2024-03-28 00:23:19

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 00/14] Add Cgroup support for SGX EPC memory

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].

This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.

Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.

For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design.

Track EPC pages in per-cgroup LRUs (from Dave)
----------------------------------------------

tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes to the existing LRU scanning algorithm, or to add new
LRUs. The result: Add a new LRU for each cgroup and scans those instead.
Replace the existing global cgroup with the root cgroup's LRU (only when
this new support is compiled in, obviously).

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.

This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.

There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.

Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.

Unreclaimable Enclave Pages
---------------------------

There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.

Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.

Unreclaimable Guest VM EPC Pages
--------------------------------

The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates the
ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
user process exactly the same way handling host running out of physical
EPC.

This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to utilize the misc cgroup
controller rather than a custom controller. I have been updating the
patches based on review comments since V2 [7-14], simplified the
implementation/design, added selftest scripts, fixed some stability issues
found from testing.

Thanks to all for the review/test/tags/feedback provided on the previous
versions.

I appreciate your further reviewing/testing and providing tags if
appropriate.

---
V10:
- Use enum instead of boolean for the 'reclaim' parameters in
sgx_alloc_epc_page(). (Dave, Jarkko)
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
- Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
- Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
(Kai)
- Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
- Does not return a value for callers to check.
- Its usage pattern is similar to that of sgx_reclaim_pages() now
- Add cond_resched() in the loop in the cgroup reclaimer to improve
liveliness.
- Add logic for cgroup level reclamation in sgx_reclaim_direct()
- Restructure V9 patches 7-10 to make them flow better. (Kai)
- Disable cgroup if workqueue allocation failed during init. (Kai)
- Shorten names for EPC cgroup functions, structures and variables.
(Jarkko)
- Separate out a helper for for addressing single iteration of the loop in
sgx_cgroup_try_charge(). (Jarkko)
- More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)

V9:
- Add comments for static variables outside functions. (Jarkko)
- Remove unnecessary ifs. (Tim)
- Add more Reviewed-By: tags from Jarkko and TJ.

V8:
- Style fixes. (Jarkko)
- Abstract _misc_res_free/alloc() (Jarkko)
- Remove unneeded NULL checks. (Jarkko)

V7:
- Split the large patch for the final EPC implementation, #10 in V6, into
smaller ones. (Dave, Kai)
- Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
into two functions (Kai)
- Removed patches to introduce the EPC page states, list for storing
candidate pages for reclamation. (not needed due to above changes)
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of
required callback functions (Jarkko)
- Initialize epc cgroup in sgx driver init function. (Kai)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
- Use a static for root cgroup (Kai)

[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/lkml/[email protected]/
[4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
[5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
[6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[7]v2: https://lore.kernel.org/all/[email protected]/
[8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
[9]v4: https://lore.kernel.org/all/[email protected]/
[10]v5: https://lore.kernel.org/all/[email protected]/
[11]v6: https://lore.kernel.org/linux-sgx/[email protected]/
[12]v7: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[13]v8: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[14]v9: https://lore.kernel.org/lkml/[email protected]/T/

Haitao Huang (3):
x86/sgx: Replace boolean parameters with enums
x86/sgx: Charge mem_cgroup for per-cgroup reclamation
selftests/sgx: Add scripts for EPC cgroup testing

Kristen Carlson Accardi (9):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Export APIs for SGX driver
cgroup/misc: Add SGX EPC resource type
x86/sgx: Implement basic EPC misc cgroup functionality
x86/sgx: Abstract tracking reclaimable pages in LRU
x86/sgx: Add basic EPC reclamation flow for cgroup
x86/sgx: Implement async reclamation for cgroup
x86/sgx: Abstract check for global reclaimable pages
x86/sgx: Turn on per-cgroup EPC reclamation

Sean Christopherson (2):
x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
Docs/x86/sgx: Add description for cgroup support

Documentation/arch/x86/sgx.rst | 83 +++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 41 +--
arch/x86/kernel/cpu/sgx/encl.h | 7 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 325 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 79 +++++
arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
arch/x86/kernel/cpu/sgx/main.c | 199 ++++++++---
arch/x86/kernel/cpu/sgx/sgx.h | 34 +-
arch/x86/kernel/cpu/sgx/virt.c | 2 +-
include/linux/misc_cgroup.h | 41 +++
kernel/cgroup/misc.c | 109 ++++--
.../selftests/sgx/run_epc_cg_selftests.sh | 246 +++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
15 files changed, 1098 insertions(+), 105 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh


base-commit: 4cece764965020c22cff7665b18a012006359095
--
2.25.1



2024-03-28 00:24:38

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

From: Kristen Carlson Accardi <[email protected]>

The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.

Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.

Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4991eb0af748..8f83f7ac386e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
*/
static struct sgx_epc_lru_list sgx_global_lru;

+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
+{
+ return &sgx_global_lru;
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);

/* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
}

/**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU.
* @page: EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ list_add_tail(&page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
}

/**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
* @page: EPC page
*
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
*
* Return:
* 0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return -EBUSY;
}

list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);

return 0;
}
--
2.25.1


2024-03-28 00:25:29

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

From: Kristen Carlson Accardi <[email protected]>

When a cgroup usage reaches its limit, and it is to be charged, i.e.,
sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
reclaim pages from its LRU or LRUs of its descendants to make room for
any new allocations. This patch adds the basic building block for the
per-cgroup reclamation flow and use it for synchronous reclamation in
sgx_cgroup_try_charge().

First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
which pages are reclaimed, so it can be reused by both the global and
cgroup reclaimers. Also return the number of pages attempted, so a
cgroup reclaimer can use it to track reclamation progress from its
descendants.

For the global reclaimer, replace all call sites of sgx_reclaim_pages()
with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
just calls sgx_reclaim_pages() with the global LRU passed in.

For cgroup reclamation, implement a basic reclamation flow, encapsulated
in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
at each node passing in the LRU of that node. It keeps track of total
attempted pages and stops the walk if desired number of pages are
attempted.

Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether
a synchronous reclamation is allowed. If the caller allows and cgroup
usage is at its limit, trigger the synchronous reclamation by calling
sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between
iterations.

A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)

V9:
- Add comments for static variables. (Jarkko)

V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
- Remove alignment for substructure variables. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 127 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +-
arch/x86/kernel/cpu/sgx/main.c | 45 ++++++----
arch/x86/kernel/cpu/sgx/sgx.h | 1 +
4 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index a1dd43c195b2..f7a487a29ed1 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -9,16 +9,136 @@
static struct sgx_cgroup sgx_cg_root;

/**
- * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root: Root of the tree to check
+ *
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ */
+static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_cgroup *sgx_cg;
+ bool ret = true;
+
+ /*
+ * Caller ensure css_root ref acquired
+ */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+
+ rcu_read_unlock();
+
+ sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
+
+ spin_lock(&sgx_cg->lru.lock);
+ ret = list_empty(&sgx_cg->lru.reclaimable);
+ spin_unlock(&sgx_cg->lru.lock);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!ret)
+ break;
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
+ * @root: The root of cgroup tree to reclaim from.
*
+ * This function performs a pre-order walk in the cgroup tree under the given
+ * root, attempting to reclaim pages at each node until a fixed number of pages
+ * (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of success on
+ * the actual reclamation process. In extreme cases, if all pages in front of
+ * the LRUs are recently accessed, i.e., considered "too young" to reclaim, no
+ * page will actually be reclaimed after walking the whole tree.
+ *
+ * Callers check for the need for reclamation before calling this function. Some
+ * callers may run this function in a loop guarded by some criteria for
+ * triggering reclamation, and call cond_resched() in between iterations to
+ * avoid indefinite blocking.
+ */
+static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_cgroup *sgx_cg;
+ unsigned int cnt = 0;
+
+ /* Caller ensure css_root ref acquired */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+ rcu_read_unlock();
+
+ sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
+ cnt += sgx_reclaim_pages(&sgx_cg->lru);
+
+ rcu_read_lock();
+ css_put(pos);
+
+ if (cnt >= SGX_NR_TO_SCAN)
+ break;
+ }
+
+ rcu_read_unlock();
+}
+
+static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
+{
+ if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
+ return 0;
+
+ if (sgx_cgroup_lru_empty(epc_cg->cg))
+ return -ENOMEM;
+
+ if (signal_pending(current))
+ return -ERESTARTSYS;
+
+ return -EBUSY;
+}
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
* @sgx_cg: The EPC cgroup to be charged for the page.
+ * @reclaim: Whether or not synchronous EPC reclaim is allowed.
* Return:
* * %0 - If successfully charged.
* * -errno - for failures.
*/
-int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
{
- return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+ int ret;
+
+ for (;;) {
+ ret = __sgx_cgroup_try_charge(sgx_cg);
+
+ if (ret != -EBUSY)
+ return ret;
+
+ if (reclaim == SGX_NO_RECLAIM)
+ return -ENOMEM;
+
+ sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ cond_resched();
+ }
+
+ return 0;
}

/**
@@ -50,6 +170,7 @@ const struct misc_res_ops sgx_cgroup_ops = {

static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
{
+ sgx_lru_init(&sgx_cg->lru);
cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
sgx_cg->cg = cg;
}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 8f794e23fad6..f62dce0cac51 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -20,7 +20,7 @@ static inline struct sgx_cgroup *sgx_get_current_cg(void)

static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }

-static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim r)
{
return 0;
}
@@ -31,6 +31,7 @@ static inline void sgx_cgroup_init(void) { }
#else
struct sgx_cgroup {
struct misc_cg *cg;
+ struct sgx_epc_lru_list lru;
};

static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
@@ -61,7 +62,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
put_misc_cg(sgx_cg->cg);
}

-int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
void sgx_cgroup_init(void);

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 8f83f7ac386e..aaf341abc641 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -286,11 +286,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
mutex_unlock(&encl->lock);
}

-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU
+ * @lru: The LRU from which pages are reclaimed.
+ *
+ * Take a fixed number of pages from the head of a given LRU and reclaim them to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
*
* Batch process a chunk of pages (at the moment 16) in order to degrade amount
* of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
@@ -298,8 +301,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
* + EWB) but not sufficiently. Reclaiming one page at a time would also be
* problematic as it would increase the lock contention too much, which would
* halt forward progress.
+ *
+ * Return: Number of pages attempted for reclamation.
*/
-static void sgx_reclaim_pages(void)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +315,9 @@ static void sgx_reclaim_pages(void)
int ret;
int i;

- spin_lock(&sgx_global_lru.lock);
+ spin_lock(&lru->lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
- struct sgx_epc_page, list);
+ epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
if (!epc_page)
break;

@@ -328,7 +332,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);

for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +355,9 @@ static void sgx_reclaim_pages(void)
continue;

skip:
- spin_lock(&sgx_global_lru.lock);
- list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ spin_lock(&lru->lock);
+ list_add_tail(&epc_page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);

kref_put(&encl_page->encl->refcount, sgx_encl_release);

@@ -379,6 +383,8 @@ static void sgx_reclaim_pages(void)

sgx_free_epc_page(epc_page);
}
+
+ return cnt;
}

static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,6 +393,11 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}

+static void sgx_reclaim_pages_global(void)
+{
+ sgx_reclaim_pages(&sgx_global_lru);
+}
+
/*
* sgx_reclaim_direct() should be called (without enclave's mutex held)
* in locations where SGX memory resources might be low and might be
@@ -395,7 +406,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
}

static int ksgxd(void *p)
@@ -418,7 +429,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));

if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();

cond_resched();
}
@@ -572,7 +583,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
int ret;

sgx_cg = sgx_get_current_cg();
- ret = sgx_cgroup_try_charge(sgx_cg);
+ ret = sgx_cgroup_try_charge(sgx_cg, reclaim);
if (ret) {
sgx_put_cg(sgx_cg);
return ERR_PTR(ret);
@@ -604,7 +615,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index fd28e2e98620..954f02efb516 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -119,6 +119,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru);

void sgx_ipi_cb(void *info);

--
2.25.1


2024-03-28 00:25:39

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 13/14] Docs/x86/sgx: Add description for cgroup support

From: Sean Christopherson <[email protected]>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
Documentation/arch/x86/sgx.rst | 83 ++++++++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call. It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running. The EPC is managed by the SGX subsystem and
+is not accounted by the memory controller. Note that this is true only for EPC
+memory itself, i.e. normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing store
+(normal system memory allocated via shmem). The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+ resource will show the total amount of EPC memory available on the
+ platform.
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+ resource will show the current active EPC memory usage of the cgroup and
+ its descendants. EPC pages that are swapped out to backing RAM are not
+ included in the current count.
+
+ misc.max
+ A read-write single value file which exists on non-root cgroups. The
+ sgx_epc resource will show the EPC usage hard limit. The default is
+ "max".
+
+ If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+ page fault handling, will be blocked until EPC can be reclaimed from the
+ cgroup. If there are no pages left that are reclaimable within the same
+ group, the kernel returns ENOMEM.
+
+ The EPC pages allocated for a guest VM by the virtual EPC driver are not
+ reclaimable by the host kernel. In case the guest cgroup's limit is
+ reached and no reclaimable pages left in the same cgroup, the virtual
+ EPC driver returns SIGBUS to the user space process to indicate failure
+ on new EPC allocation requests.
+
+ The misc.max limit is non-preemptive. If a user writes a limit lower
+ than the current usage to this file, the cgroup will not preemptively
+ deallocate pages currently in use, and will only start blocking the next
+ allocation and reclaiming EPC at that time.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+ A value change in this file generates a file modified event.
+
+ max
+ The number of times the cgroup has triggered a reclaim due to
+ its EPC usage approaching (or exceeding) its max EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged
+to the original cgroup until the page is released or reclaimed. Migrating a
+process to a different cgroup doesn't move the EPC charges that it incurred
+while in the previous cgroup to its new cgroup.
--
2.25.1


2024-03-28 00:28:30

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 14/14] selftests/sgx: Add scripts for EPC cgroup testing

The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

/watch_misc_for_tests.sh current

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test. Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.

The script also includes a test with low mem_cg limit and LARGE sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg.

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
.../selftests/sgx/run_epc_cg_selftests.sh | 246 ++++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
2 files changed, 259 insertions(+)
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..e027bf39f005
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,246 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+ echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+ exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_MISC_ROOT=/sys/fs/cgroup
+CG_MEM_ROOT=/sys/fs/cgroup
+CG_V1=0
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+ echo "# cgroup V2 is in use."
+else
+ echo "# cgroup V1 is in use."
+ CG_MISC_ROOT=/sys/fs/cgroup/misc
+ CG_MEM_ROOT=/sys/fs/cgroup/memory
+ CG_V1=1
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+wait_check_process_status() {
+ local pid=$1
+ local check_for_success=$2 # If 1, check for success;
+ # If 0, check for failure
+ wait "$pid"
+ local status=$?
+
+ if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+ echo "# Process $pid succeeded."
+ return 0
+ elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ return 0
+ fi
+ return 1
+}
+
+wait_and_detect_for_any() {
+ local pids=("$@")
+ local check_for_success=$1 # If 1, check for success;
+ # If 0, check for failure
+ local detected=1 # 0 for success detection
+
+ for pid in "${pids[@]:1}"; do
+ if wait_check_process_status "$pid" "$check_for_success"; then
+ detected=0
+ # Wait for other processes to exit
+ fi
+ done
+
+ return $detected
+}
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [[ $? -eq 0 ]]; then
+ echo "# Fail on SMALL limit, not expecting any test passes."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+else
+ echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one success...."
+
+pids=()
+for i in {1..4}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+
+if wait_and_detect_for_any 1 "${pids[@]}"; then
+ echo "# PASSED LARGE limit positive testing."
+else
+ echo "# Failed on LARGE limit positive testing, no test passes."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one failure...."
+pids=()
+for i in {1..5}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+if wait_and_detect_for_any 0 "${pids[@]}"; then
+ echo "# PASSED LARGE limit negative testing."
+else
+ echo "# Failed on LARGE limit negative testing, no test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ expecting no failure...."
+pids=()
+for i in {1..8}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+if wait_and_detect_for_any 0 "${pids[@]}"; then
+ echo "# Failed on LARGER limit, at least one test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+else
+ echo "# PASSED LARGER limit tests."
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ randomly kill one, expecting no failure...."
+pids=()
+for i in {1..8}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+sleep $((RANDOM % 10 + 5))
+
+# Randomly select a PID to kill
+RANDOM_INDEX=$((RANDOM % 8))
+PID_TO_KILL=${pids[RANDOM_INDEX]}
+
+kill $PID_TO_KILL
+echo "# Killed process with PID: $PID_TO_KILL"
+
+any_failure=0
+for pid in "${pids[@]}"; do
+ wait "$pid"
+ status=$?
+ if [ "$pid" != "$PID_TO_KILL" ]; then
+ if [[ $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ any_failure=1
+ fi
+ fi
+done
+
+if [[ $any_failure -ne 0 ]]; then
+ echo "# Failed on random killing, at least one test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+echo "# PASSED LARGER limit test with a process randomly killed."
+
+cgcreate -g memory:$TEST_CG_SUB2
+if [ $? -ne 0 ]; then
+ echo "# Failed creating memory controller."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+MEM_LIMIT_TOO_SMALL=$((CAPACITY - 2 * LARGE))
+
+if [[ $CG_V1 -eq 0 ]]; then
+ echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.max
+else
+ echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.limit_in_bytes
+ echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.memsw.limit_in_bytes
+fi
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE EPC limit,
+ and too small RAM limit, expecting all failures...."
+pids=()
+for i in {1..4}; do
+ (
+ cgexec -g memory:$TEST_CG_SUB2 -g misc:$TEST_CG_SUB2 $test_cmd \
+ >cgtest_large_oom_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+if wait_and_detect_for_any 1 "${pids[@]}"; then
+ echo "# Failed on tests with memcontrol, some tests did not fail."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ if [[ $CG_V1 -ne 0 ]]; then
+ cgdelete -r -g memory:$TEST_ROOT_CG
+ fi
+ exit 1
+else
+ echo "# PASSED LARGE limit tests with memcontrol."
+fi
+
+sleep 2
+
+USAGE=$(grep '^sgx_epc' "$CG_MISC_ROOT/$TEST_ROOT_CG/misc.current" | awk '{print $2}')
+if [ "$USAGE" -ne 0 ]; then
+ echo "# Failed: Final usage is $USAGE, not 0."
+else
+ echo "# PASSED leakage check."
+ echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
+fi
+cgdelete -r -g misc:$TEST_ROOT_CG
+if [[ $CG_V1 -ne 0 ]]; then
+ cgdelete -r -g memory:$TEST_ROOT_CG
+fi
+echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+ then
+ echo "No argument supplied, please provide 'max', 'current' or 'events'"
+ exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+ 'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
--
2.25.1


2024-03-28 01:29:09

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 06/14] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

From: Sean Christopherson <[email protected]>

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
arch/x86/kernel/cpu/sgx/sgx.h | 15 +++++++++++++
2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 023af54c1beb..4991eb0af748 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);

/*
* These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
*/
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;

static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);

@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;

- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- if (list_empty(&sgx_active_page_list))
+ epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+ struct sgx_epc_page, list);
+ if (!epc_page)
break;

- epc_page = list_first_entry(&sgx_active_page_list,
- struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;

@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);

for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;

skip:
- spin_lock(&sgx_reclaimer_lock);
- list_add_tail(&epc_page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
+ list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);

kref_put(&encl_page->encl->refcount, sgx_encl_release);

@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_active_page_list);
+ !list_empty(&sgx_global_lru.reclaimable);
}

/*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)

ksgxd_tsk = tsk;

+ sgx_lru_init(&sgx_global_lru);
+
return true;
}

@@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
}

/**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return -EBUSY;
}

list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);

return 0;
}
@@ -578,7 +579,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}

- if (list_empty(&sgx_active_page_list)) {
+ if (list_empty(&sgx_global_lru.reclaimable)) {
page = ERR_PTR(-ENOMEM);
break;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 6accc81d19a9..fd28e2e98620 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -98,6 +98,21 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}

+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+ spinlock_t lock;
+ struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+ spin_lock_init(&lru->lock);
+ INIT_LIST_HEAD(&lru->reclaimable);
+}
+
void sgx_free_epc_page(struct sgx_epc_page *page);

void sgx_reclaim_direct(void);
--
2.25.1


2024-03-28 01:33:48

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

From: Kristen Carlson Accardi <[email protected]>

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables user to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller" and restructure it (Kai)
---
arch/x86/Kconfig | 13 +++++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 74 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 70 ++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/main.c | 51 ++++++++++++++++++-
arch/x86/kernel/cpu/sgx/sgx.h | 5 ++
include/linux/misc_cgroup.h | 2 +
7 files changed, 214 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 39886bab943a..bda78255a7ab 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1941,6 +1941,19 @@ config X86_SGX

If unsure, say N.

+config CGROUP_SGX_EPC
+ bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+ depends on X86_SGX && CGROUP_MISC
+ help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
obj-$(CONFIG_X86_SGX_KVM) += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..a1dd43c195b2
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg: The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+ return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg: The charged sgx cgroup
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_cgroup_free(struct misc_cg *cg)
+{
+ struct sgx_cgroup *sgx_cg;
+
+ sgx_cg = sgx_cgroup_from_misc_cg(cg);
+ if (!sgx_cg)
+ return;
+
+ kfree(sgx_cg);
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg);
+
+const struct misc_res_ops sgx_cgroup_ops = {
+ .alloc = sgx_cgroup_alloc,
+ .free = sgx_cgroup_free,
+};
+
+static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
+{
+ cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
+ sgx_cg->cg = cg;
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg)
+{
+ struct sgx_cgroup *sgx_cg;
+
+ sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
+ if (!sgx_cg)
+ return -ENOMEM;
+
+ sgx_cgroup_misc_init(cg, sgx_cg);
+
+ return 0;
+}
+
+void sgx_cgroup_init(void)
+{
+ misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
+ sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..8f794e23fad6
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/misc_cgroup.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_cgroup;
+
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+ return NULL;
+}
+
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
+
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+ return 0;
+}
+
+static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
+
+static inline void sgx_cgroup_init(void) { }
+#else
+struct sgx_cgroup {
+ struct misc_cg *cg;
+};
+
+static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+ return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * sgx_put_cg() to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+ return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}
+
+/**
+ * sgx_put_sgx_cg() - Put the EPC cgroup and reduce its ref count.
+ * @sgx_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
+{
+ if (sgx_cg)
+ put_misc_cg(sgx_cg->cg);
+}
+
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_init(void);
+
+#endif
+
+#endif /* _SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d219f14365d4..023af54c1beb 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
#include <linux/node.h>
#include <linux/pagemap.h>
#include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
#include "driver.h"
#include "encl.h"
#include "encls.h"
+#include "epc_cgroup.h"

struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
static int sgx_nr_epc_sections;
@@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
*/
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
{
+ struct sgx_cgroup *sgx_cg;
struct sgx_epc_page *page;
+ int ret;
+
+ sgx_cg = sgx_get_current_cg();
+ ret = sgx_cgroup_try_charge(sgx_cg);
+ if (ret) {
+ sgx_put_cg(sgx_cg);
+ return ERR_PTR(ret);
+ }

for ( ; ; ) {
page = __sgx_alloc_epc_page();
@@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}

- if (list_empty(&sgx_active_page_list))
- return ERR_PTR(-ENOMEM);
+ if (list_empty(&sgx_active_page_list)) {
+ page = ERR_PTR(-ENOMEM);
+ break;
+ }

if (reclaim == SGX_NO_RECLAIM) {
page = ERR_PTR(-EBUSY);
@@ -580,10 +593,24 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}

+ /*
+ * Need to do a global reclamation if cgroup was not full but free
+ * physical pages run out, causing __sgx_alloc_epc_page() to fail.
+ */
sgx_reclaim_pages();
cond_resched();
}

+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (!IS_ERR(page)) {
+ WARN_ON_ONCE(page->sgx_cg);
+ /* sgx_put_cg() in sgx_free_epc_page() */
+ page->sgx_cg = sgx_cg;
+ } else {
+ sgx_cgroup_uncharge(sgx_cg);
+ sgx_put_cg(sgx_cg);
+ }
+#endif
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
wake_up(&ksgxd_waitq);

@@ -604,6 +631,14 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_epc_section *section = &sgx_epc_sections[page->section];
struct sgx_numa_node *node = section->node;

+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (page->sgx_cg) {
+ sgx_cgroup_uncharge(page->sgx_cg);
+ sgx_put_cg(page->sgx_cg);
+ page->sgx_cg = NULL;
+ }
+#endif
+
spin_lock(&node->lock);

page->owner = NULL;
@@ -643,6 +678,11 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
section->pages[i].flags = 0;
section->pages[i].owner = NULL;
section->pages[i].poison = 0;
+
+#ifdef CONFIG_CGROUP_SGX_EPC
+ section->pages[i].sgx_cg = NULL;
+#endif
+
list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
}

@@ -787,6 +827,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
static bool __init sgx_page_cache_init(void)
{
u32 eax, ebx, ecx, edx, type;
+ u64 capacity = 0;
u64 pa, size;
int nid;
int i;
@@ -837,6 +878,7 @@ static bool __init sgx_page_cache_init(void)

sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
sgx_numa_nodes[nid].size += size;
+ capacity += size;

sgx_nr_epc_sections++;
}
@@ -846,6 +888,8 @@ static bool __init sgx_page_cache_init(void)
return false;
}

+ misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
return true;
}

@@ -942,6 +986,9 @@ static int __init sgx_init(void)
if (sgx_vepc_init() && ret)
goto err_provision;

+ /* Setup cgroup if either the native or vepc driver is active */
+ sgx_cgroup_init();
+
return 0;

err_provision:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ca34cd4f58ac..6accc81d19a9 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -39,12 +39,17 @@ enum sgx_reclaim {
SGX_DO_RECLAIM
};

+struct sgx_cgroup;
+
struct sgx_epc_page {
unsigned int section;
u16 flags;
u16 poison;
struct sgx_encl_page *owner;
struct list_head list;
+#ifdef CONFIG_CGROUP_SGX_EPC
+ struct sgx_cgroup *sgx_cg;
+#endif
};

/*
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 2f6cc3a0ad23..1a16efdfcd3d 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -46,11 +46,13 @@ struct misc_res_ops {
* @max: Maximum limit on the resource.
* @usage: Current usage of the resource.
* @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
*/
struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+ void *priv;
};

/**
--
2.25.1


2024-03-28 03:12:25

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 10/14] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.

Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.

Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.

Signed-off-by: Haitao Huang <[email protected]>
Reported-by: Mikko Ylinen <[email protected]>
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)

V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
arch/x86/kernel/cpu/sgx/encl.c | 29 ++++++++++++++--------------
arch/x86/kernel/cpu/sgx/encl.h | 3 +--
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 10 ++++++----
arch/x86/kernel/cpu/sgx/main.c | 29 +++++++++++++---------------
arch/x86/kernel/cpu/sgx/sgx.h | 2 +-
5 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde
}

/*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
*/
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+ struct mm_struct *charge_mm)
{
struct mem_cgroup *memcg = NULL;
struct sgx_encl_mm *encl_mm;
int idx;

- /*
- * If called from normal task context, return the mem_cgroup
- * of the current task's mm. The remainder of the handling is for
- * ksgxd.
- */
- if (!current_is_ksgxd())
- return get_mem_cgroup_from_mm(current->mm);
+ /* Use the charge_mm if given. */
+ if (charge_mm)
+ return get_mem_cgroup_from_mm(charge_mm);

/*
* Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
* @encl: an enclave pointer
* @page_index: enclave page index
* @backing: data for accessing backing storage for the page
+ * @charge_mm: the mm to charge for the allocation
*
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
* mms in the enclave's mm_list prior to any backing page allocation,
* in order to ensure that shmem page allocations are charged to the
* enclave. Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
* -errno otherwise.
*/
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing)
+ struct sgx_backing *backing, struct mm_struct *charge_mm)
{
- struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+ struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm);
struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
int ret;

diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
unsigned long end, unsigned long vm_flags);

-bool current_is_ksgxd(void);
void sgx_encl_release(struct kref *ref);
int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing);
+ struct sgx_backing *backing, struct mm_struct *charge_mm);
void sgx_encl_put_backing(struct sgx_backing *backing);
int sgx_encl_test_and_clear_young(struct mm_struct *mm,
struct sgx_encl_page *page);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 3ca89b1fb7e2..1defbf213e8d 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -111,6 +111,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
/**
* sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
* @root: The root of cgroup tree to reclaim from.
+ * @charge_mm: The mm to charge for backing store allocation.
*
* This function performs a pre-order walk in the cgroup tree under the given
* root, attempting to reclaim pages at each node until a fixed number of pages
@@ -124,7 +125,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
* triggering reclamation, and call cond_resched() in between iterations to
* avoid indefinite blocking.
*/
-static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
+static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
{
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -141,7 +142,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();

sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
- cnt += sgx_reclaim_pages(&sgx_cg->lru);
+ cnt += sgx_reclaim_pages(&sgx_cg->lru, charge_mm);

rcu_read_lock();
css_put(pos);
@@ -203,7 +204,8 @@ static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
* blocked until a worker makes its way through the global work queue.
*/
while (sgx_cgroup_should_reclaim(sgx_cg)) {
- sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ /* Indirect reclaim, no mm to charge, so NULL: */
+ sgx_cgroup_reclaim_pages(sgx_cg->cg, NULL);
cond_resched();
}
}
@@ -253,7 +255,7 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
return -EBUSY;
}

- sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ sgx_cgroup_reclaim_pages(sgx_cg->cg, current->mm);
cond_resched();
}

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9f6594317ce4..c94f8b49e6f2 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -253,8 +253,8 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
}
}

-static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
- struct sgx_backing *backing)
+static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, struct sgx_backing *backing,
+ struct mm_struct *charge_mm)
{
struct sgx_encl_page *encl_page = epc_page->owner;
struct sgx_encl *encl = encl_page->encl;
@@ -270,7 +270,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,

if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
- &secs_backing);
+ &secs_backing, charge_mm);
if (ret)
goto out;

@@ -289,6 +289,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
/**
* sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU
* @lru: The LRU from which pages are reclaimed.
+ * @charge_mm: The mm to charge for backing store allocation.
*
* Take a fixed number of pages from the head of a given LRU and reclaim them to
* the enclave's private shmem files. Skip the pages, which have been accessed
@@ -304,7 +305,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
*
* Return: Number of pages attempted for reclamation.
*/
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *charge_mm)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -344,7 +345,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);

mutex_lock(&encl_page->encl->lock);
- ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
+ ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i], charge_mm);
if (ret) {
mutex_unlock(&encl_page->encl->lock);
goto skip;
@@ -376,7 +377,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
continue;

encl_page = epc_page->owner;
- sgx_reclaimer_write(epc_page, &backing[i]);
+ sgx_reclaimer_write(epc_page, &backing[i], charge_mm);

kref_put(&encl_page->encl->refcount, sgx_encl_release);
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
@@ -393,9 +394,9 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}

-static void sgx_reclaim_pages_global(void)
+static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
{
- sgx_reclaim_pages(&sgx_global_lru);
+ sgx_reclaim_pages(&sgx_global_lru, charge_mm);
}

/*
@@ -406,7 +407,7 @@ static void sgx_reclaim_pages_global(void)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(current->mm);
}

static int ksgxd(void *p)
@@ -429,7 +430,8 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));

if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages_global();
+ /* Indirect reclaim, no mm to charge, so NULL: */
+ sgx_reclaim_pages_global(NULL);

cond_resched();
}
@@ -452,11 +454,6 @@ static bool __init sgx_page_reclaimer_init(void)
return true;
}

-bool current_is_ksgxd(void)
-{
- return current == ksgxd_tsk;
-}
-
static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
{
struct sgx_numa_node *node = &sgx_numa_nodes[nid];
@@ -615,7 +612,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(current->mm);
cond_resched();
}

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 954f02efb516..689cec353d6f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -119,7 +119,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *charge_mm);

void sgx_ipi_cb(void *info);

--
2.25.1


2024-03-28 03:12:39

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 11/14] x86/sgx: Abstract check for global reclaimable pages

From: Kristen Carlson Accardi <[email protected]>

To determine if any page available for reclamation at the global level,
only checking for emptiness of the global LRU is not adequate when pages
are tracked in multiple LRUs, one per cgroup. For this purpose, create a
new helper, sgx_can_reclaim(), currently only checks the global LRU,
later will check emptiness of LRUs of all cgroups when per-cgroup
tracking is turned on. Replace all the checks of the global LRU,
list_empty(&sgx_global_lru.reclaimable), with calls to
sgx_can_reclaim().

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V10:
- Add comments for the new function. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c94f8b49e6f2..7f92455d957d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,14 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
return &sgx_global_lru;
}

+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim(void)
+{
+ return !list_empty(&sgx_global_lru.reclaimable);
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);

/* Nodes with one or more EPC sections. */
@@ -391,7 +399,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *c
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_global_lru.reclaimable);
+ sgx_can_reclaim();
}

static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
@@ -593,7 +601,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}

- if (list_empty(&sgx_global_lru.reclaimable)) {
+ if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
--
2.25.1


2024-03-28 03:13:16

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 03/14] cgroup/misc: Export APIs for SGX driver

From: Kristen Carlson Accardi <[email protected]>

The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.

The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
include/linux/misc_cgroup.h | 24 ++++++++++++++++++++++++
kernel/cgroup/misc.c | 21 ++++++++-------------
2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
};

+struct misc_cg *misc_cg_root(void);
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
}

+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+ return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
/*
* get_current_misc_cg() - Find and get the misc cgroup of the current task.
*
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
}

#else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+ return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+ return NULL;
+}

static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
{
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 14ab13ef3bc7..1f0d8e05b36c 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];

/**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
*/
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
{
- return cgroup ? css_misc(cgroup->css.parent) : NULL;
+ return &root_cg;
}
+EXPORT_SYMBOL_GPL(misc_cg_root);

/**
* valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!amount)
return 0;

- for (i = cg; i; i = parent_misc(i)) {
+ for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];

new_usage = atomic64_add_return(amount, &res->usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
return 0;

err_charge:
- for (j = i; j; j = parent_misc(j)) {
+ for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}

- for (j = cg; j != i; j = parent_misc(j))
+ for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;

- for (i = cg; i; i = parent_misc(i))
+ for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
}
EXPORT_SYMBOL_GPL(misc_cg_uncharge);
--
2.25.1


2024-03-28 03:30:50

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

From: Kristen Carlson Accardi <[email protected]>

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU. Export
sgx_cgroup_reclaim_pages() in the header file so it can be reused for
this purpose.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
Export sgx_cgroup_lru_empty() so it can be reused for this purpose.

Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.
Export sgx_cgroup_should_reclaim() for reuse.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 6 +++---
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++++++
arch/x86/kernel/cpu/sgx/main.c | 29 +++++++++++++++++++++++++++-
3 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 1defbf213e8d..cacd9e93344e 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -72,7 +72,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
*
* Return: %true if all cgroups under the specified root have empty LRU lists.
*/
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
{
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -125,7 +125,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
* triggering reclamation, and call cond_resched() in between iterations to
* avoid indefinite blocking.
*/
-static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
+void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
{
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -166,7 +166,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *cha
* threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
* cgroup.
*/
-static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
{
u64 cur, max;

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index f66570d3ef42..8f55b38157da 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -28,6 +28,10 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl
static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }

static inline void sgx_cgroup_init(void) { }
+
+static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
+{
+}
#else
struct sgx_cgroup {
struct misc_cg *cg;
@@ -65,6 +69,9 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)

int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+bool sgx_cgroup_lru_empty(struct misc_cg *root);
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm);
void sgx_cgroup_init(void);

#endif
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 7f92455d957d..68f28ff2d5ef 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,6 +34,16 @@ static struct sgx_epc_lru_list sgx_global_lru;

static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (epc_page->sgx_cg)
+ return &epc_page->sgx_cg->lru;
+
+ /*
+ * This should not happen when cgroup is enabled: Every page belongs
+ * to a cgroup, or the root by default.
+ */
+ WARN_ON_ONCE(1);
+#endif
return &sgx_global_lru;
}

@@ -42,7 +52,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
*/
static inline bool sgx_can_reclaim(void)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ return !sgx_cgroup_lru_empty(misc_cg_root());
+#else
return !list_empty(&sgx_global_lru.reclaimable);
+#endif
}

static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -404,7 +418,10 @@ static bool sgx_should_reclaim(unsigned long watermark)

static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
{
- sgx_reclaim_pages(&sgx_global_lru, charge_mm);
+ if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
+ else
+ sgx_reclaim_pages(&sgx_global_lru, charge_mm);
}

/*
@@ -414,6 +431,16 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
*/
void sgx_reclaim_direct(void)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+
+ /* Make sure there are some free pages at cgroup level */
+ if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
+ sgx_cgroup_reclaim_pages(sgx_cg->cg, current->mm);
+ sgx_put_cg(sgx_cg);
+ }
+#endif
+ /* Make sure there are some free pages at global level */
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
sgx_reclaim_pages_global(current->mm);
}
--
2.25.1


2024-03-28 04:32:30

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 04/14] cgroup/misc: Add SGX EPC resource type

From: Kristen Carlson Accardi <[email protected]>

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V6:
- Split the original patch into this and the preceding one (Kai)
---
include/linux/misc_cgroup.h | 4 ++++
kernel/cgroup/misc.c | 4 ++++
2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..2f6cc3a0ad23 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+ /* SGX EPC memory resource */
+ MISC_CG_RES_SGX_EPC,
#endif
MISC_CG_RES_TYPES
};
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 1f0d8e05b36c..e51d6a45007f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+ /* Intel SGX EPC memory bytes */
+ "sgx_epc",
+#endif
};

/* Root misc cgroup */
--
2.25.1


2024-03-28 05:12:02

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

From: Kristen Carlson Accardi <[email protected]>

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.

Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.

In case the workqueue allocation is failed during init, disable cgroup.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 134 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 1 +
arch/x86/kernel/cpu/sgx/main.c | 8 +-
3 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index f7a487a29ed1..3ca89b1fb7e2 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -5,9 +5,63 @@
#include <linux/kernel.h>
#include "epc_cgroup.h"

+/*
+ * The minimal free pages maintained by per-cgroup reclaimer
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE (SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal
+ * free pages would barely leave any page for use, causing excessive reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT (SGX_CG_MIN_FREE_PAGE * 4)
+
/* The root SGX EPC cgroup */
static struct sgx_cgroup sgx_cg_root;

+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+ return atomic64_read(&sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+ return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+ struct misc_cg *i = sgx_cg->cg;
+ u64 m = U64_MAX;
+
+ while (i) {
+ m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+ i = misc_cg_parent(i);
+ }
+
+ return m / PAGE_SIZE;
+}
+
/**
* sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
* @root: Root of the tree to check
@@ -99,6 +153,61 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();
}

+/**
+ * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a cgroup
+ * @sgx_cg: The cgroup to be checked.
+ *
+ * This function can be used to guard a call to sgx_cgroup_reclaim_pages() where
+ * the minimal number of free page needs be maintained for the cgroup to make
+ * good forward progress.
+ *
+ * Return: %true if number of free pages available for the cgroup below a
+ * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
+ * cgroup.
+ */
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+ u64 cur, max;
+
+ if (sgx_cgroup_lru_empty(sgx_cg->cg))
+ return false;
+
+ max = sgx_cgroup_max_pages_to_root(sgx_cg);
+
+ /*
+ * Unless the limit is very low, maintain a minimal number of free pages
+ * so there is always a few pages available to serve new allocation
+ * requests quickly.
+ */
+ if (max > SGX_CG_LOW_LIMIT)
+ max -= SGX_CG_MIN_FREE_PAGE;
+
+ cur = sgx_cgroup_page_counter_read(sgx_cg);
+
+ return (cur >= max);
+}
+
+/*
+ * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity.
+ */
+static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
+{
+ struct sgx_cgroup *sgx_cg = container_of(work, struct sgx_cgroup, reclaim_work);
+
+ /*
+ * This work func is scheduled by sgx_cgroup_try_charge() when it cannot
+ * directly reclaim, i.e., EPC allocation in a fault handler. Waiting to
+ * reclaim until the cgroup is actually at its limit is less performant,
+ * as it means the task scheduling this asynchronous work is effectively
+ * blocked until a worker makes its way through the global work queue.
+ */
+ while (sgx_cgroup_should_reclaim(sgx_cg)) {
+ sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ cond_resched();
+ }
+}
+
static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
{
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -125,14 +234,24 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
{
int ret;

+ /* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */
+ if (!sgx_cg_wq)
+ return 0;
+
+ /* Should not happen if the wq is allocated and SGX cgroup initialized properly. */
+ if (WARN_ON_ONCE(!sgx_cg))
+ return 0;
+
for (;;) {
ret = __sgx_cgroup_try_charge(sgx_cg);

if (ret != -EBUSY)
return ret;

- if (reclaim == SGX_NO_RECLAIM)
- return -ENOMEM;
+ if (reclaim == SGX_NO_RECLAIM) {
+ queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+ return -EBUSY;
+ }

sgx_cgroup_reclaim_pages(sgx_cg->cg);
cond_resched();
@@ -147,7 +266,8 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
*/
void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
{
- misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+ if (sgx_cg)
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
}

static void sgx_cgroup_free(struct misc_cg *cg)
@@ -158,6 +278,7 @@ static void sgx_cgroup_free(struct misc_cg *cg)
if (!sgx_cg)
return;

+ cancel_work_sync(&sgx_cg->reclaim_work);
kfree(sgx_cg);
}

@@ -171,6 +292,7 @@ const struct misc_res_ops sgx_cgroup_ops = {
static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
{
sgx_lru_init(&sgx_cg->lru);
+ INIT_WORK(&sgx_cg->reclaim_work, sgx_cgroup_reclaim_work_func);
cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
sgx_cg->cg = cg;
}
@@ -190,6 +312,12 @@ static int sgx_cgroup_alloc(struct misc_cg *cg)

void sgx_cgroup_init(void)
{
+ sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE, WQ_UNBOUND_MAX_ACTIVE);
+
+ /* All Cgroups functionalities are disabled. */
+ if (WARN_ON(!sgx_cg_wq))
+ return;
+
misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index f62dce0cac51..f66570d3ef42 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -32,6 +32,7 @@ static inline void sgx_cgroup_init(void) { }
struct sgx_cgroup {
struct misc_cg *cg;
struct sgx_epc_lru_list lru;
+ struct work_struct reclaim_work;
};

static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index aaf341abc641..9f6594317ce4 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -650,11 +650,9 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_numa_node *node = section->node;

#ifdef CONFIG_CGROUP_SGX_EPC
- if (page->sgx_cg) {
- sgx_cgroup_uncharge(page->sgx_cg);
- sgx_put_cg(page->sgx_cg);
- page->sgx_cg = NULL;
- }
+ sgx_cgroup_uncharge(page->sgx_cg);
+ sgx_put_cg(page->sgx_cg);
+ page->sgx_cg = NULL;
#endif

spin_lock(&node->lock);
--
2.25.1


2024-03-28 05:32:14

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v10 02/14] cgroup/misc: Add per resource callbacks for CSS events

From: Kristen Carlson Accardi <[email protected]>

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
include/linux/misc_cgroup.h | 11 +++++
kernel/cgroup/misc.c | 84 +++++++++++++++++++++++++++++++++----
2 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;

#include <linux/cgroup.h>

+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+ int (*alloc)(struct misc_cg *cg);
+ void (*free)(struct misc_cg *cg);
+};
+
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {

u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);

diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..14ab13ef3bc7 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
*/
static u64 misc_res_capacity[MISC_CG_RES_TYPES];

+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
/**
* parent_misc() - Get the parent of the passed misc cgroup.
* @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity)
}
EXPORT_SYMBOL_GPL(misc_cg_set_capacity);

+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+ if (!valid_type(type))
+ return -EINVAL;
+
+ if (!ops->alloc) {
+ pr_err("%s: alloc missing\n", __func__);
+ return -EINVAL;
+ }
+
+ if (!ops->free) {
+ pr_err("%s: free missing\n", __func__);
+ return -EINVAL;
+ }
+
+ misc_res_ops[type] = ops;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
/**
* misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
* @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
};

+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+ int ret;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ WRITE_ONCE(cg->res[i].max, MAX_NUM);
+ atomic64_set(&cg->res[i].usage, 0);
+ if (misc_res_ops[i]) {
+ ret = misc_res_ops[i]->alloc(cg);
+ if (ret)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++)
+ if (misc_res_ops[i])
+ misc_res_ops[i]->free(cg);
+}
+
/**
* misc_cg_alloc() - Allocate misc cgroup.
* @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
static struct cgroup_subsys_state *
misc_cg_alloc(struct cgroup_subsys_state *parent_css)
{
- enum misc_res_type i;
- struct misc_cg *cg;
+ struct misc_cg *parent_cg, *cg;
+ int ret;

- if (!parent_css) {
- cg = &root_cg;
+ if (unlikely(!parent_css)) {
+ parent_cg = cg = &root_cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+ parent_cg = css_misc(parent_css);
}

- for (i = 0; i < MISC_CG_RES_TYPES; i++) {
- WRITE_ONCE(cg->res[i].max, MAX_NUM);
- atomic64_set(&cg->res[i].usage, 0);
+ ret = _misc_cg_res_alloc(cg);
+ if (ret) {
+ _misc_cg_res_free(cg);
+ if (likely(parent_css))
+ kfree(cg);
+
+ return ERR_PTR(ret);
}

return &cg->css;
@@ -410,7 +475,10 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
*/
static void misc_cg_free(struct cgroup_subsys_state *css)
{
- kfree(css_misc(css));
+ struct misc_cg *cg = css_misc(css);
+
+ _misc_cg_res_free(cg);
+ kfree(cg);
}

/* Cgroup controller callbacks */
--
2.25.1


2024-03-28 12:54:03

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality


> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.

It's 2024 now.

And looks you need to use C style comment for /* Copyright ... */, after looking
at some other C files.

> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;
> +
> +/**
> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + *
> + * @sgx_cg: The EPC cgroup to be charged for the page.
> + * Return:
> + * * %0 - If successfully charged.
> + * * -errno - for failures.
> + */
> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +{
> + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
> +}
> +
> +/**
> + * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
> + * @sgx_cg: The charged sgx cgroup
> + */
> +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
> +{
> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
> +}
> +
> +static void sgx_cgroup_free(struct misc_cg *cg)
> +{
> + struct sgx_cgroup *sgx_cg;
> +
> + sgx_cg = sgx_cgroup_from_misc_cg(cg);
> + if (!sgx_cg)
> + return;
> +
> + kfree(sgx_cg);
> +}
> +
> +static int sgx_cgroup_alloc(struct misc_cg *cg);

Again, this declaration can be removed if you move the below structure ...

> +
> +const struct misc_res_ops sgx_cgroup_ops = {
> + .alloc = sgx_cgroup_alloc,
> + .free = sgx_cgroup_free,
> +};
> +
> +static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
> +{
> + cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
> + sgx_cg->cg = cg;
> +}
> +
> +static int sgx_cgroup_alloc(struct misc_cg *cg)
> +{
> + struct sgx_cgroup *sgx_cg;
> +
> + sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
> + if (!sgx_cg)
> + return -ENOMEM;
> +
> + sgx_cgroup_misc_init(cg, sgx_cg);
> +
> + return 0;
> +}

... here.

> +
> +void sgx_cgroup_init(void)
> +{
> + misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
> + sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
> +}
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..8f794e23fad6
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/misc_cgroup.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC

Nit: add an empty line to make text more breathable.

> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_cgroup;
> +
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + return NULL;
> +}
> +
> +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
> +
> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +{
> + return 0;
> +}
> +
> +static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
> +
> +static inline void sgx_cgroup_init(void) { }
> +#else

Nit: I prefer two empty lines before and after the 'else'.

> +struct sgx_cgroup {
> + struct misc_cg *cg;
> +};
> +
> +static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> + return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +/**
> + * sgx_get_current_cg() - get the EPC cgroup of current process.
> + *
> + * Returned cgroup has its ref count increased by 1. Caller must call
> + * sgx_put_cg() to return the reference.
> + *
> + * Return: EPC cgroup to which the current task belongs to.
> + */
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}

Again, I _think_ you need to check whether get_current_misc_cg() returns NULL?

Misc cgroup can be disabled by command line even it is on in the Kconfig.

I am not expert on cgroup, so could you check on this?

> +
> +/**
> + * sgx_put_sgx_cg() - Put the EPC cgroup and reduce its ref count.
> + * @sgx_cg - EPC cgroup to put.
> + */
> +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
> +{
> + if (sgx_cg)
> + put_misc_cg(sgx_cg->cg);
> +}
> +
> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
> +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
> +void sgx_cgroup_init(void);
> +
> +#endif
> +
> +#endif /* _SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d219f14365d4..023af54c1beb 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
> #include <linux/node.h>
> #include <linux/pagemap.h>
> #include <linux/ratelimit.h>
> @@ -17,6 +18,7 @@
> #include "driver.h"
> #include "encl.h"
> #include "encls.h"
> +#include "epc_cgroup.h"
>
> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> static int sgx_nr_epc_sections;
> @@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> */
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
> {
> + struct sgx_cgroup *sgx_cg;
> struct sgx_epc_page *page;
> + int ret;
> +
> + sgx_cg = sgx_get_current_cg();
> + ret = sgx_cgroup_try_charge(sgx_cg);
> + if (ret) {
> + sgx_put_cg(sgx_cg);
> + return ERR_PTR(ret);
> + }
>
> for ( ; ; ) {
> page = __sgx_alloc_epc_page();
> @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
> break;
> }
>
> - if (list_empty(&sgx_active_page_list))
> - return ERR_PTR(-ENOMEM);
> + if (list_empty(&sgx_active_page_list)) {
> + page = ERR_PTR(-ENOMEM);
> + break;
> + }
>
> if (reclaim == SGX_NO_RECLAIM) {
> page = ERR_PTR(-EBUSY);
> @@ -580,10 +593,24 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
> break;
> }
>
> + /*
> + * Need to do a global reclamation if cgroup was not full but free
> + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> + */

Again, to me this comment shouldn't be here, because it doesn't add any more
information.

If you can reach here, you have passed the charge(). In fact, I believe this
doesn't matter: 

When you fail to allocate, you just need to reclaim.

Now you only have the global reclamation, thus you need to reclaim from it.

Perhaps it is useful when you have per-cgroup LRU list. In that case you can
put this comment there.

> sgx_reclaim_pages();
> cond_resched();
> }
>
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (!IS_ERR(page)) {
> + WARN_ON_ONCE(page->sgx_cg);
> + /* sgx_put_cg() in sgx_free_epc_page() */
> + page->sgx_cg = sgx_cg;
> + } else {
> + sgx_cgroup_uncharge(sgx_cg);
> + sgx_put_cg(sgx_cg);
> + }
> +#endif

Again, IMHO having CONFIG_CGROUP_SGX_EPC here is ugly, because it doesn't even
match the try_charge() above, which doesn't have the CONFIG_CGROUP_SGX_EPC.

If you add a wrapper in "epc_cgroup.h"

static inline void sgx_epc_page_set_cgroup(struct epc_page *epc_page, 
struct sgx_cgroup *sgx_cg)
{
#ifdef CONFIG_CGROUP_SGX_EPC
epc_page->sgx_cg = sgx_cg;
#endif
}

Then I believe the above can be simplified to:

if (!IS_ERR(page)) {
sgx_epc_page_set_cgroup(page, sgx_cg);
} else {
sgx_cgroup_uncharge(sgx_cg);
sgx_put_cg(sgx_cg);
}

> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> wake_up(&ksgxd_waitq);
>
> @@ -604,6 +631,14 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> struct sgx_numa_node *node = section->node;
>
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (page->sgx_cg) {
> + sgx_cgroup_uncharge(page->sgx_cg);
> + sgx_put_cg(page->sgx_cg);
> + page->sgx_cg = NULL;
> + }
> +#endif
> +

Similarly, how about adding a wrapper in "epc_cgroup.h"

struct sgx_cgroup *sgx_epc_page_get_cgroup(struct sgx_epc_page *page)
{
#ifdef CONFIG_CGROUP_SGX_EPC
return page->sgx_cg;
#else
return NULL;
#endif
}

Then this can be:

sgx_cg = sgx_epc_page_get_cgroup(page);
sgx_cgroup_uncharge(sgx_cg);
sgx_put_cg(sgx_cg);
sgx_epc_page_set_cgroup(page, NULL);

> spin_lock(&node->lock);
>
> page->owner = NULL;
> @@ -643,6 +678,11 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> section->pages[i].flags = 0;
> section->pages[i].owner = NULL;
> section->pages[i].poison = 0;
> +
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + section->pages[i].sgx_cg = NULL;
> +#endif

Can use the wrapper too.

[...]

(Btw, I'll be away for couple of days due to public holiday and will review the
rest starting from late next week).

2024-03-30 11:17:33

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
>
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > @@ -0,0 +1,74 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright(c) 2022 Intel Corporation.
>
> It's 2024 now.
>
> And looks you need to use C style comment for /* Copyright ... */, after looking
> at some other C files.

To be fair, this happens *all the time* to everyone :-)

I've proposed this few times in SGX context and going to say it now.
Given the nature of Git copyrights would anyway need to be sorted by
the Git log not possibly incorrect copyright platters in the header
and source files.

BR, Jarkko

2024-03-30 11:29:31

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 14/14] selftests/sgx: Add scripts for EPC cgroup testing

On Thu Mar 28, 2024 at 2:22 AM EET, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test. Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg.
>
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>
> Signed-off-by: Haitao Huang <[email protected]>

My previous comments and you have two undocumented dependencies for your
selftest (I searched for cgexec and cgroups-tools as keywords).

BR, Jarkko

2024-03-31 19:02:03

by Haitao Huang

[permalink] [raw]
Subject: [PATCH] selftests/sgx: Improve cgroup test scripts

Make cgroup test scripts ash compatible.
Remove cg-tools dependency.
Add documentation for functions.

Tested with busybox on Ubuntu.

Signed-off-by: Haitao Huang <[email protected]>
---
tools/testing/selftests/sgx/ash_cgexec.sh | 58 ++++++
.../selftests/sgx/run_epc_cg_selftests.sh | 171 +++++++++++-------
.../selftests/sgx/watch_misc_for_tests.sh | 13 +-
3 files changed, 165 insertions(+), 77 deletions(-)
create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh

diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh b/tools/testing/selftests/sgx/ash_cgexec.sh
new file mode 100755
index 000000000000..51232d6452a8
--- /dev/null
+++ b/tools/testing/selftests/sgx/ash_cgexec.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2024 Intel Corporation.
+
+# Move the current shell process to the specified cgroup
+# Arguments:
+# $1 - The cgroup controller name, e.g., misc, memory.
+# $2 - The path of the cgroup,
+# relative to /sys/fs/cgroup for cgroup v2,
+# relative to /sys/fs/cgroup/$1 for v1.
+move_to_cgroup() {
+ controllers="$1"
+ path="$2"
+
+ # Check if cgroup v2 is in use
+ if [ ! -d "/sys/fs/cgroup/misc" ]; then
+ # Cgroup v2 logic
+ cgroup_full_path="/sys/fs/cgroup/${path}"
+ echo $$ > "${cgroup_full_path}/cgroup.procs"
+ else
+ # Cgroup v1 logic
+ OLD_IFS="$IFS"
+ IFS=','
+ for controller in $controllers; do
+ cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
+ echo $$ > "${cgroup_full_path}/tasks"
+ done
+ IFS="$OLD_IFS"
+ fi
+}
+
+if [ "$#" -lt 3 ] || [ "$1" != "-g" ]; then
+ echo "Usage: $0 -g <controller1,controller2:path> [-g <controller3:path> ...] <command> [args...]"
+ exit 1
+fi
+
+while [ "$#" -gt 0 ]; do
+ case "$1" in
+ -g)
+ # Ensure that a controller:path pair is provided after -g
+ if [ -z "$2" ]; then
+ echo "Error: Missing controller:path argument after -g"
+ exit 1
+ fi
+ IFS=':' read CONTROLLERS CGROUP_PATH <<EOF
+$2
+EOF
+ move_to_cgroup "$CONTROLLERS" "$CGROUP_PATH"
+ shift 2
+ ;;
+ *)
+ # Execute the command within the cgroup
+ exec "$@"
+ ;;
+ esac
+done
+
+
diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
index e027bf39f005..b9c73ab784cb 100755
--- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -1,24 +1,14 @@
-#!/bin/bash
+#!/usr/bin/env sh
# SPDX-License-Identifier: GPL-2.0
-# Copyright(c) 2023 Intel Corporation.
+# Copyright(c) 2023, 2024 Intel Corporation.

TEST_ROOT_CG=selftest
-cgcreate -g misc:$TEST_ROOT_CG
-if [ $? -ne 0 ]; then
- echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
- exit 1
-fi
TEST_CG_SUB1=$TEST_ROOT_CG/test1
TEST_CG_SUB2=$TEST_ROOT_CG/test2
# We will only set limit in test1 and run tests in test3
TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
TEST_CG_SUB4=$TEST_ROOT_CG/test4

-cgcreate -g misc:$TEST_CG_SUB1
-cgcreate -g misc:$TEST_CG_SUB2
-cgcreate -g misc:$TEST_CG_SUB3
-cgcreate -g misc:$TEST_CG_SUB4
-
# Default to V2
CG_MISC_ROOT=/sys/fs/cgroup
CG_MEM_ROOT=/sys/fs/cgroup
@@ -31,6 +21,10 @@ else
CG_MEM_ROOT=/sys/fs/cgroup/memory
CG_V1=1
fi
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB1
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4

CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
# This is below number of VA pages needed for enclave of capacity size. So
@@ -48,34 +42,66 @@ echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max

+if [ $? -ne 0 ]; then
+ echo "# Failed setting up misc limits, make sure misc cgroup is mounted."
+ exit 1
+fi
+
+clean_up_misc()
+{
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB2
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB3
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB4
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB1
+rmdir $CG_MISC_ROOT/$TEST_ROOT_CG
+}
+
timestamp=$(date +%Y%m%d_%H%M%S)

test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"

+# Wait for a process and check for expected exit status.
+#
+# Arguments:
+# $1 - the pid of the process to wait and check.
+# $2 - 1 if expecting success, 0 for failure.
+#
+# Return:
+# 0 if the exit status of the process matches the expectation.
+# 1 otherwise.
wait_check_process_status() {
- local pid=$1
- local check_for_success=$2 # If 1, check for success;
- # If 0, check for failure
+ pid=$1
+ check_for_success=$2 # If 1, check for success;
+ # If 0, check for failure
wait "$pid"
- local status=$?
+ status=$?

- if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+ if [ $check_for_success -eq 1 ] && [ $status -eq 0 ]; then
echo "# Process $pid succeeded."
return 0
- elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+ elif [ $check_for_success -eq 0 ] && [ $status -ne 0 ]; then
echo "# Process $pid returned failure."
return 0
fi
return 1
}

+# Wait for a set of processes and check for expected exit status
+#
+# Arguments:
+# $1 - 1 if expecting success, 0 for failure.
+# remaining args - The pids of the processes
+#
+# Return:
+# 0 if exit status of any process matches the expectation.
+# 1 otherwise.
wait_and_detect_for_any() {
- local pids=("$@")
- local check_for_success=$1 # If 1, check for success;
- # If 0, check for failure
- local detected=1 # 0 for success detection
+ check_for_success=$1 # If 1, check for success;
+ # If 0, check for failure
+ shift
+ detected=1 # 0 for success detection

- for pid in "${pids[@]:1}"; do
+ for pid in $@; do
if wait_check_process_status "$pid" "$check_for_success"; then
detected=0
# Wait for other processes to exit
@@ -88,10 +114,10 @@ wait_and_detect_for_any() {
echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
# Always use leaf node of misc cgroups so it works for both v1 and v2
# these may fail on OOM
-cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
-if [[ $? -eq 0 ]]; then
+./ash_cgexec.sh -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [ $? -eq 0 ]; then
echo "# Fail on SMALL limit, not expecting any test passes."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
else
echo "# Test failed as expected."
@@ -102,54 +128,54 @@ echo "# PASSED SMALL limit."
echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
expecting at least one success...."

-pids=()
-for i in {1..4}; do
+pids=""
+for i in 1 2 3 4; do
(
- cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done


-if wait_and_detect_for_any 1 "${pids[@]}"; then
+if wait_and_detect_for_any 1 "$pids"; then
echo "# PASSED LARGE limit positive testing."
else
echo "# Failed on LARGE limit positive testing, no test passes."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi

echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
expecting at least one failure...."
-pids=()
-for i in {1..5}; do
+pids=""
+for i in 1 2 3 4 5; do
(
- cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done

-if wait_and_detect_for_any 0 "${pids[@]}"; then
+if wait_and_detect_for_any 0 "$pids"; then
echo "# PASSED LARGE limit negative testing."
else
echo "# Failed on LARGE limit negative testing, no test fails."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi

echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
expecting no failure...."
-pids=()
-for i in {1..8}; do
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
(
- cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done

-if wait_and_detect_for_any 0 "${pids[@]}"; then
+if wait_and_detect_for_any 0 "$pids"; then
echo "# Failed on LARGER limit, at least one test fails."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
else
echo "# PASSED LARGER limit tests."
@@ -157,51 +183,58 @@ fi

echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
randomly kill one, expecting no failure...."
-pids=()
-for i in {1..8}; do
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
(
- cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done
-
-sleep $((RANDOM % 10 + 5))
+random_number=$(awk 'BEGIN{srand();print int(rand()*10)}')
+sleep $((random_number + 5))

# Randomly select a PID to kill
-RANDOM_INDEX=$((RANDOM % 8))
-PID_TO_KILL=${pids[RANDOM_INDEX]}
+RANDOM_INDEX=$(awk 'BEGIN{srand();print int(rand()*8)}')
+counter=0
+for pid in $pids; do
+ if [ "$counter" -eq "$RANDOM_INDEX" ]; then
+ PID_TO_KILL=$pid
+ break
+ fi
+ counter=$((counter + 1))
+done

kill $PID_TO_KILL
echo "# Killed process with PID: $PID_TO_KILL"

any_failure=0
-for pid in "${pids[@]}"; do
+for pid in $pids; do
wait "$pid"
status=$?
if [ "$pid" != "$PID_TO_KILL" ]; then
- if [[ $status -ne 0 ]]; then
+ if [ $status -ne 0 ]; then
echo "# Process $pid returned failure."
any_failure=1
fi
fi
done

-if [[ $any_failure -ne 0 ]]; then
+if [ $any_failure -ne 0 ]; then
echo "# Failed on random killing, at least one test fails."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi
echo "# PASSED LARGER limit test with a process randomly killed."

-cgcreate -g memory:$TEST_CG_SUB2
+mkdir -p $CG_MEM_ROOT/$TEST_CG_SUB2
if [ $? -ne 0 ]; then
echo "# Failed creating memory controller."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi
MEM_LIMIT_TOO_SMALL=$((CAPACITY - 2 * LARGE))

-if [[ $CG_V1 -eq 0 ]]; then
+if [ $CG_V1 -eq 0 ]; then
echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.max
else
echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.limit_in_bytes
@@ -210,20 +243,20 @@ fi

echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE EPC limit,
and too small RAM limit, expecting all failures...."
-pids=()
-for i in {1..4}; do
+pids=""
+for i in 1 2 3 4; do
(
- cgexec -g memory:$TEST_CG_SUB2 -g misc:$TEST_CG_SUB2 $test_cmd \
+ ./ash_cgexec.sh -g memory:$TEST_CG_SUB2 -g misc:$TEST_CG_SUB2 $test_cmd \
>cgtest_large_oom_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done

-if wait_and_detect_for_any 1 "${pids[@]}"; then
+if wait_and_detect_for_any 1 "$pids"; then
echo "# Failed on tests with memcontrol, some tests did not fail."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
if [[ $CG_V1 -ne 0 ]]; then
- cgdelete -r -g memory:$TEST_ROOT_CG
+ rmdir $CG_MEM_ROOT/$TEST_CG_SUB2
fi
exit 1
else
@@ -239,8 +272,8 @@ else
echo "# PASSED leakage check."
echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
fi
-cgdelete -r -g misc:$TEST_ROOT_CG
-if [[ $CG_V1 -ne 0 ]]; then
- cgdelete -r -g memory:$TEST_ROOT_CG
+clean_up_misc
+if [ $CG_V1 -ne 0 ]; then
+ rmdir $CG_MEM_ROOT/$TEST_CG_SUB2
fi
echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
index dbd38f346e7b..3b05475938d0 100755
--- a/tools/testing/selftests/sgx/watch_misc_for_tests.sh
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -1,13 +1,10 @@
-#!/bin/bash
-# SPDX-License-Identifier: GPL-2.0
+##!/usr/bin/env sh
+#!/bin/sh
# Copyright(c) 2023 Intel Corporation.

-if [ -z "$1" ]
- then
- echo "No argument supplied, please provide 'max', 'current' or 'events'"
+if [ -z "$1" ]; then
+ echo "No argument supplied, please provide 'max', 'current', or 'events'"
exit 1
fi

-watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
- 'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
-
+watch -n 1 'find /sys/fs/cgroup -wholename "*/test*/misc.'$1'" -exec sh -c '\''echo "$1:"; cat "$1"'\'' _ {} \;'

base-commit: 0d0d598f09a5ef12412b797fa160947febcd1777
--
2.25.1


2024-04-01 09:30:43

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote:
> On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
> >
> > > --- /dev/null
> > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > > @@ -0,0 +1,74 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +// Copyright(c) 2022 Intel Corporation.
> >
> > It's 2024 now.
> >
> > And looks you need to use C style comment for /* Copyright ... */, after looking
> > at some other C files.
>
> To be fair, this happens *all the time* to everyone :-)
>
> I've proposed this few times in SGX context and going to say it now.
> Given the nature of Git copyrights would anyway need to be sorted by
> the Git log not possibly incorrect copyright platters in the header
> and source files.
>

Sure fine to me either way. Thanks for pointing out.

I have some vague memory that we should update the year but I guess I was wrong.

2024-04-01 14:30:23

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

On Mon Apr 1, 2024 at 12:29 PM EEST, Huang, Kai wrote:
> On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote:
> > On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
> > >
> > > > --- /dev/null
> > > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > > > @@ -0,0 +1,74 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +// Copyright(c) 2022 Intel Corporation.
> > >
> > > It's 2024 now.
> > >
> > > And looks you need to use C style comment for /* Copyright ... */, after looking
> > > at some other C files.
> >
> > To be fair, this happens *all the time* to everyone :-)
> >
> > I've proposed this few times in SGX context and going to say it now.
> > Given the nature of Git copyrights would anyway need to be sorted by
> > the Git log not possibly incorrect copyright platters in the header
> > and source files.
> >
>
> Sure fine to me either way. Thanks for pointing out.
>
> I have some vague memory that we should update the year but I guess I was wrong.

I think updating year makes sense!

I'd be fine not having copyright platter at all since the commit is from
Intel domain anyway but if it is kept then the year needs to be
corrected.

I mean Git commit stores all the data, including exact date.

BR, Jarkko


2024-04-01 14:32:27

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH] selftests/sgx: Improve cgroup test scripts

On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:
> Make cgroup test scripts ash compatible.
> Remove cg-tools dependency.
> Add documentation for functions.
>
> Tested with busybox on Ubuntu.
>
> Signed-off-by: Haitao Huang <[email protected]>

I'll run this next week on good old NUC7. Thank you.

I really wish that either (hopefully both) Intel or AMD would bring up
for developers home use meant platform to develop on TDX and SNP. It is
a shame that the latest and greatest is from 2018.

BR, Jarkko

2024-04-02 01:43:07

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v2] selftests/sgx: Improve cgroup test scripts

Make cgroup test scripts ash compatible.
Remove cg-tools dependency.
Add documentation for functions.

Tested with busybox on Ubuntu.

Signed-off-by: Haitao Huang <[email protected]>
---
v2:
- Fixes for v2 cgroup
- Turn off swapping before memcontrol tests and back on after
- Add comments and reformat
---
tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++++++
.../selftests/sgx/run_epc_cg_selftests.sh | 187 +++++++++++-------
.../selftests/sgx/watch_misc_for_tests.sh | 13 +-
3 files changed, 179 insertions(+), 78 deletions(-)
create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh

diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh b/tools/testing/selftests/sgx/ash_cgexec.sh
new file mode 100755
index 000000000000..9607784378df
--- /dev/null
+++ b/tools/testing/selftests/sgx/ash_cgexec.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2024 Intel Corporation.
+
+# Move the current shell process to the specified cgroup
+# Arguments:
+# $1 - The cgroup controller name, e.g., misc, memory.
+# $2 - The path of the cgroup,
+# relative to /sys/fs/cgroup for cgroup v2,
+# relative to /sys/fs/cgroup/$1 for v1.
+move_to_cgroup() {
+ controllers="$1"
+ path="$2"
+
+ # Check if cgroup v2 is in use
+ if [ ! -d "/sys/fs/cgroup/misc" ]; then
+ # Cgroup v2 logic
+ cgroup_full_path="/sys/fs/cgroup/${path}"
+ echo $$ > "${cgroup_full_path}/cgroup.procs"
+ else
+ # Cgroup v1 logic
+ OLD_IFS="$IFS"
+ IFS=','
+ for controller in $controllers; do
+ cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
+ echo $$ > "${cgroup_full_path}/tasks"
+ done
+ IFS="$OLD_IFS"
+ fi
+}
+
+if [ "$#" -lt 3 ] || [ "$1" != "-g" ]; then
+ echo "Usage: $0 -g <controller1,controller2:path> [-g <controller3:path> ...] <command> [args...]"
+ exit 1
+fi
+
+while [ "$#" -gt 0 ]; do
+ case "$1" in
+ -g)
+ # Ensure that a controller:path pair is provided after -g
+ if [ -z "$2" ]; then
+ echo "Error: Missing controller:path argument after -g"
+ exit 1
+ fi
+ IFS=':' read CONTROLLERS CGROUP_PATH <<EOF
+$2
+EOF
+ move_to_cgroup "$CONTROLLERS" "$CGROUP_PATH"
+ shift 2
+ ;;
+ *)
+ # Execute the command within the cgroup
+ exec "$@"
+ ;;
+ esac
+done
+
diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
index e027bf39f005..be4172f84580 100755
--- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -1,24 +1,14 @@
-#!/bin/bash
+#!/usr/bin/env sh
# SPDX-License-Identifier: GPL-2.0
-# Copyright(c) 2023 Intel Corporation.
+# Copyright(c) 2023, 2024 Intel Corporation.

TEST_ROOT_CG=selftest
-cgcreate -g misc:$TEST_ROOT_CG
-if [ $? -ne 0 ]; then
- echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
- exit 1
-fi
TEST_CG_SUB1=$TEST_ROOT_CG/test1
TEST_CG_SUB2=$TEST_ROOT_CG/test2
# We will only set limit in test1 and run tests in test3
TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
TEST_CG_SUB4=$TEST_ROOT_CG/test4

-cgcreate -g misc:$TEST_CG_SUB1
-cgcreate -g misc:$TEST_CG_SUB2
-cgcreate -g misc:$TEST_CG_SUB3
-cgcreate -g misc:$TEST_CG_SUB4
-
# Default to V2
CG_MISC_ROOT=/sys/fs/cgroup
CG_MEM_ROOT=/sys/fs/cgroup
@@ -31,6 +21,19 @@ else
CG_MEM_ROOT=/sys/fs/cgroup/memory
CG_V1=1
fi
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB1
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3
+mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4
+
+# Turn on misc and memory controller in non-leaf nodes for V2
+if [ $CG_V1 -eq 0 ]; then
+ echo "+misc" > $CG_MISC_ROOT/cgroup.subtree_control
+ echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control
+ echo "+misc" > $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
+ echo "+memory" > $CG_MEM_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
+ echo "+misc" > $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
+fi

CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
# This is below number of VA pages needed for enclave of capacity size. So
@@ -48,34 +51,67 @@ echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max

+if [ $? -ne 0 ]; then
+ echo "# Failed setting up misc limits, make sure misc cgroup is mounted."
+ exit 1
+fi
+
+clean_up_misc()
+{
+ sleep 2
+ rmdir $CG_MISC_ROOT/$TEST_CG_SUB2
+ rmdir $CG_MISC_ROOT/$TEST_CG_SUB3
+ rmdir $CG_MISC_ROOT/$TEST_CG_SUB4
+ rmdir $CG_MISC_ROOT/$TEST_CG_SUB1
+ rmdir $CG_MISC_ROOT/$TEST_ROOT_CG
+}
+
timestamp=$(date +%Y%m%d_%H%M%S)

test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"

+# Wait for a process and check for expected exit status.
+#
+# Arguments:
+# $1 - the pid of the process to wait and check.
+# $2 - 1 if expecting success, 0 for failure.
+#
+# Return:
+# 0 if the exit status of the process matches the expectation.
+# 1 otherwise.
wait_check_process_status() {
- local pid=$1
- local check_for_success=$2 # If 1, check for success;
- # If 0, check for failure
+ pid=$1
+ check_for_success=$2 # If 1, check for success;
+ # If 0, check for failure
wait "$pid"
- local status=$?
+ status=$?

- if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+ if [ $check_for_success -eq 1 ] && [ $status -eq 0 ]; then
echo "# Process $pid succeeded."
return 0
- elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+ elif [ $check_for_success -eq 0 ] && [ $status -ne 0 ]; then
echo "# Process $pid returned failure."
return 0
fi
return 1
}

+# Wait for a set of processes and check for expected exit status
+#
+# Arguments:
+# $1 - 1 if expecting success, 0 for failure.
+# remaining args - The pids of the processes
+#
+# Return:
+# 0 if exit status of any process matches the expectation.
+# 1 otherwise.
wait_and_detect_for_any() {
- local pids=("$@")
- local check_for_success=$1 # If 1, check for success;
- # If 0, check for failure
- local detected=1 # 0 for success detection
+ check_for_success=$1 # If 1, check for success;
+ # If 0, check for failure
+ shift
+ detected=1 # 0 for success detection

- for pid in "${pids[@]:1}"; do
+ for pid in $@; do
if wait_check_process_status "$pid" "$check_for_success"; then
detected=0
# Wait for other processes to exit
@@ -88,10 +124,10 @@ wait_and_detect_for_any() {
echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
# Always use leaf node of misc cgroups so it works for both v1 and v2
# these may fail on OOM
-cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
-if [[ $? -eq 0 ]]; then
+./ash_cgexec.sh -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [ $? -eq 0 ]; then
echo "# Fail on SMALL limit, not expecting any test passes."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
else
echo "# Test failed as expected."
@@ -102,54 +138,54 @@ echo "# PASSED SMALL limit."
echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
expecting at least one success...."

-pids=()
-for i in {1..4}; do
+pids=""
+for i in 1 2 3 4; do
(
- cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done


-if wait_and_detect_for_any 1 "${pids[@]}"; then
+if wait_and_detect_for_any 1 "$pids"; then
echo "# PASSED LARGE limit positive testing."
else
echo "# Failed on LARGE limit positive testing, no test passes."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi

echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
expecting at least one failure...."
-pids=()
-for i in {1..5}; do
+pids=""
+for i in 1 2 3 4 5; do
(
- cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done

-if wait_and_detect_for_any 0 "${pids[@]}"; then
+if wait_and_detect_for_any 0 "$pids"; then
echo "# PASSED LARGE limit negative testing."
else
echo "# Failed on LARGE limit negative testing, no test fails."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi

echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
expecting no failure...."
-pids=()
-for i in {1..8}; do
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
(
- cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done

-if wait_and_detect_for_any 0 "${pids[@]}"; then
+if wait_and_detect_for_any 0 "$pids"; then
echo "# Failed on LARGER limit, at least one test fails."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
else
echo "# PASSED LARGER limit tests."
@@ -157,51 +193,58 @@ fi

echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
randomly kill one, expecting no failure...."
-pids=()
-for i in {1..8}; do
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
(
- cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
+ ./ash_cgexec.sh -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done
-
-sleep $((RANDOM % 10 + 5))
+random_number=$(awk 'BEGIN{srand();print int(rand()*10)}')
+sleep $((random_number + 5))

# Randomly select a PID to kill
-RANDOM_INDEX=$((RANDOM % 8))
-PID_TO_KILL=${pids[RANDOM_INDEX]}
+RANDOM_INDEX=$(awk 'BEGIN{srand();print int(rand()*8)}')
+counter=0
+for pid in $pids; do
+ if [ "$counter" -eq "$RANDOM_INDEX" ]; then
+ PID_TO_KILL=$pid
+ break
+ fi
+ counter=$((counter + 1))
+done

kill $PID_TO_KILL
echo "# Killed process with PID: $PID_TO_KILL"

any_failure=0
-for pid in "${pids[@]}"; do
+for pid in $pids; do
wait "$pid"
status=$?
if [ "$pid" != "$PID_TO_KILL" ]; then
- if [[ $status -ne 0 ]]; then
+ if [ $status -ne 0 ]; then
echo "# Process $pid returned failure."
any_failure=1
fi
fi
done

-if [[ $any_failure -ne 0 ]]; then
+if [ $any_failure -ne 0 ]; then
echo "# Failed on random killing, at least one test fails."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi
echo "# PASSED LARGER limit test with a process randomly killed."

-cgcreate -g memory:$TEST_CG_SUB2
+mkdir -p $CG_MEM_ROOT/$TEST_CG_SUB2
if [ $? -ne 0 ]; then
echo "# Failed creating memory controller."
- cgdelete -r -g misc:$TEST_ROOT_CG
+ clean_up_misc
exit 1
fi
MEM_LIMIT_TOO_SMALL=$((CAPACITY - 2 * LARGE))

-if [[ $CG_V1 -eq 0 ]]; then
+if [ $CG_V1 -eq 0 ]; then
echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.max
else
echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.limit_in_bytes
@@ -210,23 +253,27 @@ fi

echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE EPC limit,
and too small RAM limit, expecting all failures...."
-pids=()
-for i in {1..4}; do
+# Ensure swapping off
+swapoff -a
+pids=""
+for i in 1 2 3 4; do
(
- cgexec -g memory:$TEST_CG_SUB2 -g misc:$TEST_CG_SUB2 $test_cmd \
+ ./ash_cgexec.sh -g memory:$TEST_CG_SUB2 -g misc:$TEST_CG_SUB2 $test_cmd \
>cgtest_large_oom_$timestamp.$i.log 2>&1
) &
- pids+=($!)
+ pids="$pids $!"
done

-if wait_and_detect_for_any 1 "${pids[@]}"; then
+if wait_and_detect_for_any 1 "$pids"; then
echo "# Failed on tests with memcontrol, some tests did not fail."
- cgdelete -r -g misc:$TEST_ROOT_CG
- if [[ $CG_V1 -ne 0 ]]; then
- cgdelete -r -g memory:$TEST_ROOT_CG
+ clean_up_misc
+ if [ $CG_V1 -ne 0 ]; then
+ rmdir $CG_MEM_ROOT/$TEST_CG_SUB2
fi
+ swapon -a
exit 1
else
+ swapon -a
echo "# PASSED LARGE limit tests with memcontrol."
fi

@@ -239,8 +286,8 @@ else
echo "# PASSED leakage check."
echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
fi
-cgdelete -r -g misc:$TEST_ROOT_CG
-if [[ $CG_V1 -ne 0 ]]; then
- cgdelete -r -g memory:$TEST_ROOT_CG
+clean_up_misc
+if [ $CG_V1 -ne 0 ]; then
+ rmdir $CG_MEM_ROOT/$TEST_CG_SUB2
fi
echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
index dbd38f346e7b..3b05475938d0 100755
--- a/tools/testing/selftests/sgx/watch_misc_for_tests.sh
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -1,13 +1,10 @@
-#!/bin/bash
-# SPDX-License-Identifier: GPL-2.0
+##!/usr/bin/env sh
+#!/bin/sh
# Copyright(c) 2023 Intel Corporation.

-if [ -z "$1" ]
- then
- echo "No argument supplied, please provide 'max', 'current' or 'events'"
+if [ -z "$1" ]; then
+ echo "No argument supplied, please provide 'max', 'current', or 'events'"
exit 1
fi

-watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
- 'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
-
+watch -n 1 'find /sys/fs/cgroup -wholename "*/test*/misc.'$1'" -exec sh -c '\''echo "$1:"; cat "$1"'\'' _ {} \;'
--
2.25.1


2024-04-02 03:44:26

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH] selftests/sgx: Improve cgroup test scripts

On Mon, 01 Apr 2024 09:22:21 -0500, Jarkko Sakkinen <[email protected]>
wrote:

> On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:
>> Make cgroup test scripts ash compatible.
>> Remove cg-tools dependency.
>> Add documentation for functions.
>>
>> Tested with busybox on Ubuntu.
>>
>> Signed-off-by: Haitao Huang <[email protected]>
>
> I'll run this next week on good old NUC7. Thank you.
>
> I really wish that either (hopefully both) Intel or AMD would bring up
> for developers home use meant platform to develop on TDX and SNP. It is
> a shame that the latest and greatest is from 2018.
>
> BR, Jarkko
>

Argh, missed a few changes for v2 cgroup:

--- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -15,6 +15,8 @@ CG_MEM_ROOT=/sys/fs/cgroup
CG_V1=0
if [ ! -d "/sys/fs/cgroup/misc" ]; then
echo "# cgroup V2 is in use."
+ echo "+misc" > $CG_MISC_ROOT/cgroup.subtree_control
+ echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control
else
echo "# cgroup V1 is in use."
CG_MISC_ROOT=/sys/fs/cgroup/misc
@@ -26,6 +28,11 @@ mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2
mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3
mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4

+if [ $CG_V1 -eq 0 ]; then
+echo "+misc" > $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
+echo "+misc" > $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
+fi

2024-04-02 07:37:34

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH] selftests/sgx: Improve cgroup test scripts

On Tue Apr 2, 2024 at 1:55 AM EEST, Haitao Huang wrote:
> On Mon, 01 Apr 2024 09:22:21 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:
> >> Make cgroup test scripts ash compatible.
> >> Remove cg-tools dependency.
> >> Add documentation for functions.
> >>
> >> Tested with busybox on Ubuntu.
> >>
> >> Signed-off-by: Haitao Huang <[email protected]>
> >
> > I'll run this next week on good old NUC7. Thank you.
> >
> > I really wish that either (hopefully both) Intel or AMD would bring up
> > for developers home use meant platform to develop on TDX and SNP. It is
> > a shame that the latest and greatest is from 2018.
> >
> > BR, Jarkko
> >
>
> Argh, missed a few changes for v2 cgroup:
>
> --- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -15,6 +15,8 @@ CG_MEM_ROOT=/sys/fs/cgroup
> CG_V1=0
> if [ ! -d "/sys/fs/cgroup/misc" ]; then
> echo "# cgroup V2 is in use."
> + echo "+misc" > $CG_MISC_ROOT/cgroup.subtree_control
> + echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control
> else
> echo "# cgroup V1 is in use."
> CG_MISC_ROOT=/sys/fs/cgroup/misc
> @@ -26,6 +28,11 @@ mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2
> mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3
> mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4
>
> +if [ $CG_V1 -eq 0 ]; then
> +echo "+misc" > $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
> +echo "+misc" > $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
> +fi

Maybe it would be most convenient just to +1 the kselftest patch?

Alternatively you could point out to a Git branch with the series
and the updated patch.

BR, Jarkko

2024-04-02 07:43:41

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts

On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote:
> Make cgroup test scripts ash compatible.
> Remove cg-tools dependency.
> Add documentation for functions.
>
> Tested with busybox on Ubuntu.
>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> v2:
> - Fixes for v2 cgroup
> - Turn off swapping before memcontrol tests and back on after
> - Add comments and reformat
> ---
> tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++++++
> .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++++++++++-------
> .../selftests/sgx/watch_misc_for_tests.sh | 13 +-
> 3 files changed, 179 insertions(+), 78 deletions(-)
> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>
> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh b/tools/testing/selftests/sgx/ash_cgexec.sh
> new file mode 100755
> index 000000000000..9607784378df
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> @@ -0,0 +1,57 @@
> +#!/usr/bin/env sh
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2024 Intel Corporation.
> +
> +# Move the current shell process to the specified cgroup
> +# Arguments:
> +# $1 - The cgroup controller name, e.g., misc, memory.
> +# $2 - The path of the cgroup,
> +# relative to /sys/fs/cgroup for cgroup v2,
> +# relative to /sys/fs/cgroup/$1 for v1.
> +move_to_cgroup() {
> + controllers="$1"
> + path="$2"
> +
> + # Check if cgroup v2 is in use
> + if [ ! -d "/sys/fs/cgroup/misc" ]; then
> + # Cgroup v2 logic
> + cgroup_full_path="/sys/fs/cgroup/${path}"
> + echo $$ > "${cgroup_full_path}/cgroup.procs"
> + else
> + # Cgroup v1 logic
> + OLD_IFS="$IFS"
> + IFS=','
> + for controller in $controllers; do
> + cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
> + echo $$ > "${cgroup_full_path}/tasks"
> + done
> + IFS="$OLD_IFS"
> + fi

I think that if you could point me to git v10 and all this I could
then quite easily create test image and see what I get from that.

I will code review the whole thing but this is definitely good
enough to start testing this series properly! Thanks for the
effort with this. The payback from this comes after the feature
is mainline. We have now sort of reference of the usage patterns
and less layers when we need to debug any possible (likely) bugs
in the future.

This is definitely to the right direction. I'm just wondering do
we want to support v1 cgroups or would it make sense support only
v2?

BR, Jarkko

2024-04-02 18:48:54

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts

On Tue, 02 Apr 2024 02:43:25 -0500, Jarkko Sakkinen <[email protected]>
wrote:

> On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote:
>> Make cgroup test scripts ash compatible.
>> Remove cg-tools dependency.
>> Add documentation for functions.
>>
>> Tested with busybox on Ubuntu.
>>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> v2:
>> - Fixes for v2 cgroup
>> - Turn off swapping before memcontrol tests and back on after
>> - Add comments and reformat
>> ---
>> tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++++++
>> .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++++++++++-------
>> .../selftests/sgx/watch_misc_for_tests.sh | 13 +-
>> 3 files changed, 179 insertions(+), 78 deletions(-)
>> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
>>
>> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh
>> b/tools/testing/selftests/sgx/ash_cgexec.sh
>> new file mode 100755
>> index 000000000000..9607784378df
>> --- /dev/null
>> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
>> @@ -0,0 +1,57 @@
>> +#!/usr/bin/env sh
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright(c) 2024 Intel Corporation.
>> +
>> +# Move the current shell process to the specified cgroup
>> +# Arguments:
>> +# $1 - The cgroup controller name, e.g., misc, memory.
>> +# $2 - The path of the cgroup,
>> +# relative to /sys/fs/cgroup for cgroup v2,
>> +# relative to /sys/fs/cgroup/$1 for v1.
>> +move_to_cgroup() {
>> + controllers="$1"
>> + path="$2"
>> +
>> + # Check if cgroup v2 is in use
>> + if [ ! -d "/sys/fs/cgroup/misc" ]; then
>> + # Cgroup v2 logic
>> + cgroup_full_path="/sys/fs/cgroup/${path}"
>> + echo $$ > "${cgroup_full_path}/cgroup.procs"
>> + else
>> + # Cgroup v1 logic
>> + OLD_IFS="$IFS"
>> + IFS=','
>> + for controller in $controllers; do
>> + cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
>> + echo $$ > "${cgroup_full_path}/tasks"
>> + done
>> + IFS="$OLD_IFS"
>> + fi
>
> I think that if you could point me to git v10 and all this I could
> then quite easily create test image and see what I get from that.
>
> I will code review the whole thing but this is definitely good
> enough to start testing this series properly! Thanks for the
> effort with this. The payback from this comes after the feature
> is mainline. We have now sort of reference of the usage patterns
> and less layers when we need to debug any possible (likely) bugs
> in the future.
>
> This is definitely to the right direction. I'm just wondering do
> we want to support v1 cgroups or would it make sense support only
> v2?
> BR, Jarkko
>
I can drop v1. I think most distro now support v2.
Created this branch to host these changes so far:
https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v10_plus


Thanks
Haitao

2024-04-03 13:09:15

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> When a cgroup usage reaches its limit, and it is to be charged, i.e.,
> sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
> reclaim pages from its LRU or LRUs of its descendants to make room for
> any new allocations. This patch adds the basic building block for the
> per-cgroup reclamation flow and use it for synchronous reclamation in
> sgx_cgroup_try_charge().

It's better to firstly mention _why_ we need this first:

Currently in the EPC page allocation, the kernel simply fails the allocation
when the current EPC cgroup fails to charge due to its usage reaching limit.
This is not ideal. When that happens, a better way is to reclaim EPC page(s)
from the current EPC cgroup (and/or its descendants) to reduce its usage so the
new allocation can succeed.

Add the basic building blocks to support the per-cgroup reclamation flow ...

>
> First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
> which pages are reclaimed, so it can be reused by both the global and
> cgroup reclaimers. Also return the number of pages attempted, so a
> cgroup reclaimer can use it to track reclamation progress from its
> descendants.

IMHO you are jumping too fast to the implementation details. Better to have
some more background:

"
Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU
list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for each
EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC page(s)
from a given EPC cgroup (and its descendants).
"

>
> For the global reclaimer, replace all call sites of sgx_reclaim_pages()
> with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
> just calls sgx_reclaim_pages() with the global LRU passed in.
>
> For cgroup reclamation, implement a basic reclamation flow, encapsulated
> in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
> pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
> at each node passing in the LRU of that node. It keeps track of total
> attempted pages and stops the walk if desired number of pages are
> attempted.

Then it's time to jump to implementation details:

"
Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). It
always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages.
Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the
global LRU, and then tries to reclaim these pages at once for better
performance.

Use similar way to implement the "cgroup" variant EPC reclaim, but keep the
implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input,
and return the pages that are "scanned" (but not actually reclaimed); 2) loop
the given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of
the given EPC cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU. It should be enough for most cases.
"

Then I think it's better to explain why "alternatives" are not chosen:

"
Note, this simple implementation doesn't _exactly_ mimic the current global EPC
reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN
pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual
reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs
with each being smaller than SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to have a
new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the
given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one
batch. But this is unnecessarily complicated at this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to return the
actual "reclaimed" pages, but not "scanned" pages. However this solution also
has cons: <CONS>
"

<CONS>:

I recall you mentioned "unable to control latency of each reclaim" etc, but IIUC
one could be:

This approach may result in higher chance of "reclaiming EPC pages from
descendants but not the root/given EPC cgorup", e.g., when all EPC pages in the
root EPC cgroup are all young while these in its descendants are not. This may
not be desired.

Makes sense?

>
> Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether
> a synchronous reclamation is allowed. If the caller allows and cgroup
> usage is at its limit, trigger the synchronous reclamation by calling
> sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between
> iterations.

This isn't needed IMHO as you can easily see in the code, and there's no "design
choices" here.

General rule: focus on explaining "why", and "design choices", but not
implementation details, which can be seen in the code.

>
> A later patch will add support for asynchronous reclamation reusing
> sgx_cgroup_reclaim_pages().

Please also mention why "leaving asynchronous reclamation to later patch(es)" is
fine. E.g., it won't break anything I suppose.

(That being said, as mentioned in previous version, I _think_ it's better to
have one patch to implement the "cgroup" variant EPC reclaim function, and
another patch to use it: both "sync" and "async" way. But for the sake of
moving forward, I am fine with the current way if nothing is broken.)

>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V10:
> - Simplify the signature by removing a pointer to nr_to_scan (Kai)
> - Return pages attempted instead of reclaimed as it is really what the
> cgroup caller needs to track progress. This further simplifies the design.
> - Merge patch for exposing sgx_reclaim_pages() with basic synchronous
> reclamation. (Kai)

(As mentioned above, I am not sure I suggested this but anyway...)

> - Shorten names for EPC cgroup functions. (Jarkko)
> - Fix/add comments to justify the design (Kai)
> - Separate out a helper for for addressing single iteration of the loop
> in sgx_cgroup_try_charge(). (Jarkko)
>
> V9:
> - Add comments for static variables. (Jarkko)
>
> V8:
> - Use width of 80 characters in text paragraphs. (Jarkko)
> - Remove alignment for substructure variables. (Jarkko)
>
> V7:
> - Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
> function". Do not split the top level function (Kai)
> - Dropped patches 7 and 8 of V6.
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 127 ++++++++++++++++++++++++++-
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +-
> arch/x86/kernel/cpu/sgx/main.c | 45 ++++++----
> arch/x86/kernel/cpu/sgx/sgx.h | 1 +
> 4 files changed, 156 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index a1dd43c195b2..f7a487a29ed1 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -9,16 +9,136 @@
> static struct sgx_cgroup sgx_cg_root;
>
> /**
> - * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
> + * @root: Root of the tree to check
> + *
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.

I don't think this comment (the paragraph starting from "Used") should be here,
but should be put to the code where it applies.
 
Comment what this function does instead.

> + *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + */
> +static bool sgx_cgroup_lru_empty(struct misc_cg *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_cgroup *sgx_cg;
> + bool ret = true;
> +
> + /*
> + * Caller ensure css_root ref acquired
> + */

/* The caller must ensure ... */

> + css_root = &root->css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> +
> + rcu_read_unlock();
> +
> + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
> +
> + spin_lock(&sgx_cg->lru.lock);
> + ret = list_empty(&sgx_cg->lru.reclaimable);
> + spin_unlock(&sgx_cg->lru.lock);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!ret)
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
> + * @root: The root of cgroup tree to reclaim from.
> *
> + * This function performs a pre-order walk in the cgroup tree under the given
> + * root, attempting to reclaim pages at each node until a fixed number of pages
> + * (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of success on
> + * the actual reclamation process. In extreme cases, if all pages in front of
> + * the LRUs are recently accessed, i.e., considered "too young" to reclaim, no
> + * page will actually be reclaimed after walking the whole tree.
> + *
> + * Callers check for the need for reclamation before calling this function. Some
> + * callers may run this function in a loop guarded by some criteria for
> + * triggering reclamation, and call cond_resched() in between iterations to
> + * avoid indefinite blocking.

Ditto IMHO the second paragraph isn't necessary. But anyway.

> + */
> +static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_cgroup *sgx_cg;
> + unsigned int cnt = 0;
> +
> + /* Caller ensure css_root ref acquired */
> + css_root = &root->css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> + rcu_read_unlock();
> +
> + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
> + cnt += sgx_reclaim_pages(&sgx_cg->lru);
> +
> + rcu_read_lock();
> + css_put(pos);
> +
> + if (cnt >= SGX_NR_TO_SCAN)
> + break;
> + }
> +
> + rcu_read_unlock();
> +}
> +
> +static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
> +{
> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
> + return 0;
> +
> + if (sgx_cgroup_lru_empty(epc_cg->cg))
> + return -ENOMEM;
> +
> + if (signal_pending(current))
> + return -ERESTARTSYS;
> +
> + return -EBUSY;
> +}
> +
> +/**
> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
> * @sgx_cg: The EPC cgroup to be charged for the page.
> + * @reclaim: Whether or not synchronous EPC reclaim is allowed.
> * Return:
> * * %0 - If successfully charged.
> * * -errno - for failures.
> */
> -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
> {
> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
> + int ret;
> +
> + for (;;) {
> + ret = __sgx_cgroup_try_charge(sgx_cg);
> +
> + if (ret != -EBUSY)
> + return ret;
> +
> + if (reclaim == SGX_NO_RECLAIM)
> + return -ENOMEM;
> +
> + sgx_cgroup_reclaim_pages(sgx_cg->cg);
> + cond_resched();
> + }
> +
> + return 0;
> }
>
> /**
> @@ -50,6 +170,7 @@ const struct misc_res_ops sgx_cgroup_ops = {
>
> static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
> {
> + sgx_lru_init(&sgx_cg->lru);
> cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
> sgx_cg->cg = cg;
> }
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index 8f794e23fad6..f62dce0cac51 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -20,7 +20,7 @@ static inline struct sgx_cgroup *sgx_get_current_cg(void)
>
> static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
>
> -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim r)

Is the @r here intentional for shorter typing?

[...]

> @@ -572,7 +583,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
> int ret;
>
> sgx_cg = sgx_get_current_cg();
> - ret = sgx_cgroup_try_charge(sgx_cg);
> + ret = sgx_cgroup_try_charge(sgx_cg, reclaim);
> if (ret) {
> sgx_put_cg(sgx_cg);
> return ERR_PTR(ret);
> @@ -604,7 +615,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
> * Need to do a global reclamation if cgroup was not full but free
> * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> */
> - sgx_reclaim_pages();
> + sgx_reclaim_pages_global();
> cond_resched();
> }

I wish we could put the result of discussion around "per-cgroup reclaim" vs
"global reclaim" when try_charge() succeeds but still fails to allocate to the
changelog:

https://lore.kernel.org/lkml/[email protected]/

But perhaps it is just me that thinks this better to be clarified in changelog,
so up to you.

(btw, looks another reason to split the "cgroup" EPC reclaim function out as a
separate patch, but again, up to you.)





2024-04-03 15:35:29

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v2] selftests/sgx: Improve cgroup test scripts

On Tue Apr 2, 2024 at 8:31 PM EEST, Haitao Huang wrote:
> On Tue, 02 Apr 2024 02:43:25 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Tue Apr 2, 2024 at 4:42 AM EEST, Haitao Huang wrote:
> >> Make cgroup test scripts ash compatible.
> >> Remove cg-tools dependency.
> >> Add documentation for functions.
> >>
> >> Tested with busybox on Ubuntu.
> >>
> >> Signed-off-by: Haitao Huang <[email protected]>
> >> ---
> >> v2:
> >> - Fixes for v2 cgroup
> >> - Turn off swapping before memcontrol tests and back on after
> >> - Add comments and reformat
> >> ---
> >> tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++++++
> >> .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++++++++++-------
> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +-
> >> 3 files changed, 179 insertions(+), 78 deletions(-)
> >> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh
> >> b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> new file mode 100755
> >> index 000000000000..9607784378df
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/ash_cgexec.sh
> >> @@ -0,0 +1,57 @@
> >> +#!/usr/bin/env sh
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +# Copyright(c) 2024 Intel Corporation.
> >> +
> >> +# Move the current shell process to the specified cgroup
> >> +# Arguments:
> >> +# $1 - The cgroup controller name, e.g., misc, memory.
> >> +# $2 - The path of the cgroup,
> >> +# relative to /sys/fs/cgroup for cgroup v2,
> >> +# relative to /sys/fs/cgroup/$1 for v1.
> >> +move_to_cgroup() {
> >> + controllers="$1"
> >> + path="$2"
> >> +
> >> + # Check if cgroup v2 is in use
> >> + if [ ! -d "/sys/fs/cgroup/misc" ]; then
> >> + # Cgroup v2 logic
> >> + cgroup_full_path="/sys/fs/cgroup/${path}"
> >> + echo $$ > "${cgroup_full_path}/cgroup.procs"
> >> + else
> >> + # Cgroup v1 logic
> >> + OLD_IFS="$IFS"
> >> + IFS=','
> >> + for controller in $controllers; do
> >> + cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
> >> + echo $$ > "${cgroup_full_path}/tasks"
> >> + done
> >> + IFS="$OLD_IFS"
> >> + fi
> >
> > I think that if you could point me to git v10 and all this I could
> > then quite easily create test image and see what I get from that.
> >
> > I will code review the whole thing but this is definitely good
> > enough to start testing this series properly! Thanks for the
> > effort with this. The payback from this comes after the feature
> > is mainline. We have now sort of reference of the usage patterns
> > and less layers when we need to debug any possible (likely) bugs
> > in the future.
> >
> > This is definitely to the right direction. I'm just wondering do
> > we want to support v1 cgroups or would it make sense support only
> > v2?
> > BR, Jarkko
> >
> I can drop v1. I think most distro now support v2.
> Created this branch to host these changes so far:
> https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v10_plus

Thanks!

I'll point my build to that, make a test image and report the
results.

BR, Jarkko

2024-04-04 11:20:02

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
>  
>  void sgx_cgroup_init(void)
>  {
> + sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE, WQ_UNBOUND_MAX_ACTIVE);
> +
> + /* All Cgroups functionalities are disabled. */
> + if (WARN_ON(!sgx_cg_wq))
> + return;
> +

I don't think you should WARN(), because it's not a kernel bug or similar. Just
print a message saying EPC cgroup is disabled and move on.

if (!sgx_cg_wq) {
pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
return;
}

2024-04-04 20:28:04

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH v10 09/14] x86/sgx: Implement async reclamation for cgroup

On Thu, 04 Apr 2024 06:16:54 -0500, Huang, Kai <[email protected]> wrote:

> On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
>> void sgx_cgroup_init(void)
>> {
>> + sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE,
>> WQ_UNBOUND_MAX_ACTIVE);
>> +
>> + /* All Cgroups functionalities are disabled. */
>> + if (WARN_ON(!sgx_cg_wq))
>> + return;
>> +
>
> I don't think you should WARN(), because it's not a kernel bug or
> similar. Just
> print a message saying EPC cgroup is disabled and move on.
>
> if (!sgx_cg_wq) {
> pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
> return;
> }


Sure
Thanks
Haitao

2024-04-04 21:32:39

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

Hi Kai,
Thanks for your suggestions. I'll adopt most of it as it.
Minor details below.

On Wed, 03 Apr 2024 08:08:28 -0500, Huang, Kai <[email protected]> wrote:

> On Wed, 2024-03-27 at 17:22 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> When a cgroup usage reaches its limit, and it is to be charged, i.e.,
>> sgx_cgroup_try_charge() called for new allocations, the cgroup needs to
>> reclaim pages from its LRU or LRUs of its descendants to make room for
>> any new allocations. This patch adds the basic building block for the
>> per-cgroup reclamation flow and use it for synchronous reclamation in
>> sgx_cgroup_try_charge().
>
> It's better to firstly mention _why_ we need this first:
>
> Currently in the EPC page allocation, the kernel simply fails the
> allocation
> when the current EPC cgroup fails to charge due to its usage reaching
> limit.
> This is not ideal. When that happens, a better way is to reclaim EPC
> page(s)
> from the current EPC cgroup (and/or its descendants) to reduce its usage
> so the
> new allocation can succeed.
>
> Add the basic building blocks to support the per-cgroup reclamation flow
> ...
>

ok

>>
>> First, modify sgx_reclaim_pages() to let callers to pass in the LRU from
>> which pages are reclaimed, so it can be reused by both the global and
>> cgroup reclaimers. Also return the number of pages attempted, so a
>> cgroup reclaimer can use it to track reclamation progress from its
>> descendants.
>
> IMHO you are jumping too fast to the implementation details. Better to
> have
> some more background:
>
> "
> Currently the kernel only has one place to reclaim EPC pages: the global
> EPC LRU
> list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for
> each
> EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC
> page(s)
> from a given EPC cgroup (and its descendants).
> "
>

ok

>>
>> For the global reclaimer, replace all call sites of sgx_reclaim_pages()
>> with calls to a newly created wrapper, sgx_reclaim_pages_global(), which
>> just calls sgx_reclaim_pages() with the global LRU passed in.
>>
>> For cgroup reclamation, implement a basic reclamation flow, encapsulated
>> in the top-level function, sgx_cgroup_reclaim_pages(). It performs a
>> pre-order walk on a given cgroup subtree, and calls sgx_reclaim_pages()
>> at each node passing in the LRU of that node. It keeps track of total
>> attempted pages and stops the walk if desired number of pages are
>> attempted.
>
> Then it's time to jump to implementation details:
>
> "
> Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
> It
> always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages.
> Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from
> the
> global LRU, and then tries to reclaim these pages at once for better
> performance.
>
> Use similar way to implement the "cgroup" variant EPC reclaim, but keep
> the
> implementation simple: 1) change sgx_reclaim_pages() to take an LRU as
> input,
> and return the pages that are "scanned" (but not actually reclaimed); 2)
> loop
> the given EPC cgroup and its descendants and do the new
> sgx_reclaim_pages()
> until SGX_NR_TO_SCAN pages are "scanned".
>
> This implementation always tries to reclaim SGX_NR_TO_SCAN pages from
> the LRU of
> the given EPC cgroup, and only moves to its descendants when there's no
> enough
> reclaimable EPC pages to "scan" in its LRU. It should be enough for
> most cases.
> "
>

ok

> Then I think it's better to explain why "alternatives" are not chosen:
>
> "
> Note, this simple implementation doesn't _exactly_ mimic the current
> global EPC
> reclaim (which always tries to do the actual reclaim in batch of
> SGX_NR_TO_SCAN
> pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the
> actual
> reclaim of EPC pages will be split into smaller batches _across_
> multiple LRUs
> with each being smaller than SGX_NR_TO_SCAN pages.
>
> A more precise way to mimic the current global EPC reclaim would be to
> have a
> new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_
> the
> given EPC cgroup _AND_ its descendants, and then do the actual reclaim
> in one
> batch. But this is unnecessarily complicated at this stage.
>
> Alternatively, the current sgx_reclaim_pages() could be changed to
> return the
> actual "reclaimed" pages, but not "scanned" pages. However this
> solution also
> has cons: <CONS>
> "
>
> <CONS>:
>
> I recall you mentioned "unable to control latency of each reclaim" etc,
> but IIUC
> one could be:
>
> This approach may result in higher chance of "reclaiming EPC pages from
> descendants but not the root/given EPC cgorup", e.g., when all EPC pages
> in the
> root EPC cgroup are all young while these in its descendants are not.
> This may
> not be desired.
>
> Makes sense?
>

Agree with the flow.
The con is that this function may block too long that is unacceptable for
some callers like synchronous flow which only needs some minimal (e.g.,
one page to pass try_charge()) to make forward progress. Convention is to
call this function loops to ensure caller's condition is met, i.e., the
way the original sgx_reclaim_pages() is called in existing code.

>>
>> Finally, pass a parameter to sgx_cgroup_try_charge() to indicate whether
>> a synchronous reclamation is allowed. If the caller allows and cgroup
>> usage is at its limit, trigger the synchronous reclamation by calling
>> sgx_cgroup_reclaim_pages() in a loop with cond_resched() in between
>> iterations.
>
> This isn't needed IMHO as you can easily see in the code, and there's no
> "design
> choices" here.
>
> General rule: focus on explaining "why", and "design choices", but not
> implementation details, which can be seen in the code.
>
>>
>> A later patch will add support for asynchronous reclamation reusing
>> sgx_cgroup_reclaim_pages().
>
> Please also mention why "leaving asynchronous reclamation to later
> patch(es)" is
> fine. E.g., it won't break anything I suppose.
>

Right. Pages are still in the global list at the moment and only global
reclaiming is active until the "turn on" patch. Separating out is really
just for the purpose of review IMHO.

> (That being said, as mentioned in previous version, I _think_ it's
> better to
> have one patch to implement the "cgroup" variant EPC reclaim function,
> and
> another patch to use it: both "sync" and "async" way. But for the sake
> of
> moving forward, I am fine with the current way if nothing is broken.)
>
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> V10:
>> - Simplify the signature by removing a pointer to nr_to_scan (Kai)
>> - Return pages attempted instead of reclaimed as it is really what the
>> cgroup caller needs to track progress. This further simplifies the
>> design.
>> - Merge patch for exposing sgx_reclaim_pages() with basic synchronous
>> reclamation. (Kai)
>
> (As mentioned above, I am not sure I suggested this but anyway...)
>

Sorry above did not characterize your suggestion accurately.
I'll keep the sync reclaim flow in for now unless strong objections.

>> - Shorten names for EPC cgroup functions. (Jarkko)
>> - Fix/add comments to justify the design (Kai)
>> - Separate out a helper for for addressing single iteration of the loop
>> in sgx_cgroup_try_charge(). (Jarkko)
>>
>> V9:
>> - Add comments for static variables. (Jarkko)
>>
>> V8:
>> - Use width of 80 characters in text paragraphs. (Jarkko)
>> - Remove alignment for substructure variables. (Jarkko)
>>
>> V7:
>> - Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC
>> reclaim
>> function". Do not split the top level function (Kai)
>> - Dropped patches 7 and 8 of V6.
>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> ---
>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 127 ++++++++++++++++++++++++++-
>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +-
>> arch/x86/kernel/cpu/sgx/main.c | 45 ++++++----
>> arch/x86/kernel/cpu/sgx/sgx.h | 1 +
>> 4 files changed, 156 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> index a1dd43c195b2..f7a487a29ed1 100644
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -9,16 +9,136 @@
>> static struct sgx_cgroup sgx_cg_root;
>>
>> /**
>> - * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
>> + * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its
>> LRUs
>> + * @root: Root of the tree to check
>> + *
>> + * Used to avoid livelocks due to a cgroup having a non-zero charge
>> count but
>> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be
>> released or
>> + * because all pages in the cgroup are unreclaimable.
>
> I don't think this comment (the paragraph starting from "Used") should
> be here,
> but should be put to the code where it applies.
> Comment what this function does instead.
>

Ok. I can drop it if no objections from others.
This was in from the beginning and I saw couple of other examples in
existing code that explains expected usage. So I thought we were supposed
to do that.

>> + *
>> + * Return: %true if all cgroups under the specified root have empty
>> LRU lists.
>> + */
>> +static bool sgx_cgroup_lru_empty(struct misc_cg *root)
>> +{
>> + struct cgroup_subsys_state *css_root;
>> + struct cgroup_subsys_state *pos;
>> + struct sgx_cgroup *sgx_cg;
>> + bool ret = true;
>> +
>> + /*
>> + * Caller ensure css_root ref acquired
>> + */
>
> /* The caller must ensure ... */
>

ok

>> + css_root = &root->css;
>> +
>> + rcu_read_lock();
>> + css_for_each_descendant_pre(pos, css_root) {
>> + if (!css_tryget(pos))
>> + break;
>> +
>> + rcu_read_unlock();
>> +
>> + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
>> +
>> + spin_lock(&sgx_cg->lru.lock);
>> + ret = list_empty(&sgx_cg->lru.reclaimable);
>> + spin_unlock(&sgx_cg->lru.lock);
>> +
>> + rcu_read_lock();
>> + css_put(pos);
>> + if (!ret)
>> + break;
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + return ret;
>> +}
>> +
>> +/**
>> + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
>> + * @root: The root of cgroup tree to reclaim from.
>> *
>> + * This function performs a pre-order walk in the cgroup tree under
>> the given
>> + * root, attempting to reclaim pages at each node until a fixed number
>> of pages
>> + * (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of
>> success on
>> + * the actual reclamation process. In extreme cases, if all pages in
>> front of
>> + * the LRUs are recently accessed, i.e., considered "too young" to
>> reclaim, no
>> + * page will actually be reclaimed after walking the whole tree.
>> + *
>> + * Callers check for the need for reclamation before calling this
>> function. Some
>> + * callers may run this function in a loop guarded by some criteria for
>> + * triggering reclamation, and call cond_resched() in between
>> iterations to
>> + * avoid indefinite blocking.
>
> Ditto IMHO the second paragraph isn't necessary. But anyway.
>

Same as above, I can drop if no objections.

>> + */
>> +static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
>> +{
>> + struct cgroup_subsys_state *css_root;
>> + struct cgroup_subsys_state *pos;
>> + struct sgx_cgroup *sgx_cg;
>> + unsigned int cnt = 0;
>> +
>> + /* Caller ensure css_root ref acquired */
>> + css_root = &root->css;
>> +
>> + rcu_read_lock();
>> + css_for_each_descendant_pre(pos, css_root) {
>> + if (!css_tryget(pos))
>> + break;
>> + rcu_read_unlock();
>> +
>> + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
>> + cnt += sgx_reclaim_pages(&sgx_cg->lru);
>> +
>> + rcu_read_lock();
>> + css_put(pos);
>> +
>> + if (cnt >= SGX_NR_TO_SCAN)
>> + break;
>> + }
>> +
>> + rcu_read_unlock();
>> +}
>> +
>> +static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
>> +{
>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
>> + return 0;
>> +
>> + if (sgx_cgroup_lru_empty(epc_cg->cg))
>> + return -ENOMEM;
>> +
>> + if (signal_pending(current))
>> + return -ERESTARTSYS;
>> +
>> + return -EBUSY;
>> +}
>> +
>> +/**
>> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
>> * @sgx_cg: The EPC cgroup to be charged for the page.
>> + * @reclaim: Whether or not synchronous EPC reclaim is allowed.
>> * Return:
>> * * %0 - If successfully charged.
>> * * -errno - for failures.
>> */
>> -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
>> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim
>> reclaim)
>> {
>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
>> + int ret;
>> +
>> + for (;;) {
>> + ret = __sgx_cgroup_try_charge(sgx_cg);
>> +
>> + if (ret != -EBUSY)
>> + return ret;
>> +
>> + if (reclaim == SGX_NO_RECLAIM)
>> + return -ENOMEM;
>> +
>> + sgx_cgroup_reclaim_pages(sgx_cg->cg);
>> + cond_resched();
>> + }
>> +
>> + return 0;
>> }
>>
>> /**
>> @@ -50,6 +170,7 @@ const struct misc_res_ops sgx_cgroup_ops = {
>>
>> static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup
>> *sgx_cg)
>> {
>> + sgx_lru_init(&sgx_cg->lru);
>> cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
>> sgx_cg->cg = cg;
>> }
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> index 8f794e23fad6..f62dce0cac51 100644
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> @@ -20,7 +20,7 @@ static inline struct sgx_cgroup
>> *sgx_get_current_cg(void)
>>
>> static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
>>
>> -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
>> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,
>> enum sgx_reclaim r)
>
> Is the @r here intentional for shorter typing?
>

yes :-)
Will speel out to make it consistent if that's the concern.

> [...]
>
>> @@ -572,7 +583,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>> *owner, enum sgx_reclaim reclaim)
>> int ret;
>>
>> sgx_cg = sgx_get_current_cg();
>> - ret = sgx_cgroup_try_charge(sgx_cg);
>> + ret = sgx_cgroup_try_charge(sgx_cg, reclaim);
>> if (ret) {
>> sgx_put_cg(sgx_cg);
>> return ERR_PTR(ret);
>> @@ -604,7 +615,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>> *owner, enum sgx_reclaim reclaim)
>> * Need to do a global reclamation if cgroup was not full but free
>> * physical pages run out, causing __sgx_alloc_epc_page() to fail.
>> */
>> - sgx_reclaim_pages();
>> + sgx_reclaim_pages_global();
>> cond_resched();
>> }
>
> I wish we could put the result of discussion around "per-cgroup reclaim"
> vs
> "global reclaim" when try_charge() succeeds but still fails to allocate
> to the
> changelog:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> But perhaps it is just me that thinks this better to be clarified in
> changelog,
> so up to you.
>

Will add that.

> (btw, looks another reason to split the "cgroup" EPC reclaim function
> out as a
> separate patch, but again, up to you.)
>
>
Thanks
Haitao

2024-04-05 01:24:19

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

On Thu, 28 Mar 2024 07:53:45 -0500, Huang, Kai <[email protected]> wrote:

>
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -0,0 +1,74 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright(c) 2022 Intel Corporation.
>
> It's 2024 now.
>
> And looks you need to use C style comment for /* Copyright ... */, after
> looking
> at some other C files.
>
Ok, will update years and use C style.

>> +
>> +#include <linux/atomic.h>
>> +#include <linux/kernel.h>
>> +#include "epc_cgroup.h"
>> +
>> +/* The root SGX EPC cgroup */
>> +static struct sgx_cgroup sgx_cg_root;
>> +
>> +/**
>> + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
>> + *
>> + * @sgx_cg: The EPC cgroup to be charged for the page.
>> + * Return:
>> + * * %0 - If successfully charged.
>> + * * -errno - for failures.
>> + */
>> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
>> +{
>> + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
>> +}
>> +
>> +/**
>> + * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
>> + * @sgx_cg: The charged sgx cgroup
>> + */
>> +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
>> +{
>> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
>> +}
>> +
>> +static void sgx_cgroup_free(struct misc_cg *cg)
>> +{
>> + struct sgx_cgroup *sgx_cg;
>> +
>> + sgx_cg = sgx_cgroup_from_misc_cg(cg);
>> + if (!sgx_cg)
>> + return;
>> +
>> + kfree(sgx_cg);
>> +}
>> +
>> +static int sgx_cgroup_alloc(struct misc_cg *cg);
>
> Again, this declaration can be removed if you move the below structure
> ...
>
>> +
>> +const struct misc_res_ops sgx_cgroup_ops = {
>> + .alloc = sgx_cgroup_alloc,
>> + .free = sgx_cgroup_free,
>> +};
>> +
>> +static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup
>> *sgx_cg)
>> +{
>> + cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
>> + sgx_cg->cg = cg;
>> +}
>> +
>> +static int sgx_cgroup_alloc(struct misc_cg *cg)
>> +{
>> + struct sgx_cgroup *sgx_cg;
>> +
>> + sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
>> + if (!sgx_cg)
>> + return -ENOMEM;
>> +
>> + sgx_cgroup_misc_init(cg, sgx_cg);
>> +
>> + return 0;
>> +}
>
> ... here.
>

yes, thanks

>> +
>> +void sgx_cgroup_init(void)
>> +{
>> + misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
>> + sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
>> +}
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> new file mode 100644
>> index 000000000000..8f794e23fad6
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> @@ -0,0 +1,70 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright(c) 2022 Intel Corporation. */
>> +#ifndef _SGX_EPC_CGROUP_H_
>> +#define _SGX_EPC_CGROUP_H_
>> +
>> +#include <asm/sgx.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/misc_cgroup.h>
>> +
>> +#include "sgx.h"
>> +
>> +#ifndef CONFIG_CGROUP_SGX_EPC
>
> Nit: add an empty line to make text more breathable.
>

ok

>> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
>> +struct sgx_cgroup;
>> +
>> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
>> +
>> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
>> +
>> +static inline void sgx_cgroup_init(void) { }
>> +#else
>
> Nit: I prefer two empty lines before and after the 'else'.
>

ok

>> +struct sgx_cgroup {
>> + struct misc_cg *cg;
>> +};
>> +
>> +static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct
>> misc_cg *cg)
>> +{
>> + return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
>> +}
>> +
>> +/**
>> + * sgx_get_current_cg() - get the EPC cgroup of current process.
>> + *
>> + * Returned cgroup has its ref count increased by 1. Caller must call
>> + * sgx_put_cg() to return the reference.
>> + *
>> + * Return: EPC cgroup to which the current task belongs to.
>> + */
>> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
>> +{
>> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
>> +}
>
> Again, I _think_ you need to check whether get_current_misc_cg() returns
> NULL?
>
> Misc cgroup can be disabled by command line even it is on in the Kconfig.
>
> I am not expert on cgroup, so could you check on this?
>

Good catch. Will add NULL check in sgx_cgroup_from_misc_cg()

>> +
>> +/**
>> + * sgx_put_sgx_cg() - Put the EPC cgroup and reduce its ref count.
>> + * @sgx_cg - EPC cgroup to put.
>> + */
>> +static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
>> +{
>> + if (sgx_cg)
>> + put_misc_cg(sgx_cg->cg);
>> +}
>> +
>> +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
>> +void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
>> +void sgx_cgroup_init(void);
>> +
>> +#endif
>> +
>> +#endif /* _SGX_EPC_CGROUP_H_ */
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c
>> b/arch/x86/kernel/cpu/sgx/main.c
>> index d219f14365d4..023af54c1beb 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -6,6 +6,7 @@
>> #include <linux/highmem.h>
>> #include <linux/kthread.h>
>> #include <linux/miscdevice.h>
>> +#include <linux/misc_cgroup.h>
>> #include <linux/node.h>
>> #include <linux/pagemap.h>
>> #include <linux/ratelimit.h>
>> @@ -17,6 +18,7 @@
>> #include "driver.h"
>> #include "encl.h"
>> #include "encls.h"
>> +#include "epc_cgroup.h"
>>
>> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>> static int sgx_nr_epc_sections;
>> @@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct
>> sgx_epc_page *page)
>> */
>> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim
>> reclaim)
>> {
>> + struct sgx_cgroup *sgx_cg;
>> struct sgx_epc_page *page;
>> + int ret;
>> +
>> + sgx_cg = sgx_get_current_cg();
>> + ret = sgx_cgroup_try_charge(sgx_cg);
>> + if (ret) {
>> + sgx_put_cg(sgx_cg);
>> + return ERR_PTR(ret);
>> + }
>>
>> for ( ; ; ) {
>> page = __sgx_alloc_epc_page();
>> @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>> *owner, enum sgx_reclaim reclaim)
>> break;
>> }
>>
>> - if (list_empty(&sgx_active_page_list))
>> - return ERR_PTR(-ENOMEM);
>> + if (list_empty(&sgx_active_page_list)) {
>> + page = ERR_PTR(-ENOMEM);
>> + break;
>> + }
>>
>> if (reclaim == SGX_NO_RECLAIM) {
>> page = ERR_PTR(-EBUSY);
>> @@ -580,10 +593,24 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>> *owner, enum sgx_reclaim reclaim)
>> break;
>> }
>>
>> + /*
>> + * Need to do a global reclamation if cgroup was not full but free
>> + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
>> + */
>
> Again, to me this comment shouldn't be here, because it doesn't add any
> more
> information.
>
> If you can reach here, you have passed the charge(). In fact, I believe
> this
> doesn't matter:
> When you fail to allocate, you just need to reclaim.
>
> Now you only have the global reclamation, thus you need to reclaim from
> it.
>
> Perhaps it is useful when you have per-cgroup LRU list. In that case
> you can
> put this comment there.
>

Ok

>> sgx_reclaim_pages();
>> cond_resched();
>> }
>>
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + if (!IS_ERR(page)) {
>> + WARN_ON_ONCE(page->sgx_cg);
>> + /* sgx_put_cg() in sgx_free_epc_page() */
>> + page->sgx_cg = sgx_cg;
>> + } else {
>> + sgx_cgroup_uncharge(sgx_cg);
>> + sgx_put_cg(sgx_cg);
>> + }
>> +#endif
>
> Again, IMHO having CONFIG_CGROUP_SGX_EPC here is ugly, because it
> doesn't even
> match the try_charge() above, which doesn't have the
> CONFIG_CGROUP_SGX_EPC.
>
> If you add a wrapper in "epc_cgroup.h"
>
Agree. but in sgx.h so sgx_epc_page struct is not exposed in epc_cgroup.h.

> static inline void sgx_epc_page_set_cgroup(struct epc_page *epc_page,
> struct sgx_cgroup *sgx_cg)
> {
> #ifdef CONFIG_CGROUP_SGX_EPC
> epc_page->sgx_cg = sgx_cg;
> #endif
> }
>
> Then I believe the above can be simplified to:
>
> if (!IS_ERR(page)) {
> sgx_epc_page_set_cgroup(page, sgx_cg);
> } else {
> sgx_cgroup_uncharge(sgx_cg);
> sgx_put_cg(sgx_cg);
> }
>
>> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> wake_up(&ksgxd_waitq);
>>
>> @@ -604,6 +631,14 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>> struct sgx_numa_node *node = section->node;
>>
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + if (page->sgx_cg) {
>> + sgx_cgroup_uncharge(page->sgx_cg);
>> + sgx_put_cg(page->sgx_cg);
>> + page->sgx_cg = NULL;
>> + }
>> +#endif
>> +
>
> Similarly, how about adding a wrapper in "epc_cgroup.h"
>
> struct sgx_cgroup *sgx_epc_page_get_cgroup(struct sgx_epc_page *page)
> {
> #ifdef CONFIG_CGROUP_SGX_EPC
> return page->sgx_cg;
> #else
> return NULL;
> #endif
> }
>
> Then this can be:
>
> sgx_cg = sgx_epc_page_get_cgroup(page);
> sgx_cgroup_uncharge(sgx_cg);
> sgx_put_cg(sgx_cg);
> sgx_epc_page_set_cgroup(page, NULL);
>

sure.
>> spin_lock(&node->lock);
>>
>> page->owner = NULL;
>> @@ -643,6 +678,11 @@ static bool __init sgx_setup_epc_section(u64
>> phys_addr, u64 size,
>> section->pages[i].flags = 0;
>> section->pages[i].owner = NULL;
>> section->pages[i].poison = 0;
>> +
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + section->pages[i].sgx_cg = NULL;
>> +#endif
>
> Can use the wrapper too.
>
yes.
> [...]
>
> (Btw, I'll be away for couple of days due to public holiday and will
> review the
> rest starting from late next week).
Thanks
Haitao

2024-04-05 02:57:55

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

On Thu, 2024-04-04 at 20:24 -0500, Haitao Huang wrote:
> > Again, IMHO having CONFIG_CGROUP_SGX_EPC here is ugly, because it 
> > doesn't even
> > match the try_charge() above, which doesn't have the 
> > CONFIG_CGROUP_SGX_EPC.
> >
> > If you add a wrapper in "epc_cgroup.h"
> >
> Agree. but in sgx.h so sgx_epc_page struct is not exposed in epc_cgroup.h.

I am fine with any place that suits.

2024-04-05 03:05:12

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > Please also mention why "leaving asynchronous reclamation to later 
> > patch(es)" is
> > fine.  E.g., it won't break anything I suppose.
> >
>
> Right. Pages are still in the global list at the moment and only global 
> reclaiming is active until the "turn on" patch. Separating out is really 
> just for the purpose of review IMHO.

Sounds good to me. Thanks.

2024-04-05 03:08:09

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > > -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> > > +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, 
> > > enum sgx_reclaim r)
> >
> > Is the @r here intentional for shorter typing?
> >
>
> yes :-)
> Will speel out to make it consistent if that's the concern.

I kinda prefer the full name to match the CONFIG_CGROUP_SGX_EPC on case. You
can put the 'enum sgx_reclaim reclaim' parameter into the new line if needed:

static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,
enum sgx_reclaim reclaim)
{
return 0;
}

2024-04-08 12:30:26

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation


> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -28,6 +28,10 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl
> static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
>
> static inline void sgx_cgroup_init(void) { }
> +
> +static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
> +{
> +}
> #else
> struct sgx_cgroup {
> struct misc_cg *cg;
> @@ -65,6 +69,9 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
>
> int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
> void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
> +bool sgx_cgroup_lru_empty(struct misc_cg *root);
> +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
> +void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm);

Seems the decision to choose whether to implement a stub function for some
function is quite random to me.

My impression is people in general don't like #ifdef in the C file but prefer to
implementing it in the header in some helper function.

I guess you might want to just implement a stub function for each of the 3
functions exposed, so that we can eliminate some #ifdefs in the sgx/main.c (see
below).

> void sgx_cgroup_init(void);
>
> #endif
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 7f92455d957d..68f28ff2d5ef 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -34,6 +34,16 @@ static struct sgx_epc_lru_list sgx_global_lru;
>
> static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
> {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (epc_page->sgx_cg)
> + return &epc_page->sgx_cg->lru;
> +
> + /*
> + * This should not happen when cgroup is enabled: Every page belongs
> + * to a cgroup, or the root by default.
> + */
> + WARN_ON_ONCE(1);

In the case MISC cgroup is enabled in Kconfig but disabled by command line, I
think this becomes legal now?

> +#endif
> return &sgx_global_lru;
> }
>
> @@ -42,7 +52,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
> */
> static inline bool sgx_can_reclaim(void)
> {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + return !sgx_cgroup_lru_empty(misc_cg_root());
> +#else
> return !list_empty(&sgx_global_lru.reclaimable);
> +#endif
> }
>

Here you are using #ifdef CONFIG_CGRUP_SGX_EPC, but ...

> static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> @@ -404,7 +418,10 @@ static bool sgx_should_reclaim(unsigned long watermark)
>
> static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
> {
> - sgx_reclaim_pages(&sgx_global_lru, charge_mm);
> + if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
> + else
> + sgx_reclaim_pages(&sgx_global_lru, charge_mm);
> }

... here you are using IS_ENABLED(CONFIG_CGROUP_SGX_EPC).

Any reason they are not consistent?

Also, in the case where MISC cgroup is disabled via commandline, I think it
won't work, because misc_cg_root() should be NULL in this case while
IS_ENABLED(CONFIG_CGROUP_SGX_EPC) is true.

>
> /*
> @@ -414,6 +431,16 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
> */
> void sgx_reclaim_direct(void)
> {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> + /* Make sure there are some free pages at cgroup level */
> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
> + sgx_cgroup_reclaim_pages(sgx_cg->cg, current->mm);
> + sgx_put_cg(sgx_cg);
> + }
> +#endif

This #ifdef CONFIG_CGROUP_SGX_EPC can be removed if we implement a stub function
for sgx_cgroup_should_reclaim().

> + /* Make sure there are some free pages at global level */
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> sgx_reclaim_pages_global(current->mm);
> }

2024-04-08 18:04:05

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

On Mon, 08 Apr 2024 07:20:23 -0500, Huang, Kai <[email protected]> wrote:

>
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> @@ -28,6 +28,10 @@ static inline int sgx_cgroup_try_charge(struct
>> sgx_cgroup *sgx_cg, enum sgx_recl
>> static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
>>
>> static inline void sgx_cgroup_init(void) { }
>> +
>> +static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root,
>> struct mm_struct *charge_mm)
>> +{
>> +}
>> #else
>> struct sgx_cgroup {
>> struct misc_cg *cg;
>> @@ -65,6 +69,9 @@ static inline void sgx_put_cg(struct sgx_cgroup
>> *sgx_cg)
>>
>> int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim
>> reclaim);
>> void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
>> +bool sgx_cgroup_lru_empty(struct misc_cg *root);
>> +bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
>> +void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct
>> *charge_mm);
>
> Seems the decision to choose whether to implement a stub function for
> some
> function is quite random to me.
>
> My impression is people in general don't like #ifdef in the C file but
> prefer to
> implementing it in the header in some helper function.
>
> I guess you might want to just implement a stub function for each of the
> 3
> functions exposed, so that we can eliminate some #ifdefs in the
> sgx/main.c (see
> below).
>
>> void sgx_cgroup_init(void);
>>
>> #endif
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c
>> b/arch/x86/kernel/cpu/sgx/main.c
>> index 7f92455d957d..68f28ff2d5ef 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -34,6 +34,16 @@ static struct sgx_epc_lru_list sgx_global_lru;
>>
>> static inline struct sgx_epc_lru_list *sgx_lru_list(struct
>> sgx_epc_page *epc_page)
>> {
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + if (epc_page->sgx_cg)
>> + return &epc_page->sgx_cg->lru;
>> +
>> + /*
>> + * This should not happen when cgroup is enabled: Every page belongs
>> + * to a cgroup, or the root by default.
>> + */
>> + WARN_ON_ONCE(1);
>
> In the case MISC cgroup is enabled in Kconfig but disabled by command
> line, I
> think this becomes legal now?
>

I'm not sure actually. Never saw the warning even if I set
"cgroup_disable=misc" in commandlibe. Tried both v1 and v2. Anyway, I
think we can remove this warning and we handle the NULL sgx_cg now.

>> +#endif
>> return &sgx_global_lru;
>> }
>>
>> @@ -42,7 +52,11 @@ static inline struct sgx_epc_lru_list
>> *sgx_lru_list(struct sgx_epc_page *epc_pag
>> */
>> static inline bool sgx_can_reclaim(void)
>> {
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + return !sgx_cgroup_lru_empty(misc_cg_root());
>> +#else
>> return !list_empty(&sgx_global_lru.reclaimable);
>> +#endif
>> }
>>
>
> Here you are using #ifdef CONFIG_CGRUP_SGX_EPC, but ...
>
>> static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>> @@ -404,7 +418,10 @@ static bool sgx_should_reclaim(unsigned long
>> watermark)
>>
>> static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>> {
>> - sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>> + if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
>> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
>> + else
>> + sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>> }
>
> ... here you are using IS_ENABLED(CONFIG_CGROUP_SGX_EPC).
>
> Any reason they are not consistent?

I was hesitant to expose sgx_global_lru needed for implementing
sgx_cgroup_lru_empty() stub which would also be a random decision and
overkill to just remove couple of #ifdefs in short functions.

>
> Also, in the case where MISC cgroup is disabled via commandline, I think
> it
> won't work, because misc_cg_root() should be NULL in this case while
> IS_ENABLED(CONFIG_CGROUP_SGX_EPC) is true.
>
>>

The misc root cgroup is a static similar to sgx_cg_root. So
misc_cg_root() won't be NULL
However, based on how css_misc() was check NULL, I suppose
sgx_get_current_cg() may be NULL when cgroup is disabled (again not 100%
sure but we handle it anyway)

>> /*
>> @@ -414,6 +431,16 @@ static void sgx_reclaim_pages_global(struct
>> mm_struct *charge_mm)
>> */
>> void sgx_reclaim_direct(void)
>> {
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
>> +
>> + /* Make sure there are some free pages at cgroup level */
>> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
>> + sgx_cgroup_reclaim_pages(sgx_cg->cg, current->mm);
>> + sgx_put_cg(sgx_cg);
>> + }
>> +#endif
>
> This #ifdef CONFIG_CGROUP_SGX_EPC can be removed if we implement a stub
> function
> for sgx_cgroup_should_reclaim().
>
Yes.
>> + /* Make sure there are some free pages at global level */
>> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> sgx_reclaim_pages_global(current->mm);
>> }
>

Thanks
Haitao

2024-04-08 22:37:48

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation



On 9/04/2024 6:03 am, Haitao Huang wrote:
>>>
>
> The misc root cgroup is a static similar to sgx_cg_root. So
> misc_cg_root()  won't be NULL
> However, based on how css_misc() was check NULL, I suppose
> sgx_get_current_cg() may be NULL when cgroup is disabled (again not 100%
> sure but we handle it anyway)

Could you help to check? Sorry I am busy on something else thus won't
be able to do any actual check.

2024-04-09 04:23:41

by Haitao Huang

[permalink] [raw]
Subject: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

On Mon, 08 Apr 2024 17:37:10 -0500, Huang, Kai <[email protected]> wrote:

>
>
> On 9/04/2024 6:03 am, Haitao Huang wrote:
>>>>
>> The misc root cgroup is a static similar to sgx_cg_root. So
>> misc_cg_root() won't be NULL
>> However, based on how css_misc() was check NULL, I suppose
>> sgx_get_current_cg() may be NULL when cgroup is disabled (again not
>> 100% sure but we handle it anyway)
>
> Could you help to check? Sorry I am busy on something else thus won't
> be able to do any actual check.
>
It's always non-NULL based on testing.

It's hard for me to say definitely by reading the code. But IIUC
cgroup_disable command-line only blocks operations in /sys/fs/cgroup so
user space can't set up controllers and config limits, etc., for the
diasabled ones. Each task->cgroups would still have a non-NULL pointer to
the static root object for each cgroup that is enabled by KConfig, so
get_current_misc_cg() thus sgx_get_current_cg() should not return NULL
regardless 'cgroup_disable=misc'.

Maybe @Michal or @tj can confirm?

Thanks
Haitao


2024-04-09 09:03:43

by Michal Koutný

[permalink] [raw]
Subject: Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang <[email protected]> wrote:
> It's always non-NULL based on testing.
>
> It's hard for me to say definitely by reading the code. But IIUC
> cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user
> space can't set up controllers and config limits, etc., for the diasabled
> ones. Each task->cgroups would still have a non-NULL pointer to the static
> root object for each cgroup that is enabled by KConfig, so
> get_current_misc_cg() thus sgx_get_current_cg() should not return NULL
> regardless 'cgroup_disable=misc'.
>
> Maybe @Michal or @tj can confirm?

The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal


Attachments:
(No filename) (1.29 kB)
signature.asc (235.00 B)
Download all attachments

2024-04-09 15:36:58

by Haitao Huang

[permalink] [raw]
Subject: Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutn? <[email protected]> wrote:

> On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang
> <[email protected]> wrote:
>> It's always non-NULL based on testing.
>>
>> It's hard for me to say definitely by reading the code. But IIUC
>> cgroup_disable command-line only blocks operations in /sys/fs/cgroup so
>> user
>> space can't set up controllers and config limits, etc., for the
>> diasabled
>> ones. Each task->cgroups would still have a non-NULL pointer to the
>> static
>> root object for each cgroup that is enabled by KConfig, so
>> get_current_misc_cg() thus sgx_get_current_cg() should not return NULL
>> regardless 'cgroup_disable=misc'.
>>
>> Maybe @Michal or @tj can confirm?
>
> The current implementation creates root css object (see cgroup_init(),
> cgroup_ssid_enabled() check is after cgroup_init_subsys()).
> I.e. it will look like all tasks are members of root cgroup wrt given
> controller permanently and controller attribute files won't exist.
>
> (It is up to the controller implementation to do further optimization
> based on the boot-time disablement (e.g. see uses of
> mem_cgroup_disabled()). Not sure if this is useful for misc controller)
>
> As for the WARN_ON(1), taking example from memcg -- NULL is best
> synonymous with root. It's a judgement call which of the values to store
> and when to intepret it.
>
> HTH,
> Michal
Thanks for the info.

The way I see it, misc does not have special handling like memcg so every
task at least belong to the root(default) group even if it's disabled by
command line parameter. So we would not get NULL from
get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a
reminder in case misc do have custom support for disabling in future.

Thanks
Haitao

2024-04-10 21:28:46

by Haitao Huang

[permalink] [raw]
Subject: Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

On Tue, 09 Apr 2024 10:34:06 -0500, Haitao Huang
<[email protected]> wrote:

> On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutn? <[email protected]>
> wrote:
>
>> On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang
>> <[email protected]> wrote:
>>> It's always non-NULL based on testing.
>>>
>>> It's hard for me to say definitely by reading the code. But IIUC
>>> cgroup_disable command-line only blocks operations in /sys/fs/cgroup
>>> so user
>>> space can't set up controllers and config limits, etc., for the
>>> diasabled
>>> ones. Each task->cgroups would still have a non-NULL pointer to the
>>> static
>>> root object for each cgroup that is enabled by KConfig, so
>>> get_current_misc_cg() thus sgx_get_current_cg() should not return NULL
>>> regardless 'cgroup_disable=misc'.
>>>
>>> Maybe @Michal or @tj can confirm?
>>
>> The current implementation creates root css object (see cgroup_init(),
>> cgroup_ssid_enabled() check is after cgroup_init_subsys()).
>> I.e. it will look like all tasks are members of root cgroup wrt given
>> controller permanently and controller attribute files won't exist.
>>
>> (It is up to the controller implementation to do further optimization
>> based on the boot-time disablement (e.g. see uses of
>> mem_cgroup_disabled()). Not sure if this is useful for misc controller.)
>>
>> As for the WARN_ON(1), taking example from memcg -- NULL is best
>> synonymous with root. It's a judgement call which of the values to store
>> and when to intepret it.
>>
>> HTH,
>> Michal
> Thanks for the info.
>
> The way I see it, misc does not have special handling like memcg so
> every task at least belong to the root(default) group even if it's
> disabled by command line parameter. So we would not get NULL from
> get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a
> reminder in case misc do have custom support for disabling in future.
>
Actually I think it makes more sense just add some comments instead of
WARN.
That's what I did in v11 now.

Thanks
Haitao

2024-04-13 20:57:02

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH v10 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

On Fri Apr 5, 2024 at 6:07 AM EEST, Huang, Kai wrote:
> On Thu, 2024-04-04 at 12:05 -0500, Haitao Huang wrote:
> > > > -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> > > > +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, 
> > > > enum sgx_reclaim r)
> > >
> > > Is the @r here intentional for shorter typing?
> > >
> >
> > yes :-)
> > Will speel out to make it consistent if that's the concern.
>
> I kinda prefer the full name to match the CONFIG_CGROUP_SGX_EPC on case. You
> can put the 'enum sgx_reclaim reclaim' parameter into the new line if needed:

Why don't cgroups for SGX get always enabled when SGX and
cgroups support are enabled?

BR, Jarkko