SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design.
Track EPC pages in per-cgroup LRUs (from Dave)
----------------------------------------------
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
Unreclaimable Enclave Pages
---------------------------
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
Unreclaimable Guest VM EPC Pages
--------------------------------
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates the
ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
user process exactly the same way handling host running out of physical
EPC.
This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to utilize the misc cgroup
controller rather than a custom controller. I have been updating the
patches based on review comments since V2 [7-13], simplified the
implementation/design, added selftest scripts, fixed some stability issues
found from testing.
Thanks to all for the review/test/tags/feedback provided on the previous
versions.
I appreciate your further reviewing/testing and providing tags if
appropriate.
---
V9:
- Add comments for static variables outside functions. (Jarkko)
- Remove unnecessary ifs. (Tim)
- Add more Reviewed-By: tags from Jarkko and TJ.
V8:
- Style fixes. (Jarkko)
- Abstract _misc_res_free/alloc() (Jarkko)
- Remove unneeded NULL checks. (Jarkko)
V7:
- Split the large patch for the final EPC implementation, #10 in V6, into
smaller ones. (Dave, Kai)
- Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
into two functions (Kai)
- Removed patches to introduce the EPC page states, list for storing
candidate pages for reclamation. (not needed due to above changes)
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of
required callback functions (Jarkko)
- Initialize epc cgroup in sgx driver init function. (Kai)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
- Use a static for root cgroup (Kai)
[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/lkml/[email protected]/
[4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
[5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
[6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[7]v2: https://lore.kernel.org/all/[email protected]/
[8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
[9]v4: https://lore.kernel.org/all/[email protected]/
[10]v5: https://lore.kernel.org/all/[email protected]/
[11]v6: https://lore.kernel.org/linux-sgx/[email protected]/
[12]v7: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[13]v8: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
Haitao Huang (2):
x86/sgx: Charge mem_cgroup for per-cgroup reclamation
selftests/sgx: Add scripts for EPC cgroup testing
Kristen Carlson Accardi (10):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Export APIs for SGX driver
cgroup/misc: Add SGX EPC resource type
x86/sgx: Implement basic EPC misc cgroup functionality
x86/sgx: Abstract tracking reclaimable pages in LRU
x86/sgx: Implement EPC reclamation flows for cgroup
x86/sgx: Add EPC reclamation in cgroup try_charge()
x86/sgx: Abstract check for global reclaimable pages
x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer
x86/sgx: Turn on per-cgroup EPC reclamation
Sean Christopherson (3):
x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
x86/sgx: Expose sgx_reclaim_pages() for cgroup
Docs/x86/sgx: Add description for cgroup support
Documentation/arch/x86/sgx.rst | 83 ++++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 38 ++-
arch/x86/kernel/cpu/sgx/encl.h | 3 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 276 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 83 ++++++
arch/x86/kernel/cpu/sgx/main.c | 180 +++++++++---
arch/x86/kernel/cpu/sgx/sgx.h | 22 ++
include/linux/misc_cgroup.h | 41 +++
kernel/cgroup/misc.c | 109 +++++--
.../selftests/sgx/run_epc_cg_selftests.sh | 246 ++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
13 files changed, 1019 insertions(+), 89 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
base-commit: 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V6:
- Split the original patch into this and the preceding one (Kai)
---
include/linux/misc_cgroup.h | 4 ++++
kernel/cgroup/misc.c | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..2f6cc3a0ad23 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+ /* SGX EPC memory resource */
+ MISC_CG_RES_SGX_EPC,
#endif
MISC_CG_RES_TYPES
};
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 1f0d8e05b36c..e51d6a45007f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+ /* Intel SGX EPC memory bytes */
+ "sgx_epc",
+#endif
};
/* Root misc cgroup */
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.
Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.
Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 912959c7ecc9..a131aa985c95 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
*/
static struct sgx_epc_lru_list sgx_global_lru;
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
+{
+ return &sgx_global_lru;
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
}
/**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU.
* @page: EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ list_add_tail(&page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
}
/**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
* @page: EPC page
*
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
*
* Return:
* 0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return -EBUSY;
}
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return 0;
}
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
example, in a Kubernates environment, a user can request certain EPC
quota for a pod but the orchestrator can not enforce the quota to limit
runtime EPC usage of the pod without an EPC cgroup controller.
Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:
- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.
With these changes, the misc cgroup controller enables user to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".
For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.
Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)
V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller" and restructure it (Kai)
---
arch/x86/Kconfig | 13 +++++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 74 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 +++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/main.c | 52 ++++++++++++++++++-
arch/x86/kernel/cpu/sgx/sgx.h | 5 ++
include/linux/misc_cgroup.h | 2 +
7 files changed, 218 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..10c3d1d099b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1947,6 +1947,19 @@ config X86_SGX
If unsure, say N.
+config CGROUP_SGX_EPC
+ bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+ depends on X86_SGX && CGROUP_MISC
+ help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
obj-$(CONFIG_X86_SGX_KVM) += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..f4a37ace67d7
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include "epc_cgroup.h"
+
+/* The root EPC cgroup */
+static struct sgx_epc_cgroup epc_cg_root;
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @epc_cg: The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+{
+ return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_epc_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @epc_cg: The charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+ if (!epc_cg)
+ return;
+
+ kfree(epc_cg);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
+
+const struct misc_res_ops sgx_epc_cgroup_ops = {
+ .alloc = sgx_epc_cgroup_alloc,
+ .free = sgx_epc_cgroup_free,
+};
+
+static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
+{
+ cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+ epc_cg->cg = cg;
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+ if (!epc_cg)
+ return -ENOMEM;
+
+ sgx_epc_misc_init(cg, epc_cg);
+
+ return 0;
+}
+
+void sgx_epc_cgroup_init(void)
+{
+ misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
+ sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..6b664b4c321f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
+{
+ return NULL;
+}
+
+static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+{
+ return 0;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_init(void) { }
+#else
+struct sgx_epc_cgroup {
+ struct misc_cg *cg;
+};
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+ return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_epc_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * sgx_put_epc_cg() to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
+{
+ return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+}
+
+/**
+ * sgx_put_epc_cg() - Put the EPC cgroup and reduce its ref count.
+ * @epc_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
+{
+ if (epc_cg)
+ put_misc_cg(epc_cg->cg);
+}
+
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+void sgx_epc_cgroup_init(void);
+
+#endif
+
+#endif /* _SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..c32f18b70c73 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
#include <linux/node.h>
#include <linux/pagemap.h>
#include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
#include "driver.h"
#include "encl.h"
#include "encls.h"
+#include "epc_cgroup.h"
struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
static int sgx_nr_epc_sections;
@@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
*/
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
+ struct sgx_epc_cgroup *epc_cg;
struct sgx_epc_page *page;
+ int ret;
+
+ epc_cg = sgx_get_current_epc_cg();
+ ret = sgx_epc_cgroup_try_charge(epc_cg);
+ if (ret) {
+ sgx_put_epc_cg(epc_cg);
+ return ERR_PTR(ret);
+ }
for ( ; ; ) {
page = __sgx_alloc_epc_page();
@@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_active_page_list))
- return ERR_PTR(-ENOMEM);
+ if (list_empty(&sgx_active_page_list)) {
+ page = ERR_PTR(-ENOMEM);
+ break;
+ }
if (!reclaim) {
page = ERR_PTR(-EBUSY);
@@ -580,10 +593,25 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
+ /*
+ * Need to do a global reclamation if cgroup was not full but free
+ * physical pages run out, causing __sgx_alloc_epc_page() to fail.
+ */
sgx_reclaim_pages();
cond_resched();
}
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (!IS_ERR(page)) {
+ WARN_ON_ONCE(page->epc_cg);
+ /* sgx_put_epc_cg() in sgx_free_epc_page() */
+ page->epc_cg = epc_cg;
+ } else {
+ sgx_epc_cgroup_uncharge(epc_cg);
+ sgx_put_epc_cg(epc_cg);
+ }
+#endif
+
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
wake_up(&ksgxd_waitq);
@@ -604,6 +632,14 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_epc_section *section = &sgx_epc_sections[page->section];
struct sgx_numa_node *node = section->node;
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (page->epc_cg) {
+ sgx_epc_cgroup_uncharge(page->epc_cg);
+ sgx_put_epc_cg(page->epc_cg);
+ page->epc_cg = NULL;
+ }
+#endif
+
spin_lock(&node->lock);
page->owner = NULL;
@@ -643,6 +679,11 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
section->pages[i].flags = 0;
section->pages[i].owner = NULL;
section->pages[i].poison = 0;
+
+#ifdef CONFIG_CGROUP_SGX_EPC
+ section->pages[i].epc_cg = NULL;
+#endif
+
list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
}
@@ -787,6 +828,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
static bool __init sgx_page_cache_init(void)
{
u32 eax, ebx, ecx, edx, type;
+ u64 capacity = 0;
u64 pa, size;
int nid;
int i;
@@ -837,6 +879,7 @@ static bool __init sgx_page_cache_init(void)
sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
sgx_numa_nodes[nid].size += size;
+ capacity += size;
sgx_nr_epc_sections++;
}
@@ -846,6 +889,8 @@ static bool __init sgx_page_cache_init(void)
return false;
}
+ misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
return true;
}
@@ -942,6 +987,9 @@ static int __init sgx_init(void)
if (sgx_vepc_init() && ret)
goto err_provision;
+ /* Setup cgroup if either the native or vepc driver is active */
+ sgx_epc_cgroup_init();
+
return 0;
err_provision:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..a898d86dead0 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,12 +29,17 @@
/* Pages on free list */
#define SGX_EPC_PAGE_IS_FREE BIT(1)
+struct sgx_epc_cgroup;
+
struct sgx_epc_page {
unsigned int section;
u16 flags;
u16 poison;
struct sgx_encl_page *owner;
struct list_head list;
+#ifdef CONFIG_CGROUP_SGX_EPC
+ struct sgx_epc_cgroup *epc_cg;
+#endif
};
/*
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 2f6cc3a0ad23..1a16efdfcd3d 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -46,11 +46,13 @@ struct misc_res_ops {
* @max: Maximum limit on the resource.
* @usage: Current usage of the resource.
* @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
*/
struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+ void *priv;
};
/**
--
2.25.1
From: Sean Christopherson <[email protected]>
Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.
Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V6:
- removed introduction to unreclaimables in commit message.
V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.
V3:
- Removed the helper functions and revised commit messages.
---
arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
arch/x86/kernel/cpu/sgx/sgx.h | 15 +++++++++++++
2 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c32f18b70c73..912959c7ecc9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
/*
* These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
*/
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- if (list_empty(&sgx_active_page_list))
+ epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+ struct sgx_epc_page, list);
+ if (!epc_page)
break;
- epc_page = list_first_entry(&sgx_active_page_list,
- struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_reclaimer_lock);
- list_add_tail(&epc_page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
+ list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_active_page_list);
+ !list_empty(&sgx_global_lru.reclaimable);
}
/*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
ksgxd_tsk = tsk;
+ sgx_lru_init(&sgx_global_lru);
+
return true;
}
@@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
}
/**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return -EBUSY;
}
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return 0;
}
@@ -578,7 +579,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_active_page_list)) {
+ if (list_empty(&sgx_global_lru.reclaimable)) {
page = ERR_PTR(-ENOMEM);
break;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index a898d86dead0..0e99e9ae3a67 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -88,6 +88,21 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}
+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+ spinlock_t lock;
+ struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+ spin_lock_init(&lru->lock);
+ INIT_LIST_HEAD(&lru->reclaimable);
+}
+
struct sgx_epc_page *__sgx_alloc_epc_page(void);
void sgx_free_epc_page(struct sgx_epc_page *page);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Implement the reclamation flow for cgroup, encapsulated in the top-level
function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its
subtree, and make calls to sgx_reclaim_pages() at each node passing in
the LRU of that node. It keeps track of total reclaimed pages, and pages
left to attempt. It stops the walk if desired number of pages are
attempted.
In some contexts, e.g. page fault handling, only asynchronous
reclamation is allowed. Create a work-queue, corresponding work item and
function definitions to support the asynchronous reclamation. Both
synchronous and asynchronous flows invoke the same top level reclaim
function, and will be triggered later by sgx_epc_cgroup_try_charge()
when usage of the cgroup is at or near its limit.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V9:
- Add comments for static variables. (Jarkko)
V8:
- Remove alignment for substructure variables. (Jarkko)
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 181 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 3 +
2 files changed, 183 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index f4a37ace67d7..16b6d9f909eb 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -8,9 +8,180 @@
/* The root EPC cgroup */
static struct sgx_epc_cgroup epc_cg_root;
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is not
+ * an option, e.g., in a fault handler.
+ */
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+ return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+ return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is
+ * over its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+ struct misc_cg *i = epc_cg->cg;
+ u64 m = U64_MAX;
+
+ while (i) {
+ m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+ i = misc_cg_parent(i);
+ }
+
+ return m / PAGE_SIZE;
+}
+
/**
- * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root: Root of the tree to check
*
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_epc_cgroup *epc_cg;
+ bool ret = true;
+
+ /*
+ * Caller ensure css_root ref acquired
+ */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+
+ rcu_read_unlock();
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+ spin_lock(&epc_cg->lru.lock);
+ ret = list_empty(&epc_cg->lru.reclaimable);
+ spin_unlock(&epc_cg->lru.lock);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!ret)
+ break;
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages
+ * @root: Root of the tree to start walking from.
+ * Return: Number of pages reclaimed.
+ */
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+{
+ /*
+ * Attempting to reclaim only a few pages will often fail and is
+ * inefficient, while reclaiming a huge number of pages can result in
+ * soft lockups due to holding various locks for an extended duration.
+ */
+ unsigned int nr_to_scan = SGX_NR_TO_SCAN;
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_epc_cgroup *epc_cg;
+ unsigned int cnt;
+
+ /* Caller ensure css_root ref acquired */
+ css_root = &root->css;
+
+ cnt = 0;
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+ rcu_read_unlock();
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+ cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!nr_to_scan)
+ break;
+ }
+
+ rcu_read_unlock();
+ return cnt;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup
+ * when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+ struct sgx_epc_cgroup *epc_cg;
+ u64 cur, max;
+
+ epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+ for (;;) {
+ max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+ /*
+ * Adjust the limit down by one page, the goal is to free up
+ * pages for fault allocations, not to simply obey the limit.
+ * Conditionally decrementing max also means the cur vs. max
+ * check will correctly handle the case where both are zero.
+ */
+ if (max)
+ max--;
+
+ /*
+ * Unless the limit is extremely low, in which case forcing
+ * reclaim will likely cause thrashing, force the cgroup to
+ * reclaim at least once if it's operating *near* its maximum
+ * limit by adjusting @max down by half the min reclaim size.
+ * This work func is scheduled by sgx_epc_cgroup_try_charge
+ * when it cannot directly reclaim due to being in an atomic
+ * context, e.g. EPC allocation in a fault handler. Waiting
+ * to reclaim until the cgroup is actually at its limit is less
+ * performant as it means the faulting task is effectively
+ * blocked until a worker makes its way through the global work
+ * queue.
+ */
+ if (max > SGX_NR_TO_SCAN * 2)
+ max -= (SGX_NR_TO_SCAN / 2);
+
+ cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+
+ if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ break;
+
+ /* Keep reclaiming until above condition is met. */
+ sgx_epc_cgroup_reclaim_pages(epc_cg->cg);
+ }
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
* @epc_cg: The EPC cgroup to be charged for the page.
* Return:
* * %0 - If successfully charged.
@@ -38,6 +209,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
if (!epc_cg)
return;
+ cancel_work_sync(&epc_cg->reclaim_work);
kfree(epc_cg);
}
@@ -50,6 +222,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = {
static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
{
+ sgx_lru_init(&epc_cg->lru);
+ INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
epc_cg->cg = cg;
}
@@ -69,6 +243,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
void sgx_epc_cgroup_init(void)
{
+ sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+ WQ_UNBOUND | WQ_FREEZABLE,
+ WQ_UNBOUND_MAX_ACTIVE);
+ BUG_ON(!sgx_epc_cg_wq);
+
misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 6b664b4c321f..e3c6a08f0ee8 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -34,6 +34,8 @@ static inline void sgx_epc_cgroup_init(void) { }
#else
struct sgx_epc_cgroup {
struct misc_cg *cg;
+ struct sgx_epc_lru_list lru;
+ struct work_struct reclaim_work;
};
static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
@@ -66,6 +68,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
void sgx_epc_cgroup_init(void);
#endif
--
2.25.1
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it only considers two contexts in which
the swapping can be done: normal tasks and the ksgxd kthread.
With the new EPC cgroup implementation, the swapping can also happen in
EPC cgroup work-queue threads. In those cases, it improperly selects the
root mem_cgroup to charge for the RAM usage.
Change sgx_encl_get_mem_cgroup() to handle non-task contexts only and
return the mem_cgroup of an mm_struct associated with the enclave. The
return is used to charge for EPC backing pages in all kthread cases.
Pass a flag into the top level reclamation function,
sgx_reclaim_pages(), to explicitly indicate whether it is called from a
background kthread. Internally, if the flag is true, switch the active
mem_cgroup to the one returned from sgx_encl_get_mem_cgroup(), prior to
any backing page allocation, in order to ensure that shmem page
allocations are charged to the enclave's cgroup.
Removed current_is_ksgxd() as it is no longer needed.
Signed-off-by: Haitao Huang <[email protected]>
Reported-by: Mikko Ylinen <[email protected]>
---
V9:
- Reduce number of if statements. (Tim)
V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
arch/x86/kernel/cpu/sgx/encl.c | 38 +++++++++++++---------------
arch/x86/kernel/cpu/sgx/encl.h | 3 +--
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 7 ++---
arch/x86/kernel/cpu/sgx/main.c | 27 +++++++++-----------
arch/x86/kernel/cpu/sgx/sgx.h | 3 ++-
5 files changed, 36 insertions(+), 42 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..4e5948362060 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,9 +993,7 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde
}
/*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Returns the mem_cgroup of a struct mm stored in the enclave's mm_list.
*/
static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
{
@@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
struct sgx_encl_mm *encl_mm;
int idx;
- /*
- * If called from normal task context, return the mem_cgroup
- * of the current task's mm. The remainder of the handling is for
- * ksgxd.
- */
- if (!current_is_ksgxd())
- return get_mem_cgroup_from_mm(current->mm);
-
/*
* Search the enclave's mm_list to find an mm associated with
* this enclave to charge the allocation to.
@@ -1047,27 +1037,33 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
* @encl: an enclave pointer
* @page_index: enclave page index
* @backing: data for accessing backing storage for the page
+ * @indirect: in ksgxd or EPC cgroup work queue context
+ *
+ * Create a backing page for loading data back into an EPC page with ELDU. This
+ * function takes a reference on a new backing page which must be dropped with a
+ * corresponding call to sgx_encl_put_backing().
*
- * When called from ksgxd, sets the active memcg from one of the
- * mms in the enclave's mm_list prior to any backing page allocation,
- * in order to ensure that shmem page allocations are charged to the
- * enclave. Create a backing page for loading data back into an EPC page with
- * ELDU. This function takes a reference on a new backing page which
- * must be dropped with a corresponding call to sgx_encl_put_backing().
+ * When @indirect is true, sets the active memcg from one of the mms in the
+ * enclave's mm_list prior to any backing page allocation, in order to ensure
+ * that shmem page allocations are charged to the enclave.
*
* Return:
* 0 on success,
* -errno otherwise.
*/
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing)
+ struct sgx_backing *backing, bool indirect)
{
- struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
- struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
+ struct mem_cgroup *encl_memcg;
+ struct mem_cgroup *memcg;
int ret;
- ret = __sgx_encl_get_backing(encl, page_index, backing);
+ if (!indirect)
+ return __sgx_encl_get_backing(encl, page_index, backing);
+ encl_memcg = sgx_encl_get_mem_cgroup(encl);
+ memcg = set_active_memcg(encl_memcg);
+ ret = __sgx_encl_get_backing(encl, page_index, backing);
set_active_memcg(memcg);
mem_cgroup_put(encl_memcg);
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..549cd2e8d98b 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
unsigned long end, unsigned long vm_flags);
-bool current_is_ksgxd(void);
void sgx_encl_release(struct kref *ref);
int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing);
+ struct sgx_backing *backing, bool indirect);
void sgx_encl_put_backing(struct sgx_backing *backing);
int sgx_encl_test_and_clear_young(struct mm_struct *mm,
struct sgx_encl_page *page);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 16b6d9f909eb..d399fda2b55e 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -93,9 +93,10 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
/**
* sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages
* @root: Root of the tree to start walking from.
+ * @indirect: In ksgxd or EPC cgroup work queue context.
* Return: Number of pages reclaimed.
*/
-unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
{
/*
* Attempting to reclaim only a few pages will often fail and is
@@ -119,7 +120,7 @@ unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();
epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
- cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan);
+ cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan, indirect);
rcu_read_lock();
css_put(pos);
@@ -176,7 +177,7 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
break;
/* Keep reclaiming until above condition is met. */
- sgx_epc_cgroup_reclaim_pages(epc_cg->cg);
+ sgx_epc_cgroup_reclaim_pages(epc_cg->cg, true);
}
}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 4f5824c4751d..51904f191b97 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -254,7 +254,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
}
static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
- struct sgx_backing *backing)
+ struct sgx_backing *backing, bool indirect)
{
struct sgx_encl_page *encl_page = epc_page->owner;
struct sgx_encl *encl = encl_page->encl;
@@ -270,7 +270,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
- &secs_backing);
+ &secs_backing, indirect);
if (ret)
goto out;
@@ -304,9 +304,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
* @lru: The LRU from which pages are reclaimed.
* @nr_to_scan: Pointer to the target number of pages to scan, must be less than
* SGX_NR_TO_SCAN.
+ * @indirect: In ksgxd or EPC cgroup work queue contexts.
* Return: Number of pages reclaimed.
*/
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan,
+ bool indirect)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -348,7 +350,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
mutex_lock(&encl_page->encl->lock);
- ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
+ ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i], indirect);
if (ret) {
mutex_unlock(&encl_page->encl->lock);
goto skip;
@@ -381,7 +383,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
continue;
encl_page = epc_page->owner;
- sgx_reclaimer_write(epc_page, &backing[i]);
+ sgx_reclaimer_write(epc_page, &backing[i], indirect);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
@@ -399,11 +401,11 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}
-static void sgx_reclaim_pages_global(void)
+static void sgx_reclaim_pages_global(bool indirect)
{
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
- sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan);
+ sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
}
/*
@@ -414,7 +416,7 @@ static void sgx_reclaim_pages_global(void)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(false);
}
static int ksgxd(void *p)
@@ -437,7 +439,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(true);
cond_resched();
}
@@ -460,11 +462,6 @@ static bool __init sgx_page_reclaimer_init(void)
return true;
}
-bool current_is_ksgxd(void)
-{
- return current == ksgxd_tsk;
-}
-
static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
{
struct sgx_numa_node *node = &sgx_numa_nodes[nid];
@@ -623,7 +620,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(false);
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 2593c013d091..cfe906054d85 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -110,7 +110,8 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan,
+ bool indirect);
void sgx_ipi_cb(void *info);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
To determine if any page available for reclamation at the global level,
only checking for emptiness of the global LRU is not adequate when pages
are tracked in multiple LRUs, one per cgroup. For this purpose, create a
new helper, sgx_can_reclaim(), currently only checks the global LRU,
later will check emptiness of LRUs of all cgroups when per-cgroup
tracking is turned on. Replace all the checks of the global LRU,
list_empty(&sgx_global_lru.reclaimable), with calls to
sgx_can_reclaim().
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
v7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 2279ae967707..6b0c26cac621 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
return &sgx_global_lru;
}
+static inline bool sgx_can_reclaim(void)
+{
+ return !list_empty(&sgx_global_lru.reclaimable);
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
@@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_global_lru.reclaimable);
+ sgx_can_reclaim();
}
static void sgx_reclaim_pages_global(bool indirect)
@@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_global_lru.reclaimable)) {
+ if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
--
2.25.1
From: Sean Christopherson <[email protected]>
Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V8:
- Limit text width to 80 characters to be consistent.
V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
Documentation/arch/x86/sgx.rst | 83 ++++++++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)
diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call. It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running. The EPC is managed by the SGX subsystem and
+is not accounted by the memory controller. Note that this is true only for EPC
+memory itself, i.e. normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing store
+(normal system memory allocated via shmem). The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+ resource will show the total amount of EPC memory available on the
+ platform.
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+ resource will show the current active EPC memory usage of the cgroup and
+ its descendants. EPC pages that are swapped out to backing RAM are not
+ included in the current count.
+
+ misc.max
+ A read-write single value file which exists on non-root cgroups. The
+ sgx_epc resource will show the EPC usage hard limit. The default is
+ "max".
+
+ If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+ page fault handling, will be blocked until EPC can be reclaimed from the
+ cgroup. If there are no pages left that are reclaimable within the same
+ group, the kernel returns ENOMEM.
+
+ The EPC pages allocated for a guest VM by the virtual EPC driver are not
+ reclaimable by the host kernel. In case the guest cgroup's limit is
+ reached and no reclaimable pages left in the same cgroup, the virtual
+ EPC driver returns SIGBUS to the user space process to indicate failure
+ on new EPC allocation requests.
+
+ The misc.max limit is non-preemptive. If a user writes a limit lower
+ than the current usage to this file, the cgroup will not preemptively
+ deallocate pages currently in use, and will only start blocking the next
+ allocation and reclaiming EPC at that time.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+ A value change in this file generates a file modified event.
+
+ max
+ The number of times the cgroup has triggered a reclaim due to
+ its EPC usage approaching (or exceeding) its max EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged
+to the original cgroup until the page is released or reclaimed. Migrating a
+process to a different cgroup doesn't move the EPC charges that it incurred
+while in the previous cgroup to its new cgroup.
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
When cgroup is enabled, all reclaimable pages will be tracked in cgroup
LRUs. The global reclaimer needs to start reclamation from the root
cgroup. Expose the top level cgroup reclamation function so the global
reclaimer can reuse it.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V8:
- Remove unneeded breaks in function declarations. (Jarkko)
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index abf74fdb12b4..6e31f8727b8a 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
* @indirect: In ksgxd or EPC cgroup work queue context.
* Return: Number of pages reclaimed.
*/
-static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
{
/*
* Attempting to reclaim only a few pages will often fail and is
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index d061cd807b45..5b3e8e1b8630 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
static inline void sgx_epc_cgroup_init(void) { }
+
+static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
+{
+ return 0;
+}
#else
struct sgx_epc_cgroup {
struct misc_cg *cg;
@@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect);
+
void sgx_epc_cgroup_init(void);
#endif
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
When the EPC usage of a cgroup is near its limit, the cgroup needs to
reclaim pages used in the same cgroup to make room for new allocations.
This is analogous to the behavior that the global reclaimer is triggered
when the global usage is close to total available EPC.
Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
whether synchronous reclaim is allowed or not. And trigger the
synchronous/asynchronous reclamation flow accordingly.
Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
arch/x86/kernel/cpu/sgx/main.c | 2 +-
3 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index d399fda2b55e..abf74fdb12b4 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
/**
* sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
* @epc_cg: The EPC cgroup to be charged for the page.
+ * @reclaim: Whether or not synchronous reclaim is allowed
* Return:
* * %0 - If successfully charged.
* * -errno - for failures.
*/
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
{
- return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+ for (;;) {
+ if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+ PAGE_SIZE))
+ break;
+
+ if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ return -ENOMEM;
+
+ if (signal_pending(current))
+ return -ERESTARTSYS;
+
+ if (!reclaim) {
+ queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
+ return -EBUSY;
+ }
+
+ if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+ /* All pages were too young to reclaim, try again a little later */
+ schedule();
+ }
+
+ return 0;
}
/**
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index e3c6a08f0ee8..d061cd807b45 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -23,7 +23,7 @@ static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
-static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
{
return 0;
}
@@ -66,7 +66,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
put_misc_cg(epc_cg->cg);
}
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
void sgx_epc_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 51904f191b97..2279ae967707 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -588,7 +588,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
int ret;
epc_cg = sgx_get_current_epc_cg();
- ret = sgx_epc_cgroup_try_charge(epc_cg);
+ ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim);
if (ret) {
sgx_put_epc_cg(epc_cg);
return ERR_PTR(ret);
--
2.25.1
The scripts rely on cgroup-tools package from libcgroup [1].
To run selftests for epc cgroup:
sudo ./run_epc_cg_selftests.sh
To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:
/watch_misc_for_tests.sh current
With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test. Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.
The script creates 3 different cgroups at the beginning with following
expectations:
1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.
The script also includes a test with low mem_cg limit and LARGE sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg.
[1] https://github.com/libcgroup/libcgroup/blob/main/README
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Added memcontrol test.
V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
.../selftests/sgx/run_epc_cg_selftests.sh | 246 ++++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
2 files changed, 259 insertions(+)
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..e027bf39f005
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,246 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+ echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+ exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_MISC_ROOT=/sys/fs/cgroup
+CG_MEM_ROOT=/sys/fs/cgroup
+CG_V1=0
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+ echo "# cgroup V2 is in use."
+else
+ echo "# cgroup V1 is in use."
+ CG_MISC_ROOT=/sys/fs/cgroup/misc
+ CG_MEM_ROOT=/sys/fs/cgroup/memory
+ CG_V1=1
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+wait_check_process_status() {
+ local pid=$1
+ local check_for_success=$2 # If 1, check for success;
+ # If 0, check for failure
+ wait "$pid"
+ local status=$?
+
+ if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+ echo "# Process $pid succeeded."
+ return 0
+ elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ return 0
+ fi
+ return 1
+}
+
+wait_and_detect_for_any() {
+ local pids=("$@")
+ local check_for_success=$1 # If 1, check for success;
+ # If 0, check for failure
+ local detected=1 # 0 for success detection
+
+ for pid in "${pids[@]:1}"; do
+ if wait_check_process_status "$pid" "$check_for_success"; then
+ detected=0
+ # Wait for other processes to exit
+ fi
+ done
+
+ return $detected
+}
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [[ $? -eq 0 ]]; then
+ echo "# Fail on SMALL limit, not expecting any test passes."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+else
+ echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one success...."
+
+pids=()
+for i in {1..4}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+
+if wait_and_detect_for_any 1 "${pids[@]}"; then
+ echo "# PASSED LARGE limit positive testing."
+else
+ echo "# Failed on LARGE limit positive testing, no test passes."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one failure...."
+pids=()
+for i in {1..5}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+if wait_and_detect_for_any 0 "${pids[@]}"; then
+ echo "# PASSED LARGE limit negative testing."
+else
+ echo "# Failed on LARGE limit negative testing, no test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ expecting no failure...."
+pids=()
+for i in {1..8}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+if wait_and_detect_for_any 0 "${pids[@]}"; then
+ echo "# Failed on LARGER limit, at least one test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+else
+ echo "# PASSED LARGER limit tests."
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ randomly kill one, expecting no failure...."
+pids=()
+for i in {1..8}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+sleep $((RANDOM % 10 + 5))
+
+# Randomly select a PID to kill
+RANDOM_INDEX=$((RANDOM % 8))
+PID_TO_KILL=${pids[RANDOM_INDEX]}
+
+kill $PID_TO_KILL
+echo "# Killed process with PID: $PID_TO_KILL"
+
+any_failure=0
+for pid in "${pids[@]}"; do
+ wait "$pid"
+ status=$?
+ if [ "$pid" != "$PID_TO_KILL" ]; then
+ if [[ $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ any_failure=1
+ fi
+ fi
+done
+
+if [[ $any_failure -ne 0 ]]; then
+ echo "# Failed on random killing, at least one test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+echo "# PASSED LARGER limit test with a process randomly killed."
+
+cgcreate -g memory:$TEST_CG_SUB2
+if [ $? -ne 0 ]; then
+ echo "# Failed creating memory controller."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+MEM_LIMIT_TOO_SMALL=$((CAPACITY - 2 * LARGE))
+
+if [[ $CG_V1 -eq 0 ]]; then
+ echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.max
+else
+ echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.limit_in_bytes
+ echo "$MEM_LIMIT_TOO_SMALL" > $CG_MEM_ROOT/$TEST_CG_SUB2/memory.memsw.limit_in_bytes
+fi
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE EPC limit,
+ and too small RAM limit, expecting all failures...."
+pids=()
+for i in {1..4}; do
+ (
+ cgexec -g memory:$TEST_CG_SUB2 -g misc:$TEST_CG_SUB2 $test_cmd \
+ >cgtest_large_oom_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+if wait_and_detect_for_any 1 "${pids[@]}"; then
+ echo "# Failed on tests with memcontrol, some tests did not fail."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ if [[ $CG_V1 -ne 0 ]]; then
+ cgdelete -r -g memory:$TEST_ROOT_CG
+ fi
+ exit 1
+else
+ echo "# PASSED LARGE limit tests with memcontrol."
+fi
+
+sleep 2
+
+USAGE=$(grep '^sgx_epc' "$CG_MISC_ROOT/$TEST_ROOT_CG/misc.current" | awk '{print $2}')
+if [ "$USAGE" -ne 0 ]; then
+ echo "# Failed: Final usage is $USAGE, not 0."
+else
+ echo "# PASSED leakage check."
+ echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
+fi
+cgdelete -r -g misc:$TEST_ROOT_CG
+if [[ $CG_V1 -ne 0 ]]; then
+ cgdelete -r -g memory:$TEST_ROOT_CG
+fi
+echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+ then
+ echo "No argument supplied, please provide 'max', 'current' or 'events'"
+ exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+ 'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.
Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)
V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
include/linux/misc_cgroup.h | 11 +++++
kernel/cgroup/misc.c | 84 +++++++++++++++++++++++++++++++++----
2 files changed, 87 insertions(+), 8 deletions(-)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
#include <linux/cgroup.h>
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+ int (*alloc)(struct misc_cg *cg);
+ void (*free)(struct misc_cg *cg);
+};
+
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..14ab13ef3bc7 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
*/
static u64 misc_res_capacity[MISC_CG_RES_TYPES];
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
/**
* parent_misc() - Get the parent of the passed misc cgroup.
* @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity)
}
EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+ if (!valid_type(type))
+ return -EINVAL;
+
+ if (!ops->alloc) {
+ pr_err("%s: alloc missing\n", __func__);
+ return -EINVAL;
+ }
+
+ if (!ops->free) {
+ pr_err("%s: free missing\n", __func__);
+ return -EINVAL;
+ }
+
+ misc_res_ops[type] = ops;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
/**
* misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
* @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
};
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+ int ret;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ WRITE_ONCE(cg->res[i].max, MAX_NUM);
+ atomic64_set(&cg->res[i].usage, 0);
+ if (misc_res_ops[i]) {
+ ret = misc_res_ops[i]->alloc(cg);
+ if (ret)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++)
+ if (misc_res_ops[i])
+ misc_res_ops[i]->free(cg);
+}
+
/**
* misc_cg_alloc() - Allocate misc cgroup.
* @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
static struct cgroup_subsys_state *
misc_cg_alloc(struct cgroup_subsys_state *parent_css)
{
- enum misc_res_type i;
- struct misc_cg *cg;
+ struct misc_cg *parent_cg, *cg;
+ int ret;
- if (!parent_css) {
- cg = &root_cg;
+ if (unlikely(!parent_css)) {
+ parent_cg = cg = &root_cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+ parent_cg = css_misc(parent_css);
}
- for (i = 0; i < MISC_CG_RES_TYPES; i++) {
- WRITE_ONCE(cg->res[i].max, MAX_NUM);
- atomic64_set(&cg->res[i].usage, 0);
+ ret = _misc_cg_res_alloc(cg);
+ if (ret) {
+ _misc_cg_res_free(cg);
+ if (likely(parent_css))
+ kfree(cg);
+
+ return ERR_PTR(ret);
}
return &cg->css;
@@ -410,7 +475,10 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
*/
static void misc_cg_free(struct cgroup_subsys_state *css)
{
- kfree(css_misc(css));
+ struct misc_cg *cg = css_misc(css);
+
+ _misc_cg_res_free(cg);
+ kfree(cg);
}
/* Cgroup controller callbacks */
--
2.25.1
From: Sean Christopherson <[email protected]>
Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
from its LRU or LRUs of its descendants to make room for any new
allocations.
To prepare for reclamation per cgroup, expose the top level reclamation
function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
to the function to pass in an LRU so cgroups can pass in different
tracking LRUs later. Add another parameter for passing in the number of
pages to scan and make the function return the number of pages reclaimed
as a cgroup reclaimer may need to track reclamation progress from its
descendants, change number of pages to scan in subsequent calls.
Create a wrapper for the global reclaimer, sgx_reclaim_pages_global(),
to just call this function with the global LRU passed in. When
per-cgroup LRU is added later, the wrapper will perform global
reclamation from the root cgroup.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
---
V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
---
arch/x86/kernel/cpu/sgx/main.c | 53 +++++++++++++++++++++++-----------
arch/x86/kernel/cpu/sgx/sgx.h | 1 +
2 files changed, 37 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index a131aa985c95..4f5824c4751d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -286,11 +286,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
mutex_unlock(&encl->lock);
}
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
+ *
+ * Take a fixed number of pages from the head of a given LRU and reclaim them to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
*
* Batch process a chunk of pages (at the moment 16) in order to degrade amount
* of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
@@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
* + EWB) but not sufficiently. Reclaiming one page at a time would also be
* problematic as it would increase the lock contention too much, which would
* halt forward progress.
+ *
+ * @lru: The LRU from which pages are reclaimed.
+ * @nr_to_scan: Pointer to the target number of pages to scan, must be less than
+ * SGX_NR_TO_SCAN.
+ * Return: Number of pages reclaimed.
*/
-static void sgx_reclaim_pages(void)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_global_lru.lock);
- for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
- struct sgx_epc_page, list);
+ spin_lock(&lru->lock);
+
+ for (; *nr_to_scan > 0; --(*nr_to_scan)) {
+ epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
if (!epc_page)
break;
@@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+
+ spin_unlock(&lru->lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_global_lru.lock);
- list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ spin_lock(&lru->lock);
+ list_add_tail(&epc_page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void)
sgx_reclaimer_block(epc_page);
}
+ ret = 0;
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
if (!epc_page)
@@ -378,7 +387,10 @@ static void sgx_reclaim_pages(void)
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
sgx_free_epc_page(epc_page);
+ ret++;
}
+
+ return (unsigned int)ret;
}
static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,6 +399,13 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}
+static void sgx_reclaim_pages_global(void)
+{
+ unsigned int nr_to_scan = SGX_NR_TO_SCAN;
+
+ sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan);
+}
+
/*
* sgx_reclaim_direct() should be called (without enclave's mutex held)
* in locations where SGX memory resources might be low and might be
@@ -395,7 +414,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
}
static int ksgxd(void *p)
@@ -418,7 +437,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}
@@ -604,7 +623,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0e99e9ae3a67..2593c013d091 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -110,6 +110,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan);
void sgx_ipi_cb(void *info);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.
Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.
This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.
Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.
Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 6b0c26cac621..d4265a390ba9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (epc_page->epc_cg)
+ return &epc_page->epc_cg->lru;
+
+ /* This should not happen if kernel is configured correctly */
+ WARN_ON_ONCE(1);
+#endif
return &sgx_global_lru;
}
static inline bool sgx_can_reclaim(void)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ return !sgx_epc_cgroup_lru_empty(misc_cg_root());
+#else
return !list_empty(&sgx_global_lru.reclaimable);
+#endif
}
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -410,7 +421,10 @@ static void sgx_reclaim_pages_global(bool indirect)
{
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
- sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
+ if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ sgx_epc_cgroup_reclaim_pages(misc_cg_root(), indirect);
+ else
+ sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
}
/*
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
The SGX EPC cgroup will reclaim EPC pages when a usage in a cgroup
reaches its or ancestor's limit. This requires a walk from the current
cgroup up to the root similar to misc_cg_try_charge(). Export
misc_cg_parent() to enable this walk.
The SGX driver may also need start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
include/linux/misc_cgroup.h | 24 ++++++++++++++++++++++++
kernel/cgroup/misc.c | 21 ++++++++-------------
2 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
};
+struct misc_cg *misc_cg_root(void);
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
}
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+ return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
/*
* get_current_misc_cg() - Find and get the misc cgroup of the current task.
*
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
}
#else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+ return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+ return NULL;
+}
static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
{
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 14ab13ef3bc7..1f0d8e05b36c 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
/**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
*/
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
{
- return cgroup ? css_misc(cgroup->css.parent) : NULL;
+ return &root_cg;
}
+EXPORT_SYMBOL_GPL(misc_cg_root);
/**
* valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!amount)
return 0;
- for (i = cg; i; i = parent_misc(i)) {
+ for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];
new_usage = atomic64_add_return(amount, &res->usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
return 0;
err_charge:
- for (j = i; j; j = parent_misc(j)) {
+ for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}
- for (j = cg; j != i; j = parent_misc(j))
+ for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
- for (i = cg; i; i = parent_misc(i))
+ for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
}
EXPORT_SYMBOL_GPL(misc_cg_uncharge);
--
2.25.1
On Mon, Feb 05, 2024 at 01:06:23PM -0800, Haitao Huang wrote:
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The existing
> cgroup memory controller cannot be used to limit or account for SGX EPC
> memory, which is a desirable feature in some environments, e.g., support
> for pod level control in a Kubernates cluster on a VM or bare-metal host
> [1,2].
>
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. A user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
>
> Much like normal system memory, EPC memory can be overcommitted via virtual
> memory techniques and pages can be swapped out of the EPC to their backing
> store, which are normal system memory allocated via shmem and accounted by
> the memory controller. Similar to per-cgroup reclamation done by the memory
> controller, the EPC misc controller needs to implement a per-cgroup EPC
> reclaiming process: when the EPC usage of a cgroup reaches its hard limit
> ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
> some EPC pages within the same cgroup to make room for new allocations.
>
> For that, this implementation tracks reclaimable EPC pages in a separate
> LRU list in each cgroup, and below are more details and justification of
> this design.
>
> Track EPC pages in per-cgroup LRUs (from Dave)
> ----------------------------------------------
>
> tl;dr: A cgroup hitting its limit should be as similar as possible to the
> system running out of EPC memory. The only two choices to implement that
> are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
> The result: Add a new LRU for each cgroup and scans those instead. Replace
> the existing global cgroup with the root cgroup's LRU (only when this new
> support is compiled in, obviously).
>
> The existing EPC memory management aims to be a miniature version of the
> core VM where EPC memory can be overcommitted and reclaimed. EPC
> allocations can wait for reclaim. The alternative to waiting would have
> been to send a signal and let the enclave die.
>
> This series attempts to implement that same logic for cgroups, for the same
> reasons: it's preferable to wait for memory to become available and let
> reclaim happen than to do things that are fatal to enclaves.
>
> There is currently a global reclaimable page SGX LRU list. That list (and
> the existing scanning algorithm) is essentially useless for doing reclaim
> when a cgroup hits its limit because the cgroup's pages are scattered
> around that LRU. It is unspeakably inefficient to scan a linked list with
> millions of entries for what could be dozens of pages from a cgroup that
> needs reclaim.
>
> Even if unspeakably slow reclaim was accepted, the existing scanning
> algorithm only picks a few pages off the head of the global LRU. It would
> either need to hold the list locks for unreasonable amounts of time, or be
> taught to scan the list in pieces, which has its own challenges.
>
> Unreclaimable Enclave Pages
> ---------------------------
>
> There are a variety of page types for enclaves, each serving different
> purposes [5]. Although the SGX architecture supports swapping for all
> types, some special pages, e.g., Version Array(VA) and Secure Enclave
> Control Structure (SECS)[5], holds meta data of reclaimed pages and
> enclaves. That makes reclamation of such pages more intricate to manage.
> The SGX driver global reclaimer currently does not swap out VA pages. It
> only swaps the SECS page of an enclave when all other associated pages have
> been swapped out. The cgroup reclaimer follows the same approach and does
> not track those in per-cgroup LRUs and considers them as unreclaimable
> pages. The allocation of these pages is counted towards the usage of a
> specific cgroup and is subject to the cgroup's set EPC limits.
>
> Earlier versions of this series implemented forced enclave-killing to
> reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
> particularly in scenarios where a user or administrator reduces this limit
> post-launch of enclaves. However, subsequent discussions [3, 4] indicated
> that such preemptive enforcement is not necessary for the misc-controllers.
> Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
> and the limit is only enforced at the time of new EPC allocation request.
> When a cgroup hits its limit but nothing left in the LRUs of the subtree,
> i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
> within that cgroup will result in an 'ENOMEM'.
>
> Unreclaimable Guest VM EPC Pages
> --------------------------------
>
> The EPC pages allocated for guest VMs by the virtual EPC driver are not
> reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
> those as unreclaimable and returns ENOMEM when its limit is hit and nothing
> reclaimable left within the cgroup. The virtual EPC driver translates the
> ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
> user process exactly the same way handling host running out of physical
> EPC.
>
> This work was originally authored by Sean Christopherson a few years ago,
> and previously modified by Kristen C. Accardi to utilize the misc cgroup
> controller rather than a custom controller. I have been updating the
> patches based on review comments since V2 [7-13], simplified the
> implementation/design, added selftest scripts, fixed some stability issues
> found from testing.
>
> Thanks to all for the review/test/tags/feedback provided on the previous
> versions.
>
> I appreciate your further reviewing/testing and providing tags if
> appropriate.
>
I'be been running this patchset on my (single node) Kubernetes cluster
with EPC cgroups limits enforced for each SGX (Gramine) container and
everything seems to be working well and as expected.
Tested-by: Mikko Ylinen <[email protected]>
Regards, Mikko
> ---
> V9:
> - Add comments for static variables outside functions. (Jarkko)
> - Remove unnecessary ifs. (Tim)
> - Add more Reviewed-By: tags from Jarkko and TJ.
>
> V8:
> - Style fixes. (Jarkko)
> - Abstract _misc_res_free/alloc() (Jarkko)
> - Remove unneeded NULL checks. (Jarkko)
>
> V7:
> - Split the large patch for the final EPC implementation, #10 in V6, into
> smaller ones. (Dave, Kai)
> - Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
> into two functions (Kai)
> - Removed patches to introduce the EPC page states, list for storing
> candidate pages for reclamation. (not needed due to above changes)
> - Make ops one per resource type and store them in array (Michal)
> - Rename the ops struct to misc_res_ops, and enforce the constraints of
> required callback functions (Jarkko)
> - Initialize epc cgroup in sgx driver init function. (Kai)
> - Moved addition of priv field to patch 4 where it was used first. (Jarkko)
> - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
> - Use a static for root cgroup (Kai)
>
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/lkml/[email protected]/
> [4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
> [5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
> [6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
> [7]v2: https://lore.kernel.org/all/[email protected]/
> [8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
> [9]v4: https://lore.kernel.org/all/[email protected]/
> [10]v5: https://lore.kernel.org/all/[email protected]/
> [11]v6: https://lore.kernel.org/linux-sgx/[email protected]/
> [12]v7: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
> [13]v8: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
>
> Haitao Huang (2):
> x86/sgx: Charge mem_cgroup for per-cgroup reclamation
> selftests/sgx: Add scripts for EPC cgroup testing
>
> Kristen Carlson Accardi (10):
> cgroup/misc: Add per resource callbacks for CSS events
> cgroup/misc: Export APIs for SGX driver
> cgroup/misc: Add SGX EPC resource type
> x86/sgx: Implement basic EPC misc cgroup functionality
> x86/sgx: Abstract tracking reclaimable pages in LRU
> x86/sgx: Implement EPC reclamation flows for cgroup
> x86/sgx: Add EPC reclamation in cgroup try_charge()
> x86/sgx: Abstract check for global reclaimable pages
> x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer
> x86/sgx: Turn on per-cgroup EPC reclamation
>
> Sean Christopherson (3):
> x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
> x86/sgx: Expose sgx_reclaim_pages() for cgroup
> Docs/x86/sgx: Add description for cgroup support
>
> Documentation/arch/x86/sgx.rst | 83 ++++++
> arch/x86/Kconfig | 13 +
> arch/x86/kernel/cpu/sgx/Makefile | 1 +
> arch/x86/kernel/cpu/sgx/encl.c | 38 ++-
> arch/x86/kernel/cpu/sgx/encl.h | 3 +-
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 276 ++++++++++++++++++
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 83 ++++++
> arch/x86/kernel/cpu/sgx/main.c | 180 +++++++++---
> arch/x86/kernel/cpu/sgx/sgx.h | 22 ++
> include/linux/misc_cgroup.h | 41 +++
> kernel/cgroup/misc.c | 109 +++++--
> .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++++++++++++++++
> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> 13 files changed, 1019 insertions(+), 89 deletions(-)
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
>
> base-commit: 54be6c6c5ae8e0d93a6c4641cb7528eb0b6ba478
> --
> 2.25.1
>
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> Implement the reclamation flow for cgroup, encapsulated in the top-level
> function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its
> subtree, and make calls to sgx_reclaim_pages() at each node passing in
> the LRU of that node. It keeps track of total reclaimed pages, and pages
> left to attempt. It stops the walk if desired number of pages are
> attempted.
>
> In some contexts, e.g. page fault handling, only asynchronous
> reclamation is allowed. Create a work-queue, corresponding work item and
> function definitions to support the asynchronous reclamation. Both
> synchronous and asynchronous flows invoke the same top level reclaim
> function, and will be triggered later by sgx_epc_cgroup_try_charge()
> when usage of the cgroup is at or near its limit.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V9:
> - Add comments for static variables. (Jarkko)
>
> V8:
> - Remove alignment for substructure variables. (Jarkko)
>
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 181 ++++++++++++++++++++++++++-
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 3 +
> 2 files changed, 183 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index f4a37ace67d7..16b6d9f909eb 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -8,9 +8,180 @@
> /* The root EPC cgroup */
> static struct sgx_epc_cgroup epc_cg_root;
>
> +/*
> + * The work queue that reclaims EPC pages in the background for cgroups.
> + *
> + * A cgroup schedules a work item into this queue to reclaim pages within the
> + * same cgroup when its usage limit is reached and synchronous reclamation is not
> + * an option, e.g., in a fault handler.
> + */
Here I get a bit confused of the text because of "e.g., in a fault
handler". Can we enumerate sites broadly where stimulus could
happen.
Does not have to be complete but at least the most common locations
would make this comment *actually* useful for later maintenance.
BR, Jarkko
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> Enclave Page Cache(EPC) memory can be swapped out to regular system
"Enclave Page Cache (EPC)"
~
> memory, and the consumed memory should be charged to a proper
> mem_cgroup. Currently the selection of mem_cgroup to charge is done in
> sgx_encl_get_mem_cgroup(). But it only considers two contexts in which
> the swapping can be done: normal tasks and the ksgxd kthread.
> With the new EPC cgroup implementation, the swapping can also happen in
> EPC cgroup work-queue threads. In those cases, it improperly selects the
> root mem_cgroup to charge for the RAM usage.
>
> Change sgx_encl_get_mem_cgroup() to handle non-task contexts only and
> return the mem_cgroup of an mm_struct associated with the enclave. The
> return is used to charge for EPC backing pages in all kthread cases.
>
> Pass a flag into the top level reclamation function,
> sgx_reclaim_pages(), to explicitly indicate whether it is called from a
> background kthread. Internally, if the flag is true, switch the active
> mem_cgroup to the one returned from sgx_encl_get_mem_cgroup(), prior to
> any backing page allocation, in order to ensure that shmem page
> allocations are charged to the enclave's cgroup.
>
> Removed current_is_ksgxd() as it is no longer needed.
>
> Signed-off-by: Haitao Huang <[email protected]>
> Reported-by: Mikko Ylinen <[email protected]>
> ---
> V9:
> - Reduce number of if statements. (Tim)
>
> V8:
> - Limit text paragraphs to 80 characters wide. (Jarkko)
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 38 +++++++++++++---------------
> arch/x86/kernel/cpu/sgx/encl.h | 3 +--
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 7 ++---
> arch/x86/kernel/cpu/sgx/main.c | 27 +++++++++-----------
> arch/x86/kernel/cpu/sgx/sgx.h | 3 ++-
> 5 files changed, 36 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..4e5948362060 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -993,9 +993,7 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde
> }
>
> /*
> - * When called from ksgxd, returns the mem_cgroup of a struct mm stored
> - * in the enclave's mm_list. When not called from ksgxd, just returns
> - * the mem_cgroup of the current task.
> + * Returns the mem_cgroup of a struct mm stored in the enclave's mm_list.
> */
> static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
> {
> @@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
> struct sgx_encl_mm *encl_mm;
> int idx;
>
> - /*
> - * If called from normal task context, return the mem_cgroup
> - * of the current task's mm. The remainder of the handling is for
> - * ksgxd.
> - */
> - if (!current_is_ksgxd())
> - return get_mem_cgroup_from_mm(current->mm);
> -
> /*
> * Search the enclave's mm_list to find an mm associated with
> * this enclave to charge the allocation to.
> @@ -1047,27 +1037,33 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
> * @encl: an enclave pointer
> * @page_index: enclave page index
> * @backing: data for accessing backing storage for the page
> + * @indirect: in ksgxd or EPC cgroup work queue context
> + *
> + * Create a backing page for loading data back into an EPC page with ELDU. This
> + * function takes a reference on a new backing page which must be dropped with a
> + * corresponding call to sgx_encl_put_backing().
> *
> - * When called from ksgxd, sets the active memcg from one of the
> - * mms in the enclave's mm_list prior to any backing page allocation,
> - * in order to ensure that shmem page allocations are charged to the
> - * enclave. Create a backing page for loading data back into an EPC page with
> - * ELDU. This function takes a reference on a new backing page which
> - * must be dropped with a corresponding call to sgx_encl_put_backing().
> + * When @indirect is true, sets the active memcg from one of the mms in the
> + * enclave's mm_list prior to any backing page allocation, in order to ensure
> + * that shmem page allocations are charged to the enclave.
> *
> * Return:
> * 0 on success,
> * -errno otherwise.
> */
> int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
> - struct sgx_backing *backing)
> + struct sgx_backing *backing, bool indirect)
Boolean parameters should be avoided when possible because they confuse
in the call sites.
> {
> - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
> - struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
> + struct mem_cgroup *encl_memcg;
> + struct mem_cgroup *memcg;
> int ret;
>
> - ret = __sgx_encl_get_backing(encl, page_index, backing);
> + if (!indirect)
> + return __sgx_encl_get_backing(encl, page_index, backing);
If a call is either in heead or tail of the code block, then
obviously better option is to make __sgx_encl_get_backing()
as non-static sgx_encl_get_backing() and call it in those
call sites that would call this with "false".
I.e. you need a new patch where this preparation is done.
>
> + encl_memcg = sgx_encl_get_mem_cgroup(encl);
> + memcg = set_active_memcg(encl_memcg);
> + ret = __sgx_encl_get_backing(encl, page_index, backing);
> set_active_memcg(memcg);
> mem_cgroup_put(encl_memcg);
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..549cd2e8d98b 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
> int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
> unsigned long end, unsigned long vm_flags);
>
> -bool current_is_ksgxd(void);
> void sgx_encl_release(struct kref *ref);
> int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
> const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
> int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
> - struct sgx_backing *backing);
> + struct sgx_backing *backing, bool indirect);
> void sgx_encl_put_backing(struct sgx_backing *backing);
> int sgx_encl_test_and_clear_young(struct mm_struct *mm,
> struct sgx_encl_page *page);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index 16b6d9f909eb..d399fda2b55e 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -93,9 +93,10 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> /**
> * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages
> * @root: Root of the tree to start walking from.
> + * @indirect: In ksgxd or EPC cgroup work queue context.
> * Return: Number of pages reclaimed.
> */
> -unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> +static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
> {
> /*
> * Attempting to reclaim only a few pages will often fail and is
> @@ -119,7 +120,7 @@ unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> rcu_read_unlock();
>
> epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> - cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan);
> + cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan, indirect);
>
> rcu_read_lock();
> css_put(pos);
> @@ -176,7 +177,7 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> break;
>
> /* Keep reclaiming until above condition is met. */
> - sgx_epc_cgroup_reclaim_pages(epc_cg->cg);
> + sgx_epc_cgroup_reclaim_pages(epc_cg->cg, true);
> }
> }
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 4f5824c4751d..51904f191b97 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -254,7 +254,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
> }
>
> static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> - struct sgx_backing *backing)
> + struct sgx_backing *backing, bool indirect)
> {
> struct sgx_encl_page *encl_page = epc_page->owner;
> struct sgx_encl *encl = encl_page->encl;
> @@ -270,7 +270,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>
> if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
> ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
> - &secs_backing);
> + &secs_backing, indirect);
> if (ret)
> goto out;
>
> @@ -304,9 +304,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> * @lru: The LRU from which pages are reclaimed.
> * @nr_to_scan: Pointer to the target number of pages to scan, must be less than
> * SGX_NR_TO_SCAN.
> + * @indirect: In ksgxd or EPC cgroup work queue contexts.
> * Return: Number of pages reclaimed.
> */
> -unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan,
> + bool indirect)
> {
> struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -348,7 +350,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
> page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
>
> mutex_lock(&encl_page->encl->lock);
> - ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
> + ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i], indirect);
> if (ret) {
> mutex_unlock(&encl_page->encl->lock);
> goto skip;
> @@ -381,7 +383,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
> continue;
>
> encl_page = epc_page->owner;
> - sgx_reclaimer_write(epc_page, &backing[i]);
> + sgx_reclaimer_write(epc_page, &backing[i], indirect);
>
> kref_put(&encl_page->encl->refcount, sgx_encl_release);
> epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> @@ -399,11 +401,11 @@ static bool sgx_should_reclaim(unsigned long watermark)
> !list_empty(&sgx_global_lru.reclaimable);
> }
>
> -static void sgx_reclaim_pages_global(void)
> +static void sgx_reclaim_pages_global(bool indirect)
> {
> unsigned int nr_to_scan = SGX_NR_TO_SCAN;
>
> - sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan);
> + sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
> }
>
> /*
> @@ -414,7 +416,7 @@ static void sgx_reclaim_pages_global(void)
> void sgx_reclaim_direct(void)
> {
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(false);
> }
>
> static int ksgxd(void *p)
> @@ -437,7 +439,7 @@ static int ksgxd(void *p)
> sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>
> if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(true);
>
> cond_resched();
> }
> @@ -460,11 +462,6 @@ static bool __init sgx_page_reclaimer_init(void)
> return true;
> }
>
> -bool current_is_ksgxd(void)
> -{
> - return current == ksgxd_tsk;
> -}
> -
> static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
> {
> struct sgx_numa_node *node = &sgx_numa_nodes[nid];
> @@ -623,7 +620,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> * Need to do a global reclamation if cgroup was not full but free
> * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> */
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(false);
> cond_resched();
> }
>
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 2593c013d091..cfe906054d85 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -110,7 +110,8 @@ void sgx_reclaim_direct(void);
> void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
> int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
> -unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan);
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan,
> + bool indirect);
>
> void sgx_ipi_cb(void *info);
>
Br, Jarkko
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> reclaim pages used in the same cgroup to make room for new allocations.
> This is analogous to the behavior that the global reclaimer is triggered
> when the global usage is close to total available EPC.
>
> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> whether synchronous reclaim is allowed or not. And trigger the
> synchronous/asynchronous reclamation flow accordingly.
>
> Note at this point, all reclaimable EPC pages are still tracked in the
> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> is activated yet.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
> arch/x86/kernel/cpu/sgx/main.c | 2 +-
> 3 files changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index d399fda2b55e..abf74fdb12b4 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -184,13 +184,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> /**
> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> * @epc_cg: The EPC cgroup to be charged for the page.
> + * @reclaim: Whether or not synchronous reclaim is allowed
> * Return:
> * * %0 - If successfully charged.
> * * -errno - for failures.
> */
> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
> {
> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> + for (;;) {
> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> + PAGE_SIZE))
> + break;
> +
> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> + return -ENOMEM;
> + + if (signal_pending(current))
> + return -ERESTARTSYS;
> +
> + if (!reclaim) {
> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> + return -EBUSY;
> + }
> +
> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> + /* All pages were too young to reclaim, try again a little later */
> + schedule();
This will be total pain to backtrack after a while when something
needs to be changed so there definitely should be inline comments
addressing each branch condition.
I'd rethink this as:
1. Create static __sgx_epc_cgroup_try_charge() for addressing single
iteration with the new "reclaim" parameter.
2. Add a new sgx_epc_group_try_charge_reclaim() function.
There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
sgx_epc_cgroup_try_charge_reclaim() because both have almost the
same loop calling internal __sgx_epc_cgroup_try_charge() with
different parameters. That is totally acceptable.
Please also add my suggested-by.
BR, Jarkko
BR, Jarkko
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> To determine if any page available for reclamation at the global level,
> only checking for emptiness of the global LRU is not adequate when pages
> are tracked in multiple LRUs, one per cgroup. For this purpose, create a
> new helper, sgx_can_reclaim(), currently only checks the global LRU,
> later will check emptiness of LRUs of all cgroups when per-cgroup
> tracking is turned on. Replace all the checks of the global LRU,
> list_empty(&sgx_global_lru.reclaimable), with calls to
> sgx_can_reclaim().
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> v7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/main.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 2279ae967707..6b0c26cac621 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
> return &sgx_global_lru;
> }
>
/*
* Description
*/
> +static inline bool sgx_can_reclaim(void)
> +{
> + return !list_empty(&sgx_global_lru.reclaimable);
> +}
> +
> static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>
> /* Nodes with one or more EPC sections. */
> @@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
> static bool sgx_should_reclaim(unsigned long watermark)
> {
> return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> - !list_empty(&sgx_global_lru.reclaimable);
> + sgx_can_reclaim();
> }
>
> static void sgx_reclaim_pages_global(bool indirect)
> @@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> break;
> }
>
> - if (list_empty(&sgx_global_lru.reclaimable)) {
> + if (!sgx_can_reclaim()) {
> page = ERR_PTR(-ENOMEM);
> break;
> }
BR, Jarkko
On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> When cgroup is enabled, all reclaimable pages will be tracked in cgroup
> LRUs. The global reclaimer needs to start reclamation from the root
> cgroup. Expose the top level cgroup reclamation function so the global
> reclaimer can reuse it.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V8:
> - Remove unneeded breaks in function declarations. (Jarkko)
>
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++++++
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index abf74fdb12b4..6e31f8727b8a 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> * @indirect: In ksgxd or EPC cgroup work queue context.
> * Return: Number of pages reclaimed.
> */
> -static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
Now that is an ugly function name...
IMHO, be would not lost a lot of information if these would be shortened
as sgx_cgroup_reclaim_pages() and such and so forth.
No risk for amiguity and much much more digestable code to read.
BR, Jarkko
Hi Jarkko
On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> When the EPC usage of a cgroup is near its limit, the cgroup needs to
>> reclaim pages used in the same cgroup to make room for new allocations.
>> This is analogous to the behavior that the global reclaimer is triggered
>> when the global usage is close to total available EPC.
>>
>> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
>> whether synchronous reclaim is allowed or not. And trigger the
>> synchronous/asynchronous reclamation flow accordingly.
>>
>> Note at this point, all reclaimable EPC pages are still tracked in the
>> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
>> is activated yet.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> V7:
>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> ---
>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
>> arch/x86/kernel/cpu/sgx/main.c | 2 +-
>> 3 files changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> index d399fda2b55e..abf74fdb12b4 100644
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -184,13 +184,35 @@ static void
>> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
>> /**
>> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC
>> page
>> * @epc_cg: The EPC cgroup to be charged for the page.
>> + * @reclaim: Whether or not synchronous reclaim is allowed
>> * Return:
>> * * %0 - If successfully charged.
>> * * -errno - for failures.
>> */
>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>> reclaim)
>> {
>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
>> + for (;;) {
>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> + PAGE_SIZE))
>> + break;
>> +
>> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> + return -ENOMEM;
>> + + if (signal_pending(current))
>> + return -ERESTARTSYS;
>> +
>> + if (!reclaim) {
>> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>> + return -EBUSY;
>> + }
>> +
>> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>> + /* All pages were too young to reclaim, try again a little later */
>> + schedule();
>
> This will be total pain to backtrack after a while when something
> needs to be changed so there definitely should be inline comments
> addressing each branch condition.
>
> I'd rethink this as:
>
> 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
> iteration with the new "reclaim" parameter.
> 2. Add a new sgx_epc_group_try_charge_reclaim() function.
>
> There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> same loop calling internal __sgx_epc_cgroup_try_charge() with
> different parameters. That is totally acceptable.
>
> Please also add my suggested-by.
>
> BR, Jarkko
>
> BR, Jarkko
>
For #2:
The only caller of this function, sgx_alloc_epc_page(), has the same
boolean which is passed into this this function.
If we separate it into sgx_epc_cgroup_try_charge() and
sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the
if/else branches. So separation here seems not help?
For #1:
If we don't do #2, It seems overkill at the moment for such a short
function.
How about we add inline comments for each branch for now, and if later
there are more branches and the function become too long we add
__sgx_epc_cgroup_try_charge() as you suggested?
Thanks
Haitao
On Mon, 12 Feb 2024 13:46:06 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
>> Enclave Page Cache(EPC) memory can be swapped out to regular system
>
> "Enclave Page Cache (EPC)"
> ~
>
Will fix.
[...]
>> int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long
>> page_index,
>> - struct sgx_backing *backing)
>> + struct sgx_backing *backing, bool indirect)
>
> Boolean parameters should be avoided when possible because they confuse
> in the call sites.
>
>> {
>> - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
>> - struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
>> + struct mem_cgroup *encl_memcg;
>> + struct mem_cgroup *memcg;
>> int ret;
>>
>> - ret = __sgx_encl_get_backing(encl, page_index, backing);
>> + if (!indirect)
>> + return __sgx_encl_get_backing(encl, page_index, backing);
>
> If a call is either in heead or tail of the code block, then
> obviously better option is to make __sgx_encl_get_backing()
> as non-static sgx_encl_get_backing() and call it in those
> call sites that would call this with "false".
>
> I.e. you need a new patch where this preparation is done.
>
This would actually require more intrusive changes to the call stack for
global and cgroup reclaim:
{sgx_epc_cgroup_reclaim_pages(),sgx_reclaim_pages_global()}->sgx_reclaim_pages()[->sgx_reclaimer_write()]->sgx_encl_alloc_backing()
We need make two versions of each of those functions.
It'd be especially complicated in refactoring sgx_reclaim_pages() for two
versions.
I now double checked the history of current_is_ksgxd()[1], it seemed the
intent was to replace "current->mm == NULL" criteria so it is more obvious
we are running in ksgxd.
@Dave, @Kristen,
Can we restore the original criteria like below so it works for the cgroup
work-queues?
bool current_is_ksgxd(void)
{
return current == ksgxd_tsk;
}
--->
bool current_is_kthread(void)
{
return current->mm == NULL;
}
I'm not experienced in this area and not sure how reliable it is to use
current->mm == NULL for kthread and work-queues. But it would eliminate
the need for the boolean parameter.
Would appreciate the input.
Haitao
[1]https://lore.kernel.org/linux-sgx/[email protected]/
On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
> Hi Jarkko
>
> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <[email protected]>
> >>
> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> >> reclaim pages used in the same cgroup to make room for new allocations.
> >> This is analogous to the behavior that the global reclaimer is triggered
> >> when the global usage is close to total available EPC.
> >>
> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> >> whether synchronous reclaim is allowed or not. And trigger the
> >> synchronous/asynchronous reclamation flow accordingly.
> >>
> >> Note at this point, all reclaimable EPC pages are still tracked in the
> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> >> is activated yet.
> >>
> >> Co-developed-by: Sean Christopherson <[email protected]>
> >> Signed-off-by: Sean Christopherson <[email protected]>
> >> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> >> Co-developed-by: Haitao Huang <[email protected]>
> >> Signed-off-by: Haitao Huang <[email protected]>
> >> ---
> >> V7:
> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> >> ---
> >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
> >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
> >> arch/x86/kernel/cpu/sgx/main.c | 2 +-
> >> 3 files changed, 27 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> index d399fda2b55e..abf74fdb12b4 100644
> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> @@ -184,13 +184,35 @@ static void
> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> >> /**
> >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC
> >> page
> >> * @epc_cg: The EPC cgroup to be charged for the page.
> >> + * @reclaim: Whether or not synchronous reclaim is allowed
> >> * Return:
> >> * * %0 - If successfully charged.
> >> * * -errno - for failures.
> >> */
> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
> >> reclaim)
> >> {
> >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> >> + for (;;) {
> >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> + PAGE_SIZE))
> >> + break;
> >> +
> >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> >> + return -ENOMEM;
> >> + + if (signal_pending(current))
> >> + return -ERESTARTSYS;
> >> +
> >> + if (!reclaim) {
> >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> >> + return -EBUSY;
> >> + }
> >> +
> >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> >> + /* All pages were too young to reclaim, try again a little later */
> >> + schedule();
> >
> > This will be total pain to backtrack after a while when something
> > needs to be changed so there definitely should be inline comments
> > addressing each branch condition.
> >
> > I'd rethink this as:
> >
> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
> > iteration with the new "reclaim" parameter.
> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
> >
> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> > same loop calling internal __sgx_epc_cgroup_try_charge() with
> > different parameters. That is totally acceptable.
> >
> > Please also add my suggested-by.
> >
> > BR, Jarkko
> >
> > BR, Jarkko
> >
> For #2:
> The only caller of this function, sgx_alloc_epc_page(), has the same
> boolean which is passed into this this function.
I know. This would be good opportunity to fix that up. Large patch
sets should try to make the space for its feature best possible and
thus also clean up the code base overally.
> If we separate it into sgx_epc_cgroup_try_charge() and
> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the
> if/else branches. So separation here seems not help?
Of course it does. It makes the code in that location self-documenting
and easier to remember what it does.
BR, Jarkko
On 2/5/24 13:06, Haitao Huang wrote:
> static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
> {
> @@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
> struct sgx_encl_mm *encl_mm;
> int idx;
>
> - /*
> - * If called from normal task context, return the mem_cgroup
> - * of the current task's mm. The remainder of the handling is for
> - * ksgxd.
> - */
> - if (!current_is_ksgxd())
> - return get_mem_cgroup_from_mm(current->mm);
Why is this being removed?
Searching the enclave mm list is a last resort. It's expensive and
imprecise.
get_mem_cgroup_from_mm(current->mm), on the other hand is fast and precise.
Hi Dave,
On Thu, 15 Feb 2024 17:43:18 -0600, Dave Hansen <[email protected]>
wrote:
> On 2/5/24 13:06, Haitao Huang wrote:
>> static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl
>> *encl)
>> {
>> @@ -1003,14 +1001,6 @@ static struct mem_cgroup
>> *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
>> struct sgx_encl_mm *encl_mm;
>> int idx;
>>
>> - /*
>> - * If called from normal task context, return the mem_cgroup
>> - * of the current task's mm. The remainder of the handling is for
>> - * ksgxd.
>> - */
>> - if (!current_is_ksgxd())
>> - return get_mem_cgroup_from_mm(current->mm);
>
> Why is this being removed?
>
> Searching the enclave mm list is a last resort. It's expensive and
> imprecise.
>
> get_mem_cgroup_from_mm(current->mm), on the other hand is fast and
> precise.
>
I introduced a boolean flag to indicate caller is in kthread (ksgxd or
cgroup workqueue), so sgx_encl_alloc_backing only calls this function if
that flag is true, meaning search through the mm_list is needed.
But now I think a more straightforward way is to just replace
current_is_ksgxd() with (current->flags & PF_KTHREAD).
Thanks
Haitao
On 2/5/24 13:06, Haitao Huang wrote:
> @@ -414,7 +416,7 @@ static void sgx_reclaim_pages_global(void)
> void sgx_reclaim_direct(void)
> {
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(false);
> }
>
> static int ksgxd(void *p)
> @@ -437,7 +439,7 @@ static int ksgxd(void *p)
> sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>
> if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> - sgx_reclaim_pages_global();
> + sgx_reclaim_pages_global(true);
>
> cond_resched();
> }
First, I'm never a fan of random true/false or 0/1 arguments to
functions like this. You end up having to go look at the called
function to make any sense of it. You can either do an enum, or some
construct like this:
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES)) {
bool indirect = true;
sgx_reclaim_pages_global(indirect);
}
Yeah, it takes a few more lines but it saves you having to comment the
thing.
Does this 'indirect' change any behavior other than whether it does a
search for an mm to find a place to charge the backing storage? Instead
of passing a flag around, why not just pass the mm?
This refactoring out of 'indirect' or passing the mm around really wants
to be in its own patch anyway.
On Fri, 16 Feb 2024 09:15:59 -0600, Dave Hansen <[email protected]>
wrote:
> On 2/5/24 13:06, Haitao Huang wrote:
>> @@ -414,7 +416,7 @@ static void sgx_reclaim_pages_global(void)
>> void sgx_reclaim_direct(void)
>> {
>> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> - sgx_reclaim_pages_global();
>> + sgx_reclaim_pages_global(false);
>> }
>>
>> static int ksgxd(void *p)
>> @@ -437,7 +439,7 @@ static int ksgxd(void *p)
>> sgx_should_reclaim(SGX_NR_HIGH_PAGES));
>>
>> if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
>> - sgx_reclaim_pages_global();
>> + sgx_reclaim_pages_global(true);
>>
>> cond_resched();
>> }
>
> First, I'm never a fan of random true/false or 0/1 arguments to
> functions like this. You end up having to go look at the called
> function to make any sense of it. You can either do an enum, or some
> construct like this:
>
> if (sgx_should_reclaim(SGX_NR_HIGH_PAGES)) {
> bool indirect = true;
> sgx_reclaim_pages_global(indirect);
> }
>
> Yeah, it takes a few more lines but it saves you having to comment the
> thing.
>
> Does this 'indirect' change any behavior other than whether it does a
> search for an mm to find a place to charge the backing storage?
No.
> Instead of passing a flag around, why not just pass the mm?
>
There is no need to pass in mm. We could just check if current->mm == NULL
for the need of doing the search in the enclave mm list.
But you had a concern [1] that the purpose was not clear hence suggested
current_is_ksgxd().
Would it be OK if we replace current_is_ksgxd() with (current->flags &
PF_KTHREAD)? That would express the real intent of checking if calling
context is not in a user context.
> This refactoring out of 'indirect' or passing the mm around really wants
> to be in its own patch anyway.
>
Looks like I could do:
1) refactoring of 'indirect' value/enum suggested above. This seems the
most straightforward without depending on any assumptions of other kernel
code.
2) replace current_is_ksgxd() with current->mm == NULL. This assumes
kthreads has no mm.
3) replace current_is_ksgxd() with current->flags & PF_KTHREAD. This is
direct use of the flag PF_KTHREAD, so it should be better than #2?
Any preference or further thoughts?
Thanks
Haitao
[1]
https://lore.kernel.org/linux-sgx/[email protected]/
On 2/16/24 13:38, Haitao Huang wrote:
> On Fri, 16 Feb 2024 09:15:59 -0600, Dave Hansen <[email protected]>
> wrote:
..
>> Does this 'indirect' change any behavior other than whether it does a
>> search for an mm to find a place to charge the backing storage?
>
> No.
>
>> Instead of passing a flag around, why not just pass the mm?
>>
> There is no need to pass in mm. We could just check if current->mm ==
> NULL for the need of doing the search in the enclave mm list.
>
> But you had a concern [1] that the purpose was not clear hence suggested
> current_is_ksgxd().
Right, because there was only one possible way that mm could be NULL but
it wasn't obvious from the code what that way was.
> Would it be OK if we replace current_is_ksgxd() with (current->flags &
> PF_KTHREAD)? That would express the real intent of checking if calling
> context is not in a user context.
No, I think that focuses on the symptom and not on the fundamental problem.
The fundamental problem is that you need an mm in order to charge your
allocations to the right group. Indirect reclaim means you are not in a
context which is connected to the mm that should be charged while direct
reclaim is.
>> This refactoring out of 'indirect' or passing the mm around really wants
>> to be in its own patch anyway.
>>
> Looks like I could do:
> 1) refactoring of 'indirect' value/enum suggested above. This seems the
> most straightforward without depending on any assumptions of other
> kernel code.
> 2) replace current_is_ksgxd() with current->mm == NULL. This assumes
> kthreads has no mm.
> 3) replace current_is_ksgxd() with current->flags & PF_KTHREAD. This is
> direct use of the flag PF_KTHREAD, so it should be better than #2?
>
> Any preference or further thoughts?
Pass around a:
struct mm_struct *charge_mm
Then, at the bottom do:
/*
* Backing RAM allocations need to be charged to some mm and
* associated cgroup. If this context does not have an mm to
* charge, search the enclave's mm_list to find some mm
* associated with this enclave.
*/
if (!charge_mm)
... do slow mm lookup
else
return mm_to_cgroup_whatever(charge_mm);
Then just comment the call sites where the initial charge_mm comes in:
/* Indirect SGX reclaim, no mm to charge, so NULL: */
foo(..., NULL);
/* Direct SGX reclaim, charge current mm for allocations: */
foo(..., current->mm);
On Fri, 16 Feb 2024 15:55:10 -0600, Dave Hansen <[email protected]>
wrote:
> On 2/16/24 13:38, Haitao Huang wrote:
>> On Fri, 16 Feb 2024 09:15:59 -0600, Dave Hansen <[email protected]>
>> wrote:
> ...
>>> Does this 'indirect' change any behavior other than whether it does a
>>> search for an mm to find a place to charge the backing storage?
>>
>> No.
>>
>>> Instead of passing a flag around, why not just pass the mm?
>>>
>> There is no need to pass in mm. We could just check if current->mm ==
>> NULL for the need of doing the search in the enclave mm list.
>>
>> But you had a concern [1] that the purpose was not clear hence suggested
>> current_is_ksgxd().
>
> Right, because there was only one possible way that mm could be NULL but
> it wasn't obvious from the code what that way was.
>
>> Would it be OK if we replace current_is_ksgxd() with (current->flags &
>> PF_KTHREAD)? That would express the real intent of checking if calling
>> context is not in a user context.
>
> No, I think that focuses on the symptom and not on the fundamental
> problem.
>
> The fundamental problem is that you need an mm in order to charge your
> allocations to the right group. Indirect reclaim means you are not in a
> context which is connected to the mm that should be charged while direct
> reclaim is.
>
>>> This refactoring out of 'indirect' or passing the mm around really
>>> wants
>>> to be in its own patch anyway.
>>>
>> Looks like I could do:
>> 1) refactoring of 'indirect' value/enum suggested above. This seems the
>> most straightforward without depending on any assumptions of other
>> kernel code.
>> 2) replace current_is_ksgxd() with current->mm == NULL. This assumes
>> kthreads has no mm.
>> 3) replace current_is_ksgxd() with current->flags & PF_KTHREAD. This is
>> direct use of the flag PF_KTHREAD, so it should be better than #2?
>>
>> Any preference or further thoughts?
>
> Pass around a:
>
> struct mm_struct *charge_mm
>
> Then, at the bottom do:
>
> /*
> * Backing RAM allocations need to be charged to some mm and
> * associated cgroup. If this context does not have an mm to
> * charge, search the enclave's mm_list to find some mm
> * associated with this enclave.
> */
> if (!charge_mm)
> ... do slow mm lookup
> else
> return mm_to_cgroup_whatever(charge_mm);
>
> Then just comment the call sites where the initial charge_mm comes in:
>
>
> /* Indirect SGX reclaim, no mm to charge, so NULL: */
> foo(..., NULL);
>
>
> /* Direct SGX reclaim, charge current mm for allocations: */
> foo(..., current->mm);
>
>
Okay. got it now.
Thank you very much!
Haitao
On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
>> Hi Jarkko
>>
>> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <[email protected]>
>> wrote:
>>
>> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
>> >> From: Kristen Carlson Accardi <[email protected]>
>> >>
>> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
>> >> reclaim pages used in the same cgroup to make room for new
>> allocations.
>> >> This is analogous to the behavior that the global reclaimer is
>> triggered
>> >> when the global usage is close to total available EPC.
>> >>
>> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
>> >> whether synchronous reclaim is allowed or not. And trigger the
>> >> synchronous/asynchronous reclamation flow accordingly.
>> >>
>> >> Note at this point, all reclaimable EPC pages are still tracked in
>> the
>> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup
>> reclamation
>> >> is activated yet.
>> >>
>> >> Co-developed-by: Sean Christopherson
>> <[email protected]>
>> >> Signed-off-by: Sean Christopherson <[email protected]>
>> >> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> >> Co-developed-by: Haitao Huang <[email protected]>
>> >> Signed-off-by: Haitao Huang <[email protected]>
>> >> ---
>> >> V7:
>> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> >> ---
>> >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
>> >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
>> >> arch/x86/kernel/cpu/sgx/main.c | 2 +-
>> >> 3 files changed, 27 insertions(+), 5 deletions(-)
>> >>
>> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> index d399fda2b55e..abf74fdb12b4 100644
>> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> >> @@ -184,13 +184,35 @@ static void
>> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
>> >> /**
>> >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single
>> EPC
>> >> page
>> >> * @epc_cg: The EPC cgroup to be charged for the page.
>> >> + * @reclaim: Whether or not synchronous reclaim is allowed
>> >> * Return:
>> >> * * %0 - If successfully charged.
>> >> * * -errno - for failures.
>> >> */
>> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>> >> reclaim)
>> >> {
>> >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> PAGE_SIZE);
>> >> + for (;;) {
>> >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> >> + PAGE_SIZE))
>> >> + break;
>> >> +
>> >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> >> + return -ENOMEM;
>> >> + + if (signal_pending(current))
>> >> + return -ERESTARTSYS;
>> >> +
>> >> + if (!reclaim) {
>> >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>> >> + return -EBUSY;
>> >> + }
>> >> +
>> >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>> >> + /* All pages were too young to reclaim, try again a little later
>> */
>> >> + schedule();
>> >
>> > This will be total pain to backtrack after a while when something
>> > needs to be changed so there definitely should be inline comments
>> > addressing each branch condition.
>> >
>> > I'd rethink this as:
>> >
>> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
>> > iteration with the new "reclaim" parameter.
>> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
>> >
>> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
>> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
>> > same loop calling internal __sgx_epc_cgroup_try_charge() with
>> > different parameters. That is totally acceptable.
>> >
>> > Please also add my suggested-by.
>> >
>> > BR, Jarkko
>> >
>> > BR, Jarkko
>> >
>> For #2:
>> The only caller of this function, sgx_alloc_epc_page(), has the same
>> boolean which is passed into this this function.
>
> I know. This would be good opportunity to fix that up. Large patch
> sets should try to make the space for its feature best possible and
> thus also clean up the code base overally.
>
>> If we separate it into sgx_epc_cgroup_try_charge() and
>> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the
>> if/else branches. So separation here seems not help?
>
> Of course it does. It makes the code in that location self-documenting
> and easier to remember what it does.
>
> BR, Jarkko
>
Please let me know if this aligns with your suggestion.
static int ___sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
{
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
PAGE_SIZE))
return 0;
if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
return -ENOMEM;
if (signal_pending(current))
return -ERESTARTSYS;
return -EBUSY;
}
/**
* sgx_epc_cgroup_try_charge() - try to charge cgroup for a single page
* @epc_cg: The EPC cgroup to be charged for the page.
*
* Try to reclaim pages in the background if the group reaches its limit
and
* there are reclaimable pages in the group.
* Return:
* * %0 - If successfully charged.
* * -errno - for failures.
*/
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
{
int ret = ___sgx_epc_cgroup_try_charge(epc_cg);
if (ret == -EBUSY)
queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
return ret;
}
/**
* sgx_epc_cgroup_try_charge_reclaim() - try to charge cgroup for a single
page
* @epc_cg: The EPC cgroup to be charged for the page.
*
* Try to reclaim pages directly if the group reaches its limit and there
are
* reclaimable pages in the group.
* Return:
* * %0 - If successfully charged.
* * -errno - for failures.
*/
int sgx_epc_cgroup_try_charge_reclaim(struct sgx_epc_cgroup *epc_cg)
{
int ret;
for (;;) {
ret = ___sgx_epc_cgroup_try_charge(epc_cg);
if (ret != -EBUSY)
return ret;
if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, current->mm))
/* All pages were too young to reclaim, try again
a little later */
schedule();
}
return 0;
}
It is a little more involved to remove the boolean for
sgx_alloc_epc_page() and its callers like sgx_encl_grow(),
sgx_alloc_va_page(). I'll send a separate patch for comments.
Thanks
Haitao
Remove all boolean parameters for 'reclaim' from the function
sgx_alloc_epc_page() and its callers by making two versions of each
function.
Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo
Signed-off-by: Haitao Huang <[email protected]>
Suggested-by: Jarkko Sakkinen <[email protected]>
---
arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------
arch/x86/kernel/cpu/sgx/encl.h | 6 ++-
arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++---
arch/x86/kernel/cpu/sgx/main.c | 68 ++++++++++++++++++++++-----------
arch/x86/kernel/cpu/sgx/sgx.h | 4 +-
arch/x86/kernel/cpu/sgx/virt.c | 2 +-
6 files changed, 115 insertions(+), 44 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..07f369ae855c 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
struct sgx_epc_page *epc_page;
int ret;
- epc_page = sgx_alloc_epc_page(encl_page, false);
+ epc_page = sgx_alloc_epc_page(encl_page);
if (IS_ERR(epc_page))
return epc_page;
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
goto err_out_unlock;
}
- epc_page = sgx_alloc_epc_page(encl_page, false);
+ epc_page = sgx_alloc_epc_page(encl_page);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
goto err_out_unlock;
}
- va_page = sgx_encl_grow(encl, false);
+ va_page = sgx_encl_grow(encl);
if (IS_ERR(va_page)) {
if (PTR_ERR(va_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
@@ -1230,23 +1230,23 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
} while (unlikely(encl->mm_list_version != mm_list_version));
}
-/**
- * sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- * mutex should not be held if this is set.
- *
- * Allocate a free EPC page and convert it to a Version Array (VA) page.
+typedef struct sgx_epc_page *(*sgx_alloc_epc_page_fn_t)(void *owner);
+
+/*
+ * Allocate a Version Array (VA) page
+ * @alloc_fn: the EPC page allocation function.
*
* Return:
* a VA page,
* -errno otherwise
*/
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+static struct sgx_epc_page *__sgx_alloc_va_page(sgx_alloc_epc_page_fn_t alloc_fn)
{
struct sgx_epc_page *epc_page;
int ret;
- epc_page = sgx_alloc_epc_page(NULL, reclaim);
+ epc_page = alloc_fn(NULL);
+
if (IS_ERR(epc_page))
return ERR_CAST(epc_page);
@@ -1260,6 +1260,40 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
return epc_page;
}
+/**
+ * sgx_alloc_va_page() - Allocate a Version Array (VA) page
+ *
+ * Allocate a free EPC page and convert it to a VA page.
+ *
+ * Do not reclaim EPC pages if none available.
+ *
+ * Return:
+ * a VA page,
+ * -errno otherwise
+ */
+struct sgx_epc_page *sgx_alloc_va_page(void)
+{
+ return __sgx_alloc_va_page(sgx_alloc_epc_page);
+}
+
+/**
+ * sgx_alloc_va_page_reclaim() - Allocate a Version Array (VA) page
+ *
+ * Allocate a free EPC page and convert it to a VA page.
+ *
+ * Reclaim EPC pages directly if none available. Enclave mutex should not be
+ * held.
+ *
+ * Return:
+ * a VA page,
+ * -errno otherwise
+ */
+
+struct sgx_epc_page *sgx_alloc_va_page_reclaim(void)
+{
+ return __sgx_alloc_va_page(sgx_alloc_epc_page_reclaim);
+}
+
/**
* sgx_alloc_va_slot - allocate a VA slot
* @va_page: a &struct sgx_va_page instance
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..3248ff72e573 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,16 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
unsigned long offset,
u64 secinfo_flags);
void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(void);
+struct sgx_epc_page *sgx_alloc_va_page_reclaim(void);
unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
bool sgx_va_page_full(struct sgx_va_page *va_page);
void sgx_encl_free_epc_page(struct sgx_epc_page *page);
struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl);
+struct sgx_va_page *sgx_encl_grow_reclaim(struct sgx_encl *encl);
void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..70f904f753aa 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,8 @@
#include "encl.h"
#include "encls.h"
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+typedef struct sgx_epc_page *(*sgx_alloc_va_page_fn_t)(void);
+static struct sgx_va_page *__sgx_encl_grow(struct sgx_encl *encl, sgx_alloc_va_page_fn_t alloc_fn)
{
struct sgx_va_page *va_page = NULL;
void *err;
@@ -30,7 +31,7 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
if (!va_page)
return ERR_PTR(-ENOMEM);
- va_page->epc_page = sgx_alloc_va_page(reclaim);
+ va_page->epc_page = alloc_fn();
if (IS_ERR(va_page->epc_page)) {
err = ERR_CAST(va_page->epc_page);
kfree(va_page);
@@ -43,6 +44,16 @@ struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
return va_page;
}
+struct sgx_va_page *sgx_encl_grow_reclaim(struct sgx_encl *encl)
+{
+ return __sgx_encl_grow(encl, sgx_alloc_va_page_reclaim);
+}
+
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl)
+{
+ return __sgx_encl_grow(encl, sgx_alloc_va_page);
+}
+
void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page)
{
encl->page_cnt--;
@@ -64,7 +75,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
struct file *backing;
long ret;
- va_page = sgx_encl_grow(encl, true);
+ va_page = sgx_encl_grow_reclaim(encl);
if (IS_ERR(va_page))
return PTR_ERR(va_page);
else if (va_page)
@@ -83,7 +94,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
encl->backing = backing;
- secs_epc = sgx_alloc_epc_page(&encl->secs, true);
+ secs_epc = sgx_alloc_epc_page_reclaim(&encl->secs);
if (IS_ERR(secs_epc)) {
ret = PTR_ERR(secs_epc);
goto err_out_backing;
@@ -269,13 +280,13 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
if (IS_ERR(encl_page))
return PTR_ERR(encl_page);
- epc_page = sgx_alloc_epc_page(encl_page, true);
+ epc_page = sgx_alloc_epc_page_reclaim(encl_page);
if (IS_ERR(epc_page)) {
kfree(encl_page);
return PTR_ERR(epc_page);
}
- va_page = sgx_encl_grow(encl, true);
+ va_page = sgx_encl_grow_reclaim(encl);
if (IS_ERR(va_page)) {
ret = PTR_ERR(va_page);
goto err_out_free;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..ed9b711049c2 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -463,14 +463,16 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
/**
* __sgx_alloc_epc_page() - Allocate an EPC page
*
- * Iterate through NUMA nodes and reserve ia free EPC page to the caller. Start
+ * Iterate through NUMA nodes and reserve a free EPC page to the caller. Start
* from the NUMA node, where the caller is executing.
*
+ * When a page is no longer needed it must be released with sgx_free_epc_page().
+ *
* Return:
* - an EPC page: A borrowed EPC pages were available.
- * - NULL: Out of EPC pages.
+ * - -errno: Out of EPC pages.
*/
-struct sgx_epc_page *__sgx_alloc_epc_page(void)
+static struct sgx_epc_page *__sgx_alloc_epc_page(void)
{
struct sgx_epc_page *page;
int nid_of_current = numa_node_id();
@@ -493,7 +495,24 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
return page;
}
- return ERR_PTR(-ENOMEM);
+ if (list_empty(&sgx_active_page_list))
+ return ERR_PTR(-ENOMEM);
+
+ return ERR_PTR(-EBUSY);
+}
+
+/*
+ * Post-processing after allocating an EPC page.
+ *
+ * Set its owner, wake up ksgxd when the number of pages goes below the watermark.
+ */
+static void sgx_alloc_epc_page_post(struct sgx_epc_page *page, void* owner)
+{
+ if (!IS_ERR(page))
+ page->owner = owner;
+
+ if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+ wake_up(&ksgxd_waitq);
}
/**
@@ -542,38 +561,44 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
/**
* sgx_alloc_epc_page() - Allocate an EPC page
* @owner: the owner of the EPC page
- * @reclaim: reclaim pages if necessary
*
- * Iterate through EPC sections and borrow a free EPC page to the caller. When a
- * page is no longer needed it must be released with sgx_free_epc_page(). If
- * @reclaim is set to true, directly reclaim pages when we are out of pages. No
- * mm's can be locked when @reclaim is set to true.
+ * When a page is no longer needed it must be released with sgx_free_epc_page().
*
- * Finally, wake up ksgxd when the number of pages goes below the watermark
- * before returning back to the caller.
+ * Return:
+ * an EPC page,
+ * -errno on error
+ */
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner)
+{
+ struct sgx_epc_page *page;
+
+ page = __sgx_alloc_epc_page();
+
+ sgx_alloc_epc_page_post(page, owner);
+
+ return page;
+}
+/**
+ * sgx_alloc_epc_page_reclaim() - Allocate an EPC page, reclaim pages if necessary
+ * @owner: the owner of the EPC page
+ *
+ * Reclaim pages when we are out of pages. No mm's can be locked.
*
* Return:
* an EPC page,
* -errno on error
*/
-struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
+struct sgx_epc_page *sgx_alloc_epc_page_reclaim(void *owner)
{
struct sgx_epc_page *page;
for ( ; ; ) {
page = __sgx_alloc_epc_page();
if (!IS_ERR(page)) {
- page->owner = owner;
break;
}
-
- if (list_empty(&sgx_active_page_list))
- return ERR_PTR(-ENOMEM);
-
- if (!reclaim) {
- page = ERR_PTR(-EBUSY);
+ if (PTR_ERR(page) != -EBUSY)
break;
- }
if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
@@ -584,8 +609,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
cond_resched();
}
- if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- wake_up(&ksgxd_waitq);
+ sgx_alloc_epc_page_post(page, owner);
return page;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..d6246a79f0b6 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -83,13 +83,13 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}
-struct sgx_epc_page *__sgx_alloc_epc_page(void);
void sgx_free_epc_page(struct sgx_epc_page *page);
void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
-struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner);
+struct sgx_epc_page *sgx_alloc_epc_page_reclaim(void *owner);
void sgx_ipi_cb(void *info);
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 7aaa3652e31d..6a81d8af84ab 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -46,7 +46,7 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
if (epc_page)
return 0;
- epc_page = sgx_alloc_epc_page(vepc, false);
+ epc_page = sgx_alloc_epc_page(vepc);
if (IS_ERR(epc_page))
return PTR_ERR(epc_page);
--
2.25.1
On 2/19/24 07:39, Haitao Huang wrote:
> Remove all boolean parameters for 'reclaim' from the function
> sgx_alloc_epc_page() and its callers by making two versions of each
> function.
>
> Also opportunistically remove non-static declaration of
> __sgx_alloc_epc_page() and a typo
>
> Signed-off-by: Haitao Huang <[email protected]>
> Suggested-by: Jarkko Sakkinen <[email protected]>
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------
> arch/x86/kernel/cpu/sgx/encl.h | 6 ++-
> arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++---
> arch/x86/kernel/cpu/sgx/main.c | 68 ++++++++++++++++++++++-----------
> arch/x86/kernel/cpu/sgx/sgx.h | 4 +-
> arch/x86/kernel/cpu/sgx/virt.c | 2 +-
> 6 files changed, 115 insertions(+), 44 deletions(-)
Jarkko, did this turn out how you expected?
I think passing around a function pointer to *only* communicate 1 bit of
information is a _bit_ overkill here.
Simply replacing the bool with:
enum sgx_reclaim {
SGX_NO_RECLAIM,
SGX_DO_RECLAIM
};
would do the same thing. Right?
Are you sure you want a function pointer for this?
On Mon Feb 19, 2024 at 3:39 PM UTC, Haitao Huang wrote:
> Remove all boolean parameters for 'reclaim' from the function
> sgx_alloc_epc_page() and its callers by making two versions of each
> function.
>
> Also opportunistically remove non-static declaration of
> __sgx_alloc_epc_page() and a typo
>
> Signed-off-by: Haitao Huang <[email protected]>
> Suggested-by: Jarkko Sakkinen <[email protected]>
I think this is for better.
My view point for kernel patches overally is that:
1. A feature should leave the subsystem in cleaner state as
far as the existing framework of doing things goes.
2. A bugfix can sometimes do the opposite if corner case
requires some weird dance to perform.
BR, Jarkko
On Mon Feb 19, 2024 at 3:12 PM UTC, Haitao Huang wrote:
> On Tue, 13 Feb 2024 19:52:25 -0600, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Tue Feb 13, 2024 at 1:15 AM EET, Haitao Huang wrote:
> >> Hi Jarkko
> >>
> >> On Mon, 12 Feb 2024 13:55:46 -0600, Jarkko Sakkinen <[email protected]>
> >> wrote:
> >>
> >> > On Mon Feb 5, 2024 at 11:06 PM EET, Haitao Huang wrote:
> >> >> From: Kristen Carlson Accardi <[email protected]>
> >> >>
> >> >> When the EPC usage of a cgroup is near its limit, the cgroup needs to
> >> >> reclaim pages used in the same cgroup to make room for new
> >> allocations.
> >> >> This is analogous to the behavior that the global reclaimer is
> >> triggered
> >> >> when the global usage is close to total available EPC.
> >> >>
> >> >> Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
> >> >> whether synchronous reclaim is allowed or not. And trigger the
> >> >> synchronous/asynchronous reclamation flow accordingly.
> >> >>
> >> >> Note at this point, all reclaimable EPC pages are still tracked in
> >> the
> >> >> global LRU and per-cgroup LRUs are empty. So no per-cgroup
> >> reclamation
> >> >> is activated yet.
> >> >>
> >> >> Co-developed-by: Sean Christopherson
> >> <[email protected]>
> >> >> Signed-off-by: Sean Christopherson <[email protected]>
> >> >> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> >> >> Co-developed-by: Haitao Huang <[email protected]>
> >> >> Signed-off-by: Haitao Huang <[email protected]>
> >> >> ---
> >> >> V7:
> >> >> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> >> >> ---
> >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
> >> >> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
> >> >> arch/x86/kernel/cpu/sgx/main.c | 2 +-
> >> >> 3 files changed, 27 insertions(+), 5 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> index d399fda2b55e..abf74fdb12b4 100644
> >> >> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> >> >> @@ -184,13 +184,35 @@ static void
> >> >> sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> >> >> /**
> >> >> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single
> >> EPC
> >> >> page
> >> >> * @epc_cg: The EPC cgroup to be charged for the page.
> >> >> + * @reclaim: Whether or not synchronous reclaim is allowed
> >> >> * Return:
> >> >> * * %0 - If successfully charged.
> >> >> * * -errno - for failures.
> >> >> */
> >> >> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> >> >> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
> >> >> reclaim)
> >> >> {
> >> >> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> PAGE_SIZE);
> >> >> + for (;;) {
> >> >> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> >> >> + PAGE_SIZE))
> >> >> + break;
> >> >> +
> >> >> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> >> >> + return -ENOMEM;
> >> >> + + if (signal_pending(current))
> >> >> + return -ERESTARTSYS;
> >> >> +
> >> >> + if (!reclaim) {
> >> >> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> >> >> + return -EBUSY;
> >> >> + }
> >> >> +
> >> >> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> >> >> + /* All pages were too young to reclaim, try again a little later
> >> */
> >> >> + schedule();
> >> >
> >> > This will be total pain to backtrack after a while when something
> >> > needs to be changed so there definitely should be inline comments
> >> > addressing each branch condition.
> >> >
> >> > I'd rethink this as:
> >> >
> >> > 1. Create static __sgx_epc_cgroup_try_charge() for addressing single
> >> > iteration with the new "reclaim" parameter.
> >> > 2. Add a new sgx_epc_group_try_charge_reclaim() function.
> >> >
> >> > There's a bit of redundancy with sgx_epc_cgroup_try_charge() and
> >> > sgx_epc_cgroup_try_charge_reclaim() because both have almost the
> >> > same loop calling internal __sgx_epc_cgroup_try_charge() with
> >> > different parameters. That is totally acceptable.
> >> >
> >> > Please also add my suggested-by.
> >> >
> >> > BR, Jarkko
> >> >
> >> > BR, Jarkko
> >> >
> >> For #2:
> >> The only caller of this function, sgx_alloc_epc_page(), has the same
> >> boolean which is passed into this this function.
> >
> > I know. This would be good opportunity to fix that up. Large patch
> > sets should try to make the space for its feature best possible and
> > thus also clean up the code base overally.
> >
> >> If we separate it into sgx_epc_cgroup_try_charge() and
> >> sgx_epc_cgroup_try_charge_reclaim(), then the caller has to have the
> >> if/else branches. So separation here seems not help?
> >
> > Of course it does. It makes the code in that location self-documenting
> > and easier to remember what it does.
> >
> > BR, Jarkko
> >
>
> Please let me know if this aligns with your suggestion.
>
>
> static int ___sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> {
> if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> PAGE_SIZE))
> return 0;
>
> if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> return -ENOMEM;
>
> if (signal_pending(current))
> return -ERESTARTSYS;
>
> return -EBUSY;
> }
>
> /**
> * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single page
> * @epc_cg: The EPC cgroup to be charged for the page.
> *
> * Try to reclaim pages in the background if the group reaches its limit
> and
> * there are reclaimable pages in the group.
> * Return:
> * * %0 - If successfully charged.
> * * -errno - for failures.
> */
> int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> {
> int ret = ___sgx_epc_cgroup_try_charge(epc_cg);
>
> if (ret == -EBUSY)
> queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>
> return ret;
> }
>
> /**
> * sgx_epc_cgroup_try_charge_reclaim() - try to charge cgroup for a single
> page
> * @epc_cg: The EPC cgroup to be charged for the page.
> *
> * Try to reclaim pages directly if the group reaches its limit and there
> are
> * reclaimable pages in the group.
> * Return:
> * * %0 - If successfully charged.
> * * -errno - for failures.
> */
> int sgx_epc_cgroup_try_charge_reclaim(struct sgx_epc_cgroup *epc_cg)
> {
> int ret;
>
> for (;;) {
> ret = ___sgx_epc_cgroup_try_charge(epc_cg);
> if (ret != -EBUSY)
> return ret;
>
> if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, current->mm))
> /* All pages were too young to reclaim, try again
> a little later */
> schedule();
> }
>
> return 0;
> }
>
> It is a little more involved to remove the boolean for
> sgx_alloc_epc_page() and its callers like sgx_encl_grow(),
> sgx_alloc_va_page(). I'll send a separate patch for comments.
With quick look, it is towards right direction for sure.
BR, Jarkko
On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote:
> On 2/19/24 07:39, Haitao Huang wrote:
> > Remove all boolean parameters for 'reclaim' from the function
> > sgx_alloc_epc_page() and its callers by making two versions of each
> > function.
> >
> > Also opportunistically remove non-static declaration of
> > __sgx_alloc_epc_page() and a typo
> >
> > Signed-off-by: Haitao Huang <[email protected]>
> > Suggested-by: Jarkko Sakkinen <[email protected]>
> > ---
> > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------
> > arch/x86/kernel/cpu/sgx/encl.h | 6 ++-
> > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++---
> > arch/x86/kernel/cpu/sgx/main.c | 68 ++++++++++++++++++++++-----------
> > arch/x86/kernel/cpu/sgx/sgx.h | 4 +-
> > arch/x86/kernel/cpu/sgx/virt.c | 2 +-
> > 6 files changed, 115 insertions(+), 44 deletions(-)
>
> Jarkko, did this turn out how you expected?
>
> I think passing around a function pointer to *only* communicate 1 bit of
> information is a _bit_ overkill here.
>
> Simply replacing the bool with:
>
> enum sgx_reclaim {
> SGX_NO_RECLAIM,
> SGX_DO_RECLAIM
> };
>
> would do the same thing. Right?
>
> Are you sure you want a function pointer for this?
To look this in context I drafted quickly two branches representing
imaginary next version of the patch set.
I guess this would simpler and totally sufficient approach.
With this approach I'd then change also:
[PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
And add the enum-parameter already in that patch with just "no reclaim"
enum. I.e. then 10/15 will add only "do reclaim" and the new
functionality.
BR, Jarkko
On Mon, 19 Feb 2024 14:42:29 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote:
>> On 2/19/24 07:39, Haitao Huang wrote:
>> > Remove all boolean parameters for 'reclaim' from the function
>> > sgx_alloc_epc_page() and its callers by making two versions of each
>> > function.
>> >
>> > Also opportunistically remove non-static declaration of
>> > __sgx_alloc_epc_page() and a typo
>> >
>> > Signed-off-by: Haitao Huang <[email protected]>
>> > Suggested-by: Jarkko Sakkinen <[email protected]>
>> > ---
>> > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------
>> > arch/x86/kernel/cpu/sgx/encl.h | 6 ++-
>> > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++---
>> > arch/x86/kernel/cpu/sgx/main.c | 68
>> ++++++++++++++++++++++-----------
>> > arch/x86/kernel/cpu/sgx/sgx.h | 4 +-
>> > arch/x86/kernel/cpu/sgx/virt.c | 2 +-
>> > 6 files changed, 115 insertions(+), 44 deletions(-)
>>
>> Jarkko, did this turn out how you expected?
>>
>> I think passing around a function pointer to *only* communicate 1 bit of
>> information is a _bit_ overkill here.
>>
>> Simply replacing the bool with:
>>
>> enum sgx_reclaim {
>> SGX_NO_RECLAIM,
>> SGX_DO_RECLAIM
>> };
>>
>> would do the same thing. Right?
>>
>> Are you sure you want a function pointer for this?
>
> To look this in context I drafted quickly two branches representing
> imaginary next version of the patch set.
>
> I guess this would simpler and totally sufficient approach.
>
> With this approach I'd then change also:
>
> [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
>
> And add the enum-parameter already in that patch with just "no reclaim"
> enum. I.e. then 10/15 will add only "do reclaim" and the new
> functionality.
>
> BR, Jarkko
>
Thanks. My understanding is:
1) For this patch, replace the boolean with the enum as Dave suggested. No
two versions of the same functions. And this is a prerequisite for the
cgroup series, positioned before [PATCH v9 04/15]
2) For [PATCH v9 04/15], pass a hard coded SGX_NO_RECLAIM to
sgx_epc_cg_try_charge() from sgx_alloc_epc_page().
3) For [PATCH v9 10/15], remove the hard coded value, pass the reclaim
enum parameter value from sgx_alloc_epc_page() to sgx_epc_cg_try_charge()
and add the reclaim logic.
I'll send patches soon. But please let me know if I misunderstood.
Thanks
Haitao
On Mon Feb 19, 2024 at 10:25 PM UTC, Haitao Huang wrote:
> On Mon, 19 Feb 2024 14:42:29 -0600, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Mon Feb 19, 2024 at 3:56 PM UTC, Dave Hansen wrote:
> >> On 2/19/24 07:39, Haitao Huang wrote:
> >> > Remove all boolean parameters for 'reclaim' from the function
> >> > sgx_alloc_epc_page() and its callers by making two versions of each
> >> > function.
> >> >
> >> > Also opportunistically remove non-static declaration of
> >> > __sgx_alloc_epc_page() and a typo
> >> >
> >> > Signed-off-by: Haitao Huang <[email protected]>
> >> > Suggested-by: Jarkko Sakkinen <[email protected]>
> >> > ---
> >> > arch/x86/kernel/cpu/sgx/encl.c | 56 +++++++++++++++++++++------
> >> > arch/x86/kernel/cpu/sgx/encl.h | 6 ++-
> >> > arch/x86/kernel/cpu/sgx/ioctl.c | 23 ++++++++---
> >> > arch/x86/kernel/cpu/sgx/main.c | 68
> >> ++++++++++++++++++++++-----------
> >> > arch/x86/kernel/cpu/sgx/sgx.h | 4 +-
> >> > arch/x86/kernel/cpu/sgx/virt.c | 2 +-
> >> > 6 files changed, 115 insertions(+), 44 deletions(-)
> >>
> >> Jarkko, did this turn out how you expected?
> >>
> >> I think passing around a function pointer to *only* communicate 1 bit of
> >> information is a _bit_ overkill here.
> >>
> >> Simply replacing the bool with:
> >>
> >> enum sgx_reclaim {
> >> SGX_NO_RECLAIM,
> >> SGX_DO_RECLAIM
> >> };
> >>
> >> would do the same thing. Right?
> >>
> >> Are you sure you want a function pointer for this?
> >
> > To look this in context I drafted quickly two branches representing
> > imaginary next version of the patch set.
> >
> > I guess this would simpler and totally sufficient approach.
> >
> > With this approach I'd then change also:
> >
> > [PATCH v9 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
> >
> > And add the enum-parameter already in that patch with just "no reclaim"
> > enum. I.e. then 10/15 will add only "do reclaim" and the new
> > functionality.
> >
> > BR, Jarkko
> >
>
> Thanks. My understanding is:
>
> 1) For this patch, replace the boolean with the enum as Dave suggested. No
> two versions of the same functions. And this is a prerequisite for the
> cgroup series, positioned before [PATCH v9 04/15]
>
> 2) For [PATCH v9 04/15], pass a hard coded SGX_NO_RECLAIM to
> sgx_epc_cg_try_charge() from sgx_alloc_epc_page().
Yup, this will make the whole patch set also a bit leaner as the API
does not change in the middle.
>
> 3) For [PATCH v9 10/15], remove the hard coded value, pass the reclaim
> enum parameter value from sgx_alloc_epc_page() to sgx_epc_cg_try_charge()
> and add the reclaim logic.
>
> I'll send patches soon. But please let me know if I misunderstood.
BR, Jarkko
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Sean Christopherson <[email protected]>
>
> Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
> When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
> from its LRU or LRUs of its descendants to make room for any new
> allocations.
>
> To prepare for reclamation per cgroup, expose the top level reclamation
> function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
> to the function to pass in an LRU so cgroups can pass in different
> tracking LRUs later.
>
[...]
> Add another parameter for passing in the number of
> pages to scan and make the function return the number of pages reclaimed
> as a cgroup reclaimer may need to track reclamation progress from its
> descendants, change number of pages to scan in subsequent calls.
Firstly, sorry for late reply as I was away.
From the changelog, it's understandable you want to make this function return
pages that are actually reclaimed, and perhaps it's also OK to pass the number
of pages to scan.
But this doesn't explain why you need to make @nr_to_scan as a pointer, while
you are returning the number of pages that are actually reclaimed?
And ...
[...]
> -/*
> - * Take a fixed number of pages from the head of the active page pool and
> - * reclaim them to the enclave's private shmem files. Skip the pages, which have
> - * been accessed since the last scan. Move those pages to the tail of active
> - * page pool so that the pages get scanned in LRU like fashion.
> +/**
> + * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
> + *
> + * Take a fixed number of pages from the head of a given LRU and reclaim them to
> + * the enclave's private shmem files. Skip the pages, which have been accessed
> + * since the last scan. Move those pages to the tail of the list so that the
> + * pages get scanned in LRU like fashion.
> *
> * Batch process a chunk of pages (at the moment 16) in order to degrade amount
... there's no comment to explain such design either (@nr_to_scan as a pointer).
Btw, with this change, seems "Take a fixed number of pages ..." and "at the
moment 16" are not accurate any more.
> * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
> @@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> * + EWB) but not sufficiently. Reclaiming one page at a time would also be
> * problematic as it would increase the lock contention too much, which would
> * halt forward progress.
> + *
> + * @lru: The LRU from which pages are reclaimed.
> + * @nr_to_scan: Pointer to the target number of pages to scan, must be less than
> + * SGX_NR_TO_SCAN.
> + * Return: Number of pages reclaimed.
> */
> -static void sgx_reclaim_pages(void)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan)
> {
> struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
> int ret;
> int i;
>
> - spin_lock(&sgx_global_lru.lock);
> - for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
> - struct sgx_epc_page, list);
> + spin_lock(&lru->lock);
> +
> + for (; *nr_to_scan > 0; --(*nr_to_scan)) {
> + epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
> if (!epc_page)
> break;
>
> @@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
> */
> epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> }
> - spin_unlock(&sgx_global_lru.lock);
> +
> + spin_unlock(&lru->lock);
>
> for (i = 0; i < cnt; i++) {
> epc_page = chunk[i];
> @@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
> continue;
>
> skip:
> - spin_lock(&sgx_global_lru.lock);
> - list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> - spin_unlock(&sgx_global_lru.lock);
> + spin_lock(&lru->lock);
> + list_add_tail(&epc_page->list, &lru->reclaimable);
> + spin_unlock(&lru->lock);
>
> kref_put(&encl_page->encl->refcount, sgx_encl_release);
>
> @@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void)
> sgx_reclaimer_block(epc_page);
> }
>
> + ret = 0;
> for (i = 0; i < cnt; i++) {
> epc_page = chunk[i];
> if (!epc_page)
> @@ -378,7 +387,10 @@ static void sgx_reclaim_pages(void)
> epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>
> sgx_free_epc_page(epc_page);
> + ret++;
> }
> +
> + return (unsigned int)ret;
> }
>
Here basically the @nr_to_scan is reduced by the number of pages that are
isolated for isolation, but these pages may not be actually reclaimed, e.g., due
to aging.
Could you clarify the reason of such choice in this patch, preferable using a
comment (and/or in changelog if better)?
In v8's reply you mentioned this is due to "the uncertainty of how long it takes
to reclaim pages" and some other reasons, but I am not sure whether that
justify.
And AFAICT it also depends on how this function is called. Please also see my
reply to your next patch (where it is called).
That being said, does it make more sense if you can just merge this patch to
your next patch for better review?
> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors. Used in
> + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is
> + * over its limit or its ancestors' hence reclamation is needed.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
> +{
> + struct misc_cg *i = epc_cg->cg;
> + u64 m = U64_MAX;
> +
> + while (i) {
> + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> + i = misc_cg_parent(i);
> + }
> +
> + return m / PAGE_SIZE;
> +}
I am not sure, but is it possible or legal for an ancestor to have less limit
than children?
> +
> /**
> - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
> + * @root: Root of the tree to check
> *
> + * Return: %true if all cgroups under the specified root have empty LRU lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + bool ret = true;
> +
> + /*
> + * Caller ensure css_root ref acquired
> + */
> + css_root = &root->css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> +
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> + spin_lock(&epc_cg->lru.lock);
> + ret = list_empty(&epc_cg->lru.reclaimable);
> + spin_unlock(&epc_cg->lru.lock);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!ret)
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages
> + * @root: Root of the tree to start walking from.
> + * Return: Number of pages reclaimed.
Just wondering, do you need to return @cnt given this function is called w/o
checking the return value?
> + */
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> +{
> + /*
> + * Attempting to reclaim only a few pages will often fail and is
> + * inefficient, while reclaiming a huge number of pages can result in
> + * soft lockups due to holding various locks for an extended duration.
> + */
Not sure we need this comment, given it's already implied in
sgx_reclaim_pages(). You cannot pass a value > SGX_NR_TO_SCAN anyway.
> + unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + unsigned int cnt;
> +
> + /* Caller ensure css_root ref acquired */
> + css_root = &root->css;
> +
> + cnt = 0;
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> + cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!nr_to_scan)
> + break;
> + }
> +
> + rcu_read_unlock();
> + return cnt;
> +}
Here the @nr_to_scan is reduced by the number of pages that are isolated, but
not actually reclaimed (which is reflected by @cnt).
IIUC, looks you want to make this function do "each cycle" as what you mentioned
in the v8 [1]:
I tested with that approach and found we can only target number of
pages
attempted to reclaim not pages actually reclaimed due to the
uncertainty
of how long it takes to reclaim pages. Besides targeting number of
scanned pages for each cycle is also what the ksgxd does.
If we target actual number of pages, sometimes it just takes too long.
I
saw more timeouts with the default time limit when running parallel
selftests.
I am not sure what does "sometimes it just takes too long" mean, but what I am
thinking is you are trying to do some perfect but yet complicated code here.
For instance, I don't think selftest reflect the real workload, and I believe
adjusting the limit of a given EPC cgroup shouldn't be a frequent operation,
thus it is acceptable to use some easy-maintain code but less perfect code.
Here I still think having @nr_to_scan as a pointer is over-complicated. For
example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN
pages, but give up when there's enough pages reclaimed or when the EPC cgroup
and its descendants have been looped:
unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
{
unsigned int cnt = 0;
...
css_for_each_descendant_pre(pos, css_root) {
...
epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
cnt += sgx_reclaim_pages(&epc_cg->lru);
if (cnt >= SGX_NR_TO_SCAN)
break;
}
...
return cnt;
}
Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually reaches any
descendants, but that should be rare and we don't care that much, do we?
But I'll leave to maintainers to judge.
[1]
https://lore.kernel.org/linux-kernel/[email protected]/T/#md7b062b43d249218369f921682dfa7f975735dd1
> +
> +/*
> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup
> + * when the cgroup is at/near its maximum capacity
> + */
I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here. Does it
make more sense to move that code change to this patch for better review?
> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> + u64 cur, max;
> +
> + epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
> +
> + for (;;) {
> + max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
> +
> + /*
> + * Adjust the limit down by one page, the goal is to free up
> + * pages for fault allocations, not to simply obey the limit.
> + * Conditionally decrementing max also means the cur vs. max
> + * check will correctly handle the case where both are zero.
> + */
> + if (max)
> + max--;
With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one?
> +
> + /*
> + * Unless the limit is extremely low, in which case forcing
> + * reclaim will likely cause thrashing, force the cgroup to
> + * reclaim at least once if it's operating *near* its maximum
> + * limit by adjusting @max down by half the min reclaim size.
OK. But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could we
choose SGX_NR_TO_SCAN instead?
IMHO at least we should at least put a comment to mention this.
And maybe you can have a dedicated macro for that in which way I believe the
code would be easier to understand?
> + * This work func is scheduled by sgx_epc_cgroup_try_charge
This has been mentioned in the function comment already.
> + * when it cannot directly reclaim due to being in an atomic
> + * context, e.g. EPC allocation in a fault handler.
>
Why a fault handler is an "atomic context"? Just say when it cannot directly
reclaim.
> Waiting
> + * to reclaim until the cgroup is actually at its limit is less
> + * performant as it means the faulting task is effectively
> + * blocked until a worker makes its way through the global work
> + * queue.
> + */
> + if (max > SGX_NR_TO_SCAN * 2)
> + max -= (SGX_NR_TO_SCAN / 2);
> +
> + cur = sgx_epc_cgroup_page_counter_read(epc_cg);
> +
> + if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
> + break;
> +
> + /* Keep reclaiming until above condition is met. */
> + sgx_epc_cgroup_reclaim_pages(epc_cg->cg);
Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and
sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and descendants. If
we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages, seems we can
reduce the number of loops here?
> + }
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> * @epc_cg: The EPC cgroup to be charged for the page.
> * Return:
> * * %0 - If successfully charged.
> @@ -38,6 +209,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
> if (!epc_cg)
> return;
>
> + cancel_work_sync(&epc_cg->reclaim_work);
> kfree(epc_cg);
> }
>
> @@ -50,6 +222,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = {
>
> static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
> {
> + sgx_lru_init(&epc_cg->lru);
> + INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
> cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> epc_cg->cg = cg;
> }
> @@ -69,6 +243,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
>
> void sgx_epc_cgroup_init(void)
> {
> + sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
> + WQ_UNBOUND | WQ_FREEZABLE,
> + WQ_UNBOUND_MAX_ACTIVE);
> + BUG_ON(!sgx_epc_cg_wq);
You cannot BUG_ON() simply due to unable to allocate a workqueue. You can use
some way to mark EPC cgroup as disabled but keep going. Static key is one way
although we cannot re-enable it at runtime.
> +
> misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
> sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
> }
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index 6b664b4c321f..e3c6a08f0ee8 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -34,6 +34,8 @@ static inline void sgx_epc_cgroup_init(void) { }
> #else
> struct sgx_epc_cgroup {
> struct misc_cg *cg;
> + struct sgx_epc_lru_list lru;
> + struct work_struct reclaim_work;
> };
So you introduced the work/workqueue here but there's no place which actually
queues the work. IMHO you can either:
1) move relevant code change here; or
2) focus on introducing core functions to reclaim certain pages from a given EPC
cgroup w/o workqueue and introduce the work/workqueue in later patch.
Makes sense?
>
> static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> @@ -66,6 +68,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
>
> int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
> void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
Not sure why this needs to be exposed. Perhaps you should make this change when
needed.
> void sgx_epc_cgroup_init(void);
>
> #endif
On Tue, Feb 20, 2024 at 09:52:39AM +0000, "Huang, Kai" <[email protected]> wrote:
> I am not sure, but is it possible or legal for an ancestor to have less limit
> than children?
Why not?
It is desired for proper hiearchical delegation and the tightest limit
of ancestors applies (cf memory.max).
Michal
On Tue, 2024-02-20 at 14:18 +0100, Michal Koutný wrote:
> On Tue, Feb 20, 2024 at 09:52:39AM +0000, "Huang, Kai" <[email protected]> wrote:
> > I am not sure, but is it possible or legal for an ancestor to have less limit
> > than children?
>
> Why not?
> It is desired for proper hiearchical delegation and the tightest limit
> of ancestors applies (cf memory.max).
>
OK. Thanks for the info.
StartHi Kai
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <[email protected]> wrote:
[...]
>
> So you introduced the work/workqueue here but there's no place which
> actually
> queues the work. IMHO you can either:
>
> 1) move relevant code change here; or
> 2) focus on introducing core functions to reclaim certain pages from a
> given EPC
> cgroup w/o workqueue and introduce the work/workqueue in later patch.
>
> Makes sense?
>
Starting in v7, I was trying to split the big patch, #10 in v6 as you and
others suggested. My thought process was to put infrastructure needed for
per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
13/15] in the end.
Before that, all reclaimables are tracked in the global LRU so really
there is no "reclaim certain pages from a given EPC cgroup w/o workqueue"
or reclaim through workqueue before that point, as suggested in #2. This
patch puts down the implementation for both flows but neither used yet, as
stated in the commit message.
#1 would force me go back and merge the patches again.
Sorry I feel kind of lost on this whole thing by now. It seems so random
to me. Is there hard rules on this?
I was hoping these statements would help reviewers on the flow of the
patches.
At the end of [v9 04/15]:
For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.
Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.
At the end of [v9 06/15]:
Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.
At the end of [v9 08/15]:
Both synchronous and asynchronous flows invoke the same top level reclaim
function, and will be triggered later by sgx_epc_cgroup_try_charge()
when usage of the cgroup is at or near its limit.
At the end of [v9 10/15]:
Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.
Thanks
Haitao
On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:
> StartHi Kai
> On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <[email protected]> wrote:
> [...]
> >
> > So you introduced the work/workqueue here but there's no place which
> > actually
> > queues the work. IMHO you can either:
> >
> > 1) move relevant code change here; or
> > 2) focus on introducing core functions to reclaim certain pages from a
> > given EPC
> > cgroup w/o workqueue and introduce the work/workqueue in later patch.
> >
> > Makes sense?
> >
>
> Starting in v7, I was trying to split the big patch, #10 in v6 as you and
> others suggested. My thought process was to put infrastructure needed for
> per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
> 13/15] in the end.
That's reasonable for sure.
>
> Before that, all reclaimables are tracked in the global LRU so really
> there is no "reclaim certain pages from a given EPC cgroup w/o workqueue"
> or reclaim through workqueue before that point, as suggested in #2. This
> patch puts down the implementation for both flows but neither used yet, as
> stated in the commit message.
I know it's not used yet. The point is how to split patches to make them more
self-contain and easy to review.
For #2, sorry for not being explicit -- I meant it seems it's more reasonable to
split in this way:
Patch 1)
a). change to sgx_reclaim_pages();
b). introduce sgx_epc_cgroup_reclaim_pages();
c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name),
which just takes an EPC cgroup as input w/o involving any work/workqueue.
These functions are all related to how to implement reclaiming pages from a
given EPC cgroup, and they are logically related in terms of implementation thus
it's easier to be reviewed together.
Then you just need to justify the design/implementation in changelog/comments.
Patch 2)
- Introduce work/workqueue, and implement the logic to queue the work.
Now we all know there's a function to reclaim pages for a given EPC cgroup, then
we can focus on when that is called, either directly or indirectly.
>
> #1 would force me go back and merge the patches again.
I don't think so. I am not asking to put all things together, but only asking
to split in better way (that I see).
You mentioned some function is "Scheduled by sgx_epc_cgroup_try_charge() to
reclaim pages", but I am not seeing any code doing that in this patch. This
needs fixing, either by moving relevant code here, or removing these not-done-
yet comments.
For instance (I am just giving an example), if after review we found the
queue_work() shouldn't be done in try_charge(), you will need to go back to this
patch and remove these comments.
That's not the best way. Each patch needs to be self-contained.
>
> Sorry I feel kind of lost on this whole thing by now. It seems so random
> to me. Is there hard rules on this?
See above. I am only offering my opinion on how to split patches in better way.
>
> I was hoping these statements would help reviewers on the flow of the
> patches.
>
> At the end of [v9 04/15]:
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
>
> At the end of [v9 06/15]:
>
> Next patches will first get the cgroup reclamation flow ready while
> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
> make the switch in the end for sgx_lru_list() to return per-cgroup
> LRU.
>
> At the end of [v9 08/15]:
>
> Both synchronous and asynchronous flows invoke the same top level reclaim
> function, and will be triggered later by sgx_epc_cgroup_try_charge()
> when usage of the cgroup is at or near its limit.
>
> At the end of [v9 10/15]:
> Note at this point, all reclaimable EPC pages are still tracked in the
> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
> is activated yet.
They are useful in the changelog in each patch I suppose, but to me we are
discussing different things.
I found one pain in the review is I have to jump back and forth many times among
multiple patches to see whether one patch is reasonable. That's why I am asking
whether there's better way to split patches so that each patch can be self-
contained logically in someway and easier to review.
On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:
> [...]
> >
> > Here the @nr_to_scan is reduced by the number of pages that are
> > isolated, but
> > not actually reclaimed (which is reflected by @cnt).
> >
> > IIUC, looks you want to make this function do "each cycle" as what you
> > mentioned
> > in the v8 [1]:
> >
> > I tested with that approach and found we can only target number of
> > pages
> > attempted to reclaim not pages actually reclaimed due to the
> > uncertainty
> > of how long it takes to reclaim pages. Besides targeting number of
> > scanned pages for each cycle is also what the ksgxd does.
> >
> > If we target actual number of pages, sometimes it just takes too long.
> > I
> > saw more timeouts with the default time limit when running parallel
> > selftests.
> >
> > I am not sure what does "sometimes it just takes too long" mean, but
> > what I am
> > thinking is you are trying to do some perfect but yet complicated code
> > here.
>
> I think what I observed was that the try_charge() would block too long
> before getting chance of schedule() to yield, causing more timeouts than
> necessary.
> I'll do some re-test to be sure.
Looks this is a valid information that can be used to justify whatever you are
implementing in the EPC cgroup reclaiming function(s).
>
> >
> > For instance, I don't think selftest reflect the real workload, and I
> > believe
> > adjusting the limit of a given EPC cgroup shouldn't be a frequent
> > operation,
> > thus it is acceptable to use some easy-maintain code but less perfect
> > code.
> >
> > Here I still think having @nr_to_scan as a pointer is over-complicated.
> > For
> > example, we can still let sgx_reclaim_pages() to always scan
> > SGX_NR_TO_SCAN
> > pages, but give up when there's enough pages reclaimed or when the EPC
> > cgroup
> > and its descendants have been looped:
> >
> > unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> > {
> > unsigned int cnt = 0;
> > ...
> >
> > css_for_each_descendant_pre(pos, css_root) {
> > ...
> > epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> > cnt += sgx_reclaim_pages(&epc_cg->lru);
> >
> > if (cnt >= SGX_NR_TO_SCAN)
> > break;
> > }
> >
> > ...
> > return cnt;
> > }
> >
> > Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually
> > reaches any
> > descendants, but that should be rare and we don't care that much, do we?
> >
> I assume you meant @cnt here to be number of pages actually reclaimed.
Yes.
> This could cause sgx_epc_cgroup_reclaim_pages() block too long as @cnt
> may always be zero (all pages are too young) and you have to loop all
> descendants.
I am not sure whether this is a valid point.
For example, your change in patch 10 "x86/sgx: Add EPC reclamation in cgroup
try_charge()" already loops all descendants in below code:
+ if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ return -ENOMEM;
Anyway, I can see all these can be justification to your design/implementation.
My point is please put these justification in changelog/comments so that we can
actually understand.
Makes sense?
> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
> {
> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> + for (;;) {
> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
> + PAGE_SIZE))
> + break;
> +
> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> + return -ENOMEM;
> +
> + if (signal_pending(current))
> + return -ERESTARTSYS;
> +
> + if (!reclaim) {
> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
> + return -EBUSY;
> + }
> +
> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
> + /* All pages were too young to reclaim, try again a little later */
> + schedule();
> + }
> +
> + return 0;
> }
>
Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():
...
for ( ; ; ) {
page = __sgx_alloc_epc_page();
if (!IS_ERR(page)) {
page->owner = owner;
break;
}
if (list_empty(&sgx_active_page_list))
return ERR_PTR(-ENOMEM);
if (!reclaim) {
page = ERR_PTR(-EBUSY);
break;
}
if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}
sgx_reclaim_pages();
cond_resched();
}
...
Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?
IIUC, the logic is quite similar: When you either failed to allocate one page,
or failed to charge one page, you try to reclaim EPC page(s) from the current
EPC cgroup, either directly or indirectly.
No?
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> When cgroup is enabled, all reclaimable pages will be tracked in cgroup
> LRUs. The global reclaimer needs to start reclamation from the root
> cgroup. Expose the top level cgroup reclamation function so the global
> reclaimer can reuse it.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V8:
> - Remove unneeded breaks in function declarations. (Jarkko)
>
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++++++
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index abf74fdb12b4..6e31f8727b8a 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> * @indirect: In ksgxd or EPC cgroup work queue context.
> * Return: Number of pages reclaimed.
> */
> -static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
> {
> /*
> * Attempting to reclaim only a few pages will often fail and is
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> index d061cd807b45..5b3e8e1b8630 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
> static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
>
> static inline void sgx_epc_cgroup_init(void) { }
> +
> +static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
> +{
> + return 0;
> +}
> #else
> struct sgx_epc_cgroup {
> struct misc_cg *cg;
> @@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
> int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
> void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect);
> +
> void sgx_epc_cgroup_init(void);
>
> #endif
I'd just prefer to merge patch such like this to the one that actually uses the
exposed function. It's just couple of LOC and we don't deserve to read these
repeated changelog and move back and forth between patches during review.
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> Previous patches have implemented all infrastructure needed for
> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
> pages are still tracked in the global LRU as sgx_lru_list() returns hard
> coded reference to the global LRU.
>
> Change sgx_lru_list() to return the LRU of the cgroup in which the given
> EPC page is allocated.
>
> This makes all EPC pages tracked in per-cgroup LRUs and the global
> reclaimer (ksgxd) will not be able to reclaim any pages from the global
> LRU. However, in cases of over-committing, i.e., sum of cgroup limits
> greater than the total capacity, cgroups may never reclaim but the total
> usage can still be near the capacity. Therefore global reclamation is
> still needed in those cases and it should reclaim from the root cgroup.
>
> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
> when cgroup is enabled, otherwise from the global LRU.
>
> Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
>
> With these changes, the global reclamation and per-cgroup reclamation
> both work properly with all pages tracked in per-cgroup LRUs.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/main.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 6b0c26cac621..d4265a390ba9 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
>
> static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
> {
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (epc_page->epc_cg)
> + return &epc_page->epc_cg->lru;
> +
> + /* This should not happen if kernel is configured correctly */
> + WARN_ON_ONCE(1);
> +#endif
> return &sgx_global_lru;
> }
How about when EPC cgroup is enabled, but one enclave doesn't belong to any EPC
cgroup? Is it OK to track EPC pages for these enclaves to the root EPC cgroup's
LRU list together with other enclaves belongs to the root cgroup?
This should be a valid case, right?
[...]
>
> Here the @nr_to_scan is reduced by the number of pages that are
> isolated, but
> not actually reclaimed (which is reflected by @cnt).
>
> IIUC, looks you want to make this function do "each cycle" as what you
> mentioned
> in the v8 [1]:
>
> I tested with that approach and found we can only target number of
> pages
> attempted to reclaim not pages actually reclaimed due to the
> uncertainty
> of how long it takes to reclaim pages. Besides targeting number of
> scanned pages for each cycle is also what the ksgxd does.
>
> If we target actual number of pages, sometimes it just takes too long.
> I
> saw more timeouts with the default time limit when running parallel
> selftests.
>
> I am not sure what does "sometimes it just takes too long" mean, but
> what I am
> thinking is you are trying to do some perfect but yet complicated code
> here.
I think what I observed was that the try_charge() would block too long
before getting chance of schedule() to yield, causing more timeouts than
necessary.
I'll do some re-test to be sure.
>
> For instance, I don't think selftest reflect the real workload, and I
> believe
> adjusting the limit of a given EPC cgroup shouldn't be a frequent
> operation,
> thus it is acceptable to use some easy-maintain code but less perfect
> code.
>
> Here I still think having @nr_to_scan as a pointer is over-complicated.
> For
> example, we can still let sgx_reclaim_pages() to always scan
> SGX_NR_TO_SCAN
> pages, but give up when there's enough pages reclaimed or when the EPC
> cgroup
> and its descendants have been looped:
>
> unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> {
> unsigned int cnt = 0;
> ...
>
> css_for_each_descendant_pre(pos, css_root) {
> ...
> epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> cnt += sgx_reclaim_pages(&epc_cg->lru);
>
> if (cnt >= SGX_NR_TO_SCAN)
> break;
> }
>
> ...
> return cnt;
> }
>
> Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually
> reaches any
> descendants, but that should be rare and we don't care that much, do we?
>
I assume you meant @cnt here to be number of pages actually reclaimed.
This could cause sgx_epc_cgroup_reclaim_pages() block too long as @cnt
may always be zero (all pages are too young) and you have to loop all
descendants.
Thanks
Haitao
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> To determine if any page available for reclamation at the global level,
> only checking for emptiness of the global LRU is not adequate when pages
> are tracked in multiple LRUs, one per cgroup. For this purpose, create a
> new helper, sgx_can_reclaim(), currently only checks the global LRU,
> later will check emptiness of LRUs of all cgroups when per-cgroup
> tracking is turned on. Replace all the checks of the global LRU,
> list_empty(&sgx_global_lru.reclaimable), with calls to
> sgx_can_reclaim().
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> v7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
> arch/x86/kernel/cpu/sgx/main.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 2279ae967707..6b0c26cac621 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
> return &sgx_global_lru;
> }
>
> +static inline bool sgx_can_reclaim(void)
> +{
> + return !list_empty(&sgx_global_lru.reclaimable);
> +}
> +
> static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>
> /* Nodes with one or more EPC sections. */
> @@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
> static bool sgx_should_reclaim(unsigned long watermark)
> {
> return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> - !list_empty(&sgx_global_lru.reclaimable);
> + sgx_can_reclaim();
> }
>
> static void sgx_reclaim_pages_global(bool indirect)
> @@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> break;
> }
>
> - if (list_empty(&sgx_global_lru.reclaimable)) {
> + if (!sgx_can_reclaim()) {
> page = ERR_PTR(-ENOMEM);
> break;
> }
Seems a basic elemental change. Why did you put this patch at almost end of
this series but not at an earlier place?
I think one advantage of putting elemental changes at early place is, if there's
any code change related to these (the code changes sgx_global_lru in this patch)
in any later patch, the updated one can be used. Otherwise if you do elemental
change at later place, when you replace you have to replace all the places that
were modified in previous patches.
On Wed, 21 Feb 2024 05:10:36 -0600, Huang, Kai <[email protected]> wrote:
> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> When cgroup is enabled, all reclaimable pages will be tracked in cgroup
>> LRUs. The global reclaimer needs to start reclamation from the root
>> cgroup. Expose the top level cgroup reclamation function so the global
>> reclaimer can reuse it.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> V8:
>> - Remove unneeded breaks in function declarations. (Jarkko)
>>
>> V7:
>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> ---
>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++++++
>> 2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> index abf74fdb12b4..6e31f8727b8a 100644
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
>> * @indirect: In ksgxd or EPC cgroup work queue context.
>> * Return: Number of pages reclaimed.
>> */
>> -static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root,
>> bool indirect)
>> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool
>> indirect)
>> {
>> /*
>> * Attempting to reclaim only a few pages will often fail and is
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> index d061cd807b45..5b3e8e1b8630 100644
>> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> @@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct
>> sgx_epc_cgroup *epc_cg, bool
>> static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup
>> *epc_cg) { }
>>
>> static inline void sgx_epc_cgroup_init(void) { }
>> +
>> +static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg
>> *root, bool indirect)
>> +{
>> + return 0;
>> +}
>> #else
>> struct sgx_epc_cgroup {
>> struct misc_cg *cg;
>> @@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct
>> sgx_epc_cgroup *epc_cg)
>> int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>> reclaim);
>> void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
>> bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
>> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool
>> indirect);
>> +
>> void sgx_epc_cgroup_init(void);
>>
>> #endif
>
> I'd just prefer to merge patch such like this to the one that actually
> uses the
> exposed function. It's just couple of LOC and we don't deserve to read
> these
> repeated changelog and move back and forth between patches during review.
>
>
IIRC, Jarkko prefers exposing functions first in separate patch. Jarkko,
right?
Also I found your definition/expectation of self-contained patches is just
confusing or too constrained at least. I usually review each patch
separately without back and forth and then review them together to see if
they all make sense in terms of breakdown. My minimal expectation is
that a patch should not depend on future changes and should not break
current state of function. and
For this one I thought the idea was you verify if the API exposed make
sense without looking at how it is used in future. Then when you review
the usage patch, you see if the usage is reasonable.
I would really hesitate to merge patches at this point unless we all have
an agreement and have good/strong reasons or there is a hard rule about
this.
Thanks
Haitao
On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai <[email protected]> wrote:
> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> Previous patches have implemented all infrastructure needed for
>> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
>> pages are still tracked in the global LRU as sgx_lru_list() returns hard
>> coded reference to the global LRU.
>>
>> Change sgx_lru_list() to return the LRU of the cgroup in which the given
>> EPC page is allocated.
>>
>> This makes all EPC pages tracked in per-cgroup LRUs and the global
>> reclaimer (ksgxd) will not be able to reclaim any pages from the global
>> LRU. However, in cases of over-committing, i.e., sum of cgroup limits
>> greater than the total capacity, cgroups may never reclaim but the total
>> usage can still be near the capacity. Therefore global reclamation is
>> still needed in those cases and it should reclaim from the root cgroup.
>>
>> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
>> when cgroup is enabled, otherwise from the global LRU.
>>
>> Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
>> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
>>
>> With these changes, the global reclamation and per-cgroup reclamation
>> both work properly with all pages tracked in per-cgroup LRUs.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> V7:
>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>> ---
>> arch/x86/kernel/cpu/sgx/main.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c
>> b/arch/x86/kernel/cpu/sgx/main.c
>> index 6b0c26cac621..d4265a390ba9 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
>>
>> static inline struct sgx_epc_lru_list *sgx_lru_list(struct
>> sgx_epc_page *epc_page)
>> {
>> +#ifdef CONFIG_CGROUP_SGX_EPC
>> + if (epc_page->epc_cg)
>> + return &epc_page->epc_cg->lru;
>> +
>> + /* This should not happen if kernel is configured correctly */
>> + WARN_ON_ONCE(1);
>> +#endif
>> return &sgx_global_lru;
>> }
>
> How about when EPC cgroup is enabled, but one enclave doesn't belong to
> any EPC
> cgroup? Is it OK to track EPC pages for these enclaves to the root EPC
> cgroup's
> LRU list together with other enclaves belongs to the root cgroup?
>
>
> This should be a valid case, right?
There is no such case. Each page is in the root by default.
Thanks
Haitao
On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <[email protected]> wrote:
>
>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>> reclaim)
>> {
>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
>> + for (;;) {
>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> + PAGE_SIZE))
>> + break;
>> +
>> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> + return -ENOMEM;
>> +
>> + if (signal_pending(current))
>> + return -ERESTARTSYS;
>> +
>> + if (!reclaim) {
>> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>> + return -EBUSY;
>> + }
>> +
>> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>> + /* All pages were too young to reclaim, try again a little later */
>> + schedule();
>> + }
>> +
>> + return 0;
>> }
>>
>
> Seems this code change is 90% similar to the existing code in the
> sgx_alloc_epc_page():
>
> ...
> for ( ; ; ) {
> page = __sgx_alloc_epc_page();
> if (!IS_ERR(page)) {
> page->owner = owner;
> break;
> }
>
> if (list_empty(&sgx_active_page_list))
> return ERR_PTR(-ENOMEM);
>
> if (!reclaim) {
> page = ERR_PTR(-EBUSY);
> break;
> }
>
> if (signal_pending(current)) {
> page = ERR_PTR(-ERESTARTSYS);
> break;
> }
>
> sgx_reclaim_pages();
> cond_resched();
> }
> ...
>
> Is it better to move the logic/code change in try_charge() out to
> sgx_alloc_epc_page() to unify them?
>
> IIUC, the logic is quite similar: When you either failed to allocate one
> page,
> or failed to charge one page, you try to reclaim EPC page(s) from the
> current
> EPC cgroup, either directly or indirectly.
>
> No?
Only these lines are the same:
if (!reclaim) {
page = ERR_PTR(-EBUSY);
break;
}
if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}
In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup
reclamation. That's why the logic of other lines is different though they
look similar due to similar function names. For the global reclamation we
need consider case in that cgroup is not enabled. Similarly
list_empty(&sgx_active_page_list) would have to be changed to check root
cgroup if cgroups enabled otherwise check global LRU. The (!reclaim) case
is also different. So I don't see an obvious good way to abstract those
to get meaningful savings.
Thanks
Haitao
On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai <[email protected]> wrote:
> On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:
>> [...]
>> >
>> > Here the @nr_to_scan is reduced by the number of pages that are
>> > isolated, but
>> > not actually reclaimed (which is reflected by @cnt).
>> >
>> > IIUC, looks you want to make this function do "each cycle" as what you
>> > mentioned
>> > in the v8 [1]:
>> >
>> > I tested with that approach and found we can only target number of
>> > pages
>> > attempted to reclaim not pages actually reclaimed due to the
>> > uncertainty
>> > of how long it takes to reclaim pages. Besides targeting number of
>> > scanned pages for each cycle is also what the ksgxd does.
>> >
>> > If we target actual number of pages, sometimes it just takes too
>> long.
>> > I
>> > saw more timeouts with the default time limit when running parallel
>> > selftests.
>> >
>> > I am not sure what does "sometimes it just takes too long" mean, but
>> > what I am
>> > thinking is you are trying to do some perfect but yet complicated code
>> > here.
>>
>> I think what I observed was that the try_charge() would block too long
>> before getting chance of schedule() to yield, causing more timeouts than
>> necessary.
>> I'll do some re-test to be sure.
>
> Looks this is a valid information that can be used to justify whatever
> you are
> implementing in the EPC cgroup reclaiming function(s).
>
I'll add some comments. Was assuming this is just following the old design
as ksgxd.
There were some comments at the beginning of
sgx_epc_cgrooup_reclaim_page().
/*
* Attempting to reclaim only a few pages will often fail and is
* inefficient, while reclaiming a huge number of pages can result
in
* soft lockups due to holding various locks for an extended
duration.
*/
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
I think it can be improved to emphasize we only "attempt" to finish
scanning fixed number of pages for reclamation, not enforce number of
pages successfully reclaimed.
>>
>> >
>> > For instance, I don't think selftest reflect the real workload, and I
>> > believe
>> > adjusting the limit of a given EPC cgroup shouldn't be a frequent
>> > operation,
>> > thus it is acceptable to use some easy-maintain code but less perfect
>> > code.
>> >
>> > Here I still think having @nr_to_scan as a pointer is
>> over-complicated.
>> > For
>> > example, we can still let sgx_reclaim_pages() to always scan
>> > SGX_NR_TO_SCAN
>> > pages, but give up when there's enough pages reclaimed or when the EPC
>> > cgroup
>> > and its descendants have been looped:
>> >
>> > unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
>> > {
>> > unsigned int cnt = 0;
>> > ...
>> >
>> > css_for_each_descendant_pre(pos, css_root) {
>> > ...
>> > epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
>> > cnt += sgx_reclaim_pages(&epc_cg->lru);
>> >
>> > if (cnt >= SGX_NR_TO_SCAN)
>> > break;
>> > }
>> >
>> > ...
>> > return cnt;
>> > }
>> >
>> > Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually
>> > reaches any
>> > descendants, but that should be rare and we don't care that much, do
>> we?
>> >
>> I assume you meant @cnt here to be number of pages actually reclaimed.
>
> Yes.
>
>> This could cause sgx_epc_cgroup_reclaim_pages() block too long as @cnt
>> may always be zero (all pages are too young) and you have to loop all
>> descendants.
>
> I am not sure whether this is a valid point.
>
> For example, your change in patch 10 "x86/sgx: Add EPC reclamation in
> cgroup
> try_charge()" already loops all descendants in below code:
>
> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
> + return -ENOMEM;
>
I meant looping all descendants for reclamation which is expensive and we
want to avoid. Not just checking emptiness of LRUs.
> Anyway, I can see all these can be justification to your
> design/implementation.
> My point is please put these justification in changelog/comments so that
> we can
> actually understand.
>
Yes, will add clarifying comments.
Thanks
On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai <[email protected]> wrote:
> On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:
>> StartHi Kai
>> On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <[email protected]>
>> wrote:
>> [...]
>> >
>> > So you introduced the work/workqueue here but there's no place which
>> > actually
>> > queues the work. IMHO you can either:
>> >
>> > 1) move relevant code change here; or
>> > 2) focus on introducing core functions to reclaim certain pages from a
>> > given EPC
>> > cgroup w/o workqueue and introduce the work/workqueue in later patch.
>> >
>> > Makes sense?
>> >
>>
>> Starting in v7, I was trying to split the big patch, #10 in v6 as you
>> and
>> others suggested. My thought process was to put infrastructure needed
>> for
>> per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
>> 13/15] in the end.
>
> That's reasonable for sure.
>
Thanks for the confirmation :-)
>>
>> Before that, all reclaimables are tracked in the global LRU so really
>> there is no "reclaim certain pages from a given EPC cgroup w/o
>> workqueue"
>> or reclaim through workqueue before that point, as suggested in #2. This
>> patch puts down the implementation for both flows but neither used yet,
>> as
>> stated in the commit message.
>
> I know it's not used yet. The point is how to split patches to make
> them more
> self-contain and easy to review.
I would think this patch already self-contained in that all are
implementation of cgroup reclamation building blocks utilized later. But
I'll try to follow your suggestions below to split further (would prefer
not to merge in general unless there is strong reasons).
>
> For #2, sorry for not being explicit -- I meant it seems it's more
> reasonable to
> split in this way:
>
> Patch 1)
> a). change to sgx_reclaim_pages();
I'll still prefer this to be a separate patch. It is self-contained IMHO.
We were splitting the original patch because it was too big. I don't want
to merge back unless there is a strong reason.
> b). introduce sgx_epc_cgroup_reclaim_pages();
Ok.
> c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name),
> which just takes an EPC cgroup as input w/o involving any
> work/workqueue.
This is for the workqueue use only. So I think it'd be better be with
patch #2 below?
>
> These functions are all related to how to implement reclaiming pages
> from a
> given EPC cgroup, and they are logically related in terms of
> implementation thus
> it's easier to be reviewed together.
>
This is pretty much the current patch + sgx_reclaim_pages() - workqueue.
> Then you just need to justify the design/implementation in
> changelog/comments.
>
How about just do b) in patch #1, and state the new function is the
building block and will be used for both direct and indirect reclamation?
> Patch 2)
> - Introduce work/workqueue, and implement the logic to queue the work.
>
> Now we all know there's a function to reclaim pages for a given EPC
> cgroup, then
> we can focus on when that is called, either directly or indirectly.
>
The try_charge() function will do both actually.
For indirect, it queue the work to the wq. For direct it just calls
sgx_epc_cgroup_reclaim_pages().
That part is already in separate (I think self-contained) patch [v9,
10/15].
So for this patch, I'll add sgx_epc_cgroup_reclaim_work_func() and
introduce work/workqueue so later work can be queued?
>>
>> #1 would force me go back and merge the patches again.
>
> I don't think so. I am not asking to put all things together, but only
> asking
> to split in better way (that I see).
>
Okay.
> You mentioned some function is "Scheduled by sgx_epc_cgroup_try_charge()
> to
> reclaim pages", but I am not seeing any code doing that in this patch.
> This
> needs fixing, either by moving relevant code here, or removing these
> not-done-
> yet comments.
>
Yes. The comments will be fixed.
> For instance (I am just giving an example), if after review we found the
> queue_work() shouldn't be done in try_charge(), you will need to go back
> to this
> patch and remove these comments.
>
> That's not the best way. Each patch needs to be self-contained.
>
>>
>> Sorry I feel kind of lost on this whole thing by now. It seems so random
>> to me. Is there hard rules on this?
>
> See above. I am only offering my opinion on how to split patches in
> better way.
>
To be honest, the part I'm feeling most confusing is this
self-contained-ness. It seems depend on how you look at things.
>>
>> I was hoping these statements would help reviewers on the flow of the
>> patches.
>>
>> At the end of [v9 04/15]:
>>
>> For now, the EPC cgroup simply blocks additional EPC allocation in
>> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
>> still tracked in the global active list, only reclaimed by the global
>> reclaimer when the total free page count is lower than a threshold.
>>
>> Later patches will reorganize the tracking and reclamation code in the
>> global reclaimer and implement per-cgroup tracking and reclaiming.
>>
>> At the end of [v9 06/15]:
>>
>> Next patches will first get the cgroup reclamation flow ready while
>> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
>> make the switch in the end for sgx_lru_list() to return per-cgroup
>> LRU.
>>
>> At the end of [v9 08/15]:
>>
>> Both synchronous and asynchronous flows invoke the same top level
>> reclaim
>> function, and will be triggered later by sgx_epc_cgroup_try_charge()
>> when usage of the cgroup is at or near its limit.
>>
>> At the end of [v9 10/15]:
>> Note at this point, all reclaimable EPC pages are still tracked in the
>> global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
>> is activated yet.
>
> They are useful in the changelog in each patch I suppose, but to me we
> are
> discussing different things.
>
> I found one pain in the review is I have to jump back and forth many
> times among
> multiple patches to see whether one patch is reasonable. That's why I
> am asking
> whether there's better way to split patches so that each patch can be
> self-
> contained logically in someway and easier to review.
>
I appreciate very much your time and effort on providing detailed review.
You have been very helpful.
If you think it makes sense, I'll split this patch into 2 with stated
modifications above.
Thanks
Haitao
On 23/02/2024 6:09 am, Haitao Huang wrote:
> On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <[email protected]> wrote:
>
>>
>>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>>> reclaim)
>>> {
>>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>> PAGE_SIZE);
>>> + for (;;) {
>>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>> + PAGE_SIZE))
>>> + break;
>>> +
>>> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>>> + return -ENOMEM;
>>> +
>>> + if (signal_pending(current))
>>> + return -ERESTARTSYS;
>>> +
>>> + if (!reclaim) {
>>> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>>> + return -EBUSY;
>>> + }
>>> +
>>> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>>> + /* All pages were too young to reclaim, try again a
>>> little later */
>>> + schedule();
>>> + }
>>> +
>>> + return 0;
>>> }
>>>
>>
>> Seems this code change is 90% similar to the existing code in the
>> sgx_alloc_epc_page():
>>
>> ...
>> for ( ; ; ) {
>> page = __sgx_alloc_epc_page();
>> if (!IS_ERR(page)) {
>> page->owner = owner;
>> break;
>> }
>>
>> if (list_empty(&sgx_active_page_list))
>> return ERR_PTR(-ENOMEM);
>>
>> if (!reclaim) {
>> page = ERR_PTR(-EBUSY);
>> break;
>> }
>>
>> if (signal_pending(current)) {
>> page = ERR_PTR(-ERESTARTSYS);
>> break;
>> }
>>
>> sgx_reclaim_pages();
>> cond_resched();
>> }
>> ...
>>
>> Is it better to move the logic/code change in try_charge() out to
>> sgx_alloc_epc_page() to unify them?
>>
>> IIUC, the logic is quite similar: When you either failed to allocate
>> one page,
>> or failed to charge one page, you try to reclaim EPC page(s) from the
>> current
>> EPC cgroup, either directly or indirectly.
>>
>> No?
>
> Only these lines are the same:
> if (!reclaim) {
> page = ERR_PTR(-EBUSY);
> break;
> }
>
> if (signal_pending(current)) {
> page = ERR_PTR(-ERESTARTSYS);
> break;
> }
>
> In sgx_alloc_epc_page() we do global reclamation but here we do
> per-cgroup reclamation.
But why? If we failed to allocate, shouldn't we try to reclaim from the
_current_ EPC cgroup instead of global? E.g., I thought one enclave in
one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves
inside other cgroups?
That's why the logic of other lines is different
> though they look similar due to similar function names. For the global
> reclamation we need consider case in that cgroup is not enabled.
> Similarly list_empty(&sgx_active_page_list) would have to be changed to
> check root cgroup if cgroups enabled otherwise check global LRU. The
> (!reclaim) case is also different.
W/o getting clear on my above question, so far I am not convinced why
such difference cannot be hide inside wrapper function(s).
So I don't see an obvious good way
> to abstract those to get meaningful savings.
>
> Thanks
> Haitao
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <[email protected]> wrote:
>>> +/**
>> + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs
>> to reclaim pages
>> + * @root: Root of the tree to start walking from.
>> + * Return: Number of pages reclaimed.
>
> Just wondering, do you need to return @cnt given this function is called
> w/o
> checking the return value?
>
Yes. Will add explicit commenting that we need scan fixed number of pages
for attempted reclamation.
>> + */
>> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
>> +{
>> + /*
>> + * Attempting to reclaim only a few pages will often fail and is
>> + * inefficient, while reclaiming a huge number of pages can result in
>> + * soft lockups due to holding various locks for an extended duration.
>> + */
>
> Not sure we need this comment, given it's already implied in
> sgx_reclaim_pages(). You cannot pass a value > SGX_NR_TO_SCAN anyway.
Will rework on these comments to make them more meaningful.
>
[other comments/questions addressed in separate email threads]
[...]
>> +
>> +/*
>> + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
>> cgroup
>> + * when the cgroup is at/near its maximum capacity
>> + */
>
> I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here.
> Does it
> make more sense to move that code change to this patch for better review?
>
Right. This comment was left-over when I split the old patch.
>> +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
>> +{
>> + struct sgx_epc_cgroup *epc_cg;
>> + u64 cur, max;
>> +
>> + epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
>> +
>> + for (;;) {
>> + max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
>> +
>> + /*
>> + * Adjust the limit down by one page, the goal is to free up
>> + * pages for fault allocations, not to simply obey the limit.
>> + * Conditionally decrementing max also means the cur vs. max
>> + * check will correctly handle the case where both are zero.
>> + */
>> + if (max)
>> + max--;
>
> With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one?
>
Logically still needed for case max <= SGX_NR_TO_SCAN * 2
>> +
>> + /*
>> + * Unless the limit is extremely low, in which case forcing
>> + * reclaim will likely cause thrashing, force the cgroup to
>> + * reclaim at least once if it's operating *near* its maximum
>> + * limit by adjusting @max down by half the min reclaim size.
>
> OK. But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could
> we
> choose SGX_NR_TO_SCAN instead?
> IMHO at least we should at least put a comment to mention this.
>
> And maybe you can have a dedicated macro for that in which way I believe
> the
> code would be easier to understand?
Good point. I think the value is kind of arbitrary. We consider
enclaves/cgroups of 64K size are very small. If such a cgroup ever reaches
the limit, then we don't aggressively reclaim to optimize #PF handling.
User might as well just raise the limit if it is not performant.
>
>> + * This work func is scheduled by sgx_epc_cgroup_try_charge
>
> This has been mentioned in the function comment already.
>
>> + * when it cannot directly reclaim due to being in an atomic
>> + * context, e.g. EPC allocation in a fault handler.
>
> Why a fault handler is an "atomic context"? Just say when it cannot
> directly
> reclaim.
>
Sure.
>> Waiting
>> + * to reclaim until the cgroup is actually at its limit is less
>> + * performant as it means the faulting task is effectively
>> + * blocked until a worker makes its way through the global work
>> + * queue.
>> + */
>> + if (max > SGX_NR_TO_SCAN * 2)
>> + max -= (SGX_NR_TO_SCAN / 2);
>> +
>> + cur = sgx_epc_cgroup_page_counter_read(epc_cg);
>> +
>> + if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
>> + break;
>> +
>> + /* Keep reclaiming until above condition is met. */
>> + sgx_epc_cgroup_reclaim_pages(epc_cg->cg);
>
> Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and
> sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and
> descendants. If
> we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages,
> seems we can
> reduce the number of loops here?
>
[We already scan SGX_NR_TO_SCAN pages for the cgroup at the level of
sgx_epc_cgroup_reclaim_pages().]
I think you mean that we keep scanning and reclaiming until at least
SGX_NR_TO_SCAN pages are reclaimed as your code suggested above. We
probably can make that a version for this background thread for
optimization. But sgx_epc_cgroup_max_pages_to_root() and
sgx_epc_cgroup_lru_empty() are not that bad unless we had very deep and
wide cgroup trees. So would you agree we defer this optimization for later?
>> + }
>> +}
>> +
>> +/**
>> + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC
>> page
>> * @epc_cg: The EPC cgroup to be charged for the page.
>> * Return:
>> * * %0 - If successfully charged.
>> @@ -38,6 +209,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
>> if (!epc_cg)
>> return;
>>
>> + cancel_work_sync(&epc_cg->reclaim_work);
>> kfree(epc_cg);
>> }
>>
>> @@ -50,6 +222,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = {
>>
>> static void sgx_epc_misc_init(struct misc_cg *cg, struct
>> sgx_epc_cgroup *epc_cg)
>> {
>> + sgx_lru_init(&epc_cg->lru);
>> + INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
>> cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
>> epc_cg->cg = cg;
>> }
>> @@ -69,6 +243,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
>>
>> void sgx_epc_cgroup_init(void)
>> {
>> + sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
>> + WQ_UNBOUND | WQ_FREEZABLE,
>> + WQ_UNBOUND_MAX_ACTIVE);
>> + BUG_ON(!sgx_epc_cg_wq);
>
> You cannot BUG_ON() simply due to unable to allocate a workqueue. You
> can use
> some way to mark EPC cgroup as disabled but keep going. Static key is
> one way
> although we cannot re-enable it at runtime.
>
>
Okay, I'll disable and print a log.
[...]
[workqueue related discussion in separate email]
Thanks
Haitao
On 23/02/2024 9:12 am, Haitao Huang wrote:
> On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai <[email protected]> wrote:
>
>> On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:
>>> StartHi Kai
>>> On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <[email protected]>
>>> wrote:
>>> [...]
>>> >
>>> > So you introduced the work/workqueue here but there's no place which
>>> > actually
>>> > queues the work. IMHO you can either:
>>> >
>>> > 1) move relevant code change here; or
>>> > 2) focus on introducing core functions to reclaim certain pages from a
>>> > given EPC
>>> > cgroup w/o workqueue and introduce the work/workqueue in later patch.
>>> >
>>> > Makes sense?
>>> >
>>>
>>> Starting in v7, I was trying to split the big patch, #10 in v6 as you
>>> and
>>> others suggested. My thought process was to put infrastructure needed
>>> for
>>> per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
>>> 13/15] in the end.
>>
>> That's reasonable for sure.
>>
>
> Thanks for the confirmation :-)
>
>>>
>>> Before that, all reclaimables are tracked in the global LRU so really
>>> there is no "reclaim certain pages from a given EPC cgroup w/o
>>> workqueue"
>>> or reclaim through workqueue before that point, as suggested in #2. This
>>> patch puts down the implementation for both flows but neither used
>>> yet, as
>>> stated in the commit message.
>>
>> I know it's not used yet. The point is how to split patches to make
>> them more
>> self-contain and easy to review.
>
> I would think this patch already self-contained in that all are
> implementation of cgroup reclamation building blocks utilized later. But
> I'll try to follow your suggestions below to split further (would prefer
> not to merge in general unless there is strong reasons).
>
>>
>> For #2, sorry for not being explicit -- I meant it seems it's more
>> reasonable to
>> split in this way:
>>
>> Patch 1)
>> a). change to sgx_reclaim_pages();
>
> I'll still prefer this to be a separate patch. It is self-contained IMHO.
> We were splitting the original patch because it was too big. I don't
> want to merge back unless there is a strong reason.
>
>> b). introduce sgx_epc_cgroup_reclaim_pages();
>
> Ok.
If I got you right, I believe you want to have a cgroup variant function
following the same behaviour of the one for global reclaim, i.e., the
_current_ sgx_reclaim_pages(), which always tries to scan and reclaim
SGX_NR_TO_SCAN pages each time.
And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries
to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup
and all the descendants".
And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due
to WHATEVER reasons.
In that case, the change to sgx_reclaim_pages() and the introduce of
sgx_epc_cgroup_reclaim_pages() should really be together because they
are completely tied together in terms of implementation.
In this way you can just explain clearly in _ONE_ patch why you choose
this implementation, and for reviewer it's also easier to review because
we can just discuss in one patch.
Makes sense?
>
>> c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better
>> name), which just takes an EPC cgroup as input w/o involving any
>> work/workqueue.
>
> This is for the workqueue use only. So I think it'd be better be with
> patch #2 below?
There are multiple levels of logic here IMHO:
1. a) and b) above focus on "each reclaim" a given EPC cgroup
2. c) is about a loop of above to bring given cgroup's usage to limit
3. workqueue is one (probably best) way to do c) in async way
4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered
To me, it's clear 1) should be in one patch as stated above.
Also, to me 3) and 4) are better to be together since they give you a
clear view on how the direct/indirect reclaim are triggered.
2) could be flexible depending on how you see it. If you prefer viewing
it from low-level implementation of reclaiming pages from cgroup, then
it's also OK to be together with 1). If you want to treat it as a part
of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK
to be with 3) and 4).
But to me 2) can be together with 1) or even a separate patch because
it's still kinda of low-level reclaiming details. 3) and 4) shouldn't
contain such detail but should focus on how direct/indirect reclaim is done.
[...]
>
> To be honest, the part I'm feeling most confusing is this
> self-contained-ness. It seems depend on how you look at things.
Completely understand. But I think our discussion should be helpful to
both of us and others.
On 23/02/2024 6:20 am, Haitao Huang wrote:
> On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai <[email protected]> wrote:
>
>> On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:
>>> [...]
>>> >
>>> > Here the @nr_to_scan is reduced by the number of pages that are
>>> > isolated, but
>>> > not actually reclaimed (which is reflected by @cnt).
>>> >
>>> > IIUC, looks you want to make this function do "each cycle" as what you
>>> > mentioned
>>> > in the v8 [1]:
>>> >
>>> > I tested with that approach and found we can only target number of
>>> > pages
>>> > attempted to reclaim not pages actually reclaimed due to the
>>> > uncertainty
>>> > of how long it takes to reclaim pages. Besides targeting number of
>>> > scanned pages for each cycle is also what the ksgxd does.
>>> >
>>> > If we target actual number of pages, sometimes it just takes
>>> too long.
>>> > I
>>> > saw more timeouts with the default time limit when running
>>> parallel
>>> > selftests.
>>> >
>>> > I am not sure what does "sometimes it just takes too long" mean, but
>>> > what I am
>>> > thinking is you are trying to do some perfect but yet complicated code
>>> > here.
>>>
>>> I think what I observed was that the try_charge() would block too long
>>> before getting chance of schedule() to yield, causing more timeouts than
>>> necessary.
>>> I'll do some re-test to be sure.
>>
>> Looks this is a valid information that can be used to justify whatever
>> you are
>> implementing in the EPC cgroup reclaiming function(s).
>>
> I'll add some comments. Was assuming this is just following the old
> design as ksgxd.
> There were some comments at the beginning of
> sgx_epc_cgrooup_reclaim_page().
> /*
> * Attempting to reclaim only a few pages will often fail and is
> * inefficient, while reclaiming a huge number of pages can
> result in
> * soft lockups due to holding various locks for an extended
> duration.
> */
> unsigned int nr_to_scan = SGX_NR_TO_SCAN;
>
> I think it can be improved to emphasize we only "attempt" to finish
> scanning fixed number of pages for reclamation, not enforce number of
> pages successfully reclaimed.
Not sure need to be this comment, but at somewhere just state you are
trying to follow the ksgxd() (the current sgx_reclaim_pages()), but
trying to do it "_across_ given cgroup and all the descendants".
That's the reason you made @nr_to_scan as a pointer.
And also some text to explain why to follow ksgxd() -- not wanting to
block longer due to loop over descendants etc -- so we can focus on
discussing whether such justification is reasonable.
On 23/02/2024 5:36 am, Haitao Huang wrote:
> On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai <[email protected]> wrote:
>
>> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>>> From: Kristen Carlson Accardi <[email protected]>
>>>
>>> Previous patches have implemented all infrastructure needed for
>>> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
>>> pages are still tracked in the global LRU as sgx_lru_list() returns hard
>>> coded reference to the global LRU.
>>>
>>> Change sgx_lru_list() to return the LRU of the cgroup in which the given
>>> EPC page is allocated.
>>>
>>> This makes all EPC pages tracked in per-cgroup LRUs and the global
>>> reclaimer (ksgxd) will not be able to reclaim any pages from the global
>>> LRU. However, in cases of over-committing, i.e., sum of cgroup limits
>>> greater than the total capacity, cgroups may never reclaim but the total
>>> usage can still be near the capacity. Therefore global reclamation is
>>> still needed in those cases and it should reclaim from the root cgroup.
>>>
>>> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
>>> when cgroup is enabled, otherwise from the global LRU.
>>>
>>> Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
>>> cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
>>>
>>> With these changes, the global reclamation and per-cgroup reclamation
>>> both work properly with all pages tracked in per-cgroup LRUs.
>>>
>>> Co-developed-by: Sean Christopherson <[email protected]>
>>> Signed-off-by: Sean Christopherson <[email protected]>
>>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>>> Co-developed-by: Haitao Huang <[email protected]>
>>> Signed-off-by: Haitao Huang <[email protected]>
>>> ---
>>> V7:
>>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>>> ---
>>> arch/x86/kernel/cpu/sgx/main.c | 16 +++++++++++++++-
>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/sgx/main.c
>>> b/arch/x86/kernel/cpu/sgx/main.c
>>> index 6b0c26cac621..d4265a390ba9 100644
>>> --- a/arch/x86/kernel/cpu/sgx/main.c
>>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>>> @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
>>>
>>> static inline struct sgx_epc_lru_list *sgx_lru_list(struct
>>> sgx_epc_page *epc_page)
>>> {
>>> +#ifdef CONFIG_CGROUP_SGX_EPC
>>> + if (epc_page->epc_cg)
>>> + return &epc_page->epc_cg->lru;
>>> +
>>> + /* This should not happen if kernel is configured correctly */
>>> + WARN_ON_ONCE(1);
>>> +#endif
>>> return &sgx_global_lru;
>>> }
>>
>> How about when EPC cgroup is enabled, but one enclave doesn't belong
>> to any EPC
>> cgroup? Is it OK to track EPC pages for these enclaves to the root
>> EPC cgroup's
>> LRU list together with other enclaves belongs to the root cgroup?
>>
>>
>> This should be a valid case, right?
>
> There is no such case. Each page is in the root by default.
>
Is it guaranteed by the (misc) cgroup design/implementation? If so
please add this information to the changelog and/or comments? It helps
non-cgroup expert like me to understand.
On Thu, 22 Feb 2024 15:26:05 -0600, Huang, Kai <[email protected]> wrote:
>
>
> On 23/02/2024 6:09 am, Haitao Huang wrote:
>> On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai <[email protected]>
>> wrote:
>>
>>>
>>>> -int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>>>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
>>>> reclaim)
>>>> {
>>>> - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>>> PAGE_SIZE);
>>>> + for (;;) {
>>>> + if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>>> + PAGE_SIZE))
>>>> + break;
>>>> +
>>>> + if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
>>>> + return -ENOMEM;
>>>> +
>>>> + if (signal_pending(current))
>>>> + return -ERESTARTSYS;
>>>> +
>>>> + if (!reclaim) {
>>>> + queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
>>>> + return -EBUSY;
>>>> + }
>>>> +
>>>> + if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
>>>> + /* All pages were too young to reclaim, try again a
>>>> little later */
>>>> + schedule();
>>>> + }
>>>> +
>>>> + return 0;
>>>> }
>>>>
>>>
>>> Seems this code change is 90% similar to the existing code in the
>>> sgx_alloc_epc_page():
>>>
>>> ...
>>> for ( ; ; ) {
>>> page = __sgx_alloc_epc_page();
>>> if (!IS_ERR(page)) {
>>> page->owner = owner;
>>> break;
>>> }
>>>
>>> if (list_empty(&sgx_active_page_list))
>>> return ERR_PTR(-ENOMEM);
>>>
>>> if (!reclaim) {
>>> page = ERR_PTR(-EBUSY);
>>> break;
>>> }
>>>
>>> if (signal_pending(current)) {
>>> page = ERR_PTR(-ERESTARTSYS);
>>> break;
>>> }
>>>
>>> sgx_reclaim_pages();
>>> cond_resched();
>>> }
>>> ...
>>>
>>> Is it better to move the logic/code change in try_charge() out to
>>> sgx_alloc_epc_page() to unify them?
>>>
>>> IIUC, the logic is quite similar: When you either failed to allocate
>>> one page,
>>> or failed to charge one page, you try to reclaim EPC page(s) from the
>>> current
>>> EPC cgroup, either directly or indirectly.
>>>
>>> No?
>> Only these lines are the same:
>> if (!reclaim) {
>> page = ERR_PTR(-EBUSY);
>> break;
>> }
>> if (signal_pending(current)) {
>> page = ERR_PTR(-ERESTARTSYS);
>> break;
>> }
>> In sgx_alloc_epc_page() we do global reclamation but here we do
>> per-cgroup reclamation.
>
> But why? If we failed to allocate, shouldn't we try to reclaim from the
> _current_ EPC cgroup instead of global? E.g., I thought one enclave in
> one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves
> inside other cgroups?
>
Right. When code reaches to here, we already passed reclaim per cgroup.
The cgroup may not at or reach limit but system has run out of physical
EPC.
Thanks
Haitao
> >
> Right. When code reaches to here, we already passed reclaim per cgroup.
Yes if try_charge() failed we must do pre-cgroup reclaim.
> The cgroup may not at or reach limit but system has run out of physical
> EPC.
>
But after try_charge() we can still choose to reclaim from the current group,
but not necessarily have to be global, right? I am not sure whether I am
missing something, but could you elaborate why we should choose to reclaim from
the global?
On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <[email protected]> wrote:
>> >
>> Right. When code reaches to here, we already passed reclaim per cgroup.
>
> Yes if try_charge() failed we must do pre-cgroup reclaim.
>
>> The cgroup may not at or reach limit but system has run out of physical
>> EPC.
>>
>
> But after try_charge() we can still choose to reclaim from the current
> group,
> but not necessarily have to be global, right? I am not sure whether I am
> missing something, but could you elaborate why we should choose to
> reclaim from
> the global?
>
Once try_charge is done and returns zero that means the cgroup usage is
charged and it's not over usage limit. So you really can't reclaim from
that cgroup if allocation failed. The only thing you can do is to reclaim
globally.
This could happen when the sum of limits of all cgroups is greater than
the physical EPC, i.e., user is overcommitting.
Thanks
Haitao
On Thu, 22 Feb 2024 16:44:45 -0600, Huang, Kai <[email protected]> wrote:
>
>
> On 23/02/2024 5:36 am, Haitao Huang wrote:
>> On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai <[email protected]>
>> wrote:
>>
>>> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>>>> From: Kristen Carlson Accardi <[email protected]>
>>>>
>>>> Previous patches have implemented all infrastructure needed for
>>>> per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
>>>> pages are still tracked in the global LRU as sgx_lru_list() returns
>>>> hard
>>>> coded reference to the global LRU.
>>>>
>>>> Change sgx_lru_list() to return the LRU of the cgroup in which the
>>>> given
>>>> EPC page is allocated.
>>>>
>>>> This makes all EPC pages tracked in per-cgroup LRUs and the global
>>>> reclaimer (ksgxd) will not be able to reclaim any pages from the
>>>> global
>>>> LRU. However, in cases of over-committing, i.e., sum of cgroup limits
>>>> greater than the total capacity, cgroups may never reclaim but the
>>>> total
>>>> usage can still be near the capacity. Therefore global reclamation is
>>>> still needed in those cases and it should reclaim from the root
>>>> cgroup.
>>>>
>>>> Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
>>>> when cgroup is enabled, otherwise from the global LRU.
>>>>
>>>> Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
>>>> cgroups when EPC cgroup is enabled, otherwise only check the global
>>>> LRU.
>>>>
>>>> With these changes, the global reclamation and per-cgroup reclamation
>>>> both work properly with all pages tracked in per-cgroup LRUs.
>>>>
>>>> Co-developed-by: Sean Christopherson <[email protected]>
>>>> Signed-off-by: Sean Christopherson <[email protected]>
>>>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>>>> Co-developed-by: Haitao Huang <[email protected]>
>>>> Signed-off-by: Haitao Huang <[email protected]>
>>>> ---
>>>> V7:
>>>> - Split this out from the big patch, #10 in V6. (Dave, Kai)
>>>> ---
>>>> arch/x86/kernel/cpu/sgx/main.c | 16 +++++++++++++++-
>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/sgx/main.c
>>>> b/arch/x86/kernel/cpu/sgx/main.c
>>>> index 6b0c26cac621..d4265a390ba9 100644
>>>> --- a/arch/x86/kernel/cpu/sgx/main.c
>>>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>>>> @@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
>>>>
>>>> static inline struct sgx_epc_lru_list *sgx_lru_list(struct
>>>> sgx_epc_page *epc_page)
>>>> {
>>>> +#ifdef CONFIG_CGROUP_SGX_EPC
>>>> + if (epc_page->epc_cg)
>>>> + return &epc_page->epc_cg->lru;
>>>> +
>>>> + /* This should not happen if kernel is configured correctly */
>>>> + WARN_ON_ONCE(1);
>>>> +#endif
>>>> return &sgx_global_lru;
>>>> }
>>>
>>> How about when EPC cgroup is enabled, but one enclave doesn't belong
>>> to any EPC
>>> cgroup? Is it OK to track EPC pages for these enclaves to the root
>>> EPC cgroup's
>>> LRU list together with other enclaves belongs to the root cgroup?
>>>
>>>
>>> This should be a valid case, right?
>> There is no such case. Each page is in the root by default.
>>
>
> Is it guaranteed by the (misc) cgroup design/implementation? If so
> please add this information to the changelog and/or comments? It helps
> non-cgroup expert like me to understand.
>
Will do
Thanks
Haitao
On 24/02/2024 6:00 am, Haitao Huang wrote:
> On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <[email protected]> wrote:
>
>>> >
>>> Right. When code reaches to here, we already passed reclaim per cgroup.
>>
>> Yes if try_charge() failed we must do pre-cgroup reclaim.
>>
>>> The cgroup may not at or reach limit but system has run out of physical
>>> EPC.
>>>
>>
>> But after try_charge() we can still choose to reclaim from the current
>> group,
>> but not necessarily have to be global, right? I am not sure whether I am
>> missing something, but could you elaborate why we should choose to
>> reclaim from
>> the global?
>>
>
> Once try_charge is done and returns zero that means the cgroup usage is
> charged and it's not over usage limit. So you really can't reclaim from
> that cgroup if allocation failed. The only thing you can do is to
> reclaim globally.
Sorry I still cannot establish the logic here.
Let's say the sum of all cgroups are greater than the physical EPC, and
elclave(s) in each cgroup could potentially fault w/o reaching cgroup's
limit.
In this case, when enclave(s) in one cgroup faults, why we cannot
reclaim from the current cgroup, but have to reclaim from global?
Is there any real downside of the former, or you just want to follow the
reclaim logic w/o cgroup at all?
IIUC, there's at least one advantage of reclaim from the current group,
that faults of enclave(s) in one group won't impact other enclaves in
other cgroups. E.g., in this way other enclaves in other groups may
never need to trigger faults.
Or perhaps I am missing anything?
On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <[email protected]> wrote:
>
>
> On 24/02/2024 6:00 am, Haitao Huang wrote:
>> On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <[email protected]>
>> wrote:
>>
>>>> >
>>>> Right. When code reaches to here, we already passed reclaim per
>>>> cgroup.
>>>
>>> Yes if try_charge() failed we must do pre-cgroup reclaim.
>>>
>>>> The cgroup may not at or reach limit but system has run out of
>>>> physical
>>>> EPC.
>>>>
>>>
>>> But after try_charge() we can still choose to reclaim from the current
>>> group,
>>> but not necessarily have to be global, right? I am not sure whether I
>>> am
>>> missing something, but could you elaborate why we should choose to
>>> reclaim from
>>> the global?
>>>
>> Once try_charge is done and returns zero that means the cgroup usage
>> is charged and it's not over usage limit. So you really can't reclaim
>> from that cgroup if allocation failed. The only thing you can do is to
>> reclaim globally.
>
> Sorry I still cannot establish the logic here.
>
> Let's say the sum of all cgroups are greater than the physical EPC, and
> elclave(s) in each cgroup could potentially fault w/o reaching cgroup's
> limit.
>
> In this case, when enclave(s) in one cgroup faults, why we cannot
> reclaim from the current cgroup, but have to reclaim from global?
>
> Is there any real downside of the former, or you just want to follow the
> reclaim logic w/o cgroup at all?
>
> IIUC, there's at least one advantage of reclaim from the current group,
> that faults of enclave(s) in one group won't impact other enclaves in
> other cgroups. E.g., in this way other enclaves in other groups may
> never need to trigger faults.
>
> Or perhaps I am missing anything?
>
The use case here is that user knows it's OK for group A to borrow some
pages from group B for some time without impact much performance, vice
versa. That's why the user is overcomitting so system can run more
enclave/groups. Otherwise, if she is concerned about impact of A on B, she
could lower limit for A so it never interfere or interfere less with B
(assume the lower limit is still high enough to run all enclaves in A),
and sacrifice some of A's performance. Or if she does not want any
interference between groups, just don't over-comit. So we don't really
lose anything here.
In case of overcomitting, even if we always reclaim from the same cgroup
for each fault, one group may still interfere the other: e.g., consider an
extreme case in that group A used up almost all EPC at the time group B
has a fault, B has to fail allocation and kill enclaves.
Haitao
On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <[email protected]> wrote:
>
> >
> >
> > On 24/02/2024 6:00 am, Haitao Huang wrote:
> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <[email protected]>
> > > wrote:
> > >
> > > > > >
> > > > > Right. When code reaches to here, we already passed reclaim per
> > > > > cgroup.
> > > >
> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
> > > >
> > > > > The cgroup may not at or reach limit but system has run out of
> > > > > physical
> > > > > EPC.
> > > > >
> > > >
> > > > But after try_charge() we can still choose to reclaim from the current
> > > > group,
> > > > but not necessarily have to be global, right? I am not sure whether I
> > > > am
> > > > missing something, but could you elaborate why we should choose to
> > > > reclaim from
> > > > the global?
> > > >
> > > Once try_charge is done and returns zero that means the cgroup usage
> > > is charged and it's not over usage limit. So you really can't reclaim
> > > from that cgroup if allocation failed. The only thing you can do is to
> > > reclaim globally.
> >
> > Sorry I still cannot establish the logic here.
> >
> > Let's say the sum of all cgroups are greater than the physical EPC, and
> > elclave(s) in each cgroup could potentially fault w/o reaching cgroup's
> > limit.
> >
> > In this case, when enclave(s) in one cgroup faults, why we cannot
> > reclaim from the current cgroup, but have to reclaim from global?
> >
> > Is there any real downside of the former, or you just want to follow the
> > reclaim logic w/o cgroup at all?
> >
> > IIUC, there's at least one advantage of reclaim from the current group,
> > that faults of enclave(s) in one group won't impact other enclaves in
> > other cgroups. E.g., in this way other enclaves in other groups may
> > never need to trigger faults.
> >
> > Or perhaps I am missing anything?
> >
> The use case here is that user knows it's OK for group A to borrow some
> pages from group B for some time without impact much performance, vice
> versa. That's why the user is overcomitting so system can run more
> enclave/groups. Otherwise, if she is concerned about impact of A on B, she
> could lower limit for A so it never interfere or interfere less with B
> (assume the lower limit is still high enough to run all enclaves in A),
> and sacrifice some of A's performance. Or if she does not want any
> interference between groups, just don't over-comit. So we don't really
> lose anything here.
But if we reclaim from the same group, seems we could enable a user case that
allows the admin to ensure certain group won't be impacted at all, while
allowing other groups to over-commit?
E.g., let's say we have 100M physical EPC. And let's say the admin wants to run
some performance-critical enclave(s) which costs 50M EPC w/o being impacted.
The admin also wants to run other enclaves which could cost 100M EPC in total
but EPC swapping among them is acceptable.
If we choose to reclaim from the current EPC cgroup, then seems to that the
admin can achieve the above by setting up 2 groups with group1 having 50M limit
and group2 having 100M limit, and then run performance-critical enclave(s) in
group1 and others in group2? Or am I missing anything?
If we choose to do global reclaim, then we cannot achieve that.
>
> In case of overcomitting, even if we always reclaim from the same cgroup
> for each fault, one group may still interfere the other: e.g., consider an
> extreme case in that group A used up almost all EPC at the time group B
> has a fault, B has to fail allocation and kill enclaves.
If the admin allows group A to use almost all EPC, to me it's fair to say he/she
doesn't want to run anything inside B at all and it is acceptable enclaves in B
to be killed.
On 2/26/24 03:36, Huang, Kai wrote:
>> In case of overcomitting, even if we always reclaim from the same cgroup
>> for each fault, one group may still interfere the other: e.g., consider an
>> extreme case in that group A used up almost all EPC at the time group B
>> has a fault, B has to fail allocation and kill enclaves.
> If the admin allows group A to use almost all EPC, to me it's fair to say he/she
> doesn't want to run anything inside B at all and it is acceptable enclaves in B
> to be killed.
Folks, I'm having a really hard time following this thread. It sounds
like there's disagreement about when to do system-wide reclaim. Could
someone remind me of the choices that we have? (A proposed patch would
go a _long_ way to helping me understand)
Also, what does the core mm memcg code do?
Last, what is the simplest (least amount of code) thing that the SGX
cgroup controller could implement here?
On Mon, Feb 05, 2024 at 01:06:27PM -0800, Haitao Huang <[email protected]> wrote:
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_res_ops sgx_epc_cgroup_ops = {
> + .alloc = sgx_epc_cgroup_alloc,
> + .free = sgx_epc_cgroup_free,
> +};
> +
> +static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
> +{
> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> + epc_cg->cg = cg;
> +}
This is a possibly a nit pick but I share it here for consideration.
Would it be more prudent to have the signature like
alloc(struct misc_res *res, struct misc_cg *cg)
so that implementations are free of the assumption of how cg and res are
stored?
Thanks,
Michal
On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai <[email protected]> wrote:
> On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
>> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <[email protected]>
>> wrote:
>>
>> >
>> >
>> > On 24/02/2024 6:00 am, Haitao Huang wrote:
>> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <[email protected]>
>> > > wrote:
>> > >
>> > > > > >
>> > > > > Right. When code reaches to here, we already passed reclaim per
>> > > > > cgroup.
>> > > >
>> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
>> > > >
>> > > > > The cgroup may not at or reach limit but system has run out of
>> > > > > physical
>> > > > > EPC.
>> > > > >
>> > > >
>> > > > But after try_charge() we can still choose to reclaim from the
>> current
>> > > > group,
>> > > > but not necessarily have to be global, right? I am not sure
>> whether I
>> > > > am
>> > > > missing something, but could you elaborate why we should choose to
>> > > > reclaim from
>> > > > the global?
>> > > >
>> > > Once try_charge is done and returns zero that means the cgroup
>> usage
>> > > is charged and it's not over usage limit. So you really can't
>> reclaim
>> > > from that cgroup if allocation failed. The only thing you can do
>> is to
>> > > reclaim globally.
>> >
>> > Sorry I still cannot establish the logic here.
>> >
>> > Let's say the sum of all cgroups are greater than the physical EPC,
>> and
>> > elclave(s) in each cgroup could potentially fault w/o reaching
>> cgroup's
>> > limit.
>> >
>> > In this case, when enclave(s) in one cgroup faults, why we cannot
>> > reclaim from the current cgroup, but have to reclaim from global?
>> >
>> > Is there any real downside of the former, or you just want to follow
>> the
>> > reclaim logic w/o cgroup at all?
>> >
>> > IIUC, there's at least one advantage of reclaim from the current
>> group,
>> > that faults of enclave(s) in one group won't impact other enclaves in
>> > other cgroups. E.g., in this way other enclaves in other groups may
>> > never need to trigger faults.
>> >
>> > Or perhaps I am missing anything?
>> >
>> The use case here is that user knows it's OK for group A to borrow some
>> pages from group B for some time without impact much performance, vice
>> versa. That's why the user is overcomitting so system can run more
>> enclave/groups. Otherwise, if she is concerned about impact of A on B,
>> she
>> could lower limit for A so it never interfere or interfere less with B
>> (assume the lower limit is still high enough to run all enclaves in A),
>> and sacrifice some of A's performance. Or if she does not want any
>> interference between groups, just don't over-comit. So we don't really
>> lose anything here.
>
> But if we reclaim from the same group, seems we could enable a user case
> that
> allows the admin to ensure certain group won't be impacted at all, while
> allowing other groups to over-commit?
>
> E.g., let's say we have 100M physical EPC. And let's say the admin
> wants to run
> some performance-critical enclave(s) which costs 50M EPC w/o being
> impacted.
> The admin also wants to run other enclaves which could cost 100M EPC in
> total
> but EPC swapping among them is acceptable.
>
> If we choose to reclaim from the current EPC cgroup, then seems to that
> the
> admin can achieve the above by setting up 2 groups with group1 having
> 50M limit
> and group2 having 100M limit, and then run performance-critical
> enclave(s) in
> group1 and others in group2? Or am I missing anything?
>
The more important groups should have limits higher than or equal to peak
usage to ensure no impact.
The less important groups should have lower limits than its peak usage to
avoid impacting higher priority groups.
The limit is the maximum usage allowed.
By setting group2 limit to 100M, you are allowing it to use 100M. So as
soon as it gets up and consume 100M, group1 can not even load any enclave
if we only reclaim per-cgroup and do not do global reclaim.
> If we choose to do global reclaim, then we cannot achieve that.
You can achieve this by setting group 2 limit to 50M. No need to
overcommiting to the system.
Group 2 will swap as soon as it hits 50M, which is the maximum it can
consume so no impact to group 1.
>
>>
>> In case of overcomitting, even if we always reclaim from the same cgroup
>> for each fault, one group may still interfere the other: e.g., consider
>> an
>> extreme case in that group A used up almost all EPC at the time group B
>> has a fault, B has to fail allocation and kill enclaves.
>
> If the admin allows group A to use almost all EPC, to me it's fair to
> say he/she
> doesn't want to run anything inside B at all and it is acceptable
> enclaves in B
> to be killed.
>
>
I don't think so. The user just knows group A + B peak usages higher than
system capacity. And she is OK for them to share some of the pages
dynamically. So kernel should allow one borrow from the other at a
particular instance when one group has higher demand. And later doing the
opposite. IOW, the favor goes both ways.
Haitao
Hi Dave,
On Mon, 26 Feb 2024 08:04:54 -0600, Dave Hansen <[email protected]>
wrote:
> On 2/26/24 03:36, Huang, Kai wrote:
>>> In case of overcomitting, even if we always reclaim from the same
>>> cgroup
>>> for each fault, one group may still interfere the other: e.g.,
>>> consider an
>>> extreme case in that group A used up almost all EPC at the time group B
>>> has a fault, B has to fail allocation and kill enclaves.
>> If the admin allows group A to use almost all EPC, to me it's fair to
>> say he/she
>> doesn't want to run anything inside B at all and it is acceptable
>> enclaves in B
>> to be killed.
>
> Folks, I'm having a really hard time following this thread. It sounds
> like there's disagreement about when to do system-wide reclaim. Could
> someone remind me of the choices that we have? (A proposed patch would
> go a _long_ way to helping me understand)
>
In case of overcomitting, i.e., sum of limits greater than the EPC
capacity, if one group has a fault, and its usage is not above its own
limit (try_charge() passes), yet total usage of the system has exceeded
the capacity, whether we do global reclaim or just reclaim pages in the
current faulting group.
> Also, what does the core mm memcg code do?
>
I'm not sure. I'll try to find out but it'd be appreciated if someone more
knowledgeable can comment on this. memcg also has the protection mechanism
(i.e., min, low settings) to guarantee some allocation per group so its
approach might not be applicable to misc controller here.
> Last, what is the simplest (least amount of code) thing that the SGX
> cgroup controller could implement here?
>
>
I still think the current approach of doing global reclaim is reasonable
and simple: try_charge() checks cgroup limit and reclaim within the group
if needed, then do EPC page allocation, reclaim globally if allocation
fails due to global usage reaches the capacity.
I'm not sure how not doing global reclaiming in this case would bring any
benefit. Please see my response to Kai's example cases.
Thanks
Haitao
On 2/26/24 13:48, Haitao Huang wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC
> capacity, if one group has a fault, and its usage is not above its own
> limit (try_charge() passes), yet total usage of the system has exceeded
> the capacity, whether we do global reclaim or just reclaim pages in the
> current faulting group.
I don't see _any_ reason to limit reclaim to the current faulting cgroup.
>> Last, what is the simplest (least amount of code) thing that the SGX
>> cgroup controller could implement here?
>
> I still think the current approach of doing global reclaim is reasonable
> and simple: try_charge() checks cgroup limit and reclaim within the
> group if needed, then do EPC page allocation, reclaim globally if
> allocation fails due to global usage reaches the capacity.
>
> I'm not sure how not doing global reclaiming in this case would bring
> any benefit.
I tend to agree.
Kai, I think your examples sound a little bit contrived. Have actual
users expressed a strong intent for doing anything with this series
other than limiting bad actors from eating all the EPC?
On 27/02/2024 10:18 am, Haitao Huang wrote:
> On Mon, 26 Feb 2024 05:36:02 -0600, Huang, Kai <[email protected]> wrote:
>
>> On Sun, 2024-02-25 at 22:03 -0600, Haitao Huang wrote:
>>> On Sun, 25 Feb 2024 19:38:26 -0600, Huang, Kai <[email protected]>
>>> wrote:
>>>
>>> >
>>> >
>>> > On 24/02/2024 6:00 am, Haitao Huang wrote:
>>> > > On Fri, 23 Feb 2024 04:18:18 -0600, Huang, Kai <[email protected]>
>>> > > wrote:
>>> > >
>>> > > > > >
>>> > > > > Right. When code reaches to here, we already passed reclaim per
>>> > > > > cgroup.
>>> > > >
>>> > > > Yes if try_charge() failed we must do pre-cgroup reclaim.
>>> > > >
>>> > > > > The cgroup may not at or reach limit but system has run out of
>>> > > > > physical
>>> > > > > EPC.
>>> > > > >
>>> > > >
>>> > > > But after try_charge() we can still choose to reclaim from the
>>> current
>>> > > > group,
>>> > > > but not necessarily have to be global, right? I am not sure
>>> whether I
>>> > > > am
>>> > > > missing something, but could you elaborate why we should choose to
>>> > > > reclaim from
>>> > > > the global?
>>> > > >
>>> > > Once try_charge is done and returns zero that means the cgroup
>>> usage
>>> > > is charged and it's not over usage limit. So you really can't
>>> reclaim
>>> > > from that cgroup if allocation failed. The only thing you can do
>>> is to
>>> > > reclaim globally.
>>> >
>>> > Sorry I still cannot establish the logic here.
>>> >
>>> > Let's say the sum of all cgroups are greater than the physical EPC,
>>> and
>>> > elclave(s) in each cgroup could potentially fault w/o reaching
>>> cgroup's
>>> > limit.
>>> >
>>> > In this case, when enclave(s) in one cgroup faults, why we cannot
>>> > reclaim from the current cgroup, but have to reclaim from global?
>>> >
>>> > Is there any real downside of the former, or you just want to
>>> follow the
>>> > reclaim logic w/o cgroup at all?
>>> >
>>> > IIUC, there's at least one advantage of reclaim from the current
>>> group,
>>> > that faults of enclave(s) in one group won't impact other enclaves in
>>> > other cgroups. E.g., in this way other enclaves in other groups may
>>> > never need to trigger faults.
>>> >
>>> > Or perhaps I am missing anything?
>>> >
>>> The use case here is that user knows it's OK for group A to borrow some
>>> pages from group B for some time without impact much performance, vice
>>> versa. That's why the user is overcomitting so system can run more
>>> enclave/groups. Otherwise, if she is concerned about impact of A on
>>> B, she
>>> could lower limit for A so it never interfere or interfere less with B
>>> (assume the lower limit is still high enough to run all enclaves in A),
>>> and sacrifice some of A's performance. Or if she does not want any
>>> interference between groups, just don't over-comit. So we don't really
>>> lose anything here.
>>
>> But if we reclaim from the same group, seems we could enable a user
>> case that
>> allows the admin to ensure certain group won't be impacted at all, while
>> allowing other groups to over-commit?
>>
>> E.g., let's say we have 100M physical EPC. And let's say the admin
>> wants to run
>> some performance-critical enclave(s) which costs 50M EPC w/o being
>> impacted.
>> The admin also wants to run other enclaves which could cost 100M EPC
>> in total
>> but EPC swapping among them is acceptable.
>>
>> If we choose to reclaim from the current EPC cgroup, then seems to
>> that the
>> admin can achieve the above by setting up 2 groups with group1 having
>> 50M limit
>> and group2 having 100M limit, and then run performance-critical
>> enclave(s) in
>> group1 and others in group2? Or am I missing anything?
>>
>
> The more important groups should have limits higher than or equal to
> peak usage to ensure no impact.
Yes. But if you do global reclaim there's no guarantee of this
regardless of the limit setting. It depends on setting of limits of
other groups.
> The less important groups should have lower limits than its peak usage
> to avoid impacting higher priority groups.
Yeah, but depending on how low the limit is, the try_charge() can still
succeed but physical EPC is already running out.
Are you saying we should always expect the admin to set limits of groups
not exceeding the physical EPC?
> The limit is the maximum usage allowed.
>
> By setting group2 limit to 100M, you are allowing it to use 100M. So as
> soon as it gets up and consume 100M, group1 can not even load any
> enclave if we only reclaim per-cgroup and do not do global reclaim.
I kinda forgot, but I think SGX supports swapping out EPC of an enclave
before EINIT? Also, with SGX2 the initial enclave can take less EPC to
be loaded.
>
>> If we choose to do global reclaim, then we cannot achieve that.
>
>
> You can achieve this by setting group 2 limit to 50M. No need to
> overcommiting to the system.
> Group 2 will swap as soon as it hits 50M, which is the maximum it can
> consume so no impact to group 1.
Right. We can achieve this by doing so. But as said above, you are
depending on setting up the limit to do per-cgorup reclaim.
So, back to the question:
What is the downside of doing per-group reclaim when try_charge()
succeeds for the enclave but failed to allocate EPC page?
Could you give an complete answer why you choose to use global reclaim
for the above case?
On 2/26/24 14:24, Huang, Kai wrote:
> What is the downside of doing per-group reclaim when try_charge()
> succeeds for the enclave but failed to allocate EPC page?
>
> Could you give an complete answer why you choose to use global reclaim
> for the above case?
There are literally two different limits at play. There's the limit
that the cgroup imposes and then the actual physical limit.
Hitting the cgroup limit induces cgroup reclaim.
Hitting the physical limit induces global reclaim.
Maybe I'm just being dense, but I fail to understand why you would want
to entangle those two different concepts more than absolutely necessary.
On 27/02/2024 11:31 am, Dave Hansen wrote:
> On 2/26/24 14:24, Huang, Kai wrote:
>> What is the downside of doing per-group reclaim when try_charge()
>> succeeds for the enclave but failed to allocate EPC page?
>>
>> Could you give an complete answer why you choose to use global reclaim
>> for the above case?
>
> There are literally two different limits at play. There's the limit
> that the cgroup imposes and then the actual physical limit.
>
> Hitting the cgroup limit induces cgroup reclaim.
>
> Hitting the physical limit induces global reclaim.
>
> Maybe I'm just being dense, but I fail to understand why you would want
> to entangle those two different concepts more than absolutely necessary.
OK. Yes I agree doing per-cgroup reclaim when hitting physical limit
would bring another layer of consideration of when to do global reclaim,
which is not necessary now.
>
> Kai, I think your examples sound a little bit contrived. Have actual
> users expressed a strong intent for doing anything with this series
> other than limiting bad actors from eating all the EPC?
I am not sure about this. I am also trying to get a full picture.
I asked because I didn't quite like the duplicated code change in
try_charge() in this patch and in sgx_alloc_epc_page(). I think using
per-group reclaim we can unify the code (I have even started to write
the code) and I don't see the downside of doing so.
So I am trying to get the actual downside of doing per-cgroup reclaim or
the full reason that we choose global reclaim.
On 2/26/24 14:34, Huang, Kai wrote:
> So I am trying to get the actual downside of doing per-cgroup reclaim or
> the full reason that we choose global reclaim.
Take the most extreme example:
while (hit_global_sgx_limit())
reclaim_from_this(cgroup);
You eventually end up with all of 'cgroup's pages gone and handed out to
other users on the system who stole them all. Other users might cause
you to go over the global limit. *They* should be paying part of the
cost, not just you and your cgroup.
On 27/02/2024 11:38 am, Dave Hansen wrote:
> On 2/26/24 14:34, Huang, Kai wrote:
>> So I am trying to get the actual downside of doing per-cgroup reclaim or
>> the full reason that we choose global reclaim.
>
> Take the most extreme example:
>
> while (hit_global_sgx_limit())
> reclaim_from_this(cgroup);
>
> You eventually end up with all of 'cgroup's pages gone and handed out to
> other users on the system who stole them all. Other users might cause
> you to go over the global limit. *They* should be paying part of the
> cost, not just you and your cgroup.
Yeah likely we will need another layer of logic to decide when to do
global reclaim. I agree that is downside and is unnecessary for this
patchset.
Thanks for the comments.
Hello.
On Mon, Feb 26, 2024 at 03:48:18PM -0600, Haitao Huang <[email protected]> wrote:
> In case of overcomitting, i.e., sum of limits greater than the EPC capacity,
> if one group has a fault, and its usage is not above its own limit
> (try_charge() passes), yet total usage of the system has exceeded the
> capacity, whether we do global reclaim or just reclaim pages in the current
> faulting group.
>
> > Also, what does the core mm memcg code do?
> >
> I'm not sure. I'll try to find out but it'd be appreciated if someone more
> knowledgeable can comment on this. memcg also has the protection mechanism
> (i.e., min, low settings) to guarantee some allocation per group so its
> approach might not be applicable to misc controller here.
I only follow the discussion superficially but it'd be nice to have
analogous mechanisms in memcg and sgx controller.
The memory limits are rather simple -- when allocating new memory, the
tightest limit of ancestor applies and reclaim is applied to whole
subtree of such an ancestor (not necessearily the originating cgroup of
the allocation). Overcommit is admited, competition among siblings is
resolved on the first comes, first served basis.
The memory protections are an additional (and in a sense orthogoal)
mechanism. They can be interpretted as limits that are enforced not at
the time of allocation but at the time of reclaim (and reclaim is
triggered independetly, either global or with the limits above). The
consequence is that the protection code must do some normalization to
evaluate overcommited values sensibly.
(Historically, there was also memory.soft_limit_in_bytes, which combined
"protection" and global reclaim but it turned out broken. I suggest
reading Documentation/admin-guide/cgroup-v2.rst section Controller
Issues and Remedies/Memory for more details and as cautionary example.)
HTH,
Michal
On Mon Feb 26, 2024 at 11:56 PM EET, Dave Hansen wrote:
> On 2/26/24 13:48, Haitao Huang wrote:
> > In case of overcomitting, i.e., sum of limits greater than the EPC
> > capacity, if one group has a fault, and its usage is not above its own
> > limit (try_charge() passes), yet total usage of the system has exceeded
> > the capacity, whether we do global reclaim or just reclaim pages in the
> > current faulting group.
>
> I don't see _any_ reason to limit reclaim to the current faulting cgroup.
>
> >> Last, what is the simplest (least amount of code) thing that the SGX
> >> cgroup controller could implement here?
> >
> > I still think the current approach of doing global reclaim is reasonable
> > and simple: try_charge() checks cgroup limit and reclaim within the
> > group if needed, then do EPC page allocation, reclaim globally if
> > allocation fails due to global usage reaches the capacity.
> >
> > I'm not sure how not doing global reclaiming in this case would bring
> > any benefit.
> I tend to agree.
>
> Kai, I think your examples sound a little bit contrived. Have actual
> users expressed a strong intent for doing anything with this series
> other than limiting bad actors from eating all the EPC?
I'd consider this from the viewpoint is there anything in the user space
visible portion of the patch set that would limit tuning the performance
later on, if required let's say by a workload that acts sub-optimally.
If not, then most of performance related issues can be only identified
by actual use of the code.
BR, Jarkko
On Mon, 26 Feb 2024 12:25:58 -0600, Michal Koutn? <[email protected]> wrote:
> On Mon, Feb 05, 2024 at 01:06:27PM -0800, Haitao Huang
> <[email protected]> wrote:
>> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
>> +
>> +const struct misc_res_ops sgx_epc_cgroup_ops = {
>> + .alloc = sgx_epc_cgroup_alloc,
>> + .free = sgx_epc_cgroup_free,
>> +};
>> +
>> +static void sgx_epc_misc_init(struct misc_cg *cg, struct
>> sgx_epc_cgroup *epc_cg)
>> +{
>> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
>> + epc_cg->cg = cg;
>> +}
>
> This is a possibly a nit pick but I share it here for consideration.
>
> Would it be more prudent to have the signature like
> alloc(struct misc_res *res, struct misc_cg *cg)
> so that implementations are free of the assumption of how cg and res are
> stored?
>
>
> Thanks,
> Michal
Will do.
Thanks
Haitao
On Tue, 27 Feb 2024 15:35:38 -0600, Haitao Huang
<[email protected]> wrote:
> On Mon, 26 Feb 2024 12:25:58 -0600, Michal Koutn? <[email protected]>
> wrote:
>
>> On Mon, Feb 05, 2024 at 01:06:27PM -0800, Haitao Huang
>> <[email protected]> wrote:
>>> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
>>> +
>>> +const struct misc_res_ops sgx_epc_cgroup_ops = {
>>> + .alloc = sgx_epc_cgroup_alloc,
>>> + .free = sgx_epc_cgroup_free,
>>> +};
>>> +
>>> +static void sgx_epc_misc_init(struct misc_cg *cg, struct
>>> sgx_epc_cgroup *epc_cg)
>>> +{
>>> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
>>> + epc_cg->cg = cg;
>>> +}
>>
>> This is a possibly a nit pick but I share it here for consideration.
>>
>> Would it be more prudent to have the signature like
>> alloc(struct misc_res *res, struct misc_cg *cg)
>> so that implementations are free of the assumption of how cg and res are
>> stored?
>>
>>
>> Thanks,
>> Michal
>
> Will do.
>
> Thanks
> Haitao
>
Actually, because the root node is initialized in sgx_cgroup_init(), which
only has access to misc_cg_root() so we can't pass a misc_res struct
without knowing cg->res relationship. We could hide it with a getter, but
I think it's a little overkill at the moment. I can sign up for adding
this improvement if we feel it needed in future.
Thanks
Haitao
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> With different cgroups, the script starts one or multiple concurrent
> SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test.
> Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with
> following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> if
> more than 4 concurrent tests are run. The script starts 4 expecting
> at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run
> lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
>
> The script also includes a test with low mem_cg limit and LARGE
> sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is
> charged
> to a proper mem_cg.
>
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V7:
> - Added memcontrol test.
>
> V5:
> - Added script with automatic results checking, remove the
> interactive
> script.
> - The script can run independent from the series below.
> ---
> .../selftests/sgx/run_epc_cg_selftests.sh | 246
> ++++++++++++++++++
> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> 2 files changed, 259 insertions(+)
> create mode 100755
> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> create mode 100755
> tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index 000000000000..e027bf39f005
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,246 @@
> +#!/bin/bash
This is not portable and neither does hold in the wild.
It does not even often hold as it is not uncommon to place bash
to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
a path that is neither of those two.
Should be #!/usr/bin/env bash
That is POSIX compatible form.
Just got around trying to test this in NUC7 so looking into this in
more detail.
That said can you make the script work with just "#!/usr/bin/env sh"
and make sure that it is busybox ash compatible?
I don't see any necessity to make this bash only and it adds to the
compilation time of the image. Otherwise lot of this could be tested
just with qemu+bzImage+busybox(inside initramfs).
Now you are adding fully glibc shenanigans for the sake of syntax
sugar.
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +cgcreate -g misc:$TEST_ROOT_CG
How do you know that cgcreate exists? It is used a lot in the script
with no check for the existence. Please fix e.g. with "command -v
cgreate".
> +if [ $? -ne 0 ]; then
> + echo "# Please make sure cgroup-tools is installed, and misc
> cgroup is mounted."
> + exit 1
> +fi
And please do not do it this way. Also, please remove the advice for
"cgroups-tool". This is not meant to be debian only. Better would be
to e.g. point out the URL of the upstream project.
And yeah the whole message should be based on "command -v", not like
this.
> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +# We will only set limit in test1 and run tests in test3
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +cgcreate -g misc:$TEST_CG_SUB1
> +cgcreate -g misc:$TEST_CG_SUB2
> +cgcreate -g misc:$TEST_CG_SUB3
> +cgcreate -g misc:$TEST_CG_SUB4
> +
> +# Default to V2
> +CG_MISC_ROOT=/sys/fs/cgroup
> +CG_MEM_ROOT=/sys/fs/cgroup
> +CG_V1=0
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> + echo "# cgroup V2 is in use."
> +else
> + echo "# cgroup V1 is in use."
Is "#" prefix a standard for kselftest? I don't know this, thus asking.
> + CG_MISC_ROOT=/sys/fs/cgroup/misc
> + CG_MEM_ROOT=/sys/fs/cgroup/memory
> + CG_V1=1
Have you checked what is the indentation policy for bash scripts inside
kernel tree. I don't know what it is. That's why I'm asking.
> +fi
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
> '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity
> size. So
> +# should fail oversubscribed cases
> +SMALL=$(( CAPACITY / 512 ))
> +
> +# At least load one enclave of capacity size successfully, maybe up
> to 4.
> +# But some may fail if we run more than 4 concurrent enclaves of
> capacity size.
> +LARGE=$(( SMALL * 4 ))
> +
> +# Load lots of enclaves
> +LARGER=$CAPACITY
> +echo "# Setting up limits."
> +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
> +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
> +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
> +
> +timestamp=$(date +%Y%m%d_%H%M%S)
> +
> +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
> +
> +wait_check_process_status() {
> + local pid=$1
> + local check_for_success=$2 # If 1, check for success;
> + # If 0, check for failure
> + wait "$pid"
> + local status=$?
> +
> + if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
> + echo "# Process $pid succeeded."
> + return 0
> + elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
> + echo "# Process $pid returned failure."
> + return 0
> + fi
> + return 1
> +}
> +
> +wai
> wait_and_detect_for_any() {
what is "any"?
Maybe for some key functions could have short documentation what they
are and for what test uses them. I cannot possibly remember all of this
just by hints such as "this waits for Any" ;-)
I don't think there is actual kernel guideline to engineer the script
to work with just ash but at least for me that would inevitably
increase my motivation to test this patch set more rather than less.
BR, Jarkko
On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> > The scripts rely on cgroup-tools package from libcgroup [1].
> >
> > To run selftests for epc cgroup:
> >
> > sudo ./run_epc_cg_selftests.sh
> >
> > To watch misc cgroup 'current' changes during testing, run this in a
> > separate terminal:
> >
> > ./watch_misc_for_tests.sh current
> >
> > With different cgroups, the script starts one or multiple concurrent
> > SGX
> > selftests, each to run one unclobbered_vdso_oversubscribed test.
> > Each
> > of such test tries to load an enclave of EPC size equal to the EPC
> > capacity available on the platform. The script checks results against
> > the expectation set for each cgroup and reports success or failure.
> >
> > The script creates 3 different cgroups at the beginning with
> > following
> > expectations:
> >
> > 1) SMALL - intentionally small enough to fail the test loading an
> > enclave of size equal to the capacity.
> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> > if
> > more than 4 concurrent tests are run. The script starts 4 expecting
> > at
> > least one test to pass, and then starts 5 expecting at least one test
> > to fail.
> > 3) LARGER - limit is the same as the capacity, large enough to run
> > lots of
> > concurrent tests. The script starts 8 of them and expects all pass.
> > Then it reruns the same test with one process randomly killed and
> > usage checked to be zero after all process exit.
> >
> > The script also includes a test with low mem_cg limit and LARGE
> > sgx_epc
> > limit to verify that the RAM used for per-cgroup reclamation is
> > charged
> > to a proper mem_cg.
> >
> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >
> > Signed-off-by: Haitao Huang <[email protected]>
> > ---
> > V7:
> > - Added memcontrol test.
> >
> > V5:
> > - Added script with automatic results checking, remove the
> > interactive
> > script.
> > - The script can run independent from the series below.
> > ---
> > .../selftests/sgx/run_epc_cg_selftests.sh | 246
> > ++++++++++++++++++
> > .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> > 2 files changed, 259 insertions(+)
> > create mode 100755
> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > create mode 100755
> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >
> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > new file mode 100755
> > index 000000000000..e027bf39f005
> > --- /dev/null
> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > @@ -0,0 +1,246 @@
> > +#!/bin/bash
>
> This is not portable and neither does hold in the wild.
>
> It does not even often hold as it is not uncommon to place bash
> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> a path that is neither of those two.
>
> Should be #!/usr/bin/env bash
>
> That is POSIX compatible form.
>
> Just got around trying to test this in NUC7 so looking into this in
> more detail.
>
> That said can you make the script work with just "#!/usr/bin/env sh"
> and make sure that it is busybox ash compatible?
>
> I don't see any necessity to make this bash only and it adds to the
> compilation time of the image. Otherwise lot of this could be tested
> just with qemu+bzImage+busybox(inside initramfs).
>
> Now you are adding fully glibc shenanigans for the sake of syntax
> sugar.
>
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright(c) 2023 Intel Corporation.
> > +
> > +TEST_ROOT_CG=selftest
> > +cgcreate -g misc:$TEST_ROOT_CG
>
> How do you know that cgcreate exists? It is used a lot in the script
> with no check for the existence. Please fix e.g. with "command -v
> cgreate".
>
> > +if [ $? -ne 0 ]; then
> > + echo "# Please make sure cgroup-tools is installed, and misc
> > cgroup is mounted."
> > + exit 1
> > +fi
>
> And please do not do it this way. Also, please remove the advice for
> "cgroups-tool". This is not meant to be debian only. Better would be
> to e.g. point out the URL of the upstream project.
>
> And yeah the whole message should be based on "command -v", not like
> this.
>
> > +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> > +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> > +# We will only set limit in test1 and run tests in test3
> > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> > +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> > +
> > +cgcreate -g misc:$TEST_CG_SUB1
>
>
>
> > +cgcreate -g misc:$TEST_CG_SUB2
> > +cgcreate -g misc:$TEST_CG_SUB3
> > +cgcreate -g misc:$TEST_CG_SUB4
> > +
> > +# Default to V2
> > +CG_MISC_ROOT=/sys/fs/cgroup
> > +CG_MEM_ROOT=/sys/fs/cgroup
> > +CG_V1=0
> > +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> > + echo "# cgroup V2 is in use."
> > +else
> > + echo "# cgroup V1 is in use."
>
> Is "#" prefix a standard for kselftest? I don't know this, thus asking.
>
> > + CG_MISC_ROOT=/sys/fs/cgroup/misc
> > + CG_MEM_ROOT=/sys/fs/cgroup/memory
> > + CG_V1=1
>
> Have you checked what is the indentation policy for bash scripts inside
> kernel tree. I don't know what it is. That's why I'm asking.
>
> > +fi
> > +
> > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
> > '{print $2}')
> > +# This is below number of VA pages needed for enclave of capacity
> > size. So
> > +# should fail oversubscribed cases
> > +SMALL=$(( CAPACITY / 512 ))
> > +
> > +# At least load one enclave of capacity size successfully, maybe up
> > to 4.
> > +# But some may fail if we run more than 4 concurrent enclaves of
> > capacity size.
> > +LARGE=$(( SMALL * 4 ))
> > +
> > +# Load lots of enclaves
> > +LARGER=$CAPACITY
> > +echo "# Setting up limits."
> > +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
> > +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
> > +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
> > +
> > +timestamp=$(date +%Y%m%d_%H%M%S)
> > +
> > +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
> > +
> > +wait_check_process_status() {
> > + local pid=$1
> > + local check_for_success=$2 # If 1, check for success;
> > + # If 0, check for failure
> > + wait "$pid"
> > + local status=$?
> > +
> > + if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
> > + echo "# Process $pid succeeded."
> > + return 0
> > + elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
> > + echo "# Process $pid returned failure."
> > + return 0
> > + fi
> > + return 1
> > +}
> > +
> > +wai
> > wait_and_detect_for_any() {
>
> what is "any"?
>
> Maybe for some key functions could have short documentation what they
> are and for what test uses them. I cannot possibly remember all of this
> just by hints such as "this waits for Any" ;-)
>
> I don't think there is actual kernel guideline to engineer the script
> to work with just ash but at least for me that would inevitably
> increase my motivation to test this patch set more rather than less.
I also wonder is cgroup-tools dependency absolutely required or could
you just have a function that would interact with sysfs?
BR, Jarkko
On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
>> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>> > The scripts rely on cgroup-tools package from libcgroup [1].
>> >
>> > To run selftests for epc cgroup:
>> >
>> > sudo ./run_epc_cg_selftests.sh
>> >
>> > To watch misc cgroup 'current' changes during testing, run this in a
>> > separate terminal:
>> >
>> > ./watch_misc_for_tests.sh current
>> >
>> > With different cgroups, the script starts one or multiple concurrent
>> > SGX
>> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each
>> > of such test tries to load an enclave of EPC size equal to the EPC
>> > capacity available on the platform. The script checks results against
>> > the expectation set for each cgroup and reports success or failure.
>> >
>> > The script creates 3 different cgroups at the beginning with
>> > following
>> > expectations:
>> >
>> > 1) SMALL - intentionally small enough to fail the test loading an
>> > enclave of size equal to the capacity.
>> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
>> > if
>> > more than 4 concurrent tests are run. The script starts 4 expecting
>> > at
>> > least one test to pass, and then starts 5 expecting at least one test
>> > to fail.
>> > 3) LARGER - limit is the same as the capacity, large enough to run
>> > lots of
>> > concurrent tests. The script starts 8 of them and expects all pass.
>> > Then it reruns the same test with one process randomly killed and
>> > usage checked to be zero after all process exit.
>> >
>> > The script also includes a test with low mem_cg limit and LARGE
>> > sgx_epc
>> > limit to verify that the RAM used for per-cgroup reclamation is
>> > charged
>> > to a proper mem_cg.
>> >
>> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
>> >
>> > Signed-off-by: Haitao Huang <[email protected]>
>> > ---
>> > V7:
>> > - Added memcontrol test.
>> >
>> > V5:
>> > - Added script with automatic results checking, remove the
>> > interactive
>> > script.
>> > - The script can run independent from the series below.
>> > ---
>> > .../selftests/sgx/run_epc_cg_selftests.sh | 246
>> > ++++++++++++++++++
>> > .../selftests/sgx/watch_misc_for_tests.sh | 13 +
>> > 2 files changed, 259 insertions(+)
>> > create mode 100755
>> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> > create mode 100755
>> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
>> >
>> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> > new file mode 100755
>> > index 000000000000..e027bf39f005
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> > @@ -0,0 +1,246 @@
>> > +#!/bin/bash
>>
>> This is not portable and neither does hold in the wild.
>>
>> It does not even often hold as it is not uncommon to place bash
>> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
>> a path that is neither of those two.
>>
>> Should be #!/usr/bin/env bash
>>
>> That is POSIX compatible form.
>>
>> Just got around trying to test this in NUC7 so looking into this in
>> more detail.
>>
>> That said can you make the script work with just "#!/usr/bin/env sh"
>> and make sure that it is busybox ash compatible?
>>
>> I don't see any necessity to make this bash only and it adds to the
>> compilation time of the image. Otherwise lot of this could be tested
>> just with qemu+bzImage+busybox(inside initramfs).
>>
>> Now you are adding fully glibc shenanigans for the sake of syntax
>> sugar.
>>
>> > +# SPDX-License-Identifier: GPL-2.0
>> > +# Copyright(c) 2023 Intel Corporation.
>> > +
>> > +TEST_ROOT_CG=selftest
>> > +cgcreate -g misc:$TEST_ROOT_CG
>>
>> How do you know that cgcreate exists? It is used a lot in the script
>> with no check for the existence. Please fix e.g. with "command -v
>> cgreate".
>>
>> > +if [ $? -ne 0 ]; then
>> > + echo "# Please make sure cgroup-tools is installed, and misc
>> > cgroup is mounted."
>> > + exit 1
>> > +fi
>>
>> And please do not do it this way. Also, please remove the advice for
>> "cgroups-tool". This is not meant to be debian only. Better would be
>> to e.g. point out the URL of the upstream project.
>>
>> And yeah the whole message should be based on "command -v", not like
>> this.
>>
>> > +TEST_CG_SUB1=$TEST_ROOT_CG/test1
>> > +TEST_CG_SUB2=$TEST_ROOT_CG/test2
>> > +# We will only set limit in test1 and run tests in test3
>> > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
>> > +TEST_CG_SUB4=$TEST_ROOT_CG/test4
>> > +
>> > +cgcreate -g misc:$TEST_CG_SUB1
>>
>>
>>
>> > +cgcreate -g misc:$TEST_CG_SUB2
>> > +cgcreate -g misc:$TEST_CG_SUB3
>> > +cgcreate -g misc:$TEST_CG_SUB4
>> > +
>> > +# Default to V2
>> > +CG_MISC_ROOT=/sys/fs/cgroup
>> > +CG_MEM_ROOT=/sys/fs/cgroup
>> > +CG_V1=0
>> > +if [ ! -d "/sys/fs/cgroup/misc" ]; then
>> > + echo "# cgroup V2 is in use."
>> > +else
>> > + echo "# cgroup V1 is in use."
>>
>> Is "#" prefix a standard for kselftest? I don't know this, thus asking.
>>
>> > + CG_MISC_ROOT=/sys/fs/cgroup/misc
>> > + CG_MEM_ROOT=/sys/fs/cgroup/memory
>> > + CG_V1=1
>>
>> Have you checked what is the indentation policy for bash scripts inside
>> kernel tree. I don't know what it is. That's why I'm asking.
>>
>> > +fi
>> > +
>> > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
>> > '{print $2}')
>> > +# This is below number of VA pages needed for enclave of capacity
>> > size. So
>> > +# should fail oversubscribed cases
>> > +SMALL=$(( CAPACITY / 512 ))
>> > +
>> > +# At least load one enclave of capacity size successfully, maybe up
>> > to 4.
>> > +# But some may fail if we run more than 4 concurrent enclaves of
>> > capacity size.
>> > +LARGE=$(( SMALL * 4 ))
>> > +
>> > +# Load lots of enclaves
>> > +LARGER=$CAPACITY
>> > +echo "# Setting up limits."
>> > +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
>> > +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
>> > +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
>> > +
>> > +timestamp=$(date +%Y%m%d_%H%M%S)
>> > +
>> > +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
>> > +
>> > +wait_check_process_status() {
>> > + local pid=$1
>> > + local check_for_success=$2 # If 1, check for success;
>> > + # If 0, check for failure
>> > + wait "$pid"
>> > + local status=$?
>> > +
>> > + if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
>> > + echo "# Process $pid succeeded."
>> > + return 0
>> > + elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
>> > + echo "# Process $pid returned failure."
>> > + return 0
>> > + fi
>> > + return 1
>> > +}
>> > +
>> > +wai
>> > wait_and_detect_for_any() {
>>
>> what is "any"?
>>
>> Maybe for some key functions could have short documentation what they
>> are and for what test uses them. I cannot possibly remember all of this
>> just by hints such as "this waits for Any" ;-)
>>
>> I don't think there is actual kernel guideline to engineer the script
>> to work with just ash but at least for me that would inevitably
>> increase my motivation to test this patch set more rather than less.
>
> I also wonder is cgroup-tools dependency absolutely required or could
> you just have a function that would interact with sysfs?
I should have checked email before hit the send button for v10 :-).
It'd be more complicated and less readable to do all the stuff without the
cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
on libc so I hope this would not cause too much inconvenience.
I saw bash was also used in cgroup test scripts so at least that's
consistent :-)
I can look into ash if that's required. Let me know.
Certainly can add more docs as you suggested.
Thanks
Haitao
On Wed, 27 Mar 2024 19:57:26 -0500, Haitao Huang
<[email protected]> wrote:
> On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
>> On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
>>> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>>> > The scripts rely on cgroup-tools package from libcgroup [1].
>>> >
>>> > To run selftests for epc cgroup:
>>> >
>>> > sudo ./run_epc_cg_selftests.sh
>>> >
>>> > To watch misc cgroup 'current' changes during testing, run this in a
>>> > separate terminal:
..
>>> > wait_and_detect_for_any() {
>>>
>>> what is "any"?
>>>
>>> Maybe for some key functions could have short documentation what they
>>> are and for what test uses them. I cannot possibly remember all of this
>>> just by hints such as "this waits for Any" ;-)
>>>
>>> I don't think there is actual kernel guideline to engineer the script
>>> to work with just ash but at least for me that would inevitably
>>> increase my motivation to test this patch set more rather than less.
>>
>> I also wonder is cgroup-tools dependency absolutely required or could
>> you just have a function that would interact with sysfs?
>
> I should have checked email before hit the send button for v10 :-).
>
> It'd be more complicated and less readable to do all the stuff without
> the cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only
> depends on libc so I hope this would not cause too much inconvenience.
>
> I saw bash was also used in cgroup test scripts so at least that's
> consistent :-)
> I can look into ash if that's required. Let me know.
>
Sorry I missed you earlier comments. I actually tried to make it
compatible with busybox on Ubuntu.
I'll address your other comments and update. But meanwhile could you
tryout this version in your env?
https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
Thanks
Haitao
On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>> The scripts rely on cgroup-tools package from libcgroup [1].
>>
>> To run selftests for epc cgroup:
>>
>> sudo ./run_epc_cg_selftests.sh
>>
>> To watch misc cgroup 'current' changes during testing, run this in a
>> separate terminal:
>>
>> ./watch_misc_for_tests.sh current
>>
>> With different cgroups, the script starts one or multiple concurrent
>> SGX
>> selftests, each to run one unclobbered_vdso_oversubscribed test.Each
>> of such test tries to load an enclave of EPC size equal to the EPC
>> capacity available on the platform. The script checks results against
>> the expectation set for each cgroup and reports success or failure.
>>
>> The script creates 3 different cgroups at the beginning with
>> following
>> expectations:
>>
>> 1) SMALL - intentionally small enough to fail the test loading an
>> enclave of size equal to the capacity.
>> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
>> if
>> more than 4 concurrent tests are run. The script starts 4 expecting
>> at
>> least one test to pass, and then starts 5 expecting at least one test
>> to fail.
>> 3) LARGER - limit is the same as the capacity, large enough to run
>> lots of
>> concurrent tests. The script starts 8 of them and expects all pass.
>> Then it reruns the same test with one process randomly killed and
>> usage checked to be zero after all process exit.
>>
>> The script also includes a test with low mem_cg limit and LARGE
>> sgx_epc
>> limit to verify that the RAM used for per-cgroup reclamation is
>> charged
>> to a proper mem_cg.
>>
>> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> V7:
>> - Added memcontrol test.
>>
>> V5:
>> - Added script with automatic results checking, remove the
>> interactive
>> script.
>> - The script can run independent from the series below.
>> ---
>> .../selftests/sgx/run_epc_cg_selftests.sh | 246
>> ++++++++++++++++++
>> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
>> 2 files changed, 259 insertions(+)
>> create mode 100755
>> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> create mode 100755
>> tools/testing/selftests/sgx/watch_misc_for_tests.sh
>>
>> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> new file mode 100755
>> index 000000000000..e027bf39f005
>> --- /dev/null
>> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> @@ -0,0 +1,246 @@
>> +#!/bin/bash
>
> This is not portable and neither does hold in the wild.
>
> It does not even often hold as it is not uncommon to place bash
> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> a path that is neither of those two.
>
> Should be #!/usr/bin/env bash
>
> That is POSIX compatible form.
>
Sure
> Just got around trying to test this in NUC7 so looking into this in
> more detail.
Thanks. Could you please check if this version works for you?
https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
>
> That said can you make the script work with just "#!/usr/bin/env sh"
> and make sure that it is busybox ash compatible?
Yes.
>
> I don't see any necessity to make this bash only and it adds to the
> compilation time of the image. Otherwise lot of this could be tested
> just with qemu+bzImage+busybox(inside initramfs).
>
will still need cgroup-tools as you pointed out later. Compiling from its
upstream code OK?
> Now you are adding fully glibc shenanigans for the sake of syntax
> sugar.
>
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright(c) 2023 Intel Corporation.
>> +
>> +TEST_ROOT_CG=selftest
>> +cgcreate -g misc:$TEST_ROOT_CG
>
> How do you know that cgcreate exists? It is used a lot in the script
> with no check for the existence. Please fix e.g. with "command -v
> cgreate".
>
>> +if [ $? -ne 0 ]; then
>> + echo "# Please make sure cgroup-tools is installed, and misc
>> cgroup is mounted."
>> + exit 1
>> +fi
>
> And please do not do it this way. Also, please remove the advice for
> "cgroups-tool". This is not meant to be debian only. Better would be
> to e.g. point out the URL of the upstream project.
>
> And yeah the whole message should be based on "command -v", not like
> this.
>
OK
>> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
>> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
>> +# We will only set limit in test1 and run tests in test3
>> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
>> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
>> +
>> +cgcreate -g misc:$TEST_CG_SUB1
>
>
>
>> +cgcreate -g misc:$TEST_CG_SUB2
>> +cgcreate -g misc:$TEST_CG_SUB3
>> +cgcreate -g misc:$TEST_CG_SUB4
>> +
>> +# Default to V2
>> +CG_MISC_ROOT=/sys/fs/cgroup
>> +CG_MEM_ROOT=/sys/fs/cgroup
>> +CG_V1=0
>> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
>> + echo "# cgroup V2 is in use."
>> +else
>> + echo "# cgroup V1 is in use."
>
> Is "#" prefix a standard for kselftest? I don't know this, thus asking.
>
>> + CG_MISC_ROOT=/sys/fs/cgroup/misc
>> + CG_MEM_ROOT=/sys/fs/cgroup/memory
>> + CG_V1=1
>
> Have you checked what is the indentation policy for bash scripts inside
> kernel tree. I don't know what it is. That's why I'm asking.
>
Right. I looked around and found scripts using bash in cgroup selftests
(at least it marked with "#!/bin/bash"). And that's why I used it
initially.
I don't see any specific rule for testing scripts after searching through
the documentation.
I do see bash is one of minimal requirement for compiling kernel:
https://docs.kernel.org/process/changes.html?highlight=bash
Anyway, I think we can make it compatible with busybox if needed.
>> +fi
>> +
>> +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
>> '{print $2}')
>> +# This is below number of VA pages needed for enclave of capacity
>> size. So
>> +# should fail oversubscribed cases
>> +SMALL=$(( CAPACITY / 512 ))
>> +
>> +# At least load one enclave of capacity size successfully, maybe up
>> to 4.
>> +# But some may fail if we run more than 4 concurrent enclaves of
>> capacity size.
>> +LARGE=$(( SMALL * 4 ))
>> +
>> +# Load lots of enclaves
>> +LARGER=$CAPACITY
>> +echo "# Setting up limits."
>> +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
>> +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
>> +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
>> +
>> +timestamp=$(date +%Y%m%d_%H%M%S)
>> +
>> +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
>> +
>> +wait_check_process_status() {
>> + local pid=$1
>> + local check_for_success=$2 # If 1, check for success;
>> + # If 0, check for failure
>> + wait "$pid"
>> + local status=$?
>> +
>> + if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
>> + echo "# Process $pid succeeded."
>> + return 0
>> + elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
>> + echo "# Process $pid returned failure."
>> + return 0
>> + fi
>> + return 1
>> +}
>> +
>> +wai
>> wait_and_detect_for_any() {
>
> what is "any"?
>
> Maybe for some key functions could have short documentation what they
> are and for what test uses them. I cannot possibly remember all of this
> just by hints such as "this waits for Any" ;-)
>
Will add comments
> I don't think there is actual kernel guideline to engineer the script
> to work with just ash but at least for me that would inevitably
> increase my motivation to test this patch set more rather than less
.
Ok
Thanks
Haitao
On Thu, 22 Feb 2024 16:24:47 -0600, Huang, Kai <[email protected]> wrote:
>
>
> On 23/02/2024 9:12 am, Haitao Huang wrote:
>> On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai <[email protected]>
>> wrote:
>>
>>> On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:
>>>> StartHi Kai
>>>> On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai <[email protected]>
>>>> wrote:
>>>> [...]
>>>> >
>>>> > So you introduced the work/workqueue here but there's no place which
>>>> > actually
>>>> > queues the work. IMHO you can either:
>>>> >
>>>> > 1) move relevant code change here; or
>>>> > 2) focus on introducing core functions to reclaim certain pages
>>>> from a
>>>> > given EPC
>>>> > cgroup w/o workqueue and introduce the work/workqueue in later
>>>> patch.
>>>> >
>>>> > Makes sense?
>>>> >
>>>>
>>>> Starting in v7, I was trying to split the big patch, #10 in v6 as you
>>>> and
>>>> others suggested. My thought process was to put infrastructure needed
>>>> for
>>>> per-cgroup reclaim in the front, then turn on per-cgroup reclaim in
>>>> [v9
>>>> 13/15] in the end.
>>>
>>> That's reasonable for sure.
>>>
>> Thanks for the confirmation :-)
>>
>>>>
>>>> Before that, all reclaimables are tracked in the global LRU so really
>>>> there is no "reclaim certain pages from a given EPC cgroup w/o
>>>> workqueue"
>>>> or reclaim through workqueue before that point, as suggested in #2.
>>>> This
>>>> patch puts down the implementation for both flows but neither used
>>>> yet, as
>>>> stated in the commit message.
>>>
>>> I know it's not used yet. The point is how to split patches to make
>>> them more
>>> self-contain and easy to review.
>> I would think this patch already self-contained in that all are
>> implementation of cgroup reclamation building blocks utilized later.
>> But I'll try to follow your suggestions below to split further (would
>> prefer not to merge in general unless there is strong reasons).
>>
>>>
>>> For #2, sorry for not being explicit -- I meant it seems it's more
>>> reasonable to
>>> split in this way:
>>>
>>> Patch 1)
>>> a). change to sgx_reclaim_pages();
>> I'll still prefer this to be a separate patch. It is self-contained
>> IMHO.
>> We were splitting the original patch because it was too big. I don't
>> want to merge back unless there is a strong reason.
>>
>>> b). introduce sgx_epc_cgroup_reclaim_pages();
>> Ok.
>
> If I got you right, I believe you want to have a cgroup variant function
> following the same behaviour of the one for global reclaim, i.e., the
> _current_ sgx_reclaim_pages(), which always tries to scan and reclaim
> SGX_NR_TO_SCAN pages each time.
>
> And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries
> to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup
> and all the descendants".
>
> And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due
> to WHATEVER reasons.
>
> In that case, the change to sgx_reclaim_pages() and the introduce of
> sgx_epc_cgroup_reclaim_pages() should really be together because they
> are completely tied together in terms of implementation.
>
> In this way you can just explain clearly in _ONE_ patch why you choose
> this implementation, and for reviewer it's also easier to review because
> we can just discuss in one patch.
>
> Makes sense?
>
>>
>>> c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better
>>> name), which just takes an EPC cgroup as input w/o involving any
>>> work/workqueue.
>> This is for the workqueue use only. So I think it'd be better be with
>> patch #2 below?
>
> There are multiple levels of logic here IMHO:
>
> 1. a) and b) above focus on "each reclaim" a given EPC cgroup
> 2. c) is about a loop of above to bring given cgroup's usage to limit
> 3. workqueue is one (probably best) way to do c) in async way
> 4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered
>
> To me, it's clear 1) should be in one patch as stated above.
>
> Also, to me 3) and 4) are better to be together since they give you a
> clear view on how the direct/indirect reclaim are triggered.
>
> 2) could be flexible depending on how you see it. If you prefer viewing
> it from low-level implementation of reclaiming pages from cgroup, then
> it's also OK to be together with 1). If you want to treat it as a part
> of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK
> to be with 3) and 4).
>
> But to me 2) can be together with 1) or even a separate patch because
> it's still kinda of low-level reclaiming details. 3) and 4) shouldn't
> contain such detail but should focus on how direct/indirect reclaim is
> done.
I incorporated most of your suggestions, and think it'd be better discuss
this with actual code.
So I'm sending out v10, and just quickly summarize what I did to address
this particular issue here.
I pretty much follow above suggestions and end up with two patches:
1) a) and b) above plus direct reclaim triggered in try_charge() so
reviewers can see at lease one use of the sgx_cgroup_reclaim_pages(),
which is the basic building block.
2) All async related: c) above, workqueue, indirect triggered in
try_charge() which queues the work.
Please review v10 and if you think the triggering parts need be separated
then I'll separate.
Additionally, after more experimentation, I simplified sgx_reclaim_pages()
by removing the pointer for *nr_to_scan as you suggested, but returning
pages collected for isolation (attempted for reclaim) instead of pages
actually reclaimed. I found performance is acceptable with this approach.
Thanks again for your review.
Haitao
On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >>
> >> To run selftests for epc cgroup:
> >>
> >> sudo ./run_epc_cg_selftests.sh
> >>
> >> To watch misc cgroup 'current' changes during testing, run this in a
> >> separate terminal:
> >>
> >> ./watch_misc_for_tests.sh current
> >>
> >> With different cgroups, the script starts one or multiple concurrent
> >> SGX
> >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each
> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> capacity available on the platform. The script checks results against
> >> the expectation set for each cgroup and reports success or failure.
> >>
> >> The script creates 3 different cgroups at the beginning with
> >> following
> >> expectations:
> >>
> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> enclave of size equal to the capacity.
> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> >> if
> >> more than 4 concurrent tests are run. The script starts 4 expecting
> >> at
> >> least one test to pass, and then starts 5 expecting at least one test
> >> to fail.
> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> lots of
> >> concurrent tests. The script starts 8 of them and expects all pass.
> >> Then it reruns the same test with one process randomly killed and
> >> usage checked to be zero after all process exit.
> >>
> >> The script also includes a test with low mem_cg limit and LARGE
> >> sgx_epc
> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> charged
> >> to a proper mem_cg.
> >>
> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >>
> >> Signed-off-by: Haitao Huang <[email protected]>
> >> ---
> >> V7:
> >> - Added memcontrol test.
> >>
> >> V5:
> >> - Added script with automatic results checking, remove the
> >> interactive
> >> script.
> >> - The script can run independent from the series below.
> >> ---
> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> ++++++++++++++++++
> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> >> 2 files changed, 259 insertions(+)
> >> create mode 100755
> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> create mode 100755
> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >>
> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> new file mode 100755
> >> index 000000000000..e027bf39f005
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> @@ -0,0 +1,246 @@
> >> +#!/bin/bash
> >
> > This is not portable and neither does hold in the wild.
> >
> > It does not even often hold as it is not uncommon to place bash
> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> > a path that is neither of those two.
> >
> > Should be #!/usr/bin/env bash
> >
> > That is POSIX compatible form.
> >
>
> Sure
>
> > Just got around trying to test this in NUC7 so looking into this in
> > more detail.
>
> Thanks. Could you please check if this version works for you?
>
> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
>
> >
> > That said can you make the script work with just "#!/usr/bin/env sh"
> > and make sure that it is busybox ash compatible?
>
> Yes.
>
> >
> > I don't see any necessity to make this bash only and it adds to the
> > compilation time of the image. Otherwise lot of this could be tested
> > just with qemu+bzImage+busybox(inside initramfs).
> >
>
> will still need cgroup-tools as you pointed out later. Compiling from its
> upstream code OK?
Can you explain why you need it?
What is the thing you cannot do without it?
BR, Jarkko
On Thu Mar 28, 2024 at 2:57 AM EET, Haitao Huang wrote:
> On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> >> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> > The scripts rely on cgroup-tools package from libcgroup [1].
> >> >
> >> > To run selftests for epc cgroup:
> >> >
> >> > sudo ./run_epc_cg_selftests.sh
> >> >
> >> > To watch misc cgroup 'current' changes during testing, run this in a
> >> > separate terminal:
> >> >
> >> > ./watch_misc_for_tests.sh current
> >> >
> >> > With different cgroups, the script starts one or multiple concurrent
> >> > SGX
> >> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each
> >> > of such test tries to load an enclave of EPC size equal to the EPC
> >> > capacity available on the platform. The script checks results against
> >> > the expectation set for each cgroup and reports success or failure.
> >> >
> >> > The script creates 3 different cgroups at the beginning with
> >> > following
> >> > expectations:
> >> >
> >> > 1) SMALL - intentionally small enough to fail the test loading an
> >> > enclave of size equal to the capacity.
> >> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> >> > if
> >> > more than 4 concurrent tests are run. The script starts 4 expecting
> >> > at
> >> > least one test to pass, and then starts 5 expecting at least one test
> >> > to fail.
> >> > 3) LARGER - limit is the same as the capacity, large enough to run
> >> > lots of
> >> > concurrent tests. The script starts 8 of them and expects all pass.
> >> > Then it reruns the same test with one process randomly killed and
> >> > usage checked to be zero after all process exit.
> >> >
> >> > The script also includes a test with low mem_cg limit and LARGE
> >> > sgx_epc
> >> > limit to verify that the RAM used for per-cgroup reclamation is
> >> > charged
> >> > to a proper mem_cg.
> >> >
> >> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >
> >> > Signed-off-by: Haitao Huang <[email protected]>
> >> > ---
> >> > V7:
> >> > - Added memcontrol test.
> >> >
> >> > V5:
> >> > - Added script with automatic results checking, remove the
> >> > interactive
> >> > script.
> >> > - The script can run independent from the series below.
> >> > ---
> >> > .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> > ++++++++++++++++++
> >> > .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> >> > 2 files changed, 259 insertions(+)
> >> > create mode 100755
> >> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > create mode 100755
> >> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >
> >> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > new file mode 100755
> >> > index 000000000000..e027bf39f005
> >> > --- /dev/null
> >> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> > @@ -0,0 +1,246 @@
> >> > +#!/bin/bash
> >>
> >> This is not portable and neither does hold in the wild.
> >>
> >> It does not even often hold as it is not uncommon to place bash
> >> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> >> a path that is neither of those two.
> >>
> >> Should be #!/usr/bin/env bash
> >>
> >> That is POSIX compatible form.
> >>
> >> Just got around trying to test this in NUC7 so looking into this in
> >> more detail.
> >>
> >> That said can you make the script work with just "#!/usr/bin/env sh"
> >> and make sure that it is busybox ash compatible?
> >>
> >> I don't see any necessity to make this bash only and it adds to the
> >> compilation time of the image. Otherwise lot of this could be tested
> >> just with qemu+bzImage+busybox(inside initramfs).
> >>
> >> Now you are adding fully glibc shenanigans for the sake of syntax
> >> sugar.
> >>
> >> > +# SPDX-License-Identifier: GPL-2.0
> >> > +# Copyright(c) 2023 Intel Corporation.
> >> > +
> >> > +TEST_ROOT_CG=selftest
> >> > +cgcreate -g misc:$TEST_ROOT_CG
> >>
> >> How do you know that cgcreate exists? It is used a lot in the script
> >> with no check for the existence. Please fix e.g. with "command -v
> >> cgreate".
> >>
> >> > +if [ $? -ne 0 ]; then
> >> > + echo "# Please make sure cgroup-tools is installed, and misc
> >> > cgroup is mounted."
> >> > + exit 1
> >> > +fi
> >>
> >> And please do not do it this way. Also, please remove the advice for
> >> "cgroups-tool". This is not meant to be debian only. Better would be
> >> to e.g. point out the URL of the upstream project.
> >>
> >> And yeah the whole message should be based on "command -v", not like
> >> this.
> >>
> >> > +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> >> > +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> >> > +# We will only set limit in test1 and run tests in test3
> >> > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> >> > +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> >> > +
> >> > +cgcreate -g misc:$TEST_CG_SUB1
> >>
> >>
> >>
> >> > +cgcreate -g misc:$TEST_CG_SUB2
> >> > +cgcreate -g misc:$TEST_CG_SUB3
> >> > +cgcreate -g misc:$TEST_CG_SUB4
> >> > +
> >> > +# Default to V2
> >> > +CG_MISC_ROOT=/sys/fs/cgroup
> >> > +CG_MEM_ROOT=/sys/fs/cgroup
> >> > +CG_V1=0
> >> > +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> >> > + echo "# cgroup V2 is in use."
> >> > +else
> >> > + echo "# cgroup V1 is in use."
> >>
> >> Is "#" prefix a standard for kselftest? I don't know this, thus asking.
> >>
> >> > + CG_MISC_ROOT=/sys/fs/cgroup/misc
> >> > + CG_MEM_ROOT=/sys/fs/cgroup/memory
> >> > + CG_V1=1
> >>
> >> Have you checked what is the indentation policy for bash scripts inside
> >> kernel tree. I don't know what it is. That's why I'm asking.
> >>
> >> > +fi
> >> > +
> >> > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
> >> > '{print $2}')
> >> > +# This is below number of VA pages needed for enclave of capacity
> >> > size. So
> >> > +# should fail oversubscribed cases
> >> > +SMALL=$(( CAPACITY / 512 ))
> >> > +
> >> > +# At least load one enclave of capacity size successfully, maybe up
> >> > to 4.
> >> > +# But some may fail if we run more than 4 concurrent enclaves of
> >> > capacity size.
> >> > +LARGE=$(( SMALL * 4 ))
> >> > +
> >> > +# Load lots of enclaves
> >> > +LARGER=$CAPACITY
> >> > +echo "# Setting up limits."
> >> > +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
> >> > +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
> >> > +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
> >> > +
> >> > +timestamp=$(date +%Y%m%d_%H%M%S)
> >> > +
> >> > +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
> >> > +
> >> > +wait_check_process_status() {
> >> > + local pid=$1
> >> > + local check_for_success=$2 # If 1, check for success;
> >> > + # If 0, check for failure
> >> > + wait "$pid"
> >> > + local status=$?
> >> > +
> >> > + if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
> >> > + echo "# Process $pid succeeded."
> >> > + return 0
> >> > + elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
> >> > + echo "# Process $pid returned failure."
> >> > + return 0
> >> > + fi
> >> > + return 1
> >> > +}
> >> > +
> >> > +wai
> >> > wait_and_detect_for_any() {
> >>
> >> what is "any"?
> >>
> >> Maybe for some key functions could have short documentation what they
> >> are and for what test uses them. I cannot possibly remember all of this
> >> just by hints such as "this waits for Any" ;-)
> >>
> >> I don't think there is actual kernel guideline to engineer the script
> >> to work with just ash but at least for me that would inevitably
> >> increase my motivation to test this patch set more rather than less.
> >
> > I also wonder is cgroup-tools dependency absolutely required or could
> > you just have a function that would interact with sysfs?
>
> I should have checked email before hit the send button for v10 :-).
>
> It'd be more complicated and less readable to do all the stuff without the
> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
> on libc so I hope this would not cause too much inconvenience.
As per cgroup-tools, please prove this. It makes the job for more
complicated *for you* and you are making the job more complicated
to every possible person in the planet running any kernel QA.
I weight the latter more than the former. And it is exactly the
reason why we did custom user space kselftest in the first place.
Let's keep the tradition. All I can say is that kselftest is
unfinished in its current form.
What is "esp cgexec"?
BR, Jarkko
On Sat Mar 30, 2024 at 1:23 PM EET, Jarkko Sakkinen wrote:
> On Thu Mar 28, 2024 at 2:57 AM EET, Haitao Huang wrote:
> > On Wed, 27 Mar 2024 11:56:35 -0500, Jarkko Sakkinen <[email protected]>
> > wrote:
> >
> > > On Wed Mar 27, 2024 at 2:55 PM EET, Jarkko Sakkinen wrote:
> > >> On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> > >> > The scripts rely on cgroup-tools package from libcgroup [1].
> > >> >
> > >> > To run selftests for epc cgroup:
> > >> >
> > >> > sudo ./run_epc_cg_selftests.sh
> > >> >
> > >> > To watch misc cgroup 'current' changes during testing, run this in a
> > >> > separate terminal:
> > >> >
> > >> > ./watch_misc_for_tests.sh current
> > >> >
> > >> > With different cgroups, the script starts one or multiple concurrent
> > >> > SGX
> > >> > selftests, each to run one unclobbered_vdso_oversubscribed test.> Each
> > >> > of such test tries to load an enclave of EPC size equal to the EPC
> > >> > capacity available on the platform. The script checks results against
> > >> > the expectation set for each cgroup and reports success or failure.
> > >> >
> > >> > The script creates 3 different cgroups at the beginning with
> > >> > following
> > >> > expectations:
> > >> >
> > >> > 1) SMALL - intentionally small enough to fail the test loading an
> > >> > enclave of size equal to the capacity.
> > >> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> > >> > if
> > >> > more than 4 concurrent tests are run. The script starts 4 expecting
> > >> > at
> > >> > least one test to pass, and then starts 5 expecting at least one test
> > >> > to fail.
> > >> > 3) LARGER - limit is the same as the capacity, large enough to run
> > >> > lots of
> > >> > concurrent tests. The script starts 8 of them and expects all pass.
> > >> > Then it reruns the same test with one process randomly killed and
> > >> > usage checked to be zero after all process exit.
> > >> >
> > >> > The script also includes a test with low mem_cg limit and LARGE
> > >> > sgx_epc
> > >> > limit to verify that the RAM used for per-cgroup reclamation is
> > >> > charged
> > >> > to a proper mem_cg.
> > >> >
> > >> > [1] https://github.com/libcgroup/libcgroup/blob/main/README
> > >> >
> > >> > Signed-off-by: Haitao Huang <[email protected]>
> > >> > ---
> > >> > V7:
> > >> > - Added memcontrol test.
> > >> >
> > >> > V5:
> > >> > - Added script with automatic results checking, remove the
> > >> > interactive
> > >> > script.
> > >> > - The script can run independent from the series below.
> > >> > ---
> > >> > .../selftests/sgx/run_epc_cg_selftests.sh | 246
> > >> > ++++++++++++++++++
> > >> > .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> > >> > 2 files changed, 259 insertions(+)
> > >> > create mode 100755
> > >> > tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > create mode 100755
> > >> > tools/testing/selftests/sgx/watch_misc_for_tests.sh
> > >> >
> > >> > diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > new file mode 100755
> > >> > index 000000000000..e027bf39f005
> > >> > --- /dev/null
> > >> > +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> > >> > @@ -0,0 +1,246 @@
> > >> > +#!/bin/bash
> > >>
> > >> This is not portable and neither does hold in the wild.
> > >>
> > >> It does not even often hold as it is not uncommon to place bash
> > >> to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> > >> a path that is neither of those two.
> > >>
> > >> Should be #!/usr/bin/env bash
> > >>
> > >> That is POSIX compatible form.
> > >>
> > >> Just got around trying to test this in NUC7 so looking into this in
> > >> more detail.
> > >>
> > >> That said can you make the script work with just "#!/usr/bin/env sh"
> > >> and make sure that it is busybox ash compatible?
> > >>
> > >> I don't see any necessity to make this bash only and it adds to the
> > >> compilation time of the image. Otherwise lot of this could be tested
> > >> just with qemu+bzImage+busybox(inside initramfs).
> > >>
> > >> Now you are adding fully glibc shenanigans for the sake of syntax
> > >> sugar.
> > >>
> > >> > +# SPDX-License-Identifier: GPL-2.0
> > >> > +# Copyright(c) 2023 Intel Corporation.
> > >> > +
> > >> > +TEST_ROOT_CG=selftest
> > >> > +cgcreate -g misc:$TEST_ROOT_CG
> > >>
> > >> How do you know that cgcreate exists? It is used a lot in the script
> > >> with no check for the existence. Please fix e.g. with "command -v
> > >> cgreate".
> > >>
> > >> > +if [ $? -ne 0 ]; then
> > >> > + echo "# Please make sure cgroup-tools is installed, and misc
> > >> > cgroup is mounted."
> > >> > + exit 1
> > >> > +fi
> > >>
> > >> And please do not do it this way. Also, please remove the advice for
> > >> "cgroups-tool". This is not meant to be debian only. Better would be
> > >> to e.g. point out the URL of the upstream project.
> > >>
> > >> And yeah the whole message should be based on "command -v", not like
> > >> this.
> > >>
> > >> > +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> > >> > +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> > >> > +# We will only set limit in test1 and run tests in test3
> > >> > +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> > >> > +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> > >> > +
> > >> > +cgcreate -g misc:$TEST_CG_SUB1
> > >>
> > >>
> > >>
> > >> > +cgcreate -g misc:$TEST_CG_SUB2
> > >> > +cgcreate -g misc:$TEST_CG_SUB3
> > >> > +cgcreate -g misc:$TEST_CG_SUB4
> > >> > +
> > >> > +# Default to V2
> > >> > +CG_MISC_ROOT=/sys/fs/cgroup
> > >> > +CG_MEM_ROOT=/sys/fs/cgroup
> > >> > +CG_V1=0
> > >> > +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> > >> > + echo "# cgroup V2 is in use."
> > >> > +else
> > >> > + echo "# cgroup V1 is in use."
> > >>
> > >> Is "#" prefix a standard for kselftest? I don't know this, thus asking.
> > >>
> > >> > + CG_MISC_ROOT=/sys/fs/cgroup/misc
> > >> > + CG_MEM_ROOT=/sys/fs/cgroup/memory
> > >> > + CG_V1=1
> > >>
> > >> Have you checked what is the indentation policy for bash scripts inside
> > >> kernel tree. I don't know what it is. That's why I'm asking.
> > >>
> > >> > +fi
> > >> > +
> > >> > +CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk
> > >> > '{print $2}')
> > >> > +# This is below number of VA pages needed for enclave of capacity
> > >> > size. So
> > >> > +# should fail oversubscribed cases
> > >> > +SMALL=$(( CAPACITY / 512 ))
> > >> > +
> > >> > +# At least load one enclave of capacity size successfully, maybe up
> > >> > to 4.
> > >> > +# But some may fail if we run more than 4 concurrent enclaves of
> > >> > capacity size.
> > >> > +LARGE=$(( SMALL * 4 ))
> > >> > +
> > >> > +# Load lots of enclaves
> > >> > +LARGER=$CAPACITY
> > >> > +echo "# Setting up limits."
> > >> > +echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
> > >> > +echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
> > >> > +echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
> > >> > +
> > >> > +timestamp=$(date +%Y%m%d_%H%M%S)
> > >> > +
> > >> > +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
> > >> > +
> > >> > +wait_check_process_status() {
> > >> > + local pid=$1
> > >> > + local check_for_success=$2 # If 1, check for success;
> > >> > + # If 0, check for failure
> > >> > + wait "$pid"
> > >> > + local status=$?
> > >> > +
> > >> > + if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
> > >> > + echo "# Process $pid succeeded."
> > >> > + return 0
> > >> > + elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
> > >> > + echo "# Process $pid returned failure."
> > >> > + return 0
> > >> > + fi
> > >> > + return 1
> > >> > +}
> > >> > +
> > >> > +wai
> > >> > wait_and_detect_for_any() {
> > >>
> > >> what is "any"?
> > >>
> > >> Maybe for some key functions could have short documentation what they
> > >> are and for what test uses them. I cannot possibly remember all of this
> > >> just by hints such as "this waits for Any" ;-)
> > >>
> > >> I don't think there is actual kernel guideline to engineer the script
> > >> to work with just ash but at least for me that would inevitably
> > >> increase my motivation to test this patch set more rather than less.
> > >
> > > I also wonder is cgroup-tools dependency absolutely required or could
> > > you just have a function that would interact with sysfs?
> >
> > I should have checked email before hit the send button for v10 :-).
> >
> > It'd be more complicated and less readable to do all the stuff without the
> > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
> > on libc so I hope this would not cause too much inconvenience.
>
> As per cgroup-tools, please prove this. It makes the job for more
> complicated *for you* and you are making the job more complicated
> to every possible person in the planet running any kernel QA.
>
> I weight the latter more than the former. And it is exactly the
> reason why we did custom user space kselftest in the first place.
> Let's keep the tradition. All I can say is that kselftest is
> unfinished in its current form.
>
> What is "esp cgexec"?
Also in kselftest we don't drive ultimate simplicity, we drive
efficient CI/QA. By open coding something like subset of
cgroup-tools needed to run the test you also help us later
on to backtrack the kernel changes. With cgroups-tools you
would have to use strace to get the same info.
BR, Jarkko
On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
>> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen <[email protected]>
>> wrote:
>>
>> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>> >> The scripts rely on cgroup-tools package from libcgroup [1].
>> >>
>> >> To run selftests for epc cgroup:
>> >>
>> >> sudo ./run_epc_cg_selftests.sh
>> >>
>> >> To watch misc cgroup 'current' changes during testing, run this in a
>> >> separate terminal:
>> >>
>> >> ./watch_misc_for_tests.sh current
>> >>
>> >> With different cgroups, the script starts one or multiple concurrent
>> >> SGX
>> >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each
>> >> of such test tries to load an enclave of EPC size equal to the EPC
>> >> capacity available on the platform. The script checks results against
>> >> the expectation set for each cgroup and reports success or failure.
>> >>
>> >> The script creates 3 different cgroups at the beginning with
>> >> following
>> >> expectations:
>> >>
>> >> 1) SMALL - intentionally small enough to fail the test loading an
>> >> enclave of size equal to the capacity.
>> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
>> >> if
>> >> more than 4 concurrent tests are run. The script starts 4 expecting
>> >> at
>> >> least one test to pass, and then starts 5 expecting at least one test
>> >> to fail.
>> >> 3) LARGER - limit is the same as the capacity, large enough to run
>> >> lots of
>> >> concurrent tests. The script starts 8 of them and expects all pass.
>> >> Then it reruns the same test with one process randomly killed and
>> >> usage checked to be zero after all process exit.
>> >>
>> >> The script also includes a test with low mem_cg limit and LARGE
>> >> sgx_epc
>> >> limit to verify that the RAM used for per-cgroup reclamation is
>> >> charged
>> >> to a proper mem_cg.
>> >>
>> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>> >>
>> >> Signed-off-by: Haitao Huang <[email protected]>
>> >> ---
>> >> V7:
>> >> - Added memcontrol test.
>> >>
>> >> V5:
>> >> - Added script with automatic results checking, remove the
>> >> interactive
>> >> script.
>> >> - The script can run independent from the series below.
>> >> ---
>> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246
>> >> ++++++++++++++++++
>> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
>> >> 2 files changed, 259 insertions(+)
>> >> create mode 100755
>> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> create mode 100755
>> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
>> >>
>> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> new file mode 100755
>> >> index 000000000000..e027bf39f005
>> >> --- /dev/null
>> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> @@ -0,0 +1,246 @@
>> >> +#!/bin/bash
>> >
>> > This is not portable and neither does hold in the wild.
>> >
>> > It does not even often hold as it is not uncommon to place bash
>> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
>> > a path that is neither of those two.
>> >
>> > Should be #!/usr/bin/env bash
>> >
>> > That is POSIX compatible form.
>> >
>>
>> Sure
>>
>> > Just got around trying to test this in NUC7 so looking into this in
>> > more detail.
>>
>> Thanks. Could you please check if this version works for you?
>>
>> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
>>
>> >
>> > That said can you make the script work with just "#!/usr/bin/env sh"
>> > and make sure that it is busybox ash compatible?
>>
>> Yes.
>>
>> >
>> > I don't see any necessity to make this bash only and it adds to the
>> > compilation time of the image. Otherwise lot of this could be tested
>> > just with qemu+bzImage+busybox(inside initramfs).
>> >
>>
>> will still need cgroup-tools as you pointed out later. Compiling from
>> its
>> upstream code OK?
>
> Can you explain why you need it?
>
> What is the thing you cannot do without it?
>
> BR, Jarkko
>
I did not find a nice way to start a process in a specified cgroup like
cgexec does. I could wrap the test in a shell: move the current shell to a
new cgroup then do exec to run the test app. But that seems cumbersome as
I need to spawn many shells, and check results of them. Another option is
create my own cgexec, which I'm sure will be very similar to cgexec code.
Was not sure how far we want to go with this.
I can experiment with the shell wrapper idea and see how bad it can be and
fall back to implement cgexec? Open to suggestions.
Thanks
Haitao
On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote:
> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen <[email protected]>
> >> wrote:
> >>
> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >> >>
> >> >> To run selftests for epc cgroup:
> >> >>
> >> >> sudo ./run_epc_cg_selftests.sh
> >> >>
> >> >> To watch misc cgroup 'current' changes during testing, run this in a
> >> >> separate terminal:
> >> >>
> >> >> ./watch_misc_for_tests.sh current
> >> >>
> >> >> With different cgroups, the script starts one or multiple concurrent
> >> >> SGX
> >> >> selftests, each to run one unclobbered_vdso_oversubscribed test.Each
> >> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> >> capacity available on the platform. The script checks results against
> >> >> the expectation set for each cgroup and reports success or failure.
> >> >>
> >> >> The script creates 3 different cgroups at the beginning with
> >> >> following
> >> >> expectations:
> >> >>
> >> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> >> enclave of size equal to the capacity.
> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail some
> >> >> if
> >> >> more than 4 concurrent tests are run. The script starts 4 expecting
> >> >> at
> >> >> least one test to pass, and then starts 5 expecting at least one test
> >> >> to fail.
> >> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> >> lots of
> >> >> concurrent tests. The script starts 8 of them and expects all pass.
> >> >> Then it reruns the same test with one process randomly killed and
> >> >> usage checked to be zero after all process exit.
> >> >>
> >> >> The script also includes a test with low mem_cg limit and LARGE
> >> >> sgx_epc
> >> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> >> charged
> >> >> to a proper mem_cg.
> >> >>
> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >>
> >> >> Signed-off-by: Haitao Huang <[email protected]>
> >> >> ---
> >> >> V7:
> >> >> - Added memcontrol test.
> >> >>
> >> >> V5:
> >> >> - Added script with automatic results checking, remove the
> >> >> interactive
> >> >> script.
> >> >> - The script can run independent from the series below.
> >> >> ---
> >> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> >> ++++++++++++++++++
> >> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> >> >> 2 files changed, 259 insertions(+)
> >> >> create mode 100755
> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> create mode 100755
> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >>
> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> new file mode 100755
> >> >> index 000000000000..e027bf39f005
> >> >> --- /dev/null
> >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> @@ -0,0 +1,246 @@
> >> >> +#!/bin/bash
> >> >
> >> > This is not portable and neither does hold in the wild.
> >> >
> >> > It does not even often hold as it is not uncommon to place bash
> >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> >> > a path that is neither of those two.
> >> >
> >> > Should be #!/usr/bin/env bash
> >> >
> >> > That is POSIX compatible form.
> >> >
> >>
> >> Sure
> >>
> >> > Just got around trying to test this in NUC7 so looking into this in
> >> > more detail.
> >>
> >> Thanks. Could you please check if this version works for you?
> >>
> >> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
> >>
> >> >
> >> > That said can you make the script work with just "#!/usr/bin/env sh"
> >> > and make sure that it is busybox ash compatible?
> >>
> >> Yes.
> >>
> >> >
> >> > I don't see any necessity to make this bash only and it adds to the
> >> > compilation time of the image. Otherwise lot of this could be tested
> >> > just with qemu+bzImage+busybox(inside initramfs).
> >> >
> >>
> >> will still need cgroup-tools as you pointed out later. Compiling from
> >> its
> >> upstream code OK?
> >
> > Can you explain why you need it?
> >
> > What is the thing you cannot do without it?
> >
> > BR, Jarkko
> >
> I did not find a nice way to start a process in a specified cgroup like
> cgexec does. I could wrap the test in a shell: move the current shell to a
> new cgroup then do exec to run the test app. But that seems cumbersome as
> I need to spawn many shells, and check results of them. Another option is
> create my own cgexec, which I'm sure will be very similar to cgexec code.
> Was not sure how far we want to go with this.
>
> I can experiment with the shell wrapper idea and see how bad it can be and
> fall back to implement cgexec? Open to suggestions.
I guess there's only few variants of how cgexec is invoked right?
The first thing we need to do is what exact steps those variants
perform.
After that we an decide how to implement exactly those variants.
E.g. without knowing do we need to perform any syscalls or can
this all done through sysfs affects somewhat how to proceed.
Right now if I want to build e.g. image with BuildRoot I'd need
to patch the existing recipe to add new dependencies in order to
execute these tests, and probably do the same for every project
that can package selftests to image.
BR, Jarkko
On Sun, 31 Mar 2024 11:19:04 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote:
>> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen <[email protected]>
>> wrote:
>>
>> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
>> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen
>> <[email protected]>
>> >> wrote:
>> >>
>> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
>> >> >> The scripts rely on cgroup-tools package from libcgroup [1].
>> >> >>
>> >> >> To run selftests for epc cgroup:
>> >> >>
>> >> >> sudo ./run_epc_cg_selftests.sh
>> >> >>
>> >> >> To watch misc cgroup 'current' changes during testing, run this
>> in a
>> >> >> separate terminal:
>> >> >>
>> >> >> ./watch_misc_for_tests.sh current
>> >> >>
>> >> >> With different cgroups, the script starts one or multiple
>> concurrent
>> >> >> SGX
>> >> >> selftests, each to run one unclobbered_vdso_oversubscribed
>> test.Each
>> >> >> of such test tries to load an enclave of EPC size equal to the EPC
>> >> >> capacity available on the platform. The script checks results
>> against
>> >> >> the expectation set for each cgroup and reports success or
>> failure.
>> >> >>
>> >> >> The script creates 3 different cgroups at the beginning with
>> >> >> following
>> >> >> expectations:
>> >> >>
>> >> >> 1) SMALL - intentionally small enough to fail the test loading an
>> >> >> enclave of size equal to the capacity.
>> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail
>> some
>> >> >> if
>> >> >> more than 4 concurrent tests are run. The script starts 4
>> expecting
>> >> >> at
>> >> >> least one test to pass, and then starts 5 expecting at least one
>> test
>> >> >> to fail.
>> >> >> 3) LARGER - limit is the same as the capacity, large enough to run
>> >> >> lots of
>> >> >> concurrent tests. The script starts 8 of them and expects all
>> pass.
>> >> >> Then it reruns the same test with one process randomly killed and
>> >> >> usage checked to be zero after all process exit.
>> >> >>
>> >> >> The script also includes a test with low mem_cg limit and LARGE
>> >> >> sgx_epc
>> >> >> limit to verify that the RAM used for per-cgroup reclamation is
>> >> >> charged
>> >> >> to a proper mem_cg.
>> >> >>
>> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>> >> >>
>> >> >> Signed-off-by: Haitao Huang <[email protected]>
>> >> >> ---
>> >> >> V7:
>> >> >> - Added memcontrol test.
>> >> >>
>> >> >> V5:
>> >> >> - Added script with automatic results checking, remove the
>> >> >> interactive
>> >> >> script.
>> >> >> - The script can run independent from the series below.
>> >> >> ---
>> >> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246
>> >> >> ++++++++++++++++++
>> >> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
>> >> >> 2 files changed, 259 insertions(+)
>> >> >> create mode 100755
>> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> >> create mode 100755
>> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
>> >> >>
>> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> >> new file mode 100755
>> >> >> index 000000000000..e027bf39f005
>> >> >> --- /dev/null
>> >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> >> >> @@ -0,0 +1,246 @@
>> >> >> +#!/bin/bash
>> >> >
>> >> > This is not portable and neither does hold in the wild.
>> >> >
>> >> > It does not even often hold as it is not uncommon to place bash
>> >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
>> >> > a path that is neither of those two.
>> >> >
>> >> > Should be #!/usr/bin/env bash
>> >> >
>> >> > That is POSIX compatible form.
>> >> >
>> >>
>> >> Sure
>> >>
>> >> > Just got around trying to test this in NUC7 so looking into this in
>> >> > more detail.
>> >>
>> >> Thanks. Could you please check if this version works for you?
>> >>
>> >>
>> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
>> >>
>> >> >
>> >> > That said can you make the script work with just "#!/usr/bin/env
>> sh"
>> >> > and make sure that it is busybox ash compatible?
>> >>
>> >> Yes.
>> >>
>> >> >
>> >> > I don't see any necessity to make this bash only and it adds to the
>> >> > compilation time of the image. Otherwise lot of this could be
>> tested
>> >> > just with qemu+bzImage+busybox(inside initramfs).
>> >> >
>> >>
>> >> will still need cgroup-tools as you pointed out later. Compiling from
>> >> its
>> >> upstream code OK?
>> >
>> > Can you explain why you need it?
>> >
>> > What is the thing you cannot do without it?
>> >
>> > BR, Jarkko
>> >
>> I did not find a nice way to start a process in a specified cgroup like
>> cgexec does. I could wrap the test in a shell: move the current shell
>> to a
>> new cgroup then do exec to run the test app. But that seems cumbersome
>> as
>> I need to spawn many shells, and check results of them. Another option
>> is
>> create my own cgexec, which I'm sure will be very similar to cgexec
>> code.
>> Was not sure how far we want to go with this.
>>
>> I can experiment with the shell wrapper idea and see how bad it can be
>> and
>> fall back to implement cgexec? Open to suggestions.
>
> I guess there's only few variants of how cgexec is invoked right?
>
> The first thing we need to do is what exact steps those variants
> perform.
>
> After that we an decide how to implement exactly those variants.
>
> E.g. without knowing do we need to perform any syscalls or can
> this all done through sysfs affects somewhat how to proceed.
>
> Right now if I want to build e.g. image with BuildRoot I'd need
> to patch the existing recipe to add new dependencies in order to
> execute these tests, and probably do the same for every project
> that can package selftests to image.
>
> BR, Jarkko
>
Can be done through sysfs without syscalls. I implemented the wrapper
shell and it looks not as bad I expected.
Will send an add-on patch with that change and other changes to address
your comments so far for the test scripts.
If we agree on the approach, I'll squash it with this one later.
Thanks
Haitao
On Sun Mar 31, 2024 at 8:35 PM EEST, Haitao Huang wrote:
> On Sun, 31 Mar 2024 11:19:04 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote:
> >> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen <[email protected]>
> >> wrote:
> >>
> >> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> >> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen
> >> <[email protected]>
> >> >> wrote:
> >> >>
> >> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> >> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >> >> >>
> >> >> >> To run selftests for epc cgroup:
> >> >> >>
> >> >> >> sudo ./run_epc_cg_selftests.sh
> >> >> >>
> >> >> >> To watch misc cgroup 'current' changes during testing, run this
> >> in a
> >> >> >> separate terminal:
> >> >> >>
> >> >> >> ./watch_misc_for_tests.sh current
> >> >> >>
> >> >> >> With different cgroups, the script starts one or multiple
> >> concurrent
> >> >> >> SGX
> >> >> >> selftests, each to run one unclobbered_vdso_oversubscribed
> >> test.Each
> >> >> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> >> >> capacity available on the platform. The script checks results
> >> against
> >> >> >> the expectation set for each cgroup and reports success or
> >> failure.
> >> >> >>
> >> >> >> The script creates 3 different cgroups at the beginning with
> >> >> >> following
> >> >> >> expectations:
> >> >> >>
> >> >> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> >> >> enclave of size equal to the capacity.
> >> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail
> >> some
> >> >> >> if
> >> >> >> more than 4 concurrent tests are run. The script starts 4
> >> expecting
> >> >> >> at
> >> >> >> least one test to pass, and then starts 5 expecting at least one
> >> test
> >> >> >> to fail.
> >> >> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> >> >> lots of
> >> >> >> concurrent tests. The script starts 8 of them and expects all
> >> pass.
> >> >> >> Then it reruns the same test with one process randomly killed and
> >> >> >> usage checked to be zero after all process exit.
> >> >> >>
> >> >> >> The script also includes a test with low mem_cg limit and LARGE
> >> >> >> sgx_epc
> >> >> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> >> >> charged
> >> >> >> to a proper mem_cg.
> >> >> >>
> >> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >> >>
> >> >> >> Signed-off-by: Haitao Huang <[email protected]>
> >> >> >> ---
> >> >> >> V7:
> >> >> >> - Added memcontrol test.
> >> >> >>
> >> >> >> V5:
> >> >> >> - Added script with automatic results checking, remove the
> >> >> >> interactive
> >> >> >> script.
> >> >> >> - The script can run independent from the series below.
> >> >> >> ---
> >> >> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> >> >> ++++++++++++++++++
> >> >> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> >> >> >> 2 files changed, 259 insertions(+)
> >> >> >> create mode 100755
> >> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> create mode 100755
> >> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >> >>
> >> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> new file mode 100755
> >> >> >> index 000000000000..e027bf39f005
> >> >> >> --- /dev/null
> >> >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> @@ -0,0 +1,246 @@
> >> >> >> +#!/bin/bash
> >> >> >
> >> >> > This is not portable and neither does hold in the wild.
> >> >> >
> >> >> > It does not even often hold as it is not uncommon to place bash
> >> >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> >> >> > a path that is neither of those two.
> >> >> >
> >> >> > Should be #!/usr/bin/env bash
> >> >> >
> >> >> > That is POSIX compatible form.
> >> >> >
> >> >>
> >> >> Sure
> >> >>
> >> >> > Just got around trying to test this in NUC7 so looking into this in
> >> >> > more detail.
> >> >>
> >> >> Thanks. Could you please check if this version works for you?
> >> >>
> >> >>
> >> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
> >> >>
> >> >> >
> >> >> > That said can you make the script work with just "#!/usr/bin/env
> >> sh"
> >> >> > and make sure that it is busybox ash compatible?
> >> >>
> >> >> Yes.
> >> >>
> >> >> >
> >> >> > I don't see any necessity to make this bash only and it adds to the
> >> >> > compilation time of the image. Otherwise lot of this could be
> >> tested
> >> >> > just with qemu+bzImage+busybox(inside initramfs).
> >> >> >
> >> >>
> >> >> will still need cgroup-tools as you pointed out later. Compiling from
> >> >> its
> >> >> upstream code OK?
> >> >
> >> > Can you explain why you need it?
> >> >
> >> > What is the thing you cannot do without it?
> >> >
> >> > BR, Jarkko
> >> >
> >> I did not find a nice way to start a process in a specified cgroup like
> >> cgexec does. I could wrap the test in a shell: move the current shell
> >> to a
> >> new cgroup then do exec to run the test app. But that seems cumbersome
> >> as
> >> I need to spawn many shells, and check results of them. Another option
> >> is
> >> create my own cgexec, which I'm sure will be very similar to cgexec
> >> code.
> >> Was not sure how far we want to go with this.
> >>
> >> I can experiment with the shell wrapper idea and see how bad it can be
> >> and
> >> fall back to implement cgexec? Open to suggestions.
> >
> > I guess there's only few variants of how cgexec is invoked right?
> >
> > The first thing we need to do is what exact steps those variants
> > perform.
> >
> > After that we an decide how to implement exactly those variants.
> >
> > E.g. without knowing do we need to perform any syscalls or can
> > this all done through sysfs affects somewhat how to proceed.
> >
> > Right now if I want to build e.g. image with BuildRoot I'd need
> > to patch the existing recipe to add new dependencies in order to
> > execute these tests, and probably do the same for every project
> > that can package selftests to image.
> >
> > BR, Jarkko
> >
> Can be done through sysfs without syscalls. I implemented the wrapper
> shell and it looks not as bad I expected.
> Will send an add-on patch with that change and other changes to address
> your comments so far for the test scripts.
> If we agree on the approach, I'll squash it with this one later.
So think this way. By having these open coded we can check in detail
that the feature works exactly how we would like.
Those tools add an abstraction layer which makes it harder to evaluate
this as a whole.
Also, that renders away all possible issues related to different
versions of 3rd party tools and their possible incompatibilities.
If pure bash for some reason would turn out to be unmanageable nothing
prevents to e.g. implement a small tool in C or Python that does the
tasks required. E.g. for testing TPM2 chips I implemented custom TPM2
user space just for kselftest, and in SGX we implemented "bare metal"
enclave implementation with the same rationale.
BR, Jarkko
Hello.
On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen <[email protected]> wrote:
> > > It'd be more complicated and less readable to do all the stuff without the
> > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
> > > on libc so I hope this would not cause too much inconvenience.
> >
> > As per cgroup-tools, please prove this. It makes the job for more
> > complicated *for you* and you are making the job more complicated
> > to every possible person in the planet running any kernel QA.
> >
> > I weight the latter more than the former. And it is exactly the
> > reason why we did custom user space kselftest in the first place.
> > Let's keep the tradition. All I can say is that kselftest is
> > unfinished in its current form.
> >
> > What is "esp cgexec"?
>
> Also in kselftest we don't drive ultimate simplicity, we drive
> efficient CI/QA. By open coding something like subset of
> cgroup-tools needed to run the test you also help us later
> on to backtrack the kernel changes. With cgroups-tools you
> would have to use strace to get the same info.
FWIW, see also functions in
tools/testing/selftests/cgroup/cgroup_util.{h,c}.
They likely cover what you need already -- if the tests are in C.
(I admit that stuff in tools/testing/selftests/cgroup/ is best
understood with strace.)
HTH,
Michal
On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutný wrote:
> Hello.
>
> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen <[email protected]> wrote:
> > > > It'd be more complicated and less readable to do all the stuff without the
> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
> > > > on libc so I hope this would not cause too much inconvenience.
> > >
> > > As per cgroup-tools, please prove this. It makes the job for more
> > > complicated *for you* and you are making the job more complicated
> > > to every possible person in the planet running any kernel QA.
> > >
> > > I weight the latter more than the former. And it is exactly the
> > > reason why we did custom user space kselftest in the first place.
> > > Let's keep the tradition. All I can say is that kselftest is
> > > unfinished in its current form.
> > >
> > > What is "esp cgexec"?
> >
> > Also in kselftest we don't drive ultimate simplicity, we drive
> > efficient CI/QA. By open coding something like subset of
> > cgroup-tools needed to run the test you also help us later
> > on to backtrack the kernel changes. With cgroups-tools you
> > would have to use strace to get the same info.
>
> FWIW, see also functions in
> tools/testing/selftests/cgroup/cgroup_util.{h,c}.
> They likely cover what you need already -- if the tests are in C.
>
> (I admit that stuff in tools/testing/selftests/cgroup/ is best
> understood with strace.)
Thanks!
My conclusions are that:
1. We probably cannot move the test part of cgroup test itself
given the enclave payload dependency.
2. I think it makes sense to still follow the same pattern as
other cgroups test and re-use cgroup_util.[ch] functionaltiy.
So yeah I guess we need two test programs instead of one.
Something along the lines:
1. main.[ch] -> test_sgx.[ch]
2. introduce test_sgx_cgroup.c
And test_sgx_cgroup.c would be implement similar test as the shell
script and would follow the structure of existing cgroups tests.
>
> HTH,
> Michal
BR, Jarkko
On 3/30/24 04:23, Jarkko Sakkinen wrote:
>>> I also wonder is cgroup-tools dependency absolutely required or could
>>> you just have a function that would interact with sysfs?
>> I should have checked email before hit the send button for v10 ????.
>>
>> It'd be more complicated and less readable to do all the stuff without the
>> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
>> on libc so I hope this would not cause too much inconvenience.
> As per cgroup-tools, please prove this. It makes the job for more
> complicated *for you* and you are making the job more complicated
> to every possible person in the planet running any kernel QA.
I don't see any other use of cgroup-tools in testing/selftests.
I *DO* see a ton of /bin/bash use though. I wouldn't go to much trouble
to make the thing ash-compatible.
That said, the most important thing is to get some selftests in place.
If using cgroup-tools means we get actual, runnable tests in place,
that's a heck of a lot more important than making them perfect.
Remember, almost nobody uses SGX. It's available on *VERY* few systems
from one CPU vendor and only in very specific hardware configurations.
On Tue, 02 Apr 2024 06:58:40 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutn? wrote:
>> Hello.
>>
>> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen
>> <[email protected]> wrote:
>> > > > It'd be more complicated and less readable to do all the stuff
>> without the
>> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only
>> depends
>> > > > on libc so I hope this would not cause too much inconvenience.
>> > >
>> > > As per cgroup-tools, please prove this. It makes the job for more
>> > > complicated *for you* and you are making the job more complicated
>> > > to every possible person in the planet running any kernel QA.
>> > >
>> > > I weight the latter more than the former. And it is exactly the
>> > > reason why we did custom user space kselftest in the first place.
>> > > Let's keep the tradition. All I can say is that kselftest is
>> > > unfinished in its current form.
>> > >
>> > > What is "esp cgexec"?
>> >
>> > Also in kselftest we don't drive ultimate simplicity, we drive
>> > efficient CI/QA. By open coding something like subset of
>> > cgroup-tools needed to run the test you also help us later
>> > on to backtrack the kernel changes. With cgroups-tools you
>> > would have to use strace to get the same info.
>>
>> FWIW, see also functions in
>> tools/testing/selftests/cgroup/cgroup_util.{h,c}.
>> They likely cover what you need already -- if the tests are in C.
>>
>> (I admit that stuff in tools/testing/selftests/cgroup/ is best
>> understood with strace.)
>
> Thanks!
>
> My conclusions are that:
>
> 1. We probably cannot move the test part of cgroup test itself
> given the enclave payload dependency.
> 2. I think it makes sense to still follow the same pattern as
> other cgroups test and re-use cgroup_util.[ch] functionaltiy.
>
> So yeah I guess we need two test programs instead of one.
>
> Something along the lines:
>
> 1. main.[ch] -> test_sgx.[ch]
> 2. introduce test_sgx_cgroup.c
>
> And test_sgx_cgroup.c would be implement similar test as the shell
> script and would follow the structure of existing cgroups tests.
>
>>
>> HTH,
>> Michal
>
> BR, Jarkko
>
Do we really want to have it implemented in c? There are much fewer lines
of code in shell scripts. Note we are not really testing basic cgroup
stuff. All we needed were creating/deleting cgroups and set limits which I
think have been demonstrated feasible in the ash scripts now.
Given Dave's comments, and test scripts being working and cover the cases
needed IMHO, I don't see much need to move to c code. I can add more cases
if needed and fall back a c implementation later if any case can't be
implemented in scripts. How about that?
Haitao
On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang <[email protected]> wrote:
> Do we really want to have it implemented in c?
I only pointed to the available C boilerplate.
> There are much fewer lines of
> code in shell scripts. Note we are not really testing basic cgroup stuff.
> All we needed were creating/deleting cgroups and set limits which I think
> have been demonstrated feasible in the ash scripts now.
I assume you refer to
Message-Id: <[email protected]>
right?
Could it be even simpler if you didn't stick to cgtools APIs and v1
compatibility?
Reducing ash_cgexec.sh to something like
echo 0 >$R/$1/cgroup.procs
shift
exec "$@"
(with some small builerplate for $R and previous mkdirs)
Thanks,
Michal
On Tue, 02 Apr 2024 12:40:03 -0500, Michal Koutn? <[email protected]> wrote:
> On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang
> <[email protected]> wrote:
>> Do we really want to have it implemented in c?
>
> I only pointed to the available C boilerplate.
>
>> There are much fewer lines of
>> code in shell scripts. Note we are not really testing basic cgroup
>> stuff.
>> All we needed were creating/deleting cgroups and set limits which I
>> think
>> have been demonstrated feasible in the ash scripts now.
>
> I assume you refer to
> Message-Id: <[email protected]>
> right?
>
> Could it be even simpler if you didn't stick to cgtools APIs and v1
> compatibility?
>
> Reducing ash_cgexec.sh to something like
> echo 0 >$R/$1/cgroup.procs
> shift
> exec "$@"
> (with some small builerplate for $R and previous mkdirs)
>
Yes, Thanks for the suggestion.
Haitao
On Tue Apr 2, 2024 at 6:42 PM EEST, Dave Hansen wrote:
> On 3/30/24 04:23, Jarkko Sakkinen wrote:
> >>> I also wonder is cgroup-tools dependency absolutely required or could
> >>> you just have a function that would interact with sysfs?
> >> I should have checked email before hit the send button for v10 ????.
> >>
> >> It'd be more complicated and less readable to do all the stuff without the
> >> cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only depends
> >> on libc so I hope this would not cause too much inconvenience.
> > As per cgroup-tools, please prove this. It makes the job for more
> > complicated *for you* and you are making the job more complicated
> > to every possible person in the planet running any kernel QA.
>
> I don't see any other use of cgroup-tools in testing/selftests.
>
> I *DO* see a ton of /bin/bash use though. I wouldn't go to much trouble
> to make the thing ash-compatible.
>
> That said, the most important thing is to get some selftests in place.
> If using cgroup-tools means we get actual, runnable tests in place,
> that's a heck of a lot more important than making them perfect.
> Remember, almost nobody uses SGX. It's available on *VERY* few systems
> from one CPU vendor and only in very specific hardware configurations.
Ash-compatible is good enough for me, so let's draw the line there.
Ash-compatibility does not cause any major hurdle as can we seen from
Haitao's patch. Earlier version was not even POSIX-compatible, given
that it used hard-coded path.
Most of the added stuff come open coding the tools but in the test
code that is not the big deal, and helps with debugging in the future.
Even right now it helps reviewing kernel patches because it documents
exactly how the feature is seen from user space.
BR, Jarkko
On Tue Apr 2, 2024 at 7:20 PM EEST, Haitao Huang wrote:
> On Tue, 02 Apr 2024 06:58:40 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Tue Apr 2, 2024 at 2:23 PM EEST, Michal Koutný wrote:
> >> Hello.
> >>
> >> On Sat, Mar 30, 2024 at 01:26:08PM +0200, Jarkko Sakkinen
> >> <[email protected]> wrote:
> >> > > > It'd be more complicated and less readable to do all the stuff
> >> without the
> >> > > > cgroup-tools, esp cgexec. I checked dependency, cgroup-tools only
> >> depends
> >> > > > on libc so I hope this would not cause too much inconvenience.
> >> > >
> >> > > As per cgroup-tools, please prove this. It makes the job for more
> >> > > complicated *for you* and you are making the job more complicated
> >> > > to every possible person in the planet running any kernel QA.
> >> > >
> >> > > I weight the latter more than the former. And it is exactly the
> >> > > reason why we did custom user space kselftest in the first place.
> >> > > Let's keep the tradition. All I can say is that kselftest is
> >> > > unfinished in its current form.
> >> > >
> >> > > What is "esp cgexec"?
> >> >
> >> > Also in kselftest we don't drive ultimate simplicity, we drive
> >> > efficient CI/QA. By open coding something like subset of
> >> > cgroup-tools needed to run the test you also help us later
> >> > on to backtrack the kernel changes. With cgroups-tools you
> >> > would have to use strace to get the same info.
> >>
> >> FWIW, see also functions in
> >> tools/testing/selftests/cgroup/cgroup_util.{h,c}.
> >> They likely cover what you need already -- if the tests are in C.
> >>
> >> (I admit that stuff in tools/testing/selftests/cgroup/ is best
> >> understood with strace.)
> >
> > Thanks!
> >
> > My conclusions are that:
> >
> > 1. We probably cannot move the test part of cgroup test itself
> > given the enclave payload dependency.
> > 2. I think it makes sense to still follow the same pattern as
> > other cgroups test and re-use cgroup_util.[ch] functionaltiy.
> >
> > So yeah I guess we need two test programs instead of one.
> >
> > Something along the lines:
> >
> > 1. main.[ch] -> test_sgx.[ch]
> > 2. introduce test_sgx_cgroup.c
> >
> > And test_sgx_cgroup.c would be implement similar test as the shell
> > script and would follow the structure of existing cgroups tests.
> >
> >>
> >> HTH,
> >> Michal
> >
> > BR, Jarkko
> >
> Do we really want to have it implemented in c? There are much fewer lines
> of code in shell scripts. Note we are not really testing basic cgroup
> stuff. All we needed were creating/deleting cgroups and set limits which I
> think have been demonstrated feasible in the ash scripts now.
>
> Given Dave's comments, and test scripts being working and cover the cases
> needed IMHO, I don't see much need to move to c code. I can add more cases
> if needed and fall back a c implementation later if any case can't be
> implemented in scripts. How about that?
We can settle to: ash + no dependencies. I guess you have for that
all the work done already.
BR, Jarkko
On Tue Apr 2, 2024 at 8:40 PM EEST, Michal Koutný wrote:
> On Tue, Apr 02, 2024 at 11:20:21AM -0500, Haitao Huang <[email protected]> wrote:
> > Do we really want to have it implemented in c?
>
> I only pointed to the available C boilerplate.
>
> > There are much fewer lines of
> > code in shell scripts. Note we are not really testing basic cgroup stuff.
> > All we needed were creating/deleting cgroups and set limits which I think
> > have been demonstrated feasible in the ash scripts now.
>
> I assume you refer to
> Message-Id: <[email protected]>
> right?
>
> Could it be even simpler if you didn't stick to cgtools APIs and v1
> compatibility?
>
> Reducing ash_cgexec.sh to something like
> echo 0 >$R/$1/cgroup.procs
> shift
> exec "$@"
> (with some small builerplate for $R and previous mkdirs)
I already asked about necessity of v1 in some response, and fully
support this idea. Then cgexec can simply be a function wrapping
along the lines what you proposed.
BR, Jarkko