SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design.
Track EPC pages in per-cgroup LRUs (from Dave)
----------------------------------------------
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
Unreclaimable Enclave Pages
---------------------------
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
Unreclaimable Guest VM EPC Pages
--------------------------------
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates the
ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
user process exactly the same way handling host running out of physical
EPC.
This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to utilize the misc cgroup
controller rather than a custom controller. I have been updating the
patches based on review comments since V2 [7-16], simplified the
implementation/design, added selftest scripts, fixed some stability issues
found from testing.
Thanks to all for the review/test/tags/feedback provided on the previous
versions.
I appreciate your further reviewing/testing and providing tags if
appropriate.
---
V12:
- Integrate test scripts to kselftests "run_tests" target. (Jarkko)
- Remove CGROUP_SGX_EPC kconfig, conditionally compile with CGROUP_MISC enabled. (Jarkko)
- Explain why taking 'struct misc_cg *cg' as parameter, but not 'struct misc_res *res' in the
changelog for patch #2. (Kai)
- Remove "unlikely" in patch #2 (Kai)
V11:
- Update copyright years and use c style (Kai)
- Improve and simplify test scripts: remove cgroup-tools and bash dependency, drop cgroup v1.
(Jarkko, Michal)
- Add more stub/wrapper functions to minimize #ifdefs in c file. (Kai)
- Revise commit message for patch #8 to clarify design rational (Kai)
- Print error instead of WARN for init failure. (Kai)
- Add check for need to queue an async reclamation before returning from
sgx_cgroup_try_charge(), do so if needed.
V10:
- Use enum instead of boolean for the 'reclaim' parameters in
sgx_alloc_epc_page(). (Dave, Jarkko)
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
- Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
- Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
(Kai)
- Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
- Does not return a value for callers to check.
- Its usage pattern is similar to that of sgx_reclaim_pages() now
- Add cond_resched() in the loop in the cgroup reclaimer to improve
liveliness.
- Add logic for cgroup level reclamation in sgx_reclaim_direct()
- Restructure V9 patches 7-10 to make them flow better. (Kai)
- Disable cgroup if workqueue allocation failed during init. (Kai)
- Shorten names for EPC cgroup functions, structures and variables.
(Jarkko)
- Separate out a helper for for addressing single iteration of the loop in
sgx_cgroup_try_charge(). (Jarkko)
- More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)
V9:
- Add comments for static variables outside functions. (Jarkko)
- Remove unnecessary ifs. (Tim)
- Add more Reviewed-By: tags from Jarkko and TJ.
V8:
- Style fixes. (Jarkko)
- Abstract _misc_res_free/alloc() (Jarkko)
- Remove unneeded NULL checks. (Jarkko)
V7:
- Split the large patch for the final EPC implementation, #10 in V6, into
smaller ones. (Dave, Kai)
- Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
into two functions (Kai)
- Removed patches to introduce the EPC page states, list for storing
candidate pages for reclamation. (not needed due to above changes)
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of
required callback functions (Jarkko)
- Initialize epc cgroup in sgx driver init function. (Kai)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
- Use a static for root cgroup (Kai)
[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/lkml/[email protected]/
[4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
[5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
[6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[7]v2: https://lore.kernel.org/all/[email protected]/
[8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
[9]v4: https://lore.kernel.org/all/[email protected]/
[10]v5: https://lore.kernel.org/all/[email protected]/
[11]v6: https://lore.kernel.org/linux-sgx/[email protected]/
[12]v7: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[13]v8: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[14]v9: https://lore.kernel.org/lkml/[email protected]/T/
[15]v10: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[16]v11: https://lore.kernel.org/lkml/[email protected]/
Haitao Huang (3):
x86/sgx: Replace boolean parameters with enums
x86/sgx: Charge mem_cgroup for per-cgroup reclamation
selftests/sgx: Add scripts for EPC cgroup testing
Kristen Carlson Accardi (9):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Export APIs for SGX driver
cgroup/misc: Add SGX EPC resource type
x86/sgx: Implement basic EPC misc cgroup functionality
x86/sgx: Abstract tracking reclaimable pages in LRU
x86/sgx: Add basic EPC reclamation flow for cgroup
x86/sgx: Implement async reclamation for cgroup
x86/sgx: Abstract check for global reclaimable pages
x86/sgx: Turn on per-cgroup EPC reclamation
Sean Christopherson (2):
x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
Docs/x86/sgx: Add description for cgroup support
Documentation/arch/x86/sgx.rst | 83 +++++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 41 +--
arch/x86/kernel/cpu/sgx/encl.h | 7 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 312 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 101 ++++++
arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
arch/x86/kernel/cpu/sgx/main.c | 205 +++++++++---
arch/x86/kernel/cpu/sgx/sgx.h | 50 ++-
arch/x86/kernel/cpu/sgx/virt.c | 2 +-
include/linux/misc_cgroup.h | 41 +++
kernel/cgroup/misc.c | 107 ++++--
tools/testing/selftests/sgx/Makefile | 3 +-
tools/testing/selftests/sgx/README | 116 +++++++
tools/testing/selftests/sgx/ash_cgexec.sh | 16 +
.../selftests/sgx/run_epc_cg_selftests.sh | 283 ++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 11 +
17 files changed, 1284 insertions(+), 105 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100644 tools/testing/selftests/sgx/README
create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
--
2.25.1
Replace boolean parameters for 'reclaim' in the function
sgx_alloc_epc_page() and its callers with an enum.
Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo
Signed-off-by: Haitao Huang <[email protected]>
Suggested-by: Jarkko Sakkinen <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
arch/x86/kernel/cpu/sgx/encl.c | 12 ++++++------
arch/x86/kernel/cpu/sgx/encl.h | 4 ++--
arch/x86/kernel/cpu/sgx/ioctl.c | 10 +++++-----
arch/x86/kernel/cpu/sgx/main.c | 14 +++++++-------
arch/x86/kernel/cpu/sgx/sgx.h | 13 +++++++++++--
arch/x86/kernel/cpu/sgx/virt.c | 2 +-
6 files changed, 32 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f474179b6f77 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
struct sgx_epc_page *epc_page;
int ret;
- epc_page = sgx_alloc_epc_page(encl_page, false);
+ epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page))
return epc_page;
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
goto err_out_unlock;
}
- epc_page = sgx_alloc_epc_page(encl_page, false);
+ epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
goto err_out_unlock;
}
- va_page = sgx_encl_grow(encl, false);
+ va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM);
if (IS_ERR(va_page)) {
if (PTR_ERR(va_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
@@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
/**
* sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- * mutex should not be held if this is set.
+ * @reclaim: Whether reclaim EPC pages directly if none available. Enclave
+ * mutex should not be held for SGX_DO_RECLAIM.
*
* Allocate a free EPC page and convert it to a Version Array (VA) page.
*
@@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
* a VA page,
* -errno otherwise
*/
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim)
{
struct sgx_epc_page *epc_page;
int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fe15ade02ca1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl,
unsigned long offset,
u64 secinfo_flags);
void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim);
unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
bool sgx_va_page_full(struct sgx_va_page *va_page);
void sgx_encl_free_epc_page(struct sgx_epc_page *page);
struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim);
void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..793a0ba2cb16 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,7 @@
#include "encl.h"
#include "encls.h"
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim)
{
struct sgx_va_page *va_page = NULL;
void *err;
@@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
struct file *backing;
long ret;
- va_page = sgx_encl_grow(encl, true);
+ va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
if (IS_ERR(va_page))
return PTR_ERR(va_page);
else if (va_page)
@@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
encl->backing = backing;
- secs_epc = sgx_alloc_epc_page(&encl->secs, true);
+ secs_epc = sgx_alloc_epc_page(&encl->secs, SGX_DO_RECLAIM);
if (IS_ERR(secs_epc)) {
ret = PTR_ERR(secs_epc);
goto err_out_backing;
@@ -269,13 +269,13 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
if (IS_ERR(encl_page))
return PTR_ERR(encl_page);
- epc_page = sgx_alloc_epc_page(encl_page, true);
+ epc_page = sgx_alloc_epc_page(encl_page, SGX_DO_RECLAIM);
if (IS_ERR(epc_page)) {
kfree(encl_page);
return PTR_ERR(epc_page);
}
- va_page = sgx_encl_grow(encl, true);
+ va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
if (IS_ERR(va_page)) {
ret = PTR_ERR(va_page);
goto err_out_free;
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..d219f14365d4 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -463,14 +463,14 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
/**
* __sgx_alloc_epc_page() - Allocate an EPC page
*
- * Iterate through NUMA nodes and reserve ia free EPC page to the caller. Start
+ * Iterate through NUMA nodes and reserve a free EPC page to the caller. Start
* from the NUMA node, where the caller is executing.
*
* Return:
* - an EPC page: A borrowed EPC pages were available.
* - NULL: Out of EPC pages.
*/
-struct sgx_epc_page *__sgx_alloc_epc_page(void)
+static struct sgx_epc_page *__sgx_alloc_epc_page(void)
{
struct sgx_epc_page *page;
int nid_of_current = numa_node_id();
@@ -542,12 +542,12 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
/**
* sgx_alloc_epc_page() - Allocate an EPC page
* @owner: the owner of the EPC page
- * @reclaim: reclaim pages if necessary
+ * @reclaim: whether reclaim pages if necessary
*
* Iterate through EPC sections and borrow a free EPC page to the caller. When a
* page is no longer needed it must be released with sgx_free_epc_page(). If
- * @reclaim is set to true, directly reclaim pages when we are out of pages. No
- * mm's can be locked when @reclaim is set to true.
+ * @reclaim is set to SGX_DO_RECLAIM, directly reclaim pages when we are out of
+ * pages. No mm's can be locked for SGX_DO_RECLAIM.
*
* Finally, wake up ksgxd when the number of pages goes below the watermark
* before returning back to the caller.
@@ -556,7 +556,7 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
* an EPC page,
* -errno on error
*/
-struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
{
struct sgx_epc_page *page;
@@ -570,7 +570,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
if (list_empty(&sgx_active_page_list))
return ERR_PTR(-ENOMEM);
- if (!reclaim) {
+ if (reclaim == SGX_NO_RECLAIM) {
page = ERR_PTR(-EBUSY);
break;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..ca34cd4f58ac 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,6 +29,16 @@
/* Pages on free list */
#define SGX_EPC_PAGE_IS_FREE BIT(1)
+/**
+ * enum sgx_reclaim - Whether EPC reclamation is allowed within a function.
+ * %SGX_NO_RECLAIM: Do not reclaim EPC pages.
+ * %SGX_DO_RECLAIM: Reclaim EPC pages as needed.
+ */
+enum sgx_reclaim {
+ SGX_NO_RECLAIM,
+ SGX_DO_RECLAIM
+};
+
struct sgx_epc_page {
unsigned int section;
u16 flags;
@@ -83,13 +93,12 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}
-struct sgx_epc_page *__sgx_alloc_epc_page(void);
void sgx_free_epc_page(struct sgx_epc_page *page);
void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
-struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
void sgx_ipi_cb(void *info);
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 7aaa3652e31d..e7fdc3a9abae 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -46,7 +46,7 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
if (epc_page)
return 0;
- epc_page = sgx_alloc_epc_page(vepc, false);
+ epc_page = sgx_alloc_epc_page(vepc, SGX_NO_RECLAIM);
if (IS_ERR(epc_page))
return PTR_ERR(epc_page);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.
The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
include/linux/misc_cgroup.h | 24 ++++++++++++++++++++++++
kernel/cgroup/misc.c | 21 ++++++++-------------
2 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
};
+struct misc_cg *misc_cg_root(void);
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
}
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+ return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
/*
* get_current_misc_cg() - Find and get the misc cgroup of the current task.
*
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
}
#else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+ return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+ return NULL;
+}
static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
{
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 4a602d68cf7d..7d852139121a 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
/**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
*/
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
{
- return cgroup ? css_misc(cgroup->css.parent) : NULL;
+ return &root_cg;
}
+EXPORT_SYMBOL_GPL(misc_cg_root);
/**
* valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!amount)
return 0;
- for (i = cg; i; i = parent_misc(i)) {
+ for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];
new_usage = atomic64_add_return(amount, &res->usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
return 0;
err_charge:
- for (j = i; j; j = parent_misc(j)) {
+ for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}
- for (j = cg; j != i; j = parent_misc(j))
+ for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
- for (i = cg; i; i = parent_misc(i))
+ for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
}
EXPORT_SYMBOL_GPL(misc_cg_uncharge);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.
Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.
This design passes a struct misc_cg into the callbacks. An alternative
to pass only a struct misc_res was considered for encapsulating how
misc_cg stores its misc_res array. However, the SGX driver would still
need to access the misc_res array in the statically defined misc root_cg
during initialization to set resource specific fields. Therefore, some
extra getter/setter APIs to abstract such access to the misc_res array
within the misc_cg struct would be needed. That seems an overkill at the
moment and is deferred to later improvement when there are more users of
these callbacks.
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V12:
- Add comments in commit to clarify reason to pass in misc_cg, not
misc_res. (Kai)
- Remove unlikely (Kai)
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)
V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
include/linux/misc_cgroup.h | 11 +++++
kernel/cgroup/misc.c | 82 +++++++++++++++++++++++++++++++++----
2 files changed, 86 insertions(+), 7 deletions(-)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
#include <linux/cgroup.h>
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+ int (*alloc)(struct misc_cg *cg);
+ void (*free)(struct misc_cg *cg);
+};
+
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..4a602d68cf7d 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
*/
static u64 misc_res_capacity[MISC_CG_RES_TYPES];
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
/**
* parent_misc() - Get the parent of the passed misc cgroup.
* @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity)
}
EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+ if (!valid_type(type))
+ return -EINVAL;
+
+ if (!ops->alloc) {
+ pr_err("%s: alloc missing\n", __func__);
+ return -EINVAL;
+ }
+
+ if (!ops->free) {
+ pr_err("%s: free missing\n", __func__);
+ return -EINVAL;
+ }
+
+ misc_res_ops[type] = ops;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
/**
* misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
* @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
};
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+ int ret;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ WRITE_ONCE(cg->res[i].max, MAX_NUM);
+ atomic64_set(&cg->res[i].usage, 0);
+ if (misc_res_ops[i]) {
+ ret = misc_res_ops[i]->alloc(cg);
+ if (ret)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++)
+ if (misc_res_ops[i])
+ misc_res_ops[i]->free(cg);
+}
+
/**
* misc_cg_alloc() - Allocate misc cgroup.
* @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
static struct cgroup_subsys_state *
misc_cg_alloc(struct cgroup_subsys_state *parent_css)
{
- enum misc_res_type i;
- struct misc_cg *cg;
+ struct misc_cg *parent_cg, *cg;
+ int ret;
if (!parent_css) {
- cg = &root_cg;
+ parent_cg = cg = &root_cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+ parent_cg = css_misc(parent_css);
}
- for (i = 0; i < MISC_CG_RES_TYPES; i++) {
- WRITE_ONCE(cg->res[i].max, MAX_NUM);
- atomic64_set(&cg->res[i].usage, 0);
+ ret = _misc_cg_res_alloc(cg);
+ if (ret) {
+ _misc_cg_res_free(cg);
+ if (likely(parent_css))
+ kfree(cg);
+
+ return ERR_PTR(ret);
}
return &cg->css;
@@ -410,7 +475,10 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
*/
static void misc_cg_free(struct cgroup_subsys_state *css)
{
- kfree(css_misc(css));
+ struct misc_cg *cg = css_misc(css);
+
+ _misc_cg_res_free(cg);
+ kfree(cg);
}
/* Cgroup controller callbacks */
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC (Jarkko)
V6:
- Split the original patch into this and the preceding one (Kai)
---
include/linux/misc_cgroup.h | 4 ++++
kernel/cgroup/misc.c | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..440ed2bb8053 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_X86_SGX
+ /* SGX EPC memory resource */
+ MISC_CG_RES_SGX_EPC,
#endif
MISC_CG_RES_TYPES
};
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 7d852139121a..863f9147602b 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
#endif
+#ifdef CONFIG_X86_SGX
+ /* Intel SGX EPC memory bytes */
+ "sgx_epc",
+#endif
};
/* Root misc cgroup */
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.
Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.
Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 30 ++++++++++++++++++------------
1 file changed, 18 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b782207d41b6..552455365761 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
*/
static struct sgx_epc_lru_list sgx_global_lru;
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
+{
+ return &sgx_global_lru;
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
}
/**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU.
* @page: EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ list_add_tail(&page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
}
/**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
* @page: EPC page
*
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
*
* Return:
* 0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return -EBUSY;
}
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return 0;
}
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.
Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:
- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.
With these changes, the misc cgroup controller enables users to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".
For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.
Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation
conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko)
V11:
- Update copyright and format better (Kai)
- Create wrappers to remove #ifdefs in c file. (Kai)
- Remove unneeded comments (Kai)
V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)
V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)
V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller" and restructure it (Kai)
---
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 72 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 72 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/main.c | 43 ++++++++++++++++-
arch/x86/kernel/cpu/sgx/sgx.h | 21 ++++++++
include/linux/misc_cgroup.h | 2 +
6 files changed, 209 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..400baa7cfb69 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,6 +1,7 @@
obj-y += \
driver.o \
encl.o \
+ epc_cgroup.o \
ioctl.o \
main.o
obj-$(CONFIG_X86_SGX_KVM) += virt.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..ff4d4a25dbe7
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg: The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+ return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @sgx_cg: The charged sgx cgroup.
+ */
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg)
+{
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_cgroup_free(struct misc_cg *cg)
+{
+ struct sgx_cgroup *sgx_cg;
+
+ sgx_cg = sgx_cgroup_from_misc_cg(cg);
+ if (!sgx_cg)
+ return;
+
+ kfree(sgx_cg);
+}
+
+static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
+{
+ cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
+ sgx_cg->cg = cg;
+}
+
+static int sgx_cgroup_alloc(struct misc_cg *cg)
+{
+ struct sgx_cgroup *sgx_cg;
+
+ sgx_cg = kzalloc(sizeof(*sgx_cg), GFP_KERNEL);
+ if (!sgx_cg)
+ return -ENOMEM;
+
+ sgx_cgroup_misc_init(cg, sgx_cg);
+
+ return 0;
+}
+
+const struct misc_res_ops sgx_cgroup_ops = {
+ .alloc = sgx_cgroup_alloc,
+ .free = sgx_cgroup_free,
+};
+
+void sgx_cgroup_init(void)
+{
+ misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
+ sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..bd9606479e67
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/misc_cgroup.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_MISC
+
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_cgroup;
+
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+ return NULL;
+}
+
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
+
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+ return 0;
+}
+
+static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
+
+static inline void sgx_cgroup_init(void) { }
+
+#else /* CONFIG_CGROUP_MISC */
+
+struct sgx_cgroup {
+ struct misc_cg *cg;
+};
+
+static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+ return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call
+ * sgx_put_cg() to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+ /* get_current_misc_cg() never returns NULL when Kconfig enabled */
+ return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}
+
+/**
+ * sgx_put_cg() - Put the EPC cgroup and reduce its ref count.
+ * @sgx_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
+{
+ put_misc_cg(sgx_cg->cg);
+}
+
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_init(void);
+
+#endif /* CONFIG_CGROUP_MISC */
+
+#endif /* _SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d219f14365d4..d482ae7fdabf 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
#include <linux/node.h>
#include <linux/pagemap.h>
#include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
#include "driver.h"
#include "encl.h"
#include "encls.h"
+#include "epc_cgroup.h"
struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
static int sgx_nr_epc_sections;
@@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
*/
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
{
+ struct sgx_cgroup *sgx_cg;
struct sgx_epc_page *page;
+ int ret;
+
+ sgx_cg = sgx_get_current_cg();
+ ret = sgx_cgroup_try_charge(sgx_cg);
+ if (ret) {
+ sgx_put_cg(sgx_cg);
+ return ERR_PTR(ret);
+ }
for ( ; ; ) {
page = __sgx_alloc_epc_page();
@@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}
- if (list_empty(&sgx_active_page_list))
- return ERR_PTR(-ENOMEM);
+ if (list_empty(&sgx_active_page_list)) {
+ page = ERR_PTR(-ENOMEM);
+ break;
+ }
if (reclaim == SGX_NO_RECLAIM) {
page = ERR_PTR(-EBUSY);
@@ -584,6 +597,15 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
cond_resched();
}
+ if (!IS_ERR(page)) {
+ WARN_ON_ONCE(sgx_epc_page_get_cgroup(page));
+ /* sgx_put_cg() in sgx_free_epc_page() */
+ sgx_epc_page_set_cgroup(page, sgx_cg);
+ } else {
+ sgx_cgroup_uncharge(sgx_cg);
+ sgx_put_cg(sgx_cg);
+ }
+
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
wake_up(&ksgxd_waitq);
@@ -602,8 +624,16 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
void sgx_free_epc_page(struct sgx_epc_page *page)
{
struct sgx_epc_section *section = &sgx_epc_sections[page->section];
+ struct sgx_cgroup *sgx_cg = sgx_epc_page_get_cgroup(page);
struct sgx_numa_node *node = section->node;
+ /* sgx_cg could be NULL if called from __sgx_sanitize_pages() */
+ if (sgx_cg) {
+ sgx_cgroup_uncharge(sgx_cg);
+ sgx_put_cg(sgx_cg);
+ sgx_epc_page_set_cgroup(page, NULL);
+ }
+
spin_lock(&node->lock);
page->owner = NULL;
@@ -643,6 +673,8 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
section->pages[i].flags = 0;
section->pages[i].owner = NULL;
section->pages[i].poison = 0;
+ sgx_epc_page_set_cgroup(§ion->pages[i], NULL);
+
list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
}
@@ -787,6 +819,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
static bool __init sgx_page_cache_init(void)
{
u32 eax, ebx, ecx, edx, type;
+ u64 capacity = 0;
u64 pa, size;
int nid;
int i;
@@ -837,6 +870,7 @@ static bool __init sgx_page_cache_init(void)
sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
sgx_numa_nodes[nid].size += size;
+ capacity += size;
sgx_nr_epc_sections++;
}
@@ -846,6 +880,8 @@ static bool __init sgx_page_cache_init(void)
return false;
}
+ misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
return true;
}
@@ -942,6 +978,9 @@ static int __init sgx_init(void)
if (sgx_vepc_init() && ret)
goto err_provision;
+ /* Setup cgroup if either the native or vepc driver is active */
+ sgx_cgroup_init();
+
return 0;
err_provision:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ca34cd4f58ac..fae8eef10232 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -39,14 +39,35 @@ enum sgx_reclaim {
SGX_DO_RECLAIM
};
+struct sgx_cgroup;
+
struct sgx_epc_page {
unsigned int section;
u16 flags;
u16 poison;
struct sgx_encl_page *owner;
struct list_head list;
+#ifdef CONFIG_CGROUP_MISC
+ struct sgx_cgroup *sgx_cg;
+#endif
};
+static inline void sgx_epc_page_set_cgroup(struct sgx_epc_page *page, struct sgx_cgroup *cg)
+{
+#ifdef CONFIG_CGROUP_MISC
+ page->sgx_cg = cg;
+#endif
+}
+
+static inline struct sgx_cgroup *sgx_epc_page_get_cgroup(struct sgx_epc_page *page)
+{
+#ifdef CONFIG_CGROUP_MISC
+ return page->sgx_cg;
+#else
+ return NULL;
+#endif
+}
+
/*
* Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
* the free page list local to the node is stored here.
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 440ed2bb8053..c9b47a5e966a 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -46,11 +46,13 @@ struct misc_res_ops {
* @max: Maximum limit on the resource.
* @usage: Current usage of the resource.
* @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
*/
struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+ void *priv;
};
/**
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
For the global reclaimer to determine if any page available for
reclamation at the global level, it currently only checks for emptiness
of the global LRU. That will be inadequate when pages are tracked in
multiple LRUs, one per cgroup. For this purpose, create a new helper,
sgx_can_reclaim(), to abstract this check. Currently it only checks the
global LRU, later will check emptiness of LRUs of all cgroups when
per-cgroup tracking is turned on. Replace all the checks of the global
LRU, list_empty(&sgx_global_lru.reclaimable), with calls to
sgx_can_reclaim().
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V10:
- Add comments for the new function. (Jarkko)
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 7f5428571c6a..11edbdb06782 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,14 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
return &sgx_global_lru;
}
+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim(void)
+{
+ return !list_empty(&sgx_global_lru.reclaimable);
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
@@ -391,7 +399,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *c
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_global_lru.reclaimable);
+ sgx_can_reclaim();
}
static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
@@ -593,7 +601,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}
- if (list_empty(&sgx_global_lru.reclaimable)) {
+ if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
--
2.25.1
With different cgroups, the script starts one or multiple concurrent SGX
selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
test case, which loads an enclave of EPC size equal to the EPC capacity
available on the platform. The script checks results against the
expectation set for each cgroup and reports success or failure.
The script creates 3 different cgroups at the beginning with following
expectations:
1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all processes exit.
The script also includes a test with low mem_cg limit and LARGE sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg. For this test, it turns off swapping before start,
and turns swapping back on afterwards.
Add README to document how to run the tests.
Signed-off-by: Haitao Huang <[email protected]>
---
V12:
- Integrate the scripts to the "run_tests" target. (Jarkko)
V11:
- Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
- Drop support for cgroup v1 and simplify. (Michal, Jarkko)
- Add documentation for functions. (Jarkko)
- Turn off swapping before memcontrol tests and back on after
- Format and style fixes, name for hard coded values
V7:
- Added memcontrol test.
V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
tools/testing/selftests/sgx/Makefile | 3 +-
tools/testing/selftests/sgx/README | 116 +++++++
tools/testing/selftests/sgx/ash_cgexec.sh | 16 +
.../selftests/sgx/run_epc_cg_selftests.sh | 283 ++++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 11 +
5 files changed, 428 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/sgx/README
create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile
index 867f88ce2570..739376af9e33 100644
--- a/tools/testing/selftests/sgx/Makefile
+++ b/tools/testing/selftests/sgx/Makefile
@@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none
ifeq ($(CAN_BUILD_X86_64), 1)
TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx
-TEST_FILES := $(OUTPUT)/test_encl.elf
+TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh
+TEST_PROGS := run_epc_cg_selftests.sh
all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf
endif
diff --git a/tools/testing/selftests/sgx/README b/tools/testing/selftests/sgx/README
new file mode 100644
index 000000000000..dfc0c74ce99d
--- /dev/null
+++ b/tools/testing/selftests/sgx/README
@@ -0,0 +1,116 @@
+SGX selftests
+
+The SGX selftests includes a c program (test_sgx) that covers basic user space
+facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc
+cgroup. The SGX cgroup test script requires root privileges and runs a
+specific test case of the test_sgx in different cgroups configured by the
+script. More details about the cgroup test can be found below.
+
+All SGX selftests can run with or without kselftest framework.
+
+WITH KSELFTEST FRAMEWORK
+=======================
+
+BUILD
+-----
+
+Build executable file "test_sgx" from top level directory of the kernel source:
+ $ make -C tools/testing/selftests TARGETS=sgx
+
+RUN
+---
+
+Run all sgx tests as sudo or root since the cgroup tests need to configure cgroup
+limits in files under /sys/fs/cgroup.
+
+ $ sudo make -C tools/testing/selftests/sgx run_tests
+
+Without sudo, SGX cgroup tests will be skipped.
+
+On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, the
+tests may need run longer than default timeout of 45 seconds. To avoid
+timeouts, set a value for kselftest_override_timeout for the make command:
+
+ $ sudo kselftest_override_timeout=165 make -C tools/testing/selftests/sgx run_tests
+
+Or use --override-timeout option if running the installed kselftests from the
+installation root directory:
+
+ $sudo ./run_kselftest.sh --override-timeout 165 -c sgx
+
+More details about kselftest framework can be found in
+Documentation/dev-tools/kselftest.rst.
+
+WITHOUT KSELFTEST FRAMEWORK
+===========================
+
+BUILD
+-----
+
+Build executable file "test_sgx" from this
+directory(tools/testing/selftests/sgx/):
+
+ $ make
+
+RUN
+---
+
+Run all non-cgroup tests:
+
+ $ ./test_sgx
+
+To test SGX cgroup:
+
+ $ sudo ./run_sgx_cg_selftests.sh
+
+THE SGX CGROUP TEST SCRIPTS
+===========================
+
+Overview of the main cgroup test script
+---------------------------------------
+
+With different cgroups, the script (run_sgx_cg_selftests.sh) starts one or
+multiple concurrent SGX selftests (test_sgx), each to run the
+unclobbered_vdso_oversubscribed test case, which loads an enclave of EPC size
+equal to the EPC capacity available on the platform. The script checks results
+against the expectation set for each cgroup and reports success or failure.
+
+The script creates 3 different cgroups at the beginning with following
+expectations:
+
+ 1) SMALL - intentionally small enough to fail the test loading an enclave of
+ size equal to the capacity.
+
+ 2) LARGE - large enough to run up to 4 concurrent tests but fail some if more
+ than 4 concurrent tests are run. The script starts 4 expecting at
+ least one test to pass, and then starts 5 expecting at least one
+ test to fail.
+
+ 3) LARGER - limit is the same as the capacity, large enough to run lots of
+ concurrent tests. The script starts 8 of them and expects all
+ pass. Then it reruns the same test with one process randomly
+ killed and usage checked to be zero after all processes exit.
+
+The script also includes a test with low mem_cg limit (memory.max) and LARGE
+sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged
+to a proper mem_cg. To validate mem_cg OOM-kills processes when its memory.max
+limit is reached due to SGX EPC reclamation, the script turns off swapping
+before start, and turns swapping back on afterwards for this particular test.
+
+The helper scripts for monitoring and trouble-shooting
+------------------------------------------------------
+
+For developer/tester to monitor the SGX cgroup settings and behavior or
+trouble-shoot any issues encountered during testing, a helper script,
+watch_misc_for_tests.sh, is provided, which can watch relevant entries in
+cgroupfs files. For example, to watch the SGX cgroup 'current' counter changes
+during testing, run this in a separate terminal from this directory:
+
+ $ ./watch_misc_for_tests.sh current
+
+For more details about SGX cgroups, see "Cgroup Support" in
+Documentation/arch/x86/sgx.rst.
+
+The scripts require cgroup v2 support. More details about cgroup v2 can be found
+in Documentation/admin-guide/cgroup-v2.rst.
+
diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh b/tools/testing/selftests/sgx/ash_cgexec.sh
new file mode 100755
index 000000000000..cfa5d2b0e795
--- /dev/null
+++ b/tools/testing/selftests/sgx/ash_cgexec.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2024 Intel Corporation.
+
+# Start a program in a given cgroup.
+# Supports V2 cgroup paths, relative to /sys/fs/cgroup
+if [ "$#" -lt 2 ]; then
+ echo "Usage: $0 <v2 cgroup path> <command> [args...]"
+ exit 1
+fi
+# Move this shell to the cgroup.
+echo 0 >/sys/fs/cgroup/$1/cgroup.procs
+shift
+# Execute the command within the cgroup
+exec "$@"
+
diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..cd2911e865f0
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,283 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023, 2024 Intel Corporation.
+
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+if [ "$(id -u)" -ne 0 ]; then
+ echo "SKIP: SGX Cgroup tests need root privileges."
+ exit $ksft_skip
+fi
+
+TEST_ROOT_CG=selftest
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+# Cgroup v2 only
+CG_ROOT=/sys/fs/cgroup
+mkdir -p $CG_ROOT/$TEST_CG_SUB1
+mkdir -p $CG_ROOT/$TEST_CG_SUB2
+mkdir -p $CG_ROOT/$TEST_CG_SUB3
+mkdir -p $CG_ROOT/$TEST_CG_SUB4
+
+# Turn on misc and memory controller in non-leaf nodes
+echo "+misc" > $CG_ROOT/cgroup.subtree_control && \
+echo "+memory" > $CG_ROOT/cgroup.subtree_control && \
+echo "+misc" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \
+echo "+memory" > $CG_ROOT/$TEST_ROOT_CG/cgroup.subtree_control && \
+echo "+misc" > $CG_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
+if [ $? -ne 0 ]; then
+ echo "# Failed setting up cgroups, make sure misc and memory cgroups are enabled."
+ exit 1
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" > $CG_ROOT/$TEST_CG_SUB1/misc.max && \
+echo "sgx_epc $LARGE" > $CG_ROOT/$TEST_CG_SUB2/misc.max && \
+echo "sgx_epc $LARGER" > $CG_ROOT/$TEST_CG_SUB4/misc.max
+if [ $? -ne 0 ]; then
+ echo "# Failed setting up misc limits."
+ exit 1
+fi
+
+clean_up()
+{
+ sleep 2
+ rmdir $CG_ROOT/$TEST_CG_SUB2
+ rmdir $CG_ROOT/$TEST_CG_SUB3
+ rmdir $CG_ROOT/$TEST_CG_SUB4
+ rmdir $CG_ROOT/$TEST_CG_SUB1
+ rmdir $CG_ROOT/$TEST_ROOT_CG
+}
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+PROCESS_SUCCESS=1
+PROCESS_FAILURE=0
+
+# Wait for a process and check for expected exit status.
+#
+# Arguments:
+# $1 - the pid of the process to wait and check.
+# $2 - 1 if expecting success, 0 for failure.
+#
+# Return:
+# 0 if the exit status of the process matches the expectation.
+# 1 otherwise.
+wait_check_process_status() {
+ pid=$1
+ check_for_success=$2
+
+ wait "$pid"
+ status=$?
+
+ if [ $check_for_success -eq $PROCESS_SUCCESS ] && [ $status -eq 0 ]; then
+ echo "# Process $pid succeeded."
+ return 0
+ elif [ $check_for_success -eq $PROCESS_FAILURE ] && [ $status -ne 0 ]; then
+ echo "# Process $pid returned failure."
+ return 0
+ fi
+ return 1
+}
+
+# Wait for a set of processes and check for expected exit status
+#
+# Arguments:
+# $1 - 1 if expecting success, 0 for failure.
+# remaining args - The pids of the processes
+#
+# Return:
+# 0 if exit status of any process matches the expectation.
+# 1 otherwise.
+wait_and_detect_for_any() {
+ check_for_success=$1
+
+ shift
+ detected=1 # 0 for success detection
+
+ for pid in $@; do
+ if wait_check_process_status "$pid" "$check_for_success"; then
+ detected=0
+ # Wait for other processes to exit
+ fi
+ done
+
+ return $detected
+}
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
+# Always use leaf node of misc cgroups
+# these may fail on OOM
+./ash_cgexec.sh $TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [ $? -eq 0 ]; then
+ echo "# Fail on SMALL limit, not expecting any test passes."
+ clean_up
+ exit 1
+else
+ echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one success...."
+
+pids=""
+for i in 1 2 3 4; do
+ (
+ ./ash_cgexec.sh $TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+ ) &
+ pids="$pids $!"
+done
+
+
+if wait_and_detect_for_any $PROCESS_SUCCESS "$pids"; then
+ echo "# PASSED LARGE limit positive testing."
+else
+ echo "# Failed on LARGE limit positive testing, no test passes."
+ clean_up
+ exit 1
+fi
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one failure...."
+pids=""
+for i in 1 2 3 4 5; do
+ (
+ ./ash_cgexec.sh $TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+ ) &
+ pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_FAILURE "$pids"; then
+ echo "# PASSED LARGE limit negative testing."
+else
+ echo "# Failed on LARGE limit negative testing, no test fails."
+ clean_up
+ exit 1
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ expecting no failure...."
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
+ (
+ ./ash_cgexec.sh $TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ) &
+ pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_FAILURE "$pids"; then
+ echo "# Failed on LARGER limit, at least one test fails."
+ clean_up
+ exit 1
+else
+ echo "# PASSED LARGER limit tests."
+fi
+
+echo "# Start 8 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ randomly kill one, expecting no failure...."
+pids=""
+for i in 1 2 3 4 5 6 7 8; do
+ (
+ ./ash_cgexec.sh $TEST_CG_SUB4 $test_cmd >cgtest_larger_kill_$timestamp.$i.log 2>&1
+ ) &
+ pids="$pids $!"
+done
+random_number=$(awk 'BEGIN{srand();print int(rand()*5)}')
+sleep $((random_number + 1))
+
+# Randomly select a process to kill
+# Make sure usage counter not leaked at the end.
+RANDOM_INDEX=$(awk 'BEGIN{srand();print int(rand()*8)}')
+counter=0
+for pid in $pids; do
+ if [ "$counter" -eq "$RANDOM_INDEX" ]; then
+ PID_TO_KILL=$pid
+ break
+ fi
+ counter=$((counter + 1))
+done
+
+kill $PID_TO_KILL
+echo "# Killed process with PID: $PID_TO_KILL"
+
+any_failure=0
+for pid in $pids; do
+ wait "$pid"
+ status=$?
+ if [ "$pid" != "$PID_TO_KILL" ]; then
+ if [ $status -ne 0 ]; then
+ echo "# Process $pid returned failure."
+ any_failure=1
+ fi
+ fi
+done
+
+if [ $any_failure -ne 0 ]; then
+ echo "# Failed on random killing, at least one test fails."
+ clean_up
+ exit 1
+fi
+echo "# PASSED LARGER limit test with a process randomly killed."
+
+MEM_LIMIT_TOO_SMALL=$((CAPACITY - 2 * LARGE))
+
+echo "$MEM_LIMIT_TOO_SMALL" > $CG_ROOT/$TEST_CG_SUB2/memory.max
+if [ $? -ne 0 ]; then
+ echo "# Failed creating memory controller."
+ clean_up
+ exit 1
+fi
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE EPC limit,
+ and too small RAM limit, expecting all failures...."
+# Ensure swapping off so the OOM killer is activated when mem_cgroup limit is hit.
+swapoff -a
+pids=""
+for i in 1 2 3 4; do
+ (
+ ./ash_cgexec.sh $TEST_CG_SUB2 $test_cmd >cgtest_large_oom_$timestamp.$i.log 2>&1
+ ) &
+ pids="$pids $!"
+done
+
+if wait_and_detect_for_any $PROCESS_SUCCESS "$pids"; then
+ echo "# Failed on tests with memcontrol, some tests did not fail."
+ clean_up
+ swapon -a
+ exit 1
+else
+ swapon -a
+ echo "# PASSED LARGE limit tests with memcontrol."
+fi
+
+sleep 2
+
+USAGE=$(grep '^sgx_epc' "$CG_ROOT/$TEST_ROOT_CG/misc.current" | awk '{print $2}')
+if [ "$USAGE" -ne 0 ]; then
+ echo "# Failed: Final usage is $USAGE, not 0."
+else
+ echo "# PASSED leakage check."
+ echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
+fi
+clean_up
+echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..1c9985726ace
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023, 2024 Intel Corporation.
+
+if [ -z "$1" ]; then
+ echo "No argument supplied, please provide 'max', 'current', or 'events'"
+ exit 1
+fi
+
+watch -n 1 'find /sys/fs/cgroup -wholename "*/test*/misc.'$1'" -exec \
+ sh -c '\''echo "$1:"; cat "$1"'\'' _ {} \;'
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.
Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.
This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., the sum of cgroup
limits greater than the total capacity, cgroups may never reclaim but
the total usage can still be near the capacity. Therefore a global
reclamation is still needed in those cases and it should be performed
from the root cgroup.
Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU. Export
sgx_cgroup_reclaim_pages() in the header file so it can be reused for
this purpose.
Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
Export sgx_cgroup_lru_empty() so it can be reused for this purpose.
Finally, change sgx_reclaim_direct(), to check and ensure there are free
pages at cgroup level so forward progress can be made by the caller.
Export sgx_cgroup_should_reclaim() for reuse.
With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Tested-by: Mikko Ylinen <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for
CONFIGCONFIG_CGROUPMISC. (Jarkko)
V11:
- Reword the comments for global reclamation for allocation failure
after passing cgroup charging. (Kai)
- Add stub functions to remove ifdefs in c file (Kai)
- Add more detailed comments to clarify each page belongs to one cgroup, or the
root. (Kai)
V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 6 ++--
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 27 +++++++++++++++++
arch/x86/kernel/cpu/sgx/main.c | 43 ++++++++++++++++++++++++++--
3 files changed, 71 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 2efc33476b0b..16fe0e1574ec 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -68,7 +68,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
*
* Return: %true if all cgroups under the specified root have empty LRU lists.
*/
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
{
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -116,7 +116,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
* the LRUs are recently accessed, i.e., considered "too young" to reclaim, no
* page will actually be reclaimed after walking the whole tree.
*/
-static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
+void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
{
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -157,7 +157,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *cha
* threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
* cgroup.
*/
-static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
{
u64 cur, max;
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 2044e0d64076..9d69608eadf6 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -13,6 +13,11 @@
#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
struct sgx_cgroup;
+static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg)
+{
+ return NULL;
+}
+
static inline struct sgx_cgroup *sgx_get_current_cg(void)
{
return NULL;
@@ -27,8 +32,22 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl
static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
+static inline bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+ return true;
+}
+
+static inline bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+ return false;
+}
+
static inline void sgx_cgroup_init(void) { }
+static inline void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
+{
+}
+
#else /* CONFIG_CGROUP_MISC */
struct sgx_cgroup {
@@ -37,6 +56,11 @@ struct sgx_cgroup {
struct work_struct reclaim_work;
};
+static inline struct misc_cg *misc_from_sgx(struct sgx_cgroup *sgx_cg)
+{
+ return sgx_cg->cg;
+}
+
static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
{
return (struct sgx_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
@@ -67,6 +91,9 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+bool sgx_cgroup_lru_empty(struct misc_cg *root);
+bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm);
void sgx_cgroup_init(void);
#endif /* CONFIG_CGROUP_MISC */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 11edbdb06782..9343f7d50649 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,9 +32,30 @@ static DEFINE_XARRAY(sgx_epc_address_space);
*/
static struct sgx_epc_lru_list sgx_global_lru;
+/*
+ * Get the per-cgroup or global LRU list that tracks the given reclaimable page.
+ */
static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
{
+#ifdef CONFIG_CGROUP_MISC
+ /*
+ * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
+ * life between sgx_alloc_epc_page() and sgx_free_epc_page():
+ *
+ * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
+ * sgx_get_current_cg() which is the misc cgroup of the current task, or
+ * the root by default even if the misc cgroup is disabled by kernel
+ * command line.
+ *
+ * epc_page->sgx_cg is only unset by sgx_free_epc_page().
+ *
+ * This function is never used before sgx_alloc_epc_page() or after
+ * sgx_free_epc_page().
+ */
+ return &epc_page->sgx_cg->lru;
+#else
return &sgx_global_lru;
+#endif
}
/*
@@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
*/
static inline bool sgx_can_reclaim(void)
{
- return !list_empty(&sgx_global_lru.reclaimable);
+ return !sgx_cgroup_lru_empty(misc_cg_root()) ||
+ !list_empty(&sgx_global_lru.reclaimable);
}
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long watermark)
static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
{
- sgx_reclaim_pages(&sgx_global_lru, charge_mm);
+ if (IS_ENABLED(CONFIG_CGROUP_MISC))
+ sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
+ else
+ sgx_reclaim_pages(&sgx_global_lru, charge_mm);
}
/*
@@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
*/
void sgx_reclaim_direct(void)
{
+ struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+
+ /* Make sure there are some free pages at cgroup level */
+ if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
+ sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm);
+ sgx_put_cg(sgx_cg);
+ }
+ /* Make sure there are some free pages at global level */
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
sgx_reclaim_pages_global(current->mm);
}
@@ -616,6 +649,12 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}
+ /*
+ * At this point, the usage within this cgroup is under its
+ * limit but there is no physical page left for allocation.
+ * Perform a global reclaim to get some pages released from any
+ * cgroup with reclaimable pages.
+ */
sgx_reclaim_pages_global(current->mm);
cond_resched();
}
--
2.25.1
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.
Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.
Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.
Signed-off-by: Haitao Huang <[email protected]>
Reported-by: Mikko Ylinen <[email protected]>
Tested-by: Mikko Ylinen <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
V9:
- Reduce number of if statements. (Tim)
V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
arch/x86/kernel/cpu/sgx/encl.c | 29 ++++++++++++++--------------
arch/x86/kernel/cpu/sgx/encl.h | 3 +--
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 10 ++++++----
arch/x86/kernel/cpu/sgx/main.c | 29 +++++++++++++---------------
arch/x86/kernel/cpu/sgx/sgx.h | 2 +-
5 files changed, 36 insertions(+), 37 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde
}
/*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
*/
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+ struct mm_struct *charge_mm)
{
struct mem_cgroup *memcg = NULL;
struct sgx_encl_mm *encl_mm;
int idx;
- /*
- * If called from normal task context, return the mem_cgroup
- * of the current task's mm. The remainder of the handling is for
- * ksgxd.
- */
- if (!current_is_ksgxd())
- return get_mem_cgroup_from_mm(current->mm);
+ /* Use the charge_mm if given. */
+ if (charge_mm)
+ return get_mem_cgroup_from_mm(charge_mm);
/*
* Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
* @encl: an enclave pointer
* @page_index: enclave page index
* @backing: data for accessing backing storage for the page
+ * @charge_mm: the mm to charge for the allocation
*
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
* mms in the enclave's mm_list prior to any backing page allocation,
* in order to ensure that shmem page allocations are charged to the
* enclave. Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
* -errno otherwise.
*/
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing)
+ struct sgx_backing *backing, struct mm_struct *charge_mm)
{
- struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+ struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm);
struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
unsigned long end, unsigned long vm_flags);
-bool current_is_ksgxd(void);
void sgx_encl_release(struct kref *ref);
int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing);
+ struct sgx_backing *backing, struct mm_struct *charge_mm);
void sgx_encl_put_backing(struct sgx_backing *backing);
int sgx_encl_test_and_clear_young(struct mm_struct *mm,
struct sgx_encl_page *page);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 8151371a198b..2efc33476b0b 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -107,6 +107,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
/**
* sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
* @root: The root of cgroup tree to reclaim from.
+ * @charge_mm: The mm to charge for backing store allocation.
*
* This function performs a pre-order walk in the cgroup tree under the given
* root, attempting to reclaim pages at each node until a fixed number of pages
@@ -115,7 +116,7 @@ static bool sgx_cgroup_lru_empty(struct misc_cg *root)
* the LRUs are recently accessed, i.e., considered "too young" to reclaim, no
* page will actually be reclaimed after walking the whole tree.
*/
-static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
+static void sgx_cgroup_reclaim_pages(struct misc_cg *root, struct mm_struct *charge_mm)
{
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
@@ -132,7 +133,7 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();
sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
- cnt += sgx_reclaim_pages(&sgx_cg->lru);
+ cnt += sgx_reclaim_pages(&sgx_cg->lru, charge_mm);
rcu_read_lock();
css_put(pos);
@@ -194,7 +195,8 @@ static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
* blocked until a worker makes its way through the global work queue.
*/
while (sgx_cgroup_should_reclaim(sgx_cg)) {
- sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ /* Indirect reclaim, no mm to charge, so NULL: */
+ sgx_cgroup_reclaim_pages(sgx_cg->cg, NULL);
cond_resched();
}
}
@@ -241,7 +243,7 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
return -EBUSY;
}
- sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ sgx_cgroup_reclaim_pages(sgx_cg->cg, current->mm);
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b79c1d6cdc23..7f5428571c6a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -253,8 +253,8 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
}
}
-static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
- struct sgx_backing *backing)
+static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, struct sgx_backing *backing,
+ struct mm_struct *charge_mm)
{
struct sgx_encl_page *encl_page = epc_page->owner;
struct sgx_encl *encl = encl_page->encl;
@@ -270,7 +270,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
- &secs_backing);
+ &secs_backing, charge_mm);
if (ret)
goto out;
@@ -289,6 +289,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
/**
* sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU
* @lru: The LRU from which pages are reclaimed.
+ * @charge_mm: The mm to charge for backing store allocation.
*
* Take a fixed number of pages from the head of a given LRU and reclaim them to
* the enclave's private shmem files. Skip the pages, which have been accessed
@@ -304,7 +305,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
*
* Return: Number of pages attempted for reclamation.
*/
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *charge_mm)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -344,7 +345,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
mutex_lock(&encl_page->encl->lock);
- ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
+ ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i], charge_mm);
if (ret) {
mutex_unlock(&encl_page->encl->lock);
goto skip;
@@ -376,7 +377,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
continue;
encl_page = epc_page->owner;
- sgx_reclaimer_write(epc_page, &backing[i]);
+ sgx_reclaimer_write(epc_page, &backing[i], charge_mm);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
@@ -393,9 +394,9 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}
-static void sgx_reclaim_pages_global(void)
+static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
{
- sgx_reclaim_pages(&sgx_global_lru);
+ sgx_reclaim_pages(&sgx_global_lru, charge_mm);
}
/*
@@ -406,7 +407,7 @@ static void sgx_reclaim_pages_global(void)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(current->mm);
}
static int ksgxd(void *p)
@@ -429,7 +430,8 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages_global();
+ /* Indirect reclaim, no mm to charge, so NULL: */
+ sgx_reclaim_pages_global(NULL);
cond_resched();
}
@@ -452,11 +454,6 @@ static bool __init sgx_page_reclaimer_init(void)
return true;
}
-bool current_is_ksgxd(void)
-{
- return current == ksgxd_tsk;
-}
-
static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
{
struct sgx_numa_node *node = &sgx_numa_nodes[nid];
@@ -611,7 +608,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}
- sgx_reclaim_pages_global();
+ sgx_reclaim_pages_global(current->mm);
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 89adac646381..72b022755ff1 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -135,7 +135,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
-unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, struct mm_struct *charge_mm);
void sgx_ipi_cb(void *info);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.
Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.
In case the workqueue allocation is failed during init, disable cgroup.
In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.
Reclaiming only when the usage is at or very close to the limit would
cause thrashing. To avoid that, before returning from
sgx_cgroup_try_charge(), check the need for reclamation (usage too close
to the limit), queue an async work if needed, similar to how the global
reclaimer wakes up its reclaiming thread after each allocation in
sgx_alloc_epc_pages().
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V11:
- Print error instead of WARN (Kai)
- Add check for need to queue an async reclamation before returning from
try_charge(), do so if needed. This is to be consistent with global
reclaimer to minimize thrashing during allocation time.
V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().
V9:
- Add comments for static variables. (Jarkko)
V8:
- Remove alignment for substructure variables. (Jarkko)
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 129 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 1 +
2 files changed, 128 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 74d403d1e0d4..8151371a198b 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -5,9 +5,63 @@
#include <linux/kernel.h>
#include "epc_cgroup.h"
+/*
+ * The minimal free pages maintained by per-cgroup reclaimer
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE (SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal
+ * free pages would barely leave any page for use, causing excessive reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT (SGX_CG_MIN_FREE_PAGE * 4)
+
/* The root SGX EPC cgroup */
static struct sgx_cgroup sgx_cg_root;
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+ return atomic64_read(&sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+ return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+ struct misc_cg *i = sgx_cg->cg;
+ u64 m = U64_MAX;
+
+ while (i) {
+ m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+ i = misc_cg_parent(i);
+ }
+
+ return m / PAGE_SIZE;
+}
+
/**
* sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
* @root: Root of the tree to check
@@ -90,6 +144,61 @@ static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
rcu_read_unlock();
}
+/**
+ * sgx_cgroup_should_reclaim() - check if EPC reclamation is needed for a cgroup
+ * @sgx_cg: The cgroup to be checked.
+ *
+ * This function can be used to guard a call to sgx_cgroup_reclaim_pages() where
+ * the minimal number of free page needs be maintained for the cgroup to make
+ * good forward progress.
+ *
+ * Return: %true if number of free pages available for the cgroup below a
+ * threshold (%SGX_CG_MIN_FREE_PAGE) and there are reclaimable pages within the
+ * cgroup.
+ */
+static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg)
+{
+ u64 cur, max;
+
+ if (sgx_cgroup_lru_empty(sgx_cg->cg))
+ return false;
+
+ max = sgx_cgroup_max_pages_to_root(sgx_cg);
+
+ /*
+ * Unless the limit is very low, maintain a minimal number of free pages
+ * so there is always a few pages available to serve new allocation
+ * requests quickly.
+ */
+ if (max > SGX_CG_LOW_LIMIT)
+ max -= SGX_CG_MIN_FREE_PAGE;
+
+ cur = sgx_cgroup_page_counter_read(sgx_cg);
+
+ return (cur >= max);
+}
+
+/*
+ * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity.
+ */
+static void sgx_cgroup_reclaim_work_func(struct work_struct *work)
+{
+ struct sgx_cgroup *sgx_cg = container_of(work, struct sgx_cgroup, reclaim_work);
+
+ /*
+ * This work func is scheduled by sgx_cgroup_try_charge() when it cannot
+ * directly reclaim, i.e., EPC allocation in a fault handler. Waiting to
+ * reclaim until the cgroup is actually at its limit is less performant,
+ * as it means the task scheduling this asynchronous work is effectively
+ * blocked until a worker makes its way through the global work queue.
+ */
+ while (sgx_cgroup_should_reclaim(sgx_cg)) {
+ sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ cond_resched();
+ }
+}
+
static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
{
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
{
int ret;
+ /* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */
+ if (!sgx_cg_wq)
+ return 0;
+
for (;;) {
ret = __sgx_cgroup_try_charge(sgx_cg);
if (ret != -EBUSY)
return ret;
- if (reclaim == SGX_NO_RECLAIM)
- return -ENOMEM;
+ if (reclaim == SGX_NO_RECLAIM) {
+ queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+ return -EBUSY;
+ }
sgx_cgroup_reclaim_pages(sgx_cg->cg);
cond_resched();
}
+ if (sgx_cgroup_should_reclaim(sgx_cg))
+ queue_work(sgx_cg_wq, &sgx_cg->reclaim_work);
+
return 0;
}
@@ -150,12 +268,14 @@ static void sgx_cgroup_free(struct misc_cg *cg)
if (!sgx_cg)
return;
+ cancel_work_sync(&sgx_cg->reclaim_work);
kfree(sgx_cg);
}
static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
{
sgx_lru_init(&sgx_cg->lru);
+ INIT_WORK(&sgx_cg->reclaim_work, sgx_cgroup_reclaim_work_func);
cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
sgx_cg->cg = cg;
}
@@ -182,4 +302,9 @@ void sgx_cgroup_init(void)
{
misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_cgroup_ops);
sgx_cgroup_misc_init(misc_cg_root(), &sgx_cg_root);
+
+ sgx_cg_wq = alloc_workqueue("sgx_cg_wq", WQ_UNBOUND | WQ_FREEZABLE, WQ_UNBOUND_MAX_ACTIVE);
+
+ if (!sgx_cg_wq)
+ pr_err("SGX EPC cgroup disabled: alloc_workqueue() failed.\n");
}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 538524f5669d..2044e0d64076 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -34,6 +34,7 @@ static inline void sgx_cgroup_init(void) { }
struct sgx_cgroup {
struct misc_cg *cg;
struct sgx_epc_lru_list lru;
+ struct work_struct reclaim_work;
};
static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
--
2.25.1
Missed adding the config file:
index 000000000000..e7f1db1d3eff
--- /dev/null
+++ b/tools/testing/selftests/sgx/config
@@ -0,0 +1,4 @@
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_MISC=y
+CONFIG_MEMCG=y
+CONFIG_X86_SGX=y
I'll send a fixup for this patch or another version of the series if more
changes are needed.
Thanks
Haitao
>
> I'll send a fixup for this patch or another version of the series if more
> changes are needed.
Hi Haitao,
I don't like to say but in general I think you are sending too frequently. The
last version was sent April, 11th (my time), so considering the weekend it has
only been 3 or at most 4 days.
Please slow down a little bit to give people more time.
More information please also see:
https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders
From: Kristen Carlson Accardi <[email protected]>
Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit. This is not ideal. When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup (and/or its
descendants) to reduce its usage so the new allocation can succeed.
Add the basic building blocks to support per-cgroup reclamation.
Currently the kernel only has one place to reclaim EPC pages: the global
EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU
list for each EPC cgroup, and introduce a "cgroup" variant function to
reclaim EPC pages from a given EPC cgroup and its descendants.
Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
pages from the global LRU, and then tries to reclaim these pages at once
for better performance.
Implement the "cgroup" variant EPC reclaim in a similar way, but keep
the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
as input, and return the pages that are "scanned" and attempted for
reclamation (but not necessarily reclaimed successfully); 2) loop the
given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".
This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU. It should be enough for
most cases.
Note, this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
reclaimable pages, the actual reclaim of EPC pages will be split into
smaller batches _across_ multiple LRUs with each being smaller than
SGX_NR_TO_SCAN pages.
A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch. But this is unnecessarily complicated at
this stage.
Alternatively, the current sgx_reclaim_pages() could be changed to
return the actual "reclaimed" pages, but not "scanned" pages. However,
the reclamation is a lengthy process, forcing a successful reclamation
of predetermined number of pages may block the caller for too long. And
that may not be acceptable in some synchronous contexts, e.g., in
serving an ioctl().
With this building block in place, add synchronous reclamation support
in sgx_cgroup_try_charge(): trigger a call to
sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
caller allows synchronous reclaim as indicated by s newly added
parameter.
A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().
Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V11:
- Use commit message suggested by Kai
- Remove "usage" comments for functions. (Kai)
V10:
- Simplify the signature by removing a pointer to nr_to_scan (Kai)
- Return pages attempted instead of reclaimed as it is really what the
cgroup caller needs to track progress. This further simplifies the design.
- Merge patch for exposing sgx_reclaim_pages() with basic synchronous
reclamation. (Kai)
- Shorten names for EPC cgroup functions. (Jarkko)
- Fix/add comments to justify the design (Kai)
- Separate out a helper for for addressing single iteration of the loop
in sgx_cgroup_try_charge(). (Jarkko)
V9:
- Add comments for static variables. (Jarkko)
V8:
- Use width of 80 characters in text paragraphs. (Jarkko)
- Remove alignment for substructure variables. (Jarkko)
V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 119 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +-
arch/x86/kernel/cpu/sgx/main.c | 45 ++++++----
arch/x86/kernel/cpu/sgx/sgx.h | 1 +
4 files changed, 148 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index ff4d4a25dbe7..74d403d1e0d4 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -9,16 +9,128 @@
static struct sgx_cgroup sgx_cg_root;
/**
- * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root: Root of the tree to check
*
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ */
+static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_cgroup *sgx_cg;
+ bool ret = true;
+
+ /*
+ * Caller must ensure css_root ref acquired
+ */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+
+ rcu_read_unlock();
+
+ sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
+
+ spin_lock(&sgx_cg->lru.lock);
+ ret = list_empty(&sgx_cg->lru.reclaimable);
+ spin_unlock(&sgx_cg->lru.lock);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!ret)
+ break;
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree
+ * @root: The root of cgroup tree to reclaim from.
+ *
+ * This function performs a pre-order walk in the cgroup tree under the given
+ * root, attempting to reclaim pages at each node until a fixed number of pages
+ * (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of success on
+ * the actual reclamation process. In extreme cases, if all pages in front of
+ * the LRUs are recently accessed, i.e., considered "too young" to reclaim, no
+ * page will actually be reclaimed after walking the whole tree.
+ */
+static void sgx_cgroup_reclaim_pages(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_cgroup *sgx_cg;
+ unsigned int cnt = 0;
+
+ /* Caller must ensure css_root ref acquired */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+ rcu_read_unlock();
+
+ sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos));
+ cnt += sgx_reclaim_pages(&sgx_cg->lru);
+
+ rcu_read_lock();
+ css_put(pos);
+
+ if (cnt >= SGX_NR_TO_SCAN)
+ break;
+ }
+
+ rcu_read_unlock();
+}
+
+static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
+{
+ if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
+ return 0;
+
+ /* No reclaimable pages left in the cgroup */
+ if (sgx_cgroup_lru_empty(epc_cg->cg))
+ return -ENOMEM;
+
+ if (signal_pending(current))
+ return -ERESTARTSYS;
+
+ return -EBUSY;
+}
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
* @sgx_cg: The EPC cgroup to be charged for the page.
+ * @reclaim: Whether or not synchronous EPC reclaim is allowed.
* Return:
* * %0 - If successfully charged.
* * -errno - for failures.
*/
-int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
{
- return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE);
+ int ret;
+
+ for (;;) {
+ ret = __sgx_cgroup_try_charge(sgx_cg);
+
+ if (ret != -EBUSY)
+ return ret;
+
+ if (reclaim == SGX_NO_RECLAIM)
+ return -ENOMEM;
+
+ sgx_cgroup_reclaim_pages(sgx_cg->cg);
+ cond_resched();
+ }
+
+ return 0;
}
/**
@@ -43,6 +155,7 @@ static void sgx_cgroup_free(struct misc_cg *cg)
static void sgx_cgroup_misc_init(struct misc_cg *cg, struct sgx_cgroup *sgx_cg)
{
+ sgx_lru_init(&sgx_cg->lru);
cg->res[MISC_CG_RES_SGX_EPC].priv = sgx_cg;
sgx_cg->cg = cg;
}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index bd9606479e67..538524f5669d 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -20,7 +20,7 @@ static inline struct sgx_cgroup *sgx_get_current_cg(void)
static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) { }
-static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
{
return 0;
}
@@ -33,6 +33,7 @@ static inline void sgx_cgroup_init(void) { }
struct sgx_cgroup {
struct misc_cg *cg;
+ struct sgx_epc_lru_list lru;
};
static inline struct sgx_cgroup *sgx_cgroup_from_misc_cg(struct misc_cg *cg)
@@ -63,7 +64,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
put_misc_cg(sgx_cg->cg);
}
-int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg);
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
void sgx_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 552455365761..b79c1d6cdc23 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -286,11 +286,14 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
mutex_unlock(&encl->lock);
}
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_reclaim_pages() - Attempt to reclaim a fixed number of pages from an LRU
+ * @lru: The LRU from which pages are reclaimed.
+ *
+ * Take a fixed number of pages from the head of a given LRU and reclaim them to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
*
* Batch process a chunk of pages (at the moment 16) in order to degrade amount
* of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
@@ -298,8 +301,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
* + EWB) but not sufficiently. Reclaiming one page at a time would also be
* problematic as it would increase the lock contention too much, which would
* halt forward progress.
+ *
+ * Return: Number of pages attempted for reclamation.
*/
-static void sgx_reclaim_pages(void)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +315,9 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_global_lru.lock);
+ spin_lock(&lru->lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
- struct sgx_epc_page, list);
+ epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
if (!epc_page)
break;
@@ -328,7 +332,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +355,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_global_lru.lock);
- list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ spin_lock(&lru->lock);
+ list_add_tail(&epc_page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -379,6 +383,8 @@ static void sgx_reclaim_pages(void)
sgx_free_epc_page(epc_page);
}
+
+ return cnt;
}
static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,6 +393,11 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}
+static void sgx_reclaim_pages_global(void)
+{
+ sgx_reclaim_pages(&sgx_global_lru);
+}
+
/*
* sgx_reclaim_direct() should be called (without enclave's mutex held)
* in locations where SGX memory resources might be low and might be
@@ -395,7 +406,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
}
static int ksgxd(void *p)
@@ -418,7 +429,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}
@@ -572,7 +583,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
int ret;
sgx_cg = sgx_get_current_cg();
- ret = sgx_cgroup_try_charge(sgx_cg);
+ ret = sgx_cgroup_try_charge(sgx_cg, reclaim);
if (ret) {
sgx_put_cg(sgx_cg);
return ERR_PTR(ret);
@@ -600,7 +611,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 3cf5a59a4eac..89adac646381 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -135,6 +135,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru);
void sgx_ipi_cb(void *info);
--
2.25.1
From: Sean Christopherson <[email protected]>
Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
Tested-by: Mikko Ylinen <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V8:
- Limit text width to 80 characters to be consistent.
V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
Documentation/arch/x86/sgx.rst | 83 ++++++++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)
diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call. It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running. The EPC is managed by the SGX subsystem and
+is not accounted by the memory controller. Note that this is true only for EPC
+memory itself, i.e. normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing store
+(normal system memory allocated via shmem). The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+ resource will show the total amount of EPC memory available on the
+ platform.
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+ resource will show the current active EPC memory usage of the cgroup and
+ its descendants. EPC pages that are swapped out to backing RAM are not
+ included in the current count.
+
+ misc.max
+ A read-write single value file which exists on non-root cgroups. The
+ sgx_epc resource will show the EPC usage hard limit. The default is
+ "max".
+
+ If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+ page fault handling, will be blocked until EPC can be reclaimed from the
+ cgroup. If there are no pages left that are reclaimable within the same
+ group, the kernel returns ENOMEM.
+
+ The EPC pages allocated for a guest VM by the virtual EPC driver are not
+ reclaimable by the host kernel. In case the guest cgroup's limit is
+ reached and no reclaimable pages left in the same cgroup, the virtual
+ EPC driver returns SIGBUS to the user space process to indicate failure
+ on new EPC allocation requests.
+
+ The misc.max limit is non-preemptive. If a user writes a limit lower
+ than the current usage to this file, the cgroup will not preemptively
+ deallocate pages currently in use, and will only start blocking the next
+ allocation and reclaiming EPC at that time.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+ A value change in this file generates a file modified event.
+
+ max
+ The number of times the cgroup has triggered a reclaim due to
+ its EPC usage approaching (or exceeding) its max EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged
+to the original cgroup until the page is released or reclaimed. Migrating a
+process to a different cgroup doesn't move the EPC charges that it incurred
+while in the previous cgroup to its new cgroup.
--
2.25.1
From: Sean Christopherson <[email protected]>
Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.
Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Jarkko Sakkinen <[email protected]>
---
V6:
- removed introduction to unreclaimables in commit message.
V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.
V3:
- Removed the helper functions and revised commit messages.
---
arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
arch/x86/kernel/cpu/sgx/sgx.h | 15 +++++++++++++
2 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d482ae7fdabf..b782207d41b6 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
/*
* These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
*/
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- if (list_empty(&sgx_active_page_list))
+ epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+ struct sgx_epc_page, list);
+ if (!epc_page)
break;
- epc_page = list_first_entry(&sgx_active_page_list,
- struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_reclaimer_lock);
- list_add_tail(&epc_page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
+ list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_active_page_list);
+ !list_empty(&sgx_global_lru.reclaimable);
}
/*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
ksgxd_tsk = tsk;
+ sgx_lru_init(&sgx_global_lru);
+
return true;
}
@@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
}
/**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return -EBUSY;
}
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return 0;
}
@@ -578,7 +579,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
break;
}
- if (list_empty(&sgx_active_page_list)) {
+ if (list_empty(&sgx_global_lru.reclaimable)) {
page = ERR_PTR(-ENOMEM);
break;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index fae8eef10232..3cf5a59a4eac 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -114,6 +114,21 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}
+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+ spinlock_t lock;
+ struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+ spin_lock_init(&lru->lock);
+ INIT_LIST_HEAD(&lru->reclaimable);
+}
+
void sgx_free_epc_page(struct sgx_epc_page *page);
void sgx_reclaim_direct(void);
--
2.25.1
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
>
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
>
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
>
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> Reviewed-by: Jarkko Sakkinen <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
> Tested-by: Jarkko Sakkinen <[email protected]>
I don't see any big issue, so feel free to add:
Reviewed-by: Kai Huang <[email protected]>
Nitpickings below:
[...]
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
It doesn't seem you need the above two here.
Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.
However I think it's better to explicitly include <linux/slab.h> since
kzalloc()/kfree() are used.
Btw, I am not sure whether you want to use <linux/kernel.h> because looks
it contains a lot of unrelated staff. Anyway I guess nobody cares.
> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;
The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.
[...]
>
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
I don't see why you need <asm/sgx.h> here. Also, ...
> +#include <linux/cgroup.h>
> +#include <linux/misc_cgroup.h>
> +
> +#include "sgx.h"
... "sgx.h" already includes <asm/sgx.h>
[...]
>
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}
I spent some time looking into this. And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.
I typed my analysis below in [*]. And it would be helpful if any cgroup
expert can have a second eye on this.
[...]
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
Is this needed? I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?
[...]
[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).
Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.
int __init cgroup_init(void)
{
...
for_each_subsys(ss, ssid) {
if (ss->early_init) {
...
} else {
cgroup_init_subsys(ss, false);
}
...
if (!cgroup_ssid_enabled(ssid))
continue;
...
}
...
}
cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and
set the pointer to the @init_css_set.
static void __init cgroup_init_subsys(struct cgroup_subsys *ss,
...)
{
struct cgroup_subsys_state *css;
...
css = ss->css_alloc(NULL);
/* We don't handle early failures gracefully */
BUG_ON(IS_ERR(css));
...
init_css_set.subsys[ss->id] = css;
...
}
All processes are by default associated to the @init_css_set:
void cgroup_fork(struct task_struct *child)
{
RCU_INIT_POINTER(child->cgroups, &init_css_set);
INIT_LIST_HEAD(&child->cg_list);
}
At runtime, when a new cgroup is created in the hierarchy, the "cgroup"
can have a NULL @css if some subsystem is not enabled in it:
static int cgroup_apply_control_enable(struct cgroup *cgrp)
{
struct cgroup *dsct;
struct cgroup_subsys_state *d_css;
struct cgroup_subsys *ss;
int ssid, ret;
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
for_each_subsys(ss, ssid) {
struct cgroup_subsys_state *css =
cgroup_css(dsct, ss);
if (!(cgroup_ss_mask(dsct) &
(1 << ss->id)))
continue;
if (!css) {
css = css_create(dsct, ss);
if (IS_ERR(css))
return PTR_ERR(css);
}
...
}
}
}
We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the
css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL.
However, when a process is bound to a specific cgroup, the kernel tries to
get the cgorup's "effective css", and it seems this "effective css" cannot
be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which
the MISC cgroup does.
There are couple of code paths can lead to this, but they all reach to
static struct css_set *find_existing_css_set(...)
{
struct cgroup_root *root = cgrp->root;
struct cgroup_subsys *ss;
struct css_set *cset;
unsigned long key;
int i;
for_each_subsys(ss, i) {
if (root->subsys_mask & (1UL << i)) {
/*
* @ss is in this hierarchy, so we want
* the effective css from @cgrp.
*/
template[i] = cgroup_e_css_by_mask(cgrp,
ss);
} else {
/*
* @ss is not in this hierarchy, so we
* don't want to change the css.
*/
template[i] = old_cset->subsys[i];
}
}
...
}
Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys
is enabled in the root cgroup (which means MISC cgroup is not disabled by
kernel command line), or get the default css, which is @init_css_set-
>subsys[ssid], which is always valid for MISC cgroup.
And more specifically, the "effective css" in the cgroup_e_css_by_mask()
is done by searching the entire hierarchy, so MISC cgroup will always have
a valid "effective css".
static struct cgroup_subsys_state *cgroup_e_css_by_mask(
struct cgroup *cgrp,
struct cgroup_subsys *ss)
{
lockdep_assert_held(&cgroup_mutex);
if (!ss)
return &cgrp->self;
...
while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
cgrp = cgroup_parent(cgrp);
if (!cgrp)
return NULL;
}
return cgroup_css(cgrp, ss);
}
The comment of cgroup_e_css_by_mask() says:
* Similar to cgroup_css() but returns the effective css, which is defined
* as the matching css of the nearest ancestor including self which has @ss
* enabled. If @ss is associated with the hierarchy @cgrp is on, this
* function is guaranteed to return non-NULL css.
It's hard for me to interpret the second sentence, specifically, what does
"@ss is associated with the hierarchy @cgrp is on" mean. I interpret it
as "subsys is enabled in root and/or any descendants".
But again, in the find_existing_css_set() it is called when the root
cgroup has enabled the subsys, so it should always return a non-NULL css.
And that means for any process, get_current_misc_cg() cannot be NULL.
On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Add README to document how to run the tests.
>
> Signed-off-by: Haitao Huang <[email protected]>
jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C tools/testing/selftests/sgx run_tests
make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/.././../tools/include -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/.././../tools/include -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/.././../tools/include -fPIC -c sigstruct.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/.././../tools/include -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/.././../tools/include -fPIC -c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/.././../tools/include -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack -lcrypto
gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE -fno-stack-protector -mrdrnd -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include test_encl.c test_encl_bootstrap.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf -Wl,-T,test_encl.lds,--build-id=none
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable stack
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
TAP version 13
1..2
# timeout set to 45
# selftests: sgx: test_sgx
# TAP version 13
# 1..16
# # Starting 16 tests from 1 test cases.
# # RUN enclave.unclobbered_vdso ...
# # OK enclave.unclobbered_vdso
# ok 1 enclave.unclobbered_vdso
# # RUN enclave.unclobbered_vdso_oversubscribed ...
# # OK enclave.unclobbered_vdso_oversubscribed
# ok 2 enclave.unclobbered_vdso_oversubscribed
# # RUN enclave.unclobbered_vdso_oversubscribed_remove ...
# # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 98566144 bytes heap may take a while ...
# # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 bytes to trimmed may take a while ...
# # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run EACCEPT for each page of 98566144 bytes may take a while ...
# # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes from enclave may take a while ...
# # OK enclave.unclobbered_vdso_oversubscribed_remove
# ok 3 enclave.unclobbered_vdso_oversubscribed_remove
# # RUN enclave.clobbered_vdso ...
# # OK enclave.clobbered_vdso
# ok 4 enclave.clobbered_vdso
# # RUN enclave.clobbered_vdso_and_user_function ...
# # OK enclave.clobbered_vdso_and_user_function
# ok 5 enclave.clobbered_vdso_and_user_function
# # RUN enclave.tcs_entry ...
# # OK enclave.tcs_entry
# ok 6 enclave.tcs_entry
# # RUN enclave.pte_permissions ...
# # OK enclave.pte_permissions
# ok 7 enclave.pte_permissions
# # RUN enclave.tcs_permissions ...
# # OK enclave.tcs_permissions
# ok 8 enclave.tcs_permissions
# # RUN enclave.epcm_permissions ...
# # OK enclave.epcm_permissions
# ok 9 enclave.epcm_permissions
# # RUN enclave.augment ...
# # OK enclave.augment
# ok 10 enclave.augment
# # RUN enclave.augment_via_eaccept ...
# # OK enclave.augment_via_eaccept
# ok 11 enclave.augment_via_eaccept
# # RUN enclave.tcs_create ...
# # OK enclave.tcs_create
# ok 12 enclave.tcs_create
# # RUN enclave.remove_added_page_no_eaccept ...
# # OK enclave.remove_added_page_no_eaccept
# ok 13 enclave.remove_added_page_no_eaccept
# # RUN enclave.remove_added_page_invalid_access ...
# # OK enclave.remove_added_page_invalid_access
# ok 14 enclave.remove_added_page_invalid_access
# # RUN enclave.remove_added_page_invalid_access_after_eaccept ..
# # OK enclave.remove_added_page_invalid_access_after_eaccept
# ok 15 enclave.remove_added_page_invalid_access_after_eaccept
# # RUN enclave.remove_untouched_page ...
# # OK enclave.remove_untouched_page
# ok 16 enclave.remove_untouched_page
# # PASSED: 16 / 16 tests passed.
# # Totals: pass:16 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: sgx: test_sgx
# timeout set to 45
# selftests: sgx: run_epc_cg_selftests.sh
# # Setting up limits.
# ./run_epc_cg_selftests.sh: line 50: echo: write error: Invalid argument
# # Failed setting up misc limits.
not ok 2 selftests: sgx: run_epc_cg_selftests.sh # exit=1
make: Leaving directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
This is what happens now.
BTW, I noticed a file that should not exist, i.e. README. Only thing
that should exist is the tests for kselftest and anything else should
not exist at all, so this file by definiton should not exist.
BR, Jarkko
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
> of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
> allocated page into the global LRU list while
> sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
> global LRU references in these functions to make them reusable when
> pages are tracked in per-cgroup LRUs.
>
> Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
> EPC page. It simply returns the global LRU now, and will later return
> the LRU of the cgroup within which the EPC page was allocated. Replace
> the hard coded global LRU with a call to this helper.
>
> Next patches will first get the cgroup reclamation flow ready while
> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
> make the switch in the end for sgx_lru_list() to return per-cgroup
> LRU.
I found the first paragraph hard to read. Provide my version below for
your reference:
"
The SGX driver tracks reclaimable EPC pages via
sgx_mark_page_reclaimable(), which adds the newly allocated page into the
global LRU list. sgx_unmark_page_reclaimable() does the opposite.
To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and the new allocated EPC page will need to be added
to the LRU of associated cgroup, but not always the global LRU list.
When sgx_mark_page_reclaimable() is called, the cgroup that the new
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.
Add a helper, sgx_lru_list(), to return the LRU that the EPC page should
be/is added to for the given EPC page. Currently it just returns the
global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the helper
function to get the LRU from the EPC page instead of referring to the
global LRU directly.
This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.
"
Nit:
That being said, is sgx_epc_page_lru() better than sgx_lru_list()?
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> Reviewed-by: Jarkko Sakkinen <[email protected]>
> Tested-by: Jarkko Sakkinen <[email protected]>
> ---
>
Feel free to add:
Reviewed-by: Kai Huang <[email protected]>
On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> >
> > I'll send a fixup for this patch or another version of the series if more
> > changes are needed.
>
> Hi Haitao,
>
> I don't like to say but in general I think you are sending too frequently The
> last version was sent April, 11th (my time), so considering the weekend it has
> only been 3 or at most 4 days.
>
> Please slow down a little bit to give people more time.
>
> More information please also see:
>
> https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders
+1
Yes, exactly. I'd take one week break and cycle the kselftest part
internally a bit as I said my previous response. I'm sure that there
is experise inside Intel how to implement it properly. I.e. take some
time to find the right person, and wait as long as that person has a
bit of bandwidth to go through the test and suggest modifications.
Cannot blame, as I've done the same mistake a few times in past but
yeah this would be the best possible corrective action to take.
BR, Jarkko
On Tue Apr 16, 2024 at 5:05 PM EEST, Jarkko Sakkinen wrote:
> On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> > With different cgroups, the script starts one or multiple concurrent SGX
> > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> > test case, which loads an enclave of EPC size equal to the EPC capacity
> > available on the platform. The script checks results against the
> > expectation set for each cgroup and reports success or failure.
> >
> > The script creates 3 different cgroups at the beginning with following
> > expectations:
> >
> > 1) SMALL - intentionally small enough to fail the test loading an
> > enclave of size equal to the capacity.
> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> > more than 4 concurrent tests are run. The script starts 4 expecting at
> > least one test to pass, and then starts 5 expecting at least one test
> > to fail.
> > 3) LARGER - limit is the same as the capacity, large enough to run lots of
> > concurrent tests. The script starts 8 of them and expects all pass.
> > Then it reruns the same test with one process randomly killed and
> > usage checked to be zero after all processes exit.
> >
> > The script also includes a test with low mem_cg limit and LARGE sgx_epc
> > limit to verify that the RAM used for per-cgroup reclamation is charged
> > to a proper mem_cg. For this test, it turns off swapping before start,
> > and turns swapping back on afterwards.
> >
> > Add README to document how to run the tests.
> >
> > Signed-off-by: Haitao Huang <[email protected]>
>
> jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C tools/testing/selftests/sgx run_tests
> make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
> gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/./../../tools/include -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
> gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/./../../tools/include -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
> gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/./../../tools/include -fPIC -c sigstruct.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
> gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/./../../tools/include -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
> gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/./../../tools/include -fPIC -c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
> gcc -Wall -Werror -g -I/home/jarkko/linux-tpmdd/tools/testing/selftests/./../../tools/include -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/maino /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack -lcrypto
> gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE -fno-stack-protector -mrdrnd -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../.././tools/include test_encl.c test_encl_bootstrap.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf -Wl,-T,test_encl.lds,--build-id=none
> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable stack
> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
> TAP version 13
> 1..2
> # timeout set to 45
> # selftests: sgx: test_sgx
> # TAP version 13
> # 1..16
> # # Starting 16 tests from 1 test cases.
> # # RUN enclave.unclobbered_vdso ...
> # # OK enclave.unclobbered_vdso
> # ok 1 enclave.unclobbered_vdso
> # # RUN enclave.unclobbered_vdso_oversubscribed ...
> # # OK enclave.unclobbered_vdso_oversubscribed
> # ok 2 enclave.unclobbered_vdso_oversubscribed
> # # RUN enclave.unclobbered_vdso_oversubscribed_remove ...
> # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 98566144 bytes heap may take a while ...
> # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 bytes to trimmed may take a while ...
> # # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run EACCEPT for each page of 98566144 bytes may take a while ...
> # # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes from enclave may take a while ...
> # # OK enclave.unclobbered_vdso_oversubscribed_remove
> # ok 3 enclave.unclobbered_vdso_oversubscribed_remove
> # # RUN enclave.clobbered_vdso ...
> # # OK enclave.clobbered_vdso
> # ok 4 enclave.clobbered_vdso
> # # RUN enclave.clobbered_vdso_and_user_function ...
> # # OK enclave.clobbered_vdso_and_user_function
> # ok 5 enclave.clobbered_vdso_and_user_function
> # # RUN enclave.tcs_entry ...
> # # OK enclave.tcs_entry
> # ok 6 enclave.tcs_entry
> # # RUN enclave.pte_permissions ...
> # # OK enclave.pte_permissions
> # ok 7 enclave.pte_permissions
> # # RUN enclave.tcs_permissions ...
> # # OK enclave.tcs_permissions
> # ok 8 enclave.tcs_permissions
> # # RUN enclave.epcm_permissions ...
> # # OK enclave.epcm_permissions
> # ok 9 enclave.epcm_permissions
> # # RUN enclave.augment ...
> # # OK enclave.augment
> # ok 10 enclave.augment
> # # RUN enclave.augment_via_eaccept ...
> # # OK enclave.augment_via_eaccept
> # ok 11 enclave.augment_via_eaccept
> # # RUN enclave.tcs_create ...
> # # OK enclave.tcs_create
> # ok 12 enclave.tcs_create
> # # RUN enclave.remove_added_page_no_eaccept ...
> # # OK enclave.remove_added_page_no_eaccept
> # ok 13 enclave.remove_added_page_no_eaccept
> # # RUN enclave.remove_added_page_invalid_access ...
> # # OK enclave.remove_added_page_invalid_access
> # ok 14 enclave.remove_added_page_invalid_access
> # # RUN enclave.remove_added_page_invalid_access_after_eaccept ...
> # # OK enclave.remove_added_page_invalid_access_after_eaccept
> # ok 15 enclave.remove_added_page_invalid_access_after_eaccept
> # # RUN enclave.remove_untouched_page ...
> # # OK enclave.remove_untouched_page
> # ok 16 enclave.remove_untouched_page
> # # PASSED: 16 / 16 tests passed.
> # # Totals: pass:16 fail:0 xfail:0 xpass:0 skip:0 error:0
> ok 1 selftests: sgx: test_sgx
> # timeout set to 45
> # selftests: sgx: run_epc_cg_selftests.sh
> # # Setting up limits.
> # ./run_epc_cg_selftests.sh: line 50: echo: write error: Invalid argument
> # # Failed setting up misc limits.
> not ok 2 selftests: sgx: run_epc_cg_selftests.sh # exit=1
> make: Leaving directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
>
> This is what happens now.
>
> BTW, I noticed a file that should not exist, i.e. README. Only thing
> that should exist is the tests for kselftest and anything else should
> not exist at all, so this file by definiton should not exist.
I'd suggest to sanity-check the kselftest with a person from Intel who
has worked with kselftest before the next version so that it will be
nailed next time. Or better internal review this single patch with a
person with expertise on kernel QA.
I did not check this but I have also suspicion that it might have some
checks whetehr it is run as root or not. If there are any, those should
be removed too. Let people set their environment however want...
BR, Jarkko
On Tue, 16 Apr 2024 00:42:41 -0500, Huang, Kai <[email protected]> wrote:
>>
>> I'll send a fixup for this patch or another version of the series if
>> more
>> changes are needed.
>
> Hi Haitao,
>
> I don't like to say but in general I think you are sending too
> frequently. The
> last version was sent April, 11th (my time), so considering the weekend
> it has
> only been 3 or at most 4 days.
> Please slow down a little bit to give people more time.
>
> More information please also see:
>
> https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders
Thanks for the feedback. I'll slow down sending new versions.
Haitao
On Tue, 16 Apr 2024 09:10:12 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Tue Apr 16, 2024 at 5:05 PM EEST, Jarkko Sakkinen wrote:
>> On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
>> > With different cgroups, the script starts one or multiple concurrent
>> SGX
>> > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
>> > test case, which loads an enclave of EPC size equal to the EPC
>> capacity
>> > available on the platform. The script checks results against the
>> > expectation set for each cgroup and reports success or failure.
>> >
>> > The script creates 3 different cgroups at the beginning with following
>> > expectations:
>> >
>> > 1) SMALL - intentionally small enough to fail the test loading an
>> > enclave of size equal to the capacity.
>> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some
>> if
>> > more than 4 concurrent tests are run. The script starts 4 expecting at
>> > least one test to pass, and then starts 5 expecting at least one test
>> > to fail.
>> > 3) LARGER - limit is the same as the capacity, large enough to run
>> lots of
>> > concurrent tests. The script starts 8 of them and expects all pass.
>> > Then it reruns the same test with one process randomly killed and
>> > usage checked to be zero after all processes exit.
>> >
>> > The script also includes a test with low mem_cg limit and LARGE
>> sgx_epc
>> > limit to verify that the RAM used for per-cgroup reclamation is
>> charged
>> > to a proper mem_cg. For this test, it turns off swapping before start,
>> > and turns swapping back on afterwards.
>> >
>> > Add README to document how to run the tests.
>> >
>> > Signed-off-by: Haitao Huang <[email protected]>
>>
>> jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C
>> tools/testing/selftests/sgx run_tests
>> make: Entering directory
>> '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
>> gcc -Wall -Werror -g
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> -fPIC -c main.c -o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
>> gcc -Wall -Werror -g
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> -fPIC -c load.c -o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
>> gcc -Wall -Werror -g
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> -fPIC -c sigstruct.c -o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
>> gcc -Wall -Werror -g
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> -fPIC -c call.S -o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
>> gcc -Wall -Werror -g
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> -fPIC -c sign_key.S -o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
>> gcc -Wall -Werror -g
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z
>> noexecstack -lcrypto
>> gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE
>> -fno-stack-protector -mrdrnd
>> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include
>> test_encl.c test_encl_bootstrap.S -o
>> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf
>> -Wl,-T,test_encl.lds,--build-id=none
>> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld:
>> warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies
>> executable stack
>> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld:
>> NOTE: This behaviour is deprecated and will be removed in a future
>> version of the linker
>> TAP version 13
>> 1..2
>> # timeout set to 45
>> # selftests: sgx: test_sgx
>> # TAP version 13
>> # 1..16
>> # # Starting 16 tests from 1 test cases.
>> # # RUN enclave.unclobbered_vdso ...
>> # # OK enclave.unclobbered_vdso
>> # ok 1 enclave.unclobbered_vdso
>> # # RUN enclave.unclobbered_vdso_oversubscribed ...
>> # # OK enclave.unclobbered_vdso_oversubscribed
>> # ok 2 enclave.unclobbered_vdso_oversubscribed
>> # # RUN enclave.unclobbered_vdso_oversubscribed_remove ...
>> # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an
>> enclave with 98566144 bytes heap may take a while ...
>> # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of
>> 98566144 bytes to trimmed may take a while ...
>> # # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave
>> to run EACCEPT for each page of 98566144 bytes may take a while ...
>> # # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144
>> bytes from enclave may take a while ...
>> # # OK enclave.unclobbered_vdso_oversubscribed_remove
>> # ok 3 enclave.unclobbered_vdso_oversubscribed_remove
>> # # RUN enclave.clobbered_vdso ...
>> # # OK enclave.clobbered_vdso
>> # ok 4 enclave.clobbered_vdso
>> # # RUN enclave.clobbered_vdso_and_user_function ...
>> # # OK enclave.clobbered_vdso_and_user_function
>> # ok 5 enclave.clobbered_vdso_and_user_function
>> # # RUN enclave.tcs_entry ...
>> # # OK enclave.tcs_entry
>> # ok 6 enclave.tcs_entry
>> # # RUN enclave.pte_permissions ...
>> # # OK enclave.pte_permissions
>> # ok 7 enclave.pte_permissions
>> # # RUN enclave.tcs_permissions ...
>> # # OK enclave.tcs_permissions
>> # ok 8 enclave.tcs_permissions
>> # # RUN enclave.epcm_permissions ...
>> # # OK enclave.epcm_permissions
>> # ok 9 enclave.epcm_permissions
>> # # RUN enclave.augment ...
>> # # OK enclave.augment
>> # ok 10 enclave.augment
>> # # RUN enclave.augment_via_eaccept ...
>> # # OK enclave.augment_via_eaccept
>> # ok 11 enclave.augment_via_eaccept
>> # # RUN enclave.tcs_create ...
>> # # OK enclave.tcs_create
>> # ok 12 enclave.tcs_create
>> # # RUN enclave.remove_added_page_no_eaccept ...
>> # # OK enclave.remove_added_page_no_eaccept
>> # ok 13 enclave.remove_added_page_no_eaccept
>> # # RUN enclave.remove_added_page_invalid_access ...
>> # # OK enclave.remove_added_page_invalid_access
>> # ok 14 enclave.remove_added_page_invalid_access
>> # # RUN
>> enclave.remove_added_page_invalid_access_after_eaccept ...
>> # # OK
>> enclave.remove_added_page_invalid_access_after_eaccept
>> # ok 15 enclave.remove_added_page_invalid_access_after_eaccept
>> # # RUN enclave.remove_untouched_page ...
>> # # OK enclave.remove_untouched_page
>> # ok 16 enclave.remove_untouched_page
>> # # PASSED: 16 / 16 tests passed.
>> # # Totals: pass:16 fail:0 xfail:0 xpass:0 skip:0 error:0
>> ok 1 selftests: sgx: test_sgx
>> # timeout set to 45
>> # selftests: sgx: run_epc_cg_selftests.sh
>> # # Setting up limits.
>> # ./run_epc_cg_selftests.sh: line 50: echo: write error: Invalid
>> argument
>> # # Failed setting up misc limits.
>> not ok 2 selftests: sgx: run_epc_cg_selftests.sh # exit=1
>> make: Leaving directory
This means no sgx cgroup turned on. (echoing sgx_epc entries into misc.max
not allowed)
v12 removed the need for config CGROUP_SGX_EPC.
Did you by chance running on a previous kernel build without the sgx
cgroup configured?
I did declare the configs in the config file but I missed it in my patch
as stated earlier. IIUC, that would not cause this error though.
Maybe I should exit with the skip code if no CGROUP_MISC (no more
CGROUP_SGX_EPC) is configured?
>> '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
>>
>> This is what happens now.
>>
>> BTW, I noticed a file that should not exist, i.e. README. Only thing
>> that should exist is the tests for kselftest and anything else should
>> not exist at all, so this file by definiton should not exist.
>
Could you point me to ths rule?
I felt some instructions needed as tests getting more complex, and was
following examples:
tools/testing/selftests$ find . -name README
/futex/README
/tc-testing/README
/net/forwarding/README
/powerpc/nx-gzip/README
/ftrace/README
/arm64/signal/README
/arm64/fp/README
/arm64/README
/zram/README
/livepatch/README
/resctrl/README
> I'd suggest to sanity-check the kselftest with a person from Intel who
> has worked with kselftest before the next version so that it will be
> nailed next time. Or better internal review this single patch with a
> person with expertise on kernel QA.
>
I'll double check.
> I did not check this but I have also suspicion that it might have some
> checks whetehr it is run as root or not. If there are any, those should
> be removed too. Let people set their environment however want...
>
Do you mean this part?
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+if [ "$(id -u)" -ne 0 ]; then
+ echo "SKIP: SGX Cgroup tests need root privileges."
+ exit $ksft_skip
+fi
I saw lots of similar code reported when I ran following in the selftests
directory:
tools/testing/selftests$ grep -C 5 -r "root" */*.sh
Thanks
Haitao
On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> I did declare the configs in the config file but I missed it in my patch
> as stated earlier. IIUC, that would not cause this error though.
>
> Maybe I should exit with the skip code if no CGROUP_MISC (no more
> CGROUP_SGX_EPC) is configured?
OK, so I wanted to do a distro kernel test here, and used the default
OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
> tools/testing/selftests$ find . -name README
> ./futex/README
> ./tc-testing/README
> ./net/forwarding/README
> ./powerpc/nx-gzip/README
> ./ftrace/README
> ./arm64/signal/README
> ./arm64/fp/README
> ./arm64/README
> ./zram/README
> ./livepatch/README
> ./resctrl/README
So is there a README because of override timeout parameter? Maybe it
should be just set to a high enough value?
BR, Jarkko
On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
>> I did declare the configs in the config file but I missed it in my patch
>> as stated earlier. IIUC, that would not cause this error though.
>>
>> Maybe I should exit with the skip code if no CGROUP_MISC (no more
>> CGROUP_SGX_EPC) is configured?
>
> OK, so I wanted to do a distro kernel test here, and used the default
> OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
>
>> tools/testing/selftests$ find . -name README
>> ./futex/README
>> ./tc-testing/README
>> ./net/forwarding/README
>> ./powerpc/nx-gzip/README
>> ./ftrace/README
>> ./arm64/signal/README
>> ./arm64/fp/README
>> ./arm64/README
>> ./zram/README
>> ./livepatch/README
>> ./resctrl/README
>
> So is there a README because of override timeout parameter? Maybe it
> should be just set to a high enough value?
>
> BR, Jarkko
>
From the docs, I think we are supposed to use override.
See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests
Thanks
Haitao
On Mon, 15 Apr 2024 22:20:02 -0500, Haitao Huang
<[email protected]> wrote:
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..400baa7cfb69 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -1,6 +1,7 @@
> obj-y += \
> driver.o \
> encl.o \
> + epc_cgroup.o \
It should be:
+obj-$(CONFIG_CGROUP_MISC) += epc_cgroup.o
Haitao
On Tue, 16 Apr 2024 17:04:21 -0500, Haitao Huang
<[email protected]> wrote:
> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
>> On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
>>> I did declare the configs in the config file but I missed it in my
>>> patch
>>> as stated earlier. IIUC, that would not cause this error though.
>>>
>>> Maybe I should exit with the skip code if no CGROUP_MISC (no more
>>> CGROUP_SGX_EPC) is configured?
>>
>> OK, so I wanted to do a distro kernel test here, and used the default
>> OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
>>
>>> tools/testing/selftests$ find . -name README
>>> ./futex/README
>>> ./tc-testing/README
>>> ./net/forwarding/README
>>> ./powerpc/nx-gzip/README
>>> ./ftrace/README
>>> ./arm64/signal/README
>>> ./arm64/fp/README
>>> ./arm64/README
>>> ./zram/README
>>> ./livepatch/README
>>> ./resctrl/README
>>
>> So is there a README because of override timeout parameter? Maybe it
>> should be just set to a high enough value?
>>
>> BR, Jarkko
>>
>
>
> From the docs, I think we are supposed to use override.
> See:
> https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests
>
> Thanks
> Haitao
>
Maybe you are suggesting we add settings file? I can do that.
README also explains what the tests do though. Do you still think they
should not exist?
I was mostly following resctrl as example.
Thanks
Haitao
On Tue, 16 Apr 2024 09:07:33 -0500, Huang, Kai <[email protected]> wrote:
> On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
>> of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
>> allocated page into the global LRU list while
>> sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
>> global LRU references in these functions to make them reusable when
>> pages are tracked in per-cgroup LRUs.
>>
>> Create a helper, sgx_lru_list(), that returns the LRU that tracks a
>> given
>> EPC page. It simply returns the global LRU now, and will later return
>> the LRU of the cgroup within which the EPC page was allocated. Replace
>> the hard coded global LRU with a call to this helper.
>>
>> Next patches will first get the cgroup reclamation flow ready while
>> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
>> make the switch in the end for sgx_lru_list() to return per-cgroup
>> LRU.
>
> I found the first paragraph hard to read. Provide my version below for
> your reference:
>
> "
> The SGX driver tracks reclaimable EPC pages via
> sgx_mark_page_reclaimable(), which adds the newly allocated page into the
> global LRU list. sgx_unmark_page_reclaimable() does the opposite.
>
> To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
> list for each cgroup, and the new allocated EPC page will need to be
> added
> to the LRU of associated cgroup, but not always the global LRU list.
>
> When sgx_mark_page_reclaimable() is called, the cgroup that the new
> allocated EPC page belongs to is already known, i.e., it has been set to
> the 'struct sgx_epc_page'.
>
> Add a helper, sgx_lru_list(), to return the LRU that the EPC page should
> be/is added to for the given EPC page. Currently it just returns the
> global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the
> helper
> function to get the LRU from the EPC page instead of referring to the
> global LRU directly.
>
> This allows EPC page being able to be tracked in "per-cgroup" LRU when
> that becomes ready.
> "
>
Thanks. Will use
> Nit:
>
> That being said, is sgx_epc_page_lru() better than sgx_lru_list()?
>
Sure
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> Reviewed-by: Jarkko Sakkinen <[email protected]>
>> Tested-by: Jarkko Sakkinen <[email protected]>
>> ---
>>
>
> Feel free to add:
>
> Reviewed-by: Kai Huang <[email protected]>
Thanks
Haitao
On Tue, 16 Apr 2024 17:21:24 -0500, Haitao Huang
<[email protected]> wrote:
> On Tue, 16 Apr 2024 17:04:21 -0500, Haitao Huang
> <[email protected]> wrote:
>
>> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen <[email protected]>
>> wrote:
>>
>>> On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
>>>> I did declare the configs in the config file but I missed it in my
>>>> patch
>>>> as stated earlier. IIUC, that would not cause this error though.
>>>>
>>>> Maybe I should exit with the skip code if no CGROUP_MISC (no more
>>>> CGROUP_SGX_EPC) is configured?
>>>
>>> OK, so I wanted to do a distro kernel test here, and used the default
>>> OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
>>>
>>>> tools/testing/selftests$ find . -name README
>>>> ./futex/README
>>>> ./tc-testing/README
>>>> ./net/forwarding/README
>>>> ./powerpc/nx-gzip/README
>>>> ./ftrace/README
>>>> ./arm64/signal/README
>>>> ./arm64/fp/README
>>>> ./arm64/README
>>>> ./zram/README
>>>> ./livepatch/README
>>>> ./resctrl/README
>>>
>>> So is there a README because of override timeout parameter? Maybe it
>>> should be just set to a high enough value?
>>>
>>> BR, Jarkko
>>>
>>
>>
>> From the docs, I think we are supposed to use override.
>> See:
>> https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests
>>
>> Thanks
>> Haitao
>>
>
> Maybe you are suggesting we add settings file? I can do that.
> README also explains what the tests do though. Do you still think they
> should not exist?
> I was mostly following resctrl as example.
>
> Thanks
> Haitao
>
With the settings I shortened the README quite bit. Now I also lean
towards removing it. Let me know your preference. You can check the latest
at my branch for reference:
https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v12_plus
Thanks
Haitao
On Wed Apr 17, 2024 at 1:04 AM EEST, Haitao Huang wrote:
> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> >> I did declare the configs in the config file but I missed it in my patch
> >> as stated earlier. IIUC, that would not cause this error though.
> >>
> >> Maybe I should exit with the skip code if no CGROUP_MISC (no more
> >> CGROUP_SGX_EPC) is configured?
> >
> > OK, so I wanted to do a distro kernel test here, and used the default
> > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
> >
> >> tools/testing/selftests$ find . -name README
> >> ./futex/README
> >> ./tc-testing/README
> >> ./net/forwarding/README
> >> ./powerpc/nx-gzip/README
> >> ./ftrace/README
> >> ./arm64/signal/README
> >> ./arm64/fp/README
> >> ./arm64/README
> >> ./zram/README
> >> ./livepatch/README
> >> ./resctrl/README
> >
> > So is there a README because of override timeout parameter? Maybe it
> > should be just set to a high enough value?
> >
> > BR, Jarkko
> >
>
>
> From the docs, I think we are supposed to use override.
> See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests
OK, fair enough :-) I did not know this.
BR, Jarkko
On 16/04/2024 3:20 pm, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> Currently in the EPC page allocation, the kernel simply fails the
> allocation when the current EPC cgroup fails to charge due to its usage
> reaching limit. This is not ideal. When that happens, a better way is
> to reclaim EPC page(s) from the current EPC cgroup (and/or its
> descendants) to reduce its usage so the new allocation can succeed.
>
> Add the basic building blocks to support per-cgroup reclamation.
>
> Currently the kernel only has one place to reclaim EPC pages: the global
> EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU
> list for each EPC cgroup, and introduce a "cgroup" variant function to
> reclaim EPC pages from a given EPC cgroup and its descendants.
>
> Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
> It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
> pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
> pages from the global LRU, and then tries to reclaim these pages at once
> for better performance.
>
> Implement the "cgroup" variant EPC reclaim in a similar way, but keep
> the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
> as input, and return the pages that are "scanned" and attempted for
> reclamation (but not necessarily reclaimed successfully); 2) loop the
> given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
> until SGX_NR_TO_SCAN pages are "scanned".
>
> This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
> tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
> cgroup, and only moves to its descendants when there's no enough
> reclaimable EPC pages to "scan" in its LRU. It should be enough for
> most cases.
>
> Note, this simple implementation doesn't _exactly_ mimic the current
> global EPC reclaim (which always tries to do the actual reclaim in batch
> of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
> reclaimable pages, the actual reclaim of EPC pages will be split into
> smaller batches _across_ multiple LRUs with each being smaller than
> SGX_NR_TO_SCAN pages.
>
> A more precise way to mimic the current global EPC reclaim would be to
> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
> _across_ the given EPC cgroup _AND_ its descendants, and then do the
> actual reclaim in one batch. But this is unnecessarily complicated at
> this stage.
>
> Alternatively, the current sgx_reclaim_pages() could be changed to
> return the actual "reclaimed" pages, but not "scanned" pages. However,
> the reclamation is a lengthy process, forcing a successful reclamation
> of predetermined number of pages may block the caller for too long. And
> that may not be acceptable in some synchronous contexts, e.g., in
> serving an ioctl().
>
> With this building block in place, add synchronous reclamation support
> in sgx_cgroup_try_charge(): trigger a call to
> sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
> caller allows synchronous reclaim as indicated by s newly added
> parameter.
>
> A later patch will add support for asynchronous reclamation reusing
> sgx_cgroup_reclaim_pages().
>
> Note all reclaimable EPC pages are still tracked in the global LRU thus
> no per-cgroup reclamation is actually active at the moment. Per-cgroup
> tracking and reclamation will be turned on in the end after all
> necessary infrastructure is in place.
Nit:
"all necessary infrastructures are in place", or, "all necessary
building blocks are in place".
?
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> Tested-by: Jarkko Sakkinen <[email protected]>
> ---
Reviewed-by: Kai Huang <[email protected]>
More nitpickings below:
[...]
> -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
Let's still wrap the text on 80-character basis.
I guess most people are more used to that.
[...]
> - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
> - struct sgx_epc_page, list);
> + epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
Ditto.
On Tue, 16 Apr 2024 08:22:06 -0500, Huang, Kai <[email protected]> wrote:
> On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
>> RAM allocations, and are managed solely by the SGX subsystem. The
>> existing cgroup memory controller cannot be used to limit or account for
>> SGX EPC memory, which is a desirable feature in some environments. For
>> instance, within a Kubernetes environment, while a user may specify a
>> particular EPC quota for a pod, the orchestrator requires a mechanism to
>> enforce that the pod's actual runtime EPC usage does not exceed the
>> allocated quota.
>>
>> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
>> limit and track EPC allocations per cgroup. Earlier patches have added
>> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
>> support in SGX driver as the "sgx_epc" resource provider:
>>
>> - Set "capacity" of EPC by calling misc_cg_set_capacity()
>> - Update EPC usage counter, "current", by calling charge and uncharge
>> APIs for EPC allocation and deallocation, respectively.
>> - Setup sgx_epc resource type specific callbacks, which perform
>> initialization and cleanup during cgroup allocation and deallocation,
>> respectively.
>>
>> With these changes, the misc cgroup controller enables users to set a
>> hard
>> limit for EPC usage in the "misc.max" interface file. It reports current
>> usage in "misc.current", the total EPC memory available in
>> "misc.capacity", and the number of times EPC usage reached the max limit
>> in "misc.events".
>>
>> For now, the EPC cgroup simply blocks additional EPC allocation in
>> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
>> still tracked in the global active list, only reclaimed by the global
>> reclaimer when the total free page count is lower than a threshold.
>>
>> Later patches will reorganize the tracking and reclamation code in the
>> global reclaimer and implement per-cgroup tracking and reclaiming.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> Reviewed-by: Jarkko Sakkinen <[email protected]>
>> Reviewed-by: Tejun Heo <[email protected]>
>> Tested-by: Jarkko Sakkinen <[email protected]>
>
> I don't see any big issue, so feel free to add:
>
> Reviewed-by: Kai Huang <[email protected]>
>
Thanks
> Nitpickings below:
>
> [...]
>
>
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -0,0 +1,72 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright(c) 2022-2024 Intel Corporation. */
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/kernel.h>
>
> It doesn't seem you need the above two here.
>
> Probably they are needed in later patches, in that case we can move to
> the
> relevant patch(es) that they got used.
>
> However I think it's better to explicitly include <linux/slab.h> since
> kzalloc()/kfree() are used.
>
> Btw, I am not sure whether you want to use <linux/kernel.h> because looks
> it contains a lot of unrelated staff. Anyway I guess nobody cares.
>
I'll check and remove as needed.
>> +#include "epc_cgroup.h"
>> +
>> +/* The root SGX EPC cgroup */
>> +static struct sgx_cgroup sgx_cg_root;
>
> The comment isn't necessary (sorry didn't notice before), because the
> code
> is pretty clear saying that IMHO.
>
Was requested by Jarkko:
https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t
> [...]
>
>>
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> @@ -0,0 +1,72 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _SGX_EPC_CGROUP_H_
>> +#define _SGX_EPC_CGROUP_H_
>> +
>> +#include <asm/sgx.h>
>
> I don't see why you need <asm/sgx.h> here. Also, ...
>
>> +#include <linux/cgroup.h>
>> +#include <linux/misc_cgroup.h>
>> +
>> +#include "sgx.h"
>
> ... "sgx.h" already includes <asm/sgx.h>
>
> [...]
>
right
>>
>> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
>> +{
>> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
>> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
>> +}
>
> I spent some time looking into this. And yes if I was reading code
> correctly the get_current_misc_cg() should never return NULL when Kconfig
> is on.
>
> I typed my analysis below in [*]. And it would be helpful if any cgroup
> expert can have a second eye on this.
>
> [...]
>
Thanks for checking this and I did similar and agree with the conclusion.
I think this is confirmed also by Michal's description AFAICT:
"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."
>
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -6,6 +6,7 @@
>> #include <linux/highmem.h>
>> #include <linux/kthread.h>
>> #include <linux/miscdevice.h>
>> +#include <linux/misc_cgroup.h>
>
> Is this needed? I believe SGX variants in "epc_cgroup.h" should be
> enough
> for sgx/main.c?
>
> [...]
>
>
right
> [*] IIUC get_current_misc_cg() should never return NULL when Kconfig is
> on
yes
[...]
Thanks
Haitao
> Was requested by Jarkko:
> https://lore.kernel.org/lkml/CYU504RLY7QU.QZY9LWC076NX@suppilovahvero/#t
>
>> [...]
Ah I missed that. No problem to me.
>>
>>>
>>> --- /dev/null
>>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>>> @@ -0,0 +1,72 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _SGX_EPC_CGROUP_H_
>>> +#define _SGX_EPC_CGROUP_H_
>>> +
>>> +#include <asm/sgx.h>
>>
>> I don't see why you need <asm/sgx.h> here. Also, ...
>>
>>> +#include <linux/cgroup.h>
>>> +#include <linux/misc_cgroup.h>
>>> +
>>> +#include "sgx.h"
>>
>> ... "sgx.h" already includes <asm/sgx.h>
>>
>> [...]
>>
> right
>
>>>
>>> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
>>> +{
>>> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
>>> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
>>> +}
>>
>> I spent some time looking into this. And yes if I was reading code
>> correctly the get_current_misc_cg() should never return NULL when Kconfig
>> is on.
>>
>> I typed my analysis below in [*]. And it would be helpful if any cgroup
>> expert can have a second eye on this.
>>
>> [...]
>>
> Thanks for checking this and I did similar and agree with the
> conclusion. I think this is confirmed also by Michal's description AFAICT:
> "
> The current implementation creates root css object (see cgroup_init(),
> cgroup_ssid_enabled() check is after cgroup_init_subsys()).
> I.e. it will look like all tasks are members of root cgroup wrt given
> controller permanently and controller attribute files won't exist."
After looking I believe we can even disable MISC cgroup at runtime for a
particular cgroup (haven't actually verified on real machine, though):
# echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control
And if you look at the MISC cgroup core code, many functions actually
handle a NULL css, e.g., misc_cg_try_charge():
int misc_cg_try_charge(enum misc_res_type type,
struct misc_cg *cg, u64 amount)
{
...
if (!(valid_type(type) && cg &&
READ_ONCE(misc_res_capacity[type])))
return -EINVAL;
...
}
That's why I am still a little bit worried about this. And it's better
to have cgroup expert(s) to confirm here.
Btw, AMD SEV doesn't need to worry because it doesn't dereference @css
but just pass it to MISC cgroup core functions like
misc_cg_try_charge(). But for SGX, we actually dereference it directly.
On 16/04/2024 3:20 pm, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> In cases EPC pages need be allocated during a page fault and the cgroup
> usage is near its limit, an asynchronous reclamation needs be triggered
> to avoid blocking the page fault handling.
>
> Create a workqueue, corresponding work item and function definitions
> for EPC cgroup to support the asynchronous reclamation.
>
> In case the workqueue allocation is failed during init, disable cgroup.
It's fine and reasonable to disable (SGX EPC) cgroup. The problem is
"exactly what does this mean" isn't quite clear.
Given SGX EPC is just one type of MISC cgroup resources, we cannot just
disable MISC cgroup as a whole.
So, the first interpretation is we treat the entire MISC_CG_RES_SGX
resource type doesn't exist, that is, we just don't show control files
in the file system, and all EPC pages are tracked in the global list.
But it might be not straightforward to implement in the SGX driver,
i.e., we might need to do more MISC cgroup core code change to make it
being able to support disable particular resource at runtime -- I need
to double check.
So if that is not something worth to do, we will still need to live with
the fact that, the user is still able to create SGX cgroup in the
hierarchy and see those control files, and being able to read/write them.
The second interpretation I suppose is, although the SGX cgroup is still
seen as supported in userspace, in kernel we just treat it doesn't exist.
Specifically, that means: 1) we always return the root SGX cgroup for
any EPC page when allocating a new one; 2) as a result, we still track
all EPC pages in a single global list.
But from the code below ...
> static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
> {
> if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
> @@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim)
> {
> int ret;
>
> + /* cgroup disabled due to wq allocation failure during sgx_cgroup_init(). */
> + if (!sgx_cg_wq)
> + return 0;
> +
.., IIUC you choose a (third) solution that is even one more step back:
It just makes try_charge() always succeed, but EPC pages are still
managed in the "per-cgroup" list.
But this solution, AFAICT, doesn't work. The reason is when you fail to
allocate EPC page you will do the global reclaim, but now the global
list is empty.
Am I missing anything?
So my thinking is, we have two options:
1) Modify the MISC cgroup core code to allow the kernel to disable one
particular resource. It shouldn't be hard, e.g., we can add a
'disabled' flag to the 'struct misc_res'.
Hmm.. wait, after checking, the MISC cgroup won't show any control files
if the "capacity" of the resource is 0:
"
* Miscellaneous resources capacity for the entire machine. 0 capacity
* means resource is not initialized or not present in the host.
"
So I really suppose we should go with this route, i.e., by just setting
the EPC capacity to 0?
Note misc_cg_try_charge() will fail if capacity is 0, but we can make it
return success by explicitly check whether SGX cgroup is disabled by
using a helper, e.g., sgx_cgroup_disabled().
And you always return the root SGX cgroup in sgx_get_current_cg() when
sgx_cgroup_disabled() is true.
And in sgx_reclaim_pages_global(), you do something like:
static void sgx_reclaim_pages_global(..)
{
#ifdef CONFIG_CGROUP_MISC
if (sgx_cgroup_disabled())
sgx_reclaim_pages(&sgx_root_cg.lru);
else
sgx_cgroup_reclaim_pages(misc_cg_root());
#else
sgx_reclaim_pages(&sgx_global_list);
#endif
}
I am perhaps missing some other spots too but you got the idea.
At last, after typing those, I believe we should have a separate patch
to handle disable SGX cgroup at initialization time. And you can even
put this patch _somewhere_ after the patch
"x86/sgx: Implement basic EPC misc cgroup functionality"
and before this patch.
It makes sense to have such patch anyway, because with it we can easily
to add a kernel command line 'sgx_cgroup=disabled" if the user wants it
disabled (when someone has such requirement in the future).
On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai <[email protected]> wrote:
>
>
> On 16/04/2024 3:20 pm, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>> In cases EPC pages need be allocated during a page fault and the cgroup
>> usage is near its limit, an asynchronous reclamation needs be triggered
>> to avoid blocking the page fault handling.
>> Create a workqueue, corresponding work item and function definitions
>> for EPC cgroup to support the asynchronous reclamation.
>> In case the workqueue allocation is failed during init, disable cgroup.
>
> It's fine and reasonable to disable (SGX EPC) cgroup. The problem is
> "exactly what does this mean" isn't quite clear.
>
First, this is really some corner case most people don't care: during
init, kernel can't even allocate a workqueue object. So I don't think we
should write extra code to implement some sophisticated solution. Any
solution we come up with may just not work as the way user want or solve
the real issue due to the fact such allocation failure even happens at
init time.
So IMHO the current solution should be fine and I'll answer some of your
detailed questions below.
> Given SGX EPC is just one type of MISC cgroup resources, we cannot just
> disable MISC cgroup as a whole.
>
> So, the first interpretation is we treat the entire MISC_CG_RES_SGX
> resource type doesn't exist, that is, we just don't show control files
> in the file system, and all EPC pages are tracked in the global list.
>
> But it might be not straightforward to implement in the SGX driver,
> i.e., we might need to do more MISC cgroup core code change to make it
> being able to support disable particular resource at runtime -- I need
> to double check.
>
> So if that is not something worth to do, we will still need to live with
> the fact that, the user is still able to create SGX cgroup in the
> hierarchy and see those control files, and being able to read/write them.
>
Can not reliably predict what will happen. Most likely the ENOMEM will be
returned by sgx_cgroup_alloc() if reached or other error in the stack if
not reached to sgx_cgroup_alloc()
and user fails on creating anything.
But if they do end up creating some cgroups (sgx_cgroup_alloc() and
everything else on the call stack passed without failure), everything
still kind of works for the reason answered below.
> The second interpretation I suppose is, although the SGX cgroup is still
> seen as supported in userspace, in kernel we just treat it doesn't exist.
>
> Specifically, that means: 1) we always return the root SGX cgroup for
> any EPC page when allocating a new one; 2) as a result, we still track
> all EPC pages in a single global list.
>
Current code has similar behavior without extra code.
> But from the code below ...
>
>
>> static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
>> {
>> if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
>> @@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup
>> *sgx_cg, enum sgx_reclaim reclaim)
>> {
>> int ret;
>> + /* cgroup disabled due to wq allocation failure during
>> sgx_cgroup_init(). */
>> + if (!sgx_cg_wq)
>> + return 0;
>> +
>
> ..., IIUC you choose a (third) solution that is even one more step back:
>
> It just makes try_charge() always succeed, but EPC pages are still
> managed in the "per-cgroup" list.
>
> But this solution, AFAICT, doesn't work. The reason is when you fail to
> allocate EPC page you will do the global reclaim, but now the global
> list is empty.
>
> Am I missing anything?
But when cgroups enabled in config, global reclamation starts from root
and reclaim from the whole hierarchy if user may still be able to create.
Just that we don't have async/sync per-cgroup reclaim triggered.
>
> So my thinking is, we have two options:
>
> 1) Modify the MISC cgroup core code to allow the kernel to disable one
> particular resource. It shouldn't be hard, e.g., we can add a
> 'disabled' flag to the 'struct misc_res'.
>
> Hmm.. wait, after checking, the MISC cgroup won't show any control files
> if the "capacity" of the resource is 0:
>
> "
> * Miscellaneous resources capacity for the entire machine. 0 capacity
> * means resource is not initialized or not present in the host.
> "
>
> So I really suppose we should go with this route, i.e., by just setting
> the EPC capacity to 0?
>
> Note misc_cg_try_charge() will fail if capacity is 0, but we can make it
> return success by explicitly check whether SGX cgroup is disabled by
> using a helper, e.g., sgx_cgroup_disabled().
>
> And you always return the root SGX cgroup in sgx_get_current_cg() when
> sgx_cgroup_disabled() is true.
>
> And in sgx_reclaim_pages_global(), you do something like:
>
> static void sgx_reclaim_pages_global(..)
> {
> #ifdef CONFIG_CGROUP_MISC
> if (sgx_cgroup_disabled())
> sgx_reclaim_pages(&sgx_root_cg.lru);
> else
> sgx_cgroup_reclaim_pages(misc_cg_root());
> #else
> sgx_reclaim_pages(&sgx_global_list);
> #endif
> }
>
> I am perhaps missing some other spots too but you got the idea.
>
> At last, after typing those, I believe we should have a separate patch
> to handle disable SGX cgroup at initialization time. And you can even
> put this patch _somewhere_ after the patch
>
> "x86/sgx: Implement basic EPC misc cgroup functionality"
>
> and before this patch.
>
> It makes sense to have such patch anyway, because with it we can easily
> to add a kernel command line 'sgx_cgroup=disabled" if the user wants it
> disabled (when someone has such requirement in the future).
>
I think we can add support for "sgx_cgroup=disabled" in future if indeed
needed. But just for init failure, no?
Thanks
Haitao
> Documentation of task_get_css() says it always
> returns a valid css. This function is used by get_current_misc_cg() to get
> the css refernce.
>
>
> /**
> * task_get_css - find and get the css for (task, subsys)
> * @task: the target task
> * @subsys_id: the target subsystem ID
> *
> * Find the css for the (@task, @subsys_id) combination, increment a
> * reference on and return it. This function is guaranteed to return a
> * valid css. The returned css may already have been offlined.
> */
> static inline struct cgroup_subsys_state *
> task_get_css(struct task_struct *task, int subsys_id)
Ah, I missed this comment.
This confirms my code reading too.
>
>
> If you look at the code of this function, you will see it does not check
> NULL either for task_css().
>
> So I think we are pretty sure here it's confirmed by this documentation
> and testing.
Yeah agreed. Thanks.
On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
> On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai <[email protected]> wrote:
>
> >
> >
> > On 16/04/2024 3:20 pm, Haitao Huang wrote:
> > > From: Kristen Carlson Accardi <[email protected]>
> > > In cases EPC pages need be allocated during a page fault and the cgroup
> > > usage is near its limit, an asynchronous reclamation needs be triggered
> > > to avoid blocking the page fault handling.
> > > Create a workqueue, corresponding work item and function definitions
> > > for EPC cgroup to support the asynchronous reclamation.
> > > In case the workqueue allocation is failed during init, disable cgroup.
> >
> > It's fine and reasonable to disable (SGX EPC) cgroup. The problem is
> > "exactly what does this mean" isn't quite clear.
> >
> First, this is really some corner case most people don't care: during
> init, kernel can't even allocate a workqueue object. So I don't think we
> should write extra code to implement some sophisticated solution. Any
> solution we come up with may just not work as the way user want or solve
> the real issue due to the fact such allocation failure even happens at
> init time.
I think for such boot time failure we can either choose directly BUG_ON(),
or we try to handle it _nicely_, but not half-way. My experience is
adding BUG_ON() should be avoided in general, but it might be acceptable
during kernel boot. I will leave it to others.
[...]
> >
> > ..., IIUC you choose a (third) solution that is even one more step back:
> >
> > It just makes try_charge() always succeed, but EPC pages are still
> > managed in the "per-cgroup" list.
> >
> > But this solution, AFAICT, doesn't work. The reason is when you fail to
> > allocate EPC page you will do the global reclaim, but now the global
> > list is empty.
> >
> > Am I missing anything?
>
> But when cgroups enabled in config, global reclamation starts from root
> and reclaim from the whole hierarchy if user may still be able to create.
> Just that we don't have async/sync per-cgroup reclaim triggered.
OK. I missed this as it is in a later patch.
>
> >
> > So my thinking is, we have two options:
> >
> > 1) Modify the MISC cgroup core code to allow the kernel to disable one
> > particular resource. It shouldn't be hard, e.g., we can add a
> > 'disabled' flag to the 'struct misc_res'.
> >
> > Hmm.. wait, after checking, the MISC cgroup won't show any control files
> > if the "capacity" of the resource is 0:
> >
> > "
> > * Miscellaneous resources capacity for the entire machine. 0 capacity
> > * means resource is not initialized or not present in the host.
> > "
> >
> > So I really suppose we should go with this route, i.e., by just setting
> > the EPC capacity to 0?
> >
> > Note misc_cg_try_charge() will fail if capacity is 0, but we can make it
> > return success by explicitly check whether SGX cgroup is disabled by
> > using a helper, e.g., sgx_cgroup_disabled().
> >
> > And you always return the root SGX cgroup in sgx_get_current_cg() when
> > sgx_cgroup_disabled() is true.
> >
> > And in sgx_reclaim_pages_global(), you do something like:
> >
> > static void sgx_reclaim_pages_global(..)
> > {
> > #ifdef CONFIG_CGROUP_MISC
> > if (sgx_cgroup_disabled())
> > sgx_reclaim_pages(&sgx_root_cg.lru);
> > else
> > sgx_cgroup_reclaim_pages(misc_cg_root());
> > #else
> > sgx_reclaim_pages(&sgx_global_list);
> > #endif
> > }
> >
> > I am perhaps missing some other spots too but you got the idea.
> >
> > At last, after typing those, I believe we should have a separate patch
> > to handle disable SGX cgroup at initialization time. And you can even
> > put this patch _somewhere_ after the patch
> >
> > "x86/sgx: Implement basic EPC misc cgroup functionality"
> >
> > and before this patch.
> >
> > It makes sense to have such patch anyway, because with it we can easily
> > to add a kernel command line 'sgx_cgroup=disabled" if the user wants it
> > disabled (when someone has such requirement in the future).
> >
>
> I think we can add support for "sgx_cgroup=disabled" in future if indeed
> needed. But just for init failure, no?
>
It's not about the commandline, which we can add in the future when
needed. It's about we need to have a way to handle SGX cgroup being
disabled at boot time nicely, because we already have a case where we need
to do so.
Your approach looks half-way to me, and is not future extendible. If we
choose to do it, do it right -- that is, we need a way to disable it
completely in both kernel and userspace so that userspace won't be able to
see it.
On Thu, 18 Apr 2024 18:29:53 -0500, Huang, Kai <[email protected]> wrote:
>>>>
>>>> --- /dev/null
>>>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>>>> @@ -0,0 +1,72 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef _SGX_EPC_CGROUP_H_
>>>> +#define _SGX_EPC_CGROUP_H_
>>>> +
>>>> +#include <asm/sgx.h>
>>>
>>> I don't see why you need <asm/sgx.h> here. Also, ...
>>>
>>>> +#include <linux/cgroup.h>
>>>> +#include <linux/misc_cgroup.h>
>>>> +
>>>> +#include "sgx.h"
>>>
>>> ... "sgx.h" already includes <asm/sgx.h>
>>>
>>> [...]
>>>
>> right
>>
>>>>
>>>> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
>>>> +{
>>>> + /* get_current_misc_cg() never returns NULL when Kconfig enabled
>>>> */
>>>> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
>>>> +}
>>>
>>> I spent some time looking into this. And yes if I was reading code
>>> correctly the get_current_misc_cg() should never return NULL when
>>> Kconfig
>>> is on.
>>>
>>> I typed my analysis below in [*]. And it would be helpful if any
>>> cgroup
>>> expert can have a second eye on this.
>>>
>>> [...]
>>>
>> Thanks for checking this and I did similar and agree with the
>> conclusion. I think this is confirmed also by Michal's description
>> AFAICT:
>> "
>> The current implementation creates root css object (see cgroup_init(),
>> cgroup_ssid_enabled() check is after cgroup_init_subsys()).
>> I.e. it will look like all tasks are members of root cgroup wrt given
>> controller permanently and controller attribute files won't exist."
>
> After looking I believe we can even disable MISC cgroup at runtime for a
> particular cgroup (haven't actually verified on real machine, though):
>
> # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control
>
My test confirms this is does not cause NULL cgroup for the tasks.
It actually works the same way as commandline disable except for that this
only disables misc in subtree and does not show any misc.* files or allow
creating such files in the subtree.
> And if you look at the MISC cgroup core code, many functions actually
> handle a NULL css, e.g., misc_cg_try_charge():
>
> int misc_cg_try_charge(enum misc_res_type type,
> struct misc_cg *cg, u64 amount)
> {
> ...
>
> if (!(valid_type(type) && cg &&
> READ_ONCE(misc_res_capacity[type])))
> return -EINVAL;
>
> ...
> }
>
> That's why I am still a little bit worried about this. And it's better
> to have cgroup expert(s) to confirm here.
>
I think it's just being defensive as this function is public API called by
other parts of kernel. Documentation of task_get_css() says it always
returns a valid css. This function is used by get_current_misc_cg() to get
the css refernce.
/**
* task_get_css - find and get the css for (task, subsys)
* @task: the target task
* @subsys_id: the target subsystem ID
*
* Find the css for the (@task, @subsys_id) combination, increment a
* reference on and return it. This function is guaranteed to return a
* valid css. The returned css may already have been offlined.
*/
static inline struct cgroup_subsys_state *
task_get_css(struct task_struct *task, int subsys_id)
If you look at the code of this function, you will see it does not check
NULL either for task_css().
So I think we are pretty sure here it's confirmed by this documentation
and testing.
Thanks
Haitao
On Fri, 19 Apr 2024 17:44:59 -0500, Huang, Kai <[email protected]> wrote:
> On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
>> On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai <[email protected]>
>> wrote:
>>
>> >
>> >
>> > On 16/04/2024 3:20 pm, Haitao Huang wrote:
>> > > From: Kristen Carlson Accardi <[email protected]>
>> > > In cases EPC pages need be allocated during a page fault and the
>> cgroup
>> > > usage is near its limit, an asynchronous reclamation needs be
>> triggered
>> > > to avoid blocking the page fault handling.
>> > > Create a workqueue, corresponding work item and function
>> definitions
>> > > for EPC cgroup to support the asynchronous reclamation.
>> > > In case the workqueue allocation is failed during init, disable
>> cgroup.
>> >
>> > It's fine and reasonable to disable (SGX EPC) cgroup. The problem is
>> > "exactly what does this mean" isn't quite clear.
>> >
>> First, this is really some corner case most people don't care: during
>> init, kernel can't even allocate a workqueue object. So I don't think we
>> should write extra code to implement some sophisticated solution. Any
>> solution we come up with may just not work as the way user want or solve
>> the real issue due to the fact such allocation failure even happens at
>> init time.
>
> I think for such boot time failure we can either choose directly
> BUG_ON(),
> or we try to handle it _nicely_, but not half-way. My experience is
> adding BUG_ON() should be avoided in general, but it might be acceptable
> during kernel boot. I will leave it to others.
>
>
> [...]
>
>> >
>> > ..., IIUC you choose a (third) solution that is even one more step
>> back:
>> >
>> > It just makes try_charge() always succeed, but EPC pages are still
>> > managed in the "per-cgroup" list.
>> >
>> > But this solution, AFAICT, doesn't work. The reason is when you fail
>> to
>> > allocate EPC page you will do the global reclaim, but now the global
>> > list is empty.
>> >
>> > Am I missing anything?
>>
>> But when cgroups enabled in config, global reclamation starts from root
>> and reclaim from the whole hierarchy if user may still be able to
>> create.
>> Just that we don't have async/sync per-cgroup reclaim triggered.
>
> OK. I missed this as it is in a later patch.
>
>>
>> >
>> > So my thinking is, we have two options:
>> >
>> > 1) Modify the MISC cgroup core code to allow the kernel to disable one
>> > particular resource. It shouldn't be hard, e.g., we can add a
>> > 'disabled' flag to the 'struct misc_res'.
>> >
>> > Hmm.. wait, after checking, the MISC cgroup won't show any control
>> files
>> > if the "capacity" of the resource is 0:
>> >
>> > "
>> > * Miscellaneous resources capacity for the entire machine. 0
>> capacity
>> > * means resource is not initialized or not present in the host.
>> > "
>> >
>> > So I really suppose we should go with this route, i.e., by just
>> setting
>> > the EPC capacity to 0?
>> >
>> > Note misc_cg_try_charge() will fail if capacity is 0, but we can make
>> it
>> > return success by explicitly check whether SGX cgroup is disabled by
>> > using a helper, e.g., sgx_cgroup_disabled().
>> >
>> > And you always return the root SGX cgroup in sgx_get_current_cg() when
>> > sgx_cgroup_disabled() is true.
>> >
>> > And in sgx_reclaim_pages_global(), you do something like:
>> >
>> > static void sgx_reclaim_pages_global(..)
>> > {
>> > #ifdef CONFIG_CGROUP_MISC
>> > if (sgx_cgroup_disabled())
>> > sgx_reclaim_pages(&sgx_root_cg.lru);
>> > else
>> > sgx_cgroup_reclaim_pages(misc_cg_root());
>> > #else
>> > sgx_reclaim_pages(&sgx_global_list);
>> > #endif
>> > }
>> >
>> > I am perhaps missing some other spots too but you got the idea.
>> >
>> > At last, after typing those, I believe we should have a separate patch
>> > to handle disable SGX cgroup at initialization time. And you can even
>> > put this patch _somewhere_ after the patch
>> >
>> > "x86/sgx: Implement basic EPC misc cgroup functionality"
>> >
>> > and before this patch.
>> >
>> > It makes sense to have such patch anyway, because with it we can
>> easily
>> > to add a kernel command line 'sgx_cgroup=disabled" if the user wants
>> it
>> > disabled (when someone has such requirement in the future).
>> >
>>
>> I think we can add support for "sgx_cgroup=disabled" in future if indeed
>> needed. But just for init failure, no?
>>
>
> It's not about the commandline, which we can add in the future when
> needed. It's about we need to have a way to handle SGX cgroup being
> disabled at boot time nicely, because we already have a case where we
> need
> to do so.
>
> Your approach looks half-way to me, and is not future extendible. If we
> choose to do it, do it right -- that is, we need a way to disable it
> completely in both kernel and userspace so that userspace won't be able
> to
> see it.
That would need more changes in misc cgroup implementation to support
sgx-disable. Right now misc does not have separate files for different
resource types. So we can only block echo "sgx_epc..." to those interface
files, can't really make files not visible.
Anyway, I see this is really a configuration failure case. And we handle
it without added complexity and do as much as we can gracefully until
another fatal error happens. So if we need support disable sgx through
command line in future, then we can also make this failure handling even
more graceful at that time. Nothing really is lost IMHO.
Originally we put in BUG_ON() but then we changed to handle it based on
your feedback. I can put BUG_ON() back if we agree that's more appropriate
at the moment.
@Dave, @Jarkko, @Michal, your thoughts?
Thanks
Haitao
On Mon, Apr 15, 2024 at 08:20:10PM -0700, Haitao Huang wrote:
> From: Sean Christopherson <[email protected]>
>
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
>
The doc LGTM, thanks!
Reviewed-by: Bagas Sanjaya <[email protected]>
--
An old man doll... just what I always wanted! - Clara
On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > I think we can add support for "sgx_cgroup=disabled" in future if indeed
> > > needed. But just for init failure, no?
> > >
> >
> > It's not about the commandline, which we can add in the future when
> > needed. It's about we need to have a way to handle SGX cgroup being
> > disabled at boot time nicely, because we already have a case where we
> > need
> > to do so.
> >
> > Your approach looks half-way to me, and is not future extendible. If we
> > choose to do it, do it right -- that is, we need a way to disable it
> > completely in both kernel and userspace so that userspace won't be able
> > to
> > see it.
>
> That would need more changes in misc cgroup implementation to support
> sgx-disable. Right now misc does not have separate files for different
> resource types. So we can only block echo "sgx_epc..." to those interface
> files, can't really make files not visible.
"won't be able to see" I mean "only for SGX EPC resource", but not the
control files for the entire MISC cgroup.
I replied at the beginning of the previous reply:
"
Given SGX EPC is just one type of MISC cgroup resources, we cannot just
disable MISC cgroup as a whole.
"
You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC. See
the comment of @misc_res_capacity:
* Miscellaneous resources capacity for the entire machine. 0 capacity
* means resource is not initialized or not present in the host.
And "blocking echo sgx_epc ... to those control files" is already
sufficient for the purpose of not exposing SGX EPC to userspace, correct?
E.g., if SGX cgroup is enabled, you can see below when you read "max":
# cat /sys/fs/cgroup/my_group/misc.max
# <resource1> <max1>
sgx_epc ...
...
Otherwise you won't be able to see "sgx_epc":
# cat /sys/fs/cgroup/my_group/misc.max
# <resource1> <max1>
...
And when you try to write the "max" for "sgx_epc", you will hit error:
# echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
# ... echo: write error: Invalid argument
The above applies to all the control files. To me this is pretty much
means "SGX EPC is disabled" or "not supported" for userspace.
Am I missing anything?
On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <[email protected]> wrote:
> On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
>> > > I think we can add support for "sgx_cgroup=disabled" in future if
>> indeed
>> > > needed. But just for init failure, no?
>> > >
>> >
>> > It's not about the commandline, which we can add in the future when
>> > needed. It's about we need to have a way to handle SGX cgroup being
>> > disabled at boot time nicely, because we already have a case where we
>> > need
>> > to do so.
>> >
>> > Your approach looks half-way to me, and is not future extendible. If
>> we
>> > choose to do it, do it right -- that is, we need a way to disable it
>> > completely in both kernel and userspace so that userspace won't be
>> able> to
>> > see it.
>>
>> That would need more changes in misc cgroup implementation to support
>> sgx-disable. Right now misc does not have separate files for different
>> resource types. So we can only block echo "sgx_epc..." to those
>> interfacefiles, can't really make files not visible.
>
> "won't be able to see" I mean "only for SGX EPC resource", but not the
> control files for the entire MISC cgroup.
>
> I replied at the beginning of the previous reply:
>
> "
> Given SGX EPC is just one type of MISC cgroup resources, we cannot just
> disable MISC cgroup as a whole.
> "
>
Sorry I missed this point. below.
> You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC. See
> the comment of @misc_res_capacity:
>
> * Miscellaneous resources capacity for the entire machine. 0 capacity
> * means resource is not initialized or not present in the host.
>
IIUC I don't think the situation we have is either of those cases. For our
case, resource is inited and present on the host but we have allocation
error for sgx cgroup infra.
> And "blocking echo sgx_epc ... to those control files" is already
> sufficient for the purpose of not exposing SGX EPC to userspace, correct?
>
> E.g., if SGX cgroup is enabled, you can see below when you read "max":
>
> # cat /sys/fs/cgroup/my_group/misc.max
> # <resource1> <max1>
> sgx_epc ...
> ...
>
> Otherwise you won't be able to see "sgx_epc":
>
> # cat /sys/fs/cgroup/my_group/misc.max
> # <resource1> <max1>
> ...
>
> And when you try to write the "max" for "sgx_epc", you will hit error:
>
> # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> # ... echo: write error: Invalid argument
>
> The above applies to all the control files. To me this is pretty much
> means "SGX EPC is disabled" or "not supported" for userspace.
>
You are right, capacity == 0 does block echoing max and users see an error
if they do that. But 1) doubt you literately wanted "SGX EPC is disabled"
and make it unsupported in this case, 2) even if we accept this is "sgx
cgroup disabled" I don't see how it is much better user experience than
current solution or really helps user better.
Also to implement this approach, as you mentioned, we need workaround the
fact that misc_try_charge() fails when capacity set to zero, and adding
code to return root always? So it seems like more workaround code to just
make it work for a failing case no one really care much and end result is
not really much better IMHO.
Thanks
Haitao
On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <[email protected]> wrote:
>
> > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > I think we can add support for "sgx_cgroup=disabled" in future if
> > > indeed
> > > > > needed. But just for init failure, no?
> > > > >
> > > >
> > > > It's not about the commandline, which we can add in the future when
> > > > needed. It's about we need to have a way to handle SGX cgroup being
> > > > disabled at boot time nicely, because we already have a case where we
> > > > need
> > > > to do so.
> > > >
> > > > Your approach looks half-way to me, and is not future extendible. If
> > > we
> > > > choose to do it, do it right -- that is, we need a way to disable it
> > > > completely in both kernel and userspace so that userspace won't be
> > > able> to
> > > > see it.
> > >
> > > That would need more changes in misc cgroup implementation to support
> > > sgx-disable. Right now misc does not have separate files for different
> > > resource types. So we can only block echo "sgx_epc..." to those
> > > interfacefiles, can't really make files not visible.
> >
> > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > control files for the entire MISC cgroup.
> >
> > I replied at the beginning of the previous reply:
> >
> > "
> > Given SGX EPC is just one type of MISC cgroup resources, we cannot just
> > disable MISC cgroup as a whole.
> > "
> >
> Sorry I missed this point. below.
>
> > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC. See
> > the comment of @misc_res_capacity:
> >
> > * Miscellaneous resources capacity for the entire machine. 0 capacity
> > * means resource is not initialized or not present in the host.
> >
>
> IIUC I don't think the situation we have is either of those cases. For our
> case, resource is inited and present on the host but we have allocation
> error for sgx cgroup infra.
You have calculated the "capacity", but later you failed something and
then reset the "capacity" to 0, i.e., cleanup. What's wrong with that?
>
> > And "blocking echo sgx_epc ... to those control files" is already
> > sufficient for the purpose of not exposing SGX EPC to userspace, correct?
> >
> > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> >
> > # cat /sys/fs/cgroup/my_group/misc.max
> > # <resource1> <max1>
> > sgx_epc ...
> > ...
> >
> > Otherwise you won't be able to see "sgx_epc":
> >
> > # cat /sys/fs/cgroup/my_group/misc.max
> > # <resource1> <max1>
> > ...
> >
> > And when you try to write the "max" for "sgx_epc", you will hit error:
> >
> > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> > # ... echo: write error: Invalid argument
> >
> > The above applies to all the control files. To me this is pretty much
> > means "SGX EPC is disabled" or "not supported" for userspace.
> >
> You are right, capacity == 0 does block echoing max and users see an error
> if they do that. But 1) doubt you literately wanted "SGX EPC is disabled"
> and make it unsupported in this case,
>
I don't understand. Something failed during SGX cgroup initialization,
you _literally_ cannot continue to support it.
> 2) even if we accept this is "sgx
> cgroup disabled" I don't see how it is much better user experience than
> current solution or really helps user better.
In your way, the userspace is still able to see "sgx_epc" in control files
and is able to update them. So from userspace's perspective SGX cgroup is
enabled, but obviously updating to "max" doesn't have any impact. This
will confuse userspace.
>
> Also to implement this approach, as you mentioned, we need workaround the
> fact that misc_try_charge() fails when capacity set to zero, and adding
> code to return root always?
>
Why this is a problem?
> So it seems like more workaround code to just
> make it work for a failing case no one really care much and end result is
> not really much better IMHO.
It's not workaround, it's the right thing to do.
The result is userspace will see it being disabled when kernel disables
it.
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson <[email protected]>
>
> Add initial documentation of how to regulate the distribution of
> SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
> controller.
>
>
Acked-by: Kai Huang <[email protected]>
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> Enclave Page Cache(EPC) memory can be swapped out to regular system
> memory, and the consumed memory should be charged to a proper
> mem_cgroup. Currently the selection of mem_cgroup to charge is done in
> sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
> ksgxd thread are user processes. With the new EPC cgroup implementation,
> the swapping can also happen in EPC cgroup work-queue threads. In those
> cases, it improperly selects the root mem_cgroup to charge for the RAM
> usage.
>
> Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
> an additional argument to explicitly specify the mm struct to charge for
> allocations. Callers from background kthreads not associated with a
> charging mm struct would set it to NULL, while callers in user process
> contexts set it to current->mm.
>
> Internally, it handles the case when the charging mm given is NULL, by
> searching for an mm struct from enclave's mm_list.
>
> Signed-off-by: Haitao Huang <[email protected]>
> Reported-by: Mikko Ylinen <[email protected]>
> Tested-by: Mikko Ylinen <[email protected]>
> Tested-by: Jarkko Sakkinen <[email protected]>
>
Reviewed-by: Kai Huang <[email protected]>
On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai <[email protected]> wrote:
> On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
>> On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <[email protected]>
>> wrote:
>>
>> > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
>> > > > > I think we can add support for "sgx_cgroup=disabled" in future
>> if
>> > > indeed
>> > > > > needed. But just for init failure, no?
>> > > > >
>> > > >
>> > > > It's not about the commandline, which we can add in the future
>> when
>> > > > needed. It's about we need to have a way to handle SGX cgroup
>> being
>> > > > disabled at boot time nicely, because we already have a case
>> where we
>> > > > need
>> > > > to do so.
>> > > >
>> > > > Your approach looks half-way to me, and is not future
>> extendible. If
>> > > we
>> > > > choose to do it, do it right -- that is, we need a way to disable
>> it
>> > > > completely in both kernel and userspace so that userspace won't be
>> > > able> to
>> > > > see it.
>> > >
>> > > That would need more changes in misc cgroup implementation to
>> support
>> > > sgx-disable. Right now misc does not have separate files for
>> different
>> > > resource types. So we can only block echo "sgx_epc..." to those
>> > > interfacefiles, can't really make files not visible.
>> >
>> > "won't be able to see" I mean "only for SGX EPC resource", but not the
>> > control files for the entire MISC cgroup.
>> >
>> > I replied at the beginning of the previous reply:
>> >
>> > "
>> > Given SGX EPC is just one type of MISC cgroup resources, we cannot
>> just
>> > disable MISC cgroup as a whole.
>> > "
>> >
>> Sorry I missed this point. below.
>>
>> > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.
>> See
>> > the comment of @misc_res_capacity:
>> >
>> > * Miscellaneous resources capacity for the entire machine. 0 capacity
>> > * means resource is not initialized or not present in the host.
>> >
>>
>> IIUC I don't think the situation we have is either of those cases. For
>> our
>> case, resource is inited and present on the host but we have allocation
>> error for sgx cgroup infra.
>
> You have calculated the "capacity", but later you failed something and
> then reset the "capacity" to 0, i.e., cleanup. What's wrong with that?
>
>>
>> > And "blocking echo sgx_epc ... to those control files" is already
>> > sufficient for the purpose of not exposing SGX EPC to userspace,
>> correct?
>> >
>> > E.g., if SGX cgroup is enabled, you can see below when you read "max":
>> >
>> > # cat /sys/fs/cgroup/my_group/misc.max
>> > # <resource1> <max1>
>> > sgx_epc ...
>> > ...
>> >
>> > Otherwise you won't be able to see "sgx_epc":
>> >
>> > # cat /sys/fs/cgroup/my_group/misc.max
>> > # <resource1> <max1>
>> > ...
>> >
>> > And when you try to write the "max" for "sgx_epc", you will hit error:
>> >
>> > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
>> > # ... echo: write error: Invalid argument
>> >
>> > The above applies to all the control files. To me this is pretty much
>> > means "SGX EPC is disabled" or "not supported" for userspace.
>> >
>> You are right, capacity == 0 does block echoing max and users see an
>> error
>> if they do that. But 1) doubt you literately wanted "SGX EPC is
>> disabled"
>> and make it unsupported in this case,
>
> I don't understand. Something failed during SGX cgroup initialization,
> you _literally_ cannot continue to support it.
>
>
Then we should just return -ENOMEM from sgx_init() when sgx cgroup
initialization fails?
I thought we only disable SGX cgroup support. SGX can still run.
>> 2) even if we accept this is "sgx
>> cgroup disabled" I don't see how it is much better user experience than
>> current solution or really helps user better.
>
> In your way, the userspace is still able to see "sgx_epc" in control
> files
> and is able to update them. So from userspace's perspective SGX cgroup
> is
> enabled, but obviously updating to "max" doesn't have any impact. This
> will confuse userspace.
>
>>
Setting capacity to zero also confuses user space. Some application may
rely on this file to know the capacity.
>> Also to implement this approach, as you mentioned, we need workaround
>> the
>> fact that misc_try_charge() fails when capacity set to zero, and adding
>> code to return root always?
>
> Why this is a problem?
>
It changes/overrides the the original meaning of capacity==0: no one can
allocate if capacity is zero.
>> So it seems like more workaround code to just
>> make it work for a failing case no one really care much and end result
>> is
>> not really much better IMHO.
>
> It's not workaround, it's the right thing to do.
>
> The result is userspace will see it being disabled when kernel disables
> it.
>
>
It's a workaround because you use the capacity==0 but it does not really
mean to disable the misc cgroup for specific resource IIUC.
There is explicit way for user to disable misc without setting capacity to
zero. So in future if we want really disable sgx_epc cgroup specifically
we should not use capacity. Therefore your approach is not
extensible/reusable.
Given this is a rare corner case caused by configuration, we can only do
as much as possible IMHO, not trying to implement a perfect solution at
the moment. Maybe BUG_ON() is more appropriate?
(If you want to disable sgx, we can return error from sgx_init() but I
still doubt that's what you meant.)
Thanks
Haitao
On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
> On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai <[email protected]> wrote:
>
> > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
> > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <[email protected]>
> > > wrote:
> > >
> > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
> > > > > > > I think we can add support for "sgx_cgroup=disabled" in future
> > > if
> > > > > indeed
> > > > > > > needed. But just for init failure, no?
> > > > > > >
> > > > > >
> > > > > > It's not about the commandline, which we can add in the future
> > > when
> > > > > > needed. It's about we need to have a way to handle SGX cgroup
> > > being
> > > > > > disabled at boot time nicely, because we already have a case
> > > where we
> > > > > > need
> > > > > > to do so.
> > > > > >
> > > > > > Your approach looks half-way to me, and is not future
> > > extendible. If
> > > > > we
> > > > > > choose to do it, do it right -- that is, we need a way to disable
> > > it
> > > > > > completely in both kernel and userspace so that userspace won't be
> > > > > able> to
> > > > > > see it.
> > > > >
> > > > > That would need more changes in misc cgroup implementation to
> > > support
> > > > > sgx-disable. Right now misc does not have separate files for
> > > different
> > > > > resource types. So we can only block echo "sgx_epc..." to those
> > > > > interfacefiles, can't really make files not visible.
> > > >
> > > > "won't be able to see" I mean "only for SGX EPC resource", but not the
> > > > control files for the entire MISC cgroup.
> > > >
> > > > I replied at the beginning of the previous reply:
> > > >
> > > > "
> > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot
> > > just
> > > > disable MISC cgroup as a whole.
> > > > "
> > > >
> > > Sorry I missed this point. below.
> > >
> > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX EPC.
> > > See
> > > > the comment of @misc_res_capacity:
> > > >
> > > > * Miscellaneous resources capacity for the entire machine. 0 capacity
> > > > * means resource is not initialized or not present in the host.
> > > >
> > >
> > > IIUC I don't think the situation we have is either of those cases. For
> > > our
> > > case, resource is inited and present on the host but we have allocation
> > > error for sgx cgroup infra.
> >
> > You have calculated the "capacity", but later you failed something and
> > then reset the "capacity" to 0, i.e., cleanup. What's wrong with that?
> >
> > >
> > > > And "blocking echo sgx_epc ... to those control files" is already
> > > > sufficient for the purpose of not exposing SGX EPC to userspace,
> > > correct?
> > > >
> > > > E.g., if SGX cgroup is enabled, you can see below when you read "max":
> > > >
> > > > # cat /sys/fs/cgroup/my_group/misc.max
> > > > # <resource1> <max1>
> > > > sgx_epc ...
> > > > ...
> > > >
> > > > Otherwise you won't be able to see "sgx_epc":
> > > >
> > > > # cat /sys/fs/cgroup/my_group/misc.max
> > > > # <resource1> <max1>
> > > > ...
> > > >
> > > > And when you try to write the "max" for "sgx_epc", you will hit error:
> > > >
> > > > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
> > > > # ... echo: write error: Invalid argument
> > > >
> > > > The above applies to all the control files. To me this is pretty much
> > > > means "SGX EPC is disabled" or "not supported" for userspace.
> > > >
> > > You are right, capacity == 0 does block echoing max and users see an
> > > error
> > > if they do that. But 1) doubt you literately wanted "SGX EPC is
> > > disabled"
> > > and make it unsupported in this case,
> >
> > I don't understand. Something failed during SGX cgroup initialization,
> > you _literally_ cannot continue to support it.
> >
> >
>
> Then we should just return -ENOMEM from sgx_init() when sgx cgroup
> initialization fails?
> I thought we only disable SGX cgroup support. SGX can still run.
I am not sure how you got this conclusion. I specifically said something
failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to
be disabled, not SGX as a whole.
>
> > > 2) even if we accept this is "sgx
> > > cgroup disabled" I don't see how it is much better user experience than
> > > current solution or really helps user better.
> >
> > In your way, the userspace is still able to see "sgx_epc" in control
> > files
> > and is able to update them. So from userspace's perspective SGX cgroup
> > is
> > enabled, but obviously updating to "max" doesn't have any impact. This
> > will confuse userspace.
> >
> > >
>
> Setting capacity to zero also confuses user space. Some application may
> rely on this file to know the capacity.
Why??
Are you saying before this SGX cgroup patchset those applications cannot
run?
>
> > > Also to implement this approach, as you mentioned, we need workaround
> > > the
> > > fact that misc_try_charge() fails when capacity set to zero, and adding
> > > code to return root always?
> >
> > Why this is a problem?
> >
>
> It changes/overrides the the original meaning of capacity==0: no one can
> allocate if capacity is zero.
Why??
Are you saying before this series, no one can allocate EPC page?
>
> > > So it seems like more workaround code to just
> > > make it work for a failing case no one really care much and end result
> > > is
> > > not really much better IMHO.
> >
> > It's not workaround, it's the right thing to do.
> >
> > The result is userspace will see it being disabled when kernel disables
> > it.
> >
> >
> It's a workaround because you use the capacity==0 but it does not really
> mean to disable the misc cgroup for specific resource IIUC.
Please read the comment around @misc_res_capacity again:
* Miscellaneous resources capacity for the entire machine. 0 capacity
* means resource is not initialized or not present in the host.
>
> There is explicit way for user to disable misc without setting capacity to
> zero.
>
Which way are you talking about?
> So in future if we want really disable sgx_epc cgroup specifically
> we should not use capacity. Therefore your approach is not
> extensible/reusable.
>
> Given this is a rare corner case caused by configuration, we can only do
> as much as possible IMHO, not trying to implement a perfect solution at
> the moment. Maybe BUG_ON() is more appropriate?
>
I think I will reply this thread for the last time:
I don't have strong opinion to against using BUG_ON() when you fail to
allocate workqueue. If you choose to do this, I'll leave to others.
If you want to "disable SGX cgroup" when you fail to allocate workqueue,
reset the "capacity" to 0 to disable it.
On Tue, 23 Apr 2024 09:19:53 -0500, Huang, Kai <[email protected]> wrote:
> On Tue, 2024-04-23 at 08:08 -0500, Haitao Huang wrote:
>> On Mon, 22 Apr 2024 17:16:34 -0500, Huang, Kai <[email protected]>
>> wrote:
>>
>> > On Mon, 2024-04-22 at 11:17 -0500, Haitao Huang wrote:
>> > > On Sun, 21 Apr 2024 19:22:27 -0500, Huang, Kai <[email protected]>
>> > > wrote:
>> > >
>> > > > On Fri, 2024-04-19 at 20:14 -0500, Haitao Huang wrote:
>> > > > > > > I think we can add support for "sgx_cgroup=disabled" in
>> future
>> > > if
>> > > > > indeed
>> > > > > > > needed. But just for init failure, no?
>> > > > > > >
>> > > > > >
>> > > > > > It's not about the commandline, which we can add in the future
>> > > when
>> > > > > > needed. It's about we need to have a way to handle SGX cgroup
>> > > being
>> > > > > > disabled at boot time nicely, because we already have a case
>> > > where we
>> > > > > > need
>> > > > > > to do so.
>> > > > > >
>> > > > > > Your approach looks half-way to me, and is not future
>> > > extendible. If
>> > > > > we
>> > > > > > choose to do it, do it right -- that is, we need a way to
>> disable
>> > > it
>> > > > > > completely in both kernel and userspace so that userspace
>> won't be
>> > > > > able> to
>> > > > > > see it.
>> > > > >
>> > > > > That would need more changes in misc cgroup implementation to
>> > > support
>> > > > > sgx-disable. Right now misc does not have separate files for
>> > > different
>> > > > > resource types. So we can only block echo "sgx_epc..." to those
>> > > > > interfacefiles, can't really make files not visible.
>> > > >
>> > > > "won't be able to see" I mean "only for SGX EPC resource", but
>> not the
>> > > > control files for the entire MISC cgroup.
>> > > >
>> > > > I replied at the beginning of the previous reply:
>> > > >
>> > > > "
>> > > > Given SGX EPC is just one type of MISC cgroup resources, we cannot
>> > > just
>> > > > disable MISC cgroup as a whole.
>> > > > "
>> > > >
>> > > Sorry I missed this point. below.
>> > >
>> > > > You just need to set the SGX EPC "capacity" to 0 to disable SGX
>> EPC.
>> > > See
>> > > > the comment of @misc_res_capacity:
>> > > >
>> > > > * Miscellaneous resources capacity for the entire machine. 0
>> capacity
>> > > > * means resource is not initialized or not present in the host.
>> > > >
>> > >
>> > > IIUC I don't think the situation we have is either of those cases.
>> For
>> > > our
>> > > case, resource is inited and present on the host but we have
>> allocation
>> > > error for sgx cgroup infra.
>> >
>> > You have calculated the "capacity", but later you failed something and
>> > then reset the "capacity" to 0, i.e., cleanup. What's wrong with
>> that?
>> >
>> > >
>> > > > And "blocking echo sgx_epc ... to those control files" is already
>> > > > sufficient for the purpose of not exposing SGX EPC to userspace,
>> > > correct?
>> > > >
>> > > > E.g., if SGX cgroup is enabled, you can see below when you read
>> "max":
>> > > >
>> > > > # cat /sys/fs/cgroup/my_group/misc.max
>> > > > # <resource1> <max1>
>> > > > sgx_epc ...
>> > > > ...
>> > > >
>> > > > Otherwise you won't be able to see "sgx_epc":
>> > > >
>> > > > # cat /sys/fs/cgroup/my_group/misc.max
>> > > > # <resource1> <max1>
>> > > > ...
>> > > >
>> > > > And when you try to write the "max" for "sgx_epc", you will hit
>> error:
>> > > >
>> > > > # echo "sgx_epc 100" > /sys/fs/cgroup/my_group/misc.max
>> > > > # ... echo: write error: Invalid argument
>> > > >
>> > > > The above applies to all the control files. To me this is pretty
>> much
>> > > > means "SGX EPC is disabled" or "not supported" for userspace.
>> > > >
>> > > You are right, capacity == 0 does block echoing max and users see an
>> > > error
>> > > if they do that. But 1) doubt you literately wanted "SGX EPC is
>> > > disabled"
>> > > and make it unsupported in this case,
>> >
>> > I don't understand. Something failed during SGX cgroup
>> initialization,
>> > you _literally_ cannot continue to support it.
>> >
>> >
>>
>> Then we should just return -ENOMEM from sgx_init() when sgx cgroup
>> initialization fails?
>> I thought we only disable SGX cgroup support. SGX can still run.
>
> I am not sure how you got this conclusion. I specifically said something
> failed during SGX "cgroup" initialization, so only SGX "cgroup" needs to
> be disabled, not SGX as a whole.
>
>>
>> > > 2) even if we accept this is "sgx
>> > > cgroup disabled" I don't see how it is much better user experience
>> than
>> > > current solution or really helps user better.
>> >
>> > In your way, the userspace is still able to see "sgx_epc" in control
>> > files
>> > and is able to update them. So from userspace's perspective SGX
>> cgroup
>> > is
>> > enabled, but obviously updating to "max" doesn't have any impact.
>> This
>> > will confuse userspace.
>> >
>> > >
>>
>> Setting capacity to zero also confuses user space. Some application may
>> rely on this file to know the capacity.
>
>
> Why??
>
> Are you saying before this SGX cgroup patchset those applications cannot
> run?
>
>>
>> > > Also to implement this approach, as you mentioned, we need
>> workaround
>> > > the
>> > > fact that misc_try_charge() fails when capacity set to zero, and
>> adding
>> > > code to return root always?
>> >
>> > Why this is a problem?
>> >
>>
>> It changes/overrides the the original meaning of capacity==0: no one can
>> allocate if capacity is zero.
>
> Why??
>
> Are you saying before this series, no one can allocate EPC page?
>
>>
>> > > So it seems like more workaround code to just
>> > > make it work for a failing case no one really care much and end
>> result
>> > > is
>> > > not really much better IMHO.
>> >
>> > It's not workaround, it's the right thing to do.
>> >
>> > The result is userspace will see it being disabled when kernel
>> disables
>> > it.
>> >
>> >
>> It's a workaround because you use the capacity==0 but it does not really
>> mean to disable the misc cgroup for specific resource IIUC.
>
> Please read the comment around @misc_res_capacity again:
>
> * Miscellaneous resources capacity for the entire machine. 0 capacity
> * means resource is not initialized or not present in the host.
>
I mentioned this in earlier email. I think this means no SGX EPC. It does
not mean sgx epc cgroup not enabled. That's also consistent with the
behavior try_charge() fails if capacity is zero.
>>
>> There is explicit way for user to disable misc without setting capacity
>> to
>> zero.
>
> Which way are you talking about?
Echo "-misc" to cgroup.subtree_control at root level for example still
shows non-zero sgx_epc capacity.
>
>> So in future if we want really disable sgx_epc cgroup specifically
>> we should not use capacity. Therefore your approach is not
>> extensible/reusable.
>>
>> Given this is a rare corner case caused by configuration, we can only do
>> as much as possible IMHO, not trying to implement a perfect solution at
>> the moment. Maybe BUG_ON() is more appropriate?
>>
>
> I think I will reply this thread for the last time:
>
> I don't have strong opinion to against using BUG_ON() when you fail to
> allocate workqueue. If you choose to do this, I'll leave to others.
>
> If you want to "disable SGX cgroup" when you fail to allocate workqueue,
> reset the "capacity" to 0 to disable it.
Unless I hear otherwise, I'll revert to BUG_ON().
Thanks
Haitao
On Wed, 17 Apr 2024 18:51:28 -0500, Huang, Kai <[email protected]> wrote:
>
>
> On 16/04/2024 3:20 pm, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>> Currently in the EPC page allocation, the kernel simply fails the
>> allocation when the current EPC cgroup fails to charge due to its usage
>> reaching limit. This is not ideal. When that happens, a better way is
>> to reclaim EPC page(s) from the current EPC cgroup (and/or its
>> descendants) to reduce its usage so the new allocation can succeed.
>> Add the basic building blocks to support per-cgroup reclamation.
>> Currently the kernel only has one place to reclaim EPC pages: the
>> global
>> EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU
>> list for each EPC cgroup, and introduce a "cgroup" variant function to
>> reclaim EPC pages from a given EPC cgroup and its descendants.
>> Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
>> It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
>> pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
>> pages from the global LRU, and then tries to reclaim these pages at once
>> for better performance.
>> Implement the "cgroup" variant EPC reclaim in a similar way, but keep
>> the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
>> as input, and return the pages that are "scanned" and attempted for
>> reclamation (but not necessarily reclaimed successfully); 2) loop the
>> given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
>> until SGX_NR_TO_SCAN pages are "scanned".
>> This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
>> tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
>> cgroup, and only moves to its descendants when there's no enough
>> reclaimable EPC pages to "scan" in its LRU. It should be enough for
>> most cases.
>> Note, this simple implementation doesn't _exactly_ mimic the current
>> global EPC reclaim (which always tries to do the actual reclaim in batch
>> of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
>> reclaimable pages, the actual reclaim of EPC pages will be split into
>> smaller batches _across_ multiple LRUs with each being smaller than
>> SGX_NR_TO_SCAN pages.
>> A more precise way to mimic the current global EPC reclaim would be to
>> have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
>> _across_ the given EPC cgroup _AND_ its descendants, and then do the
>> actual reclaim in one batch. But this is unnecessarily complicated at
>> this stage.
>> Alternatively, the current sgx_reclaim_pages() could be changed to
>> return the actual "reclaimed" pages, but not "scanned" pages. However,
>> the reclamation is a lengthy process, forcing a successful reclamation
>> of predetermined number of pages may block the caller for too long. And
>> that may not be acceptable in some synchronous contexts, e.g., in
>> serving an ioctl().
>> With this building block in place, add synchronous reclamation support
>> in sgx_cgroup_try_charge(): trigger a call to
>> sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
>> caller allows synchronous reclaim as indicated by s newly added
>> parameter.
>> A later patch will add support for asynchronous reclamation reusing
>> sgx_cgroup_reclaim_pages().
>> Note all reclaimable EPC pages are still tracked in the global LRU thus
>> no per-cgroup reclamation is actually active at the moment. Per-cgroup
>> tracking and reclamation will be turned on in the end after all
>> necessary infrastructure is in place.
>
> Nit:
>
> "all necessary infrastructures are in place", or, "all necessary
> building blocks are in place".
>
> ?
>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> Tested-by: Jarkko Sakkinen <[email protected]>
>> ---
>
> Reviewed-by: Kai Huang <[email protected]>
>
Thanks
> More nitpickings below:
>
> [...]
>
>> -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
>> +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg,
>> enum sgx_reclaim reclaim)
>
> Let's still wrap the text on 80-character basis.
>
> I guess most people are more used to that.
>
> [...]
>
>> - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
>> - struct sgx_epc_page, list);
>> + epc_page = list_first_entry_or_null(&lru->reclaimable, struct
>> sgx_epc_page, list);
>
> Ditto.
>
Actually I changed to 100 char width based on comments from Jarkko IIRC.
I don't have personal preference, but will not change back to 80 unless
Jarkko also agrees.
Thanks
Haitao
On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > It's a workaround because you use the capacity==0 but it does not really
> > > mean to disable the misc cgroup for specific resource IIUC.
> >
> > Please read the comment around @misc_res_capacity again:
> >
> > * Miscellaneous resources capacity for the entire machine. 0 capacity
> > * means resource is not initialized or not present in the host.
> >
>
> I mentioned this in earlier email. I think this means no SGX EPC. It does
> not mean sgx epc cgroup not enabled. That's also consistent with the
> behavior try_charge() fails if capacity is zero.
OK. To me the "capacity" is purely the concept of cgroup, so it must be
interpreted within the scope of "cgroup". If cgroup, in our case, SGX
cgroup, is disabled, then whether "leaving the capacity to reflect the
presence of hardware resource" doesn't really matter.
So what you are saying is that, the kernel must initialize the capacity of
some MISC resource once it is added as valid type.
And you must initialize the "capacity" even MISC cgroup is disabled
entirely by kernel commandline, in which case, IIUC, misc.capacity is not
even going to show in the /fs.
If this is your point, then your patch:
cgroup/misc: Add SGX EPC resource type
is already broken, because you added the new type w/o initializing the
capacity.
Please fix that up.
>
> > >
> > > There is explicit way for user to disable misc without setting capacity
> > > to
> > > zero.
> >
> > Which way are you talking about?
>
> Echo "-misc" to cgroup.subtree_control at root level for example still
> shows non-zero sgx_epc capacity.
I guess "having to disable all MISC resources just in order to disable SGX
EPC cgroup" is a brilliant idea.
You can easily disable the entire MISC cgroup by commandline for that
purpose if that's acceptable.
And I have no idea why "still showing non-zero EPC capacity" is important
if SGX cgroup cannot be supported at all.
On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai <[email protected]> wrote:
> On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
>> > > It's a workaround because you use the capacity==0 but it does not
>> really
>> > > mean to disable the misc cgroup for specific resource IIUC.
>> >
>> > Please read the comment around @misc_res_capacity again:
>> >
>> > * Miscellaneous resources capacity for the entire machine. 0
>> capacity
>> > * means resource is not initialized or not present in the host.
>> >
>>
>> I mentioned this in earlier email. I think this means no SGX EPC. It
>> doesnot mean sgx epc cgroup not enabled. That's also consistent with the
>> behavior try_charge() fails if capacity is zero.
>
> OK. To me the "capacity" is purely the concept of cgroup, so it must be
> interpreted within the scope of "cgroup". If cgroup, in our case, SGX
> cgroup, is disabled, then whether "leaving the capacity to reflect the
> presence of hardware resource" doesn't really matter.
> So what you are saying is that, the kernel must initialize the capacity
> of
> some MISC resource once it is added as valid type.
> And you must initialize the "capacity" even MISC cgroup is disabled
> entirely by kernel commandline, in which case, IIUC, misc.capacity is not
> even going to show in the /fs.
>
> If this is your point, then your patch:
>
> cgroup/misc: Add SGX EPC resource type
>
> is already broken, because you added the new type w/o initializing the
> capacity.
>
> Please fix that up.
>
>>
>> > >
>> > > There is explicit way for user to disable misc without setting
>> capacity> > to
>> > > zero.
>> >
>> > Which way are you talking about?
>>
>> Echo "-misc" to cgroup.subtree_control at root level for example still
>> shows non-zero sgx_epc capacity.
>
> I guess "having to disable all MISC resources just in order to disable
> SGX
> EPC cgroup" is a brilliant idea.
>
> You can easily disable the entire MISC cgroup by commandline for that
> purpose if that's acceptable.
>
> And I have no idea why "still showing non-zero EPC capacity" is important
> if SGX cgroup cannot be supported at all.
>
Okay, all I'm trying to say is we should care about consistency in code
and don't want SGX do something different. Mixing "disable" with
"capacity==0" causes inconsistencies AFAICS:
1) The try_charge() API currently returns error when capacity is zero. So
it appears not to mean that the cgroup is disabled otherwise it should
return success.
2) The current explicit way ("-misc") to disable misc still shows non-zero
entries in misc.capacity. (At least for v2 cgroup, it does when I tested).
Maybe this is not important but I just don't feel good about this
inconsistency.
For now I'll just do BUG_ON() unless there are more strong opinions one
way or the other.
BR
Haitao
On Tue, 2024-04-23 at 19:26 -0500, Haitao Huang wrote:
> On Tue, 23 Apr 2024 17:13:15 -0500, Huang, Kai <[email protected]> wrote:
>
> > On Tue, 2024-04-23 at 10:30 -0500, Haitao Huang wrote:
> > > > > It's a workaround because you use the capacity==0 but it does not
> > > really
> > > > > mean to disable the misc cgroup for specific resource IIUC.
> > > >
> > > > Please read the comment around @misc_res_capacity again:
> > > >
> > > > * Miscellaneous resources capacity for the entire machine. 0
> > > capacity
> > > > * means resource is not initialized or not present in the host.
> > > >
> > >
> > > I mentioned this in earlier email. I think this means no SGX EPC. It
> > > doesnot mean sgx epc cgroup not enabled. That's also consistent with the
> > > behavior try_charge() fails if capacity is zero.
> >
> > OK. To me the "capacity" is purely the concept of cgroup, so it must be
> > interpreted within the scope of "cgroup". If cgroup, in our case, SGX
> > cgroup, is disabled, then whether "leaving the capacity to reflect the
> > presence of hardware resource" doesn't really matter.
> > So what you are saying is that, the kernel must initialize the capacity
> > of
> > some MISC resource once it is added as valid type.
> > And you must initialize the "capacity" even MISC cgroup is disabled
> > entirely by kernel commandline, in which case, IIUC, misc.capacity is not
> > even going to show in the /fs.
> >
> > If this is your point, then your patch:
> >
> > cgroup/misc: Add SGX EPC resource type
> >
> > is already broken, because you added the new type w/o initializing the
> > capacity.
> >
> > Please fix that up.
> >
> > >
> > > > >
> > > > > There is explicit way for user to disable misc without setting
> > > capacity> > to
> > > > > zero.
> > > >
> > > > Which way are you talking about?
> > >
> > > Echo "-misc" to cgroup.subtree_control at root level for example still
> > > shows non-zero sgx_epc capacity.
> >
> > I guess "having to disable all MISC resources just in order to disable
> > SGX
> > EPC cgroup" is a brilliant idea.
> >
> > You can easily disable the entire MISC cgroup by commandline for that
> > purpose if that's acceptable.
> >
> > And I have no idea why "still showing non-zero EPC capacity" is important
> > if SGX cgroup cannot be supported at all.
> >
>
> Okay, all I'm trying to say is we should care about consistency in code
> and don't want SGX do something different. Mixing "disable" with
> "capacity==0" causes inconsistencies AFAICS:
>
> 1) The try_charge() API currently returns error when capacity is zero. So
> it appears not to mean that the cgroup is disabled otherwise it should
> return success.
I agree this isn't ideal. My view is we can fix it in MISC code if
needed.
>
> 2) The current explicit way ("-misc") to disable misc still shows non-zero
> entries in misc.capacity. (At least for v2 cgroup, it does when I tested).
> Maybe this is not important but I just don't feel good about this
> inconsistency.
This belongs to "MISC resource cgroup was initially enabled by the kernel
at boot time, but later was disabled *somewhere in hierarchy* by the
user".
In fact, if you only do "-misc" for "some subtree", it's quite reasonable
to still report the resource in max.capacity. In the case above, the
"some subtree" happens to be the root.
So to me it's reasonable to still show max.capacity in this case. And you
can actually argue that the kernel still supports the cgroup for the
resource. E.g., you can at runtime do "+misc" to re-enable.
However, if the kernel isn't able to support certain MISC resource cgroup
at boot time, it's quite reasonable to just set the "capacity" to 0 so it
isn't visible to userspace.
Note:
My key point is, when userspace sees 0 "capacity", it shouldn't need to
care about whether it is because of "hardware resource is not available",
or "hardware resource is available but kernel cannot support cgroup for
it". The resource cgroup is simply unavailable.
That means the kernel has full right to just hide that resource from the
cgroup at boot time.
But this should be just within "cgroup's scope", i.e., that resource can
still be available if kernel can provide it. If some app wants to
additionally check whether such resource is indeed available but only
cgroup is not available, it should check resource specific interface but
not take advantage of the MISC cgroup interface.
>
> For now I'll just do BUG_ON() unless there are more strong opinions one
> way or the other.
>
Fine to me.
Hi Jarkko
On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
>> I did declare the configs in the config file but I missed it in my patch
>> as stated earlier. IIUC, that would not cause this error though.
>>
>> Maybe I should exit with the skip code if no CGROUP_MISC (no more
>> CGROUP_SGX_EPC) is configured?
> OK, so I wanted to do a distro kernel test here, and used the default
> OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
I couldn't figure out why this failure you have encountered. I think
OpenSUSE kernel most likely config CGROUP_MISC.
Also if CGROUP_MISC not set, then there should be error happen earlier on
echoing "+misc" to cgroup.subtree_control at line 20. But your log
indicates only error on echoing "sgx_epc ..." to
/sys/fs/cgroup/...//misc.max.
I can only speculate that can could happen (if sgx epc cgroup was compiled
in) when the cgroup-fs subdirectories in question already have populated
config that is conflicting with the scripts.
Could you double check or start from a clean environment?
Thanks
Haitao
On Wed Apr 24, 2024 at 10:42 PM EEST, Haitao Huang wrote:
> Hi Jarkko
> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> >> I did declare the configs in the config file but I missed it in my patch
> >> as stated earlier. IIUC, that would not cause this error though.
> >>
> >> Maybe I should exit with the skip code if no CGROUP_MISC (no more
> >> CGROUP_SGX_EPC) is configured?
> > OK, so I wanted to do a distro kernel test here, and used the default
> > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
>
> I couldn't figure out why this failure you have encountered. I think
> OpenSUSE kernel most likely config CGROUP_MISC.
>
> Also if CGROUP_MISC not set, then there should be error happen earlier on
> echoing "+misc" to cgroup.subtree_control at line 20. But your log
> indicates only error on echoing "sgx_epc ..." to
> /sys/fs/cgroup/...//misc.max.
>
> I can only speculate that can could happen (if sgx epc cgroup was compiled
> in) when the cgroup-fs subdirectories in question already have populated
> config that is conflicting with the scripts.
>
> Could you double check or start from a clean environment?
> Thanks
> Haitao
I can re-check next week once I'm back from Estonia. I'm travelling now
to https://lu.ma/uncloud.
BR, Jarkko
On 4/16/24 07:15, Jarkko Sakkinen wrote:
> On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> Yes, exactly. I'd take one week break and cycle the kselftest part
> internally a bit as I said my previous response. I'm sure that there
> is experise inside Intel how to implement it properly. I.e. take some
> time to find the right person, and wait as long as that person has a
> bit of bandwidth to go through the test and suggest modifications.
Folks, I worry that this series is getting bogged down in the selftests.
Yes, selftests are important. But getting _some_ tests in the kernel
is substantially more important than getting perfect tests.
I don't think Haitao needs to "cycle" this back inside Intel.
On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
> On 4/16/24 07:15, Jarkko Sakkinen wrote:
> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> > Yes, exactly. I'd take one week break and cycle the kselftest part
> > internally a bit as I said my previous response. I'm sure that there
> > is experise inside Intel how to implement it properly. I.e. take some
> > time to find the right person, and wait as long as that person has a
> > bit of bandwidth to go through the test and suggest modifications.
>
> Folks, I worry that this series is getting bogged down in the selftests.
> Yes, selftests are important. But getting _some_ tests in the kernel
> is substantially more important than getting perfect tests.
>
> I don't think Haitao needs to "cycle" this back inside Intel.
The problem with the tests was that they are hard to run anything else
than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
Selftests do not have to be perfect but at minimum they need to be
runnable.
I need ret-test the latest series because it is possible that I did not
have right flags (I was travelling few days thus have not done it yet).
BR, Jarkko
> +/*
> + * Get the per-cgroup or global LRU list that tracks the given reclaimable page.
> + */
> static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
> {
> +#ifdef CONFIG_CGROUP_MISC
> + /*
> + * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
> + * life between sgx_alloc_epc_page() and sgx_free_epc_page():
> + *
> + * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
> + * sgx_get_current_cg() which is the misc cgroup of the current task, or
> + * the root by default even if the misc cgroup is disabled by kernel
> + * command line.
> + *
> + * epc_page->sgx_cg is only unset by sgx_free_epc_page().
> + *
> + * This function is never used before sgx_alloc_epc_page() or after
> + * sgx_free_epc_page().
> + */
> + return &epc_page->sgx_cg->lru;
> +#else
> return &sgx_global_lru;
> +#endif
> }
>
> /*
> @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
> */
> static inline bool sgx_can_reclaim(void)
> {
> - return !list_empty(&sgx_global_lru.reclaimable);
> + return !sgx_cgroup_lru_empty(misc_cg_root()) ||
> + !list_empty(&sgx_global_lru.reclaimable);
> }
Shouldn't this be:
if (IS_ENABLED(CONFIG_CGROUP_MISC))
return !sgx_cgroup_lru_empty(misc_cg_root());
else
return !list_empty(&sgx_global_lru.reclaimable);
?
In this way, it is consistent with the sgx_reclaim_pages_global() below.
>
> static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> @@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long watermark)
>
> static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
> {
> - sgx_reclaim_pages(&sgx_global_lru, charge_mm);
> + if (IS_ENABLED(CONFIG_CGROUP_MISC))
> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
> + else
> + sgx_reclaim_pages(&sgx_global_lru, charge_mm);
> }
>
> /*
> @@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
> */
> void sgx_reclaim_direct(void)
> {
> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> + /* Make sure there are some free pages at cgroup level */
> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
> + sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm);
> + sgx_put_cg(sgx_cg);
> + }
Empty line.
> + /* Make sure there are some free pages at global level */
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
Looking at the code, to me sgx_should_reclaim() is a little bit vague
because from the name we don't know whether it interally checks the
current cgroup or the global.
It's better to rename to sgx_should_reclaim_global().
Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global().
And I think you can do the renaming in the previous patch, because in the
changelog of your previous patch, it seems you have called out the two
functions are for global reclaim.
> sgx_reclaim_pages_global(current->mm);
> }
> @@ -616,6 +649,12 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, enum sgx_reclaim reclaim)
> break;
> }
>
> + /*
> + * At this point, the usage within this cgroup is under its
> + * limit but there is no physical page left for allocation.
> + * Perform a global reclaim to get some pages released from any
> + * cgroup with reclaimable pages.
> + */
> sgx_reclaim_pages_global(current->mm);
> cond_resched();
> }
On Mon, 29 Apr 2024 05:49:13 -0500, Huang, Kai <[email protected]> wrote:
>
>> +/*
>> + * Get the per-cgroup or global LRU list that tracks the given
>> reclaimable page.
>> + */
>> static inline struct sgx_epc_lru_list *sgx_lru_list(struct
>> sgx_epc_page *epc_page)
>> {
>> +#ifdef CONFIG_CGROUP_MISC
>> + /*
>> + * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
>> + * life between sgx_alloc_epc_page() and sgx_free_epc_page():
>> + *
>> + * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
>> + * sgx_get_current_cg() which is the misc cgroup of the current task,
>> or
>> + * the root by default even if the misc cgroup is disabled by kernel
>> + * command line.
>> + *
>> + * epc_page->sgx_cg is only unset by sgx_free_epc_page().
>> + *
>> + * This function is never used before sgx_alloc_epc_page() or after
>> + * sgx_free_epc_page().
>> + */
>> + return &epc_page->sgx_cg->lru;
>> +#else
>> return &sgx_global_lru;
>> +#endif
>> }
>>
>> /*
>> @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list
>> *sgx_lru_list(struct sgx_epc_page *epc_pag
>> */
>> static inline bool sgx_can_reclaim(void)
>> {
>> - return !list_empty(&sgx_global_lru.reclaimable);
>> + return !sgx_cgroup_lru_empty(misc_cg_root()) ||
>> + !list_empty(&sgx_global_lru.reclaimable);
>> }
>
> Shouldn't this be:
>
> if (IS_ENABLED(CONFIG_CGROUP_MISC))
> return !sgx_cgroup_lru_empty(misc_cg_root());
> else
> return !list_empty(&sgx_global_lru.reclaimable);
> ?
>
> In this way, it is consistent with the sgx_reclaim_pages_global() below.
>
I changed to this way because sgx_cgroup_lru_empty() is now defined in
both KConfig cases.
And it seems better to minimize use of the KConfig variables based on
earlier feedback (some are yours).
Don't really have strong preference here. So let me know one way of the
other.
>>
>> static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>> @@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long
>> watermark)
>>
>> static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>> {
>> - sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>> + if (IS_ENABLED(CONFIG_CGROUP_MISC))
>> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
>> + else
>> + sgx_reclaim_pages(&sgx_global_lru, charge_mm);
>> }
>>
>> /*
>> @@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct
>> mm_struct *charge_mm)
>> */
>> void sgx_reclaim_direct(void)
>> {
>> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
>> +
>> + /* Make sure there are some free pages at cgroup level */
>> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
>> + sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm);
>> + sgx_put_cg(sgx_cg);
>> + }
>
> Empty line.
>
Sure
>> + /* Make sure there are some free pages at global level */
>> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>
> Looking at the code, to me sgx_should_reclaim() is a little bit vague
> because from the name we don't know whether it interally checks the
> current cgroup or the global.
> It's better to rename to sgx_should_reclaim_global().
>
> Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global().
>
> And I think you can do the renaming in the previous patch, because in the
> changelog of your previous patch, it seems you have called out the two
> functions are for global reclaim.
>
ok
Thanks
Haitao
Hi Jarkko
On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
>> On 4/16/24 07:15, Jarkko Sakkinen wrote:
>> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
>> > Yes, exactly. I'd take one week break and cycle the kselftest part
>> > internally a bit as I said my previous response. I'm sure that there
>> > is experise inside Intel how to implement it properly. I.e. take some
>> > time to find the right person, and wait as long as that person has a
>> > bit of bandwidth to go through the test and suggest modifications.
>>
>> Folks, I worry that this series is getting bogged down in the selftests.
>> Yes, selftests are important. But getting _some_ tests in the kernel
>> is substantially more important than getting perfect tests.
>>
>> I don't think Haitao needs to "cycle" this back inside Intel.
>
> The problem with the tests was that they are hard to run anything else
> than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
> Selftests do not have to be perfect but at minimum they need to be
> runnable.
>
> I need ret-test the latest series because it is possible that I did not
> have right flags (I was travelling few days thus have not done it yet).
>
> BR, Jarkko
>
Let me know if you want me to send v13 before testing or you can just use
the sgx_cg_upstream_v12_plus branch in my repo.
Also thanks for the "Reviewed-by" tags for other patches. But I've not got
"Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go
through those alsoe when you get chance?
Thanks
Haitao
On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
> >> On 4/16/24 07:15, Jarkko Sakkinen wrote:
> >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> >> > Yes, exactly. I'd take one week break and cycle the kselftest part
> >> > internally a bit as I said my previous response. I'm sure that there
> >> > is experise inside Intel how to implement it properly. I.e. take some
> >> > time to find the right person, and wait as long as that person has a
> >> > bit of bandwidth to go through the test and suggest modifications.
> >>
> >> Folks, I worry that this series is getting bogged down in the selftests.
> >> Yes, selftests are important. But getting _some_ tests in the kernel
> >> is substantially more important than getting perfect tests.
> >>
> >> I don't think Haitao needs to "cycle" this back inside Intel.
> >
> > The problem with the tests was that they are hard to run anything else
> > than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
> > Selftests do not have to be perfect but at minimum they need to be
> > runnable.
> >
> > I need ret-test the latest series because it is possible that I did not
> > have right flags (I was travelling few days thus have not done it yet).
> >
> > BR, Jarkko
> >
>
> Let me know if you want me to send v13 before testing or you can just use
> the sgx_cg_upstream_v12_plus branch in my repo.
>
> Also thanks for the "Reviewed-by" tags for other patches. But I've not got
> "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go
> through those alsoe when you get chance?
So, I compiled v12 branch. Was the only difference in selftests?
I can just copy them to the device.
BR, Jarkko
On Mon, 29 Apr 2024 11:43:05 -0500, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote:
>> Hi Jarkko
>>
>> On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen <[email protected]>
>> wrote:
>>
>> > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
>> >> On 4/16/24 07:15, Jarkko Sakkinen wrote:
>> >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
>> >> > Yes, exactly. I'd take one week break and cycle the kselftest part
>> >> > internally a bit as I said my previous response. I'm sure that
>> there
>> >> > is experise inside Intel how to implement it properly. I.e. take
>> some
>> >> > time to find the right person, and wait as long as that person has
>> a
>> >> > bit of bandwidth to go through the test and suggest modifications.
>> >>
>> >> Folks, I worry that this series is getting bogged down in the
>> selftests.
>> >> Yes, selftests are important. But getting _some_ tests in the
>> kernel
>> >> is substantially more important than getting perfect tests.
>> >>
>> >> I don't think Haitao needs to "cycle" this back inside Intel.
>> >
>> > The problem with the tests was that they are hard to run anything else
>> > than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
>> > Selftests do not have to be perfect but at minimum they need to be
>> > runnable.
>> >
>> > I need ret-test the latest series because it is possible that I did
>> not
>> > have right flags (I was travelling few days thus have not done it
>> yet).
>> >
>> > BR, Jarkko
>> >
>>
>> Let me know if you want me to send v13 before testing or you can just
>> use
>> the sgx_cg_upstream_v12_plus branch in my repo.
>>
>> Also thanks for the "Reviewed-by" tags for other patches. But I've not
>> got
>> "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you
>> go
>> through those alsoe when you get chance?
>
> So, I compiled v12 branch. Was the only difference in selftests?
>
> I can just copy them to the device.
>
> BR, Jarkko
>
The only other functional change is using BUG_ON() when workqueue
allocation fails in sgx_cgroup_init(). It should not affect testing
results.
Thanks
Haitao
>>> /*
>>> @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list
>>> *sgx_lru_list(struct sgx_epc_page *epc_pag
>>> */
>>> static inline bool sgx_can_reclaim(void)
>>> {
>>> - return !list_empty(&sgx_global_lru.reclaimable);
>>> + return !sgx_cgroup_lru_empty(misc_cg_root()) ||
>>> + !list_empty(&sgx_global_lru.reclaimable);
>>> }
>>
>> Shouldn't this be:
>>
>> if (IS_ENABLED(CONFIG_CGROUP_MISC))
>> return !sgx_cgroup_lru_empty(misc_cg_root());
>> else
>> return !list_empty(&sgx_global_lru.reclaimable);
>> ?
>>
>> In this way, it is consistent with the sgx_reclaim_pages_global() below.
>>
>
> I changed to this way because sgx_cgroup_lru_empty() is now defined in
> both KConfig cases.
> And it seems better to minimize use of the KConfig variables based on
> earlier feedback (some are yours).
> Don't really have strong preference here. So let me know one way of the
> other.
>
But IMHO your code could be confusing, e.g., it can be interpreted as:
The EPC pages can be managed by both the cgroup LRUs and the
sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC
is on.
Which is not true.
So we should make the code clearly reflect the true behaviour.
On Mon, 29 Apr 2024 17:18:05 -0500, Huang, Kai <[email protected]> wrote:
>
>>>> /*
>>>> @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list
>>>> *sgx_lru_list(struct sgx_epc_page *epc_pag
>>>> */
>>>> static inline bool sgx_can_reclaim(void)
>>>> {
>>>> - return !list_empty(&sgx_global_lru.reclaimable);
>>>> + return !sgx_cgroup_lru_empty(misc_cg_root()) ||
>>>> + !list_empty(&sgx_global_lru.reclaimable);
>>>> }
>>>
>>> Shouldn't this be:
>>>
>>> if (IS_ENABLED(CONFIG_CGROUP_MISC))
>>> return !sgx_cgroup_lru_empty(misc_cg_root());
>>> else
>>> return !list_empty(&sgx_global_lru.reclaimable);
>>> ?
>>>
>>> In this way, it is consistent with the sgx_reclaim_pages_global()
>>> below.
>>>
>> I changed to this way because sgx_cgroup_lru_empty() is now defined in
>> both KConfig cases.
>> And it seems better to minimize use of the KConfig variables based on
>> earlier feedback (some are yours).
>> Don't really have strong preference here. So let me know one way of the
>> other.
>>
>
> But IMHO your code could be confusing, e.g., it can be interpreted as:
>
> The EPC pages can be managed by both the cgroup LRUs and the
> sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC
> is on.
>
> Which is not true.
>
> So we should make the code clearly reflect the true behaviour.
>
Ok, I'll change back.
Thanks
Haitao