2023-09-14 05:06:49

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v4 00/18] Add Cgroup support for SGX EPC memory

SGX EPC memory allocations are separate from normal RAM allocations, and
are managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory, which is
a desirable feature in some environments, e.g., support for pod level
control in a Kubernates cluster on a VM or baremetal host [1,2].

This patchset implements the support for sgx_epc memory within the misc
cgroup controller. The user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.

This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to work with more recent
kernels, and to utilize the misc cgroup controller rather than a custom
controller. Now I updated the patches based on review comments on the V2
and V3 series [3, 4], simplified a few aspects of the implementation/design
and fixed some stability issues found from testing, while keeping the same
user space facing interfaces.

The patchset adds support for multiple LRU lists to track both reclaimable
EPC pages (i.e., pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e., pages which the reclaimer isn't aware of, such as VA
pages). These pages are assigned to an LRU list, as well as an enclave, so
that an enclave's full EPC usage can be tracked, and subject to the
per-cgroup limit. During OOM events, an enclave can have its memory zapped,
and all the EPC pages tracked by the LRU lists can be freed.

The EPC pages allocated for KVM guests by the virtual EPC driver are not
reclaimable by the host kernel [5]. Therefore they are not tracked by any
LRU lists for reclaiming purposes in this implementation, but they are
charged toward the cgroup of the user processs (e.g., QEMU) launching the
guest. And when the cgroup EPC usage reaches its limit, the virtual EPC
driver will stop allocating more EPC for the VM, and return SIGBUS to the
user process which would abort the VM launch.

To make it easier to follow, I reordered the patches in v4 into following
clusters:
- Patches 1&2 are prerequisite misc cgroup changes
- Patches 3-8 deal with the 'reclaimable' pages
- Patches 9-12 deal with the 'unreclaimable' pages, which are freed only
for OOM scenarios.
- Patches 13-15 re-organize EPC reclaiming code to be reusable by EPC
cgroup.
- Patch 16 implements EPC cgroup as a misc cgroup.
- Patch 17 adds documentation for the EPC cgroup.
- Patch 18 adds test scripts. They depend on earlier fixes and enhancements
reviewed previously [6]

I appreciate your comments and feedback.

---
v4:
* Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4.
* Rebased on to v6.6_rc1 and reordered patches as described above.
* Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko)
* Added comments in commit message to give more preview what's to come next. (Jarkko)
* Fixed some documentation error, gap, style (Mikko, Randy)
* Fixed some comments, typo, style in code (Mikko, Kai)
* Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
* Fixed typo (Pavel)
* Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6]
* Use the same to list for cover and all patches. (Sohil)

v3:

* Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
* Unrolled wrappers for cond_resched, list (Dave)
* Separate patches for adding reclaimable and unreclaimable lists. (Dave)
* Other improvments on patch flow, commit messages, styles. (Dave, Jarkko)
* Simplified the cgroup tree walking with plain
css_for_each_descendant_pre.
* Fixed race conditions and crashes.
* OOM killer to wait for the victim enclave pages being reclaimed.
* Unblock the user by handling misc_max_write callback asynchronously.
* Rebased onto 6.4 and no longer base this series on the MCA patchset.
* Fix an overflow in misc_try_charge.
* Fix a NULL pointer in SGX PF handler.
* Updated and included the SGX selftest patches previously reviewed. Those
patches fix issues triggered in high EPC pressure required for cgroup
testing.
* Added test scripts to help setup and test SGX EPC cgroups.

[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/all/[email protected]/
[4]https://lore.kernel.org/linux-sgx/[email protected]/
[5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[6]https://lore.kernel.org/linux-sgx/[email protected]/
[7]https://lore.kernel.org/linux-sgx/[email protected]/
[8]https://lore.kernel.org/linux-sgx/[email protected]/
[9]https://lore.kernel.org/linux-sgx/[email protected]/

Haitao Huang (4):
x86/sgx: Introduce EPC page states
x86/sgx: Store struct sgx_encl when allocating new VA pages
x86/sgx: Prepare for multiple LRUs
selftests/sgx: Add scripts for epc cgroup testing

Kristen Carlson Accardi (9):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver
x86/sgx: Add sgx_epc_lru_lists to encapsulate LRU lists
x86/sgx: Use sgx_epc_lru_lists for existing active page list
x86/sgx: Store reclaimable EPC pages in sgx_epc_lru_lists
x86/sgx: Use a list to track to-be-reclaimed pages
x86/sgx: store unreclaimable pages in LRU lists
x86/sgx: Limit process EPC usage with misc cgroup controller
Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (5):
x86/sgx: Introduce RECLAIM_IN_PROGRESS state
x86/sgx: Add EPC page flags to identify owner types
x86/sgx: Add EPC OOM path to forcefully reclaim EPC
x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
x86/sgx: Add helper to grab pages from an arbitrary EPC LRU

Documentation/arch/x86/sgx.rst | 82 ++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/driver.c | 27 +-
arch/x86/kernel/cpu/sgx/encl.c | 74 +++-
arch/x86/kernel/cpu/sgx/encl.h | 4 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 406 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 59 +++
arch/x86/kernel/cpu/sgx/ioctl.c | 27 +-
arch/x86/kernel/cpu/sgx/main.c | 399 +++++++++++++----
arch/x86/kernel/cpu/sgx/sgx.h | 115 ++++-
include/linux/misc_cgroup.h | 34 ++
kernel/cgroup/misc.c | 57 ++-
.../selftests/sgx/run_tests_in_misc_cg.sh | 68 +++
tools/testing/selftests/sgx/setup_epc_cg.sh | 29 ++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
16 files changed, 1257 insertions(+), 151 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100755 tools/testing/selftests/sgx/run_tests_in_misc_cg.sh
create mode 100755 tools/testing/selftests/sgx/setup_epc_cg.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

--
2.25.1


2023-09-14 10:11:09

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v4 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

From: Sean Christopherson <[email protected]>

Introduce the OOM path for killing an enclave with a reclaimer that is no
longer able to reclaim enough EPC pages. Find a victim enclave, which
will be an enclave with only "unreclaimable" EPC pages left in the
cgroup LRU lists. Once a victim is identified, mark the enclave as OOM
and zap the enclave's entire page range, and drain all mm references in
encl->mm_list. Block allocating any EPC pages in #PF handler, or
reloading any pages in all paths, or creating any new mappings.

The OOM killing path may race with the reclaimers: in some cases, the
victim enclave is in the process of reclaiming the last EPC pages when
OOM happens, that is, all pages other than SECS and VA pages are in
RECLAIMING_IN_PROGRESS state. The reclaiming process requires access to
the enclave backing, VA pages as well as SECS. So the OOM killer does
not directly release those enclave resources, instead, it lets all
reclaiming in progress to finish, and relies (as currently done) on
kref_put on encl->refcount to trigger sgx_encl_release() to do the
final cleanup.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V4:
- Updates for patch reordering and typo fixes.

V3:
- Rebased to use the new VMA_ITERATOR to zap VMAs.
- Fixed the racing cases by blocking new page allocation/mapping and
reloading when enclave is marked for OOM. And do not release any enclave
resources other than draining mm_list entries, and let pages in
RECLAIMING_IN_PROGRESS to be reaped by reclaimers.
- Due to above changes, also removed the no-longer needed encl->lock in
the OOM path which was causing deadlocks reported by the lock prover.
---
arch/x86/kernel/cpu/sgx/driver.c | 27 +-----
arch/x86/kernel/cpu/sgx/encl.c | 48 ++++++++++-
arch/x86/kernel/cpu/sgx/encl.h | 2 +
arch/x86/kernel/cpu/sgx/ioctl.c | 9 ++
arch/x86/kernel/cpu/sgx/main.c | 140 +++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/sgx.h | 1 +
6 files changed, 200 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 262f5fb18d74..ff42d649c7b6 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -44,7 +44,6 @@ static int sgx_open(struct inode *inode, struct file *file)
static int sgx_release(struct inode *inode, struct file *file)
{
struct sgx_encl *encl = file->private_data;
- struct sgx_encl_mm *encl_mm;

/*
* Drain the remaining mm_list entries. At this point the list contains
@@ -52,31 +51,7 @@ static int sgx_release(struct inode *inode, struct file *file)
* not exited yet. The processes, which have exited, are gone from the
* list by sgx_mmu_notifier_release().
*/
- for ( ; ; ) {
- spin_lock(&encl->mm_lock);
-
- if (list_empty(&encl->mm_list)) {
- encl_mm = NULL;
- } else {
- encl_mm = list_first_entry(&encl->mm_list,
- struct sgx_encl_mm, list);
- list_del_rcu(&encl_mm->list);
- }
-
- spin_unlock(&encl->mm_lock);
-
- /* The enclave is no longer mapped by any mm. */
- if (!encl_mm)
- break;
-
- synchronize_srcu(&encl->srcu);
- mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
- kfree(encl_mm);
-
- /* 'encl_mm' is gone, put encl_mm->encl reference: */
- kref_put(&encl->refcount, sgx_encl_release);
- }
-
+ sgx_encl_mm_drain(encl);
kref_put(&encl->refcount, sgx_encl_release);
return 0;
}
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index bf0ac3677ca8..85b6f218f029 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -453,6 +453,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
if (unlikely(!encl))
return VM_FAULT_SIGBUS;

+ if (test_bit(SGX_ENCL_OOM, &encl->flags))
+ return VM_FAULT_SIGBUS;
+
/*
* The page_array keeps track of all enclave pages, whether they
* are swapped out or not. If there is no entry for this page and
@@ -651,7 +654,8 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
if (!encl)
return -EFAULT;

- if (!test_bit(SGX_ENCL_DEBUG, &encl->flags))
+ if (!test_bit(SGX_ENCL_DEBUG, &encl->flags) ||
+ test_bit(SGX_ENCL_OOM, &encl->flags))
return -EFAULT;

for (i = 0; i < len; i += cnt) {
@@ -776,6 +780,45 @@ void sgx_encl_release(struct kref *ref)
kfree(encl);
}

+/**
+ * sgx_encl_mm_drain - drain all mm_list entries
+ * @encl: address of the sgx_encl to drain
+ *
+ * Used during oom kill to empty the mm_list entries after they have been
+ * zapped. Or used by sgx_release to drain the remaining mm_list entries when
+ * the enclave fd is closing. After this call, sgx_encl_release will be called
+ * with kref_put.
+ */
+void sgx_encl_mm_drain(struct sgx_encl *encl)
+{
+ struct sgx_encl_mm *encl_mm;
+
+ for ( ; ; ) {
+ spin_lock(&encl->mm_lock);
+
+ if (list_empty(&encl->mm_list)) {
+ encl_mm = NULL;
+ } else {
+ encl_mm = list_first_entry(&encl->mm_list,
+ struct sgx_encl_mm, list);
+ list_del_rcu(&encl_mm->list);
+ }
+
+ spin_unlock(&encl->mm_lock);
+
+ /* The enclave is no longer mapped by any mm. */
+ if (!encl_mm)
+ break;
+
+ synchronize_srcu(&encl->srcu);
+ mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm);
+ kfree(encl_mm);
+
+ /* 'encl_mm' is gone, put encl_mm->encl reference: */
+ kref_put(&encl->refcount, sgx_encl_release);
+ }
+}
+
/*
* 'mm' is exiting and no longer needs mmu notifications.
*/
@@ -847,6 +890,9 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)
struct sgx_encl_mm *encl_mm;
int ret;

+ if (test_bit(SGX_ENCL_OOM, &encl->flags))
+ return -ENOMEM;
+
/*
* Even though a single enclave may be mapped into an mm more than once,
* each 'mm' only appears once on encl->mm_list. This is guaranteed by
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 831d63f80f5a..47792fb00cee 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -39,6 +39,7 @@ enum sgx_encl_flags {
SGX_ENCL_DEBUG = BIT(1),
SGX_ENCL_CREATED = BIT(2),
SGX_ENCL_INITIALIZED = BIT(3),
+ SGX_ENCL_OOM = BIT(4),
};

struct sgx_encl_mm {
@@ -125,5 +126,6 @@ struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
unsigned long addr);
struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
+void sgx_encl_mm_drain(struct sgx_encl *encl);

#endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 8c23bb524674..1f65c79664a2 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -421,6 +421,9 @@ static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
return -EINVAL;

+ if (test_bit(SGX_ENCL_OOM, &encl->flags))
+ return -ENOMEM;
+
if (copy_from_user(&add_arg, arg, sizeof(add_arg)))
return -EFAULT;

@@ -606,6 +609,9 @@ static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
return -EINVAL;

+ if (test_bit(SGX_ENCL_OOM, &encl->flags))
+ return -ENOMEM;
+
if (copy_from_user(&init_arg, arg, sizeof(init_arg)))
return -EFAULT;

@@ -682,6 +688,9 @@ static int sgx_ioc_sgx2_ready(struct sgx_encl *encl)
if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags))
return -EINVAL;

+ if (test_bit(SGX_ENCL_OOM, &encl->flags))
+ return -ENOMEM;
+
return 0;
}

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index f3a3ed894616..c8900d62cfff 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -621,6 +621,146 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
atomic_long_inc(&sgx_nr_free_pages);
}

+static bool sgx_oom_get_ref(struct sgx_epc_page *epc_page)
+{
+ struct sgx_encl *encl;
+
+ if (epc_page->flags & SGX_EPC_OWNER_PAGE)
+ encl = epc_page->encl_page->encl;
+ else if (epc_page->flags & SGX_EPC_OWNER_ENCL)
+ encl = epc_page->encl;
+ else
+ return false;
+
+ return kref_get_unless_zero(&encl->refcount);
+}
+
+static struct sgx_epc_page *sgx_oom_get_victim(struct sgx_epc_lru_lists *lru)
+{
+ struct sgx_epc_page *epc_page, *tmp;
+
+ if (list_empty(&lru->unreclaimable))
+ return NULL;
+
+ list_for_each_entry_safe(epc_page, tmp, &lru->unreclaimable, list) {
+ list_del_init(&epc_page->list);
+
+ if (sgx_oom_get_ref(epc_page))
+ return epc_page;
+ }
+ return NULL;
+}
+
+static void sgx_epc_oom_zap(void *owner, struct mm_struct *mm, unsigned long start,
+ unsigned long end, const struct vm_operations_struct *ops)
+{
+ VMA_ITERATOR(vmi, mm, start);
+ struct vm_area_struct *vma;
+
+ /**
+ * Use end because start can be zero and not mapped into
+ * enclave even if encl->base = 0.
+ */
+ for_each_vma_range(vmi, vma, end) {
+ if (vma->vm_ops == ops && vma->vm_private_data == owner &&
+ vma->vm_start < end) {
+ zap_vma_pages(vma);
+ }
+ }
+}
+
+static bool sgx_oom_encl(struct sgx_encl *encl)
+{
+ unsigned long mm_list_version;
+ struct sgx_encl_mm *encl_mm;
+ bool ret = false;
+ int idx;
+
+ if (!test_bit(SGX_ENCL_CREATED, &encl->flags))
+ goto out_put;
+
+ /* Done OOM on this enclave previously, do not redo it.
+ * This may happen when the SECS page is still UNRECLAIMABLE because
+ * another page is in RECLAIM_IN_PROGRESS. Still return true so OOM
+ * killer can wait until the reclaimer done with the hold-up page and
+ * SECS before it move on to find another victim.
+ */
+ if (test_bit(SGX_ENCL_OOM, &encl->flags))
+ goto out;
+
+ set_bit(SGX_ENCL_OOM, &encl->flags);
+
+ do {
+ mm_list_version = encl->mm_list_version;
+
+ /* Pairs with smp_rmb() in sgx_encl_mm_add(). */
+ smp_rmb();
+
+ idx = srcu_read_lock(&encl->srcu);
+
+ list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+ if (!mmget_not_zero(encl_mm->mm))
+ continue;
+
+ mmap_read_lock(encl_mm->mm);
+
+ sgx_epc_oom_zap(encl, encl_mm->mm, encl->base,
+ encl->base + encl->size, &sgx_vm_ops);
+
+ mmap_read_unlock(encl_mm->mm);
+
+ mmput_async(encl_mm->mm);
+ }
+
+ srcu_read_unlock(&encl->srcu, idx);
+ } while (WARN_ON_ONCE(encl->mm_list_version != mm_list_version));
+
+ sgx_encl_mm_drain(encl);
+out:
+ ret = true;
+
+out_put:
+ /*
+ * This puts the refcount we took when we identified this enclave as
+ * an OOM victim.
+ */
+ kref_put(&encl->refcount, sgx_encl_release);
+ return ret;
+}
+
+static inline bool sgx_oom_encl_page(struct sgx_encl_page *encl_page)
+{
+ return sgx_oom_encl(encl_page->encl);
+}
+
+/**
+ * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
+ * @lru: LRU that is low
+ *
+ * Return: %true if a victim was found and kicked.
+ */
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
+{
+ struct sgx_epc_page *victim;
+
+ spin_lock(&lru->lock);
+ victim = sgx_oom_get_victim(lru);
+ spin_unlock(&lru->lock);
+
+ if (!victim)
+ return false;
+
+ if (victim->flags & SGX_EPC_OWNER_PAGE)
+ return sgx_oom_encl_page(victim->encl_page);
+
+ if (victim->flags & SGX_EPC_OWNER_ENCL)
+ return sgx_oom_encl(victim->encl);
+
+ /*Will never happen unless we add more owner types in future */
+ WARN_ON_ONCE(1);
+ return false;
+}
+
static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
unsigned long index,
struct sgx_epc_section *section)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index e210af77f0cf..3818be5a8bd3 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -176,6 +176,7 @@ void sgx_reclaim_direct(void);
void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
int sgx_drop_epc_page(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);

void sgx_ipi_cb(void *info);

--
2.25.1

2023-09-14 11:27:13

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v4 14/18] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU

From: Sean Christopherson <[email protected]>

Move the isolation loop into a helper, sgx_isolate_pages(), in
preparation for existence of multiple LRUs. Expose the helper to other
SGX code so that it can be called from the EPC cgroup code, e.g., to
isolate pages from a single cgroup LRU. Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V4:
- No changes other than reordering the patches
---
arch/x86/kernel/cpu/sgx/main.c | 57 +++++++++++++++++++++-------------
arch/x86/kernel/cpu/sgx/sgx.h | 2 ++
2 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e1dde431a400..ce316bd5e5bb 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -283,6 +283,40 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
mutex_unlock(&encl->lock);
}

+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru: LRU from which to reclaim
+ * @nr_to_scan: Number of pages to scan for reclaim
+ * @dst: Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lru, size_t nr_to_scan,
+ struct list_head *dst)
+{
+ struct sgx_encl_page *encl_page;
+ struct sgx_epc_page *epc_page;
+
+ spin_lock(&lru->lock);
+ for (; nr_to_scan > 0; --nr_to_scan) {
+ epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
+ if (!epc_page)
+ break;
+
+ encl_page = epc_page->encl_page;
+
+ if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+ sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
+ list_move_tail(&epc_page->list, dst);
+ } else {
+ /* The owner is freeing the page, remove it from the
+ * LRU list
+ */
+ sgx_epc_page_reset_state(epc_page);
+ list_del_init(&epc_page->list);
+ }
+ }
+ spin_unlock(&lru->lock);
+}
+
/**
* sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
* @nr_to_scan: Number of EPC pages to scan for reclaim
@@ -310,28 +344,7 @@ size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age)
size_t ret;
size_t i;

- spin_lock(&sgx_global_lru.lock);
- for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
- struct sgx_epc_page, list);
- if (!epc_page)
- break;
-
- list_del_init(&epc_page->list);
- encl_page = epc_page->encl_page;
-
- if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
- sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
- list_move_tail(&epc_page->list, &iso);
- } else {
- /* The owner is freeing the page, remove it from the
- * LRU list
- */
- sgx_epc_page_reset_state(epc_page);
- list_del_init(&epc_page->list);
- }
- }
- spin_unlock(&sgx_global_lru.lock);
+ sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);

if (list_empty(&iso))
return 0;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index aa4ec2c0ce96..7e21192b87a8 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -178,6 +178,8 @@ int sgx_drop_epc_page(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
bool sgx_epc_oom(struct sgx_epc_lru_lists *lrus);
size_t sgx_reclaim_epc_pages(size_t nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru_lists *lrus, size_t nr_to_scan,
+ struct list_head *dst);

void sgx_ipi_cb(void *info);

--
2.25.1

2023-09-15 19:28:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v4 00/18] Add Cgroup support for SGX EPC memory

On Tue, Sep 12, 2023 at 09:06:17PM -0700, Haitao Huang wrote:
> SGX EPC memory allocations are separate from normal RAM allocations, and
> are managed solely by the SGX subsystem. The existing cgroup memory
> controller cannot be used to limit or account for SGX EPC memory, which is
> a desirable feature in some environments, e.g., support for pod level
> control in a Kubernates cluster on a VM or baremetal host [1,2].
>
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. The user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.

Minor nit aside, it looks fine from cgroup side.

Thanks.

--
tejun

2023-09-16 20:26:33

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v4 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver

From: Kristen Carlson Accardi <[email protected]>

The SGX driver will need to get access to the root misc_cg object
to do iterative walks and also determine if a charge will be
towards the root cgroup or not.

To manage the SGX EPC memory via the misc controller, the SGX
driver will also need to be able to iterate over the misc cgroup
hierarchy.

Move parent_misc() into misc_cgroup.h and make inline to make this
function available to SGX, rename it to misc_cg_parent(), and update
misc.c to use the new name.

Add per resource type private data so that SGX can store additional
per cgroup data with the misc_cg struct.

Allow SGX EPC memory to be a valid resource type for the misc
controller.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V4:
- Moved this to the second in the series.
---
include/linux/misc_cgroup.h | 29 +++++++++++++++++++++++++++++
kernel/cgroup/misc.c | 25 ++++++++++++-------------
2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e1bcd176c2de..6f8330f435ba 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+ /* SGX EPC memory resource */
+ MISC_CG_RES_SGX_EPC,
#endif
MISC_CG_RES_TYPES
};
@@ -37,6 +41,7 @@ struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+ void *priv;

/* per resource callback ops */
int (*misc_cg_alloc)(struct misc_cg *cg);
@@ -59,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
};

+struct misc_cg *misc_cg_root(void);
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
@@ -78,6 +84,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
}

+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+ return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
/*
* get_current_misc_cg() - Find and get the misc cgroup of the current task.
*
@@ -102,6 +122,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
}

#else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+ return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+ return NULL;
+}

static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
{
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index e0092170d0dd..dbd881be773f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+ /* Intel SGX EPC memory bytes */
+ "sgx_epc",
+#endif
};

/* Root misc cgroup */
@@ -40,18 +44,13 @@ static struct misc_cg root_cg;
static u64 misc_res_capacity[MISC_CG_RES_TYPES];

/**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
*/
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
{
- return cgroup ? css_misc(cgroup->css.parent) : NULL;
+ return &root_cg;
}
+EXPORT_SYMBOL_GPL(misc_cg_root);

/**
* valid_type() - Check if @type is valid or not.
@@ -150,7 +149,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!amount)
return 0;

- for (i = cg; i; i = parent_misc(i)) {
+ for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];

new_usage = atomic64_add_return(amount, &res->usage);
@@ -163,12 +162,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
return 0;

err_charge:
- for (j = i; j; j = parent_misc(j)) {
+ for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}

- for (j = cg; j != i; j = parent_misc(j))
+ for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -190,7 +189,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;

- for (i = cg; i; i = parent_misc(i))
+ for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
}
EXPORT_SYMBOL_GPL(misc_cg_uncharge);
--
2.25.1