2023-02-06 07:48:23

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory

Having large amounts of unmovable or unreclaimable memory in a system
can lead to system instability due to increasing the likelihood of
encountering out-of-memory conditions. Therefore it is desirable to
limit the amount of memory users can lock or pin.

From userspace such limits can be enforced by setting
RLIMIT_MEMLOCK. However there is no standard method that drivers and
other in-kernel users can use to check and enforce this limit.

This has lead to a large number of inconsistencies in how limits are
enforced. For example some drivers will use mm->locked_mm while others
will use mm->pinned_mm or user->locked_mm. It is therefore possible to
have up to three times RLIMIT_MEMLOCKED pinned.

Having pinned memory limited per-task also makes it easy for users to
exceed the limit. For example drivers that pin memory with
pin_user_pages() it tends to remain pinned after fork. To deal with
this and other issues this series introduces a cgroup for tracking and
limiting the number of pages pinned or locked by tasks in the group.

However the existing behaviour with regards to the rlimit needs to be
maintained. Therefore the lesser of the two limits is
enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
but this bypass is not allowed for the cgroup.

The first part of this series converts existing drivers which
open-code the use of locked_mm/pinned_mm over to a common interface
which manages the refcounts of the associated task/mm/user
structs. This ensures accounting of pages is consistent and makes it
easier to add charging of the cgroup.

The second part of the series adds the cgroup controller and converts
core mm code such as mlock over to charging the cgroup before finally
introducing some selftests.

Rather than adding onto an exisiting cgroup controller such as memcg
we introduce a new controller. This is primarily because we wish to
limit the total number of pages tasks within a cgroup may pin/lock.

As I don't have access to systems with all the various devices I
haven't been able to test all driver changes. Any help there would be
appreciated.

Note that this series is based on v6.2-rc5 and
https://lore.kernel.org/linux-rdma/[email protected]/
which makes updating the siw driver easier (thanks Bernard).

Changes from initial RFC:

- Fixes to some driver error handling.

- Pages charged with vm_account will always increment mm->pinned_vm
and enforce the limit against user->locked_vm or mm->pinned_vm
depending on initialisation flags.

- Moved vm_account prototypes and struct definitions into a separate header.

- Minor updates to commit messages and kernel docs (thanks to Jason,
Christoph, Yosry and T.J.).

Outstanding issues:

- David H pointed out that the vm_account naming is potentially
confusing and I agree. However I have yet to come up with something
better so will rename this in a subsequent version of this series
(suggestions welcome).

- Jason G raised some issues with adding the accounting struct to
struct sock which are unresolved.

Alistair Popple (19):
mm: Introduce vm_account
drivers/vhost: Convert to use vm_account
drivers/vdpa: Convert vdpa to use the new vm_structure
infiniband/umem: Convert to use vm_account
RMDA/siw: Convert to use vm_account
RDMA/usnic: convert to use vm_account
vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm
vfio/spapr_tce: Convert accounting to pinned_vm
io_uring: convert to use vm_account
net: skb: Switch to using vm_account
xdp: convert to use vm_account
kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()
fpga: dfl: afu: convert to use vm_account
mm: Introduce a cgroup for pinned memory
mm/util: Extend vm_account to charge pages against the pin cgroup
mm/util: Refactor account_locked_vm
mm: Convert mmap and mlock to use account_locked_vm
mm/mmap: Charge locked memory to pins cgroup
selftests/vm: Add pins-cgroup selftest for mlock/mmap

MAINTAINERS | 8 +-
arch/powerpc/kvm/book3s_64_vio.c | 10 +-
arch/powerpc/mm/book3s64/iommu_api.c | 30 +--
drivers/fpga/dfl-afu-dma-region.c | 11 +-
drivers/fpga/dfl-afu.h | 2 +-
drivers/infiniband/core/umem.c | 16 +-
drivers/infiniband/core/umem_odp.c | 6 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 14 +-
drivers/infiniband/hw/usnic/usnic_uiom.h | 2 +-
drivers/infiniband/sw/siw/siw.h | 3 +-
drivers/infiniband/sw/siw/siw_mem.c | 21 +--
drivers/infiniband/sw/siw/siw_verbs.c | 15 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 21 +--
drivers/vfio/vfio_iommu_spapr_tce.c | 16 +-
drivers/vfio/vfio_iommu_type1.c | 60 +----
drivers/vhost/vdpa.c | 17 +-
drivers/vhost/vhost.c | 2 +-
drivers/vhost/vhost.h | 2 +-
include/linux/cgroup.h | 20 ++-
include/linux/cgroup_subsys.h | 4 +-
include/linux/io_uring_types.h | 4 +-
include/linux/kvm_host.h | 2 +-
include/linux/mm.h | 5 +-
include/linux/skbuff.h | 7 +-
include/linux/vm_account.h | 57 +++++-
include/net/sock.h | 3 +-
include/net/xdp_sock.h | 3 +-
include/rdma/ib_umem.h | 2 +-
io_uring/io_uring.c | 20 +--
io_uring/notif.c | 4 +-
io_uring/notif.h | 10 +-
io_uring/rsrc.c | 38 +---
io_uring/rsrc.h | 9 +-
mm/Kconfig | 11 +-
mm/Makefile | 1 +-
mm/internal.h | 2 +-
mm/mlock.c | 76 +------
mm/mmap.c | 76 +++----
mm/mremap.c | 54 +++--
mm/pins_cgroup.c | 273 ++++++++++++++++++++++++-
mm/secretmem.c | 6 +-
mm/util.c | 234 +++++++++++++++++++--
net/core/skbuff.c | 47 +---
net/rds/message.c | 10 +-
net/xdp/xdp_umem.c | 38 +--
tools/testing/selftests/vm/Makefile | 1 +-
tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
virt/kvm/kvm_main.c | 3 +-
48 files changed, 1143 insertions(+), 404 deletions(-)
create mode 100644 include/linux/vm_account.h
create mode 100644 mm/pins_cgroup.c
create mode 100644 tools/testing/selftests/vm/pins-cgroup.c

base-commit: f0076df3552b965d8107318bd9d9e678530f1687
--
git-series 0.9.1


2023-02-06 07:48:34

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 01/19] mm: Introduce vm_account

Kernel drivers that pin pages should account these pages against
either user->locked_vm and/or mm->pinned_vm and fail the pinning if
RLIMIT_MEMLOCK is exceeded and CAP_IPC_LOCK isn't held.

Currently drivers open-code this accounting and use various methods to
update the atomic variables and check against the limits leading to
various bugs and inconsistencies. To fix this introduce a standard
interface for charging pinned and locked memory. As this involves
taking references on kernel objects such as mm_struct or user_struct
we introduce a new vm_account struct to hold these references. Several
helper functions are then introduced to grab references and check
limits.

As the way these limits are charged and enforced is visible to
userspace we need to be careful not to break existing applications by
charging to different counters. As a result the vm_account functions
support accounting to different counters as required.

A future change will extend this to also account against a cgroup for
pinned pages.

Signed-off-by: Alistair Popple <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/vm_account.h | 56 +++++++++++++++++-
mm/util.c | 127 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 183 insertions(+)
create mode 100644 include/linux/vm_account.h

diff --git a/include/linux/vm_account.h b/include/linux/vm_account.h
new file mode 100644
index 0000000..b4b2e90
--- /dev/null
+++ b/include/linux/vm_account.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VM_ACCOUNT_H
+#define _LINUX_VM_ACCOUNT_H
+
+/**
+ * enum vm_account_flags - Determine how pinned/locked memory is accounted.
+ * @VM_ACCOUNT_TASK: Account pinned memory to mm->pinned_vm.
+ * @VM_ACCOUNT_BYPASS: Don't enforce rlimit on any charges.
+ * @VM_ACCOUNT_USER: Account locked memory to user->locked_vm.
+ *
+ * Determines which statistic pinned/locked memory is accounted
+ * against. All limits will be enforced against RLIMIT_MEMLOCK and the
+ * pins cgroup if CONFIG_CGROUP_PINS is enabled.
+ *
+ * New drivers should use VM_ACCOUNT_USER. VM_ACCOUNT_TASK is used by
+ * pre-existing drivers to maintain existing accounting against
+ * mm->pinned_mm rather than user->locked_mm.
+ *
+ * VM_ACCOUNT_BYPASS may also be specified to bypass rlimit
+ * checks. Typically this is used to cache CAP_IPC_LOCK from when a
+ * driver is first initialised. Note that this does not bypass cgroup
+ * limit checks.
+ */
+enum vm_account_flags {
+ VM_ACCOUNT_USER = 0,
+ VM_ACCOUNT_BYPASS = 1,
+ VM_ACCOUNT_TASK = 1,
+};
+
+struct vm_account {
+ struct task_struct *task;
+ struct mm_struct *mm;
+ struct user_struct *user;
+ enum vm_account_flags flags;
+};
+
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+ struct user_struct *user, enum vm_account_flags flags);
+
+/**
+ * vm_account_init_current - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ *
+ * Helper to initialise a vm_account for the common case of charging
+ * with VM_ACCOUNT_TASK against current.
+ */
+static inline void vm_account_init_current(struct vm_account *vm_account)
+{
+ vm_account_init(vm_account, current, NULL, VM_ACCOUNT_TASK);
+}
+
+void vm_account_release(struct vm_account *vm_account);
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages);
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages);
+
+#endif /* _LINUX_VM_ACCOUNT_H */
diff --git a/mm/util.c b/mm/util.c
index b56c92f..d8c19f8 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -23,6 +23,7 @@
#include <linux/processor.h>
#include <linux/sizes.h>
#include <linux/compat.h>
+#include <linux/vm_account.h>

#include <linux/uaccess.h>

@@ -431,6 +432,132 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
#endif

/**
+ * vm_account_init - Initialise a new struct vm_account.
+ * @vm_account: pointer to uninitialised vm_account.
+ * @task: task to charge against.
+ * @user: user to charge against. Must be non-NULL for VM_ACCOUNT_USER.
+ * @flags: flags to use when charging to vm_account.
+ *
+ * Initialise a new uninitialised struct vm_account. Takes references
+ * on the task/mm/user/cgroup as required although callers must ensure
+ * any references passed in remain valid for the duration of this
+ * call.
+ */
+void vm_account_init(struct vm_account *vm_account, struct task_struct *task,
+ struct user_struct *user, enum vm_account_flags flags)
+{
+ vm_account->task = get_task_struct(task);
+
+ if (flags & VM_ACCOUNT_USER)
+ vm_account->user = get_uid(user);
+
+ mmgrab(task->mm);
+ vm_account->mm = task->mm;
+ vm_account->flags = flags;
+}
+EXPORT_SYMBOL_GPL(vm_account_init);
+
+/**
+ * vm_account_release - Initialise a new struct vm_account.
+ * @vm_account: pointer to initialised vm_account.
+ *
+ * Drop any object references obtained by vm_account_init(). The
+ * vm_account must not be used after calling this unless reinitialised
+ * with vm_account_init().
+ */
+void vm_account_release(struct vm_account *vm_account)
+{
+ put_task_struct(vm_account->task);
+ if (vm_account->flags & VM_ACCOUNT_USER)
+ free_uid(vm_account->user);
+
+ mmdrop(vm_account->mm);
+}
+EXPORT_SYMBOL_GPL(vm_account_release);
+
+/*
+ * Charge pages with an atomic compare and swap. Returns -ENOMEM on
+ * failure, 1 on success and 0 for retry.
+ */
+static int vm_account_cmpxchg(struct vm_account *vm_account,
+ unsigned long npages, unsigned long lock_limit)
+{
+ u64 cur_pages, new_pages;
+
+ if (vm_account->flags & VM_ACCOUNT_USER)
+ cur_pages = atomic_long_read(&vm_account->user->locked_vm);
+ else
+ cur_pages = atomic64_read(&vm_account->mm->pinned_vm);
+
+ new_pages = cur_pages + npages;
+ if (lock_limit != RLIM_INFINITY && new_pages > lock_limit)
+ return -ENOMEM;
+
+ if (vm_account->flags & VM_ACCOUNT_USER) {
+ return atomic_long_cmpxchg(&vm_account->user->locked_vm,
+ cur_pages, new_pages) == cur_pages;
+ } else {
+ return atomic64_cmpxchg(&vm_account->mm->pinned_vm,
+ cur_pages, new_pages) == cur_pages;
+ }
+}
+
+/**
+ * vm_account_pinned - Charge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to charge.
+ *
+ * Return: 0 on success, -ENOMEM if a limit would be exceeded.
+ *
+ * Note: All pages must be explicitly uncharged with
+ * vm_unaccount_pinned() prior to releasing the vm_account with
+ * vm_account_release().
+ */
+int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+ unsigned long lock_limit = RLIM_INFINITY;
+ int ret;
+
+ if (!(vm_account->flags & VM_ACCOUNT_BYPASS) && !capable(CAP_IPC_LOCK))
+ lock_limit = task_rlimit(vm_account->task,
+ RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+ while (true) {
+ ret = vm_account_cmpxchg(vm_account, npages, lock_limit);
+ if (ret > 0)
+ break;
+ else if (ret < 0)
+ return ret;
+ }
+
+ /*
+ * Always add pinned pages to mm->pinned_vm even when we're
+ * not enforcing the limit against that.
+ */
+ if (vm_account->flags & VM_ACCOUNT_USER)
+ atomic64_add(npages, &vm_account->mm->pinned_vm);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vm_account_pinned);
+
+/**
+ * vm_unaccount_pinned - Uncharge pinned or locked memory to the vm_account.
+ * @vm_account: pointer to an initialised vm_account.
+ * @npages: number of pages to uncharge.
+ */
+void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages)
+{
+ if (vm_account->flags & VM_ACCOUNT_USER) {
+ atomic_long_sub(npages, &vm_account->user->locked_vm);
+ atomic64_sub(npages, &vm_account->mm->pinned_vm);
+ } else {
+ atomic64_sub(npages, &vm_account->mm->pinned_vm);
+ }
+}
+EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
+
+/**
* __account_locked_vm - account locked pages to an mm's locked_vm
* @mm: mm to account against
* @pages: number of pages to account
--
git-series 0.9.1

2023-02-06 07:49:09

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 02/19] drivers/vhost: Convert to use vm_account

Convert vhost to use the new vm_account structure and associated
account_pinned_vm() functions. This means vhost will start enforcing
RLIMIT_MEMLOCK when a user does not have CAP_IPC_LOCK and fail the
mapping request.

Signed-off-by: Alistair Popple <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
drivers/vhost/vdpa.c | 17 ++++++++++-------
drivers/vhost/vhost.c | 2 ++
drivers/vhost/vhost.h | 2 ++
3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ec32f78..d970fcc 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -716,7 +716,7 @@ static void vhost_vdpa_pa_unmap(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
set_page_dirty_lock(page);
unpin_user_page(page);
}
- atomic64_sub(PFN_DOWN(map->size), &dev->mm->pinned_vm);
+ vm_unaccount_pinned(&dev->vm_account, PFN_DOWN(map->size));
vhost_vdpa_general_unmap(v, map, asid);
vhost_iotlb_map_free(iotlb, map);
}
@@ -780,10 +780,14 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
u32 asid = iotlb_to_asid(iotlb);
int r = 0;

+ if (!vdpa->use_va)
+ if (vm_account_pinned(&dev->vm_account, PFN_DOWN(size)))
+ return -ENOMEM;
+
r = vhost_iotlb_add_range_ctx(iotlb, iova, iova + size - 1,
pa, perm, opaque);
if (r)
- return r;
+ goto out_unaccount;

if (ops->dma_map) {
r = ops->dma_map(vdpa, asid, iova, size, pa, perm, opaque);
@@ -794,15 +798,14 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct vhost_iotlb *iotlb,
r = iommu_map(v->domain, iova, pa, size,
perm_to_iommu_flags(perm));
}
- if (r) {
+ if (r)
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
- return r;
- }

+out_unaccount:
if (!vdpa->use_va)
- atomic64_add(PFN_DOWN(size), &dev->mm->pinned_vm);
+ vm_unaccount_pinned(&dev->vm_account, PFN_DOWN(size));

- return 0;
+ return r;
}

static void vhost_vdpa_unmap(struct vhost_vdpa *v,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bf..5645c26 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -556,6 +556,7 @@ static void vhost_attach_mm(struct vhost_dev *dev)
dev->mm = current->mm;
mmgrab(dev->mm);
}
+ vm_account_init_current(&dev->vm_account);
}

static void vhost_detach_mm(struct vhost_dev *dev)
@@ -569,6 +570,7 @@ static void vhost_detach_mm(struct vhost_dev *dev)
mmdrop(dev->mm);

dev->mm = NULL;
+ vm_account_release(&dev->vm_account);
}

/* Caller should have device mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d910910..b2434dd 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -14,6 +14,7 @@
#include <linux/atomic.h>
#include <linux/vhost_iotlb.h>
#include <linux/irqbypass.h>
+#include <linux/vm_account.h>

struct vhost_work;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -144,6 +145,7 @@ struct vhost_msg_node {
struct vhost_dev {
struct mm_struct *mm;
struct mutex mutex;
+ struct vm_account vm_account;
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
--
git-series 0.9.1

2023-02-06 07:49:19

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure

Convert vdpa to use the new vm_structure and associated
account_pinned_vm() functions. This also fixes a bug where
vduse_dev_reg_umem() could exceed the rlimit due to non-atomically
checking and updating mm->pinned_vm which could lead to a race.

Signed-off-by: Alistair Popple <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/vdpa/vdpa_user/vduse_dev.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0c3b486..bc300e2 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -23,6 +23,7 @@
#include <linux/nospec.h>
#include <linux/vmalloc.h>
#include <linux/sched/mm.h>
+#include <linux/vm_account.h>
#include <uapi/linux/vduse.h>
#include <uapi/linux/vdpa.h>
#include <uapi/linux/virtio_config.h>
@@ -70,7 +71,7 @@ struct vduse_umem {
unsigned long iova;
unsigned long npages;
struct page **pages;
- struct mm_struct *mm;
+ struct vm_account vm_account;
};

struct vduse_dev {
@@ -950,8 +951,7 @@ static int vduse_dev_dereg_umem(struct vduse_dev *dev,
vduse_domain_remove_user_bounce_pages(dev->domain);
unpin_user_pages_dirty_lock(dev->umem->pages,
dev->umem->npages, true);
- atomic64_sub(dev->umem->npages, &dev->umem->mm->pinned_vm);
- mmdrop(dev->umem->mm);
+ vm_unaccount_pinned(&dev->umem->vm_account, dev->umem->npages);
vfree(dev->umem->pages);
kfree(dev->umem);
dev->umem = NULL;
@@ -967,7 +967,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
struct page **page_list = NULL;
struct vduse_umem *umem = NULL;
long pinned = 0;
- unsigned long npages, lock_limit;
+ unsigned long npages;
int ret;

if (!dev->domain->bounce_map ||
@@ -990,8 +990,8 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,

mmap_read_lock(current->mm);

- lock_limit = PFN_DOWN(rlimit(RLIMIT_MEMLOCK));
- if (npages + atomic64_read(&current->mm->pinned_vm) > lock_limit)
+ vm_account_init_current(&umem->vm_account);
+ if (vm_account_pinned(&umem->vm_account, npages))
goto out;

pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
@@ -1006,22 +1006,21 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
if (ret)
goto out;

- atomic64_add(npages, &current->mm->pinned_vm);
-
umem->pages = page_list;
umem->npages = pinned;
umem->iova = iova;
- umem->mm = current->mm;
- mmgrab(current->mm);

dev->umem = umem;
out:
- if (ret && pinned > 0)
+ if (ret && pinned > 0) {
unpin_user_pages(page_list, pinned);
+ vm_unaccount_pinned(&umem->vm_account, npages);
+ }

mmap_read_unlock(current->mm);
unlock:
if (ret) {
+ vm_account_release(&umem->vm_account);
vfree(page_list);
kfree(umem);
}
--
git-series 0.9.1

2023-02-06 07:49:37

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 04/19] infiniband/umem: Convert to use vm_account

Converts the infiniband core umem code to use the vm_account structure
so that pinned pages can be charged to the correct cgroup with
account_pinned_vm().

Signed-off-by: Alistair Popple <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/infiniband/core/umem.c | 16 ++++++----------
drivers/infiniband/core/umem_odp.c | 6 ++++++
include/rdma/ib_umem.h | 2 ++
3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 755a9c5..479b7f0 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -149,8 +149,6 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
{
struct ib_umem *umem;
struct page **page_list;
- unsigned long lock_limit;
- unsigned long new_pinned;
unsigned long cur_base;
unsigned long dma_attr = 0;
struct mm_struct *mm;
@@ -186,6 +184,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
umem->writable = ib_access_writable(access);
umem->owning_mm = mm = current->mm;
mmgrab(mm);
+ vm_account_init_current(&umem->vm_account);

page_list = (struct page **) __get_free_page(GFP_KERNEL);
if (!page_list) {
@@ -199,11 +198,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
goto out;
}

- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- new_pinned = atomic64_add_return(npages, &mm->pinned_vm);
- if (new_pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
- atomic64_sub(npages, &mm->pinned_vm);
+ if (vm_account_pinned(&umem->vm_account, npages)) {
ret = -ENOMEM;
goto out;
}
@@ -248,12 +243,13 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,

umem_release:
__ib_umem_release(device, umem, 0);
- atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
+ vm_unaccount_pinned(&umem->vm_account, ib_umem_num_pages(umem));
out:
free_page((unsigned long) page_list);
umem_kfree:
if (ret) {
mmdrop(umem->owning_mm);
+ vm_account_release(&umem->vm_account);
kfree(umem);
}
return ret ? ERR_PTR(ret) : umem;
@@ -275,8 +271,8 @@ void ib_umem_release(struct ib_umem *umem)

__ib_umem_release(umem->ibdev, umem, 1);

- atomic64_sub(ib_umem_num_pages(umem), &umem->owning_mm->pinned_vm);
- mmdrop(umem->owning_mm);
+ vm_unaccount_pinned(&umem->vm_account, ib_umem_num_pages(umem));
+ vm_account_release(&umem->vm_account);
kfree(umem);
}
EXPORT_SYMBOL(ib_umem_release);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d..4fbca3e 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -130,6 +130,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
umem->ibdev = device;
umem->writable = ib_access_writable(access);
umem->owning_mm = current->mm;
+ vm_account_init_current(&umem->vm_account);
umem_odp->is_implicit_odp = 1;
umem_odp->page_shift = PAGE_SHIFT;

@@ -137,6 +138,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
ret = ib_init_umem_odp(umem_odp, NULL);
if (ret) {
put_pid(umem_odp->tgid);
+ vm_account_release(&umem->vm_account);
kfree(umem_odp);
return ERR_PTR(ret);
}
@@ -179,6 +181,7 @@ ib_umem_odp_alloc_child(struct ib_umem_odp *root, unsigned long addr,
umem->address = addr;
umem->writable = root->umem.writable;
umem->owning_mm = root->umem.owning_mm;
+ umem->vm_account = root->umem.vm_account;
odp_data->page_shift = PAGE_SHIFT;
odp_data->notifier.ops = ops;

@@ -239,6 +242,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
umem_odp->umem.address = addr;
umem_odp->umem.writable = ib_access_writable(access);
umem_odp->umem.owning_mm = current->mm;
+ vm_account_init_current(&umem_odp->umem.vm_account);
umem_odp->notifier.ops = ops;

umem_odp->page_shift = PAGE_SHIFT;
@@ -255,6 +259,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,

err_put_pid:
put_pid(umem_odp->tgid);
+ vm_account_release(&umem_odp->umem.vm_account);
kfree(umem_odp);
return ERR_PTR(ret);
}
@@ -278,6 +283,7 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
kvfree(umem_odp->pfn_list);
}
put_pid(umem_odp->tgid);
+ vm_account_release(&umem_odp->umem.vm_account);
kfree(umem_odp);
}
EXPORT_SYMBOL(ib_umem_odp_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 92a673c..065cd2c 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -10,6 +10,7 @@
#include <linux/list.h>
#include <linux/scatterlist.h>
#include <linux/workqueue.h>
+#include <linux/vm_account.h>
#include <rdma/ib_verbs.h>

struct ib_ucontext;
@@ -19,6 +20,7 @@ struct dma_buf_attach_ops;
struct ib_umem {
struct ib_device *ibdev;
struct mm_struct *owning_mm;
+ struct vm_account vm_account;
u64 iova;
size_t length;
unsigned long address;
--
git-series 0.9.1

2023-02-06 07:49:50

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 06/19] RDMA/usnic: convert to use vm_account

Convert to using a vm_account structure to account pinned memory to
both the mm and the pins cgroup.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Christian Benvenuti <[email protected]>
Cc: Nelson Escobar <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/infiniband/hw/usnic/usnic_uiom.c | 14 ++++++--------
drivers/infiniband/hw/usnic/usnic_uiom.h | 2 ++
2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index c301b3b..8952ee5 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -89,8 +89,6 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
struct page **page_list;
struct scatterlist *sg;
struct usnic_uiom_chunk *chunk;
- unsigned long locked;
- unsigned long lock_limit;
unsigned long cur_base;
unsigned long npages;
int ret;
@@ -123,10 +121,9 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
uiomr->owning_mm = mm = current->mm;
mmap_read_lock(mm);

- locked = atomic64_add_return(npages, &current->mm->pinned_vm);
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+ vm_account_init_current(&uiomr->vm_account);
+ if (vm_account_pinned(&uiomr->vm_account, npages)) {
+ npages = 0;
ret = -ENOMEM;
goto out;
}
@@ -178,7 +175,8 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
out:
if (ret < 0) {
usnic_uiom_put_pages(chunk_list, 0);
- atomic64_sub(npages, &current->mm->pinned_vm);
+ vm_unaccount_pinned(&uiomr->vm_account, npages);
+ vm_account_release(&uiomr->vm_account);
} else
mmgrab(uiomr->owning_mm);

@@ -430,7 +428,7 @@ void usnic_uiom_reg_release(struct usnic_uiom_reg *uiomr)
{
__usnic_uiom_reg_release(uiomr->pd, uiomr, 1);

- atomic64_sub(usnic_uiom_num_pages(uiomr), &uiomr->owning_mm->pinned_vm);
+ vm_unaccount_pinned(&uiomr->vm_account, usnic_uiom_num_pages(uiomr));
__usnic_uiom_release_tail(uiomr);
}

diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.h b/drivers/infiniband/hw/usnic/usnic_uiom.h
index 5a9acf9..5238d06 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.h
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.h
@@ -36,6 +36,7 @@

#include <linux/list.h>
#include <linux/scatterlist.h>
+#include <linux/vm_account.h>

#include "usnic_uiom_interval_tree.h"

@@ -72,6 +73,7 @@ struct usnic_uiom_reg {
struct list_head chunk_list;
struct work_struct work;
struct mm_struct *owning_mm;
+ struct vm_account vm_account;
};

struct usnic_uiom_chunk {
--
git-series 0.9.1

2023-02-06 07:49:54

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 05/19] RMDA/siw: Convert to use vm_account

Convert to using a vm_account structure to account pinned memory to
both the mm and the pins cgroup.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Bernard Metzler <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/infiniband/sw/siw/siw.h | 3 ++-
drivers/infiniband/sw/siw/siw_mem.c | 21 +++++++--------------
drivers/infiniband/sw/siw/siw_verbs.c | 15 ---------------
3 files changed, 9 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
index 2f3a9cd..6d4aabd 100644
--- a/drivers/infiniband/sw/siw/siw.h
+++ b/drivers/infiniband/sw/siw/siw.h
@@ -13,6 +13,7 @@
#include <crypto/hash.h>
#include <linux/crc32.h>
#include <linux/crc32c.h>
+#include <linux/vm_account.h>

#include <rdma/siw-abi.h>
#include "iwarp.h"
@@ -124,7 +125,7 @@ struct siw_umem {
int num_pages;
bool writable;
u64 fp_addr; /* First page base address */
- struct mm_struct *owning_mm;
+ struct vm_account vm_account;
};

struct siw_pble {
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index f51ab2c..be90121 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -68,7 +68,6 @@ static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,

void siw_umem_release(struct siw_umem *umem, bool dirty)
{
- struct mm_struct *mm_s = umem->owning_mm;
int i, num_pages = umem->num_pages;

for (i = 0; num_pages; i++) {
@@ -79,9 +78,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
kfree(umem->page_chunk[i].plist);
num_pages -= to_free;
}
- atomic64_sub(umem->num_pages, &mm_s->pinned_vm);
+ vm_unaccount_pinned(&umem->vm_account, umem->num_pages);
+ vm_account_release(&umem->vm_account);

- mmdrop(mm_s);
kfree(umem->page_chunk);
kfree(umem);
}
@@ -365,9 +364,7 @@ struct siw_pbl *siw_pbl_alloc(u32 num_buf)
struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
{
struct siw_umem *umem;
- struct mm_struct *mm_s;
u64 first_page_va;
- unsigned long mlock_limit;
unsigned int foll_flags = FOLL_LONGTERM;
int num_pages, num_chunks, i, rv = 0;

@@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
if (!umem)
return ERR_PTR(-ENOMEM);

- mm_s = current->mm;
- umem->owning_mm = mm_s;
umem->writable = writable;

- mmgrab(mm_s);
+ vm_account_init_current(&umem->vm_account);

if (writable)
foll_flags |= FOLL_WRITE;

- mmap_read_lock(mm_s);
+ mmap_read_lock(current->mm);

- mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- if (atomic64_add_return(num_pages, &mm_s->pinned_vm) > mlock_limit) {
+ if (vm_account_pinned(&umem->vm_account, num_pages)) {
rv = -ENOMEM;
goto out_sem_up;
}
@@ -434,14 +427,14 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
}
}
out_sem_up:
- mmap_read_unlock(mm_s);
+ mmap_read_unlock(current->mm);

if (rv > 0)
return umem;

/* Adjust accounting for pages not pinned */
if (num_pages)
- atomic64_sub(num_pages, &mm_s->pinned_vm);
+ vm_unaccount_pinned(&umem->vm_account, num_pages);

siw_umem_release(umem, false);

diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index 906fde1..8fab009 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -1321,8 +1321,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
struct siw_umem *umem = NULL;
struct siw_ureq_reg_mr ureq;
struct siw_device *sdev = to_siw_dev(pd->device);
-
- unsigned long mem_limit = rlimit(RLIMIT_MEMLOCK);
int rv;

siw_dbg_pd(pd, "start: 0x%pK, va: 0x%pK, len: %llu\n",
@@ -1338,19 +1336,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
rv = -EINVAL;
goto err_out;
}
- if (mem_limit != RLIM_INFINITY) {
- unsigned long num_pages =
- (PAGE_ALIGN(len + (start & ~PAGE_MASK))) >> PAGE_SHIFT;
- mem_limit >>= PAGE_SHIFT;
-
- if (num_pages > mem_limit - current->mm->locked_vm) {
- siw_dbg_pd(pd, "pages req %lu, max %lu, lock %lu\n",
- num_pages, mem_limit,
- current->mm->locked_vm);
- rv = -ENOMEM;
- goto err_out;
- }
- }
umem = siw_umem_get(start, len, ib_access_writable(rights));
if (IS_ERR(umem)) {
rv = PTR_ERR(umem);
--
git-series 0.9.1

2023-02-06 07:50:04

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm

This switches the charge to pinned_vm to be consistent with other
drivers that pin pages with FOLL_LONGTERM. It also allows the use of
the vm_account helper struct which makes a future change to implement
cgroup accounting of pinned pages easier to implement as that requires
a reference to the cgroup to be maintained.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Cornelia Huck <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/vfio/vfio_iommu_type1.c | 60 +++++++++-------------------------
1 file changed, 17 insertions(+), 43 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe..a3957b8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -38,6 +38,7 @@
#include <linux/workqueue.h>
#include <linux/notifier.h>
#include <linux/irqdomain.h>
+#include <linux/vm_account.h>
#include "vfio.h"

#define DRIVER_VERSION "0.2"
@@ -95,11 +96,11 @@ struct vfio_dma {
size_t size; /* Map size (bytes) */
int prot; /* IOMMU_READ/WRITE */
bool iommu_mapped;
- bool lock_cap; /* capable(CAP_IPC_LOCK) */
bool vaddr_invalid;
struct task_struct *task;
struct rb_root pfn_list; /* Ex-user pinned pfn list */
unsigned long *bitmap;
+ struct vm_account vm_account;
};

struct vfio_batch {
@@ -412,31 +413,6 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn)
return ret;
}

-static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
-{
- struct mm_struct *mm;
- int ret;
-
- if (!npage)
- return 0;
-
- mm = async ? get_task_mm(dma->task) : dma->task->mm;
- if (!mm)
- return -ESRCH; /* process exited */
-
- ret = mmap_write_lock_killable(mm);
- if (!ret) {
- ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task,
- dma->lock_cap);
- mmap_write_unlock(mm);
- }
-
- if (async)
- mmput(mm);
-
- return ret;
-}
-
/*
* Some mappings aren't backed by a struct page, for example an mmap'd
* MMIO range for our own or another device. These use a different
@@ -715,16 +691,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
* externally pinned pages are already counted against
* the user.
*/
- if (!rsvd && !vfio_find_vpfn(dma, iova)) {
- if (!dma->lock_cap &&
- mm->locked_vm + lock_acct + 1 > limit) {
- pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
- __func__, limit << PAGE_SHIFT);
- ret = -ENOMEM;
- goto unpin_out;
- }
+ if (!rsvd && !vfio_find_vpfn(dma, iova))
lock_acct++;
- }

pinned++;
npage--;
@@ -744,7 +712,11 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
}

out:
- ret = vfio_lock_acct(dma, lock_acct, false);
+ if (vm_account_pinned(&dma->vm_account, lock_acct)) {
+ ret = -ENOMEM;
+ lock_acct = 0;
+ pr_warn("%s: RLIMIT_MEMLOCK exceeded\n", __func__);
+ }

unpin_out:
if (batch->size == 1 && !batch->offset) {
@@ -759,6 +731,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
put_pfn(pfn, dma->prot);
}
vfio_batch_unpin(batch, dma);
+ vm_unaccount_pinned(&dma->vm_account, lock_acct);

return ret;
}
@@ -782,7 +755,7 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
}

if (do_accounting)
- vfio_lock_acct(dma, locked - unlocked, true);
+ vm_unaccount_pinned(&dma->vm_account, locked - unlocked);

return unlocked;
}
@@ -805,7 +778,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
ret = 0;

if (do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
- ret = vfio_lock_acct(dma, 1, true);
+ ret = vm_account_pinned(&dma->vm_account, 1);
if (ret) {
put_pfn(*pfn_base, dma->prot);
if (ret == -ENOMEM)
@@ -833,7 +806,7 @@ static int vfio_unpin_page_external(struct vfio_dma *dma, dma_addr_t iova,
unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);

if (do_accounting)
- vfio_lock_acct(dma, -unlocked, true);
+ vm_unaccount_pinned(&dma->vm_account, unlocked);

return unlocked;
}
@@ -921,7 +894,7 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
ret = vfio_add_to_pfn_list(dma, iova, phys_pfn);
if (ret) {
if (put_pfn(phys_pfn, dma->prot) && do_accounting)
- vfio_lock_acct(dma, -1, true);
+ vm_unaccount_pinned(&dma->vm_account, 1);
goto pin_unwind;
}

@@ -1162,7 +1135,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
}

if (do_accounting) {
- vfio_lock_acct(dma, -unlocked, true);
+ vm_unaccount_pinned(&dma->vm_account, unlocked);
return 0;
}
return unlocked;
@@ -1674,7 +1647,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
*/
get_task_struct(current->group_leader);
dma->task = current->group_leader;
- dma->lock_cap = capable(CAP_IPC_LOCK);
+ vm_account_init(&dma->vm_account, dma->task, NULL, VM_ACCOUNT_TASK |
+ (capable(CAP_IPC_LOCK) ? VM_ACCOUNT_BYPASS : 0));

dma->pfn_list = RB_ROOT;

@@ -2398,7 +2372,7 @@ static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
if (!is_invalid_reserved_pfn(vpfn->pfn))
locked++;
}
- vfio_lock_acct(dma, locked - unlocked, true);
+ vm_unaccount_pinned(&dma->vm_account, locked - unlocked);
}
}

--
git-series 0.9.1

2023-02-06 07:50:11

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 11/19] xdp: convert to use vm_account

Switch to using the new vm_account struct to charge pinned pages and
enforce the rlimit. This will allow a future change to also charge a
cgroup for limiting the number of pinned pages.

Signed-off-by: Alistair Popple <[email protected]>
Cc: "Björn Töpel" <[email protected]>
Cc: Magnus Karlsson <[email protected]>
Cc: Maciej Fijalkowski <[email protected]>
Cc: Jonathan Lemon <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/net/xdp_sock.h | 3 ++-
net/xdp/xdp_umem.c | 38 +++++++++++++-------------------------
2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 3057e1a..9a21054 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -12,6 +12,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/mm.h>
+#include <linux/vm_account.h>
#include <net/sock.h>

struct net_device;
@@ -25,7 +26,7 @@ struct xdp_umem {
u32 chunk_size;
u32 chunks;
u32 npgs;
- struct user_struct *user;
+ struct vm_account vm_account;
refcount_t users;
u8 flags;
bool zc;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 4681e8e..4b5fb2f 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -29,12 +29,10 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
umem->pgs = NULL;
}

-static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem, u32 npgs)
{
- if (umem->user) {
- atomic_long_sub(umem->npgs, &umem->user->locked_vm);
- free_uid(umem->user);
- }
+ vm_unaccount_pinned(&umem->vm_account, npgs);
+ vm_account_release(&umem->vm_account);
}

static void xdp_umem_addr_unmap(struct xdp_umem *umem)
@@ -54,13 +52,15 @@ static int xdp_umem_addr_map(struct xdp_umem *umem, struct page **pages,

static void xdp_umem_release(struct xdp_umem *umem)
{
+ u32 npgs = umem->npgs;
+
umem->zc = false;
ida_free(&umem_ida, umem->id);

xdp_umem_addr_unmap(umem);
xdp_umem_unpin_pages(umem);

- xdp_umem_unaccount_pages(umem);
+ xdp_umem_unaccount_pages(umem, npgs);
kfree(umem);
}

@@ -127,24 +127,13 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)

static int xdp_umem_account_pages(struct xdp_umem *umem)
{
- unsigned long lock_limit, new_npgs, old_npgs;
-
- if (capable(CAP_IPC_LOCK))
- return 0;
-
- lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- umem->user = get_uid(current_user());
+ vm_account_init(&umem->vm_account, current,
+ current_user(), VM_ACCOUNT_USER);
+ if (vm_account_pinned(&umem->vm_account, umem->npgs)) {
+ vm_account_release(&umem->vm_account);
+ return -ENOBUFS;
+ }

- do {
- old_npgs = atomic_long_read(&umem->user->locked_vm);
- new_npgs = old_npgs + umem->npgs;
- if (new_npgs > lock_limit) {
- free_uid(umem->user);
- umem->user = NULL;
- return -ENOBUFS;
- }
- } while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
- new_npgs) != old_npgs);
return 0;
}

@@ -204,7 +193,6 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
umem->chunks = chunks;
umem->npgs = (u32)npgs;
umem->pgs = NULL;
- umem->user = NULL;
umem->flags = mr->flags;

INIT_LIST_HEAD(&umem->xsk_dma_list);
@@ -227,7 +215,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
out_unpin:
xdp_umem_unpin_pages(umem);
out_account:
- xdp_umem_unaccount_pages(umem);
+ xdp_umem_unaccount_pages(umem, npgs);
return err;
}

--
git-series 0.9.1

2023-02-06 07:50:34

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm

Convert from accounting pages against locked_vm to accounting them to
pinned_vm. This allows struct vm_account to be used to track the
mm_struct used to charge the pages. A future change also uses this to
track a cgroup for controlling pinned pages.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Cornelia Huck <[email protected]>
Cc: Alexey Kardashevskiy <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/powerpc/mm/book3s64/iommu_api.c | 30 ++++++++++++++++++-----------
drivers/vfio/vfio_iommu_spapr_tce.c | 16 ++++++++++-----
2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 7fcfba1..338b111 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -18,6 +18,7 @@
#include <asm/mmu_context.h>
#include <asm/pte-walk.h>
#include <linux/mm_inline.h>
+#include <linux/vm_account.h>

static DEFINE_MUTEX(mem_list_mutex);

@@ -30,6 +31,7 @@ struct mm_iommu_table_group_mem_t {
unsigned long used;
atomic64_t mapped;
unsigned int pageshift;
+ struct vm_account vm_account;
u64 ua; /* userspace address */
u64 entries; /* number of entries in hpas/hpages[] */
/*
@@ -62,20 +64,24 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
unsigned int pageshift;
unsigned long entry, chunk;

- if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
- ret = account_locked_vm(mm, entries, true);
- if (ret)
- return ret;
-
- locked_entries = entries;
- }
-
mem = kzalloc(sizeof(*mem), GFP_KERNEL);
if (!mem) {
ret = -ENOMEM;
goto unlock_exit;
}

+ vm_account_init_current(&mem->vm_account);
+ if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
+ ret = vm_account_pinned(&mem->vm_account, entries);
+ if (ret) {
+ vm_account_release(&mem->vm_account);
+ kfree(mem);
+ return ret;
+ }
+
+ locked_entries = entries;
+ }
+
if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
mem->dev_hpa = dev_hpa;
@@ -175,10 +181,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
unpin_user_pages(mem->hpages, pinned);

vfree(mem->hpas);
- kfree(mem);

unlock_exit:
- account_locked_vm(mm, locked_entries, false);
+ vm_unaccount_pinned(&mem->vm_account, locked_entries);
+ vm_account_release(&mem->vm_account);
+ kfree(mem);

return ret;
}
@@ -229,6 +236,7 @@ static void mm_iommu_do_free(struct mm_iommu_table_group_mem_t *mem)

mm_iommu_unpin(mem);
vfree(mem->hpas);
+ vm_account_release(&mem->vm_account);
kfree(mem);
}

@@ -279,7 +287,7 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
unlock_exit:
mutex_unlock(&mem_list_mutex);

- account_locked_vm(mm, unlock_entries, false);
+ vm_unaccount_pinned(&mem->vm_account, unlock_entries);

return ret;
}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 60a50ce..454ccc4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -21,6 +21,7 @@
#include <linux/sched/mm.h>
#include <linux/sched/signal.h>
#include <linux/mm.h>
+#include <linux/vm_account.h>
#include "vfio.h"

#include <asm/iommu.h>
@@ -67,6 +68,7 @@ struct tce_container {
bool def_window_pending;
unsigned long locked_pages;
struct mm_struct *mm;
+ struct vm_account vm_account;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
struct list_head prereg_list;
@@ -82,6 +84,7 @@ static long tce_iommu_mm_set(struct tce_container *container)
BUG_ON(!current->mm);
container->mm = current->mm;
mmgrab(container->mm);
+ vm_account_init_current(&container->vm_account);

return 0;
}
@@ -291,7 +294,7 @@ static int tce_iommu_enable(struct tce_container *container)
return ret;

locked = table_group->tce32_size >> PAGE_SHIFT;
- ret = account_locked_vm(container->mm, locked, true);
+ ret = vm_account_pinned(&container->vm_accounnt, locked);
if (ret)
return ret;

@@ -310,7 +313,7 @@ static void tce_iommu_disable(struct tce_container *container)
container->enabled = false;

BUG_ON(!container->mm);
- account_locked_vm(container->mm, container->locked_pages, false);
+ vm_account_pinned(&container->vm_account, container->locked_pages);
}

static void *tce_iommu_open(unsigned long arg)
@@ -372,8 +375,10 @@ static void tce_iommu_release(void *iommu_data)
WARN_ON(tce_iommu_prereg_free(container, tcemem));

tce_iommu_disable(container);
- if (container->mm)
+ if (container->mm) {
mmdrop(container->mm);
+ vm_account_release(&container->vm_account);
+ }
mutex_destroy(&container->lock);

kfree(container);
@@ -619,7 +624,8 @@ static long tce_iommu_create_table(struct tce_container *container,
if (!table_size)
return -EINVAL;

- ret = account_locked_vm(container->mm, table_size >> PAGE_SHIFT, true);
+ ret = vm_account_pinned(&container->vm_account,
+ table_size >> PAGE_SHIFT);
if (ret)
return ret;

@@ -638,7 +644,7 @@ static void tce_iommu_free_table(struct tce_container *container,
unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;

iommu_tce_table_put(tbl);
- account_locked_vm(container->mm, pages, false);
+ vm_unaccount_pinned(&container->vm_account, pages);
}

static long tce_iommu_create_window(struct tce_container *container,
--
git-series 0.9.1

2023-02-06 07:50:40

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned()

book3s_64_vio currently accounts for pinned pages with
account_locked_vm() which charges the pages to mm->locked_vm. To make
this consistent with other drivers switch to using
vm_account_pinned().

Signed-off-by: Alistair Popple <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Alexey Kardashevskiy <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/powerpc/kvm/book3s_64_vio.c | 10 +++++-----
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 3 +++
3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 95e738e..ecd1deb 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -273,8 +273,8 @@ static int kvm_spapr_tce_release(struct inode *inode, struct file *filp)
}
}

- account_locked_vm(kvm->mm,
- kvmppc_stt_pages(kvmppc_tce_pages(stt->size)), false);
+ vm_unaccount_pinned(&kvm->vm_account,
+ kvmppc_stt_pages(kvmppc_tce_pages(stt->size)));

kvm_put_kvm(stt->kvm);

@@ -301,8 +301,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
(args->offset + args->size > (ULLONG_MAX >> args->page_shift)))
return -EINVAL;

- npages = kvmppc_tce_pages(args->size);
- ret = account_locked_vm(mm, kvmppc_stt_pages(npages), true);
+ npages = kvmppc_tce_pages(size);
+ ret = vm_account_pinned(&kvm->vm_account, kvmppc_stt_pages(npages));
if (ret)
return ret;

@@ -347,7 +347,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,

kfree(stt);
fail_acct:
- account_locked_vm(mm, kvmppc_stt_pages(npages), false);
+ vm_unaccount_pinned(&kvm->vm_account, kvmppc_stt_pages(npages));
return ret;
}

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b24..bd7a7be 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -35,6 +35,7 @@
#include <linux/interval_tree.h>
#include <linux/rbtree.h>
#include <linux/xarray.h>
+#include <linux/vm_account.h>
#include <asm/signal.h>

#include <linux/kvm.h>
@@ -717,6 +718,7 @@ struct kvm {
*/
struct mutex slots_arch_lock;
struct mm_struct *mm; /* userspace tied to this vm */
+ struct vm_account vm_account;
unsigned long nr_memslot_pages;
/* The two memslot sets - active and inactive (per address space) */
struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384..770d037 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
KVM_MMU_LOCK_INIT(kvm);
mmgrab(current->mm);
kvm->mm = current->mm;
+ vm_account_init_current(&kvm->vm_account);
kvm_eventfd_init(kvm);
mutex_init(&kvm->lock);
mutex_init(&kvm->irq_lock);
@@ -1258,6 +1259,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
out_err_no_srcu:
kvm_arch_free_vm(kvm);
mmdrop(current->mm);
+ vm_account_release(&kvm->vm_account);
module_put(kvm_chardev_ops.owner);
return ERR_PTR(r);
}
@@ -1327,6 +1329,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
preempt_notifier_dec();
hardware_disable_all();
mmdrop(mm);
+ vm_account_release(&kvm->vm_account);
module_put(kvm_chardev_ops.owner);
}

--
git-series 0.9.1

2023-02-06 07:50:57

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 09/19] io_uring: convert to use vm_account

Convert io_uring to use vm_account instead of directly charging pages
against the user/mm. Rather than charge pages to both user->locked_vm
and mm->pinned_vm this will only charge pages to user->locked_vm.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/io_uring_types.h | 4 ++--
io_uring/io_uring.c | 20 +++---------------
io_uring/notif.c | 4 ++--
io_uring/notif.h | 10 +++------
io_uring/rsrc.c | 38 +++--------------------------------
io_uring/rsrc.h | 9 +--------
6 files changed, 17 insertions(+), 68 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 128a67a..45ac75d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -5,6 +5,7 @@
#include <linux/task_work.h>
#include <linux/bitmap.h>
#include <linux/llist.h>
+#include <linux/vm_account.h>
#include <uapi/linux/io_uring.h>

struct io_wq_work_node {
@@ -343,8 +344,7 @@ struct io_ring_ctx {
struct io_wq_hash *hash_map;

/* Only used for accounting purposes */
- struct user_struct *user;
- struct mm_struct *mm_account;
+ struct vm_account vm_account;

/* ctx exit and cancelation */
struct llist_head fallback_llist;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0a4efad..912da4f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2744,15 +2744,11 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
#endif
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));

- if (ctx->mm_account) {
- mmdrop(ctx->mm_account);
- ctx->mm_account = NULL;
- }
+ vm_account_release(&ctx->vm_account);
io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);

percpu_ref_exit(&ctx->refs);
- free_uid(ctx->user);
io_req_caches_free(ctx);
if (ctx->hash_map)
io_wq_put_hash(ctx->hash_map);
@@ -3585,8 +3581,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
ctx->syscall_iopoll = 1;

ctx->compat = in_compat_syscall();
- if (!capable(CAP_IPC_LOCK))
- ctx->user = get_uid(current_user());
+ vm_account_init(&ctx->vm_account, current, current_user(),
+ VM_ACCOUNT_USER |
+ (capable(CAP_IPC_LOCK) ? VM_ACCOUNT_BYPASS : 0));

/*
* For SQPOLL, we just need a wakeup, always. For !SQPOLL, if
@@ -3619,15 +3616,6 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
goto err;
}

- /*
- * This is just grabbed for accounting purposes. When a process exits,
- * the mm is exited and dropped before the files, hence we need to hang
- * on to this mm purely for the purposes of being able to unaccount
- * memory (locked/pinned vm). It's not used for anything else.
- */
- mmgrab(current->mm);
- ctx->mm_account = current->mm;
-
ret = io_allocate_scq_urings(ctx, p);
if (ret)
goto err;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index c4bb793..0f589fa 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -17,8 +17,8 @@ static void io_notif_complete_tw_ext(struct io_kiocb *notif, bool *locked)
if (nd->zc_report && (nd->zc_copied || !nd->zc_used))
notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;

- if (nd->account_pages && ctx->user) {
- __io_unaccount_mem(ctx->user, nd->account_pages);
+ if (nd->account_pages) {
+ vm_unaccount_pinned(&ctx->vm_account, nd->account_pages);
nd->account_pages = 0;
}
io_req_task_complete(notif, locked);
diff --git a/io_uring/notif.h b/io_uring/notif.h
index c88c800..e2cb44a 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -43,11 +43,9 @@ static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
unsigned nr_pages = (len >> PAGE_SHIFT) + 2;
int ret;

- if (ctx->user) {
- ret = __io_account_mem(ctx->user, nr_pages);
- if (ret)
- return ret;
- nd->account_pages += nr_pages;
- }
+ ret = __io_account_mem(&ctx->vm_account, nr_pages);
+ if (ret)
+ return ret;
+ nd->account_pages += nr_pages;
return 0;
}
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 18de10c..aa44528 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -42,49 +42,19 @@ void io_rsrc_refs_drop(struct io_ring_ctx *ctx)
}
}

-int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
+int __io_account_mem(struct vm_account *vm_account, unsigned long nr_pages)
{
- unsigned long page_limit, cur_pages, new_pages;
-
- if (!nr_pages)
- return 0;
-
- /* Don't allow more pages than we can safely lock */
- page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
- cur_pages = atomic_long_read(&user->locked_vm);
- do {
- new_pages = cur_pages + nr_pages;
- if (new_pages > page_limit)
- return -ENOMEM;
- } while (!atomic_long_try_cmpxchg(&user->locked_vm,
- &cur_pages, new_pages));
- return 0;
+ return vm_account_pinned(vm_account, nr_pages);
}

static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
{
- if (ctx->user)
- __io_unaccount_mem(ctx->user, nr_pages);
-
- if (ctx->mm_account)
- atomic64_sub(nr_pages, &ctx->mm_account->pinned_vm);
+ vm_unaccount_pinned(&ctx->vm_account, nr_pages);
}

static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
{
- int ret;
-
- if (ctx->user) {
- ret = __io_account_mem(ctx->user, nr_pages);
- if (ret)
- return ret;
- }
-
- if (ctx->mm_account)
- atomic64_add(nr_pages, &ctx->mm_account->pinned_vm);
-
- return 0;
+ return vm_account_pinned(&ctx->vm_account, nr_pages);
}

static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b87436..d8833d0 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -167,12 +167,5 @@ static inline u64 *io_get_tag_slot(struct io_rsrc_data *data, unsigned int idx)
int io_files_update(struct io_kiocb *req, unsigned int issue_flags);
int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);

-int __io_account_mem(struct user_struct *user, unsigned long nr_pages);
-
-static inline void __io_unaccount_mem(struct user_struct *user,
- unsigned long nr_pages)
-{
- atomic_long_sub(nr_pages, &user->locked_vm);
-}
-
+int __io_account_mem(struct vm_account *vm_account, unsigned long nr_pages);
#endif
--
git-series 0.9.1

2023-02-06 07:51:02

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 10/19] net: skb: Switch to using vm_account

Switch to using vm_account to charge pinned pages. This will allow a
future change to charge the pinned pages to a cgroup to limit the
overall number of pinned pages in the system.

Signed-off-by: Alistair Popple <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/skbuff.h | 7 +++---
include/net/sock.h | 3 +++-
net/core/skbuff.c | 47 +++++++++++++++----------------------------
net/rds/message.c | 10 ++++++---
4 files changed, 31 insertions(+), 36 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c84924..14f29a0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -37,6 +37,7 @@
#include <linux/in6.h>
#include <linux/if_packet.h>
#include <linux/llist.h>
+#include <linux/vm_account.h>
#include <net/flow.h>
#include <net/page_pool.h>
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
@@ -554,7 +555,6 @@ struct ubuf_info_msgzc {
};

struct mmpin {
- struct user_struct *user;
unsigned int num_pg;
} mmp;
};
@@ -563,8 +563,9 @@ struct ubuf_info_msgzc {
#define uarg_to_msgzc(ubuf_ptr) container_of((ubuf_ptr), struct ubuf_info_msgzc, \
ubuf)

-int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
-void mm_unaccount_pinned_pages(struct mmpin *mmp);
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+ size_t size);
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp);

/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
diff --git a/include/net/sock.h b/include/net/sock.h
index dcd72e6..0e756d3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -63,6 +63,7 @@
#include <linux/atomic.h>
#include <linux/refcount.h>
#include <linux/llist.h>
+#include <linux/vm_account.h>
#include <net/dst.h>
#include <net/checksum.h>
#include <net/tcp_states.h>
@@ -334,6 +335,7 @@ struct sk_filter;
* @sk_security: used by security modules
* @sk_mark: generic packet mark
* @sk_cgrp_data: cgroup data for this cgroup
+ * @sk_vm_account: data for pinned memory accounting
* @sk_memcg: this socket's memory cgroup association
* @sk_write_pending: a write to stream socket waits to start
* @sk_state_change: callback to indicate change in the state of the sock
@@ -523,6 +525,7 @@ struct sock {
void *sk_security;
#endif
struct sock_cgroup_data sk_cgrp_data;
+ struct vm_account sk_vm_account;
struct mem_cgroup *sk_memcg;
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4a0eb55..bed3fc9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1309,42 +1309,25 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
}
EXPORT_SYMBOL_GPL(skb_morph);

-int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+int mm_account_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp,
+ size_t size)
{
- unsigned long max_pg, num_pg, new_pg, old_pg;
- struct user_struct *user;
-
- if (capable(CAP_IPC_LOCK) || !size)
- return 0;
+ unsigned int num_pg;

num_pg = (size >> PAGE_SHIFT) + 2; /* worst case */
- max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- user = mmp->user ? : current_user();
+ if (vm_account_pinned(vm_account, num_pg))
+ return -ENOBUFS;

- old_pg = atomic_long_read(&user->locked_vm);
- do {
- new_pg = old_pg + num_pg;
- if (new_pg > max_pg)
- return -ENOBUFS;
- } while (!atomic_long_try_cmpxchg(&user->locked_vm, &old_pg, new_pg));
-
- if (!mmp->user) {
- mmp->user = get_uid(user);
- mmp->num_pg = num_pg;
- } else {
- mmp->num_pg += num_pg;
- }
+ mmp->num_pg += num_pg;

return 0;
}
EXPORT_SYMBOL_GPL(mm_account_pinned_pages);

-void mm_unaccount_pinned_pages(struct mmpin *mmp)
+void mm_unaccount_pinned_pages(struct vm_account *vm_account, struct mmpin *mmp)
{
- if (mmp->user) {
- atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
- free_uid(mmp->user);
- }
+ vm_unaccount_pinned(vm_account, mmp->num_pg);
+ vm_account_release(vm_account);
}
EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages);

@@ -1361,9 +1344,12 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)

BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
uarg = (void *)skb->cb;
- uarg->mmp.user = NULL;
+ uarg->mmp.num_pg = 0;
+ vm_account_init(&sk->sk_vm_account, current,
+ current_user(), VM_ACCOUNT_USER);

- if (mm_account_pinned_pages(&uarg->mmp, size)) {
+ if (mm_account_pinned_pages(&sk->sk_vm_account, &uarg->mmp, size)) {
+ vm_account_release(&sk->sk_vm_account);
kfree_skb(skb);
return NULL;
}
@@ -1416,7 +1402,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,

next = (u32)atomic_read(&sk->sk_zckey);
if ((u32)(uarg_zc->id + uarg_zc->len) == next) {
- if (mm_account_pinned_pages(&uarg_zc->mmp, size))
+ if (mm_account_pinned_pages(&sk->sk_vm_account,
+ &uarg_zc->mmp, size))
return NULL;
uarg_zc->len++;
uarg_zc->bytelen = bytelen;
@@ -1466,7 +1453,7 @@ static void __msg_zerocopy_callback(struct ubuf_info_msgzc *uarg)
u32 lo, hi;
u16 len;

- mm_unaccount_pinned_pages(&uarg->mmp);
+ mm_unaccount_pinned_pages(&sk->sk_vm_account, &uarg->mmp);

/* if !len, there was only 1 call, and it was aborted
* so do not queue a completion notification
diff --git a/net/rds/message.c b/net/rds/message.c
index b47e4f0..4595540 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -35,6 +35,7 @@
#include <linux/export.h>
#include <linux/skbuff.h>
#include <linux/list.h>
+#include <linux/vm_account.h>
#include <linux/errqueue.h>

#include "rds.h"
@@ -99,7 +100,7 @@ static void rds_rm_zerocopy_callback(struct rds_sock *rs,
struct list_head *head;
unsigned long flags;

- mm_unaccount_pinned_pages(&znotif->z_mmp);
+ mm_unaccount_pinned_pages(&rs->rs_sk.sk_vm_account, &znotif->z_mmp);
q = &rs->rs_zcookie_queue;
spin_lock_irqsave(&q->lock, flags);
head = &q->zcookie_head;
@@ -367,6 +368,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
int ret = 0;
int length = iov_iter_count(from);
struct rds_msg_zcopy_info *info;
+ struct vm_account *vm_account = &rm->m_rs->rs_sk.sk_vm_account;

rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from));

@@ -380,7 +382,9 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
return -ENOMEM;
INIT_LIST_HEAD(&info->rs_zcookie_next);
rm->data.op_mmp_znotifier = &info->znotif;
- if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp,
+ vm_account_init(vm_account, current, current_user(), VM_ACCOUNT_USER);
+ if (mm_account_pinned_pages(vm_account,
+ &rm->data.op_mmp_znotifier->z_mmp,
length)) {
ret = -ENOMEM;
goto err;
@@ -399,7 +403,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
for (i = 0; i < rm->data.op_nents; i++)
put_page(sg_page(&rm->data.op_sg[i]));
mmp = &rm->data.op_mmp_znotifier->z_mmp;
- mm_unaccount_pinned_pages(mmp);
+ mm_unaccount_pinned_pages(vm_account, mmp);
ret = -EFAULT;
goto err;
}
--
git-series 0.9.1

2023-02-06 07:51:08

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 13/19] fpga: dfl: afu: convert to use vm_account

To charge pinned pages against the pins cgroup drivers must use the
vm_account_pinned() functions which requires initialisation of a
struct vm_account. Convert the dfl-afu-region code to do this and
charge any pins to the pins cgroup.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Wu Hao <[email protected]>
Cc: Tom Rix <[email protected]>
Cc: Moritz Fischer <[email protected]>
Cc: Xu Yilun <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
drivers/fpga/dfl-afu-dma-region.c | 11 ++++++++---
drivers/fpga/dfl-afu.h | 2 ++
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c
index 02b60fd..3b99784 100644
--- a/drivers/fpga/dfl-afu-dma-region.c
+++ b/drivers/fpga/dfl-afu-dma-region.c
@@ -38,7 +38,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
struct device *dev = &pdata->dev->dev;
int ret, pinned;

- ret = account_locked_vm(current->mm, npages, true);
+ ret = vm_account_pinned(&region->vm_account, npages);
if (ret)
return ret;

@@ -67,7 +67,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata,
free_pages:
kfree(region->pages);
unlock_vm:
- account_locked_vm(current->mm, npages, false);
+ vm_unaccount_pinned(&region->vm_account, npages);
return ret;
}

@@ -87,7 +87,7 @@ static void afu_dma_unpin_pages(struct dfl_feature_platform_data *pdata,

unpin_user_pages(region->pages, npages);
kfree(region->pages);
- account_locked_vm(current->mm, npages, false);
+ vm_unaccount_pinned(&region->vm_account, npages);

dev_dbg(dev, "%ld pages unpinned\n", npages);
}
@@ -223,6 +223,7 @@ void afu_dma_region_destroy(struct dfl_feature_platform_data *pdata)
afu_dma_unpin_pages(pdata, region);

node = rb_next(node);
+ vm_account_release(&region->vm_account);
kfree(region);
}
}
@@ -322,6 +323,8 @@ int afu_dma_map_region(struct dfl_feature_platform_data *pdata,
region->user_addr = user_addr;
region->length = length;

+ vm_account_init_current(&region->vm_account);
+
/* Pin the user memory region */
ret = afu_dma_pin_pages(pdata, region);
if (ret) {
@@ -365,6 +368,7 @@ int afu_dma_map_region(struct dfl_feature_platform_data *pdata,
unpin_pages:
afu_dma_unpin_pages(pdata, region);
free_region:
+ vm_account_release(&region->vm_account);
kfree(region);
return ret;
}
@@ -399,6 +403,7 @@ int afu_dma_unmap_region(struct dfl_feature_platform_data *pdata, u64 iova)
dma_unmap_page(dfl_fpga_pdata_to_parent(pdata),
region->iova, region->length, DMA_BIDIRECTIONAL);
afu_dma_unpin_pages(pdata, region);
+ vm_account_release(&region->vm_account);
kfree(region);

return 0;
diff --git a/drivers/fpga/dfl-afu.h b/drivers/fpga/dfl-afu.h
index e5020e2..2ca6117 100644
--- a/drivers/fpga/dfl-afu.h
+++ b/drivers/fpga/dfl-afu.h
@@ -18,6 +18,7 @@
#define __DFL_AFU_H

#include <linux/mm.h>
+#include <linux/vm_account.h>

#include "dfl.h"

@@ -51,6 +52,7 @@ struct dfl_afu_mmio_region {
* @in_use: flag to indicate if this region is in_use.
*/
struct dfl_afu_dma_region {
+ struct vm_account vm_account;
u64 user_addr;
u64 length;
u64 iova;
--
git-series 0.9.1

2023-02-06 07:51:39

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup

The vm_account_pinned() functions currently only account pages against
pinned_vm/locked_vm and enforce limits against RLIMIT_MEMLOCK. Extend
these to account pages and enforce limits using the pin count cgroup.

Accounting of pages will fail if either RLIMIT_MEMLOCK or the cgroup
limit is exceeded. Unlike rlimit enforcement which can be bypassed if
the user has CAP_IPC_LOCK cgroup limits can not be bypassed.

Signed-off-by: Alistair Popple <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/vm_account.h | 1 +
mm/util.c | 26 ++++++++++++++++++++------
2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/vm_account.h b/include/linux/vm_account.h
index b4b2e90..4fd5d3a 100644
--- a/include/linux/vm_account.h
+++ b/include/linux/vm_account.h
@@ -31,6 +31,7 @@ struct vm_account {
struct task_struct *task;
struct mm_struct *mm;
struct user_struct *user;
+ struct pins_cgroup *pins_cg;
enum vm_account_flags flags;
};

diff --git a/mm/util.c b/mm/util.c
index d8c19f8..0e93625 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -453,6 +453,7 @@ void vm_account_init(struct vm_account *vm_account, struct task_struct *task,

mmgrab(task->mm);
vm_account->mm = task->mm;
+ vm_account->pins_cg = get_pins_cg(task);
vm_account->flags = flags;
}
EXPORT_SYMBOL_GPL(vm_account_init);
@@ -472,6 +473,7 @@ void vm_account_release(struct vm_account *vm_account)
free_uid(vm_account->user);

mmdrop(vm_account->mm);
+ put_pins_cg(vm_account->pins_cg);
}
EXPORT_SYMBOL_GPL(vm_account_release);

@@ -502,6 +504,17 @@ static int vm_account_cmpxchg(struct vm_account *vm_account,
}
}

+static void vm_unaccount_legacy(struct vm_account *vm_account,
+ unsigned long npages)
+{
+ if (vm_account->flags & VM_ACCOUNT_USER) {
+ atomic_long_sub(npages, &vm_account->user->locked_vm);
+ atomic64_sub(npages, &vm_account->mm->pinned_vm);
+ } else {
+ atomic64_sub(npages, &vm_account->mm->pinned_vm);
+ }
+}
+
/**
* vm_account_pinned - Charge pinned or locked memory to the vm_account.
* @vm_account: pointer to an initialised vm_account.
@@ -537,6 +550,11 @@ int vm_account_pinned(struct vm_account *vm_account, unsigned long npages)
if (vm_account->flags & VM_ACCOUNT_USER)
atomic64_add(npages, &vm_account->mm->pinned_vm);

+ if (!pins_try_charge(vm_account->pins_cg, npages)) {
+ vm_unaccount_legacy(vm_account, npages);
+ return -ENOMEM;
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(vm_account_pinned);
@@ -548,12 +566,8 @@ EXPORT_SYMBOL_GPL(vm_account_pinned);
*/
void vm_unaccount_pinned(struct vm_account *vm_account, unsigned long npages)
{
- if (vm_account->flags & VM_ACCOUNT_USER) {
- atomic_long_sub(npages, &vm_account->user->locked_vm);
- atomic64_sub(npages, &vm_account->mm->pinned_vm);
- } else {
- atomic64_sub(npages, &vm_account->mm->pinned_vm);
- }
+ vm_unaccount_legacy(vm_account, npages);
+ pins_uncharge(vm_account->pins_cg, npages);
}
EXPORT_SYMBOL_GPL(vm_unaccount_pinned);

--
git-series 0.9.1

2023-02-06 07:51:41

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

If too much memory in a system is pinned or locked it can lead to
problems such as performance degradation or in the worst case
out-of-memory errors as such memory cannot be moved or paged out.

In order to prevent users without CAP_IPC_LOCK from causing these
issues the amount of memory that can be pinned is typically limited by
RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
between tasks and the enforcement of these limits is inconsistent
between in-kernel users of pinned memory such as mlock() and device
drivers which may also pin pages with pin_user_pages().

To allow for a single limit to be set introduce a cgroup controller
which can be used to limit the number of pages being pinned by all
tasks in the cgroup.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
MAINTAINERS | 7 +-
include/linux/cgroup.h | 20 +++-
include/linux/cgroup_subsys.h | 4 +-
mm/Kconfig | 11 +-
mm/Makefile | 1 +-
mm/pins_cgroup.c | 273 +++++++++++++++++++++++++++++++++++-
6 files changed, 316 insertions(+)
create mode 100644 mm/pins_cgroup.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f781f93..f8526e2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5381,6 +5381,13 @@ F: tools/testing/selftests/cgroup/memcg_protection.m
F: tools/testing/selftests/cgroup/test_kmem.c
F: tools/testing/selftests/cgroup/test_memcontrol.c

+CONTROL GROUP - PINNED AND LOCKED MEMORY
+M: Alistair Popple <[email protected]>
+L: [email protected]
+L: [email protected]
+S: Maintained
+F: mm/pins_cgroup.c
+
CORETEMP HARDWARE MONITORING DRIVER
M: Fenghua Yu <[email protected]>
L: [email protected]
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3410aec..d98de25 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -857,4 +857,24 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}

#endif /* CONFIG_CGROUP_BPF */

+#ifdef CONFIG_CGROUP_PINS
+
+struct pins_cgroup *get_pins_cg(struct task_struct *task);
+void put_pins_cg(struct pins_cgroup *cg);
+void pins_uncharge(struct pins_cgroup *pins, int num);
+bool pins_try_charge(struct pins_cgroup *pins, int num);
+
+#else /* CONFIG_CGROUP_PINS */
+
+static inline struct pins_cgroup *get_pins_cg(struct task_struct *task)
+{
+ return NULL;
+}
+
+static inline void put_pins_cg(struct pins_cgroup *cg) {}
+static inline void pins_uncharge(struct pins_cgroup *pins, int num) {}
+static inline bool pins_try_charge(struct pins_cgroup *pins, int num) { return 1; }
+
+#endif /* CONFIG_CGROUP_PINS */
+
#endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 4452354..c1b4aab 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
SUBSYS(misc)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_PINS)
+SUBSYS(pins)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209..4472043 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1183,6 +1183,17 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }

+config CGROUP_PINS
+ bool "Cgroup controller for pinned and locked memory"
+ depends on CGROUPS
+ default y
+ help
+ Having too much memory pinned or locked can lead to system
+ instability due to increased likelihood of encountering
+ out-of-memory conditions. Select this option to enable a cgroup
+ controller which can be used to limit the overall number of
+ pages procceses in a cgroup may lock or have pinned by drivers.
+
source "mm/damon/Kconfig"

endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5..81db189 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_CGROUP_PINS) += pins_cgroup.o
diff --git a/mm/pins_cgroup.c b/mm/pins_cgroup.c
new file mode 100644
index 0000000..2d8c6c7
--- /dev/null
+++ b/mm/pins_cgroup.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Controller for cgroups limiting number of pages pinned for FOLL_LONGETERM.
+ *
+ * Copyright (C) 2022 Alistair Popple <[email protected]>
+ */
+
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/sched/task.h>
+
+#define PINS_MAX (-1ULL)
+#define PINS_MAX_STR "max"
+
+struct pins_cgroup {
+ struct cgroup_subsys_state css;
+
+ atomic64_t counter;
+ atomic64_t limit;
+
+ struct cgroup_file events_file;
+ atomic64_t events_limit;
+};
+
+static struct pins_cgroup *css_pins(struct cgroup_subsys_state *css)
+{
+ return container_of(css, struct pins_cgroup, css);
+}
+
+static struct pins_cgroup *parent_pins(struct pins_cgroup *pins)
+{
+ return css_pins(pins->css.parent);
+}
+
+struct pins_cgroup *get_pins_cg(struct task_struct *task)
+{
+ return css_pins(task_get_css(task, pins_cgrp_id));
+}
+
+void put_pins_cg(struct pins_cgroup *cg)
+{
+ css_put(&cg->css);
+}
+
+static struct cgroup_subsys_state *
+pins_css_alloc(struct cgroup_subsys_state *parent)
+{
+ struct pins_cgroup *pins;
+
+ pins = kzalloc(sizeof(struct pins_cgroup), GFP_KERNEL);
+ if (!pins)
+ return ERR_PTR(-ENOMEM);
+
+ atomic64_set(&pins->counter, 0);
+ atomic64_set(&pins->limit, PINS_MAX);
+ atomic64_set(&pins->events_limit, 0);
+ return &pins->css;
+}
+
+static void pins_css_free(struct cgroup_subsys_state *css)
+{
+ kfree(css_pins(css));
+}
+
+/**
+ * pins_cancel - uncharge the local pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to cancel
+ *
+ * This function will WARN if the pin count goes under 0, because such a case is
+ * a bug in the pins controller proper.
+ */
+void pins_cancel(struct pins_cgroup *pins, int num)
+{
+ /*
+ * A negative count (or overflow for that matter) is invalid,
+ * and indicates a bug in the `pins` controller proper.
+ */
+ WARN_ON_ONCE(atomic64_add_negative(-num, &pins->counter));
+}
+
+/**
+ * pins_uncharge - hierarchically uncharge the pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to uncharge
+ */
+void pins_uncharge(struct pins_cgroup *pins, int num)
+{
+ struct pins_cgroup *p;
+
+ for (p = pins; parent_pins(p); p = parent_pins(p))
+ pins_cancel(p, num);
+}
+
+/**
+ * pins_charge - hierarchically charge the pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to charge
+ *
+ * This function does *not* follow the pin limit set. It cannot fail and the new
+ * pin count may exceed the limit. This is only used for reverting failed
+ * attaches, where there is no other way out than violating the limit.
+ */
+static void pins_charge(struct pins_cgroup *pins, int num)
+{
+ struct pins_cgroup *p;
+
+ for (p = pins; parent_pins(p); p = parent_pins(p))
+ atomic64_add(num, &p->counter);
+}
+
+/**
+ * pins_try_charge - hierarchically try to charge the pin count
+ * @pins: the pin cgroup state
+ * @num: the number of pins to charge
+ *
+ * This function follows the set limit. It will fail if the charge would cause
+ * the new value to exceed the hierarchical limit. Returns true if the charge
+ * succeeded, false otherwise.
+ */
+bool pins_try_charge(struct pins_cgroup *pins, int num)
+{
+ struct pins_cgroup *p, *q;
+
+ for (p = pins; parent_pins(p); p = parent_pins(p)) {
+ uint64_t new = atomic64_add_return(num, &p->counter);
+ uint64_t limit = atomic64_read(&p->limit);
+
+ if (limit != PINS_MAX && new > limit)
+ goto revert;
+ }
+
+ return true;
+
+revert:
+ for (q = pins; q != p; q = parent_pins(q))
+ pins_cancel(q, num);
+ pins_cancel(p, num);
+
+ return false;
+}
+
+static int pins_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *dst_css;
+ struct task_struct *task;
+
+ cgroup_taskset_for_each(task, dst_css, tset) {
+ struct pins_cgroup *pins = css_pins(dst_css);
+ struct cgroup_subsys_state *old_css;
+ struct pins_cgroup *old_pins;
+
+ old_css = task_css(task, pins_cgrp_id);
+ old_pins = css_pins(old_css);
+
+ pins_charge(pins, task->mm->locked_vm);
+ pins_uncharge(old_pins, task->mm->locked_vm);
+ }
+
+ return 0;
+}
+
+static void pins_cancel_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *dst_css;
+ struct task_struct *task;
+
+ cgroup_taskset_for_each(task, dst_css, tset) {
+ struct pins_cgroup *pins = css_pins(dst_css);
+ struct cgroup_subsys_state *old_css;
+ struct pins_cgroup *old_pins;
+
+ old_css = task_css(task, pins_cgrp_id);
+ old_pins = css_pins(old_css);
+
+ pins_charge(old_pins, task->mm->locked_vm);
+ pins_uncharge(pins, task->mm->locked_vm);
+ }
+}
+
+
+static ssize_t pins_max_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct cgroup_subsys_state *css = of_css(of);
+ struct pins_cgroup *pins = css_pins(css);
+ uint64_t limit;
+ int err;
+
+ buf = strstrip(buf);
+ if (!strcmp(buf, PINS_MAX_STR)) {
+ limit = PINS_MAX;
+ goto set_limit;
+ }
+
+ err = kstrtoll(buf, 0, &limit);
+ if (err)
+ return err;
+
+ if (limit < 0 || limit >= PINS_MAX)
+ return -EINVAL;
+
+set_limit:
+ /*
+ * Limit updates don't need to be mutex'd, since it isn't
+ * critical that any racing fork()s follow the new limit.
+ */
+ atomic64_set(&pins->limit, limit);
+ return nbytes;
+}
+
+static int pins_max_show(struct seq_file *sf, void *v)
+{
+ struct cgroup_subsys_state *css = seq_css(sf);
+ struct pins_cgroup *pins = css_pins(css);
+ uint64_t limit = atomic64_read(&pins->limit);
+
+ if (limit >= PINS_MAX)
+ seq_printf(sf, "%s\n", PINS_MAX_STR);
+ else
+ seq_printf(sf, "%lld\n", limit);
+
+ return 0;
+}
+
+static s64 pins_current_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct pins_cgroup *pins = css_pins(css);
+
+ return atomic64_read(&pins->counter);
+}
+
+static int pins_events_show(struct seq_file *sf, void *v)
+{
+ struct pins_cgroup *pins = css_pins(seq_css(sf));
+
+ seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pins->events_limit));
+ return 0;
+}
+
+static struct cftype pins_files[] = {
+ {
+ .name = "max",
+ .write = pins_max_write,
+ .seq_show = pins_max_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "current",
+ .read_s64 = pins_current_read,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "events",
+ .seq_show = pins_events_show,
+ .file_offset = offsetof(struct pins_cgroup, events_file),
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ { } /* terminate */
+};
+
+struct cgroup_subsys pins_cgrp_subsys = {
+ .css_alloc = pins_css_alloc,
+ .css_free = pins_css_free,
+ .legacy_cftypes = pins_files,
+ .dfl_cftypes = pins_files,
+ .can_attach = pins_can_attach,
+ .cancel_attach = pins_cancel_attach,
+};
--
git-series 0.9.1

2023-02-06 07:52:08

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 16/19] mm/util: Refactor account_locked_vm

account_locked_vm() takes a flag to indicate if pages are being
accounted or unaccounted for. A flag is also provided to bypass
rlimits. However unaccounting of pages always succeeds and the flag to
ignore the limits is ignored. The flags make calling code harder to
understand so refactor the accounting and unaccounting paths into
separate functions.

Signed-off-by: Alistair Popple <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/mm.h | 5 +--
mm/util.c | 73 +++++++++++++++++++++++++++++++++--------------
2 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f85716..126b756 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2090,9 +2090,10 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);

-int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
-int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
+int account_locked_vm(struct mm_struct *mm, unsigned long pages);
+int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
struct task_struct *task, bool bypass_rlim);
+void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages);

struct kvec;
int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
diff --git a/mm/util.c b/mm/util.c
index 0e93625..1ca0dfe 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -575,7 +575,6 @@ EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
* __account_locked_vm - account locked pages to an mm's locked_vm
* @mm: mm to account against
* @pages: number of pages to account
- * @inc: %true if @pages should be considered positive, %false if not
* @task: task used to check RLIMIT_MEMLOCK
* @bypass_rlim: %true if checking RLIMIT_MEMLOCK should be skipped
*
@@ -586,7 +585,7 @@ EXPORT_SYMBOL_GPL(vm_unaccount_pinned);
* * 0 on success
* * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
*/
-int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
+int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
struct task_struct *task, bool bypass_rlim)
{
unsigned long locked_vm, limit;
@@ -595,33 +594,44 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
mmap_assert_write_locked(mm);

locked_vm = mm->locked_vm;
- if (inc) {
- if (!bypass_rlim) {
- limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked_vm + pages > limit)
- ret = -ENOMEM;
- }
- if (!ret)
- mm->locked_vm = locked_vm + pages;
- } else {
- WARN_ON_ONCE(pages > locked_vm);
- mm->locked_vm = locked_vm - pages;
+ if (!bypass_rlim) {
+ limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked_vm + pages > limit)
+ ret = -ENOMEM;
}

- pr_debug("%s: [%d] caller %ps %c%lu %lu/%lu%s\n", __func__, task->pid,
- (void *)_RET_IP_, (inc) ? '+' : '-', pages << PAGE_SHIFT,
- locked_vm << PAGE_SHIFT, task_rlimit(task, RLIMIT_MEMLOCK),
- ret ? " - exceeded" : "");
+ if (!ret)
+ mm->locked_vm = locked_vm + pages;
+
+ pr_debug("%s: [%d] caller %ps %lu %lu/%lu%s\n", __func__, task->pid,
+ (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
+ task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");

return ret;
}
EXPORT_SYMBOL_GPL(__account_locked_vm);

/**
+ * __unaccount_locked_vm - unaccount locked pages to an mm's locked_vm
+ * @mm: mm to account against
+ * @pages: number of pages to account
+ *
+ * Assumes @mm are valid and that mmap_lock is held as writer.
+ */
+void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
+{
+ unsigned long locked_vm = mm->locked_vm;
+
+ mmap_assert_write_locked(mm);
+ WARN_ON_ONCE(pages > locked_vm);
+ mm->locked_vm = locked_vm - pages;
+}
+EXPORT_SYMBOL_GPL(__unaccount_locked_vm);
+
+/**
* account_locked_vm - account locked pages to an mm's locked_vm
* @mm: mm to account against, may be NULL
* @pages: number of pages to account
- * @inc: %true if @pages should be considered positive, %false if not
*
* Assumes a non-NULL @mm is valid (i.e. at least one reference on it).
*
@@ -629,7 +639,7 @@ EXPORT_SYMBOL_GPL(__account_locked_vm);
* * 0 on success, or if mm is NULL
* * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
*/
-int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc)
+int account_locked_vm(struct mm_struct *mm, unsigned long pages)
{
int ret;

@@ -637,14 +647,35 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc)
return 0;

mmap_write_lock(mm);
- ret = __account_locked_vm(mm, pages, inc, current,
- capable(CAP_IPC_LOCK));
+ ret = __account_locked_vm(mm, pages, current, capable(CAP_IPC_LOCK));
mmap_write_unlock(mm);

return ret;
}
EXPORT_SYMBOL_GPL(account_locked_vm);

+/**
+ * unaccount_locked_vm - account locked pages to an mm's locked_vm
+ * @mm: mm to account against, may be NULL
+ * @pages: number of pages to account
+ *
+ * Assumes a non-NULL @mm is valid (i.e. at least one reference on it).
+ *
+ * Return:
+ * * 0 on success, or if mm is NULL
+ * * -ENOMEM if RLIMIT_MEMLOCK would be exceeded.
+ */
+void unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
+{
+ if (pages == 0 || !mm)
+ return;
+
+ mmap_write_lock(mm);
+ __unaccount_locked_vm(mm, pages);
+ mmap_write_unlock(mm);
+}
+EXPORT_SYMBOL_GPL(unaccount_locked_vm);
+
unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long pgoff)
--
git-series 0.9.1

2023-02-06 07:52:11

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm

A future change introduces a cgroup to control the amount of
locked/pinned memory on the system. To ensure memory pinned via mlock
and mmap is accounted for use the common account_locked_vm()
function.

As cgroups can outlive individual processes also unaccount for the
locked memory during process teardown.

This patch should introduce no user visible change.

Signed-off-by: Alistair Popple <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/internal.h | 2 +-
mm/mlock.c | 76 ++++++++++-----------------------------------------
mm/mmap.c | 76 +++++++++++++++++++++++++--------------------------
mm/mremap.c | 54 ++++++++++++++++++++++++++----------
mm/secretmem.c | 6 +---
5 files changed, 95 insertions(+), 119 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8..7c8c3f2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -489,8 +489,6 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
extern long faultin_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
bool write, int *locked);
-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
- unsigned long len);
/*
* mlock_vma_page() and munlock_vma_page():
* should be called with vma's mmap_lock held for read or write,
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6d..a97c8c5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -416,6 +416,20 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
goto out;

+ /*
+ * Keep track of amount of locked VM.
+ */
+ nr_pages = (end - start) >> PAGE_SHIFT;
+ if (!(newflags & VM_LOCKED)) {
+ __unaccount_locked_vm(mm, nr_pages);
+ } else if (!(oldflags & VM_LOCKED)) {
+ if (__account_locked_vm(mm, nr_pages, current,
+ capable(CAP_IPC_LOCK))) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ }
+
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
@@ -439,16 +453,6 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,

success:
/*
- * Keep track of amount of locked VM.
- */
- nr_pages = (end - start) >> PAGE_SHIFT;
- if (!(newflags & VM_LOCKED))
- nr_pages = -nr_pages;
- else if (oldflags & VM_LOCKED)
- nr_pages = 0;
- mm->locked_vm += nr_pages;
-
- /*
* vm_flags is protected by the mmap_lock held in write mode.
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, populate_vma_page_range will bring it back.
@@ -517,42 +521,6 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
}

/*
- * Go through vma areas and sum size of mlocked
- * vma pages, as return value.
- * Note deferred memory locking case(mlock2(,,MLOCK_ONFAULT)
- * is also counted.
- * Return value: previously mlocked page counts
- */
-static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
- unsigned long start, size_t len)
-{
- struct vm_area_struct *vma;
- unsigned long count = 0;
- unsigned long end;
- VMA_ITERATOR(vmi, mm, start);
-
- /* Don't overflow past ULONG_MAX */
- if (unlikely(ULONG_MAX - len < start))
- end = ULONG_MAX;
- else
- end = start + len;
-
- for_each_vma_range(vmi, vma, end) {
- if (vma->vm_flags & VM_LOCKED) {
- if (start > vma->vm_start)
- count -= (start - vma->vm_start);
- if (end < vma->vm_end) {
- count += end - vma->vm_start;
- break;
- }
- count += vma->vm_end - vma->vm_start;
- }
- }
-
- return count >> PAGE_SHIFT;
-}
-
-/*
* convert get_user_pages() return value to posix mlock() error
*/
static int __mlock_posix_error_return(long retval)
@@ -585,21 +553,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
if (mmap_write_lock_killable(current->mm))
return -EINTR;

- locked += current->mm->locked_vm;
- if ((locked > lock_limit) && (!capable(CAP_IPC_LOCK))) {
- /*
- * It is possible that the regions requested intersect with
- * previously mlocked areas, that part area in "mm->locked_vm"
- * should not be counted to new mlock increment count. So check
- * and adjust locked count if necessary.
- */
- locked -= count_mm_mlocked_page_nr(current->mm,
- start, len);
- }
-
- /* check against resource limits */
- if ((locked <= lock_limit) || capable(CAP_IPC_LOCK))
- error = apply_vma_lock_flags(start, len, flags);
+ error = apply_vma_lock_flags(start, len, flags);

mmap_write_unlock(current->mm);
if (error)
diff --git a/mm/mmap.c b/mm/mmap.c
index 425a934..2c05582 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -160,7 +160,7 @@ static int check_brk_limits(unsigned long addr, unsigned long len)
if (IS_ERR_VALUE(mapped_addr))
return mapped_addr;

- return mlock_future_check(current->mm, current->mm->def_flags, len);
+ return 0;
}
static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
unsigned long newbrk, unsigned long oldbrk,
@@ -1184,23 +1184,6 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
return hint;
}

-int mlock_future_check(struct mm_struct *mm, unsigned long flags,
- unsigned long len)
-{
- unsigned long locked, lock_limit;
-
- /* mlock MCL_FUTURE? */
- if (flags & VM_LOCKED) {
- locked = len >> PAGE_SHIFT;
- locked += mm->locked_vm;
- lock_limit = rlimit(RLIMIT_MEMLOCK);
- lock_limit >>= PAGE_SHIFT;
- if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- return -EAGAIN;
- }
- return 0;
-}
-
static inline u64 file_mmap_size_max(struct file *file, struct inode *inode)
{
if (S_ISREG(inode->i_mode))
@@ -1310,9 +1293,6 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (!can_do_mlock())
return -EPERM;

- if (mlock_future_check(mm, vm_flags, len))
- return -EAGAIN;
-
if (file) {
struct inode *inode = file_inode(file);
unsigned long flags_mask;
@@ -1882,22 +1862,27 @@ static int acct_stack_growth(struct vm_area_struct *vma,
if (size > rlimit(RLIMIT_STACK))
return -ENOMEM;

- /* mlock limit tests */
- if (mlock_future_check(mm, vma->vm_flags, grow << PAGE_SHIFT))
- return -ENOMEM;
-
/* Check to ensure the stack will not grow into a hugetlb-only region */
new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start :
vma->vm_end - size;
if (is_hugepage_only_range(vma->vm_mm, new_start, size))
return -EFAULT;

+ /* mlock limit tests */
+ if (vma->vm_flags & VM_LOCKED)
+ if (__account_locked_vm(mm, grow << PAGE_SHIFT, current,
+ capable(CAP_IPC_LOCK)))
+ return -ENOMEM;
+
/*
* Overcommit.. This must be the final test, as it will
* update security statistics.
*/
- if (security_vm_enough_memory_mm(mm, grow))
+ if (security_vm_enough_memory_mm(mm, grow)) {
+ if (vma->vm_flags & VM_LOCKED)
+ __unaccount_locked_vm(mm, grow << PAGE_SHIFT);
return -ENOMEM;
+ }

return 0;
}
@@ -1975,8 +1960,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
* to guard against concurrent vma expansions.
*/
spin_lock(&mm->page_table_lock);
- if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
@@ -2056,8 +2039,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
* to guard against concurrent vma expansions.
*/
spin_lock(&mm->page_table_lock);
- if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_start = address;
@@ -2281,7 +2262,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
return -ENOMEM;

if (vma->vm_flags & VM_LOCKED)
- vma->vm_mm->locked_vm -= vma_pages(vma);
+ __unaccount_locked_vm(vma->vm_mm, vma_pages(vma));

return 0;
}
@@ -2525,6 +2506,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct vm_area_struct *next, *prev, *merge;
pgoff_t pglen = len >> PAGE_SHIFT;
unsigned long charged = 0;
+ unsigned long locked = 0;
unsigned long end = addr + len;
unsigned long merge_start = addr, merge_end = end;
pgoff_t vm_pgoff;
@@ -2560,6 +2542,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}

+ if (vm_flags & VM_LOCKED) {
+ locked = len >> PAGE_SHIFT;
+ if (__account_locked_vm(mm, locked, current,
+ capable(CAP_IPC_LOCK))) {
+ error = -ENOMEM;
+ goto unacct_error;
+ }
+ }
+
next = mas_next(&mas, ULONG_MAX);
prev = mas_prev(&mas, 0);
if (vm_flags & VM_SPECIAL)
@@ -2605,7 +2596,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma = vm_area_alloc(mm);
if (!vma) {
error = -ENOMEM;
- goto unacct_error;
+ goto unlock_error;
}

vma->vm_start = addr;
@@ -2725,8 +2716,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
- else
- mm->locked_vm += (len >> PAGE_SHIFT);
}

if (file)
@@ -2759,6 +2748,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
mapping_unmap_writable(file->f_mapping);
free_vma:
vm_area_free(vma);
+unlock_error:
+ if (locked)
+ __unaccount_locked_vm(mm, locked);
unacct_error:
if (charged)
vm_unacct_memory(charged);
@@ -2942,8 +2934,13 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;

+ if (flags & VM_LOCKED)
+ if (__account_locked_vm(mm, len >> PAGE_SHIFT, current,
+ capable(CAP_IPC_LOCK)))
+ return -ENOMEM;
+
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
- return -ENOMEM;
+ goto unacct_locked;

/*
* Expand the existing vma if possible; Note that singular lists do not
@@ -2993,8 +2990,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
perf_event_mmap(vma);
mm->total_vm += len >> PAGE_SHIFT;
mm->data_vm += len >> PAGE_SHIFT;
- if (flags & VM_LOCKED)
- mm->locked_vm += (len >> PAGE_SHIFT);
vma->vm_flags |= VM_SOFTDIRTY;
validate_mm(mm);
return 0;
@@ -3003,6 +2998,8 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
vm_area_free(vma);
unacct_fail:
vm_unacct_memory(len >> PAGE_SHIFT);
+unacct_locked:
+ __unaccount_locked_vm(mm, len >> PAGE_SHIFT);
return -ENOMEM;
}

@@ -3064,7 +3061,7 @@ void exit_mmap(struct mm_struct *mm)
{
struct mmu_gather tlb;
struct vm_area_struct *vma;
- unsigned long nr_accounted = 0;
+ unsigned long nr_accounted = 0, nr_locked = 0;
MA_STATE(mas, &mm->mm_mt, 0, 0);
int count = 0;

@@ -3107,6 +3104,8 @@ void exit_mmap(struct mm_struct *mm)
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
+ if (vma->vm_flags & VM_LOCKED)
+ nr_locked += vma_pages(vma);
remove_vma(vma);
count++;
cond_resched();
@@ -3116,6 +3115,7 @@ void exit_mmap(struct mm_struct *mm)

trace_exit_mmap(mm);
__mt_destroy(&mm->mm_mt);
+ __unaccount_locked_vm(mm, nr_locked);
mmap_write_unlock(mm);
vm_unacct_memory(nr_accounted);
}
diff --git a/mm/mremap.c b/mm/mremap.c
index fe587c5..67cc4f3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -574,7 +574,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
bool *locked, unsigned long flags,
struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
{
- long to_account = new_len - old_len;
+ long to_account = (new_len - old_len) >> PAGE_SHIFT;
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *new_vma;
unsigned long vm_flags = vma->vm_flags;
@@ -594,7 +594,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return -ENOMEM;

if (unlikely(flags & MREMAP_DONTUNMAP))
- to_account = new_len;
+ to_account = new_len >> PAGE_SHIFT;

if (vma->vm_ops && vma->vm_ops->may_split) {
if (vma->vm_start != old_addr)
@@ -618,16 +618,36 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return err;

if (vm_flags & VM_ACCOUNT) {
- if (security_vm_enough_memory_mm(mm, to_account >> PAGE_SHIFT))
+ if (security_vm_enough_memory_mm(mm, to_account))
return -ENOMEM;
}

+ /*
+ * MREMAP_DONTUNMAP clears VM_LOCKED on the old vma and
+ * implies new_len == old_len so no need to account locked
+ * pages.
+ */
+ if ((vm_flags & VM_LOCKED) && likely(!(flags & MREMAP_DONTUNMAP))) {
+ if (__account_locked_vm(mm, to_account, current,
+ capable(CAP_IPC_LOCK))) {
+ if (vm_flags & VM_ACCOUNT)
+ vm_unacct_memory(to_account);
+ return -ENOMEM;
+ }
+ *locked = true;
+ }
+
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
&need_rmap_locks);
if (!new_vma) {
if (vm_flags & VM_ACCOUNT)
- vm_unacct_memory(to_account >> PAGE_SHIFT);
+ vm_unacct_memory(to_account);
+ if ((vm_flags & VM_LOCKED) &&
+ likely(!(flags & MREMAP_DONTUNMAP))) {
+ __unaccount_locked_vm(mm, to_account);
+ *locked = false;
+ }
return -ENOMEM;
}

@@ -696,10 +716,11 @@ static unsigned long move_vma(struct vm_area_struct *vma,
vma->vm_end == (old_addr + old_len))
unlink_anon_vmas(vma);

- /* Because we won't unmap we don't need to touch locked_vm */
return new_addr;
}

+ /* Make sure do_munmap() doesn't unaccount locked pages */
+ vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
/* OOM: unable to split vma, just get accounts right */
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
@@ -707,15 +728,11 @@ static unsigned long move_vma(struct vm_area_struct *vma,
excess = 0;
}

- if (vm_flags & VM_LOCKED) {
- mm->locked_vm += new_len >> PAGE_SHIFT;
- *locked = true;
- }
-
mm->hiwater_vm = hiwater_vm;

/* Restore VM_ACCOUNT if one or two pieces of vma left */
if (excess) {
+ vma->vm_flags = vm_flags;
vma->vm_flags |= VM_ACCOUNT;
if (split)
find_vma(mm, vma->vm_end)->vm_flags |= VM_ACCOUNT;
@@ -768,9 +785,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
return ERR_PTR(-EFAULT);

- if (mlock_future_check(mm, vma->vm_flags, new_len - old_len))
- return ERR_PTR(-EAGAIN);
-
if (!may_expand_vm(mm, vma->vm_flags,
(new_len - old_len) >> PAGE_SHIFT))
return ERR_PTR(-ENOMEM);
@@ -1026,6 +1040,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
}
}

+ if (vma->vm_flags & VM_LOCKED) {
+ if (__account_locked_vm(mm, pages, current,
+ capable(CAP_IPC_LOCK))) {
+ if (vma->vm_flags & VM_ACCOUNT)
+ vm_unacct_memory(pages);
+ ret = -ENOMEM;
+ goto out;
+ }
+ }
+
/*
* Function vma_merge() is called on the extension we are adding to
* the already existing vma, vma_merge() will merge this extension with
@@ -1038,14 +1062,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
extension_pgoff, vma_policy(vma),
vma->vm_userfaultfd_ctx, anon_vma_name(vma));
if (!vma) {
+ // TODO: We always unacct memory
+ // regardless of VM_ACCOUNT flags?
vm_unacct_memory(pages);
+ __unaccount_locked_vm(mm, pages);
ret = -ENOMEM;
goto out;
}

vm_stat_account(mm, vma->vm_flags, pages);
if (vma->vm_flags & VM_LOCKED) {
- mm->locked_vm += pages;
locked = true;
new_addr = addr;
}
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 04c3ac9..4515eb4 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,13 +120,11 @@ static int secretmem_release(struct inode *inode, struct file *file)

static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
{
- unsigned long len = vma->vm_end - vma->vm_start;
-
if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
return -EINVAL;

- if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
- return -EAGAIN;
+ if (account_locked_vm(vma->vm_mm, vma->vm_end - vma->vm_start))
+ return -ENOMEM;

vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
vma->vm_ops = &secretmem_vm_ops;
--
git-series 0.9.1

2023-02-06 07:52:37

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap

Add some basic tests of mlock/mmap cgroup accounting for pinned
memory.

Signed-off-by: Alistair Popple <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
MAINTAINERS | 1 +-
tools/testing/selftests/vm/Makefile | 1 +-
tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++-
3 files changed, 273 insertions(+)
create mode 100644 tools/testing/selftests/vm/pins-cgroup.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f8526e2..4c4eed9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5387,6 +5387,7 @@ L: [email protected]
L: [email protected]
S: Maintained
F: mm/pins_cgroup.c
+F: tools/testing/selftests/vm/pins-cgroup.c

CORETEMP HARDWARE MONITORING DRIVER
M: Fenghua Yu <[email protected]>
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 89c14e4..0653720 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -56,6 +56,7 @@ TEST_GEN_PROGS += soft-dirty
TEST_GEN_PROGS += split_huge_page_test
TEST_GEN_FILES += ksm_tests
TEST_GEN_PROGS += ksm_functional_tests
+TEST_GEN_FILES += pins-cgroup

ifeq ($(MACHINE),x86_64)
CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32)
diff --git a/tools/testing/selftests/vm/pins-cgroup.c b/tools/testing/selftests/vm/pins-cgroup.c
new file mode 100644
index 0000000..c2eabc2
--- /dev/null
+++ b/tools/testing/selftests/vm/pins-cgroup.c
@@ -0,0 +1,271 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "../kselftest_harness.h"
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <assert.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+#include <unistd.h>
+
+#define CGROUP_TEMP "/sys/fs/cgroup/pins_XXXXXX"
+#define PINS_MAX (-1UL)
+
+FIXTURE(pins_cg)
+{
+ char *cg_path;
+ long page_size;
+};
+
+static char *cgroup_new(void)
+{
+ char *cg;
+
+ cg = malloc(sizeof(CGROUP_TEMP));
+ strcpy(cg, CGROUP_TEMP);
+ if (!mkdtemp(cg)) {
+ perror("Failed to create cgroup");
+ return NULL;
+ }
+
+ return cg;
+}
+
+static int cgroup_add_proc(char *cg, pid_t pid)
+{
+ char *cg_proc;
+ FILE *f;
+ int ret = 0;
+
+ if (asprintf(&cg_proc, "%s/cgroup.procs", cg) < 0)
+ return -1;
+
+ f = fopen(cg_proc, "w");
+ free(cg_proc);
+ if (!f)
+ return -1;
+
+ if (fprintf(f, "%ld\n", (long) pid) < 0)
+ ret = -1;
+
+ fclose(f);
+ return ret;
+}
+
+static int cgroup_set_limit(char *cg, unsigned long limit)
+{
+ char *cg_pins_max;
+ FILE *f;
+ int ret = 0;
+
+ if (asprintf(&cg_pins_max, "%s/pins.max", cg) < 0)
+ return -1;
+
+ f = fopen(cg_pins_max, "w");
+ free(cg_pins_max);
+ if (!f)
+ return -1;
+
+ if (limit != PINS_MAX) {
+ if (fprintf(f, "%ld\n", limit) < 0)
+ ret = -1;
+ } else {
+ if (fprintf(f, "max\n") < 0)
+ ret = -1;
+ }
+
+ fclose(f);
+ return ret;
+}
+
+FIXTURE_SETUP(pins_cg)
+{
+ char *cg_subtree_control;
+ FILE *f;
+
+ if (asprintf(&cg_subtree_control,
+ "/sys/fs/cgroup/cgroup.subtree_control") < 0)
+ return;
+
+ f = fopen(cg_subtree_control, "w");
+ free(cg_subtree_control);
+ if (!f)
+ return;
+
+ fprintf(f, "+pins\n");
+ fclose(f);
+
+ self->cg_path = cgroup_new();
+ self->page_size = sysconf(_SC_PAGE_SIZE);
+}
+
+FIXTURE_TEARDOWN(pins_cg)
+{
+ cgroup_add_proc("/sys/fs/cgroup", getpid());
+
+ rmdir(self->cg_path);
+ free(self->cg_path);
+}
+
+static long cgroup_pins(char *cg)
+{
+ long pin_count;
+ char *cg_pins_current;
+ FILE *f;
+ int ret;
+
+ if (asprintf(&cg_pins_current, "%s/pins.current", cg) < 0)
+ return -1;
+
+ f = fopen(cg_pins_current, "r");
+ if (!f) {
+ printf("Can't open %s\n", cg_pins_current);
+ getchar();
+ free(cg_pins_current);
+ return -2;
+ }
+
+ free(cg_pins_current);
+
+ if (fscanf(f, "%ld", &pin_count) == EOF)
+ ret = -3;
+ else
+ ret = pin_count;
+
+ fclose(f);
+ return ret;
+}
+
+static int set_rlim_memlock(unsigned long size)
+{
+ struct rlimit rlim_memlock = {
+ .rlim_cur = size,
+ .rlim_max = size,
+ };
+ cap_t cap;
+ cap_value_t capability[1] = { CAP_IPC_LOCK };
+
+ /*
+ * Many of the rlimit checks are skipped if a process has
+ * CAP_IP_LOCK. As this test should be run as root we need to
+ * explicitly drop it.
+ */
+ cap = cap_get_proc();
+ if (!cap)
+ return -1;
+ if (cap_set_flag(cap, CAP_EFFECTIVE, 1, capability, CAP_CLEAR))
+ return -1;
+ if (cap_set_proc(cap))
+ return -1;
+ return setrlimit(RLIMIT_MEMLOCK, &rlim_memlock);
+}
+
+TEST_F(pins_cg, basic)
+{
+ pid_t child_pid;
+ long page_size = self->page_size;
+ char *p;
+
+ ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+ p = mmap(NULL, 32*page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ ASSERT_NE(p, MAP_FAILED);
+
+ ASSERT_EQ(cgroup_pins(self->cg_path), 0);
+ memset(p, 0, 16*page_size);
+ ASSERT_EQ(mlock(p, page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 1);
+ ASSERT_EQ(mlock(p + page_size, page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+ ASSERT_EQ(mlock(p, page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+ ASSERT_EQ(mlock(p, 4*page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 4);
+ ASSERT_EQ(munlock(p + 2*page_size, 2*page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+ ASSERT_EQ(cgroup_set_limit(self->cg_path, 8), 0);
+ ASSERT_EQ(mlock(p, 16*page_size), -1);
+ ASSERT_EQ(errno, ENOMEM);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 2);
+ ASSERT_EQ(cgroup_set_limit(self->cg_path, PINS_MAX), 0);
+
+ /* check mremap() a locked region correctly accounts locked pages */
+ ASSERT_EQ(mlock(p, 32*page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+ p = mremap(p, 32*page_size, 64*page_size, MREMAP_MAYMOVE);
+ ASSERT_NE(p, MAP_FAILED);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 64);
+ ASSERT_EQ(munmap(p + 32*page_size, 32*page_size), 0)
+ ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+ p = mremap(p, 32*page_size, 32*page_size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP);
+ ASSERT_NE(p, MAP_FAILED);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+ ASSERT_EQ(munlock(p, 32*page_size), 0);
+
+ /* mremap() a locked region should fail if limit exceeded */
+ ASSERT_EQ(set_rlim_memlock(32*page_size), 0);
+ ASSERT_EQ(mlock(p, 32*page_size), 0);
+ ASSERT_EQ(mremap(p, 32*page_size, 64*page_size, 0), MAP_FAILED);
+ ASSERT_EQ(munlock(p, 32*page_size), 0);
+
+ /* Exceeds rlimit, expected to fail */
+ ASSERT_EQ(set_rlim_memlock(16*page_size), 0);
+ ASSERT_EQ(mlock(p, 32*page_size), -1);
+ ASSERT_EQ(errno, ENOMEM);
+
+ /* memory in the child isn't locked so shouldn't increase pin_cg count */
+ ASSERT_EQ(mlock(p, 16*page_size), 0);
+ child_pid = fork();
+ if (!child_pid) {
+ ASSERT_EQ(cgroup_pins(self->cg_path), 16);
+ ASSERT_EQ(mlock(p, 16*page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 32);
+ return;
+
+ }
+ waitpid(child_pid, NULL, 0);
+
+ /* check that child exit uncharged the pins */
+ ASSERT_EQ(cgroup_pins(self->cg_path), 16);
+}
+
+TEST_F(pins_cg, mmap)
+{
+ char *p;
+
+ ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+ p = mmap(NULL, 4*self->page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED, -1, 0);
+ ASSERT_NE(p, MAP_FAILED);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 4);
+}
+
+/*
+ * Test moving to a different cgroup.
+ */
+TEST_F(pins_cg, move_cg)
+{
+ char *p, *new_cg;
+
+ ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+ p = mmap(NULL, 16*self->page_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ ASSERT_NE(p, MAP_FAILED);
+ memset(p, 0, 16*self->page_size);
+ ASSERT_EQ(mlock(p, 16*self->page_size), 0);
+ ASSERT_EQ(cgroup_pins(self->cg_path), 16);
+ ASSERT_NE(new_cg = cgroup_new(), NULL);
+ ASSERT_EQ(cgroup_add_proc(new_cg, getpid()), 0);
+ ASSERT_EQ(cgroup_pins(new_cg), 16);
+ ASSERT_EQ(cgroup_add_proc(self->cg_path, getpid()), 0);
+ rmdir(new_cg);
+}
+TEST_HARNESS_MAIN
--
git-series 0.9.1

2023-02-06 07:53:01

by Alistair Popple

[permalink] [raw]
Subject: [PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup

account_locked_vm() is used to account memory to mm->locked_vm. This
adds accounting to the pins cgorup as it behaves similarly and should
be accounted against the same global limit if set.

This means memory must now be unaccounted for correctly, as the cgroup
typically outlives both the mm and the task. It is assumed that
callers of account_locked_vm() only do accounting against the current
task. Callers that need to do accounting against remote tasks should
use account_pinned_vm() and associated struct vm_account to hold
references to the cgroup.

Signed-off-by: Alistair Popple <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/util.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 1ca0dfe..755bada 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -589,15 +589,21 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
struct task_struct *task, bool bypass_rlim)
{
unsigned long locked_vm, limit;
+ struct pins_cgroup *pins_cg = get_pins_cg(task);
int ret = 0;

mmap_assert_write_locked(mm);

+ if (pins_cg && !pins_try_charge(pins_cg, pages))
+ return -ENOMEM;
+
locked_vm = mm->locked_vm;
if (!bypass_rlim) {
limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- if (locked_vm + pages > limit)
+ if (locked_vm + pages > limit) {
+ pins_uncharge(pins_cg, pages);
ret = -ENOMEM;
+ }
}

if (!ret)
@@ -607,6 +613,12 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
(void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");

+ pr_debug("%s: [%d] caller %ps %lu %lu/%lu%s\n", __func__, task->pid,
+ (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
+ task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
+
+ if (pins_cg)
+ put_pins_cg(pins_cg);
return ret;
}
EXPORT_SYMBOL_GPL(__account_locked_vm);
@@ -622,8 +634,18 @@ void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
{
unsigned long locked_vm = mm->locked_vm;

+ /*
+ * TODO: Convert book3s vio to use pinned vm to ensure
+ * unaccounting happens to the correct cgroup.
+ */
+ struct pins_cgroup *pins_cg = get_pins_cg(current);
+
mmap_assert_write_locked(mm);
WARN_ON_ONCE(pages > locked_vm);
+ if (pins_cg) {
+ pins_uncharge(pins_cg, pages);
+ put_pins_cg(pins_cg);
+ }
mm->locked_vm = locked_vm - pages;
}
EXPORT_SYMBOL_GPL(__unaccount_locked_vm);
--
git-series 0.9.1

2023-02-06 15:29:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 09/19] io_uring: convert to use vm_account

On 2/6/23 12:47 AM, Alistair Popple wrote:
> Convert io_uring to use vm_account instead of directly charging pages
> against the user/mm. Rather than charge pages to both user->locked_vm
> and mm->pinned_vm this will only charge pages to user->locked_vm.

Not sure how we're supposed to review this, when you just send us 9/19
and vm_account_release() is supposedly an earlier patch in this series.

Either CC the whole series, or at least the cover letter, core parts,
and the per-subsystem parts.

--
Jens Axboe



2023-02-06 21:02:00

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Sun, Feb 5, 2023 at 11:49 PM Alistair Popple <[email protected]> wrote:
>
> If too much memory in a system is pinned or locked it can lead to
> problems such as performance degradation or in the worst case
> out-of-memory errors as such memory cannot be moved or paged out.
>
> In order to prevent users without CAP_IPC_LOCK from causing these
> issues the amount of memory that can be pinned is typically limited by
> RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
> between tasks and the enforcement of these limits is inconsistent
> between in-kernel users of pinned memory such as mlock() and device
> drivers which may also pin pages with pin_user_pages().
>
> To allow for a single limit to be set introduce a cgroup controller
> which can be used to limit the number of pages being pinned by all
> tasks in the cgroup.
>
> Signed-off-by: Alistair Popple <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Zefan Li <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> MAINTAINERS | 7 +-
> include/linux/cgroup.h | 20 +++-
> include/linux/cgroup_subsys.h | 4 +-
> mm/Kconfig | 11 +-
> mm/Makefile | 1 +-
> mm/pins_cgroup.c | 273 +++++++++++++++++++++++++++++++++++-
> 6 files changed, 316 insertions(+)
> create mode 100644 mm/pins_cgroup.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f781f93..f8526e2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5381,6 +5381,13 @@ F: tools/testing/selftests/cgroup/memcg_protection.m
> F: tools/testing/selftests/cgroup/test_kmem.c
> F: tools/testing/selftests/cgroup/test_memcontrol.c
>
> +CONTROL GROUP - PINNED AND LOCKED MEMORY
> +M: Alistair Popple <[email protected]>
> +L: [email protected]
> +L: [email protected]
> +S: Maintained
> +F: mm/pins_cgroup.c
> +
> CORETEMP HARDWARE MONITORING DRIVER
> M: Fenghua Yu <[email protected]>
> L: [email protected]
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 3410aec..d98de25 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -857,4 +857,24 @@ static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
>
> #endif /* CONFIG_CGROUP_BPF */
>
> +#ifdef CONFIG_CGROUP_PINS
> +
> +struct pins_cgroup *get_pins_cg(struct task_struct *task);
> +void put_pins_cg(struct pins_cgroup *cg);
> +void pins_uncharge(struct pins_cgroup *pins, int num);
> +bool pins_try_charge(struct pins_cgroup *pins, int num);
> +
> +#else /* CONFIG_CGROUP_PINS */
> +
> +static inline struct pins_cgroup *get_pins_cg(struct task_struct *task)
> +{
> + return NULL;
> +}
> +
> +static inline void put_pins_cg(struct pins_cgroup *cg) {}
> +static inline void pins_uncharge(struct pins_cgroup *pins, int num) {}
> +static inline bool pins_try_charge(struct pins_cgroup *pins, int num) { return 1; }
> +
> +#endif /* CONFIG_CGROUP_PINS */
> +
> #endif /* _LINUX_CGROUP_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 4452354..c1b4aab 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -65,6 +65,10 @@ SUBSYS(rdma)
> SUBSYS(misc)
> #endif
>
> +#if IS_ENABLED(CONFIG_CGROUP_PINS)
> +SUBSYS(pins)
> +#endif
> +
> /*
> * The following subsystems are not supported on the default hierarchy.
> */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ff7b209..4472043 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1183,6 +1183,17 @@ config LRU_GEN_STATS
> This option has a per-memcg and per-node memory overhead.
> # }
>
> +config CGROUP_PINS
> + bool "Cgroup controller for pinned and locked memory"
> + depends on CGROUPS
> + default y
> + help
> + Having too much memory pinned or locked can lead to system
> + instability due to increased likelihood of encountering
> + out-of-memory conditions. Select this option to enable a cgroup
> + controller which can be used to limit the overall number of
> + pages procceses in a cgroup may lock or have pinned by drivers.
> +
> source "mm/damon/Kconfig"
>
> endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index 8e105e5..81db189 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
> obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
> obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
> obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_CGROUP_PINS) += pins_cgroup.o
> diff --git a/mm/pins_cgroup.c b/mm/pins_cgroup.c
> new file mode 100644
> index 0000000..2d8c6c7
> --- /dev/null
> +++ b/mm/pins_cgroup.c
> @@ -0,0 +1,273 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Controller for cgroups limiting number of pages pinned for FOLL_LONGETERM.
> + *
> + * Copyright (C) 2022 Alistair Popple <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/threads.h>
> +#include <linux/atomic.h>
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/sched/task.h>
> +
> +#define PINS_MAX (-1ULL)
> +#define PINS_MAX_STR "max"
> +
> +struct pins_cgroup {
> + struct cgroup_subsys_state css;
> +
> + atomic64_t counter;
> + atomic64_t limit;
> +
> + struct cgroup_file events_file;
> + atomic64_t events_limit;
> +};
> +
> +static struct pins_cgroup *css_pins(struct cgroup_subsys_state *css)
> +{
> + return container_of(css, struct pins_cgroup, css);
> +}
> +
> +static struct pins_cgroup *parent_pins(struct pins_cgroup *pins)
> +{
> + return css_pins(pins->css.parent);
> +}
> +
> +struct pins_cgroup *get_pins_cg(struct task_struct *task)

I think you'll need a version of this that gets multiple refs to the
pins cgroup. I will add more details in the patch that actually hooks
the accounting.

> +{
> + return css_pins(task_get_css(task, pins_cgrp_id));
> +}
> +
> +void put_pins_cg(struct pins_cgroup *cg)
> +{
> + css_put(&cg->css);
> +}
> +
> +static struct cgroup_subsys_state *
> +pins_css_alloc(struct cgroup_subsys_state *parent)
> +{
> + struct pins_cgroup *pins;
> +
> + pins = kzalloc(sizeof(struct pins_cgroup), GFP_KERNEL);
> + if (!pins)
> + return ERR_PTR(-ENOMEM);
> +
> + atomic64_set(&pins->counter, 0);
> + atomic64_set(&pins->limit, PINS_MAX);
> + atomic64_set(&pins->events_limit, 0);
> + return &pins->css;
> +}
> +
> +static void pins_css_free(struct cgroup_subsys_state *css)
> +{
> + kfree(css_pins(css));
> +}
> +
> +/**
> + * pins_cancel - uncharge the local pin count
> + * @pins: the pin cgroup state
> + * @num: the number of pins to cancel
> + *
> + * This function will WARN if the pin count goes under 0, because such a case is
> + * a bug in the pins controller proper.
> + */
> +void pins_cancel(struct pins_cgroup *pins, int num)
> +{
> + /*
> + * A negative count (or overflow for that matter) is invalid,
> + * and indicates a bug in the `pins` controller proper.
> + */
> + WARN_ON_ONCE(atomic64_add_negative(-num, &pins->counter));
> +}
> +
> +/**
> + * pins_uncharge - hierarchically uncharge the pin count
> + * @pins: the pin cgroup state
> + * @num: the number of pins to uncharge
> + */
> +void pins_uncharge(struct pins_cgroup *pins, int num)
> +{
> + struct pins_cgroup *p;
> +
> + for (p = pins; parent_pins(p); p = parent_pins(p))
> + pins_cancel(p, num);
> +}
> +
> +/**
> + * pins_charge - hierarchically charge the pin count
> + * @pins: the pin cgroup state
> + * @num: the number of pins to charge
> + *
> + * This function does *not* follow the pin limit set. It cannot fail and the new
> + * pin count may exceed the limit. This is only used for reverting failed
> + * attaches, where there is no other way out than violating the limit.
> + */
> +static void pins_charge(struct pins_cgroup *pins, int num)
> +{
> + struct pins_cgroup *p;
> +
> + for (p = pins; parent_pins(p); p = parent_pins(p))
> + atomic64_add(num, &p->counter);
> +}
> +
> +/**
> + * pins_try_charge - hierarchically try to charge the pin count
> + * @pins: the pin cgroup state
> + * @num: the number of pins to charge
> + *
> + * This function follows the set limit. It will fail if the charge would cause
> + * the new value to exceed the hierarchical limit. Returns true if the charge
> + * succeeded, false otherwise.
> + */
> +bool pins_try_charge(struct pins_cgroup *pins, int num)
> +{
> + struct pins_cgroup *p, *q;
> +
> + for (p = pins; parent_pins(p); p = parent_pins(p)) {
> + uint64_t new = atomic64_add_return(num, &p->counter);
> + uint64_t limit = atomic64_read(&p->limit);
> +
> + if (limit != PINS_MAX && new > limit)
> + goto revert;
> + }
> +
> + return true;
> +
> +revert:
> + for (q = pins; q != p; q = parent_pins(q))
> + pins_cancel(q, num);
> + pins_cancel(p, num);
> +
> + return false;
> +}
> +
> +static int pins_can_attach(struct cgroup_taskset *tset)
> +{
> + struct cgroup_subsys_state *dst_css;
> + struct task_struct *task;
> +
> + cgroup_taskset_for_each(task, dst_css, tset) {
> + struct pins_cgroup *pins = css_pins(dst_css);
> + struct cgroup_subsys_state *old_css;
> + struct pins_cgroup *old_pins;
> +
> + old_css = task_css(task, pins_cgrp_id);
> + old_pins = css_pins(old_css);
> +
> + pins_charge(pins, task->mm->locked_vm);

Why are we not using pins_try_charge() here, and failing attachment if
we cannot charge?

The comment above pins_charge() states that it is only used when
cancelling attachment, but this is not the case here, right?

> + pins_uncharge(old_pins, task->mm->locked_vm);
> + }
> +
> + return 0;
> +}
> +
> +static void pins_cancel_attach(struct cgroup_taskset *tset)
> +{
> + struct cgroup_subsys_state *dst_css;
> + struct task_struct *task;
> +
> + cgroup_taskset_for_each(task, dst_css, tset) {
> + struct pins_cgroup *pins = css_pins(dst_css);
> + struct cgroup_subsys_state *old_css;
> + struct pins_cgroup *old_pins;
> +
> + old_css = task_css(task, pins_cgrp_id);
> + old_pins = css_pins(old_css);
> +
> + pins_charge(old_pins, task->mm->locked_vm);
> + pins_uncharge(pins, task->mm->locked_vm);
> + }
> +}
> +
> +
> +static ssize_t pins_max_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off)
> +{
> + struct cgroup_subsys_state *css = of_css(of);
> + struct pins_cgroup *pins = css_pins(css);
> + uint64_t limit;
> + int err;
> +
> + buf = strstrip(buf);
> + if (!strcmp(buf, PINS_MAX_STR)) {
> + limit = PINS_MAX;
> + goto set_limit;
> + }
> +
> + err = kstrtoll(buf, 0, &limit);
> + if (err)
> + return err;
> +
> + if (limit < 0 || limit >= PINS_MAX)
> + return -EINVAL;
> +
> +set_limit:
> + /*
> + * Limit updates don't need to be mutex'd, since it isn't
> + * critical that any racing fork()s follow the new limit.
> + */
> + atomic64_set(&pins->limit, limit);
> + return nbytes;
> +}
> +
> +static int pins_max_show(struct seq_file *sf, void *v)
> +{
> + struct cgroup_subsys_state *css = seq_css(sf);
> + struct pins_cgroup *pins = css_pins(css);
> + uint64_t limit = atomic64_read(&pins->limit);
> +
> + if (limit >= PINS_MAX)
> + seq_printf(sf, "%s\n", PINS_MAX_STR);
> + else
> + seq_printf(sf, "%lld\n", limit);
> +
> + return 0;
> +}
> +
> +static s64 pins_current_read(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + struct pins_cgroup *pins = css_pins(css);
> +
> + return atomic64_read(&pins->counter);
> +}
> +
> +static int pins_events_show(struct seq_file *sf, void *v)
> +{
> + struct pins_cgroup *pins = css_pins(seq_css(sf));
> +
> + seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pins->events_limit));
> + return 0;
> +}
> +
> +static struct cftype pins_files[] = {
> + {
> + .name = "max",
> + .write = pins_max_write,
> + .seq_show = pins_max_show,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> + {
> + .name = "current",
> + .read_s64 = pins_current_read,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> + {
> + .name = "events",
> + .seq_show = pins_events_show,
> + .file_offset = offsetof(struct pins_cgroup, events_file),
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> + { } /* terminate */
> +};
> +
> +struct cgroup_subsys pins_cgrp_subsys = {
> + .css_alloc = pins_css_alloc,
> + .css_free = pins_css_free,
> + .legacy_cftypes = pins_files,
> + .dfl_cftypes = pins_files,
> + .can_attach = pins_can_attach,
> + .cancel_attach = pins_cancel_attach,
> +};
> --
> git-series 0.9.1
>

2023-02-06 21:12:47

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup

On Sun, Feb 5, 2023 at 11:50 PM Alistair Popple <[email protected]> wrote:
>
> account_locked_vm() is used to account memory to mm->locked_vm. This
> adds accounting to the pins cgorup as it behaves similarly and should
> be accounted against the same global limit if set.
>
> This means memory must now be unaccounted for correctly, as the cgroup
> typically outlives both the mm and the task. It is assumed that
> callers of account_locked_vm() only do accounting against the current
> task. Callers that need to do accounting against remote tasks should
> use account_pinned_vm() and associated struct vm_account to hold
> references to the cgroup.
>
> Signed-off-by: Alistair Popple <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/util.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 1ca0dfe..755bada 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -589,15 +589,21 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
> struct task_struct *task, bool bypass_rlim)
> {
> unsigned long locked_vm, limit;
> + struct pins_cgroup *pins_cg = get_pins_cg(task);

Here we get one ref one the pins cgroup for the entire locked region
that may contain multiple pages, right? During unlock, we drop the
ref. Is it possible that we lock a region (acquiring one ref), and
then unlock it in chunks (dropping multiple refs)?

If this is possible, we may have a problem here. We may need to
acquire one ref per pinned page (not sure if this can overflow). We
may also want to defer the refcount handling to the pins cgroup
controller code, similar to charge_memcg(), a function that tries to
charge and acquires any necessary refs, same for uncharging.

WDYT?

> int ret = 0;
>
> mmap_assert_write_locked(mm);
>
> + if (pins_cg && !pins_try_charge(pins_cg, pages))
> + return -ENOMEM;
> +
> locked_vm = mm->locked_vm;
> if (!bypass_rlim) {
> limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> - if (locked_vm + pages > limit)
> + if (locked_vm + pages > limit) {
> + pins_uncharge(pins_cg, pages);
> ret = -ENOMEM;
> + }
> }
>
> if (!ret)
> @@ -607,6 +613,12 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages,
> (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
> task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
>
> + pr_debug("%s: [%d] caller %ps %lu %lu/%lu%s\n", __func__, task->pid,
> + (void *)_RET_IP_, pages << PAGE_SHIFT, locked_vm << PAGE_SHIFT,
> + task_rlimit(task, RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
> +
> + if (pins_cg)
> + put_pins_cg(pins_cg);
> return ret;
> }
> EXPORT_SYMBOL_GPL(__account_locked_vm);
> @@ -622,8 +634,18 @@ void __unaccount_locked_vm(struct mm_struct *mm, unsigned long pages)
> {
> unsigned long locked_vm = mm->locked_vm;
>
> + /*
> + * TODO: Convert book3s vio to use pinned vm to ensure
> + * unaccounting happens to the correct cgroup.
> + */
> + struct pins_cgroup *pins_cg = get_pins_cg(current);
> +
> mmap_assert_write_locked(mm);
> WARN_ON_ONCE(pages > locked_vm);
> + if (pins_cg) {
> + pins_uncharge(pins_cg, pages);
> + put_pins_cg(pins_cg);
> + }
> mm->locked_vm = locked_vm - pages;
> }
> EXPORT_SYMBOL_GPL(__unaccount_locked_vm);
> --
> git-series 0.9.1
>

2023-02-06 21:14:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 06, 2023 at 06:47:51PM +1100, Alistair Popple wrote:
> If too much memory in a system is pinned or locked it can lead to
> problems such as performance degradation or in the worst case
> out-of-memory errors as such memory cannot be moved or paged out.
>
> In order to prevent users without CAP_IPC_LOCK from causing these
> issues the amount of memory that can be pinned is typically limited by
> RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
> between tasks and the enforcement of these limits is inconsistent
> between in-kernel users of pinned memory such as mlock() and device
> drivers which may also pin pages with pin_user_pages().
>
> To allow for a single limit to be set introduce a cgroup controller
> which can be used to limit the number of pages being pinned by all
> tasks in the cgroup.

As I wrote before, I think this might fit better as a part of memcg than as
its own controller.

Thanks.

--
tejun

2023-02-06 22:32:58

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 6, 2023 at 1:14 PM Tejun Heo <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 06:47:51PM +1100, Alistair Popple wrote:
> > If too much memory in a system is pinned or locked it can lead to
> > problems such as performance degradation or in the worst case
> > out-of-memory errors as such memory cannot be moved or paged out.
> >
> > In order to prevent users without CAP_IPC_LOCK from causing these
> > issues the amount of memory that can be pinned is typically limited by
> > RLIMIT_MEMLOCK. However this is inflexible as limits can't be shared
> > between tasks and the enforcement of these limits is inconsistent
> > between in-kernel users of pinned memory such as mlock() and device
> > drivers which may also pin pages with pin_user_pages().
> >
> > To allow for a single limit to be set introduce a cgroup controller
> > which can be used to limit the number of pages being pinned by all
> > tasks in the cgroup.
>
> As I wrote before, I think this might fit better as a part of memcg than as
> its own controller.

I guess it boils down to which we want:
(a) Limit the amount of memory processes in a cgroup can be pinned/locked.
(b) Limit the amount of memory charged to a cgroup that can be pinned/locked.

The proposal is doing (a), I suppose if this was part of memcg it
would be (b), right?

I am not saying it should be one or the other, I am just making sure
my understanding is clear.

>
> Thanks.
>
> --
> tejun

2023-02-06 22:36:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 06, 2023 at 02:32:10PM -0800, Yosry Ahmed wrote:
> I guess it boils down to which we want:
> (a) Limit the amount of memory processes in a cgroup can be pinned/locked.
> (b) Limit the amount of memory charged to a cgroup that can be pinned/locked.
>
> The proposal is doing (a), I suppose if this was part of memcg it
> would be (b), right?
>
> I am not saying it should be one or the other, I am just making sure
> my understanding is clear.

I don't quite understand what the distinction would mean in practice. It's
just odd to put locked memory in a separate controller from interface POV.

Thanks.

--
tejun

2023-02-06 22:40:00

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 6, 2023 at 2:36 PM Tejun Heo <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 02:32:10PM -0800, Yosry Ahmed wrote:
> > I guess it boils down to which we want:
> > (a) Limit the amount of memory processes in a cgroup can be pinned/locked.
> > (b) Limit the amount of memory charged to a cgroup that can be pinned/locked.
> >
> > The proposal is doing (a), I suppose if this was part of memcg it
> > would be (b), right?
> >
> > I am not saying it should be one or the other, I am just making sure
> > my understanding is clear.
>
> I don't quite understand what the distinction would mean in practice. It's
> just odd to put locked memory in a separate controller from interface POV.

Assume we have 2 cgroups, A and B. A process in cgroup A creates a
tmpfs file and writes to it, so the memory is now charged to cgroup A.
Now imagine a process in cgroup B tries to lock this memory.
- With (a) the amount of locked memory will count toward's cgroup A's
limit, because cgroup A is charged for the memory.
- With (b) the amount of locked memory will count toward's cgroup B's
limit, because a process in cgroup B is locking the memory.

I agree that it is confusing from an interface POV.

>
> Thanks.
>
> --
> tejun

2023-02-06 23:25:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 06, 2023 at 02:39:17PM -0800, Yosry Ahmed wrote:
> On Mon, Feb 6, 2023 at 2:36 PM Tejun Heo <[email protected]> wrote:
> >
> > On Mon, Feb 06, 2023 at 02:32:10PM -0800, Yosry Ahmed wrote:
> > > I guess it boils down to which we want:
> > > (a) Limit the amount of memory processes in a cgroup can be pinned/locked.
> > > (b) Limit the amount of memory charged to a cgroup that can be pinned/locked.
> > >
> > > The proposal is doing (a), I suppose if this was part of memcg it
> > > would be (b), right?
> > >
> > > I am not saying it should be one or the other, I am just making sure
> > > my understanding is clear.
> >
> > I don't quite understand what the distinction would mean in practice. It's
> > just odd to put locked memory in a separate controller from interface POV.
>
> Assume we have 2 cgroups, A and B. A process in cgroup A creates a
> tmpfs file and writes to it, so the memory is now charged to cgroup A.
> Now imagine a process in cgroup B tries to lock this memory.
> - With (a) the amount of locked memory will count toward's cgroup A's
> limit, because cgroup A is charged for the memory.
> - With (b) the amount of locked memory will count toward's cgroup B's
> limit, because a process in cgroup B is locking the memory.
>
> I agree that it is confusing from an interface POV.

Oh yeah, that's confusing. I'd go with (a) for consistency with the rest of
memcg - locked memory should fit inside e.g. memory.max. The problem with
shared memory accounting exists for non-locked memory as well and prolly
best to handle the same way rather than handling differently.

Thanks.

--
tejun

2023-02-06 23:35:55

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 6, 2023 at 3:25 PM Tejun Heo <[email protected]> wrote:
>
> On Mon, Feb 06, 2023 at 02:39:17PM -0800, Yosry Ahmed wrote:
> > On Mon, Feb 6, 2023 at 2:36 PM Tejun Heo <[email protected]> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 02:32:10PM -0800, Yosry Ahmed wrote:
> > > > I guess it boils down to which we want:
> > > > (a) Limit the amount of memory processes in a cgroup can be pinned/locked.
> > > > (b) Limit the amount of memory charged to a cgroup that can be pinned/locked.
> > > >
> > > > The proposal is doing (a), I suppose if this was part of memcg it
> > > > would be (b), right?
> > > >
> > > > I am not saying it should be one or the other, I am just making sure
> > > > my understanding is clear.
> > >
> > > I don't quite understand what the distinction would mean in practice. It's
> > > just odd to put locked memory in a separate controller from interface POV.
> >
> > Assume we have 2 cgroups, A and B. A process in cgroup A creates a
> > tmpfs file and writes to it, so the memory is now charged to cgroup A.
> > Now imagine a process in cgroup B tries to lock this memory.
> > - With (a) the amount of locked memory will count toward's cgroup A's
> > limit, because cgroup A is charged for the memory.
> > - With (b) the amount of locked memory will count toward's cgroup B's
> > limit, because a process in cgroup B is locking the memory.
> >
> > I agree that it is confusing from an interface POV.
>
> Oh yeah, that's confusing. I'd go with (a) for consistency with the rest of
> memcg - locked memory should fit inside e.g. memory.max. The problem with
> shared memory accounting exists for non-locked memory as well and prolly
> best to handle the same way rather than handling differently.

+Michal Hocko +Roman Gushchin +Shakeel Butt for visibility with memcg.

>
> Thanks.
>
> --
> tejun

2023-02-06 23:41:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 06, 2023 at 01:25:33PM -1000, Tejun Heo wrote:
> On Mon, Feb 06, 2023 at 02:39:17PM -0800, Yosry Ahmed wrote:
> > On Mon, Feb 6, 2023 at 2:36 PM Tejun Heo <[email protected]> wrote:
> > >
> > > On Mon, Feb 06, 2023 at 02:32:10PM -0800, Yosry Ahmed wrote:
> > > > I guess it boils down to which we want:
> > > > (a) Limit the amount of memory processes in a cgroup can be pinned/locked.
> > > > (b) Limit the amount of memory charged to a cgroup that can be pinned/locked.
> > > >
> > > > The proposal is doing (a), I suppose if this was part of memcg it
> > > > would be (b), right?
> > > >
> > > > I am not saying it should be one or the other, I am just making sure
> > > > my understanding is clear.
> > >
> > > I don't quite understand what the distinction would mean in practice. It's
> > > just odd to put locked memory in a separate controller from interface POV.
> >
> > Assume we have 2 cgroups, A and B. A process in cgroup A creates a
> > tmpfs file and writes to it, so the memory is now charged to cgroup A.
> > Now imagine a process in cgroup B tries to lock this memory.
> > - With (a) the amount of locked memory will count toward's cgroup A's
> > limit, because cgroup A is charged for the memory.
> > - With (b) the amount of locked memory will count toward's cgroup B's
> > limit, because a process in cgroup B is locking the memory.
> >
> > I agree that it is confusing from an interface POV.
>
> Oh yeah, that's confusing. I'd go with (a) for consistency with the rest of
> memcg - locked memory should fit inside e.g. memory.max. The problem with
> shared memory accounting exists for non-locked memory as well and prolly
> best to handle the same way rather than handling differently.

(a) kind of destroys the point of this as a sandboxing tool

It is not so harmful to use memory that someone else has been charged
with allocating.

But it is harmful to pin memory if someone else is charged for the
pin. It means it is unpredictable how much memory a sandbox can
actually lock down.

Plus we have the double accounting problem, if 1000 processes in
different cgroups open the tmpfs and all pin the memory then cgroup A
will be charged 1000x for the memory and hit its limit, possibly
creating a DOS from less priv to more priv

Jason

2023-02-07 00:32:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

Hello,

On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> (a) kind of destroys the point of this as a sandboxing tool
>
> It is not so harmful to use memory that someone else has been charged
> with allocating.
>
> But it is harmful to pin memory if someone else is charged for the
> pin. It means it is unpredictable how much memory a sandbox can
> actually lock down.
>
> Plus we have the double accounting problem, if 1000 processes in
> different cgroups open the tmpfs and all pin the memory then cgroup A
> will be charged 1000x for the memory and hit its limit, possibly
> creating a DOS from less priv to more priv

Let's hear what memcg people think about it. I'm not a fan of disassociating
the ownership and locker of the same page but it is true that actively
increasing locked consumption on a remote cgroup is awkward too.

Thanks.

--
tejun

2023-02-07 01:01:48

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On 2/6/23 17:39, Yosry Ahmed wrote:
> On Mon, Feb 6, 2023 at 2:36 PM Tejun Heo <[email protected]> wrote:
>> On Mon, Feb 06, 2023 at 02:32:10PM -0800, Yosry Ahmed wrote:
>>> I guess it boils down to which we want:
>>> (a) Limit the amount of memory processes in a cgroup can be pinned/locked.
>>> (b) Limit the amount of memory charged to a cgroup that can be pinned/locked.
>>>
>>> The proposal is doing (a), I suppose if this was part of memcg it
>>> would be (b), right?
>>>
>>> I am not saying it should be one or the other, I am just making sure
>>> my understanding is clear.
>> I don't quite understand what the distinction would mean in practice. It's
>> just odd to put locked memory in a separate controller from interface POV.
> Assume we have 2 cgroups, A and B. A process in cgroup A creates a
> tmpfs file and writes to it, so the memory is now charged to cgroup A.
> Now imagine a process in cgroup B tries to lock this memory.
> - With (a) the amount of locked memory will count toward's cgroup A's
> limit, because cgroup A is charged for the memory.
> - With (b) the amount of locked memory will count toward's cgroup B's
> limit, because a process in cgroup B is locking the memory.
>
> I agree that it is confusing from an interface POV.

If it should not be part of the memcg, does it make sense to make it a
resource in the existing misc controller? I believe we don't want a
proliferation of new cgroup controllers.

Cheers,
Longman


2023-02-07 01:03:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 06, 2023 at 08:00:54PM -0500, Waiman Long wrote:
> If it should not be part of the memcg, does it make sense to make it a
> resource in the existing misc controller? I believe we don't want a
> proliferation of new cgroup controllers.

Yeah, if it's gonna be an independent knob, I suppose so, but I really think
the locked accounting should be tied to the page, which mostly likely would
mean that it'd be tied to the page ownership too making its natural place
memcg.

Thanks.

--
tejun

2023-02-07 01:12:03

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 09/19] io_uring: convert to use vm_account


Jens Axboe <[email protected]> writes:

> On 2/6/23 12:47 AM, Alistair Popple wrote:
>> Convert io_uring to use vm_account instead of directly charging pages
>> against the user/mm. Rather than charge pages to both user->locked_vm
>> and mm->pinned_vm this will only charge pages to user->locked_vm.
>
> Not sure how we're supposed to review this, when you just send us 9/19
> and vm_account_release() is supposedly an earlier patch in this series.
>
> Either CC the whole series, or at least the cover letter, core parts,
> and the per-subsystem parts.

Ok, thanks. Will be sure to add everyone to the cover letter and patch
01 when I send the next version.

For reference the cover letter is here:

https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/

And the core patch that introduces vm_account is here:

https://lore.kernel.org/linux-mm/e80b61561f97296a6c08faeebe281cb949333d1d.1675669136.git-series.apopple@nvidia.com/

No problem if you want to wait for the resend/next version before
taking another look though.

2023-02-07 01:53:26

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory


Tejun Heo <[email protected]> writes:

> On Mon, Feb 06, 2023 at 08:00:54PM -0500, Waiman Long wrote:
>> If it should not be part of the memcg, does it make sense to make it a
>> resource in the existing misc controller? I believe we don't want a
>> proliferation of new cgroup controllers.
>
> Yeah, if it's gonna be an independent knob, I suppose so, but I really think
> the locked accounting should be tied to the page, which mostly likely would
> mean that it'd be tied to the page ownership too making its natural place
> memcg.

Yes, I think it might be possible. I looked briefly at doing it when
initially developing the series but I would like to resolve the question
of independent knob vs. memcg before heading too far down either path.

> Thanks.


2023-02-07 12:20:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon, Feb 06, 2023 at 02:32:37PM -1000, Tejun Heo wrote:
> Hello,
>
> On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> > (a) kind of destroys the point of this as a sandboxing tool
> >
> > It is not so harmful to use memory that someone else has been charged
> > with allocating.
> >
> > But it is harmful to pin memory if someone else is charged for the
> > pin. It means it is unpredictable how much memory a sandbox can
> > actually lock down.
> >
> > Plus we have the double accounting problem, if 1000 processes in
> > different cgroups open the tmpfs and all pin the memory then cgroup A
> > will be charged 1000x for the memory and hit its limit, possibly
> > creating a DOS from less priv to more priv
>
> Let's hear what memcg people think about it. I'm not a fan of disassociating
> the ownership and locker of the same page but it is true that actively
> increasing locked consumption on a remote cgroup is awkward too.

The main purpose of all this is to support libvirt, so they need to
support (a) too.

(b) is what we have now and most closely emulates the way the RLIMIT
works.

Jason

2023-02-07 14:29:05

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 09/19] io_uring: convert to use vm_account

On 2/6/23 6:03?PM, Alistair Popple wrote:
>
> Jens Axboe <[email protected]> writes:
>
>> On 2/6/23 12:47?AM, Alistair Popple wrote:
>>> Convert io_uring to use vm_account instead of directly charging pages
>>> against the user/mm. Rather than charge pages to both user->locked_vm
>>> and mm->pinned_vm this will only charge pages to user->locked_vm.
>>
>> Not sure how we're supposed to review this, when you just send us 9/19
>> and vm_account_release() is supposedly an earlier patch in this series.
>>
>> Either CC the whole series, or at least the cover letter, core parts,
>> and the per-subsystem parts.
>
> Ok, thanks. Will be sure to add everyone to the cover letter and patch
> 01 when I send the next version.
>
> For reference the cover letter is here:
>
> https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/
>
> And the core patch that introduces vm_account is here:
>
> https://lore.kernel.org/linux-mm/e80b61561f97296a6c08faeebe281cb949333d1d.1675669136.git-series.apopple@nvidia.com/
>
> No problem if you want to wait for the resend/next version before
> taking another look though.

Thanks, that helps. Like listed in the cover letter, I also have to
agree that this is badly named. It's way too generic, it needs to have a
name that tells you what it does. There's tons of accounting, you need
to be more specific.

Outside of that, we're now doubling the amount of memory associated with
tracking this. That isn't necessarily a showstopper, but it is not
ideal. I didn't take a look at the other conversions (again, because
they were not sent to me), but seems like the task_struct and flags
could just be passed in as they may very well be known to many/most
callers?

--
Jens Axboe


2023-02-07 14:56:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 09/19] io_uring: convert to use vm_account

On Tue, Feb 07, 2023 at 07:28:56AM -0700, Jens Axboe wrote:

> Outside of that, we're now doubling the amount of memory associated with
> tracking this. That isn't necessarily a showstopper, but it is not
> ideal. I didn't take a look at the other conversions (again, because
> they were not sent to me), but seems like the task_struct and flags
> could just be passed in as they may very well be known to many/most
> callers?

For places doing the mm accounting type it cannot use the task struct
as the underlying mm can be replaced and keep the task, IIRC.

We just had a bug in VFIO related to this..

If we could go back from the mm to the task (even a from a destroyed
mm though) that might work to reduce storage?

Jason


2023-02-07 17:05:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 09/19] io_uring: convert to use vm_account

On 2/7/23 7:55?AM, Jason Gunthorpe wrote:
> On Tue, Feb 07, 2023 at 07:28:56AM -0700, Jens Axboe wrote:
>
>> Outside of that, we're now doubling the amount of memory associated with
>> tracking this. That isn't necessarily a showstopper, but it is not
>> ideal. I didn't take a look at the other conversions (again, because
>> they were not sent to me), but seems like the task_struct and flags
>> could just be passed in as they may very well be known to many/most
>> callers?
>
> For places doing the mm accounting type it cannot use the task struct
> as the underlying mm can be replaced and keep the task, IIRC.
>
> We just had a bug in VFIO related to this..
>
> If we could go back from the mm to the task (even a from a destroyed
> mm though) that might work to reduce storage?

Then maybe just nest them:

struct small_one {
struct mm_struct *mm;
struct user_struct *user;
};

struct big_one {
struct small_one foo;
struct task_struct *task;
enum vm_account_flags flags;
};

and have the real helpers deal with small_one, and wrappers around that
taking big_one that just passes in the missing bits. Then users that
don't need the extra bits can just use the right API.

--
Jens Axboe


2023-02-12 17:32:42

by Bernard Metzler

[permalink] [raw]
Subject: RE: [PATCH 05/19] RMDA/siw: Convert to use vm_account



> -----Original Message-----
> From: Alistair Popple <[email protected]>
> Sent: Monday, 6 February 2023 08:48
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Daniel Berrange <[email protected]>;
> Alex Williamson <[email protected]>; Alistair Popple
> <[email protected]>; Bernard Metzler <[email protected]>; Jason Gunthorpe
> <[email protected]>; Leon Romanovsky <[email protected]>; linux-
> [email protected]
> Subject: [EXTERNAL] [PATCH 05/19] RMDA/siw: Convert to use vm_account
>
> Convert to using a vm_account structure to account pinned memory to
> both the mm and the pins cgroup.
>
> Signed-off-by: Alistair Popple <[email protected]>
> Cc: Bernard Metzler <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Leon Romanovsky <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> drivers/infiniband/sw/siw/siw.h | 3 ++-
> drivers/infiniband/sw/siw/siw_mem.c | 21 +++++++--------------
> drivers/infiniband/sw/siw/siw_verbs.c | 15 ---------------
> 3 files changed, 9 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/infiniband/sw/siw/siw.h
> b/drivers/infiniband/sw/siw/siw.h
> index 2f3a9cd..6d4aabd 100644
> --- a/drivers/infiniband/sw/siw/siw.h
> +++ b/drivers/infiniband/sw/siw/siw.h
> @@ -13,6 +13,7 @@
> #include <crypto/hash.h>
> #include <linux/crc32.h>
> #include <linux/crc32c.h>
> +#include <linux/vm_account.h>
>
> #include <rdma/siw-abi.h>
> #include "iwarp.h"
> @@ -124,7 +125,7 @@ struct siw_umem {
> int num_pages;
> bool writable;
> u64 fp_addr; /* First page base address */
> - struct mm_struct *owning_mm;
> + struct vm_account vm_account;
> };
>
> struct siw_pble {
> diff --git a/drivers/infiniband/sw/siw/siw_mem.c
> b/drivers/infiniband/sw/siw/siw_mem.c
> index f51ab2c..be90121 100644
> --- a/drivers/infiniband/sw/siw/siw_mem.c
> +++ b/drivers/infiniband/sw/siw/siw_mem.c
> @@ -68,7 +68,6 @@ static void siw_free_plist(struct siw_page_chunk *chunk,
> int num_pages,
>
> void siw_umem_release(struct siw_umem *umem, bool dirty)
> {
> - struct mm_struct *mm_s = umem->owning_mm;
> int i, num_pages = umem->num_pages;
>
> for (i = 0; num_pages; i++) {
> @@ -79,9 +78,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
> kfree(umem->page_chunk[i].plist);
> num_pages -= to_free;
> }
> - atomic64_sub(umem->num_pages, &mm_s->pinned_vm);
> + vm_unaccount_pinned(&umem->vm_account, umem->num_pages);
> + vm_account_release(&umem->vm_account);
>
> - mmdrop(mm_s);
> kfree(umem->page_chunk);
> kfree(umem);
> }
> @@ -365,9 +364,7 @@ struct siw_pbl *siw_pbl_alloc(u32 num_buf)
> struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
> {
> struct siw_umem *umem;
> - struct mm_struct *mm_s;
> u64 first_page_va;
> - unsigned long mlock_limit;
> unsigned int foll_flags = FOLL_LONGTERM;
> int num_pages, num_chunks, i, rv = 0;
>
> @@ -385,20 +382,16 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> if (!umem)
> return ERR_PTR(-ENOMEM);
>
> - mm_s = current->mm;
> - umem->owning_mm = mm_s;
> umem->writable = writable;
>
> - mmgrab(mm_s);
> + vm_account_init_current(&umem->vm_account);
>
> if (writable)
> foll_flags |= FOLL_WRITE;
>
> - mmap_read_lock(mm_s);
> + mmap_read_lock(current->mm);
>
> - mlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> -
> - if (atomic64_add_return(num_pages, &mm_s->pinned_vm) > mlock_limit) {
> + if (vm_account_pinned(&umem->vm_account, num_pages)) {
> rv = -ENOMEM;
> goto out_sem_up;
> }
> @@ -434,14 +427,14 @@ struct siw_umem *siw_umem_get(u64 start, u64 len,
> bool writable)
> }
> }
> out_sem_up:
> - mmap_read_unlock(mm_s);
> + mmap_read_unlock(current->mm);
>
> if (rv > 0)
> return umem;
>
> /* Adjust accounting for pages not pinned */
> if (num_pages)
> - atomic64_sub(num_pages, &mm_s->pinned_vm);
> + vm_unaccount_pinned(&umem->vm_account, num_pages);
>
> siw_umem_release(umem, false);
>
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c
> b/drivers/infiniband/sw/siw/siw_verbs.c
> index 906fde1..8fab009 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -1321,8 +1321,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64
> start, u64 len,
> struct siw_umem *umem = NULL;
> struct siw_ureq_reg_mr ureq;
> struct siw_device *sdev = to_siw_dev(pd->device);
> -
> - unsigned long mem_limit = rlimit(RLIMIT_MEMLOCK);
> int rv;
>
> siw_dbg_pd(pd, "start: 0x%pK, va: 0x%pK, len: %llu\n",
> @@ -1338,19 +1336,6 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64
> start, u64 len,
> rv = -EINVAL;
> goto err_out;
> }
> - if (mem_limit != RLIM_INFINITY) {
> - unsigned long num_pages =
> - (PAGE_ALIGN(len + (start & ~PAGE_MASK))) >> PAGE_SHIFT;
> - mem_limit >>= PAGE_SHIFT;
> -
> - if (num_pages > mem_limit - current->mm->locked_vm) {
> - siw_dbg_pd(pd, "pages req %lu, max %lu, lock %lu\n",
> - num_pages, mem_limit,
> - current->mm->locked_vm);
> - rv = -ENOMEM;
> - goto err_out;
> - }
> - }

Yes, makes sense. This double checking now and then
in siw_umem_get() was just useless. thanks!

> umem = siw_umem_get(start, len, ib_access_writable(rights));
> if (IS_ERR(umem)) {
> rv = PTR_ERR(umem);
LGTM!
Reviewed-by: Bernard Metzler <[email protected]>

> --
> git-series 0.9.1

2023-02-13 11:32:33

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 09/19] io_uring: convert to use vm_account


Jens Axboe <[email protected]> writes:

> On 2/7/23 7:55?AM, Jason Gunthorpe wrote:
>> On Tue, Feb 07, 2023 at 07:28:56AM -0700, Jens Axboe wrote:
>>
>>> Outside of that, we're now doubling the amount of memory associated with
>>> tracking this. That isn't necessarily a showstopper, but it is not
>>> ideal. I didn't take a look at the other conversions (again, because
>>> they were not sent to me), but seems like the task_struct and flags
>>> could just be passed in as they may very well be known to many/most
>>> callers?
>>
>> For places doing the mm accounting type it cannot use the task struct
>> as the underlying mm can be replaced and keep the task, IIRC.
>>
>> We just had a bug in VFIO related to this..
>>
>> If we could go back from the mm to the task (even a from a destroyed
>> mm though) that might work to reduce storage?

Yes, it's the going back from a destroyed mm that gets a bit murky. I
don't think it's a showstopper as we could probably keep track of that
when we destroy the mm but it seems like a fair amount of complexity to
save a smallish amount of memory.

However if we end up tacking this onto memcg instead then we could use
that to go back to the task and move any charges when the mm moves.

> Then maybe just nest them:
>
> struct small_one {
> struct mm_struct *mm;
> struct user_struct *user;
> };
>
> struct big_one {
> struct small_one foo;
> struct task_struct *task;
> enum vm_account_flags flags;
> };
>
> and have the real helpers deal with small_one, and wrappers around that
> taking big_one that just passes in the missing bits. Then users that
> don't need the extra bits can just use the right API.

Thanks for the suggestion, it should work noting that we will have to
add a struct pins_cgroup pointer to struct small_one. It may also help
with a similar problem I was having with one of the networking
conversions.

2023-02-15 19:00:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Mon 06-02-23 14:32:37, Tejun Heo wrote:
> Hello,
>
> On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> > (a) kind of destroys the point of this as a sandboxing tool
> >
> > It is not so harmful to use memory that someone else has been charged
> > with allocating.
> >
> > But it is harmful to pin memory if someone else is charged for the
> > pin. It means it is unpredictable how much memory a sandbox can
> > actually lock down.
> >
> > Plus we have the double accounting problem, if 1000 processes in
> > different cgroups open the tmpfs and all pin the memory then cgroup A
> > will be charged 1000x for the memory and hit its limit, possibly
> > creating a DOS from less priv to more priv
>
> Let's hear what memcg people think about it. I'm not a fan of disassociating
> the ownership and locker of the same page but it is true that actively
> increasing locked consumption on a remote cgroup is awkward too.

One thing that is not really clear to me is whether those pins do
actually have any "ownership". The interface itself doesn't talk about
anything like that and so it seems perfectly fine to unpin from a
completely different context then pinning. If there is no enforcement
then Tejun is right and relying on memcg ownership is likely the only
reliable way to use for tracking. The downside is sharing obviously but
this is the same problem we already do deal with with shared pages.

Another thing that is not really clear to me is how the limit is
actually going to be used in practice. As there is no concept of a
reclaim for pins then I can imagine that it would be quite easy to
reach the hard limit and essentially DoS any further use of pins. Cross
cgroup pinning would make it even worse because it could become a DoS
vector very easily. Practically speaking what tends to be a corner case
in the memcg limit world would be norm for pin based limit.

Or am I misunderstanding something here?
--
Michal Hocko
SUSE Labs

2023-02-15 19:07:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Wed, Feb 15, 2023 at 08:00:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-23 14:32:37, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> > > (a) kind of destroys the point of this as a sandboxing tool
> > >
> > > It is not so harmful to use memory that someone else has been charged
> > > with allocating.
> > >
> > > But it is harmful to pin memory if someone else is charged for the
> > > pin. It means it is unpredictable how much memory a sandbox can
> > > actually lock down.
> > >
> > > Plus we have the double accounting problem, if 1000 processes in
> > > different cgroups open the tmpfs and all pin the memory then cgroup A
> > > will be charged 1000x for the memory and hit its limit, possibly
> > > creating a DOS from less priv to more priv
> >
> > Let's hear what memcg people think about it. I'm not a fan of disassociating
> > the ownership and locker of the same page but it is true that actively
> > increasing locked consumption on a remote cgroup is awkward too.
>
> One thing that is not really clear to me is whether those pins do
> actually have any "ownership".

In most cases the ownship traces back to a file descriptor. When the
file is closed the pin goes away.

> The interface itself doesn't talk about
> anything like that and so it seems perfectly fine to unpin from a
> completely different context then pinning.

Yes, concievably the close of the FD can be in a totally different
process with a different cgroup.

> If there is no enforcement then Tejun is right and relying on memcg
> ownership is likely the only reliable way to use for tracking. The
> downside is sharing obviously but this is the same problem we
> already do deal with with shared pages.

I think this does not work well because the owner in a memcg sense is
unrelated to the file descriptor which is the true owner.

So we can get cases where the pin is charged to the wrong cgroup which
is effectively fatal for sandboxing, IMHO.

> Another thing that is not really clear to me is how the limit is
> actually going to be used in practice. As there is no concept of a
> reclaim for pins then I can imagine that it would be quite easy to
> reach the hard limit and essentially DoS any further use of pins.

Yes, that is the purpose. It is to sandbox pin users to put some limit
on the effect they have on the full machine.

It replaces the rlimit mess that was doing the same thing.

> Cross cgroup pinning would make it even worse because it could
> become a DoS vector very easily. Practically speaking what tends to
> be a corner case in the memcg limit world would be norm for pin
> based limit.

This is why the cgroup charged for the pin must be tightly linked to
some cgroup that is obviously connected to the creator of the FD
owning the pin.

Jason

2023-02-16 08:04:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Wed 15-02-23 15:07:05, Jason Gunthorpe wrote:
> On Wed, Feb 15, 2023 at 08:00:22PM +0100, Michal Hocko wrote:
> > On Mon 06-02-23 14:32:37, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Mon, Feb 06, 2023 at 07:40:55PM -0400, Jason Gunthorpe wrote:
> > > > (a) kind of destroys the point of this as a sandboxing tool
> > > >
> > > > It is not so harmful to use memory that someone else has been charged
> > > > with allocating.
> > > >
> > > > But it is harmful to pin memory if someone else is charged for the
> > > > pin. It means it is unpredictable how much memory a sandbox can
> > > > actually lock down.
> > > >
> > > > Plus we have the double accounting problem, if 1000 processes in
> > > > different cgroups open the tmpfs and all pin the memory then cgroup A
> > > > will be charged 1000x for the memory and hit its limit, possibly
> > > > creating a DOS from less priv to more priv
> > >
> > > Let's hear what memcg people think about it. I'm not a fan of disassociating
> > > the ownership and locker of the same page but it is true that actively
> > > increasing locked consumption on a remote cgroup is awkward too.
> >
> > One thing that is not really clear to me is whether those pins do
> > actually have any "ownership".
>
> In most cases the ownship traces back to a file descriptor. When the
> file is closed the pin goes away.

This assumes a specific use of {un}pin_user_page*, right? IIUC the
cgroup charging is meant to be used from vm_account but that doesn't
really tell anything about the lifetime nor the ownership. Maybe this is
just a matter of documentation update...

> > The interface itself doesn't talk about
> > anything like that and so it seems perfectly fine to unpin from a
> > completely different context then pinning.
>
> Yes, concievably the close of the FD can be in a totally different
> process with a different cgroup.

Wouldn't you get an unbalanced charges then? How can admin recover that
situation?

> > If there is no enforcement then Tejun is right and relying on memcg
> > ownership is likely the only reliable way to use for tracking. The
> > downside is sharing obviously but this is the same problem we
> > already do deal with with shared pages.
>
> I think this does not work well because the owner in a memcg sense is
> unrelated to the file descriptor which is the true owner.
>
> So we can get cases where the pin is charged to the wrong cgroup which
> is effectively fatal for sandboxing, IMHO.

OK, I see. This makes it really much more complicated then.

> > Another thing that is not really clear to me is how the limit is
> > actually going to be used in practice. As there is no concept of a
> > reclaim for pins then I can imagine that it would be quite easy to
> > reach the hard limit and essentially DoS any further use of pins.
>
> Yes, that is the purpose. It is to sandbox pin users to put some limit
> on the effect they have on the full machine.
>
> It replaces the rlimit mess that was doing the same thing.

arguably rlimit has a concept of the owner at least AFAICS. I do realize
this is not really great wrt a high level resource control though.

> > Cross cgroup pinning would make it even worse because it could
> > become a DoS vector very easily. Practically speaking what tends to
> > be a corner case in the memcg limit world would be norm for pin
> > based limit.
>
> This is why the cgroup charged for the pin must be tightly linked to
> some cgroup that is obviously connected to the creator of the FD
> owning the pin.

The problem I can see is that the fd is just too fluid for tracking. You
can pass fd over to a different cgroup context and then all the tracking
just loses any trail to an owner.

I can see how the underlying memcg tracking information is not really
feasible for your usecases but I am really worried that it is just too
easy to misaccount without any other proper ownership tracking.
--
Michal Hocko
SUSE Labs

2023-02-16 11:02:44

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory

On 06.02.23 08:47, Alistair Popple wrote:
> Having large amounts of unmovable or unreclaimable memory in a system
> can lead to system instability due to increasing the likelihood of
> encountering out-of-memory conditions. Therefore it is desirable to
> limit the amount of memory users can lock or pin.
>
> From userspace such limits can be enforced by setting
> RLIMIT_MEMLOCK. However there is no standard method that drivers and
> other in-kernel users can use to check and enforce this limit.
>
> This has lead to a large number of inconsistencies in how limits are
> enforced. For example some drivers will use mm->locked_mm while others
> will use mm->pinned_mm or user->locked_mm. It is therefore possible to
> have up to three times RLIMIT_MEMLOCKED pinned.
>
> Having pinned memory limited per-task also makes it easy for users to
> exceed the limit. For example drivers that pin memory with
> pin_user_pages() it tends to remain pinned after fork. To deal with
> this and other issues this series introduces a cgroup for tracking and
> limiting the number of pages pinned or locked by tasks in the group.
>
> However the existing behaviour with regards to the rlimit needs to be
> maintained. Therefore the lesser of the two limits is
> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit,
> but this bypass is not allowed for the cgroup.
>
> The first part of this series converts existing drivers which
> open-code the use of locked_mm/pinned_mm over to a common interface
> which manages the refcounts of the associated task/mm/user
> structs. This ensures accounting of pages is consistent and makes it
> easier to add charging of the cgroup.
>
> The second part of the series adds the cgroup controller and converts
> core mm code such as mlock over to charging the cgroup before finally
> introducing some selftests.
>
> Rather than adding onto an exisiting cgroup controller such as memcg
> we introduce a new controller. This is primarily because we wish to
> limit the total number of pages tasks within a cgroup may pin/lock.
>
> As I don't have access to systems with all the various devices I
> haven't been able to test all driver changes. Any help there would be
> appreciated.
>
> Note that this series is based on v6.2-rc5 and
> https://lore.kernel.org/linux-rdma/[email protected]/
> which makes updating the siw driver easier (thanks Bernard).
>
> Changes from initial RFC:
>
> - Fixes to some driver error handling.
>
> - Pages charged with vm_account will always increment mm->pinned_vm
> and enforce the limit against user->locked_vm or mm->pinned_vm
> depending on initialisation flags.
>
> - Moved vm_account prototypes and struct definitions into a separate header.
>
> - Minor updates to commit messages and kernel docs (thanks to Jason,
> Christoph, Yosry and T.J.).
>
> Outstanding issues:
>
> - David H pointed out that the vm_account naming is potentially
> confusing and I agree. However I have yet to come up with something
> better so will rename this in a subsequent version of this series
> (suggestions welcome).


vm_lockaccount ? vm_pinaccount ?

Less confusing than reusing VM_ACCOUNT which translates to "commit
accounting".

Might also make sense to rename VM_ACCOUNT to VM_COMMIT or sth like that.

--
Thanks,

David / dhildenb


2023-02-16 12:45:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 16, 2023 at 09:04:03AM +0100, Michal Hocko wrote:

> > In most cases the ownship traces back to a file descriptor. When the
> > file is closed the pin goes away.
>
> This assumes a specific use of {un}pin_user_page*, right? IIUC the
> cgroup charging is meant to be used from vm_account but that doesn't
> really tell anything about the lifetime nor the ownership. Maybe this is
> just a matter of documentation update...

Yes documentation.

> > > The interface itself doesn't talk about
> > > anything like that and so it seems perfectly fine to unpin from a
> > > completely different context then pinning.
> >
> > Yes, concievably the close of the FD can be in a totally different
> > process with a different cgroup.
>
> Wouldn't you get an unbalanced charges then? How can admin recover that
> situation?

No, the approach in this patch series captures the cgroup that was
charged and stores it in the FD until uncharge.

This is the same as we do today for rlimit. The user/process that is
charged is captured and the uncharge always applies to user/process
that was charged, not the user/process that happens to be associated
with the uncharging context.

cgroup just add another option so it is user/process/cgroup that can
hold the charge.

It is conceptually similar to how each struct page has the memcg that
its allocation was charged to - we just record this in the FD not the
page.

> > > Another thing that is not really clear to me is how the limit is
> > > actually going to be used in practice. As there is no concept of a
> > > reclaim for pins then I can imagine that it would be quite easy to
> > > reach the hard limit and essentially DoS any further use of pins.
> >
> > Yes, that is the purpose. It is to sandbox pin users to put some limit
> > on the effect they have on the full machine.
> >
> > It replaces the rlimit mess that was doing the same thing.
>
> arguably rlimit has a concept of the owner at least AFAICS. I do realize
> this is not really great wrt a high level resource control though.

rlimit uses either the user or the process as the "owner". In this
model we view a cgroup as the "owner". The lifetime logic is all the
same, you figure out the owner (cgroup/user/process) when the charge
is made and record it, when the uncharge comes the recorded owner is
uncharged.

It never allows unbalanced charge/uncharge because that would be a
security problem even for rlimit cases today.

Jason

2023-02-21 16:51:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

Hello,

On Thu, Feb 16, 2023 at 08:45:38AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 16, 2023 at 09:04:03AM +0100, Michal Hocko wrote:
>
> > > In most cases the ownship traces back to a file descriptor. When the
> > > file is closed the pin goes away.
> >
> > This assumes a specific use of {un}pin_user_page*, right? IIUC the
> > cgroup charging is meant to be used from vm_account but that doesn't
> > really tell anything about the lifetime nor the ownership. Maybe this is
> > just a matter of documentation update...
>
> Yes documentation.
>
> > > > The interface itself doesn't talk about
> > > > anything like that and so it seems perfectly fine to unpin from a
> > > > completely different context then pinning.
> > >
> > > Yes, concievably the close of the FD can be in a totally different
> > > process with a different cgroup.
> >
> > Wouldn't you get an unbalanced charges then? How can admin recover that
> > situation?
>
> No, the approach in this patch series captures the cgroup that was
> charged and stores it in the FD until uncharge.
>
> This is the same as we do today for rlimit. The user/process that is
> charged is captured and the uncharge always applies to user/process
> that was charged, not the user/process that happens to be associated
> with the uncharging context.
>
> cgroup just add another option so it is user/process/cgroup that can
> hold the charge.
>
> It is conceptually similar to how each struct page has the memcg that
> its allocation was charged to - we just record this in the FD not the
> page.

I don't have a lot of context here but it looks like the problem here is
that the newly proposed controller is introducing an ownership discrepancy.
If a memory which is pinned by a cgroup is to be charged to the owner of the
fd, the memory which isn't pinned should be charged to the memcg of the same
cgroup, right? It makes little sense to me to separate the owner of the
memory page and the pinner of it. They should be one and the same.

Alistair, can you please elaborate how these pages are allocated, charged
and used?

Thanks.

--
tejun

2023-02-21 17:27:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Tue, Feb 21, 2023 at 06:51:48AM -1000, Tejun Heo wrote:
> cgroup, right? It makes little sense to me to separate the owner of the
> memory page and the pinner of it. They should be one and the same.

The owner and pinner are not always the same entity or we could just
use the page's cgroup.

Jason

2023-02-21 17:29:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Tue, Feb 21, 2023 at 01:25:59PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 21, 2023 at 06:51:48AM -1000, Tejun Heo wrote:
> > cgroup, right? It makes little sense to me to separate the owner of the
> > memory page and the pinner of it. They should be one and the same.
>
> The owner and pinner are not always the same entity or we could just
> use the page's cgroup.

Yeah, so, what I'm trying to say is that that might be the source of the
problem. Is the current page ownership attribution correct given that the fd
for whatever reason is determining the pinning ownership or should the page
ownership be attributed the same way too? If they indeed need to differ,
that probably would need pretty strong justifications.

Thanks.

--
tejun

2023-02-21 17:51:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Tue, Feb 21, 2023 at 07:29:18AM -1000, Tejun Heo wrote:
> On Tue, Feb 21, 2023 at 01:25:59PM -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 21, 2023 at 06:51:48AM -1000, Tejun Heo wrote:
> > > cgroup, right? It makes little sense to me to separate the owner of the
> > > memory page and the pinner of it. They should be one and the same.
> >
> > The owner and pinner are not always the same entity or we could just
> > use the page's cgroup.
>
> Yeah, so, what I'm trying to say is that that might be the source of the
> problem. Is the current page ownership attribution correct

It should be correct.

This mechanism is driven by pin_user_page(), (as it is the only API
that can actually create a pin) so the cgroup owner of the page is
broadly related to the "owner" of the VMA's inode.

The owner of the pin is the caller of pin_user_page(), which is
initated by some FD/proces that is not necessarily related to the
VMA's inode.

Eg concretely, something like io_uring will do something like:
buffer = mmap() <- Charge memcg for the pages
fd = io_uring_setup(..)
io_uring_register(fd,xx,buffer,..); <- Charge the pincg for the pin

If mmap is a private anonymous VMA created by the same process then it
is likely the pages will have the same cgroup as io_uring_register and
the FD.

Otherwise the page cgroup is unconstrained. MAP_SHARED mappings will
have the page cgroup point at whatever cgroup was first to allocate
the page for the VMA's inode.

AFAIK there are few real use cases to establish a pin on MAP_SHARED
mappings outside your cgroup. However, it is possible, the APIs allow
it, and for security sandbox purposes we can't allow a process inside
a cgroup to triger a charge on a different cgroup. That breaks the
sandbox goal.

If memcg could support multiple owners then it would be logical that
the pinner would be one of the memcg owners.

> for whatever reason is determining the pinning ownership or should the page
> ownership be attributed the same way too? If they indeed need to differ,
> that probably would need pretty strong justifications.

It is inherent to how pin_user_pages() works. It is an API that
establishs pins on existing pages. There is nothing about it that says
who the page's memcg owner is.

I don't think we can do anything about this without breaking things.

Jason

2023-02-21 18:07:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

Hello,

On Tue, Feb 21, 2023 at 01:51:12PM -0400, Jason Gunthorpe wrote:
> > Yeah, so, what I'm trying to say is that that might be the source of the
> > problem. Is the current page ownership attribution correct
>
> It should be correct.
>
> This mechanism is driven by pin_user_page(), (as it is the only API
> that can actually create a pin) so the cgroup owner of the page is
> broadly related to the "owner" of the VMA's inode.
>
> The owner of the pin is the caller of pin_user_page(), which is
> initated by some FD/proces that is not necessarily related to the
> VMA's inode.
>
> Eg concretely, something like io_uring will do something like:
> buffer = mmap() <- Charge memcg for the pages
> fd = io_uring_setup(..)
> io_uring_register(fd,xx,buffer,..); <- Charge the pincg for the pin
>
> If mmap is a private anonymous VMA created by the same process then it
> is likely the pages will have the same cgroup as io_uring_register and
> the FD.
>
> Otherwise the page cgroup is unconstrained. MAP_SHARED mappings will
> have the page cgroup point at whatever cgroup was first to allocate
> the page for the VMA's inode.
>
> AFAIK there are few real use cases to establish a pin on MAP_SHARED
> mappings outside your cgroup. However, it is possible, the APIs allow
> it, and for security sandbox purposes we can't allow a process inside
> a cgroup to triger a charge on a different cgroup. That breaks the
> sandbox goal.

It seems broken anyway. Please consider the following scenario:

1. A is a tiny cgroup which only does streaming IOs and has memory.high of
128M which is more than sufficient for IO window. The last file it
streamed happened to be F which was about 256M.

2. B is an a lot larger cgroup w/ pin limit way above 256M. B pins the
entirety of F.

3. A now tries to stream another file but F is almost fully occupying its
memory allowance and can't be evicted. A keeps thrashing due to lack of
memory and isolation is completely broken.

This stems directly from page ownership and pin accounting discrepancy.

> If memcg could support multiple owners then it would be logical that
> the pinner would be one of the memcg owners.
>
> > for whatever reason is determining the pinning ownership or should the page
> > ownership be attributed the same way too? If they indeed need to differ,
> > that probably would need pretty strong justifications.
>
> It is inherent to how pin_user_pages() works. It is an API that
> establishs pins on existing pages. There is nothing about it that says
> who the page's memcg owner is.
>
> I don't think we can do anything about this without breaking things.

That's a discrepancy in an internal interface and we don't wanna codify
something like that into userspace interface. Semantially, it seems like if
pin_user_pages() wanna charge pinning to the cgroup associated with an fd
(or whatever), it should also claim the ownership of the pages themselves. I
have no idea how feasiable that'd be from memcg POV tho. Given that this
would be a fairly cold path (in most cases, the ownership should already
match), maybe it won't be too bad?

Thanks.

--
tejun

2023-02-21 19:26:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Tue, Feb 21, 2023 at 08:07:13AM -1000, Tejun Heo wrote:
> > AFAIK there are few real use cases to establish a pin on MAP_SHARED
> > mappings outside your cgroup. However, it is possible, the APIs allow
> > it, and for security sandbox purposes we can't allow a process inside
> > a cgroup to triger a charge on a different cgroup. That breaks the
> > sandbox goal.
>
> It seems broken anyway. Please consider the following scenario:

Yes, this is broken like this already today - memcg doesn't work
entirely perfectly for MAP_SHARED scenarios, IMHO.

> > > for whatever reason is determining the pinning ownership or should the page
> > > ownership be attributed the same way too? If they indeed need to differ,
> > > that probably would need pretty strong justifications.
> >
> > It is inherent to how pin_user_pages() works. It is an API that
> > establishs pins on existing pages. There is nothing about it that says
> > who the page's memcg owner is.
> >
> > I don't think we can do anything about this without breaking things.
>
> That's a discrepancy in an internal interface and we don't wanna codify
> something like that into userspace interface. Semantially, it seems like if
> pin_user_pages() wanna charge pinning to the cgroup associated with an fd
> (or whatever), it should also claim the ownership of the pages
> themselves.

Multiple cgroup can pin the same page, so it is not as simple as just
transfering ownership, we need multi-ownership and to really fix the
memcg limitations with MAP_SHARED without an API impact.

You are right that pinning is really just a special case of
allocation, but there is a reason the memcg was left with weak support
for MAP_SHARED and changing that may be more than just hard but an
infeasible trade off..

At least I don't have a good idea how to even approach building a
reasonable datstructure that can track the number of
charges per-cgroup per page. :\

Jason

2023-02-21 19:45:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

Hello,

On Tue, Feb 21, 2023 at 03:26:33PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 21, 2023 at 08:07:13AM -1000, Tejun Heo wrote:
> > > AFAIK there are few real use cases to establish a pin on MAP_SHARED
> > > mappings outside your cgroup. However, it is possible, the APIs allow
> > > it, and for security sandbox purposes we can't allow a process inside
> > > a cgroup to triger a charge on a different cgroup. That breaks the
> > > sandbox goal.
> >
> > It seems broken anyway. Please consider the following scenario:
>
> Yes, this is broken like this already today - memcg doesn't work
> entirely perfectly for MAP_SHARED scenarios, IMHO.

It is far from perfect but the existing behavior isn't that broken. e.g. in
the same scenario, without pinning, even if the larger cgroup keeps using
the same page, the smaller cgroup should be able to evict the pages as they
are not pinned and the cgroup is under heavy reclaim pressure. The larger
cgroup will refault them back in and end up owning those pages.

memcg can't capture the case of the same pages being actively shared by
multiple cgroups concurrently (I think those cases should be handled by
pushing them to the common parent as discussed elswhere but that's a
separate topic) but it can converge when page usage transfers across cgroups
if needed. Disassociating ownership and pinning will break that in an
irreversible way.

> > > > for whatever reason is determining the pinning ownership or should the page
> > > > ownership be attributed the same way too? If they indeed need to differ,
> > > > that probably would need pretty strong justifications.
> > >
> > > It is inherent to how pin_user_pages() works. It is an API that
> > > establishs pins on existing pages. There is nothing about it that says
> > > who the page's memcg owner is.
> > >
> > > I don't think we can do anything about this without breaking things.
> >
> > That's a discrepancy in an internal interface and we don't wanna codify
> > something like that into userspace interface. Semantially, it seems like if
> > pin_user_pages() wanna charge pinning to the cgroup associated with an fd
> > (or whatever), it should also claim the ownership of the pages
> > themselves.
>
> Multiple cgroup can pin the same page, so it is not as simple as just
> transfering ownership, we need multi-ownership and to really fix the
> memcg limitations with MAP_SHARED without an API impact.
>
> You are right that pinning is really just a special case of
> allocation, but there is a reason the memcg was left with weak support
> for MAP_SHARED and changing that may be more than just hard but an
> infeasible trade off..
>
> At least I don't have a good idea how to even approach building a
> reasonable datstructure that can track the number of
> charges per-cgroup per page. :\

As I wrote above, I don't think the problem here is the case of pages being
shared by multiple cgroups concurrently. We can leave that problem for
another thread. However, if we want to support accounting and control of
pinned memory, we really shouldn't introduce a fundmental discrepancy like
the owner and pinner disagreeing with each other. At least conceptually, the
solution is rather straight-forward - whoever pins a page should also claim
the ownership of it.

Thanks.

--
tejun

2023-02-21 19:49:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Tue, Feb 21, 2023 at 09:45:15AM -1000, Tejun Heo wrote:
> > Multiple cgroup can pin the same page, so it is not as simple as just
> > transfering ownership, we need multi-ownership and to really fix the
> > memcg limitations with MAP_SHARED without an API impact.
> >
> > You are right that pinning is really just a special case of
> > allocation, but there is a reason the memcg was left with weak support
> > for MAP_SHARED and changing that may be more than just hard but an
> > infeasible trade off..
> >
> > At least I don't have a good idea how to even approach building a
> > reasonable datstructure that can track the number of
> > charges per-cgroup per page. :\
>
> As I wrote above, I don't think the problem here is the case of pages being
> shared by multiple cgroups concurrently. We can leave that problem for
> another thread. However, if we want to support accounting and control of
> pinned memory, we really shouldn't introduce a fundmental discrepancy like
> the owner and pinner disagreeing with each other. At least conceptually, the
> solution is rather straight-forward - whoever pins a page should also claim
> the ownership of it.

Ah, sorry, I missed the part about multiple cgroups pinning the same page.
Yeah, I can't think of a good answer for that.

Thanks.

--
tejun

2023-02-21 19:58:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Tue, Feb 21, 2023 at 09:45:15AM -1000, Tejun Heo wrote:
> Hello,
>
> On Tue, Feb 21, 2023 at 03:26:33PM -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 21, 2023 at 08:07:13AM -1000, Tejun Heo wrote:
> > > > AFAIK there are few real use cases to establish a pin on MAP_SHARED
> > > > mappings outside your cgroup. However, it is possible, the APIs allow
> > > > it, and for security sandbox purposes we can't allow a process inside
> > > > a cgroup to triger a charge on a different cgroup. That breaks the
> > > > sandbox goal.
> > >
> > > It seems broken anyway. Please consider the following scenario:
> >
> > Yes, this is broken like this already today - memcg doesn't work
> > entirely perfectly for MAP_SHARED scenarios, IMHO.
>
> It is far from perfect but the existing behavior isn't that broken. e.g. in
> the same scenario, without pinning, even if the larger cgroup keeps using
> the same page, the smaller cgroup should be able to evict the pages as they
> are not pinned and the cgroup is under heavy reclaim pressure. The larger
> cgroup will refault them back in and end up owning those pages.
>
> memcg can't capture the case of the same pages being actively shared by
> multiple cgroups concurrently (I think those cases should be handled by
> pushing them to the common parent as discussed elswhere but that's a
> separate topic) but it can converge when page usage transfers across cgroups
> if needed. Disassociating ownership and pinning will break that in an
> irreversible way.

It is already disassociated - memcg is broken as you describe today
with pin_user_pages().

If you want to fix that, then we need to redefine how memcg works with
pin_user_pages() and I'm open to ideas..

> the owner and pinner disagreeing with each other. At least
> conceptually, the solution is rather straight-forward - whoever pins
> a page should also claim the ownership of it.

If the answer is pinner is owner, then multi-pinners must mean
multi-owner too. We probably can't block multi-pinner without causing
uAPI problems.

You are not wrong on any of these remarks, but this looses sight of
the point - it is take the existing broken RLIMIT scheme and make it
incrementally better by being the same broken scheme just with
cgroups.

If we eventually fix everything so memcg can do multi-pinners/owners
then would it be reasonable to phase out the new pincg at that time?

Jason

2023-02-22 11:42:15

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory


Jason Gunthorpe <[email protected]> writes:

>> the owner and pinner disagreeing with each other. At least
>> conceptually, the solution is rather straight-forward - whoever pins
>> a page should also claim the ownership of it.
>
> If the answer is pinner is owner, then multi-pinners must mean
> multi-owner too. We probably can't block multi-pinner without causing
> uAPI problems.

It seems the problem is how to track multiple pinners of the page. At
the moment memcg ownership is stored in folio->memcg_data which
basically points to the owning struct mem_cgroup *.

For pinning this series already introduces this data structure:

struct vm_account {
struct task_struct *task;
struct mm_struct *mm;
struct user_struct *user;
enum vm_account_flags flags;
};

We could modify it to something like:

struct vm_account {
struct list_head possible_pinners;
struct mem_cgroup *memcg;
[...]
};

When a page is pinned the first pinner takes ownership and stores it's
memcg there, updating memcg_data to point to it. This would require a
new page_memcg_data_flags but I think we have one left. Subsequent
pinners create a vm_account and add it to the pinners list.

When a driver unpins a page we scan the pinners list and assign
ownership to the next driver pinning the page by updating memcg_data and
removing the vm_account from the list.

The problem with this approach is each pinner (ie. struct vm_account)
may cover different subsets of pages. Drivers have to store a list of
pinned pages somewhere, so we could query drivers or store the list of
pinned pages in the vm_account. That seems like a fair bit of overhead
though and would make unpinning expensive as we'd have to traverse
several lists.

We'd also have to ensure possible owners had enough free memory in the
owning memcg to accept having the page charged when another pinner
unpins. That could be done by reserving space during pinning.

And of course it only works for pin_user_pages() - other users don't
always have a list of pages conveniently available although I suppose
they could walk the rmap, but again that adds overhead. So not sure it's
a great solution but figured I'd leave it here in case it triggers other
ideas.

> You are not wrong on any of these remarks, but this looses sight of
> the point - it is take the existing broken RLIMIT scheme and make it
> incrementally better by being the same broken scheme just with
> cgroups.

Right. RLIMIT_MEMLOCK is pretty broken because most uses enforce it
against a specific task so it can be easily bypassed. The aim here was
to make it at least possible to enforce a meaningful limit.

> If we eventually fix everything so memcg can do multi-pinners/owners
> then would it be reasonable to phase out the new pincg at that time?
>
> Jason


2023-02-22 12:57:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> When a driver unpins a page we scan the pinners list and assign
> ownership to the next driver pinning the page by updating memcg_data and
> removing the vm_account from the list.

I don't see how this works with just the data structure you outlined??
Every unique page needs its own list_head in the vm_account, it is
doable just incredibly costly.

Jason

2023-02-22 23:11:01

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory


Jason Gunthorpe <[email protected]> writes:

> On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
>> When a driver unpins a page we scan the pinners list and assign
>> ownership to the next driver pinning the page by updating memcg_data and
>> removing the vm_account from the list.
>
> I don't see how this works with just the data structure you outlined??
> Every unique page needs its own list_head in the vm_account, it is
> doable just incredibly costly.

The idea was every driver already needs to allocate a pages array to
pass to pin_user_pages(), and by necessity drivers have to keep a
reference to the contents of that in one form or another. So
conceptually the equivalent of:

struct vm_account {
struct list_head possible_pinners;
struct mem_cgroup *memcg;
struct pages **pages;
[...]
};

Unpinnig involves finding a new owner by traversing the list of
page->memcg_data->possible_pinners and iterating over *pages[] to figure
out if that vm_account actually has this page pinned or not and could
own it.

Agree this is costly though. And I don't think all drivers keep the
array around so "iterating over *pages[]" may need to be a callback.

> Jason


2023-02-23 00:06:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
> The idea was every driver already needs to allocate a pages array to
> pass to pin_user_pages(), and by necessity drivers have to keep a
> reference to the contents of that in one form or another. So
> conceptually the equivalent of:
>
> struct vm_account {
> struct list_head possible_pinners;
> struct mem_cgroup *memcg;
> struct pages **pages;
> [...]
> };
>
> Unpinnig involves finding a new owner by traversing the list of
> page->memcg_data->possible_pinners and iterating over *pages[] to figure
> out if that vm_account actually has this page pinned or not and could
> own it.
>
> Agree this is costly though. And I don't think all drivers keep the
> array around so "iterating over *pages[]" may need to be a callback.

Is pinning in this context referring to FOLL_LONGTERM pins or any
FOLL_PIN? In the latter case block based direct I/O does not keep
the pages array around, and also is absolutely not willing to pay
for the overhead.

For FOLL_LONGTERM the schemes sounds vaguely reasonable to me.

2023-02-23 00:36:00

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory


Christoph Hellwig <[email protected]> writes:

> On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
>> The idea was every driver already needs to allocate a pages array to
>> pass to pin_user_pages(), and by necessity drivers have to keep a
>> reference to the contents of that in one form or another. So
>> conceptually the equivalent of:
>>
>> struct vm_account {
>> struct list_head possible_pinners;
>> struct mem_cgroup *memcg;
>> struct pages **pages;
>> [...]
>> };
>>
>> Unpinnig involves finding a new owner by traversing the list of
>> page->memcg_data->possible_pinners and iterating over *pages[] to figure
>> out if that vm_account actually has this page pinned or not and could
>> own it.
>>
>> Agree this is costly though. And I don't think all drivers keep the
>> array around so "iterating over *pages[]" may need to be a callback.
>
> Is pinning in this context referring to FOLL_LONGTERM pins or any
> FOLL_PIN? In the latter case block based direct I/O does not keep
> the pages array around, and also is absolutely not willing to pay
> for the overhead.

Good point. I was primarily targeting FOLL_LONGTERM users. I'm not too
familiar with block based direct I/O but from what I can tell it
currently doesn't respect any kind of RLIMIT anyway so I guess the
requirment to limit pinned pages there isn't so revelant.

2023-02-23 01:54:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
>
> Jason Gunthorpe <[email protected]> writes:
>
> > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> >> When a driver unpins a page we scan the pinners list and assign
> >> ownership to the next driver pinning the page by updating memcg_data and
> >> removing the vm_account from the list.
> >
> > I don't see how this works with just the data structure you outlined??
> > Every unique page needs its own list_head in the vm_account, it is
> > doable just incredibly costly.
>
> The idea was every driver already needs to allocate a pages array to
> pass to pin_user_pages(), and by necessity drivers have to keep a
> reference to the contents of that in one form or another. So
> conceptually the equivalent of:
>
> struct vm_account {
> struct list_head possible_pinners;
> struct mem_cgroup *memcg;
> struct pages **pages;
> [...]
> };
>
> Unpinnig involves finding a new owner by traversing the list of
> page->memcg_data->possible_pinners and iterating over *pages[] to figure
> out if that vm_account actually has this page pinned or not and could
> own it.

Oh, you are focusing on Tejun's DOS scenario.

The DOS problem is to prevent a pin users in cgroup A from keeping
memory charged to cgroup B that it isn't using any more.

cgroup B doesn't need to be pinning the memory, it could just be
normal VMAs and "isn't using anymore" means it has unmapped all the
VMAs.

Solving that problem means figuring out when every cgroup stops using
the memory - pinning or not. That seems to be very costly.

AFAIK this problem also already exists today as the memcg of a page
doesn't change while it is pinned. So maybe we don't need to address
it.

Arguably the pins are not the problem. If we want to treat the pin
like allocation then we simply charge the non-owning memcg's for the
pin as though it was an allocation. Eg go over every page and if the
owning memcg is not the current memcg then charge the current memcg
for an allocation of the MAP_SHARED memory. Undoing this is trivial
enoug.

This doesn't fix the DOS problem but it does sort of harmonize the pin
accounting with the memcg by multi-accounting every pin of a
MAP_SHARED page.

The other drawback is that this isn't the same thing as the current
rlimit. The rlimit is largely restricting the creation of unmovable
memory.

Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL)
along with movable user pages so it would be self-consistent.

I'm unclear if this is OK for libvirt..

> Agree this is costly though. And I don't think all drivers keep the
> array around so "iterating over *pages[]" may need to be a callback.

I think searching lists of pages is not reasonable. Things like VFIO &
KVM use cases effectively pin 90% of all system memory, that is
potentially TB of page lists that might need linear searching!

Jason

2023-02-23 09:13:23

by Daniel P. Berrangé

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Wed, Feb 22, 2023 at 09:53:56PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
> >
> > Jason Gunthorpe <[email protected]> writes:
> >
> > > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> > >> When a driver unpins a page we scan the pinners list and assign
> > >> ownership to the next driver pinning the page by updating memcg_data and
> > >> removing the vm_account from the list.
> > >
> > > I don't see how this works with just the data structure you outlined??
> > > Every unique page needs its own list_head in the vm_account, it is
> > > doable just incredibly costly.
> >
> > The idea was every driver already needs to allocate a pages array to
> > pass to pin_user_pages(), and by necessity drivers have to keep a
> > reference to the contents of that in one form or another. So
> > conceptually the equivalent of:
> >
> > struct vm_account {
> > struct list_head possible_pinners;
> > struct mem_cgroup *memcg;
> > struct pages **pages;
> > [...]
> > };
> >
> > Unpinnig involves finding a new owner by traversing the list of
> > page->memcg_data->possible_pinners and iterating over *pages[] to figure
> > out if that vm_account actually has this page pinned or not and could
> > own it.
>
> Oh, you are focusing on Tejun's DOS scenario.
>
> The DOS problem is to prevent a pin users in cgroup A from keeping
> memory charged to cgroup B that it isn't using any more.
>
> cgroup B doesn't need to be pinning the memory, it could just be
> normal VMAs and "isn't using anymore" means it has unmapped all the
> VMAs.
>
> Solving that problem means figuring out when every cgroup stops using
> the memory - pinning or not. That seems to be very costly.
>
> AFAIK this problem also already exists today as the memcg of a page
> doesn't change while it is pinned. So maybe we don't need to address
> it.
>
> Arguably the pins are not the problem. If we want to treat the pin
> like allocation then we simply charge the non-owning memcg's for the
> pin as though it was an allocation. Eg go over every page and if the
> owning memcg is not the current memcg then charge the current memcg
> for an allocation of the MAP_SHARED memory. Undoing this is trivial
> enoug.
>
> This doesn't fix the DOS problem but it does sort of harmonize the pin
> accounting with the memcg by multi-accounting every pin of a
> MAP_SHARED page.
>
> The other drawback is that this isn't the same thing as the current
> rlimit. The rlimit is largely restricting the creation of unmovable
> memory.
>
> Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL)
> along with movable user pages so it would be self-consistent.
>
> I'm unclear if this is OK for libvirt..

I'm not sure what exact scenario you're thinking of when talking
about two distinct cgroups and its impact on libvirt. None the less
here's a rough summary of libvirt's approach to cgroups and memory

On the libvirt side, we create 1 single cgroup per VM, in which lives
at least the QEMU process, but possibly some additional per-VM helper
processes (swtpm for TPM, sometimes slirp/passt for NIC, etc).

Potentially there are externally managed processes that are handling
some resources on behalf of the VM. These might be a single centralized
daemon handling work for many VMs, or might be per VM services. Either
way, since they are externally managed, their setup and usage of cgroups
is completely opaque to libvirt.

Libvirt is only concerned with the 1 cgroup per VM that it creates and
manages. Its goal is to protect the host OS from a misbehaving guest
OS/compromised QEMU.

The memory limits we can set on VMs are somewhat limited. In general
we prefer to avoid setting any hard per-VM memory cap by default.
QEMU's worst case memory usage is incredibly hard to predict, because
of an incredibly broad range of possible configurations and opaque
behaviour/usage from ELF libraries it uses. Every time anyone has
tried hard memory caps, we've ended up with VMs being incorrectly
killed because they genuinely wanted more memory than anticipated
by the algorithm.

To protect the host OS, I tend to suggest mgmt apps/admins set a
hard memory limit acrosss all VMs in aggregate eg at /machine.slice,
instead of per-VM. This aims to makes it possible to ensure that
the host OS always has some memory reserved for its own system
services, while allowing the individual VMs to battle it out
between themselves.

We do still have to apply some tuning for VFIO, around what amount
of memory it is allowed to lock, but that is not so bad as we just
need to allow it to lock guest RAM which is known + an finite extra
amount, so don't need to take account of all of QEMU's memory
allocations in general. This is all still just in context of 1
cgroup though, as least as far as libvirt is aware. Any other
cgroups involved are opaque to libvirt, and not our concern as long
as QEMU's cgroup is preventing QEMU's misbehaviour as configured.

With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|


2023-02-23 17:18:41

by T.J. Mercier

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Wed, Feb 22, 2023 at 5:54 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
> >
> > Jason Gunthorpe <[email protected]> writes:
> >
> > > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> > >> When a driver unpins a page we scan the pinners list and assign
> > >> ownership to the next driver pinning the page by updating memcg_data and
> > >> removing the vm_account from the list.
> > >
> > > I don't see how this works with just the data structure you outlined??
> > > Every unique page needs its own list_head in the vm_account, it is
> > > doable just incredibly costly.
> >
> > The idea was every driver already needs to allocate a pages array to
> > pass to pin_user_pages(), and by necessity drivers have to keep a
> > reference to the contents of that in one form or another. So
> > conceptually the equivalent of:
> >
> > struct vm_account {
> > struct list_head possible_pinners;
> > struct mem_cgroup *memcg;
> > struct pages **pages;
> > [...]
> > };
> >
> > Unpinnig involves finding a new owner by traversing the list of
> > page->memcg_data->possible_pinners and iterating over *pages[] to figure
> > out if that vm_account actually has this page pinned or not and could
> > own it.
>
> Oh, you are focusing on Tejun's DOS scenario.
>
> The DOS problem is to prevent a pin users in cgroup A from keeping
> memory charged to cgroup B that it isn't using any more.
>
> cgroup B doesn't need to be pinning the memory, it could just be
> normal VMAs and "isn't using anymore" means it has unmapped all the
> VMAs.
>
> Solving that problem means figuring out when every cgroup stops using
> the memory - pinning or not. That seems to be very costly.
>
This is the current behavior of accounting for memfds, and I suspect
any kind of shared memory.

If cgroup A creates a memfd, maps and faults in pages, shares the
memfd with cgroup B and then A unmaps and closes the memfd, then
cgroup A is still charged for the pages it faulted in.

FWIW this is also the behavior I was trying to use to attribute
dma-buffers to their original allocators. Whoever touches it first
gets charged as long as the memory is alive somewhere.

Can't we do the same thing for pins?

> AFAIK this problem also already exists today as the memcg of a page
> doesn't change while it is pinned. So maybe we don't need to address
> it.
>
> Arguably the pins are not the problem. If we want to treat the pin
> like allocation then we simply charge the non-owning memcg's for the
> pin as though it was an allocation. Eg go over every page and if the
> owning memcg is not the current memcg then charge the current memcg
> for an allocation of the MAP_SHARED memory. Undoing this is trivial
> enoug.
>
> This doesn't fix the DOS problem but it does sort of harmonize the pin
> accounting with the memcg by multi-accounting every pin of a
> MAP_SHARED page.
>
> The other drawback is that this isn't the same thing as the current
> rlimit. The rlimit is largely restricting the creation of unmovable
> memory.
>
> Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL)
> along with movable user pages so it would be self-consistent.
>
> I'm unclear if this is OK for libvirt..
>
> > Agree this is costly though. And I don't think all drivers keep the
> > array around so "iterating over *pages[]" may need to be a callback.
>
> I think searching lists of pages is not reasonable. Things like VFIO &
> KVM use cases effectively pin 90% of all system memory, that is
> potentially TB of page lists that might need linear searching!
>
> Jason

2023-02-23 17:28:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 09:18:23AM -0800, T.J. Mercier wrote:

> > Solving that problem means figuring out when every cgroup stops using
> > the memory - pinning or not. That seems to be very costly.
> >
> This is the current behavior of accounting for memfds, and I suspect
> any kind of shared memory.
>
> If cgroup A creates a memfd, maps and faults in pages, shares the
> memfd with cgroup B and then A unmaps and closes the memfd, then
> cgroup A is still charged for the pages it faulted in.

As we discussed, as long as the memory is swappable then eventually
memory pressure on cgroup A will evict the memfd pages and then cgroup
B will swap it in and be charged for it.

> FWIW this is also the behavior I was trying to use to attribute
> dma-buffers to their original allocators. Whoever touches it first
> gets charged as long as the memory is alive somewhere.
>
> Can't we do the same thing for pins?

If pins are tracked independently from memcg then definately not,
a process in cgroup A should never be able to make a charge on cgroup
B as a matter of security.

If pins are part of the memcg then we can't always turn the pin
request in to a NOP - the current cgroup always has to be charged for
the memory. Otherwise what is the point from a security perspective?

Jason

2023-02-23 17:31:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 09:12:19AM +0000, Daniel P. Berrangé wrote:

> The memory limits we can set on VMs are somewhat limited. In general
> we prefer to avoid setting any hard per-VM memory cap by default.

So if you don't use hard limits on the memcg..

But to do this with a hard limit:

> We do still have to apply some tuning for VFIO, around what amount
> of memory it is allowed to lock, but that is not so bad as we just
> need to allow it to lock guest RAM which is known + an finite extra
> amount, so don't need to take account of all of QEMU's memory
> allocations in general.

Will need its own controller, right?

Jason

2023-02-23 18:04:35

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 9:28 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Feb 23, 2023 at 09:18:23AM -0800, T.J. Mercier wrote:
>
> > > Solving that problem means figuring out when every cgroup stops using
> > > the memory - pinning or not. That seems to be very costly.
> > >
> > This is the current behavior of accounting for memfds, and I suspect
> > any kind of shared memory.
> >
> > If cgroup A creates a memfd, maps and faults in pages, shares the
> > memfd with cgroup B and then A unmaps and closes the memfd, then
> > cgroup A is still charged for the pages it faulted in.
>
> As we discussed, as long as the memory is swappable then eventually
> memory pressure on cgroup A will evict the memfd pages and then cgroup
> B will swap it in and be charged for it.

I am not familiar with memfd, but based on
mem_cgroup_swapin_charge_folio() it seems like if cgroup B swapped in
the pages they will remain charged to cgroup A, unless cgroup A is
removed/offlined. Am I missing something?

>
> > FWIW this is also the behavior I was trying to use to attribute
> > dma-buffers to their original allocators. Whoever touches it first
> > gets charged as long as the memory is alive somewhere.
> >
> > Can't we do the same thing for pins?
>
> If pins are tracked independently from memcg then definately not,
> a process in cgroup A should never be able to make a charge on cgroup
> B as a matter of security.
>
> If pins are part of the memcg then we can't always turn the pin
> request in to a NOP - the current cgroup always has to be charged for
> the memory. Otherwise what is the point from a security perspective?
>
> Jason

2023-02-23 18:11:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 10:03:50AM -0800, Yosry Ahmed wrote:
> On Thu, Feb 23, 2023 at 9:28 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Feb 23, 2023 at 09:18:23AM -0800, T.J. Mercier wrote:
> >
> > > > Solving that problem means figuring out when every cgroup stops using
> > > > the memory - pinning or not. That seems to be very costly.
> > > >
> > > This is the current behavior of accounting for memfds, and I suspect
> > > any kind of shared memory.
> > >
> > > If cgroup A creates a memfd, maps and faults in pages, shares the
> > > memfd with cgroup B and then A unmaps and closes the memfd, then
> > > cgroup A is still charged for the pages it faulted in.
> >
> > As we discussed, as long as the memory is swappable then eventually
> > memory pressure on cgroup A will evict the memfd pages and then cgroup
> > B will swap it in and be charged for it.
>
> I am not familiar with memfd, but based on
> mem_cgroup_swapin_charge_folio() it seems like if cgroup B swapped in
> the pages they will remain charged to cgroup A, unless cgroup A is
> removed/offlined. Am I missing something?

Ah, I don't know, Tejun said:

"but it can converge when page usage transfers across cgroups
if needed."

Which I assumed was swap related but I don't know how convergence
works.

Jason

2023-02-23 18:14:47

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 10:11 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Feb 23, 2023 at 10:03:50AM -0800, Yosry Ahmed wrote:
> > On Thu, Feb 23, 2023 at 9:28 AM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Thu, Feb 23, 2023 at 09:18:23AM -0800, T.J. Mercier wrote:
> > >
> > > > > Solving that problem means figuring out when every cgroup stops using
> > > > > the memory - pinning or not. That seems to be very costly.
> > > > >
> > > > This is the current behavior of accounting for memfds, and I suspect
> > > > any kind of shared memory.
> > > >
> > > > If cgroup A creates a memfd, maps and faults in pages, shares the
> > > > memfd with cgroup B and then A unmaps and closes the memfd, then
> > > > cgroup A is still charged for the pages it faulted in.
> > >
> > > As we discussed, as long as the memory is swappable then eventually
> > > memory pressure on cgroup A will evict the memfd pages and then cgroup
> > > B will swap it in and be charged for it.
> >
> > I am not familiar with memfd, but based on
> > mem_cgroup_swapin_charge_folio() it seems like if cgroup B swapped in
> > the pages they will remain charged to cgroup A, unless cgroup A is
> > removed/offlined. Am I missing something?
>
> Ah, I don't know, Tejun said:
>
> "but it can converge when page usage transfers across cgroups
> if needed."
>
> Which I assumed was swap related but I don't know how convergence
> works.

I believe that's the case for file-backed pages, but I do not believe
it's the case for swap-backed pages.

>
> Jason

2023-02-23 18:15:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 02:10:56PM -0400, Jason Gunthorpe wrote:
> > I am not familiar with memfd, but based on
> > mem_cgroup_swapin_charge_folio() it seems like if cgroup B swapped in
> > the pages they will remain charged to cgroup A, unless cgroup A is
> > removed/offlined. Am I missing something?
>
> Ah, I don't know, Tejun said:
>
> "but it can converge when page usage transfers across cgroups
> if needed."
>
> Which I assumed was swap related but I don't know how convergence
> works.

That'd work for pagecache. For swap-backed, I think Yosry is right. Is
MAP_SHARED | MAP_ANONYMOUS a concern? Such mappings can only be shared
through forking, so it's not a common thing to be shared across different
resource domains.

Thanks.

--
tejun

2023-02-23 18:17:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 08:15:17AM -1000, Tejun Heo wrote:
> On Thu, Feb 23, 2023 at 02:10:56PM -0400, Jason Gunthorpe wrote:
> > > I am not familiar with memfd, but based on
> > > mem_cgroup_swapin_charge_folio() it seems like if cgroup B swapped in
> > > the pages they will remain charged to cgroup A, unless cgroup A is
> > > removed/offlined. Am I missing something?
> >
> > Ah, I don't know, Tejun said:
> >
> > "but it can converge when page usage transfers across cgroups
> > if needed."
> >
> > Which I assumed was swap related but I don't know how convergence
> > works.
>
> That'd work for pagecache. For swap-backed, I think Yosry is right. Is
> MAP_SHARED | MAP_ANONYMOUS a concern? Such mappings can only be shared
> through forking, so it's not a common thing to be shared across different
> resource domains.

Isn't memfd also in the same boat?

Jason

2023-02-23 18:22:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory

On Thu, Feb 23, 2023 at 02:17:18PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 23, 2023 at 08:15:17AM -1000, Tejun Heo wrote:
> > On Thu, Feb 23, 2023 at 02:10:56PM -0400, Jason Gunthorpe wrote:
> > > > I am not familiar with memfd, but based on
> > > > mem_cgroup_swapin_charge_folio() it seems like if cgroup B swapped in
> > > > the pages they will remain charged to cgroup A, unless cgroup A is
> > > > removed/offlined. Am I missing something?
> > >
> > > Ah, I don't know, Tejun said:
> > >
> > > "but it can converge when page usage transfers across cgroups
> > > if needed."
> > >
> > > Which I assumed was swap related but I don't know how convergence
> > > works.
> >
> > That'd work for pagecache. For swap-backed, I think Yosry is right. Is
> > MAP_SHARED | MAP_ANONYMOUS a concern? Such mappings can only be shared
> > through forking, so it's not a common thing to be shared across different
> > resource domains.
>
> Isn't memfd also in the same boat?

I see. Yeah, that's looks like named shared anon. The first one
instantiating a page would always be the owner.

Thanks.

--
tejun