Based on discussions at LPC, this series adds a memory.stat counter for
exported dmabufs. This counter allows us to continue tracking
system-wide total exported buffer sizes which there is no longer any
way to get without DMABUF_SYSFS_STATS, and adds a new capability to
track per-cgroup exported buffer sizes. The total (root counter) is
helpful for accounting in-kernel dmabuf use (by comparing with the sum
of child nodes or with the sum of sizes of mapped buffers or FD
references in procfs) in addition to helping identify driver memory
leaks when in-kernel use continually increases over time. With
per-application cgroups, the per-cgroup counter allows us to quickly
see how much dma-buf memory an application has caused to be allocated.
This avoids the need to read through all of procfs which can be a
lengthy process, and causes the charge to "stick" to the allocating
process/cgroup as long as the buffer is alive, regardless of how the
buffer is shared (unless the charge is transferred).
The first patch adds the counter to memcg. The next two patches allow
the charge for a buffer to be transferred across cgroups which is
necessary because of the way most dmabufs are allocated from a central
process on Android. The fourth patch adds the binder object flags to
the existing selinux_binder_transfer_file LSM hook and a SELinux
permission for charge transfers.
[1] https://lore.kernel.org/all/[email protected]/
v2:
Actually charge memcg vs just mutate the stat counter per Shakeel Butt
and Michal Hocko. Shakeel pointed me at the skmem functions which
turned out to be very similar to how I was thinking the dmabuf tracking
should work. So I've added a pair of dmabuf functions that do
essentially the same thing, except conditionally implemented behind
CONFIG_MEMCG alongside the other charge/uncharge functions.
Drop security_binder_transfer_charge per Casey Schaufler and Paul Moore
Drop BINDER_FDA_FLAG_XFER_CHARGE (and fix commit message) per Carlos
Llamas
Don't expose is_dma_buf_file for use by binder per Hillf Danton
Call dma_buf_stats_teardown in dma_buf_export error handling
Rebase onto v6.2-rc5
Hridya Valsaraju (1):
binder: Add flags to relinquish ownership of fds
T.J. Mercier (3):
memcg: Track exported dma-buffers
dmabuf: Add cgroup charge transfer function
security: binder: Add binder object flags to
selinux_binder_transfer_file
Documentation/admin-guide/cgroup-v2.rst | 5 ++
drivers/android/binder.c | 27 ++++++++--
drivers/dma-buf/dma-buf.c | 69 +++++++++++++++++++++++++
include/linux/dma-buf.h | 4 ++
include/linux/lsm_hook_defs.h | 2 +-
include/linux/lsm_hooks.h | 5 +-
include/linux/memcontrol.h | 43 +++++++++++++++
include/linux/security.h | 6 ++-
include/uapi/linux/android/binder.h | 19 +++++--
mm/memcontrol.c | 19 +++++++
security/security.c | 4 +-
security/selinux/hooks.c | 13 ++++-
security/selinux/include/classmap.h | 2 +-
13 files changed, 201 insertions(+), 17 deletions(-)
base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
--
2.39.0.246.g2a6d74b583-goog
When a buffer is exported to userspace, use memcg to attribute the
buffer to the allocating cgroup until all buffer references are
released.
Unlike the dmabuf sysfs stats implementation, this memcg accounting
avoids contention over the kernfs_rwsem incurred when creating or
removing nodes.
Signed-off-by: T.J. Mercier <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 4 +++
drivers/dma-buf/dma-buf.c | 13 +++++++++
include/linux/dma-buf.h | 3 ++
include/linux/memcontrol.h | 38 +++++++++++++++++++++++++
mm/memcontrol.c | 19 +++++++++++++
5 files changed, 77 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..538ae22bc514 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1455,6 +1455,10 @@ PAGE_SIZE multiple when read back.
Amount of memory used for storing in-kernel data
structures.
+ dmabuf (npn)
+ Amount of memory used for exported DMA buffers allocated by the cgroup.
+ Stays with the allocating cgroup regardless of how the buffer is shared.
+
workingset_refault_anon
Number of refaults of previously evicted anonymous pages.
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index e6528767efc7..a6a8cb5cb32d 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -75,6 +75,9 @@ static void dma_buf_release(struct dentry *dentry)
*/
BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
+ mem_cgroup_uncharge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE);
+ mem_cgroup_put(dmabuf->memcg);
+
dma_buf_stats_teardown(dmabuf);
dmabuf->ops->release(dmabuf);
@@ -673,6 +676,13 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
if (ret)
goto err_dmabuf;
+ dmabuf->memcg = get_mem_cgroup_from_mm(current->mm);
+ if (!mem_cgroup_charge_dmabuf(dmabuf->memcg, PAGE_ALIGN(dmabuf->size) / PAGE_SIZE,
+ GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto err_memcg;
+ }
+
file->private_data = dmabuf;
file->f_path.dentry->d_fsdata = dmabuf;
dmabuf->file = file;
@@ -683,6 +693,9 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
return dmabuf;
+err_memcg:
+ mem_cgroup_put(dmabuf->memcg);
+ dma_buf_stats_teardown(dmabuf);
err_dmabuf:
if (!resv)
dma_resv_fini(dmabuf->resv);
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 6fa8d4e29719..1f0ffb8e4bf5 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -22,6 +22,7 @@
#include <linux/fs.h>
#include <linux/dma-fence.h>
#include <linux/wait.h>
+#include <linux/memcontrol.h>
struct device;
struct dma_buf;
@@ -446,6 +447,8 @@ struct dma_buf {
struct dma_buf *dmabuf;
} *sysfs_entry;
#endif
+ /* The cgroup to which this buffer is currently attributed */
+ struct mem_cgroup *memcg;
};
/**
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d3c8203cab6c..c10b8565fdbf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,7 @@ enum memcg_stat_item {
MEMCG_KMEM,
MEMCG_ZSWAP_B,
MEMCG_ZSWAPPED,
+ MEMCG_DMABUF,
MEMCG_NR_STAT,
};
@@ -673,6 +674,25 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry);
+
+/**
+ * mem_cgroup_charge_dmabuf - Charge dma-buf memory to a cgroup and update stat counter
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ * @gfp_mask: reclaim mode
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if it doesn't.
+ */
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask);
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages,
+ gfp_t gfp_mask)
+{
+ if (mem_cgroup_disabled())
+ return 0;
+ return __mem_cgroup_charge_dmabuf(memcg, nr_pages, gfp_mask);
+}
+
void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
void __mem_cgroup_uncharge(struct folio *folio);
@@ -690,6 +710,14 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
__mem_cgroup_uncharge(folio);
}
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages);
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ if (mem_cgroup_disabled())
+ return;
+ __mem_cgroup_uncharge_dmabuf(memcg, nr_pages);
+}
+
void __mem_cgroup_uncharge_list(struct list_head *page_list);
static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
{
@@ -1242,6 +1270,12 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
return 0;
}
+static inline bool mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages,
+ gfp_t gfp_mask)
+{
+ return true;
+}
+
static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
{
}
@@ -1250,6 +1284,10 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
{
}
+static inline void mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+}
+
static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..375d18370f4b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1502,6 +1502,7 @@ static const struct memory_stat memory_stats[] = {
{ "unevictable", NR_UNEVICTABLE },
{ "slab_reclaimable", NR_SLAB_RECLAIMABLE_B },
{ "slab_unreclaimable", NR_SLAB_UNRECLAIMABLE_B },
+ { "dmabuf", MEMCG_DMABUF },
/* The memory events */
{ "workingset_refault_anon", WORKINGSET_REFAULT_ANON },
@@ -4042,6 +4043,7 @@ static const unsigned int memcg1_stats[] = {
WORKINGSET_REFAULT_ANON,
WORKINGSET_REFAULT_FILE,
MEMCG_SWAP,
+ MEMCG_DMABUF,
};
static const char *const memcg1_stat_names[] = {
@@ -4057,6 +4059,7 @@ static const char *const memcg1_stat_names[] = {
"workingset_refault_anon",
"workingset_refault_file",
"swap",
+ "dmabuf",
};
/* Universal VM events cgroup1 shows, original sort order */
@@ -7299,6 +7302,22 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
refill_stock(memcg, nr_pages);
}
+bool __mem_cgroup_charge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask)
+{
+ if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
+ mod_memcg_state(memcg, MEMCG_DMABUF, nr_pages);
+ return true;
+ }
+
+ return false;
+}
+
+void __mem_cgroup_uncharge_dmabuf(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ mod_memcg_state(memcg, MEMCG_DMABUF, -nr_pages);
+ refill_stock(memcg, nr_pages);
+}
+
static int __init cgroup_memory(char *s)
{
char *token;
--
2.39.0.246.g2a6d74b583-goog
The dma_buf_transfer_charge function provides a way for processes to
transfer charge of a buffer to a different cgroup. This is essential
for the cases where a central allocator process does allocations for
various subsystems, hands over the fd to the client who requested the
memory, and drops all references to the allocated memory.
Signed-off-by: T.J. Mercier <[email protected]>
---
drivers/dma-buf/dma-buf.c | 56 ++++++++++++++++++++++++++++++++++++++
include/linux/dma-buf.h | 1 +
include/linux/memcontrol.h | 5 ++++
3 files changed, 62 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index a6a8cb5cb32d..ac3d02a7ecf8 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -11,6 +11,7 @@
* refining of this idea.
*/
+#include <linux/atomic.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/dma-buf.h>
@@ -1626,6 +1627,61 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map)
}
EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
+/**
+ * dma_buf_transfer_charge - Change the cgroup to which the provided dma_buf is charged.
+ * @dmabuf_file: [in] file for buffer whose charge will be migrated to a different cgroup
+ * @target: [in] the task_struct of the destination process for the cgroup charge
+ *
+ * Only tasks that belong to the same cgroup the buffer is currently charged to
+ * may call this function, otherwise it will return -EPERM.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int dma_buf_transfer_charge(struct file *dmabuf_file, struct task_struct *target)
+{
+ struct mem_cgroup *current_cg, *target_cg;
+ struct dma_buf *dmabuf;
+ unsigned int nr_pages;
+ int ret = 0;
+
+ if (!IS_ENABLED(CONFIG_MEMCG))
+ return 0;
+
+ if (WARN_ON(!dmabuf_file) || WARN_ON(!target))
+ return -EINVAL;
+
+ if (!is_dma_buf_file(dmabuf_file))
+ return -EBADF;
+ dmabuf = dmabuf_file->private_data;
+
+ nr_pages = PAGE_ALIGN(dmabuf->size) / PAGE_SIZE;
+ current_cg = mem_cgroup_from_task(current);
+ target_cg = get_mem_cgroup_from_mm(target->mm);
+
+ if (current_cg == target_cg)
+ goto skip_transfer;
+
+ if (!mem_cgroup_charge_dmabuf(target_cg, nr_pages, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto skip_transfer;
+ }
+
+ if (cmpxchg(&dmabuf->memcg, current_cg, target_cg) != current_cg) {
+ /* Only the current owner can transfer the charge */
+ ret = -EPERM;
+ mem_cgroup_uncharge_dmabuf(target_cg, nr_pages);
+ goto skip_transfer;
+ }
+
+ mem_cgroup_uncharge_dmabuf(current_cg, nr_pages);
+ mem_cgroup_put(current_cg); /* unref from buffer - buffer keeps new ref to target_cg */
+ return 0;
+
+skip_transfer:
+ mem_cgroup_put(target_cg);
+ return ret;
+}
+
#ifdef CONFIG_DEBUG_FS
static int dma_buf_debug_show(struct seq_file *s, void *unused)
{
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 1f0ffb8e4bf5..f25eb8e60fb2 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -634,4 +634,5 @@ int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map);
void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
+int dma_buf_transfer_charge(struct file *dmabuf_file, struct task_struct *target);
#endif /* __DMA_BUF_H__ */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c10b8565fdbf..009298a446fe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1335,6 +1335,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
return NULL;
}
+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+ return NULL;
+}
+
static inline void obj_cgroup_put(struct obj_cgroup *objcg)
{
}
--
2.39.0.246.g2a6d74b583-goog
From: Hridya Valsaraju <[email protected]>
This patch introduces flag BINDER_FD_FLAG_XFER_CHARGE that a process
sending an individual fd or fd array to another process over binder IPC
can set to relinquish ownership of the fd(s) being sent for memory
accounting purposes. If the flag is found to be set during the fd or fd
array translation and the fd is for a DMA-BUF, the buffer is uncharged
from the sender's cgroup and charged to the receiving process's cgroup
instead.
It is up to the sending process to ensure that it closes the fds
regardless of whether the transfer failed or succeeded.
Most graphics shared memory allocations in Android are done by the
graphics allocator HAL process. On requests from clients, the HAL
process allocates memory and sends the fds to the clients over binder
IPC. The graphics allocator HAL will not retain any references to the
buffers. When the HAL sets BINDER_FD_FLAG_XFER_CHARGE, binder will
transfer the charge for the buffer from the allocator process cgroup to
the client process cgroup.
The pad [1] and pad_flags [2] fields of binder_fd_object and
binder_fda_array_object come from alignment with flat_binder_object and
have never been exposed for use from userspace. This new flags use
follows the pattern set by binder_buffer_object.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/linux/android/binder.h?id=feba3900cabb8e7c87368faa28e7a6936809ba22
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/linux/android/binder.h?id=5cdcf4c6a638591ec0e98c57404a19e7f9997567
Signed-off-by: Hridya Valsaraju <[email protected]>
Signed-off-by: T.J. Mercier <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 3 ++-
drivers/android/binder.c | 25 +++++++++++++++++++++----
include/uapi/linux/android/binder.h | 19 +++++++++++++++----
3 files changed, 38 insertions(+), 9 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 538ae22bc514..d225295932c0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1457,7 +1457,8 @@ PAGE_SIZE multiple when read back.
dmabuf (npn)
Amount of memory used for exported DMA buffers allocated by the cgroup.
- Stays with the allocating cgroup regardless of how the buffer is shared.
+ Stays with the allocating cgroup regardless of how the buffer is shared
+ unless explicitly transferred.
workingset_refault_anon
Number of refaults of previously evicted anonymous pages.
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 880224ec6abb..5e707974793f 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -42,6 +42,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/dma-buf.h>
#include <linux/fdtable.h>
#include <linux/file.h>
#include <linux/freezer.h>
@@ -2237,7 +2238,7 @@ static int binder_translate_handle(struct flat_binder_object *fp,
return ret;
}
-static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
+static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags,
struct binder_transaction *t,
struct binder_thread *thread,
struct binder_transaction *in_reply_to)
@@ -2275,6 +2276,20 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
goto err_security;
}
+ if (IS_ENABLED(CONFIG_MEMCG) && (flags & BINDER_FD_FLAG_XFER_CHARGE)) {
+ ret = dma_buf_transfer_charge(file, target_proc->tsk);
+ if (unlikely(ret == -EBADF)) {
+ binder_user_error(
+ "%d:%d got transaction with XFER_CHARGE for non-DMA-BUF fd, %d\n",
+ proc->pid, thread->pid, fd);
+ goto err_dmabuf;
+ } else if (ret) {
+ pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
+ proc->pid, thread->pid, target_proc->pid);
+ goto err_xfer;
+ }
+ }
+
/*
* Add fixup record for this transaction. The allocation
* of the fd in the target needs to be done from a
@@ -2294,6 +2309,8 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
return ret;
err_alloc:
+err_xfer:
+err_dmabuf:
err_security:
fput(file);
err_fget:
@@ -2604,7 +2621,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd));
if (!ret)
- ret = binder_translate_fd(fd, offset, t, thread,
+ ret = binder_translate_fd(fd, offset, fda->flags, t, thread,
in_reply_to);
if (ret)
return ret > 0 ? -EINVAL : ret;
@@ -3383,8 +3400,8 @@ static void binder_transaction(struct binder_proc *proc,
struct binder_fd_object *fp = to_binder_fd_object(hdr);
binder_size_t fd_offset = object_offset +
(uintptr_t)&fp->fd - (uintptr_t)fp;
- int ret = binder_translate_fd(fp->fd, fd_offset, t,
- thread, in_reply_to);
+ int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
+ t, thread, in_reply_to);
fp->pad_binder = 0;
if (ret < 0 ||
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
index e72e4de8f452..4b20dd1dccb1 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -91,14 +91,14 @@ struct flat_binder_object {
/**
* struct binder_fd_object - describes a filedescriptor to be fixed up.
* @hdr: common header structure
- * @pad_flags: padding to remain compatible with old userspace code
+ * @flags: One or more BINDER_FD_FLAG_* flags
* @pad_binder: padding to remain compatible with old userspace code
* @fd: file descriptor
* @cookie: opaque data, used by user-space
*/
struct binder_fd_object {
struct binder_object_header hdr;
- __u32 pad_flags;
+ __u32 flags;
union {
binder_uintptr_t pad_binder;
__u32 fd;
@@ -107,6 +107,17 @@ struct binder_fd_object {
binder_uintptr_t cookie;
};
+enum {
+ /**
+ * @BINDER_FD_FLAG_XFER_CHARGE
+ *
+ * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
+ * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
+ * sender's cgroup and charged to the receiving process's cgroup instead.
+ */
+ BINDER_FD_FLAG_XFER_CHARGE = 0x01,
+};
+
/* struct binder_buffer_object - object describing a userspace buffer
* @hdr: common header structure
* @flags: one or more BINDER_BUFFER_* flags
@@ -141,7 +152,7 @@ enum {
/* struct binder_fd_array_object - object describing an array of fds in a buffer
* @hdr: common header structure
- * @pad: padding to ensure correct alignment
+ * @flags: One or more BINDER_FD_FLAG_* flags
* @num_fds: number of file descriptors in the buffer
* @parent: index in offset array to buffer holding the fd array
* @parent_offset: start offset of fd array in the buffer
@@ -162,7 +173,7 @@ enum {
*/
struct binder_fd_array_object {
struct binder_object_header hdr;
- __u32 pad;
+ __u32 flags;
binder_size_t num_fds;
binder_size_t parent;
binder_size_t parent_offset;
--
2.39.0.246.g2a6d74b583-goog
Any process can cause a memory charge transfer to occur to any other
process when transmitting a file descriptor through binder. This should
only be possible for central allocator processes, so the binder object
flags are added to the security_binder_transfer_file hook so that LSMs
can enforce restrictions on charge transfers.
Signed-off-by: T.J. Mercier <[email protected]>
---
drivers/android/binder.c | 2 +-
include/linux/lsm_hook_defs.h | 2 +-
include/linux/lsm_hooks.h | 5 ++++-
include/linux/security.h | 6 ++++--
security/security.c | 4 ++--
security/selinux/hooks.c | 13 ++++++++++++-
security/selinux/include/classmap.h | 2 +-
7 files changed, 25 insertions(+), 9 deletions(-)
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 5e707974793f..7b1bb23b6b79 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -2270,7 +2270,7 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags,
ret = -EBADF;
goto err_fget;
}
- ret = security_binder_transfer_file(proc->cred, target_proc->cred, file);
+ ret = security_binder_transfer_file(proc->cred, target_proc->cred, file, flags);
if (ret < 0) {
ret = -EPERM;
goto err_security;
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index ed6cb2ac55fa..84ee61089f7b 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -32,7 +32,7 @@ LSM_HOOK(int, 0, binder_transaction, const struct cred *from,
LSM_HOOK(int, 0, binder_transfer_binder, const struct cred *from,
const struct cred *to)
LSM_HOOK(int, 0, binder_transfer_file, const struct cred *from,
- const struct cred *to, struct file *file)
+ const struct cred *to, struct file *file, u32 binder_object_flags)
LSM_HOOK(int, 0, ptrace_access_check, struct task_struct *child,
unsigned int mode)
LSM_HOOK(int, 0, ptrace_traceme, struct task_struct *parent)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 0a5ba81f7367..d57977336ae8 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1381,9 +1381,12 @@
* Return 0 if permission is granted.
* @binder_transfer_file:
* Check whether @from is allowed to transfer @file to @to.
+ * If @binder_object_flags indicates a memory charge transfer for @file, then
+ * permission for the charge transfer can be checked as well.
* @from contains the struct cred for the sending process.
- * @file contains the struct file being transferred.
* @to contains the struct cred for the receiving process.
+ * @file contains the struct file being transferred.
+ * @binder_object_flags contains the flags associated with the binder object.
* Return 0 if permission is granted.
*
* @ptrace_access_check:
diff --git a/include/linux/security.h b/include/linux/security.h
index 5b67f208f7de..c4b80fc8d104 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -269,7 +269,8 @@ int security_binder_transaction(const struct cred *from,
int security_binder_transfer_binder(const struct cred *from,
const struct cred *to);
int security_binder_transfer_file(const struct cred *from,
- const struct cred *to, struct file *file);
+ const struct cred *to, struct file *file,
+ u32 binder_object_flags);
int security_ptrace_access_check(struct task_struct *child, unsigned int mode);
int security_ptrace_traceme(struct task_struct *parent);
int security_capget(struct task_struct *target,
@@ -542,7 +543,8 @@ static inline int security_binder_transfer_binder(const struct cred *from,
static inline int security_binder_transfer_file(const struct cred *from,
const struct cred *to,
- struct file *file)
+ struct file *file,
+ u32 binder_object_flags)
{
return 0;
}
diff --git a/security/security.c b/security/security.c
index d1571900a8c7..12ccaca744c0 100644
--- a/security/security.c
+++ b/security/security.c
@@ -796,9 +796,9 @@ int security_binder_transfer_binder(const struct cred *from,
}
int security_binder_transfer_file(const struct cred *from,
- const struct cred *to, struct file *file)
+ const struct cred *to, struct file *file, u32 binder_object_flags)
{
- return call_int_hook(binder_transfer_file, 0, from, to, file);
+ return call_int_hook(binder_transfer_file, 0, from, to, file, binder_object_flags);
}
int security_ptrace_access_check(struct task_struct *child, unsigned int mode)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 3c5be76a9199..d4cfca3c9a3b 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -88,6 +88,7 @@
#include <linux/bpf.h>
#include <linux/kernfs.h>
#include <linux/stringhash.h> /* for hashlen_string() */
+#include <uapi/linux/android/binder.h>
#include <uapi/linux/mount.h>
#include <linux/fsnotify.h>
#include <linux/fanotify.h>
@@ -2029,7 +2030,8 @@ static int selinux_binder_transfer_binder(const struct cred *from,
static int selinux_binder_transfer_file(const struct cred *from,
const struct cred *to,
- struct file *file)
+ struct file *file,
+ u32 binder_object_flags)
{
u32 sid = cred_sid(to);
struct file_security_struct *fsec = selinux_file(file);
@@ -2038,6 +2040,15 @@ static int selinux_binder_transfer_file(const struct cred *from,
struct common_audit_data ad;
int rc;
+ if (binder_object_flags & BINDER_FD_FLAG_XFER_CHARGE) {
+ rc = avc_has_perm(&selinux_state,
+ cred_sid(from), sid,
+ SECCLASS_BINDER, BINDER__TRANSFER_CHARGE,
+ NULL);
+ if (rc)
+ return rc;
+ }
+
ad.type = LSM_AUDIT_DATA_PATH;
ad.u.path = file->f_path;
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index a3c380775d41..2eef180d10d7 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -172,7 +172,7 @@ const struct security_class_mapping secclass_map[] = {
{ "tun_socket",
{ COMMON_SOCK_PERMS, "attach_queue", NULL } },
{ "binder", { "impersonate", "call", "set_context_mgr", "transfer",
- NULL } },
+ "transfer_charge", NULL } },
{ "cap_userns",
{ COMMON_CAP_PERMS, NULL } },
{ "cap2_userns",
--
2.39.0.246.g2a6d74b583-goog
On Mon, Jan 23, 2023 at 2:18 PM T.J. Mercier <[email protected]> wrote:
>
> Any process can cause a memory charge transfer to occur to any other
> process when transmitting a file descriptor through binder. This should
> only be possible for central allocator processes, so the binder object
> flags are added to the security_binder_transfer_file hook so that LSMs
> can enforce restrictions on charge transfers.
>
> Signed-off-by: T.J. Mercier <[email protected]>
> ---
> drivers/android/binder.c | 2 +-
> include/linux/lsm_hook_defs.h | 2 +-
> include/linux/lsm_hooks.h | 5 ++++-
> include/linux/security.h | 6 ++++--
> security/security.c | 4 ++--
> security/selinux/hooks.c | 13 ++++++++++++-
> security/selinux/include/classmap.h | 2 +-
> 7 files changed, 25 insertions(+), 9 deletions(-)
...
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 3c5be76a9199..d4cfca3c9a3b 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -88,6 +88,7 @@
> #include <linux/bpf.h>
> #include <linux/kernfs.h>
> #include <linux/stringhash.h> /* for hashlen_string() */
> +#include <uapi/linux/android/binder.h>
> #include <uapi/linux/mount.h>
> #include <linux/fsnotify.h>
> #include <linux/fanotify.h>
> @@ -2029,7 +2030,8 @@ static int selinux_binder_transfer_binder(const struct cred *from,
>
> static int selinux_binder_transfer_file(const struct cred *from,
> const struct cred *to,
> - struct file *file)
> + struct file *file,
> + u32 binder_object_flags)
> {
> u32 sid = cred_sid(to);
> struct file_security_struct *fsec = selinux_file(file);
> @@ -2038,6 +2040,15 @@ static int selinux_binder_transfer_file(const struct cred *from,
> struct common_audit_data ad;
> int rc;
>
> + if (binder_object_flags & BINDER_FD_FLAG_XFER_CHARGE) {
> + rc = avc_has_perm(&selinux_state,
> + cred_sid(from), sid,
> + SECCLASS_BINDER, BINDER__TRANSFER_CHARGE,
> + NULL);
> + if (rc)
> + return rc;
> + }
> +
> ad.type = LSM_AUDIT_DATA_PATH;
> ad.u.path = file->f_path;
>
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index a3c380775d41..2eef180d10d7 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -172,7 +172,7 @@ const struct security_class_mapping secclass_map[] = {
> { "tun_socket",
> { COMMON_SOCK_PERMS, "attach_queue", NULL } },
> { "binder", { "impersonate", "call", "set_context_mgr", "transfer",
> - NULL } },
> + "transfer_charge", NULL } },
> { "cap_userns",
> { COMMON_CAP_PERMS, NULL } },
> { "cap2_userns",
My first take on reading these changes above is that you've completely
ignored my previous comments about SELinux access controls around
resource management. You've leveraged the existing LSM/SELinux hook
as we discussed previously, that's good, but can you explain what
changes you've made to address my concerns about one-off resource
management controls?
--
paul-moore.com
On Mon, Jan 23, 2023 at 2:04 PM T.J. Mercier <[email protected]> wrote:
>
>
>
> On Mon, Jan 23, 2023 at 1:36 PM Paul Moore <[email protected]> wrote:
>>
>> On Mon, Jan 23, 2023 at 2:18 PM T.J. Mercier <[email protected]> wrote:
>> >
>> > Any process can cause a memory charge transfer to occur to any other
>> > process when transmitting a file descriptor through binder. This should
>> > only be possible for central allocator processes, so the binder object
>> > flags are added to the security_binder_transfer_file hook so that LSMs
>> > can enforce restrictions on charge transfers.
>> >
>> > Signed-off-by: T.J. Mercier <[email protected]>
>> > ---
>> > drivers/android/binder.c | 2 +-
>> > include/linux/lsm_hook_defs.h | 2 +-
>> > include/linux/lsm_hooks.h | 5 ++++-
>> > include/linux/security.h | 6 ++++--
>> > security/security.c | 4 ++--
>> > security/selinux/hooks.c | 13 ++++++++++++-
>> > security/selinux/include/classmap.h | 2 +-
>> > 7 files changed, 25 insertions(+), 9 deletions(-)
>>
>> ...
>>
>> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>> > index 3c5be76a9199..d4cfca3c9a3b 100644
>> > --- a/security/selinux/hooks.c
>> > +++ b/security/selinux/hooks.c
>> > @@ -88,6 +88,7 @@
>> > #include <linux/bpf.h>
>> > #include <linux/kernfs.h>
>> > #include <linux/stringhash.h> /* for hashlen_string() */
>> > +#include <uapi/linux/android/binder.h>
>> > #include <uapi/linux/mount.h>
>> > #include <linux/fsnotify.h>
>> > #include <linux/fanotify.h>
>> > @@ -2029,7 +2030,8 @@ static int selinux_binder_transfer_binder(const struct cred *from,
>> >
>> > static int selinux_binder_transfer_file(const struct cred *from,
>> > const struct cred *to,
>> > - struct file *file)
>> > + struct file *file,
>> > + u32 binder_object_flags)
>> > {
>> > u32 sid = cred_sid(to);
>> > struct file_security_struct *fsec = selinux_file(file);
>> > @@ -2038,6 +2040,15 @@ static int selinux_binder_transfer_file(const struct cred *from,
>> > struct common_audit_data ad;
>> > int rc;
>> >
>> > + if (binder_object_flags & BINDER_FD_FLAG_XFER_CHARGE) {
>> > + rc = avc_has_perm(&selinux_state,
>> > + cred_sid(from), sid,
>> > + SECCLASS_BINDER, BINDER__TRANSFER_CHARGE,
>> > + NULL);
>> > + if (rc)
>> > + return rc;
>> > + }
>> > +
>> > ad.type = LSM_AUDIT_DATA_PATH;
>> > ad.u.path = file->f_path;
>> >
>> > diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
>> > index a3c380775d41..2eef180d10d7 100644
>> > --- a/security/selinux/include/classmap.h
>> > +++ b/security/selinux/include/classmap.h
>> > @@ -172,7 +172,7 @@ const struct security_class_mapping secclass_map[] = {
>> > { "tun_socket",
>> > { COMMON_SOCK_PERMS, "attach_queue", NULL } },
>> > { "binder", { "impersonate", "call", "set_context_mgr", "transfer",
>> > - NULL } },
>> > + "transfer_charge", NULL } },
>> > { "cap_userns",
>> > { COMMON_CAP_PERMS, NULL } },
>> > { "cap2_userns",
>>
>> My first take on reading these changes above is that you've completely
>> ignored my previous comments about SELinux access controls around
>> resource management. You've leveraged the existing LSM/SELinux hook
>> as we discussed previously, that's good, but can you explain what
>> changes you've made to address my concerns about one-off resource
>> management controls?
>>
> It's been a couple of weeks since v1 so I've sent this update out now to incorporate all the other feedback so far to make sure it's headed in the right direction. I've tried opening up a discussion about this rather unique case, but there's been no activity on that yet.
>
Someone pointed out this didn't make it to the lists. Retrying.
>> --
>> paul-moore.com
On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> When a buffer is exported to userspace, use memcg to attribute the
> buffer to the allocating cgroup until all buffer references are
> released.
Is there any reason why this memory cannot be charged during the
allocation (__GFP_ACCOUNT used)?
Also you do charge and account the memory but underlying pages do not
know about their memcg (this is normally done with commit_charge for
user mapped pages). This would become a problem if the memory is
migrated for example. This also means that you have to maintain memcg
reference outside of the memcg proper which is not really nice either.
This mimicks tcp kmem limit implementation which I really have to say I
am not a great fan of and this pattern shouldn't be coppied.
Also you are not really saying anything about the oom behavior. With
this implementation the kernel will try to reclaim the memory and even
trigger the memcg oom killer if the request size is <= 8 pages. Is this
a desirable behavior?
--
Michal Hocko
SUSE Labs
On Tue, Jan 24, 2023 at 7:00 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > When a buffer is exported to userspace, use memcg to attribute the
> > buffer to the allocating cgroup until all buffer references are
> > released.
>
> Is there any reason why this memory cannot be charged during the
> allocation (__GFP_ACCOUNT used)?
My main motivation was to keep code changes away from exporters and
implement the accounting in one common spot for all of them. This is a
bit of a carryover from a previous approach [1] where there was some
objection to pushing off this work onto exporters and forcing them to
adapt, but __GFP_ACCOUNT does seem like a smaller burden than before
at least initially. However in order to support charge transfer
between cgroups with __GFP_ACCOUNT we'd need to be able to get at the
pages backing dmabuf objects, and the exporters are the ones with that
access. Meaning I think we'd have to add some additional dma_buf_ops
to achieve that, which was the objection from [1].
[1] https://lore.kernel.org/lkml/[email protected]/
>
> Also you do charge and account the memory but underlying pages do not
> know about their memcg (this is normally done with commit_charge for
> user mapped pages). This would become a problem if the memory is
> migrated for example.
Hmm, what problem do you see in this situation? If the backing pages
are to be migrated that requires the cooperation of the exporter,
which currently has no influence on how the cgroup charging is done
and that seems fine. (Unless you mean migrating the charge across
cgroups? In which case that's the next patch.)
> This also means that you have to maintain memcg
> reference outside of the memcg proper which is not really nice either.
> This mimicks tcp kmem limit implementation which I really have to say I
> am not a great fan of and this pattern shouldn't be coppied.
>
Ah, what can I say. This way looked simple to me. I think otherwise
we're back to making all exporters do more stuff for the accounting.
> Also you are not really saying anything about the oom behavior. With
> this implementation the kernel will try to reclaim the memory and even
> trigger the memcg oom killer if the request size is <= 8 pages. Is this
> a desirable behavior?
It will try to reclaim some memory, but not the dmabuf pages right?
Not *yet* anyway. This behavior sounds expected to me. I would only
expect it to be surprising for cgroups making heavy use of dmabufs
(that weren't accounted before) *and* with hard limits already very
close to actual usage. I remember Johannes mentioning that what counts
under memcg use is already a bit of a moving target.
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > When a buffer is exported to userspace, use memcg to attribute the
> > buffer to the allocating cgroup until all buffer references are
> > released.
>
> Is there any reason why this memory cannot be charged during the
> allocation (__GFP_ACCOUNT used)?
> Also you do charge and account the memory but underlying pages do not
> know about their memcg (this is normally done with commit_charge for
> user mapped pages). This would become a problem if the memory is
> migrated for example.
I don't think this is movable memory.
> This also means that you have to maintain memcg
> reference outside of the memcg proper which is not really nice either.
> This mimicks tcp kmem limit implementation which I really have to say I
> am not a great fan of and this pattern shouldn't be coppied.
>
I think we should keep the discussion on technical merits instead of
personal perference. To me using skmem like interface is totally fine
but the pros/cons need to be very explicit and the clear reasons to
select that option should be included.
To me there are two options:
1. Using skmem like interface as this patch series:
The main pros of this option is that it is very simple. Let me list down
the cons of this approach:
a. There is time window between the actual memory allocation/free and
the charge and uncharge and [un]charge happen when the whole memory is
allocated or freed. I think for the charge path that might not be a big
issue but on the uncharge, this can cause issues. The application and
the potential shrinkers have freed some of this dmabuf memory but until
the whole dmabuf is freed, the memcg uncharge will not happen. This can
consequences on reclaim and oom behavior of the application.
b. Due to the usage model i.e. a central daemon allocating the dmabuf
memory upfront, there is a requirement to have a memcg charge transfer
functionality to transfer the charge from the central daemon to the
client applications. This does introduce complexity and avenues of weird
reclaim and oom behavior.
2. Allocate and charge the memory on page fault by actual user
In this approach, the memory is not allocated upfront by the central
daemon but rather on the page fault by the client application and the
memcg charge happen at the same time.
The only cons I can think of is this approach is more involved and may
need some clever tricks to track the page on the free patch i.e. we to
decrement the dmabuf memcg stat on free path. Maybe a page flag.
The pros of this approach is there is no need have a charge transfer
functionality and the charge/uncharge being closely tied to the actual
memory allocation and free.
Personally I would prefer the second approach but I don't want to just
block this work if the dmabuf folks are ok with the cons mentioned of
the first approach.
thanks,
Shakeel
Hi Mercier",
Thank you for the patch! Yet something to improve:
[auto build test ERROR on 2241ab53cbb5cdb08a6b2d4688feb13971058f65]
url: https://github.com/intel-lab-lkp/linux/commits/T-J-Mercier/memcg-Track-exported-dma-buffers/20230124-032017
base: 2241ab53cbb5cdb08a6b2d4688feb13971058f65
patch link: https://lore.kernel.org/r/20230123191728.2928839-4-tjmercier%40google.com
patch subject: [PATCH v2 3/4] binder: Add flags to relinquish ownership of fds
config: x86_64-randconfig-a016-20230123 (https://download.01.org/0day-ci/archive/20230125/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/41e80f59d1b70691eefc0490e7f1df800cead9f2
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review T-J-Mercier/memcg-Track-exported-dma-buffers/20230124-032017
git checkout 41e80f59d1b70691eefc0490e7f1df800cead9f2
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
>> ld.lld: error: undefined symbol: dma_buf_transfer_charge
>>> referenced by binder.c:2280 (drivers/android/binder.c:2280)
>>> drivers/android/binder.o:(binder_translate_fd) in archive vmlinux.a
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > > When a buffer is exported to userspace, use memcg to attribute the
> > > buffer to the allocating cgroup until all buffer references are
> > > released.
> >
> > Is there any reason why this memory cannot be charged during the
> > allocation (__GFP_ACCOUNT used)?
> > Also you do charge and account the memory but underlying pages do not
> > know about their memcg (this is normally done with commit_charge for
> > user mapped pages). This would become a problem if the memory is
> > migrated for example.
>
> I don't think this is movable memory.
>
> > This also means that you have to maintain memcg
> > reference outside of the memcg proper which is not really nice either.
> > This mimicks tcp kmem limit implementation which I really have to say I
> > am not a great fan of and this pattern shouldn't be coppied.
> >
>
> I think we should keep the discussion on technical merits instead of
> personal perference. To me using skmem like interface is totally fine
> but the pros/cons need to be very explicit and the clear reasons to
> select that option should be included.
I do agree with that. I didn't want sound to be personal wrt tcp kmem
accounting but the overall code maintenance cost is higher because
of how tcp take on accounting differs from anything else in the memcg
proper. I would prefer to not grow another example like that.
> To me there are two options:
>
> 1. Using skmem like interface as this patch series:
>
> The main pros of this option is that it is very simple. Let me list down
> the cons of this approach:
>
> a. There is time window between the actual memory allocation/free and
> the charge and uncharge and [un]charge happen when the whole memory is
> allocated or freed. I think for the charge path that might not be a big
> issue but on the uncharge, this can cause issues. The application and
> the potential shrinkers have freed some of this dmabuf memory but until
> the whole dmabuf is freed, the memcg uncharge will not happen. This can
> consequences on reclaim and oom behavior of the application.
>
> b. Due to the usage model i.e. a central daemon allocating the dmabuf
> memory upfront, there is a requirement to have a memcg charge transfer
> functionality to transfer the charge from the central daemon to the
> client applications. This does introduce complexity and avenues of weird
> reclaim and oom behavior.
>
>
> 2. Allocate and charge the memory on page fault by actual user
>
> In this approach, the memory is not allocated upfront by the central
> daemon but rather on the page fault by the client application and the
> memcg charge happen at the same time.
>
> The only cons I can think of is this approach is more involved and may
> need some clever tricks to track the page on the free patch i.e. we to
> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>
> The pros of this approach is there is no need have a charge transfer
> functionality and the charge/uncharge being closely tied to the actual
> memory allocation and free.
>
> Personally I would prefer the second approach but I don't want to just
> block this work if the dmabuf folks are ok with the cons mentioned of
> the first approach.
I am not familiar with dmabuf internals to judge complexity on their end
but I fully agree that charge-when-used is much more easier to reason
about and it should have less subtle surprises.
--
Michal Hocko
SUSE Labs
On Tue 24-01-23 10:55:21, T.J. Mercier wrote:
> On Tue, Jan 24, 2023 at 7:00 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > > When a buffer is exported to userspace, use memcg to attribute the
> > > buffer to the allocating cgroup until all buffer references are
> > > released.
> >
> > Is there any reason why this memory cannot be charged during the
> > allocation (__GFP_ACCOUNT used)?
>
> My main motivation was to keep code changes away from exporters and
> implement the accounting in one common spot for all of them. This is a
> bit of a carryover from a previous approach [1] where there was some
> objection to pushing off this work onto exporters and forcing them to
> adapt, but __GFP_ACCOUNT does seem like a smaller burden than before
> at least initially. However in order to support charge transfer
> between cgroups with __GFP_ACCOUNT we'd need to be able to get at the
> pages backing dmabuf objects, and the exporters are the ones with that
> access. Meaning I think we'd have to add some additional dma_buf_ops
> to achieve that, which was the objection from [1].
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> >
> > Also you do charge and account the memory but underlying pages do not
> > know about their memcg (this is normally done with commit_charge for
> > user mapped pages). This would become a problem if the memory is
> > migrated for example.
>
> Hmm, what problem do you see in this situation? If the backing pages
> are to be migrated that requires the cooperation of the exporter,
> which currently has no influence on how the cgroup charging is done
> and that seems fine. (Unless you mean migrating the charge across
> cgroups? In which case that's the next patch.)
My main concern was that page migration could lose the external tracking
without some additional steps on the dmabuf front.
> > This also means that you have to maintain memcg
> > reference outside of the memcg proper which is not really nice either.
> > This mimicks tcp kmem limit implementation which I really have to say I
> > am not a great fan of and this pattern shouldn't be coppied.
> >
> Ah, what can I say. This way looked simple to me. I think otherwise
> we're back to making all exporters do more stuff for the accounting.
>
> > Also you are not really saying anything about the oom behavior. With
> > this implementation the kernel will try to reclaim the memory and even
> > trigger the memcg oom killer if the request size is <= 8 pages. Is this
> > a desirable behavior?
>
> It will try to reclaim some memory, but not the dmabuf pages right?
> Not *yet* anyway. This behavior sounds expected to me.
Yes, we have discussed that shrinkers will follow up later which is
fine. The question is how much reclaim actually makes sense at this
stage. Charging interface usually copes with sizes resulting from
allocation requests (so usually 1<<order based). I can imagine that a
batch charge like implemented here could easily be 100s of MBs and it is
much harder to define reclaim targets for. At least that is something
the memcg charging hasn't really considered yet. Maybe the existing
try_charge implementation can cope with that just fine but it would be
really great to have the expected behavior described.
E.g. should be memcg OOM killer be invoked? Should reclaim really target
regular memory at all costs or just a lightweight memory reclaim is
preferred (is the dmabuf charge failure an expensive operation wrt.
memory refault due to reclaim).
--
Michal Hocko
SUSE Labs
On Mon, Jan 23, 2023 at 07:17:25PM +0000, T.J. Mercier wrote:
> From: Hridya Valsaraju <[email protected]>
>
> This patch introduces flag BINDER_FD_FLAG_XFER_CHARGE that a process
> sending an individual fd or fd array to another process over binder IPC
> can set to relinquish ownership of the fd(s) being sent for memory
> accounting purposes. If the flag is found to be set during the fd or fd
> array translation and the fd is for a DMA-BUF, the buffer is uncharged
> from the sender's cgroup and charged to the receiving process's cgroup
> instead.
>
> It is up to the sending process to ensure that it closes the fds
> regardless of whether the transfer failed or succeeded.
>
> Most graphics shared memory allocations in Android are done by the
> graphics allocator HAL process. On requests from clients, the HAL
> process allocates memory and sends the fds to the clients over binder
> IPC. The graphics allocator HAL will not retain any references to the
> buffers. When the HAL sets BINDER_FD_FLAG_XFER_CHARGE, binder will
> transfer the charge for the buffer from the allocator process cgroup to
> the client process cgroup.
>
> The pad [1] and pad_flags [2] fields of binder_fd_object and
> binder_fda_array_object come from alignment with flat_binder_object and
> have never been exposed for use from userspace. This new flags use
> follows the pattern set by binder_buffer_object.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/linux/android/binder.h?id=feba3900cabb8e7c87368faa28e7a6936809ba22
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/linux/android/binder.h?id=5cdcf4c6a638591ec0e98c57404a19e7f9997567
>
> Signed-off-by: Hridya Valsaraju <[email protected]>
> Signed-off-by: T.J. Mercier <[email protected]>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 3 ++-
> drivers/android/binder.c | 25 +++++++++++++++++++++----
> include/uapi/linux/android/binder.h | 19 +++++++++++++++----
> 3 files changed, 38 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 538ae22bc514..d225295932c0 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1457,7 +1457,8 @@ PAGE_SIZE multiple when read back.
>
> dmabuf (npn)
> Amount of memory used for exported DMA buffers allocated by the cgroup.
> - Stays with the allocating cgroup regardless of how the buffer is shared.
> + Stays with the allocating cgroup regardless of how the buffer is shared
> + unless explicitly transferred.
>
> workingset_refault_anon
> Number of refaults of previously evicted anonymous pages.
> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> index 880224ec6abb..5e707974793f 100644
> --- a/drivers/android/binder.c
> +++ b/drivers/android/binder.c
> @@ -42,6 +42,7 @@
>
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> +#include <linux/dma-buf.h>
> #include <linux/fdtable.h>
> #include <linux/file.h>
> #include <linux/freezer.h>
> @@ -2237,7 +2238,7 @@ static int binder_translate_handle(struct flat_binder_object *fp,
> return ret;
> }
>
> -static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
> +static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags,
> struct binder_transaction *t,
> struct binder_thread *thread,
> struct binder_transaction *in_reply_to)
> @@ -2275,6 +2276,20 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
> goto err_security;
> }
>
> + if (IS_ENABLED(CONFIG_MEMCG) && (flags & BINDER_FD_FLAG_XFER_CHARGE)) {
Do we need to test for MEMCG here? it seems this has been offloaded to
dma_buf_transfer_charge()?
> + ret = dma_buf_transfer_charge(file, target_proc->tsk);
> + if (unlikely(ret == -EBADF)) {
> + binder_user_error(
> + "%d:%d got transaction with XFER_CHARGE for non-DMA-BUF fd, %d\n",
> + proc->pid, thread->pid, fd);
> + goto err_dmabuf;
> + } else if (ret) {
> + pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
> + proc->pid, thread->pid, target_proc->pid);
> + goto err_xfer;
> + }
> + }
> +
> /*
> * Add fixup record for this transaction. The allocation
> * of the fd in the target needs to be done from a
> @@ -2294,6 +2309,8 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
> return ret;
>
> err_alloc:
> +err_xfer:
> +err_dmabuf:
> err_security:
> fput(file);
> err_fget:
> @@ -2604,7 +2621,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
>
> ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd));
> if (!ret)
> - ret = binder_translate_fd(fd, offset, t, thread,
> + ret = binder_translate_fd(fd, offset, fda->flags, t, thread,
> in_reply_to);
> if (ret)
> return ret > 0 ? -EINVAL : ret;
> @@ -3383,8 +3400,8 @@ static void binder_transaction(struct binder_proc *proc,
> struct binder_fd_object *fp = to_binder_fd_object(hdr);
> binder_size_t fd_offset = object_offset +
> (uintptr_t)&fp->fd - (uintptr_t)fp;
> - int ret = binder_translate_fd(fp->fd, fd_offset, t,
> - thread, in_reply_to);
> + int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
> + t, thread, in_reply_to);
>
> fp->pad_binder = 0;
> if (ret < 0 ||
> diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> index e72e4de8f452..4b20dd1dccb1 100644
> --- a/include/uapi/linux/android/binder.h
> +++ b/include/uapi/linux/android/binder.h
> @@ -91,14 +91,14 @@ struct flat_binder_object {
> /**
> * struct binder_fd_object - describes a filedescriptor to be fixed up.
> * @hdr: common header structure
> - * @pad_flags: padding to remain compatible with old userspace code
> + * @flags: One or more BINDER_FD_FLAG_* flags
> * @pad_binder: padding to remain compatible with old userspace code
> * @fd: file descriptor
> * @cookie: opaque data, used by user-space
> */
> struct binder_fd_object {
> struct binder_object_header hdr;
> - __u32 pad_flags;
> + __u32 flags;
> union {
> binder_uintptr_t pad_binder;
> __u32 fd;
> @@ -107,6 +107,17 @@ struct binder_fd_object {
> binder_uintptr_t cookie;
> };
>
> +enum {
> + /**
> + * @BINDER_FD_FLAG_XFER_CHARGE
> + *
> + * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
> + * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
> + * sender's cgroup and charged to the receiving process's cgroup instead.
> + */
> + BINDER_FD_FLAG_XFER_CHARGE = 0x01,
> +};
> +
> /* struct binder_buffer_object - object describing a userspace buffer
> * @hdr: common header structure
> * @flags: one or more BINDER_BUFFER_* flags
> @@ -141,7 +152,7 @@ enum {
>
> /* struct binder_fd_array_object - object describing an array of fds in a buffer
> * @hdr: common header structure
> - * @pad: padding to ensure correct alignment
> + * @flags: One or more BINDER_FD_FLAG_* flags
> * @num_fds: number of file descriptors in the buffer
> * @parent: index in offset array to buffer holding the fd array
> * @parent_offset: start offset of fd array in the buffer
> @@ -162,7 +173,7 @@ enum {
> */
> struct binder_fd_array_object {
> struct binder_object_header hdr;
> - __u32 pad;
> + __u32 flags;
> binder_size_t num_fds;
> binder_size_t parent;
> binder_size_t parent_offset;
> --
> 2.39.0.246.g2a6d74b583-goog
>
Other than the previous question this looks good to me. Also, the error
from the test robot seems to indicate a missing stub for
dma_buf_transfer_charg() when !CONFIG_DMA_SHARED_BUFFER. However, this
is likely to be fixed outside of this patch. Feel free to add this tag
to the following round:
Acked-by: Carlos Llamas <[email protected]>
Thanks,
Hi,
On 25/01/2023 11:52, Michal Hocko wrote:
> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
>>>> When a buffer is exported to userspace, use memcg to attribute the
>>>> buffer to the allocating cgroup until all buffer references are
>>>> released.
>>>
>>> Is there any reason why this memory cannot be charged during the
>>> allocation (__GFP_ACCOUNT used)?
>>> Also you do charge and account the memory but underlying pages do not
>>> know about their memcg (this is normally done with commit_charge for
>>> user mapped pages). This would become a problem if the memory is
>>> migrated for example.
>>
>> I don't think this is movable memory.
>>
>>> This also means that you have to maintain memcg
>>> reference outside of the memcg proper which is not really nice either.
>>> This mimicks tcp kmem limit implementation which I really have to say I
>>> am not a great fan of and this pattern shouldn't be coppied.
>>>
>>
>> I think we should keep the discussion on technical merits instead of
>> personal perference. To me using skmem like interface is totally fine
>> but the pros/cons need to be very explicit and the clear reasons to
>> select that option should be included.
>
> I do agree with that. I didn't want sound to be personal wrt tcp kmem
> accounting but the overall code maintenance cost is higher because
> of how tcp take on accounting differs from anything else in the memcg
> proper. I would prefer to not grow another example like that.
>
>> To me there are two options:
>>
>> 1. Using skmem like interface as this patch series:
>>
>> The main pros of this option is that it is very simple. Let me list down
>> the cons of this approach:
>>
>> a. There is time window between the actual memory allocation/free and
>> the charge and uncharge and [un]charge happen when the whole memory is
>> allocated or freed. I think for the charge path that might not be a big
>> issue but on the uncharge, this can cause issues. The application and
>> the potential shrinkers have freed some of this dmabuf memory but until
>> the whole dmabuf is freed, the memcg uncharge will not happen. This can
>> consequences on reclaim and oom behavior of the application.
>>
>> b. Due to the usage model i.e. a central daemon allocating the dmabuf
>> memory upfront, there is a requirement to have a memcg charge transfer
>> functionality to transfer the charge from the central daemon to the
>> client applications. This does introduce complexity and avenues of weird
>> reclaim and oom behavior.
>>
>>
>> 2. Allocate and charge the memory on page fault by actual user
>>
>> In this approach, the memory is not allocated upfront by the central
>> daemon but rather on the page fault by the client application and the
>> memcg charge happen at the same time.
>>
>> The only cons I can think of is this approach is more involved and may
>> need some clever tricks to track the page on the free patch i.e. we to
>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>>
>> The pros of this approach is there is no need have a charge transfer
>> functionality and the charge/uncharge being closely tied to the actual
>> memory allocation and free.
>>
>> Personally I would prefer the second approach but I don't want to just
>> block this work if the dmabuf folks are ok with the cons mentioned of
>> the first approach.
>
> I am not familiar with dmabuf internals to judge complexity on their end
> but I fully agree that charge-when-used is much more easier to reason
> about and it should have less subtle surprises.
Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
am missing something, but in principle yes, I agree that the 2nd option
(charge the user, not exporter) should be preferred. Thing being that at
export time there may not be any backing store allocated, plus if the
series is restricting the charge transfer to just Android clients then
it seems it has the potential to miss many other use cases. At least
needs to outline a description on how the feature will be useful outside
Android.
Also stepping back for a moment - is a new memory category really
needed, versus perhaps attempting to charge the actual backing store
memory to the correct client? (There might have been many past
discussions on this so it's okay to point me towards something in the
archives.)
Regards,
Tvrtko
On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> Hi,
>
> On 25/01/2023 11:52, Michal Hocko wrote:
> > On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
> >> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> >>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> >>>> When a buffer is exported to userspace, use memcg to attribute the
> >>>> buffer to the allocating cgroup until all buffer references are
> >>>> released.
> >>>
> >>> Is there any reason why this memory cannot be charged during the
> >>> allocation (__GFP_ACCOUNT used)?
> >>> Also you do charge and account the memory but underlying pages do not
> >>> know about their memcg (this is normally done with commit_charge for
> >>> user mapped pages). This would become a problem if the memory is
> >>> migrated for example.
> >>
> >> I don't think this is movable memory.
> >>
> >>> This also means that you have to maintain memcg
> >>> reference outside of the memcg proper which is not really nice either.
> >>> This mimicks tcp kmem limit implementation which I really have to say I
> >>> am not a great fan of and this pattern shouldn't be coppied.
> >>>
> >>
> >> I think we should keep the discussion on technical merits instead of
> >> personal perference. To me using skmem like interface is totally fine
> >> but the pros/cons need to be very explicit and the clear reasons to
> >> select that option should be included.
> >
> > I do agree with that. I didn't want sound to be personal wrt tcp kmem
> > accounting but the overall code maintenance cost is higher because
> > of how tcp take on accounting differs from anything else in the memcg
> > proper. I would prefer to not grow another example like that.
> >
> >> To me there are two options:
> >>
> >> 1. Using skmem like interface as this patch series:
> >>
> >> The main pros of this option is that it is very simple. Let me list down
> >> the cons of this approach:
> >>
> >> a. There is time window between the actual memory allocation/free and
> >> the charge and uncharge and [un]charge happen when the whole memory is
> >> allocated or freed. I think for the charge path that might not be a big
> >> issue but on the uncharge, this can cause issues. The application and
> >> the potential shrinkers have freed some of this dmabuf memory but until
> >> the whole dmabuf is freed, the memcg uncharge will not happen. This can
> >> consequences on reclaim and oom behavior of the application.
> >>
> >> b. Due to the usage model i.e. a central daemon allocating the dmabuf
> >> memory upfront, there is a requirement to have a memcg charge transfer
> >> functionality to transfer the charge from the central daemon to the
> >> client applications. This does introduce complexity and avenues of weird
> >> reclaim and oom behavior.
> >>
> >>
> >> 2. Allocate and charge the memory on page fault by actual user
> >>
> >> In this approach, the memory is not allocated upfront by the central
> >> daemon but rather on the page fault by the client application and the
> >> memcg charge happen at the same time.
> >>
> >> The only cons I can think of is this approach is more involved and may
> >> need some clever tricks to track the page on the free patch i.e. we to
> >> decrement the dmabuf memcg stat on free path. Maybe a page flag.
> >>
> >> The pros of this approach is there is no need have a charge transfer
> >> functionality and the charge/uncharge being closely tied to the actual
> >> memory allocation and free.
> >>
> >> Personally I would prefer the second approach but I don't want to just
> >> block this work if the dmabuf folks are ok with the cons mentioned of
> >> the first approach.
> >
> > I am not familiar with dmabuf internals to judge complexity on their end
> > but I fully agree that charge-when-used is much more easier to reason
> > about and it should have less subtle surprises.
>
> Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
> am missing something, but in principle yes, I agree that the 2nd option
> (charge the user, not exporter) should be preferred. Thing being that at
> export time there may not be any backing store allocated, plus if the
> series is restricting the charge transfer to just Android clients then
> it seems it has the potential to miss many other use cases. At least
> needs to outline a description on how the feature will be useful outside
> Android.
>
There is no restriction like that. It's available to anybody who wants
to call dma_buf_charge_transfer if they actually have a need for that,
which I don't really expect to be common since most users/owners of
the buffers will be the ones causing the export in the first place.
It's just not like that on Android with the extra allocator process in
the middle most of the time.
> Also stepping back for a moment - is a new memory category really
> needed, versus perhaps attempting to charge the actual backing store
> memory to the correct client? (There might have been many past
> discussions on this so it's okay to point me towards something in the
> archives.)
>
Well the dmabuf counter for the stat file is really just a subcategory
of memory that is charged. Its existence is not related to getting the
charge attributed to the right process/cgroup. We do want to know how
much of the memory attributed to a process is for dmabufs, which is
the main point of this series.
> Regards,
>
> Tvrtko
On Wed, Jan 25, 2023 at 4:05 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 24-01-23 10:55:21, T.J. Mercier wrote:
> > On Tue, Jan 24, 2023 at 7:00 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > > > When a buffer is exported to userspace, use memcg to attribute the
> > > > buffer to the allocating cgroup until all buffer references are
> > > > released.
> > >
> > > Is there any reason why this memory cannot be charged during the
> > > allocation (__GFP_ACCOUNT used)?
> >
> > My main motivation was to keep code changes away from exporters and
> > implement the accounting in one common spot for all of them. This is a
> > bit of a carryover from a previous approach [1] where there was some
> > objection to pushing off this work onto exporters and forcing them to
> > adapt, but __GFP_ACCOUNT does seem like a smaller burden than before
> > at least initially. However in order to support charge transfer
> > between cgroups with __GFP_ACCOUNT we'd need to be able to get at the
> > pages backing dmabuf objects, and the exporters are the ones with that
> > access. Meaning I think we'd have to add some additional dma_buf_ops
> > to achieve that, which was the objection from [1].
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > >
> > > Also you do charge and account the memory but underlying pages do not
> > > know about their memcg (this is normally done with commit_charge for
> > > user mapped pages). This would become a problem if the memory is
> > > migrated for example.
> >
> > Hmm, what problem do you see in this situation? If the backing pages
> > are to be migrated that requires the cooperation of the exporter,
> > which currently has no influence on how the cgroup charging is done
> > and that seems fine. (Unless you mean migrating the charge across
> > cgroups? In which case that's the next patch.)
>
> My main concern was that page migration could lose the external tracking
> without some additional steps on the dmabuf front.
>
I see, yes that would be true if an exporter moves data around between
system memory and VRAM for example. (I think TTM does this sort of
thing, but not sure if that's actually within a single dma buffer.)
VRAM feels like it maybe doesn't belong in memcg, yet it would still
be charged there under this series right now. I don't really see a way
around this except to involve the exporters directly in the accounting
(or don't attempt to distinguish between types of memory).
> > > This also means that you have to maintain memcg
> > > reference outside of the memcg proper which is not really nice either.
> > > This mimicks tcp kmem limit implementation which I really have to say I
> > > am not a great fan of and this pattern shouldn't be coppied.
> > >
> > Ah, what can I say. This way looked simple to me. I think otherwise
> > we're back to making all exporters do more stuff for the accounting.
> >
> > > Also you are not really saying anything about the oom behavior. With
> > > this implementation the kernel will try to reclaim the memory and even
> > > trigger the memcg oom killer if the request size is <= 8 pages. Is this
> > > a desirable behavior?
> >
> > It will try to reclaim some memory, but not the dmabuf pages right?
> > Not *yet* anyway. This behavior sounds expected to me.
>
> Yes, we have discussed that shrinkers will follow up later which is
> fine. The question is how much reclaim actually makes sense at this
> stage. Charging interface usually copes with sizes resulting from
> allocation requests (so usually 1<<order based). I can imagine that a
> batch charge like implemented here could easily be 100s of MBs and it is
> much harder to define reclaim targets for. At least that is something
> the memcg charging hasn't really considered yet. Maybe the existing
> try_charge implementation can cope with that just fine but it would be
> really great to have the expected behavior described.
>
> E.g. should be memcg OOM killer be invoked? Should reclaim really target
> regular memory at all costs or just a lightweight memory reclaim is
> preferred (is the dmabuf charge failure an expensive operation wrt.
> memory refault due to reclaim).
Ah, in my experience very large individual buffers like that are rare.
Cumulative system-wide usage might reach 100s of megs or more spread
across many buffers. On my phone the majority of buffer sizes are 4
pages or less, but there are a few that reach into the tens of megs.
But now I see your point. I still think that where a memcg limit is
exceeded and we can't reclaim enough as a result of a new dmabuf
allocation, we should see a memcg OOM kill. Sounds like you are
looking for that to be written down, so I'll try to find a place for
that.
Part of the motivation for this accounting is to eventually have a
well defined limit for applications to know how much more they can
allocate. So where buffer size or number of buffers is a flexible
variable, I'd like to see an application checking this limit before
making a large request in an effort to avoid reclaim in the first
place. Where there is heavy memory pressure and multiple competing
apps, the status-quo today is a kill for us anyways (typically LMKD).
> --
> Michal Hocko
> SUSE Labs
On Wed, Jan 25, 2023 at 9:30 AM Carlos Llamas <[email protected]> wrote:
>
> On Mon, Jan 23, 2023 at 07:17:25PM +0000, T.J. Mercier wrote:
> > From: Hridya Valsaraju <[email protected]>
> >
> > This patch introduces flag BINDER_FD_FLAG_XFER_CHARGE that a process
> > sending an individual fd or fd array to another process over binder IPC
> > can set to relinquish ownership of the fd(s) being sent for memory
> > accounting purposes. If the flag is found to be set during the fd or fd
> > array translation and the fd is for a DMA-BUF, the buffer is uncharged
> > from the sender's cgroup and charged to the receiving process's cgroup
> > instead.
> >
> > It is up to the sending process to ensure that it closes the fds
> > regardless of whether the transfer failed or succeeded.
> >
> > Most graphics shared memory allocations in Android are done by the
> > graphics allocator HAL process. On requests from clients, the HAL
> > process allocates memory and sends the fds to the clients over binder
> > IPC. The graphics allocator HAL will not retain any references to the
> > buffers. When the HAL sets BINDER_FD_FLAG_XFER_CHARGE, binder will
> > transfer the charge for the buffer from the allocator process cgroup to
> > the client process cgroup.
> >
> > The pad [1] and pad_flags [2] fields of binder_fd_object and
> > binder_fda_array_object come from alignment with flat_binder_object and
> > have never been exposed for use from userspace. This new flags use
> > follows the pattern set by binder_buffer_object.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/linux/android/binder.h?id=feba3900cabb8e7c87368faa28e7a6936809ba22
> > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/include/uapi/linux/android/binder.h?id=5cdcf4c6a638591ec0e98c57404a19e7f9997567
> >
> > Signed-off-by: Hridya Valsaraju <[email protected]>
> > Signed-off-by: T.J. Mercier <[email protected]>
> > ---
> > Documentation/admin-guide/cgroup-v2.rst | 3 ++-
> > drivers/android/binder.c | 25 +++++++++++++++++++++----
> > include/uapi/linux/android/binder.h | 19 +++++++++++++++----
> > 3 files changed, 38 insertions(+), 9 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 538ae22bc514..d225295932c0 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1457,7 +1457,8 @@ PAGE_SIZE multiple when read back.
> >
> > dmabuf (npn)
> > Amount of memory used for exported DMA buffers allocated by the cgroup.
> > - Stays with the allocating cgroup regardless of how the buffer is shared.
> > + Stays with the allocating cgroup regardless of how the buffer is shared
> > + unless explicitly transferred.
> >
> > workingset_refault_anon
> > Number of refaults of previously evicted anonymous pages.
> > diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> > index 880224ec6abb..5e707974793f 100644
> > --- a/drivers/android/binder.c
> > +++ b/drivers/android/binder.c
> > @@ -42,6 +42,7 @@
> >
> > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> >
> > +#include <linux/dma-buf.h>
> > #include <linux/fdtable.h>
> > #include <linux/file.h>
> > #include <linux/freezer.h>
> > @@ -2237,7 +2238,7 @@ static int binder_translate_handle(struct flat_binder_object *fp,
> > return ret;
> > }
> >
> > -static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
> > +static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags,
> > struct binder_transaction *t,
> > struct binder_thread *thread,
> > struct binder_transaction *in_reply_to)
> > @@ -2275,6 +2276,20 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
> > goto err_security;
> > }
> >
> > + if (IS_ENABLED(CONFIG_MEMCG) && (flags & BINDER_FD_FLAG_XFER_CHARGE)) {
>
> Do we need to test for MEMCG here? it seems this has been offloaded to
> dma_buf_transfer_charge()?
>
Nope, that's a duplicate check now. Will remove.
> > + ret = dma_buf_transfer_charge(file, target_proc->tsk);
> > + if (unlikely(ret == -EBADF)) {
> > + binder_user_error(
> > + "%d:%d got transaction with XFER_CHARGE for non-DMA-BUF fd, %d\n",
> > + proc->pid, thread->pid, fd);
> > + goto err_dmabuf;
> > + } else if (ret) {
> > + pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
> > + proc->pid, thread->pid, target_proc->pid);
> > + goto err_xfer;
> > + }
> > + }
> > +
> > /*
> > * Add fixup record for this transaction. The allocation
> > * of the fd in the target needs to be done from a
> > @@ -2294,6 +2309,8 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
> > return ret;
> >
> > err_alloc:
> > +err_xfer:
> > +err_dmabuf:
> > err_security:
> > fput(file);
> > err_fget:
> > @@ -2604,7 +2621,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
> >
> > ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd));
> > if (!ret)
> > - ret = binder_translate_fd(fd, offset, t, thread,
> > + ret = binder_translate_fd(fd, offset, fda->flags, t, thread,
> > in_reply_to);
> > if (ret)
> > return ret > 0 ? -EINVAL : ret;
> > @@ -3383,8 +3400,8 @@ static void binder_transaction(struct binder_proc *proc,
> > struct binder_fd_object *fp = to_binder_fd_object(hdr);
> > binder_size_t fd_offset = object_offset +
> > (uintptr_t)&fp->fd - (uintptr_t)fp;
> > - int ret = binder_translate_fd(fp->fd, fd_offset, t,
> > - thread, in_reply_to);
> > + int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
> > + t, thread, in_reply_to);
> >
> > fp->pad_binder = 0;
> > if (ret < 0 ||
> > diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
> > index e72e4de8f452..4b20dd1dccb1 100644
> > --- a/include/uapi/linux/android/binder.h
> > +++ b/include/uapi/linux/android/binder.h
> > @@ -91,14 +91,14 @@ struct flat_binder_object {
> > /**
> > * struct binder_fd_object - describes a filedescriptor to be fixed up.
> > * @hdr: common header structure
> > - * @pad_flags: padding to remain compatible with old userspace code
> > + * @flags: One or more BINDER_FD_FLAG_* flags
> > * @pad_binder: padding to remain compatible with old userspace code
> > * @fd: file descriptor
> > * @cookie: opaque data, used by user-space
> > */
> > struct binder_fd_object {
> > struct binder_object_header hdr;
> > - __u32 pad_flags;
> > + __u32 flags;
> > union {
> > binder_uintptr_t pad_binder;
> > __u32 fd;
> > @@ -107,6 +107,17 @@ struct binder_fd_object {
> > binder_uintptr_t cookie;
> > };
> >
> > +enum {
> > + /**
> > + * @BINDER_FD_FLAG_XFER_CHARGE
> > + *
> > + * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
> > + * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
> > + * sender's cgroup and charged to the receiving process's cgroup instead.
> > + */
> > + BINDER_FD_FLAG_XFER_CHARGE = 0x01,
> > +};
> > +
> > /* struct binder_buffer_object - object describing a userspace buffer
> > * @hdr: common header structure
> > * @flags: one or more BINDER_BUFFER_* flags
> > @@ -141,7 +152,7 @@ enum {
> >
> > /* struct binder_fd_array_object - object describing an array of fds in a buffer
> > * @hdr: common header structure
> > - * @pad: padding to ensure correct alignment
> > + * @flags: One or more BINDER_FD_FLAG_* flags
> > * @num_fds: number of file descriptors in the buffer
> > * @parent: index in offset array to buffer holding the fd array
> > * @parent_offset: start offset of fd array in the buffer
> > @@ -162,7 +173,7 @@ enum {
> > */
> > struct binder_fd_array_object {
> > struct binder_object_header hdr;
> > - __u32 pad;
> > + __u32 flags;
> > binder_size_t num_fds;
> > binder_size_t parent;
> > binder_size_t parent_offset;
> > --
> > 2.39.0.246.g2a6d74b583-goog
> >
>
> Other than the previous question this looks good to me. Also, the error
> from the test robot seems to indicate a missing stub for
> dma_buf_transfer_charg() when !CONFIG_DMA_SHARED_BUFFER. However, this
> is likely to be fixed outside of this patch. Feel free to add this tag
> to the following round:
>
> Acked-by: Carlos Llamas <[email protected]>
>
> Thanks,
Got it, thanks!
On 25/01/2023 20:04, T.J. Mercier wrote:
> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> On 25/01/2023 11:52, Michal Hocko wrote:
>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
>>>>>> When a buffer is exported to userspace, use memcg to attribute the
>>>>>> buffer to the allocating cgroup until all buffer references are
>>>>>> released.
>>>>>
>>>>> Is there any reason why this memory cannot be charged during the
>>>>> allocation (__GFP_ACCOUNT used)?
>>>>> Also you do charge and account the memory but underlying pages do not
>>>>> know about their memcg (this is normally done with commit_charge for
>>>>> user mapped pages). This would become a problem if the memory is
>>>>> migrated for example.
>>>>
>>>> I don't think this is movable memory.
>>>>
>>>>> This also means that you have to maintain memcg
>>>>> reference outside of the memcg proper which is not really nice either.
>>>>> This mimicks tcp kmem limit implementation which I really have to say I
>>>>> am not a great fan of and this pattern shouldn't be coppied.
>>>>>
>>>>
>>>> I think we should keep the discussion on technical merits instead of
>>>> personal perference. To me using skmem like interface is totally fine
>>>> but the pros/cons need to be very explicit and the clear reasons to
>>>> select that option should be included.
>>>
>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
>>> accounting but the overall code maintenance cost is higher because
>>> of how tcp take on accounting differs from anything else in the memcg
>>> proper. I would prefer to not grow another example like that.
>>>
>>>> To me there are two options:
>>>>
>>>> 1. Using skmem like interface as this patch series:
>>>>
>>>> The main pros of this option is that it is very simple. Let me list down
>>>> the cons of this approach:
>>>>
>>>> a. There is time window between the actual memory allocation/free and
>>>> the charge and uncharge and [un]charge happen when the whole memory is
>>>> allocated or freed. I think for the charge path that might not be a big
>>>> issue but on the uncharge, this can cause issues. The application and
>>>> the potential shrinkers have freed some of this dmabuf memory but until
>>>> the whole dmabuf is freed, the memcg uncharge will not happen. This can
>>>> consequences on reclaim and oom behavior of the application.
>>>>
>>>> b. Due to the usage model i.e. a central daemon allocating the dmabuf
>>>> memory upfront, there is a requirement to have a memcg charge transfer
>>>> functionality to transfer the charge from the central daemon to the
>>>> client applications. This does introduce complexity and avenues of weird
>>>> reclaim and oom behavior.
>>>>
>>>>
>>>> 2. Allocate and charge the memory on page fault by actual user
>>>>
>>>> In this approach, the memory is not allocated upfront by the central
>>>> daemon but rather on the page fault by the client application and the
>>>> memcg charge happen at the same time.
>>>>
>>>> The only cons I can think of is this approach is more involved and may
>>>> need some clever tricks to track the page on the free patch i.e. we to
>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>>>>
>>>> The pros of this approach is there is no need have a charge transfer
>>>> functionality and the charge/uncharge being closely tied to the actual
>>>> memory allocation and free.
>>>>
>>>> Personally I would prefer the second approach but I don't want to just
>>>> block this work if the dmabuf folks are ok with the cons mentioned of
>>>> the first approach.
>>>
>>> I am not familiar with dmabuf internals to judge complexity on their end
>>> but I fully agree that charge-when-used is much more easier to reason
>>> about and it should have less subtle surprises.
>>
>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
>> am missing something, but in principle yes, I agree that the 2nd option
>> (charge the user, not exporter) should be preferred. Thing being that at
>> export time there may not be any backing store allocated, plus if the
>> series is restricting the charge transfer to just Android clients then
>> it seems it has the potential to miss many other use cases. At least
>> needs to outline a description on how the feature will be useful outside
>> Android.
>>
> There is no restriction like that. It's available to anybody who wants
> to call dma_buf_charge_transfer if they actually have a need for that,
> which I don't really expect to be common since most users/owners of
> the buffers will be the ones causing the export in the first place.
> It's just not like that on Android with the extra allocator process in
> the middle most of the time.
Yeah I used the wrong term "restrict", apologies. What I meant was, if
the idea was to allow spotting memory leaks, with the charge transfer
being optional and in the series only wired up for Android Binder, then
it obviously only fully works for that one case. So a step back..
.. For instance, it is not feasible to transfer the charge when dmabuf
is attached, or imported? That would attribute the usage to the
user/importer so give better visibility on who is actually causing the
memory leak.
Further more, if above is feasible, then could it also be implemented in
the common layer so it would automatically cover all drivers?
>> Also stepping back for a moment - is a new memory category really
>> needed, versus perhaps attempting to charge the actual backing store
>> memory to the correct client? (There might have been many past
>> discussions on this so it's okay to point me towards something in the
>> archives.)
>>
> Well the dmabuf counter for the stat file is really just a subcategory
> of memory that is charged. Its existence is not related to getting the
> charge attributed to the right process/cgroup. We do want to know how
> much of the memory attributed to a process is for dmabufs, which is
> the main point of this series.
Then I am probably missing something because the statement how proposal
is not intended to charge to the right process, but wants to know how
much dmabuf "size" is attributed to a process, confuses me due a seeming
contradiction. And the fact it would not be externally observable how
much of the stats is accurate and how much is not (without knowing the
implementation detail of which drivers implement charge transfer and
when). Maybe I completely misunderstood the use case.
Regards,
Tvrtko
On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 25/01/2023 20:04, T.J. Mercier wrote:
> > On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> Hi,
> >>
> >> On 25/01/2023 11:52, Michal Hocko wrote:
> >>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
> >>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> >>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> >>>>>> When a buffer is exported to userspace, use memcg to attribute the
> >>>>>> buffer to the allocating cgroup until all buffer references are
> >>>>>> released.
> >>>>>
> >>>>> Is there any reason why this memory cannot be charged during the
> >>>>> allocation (__GFP_ACCOUNT used)?
> >>>>> Also you do charge and account the memory but underlying pages do not
> >>>>> know about their memcg (this is normally done with commit_charge for
> >>>>> user mapped pages). This would become a problem if the memory is
> >>>>> migrated for example.
> >>>>
> >>>> I don't think this is movable memory.
> >>>>
> >>>>> This also means that you have to maintain memcg
> >>>>> reference outside of the memcg proper which is not really nice either.
> >>>>> This mimicks tcp kmem limit implementation which I really have to say I
> >>>>> am not a great fan of and this pattern shouldn't be coppied.
> >>>>>
> >>>>
> >>>> I think we should keep the discussion on technical merits instead of
> >>>> personal perference. To me using skmem like interface is totally fine
> >>>> but the pros/cons need to be very explicit and the clear reasons to
> >>>> select that option should be included.
> >>>
> >>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
> >>> accounting but the overall code maintenance cost is higher because
> >>> of how tcp take on accounting differs from anything else in the memcg
> >>> proper. I would prefer to not grow another example like that.
> >>>
> >>>> To me there are two options:
> >>>>
> >>>> 1. Using skmem like interface as this patch series:
> >>>>
> >>>> The main pros of this option is that it is very simple. Let me list down
> >>>> the cons of this approach:
> >>>>
> >>>> a. There is time window between the actual memory allocation/free and
> >>>> the charge and uncharge and [un]charge happen when the whole memory is
> >>>> allocated or freed. I think for the charge path that might not be a big
> >>>> issue but on the uncharge, this can cause issues. The application and
> >>>> the potential shrinkers have freed some of this dmabuf memory but until
> >>>> the whole dmabuf is freed, the memcg uncharge will not happen. This can
> >>>> consequences on reclaim and oom behavior of the application.
> >>>>
> >>>> b. Due to the usage model i.e. a central daemon allocating the dmabuf
> >>>> memory upfront, there is a requirement to have a memcg charge transfer
> >>>> functionality to transfer the charge from the central daemon to the
> >>>> client applications. This does introduce complexity and avenues of weird
> >>>> reclaim and oom behavior.
> >>>>
> >>>>
> >>>> 2. Allocate and charge the memory on page fault by actual user
> >>>>
> >>>> In this approach, the memory is not allocated upfront by the central
> >>>> daemon but rather on the page fault by the client application and the
> >>>> memcg charge happen at the same time.
> >>>>
> >>>> The only cons I can think of is this approach is more involved and may
> >>>> need some clever tricks to track the page on the free patch i.e. we to
> >>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
> >>>>
> >>>> The pros of this approach is there is no need have a charge transfer
> >>>> functionality and the charge/uncharge being closely tied to the actual
> >>>> memory allocation and free.
> >>>>
> >>>> Personally I would prefer the second approach but I don't want to just
> >>>> block this work if the dmabuf folks are ok with the cons mentioned of
> >>>> the first approach.
> >>>
> >>> I am not familiar with dmabuf internals to judge complexity on their end
> >>> but I fully agree that charge-when-used is much more easier to reason
> >>> about and it should have less subtle surprises.
> >>
> >> Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
> >> am missing something, but in principle yes, I agree that the 2nd option
> >> (charge the user, not exporter) should be preferred. Thing being that at
> >> export time there may not be any backing store allocated, plus if the
> >> series is restricting the charge transfer to just Android clients then
> >> it seems it has the potential to miss many other use cases. At least
> >> needs to outline a description on how the feature will be useful outside
> >> Android.
> >>
> > There is no restriction like that. It's available to anybody who wants
> > to call dma_buf_charge_transfer if they actually have a need for that,
> > which I don't really expect to be common since most users/owners of
> > the buffers will be the ones causing the export in the first place.
> > It's just not like that on Android with the extra allocator process in
> > the middle most of the time.
>
> Yeah I used the wrong term "restrict", apologies. What I meant was, if
> the idea was to allow spotting memory leaks, with the charge transfer
> being optional and in the series only wired up for Android Binder, then
> it obviously only fully works for that one case. So a step back..
>
Oh, spotting kernel memory leaks is a side-benefit of accounting
kernel-only buffers in the root cgroup. The primary goal is to
attribute buffers to applications that originated them (via
per-application cgroups) simply for accounting purposes. Buffers are
using memory on the system, and we want to know who created them and
how much memory is used. That information is/will no longer available
with the recent deprecation of the dmabuf sysfs statistics.
> .. For instance, it is not feasible to transfer the charge when dmabuf
> is attached, or imported? That would attribute the usage to the
> user/importer so give better visibility on who is actually causing the
> memory leak.
>
Instead of accounting at export, we could account at attach. That just
turns out not to be very useful when the majority of our
heap-allocated buffers don't have attachments at any particular point
in time. :\ But again it's less about leaks and more about knowing
which buffers exist in the first place.
> Further more, if above is feasible, then could it also be implemented in
> the common layer so it would automatically cover all drivers?
>
Which common layer code specifically? The dmabuf interface appears to
be the most central/common place to me.
> >> Also stepping back for a moment - is a new memory category really
> >> needed, versus perhaps attempting to charge the actual backing store
> >> memory to the correct client? (There might have been many past
> >> discussions on this so it's okay to point me towards something in the
> >> archives.)
> >>
> > Well the dmabuf counter for the stat file is really just a subcategory
> > of memory that is charged. Its existence is not related to getting the
> > charge attributed to the right process/cgroup. We do want to know how
> > much of the memory attributed to a process is for dmabufs, which is
> > the main point of this series.
>
> Then I am probably missing something because the statement how proposal
> is not intended to charge to the right process, but wants to know how
> much dmabuf "size" is attributed to a process, confuses me due a seeming
> contradiction. And the fact it would not be externally observable how
> much of the stats is accurate and how much is not (without knowing the
> implementation detail of which drivers implement charge transfer and
> when). Maybe I completely misunderstood the use case.
>
Hmm, did I clear this up above or no? The current proposal is for the
process causing the export of a buffer to be charged for it,
regardless of whatever happens afterwards. (Unless that process is
like gralloc on Android, in which case the charge is transferred from
gralloc to whoever called gralloc to allocate the buffer on their
behalf.)
> Regards,
>
> Tvrtko
On 01/02/2023 01:49, T.J. Mercier wrote:
> On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 25/01/2023 20:04, T.J. Mercier wrote:
>>> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> On 25/01/2023 11:52, Michal Hocko wrote:
>>>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
>>>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
>>>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
>>>>>>>> When a buffer is exported to userspace, use memcg to attribute the
>>>>>>>> buffer to the allocating cgroup until all buffer references are
>>>>>>>> released.
>>>>>>>
>>>>>>> Is there any reason why this memory cannot be charged during the
>>>>>>> allocation (__GFP_ACCOUNT used)?
>>>>>>> Also you do charge and account the memory but underlying pages do not
>>>>>>> know about their memcg (this is normally done with commit_charge for
>>>>>>> user mapped pages). This would become a problem if the memory is
>>>>>>> migrated for example.
>>>>>>
>>>>>> I don't think this is movable memory.
>>>>>>
>>>>>>> This also means that you have to maintain memcg
>>>>>>> reference outside of the memcg proper which is not really nice either.
>>>>>>> This mimicks tcp kmem limit implementation which I really have to say I
>>>>>>> am not a great fan of and this pattern shouldn't be coppied.
>>>>>>>
>>>>>>
>>>>>> I think we should keep the discussion on technical merits instead of
>>>>>> personal perference. To me using skmem like interface is totally fine
>>>>>> but the pros/cons need to be very explicit and the clear reasons to
>>>>>> select that option should be included.
>>>>>
>>>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
>>>>> accounting but the overall code maintenance cost is higher because
>>>>> of how tcp take on accounting differs from anything else in the memcg
>>>>> proper. I would prefer to not grow another example like that.
>>>>>
>>>>>> To me there are two options:
>>>>>>
>>>>>> 1. Using skmem like interface as this patch series:
>>>>>>
>>>>>> The main pros of this option is that it is very simple. Let me list down
>>>>>> the cons of this approach:
>>>>>>
>>>>>> a. There is time window between the actual memory allocation/free and
>>>>>> the charge and uncharge and [un]charge happen when the whole memory is
>>>>>> allocated or freed. I think for the charge path that might not be a big
>>>>>> issue but on the uncharge, this can cause issues. The application and
>>>>>> the potential shrinkers have freed some of this dmabuf memory but until
>>>>>> the whole dmabuf is freed, the memcg uncharge will not happen. This can
>>>>>> consequences on reclaim and oom behavior of the application.
>>>>>>
>>>>>> b. Due to the usage model i.e. a central daemon allocating the dmabuf
>>>>>> memory upfront, there is a requirement to have a memcg charge transfer
>>>>>> functionality to transfer the charge from the central daemon to the
>>>>>> client applications. This does introduce complexity and avenues of weird
>>>>>> reclaim and oom behavior.
>>>>>>
>>>>>>
>>>>>> 2. Allocate and charge the memory on page fault by actual user
>>>>>>
>>>>>> In this approach, the memory is not allocated upfront by the central
>>>>>> daemon but rather on the page fault by the client application and the
>>>>>> memcg charge happen at the same time.
>>>>>>
>>>>>> The only cons I can think of is this approach is more involved and may
>>>>>> need some clever tricks to track the page on the free patch i.e. we to
>>>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>>>>>>
>>>>>> The pros of this approach is there is no need have a charge transfer
>>>>>> functionality and the charge/uncharge being closely tied to the actual
>>>>>> memory allocation and free.
>>>>>>
>>>>>> Personally I would prefer the second approach but I don't want to just
>>>>>> block this work if the dmabuf folks are ok with the cons mentioned of
>>>>>> the first approach.
>>>>>
>>>>> I am not familiar with dmabuf internals to judge complexity on their end
>>>>> but I fully agree that charge-when-used is much more easier to reason
>>>>> about and it should have less subtle surprises.
>>>>
>>>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
>>>> am missing something, but in principle yes, I agree that the 2nd option
>>>> (charge the user, not exporter) should be preferred. Thing being that at
>>>> export time there may not be any backing store allocated, plus if the
>>>> series is restricting the charge transfer to just Android clients then
>>>> it seems it has the potential to miss many other use cases. At least
>>>> needs to outline a description on how the feature will be useful outside
>>>> Android.
>>>>
>>> There is no restriction like that. It's available to anybody who wants
>>> to call dma_buf_charge_transfer if they actually have a need for that,
>>> which I don't really expect to be common since most users/owners of
>>> the buffers will be the ones causing the export in the first place.
>>> It's just not like that on Android with the extra allocator process in
>>> the middle most of the time.
>>
>> Yeah I used the wrong term "restrict", apologies. What I meant was, if
>> the idea was to allow spotting memory leaks, with the charge transfer
>> being optional and in the series only wired up for Android Binder, then
>> it obviously only fully works for that one case. So a step back..
>>
> Oh, spotting kernel memory leaks is a side-benefit of accounting
> kernel-only buffers in the root cgroup. The primary goal is to
> attribute buffers to applications that originated them (via
> per-application cgroups) simply for accounting purposes. Buffers are
> using memory on the system, and we want to know who created them and
> how much memory is used. That information is/will no longer available
> with the recent deprecation of the dmabuf sysfs statistics.
>
>> .. For instance, it is not feasible to transfer the charge when dmabuf
>> is attached, or imported? That would attribute the usage to the
>> user/importer so give better visibility on who is actually causing the
>> memory leak.
>>
> Instead of accounting at export, we could account at attach. That just
> turns out not to be very useful when the majority of our
> heap-allocated buffers don't have attachments at any particular point
> in time. :\ But again it's less about leaks and more about knowing
> which buffers exist in the first place.
>
>> Further more, if above is feasible, then could it also be implemented in
>> the common layer so it would automatically cover all drivers?
>>
> Which common layer code specifically? The dmabuf interface appears to
> be the most central/common place to me.
Yes, I meant dma_buf_attach / detach. More below.
>>>> Also stepping back for a moment - is a new memory category really
>>>> needed, versus perhaps attempting to charge the actual backing store
>>>> memory to the correct client? (There might have been many past
>>>> discussions on this so it's okay to point me towards something in the
>>>> archives.)
>>>>
>>> Well the dmabuf counter for the stat file is really just a subcategory
>>> of memory that is charged. Its existence is not related to getting the
>>> charge attributed to the right process/cgroup. We do want to know how
>>> much of the memory attributed to a process is for dmabufs, which is
>>> the main point of this series.
>>
>> Then I am probably missing something because the statement how proposal
>> is not intended to charge to the right process, but wants to know how
>> much dmabuf "size" is attributed to a process, confuses me due a seeming
>> contradiction. And the fact it would not be externally observable how
>> much of the stats is accurate and how much is not (without knowing the
>> implementation detail of which drivers implement charge transfer and
>> when). Maybe I completely misunderstood the use case.
>>
> Hmm, did I clear this up above or no? The current proposal is for the
> process causing the export of a buffer to be charged for it,
> regardless of whatever happens afterwards. (Unless that process is
> like gralloc on Android, in which case the charge is transferred from
> gralloc to whoever called gralloc to allocate the buffer on their
> behalf.)
Main problem for me is that charging at export time has no relation to memory used. But I am not familiar with the memcg counters to know if any other counter sets that same precedent. If all other are about real memory use then IMO this does not fit that well. I mean specifically this:
+ dmabuf (npn)
+ Amount of memory used for exported DMA buffers allocated by the cgroup.
+ Stays with the allocating cgroup regardless of how the buffer is shared.
+
I think that "Amount of memory used for exported..." is not correct. As implemented it is more akin the virtual address space size in the cpu space - it can have no relation to the actual usage since backing store is not allocated until the attachment is made.
Then also this:
@@ -446,6 +447,8 @@ struct dma_buf {
struct dma_buf *dmabuf;
} *sysfs_entry;
#endif
+ /* The cgroup to which this buffer is currently attributed */
+ struct mem_cgroup *memcg;
};
Does not conceptually fit in my mind. Dmabufs are not associated with one cgroup at a time.
So if you would place tracking into dma_buf_attach/detach you would be able to charge to correct cgroup regardless of a driver and since by contract at this stage there is backing store, the reflected memory usage counter would be truthful.
But then you state a problem, that majority of the time there are no attachments in your setup, and you also say the proposal is not so much about leaks but more about knowing what is exported.
In this case you could additionally track that via dma_buf_getfile / dma_buf_file_release as a separate category like dmabuf-exported? But again, I personally don't know if such "may not really be using memory" counters fit in memcg.
(Hm you'd probably still need dmabuf->export_memcg to store who was the original caller of dma_buf_getfile, in case last reference is dropped from a different process/context. Even dmabuf->attach_memcg for attach/detach to work correctly for the same reason.)
Regards,
Tvrtko
On 01/02/2023 14:23, Tvrtko Ursulin wrote:
>
> On 01/02/2023 01:49, T.J. Mercier wrote:
>> On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
>> <[email protected]> wrote:
>>>
>>>
>>> On 25/01/2023 20:04, T.J. Mercier wrote:
>>>> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
>>>> <[email protected]> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 25/01/2023 11:52, Michal Hocko wrote:
>>>>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
>>>>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
>>>>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
>>>>>>>>> When a buffer is exported to userspace, use memcg to attribute the
>>>>>>>>> buffer to the allocating cgroup until all buffer references are
>>>>>>>>> released.
>>>>>>>>
>>>>>>>> Is there any reason why this memory cannot be charged during the
>>>>>>>> allocation (__GFP_ACCOUNT used)?
>>>>>>>> Also you do charge and account the memory but underlying pages
>>>>>>>> do not
>>>>>>>> know about their memcg (this is normally done with commit_charge
>>>>>>>> for
>>>>>>>> user mapped pages). This would become a problem if the memory is
>>>>>>>> migrated for example.
>>>>>>>
>>>>>>> I don't think this is movable memory.
>>>>>>>
>>>>>>>> This also means that you have to maintain memcg
>>>>>>>> reference outside of the memcg proper which is not really nice
>>>>>>>> either.
>>>>>>>> This mimicks tcp kmem limit implementation which I really have
>>>>>>>> to say I
>>>>>>>> am not a great fan of and this pattern shouldn't be coppied.
>>>>>>>>
>>>>>>>
>>>>>>> I think we should keep the discussion on technical merits instead of
>>>>>>> personal perference. To me using skmem like interface is totally
>>>>>>> fine
>>>>>>> but the pros/cons need to be very explicit and the clear reasons to
>>>>>>> select that option should be included.
>>>>>>
>>>>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
>>>>>> accounting but the overall code maintenance cost is higher because
>>>>>> of how tcp take on accounting differs from anything else in the memcg
>>>>>> proper. I would prefer to not grow another example like that.
>>>>>>
>>>>>>> To me there are two options:
>>>>>>>
>>>>>>> 1. Using skmem like interface as this patch series:
>>>>>>>
>>>>>>> The main pros of this option is that it is very simple. Let me
>>>>>>> list down
>>>>>>> the cons of this approach:
>>>>>>>
>>>>>>> a. There is time window between the actual memory allocation/free
>>>>>>> and
>>>>>>> the charge and uncharge and [un]charge happen when the whole
>>>>>>> memory is
>>>>>>> allocated or freed. I think for the charge path that might not be
>>>>>>> a big
>>>>>>> issue but on the uncharge, this can cause issues. The application
>>>>>>> and
>>>>>>> the potential shrinkers have freed some of this dmabuf memory but
>>>>>>> until
>>>>>>> the whole dmabuf is freed, the memcg uncharge will not happen.
>>>>>>> This can
>>>>>>> consequences on reclaim and oom behavior of the application.
>>>>>>>
>>>>>>> b. Due to the usage model i.e. a central daemon allocating the
>>>>>>> dmabuf
>>>>>>> memory upfront, there is a requirement to have a memcg charge
>>>>>>> transfer
>>>>>>> functionality to transfer the charge from the central daemon to the
>>>>>>> client applications. This does introduce complexity and avenues
>>>>>>> of weird
>>>>>>> reclaim and oom behavior.
>>>>>>>
>>>>>>>
>>>>>>> 2. Allocate and charge the memory on page fault by actual user
>>>>>>>
>>>>>>> In this approach, the memory is not allocated upfront by the central
>>>>>>> daemon but rather on the page fault by the client application and
>>>>>>> the
>>>>>>> memcg charge happen at the same time.
>>>>>>>
>>>>>>> The only cons I can think of is this approach is more involved
>>>>>>> and may
>>>>>>> need some clever tricks to track the page on the free patch i.e.
>>>>>>> we to
>>>>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>>>>>>>
>>>>>>> The pros of this approach is there is no need have a charge transfer
>>>>>>> functionality and the charge/uncharge being closely tied to the
>>>>>>> actual
>>>>>>> memory allocation and free.
>>>>>>>
>>>>>>> Personally I would prefer the second approach but I don't want to
>>>>>>> just
>>>>>>> block this work if the dmabuf folks are ok with the cons
>>>>>>> mentioned of
>>>>>>> the first approach.
>>>>>>
>>>>>> I am not familiar with dmabuf internals to judge complexity on
>>>>>> their end
>>>>>> but I fully agree that charge-when-used is much more easier to reason
>>>>>> about and it should have less subtle surprises.
>>>>>
>>>>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so
>>>>> maybe I
>>>>> am missing something, but in principle yes, I agree that the 2nd
>>>>> option
>>>>> (charge the user, not exporter) should be preferred. Thing being
>>>>> that at
>>>>> export time there may not be any backing store allocated, plus if the
>>>>> series is restricting the charge transfer to just Android clients then
>>>>> it seems it has the potential to miss many other use cases. At least
>>>>> needs to outline a description on how the feature will be useful
>>>>> outside
>>>>> Android.
>>>>>
>>>> There is no restriction like that. It's available to anybody who wants
>>>> to call dma_buf_charge_transfer if they actually have a need for that,
>>>> which I don't really expect to be common since most users/owners of
>>>> the buffers will be the ones causing the export in the first place.
>>>> It's just not like that on Android with the extra allocator process in
>>>> the middle most of the time.
>>>
>>> Yeah I used the wrong term "restrict", apologies. What I meant was, if
>>> the idea was to allow spotting memory leaks, with the charge transfer
>>> being optional and in the series only wired up for Android Binder, then
>>> it obviously only fully works for that one case. So a step back..
>>>
>> Oh, spotting kernel memory leaks is a side-benefit of accounting
>> kernel-only buffers in the root cgroup. The primary goal is to
>> attribute buffers to applications that originated them (via
>> per-application cgroups) simply for accounting purposes. Buffers are
>> using memory on the system, and we want to know who created them and
>> how much memory is used. That information is/will no longer available
>> with the recent deprecation of the dmabuf sysfs statistics.
>>
>>> .. For instance, it is not feasible to transfer the charge when dmabuf
>>> is attached, or imported? That would attribute the usage to the
>>> user/importer so give better visibility on who is actually causing the
>>> memory leak.
>>>
>> Instead of accounting at export, we could account at attach. That just
>> turns out not to be very useful when the majority of our
>> heap-allocated buffers don't have attachments at any particular point
>> in time. :\ But again it's less about leaks and more about knowing
>> which buffers exist in the first place.
>>
>>> Further more, if above is feasible, then could it also be implemented in
>>> the common layer so it would automatically cover all drivers?
>>>
>> Which common layer code specifically? The dmabuf interface appears to
>> be the most central/common place to me.
>
> Yes, I meant dma_buf_attach / detach. More below.
>>>>> Also stepping back for a moment - is a new memory category really
>>>>> needed, versus perhaps attempting to charge the actual backing store
>>>>> memory to the correct client? (There might have been many past
>>>>> discussions on this so it's okay to point me towards something in the
>>>>> archives.)
>>>>>
>>>> Well the dmabuf counter for the stat file is really just a subcategory
>>>> of memory that is charged. Its existence is not related to getting the
>>>> charge attributed to the right process/cgroup. We do want to know how
>>>> much of the memory attributed to a process is for dmabufs, which is
>>>> the main point of this series.
>>>
>>> Then I am probably missing something because the statement how proposal
>>> is not intended to charge to the right process, but wants to know how
>>> much dmabuf "size" is attributed to a process, confuses me due a seeming
>>> contradiction. And the fact it would not be externally observable how
>>> much of the stats is accurate and how much is not (without knowing the
>>> implementation detail of which drivers implement charge transfer and
>>> when). Maybe I completely misunderstood the use case.
>>>
>> Hmm, did I clear this up above or no? The current proposal is for the
>> process causing the export of a buffer to be charged for it,
>> regardless of whatever happens afterwards. (Unless that process is
>> like gralloc on Android, in which case the charge is transferred from
>> gralloc to whoever called gralloc to allocate the buffer on their
>> behalf.)
>
> Main problem for me is that charging at export time has no relation to
> memory used. But I am not familiar with the memcg counters to know if
> any other counter sets that same precedent. If all other are about real
> memory use then IMO this does not fit that well. I mean specifically this:
>
> +Â Â Â Â Â dmabuf (npn)
> +Â Â Â Â Â Â Â Amount of memory used for exported DMA buffers allocated by the
> cgroup.
> +Â Â Â Â Â Â Â Stays with the allocating cgroup regardless of how the buffer
> is shared.
> +
>
> I think that "Amount of memory used for exported..." is not correct. As
> implemented it is more akin the virtual address space size in the cpu
> space - it can have no relation to the actual usage since backing store
> is not allocated until the attachment is made.
>
> Then also this:
>
> @@ -446,6 +447,8 @@ struct dma_buf {
> Â Â Â Â Â Â Â Â struct dma_buf *dmabuf;
> Â Â Â Â } *sysfs_entry;
> Â #endif
> +Â Â Â /* The cgroup to which this buffer is currently attributed */
> +Â Â Â struct mem_cgroup *memcg;
> Â };
>
> Does not conceptually fit in my mind. Dmabufs are not associated with
> one cgroup at a time.
>
> So if you would place tracking into dma_buf_attach/detach you would be
> able to charge to correct cgroup regardless of a driver and since by
> contract at this stage there is backing store, the reflected memory
> usage counter would be truthful.
>
> But then you state a problem, that majority of the time there are no
> attachments in your setup, and you also say the proposal is not so much
> about leaks but more about knowing what is exported.
>
> In this case you could additionally track that via dma_buf_getfile /
> dma_buf_file_release as a separate category like dmabuf-exported? But
> again, I personally don't know if such "may not really be using memory"
> counters fit in memcg.
>
> (Hm you'd probably still need dmabuf->export_memcg to store who was the
> original caller of dma_buf_getfile, in case last reference is dropped
> from a different process/context. Even dmabuf->attach_memcg for
> attach/detach to work correctly for the same reason.)
Or to work around the "may not really be using memory" problem with the
exported tracking, perhaps you could record dmabuf->export_memcg at
dma_buf_export time, but only charge against it at dma_buf_getfile time.
Assuming it is possible to keep references to those memcg's over the
dmabuf lifetime without any issues.
That way we could have dmabuf-exported and dmabuf-imported memcg
categories which would better correlate with real memory usage. I say
better, because I don't think it would still be perfect since individual
drivers are allowed to hold onto the backing store post detach and that
is invisible to dmabuf API. But that probably is a different problem.
Regards,
Tvrtko
On Wed, Feb 1, 2023 at 6:23 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 01/02/2023 01:49, T.J. Mercier wrote:
> > On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> On 25/01/2023 20:04, T.J. Mercier wrote:
> >>> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 25/01/2023 11:52, Michal Hocko wrote:
> >>>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
> >>>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> >>>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> >>>>>>>> When a buffer is exported to userspace, use memcg to attribute the
> >>>>>>>> buffer to the allocating cgroup until all buffer references are
> >>>>>>>> released.
> >>>>>>>
> >>>>>>> Is there any reason why this memory cannot be charged during the
> >>>>>>> allocation (__GFP_ACCOUNT used)?
> >>>>>>> Also you do charge and account the memory but underlying pages do not
> >>>>>>> know about their memcg (this is normally done with commit_charge for
> >>>>>>> user mapped pages). This would become a problem if the memory is
> >>>>>>> migrated for example.
> >>>>>>
> >>>>>> I don't think this is movable memory.
> >>>>>>
> >>>>>>> This also means that you have to maintain memcg
> >>>>>>> reference outside of the memcg proper which is not really nice either.
> >>>>>>> This mimicks tcp kmem limit implementation which I really have to say I
> >>>>>>> am not a great fan of and this pattern shouldn't be coppied.
> >>>>>>>
> >>>>>>
> >>>>>> I think we should keep the discussion on technical merits instead of
> >>>>>> personal perference. To me using skmem like interface is totally fine
> >>>>>> but the pros/cons need to be very explicit and the clear reasons to
> >>>>>> select that option should be included.
> >>>>>
> >>>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
> >>>>> accounting but the overall code maintenance cost is higher because
> >>>>> of how tcp take on accounting differs from anything else in the memcg
> >>>>> proper. I would prefer to not grow another example like that.
> >>>>>
> >>>>>> To me there are two options:
> >>>>>>
> >>>>>> 1. Using skmem like interface as this patch series:
> >>>>>>
> >>>>>> The main pros of this option is that it is very simple. Let me list down
> >>>>>> the cons of this approach:
> >>>>>>
> >>>>>> a. There is time window between the actual memory allocation/free and
> >>>>>> the charge and uncharge and [un]charge happen when the whole memory is
> >>>>>> allocated or freed. I think for the charge path that might not be a big
> >>>>>> issue but on the uncharge, this can cause issues. The application and
> >>>>>> the potential shrinkers have freed some of this dmabuf memory but until
> >>>>>> the whole dmabuf is freed, the memcg uncharge will not happen. This can
> >>>>>> consequences on reclaim and oom behavior of the application.
> >>>>>>
> >>>>>> b. Due to the usage model i.e. a central daemon allocating the dmabuf
> >>>>>> memory upfront, there is a requirement to have a memcg charge transfer
> >>>>>> functionality to transfer the charge from the central daemon to the
> >>>>>> client applications. This does introduce complexity and avenues of weird
> >>>>>> reclaim and oom behavior.
> >>>>>>
> >>>>>>
> >>>>>> 2. Allocate and charge the memory on page fault by actual user
> >>>>>>
> >>>>>> In this approach, the memory is not allocated upfront by the central
> >>>>>> daemon but rather on the page fault by the client application and the
> >>>>>> memcg charge happen at the same time.
> >>>>>>
> >>>>>> The only cons I can think of is this approach is more involved and may
> >>>>>> need some clever tricks to track the page on the free patch i.e. we to
> >>>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
> >>>>>>
> >>>>>> The pros of this approach is there is no need have a charge transfer
> >>>>>> functionality and the charge/uncharge being closely tied to the actual
> >>>>>> memory allocation and free.
> >>>>>>
> >>>>>> Personally I would prefer the second approach but I don't want to just
> >>>>>> block this work if the dmabuf folks are ok with the cons mentioned of
> >>>>>> the first approach.
> >>>>>
> >>>>> I am not familiar with dmabuf internals to judge complexity on their end
> >>>>> but I fully agree that charge-when-used is much more easier to reason
> >>>>> about and it should have less subtle surprises.
> >>>>
> >>>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
> >>>> am missing something, but in principle yes, I agree that the 2nd option
> >>>> (charge the user, not exporter) should be preferred. Thing being that at
> >>>> export time there may not be any backing store allocated, plus if the
> >>>> series is restricting the charge transfer to just Android clients then
> >>>> it seems it has the potential to miss many other use cases. At least
> >>>> needs to outline a description on how the feature will be useful outside
> >>>> Android.
> >>>>
> >>> There is no restriction like that. It's available to anybody who wants
> >>> to call dma_buf_charge_transfer if they actually have a need for that,
> >>> which I don't really expect to be common since most users/owners of
> >>> the buffers will be the ones causing the export in the first place.
> >>> It's just not like that on Android with the extra allocator process in
> >>> the middle most of the time.
> >>
> >> Yeah I used the wrong term "restrict", apologies. What I meant was, if
> >> the idea was to allow spotting memory leaks, with the charge transfer
> >> being optional and in the series only wired up for Android Binder, then
> >> it obviously only fully works for that one case. So a step back..
> >>
> > Oh, spotting kernel memory leaks is a side-benefit of accounting
> > kernel-only buffers in the root cgroup. The primary goal is to
> > attribute buffers to applications that originated them (via
> > per-application cgroups) simply for accounting purposes. Buffers are
> > using memory on the system, and we want to know who created them and
> > how much memory is used. That information is/will no longer available
> > with the recent deprecation of the dmabuf sysfs statistics.
> >
> >> .. For instance, it is not feasible to transfer the charge when dmabuf
> >> is attached, or imported? That would attribute the usage to the
> >> user/importer so give better visibility on who is actually causing the
> >> memory leak.
> >>
> > Instead of accounting at export, we could account at attach. That just
> > turns out not to be very useful when the majority of our
> > heap-allocated buffers don't have attachments at any particular point
> > in time. :\ But again it's less about leaks and more about knowing
> > which buffers exist in the first place.
> >
> >> Further more, if above is feasible, then could it also be implemented in
> >> the common layer so it would automatically cover all drivers?
> >>
> > Which common layer code specifically? The dmabuf interface appears to
> > be the most central/common place to me.
>
> Yes, I meant dma_buf_attach / detach. More below.
> >>>> Also stepping back for a moment - is a new memory category really
> >>>> needed, versus perhaps attempting to charge the actual backing store
> >>>> memory to the correct client? (There might have been many past
> >>>> discussions on this so it's okay to point me towards something in the
> >>>> archives.)
> >>>>
> >>> Well the dmabuf counter for the stat file is really just a subcategory
> >>> of memory that is charged. Its existence is not related to getting the
> >>> charge attributed to the right process/cgroup. We do want to know how
> >>> much of the memory attributed to a process is for dmabufs, which is
> >>> the main point of this series.
> >>
> >> Then I am probably missing something because the statement how proposal
> >> is not intended to charge to the right process, but wants to know how
> >> much dmabuf "size" is attributed to a process, confuses me due a seeming
> >> contradiction. And the fact it would not be externally observable how
> >> much of the stats is accurate and how much is not (without knowing the
> >> implementation detail of which drivers implement charge transfer and
> >> when). Maybe I completely misunderstood the use case.
> >>
> > Hmm, did I clear this up above or no? The current proposal is for the
> > process causing the export of a buffer to be charged for it,
> > regardless of whatever happens afterwards. (Unless that process is
> > like gralloc on Android, in which case the charge is transferred from
> > gralloc to whoever called gralloc to allocate the buffer on their
> > behalf.)
>
> Main problem for me is that charging at export time has no relation to memory used. But I am not familiar with the memcg counters to know if any other counter sets that same precedent. If all other are about real memory use then IMO this does not fit that well. I mean specifically this:
>
> + dmabuf (npn)
> + Amount of memory used for exported DMA buffers allocated by the cgroup.
> + Stays with the allocating cgroup regardless of how the buffer is shared.
> +
>
> I think that "Amount of memory used for exported..." is not correct. As implemented it is more akin the virtual address space size in the cpu space - it can have no relation to the actual usage since backing store is not allocated until the attachment is made.
>
> Then also this:
>
> @@ -446,6 +447,8 @@ struct dma_buf {
> struct dma_buf *dmabuf;
> } *sysfs_entry;
> #endif
> + /* The cgroup to which this buffer is currently attributed */
> + struct mem_cgroup *memcg;
> };
>
> Does not conceptually fit in my mind. Dmabufs are not associated with one cgroup at a time.
>
It's true that a dmabuf could be shared among processes in different
cgroups, but this refers to the one that's charged for it. Similar to
how the shmem pages that back memfds which can be similarly shared get
charged to the first cgroup that touches each page, here it's the
entire buffer instead of each individual page. Maybe it'd be possible
to charge whoever attaches / maps first, but I have to point out
there'd be a gap between then and export where we'd have no accounting
of the memory for cases where pages actually do get allocated during
export (like in the system_heap).
> So if you would place tracking into dma_buf_attach/detach you would be able to charge to correct cgroup regardless of a driver and since by contract at this stage there is backing store, the reflected memory usage counter would be truthful.
>
> But then you state a problem, that majority of the time there are no attachments in your setup, and you also say the proposal is not so much about leaks but more about knowing what is exported.
>
> In this case you could additionally track that via dma_buf_getfile / dma_buf_file_release as a separate category like dmabuf-exported? But again, I personally don't know if such "may not really be using memory" counters fit in memcg.
>
> (Hm you'd probably still need dmabuf->export_memcg to store who was the original caller of dma_buf_getfile, in case last reference is dropped from a different process/context. Even dmabuf->attach_memcg for attach/detach to work correctly for the same reason.)
>
> Regards,
>
> Tvrtko
On Wed, Feb 1, 2023 at 6:52 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> On 01/02/2023 14:23, Tvrtko Ursulin wrote:
> >
> > On 01/02/2023 01:49, T.J. Mercier wrote:
> >> On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
> >> <[email protected]> wrote:
> >>>
> >>>
> >>> On 25/01/2023 20:04, T.J. Mercier wrote:
> >>>> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> On 25/01/2023 11:52, Michal Hocko wrote:
> >>>>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
> >>>>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> >>>>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> >>>>>>>>> When a buffer is exported to userspace, use memcg to attribute the
> >>>>>>>>> buffer to the allocating cgroup until all buffer references are
> >>>>>>>>> released.
> >>>>>>>>
> >>>>>>>> Is there any reason why this memory cannot be charged during the
> >>>>>>>> allocation (__GFP_ACCOUNT used)?
> >>>>>>>> Also you do charge and account the memory but underlying pages
> >>>>>>>> do not
> >>>>>>>> know about their memcg (this is normally done with commit_charge
> >>>>>>>> for
> >>>>>>>> user mapped pages). This would become a problem if the memory is
> >>>>>>>> migrated for example.
> >>>>>>>
> >>>>>>> I don't think this is movable memory.
> >>>>>>>
> >>>>>>>> This also means that you have to maintain memcg
> >>>>>>>> reference outside of the memcg proper which is not really nice
> >>>>>>>> either.
> >>>>>>>> This mimicks tcp kmem limit implementation which I really have
> >>>>>>>> to say I
> >>>>>>>> am not a great fan of and this pattern shouldn't be coppied.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I think we should keep the discussion on technical merits instead of
> >>>>>>> personal perference. To me using skmem like interface is totally
> >>>>>>> fine
> >>>>>>> but the pros/cons need to be very explicit and the clear reasons to
> >>>>>>> select that option should be included.
> >>>>>>
> >>>>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
> >>>>>> accounting but the overall code maintenance cost is higher because
> >>>>>> of how tcp take on accounting differs from anything else in the memcg
> >>>>>> proper. I would prefer to not grow another example like that.
> >>>>>>
> >>>>>>> To me there are two options:
> >>>>>>>
> >>>>>>> 1. Using skmem like interface as this patch series:
> >>>>>>>
> >>>>>>> The main pros of this option is that it is very simple. Let me
> >>>>>>> list down
> >>>>>>> the cons of this approach:
> >>>>>>>
> >>>>>>> a. There is time window between the actual memory allocation/free
> >>>>>>> and
> >>>>>>> the charge and uncharge and [un]charge happen when the whole
> >>>>>>> memory is
> >>>>>>> allocated or freed. I think for the charge path that might not be
> >>>>>>> a big
> >>>>>>> issue but on the uncharge, this can cause issues. The application
> >>>>>>> and
> >>>>>>> the potential shrinkers have freed some of this dmabuf memory but
> >>>>>>> until
> >>>>>>> the whole dmabuf is freed, the memcg uncharge will not happen.
> >>>>>>> This can
> >>>>>>> consequences on reclaim and oom behavior of the application.
> >>>>>>>
> >>>>>>> b. Due to the usage model i.e. a central daemon allocating the
> >>>>>>> dmabuf
> >>>>>>> memory upfront, there is a requirement to have a memcg charge
> >>>>>>> transfer
> >>>>>>> functionality to transfer the charge from the central daemon to the
> >>>>>>> client applications. This does introduce complexity and avenues
> >>>>>>> of weird
> >>>>>>> reclaim and oom behavior.
> >>>>>>>
> >>>>>>>
> >>>>>>> 2. Allocate and charge the memory on page fault by actual user
> >>>>>>>
> >>>>>>> In this approach, the memory is not allocated upfront by the central
> >>>>>>> daemon but rather on the page fault by the client application and
> >>>>>>> the
> >>>>>>> memcg charge happen at the same time.
> >>>>>>>
> >>>>>>> The only cons I can think of is this approach is more involved
> >>>>>>> and may
> >>>>>>> need some clever tricks to track the page on the free patch i.e.
> >>>>>>> we to
> >>>>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
> >>>>>>>
> >>>>>>> The pros of this approach is there is no need have a charge transfer
> >>>>>>> functionality and the charge/uncharge being closely tied to the
> >>>>>>> actual
> >>>>>>> memory allocation and free.
> >>>>>>>
> >>>>>>> Personally I would prefer the second approach but I don't want to
> >>>>>>> just
> >>>>>>> block this work if the dmabuf folks are ok with the cons
> >>>>>>> mentioned of
> >>>>>>> the first approach.
> >>>>>>
> >>>>>> I am not familiar with dmabuf internals to judge complexity on
> >>>>>> their end
> >>>>>> but I fully agree that charge-when-used is much more easier to reason
> >>>>>> about and it should have less subtle surprises.
> >>>>>
> >>>>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so
> >>>>> maybe I
> >>>>> am missing something, but in principle yes, I agree that the 2nd
> >>>>> option
> >>>>> (charge the user, not exporter) should be preferred. Thing being
> >>>>> that at
> >>>>> export time there may not be any backing store allocated, plus if the
> >>>>> series is restricting the charge transfer to just Android clients then
> >>>>> it seems it has the potential to miss many other use cases. At least
> >>>>> needs to outline a description on how the feature will be useful
> >>>>> outside
> >>>>> Android.
> >>>>>
> >>>> There is no restriction like that. It's available to anybody who wants
> >>>> to call dma_buf_charge_transfer if they actually have a need for that,
> >>>> which I don't really expect to be common since most users/owners of
> >>>> the buffers will be the ones causing the export in the first place.
> >>>> It's just not like that on Android with the extra allocator process in
> >>>> the middle most of the time.
> >>>
> >>> Yeah I used the wrong term "restrict", apologies. What I meant was, if
> >>> the idea was to allow spotting memory leaks, with the charge transfer
> >>> being optional and in the series only wired up for Android Binder, then
> >>> it obviously only fully works for that one case. So a step back..
> >>>
> >> Oh, spotting kernel memory leaks is a side-benefit of accounting
> >> kernel-only buffers in the root cgroup. The primary goal is to
> >> attribute buffers to applications that originated them (via
> >> per-application cgroups) simply for accounting purposes. Buffers are
> >> using memory on the system, and we want to know who created them and
> >> how much memory is used. That information is/will no longer available
> >> with the recent deprecation of the dmabuf sysfs statistics.
> >>
> >>> .. For instance, it is not feasible to transfer the charge when dmabuf
> >>> is attached, or imported? That would attribute the usage to the
> >>> user/importer so give better visibility on who is actually causing the
> >>> memory leak.
> >>>
> >> Instead of accounting at export, we could account at attach. That just
> >> turns out not to be very useful when the majority of our
> >> heap-allocated buffers don't have attachments at any particular point
> >> in time. :\ But again it's less about leaks and more about knowing
> >> which buffers exist in the first place.
> >>
> >>> Further more, if above is feasible, then could it also be implemented in
> >>> the common layer so it would automatically cover all drivers?
> >>>
> >> Which common layer code specifically? The dmabuf interface appears to
> >> be the most central/common place to me.
> >
> > Yes, I meant dma_buf_attach / detach. More below.
> >>>>> Also stepping back for a moment - is a new memory category really
> >>>>> needed, versus perhaps attempting to charge the actual backing store
> >>>>> memory to the correct client? (There might have been many past
> >>>>> discussions on this so it's okay to point me towards something in the
> >>>>> archives.)
> >>>>>
> >>>> Well the dmabuf counter for the stat file is really just a subcategory
> >>>> of memory that is charged. Its existence is not related to getting the
> >>>> charge attributed to the right process/cgroup. We do want to know how
> >>>> much of the memory attributed to a process is for dmabufs, which is
> >>>> the main point of this series.
> >>>
> >>> Then I am probably missing something because the statement how proposal
> >>> is not intended to charge to the right process, but wants to know how
> >>> much dmabuf "size" is attributed to a process, confuses me due a seeming
> >>> contradiction. And the fact it would not be externally observable how
> >>> much of the stats is accurate and how much is not (without knowing the
> >>> implementation detail of which drivers implement charge transfer and
> >>> when). Maybe I completely misunderstood the use case.
> >>>
> >> Hmm, did I clear this up above or no? The current proposal is for the
> >> process causing the export of a buffer to be charged for it,
> >> regardless of whatever happens afterwards. (Unless that process is
> >> like gralloc on Android, in which case the charge is transferred from
> >> gralloc to whoever called gralloc to allocate the buffer on their
> >> behalf.)
> >
> > Main problem for me is that charging at export time has no relation to
> > memory used. But I am not familiar with the memcg counters to know if
> > any other counter sets that same precedent. If all other are about real
> > memory use then IMO this does not fit that well. I mean specifically this:
> >
> > + dmabuf (npn)
> > + Amount of memory used for exported DMA buffers allocated by the
> > cgroup.
> > + Stays with the allocating cgroup regardless of how the buffer
> > is shared.
> > +
> >
> > I think that "Amount of memory used for exported..." is not correct. As
> > implemented it is more akin the virtual address space size in the cpu
> > space - it can have no relation to the actual usage since backing store
> > is not allocated until the attachment is made.
> >
> > Then also this:
> >
> > @@ -446,6 +447,8 @@ struct dma_buf {
> > struct dma_buf *dmabuf;
> > } *sysfs_entry;
> > #endif
> > + /* The cgroup to which this buffer is currently attributed */
> > + struct mem_cgroup *memcg;
> > };
> >
> > Does not conceptually fit in my mind. Dmabufs are not associated with
> > one cgroup at a time.
> >
> > So if you would place tracking into dma_buf_attach/detach you would be
> > able to charge to correct cgroup regardless of a driver and since by
> > contract at this stage there is backing store, the reflected memory
> > usage counter would be truthful.
> >
> > But then you state a problem, that majority of the time there are no
> > attachments in your setup, and you also say the proposal is not so much
> > about leaks but more about knowing what is exported.
> >
> > In this case you could additionally track that via dma_buf_getfile /
> > dma_buf_file_release as a separate category like dmabuf-exported? But
> > again, I personally don't know if such "may not really be using memory"
> > counters fit in memcg.
> >
> > (Hm you'd probably still need dmabuf->export_memcg to store who was the
> > original caller of dma_buf_getfile, in case last reference is dropped
> > from a different process/context. Even dmabuf->attach_memcg for
> > attach/detach to work correctly for the same reason.)
>
> Or to work around the "may not really be using memory" problem with the
> exported tracking, perhaps you could record dmabuf->export_memcg at
> dma_buf_export time, but only charge against it at dma_buf_getfile time.
> Assuming it is possible to keep references to those memcg's over the
> dmabuf lifetime without any issues.
>
I don't follow here. dma_buf_export calls dma_buf_getfile. Did you
mean dma_buf_attach / dma_buf_mmap instead of dma_buf_getfile? If so
that's an interesting idea, but want to make sure I'm tracking
correctly.
> That way we could have dmabuf-exported and dmabuf-imported memcg
> categories which would better correlate with real memory usage. I say
> better, because I don't think it would still be perfect since individual
> drivers are allowed to hold onto the backing store post detach and that
> is invisible to dmabuf API. But that probably is a different problem.
>
Oh, that sounds... broken.
> Regards,
>
> Tvrtko
On 02/02/2023 23:43, T.J. Mercier wrote:
> On Wed, Feb 1, 2023 at 6:52 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 01/02/2023 14:23, Tvrtko Ursulin wrote:
>>>
>>> On 01/02/2023 01:49, T.J. Mercier wrote:
>>>> On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
>>>> <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 25/01/2023 20:04, T.J. Mercier wrote:
>>>>>> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 25/01/2023 11:52, Michal Hocko wrote:
>>>>>>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
>>>>>>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
>>>>>>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
>>>>>>>>>>> When a buffer is exported to userspace, use memcg to attribute the
>>>>>>>>>>> buffer to the allocating cgroup until all buffer references are
>>>>>>>>>>> released.
>>>>>>>>>>
>>>>>>>>>> Is there any reason why this memory cannot be charged during the
>>>>>>>>>> allocation (__GFP_ACCOUNT used)?
>>>>>>>>>> Also you do charge and account the memory but underlying pages
>>>>>>>>>> do not
>>>>>>>>>> know about their memcg (this is normally done with commit_charge
>>>>>>>>>> for
>>>>>>>>>> user mapped pages). This would become a problem if the memory is
>>>>>>>>>> migrated for example.
>>>>>>>>>
>>>>>>>>> I don't think this is movable memory.
>>>>>>>>>
>>>>>>>>>> This also means that you have to maintain memcg
>>>>>>>>>> reference outside of the memcg proper which is not really nice
>>>>>>>>>> either.
>>>>>>>>>> This mimicks tcp kmem limit implementation which I really have
>>>>>>>>>> to say I
>>>>>>>>>> am not a great fan of and this pattern shouldn't be coppied.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think we should keep the discussion on technical merits instead of
>>>>>>>>> personal perference. To me using skmem like interface is totally
>>>>>>>>> fine
>>>>>>>>> but the pros/cons need to be very explicit and the clear reasons to
>>>>>>>>> select that option should be included.
>>>>>>>>
>>>>>>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
>>>>>>>> accounting but the overall code maintenance cost is higher because
>>>>>>>> of how tcp take on accounting differs from anything else in the memcg
>>>>>>>> proper. I would prefer to not grow another example like that.
>>>>>>>>
>>>>>>>>> To me there are two options:
>>>>>>>>>
>>>>>>>>> 1. Using skmem like interface as this patch series:
>>>>>>>>>
>>>>>>>>> The main pros of this option is that it is very simple. Let me
>>>>>>>>> list down
>>>>>>>>> the cons of this approach:
>>>>>>>>>
>>>>>>>>> a. There is time window between the actual memory allocation/free
>>>>>>>>> and
>>>>>>>>> the charge and uncharge and [un]charge happen when the whole
>>>>>>>>> memory is
>>>>>>>>> allocated or freed. I think for the charge path that might not be
>>>>>>>>> a big
>>>>>>>>> issue but on the uncharge, this can cause issues. The application
>>>>>>>>> and
>>>>>>>>> the potential shrinkers have freed some of this dmabuf memory but
>>>>>>>>> until
>>>>>>>>> the whole dmabuf is freed, the memcg uncharge will not happen.
>>>>>>>>> This can
>>>>>>>>> consequences on reclaim and oom behavior of the application.
>>>>>>>>>
>>>>>>>>> b. Due to the usage model i.e. a central daemon allocating the
>>>>>>>>> dmabuf
>>>>>>>>> memory upfront, there is a requirement to have a memcg charge
>>>>>>>>> transfer
>>>>>>>>> functionality to transfer the charge from the central daemon to the
>>>>>>>>> client applications. This does introduce complexity and avenues
>>>>>>>>> of weird
>>>>>>>>> reclaim and oom behavior.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Allocate and charge the memory on page fault by actual user
>>>>>>>>>
>>>>>>>>> In this approach, the memory is not allocated upfront by the central
>>>>>>>>> daemon but rather on the page fault by the client application and
>>>>>>>>> the
>>>>>>>>> memcg charge happen at the same time.
>>>>>>>>>
>>>>>>>>> The only cons I can think of is this approach is more involved
>>>>>>>>> and may
>>>>>>>>> need some clever tricks to track the page on the free patch i.e.
>>>>>>>>> we to
>>>>>>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>>>>>>>>>
>>>>>>>>> The pros of this approach is there is no need have a charge transfer
>>>>>>>>> functionality and the charge/uncharge being closely tied to the
>>>>>>>>> actual
>>>>>>>>> memory allocation and free.
>>>>>>>>>
>>>>>>>>> Personally I would prefer the second approach but I don't want to
>>>>>>>>> just
>>>>>>>>> block this work if the dmabuf folks are ok with the cons
>>>>>>>>> mentioned of
>>>>>>>>> the first approach.
>>>>>>>>
>>>>>>>> I am not familiar with dmabuf internals to judge complexity on
>>>>>>>> their end
>>>>>>>> but I fully agree that charge-when-used is much more easier to reason
>>>>>>>> about and it should have less subtle surprises.
>>>>>>>
>>>>>>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so
>>>>>>> maybe I
>>>>>>> am missing something, but in principle yes, I agree that the 2nd
>>>>>>> option
>>>>>>> (charge the user, not exporter) should be preferred. Thing being
>>>>>>> that at
>>>>>>> export time there may not be any backing store allocated, plus if the
>>>>>>> series is restricting the charge transfer to just Android clients then
>>>>>>> it seems it has the potential to miss many other use cases. At least
>>>>>>> needs to outline a description on how the feature will be useful
>>>>>>> outside
>>>>>>> Android.
>>>>>>>
>>>>>> There is no restriction like that. It's available to anybody who wants
>>>>>> to call dma_buf_charge_transfer if they actually have a need for that,
>>>>>> which I don't really expect to be common since most users/owners of
>>>>>> the buffers will be the ones causing the export in the first place.
>>>>>> It's just not like that on Android with the extra allocator process in
>>>>>> the middle most of the time.
>>>>>
>>>>> Yeah I used the wrong term "restrict", apologies. What I meant was, if
>>>>> the idea was to allow spotting memory leaks, with the charge transfer
>>>>> being optional and in the series only wired up for Android Binder, then
>>>>> it obviously only fully works for that one case. So a step back..
>>>>>
>>>> Oh, spotting kernel memory leaks is a side-benefit of accounting
>>>> kernel-only buffers in the root cgroup. The primary goal is to
>>>> attribute buffers to applications that originated them (via
>>>> per-application cgroups) simply for accounting purposes. Buffers are
>>>> using memory on the system, and we want to know who created them and
>>>> how much memory is used. That information is/will no longer available
>>>> with the recent deprecation of the dmabuf sysfs statistics.
>>>>
>>>>> .. For instance, it is not feasible to transfer the charge when dmabuf
>>>>> is attached, or imported? That would attribute the usage to the
>>>>> user/importer so give better visibility on who is actually causing the
>>>>> memory leak.
>>>>>
>>>> Instead of accounting at export, we could account at attach. That just
>>>> turns out not to be very useful when the majority of our
>>>> heap-allocated buffers don't have attachments at any particular point
>>>> in time. :\ But again it's less about leaks and more about knowing
>>>> which buffers exist in the first place.
>>>>
>>>>> Further more, if above is feasible, then could it also be implemented in
>>>>> the common layer so it would automatically cover all drivers?
>>>>>
>>>> Which common layer code specifically? The dmabuf interface appears to
>>>> be the most central/common place to me.
>>>
>>> Yes, I meant dma_buf_attach / detach. More below.
>>>>>>> Also stepping back for a moment - is a new memory category really
>>>>>>> needed, versus perhaps attempting to charge the actual backing store
>>>>>>> memory to the correct client? (There might have been many past
>>>>>>> discussions on this so it's okay to point me towards something in the
>>>>>>> archives.)
>>>>>>>
>>>>>> Well the dmabuf counter for the stat file is really just a subcategory
>>>>>> of memory that is charged. Its existence is not related to getting the
>>>>>> charge attributed to the right process/cgroup. We do want to know how
>>>>>> much of the memory attributed to a process is for dmabufs, which is
>>>>>> the main point of this series.
>>>>>
>>>>> Then I am probably missing something because the statement how proposal
>>>>> is not intended to charge to the right process, but wants to know how
>>>>> much dmabuf "size" is attributed to a process, confuses me due a seeming
>>>>> contradiction. And the fact it would not be externally observable how
>>>>> much of the stats is accurate and how much is not (without knowing the
>>>>> implementation detail of which drivers implement charge transfer and
>>>>> when). Maybe I completely misunderstood the use case.
>>>>>
>>>> Hmm, did I clear this up above or no? The current proposal is for the
>>>> process causing the export of a buffer to be charged for it,
>>>> regardless of whatever happens afterwards. (Unless that process is
>>>> like gralloc on Android, in which case the charge is transferred from
>>>> gralloc to whoever called gralloc to allocate the buffer on their
>>>> behalf.)
>>>
>>> Main problem for me is that charging at export time has no relation to
>>> memory used. But I am not familiar with the memcg counters to know if
>>> any other counter sets that same precedent. If all other are about real
>>> memory use then IMO this does not fit that well. I mean specifically this:
>>>
>>> + dmabuf (npn)
>>> + Amount of memory used for exported DMA buffers allocated by the
>>> cgroup.
>>> + Stays with the allocating cgroup regardless of how the buffer
>>> is shared.
>>> +
>>>
>>> I think that "Amount of memory used for exported..." is not correct. As
>>> implemented it is more akin the virtual address space size in the cpu
>>> space - it can have no relation to the actual usage since backing store
>>> is not allocated until the attachment is made.
>>>
>>> Then also this:
>>>
>>> @@ -446,6 +447,8 @@ struct dma_buf {
>>> struct dma_buf *dmabuf;
>>> } *sysfs_entry;
>>> #endif
>>> + /* The cgroup to which this buffer is currently attributed */
>>> + struct mem_cgroup *memcg;
>>> };
>>>
>>> Does not conceptually fit in my mind. Dmabufs are not associated with
>>> one cgroup at a time.
>>>
>>> So if you would place tracking into dma_buf_attach/detach you would be
>>> able to charge to correct cgroup regardless of a driver and since by
>>> contract at this stage there is backing store, the reflected memory
>>> usage counter would be truthful.
>>>
>>> But then you state a problem, that majority of the time there are no
>>> attachments in your setup, and you also say the proposal is not so much
>>> about leaks but more about knowing what is exported.
>>>
>>> In this case you could additionally track that via dma_buf_getfile /
>>> dma_buf_file_release as a separate category like dmabuf-exported? But
>>> again, I personally don't know if such "may not really be using memory"
>>> counters fit in memcg.
>>>
>>> (Hm you'd probably still need dmabuf->export_memcg to store who was the
>>> original caller of dma_buf_getfile, in case last reference is dropped
>>> from a different process/context. Even dmabuf->attach_memcg for
>>> attach/detach to work correctly for the same reason.)
>>
>> Or to work around the "may not really be using memory" problem with the
>> exported tracking, perhaps you could record dmabuf->export_memcg at
>> dma_buf_export time, but only charge against it at dma_buf_getfile time.
>> Assuming it is possible to keep references to those memcg's over the
>> dmabuf lifetime without any issues.
>>
> I don't follow here. dma_buf_export calls dma_buf_getfile. Did you
> mean dma_buf_attach / dma_buf_mmap instead of dma_buf_getfile? If so
> that's an interesting idea, but want to make sure I'm tracking
> correctly.
Yes sorry, I confused the two sides when typing.
Exported lifetime: dma_buf_getfile to dma_buf_file_release.
Imported lifetime: dma_buf_attach to dma_buf_detach.
Multiple attachments though, so if you want to track imported size the
importer memcg would probably need to be stored in struct
dma_buf_attachment.
And exported size would only need to be charged once on first importer
attaching.
I am not familiar if cgroup migrations would automatically be handled or
not if you permanently store memcg pointers in the respective dmabuf
structures.
>> That way we could have dmabuf-exported and dmabuf-imported memcg
>> categories which would better correlate with real memory usage. I say
>> better, because I don't think it would still be perfect since individual
>> drivers are allowed to hold onto the backing store post detach and that
>> is invisible to dmabuf API. But that probably is a different problem.
>>
> Oh, that sounds... broken.
Not broken in general, but definitely an asterisk on the dmabuf charging
semantics. Unless it is completely incompatible with anything to be
tracked under memcg?
Regards,
Tvrtko
On 02/02/2023 23:43, T.J. Mercier wrote:
> On Wed, Feb 1, 2023 at 6:23 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> On 01/02/2023 01:49, T.J. Mercier wrote:
>>> On Tue, Jan 31, 2023 at 6:01 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On 25/01/2023 20:04, T.J. Mercier wrote:
>>>>> On Wed, Jan 25, 2023 at 9:31 AM Tvrtko Ursulin
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 25/01/2023 11:52, Michal Hocko wrote:
>>>>>>> On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
>>>>>>>> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
>>>>>>>>> On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
>>>>>>>>>> When a buffer is exported to userspace, use memcg to attribute the
>>>>>>>>>> buffer to the allocating cgroup until all buffer references are
>>>>>>>>>> released.
>>>>>>>>>
>>>>>>>>> Is there any reason why this memory cannot be charged during the
>>>>>>>>> allocation (__GFP_ACCOUNT used)?
>>>>>>>>> Also you do charge and account the memory but underlying pages do not
>>>>>>>>> know about their memcg (this is normally done with commit_charge for
>>>>>>>>> user mapped pages). This would become a problem if the memory is
>>>>>>>>> migrated for example.
>>>>>>>>
>>>>>>>> I don't think this is movable memory.
>>>>>>>>
>>>>>>>>> This also means that you have to maintain memcg
>>>>>>>>> reference outside of the memcg proper which is not really nice either.
>>>>>>>>> This mimicks tcp kmem limit implementation which I really have to say I
>>>>>>>>> am not a great fan of and this pattern shouldn't be coppied.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think we should keep the discussion on technical merits instead of
>>>>>>>> personal perference. To me using skmem like interface is totally fine
>>>>>>>> but the pros/cons need to be very explicit and the clear reasons to
>>>>>>>> select that option should be included.
>>>>>>>
>>>>>>> I do agree with that. I didn't want sound to be personal wrt tcp kmem
>>>>>>> accounting but the overall code maintenance cost is higher because
>>>>>>> of how tcp take on accounting differs from anything else in the memcg
>>>>>>> proper. I would prefer to not grow another example like that.
>>>>>>>
>>>>>>>> To me there are two options:
>>>>>>>>
>>>>>>>> 1. Using skmem like interface as this patch series:
>>>>>>>>
>>>>>>>> The main pros of this option is that it is very simple. Let me list down
>>>>>>>> the cons of this approach:
>>>>>>>>
>>>>>>>> a. There is time window between the actual memory allocation/free and
>>>>>>>> the charge and uncharge and [un]charge happen when the whole memory is
>>>>>>>> allocated or freed. I think for the charge path that might not be a big
>>>>>>>> issue but on the uncharge, this can cause issues. The application and
>>>>>>>> the potential shrinkers have freed some of this dmabuf memory but until
>>>>>>>> the whole dmabuf is freed, the memcg uncharge will not happen. This can
>>>>>>>> consequences on reclaim and oom behavior of the application.
>>>>>>>>
>>>>>>>> b. Due to the usage model i.e. a central daemon allocating the dmabuf
>>>>>>>> memory upfront, there is a requirement to have a memcg charge transfer
>>>>>>>> functionality to transfer the charge from the central daemon to the
>>>>>>>> client applications. This does introduce complexity and avenues of weird
>>>>>>>> reclaim and oom behavior.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. Allocate and charge the memory on page fault by actual user
>>>>>>>>
>>>>>>>> In this approach, the memory is not allocated upfront by the central
>>>>>>>> daemon but rather on the page fault by the client application and the
>>>>>>>> memcg charge happen at the same time.
>>>>>>>>
>>>>>>>> The only cons I can think of is this approach is more involved and may
>>>>>>>> need some clever tricks to track the page on the free patch i.e. we to
>>>>>>>> decrement the dmabuf memcg stat on free path. Maybe a page flag.
>>>>>>>>
>>>>>>>> The pros of this approach is there is no need have a charge transfer
>>>>>>>> functionality and the charge/uncharge being closely tied to the actual
>>>>>>>> memory allocation and free.
>>>>>>>>
>>>>>>>> Personally I would prefer the second approach but I don't want to just
>>>>>>>> block this work if the dmabuf folks are ok with the cons mentioned of
>>>>>>>> the first approach.
>>>>>>>
>>>>>>> I am not familiar with dmabuf internals to judge complexity on their end
>>>>>>> but I fully agree that charge-when-used is much more easier to reason
>>>>>>> about and it should have less subtle surprises.
>>>>>>
>>>>>> Disclaimer that I don't seem to see patches 3&4 on dri-devel so maybe I
>>>>>> am missing something, but in principle yes, I agree that the 2nd option
>>>>>> (charge the user, not exporter) should be preferred. Thing being that at
>>>>>> export time there may not be any backing store allocated, plus if the
>>>>>> series is restricting the charge transfer to just Android clients then
>>>>>> it seems it has the potential to miss many other use cases. At least
>>>>>> needs to outline a description on how the feature will be useful outside
>>>>>> Android.
>>>>>>
>>>>> There is no restriction like that. It's available to anybody who wants
>>>>> to call dma_buf_charge_transfer if they actually have a need for that,
>>>>> which I don't really expect to be common since most users/owners of
>>>>> the buffers will be the ones causing the export in the first place.
>>>>> It's just not like that on Android with the extra allocator process in
>>>>> the middle most of the time.
>>>>
>>>> Yeah I used the wrong term "restrict", apologies. What I meant was, if
>>>> the idea was to allow spotting memory leaks, with the charge transfer
>>>> being optional and in the series only wired up for Android Binder, then
>>>> it obviously only fully works for that one case. So a step back..
>>>>
>>> Oh, spotting kernel memory leaks is a side-benefit of accounting
>>> kernel-only buffers in the root cgroup. The primary goal is to
>>> attribute buffers to applications that originated them (via
>>> per-application cgroups) simply for accounting purposes. Buffers are
>>> using memory on the system, and we want to know who created them and
>>> how much memory is used. That information is/will no longer available
>>> with the recent deprecation of the dmabuf sysfs statistics.
>>>
>>>> .. For instance, it is not feasible to transfer the charge when dmabuf
>>>> is attached, or imported? That would attribute the usage to the
>>>> user/importer so give better visibility on who is actually causing the
>>>> memory leak.
>>>>
>>> Instead of accounting at export, we could account at attach. That just
>>> turns out not to be very useful when the majority of our
>>> heap-allocated buffers don't have attachments at any particular point
>>> in time. :\ But again it's less about leaks and more about knowing
>>> which buffers exist in the first place.
>>>
>>>> Further more, if above is feasible, then could it also be implemented in
>>>> the common layer so it would automatically cover all drivers?
>>>>
>>> Which common layer code specifically? The dmabuf interface appears to
>>> be the most central/common place to me.
>>
>> Yes, I meant dma_buf_attach / detach. More below.
>>>>>> Also stepping back for a moment - is a new memory category really
>>>>>> needed, versus perhaps attempting to charge the actual backing store
>>>>>> memory to the correct client? (There might have been many past
>>>>>> discussions on this so it's okay to point me towards something in the
>>>>>> archives.)
>>>>>>
>>>>> Well the dmabuf counter for the stat file is really just a subcategory
>>>>> of memory that is charged. Its existence is not related to getting the
>>>>> charge attributed to the right process/cgroup. We do want to know how
>>>>> much of the memory attributed to a process is for dmabufs, which is
>>>>> the main point of this series.
>>>>
>>>> Then I am probably missing something because the statement how proposal
>>>> is not intended to charge to the right process, but wants to know how
>>>> much dmabuf "size" is attributed to a process, confuses me due a seeming
>>>> contradiction. And the fact it would not be externally observable how
>>>> much of the stats is accurate and how much is not (without knowing the
>>>> implementation detail of which drivers implement charge transfer and
>>>> when). Maybe I completely misunderstood the use case.
>>>>
>>> Hmm, did I clear this up above or no? The current proposal is for the
>>> process causing the export of a buffer to be charged for it,
>>> regardless of whatever happens afterwards. (Unless that process is
>>> like gralloc on Android, in which case the charge is transferred from
>>> gralloc to whoever called gralloc to allocate the buffer on their
>>> behalf.)
>>
>> Main problem for me is that charging at export time has no relation to memory used. But I am not familiar with the memcg counters to know if any other counter sets that same precedent. If all other are about real memory use then IMO this does not fit that well. I mean specifically this:
>>
>> + dmabuf (npn)
>> + Amount of memory used for exported DMA buffers allocated by the cgroup.
>> + Stays with the allocating cgroup regardless of how the buffer is shared.
>> +
>>
>> I think that "Amount of memory used for exported..." is not correct. As implemented it is more akin the virtual address space size in the cpu space - it can have no relation to the actual usage since backing store is not allocated until the attachment is made.
>>
>> Then also this:
>>
>> @@ -446,6 +447,8 @@ struct dma_buf {
>> struct dma_buf *dmabuf;
>> } *sysfs_entry;
>> #endif
>> + /* The cgroup to which this buffer is currently attributed */
>> + struct mem_cgroup *memcg;
>> };
>>
>> Does not conceptually fit in my mind. Dmabufs are not associated with one cgroup at a time.
>>
> It's true that a dmabuf could be shared among processes in different
> cgroups, but this refers to the one that's charged for it. Similar to
> how the shmem pages that back memfds which can be similarly shared get
> charged to the first cgroup that touches each page, here it's the
> entire buffer instead of each individual page. Maybe it'd be possible
> to charge whoever attaches / maps first, but I have to point out
> there'd be a gap between then and export where we'd have no accounting
> of the memory for cases where pages actually do get allocated during
> export (like in the system_heap).
Okay I wasn't familiar with heaps until now - indeed - allocating a dma
buf from there is allocation and export in one, no delayed/lazy anything
on neither edge. Therefore charge at exports works there.
One option - rename the proposed memcg category to be clear it is only
for dma buf heaps?
But does it not create double accounting btw? Since there are both
pages/cma allocations that would be tracked and the new dma buf category.
Another option was allow each "backend" to specify if export charge
needs to happen on export or import to be more accurate? (Like a flag
for dma_buf_export_info maybe.)
Regards,
Tvrtko
>> So if you would place tracking into dma_buf_attach/detach you would be able to charge to correct cgroup regardless of a driver and since by contract at this stage there is backing store, the reflected memory usage counter would be truthful.
>>
>> But then you state a problem, that majority of the time there are no attachments in your setup, and you also say the proposal is not so much about leaks but more about knowing what is exported.
>>
>> In this case you could additionally track that via dma_buf_getfile / dma_buf_file_release as a separate category like dmabuf-exported? But again, I personally don't know if such "may not really be using memory" counters fit in memcg.
>>
>> (Hm you'd probably still need dmabuf->export_memcg to store who was the original caller of dma_buf_getfile, in case last reference is dropped from a different process/context. Even dmabuf->attach_memcg for attach/detach to work correctly for the same reason.)
>>
>> Regards,
>>
>> Tvrtko