2021-01-27 22:08:25

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

In order to measure how much memory a process actually consumes, it is
necessary to include the DMA buffer sizes for that process in the memory
accounting. Since the handle to DMA buffers are raw FDs, it is important
to be able to identify which processes have FD references to a DMA buffer.

Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both of which are only root readable, as follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.

Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. To include a process’s dmabuf usage as part of its memory state,
the data collection needs to be fast enough to reflect the memory state at
the time of such events.

Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
privileges, this approach is not suitable for production builds. Granting
root privileges even to a system process increases the attack surface and
is highly undesirable. Additionally this is slow as it requires many
context switches for searching and getting the dma-buf info.

With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
details can be queried using their unique inode numbers.

This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.

/proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
every DMA buffer FD that the task has. Entries with the same inode
number can appear more than once, indicating the total FD references
for the associated DMA buffer.

If a thread shares the same files as the group leader then its
dmabuf_fds file will be empty, as these dmabufs are reported by the
group leader.

The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
and allows the efficient accounting of per-process DMA buffer usage without
requiring root privileges. (See data below)

Performance Comparison:
-----------------------

The following data compares the time to capture the sizes of all DMA
buffers referenced by FDs for all processes on an arm64 android device.

-------------------------------------------------------
| Core 0 (Little) | Core 7 (Big) |
-------------------------------------------------------
From <pid>/fdinfo | 318 ms | 145 ms |
-------------------------------------------------------
Inodes from | 114 ms | 27 ms |
dmabuf_fds; | (2.8x ^) | (5.4x ^) |
data from sysfs | | |
-------------------------------------------------------

It can be inferred that in the worst case there is a 2.8x speedup for
determining per-process DMA buffer FD sizes, when using the proposed
interfaces.

[1] https://lore.kernel.org/dri-devel/[email protected]/

Signed-off-by: Kalesh Singh <[email protected]>
---
Documentation/filesystems/proc.rst | 30 ++++++
drivers/dma-buf/dma-buf.c | 7 +-
fs/proc/Makefile | 1 +
fs/proc/base.c | 1 +
fs/proc/dma_bufs.c | 159 +++++++++++++++++++++++++++++
fs/proc/internal.h | 1 +
include/linux/dma-buf.h | 5 +
7 files changed, 198 insertions(+), 6 deletions(-)
create mode 100644 fs/proc/dma_bufs.c

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..757dd47ab679 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <[email protected]> June 9 2009
3.10 /proc/<pid>/timerslack_ns - Task timerslack value
3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information
+ 3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD

4 Configuring procfs
4.1 Mount options
@@ -2131,6 +2132,35 @@ AVX512_elapsed_ms
the task is unlikely an AVX512 user, but depends on the workload and the
scheduling scenario, it also could be a false negative mentioned above.

+3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
+-------------------------------------------------------------------------
+This file exposes a list of the inode numbers for every DMA buffer
+FD that the task has.
+
+The same inode number can appear more than once, indicating the total
+FD references for the associated DMA buffer.
+
+The inode number can be used to lookup the DMA buffer information in
+the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/.
+
+Example Output
+~~~~~~~~~~~~~~
+$ cat /proc/612/task/612/dmabuf_fds
+30972 30973 45678 49326
+
+Permission to access this file is governed by a ptrace access mode
+PTRACE_MODE_READ_FSCREDS.
+
+Threads can have different files when created without specifying
+the CLONE_FILES flag. For this reason the interface is presented as
+/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds.
+This simplifies kernel code and aggregation can be handled in
+userspace.
+
+If a thread has the same files as its group leader, then its dmabuf_fds
+file will be empty as these dmabufs are already reported by the
+group leader.
+
Chapter 4: Configuring procfs
=============================

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 9ad6397aaa97..0660c06be4c6 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -29,8 +29,6 @@
#include <uapi/linux/dma-buf.h>
#include <uapi/linux/magic.h>

-static inline int is_dma_buf_file(struct file *);
-
struct dma_buf_list {
struct list_head head;
struct mutex lock;
@@ -434,10 +432,7 @@ static const struct file_operations dma_buf_fops = {
.show_fdinfo = dma_buf_show_fdinfo,
};

-/*
- * is_dma_buf_file - Check if struct file* is associated with dma_buf
- */
-static inline int is_dma_buf_file(struct file *file)
+int is_dma_buf_file(struct file *file)
{
return file->f_op == &dma_buf_fops;
}
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index bd08616ed8ba..91a67f43ddf4 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -16,6 +16,7 @@ proc-y += cmdline.o
proc-y += consoles.o
proc-y += cpuinfo.o
proc-y += devices.o
+proc-y += dma_bufs.o
proc-y += interrupts.o
proc-y += loadavg.o
proc-y += meminfo.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..af15a60b9831 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3598,6 +3598,7 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_SECCOMP_CACHE_DEBUG
ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
#endif
+ REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations),
};

static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c
new file mode 100644
index 000000000000..46ea9cf968ed
--- /dev/null
+++ b/fs/proc/dma_bufs.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per-process DMA-BUF Stats
+ *
+ * Copyright (C) 2021 Google LLC.
+ */
+
+#include <linux/dma-buf.h>
+#include <linux/fdtable.h>
+#include <linux/ptrace.h>
+#include <linux/seq_file.h>
+
+#include "internal.h"
+
+struct dmabuf_fds_private {
+ struct inode *inode;
+ struct task_struct *task;
+ struct file *dmabuf_file;
+};
+
+static loff_t *next_dmabuf(struct dmabuf_fds_private *priv,
+ loff_t *pos)
+{
+ struct fdtable *fdt;
+ struct file *file;
+
+ rcu_read_lock();
+ fdt = files_fdtable(priv->task->files);
+ for (; *pos < fdt->max_fds; ++*pos) {
+ file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos);
+ if (file && is_dma_buf_file(file) && get_file_rcu(file)) {
+ priv->dmabuf_file = file;
+ break;
+ }
+ }
+ if (*pos >= fdt->max_fds)
+ pos = NULL;
+ rcu_read_unlock();
+
+ return pos;
+}
+
+static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos)
+{
+ struct dmabuf_fds_private *priv = s->private;
+ struct files_struct *group_leader_files;
+
+ priv->task = get_proc_task(priv->inode);
+
+ if (!priv->task)
+ return ERR_PTR(-ESRCH);
+
+ /* Hold task lock for duration that files need to be stable */
+ task_lock(priv->task);
+
+ /*
+ * If this task is not the group leader but shares the same files, leave file empty.
+ * These dmabufs are already reported in the group leader's dmabuf_fds.
+ */
+ group_leader_files = priv->task->group_leader->files;
+ if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) {
+ task_unlock(priv->task);
+ put_task_struct(priv->task);
+ priv->task = NULL;
+ return NULL;
+ }
+
+ return next_dmabuf(priv, pos);
+}
+
+static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
+{
+ ++*pos;
+ return next_dmabuf(s->private, pos);
+}
+
+static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
+{
+ struct dmabuf_fds_private *priv = s->private;
+
+ if (priv->task) {
+ task_unlock(priv->task);
+ put_task_struct(priv->task);
+
+ }
+ if (priv->dmabuf_file)
+ fput(priv->dmabuf_file);
+}
+
+static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
+{
+ struct dmabuf_fds_private *priv = s->private;
+ struct file *file = priv->dmabuf_file;
+ struct dma_buf *dmabuf = file->private_data;
+
+ if (!dmabuf)
+ return -ESRCH;
+
+ seq_printf(s, "%8lu ", file_inode(file)->i_ino);
+
+ fput(priv->dmabuf_file);
+ priv->dmabuf_file = NULL;
+
+ return 0;
+}
+
+static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
+ .start = dmabuf_fds_seq_start,
+ .next = dmabuf_fds_seq_next,
+ .stop = dmabuf_fds_seq_stop,
+ .show = dmabuf_fds_seq_show
+};
+
+static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
+ const struct seq_operations *ops)
+{
+ struct dmabuf_fds_private *priv;
+ struct task_struct *task;
+ bool allowed = false;
+
+ task = get_proc_task(inode);
+ if (!task)
+ return -ESRCH;
+
+ allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+ put_task_struct(task);
+
+ if (!allowed)
+ return -EACCES;
+
+ priv = __seq_open_private(file, ops, sizeof(*priv));
+ if (!priv)
+ return -ENOMEM;
+
+ priv->inode = inode;
+ priv->task = NULL;
+ priv->dmabuf_file = NULL;
+
+ return 0;
+}
+
+static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
+{
+ return seq_release_private(inode, file);
+}
+
+static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
+{
+ return proc_dmabuf_fds_open(inode, file,
+ &proc_tid_dmabuf_fds_seq_ops);
+}
+
+const struct file_operations proc_tid_dmabuf_fds_operations = {
+ .open = tid_dmabuf_fds_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = proc_dmabuf_fds_release,
+};
+
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index f60b379dcdc7..4ca74220db9c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -303,6 +303,7 @@ extern const struct file_operations proc_pid_smaps_operations;
extern const struct file_operations proc_pid_smaps_rollup_operations;
extern const struct file_operations proc_clear_refs_operations;
extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_tid_dmabuf_fds_operations;

extern unsigned long task_vsize(struct mm_struct *);
extern unsigned long task_statm(struct mm_struct *,
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index cf72699cb2bc..087e11f7f193 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -27,6 +27,11 @@ struct device;
struct dma_buf;
struct dma_buf_attachment;

+/**
+ * Check if struct file* is associated with dma_buf.
+ */
+int is_dma_buf_file(struct file *file);
+
/**
* struct dma_buf_ops - operations possible on struct dma_buf
* @vmap: [optional] creates a virtual mapping for the buffer into kernel
--
2.30.0.280.ga3ce27912f-goog


2021-01-27 22:13:18

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

+jeffv from Android

On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <[email protected]> wrote:
> In order to measure how much memory a process actually consumes, it is
> necessary to include the DMA buffer sizes for that process in the memory
> accounting. Since the handle to DMA buffers are raw FDs, it is important
> to be able to identify which processes have FD references to a DMA buffer.

Or you could try to let the DMA buffer take a reference on the
mm_struct and account its size into the mm_struct? That would probably
be nicer to work with than having to poke around in procfs separately
for DMA buffers.

> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both of which are only root readable, as follows:

That's not quite right. They can both also be accessed by the user
owning the process. Also, fdinfo is a standard interface for
inspecting process state that doesn't permit reading process memory or
manipulating process state - so I think it would be fine to permit
access to fdinfo under a PTRACE_MODE_READ_FSCRED check, just like the
interface you're suggesting.

> 1. Do a readlink on each FD.
> 2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
> 3. stat the file to get the dmabuf inode number.
> 4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
>
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers. To include a process’s dmabuf usage as part of its memory state,
> the data collection needs to be fast enough to reflect the memory state at
> the time of such events.
>
> Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> privileges, this approach is not suitable for production builds.

It should be easy to add enough information to /proc/<pid>/fdinfo/ so
that you don't need to look at /proc/<pid>/fd/ anymore.

> Granting
> root privileges even to a system process increases the attack surface and
> is highly undesirable. Additionally this is slow as it requires many
> context switches for searching and getting the dma-buf info.

What do you mean by "context switches"? Task switches or kernel/user
transitions (e.g. via syscall)?

> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> details can be queried using their unique inode numbers.
>
> This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
>
> /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> every DMA buffer FD that the task has. Entries with the same inode
> number can appear more than once, indicating the total FD references
> for the associated DMA buffer.
>
> If a thread shares the same files as the group leader then its
> dmabuf_fds file will be empty, as these dmabufs are reported by the
> group leader.
>
> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> and allows the efficient accounting of per-process DMA buffer usage without
> requiring root privileges. (See data below)

I'm not convinced that introducing a new procfs file for this is the
right way to go. And the idea of having to poke into multiple
different files in procfs and in sysfs just to be able to compute a
proper memory usage score for a process seems weird to me. "How much
memory is this process using" seems like the kind of question the
kernel ought to be able to answer (and the kernel needs to be able to
answer somewhat accurately so that its own OOM killer can do its job
properly)?

2021-01-27 22:18:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

On Wed 27-01-21 11:47:29, Jann Horn wrote:
> +jeffv from Android
>
> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <[email protected]> wrote:
> > In order to measure how much memory a process actually consumes, it is
> > necessary to include the DMA buffer sizes for that process in the memory
> > accounting. Since the handle to DMA buffers are raw FDs, it is important
> > to be able to identify which processes have FD references to a DMA buffer.
>
> Or you could try to let the DMA buffer take a reference on the
> mm_struct and account its size into the mm_struct? That would probably
> be nicer to work with than having to poke around in procfs separately
> for DMA buffers.

Yes that would make some sense to me as well but how do you know that
the process actually uses a buffer? If it mmaps it then you already have
that information via /proc/<pid>/maps. My understanding of dma-buf is
really coarse but my impression is that you can consume the memory via
standard read syscall as well. How would you account for that.

[...]
Skipping over a large part of your response but I do agree that the
interface is really elaborate to drill down to the information.

> I'm not convinced that introducing a new procfs file for this is the
> right way to go. And the idea of having to poke into multiple
> different files in procfs and in sysfs just to be able to compute a
> proper memory usage score for a process seems weird to me. "How much
> memory is this process using" seems like the kind of question the
> kernel ought to be able to answer (and the kernel needs to be able to
> answer somewhat accurately so that its own OOM killer can do its job
> properly)?

Well, shared buffers are tricky but it is true that we already consider
shmem in badness so this wouldn't go out of line. Kernel oom killer
could be more clever with these special fds though and query for buffer
size directly.
--
Michal Hocko
SUSE Labs

2021-01-27 23:50:24

by Christian König

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

Am 27.01.21 um 11:57 schrieb Michal Hocko:
> On Wed 27-01-21 11:47:29, Jann Horn wrote:
>> +jeffv from Android
>>
>> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <[email protected]> wrote:
>>> In order to measure how much memory a process actually consumes, it is
>>> necessary to include the DMA buffer sizes for that process in the memory
>>> accounting. Since the handle to DMA buffers are raw FDs, it is important
>>> to be able to identify which processes have FD references to a DMA buffer.
>> Or you could try to let the DMA buffer take a reference on the
>> mm_struct and account its size into the mm_struct? That would probably
>> be nicer to work with than having to poke around in procfs separately
>> for DMA buffers.
> Yes that would make some sense to me as well but how do you know that
> the process actually uses a buffer? If it mmaps it then you already have
> that information via /proc/<pid>/maps. My understanding of dma-buf is
> really coarse but my impression is that you can consume the memory via
> standard read syscall as well. How would you account for that.

Somewhat correct. This interface here really doesn't make sense since
the file descriptor representation of DMA-buf is only meant to be used
for short term usage.

E.g. the idea is that you can export a DMA-buf fd from your device
driver, transfer that to another process and then import it again into a
device driver.

Keeping a long term reference to a DMA-buf fd sounds like a design bug
in userspace to me.

> [...]
> Skipping over a large part of your response but I do agree that the
> interface is really elaborate to drill down to the information.
>
>> I'm not convinced that introducing a new procfs file for this is the
>> right way to go. And the idea of having to poke into multiple
>> different files in procfs and in sysfs just to be able to compute a
>> proper memory usage score for a process seems weird to me. "How much
>> memory is this process using" seems like the kind of question the
>> kernel ought to be able to answer (and the kernel needs to be able to
>> answer somewhat accurately so that its own OOM killer can do its job
>> properly)?
> Well, shared buffers are tricky but it is true that we already consider
> shmem in badness so this wouldn't go out of line. Kernel oom killer
> could be more clever with these special fds though and query for buffer
> size directly.

Some years ago I've proposed an change to improve the OOM killer to take
into account how much memory is reference through special file
descriptors like device drivers or DMA-buf.

But that never want anywhere because of concerns that this would add to
much overhead to the OOM killer.

Regards,
Christian.

2021-01-27 23:52:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

On Wed 27-01-21 12:01:55, Christian K?nig wrote:
[...]
> Some years ago I've proposed an change to improve the OOM killer to take
> into account how much memory is reference through special file descriptors
> like device drivers or DMA-buf.
>
> But that never want anywhere because of concerns that this would add to much
> overhead to the OOM killer.

I have to admit I do not remember such a discussion. There were many on
the related topic, concerns were mostly about a proper accounting of
shared resources or locking required to gain the information. I do not
remember overhead as being the main concern as oom is a real cold path.

Anyway, if you want to revamp those patches then feel free to CC me and
we can discuss further. I do not want to hijack this thread by an
unrelated topic.
--
Michal Hocko
SUSE Labs

2021-01-28 00:09:35

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

On Wed, Jan 27, 2021 at 5:47 AM Jann Horn <[email protected]> wrote:
>
> +jeffv from Android
>
> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <[email protected]> wrote:
> > In order to measure how much memory a process actually consumes, it is
> > necessary to include the DMA buffer sizes for that process in the memory
> > accounting. Since the handle to DMA buffers are raw FDs, it is important
> > to be able to identify which processes have FD references to a DMA buffer.
>
> Or you could try to let the DMA buffer take a reference on the
> mm_struct and account its size into the mm_struct? That would probably
> be nicer to work with than having to poke around in procfs separately
> for DMA buffers.
>
> > Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> > /proc/<pid>/fdinfo -- both of which are only root readable, as follows:
>
> That's not quite right. They can both also be accessed by the user
> owning the process. Also, fdinfo is a standard interface for
> inspecting process state that doesn't permit reading process memory or
> manipulating process state - so I think it would be fine to permit
> access to fdinfo under a PTRACE_MODE_READ_FSCRED check, just like the
> interface you're suggesting.


Hi everyone. Thank you for the feedback.

I understand there is a deeper problem of accounting shared memory in
the kernel, that’s not only specific to the DMA buffers. In this case
DMA buffers, I think Jann’s proposal is the cleanest way to attribute
the shared buffers to processes. I can respin a patch modifying fdinfo
as suggested, if this is not an issue from a security perspective.

Thanks,
Kalesh

>
>
> > 1. Do a readlink on each FD.
> > 2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
> > 3. stat the file to get the dmabuf inode number.
> > 4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
> >
> > Android captures per-process system memory state when certain low memory
> > events (e.g a foreground app kill) occur, to identify potential memory
> > hoggers. To include a process’s dmabuf usage as part of its memory state,
> > the data collection needs to be fast enough to reflect the memory state at
> > the time of such events.
> >
> > Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> > privileges, this approach is not suitable for production builds.
>
> It should be easy to add enough information to /proc/<pid>/fdinfo/ so
> that you don't need to look at /proc/<pid>/fd/ anymore.
>
> > Granting
> > root privileges even to a system process increases the attack surface and
> > is highly undesirable. Additionally this is slow as it requires many
> > context switches for searching and getting the dma-buf info.
>
> What do you mean by "context switches"? Task switches or kernel/user
> transitions (e.g. via syscall)?
>
> > With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> > details can be queried using their unique inode numbers.
> >
> > This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
> >
> > /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> > every DMA buffer FD that the task has. Entries with the same inode
> > number can appear more than once, indicating the total FD references
> > for the associated DMA buffer.
> >
> > If a thread shares the same files as the group leader then its
> > dmabuf_fds file will be empty, as these dmabufs are reported by the
> > group leader.
> >
> > The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> > and allows the efficient accounting of per-process DMA buffer usage without
> > requiring root privileges. (See data below)
>
> I'm not convinced that introducing a new procfs file for this is the
> right way to go. And the idea of having to poke into multiple
> different files in procfs and in sysfs just to be able to compute a
> proper memory usage score for a process seems weird to me. "How much
> memory is this process using" seems like the kind of question the
> kernel ought to be able to answer (and the kernel needs to be able to
> answer somewhat accurately so that its own OOM killer can do its job
> properly)?

2021-01-28 10:06:56

by Pekka Paalanen

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

On Wed, 27 Jan 2021 12:01:55 +0100
Christian König <[email protected]> wrote:

> Somewhat correct. This interface here really doesn't make sense since
> the file descriptor representation of DMA-buf is only meant to be used
> for short term usage.
>
> E.g. the idea is that you can export a DMA-buf fd from your device
> driver, transfer that to another process and then import it again into a
> device driver.
>
> Keeping a long term reference to a DMA-buf fd sounds like a design bug
> in userspace to me.

Except keeping the fd is exactly what userspace must do if it wishes to
re-use the buffer without passing a new fd over IPC again. Particularly
Wayland compositors need to keep the client buffer dmabuf fd open after
receiving it, so that they can re-import it to EGL to ensure updated
contents are correctly flushed as EGL has no other API for it.

That is my vague understanding, and what Weston implements. You can say
it's a bad userspace API design in EGL, but what else can we do?

However, in the particular case of Wayland, the shared dmabufs should
be accounted to the Wayland client process. OOM-killing the client
process will eventually free the dmabuf, also the Wayland server
references to it. Killing the Wayland server (compositor, display
server) OTOH is something that should not be done as long as there are
e.g. Wayland clients to be killed.

Unfortunately(?), Wayland clients do not have a reason to keep the
dmabuf fd open themselves, so they probably close it as soon as it has
been sent to a display server. So the process that should be OOM-killed
does not have an open fd for the dmabuf (but probably has something
else, but not an mmap for CPU). Buffer re-use in Wayland does not
require re-sending the dmabuf fd over IPC.

(In general, dmabufs are never mmapped for CPU. They are accessed by
devices.)


Thanks,
pq


Attachments:
(No filename) (849.00 B)
OpenPGP digital signature

2021-02-03 10:25:32

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH] procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

On Fri, Jan 29, 2021 at 03:22:06PM +0100, Christian K?nig wrote:
> Am 29.01.21 um 15:17 schrieb Simon Ser:
> > On Friday, January 29th, 2021 at 3:13 PM, Pekka Paalanen <[email protected]> wrote:
> >
> > > > Re-importing it adds quite a huge CPU overhead to both userspace as well
> > > > as the kernel.
> > > Perhaps, but so far it seems no-one has noticed the overhead, with Mesa
> > > at least.
> > >
> > > I happily stand corrected.
> > Note, all of this doesn't mean that compositors will stop keeping
> > DMA-BUF FDs around. They may want to keep them open for other purposes
> > like importing them into KMS or other EGL displays as needed.
>
> Correct and that's a perfectly valid use case. Just re-importing it on every
> frame is something we should really try to avoid.
>
> At least with debugging enabled it's massive overhead and maybe even
> performance penalty when we have to re-create device page tables all the
> time.
>
> But thinking more about that it is possible that we short-cut this step as
> long as the original import was still referenced. Otherwise we probably
> would have noticed this much earlier.

Yeah kernel keeps lots of caches around and just gives you back the
previous buffer if it's still around. Still probably not the smartest
idea.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch