Recently, we noticed an issue where a process went into direct reclaim
while holding the kernfs rw semaphore for sysfs in write (exclusive)
mode. This caused processes who were doing DMA-BUF exports and releases
to go into uninterruptible sleep since they needed to acquire the same
semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
blocking DMA-BUF export for an indeterminate amount of time while
another process is holding the sysfs rw semaphore in exclusive mode,
this patch moves the per-buffer sysfs file creation to the default work
queue. Note that this can lead to a short-term inaccuracy in the dmabuf
sysfs statistics, but this is a tradeoff to prevent the hot path from
being blocked. A work_struct is added to dma_buf to achieve this, but as
it is unioned with the kobject in the sysfs_entry, dma_buf does not
increase in size.
Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
Originally-by: Hridya Valsaraju <[email protected]>
Signed-off-by: T.J. Mercier <[email protected]>
---
See the originally submitted patch by Hridya Valsaraju here:
https://lkml.org/lkml/2022/1/4/1066
v2 changes:
- Defer only sysfs creation instead of creation and teardown per
Christian König
- Use a work queue instead of a kthread for deferred work per
Christian König
---
drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
include/linux/dma-buf.h | 14 ++++++-
2 files changed, 54 insertions(+), 16 deletions(-)
diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
index 2bba0babcb62..67b0a298291c 100644
--- a/drivers/dma-buf/dma-buf-sysfs-stats.c
+++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
@@ -11,6 +11,7 @@
#include <linux/printk.h>
#include <linux/slab.h>
#include <linux/sysfs.h>
+#include <linux/workqueue.h>
#include "dma-buf-sysfs-stats.h"
@@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
kset_unregister(dma_buf_stats_kset);
}
+static void sysfs_add_workfn(struct work_struct *work)
+{
+ struct dma_buf_sysfs_entry *sysfs_entry =
+ container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
+ struct dma_buf *dmabuf = sysfs_entry->dmabuf;
+
+ /*
+ * A dmabuf is ref-counted via its file member. If this handler holds the only
+ * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
+ * optimization and a race; when the reference count drops to 1 immediately after
+ * this check it is not harmful as the sysfs entry will still get cleaned up in
+ * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
+ * is released, and that can't happen until the end of this function.
+ */
+ if (file_count(dmabuf->file) > 1) {
+ /*
+ * kobject_init_and_add expects kobject to be zero-filled, but we have populated it
+ * (the sysfs_add_work union member) to trigger this work function.
+ */
+ memset(&dmabuf->sysfs_entry->kobj, 0, sizeof(dmabuf->sysfs_entry->kobj));
+ dmabuf->sysfs_entry->kobj.kset = dma_buf_per_buffer_stats_kset;
+ if (kobject_init_and_add(&dmabuf->sysfs_entry->kobj, &dma_buf_ktype, NULL,
+ "%lu", file_inode(dmabuf->file)->i_ino)) {
+ kobject_put(&dmabuf->sysfs_entry->kobj);
+ dmabuf->sysfs_entry = NULL;
+ }
+ } else {
+ /*
+ * Free the sysfs_entry and reset the pointer so dma_buf_stats_teardown doesn't
+ * attempt to operate on it.
+ */
+ kfree(dmabuf->sysfs_entry);
+ dmabuf->sysfs_entry = NULL;
+ }
+ dma_buf_put(dmabuf);
+}
+
int dma_buf_stats_setup(struct dma_buf *dmabuf)
{
struct dma_buf_sysfs_entry *sysfs_entry;
- int ret;
if (!dmabuf || !dmabuf->file)
return -EINVAL;
@@ -181,25 +218,16 @@ int dma_buf_stats_setup(struct dma_buf *dmabuf)
return -EINVAL;
}
- sysfs_entry = kzalloc(sizeof(struct dma_buf_sysfs_entry), GFP_KERNEL);
+ sysfs_entry = kmalloc(sizeof(struct dma_buf_sysfs_entry), GFP_KERNEL);
if (!sysfs_entry)
return -ENOMEM;
- sysfs_entry->kobj.kset = dma_buf_per_buffer_stats_kset;
sysfs_entry->dmabuf = dmabuf;
-
dmabuf->sysfs_entry = sysfs_entry;
- /* create the directory for buffer stats */
- ret = kobject_init_and_add(&sysfs_entry->kobj, &dma_buf_ktype, NULL,
- "%lu", file_inode(dmabuf->file)->i_ino);
- if (ret)
- goto err_sysfs_dmabuf;
+ INIT_WORK(&dmabuf->sysfs_entry->sysfs_add_work, sysfs_add_workfn);
+ get_dma_buf(dmabuf); /* This reference will be dropped in sysfs_add_workfn. */
+ schedule_work(&dmabuf->sysfs_entry->sysfs_add_work);
return 0;
-
-err_sysfs_dmabuf:
- kobject_put(&sysfs_entry->kobj);
- dmabuf->sysfs_entry = NULL;
- return ret;
}
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 2097760e8e95..0200caa3c515 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -22,6 +22,7 @@
#include <linux/fs.h>
#include <linux/dma-fence.h>
#include <linux/wait.h>
+#include <linux/workqueue.h>
struct device;
struct dma_buf;
@@ -365,7 +366,7 @@ struct dma_buf {
*/
const char *name;
- /** @name_lock: Spinlock to protect name acces for read access. */
+ /** @name_lock: Spinlock to protect name access for read access. */
spinlock_t name_lock;
/**
@@ -441,6 +442,7 @@ struct dma_buf {
__poll_t active;
} cb_in, cb_out;
+
#ifdef CONFIG_DMABUF_SYSFS_STATS
/**
* @sysfs_entry:
@@ -449,7 +451,15 @@ struct dma_buf {
* `DMA-BUF statistics`_ for the uapi this enables.
*/
struct dma_buf_sysfs_entry {
- struct kobject kobj;
+ union {
+ struct kobject kobj;
+
+ /** @sysfs_add_work:
+ *
+ * For deferred sysfs kobject creation using a workqueue.
+ */
+ struct work_struct sysfs_add_work;
+ };
struct dma_buf *dmabuf;
} *sysfs_entry;
#endif
--
2.36.0.550.gb090851708-goog
On Fri, May 20, 2022 at 12:03 AM Christian König
<[email protected]> wrote:
>
> Am 20.05.22 um 00:58 schrieb T.J. Mercier:
> > [SNIP]
> >>> Is there some other
> >>> solution to the problem of exports getting blocked that you would
> >>> suggest here?
> >> Well pretty much the same as Greg outlined as well. Go back to your
> >> drawing board and come back with a solution which does not need such
> >> workarounds.
> >>
> >> Alternatively you can give me a full overview of the design and explain
> >> why exactly that interface here is necessary in exactly that form.
> >>
> > We ended up here because we could not use debugfs.
>
> [SNIP]
>
> > Another idea was adding per-buffer stats to procfs, but that was not
> > an option since per-buffer stats are not process specific.
> >
> > So it seemed like sysfs was an appropriate solution at the time. It
> > comes with a stable interface as a bonus, but with the limitation of 1
> > value per file this leads to creating lots of files in sysfs for all
> > dma buffers. This leads to increased kernfs lock contention, and
> > unfortunately we try to take the lock on the hot path.
>
> That's what I totally agree on about. debugfs is for debugging and not
> for production use.
>
> So either sysfs or procfs or something completely different seems to be
> the right direction for the solution of the problem.
>
> > With the description and links to the userspace code which actually
> > uses the feature I feel like that's a complete picture of what's
> > currently happening with this interface. If you could explain what
> > information is missing I'll do my best to provide it.
>
> Yeah, I've realized that I didn't made it clear what my concerns are
> here. So let me try once more from the beginning:
>
> DMA-buf is a framework for sharing device buffers and their handles
> between different userspace processes and kernel device. It's based
> around the concept of representing those buffers as files which can then
> be mmap(), referenced with a file descriptor, etc....
>
> Those abilities come with a certain overhead, using inode numbers,
> reference counters, creating virtual files for tracking (both debugfs,
> sysfs, procfs) etc... So what both drivers and userspace implementing
> DMA-buf is doing is that they share buffers using this framework only
> when they have to.
>
> In other words for upstream graphics drivers 99.9% of the buffers are
> *not* shared using DMA-buf. And this is perfectly intentional because of
> the additional overhead. Only the 3 or 4 buffers which are shared per
> process between the client and server in a display environment are
> actually exported and imported as DMA-buf.
>
> What the recent patches suggest is that this is not the case on Android.
> So for example overrunning a 32bit inode number means that you manage to
> created and destroy over 4 billion DMA-bufs. Same for this sysfs based
> accounting, this only makes sense when you really export *everything* as
> DMA-buf.
>
> So if that is correct, then that would be a pretty clear design issue in
> Android. Now, if you want to keep this design then that is perfectly
> fine with the kernel, but it also means that you need to deal with any
> arising problems by yourself.
>
> Pushing patches upstream indicates that you want to share your work with
> others. And in this case it suggests that you want to encourage others
> to follow the Android design and that is something I would pretty
> clearly reject.
>
Ok thank you, this is clear and I understand your position. Yes
Android does things a little differently. My team is actually hoping
to create a presentation on this topic explaining why things are the
way they are because these differences keep coming up in discussions.
The inode number rollover happened after running for two weeks, but
that's still around 300M a day which is extraordinary, so I think they
must have been stress testing. But yes the Android graphics stack does
make much more use of DMA-bufs than other users.
> >> Yeah and to be honest I have the strong feeling now that this was
> >> absolutely not well thought through.
> > I'm open to working on a replacement for this if we can't find an
> > acceptable solution here, but I would appreciate some direction on
> > what would be acceptable. For example Greg's idea sounds workable, but
> > the question is if it mergeable?
>
> Well one possibility would be to use cgroups. That framework needs to do
> accounting as well, just with an additional limitation to it.
>
> And there are already some proposed cgroup patches for device driver
> memory. While reviewing those both Daniel and I already made it pretty
> clear that it must be separated from DMA-buf, exactly because of the
> reason that we probably don't want every buffer exported.
>
Cgroups definitely would help out with per-application accounting.
Much nicer than parsing through procfs. For our use case this requires
associating the exporter name with the cgroup resource, which is part
of the data that comes from sysfs now. I have some patches which do
this, but this naming component is a point of contention at the
moment. Maybe it would be better to focus efforts on the problem of
how to categorize and aggregate (or not aggregate) graphics resources
for accounting with cgroups in a way that suits everyone's needs.
Thanks,
T.J.
> But to work on a full blown solution I need a better understanding of
> how your userspace components do.
>
> Regards,
> Christian.