Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> On Mon, May 16, 2022 at 10:20 AM Christian König
> <[email protected]> wrote:
>> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
>>> Recently, we noticed an issue where a process went into direct reclaim
>>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
>>> mode. This caused processes who were doing DMA-BUF exports and releases
>>> to go into uninterruptible sleep since they needed to acquire the same
>>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
>>> blocking DMA-BUF export for an indeterminate amount of time while
>>> another process is holding the sysfs rw semaphore in exclusive mode,
>>> this patch moves the per-buffer sysfs file creation to the default work
>>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
>>> sysfs statistics, but this is a tradeoff to prevent the hot path from
>>> being blocked. A work_struct is added to dma_buf to achieve this, but as
>>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
>>> increase in size.
>> I'm still not very keen of this approach as it strongly feels like we
>> are working around shortcoming somewhere else.
>>
> My read of the thread for the last version is that we're running into
> a situation where sysfs is getting used for something it wasn't
> originally intended for, but we're also stuck with this sysfs
> functionality for dmabufs.
>
>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
>>> Originally-by: Hridya Valsaraju <[email protected]>
>>> Signed-off-by: T.J. Mercier <[email protected]>
>>>
>>> ---
>>> See the originally submitted patch by Hridya Valsaraju here:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
>>>
>>> v2 changes:
>>> - Defer only sysfs creation instead of creation and teardown per
>>> Christian König
>>>
>>> - Use a work queue instead of a kthread for deferred work per
>>> Christian König
>>> ---
>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
>>> include/linux/dma-buf.h | 14 ++++++-
>>> 2 files changed, 54 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>> index 2bba0babcb62..67b0a298291c 100644
>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>> @@ -11,6 +11,7 @@
>>> #include <linux/printk.h>
>>> #include <linux/slab.h>
>>> #include <linux/sysfs.h>
>>> +#include <linux/workqueue.h>
>>>
>>> #include "dma-buf-sysfs-stats.h"
>>>
>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
>>> kset_unregister(dma_buf_stats_kset);
>>> }
>>>
>>> +static void sysfs_add_workfn(struct work_struct *work)
>>> +{
>>> + struct dma_buf_sysfs_entry *sysfs_entry =
>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
>>> +
>>> + /*
>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
>>> + * optimization and a race; when the reference count drops to 1 immediately after
>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
>>> + * is released, and that can't happen until the end of this function.
>>> + */
>>> + if (file_count(dmabuf->file) > 1) {
>> Please completely drop that. I see absolutely no justification for this
>> additional complexity.
>>
> This case gets hit around 5% of the time in my testing so the else is
> not a completely unused branch.
Well I can only repeat myself: This means that your userspace is
severely broken!
DMA-buf are meant to be long living objects and when you create and
destroy it faster than a work item can create the sysfs entries than
there is some serious design issue here.
My suspicion is that your IOCTL to create the DMA-buf is somehow
interrupted/aborted after it was already exported which results in it's
immediate destruction.
Regards,
Christian.
> It's only 3 extra lines of actual
> code. I'd prefer to keep it, but I'll remove it to reduce complexity.
> This means doing work that we know is useless some of the time, and
> adding contention for a global kernfs lock which this patch is aimed
> at avoiding (on the hot path), but at least that work is on a worker
> thread with this patch.
>
>>> + /*
>>> + * kobject_init_and_add expects kobject to be zero-filled, but we have populated it
>>> + * (the sysfs_add_work union member) to trigger this work function.
>>> + */
>>> + memset(&dmabuf->sysfs_entry->kobj, 0, sizeof(dmabuf->sysfs_entry->kobj));
>>> + dmabuf->sysfs_entry->kobj.kset = dma_buf_per_buffer_stats_kset;
>>> + if (kobject_init_and_add(&dmabuf->sysfs_entry->kobj, &dma_buf_ktype, NULL,
>>> + "%lu", file_inode(dmabuf->file)->i_ino)) {
>>> + kobject_put(&dmabuf->sysfs_entry->kobj);
>>> + dmabuf->sysfs_entry = NULL;
>>> + }
>>> + } else {
>>> + /*
>>> + * Free the sysfs_entry and reset the pointer so dma_buf_stats_teardown doesn't
>>> + * attempt to operate on it.
>>> + */
>>> + kfree(dmabuf->sysfs_entry);
>>> + dmabuf->sysfs_entry = NULL;
>>> + }
>>> + dma_buf_put(dmabuf);
>>> +}
>>> +
>>> int dma_buf_stats_setup(struct dma_buf *dmabuf)
>>> {
>>> struct dma_buf_sysfs_entry *sysfs_entry;
>>> - int ret;
>>>
>>> if (!dmabuf || !dmabuf->file)
>>> return -EINVAL;
>>> @@ -181,25 +218,16 @@ int dma_buf_stats_setup(struct dma_buf *dmabuf)
>>> return -EINVAL;
>>> }
>>>
>>> - sysfs_entry = kzalloc(sizeof(struct dma_buf_sysfs_entry), GFP_KERNEL);
>>> + sysfs_entry = kmalloc(sizeof(struct dma_buf_sysfs_entry), GFP_KERNEL);
>>> if (!sysfs_entry)
>>> return -ENOMEM;
>>>
>>> - sysfs_entry->kobj.kset = dma_buf_per_buffer_stats_kset;
>>> sysfs_entry->dmabuf = dmabuf;
>>> -
>>> dmabuf->sysfs_entry = sysfs_entry;
>>>
>>> - /* create the directory for buffer stats */
>>> - ret = kobject_init_and_add(&sysfs_entry->kobj, &dma_buf_ktype, NULL,
>>> - "%lu", file_inode(dmabuf->file)->i_ino);
>>> - if (ret)
>>> - goto err_sysfs_dmabuf;
>>> + INIT_WORK(&dmabuf->sysfs_entry->sysfs_add_work, sysfs_add_workfn);
>>> + get_dma_buf(dmabuf); /* This reference will be dropped in sysfs_add_workfn. */
>>> + schedule_work(&dmabuf->sysfs_entry->sysfs_add_work);
>>>
>>> return 0;
>>> -
>>> -err_sysfs_dmabuf:
>>> - kobject_put(&sysfs_entry->kobj);
>>> - dmabuf->sysfs_entry = NULL;
>>> - return ret;
>>> }
>>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>>> index 2097760e8e95..0200caa3c515 100644
>>> --- a/include/linux/dma-buf.h
>>> +++ b/include/linux/dma-buf.h
>>> @@ -22,6 +22,7 @@
>>> #include <linux/fs.h>
>>> #include <linux/dma-fence.h>
>>> #include <linux/wait.h>
>>> +#include <linux/workqueue.h>
>>>
>>> struct device;
>>> struct dma_buf;
>>> @@ -365,7 +366,7 @@ struct dma_buf {
>>> */
>>> const char *name;
>>>
>>> - /** @name_lock: Spinlock to protect name acces for read access. */
>>> + /** @name_lock: Spinlock to protect name access for read access. */
>>> spinlock_t name_lock;
>>>
>>> /**
>>> @@ -441,6 +442,7 @@ struct dma_buf {
>>>
>>> __poll_t active;
>>> } cb_in, cb_out;
>>> +
>> Those changes are unrelated.
>>
> I included it here because I thought it was bad form to submit a
> typo-only patch. Will remove.
>
>
>
>
>> Regards,
>> Christian.
>>
>>> #ifdef CONFIG_DMABUF_SYSFS_STATS
>>> /**
>>> * @sysfs_entry:
>>> @@ -449,7 +451,15 @@ struct dma_buf {
>>> * `DMA-BUF statistics`_ for the uapi this enables.
>>> */
>>> struct dma_buf_sysfs_entry {
>>> - struct kobject kobj;
>>> + union {
>>> + struct kobject kobj;
>>> +
>>> + /** @sysfs_add_work:
>>> + *
>>> + * For deferred sysfs kobject creation using a workqueue.
>>> + */
>>> + struct work_struct sysfs_add_work;
>>> + };
>>> struct dma_buf *dmabuf;
>>> } *sysfs_entry;
>>> #endif
On Mon, May 16, 2022 at 12:21 PM Christian König
<[email protected]> wrote:
>
> Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > On Mon, May 16, 2022 at 10:20 AM Christian König
> > <[email protected]> wrote:
> >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> >>> Recently, we noticed an issue where a process went into direct reclaim
> >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> >>> mode. This caused processes who were doing DMA-BUF exports and releases
> >>> to go into uninterruptible sleep since they needed to acquire the same
> >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> >>> blocking DMA-BUF export for an indeterminate amount of time while
> >>> another process is holding the sysfs rw semaphore in exclusive mode,
> >>> this patch moves the per-buffer sysfs file creation to the default work
> >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> >>> increase in size.
> >> I'm still not very keen of this approach as it strongly feels like we
> >> are working around shortcoming somewhere else.
> >>
> > My read of the thread for the last version is that we're running into
> > a situation where sysfs is getting used for something it wasn't
> > originally intended for, but we're also stuck with this sysfs
> > functionality for dmabufs.
> >
> >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> >>> Originally-by: Hridya Valsaraju <[email protected]>
> >>> Signed-off-by: T.J. Mercier <[email protected]>
> >>>
> >>> ---
> >>> See the originally submitted patch by Hridya Valsaraju here:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> >>>
> >>> v2 changes:
> >>> - Defer only sysfs creation instead of creation and teardown per
> >>> Christian König
> >>>
> >>> - Use a work queue instead of a kthread for deferred work per
> >>> Christian König
> >>> ---
> >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> >>> include/linux/dma-buf.h | 14 ++++++-
> >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> >>>
> >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>> index 2bba0babcb62..67b0a298291c 100644
> >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>> @@ -11,6 +11,7 @@
> >>> #include <linux/printk.h>
> >>> #include <linux/slab.h>
> >>> #include <linux/sysfs.h>
> >>> +#include <linux/workqueue.h>
> >>>
> >>> #include "dma-buf-sysfs-stats.h"
> >>>
> >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> >>> kset_unregister(dma_buf_stats_kset);
> >>> }
> >>>
> >>> +static void sysfs_add_workfn(struct work_struct *work)
> >>> +{
> >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> >>> +
> >>> + /*
> >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> >>> + * optimization and a race; when the reference count drops to 1 immediately after
> >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> >>> + * is released, and that can't happen until the end of this function.
> >>> + */
> >>> + if (file_count(dmabuf->file) > 1) {
> >> Please completely drop that. I see absolutely no justification for this
> >> additional complexity.
> >>
> > This case gets hit around 5% of the time in my testing so the else is
> > not a completely unused branch.
>
> Well I can only repeat myself: This means that your userspace is
> severely broken!
>
> DMA-buf are meant to be long living objects
This patch addresses export *latency* regardless of how long-lived the
object is. Even a single, long-lived export will benefit from this
change if it would otherwise be blocked on adding an object to sysfs.
I think attempting to improve this latency still has merit.
> and when you create and
> destroy it faster than a work item can create the sysfs entries than
> there is some serious design issue here.
>
> My suspicion is that your IOCTL to create the DMA-buf is somehow
> interrupted/aborted after it was already exported which results in it's
> immediate destruction.
>
This patch uses the default work queue, so this work function could be
blocked by any other user of the queue ahead, not just other dmabuf
exports. So this is independent of getting blocked on manipulating
sysfs.
I will look closer into the lifecycle of dmabufs during camera app
launches to see if there are buffers that have extremely short
lifetimes by design, but I would still like to get this change in to
improve the latency of export for all buffers.
> Regards,
> Christian.
>
> > It's only 3 extra lines of actual
> > code. I'd prefer to keep it, but I'll remove it to reduce complexity.
> > This means doing work that we know is useless some of the time, and
> > adding contention for a global kernfs lock which this patch is aimed
> > at avoiding (on the hot path), but at least that work is on a worker
> > thread with this patch.
>
>
> >
> >>> + /*
> >>> + * kobject_init_and_add expects kobject to be zero-filled, but we have populated it
> >>> + * (the sysfs_add_work union member) to trigger this work function.
> >>> + */
> >>> + memset(&dmabuf->sysfs_entry->kobj, 0, sizeof(dmabuf->sysfs_entry->kobj));
> >>> + dmabuf->sysfs_entry->kobj.kset = dma_buf_per_buffer_stats_kset;
> >>> + if (kobject_init_and_add(&dmabuf->sysfs_entry->kobj, &dma_buf_ktype, NULL,
> >>> + "%lu", file_inode(dmabuf->file)->i_ino)) {
> >>> + kobject_put(&dmabuf->sysfs_entry->kobj);
> >>> + dmabuf->sysfs_entry = NULL;
> >>> + }
> >>> + } else {
> >>> + /*
> >>> + * Free the sysfs_entry and reset the pointer so dma_buf_stats_teardown doesn't
> >>> + * attempt to operate on it.
> >>> + */
> >>> + kfree(dmabuf->sysfs_entry);
> >>> + dmabuf->sysfs_entry = NULL;
> >>> + }
> >>> + dma_buf_put(dmabuf);
> >>> +}
> >>> +
> >>> int dma_buf_stats_setup(struct dma_buf *dmabuf)
> >>> {
> >>> struct dma_buf_sysfs_entry *sysfs_entry;
> >>> - int ret;
> >>>
> >>> if (!dmabuf || !dmabuf->file)
> >>> return -EINVAL;
> >>> @@ -181,25 +218,16 @@ int dma_buf_stats_setup(struct dma_buf *dmabuf)
> >>> return -EINVAL;
> >>> }
> >>>
> >>> - sysfs_entry = kzalloc(sizeof(struct dma_buf_sysfs_entry), GFP_KERNEL);
> >>> + sysfs_entry = kmalloc(sizeof(struct dma_buf_sysfs_entry), GFP_KERNEL);
> >>> if (!sysfs_entry)
> >>> return -ENOMEM;
> >>>
> >>> - sysfs_entry->kobj.kset = dma_buf_per_buffer_stats_kset;
> >>> sysfs_entry->dmabuf = dmabuf;
> >>> -
> >>> dmabuf->sysfs_entry = sysfs_entry;
> >>>
> >>> - /* create the directory for buffer stats */
> >>> - ret = kobject_init_and_add(&sysfs_entry->kobj, &dma_buf_ktype, NULL,
> >>> - "%lu", file_inode(dmabuf->file)->i_ino);
> >>> - if (ret)
> >>> - goto err_sysfs_dmabuf;
> >>> + INIT_WORK(&dmabuf->sysfs_entry->sysfs_add_work, sysfs_add_workfn);
> >>> + get_dma_buf(dmabuf); /* This reference will be dropped in sysfs_add_workfn. */
> >>> + schedule_work(&dmabuf->sysfs_entry->sysfs_add_work);
> >>>
> >>> return 0;
> >>> -
> >>> -err_sysfs_dmabuf:
> >>> - kobject_put(&sysfs_entry->kobj);
> >>> - dmabuf->sysfs_entry = NULL;
> >>> - return ret;
> >>> }
> >>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> >>> index 2097760e8e95..0200caa3c515 100644
> >>> --- a/include/linux/dma-buf.h
> >>> +++ b/include/linux/dma-buf.h
> >>> @@ -22,6 +22,7 @@
> >>> #include <linux/fs.h>
> >>> #include <linux/dma-fence.h>
> >>> #include <linux/wait.h>
> >>> +#include <linux/workqueue.h>
> >>>
> >>> struct device;
> >>> struct dma_buf;
> >>> @@ -365,7 +366,7 @@ struct dma_buf {
> >>> */
> >>> const char *name;
> >>>
> >>> - /** @name_lock: Spinlock to protect name acces for read access. */
> >>> + /** @name_lock: Spinlock to protect name access for read access. */
> >>> spinlock_t name_lock;
> >>>
> >>> /**
> >>> @@ -441,6 +442,7 @@ struct dma_buf {
> >>>
> >>> __poll_t active;
> >>> } cb_in, cb_out;
> >>> +
> >> Those changes are unrelated.
> >>
> > I included it here because I thought it was bad form to submit a
> > typo-only patch. Will remove.
> >
> >
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> #ifdef CONFIG_DMABUF_SYSFS_STATS
> >>> /**
> >>> * @sysfs_entry:
> >>> @@ -449,7 +451,15 @@ struct dma_buf {
> >>> * `DMA-BUF statistics`_ for the uapi this enables.
> >>> */
> >>> struct dma_buf_sysfs_entry {
> >>> - struct kobject kobj;
> >>> + union {
> >>> + struct kobject kobj;
> >>> +
> >>> + /** @sysfs_add_work:
> >>> + *
> >>> + * For deferred sysfs kobject creation using a workqueue.
> >>> + */
> >>> + struct work_struct sysfs_add_work;
> >>> + };
> >>> struct dma_buf *dmabuf;
> >>> } *sysfs_entry;
> >>> #endif
>
On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> On Mon, May 16, 2022 at 12:21 PM Christian K?nig
> <[email protected]> wrote:
> >
> > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > On Mon, May 16, 2022 at 10:20 AM Christian K?nig
> > > <[email protected]> wrote:
> > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > >>> Recently, we noticed an issue where a process went into direct reclaim
> > >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > >>> mode. This caused processes who were doing DMA-BUF exports and releases
> > >>> to go into uninterruptible sleep since they needed to acquire the same
> > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > >>> blocking DMA-BUF export for an indeterminate amount of time while
> > >>> another process is holding the sysfs rw semaphore in exclusive mode,
> > >>> this patch moves the per-buffer sysfs file creation to the default work
> > >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> > >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > >>> increase in size.
> > >> I'm still not very keen of this approach as it strongly feels like we
> > >> are working around shortcoming somewhere else.
> > >>
> > > My read of the thread for the last version is that we're running into
> > > a situation where sysfs is getting used for something it wasn't
> > > originally intended for, but we're also stuck with this sysfs
> > > functionality for dmabufs.
> > >
> > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > >>> Originally-by: Hridya Valsaraju <[email protected]>
> > >>> Signed-off-by: T.J. Mercier <[email protected]>
> > >>>
> > >>> ---
> > >>> See the originally submitted patch by Hridya Valsaraju here:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> > >>>
> > >>> v2 changes:
> > >>> - Defer only sysfs creation instead of creation and teardown per
> > >>> Christian K?nig
> > >>>
> > >>> - Use a work queue instead of a kthread for deferred work per
> > >>> Christian K?nig
> > >>> ---
> > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > >>> include/linux/dma-buf.h | 14 ++++++-
> > >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> > >>>
> > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > >>> index 2bba0babcb62..67b0a298291c 100644
> > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > >>> @@ -11,6 +11,7 @@
> > >>> #include <linux/printk.h>
> > >>> #include <linux/slab.h>
> > >>> #include <linux/sysfs.h>
> > >>> +#include <linux/workqueue.h>
> > >>>
> > >>> #include "dma-buf-sysfs-stats.h"
> > >>>
> > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > >>> kset_unregister(dma_buf_stats_kset);
> > >>> }
> > >>>
> > >>> +static void sysfs_add_workfn(struct work_struct *work)
> > >>> +{
> > >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> > >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > >>> +
> > >>> + /*
> > >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > >>> + * optimization and a race; when the reference count drops to 1 immediately after
> > >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > >>> + * is released, and that can't happen until the end of this function.
> > >>> + */
> > >>> + if (file_count(dmabuf->file) > 1) {
> > >> Please completely drop that. I see absolutely no justification for this
> > >> additional complexity.
> > >>
> > > This case gets hit around 5% of the time in my testing so the else is
> > > not a completely unused branch.
> >
> > Well I can only repeat myself: This means that your userspace is
> > severely broken!
> >
> > DMA-buf are meant to be long living objects
> This patch addresses export *latency* regardless of how long-lived the
> object is. Even a single, long-lived export will benefit from this
> change if it would otherwise be blocked on adding an object to sysfs.
> I think attempting to improve this latency still has merit.
Fixing the latency is nice, but as it's just pushing the needed work off
to another code path, it will take longer overall for the sysfs stuff to
be ready for userspace to see.
Perhaps we need to step back and understand what this code is supposed
to be doing. As I recall, it was created because some systems do not
allow debugfs anymore, and they wanted the debugging information that
the dmabuf code was exposing to debugfs on a "normal" system. Moving
that logic to sysfs made sense, but now I am wondering why we didn't see
these issues in the debugfs code previously?
Perhaps we should go just one step further and make a misc device node
for dmabug debugging information to be in and just have userspace
poll/read on the device node and we spit the info that used to be in
debugfs out through that? That way this only affects systems when they
want to read the information and not normal code paths? Yeah that's a
hack, but this whole thing feels overly complex now.
thanks,
greg k-h
Am 17.05.22 um 08:13 schrieb Greg Kroah-Hartman:
> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
>> [SNIP]
>>>>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
>>>>>> Originally-by: Hridya Valsaraju <[email protected]>
>>>>>> Signed-off-by: T.J. Mercier <[email protected]>
>>>>>>
>>>>>> ---
>>>>>> See the originally submitted patch by Hridya Valsaraju here:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C61d7d3acbe5f47c7d0e608da37cc5ed7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883648212878440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HdSHA2vbBkBgdKxPXIp57EHW49yoMjgmigkVOKeTasI%3D&reserved=0
>>>>>>
>>>>>> v2 changes:
>>>>>> - Defer only sysfs creation instead of creation and teardown per
>>>>>> Christian König
>>>>>>
>>>>>> - Use a work queue instead of a kthread for deferred work per
>>>>>> Christian König
>>>>>> ---
>>>>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
>>>>>> include/linux/dma-buf.h | 14 ++++++-
>>>>>> 2 files changed, 54 insertions(+), 16 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>> index 2bba0babcb62..67b0a298291c 100644
>>>>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>> @@ -11,6 +11,7 @@
>>>>>> #include <linux/printk.h>
>>>>>> #include <linux/slab.h>
>>>>>> #include <linux/sysfs.h>
>>>>>> +#include <linux/workqueue.h>
>>>>>>
>>>>>> #include "dma-buf-sysfs-stats.h"
>>>>>>
>>>>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
>>>>>> kset_unregister(dma_buf_stats_kset);
>>>>>> }
>>>>>>
>>>>>> +static void sysfs_add_workfn(struct work_struct *work)
>>>>>> +{
>>>>>> + struct dma_buf_sysfs_entry *sysfs_entry =
>>>>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
>>>>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
>>>>>> +
>>>>>> + /*
>>>>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
>>>>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
>>>>>> + * optimization and a race; when the reference count drops to 1 immediately after
>>>>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
>>>>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
>>>>>> + * is released, and that can't happen until the end of this function.
>>>>>> + */
>>>>>> + if (file_count(dmabuf->file) > 1) {
>>>>> Please completely drop that. I see absolutely no justification for this
>>>>> additional complexity.
>>>>>
>>>> This case gets hit around 5% of the time in my testing so the else is
>>>> not a completely unused branch.
>>> Well I can only repeat myself: This means that your userspace is
>>> severely broken!
>>>
>>> DMA-buf are meant to be long living objects
>> This patch addresses export *latency* regardless of how long-lived the
>> object is. Even a single, long-lived export will benefit from this
>> change if it would otherwise be blocked on adding an object to sysfs.
>> I think attempting to improve this latency still has merit.
> Fixing the latency is nice, but as it's just pushing the needed work off
> to another code path, it will take longer overall for the sysfs stuff to
> be ready for userspace to see.
>
> Perhaps we need to step back and understand what this code is supposed
> to be doing. As I recall, it was created because some systems do not
> allow debugfs anymore, and they wanted the debugging information that
> the dmabuf code was exposing to debugfs on a "normal" system. Moving
> that logic to sysfs made sense, but now I am wondering why we didn't see
> these issues in the debugfs code previously?
Well, I think that some key information is that adding the sysfs support
was justified with the argument that this is not only used for debugging.
If it would be used only for debugging then debugfs would the right
choice for this. If debugfs is then not available in your environment
then you should *not* ask the kernel to work around that. Instead we
should discuss why you want to disable some debugging access, but not
all of that.
So for now let's assume that this is also used for accounting, e.g. when
userspace wants to know how many DMA-bufs of which size are flying
around to make decisions like which process to put into background or
which to swap out based on that information.
> Perhaps we should go just one step further and make a misc device node
> for dmabug debugging information to be in and just have userspace
> poll/read on the device node and we spit the info that used to be in
> debugfs out through that? That way this only affects systems when they
> want to read the information and not normal code paths? Yeah that's a
> hack, but this whole thing feels overly complex now.
Yeah, totally agree on the complexity note. I'm just absolutely not keen
to add hack over hack over hack to make something work which from my
point of view has some serious issues with it's design.
For example trying to do accounting based on DMA-bufs is extremely
questionable to begin with. See a modern game for example can have
between 10k and 100k of different buffers, reserving one file descriptor
for each of those objects is absolutely not going to work.
So my request is to please describe your full use case and not just why
you think this patch is justified.
Regards,
Christian.
>
> thanks,
>
> greg k-h
On Mon, May 16, 2022 at 11:13 PM Greg Kroah-Hartman
<[email protected]> wrote:
>
> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > On Mon, May 16, 2022 at 12:21 PM Christian König
> > <[email protected]> wrote:
> > >
> > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > On Mon, May 16, 2022 at 10:20 AM Christian König
> > > > <[email protected]> wrote:
> > > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > >>> Recently, we noticed an issue where a process went into direct reclaim
> > > >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > >>> mode. This caused processes who were doing DMA-BUF exports and releases
> > > >>> to go into uninterruptible sleep since they needed to acquire the same
> > > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > >>> blocking DMA-BUF export for an indeterminate amount of time while
> > > >>> another process is holding the sysfs rw semaphore in exclusive mode,
> > > >>> this patch moves the per-buffer sysfs file creation to the default work
> > > >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > >>> increase in size.
> > > >> I'm still not very keen of this approach as it strongly feels like we
> > > >> are working around shortcoming somewhere else.
> > > >>
> > > > My read of the thread for the last version is that we're running into
> > > > a situation where sysfs is getting used for something it wasn't
> > > > originally intended for, but we're also stuck with this sysfs
> > > > functionality for dmabufs.
> > > >
> > > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > >>> Originally-by: Hridya Valsaraju <[email protected]>
> > > >>> Signed-off-by: T.J. Mercier <[email protected]>
> > > >>>
> > > >>> ---
> > > >>> See the originally submitted patch by Hridya Valsaraju here:
> > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> > > >>>
> > > >>> v2 changes:
> > > >>> - Defer only sysfs creation instead of creation and teardown per
> > > >>> Christian König
> > > >>>
> > > >>> - Use a work queue instead of a kthread for deferred work per
> > > >>> Christian König
> > > >>> ---
> > > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > >>> include/linux/dma-buf.h | 14 ++++++-
> > > >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> > > >>>
> > > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > >>> index 2bba0babcb62..67b0a298291c 100644
> > > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > >>> @@ -11,6 +11,7 @@
> > > >>> #include <linux/printk.h>
> > > >>> #include <linux/slab.h>
> > > >>> #include <linux/sysfs.h>
> > > >>> +#include <linux/workqueue.h>
> > > >>>
> > > >>> #include "dma-buf-sysfs-stats.h"
> > > >>>
> > > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > >>> kset_unregister(dma_buf_stats_kset);
> > > >>> }
> > > >>>
> > > >>> +static void sysfs_add_workfn(struct work_struct *work)
> > > >>> +{
> > > >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> > > >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > >>> +
> > > >>> + /*
> > > >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > >>> + * optimization and a race; when the reference count drops to 1 immediately after
> > > >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > >>> + * is released, and that can't happen until the end of this function.
> > > >>> + */
> > > >>> + if (file_count(dmabuf->file) > 1) {
> > > >> Please completely drop that. I see absolutely no justification for this
> > > >> additional complexity.
> > > >>
> > > > This case gets hit around 5% of the time in my testing so the else is
> > > > not a completely unused branch.
> > >
> > > Well I can only repeat myself: This means that your userspace is
> > > severely broken!
> > >
> > > DMA-buf are meant to be long living objects
> > This patch addresses export *latency* regardless of how long-lived the
> > object is. Even a single, long-lived export will benefit from this
> > change if it would otherwise be blocked on adding an object to sysfs.
> > I think attempting to improve this latency still has merit.
>
> Fixing the latency is nice, but as it's just pushing the needed work off
> to another code path, it will take longer overall for the sysfs stuff to
> be ready for userspace to see.
>
> Perhaps we need to step back and understand what this code is supposed
> to be doing. As I recall, it was created because some systems do not
> allow debugfs anymore, and they wanted the debugging information that
> the dmabuf code was exposing to debugfs on a "normal" system. Moving
> that logic to sysfs made sense, but now I am wondering why we didn't see
> these issues in the debugfs code previously?
>
The debugfs stuff doesn't happen on every export, right?
> Perhaps we should go just one step further and make a misc device node
> for dmabug debugging information to be in and just have userspace
> poll/read on the device node and we spit the info that used to be in
> debugfs out through that? That way this only affects systems when they
> want to read the information and not normal code paths? Yeah that's a
> hack, but this whole thing feels overly complex now.
>
And deprecate sysfs support? I'm happy to try out anything you think
might be a better way. As far as complexity of this patch, this
revision is a much simpler version of the one from Hridya you already
reviewed.
> thanks,
>
> greg k-h
On Mon, May 16, 2022 at 11:59 PM Christian König
<[email protected]> wrote:
>
> Am 17.05.22 um 08:13 schrieb Greg Kroah-Hartman:
> > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> >> [SNIP]
> >>>>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> >>>>>> Originally-by: Hridya Valsaraju <[email protected]>
> >>>>>> Signed-off-by: T.J. Mercier <[email protected]>
> >>>>>>
> >>>>>> ---
> >>>>>> See the originally submitted patch by Hridya Valsaraju here:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C61d7d3acbe5f47c7d0e608da37cc5ed7%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883648212878440%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HdSHA2vbBkBgdKxPXIp57EHW49yoMjgmigkVOKeTasI%3D&reserved=0
> >>>>>>
> >>>>>> v2 changes:
> >>>>>> - Defer only sysfs creation instead of creation and teardown per
> >>>>>> Christian König
> >>>>>>
> >>>>>> - Use a work queue instead of a kthread for deferred work per
> >>>>>> Christian König
> >>>>>> ---
> >>>>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> >>>>>> include/linux/dma-buf.h | 14 ++++++-
> >>>>>> 2 files changed, 54 insertions(+), 16 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>>>>> index 2bba0babcb62..67b0a298291c 100644
> >>>>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>>>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>>>>> @@ -11,6 +11,7 @@
> >>>>>> #include <linux/printk.h>
> >>>>>> #include <linux/slab.h>
> >>>>>> #include <linux/sysfs.h>
> >>>>>> +#include <linux/workqueue.h>
> >>>>>>
> >>>>>> #include "dma-buf-sysfs-stats.h"
> >>>>>>
> >>>>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> >>>>>> kset_unregister(dma_buf_stats_kset);
> >>>>>> }
> >>>>>>
> >>>>>> +static void sysfs_add_workfn(struct work_struct *work)
> >>>>>> +{
> >>>>>> + struct dma_buf_sysfs_entry *sysfs_entry =
> >>>>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> >>>>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> >>>>>> +
> >>>>>> + /*
> >>>>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> >>>>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> >>>>>> + * optimization and a race; when the reference count drops to 1 immediately after
> >>>>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> >>>>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> >>>>>> + * is released, and that can't happen until the end of this function.
> >>>>>> + */
> >>>>>> + if (file_count(dmabuf->file) > 1) {
> >>>>> Please completely drop that. I see absolutely no justification for this
> >>>>> additional complexity.
> >>>>>
> >>>> This case gets hit around 5% of the time in my testing so the else is
> >>>> not a completely unused branch.
> >>> Well I can only repeat myself: This means that your userspace is
> >>> severely broken!
> >>>
> >>> DMA-buf are meant to be long living objects
> >> This patch addresses export *latency* regardless of how long-lived the
> >> object is. Even a single, long-lived export will benefit from this
> >> change if it would otherwise be blocked on adding an object to sysfs.
> >> I think attempting to improve this latency still has merit.
> > Fixing the latency is nice, but as it's just pushing the needed work off
> > to another code path, it will take longer overall for the sysfs stuff to
> > be ready for userspace to see.
> >
> > Perhaps we need to step back and understand what this code is supposed
> > to be doing. As I recall, it was created because some systems do not
> > allow debugfs anymore, and they wanted the debugging information that
> > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > that logic to sysfs made sense, but now I am wondering why we didn't see
> > these issues in the debugfs code previously?
>
> Well, I think that some key information is that adding the sysfs support
> was justified with the argument that this is not only used for debugging.
>
> If it would be used only for debugging then debugfs would the right
> choice for this. If debugfs is then not available in your environment
> then you should *not* ask the kernel to work around that. Instead we
> should discuss why you want to disable some debugging access, but not
> all of that.
>
> So for now let's assume that this is also used for accounting, e.g. when
> userspace wants to know how many DMA-bufs of which size are flying
> around to make decisions like which process to put into background or
> which to swap out based on that information.
>
Yes, the accounting of buffers at runtime on production devices is
part of the use case:
https://lore.kernel.org/all/CA+wgaPPtoz_JSAwsVVpFGLrcrO8-tAGD+gdrsWmBA3jpidigzQ@mail.gmail.com/
> > Perhaps we should go just one step further and make a misc device node
> > for dmabug debugging information to be in and just have userspace
> > poll/read on the device node and we spit the info that used to be in
> > debugfs out through that? That way this only affects systems when they
> > want to read the information and not normal code paths? Yeah that's a
> > hack, but this whole thing feels overly complex now.
>
> Yeah, totally agree on the complexity note. I'm just absolutely not keen
> to add hack over hack over hack to make something work which from my
> point of view has some serious issues with it's design.
>
Why is this patch a hack? We found a problem with the initial design
which nobody saw when it was originally created, and now we're trying
to address it within the constraints that exist. Is there some other
solution to the problem of exports getting blocked that you would
suggest here?
> For example trying to do accounting based on DMA-bufs is extremely
> questionable to begin with. See a modern game for example can have
> between 10k and 100k of different buffers, reserving one file descriptor
> for each of those objects is absolutely not going to work.
>
> So my request is to please describe your full use case and not just why
> you think this patch is justified.
>
The use case was described in the commit message when the feature was
initially added (after discussion about it on the list) including
links to code that uses the feature:
https://lore.kernel.org/all/[email protected]/
> Regards,
> Christian.
>
> >
> > thanks,
> >
> > greg k-h
>
Am 18.05.22 um 01:09 schrieb T.J. Mercier:
> [SNIP]
>>> Perhaps we should go just one step further and make a misc device node
>>> for dmabug debugging information to be in and just have userspace
>>> poll/read on the device node and we spit the info that used to be in
>>> debugfs out through that? That way this only affects systems when they
>>> want to read the information and not normal code paths? Yeah that's a
>>> hack, but this whole thing feels overly complex now.
>> Yeah, totally agree on the complexity note. I'm just absolutely not keen
>> to add hack over hack over hack to make something work which from my
>> point of view has some serious issues with it's design.
>>
> Why is this patch a hack? We found a problem with the initial design
> which nobody saw when it was originally created, and now we're trying
> to address it within the constraints that exist.
Well the issue is that you don't try to tackle the underlying problem,
but rather just mitigate the unforeseen consequences. That is pretty
clearly a hack to me.
> Is there some other
> solution to the problem of exports getting blocked that you would
> suggest here?
Well pretty much the same as Greg outlined as well. Go back to your
drawing board and come back with a solution which does not need such
workarounds.
Alternatively you can give me a full overview of the design and explain
why exactly that interface here is necessary in exactly that form.
>> For example trying to do accounting based on DMA-bufs is extremely
>> questionable to begin with. See a modern game for example can have
>> between 10k and 100k of different buffers, reserving one file descriptor
>> for each of those objects is absolutely not going to work.
>>
>> So my request is to please describe your full use case and not just why
>> you think this patch is justified.
>>
> The use case was described in the commit message when the feature was
> initially added (after discussion about it on the list) including
> links to code that uses the feature:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20210603214758.2955251-1-hridya%40google.com%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C3f6e3e98fc6f45ead80d08da385a59e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637884257979664387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=i%2BSfpB%2B6iBolBHu6KrKH3njq0zo1SBbrKDHZOjpsIks%3D&reserved=0
Yeah and to be honest I have the strong feeling now that this was
absolutely not well thought through. This description as well as the in
kernel documentation are not even remotely sufficient to explain what
you guys are doing with this.
So please come up with a complete design description for both userspace
and kernel why this interface here is necessary inside the upstream kernel.
If you can't convince me that this is a good idea I will just bluntly
mark this DMA-buf sysfs interface as deprecated.
Regards,
Christian.
>
>
>> Regards,
>> Christian.
>>
>>> thanks,
>>>
>>> greg k-h
On Tue, May 17, 2022 at 04:09:36PM -0700, T.J. Mercier wrote:
> On Mon, May 16, 2022 at 11:13 PM Greg Kroah-Hartman
> <[email protected]> wrote:
> >
> > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > > On Mon, May 16, 2022 at 12:21 PM Christian K?nig
> > > <[email protected]> wrote:
> > > >
> > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > > On Mon, May 16, 2022 at 10:20 AM Christian K?nig
> > > > > <[email protected]> wrote:
> > > > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > > >>> Recently, we noticed an issue where a process went into direct reclaim
> > > > >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > > >>> mode. This caused processes who were doing DMA-BUF exports and releases
> > > > >>> to go into uninterruptible sleep since they needed to acquire the same
> > > > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > > >>> blocking DMA-BUF export for an indeterminate amount of time while
> > > > >>> another process is holding the sysfs rw semaphore in exclusive mode,
> > > > >>> this patch moves the per-buffer sysfs file creation to the default work
> > > > >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > > >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > > >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > > >>> increase in size.
> > > > >> I'm still not very keen of this approach as it strongly feels like we
> > > > >> are working around shortcoming somewhere else.
> > > > >>
> > > > > My read of the thread for the last version is that we're running into
> > > > > a situation where sysfs is getting used for something it wasn't
> > > > > originally intended for, but we're also stuck with this sysfs
> > > > > functionality for dmabufs.
> > > > >
> > > > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > > >>> Originally-by: Hridya Valsaraju <[email protected]>
> > > > >>> Signed-off-by: T.J. Mercier <[email protected]>
> > > > >>>
> > > > >>> ---
> > > > >>> See the originally submitted patch by Hridya Valsaraju here:
> > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> > > > >>>
> > > > >>> v2 changes:
> > > > >>> - Defer only sysfs creation instead of creation and teardown per
> > > > >>> Christian K?nig
> > > > >>>
> > > > >>> - Use a work queue instead of a kthread for deferred work per
> > > > >>> Christian K?nig
> > > > >>> ---
> > > > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > > >>> include/linux/dma-buf.h | 14 ++++++-
> > > > >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> > > > >>>
> > > > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > >>> index 2bba0babcb62..67b0a298291c 100644
> > > > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > >>> @@ -11,6 +11,7 @@
> > > > >>> #include <linux/printk.h>
> > > > >>> #include <linux/slab.h>
> > > > >>> #include <linux/sysfs.h>
> > > > >>> +#include <linux/workqueue.h>
> > > > >>>
> > > > >>> #include "dma-buf-sysfs-stats.h"
> > > > >>>
> > > > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > > >>> kset_unregister(dma_buf_stats_kset);
> > > > >>> }
> > > > >>>
> > > > >>> +static void sysfs_add_workfn(struct work_struct *work)
> > > > >>> +{
> > > > >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> > > > >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > > >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > > >>> +
> > > > >>> + /*
> > > > >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > > >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > > >>> + * optimization and a race; when the reference count drops to 1 immediately after
> > > > >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > > >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > > >>> + * is released, and that can't happen until the end of this function.
> > > > >>> + */
> > > > >>> + if (file_count(dmabuf->file) > 1) {
> > > > >> Please completely drop that. I see absolutely no justification for this
> > > > >> additional complexity.
> > > > >>
> > > > > This case gets hit around 5% of the time in my testing so the else is
> > > > > not a completely unused branch.
> > > >
> > > > Well I can only repeat myself: This means that your userspace is
> > > > severely broken!
> > > >
> > > > DMA-buf are meant to be long living objects
> > > This patch addresses export *latency* regardless of how long-lived the
> > > object is. Even a single, long-lived export will benefit from this
> > > change if it would otherwise be blocked on adding an object to sysfs.
> > > I think attempting to improve this latency still has merit.
> >
> > Fixing the latency is nice, but as it's just pushing the needed work off
> > to another code path, it will take longer overall for the sysfs stuff to
> > be ready for userspace to see.
> >
> > Perhaps we need to step back and understand what this code is supposed
> > to be doing. As I recall, it was created because some systems do not
> > allow debugfs anymore, and they wanted the debugging information that
> > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > that logic to sysfs made sense, but now I am wondering why we didn't see
> > these issues in the debugfs code previously?
> >
> The debugfs stuff doesn't happen on every export, right?
I do not remember. If not, then why not do what that does? :)
> > Perhaps we should go just one step further and make a misc device node
> > for dmabug debugging information to be in and just have userspace
> > poll/read on the device node and we spit the info that used to be in
> > debugfs out through that? That way this only affects systems when they
> > want to read the information and not normal code paths? Yeah that's a
> > hack, but this whole thing feels overly complex now.
> >
>
> And deprecate sysfs support?
As Android is the only current user, that would be easy to do.
> I'm happy to try out anything you think might be a better way. As far
> as complexity of this patch, this revision is a much simpler version
> of the one from Hridya you already reviewed.
I agree we should work to resolve this now, but just that going forward,
perhaps this whole api should be revisited now that we see just how much
complexity and extra system load it has added to to the system due to
how some users interact with the dmabuf apis.
You never learn just how bad an API really is until it gets used a lot,
so it's no one's fault. But let's learn and move forward if possible to
make things better.
thanks,
greg k-h
On Wed, May 18, 2022 at 5:03 AM Christian König
<[email protected]> wrote:
>
> Am 18.05.22 um 01:09 schrieb T.J. Mercier:
> > [SNIP]
> >>> Perhaps we should go just one step further and make a misc device node
> >>> for dmabug debugging information to be in and just have userspace
> >>> poll/read on the device node and we spit the info that used to be in
> >>> debugfs out through that? That way this only affects systems when they
> >>> want to read the information and not normal code paths? Yeah that's a
> >>> hack, but this whole thing feels overly complex now.
> >> Yeah, totally agree on the complexity note. I'm just absolutely not keen
> >> to add hack over hack over hack to make something work which from my
> >> point of view has some serious issues with it's design.
> >>
> > Why is this patch a hack? We found a problem with the initial design
> > which nobody saw when it was originally created, and now we're trying
> > to address it within the constraints that exist.
>
> Well the issue is that you don't try to tackle the underlying problem,
> but rather just mitigate the unforeseen consequences. That is pretty
> clearly a hack to me.
>
> > Is there some other
> > solution to the problem of exports getting blocked that you would
> > suggest here?
>
> Well pretty much the same as Greg outlined as well. Go back to your
> drawing board and come back with a solution which does not need such
> workarounds.
>
> Alternatively you can give me a full overview of the design and explain
> why exactly that interface here is necessary in exactly that form.
>
We ended up here because we could not use debugfs.
https://source.android.com/setup/start/android-11-release#debugfs
Another idea was adding per-buffer stats to procfs, but that was not
an option since per-buffer stats are not process specific.
So it seemed like sysfs was an appropriate solution at the time. It
comes with a stable interface as a bonus, but with the limitation of 1
value per file this leads to creating lots of files in sysfs for all
dma buffers. This leads to increased kernfs lock contention, and
unfortunately we try to take the lock on the hot path.
With the description and links to the userspace code which actually
uses the feature I feel like that's a complete picture of what's
currently happening with this interface. If you could explain what
information is missing I'll do my best to provide it.
https://cs.android.com/android/platform/superproject/+/master:system/memory/libmeminfo/libdmabufinfo/dmabuf_sysfs_stats.cpp;l=55;drc=3f7db7aec60bbba14c6dd81600820ed62fe3f475;bpv=0;bpt=1
https://cs.android.com/android/platform/superproject/+/master:system/memory/libmeminfo/libdmabufinfo/tools/dmabuf_dump.cpp;l=182;drc=3f7db7aec60bbba14c6dd81600820ed62fe3f475;bpv=0;bpt=1
https://cs.android.com/android/platform/superproject/+/master:frameworks/native/cmds/dumpstate/dumpstate.cpp;drc=621533f895b30ca281e79a7a7c03eefd352359b5;l=1832-1833
> >> For example trying to do accounting based on DMA-bufs is extremely
> >> questionable to begin with. See a modern game for example can have
> >> between 10k and 100k of different buffers, reserving one file descriptor
> >> for each of those objects is absolutely not going to work.
> >>
> >> So my request is to please describe your full use case and not just why
> >> you think this patch is justified.
> >>
> > The use case was described in the commit message when the feature was
> > initially added (after discussion about it on the list) including
> > links to code that uses the feature:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F20210603214758.2955251-1-hridya%40google.com%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C3f6e3e98fc6f45ead80d08da385a59e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637884257979664387%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=i%2BSfpB%2B6iBolBHu6KrKH3njq0zo1SBbrKDHZOjpsIks%3D&reserved=0
>
> Yeah and to be honest I have the strong feeling now that this was
> absolutely not well thought through.
I'm open to working on a replacement for this if we can't find an
acceptable solution here, but I would appreciate some direction on
what would be acceptable. For example Greg's idea sounds workable, but
the question is if it mergeable?
> This description as well as the in
> kernel documentation are not even remotely sufficient to explain what
> you guys are doing with this.
>
> So please come up with a complete design description for both userspace
> and kernel why this interface here is necessary inside the upstream kernel.
>
> If you can't convince me that this is a good idea I will just bluntly
> mark this DMA-buf sysfs interface as deprecated.
>
> Regards,
> Christian.
>
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>> thanks,
> >>>
> >>> greg k-h
>
Am 20.05.22 um 00:58 schrieb T.J. Mercier:
> [SNIP]
>>> Is there some other
>>> solution to the problem of exports getting blocked that you would
>>> suggest here?
>> Well pretty much the same as Greg outlined as well. Go back to your
>> drawing board and come back with a solution which does not need such
>> workarounds.
>>
>> Alternatively you can give me a full overview of the design and explain
>> why exactly that interface here is necessary in exactly that form.
>>
> We ended up here because we could not use debugfs.
[SNIP]
> Another idea was adding per-buffer stats to procfs, but that was not
> an option since per-buffer stats are not process specific.
>
> So it seemed like sysfs was an appropriate solution at the time. It
> comes with a stable interface as a bonus, but with the limitation of 1
> value per file this leads to creating lots of files in sysfs for all
> dma buffers. This leads to increased kernfs lock contention, and
> unfortunately we try to take the lock on the hot path.
That's what I totally agree on about. debugfs is for debugging and not
for production use.
So either sysfs or procfs or something completely different seems to be
the right direction for the solution of the problem.
> With the description and links to the userspace code which actually
> uses the feature I feel like that's a complete picture of what's
> currently happening with this interface. If you could explain what
> information is missing I'll do my best to provide it.
Yeah, I've realized that I didn't made it clear what my concerns are
here. So let me try once more from the beginning:
DMA-buf is a framework for sharing device buffers and their handles
between different userspace processes and kernel device. It's based
around the concept of representing those buffers as files which can then
be mmap(), referenced with a file descriptor, etc....
Those abilities come with a certain overhead, using inode numbers,
reference counters, creating virtual files for tracking (both debugfs,
sysfs, procfs) etc... So what both drivers and userspace implementing
DMA-buf is doing is that they share buffers using this framework only
when they have to.
In other words for upstream graphics drivers 99.9% of the buffers are
*not* shared using DMA-buf. And this is perfectly intentional because of
the additional overhead. Only the 3 or 4 buffers which are shared per
process between the client and server in a display environment are
actually exported and imported as DMA-buf.
What the recent patches suggest is that this is not the case on Android.
So for example overrunning a 32bit inode number means that you manage to
created and destroy over 4 billion DMA-bufs. Same for this sysfs based
accounting, this only makes sense when you really export *everything* as
DMA-buf.
So if that is correct, then that would be a pretty clear design issue in
Android. Now, if you want to keep this design then that is perfectly
fine with the kernel, but it also means that you need to deal with any
arising problems by yourself.
Pushing patches upstream indicates that you want to share your work with
others. And in this case it suggests that you want to encourage others
to follow the Android design and that is something I would pretty
clearly reject.
>> Yeah and to be honest I have the strong feeling now that this was
>> absolutely not well thought through.
> I'm open to working on a replacement for this if we can't find an
> acceptable solution here, but I would appreciate some direction on
> what would be acceptable. For example Greg's idea sounds workable, but
> the question is if it mergeable?
Well one possibility would be to use cgroups. That framework needs to do
accounting as well, just with an additional limitation to it.
And there are already some proposed cgroup patches for device driver
memory. While reviewing those both Daniel and I already made it pretty
clear that it must be separated from DMA-buf, exactly because of the
reason that we probably don't want every buffer exported.
But to work on a full blown solution I need a better understanding of
how your userspace components do.
Regards,
Christian.
On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > On Mon, May 16, 2022 at 12:21 PM Christian K?nig
> > <[email protected]> wrote:
> > >
> > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > On Mon, May 16, 2022 at 10:20 AM Christian K?nig
> > > > <[email protected]> wrote:
> > > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > >>> Recently, we noticed an issue where a process went into direct reclaim
> > > >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > >>> mode. This caused processes who were doing DMA-BUF exports and releases
> > > >>> to go into uninterruptible sleep since they needed to acquire the same
> > > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > >>> blocking DMA-BUF export for an indeterminate amount of time while
> > > >>> another process is holding the sysfs rw semaphore in exclusive mode,
> > > >>> this patch moves the per-buffer sysfs file creation to the default work
> > > >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > >>> increase in size.
> > > >> I'm still not very keen of this approach as it strongly feels like we
> > > >> are working around shortcoming somewhere else.
> > > >>
> > > > My read of the thread for the last version is that we're running into
> > > > a situation where sysfs is getting used for something it wasn't
> > > > originally intended for, but we're also stuck with this sysfs
> > > > functionality for dmabufs.
> > > >
> > > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > >>> Originally-by: Hridya Valsaraju <[email protected]>
> > > >>> Signed-off-by: T.J. Mercier <[email protected]>
> > > >>>
> > > >>> ---
> > > >>> See the originally submitted patch by Hridya Valsaraju here:
> > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> > > >>>
> > > >>> v2 changes:
> > > >>> - Defer only sysfs creation instead of creation and teardown per
> > > >>> Christian K?nig
> > > >>>
> > > >>> - Use a work queue instead of a kthread for deferred work per
> > > >>> Christian K?nig
> > > >>> ---
> > > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > >>> include/linux/dma-buf.h | 14 ++++++-
> > > >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> > > >>>
> > > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > >>> index 2bba0babcb62..67b0a298291c 100644
> > > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > >>> @@ -11,6 +11,7 @@
> > > >>> #include <linux/printk.h>
> > > >>> #include <linux/slab.h>
> > > >>> #include <linux/sysfs.h>
> > > >>> +#include <linux/workqueue.h>
> > > >>>
> > > >>> #include "dma-buf-sysfs-stats.h"
> > > >>>
> > > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > >>> kset_unregister(dma_buf_stats_kset);
> > > >>> }
> > > >>>
> > > >>> +static void sysfs_add_workfn(struct work_struct *work)
> > > >>> +{
> > > >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> > > >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > >>> +
> > > >>> + /*
> > > >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > >>> + * optimization and a race; when the reference count drops to 1 immediately after
> > > >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > >>> + * is released, and that can't happen until the end of this function.
> > > >>> + */
> > > >>> + if (file_count(dmabuf->file) > 1) {
> > > >> Please completely drop that. I see absolutely no justification for this
> > > >> additional complexity.
> > > >>
> > > > This case gets hit around 5% of the time in my testing so the else is
> > > > not a completely unused branch.
> > >
> > > Well I can only repeat myself: This means that your userspace is
> > > severely broken!
> > >
> > > DMA-buf are meant to be long living objects
> > This patch addresses export *latency* regardless of how long-lived the
> > object is. Even a single, long-lived export will benefit from this
> > change if it would otherwise be blocked on adding an object to sysfs.
> > I think attempting to improve this latency still has merit.
>
> Fixing the latency is nice, but as it's just pushing the needed work off
> to another code path, it will take longer overall for the sysfs stuff to
> be ready for userspace to see.
>
> Perhaps we need to step back and understand what this code is supposed
> to be doing. As I recall, it was created because some systems do not
> allow debugfs anymore, and they wanted the debugging information that
> the dmabuf code was exposing to debugfs on a "normal" system. Moving
> that logic to sysfs made sense, but now I am wondering why we didn't see
> these issues in the debugfs code previously?
>
> Perhaps we should go just one step further and make a misc device node
> for dmabug debugging information to be in and just have userspace
> poll/read on the device node and we spit the info that used to be in
> debugfs out through that? That way this only affects systems when they
> want to read the information and not normal code paths? Yeah that's a
> hack, but this whole thing feels overly complex now.
A bit late on this discussion, but just wanted to add my +1 that we should
either redesign the uapi, or fix the underlying latency issue in sysfs, or
whatever else is deemed the proper fix.
Making uapi interfaces async in ways that userspace can't discover is a
hack that we really shouldn't consider, at least for upstream. All kinds
of hilarious things might start to happen when an object exists, but not
consistently in all the places where it should be visible. There's a
reason sysfs has all these neat property groups so that absolutely
everything is added atomically. Doing stuff later on just because usually
no one notices that the illusion falls apart isn't great.
Unfortunately I don't have a clear idea here what would be the right
solution :-/ One idea perhaps: Should we dynamically enumerate the objects
when userspace does a readdir()? That's absolutely not how sysfs works,
but procfs works like that and there's discussions going around about
moving these optimizations to other kernfs implementations. At least there
was a recent lwn article on this:
https://lwn.net/Articles/895111/
But that would be serious amounts of work I guess.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
>
> On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > > On Mon, May 16, 2022 at 12:21 PM Christian König
> > > <[email protected]> wrote:
> > > >
> > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > > On Mon, May 16, 2022 at 10:20 AM Christian König
> > > > > <[email protected]> wrote:
> > > > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > > >>> Recently, we noticed an issue where a process went into direct reclaim
> > > > >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > > >>> mode. This caused processes who were doing DMA-BUF exports and releases
> > > > >>> to go into uninterruptible sleep since they needed to acquire the same
> > > > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > > >>> blocking DMA-BUF export for an indeterminate amount of time while
> > > > >>> another process is holding the sysfs rw semaphore in exclusive mode,
> > > > >>> this patch moves the per-buffer sysfs file creation to the default work
> > > > >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > > >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > > >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > > >>> increase in size.
> > > > >> I'm still not very keen of this approach as it strongly feels like we
> > > > >> are working around shortcoming somewhere else.
> > > > >>
> > > > > My read of the thread for the last version is that we're running into
> > > > > a situation where sysfs is getting used for something it wasn't
> > > > > originally intended for, but we're also stuck with this sysfs
> > > > > functionality for dmabufs.
> > > > >
> > > > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > > >>> Originally-by: Hridya Valsaraju <[email protected]>
> > > > >>> Signed-off-by: T.J. Mercier <[email protected]>
> > > > >>>
> > > > >>> ---
> > > > >>> See the originally submitted patch by Hridya Valsaraju here:
> > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> > > > >>>
> > > > >>> v2 changes:
> > > > >>> - Defer only sysfs creation instead of creation and teardown per
> > > > >>> Christian König
> > > > >>>
> > > > >>> - Use a work queue instead of a kthread for deferred work per
> > > > >>> Christian König
> > > > >>> ---
> > > > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > > >>> include/linux/dma-buf.h | 14 ++++++-
> > > > >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> > > > >>>
> > > > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > >>> index 2bba0babcb62..67b0a298291c 100644
> > > > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > >>> @@ -11,6 +11,7 @@
> > > > >>> #include <linux/printk.h>
> > > > >>> #include <linux/slab.h>
> > > > >>> #include <linux/sysfs.h>
> > > > >>> +#include <linux/workqueue.h>
> > > > >>>
> > > > >>> #include "dma-buf-sysfs-stats.h"
> > > > >>>
> > > > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > > >>> kset_unregister(dma_buf_stats_kset);
> > > > >>> }
> > > > >>>
> > > > >>> +static void sysfs_add_workfn(struct work_struct *work)
> > > > >>> +{
> > > > >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> > > > >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > > >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > > >>> +
> > > > >>> + /*
> > > > >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > > >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > > >>> + * optimization and a race; when the reference count drops to 1 immediately after
> > > > >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > > >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > > >>> + * is released, and that can't happen until the end of this function.
> > > > >>> + */
> > > > >>> + if (file_count(dmabuf->file) > 1) {
> > > > >> Please completely drop that. I see absolutely no justification for this
> > > > >> additional complexity.
> > > > >>
> > > > > This case gets hit around 5% of the time in my testing so the else is
> > > > > not a completely unused branch.
> > > >
> > > > Well I can only repeat myself: This means that your userspace is
> > > > severely broken!
> > > >
> > > > DMA-buf are meant to be long living objects
> > > This patch addresses export *latency* regardless of how long-lived the
> > > object is. Even a single, long-lived export will benefit from this
> > > change if it would otherwise be blocked on adding an object to sysfs.
> > > I think attempting to improve this latency still has merit.
> >
> > Fixing the latency is nice, but as it's just pushing the needed work off
> > to another code path, it will take longer overall for the sysfs stuff to
> > be ready for userspace to see.
> >
> > Perhaps we need to step back and understand what this code is supposed
> > to be doing. As I recall, it was created because some systems do not
> > allow debugfs anymore, and they wanted the debugging information that
> > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > that logic to sysfs made sense, but now I am wondering why we didn't see
> > these issues in the debugfs code previously?
> >
> > Perhaps we should go just one step further and make a misc device node
> > for dmabug debugging information to be in and just have userspace
> > poll/read on the device node and we spit the info that used to be in
> > debugfs out through that? That way this only affects systems when they
> > want to read the information and not normal code paths? Yeah that's a
> > hack, but this whole thing feels overly complex now.
>
> A bit late on this discussion, but just wanted to add my +1 that we should
> either redesign the uapi, or fix the underlying latency issue in sysfs, or
> whatever else is deemed the proper fix.
>
> Making uapi interfaces async in ways that userspace can't discover is a
> hack that we really shouldn't consider, at least for upstream. All kinds
> of hilarious things might start to happen when an object exists, but not
> consistently in all the places where it should be visible. There's a
> reason sysfs has all these neat property groups so that absolutely
> everything is added atomically. Doing stuff later on just because usually
> no one notices that the illusion falls apart isn't great.
>
> Unfortunately I don't have a clear idea here what would be the right
> solution :-/ One idea perhaps: Should we dynamically enumerate the objects
> when userspace does a readdir()? That's absolutely not how sysfs works,
> but procfs works like that and there's discussions going around about
> moving these optimizations to other kernfs implementations. At least there
> was a recent lwn article on this:
>
> https://lwn.net/Articles/895111/
>
> But that would be serious amounts of work I guess.
> -Daniel
> --
> Daniel Vetter"
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Hi Daniel,
My team has been discussing this, and I think we're approaching a
consensus on a way forward that involves deprecating the existing
uapi.
I actually proposed a similar (but less elegant) idea to the readdir()
one. A new "dump_dmabuf_data" sysfs file that a user would write to,
which would cause a one-time creation of the per-buffer files. These
could be left around to become stale, or get cleaned up after first
read. However to me it seems impossible to correctly deal with
multiple simultaneous users with this technique. We're not currently
planning to pursue this.
Thanks for the link to the article. That on-demand creation sounds
like it would allow us to keep the existing structure and files for
DMA-buf, assuming there is not a similar lock contention issue when
adding a new node to the virtual tree. :)
On Wed, May 25, 2022 at 2:05 PM T.J. Mercier <[email protected]> wrote:
>
> On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
> >
> > On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> > > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > > > On Mon, May 16, 2022 at 12:21 PM Christian König
> > > > <[email protected]> wrote:
> > > > >
> > > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > > > On Mon, May 16, 2022 at 10:20 AM Christian König
> > > > > > <[email protected]> wrote:
> > > > > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > > > >>> Recently, we noticed an issue where a process went into direct reclaim
> > > > > >>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > > > >>> mode. This caused processes who were doing DMA-BUF exports and releases
> > > > > >>> to go into uninterruptible sleep since they needed to acquire the same
> > > > > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > > > >>> blocking DMA-BUF export for an indeterminate amount of time while
> > > > > >>> another process is holding the sysfs rw semaphore in exclusive mode,
> > > > > >>> this patch moves the per-buffer sysfs file creation to the default work
> > > > > >>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > > > >>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > > > >>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > > > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > > > >>> increase in size.
> > > > > >> I'm still not very keen of this approach as it strongly feels like we
> > > > > >> are working around shortcoming somewhere else.
> > > > > >>
> > > > > > My read of the thread for the last version is that we're running into
> > > > > > a situation where sysfs is getting used for something it wasn't
> > > > > > originally intended for, but we're also stuck with this sysfs
> > > > > > functionality for dmabufs.
> > > > > >
> > > > > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > > > >>> Originally-by: Hridya Valsaraju <[email protected]>
> > > > > >>> Signed-off-by: T.J. Mercier <[email protected]>
> > > > > >>>
> > > > > >>> ---
> > > > > >>> See the originally submitted patch by Hridya Valsaraju here:
> > > > > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=0
> > > > > >>>
> > > > > >>> v2 changes:
> > > > > >>> - Defer only sysfs creation instead of creation and teardown per
> > > > > >>> Christian König
> > > > > >>>
> > > > > >>> - Use a work queue instead of a kthread for deferred work per
> > > > > >>> Christian König
> > > > > >>> ---
> > > > > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > > > >>> include/linux/dma-buf.h | 14 ++++++-
> > > > > >>> 2 files changed, 54 insertions(+), 16 deletions(-)
> > > > > >>>
> > > > > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > >>> index 2bba0babcb62..67b0a298291c 100644
> > > > > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > >>> @@ -11,6 +11,7 @@
> > > > > >>> #include <linux/printk.h>
> > > > > >>> #include <linux/slab.h>
> > > > > >>> #include <linux/sysfs.h>
> > > > > >>> +#include <linux/workqueue.h>
> > > > > >>>
> > > > > >>> #include "dma-buf-sysfs-stats.h"
> > > > > >>>
> > > > > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > > > >>> kset_unregister(dma_buf_stats_kset);
> > > > > >>> }
> > > > > >>>
> > > > > >>> +static void sysfs_add_workfn(struct work_struct *work)
> > > > > >>> +{
> > > > > >>> + struct dma_buf_sysfs_entry *sysfs_entry =
> > > > > >>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > > > >>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > > > >>> +
> > > > > >>> + /*
> > > > > >>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > > > >>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > > > >>> + * optimization and a race; when the reference count drops to 1 immediately after
> > > > > >>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > > > >>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > > > >>> + * is released, and that can't happen until the end of this function.
> > > > > >>> + */
> > > > > >>> + if (file_count(dmabuf->file) > 1) {
> > > > > >> Please completely drop that. I see absolutely no justification for this
> > > > > >> additional complexity.
> > > > > >>
> > > > > > This case gets hit around 5% of the time in my testing so the else is
> > > > > > not a completely unused branch.
> > > > >
> > > > > Well I can only repeat myself: This means that your userspace is
> > > > > severely broken!
> > > > >
> > > > > DMA-buf are meant to be long living objects
> > > > This patch addresses export *latency* regardless of how long-lived the
> > > > object is. Even a single, long-lived export will benefit from this
> > > > change if it would otherwise be blocked on adding an object to sysfs.
> > > > I think attempting to improve this latency still has merit.
> > >
> > > Fixing the latency is nice, but as it's just pushing the needed work off
> > > to another code path, it will take longer overall for the sysfs stuff to
> > > be ready for userspace to see.
> > >
> > > Perhaps we need to step back and understand what this code is supposed
> > > to be doing. As I recall, it was created because some systems do not
> > > allow debugfs anymore, and they wanted the debugging information that
> > > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > > that logic to sysfs made sense, but now I am wondering why we didn't see
> > > these issues in the debugfs code previously?
> > >
> > > Perhaps we should go just one step further and make a misc device node
> > > for dmabug debugging information to be in and just have userspace
> > > poll/read on the device node and we spit the info that used to be in
> > > debugfs out through that? That way this only affects systems when they
> > > want to read the information and not normal code paths? Yeah that's a
> > > hack, but this whole thing feels overly complex now.
> >
> > A bit late on this discussion, but just wanted to add my +1 that we should
> > either redesign the uapi, or fix the underlying latency issue in sysfs, or
> > whatever else is deemed the proper fix.
> >
> > Making uapi interfaces async in ways that userspace can't discover is a
> > hack that we really shouldn't consider, at least for upstream. All kinds
> > of hilarious things might start to happen when an object exists, but not
> > consistently in all the places where it should be visible. There's a
> > reason sysfs has all these neat property groups so that absolutely
> > everything is added atomically. Doing stuff later on just because usually
> > no one notices that the illusion falls apart isn't great.
> >
> > Unfortunately I don't have a clear idea here what would be the right
> > solution :-/ One idea perhaps: Should we dynamically enumerate the objects
> > when userspace does a readdir()? That's absolutely not how sysfs works,
> > but procfs works like that and there's discussions going around about
> > moving these optimizations to other kernfs implementations. At least there
> > was a recent lwn article on this:
> >
> > https://lwn.net/Articles/895111/
> >
> > But that would be serious amounts of work I guess.
> > -Daniel
> > --
> > Daniel Vetter"
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
>
> Hi Daniel,
>
> My team has been discussing this, and I think we're approaching a
> consensus on a way forward that involves deprecating the existing
> uapi.
>
> I actually proposed a similar (but less elegant) idea to the readdir()
> one. A new "dump_dmabuf_data" sysfs file that a user would write to,
> which would cause a one-time creation of the per-buffer files. These
> could be left around to become stale, or get cleaned up after first
> read. However to me it seems impossible to correctly deal with
> multiple simultaneous users with this technique. We're not currently
> planning to pursue this.
>
> Thanks for the link to the article. That on-demand creation sounds
> like it would allow us to keep the existing structure and files for
> DMA-buf, assuming there is not a similar lock contention issue when
> adding a new node to the virtual tree. :)
I'll follow up with Steven on this topic. Thanks again.
Am 25.05.22 um 23:05 schrieb T.J. Mercier:
> On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
>> On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
>>> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
>>>> On Mon, May 16, 2022 at 12:21 PM Christian König
>>>> <[email protected]> wrote:
>>>>> Am 16.05.22 um 20:08 schrieb T.J. Mercier:
>>>>>> On Mon, May 16, 2022 at 10:20 AM Christian König
>>>>>> <[email protected]> wrote:
>>>>>>> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
>>>>>>>> Recently, we noticed an issue where a process went into direct reclaim
>>>>>>>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
>>>>>>>> mode. This caused processes who were doing DMA-BUF exports and releases
>>>>>>>> to go into uninterruptible sleep since they needed to acquire the same
>>>>>>>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
>>>>>>>> blocking DMA-BUF export for an indeterminate amount of time while
>>>>>>>> another process is holding the sysfs rw semaphore in exclusive mode,
>>>>>>>> this patch moves the per-buffer sysfs file creation to the default work
>>>>>>>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
>>>>>>>> sysfs statistics, but this is a tradeoff to prevent the hot path from
>>>>>>>> being blocked. A work_struct is added to dma_buf to achieve this, but as
>>>>>>>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
>>>>>>>> increase in size.
>>>>>>> I'm still not very keen of this approach as it strongly feels like we
>>>>>>> are working around shortcoming somewhere else.
>>>>>>>
>>>>>> My read of the thread for the last version is that we're running into
>>>>>> a situation where sysfs is getting used for something it wasn't
>>>>>> originally intended for, but we're also stuck with this sysfs
>>>>>> functionality for dmabufs.
>>>>>>
>>>>>>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
>>>>>>>> Originally-by: Hridya Valsaraju <[email protected]>
>>>>>>>> Signed-off-by: T.J. Mercier <[email protected]>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> See the originally submitted patch by Hridya Valsaraju here:
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pubWqUyqhCWpXHhJHsoqarc3GLtB6IFB1rhgfsL4a1M%3D&reserved=0
>>>>>>>>
>>>>>>>> v2 changes:
>>>>>>>> - Defer only sysfs creation instead of creation and teardown per
>>>>>>>> Christian König
>>>>>>>>
>>>>>>>> - Use a work queue instead of a kthread for deferred work per
>>>>>>>> Christian König
>>>>>>>> ---
>>>>>>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
>>>>>>>> include/linux/dma-buf.h | 14 ++++++-
>>>>>>>> 2 files changed, 54 insertions(+), 16 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>> index 2bba0babcb62..67b0a298291c 100644
>>>>>>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>> @@ -11,6 +11,7 @@
>>>>>>>> #include <linux/printk.h>
>>>>>>>> #include <linux/slab.h>
>>>>>>>> #include <linux/sysfs.h>
>>>>>>>> +#include <linux/workqueue.h>
>>>>>>>>
>>>>>>>> #include "dma-buf-sysfs-stats.h"
>>>>>>>>
>>>>>>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
>>>>>>>> kset_unregister(dma_buf_stats_kset);
>>>>>>>> }
>>>>>>>>
>>>>>>>> +static void sysfs_add_workfn(struct work_struct *work)
>>>>>>>> +{
>>>>>>>> + struct dma_buf_sysfs_entry *sysfs_entry =
>>>>>>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
>>>>>>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
>>>>>>>> +
>>>>>>>> + /*
>>>>>>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
>>>>>>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
>>>>>>>> + * optimization and a race; when the reference count drops to 1 immediately after
>>>>>>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
>>>>>>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
>>>>>>>> + * is released, and that can't happen until the end of this function.
>>>>>>>> + */
>>>>>>>> + if (file_count(dmabuf->file) > 1) {
>>>>>>> Please completely drop that. I see absolutely no justification for this
>>>>>>> additional complexity.
>>>>>>>
>>>>>> This case gets hit around 5% of the time in my testing so the else is
>>>>>> not a completely unused branch.
>>>>> Well I can only repeat myself: This means that your userspace is
>>>>> severely broken!
>>>>>
>>>>> DMA-buf are meant to be long living objects
>>>> This patch addresses export *latency* regardless of how long-lived the
>>>> object is. Even a single, long-lived export will benefit from this
>>>> change if it would otherwise be blocked on adding an object to sysfs.
>>>> I think attempting to improve this latency still has merit.
>>> Fixing the latency is nice, but as it's just pushing the needed work off
>>> to another code path, it will take longer overall for the sysfs stuff to
>>> be ready for userspace to see.
>>>
>>> Perhaps we need to step back and understand what this code is supposed
>>> to be doing. As I recall, it was created because some systems do not
>>> allow debugfs anymore, and they wanted the debugging information that
>>> the dmabuf code was exposing to debugfs on a "normal" system. Moving
>>> that logic to sysfs made sense, but now I am wondering why we didn't see
>>> these issues in the debugfs code previously?
>>>
>>> Perhaps we should go just one step further and make a misc device node
>>> for dmabug debugging information to be in and just have userspace
>>> poll/read on the device node and we spit the info that used to be in
>>> debugfs out through that? That way this only affects systems when they
>>> want to read the information and not normal code paths? Yeah that's a
>>> hack, but this whole thing feels overly complex now.
>> A bit late on this discussion, but just wanted to add my +1 that we should
>> either redesign the uapi, or fix the underlying latency issue in sysfs, or
>> whatever else is deemed the proper fix.
>>
>> Making uapi interfaces async in ways that userspace can't discover is a
>> hack that we really shouldn't consider, at least for upstream. All kinds
>> of hilarious things might start to happen when an object exists, but not
>> consistently in all the places where it should be visible. There's a
>> reason sysfs has all these neat property groups so that absolutely
>> everything is added atomically. Doing stuff later on just because usually
>> no one notices that the illusion falls apart isn't great.
>>
>> Unfortunately I don't have a clear idea here what would be the right
>> solution :-/ One idea perhaps: Should we dynamically enumerate the objects
>> when userspace does a readdir()? That's absolutely not how sysfs works,
>> but procfs works like that and there's discussions going around about
>> moving these optimizations to other kernfs implementations. At least there
>> was a recent lwn article on this:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q58OZi79vmKMCZLL0pY7NniIW6hmSqyWjlEaZgqzYtM%3D&reserved=0
>>
>> But that would be serious amounts of work I guess.
>> -Daniel
>> --
>> Daniel Vetter"
>> Software Engineer, Intel Corporation
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pOIl5yszzak4TPqjBYyL0mHjj%2F1nYRfNJbNPQTXBhbA%3D&reserved=0
> Hi Daniel,
>
> My team has been discussing this, and I think we're approaching a
> consensus on a way forward that involves deprecating the existing
> uapi.
>
> I actually proposed a similar (but less elegant) idea to the readdir()
> one. A new "dump_dmabuf_data" sysfs file that a user would write to,
> which would cause a one-time creation of the per-buffer files. These
> could be left around to become stale, or get cleaned up after first
> read. However to me it seems impossible to correctly deal with
> multiple simultaneous users with this technique. We're not currently
> planning to pursue this.
>
> Thanks for the link to the article. That on-demand creation sounds
> like it would allow us to keep the existing structure and files for
> DMA-buf, assuming there is not a similar lock contention issue when
> adding a new node to the virtual tree. :)
I think that this on demand creation is even worse than the existing
ideas, but if you can get Greg to accept the required sysfs changes than
that's at least outside of my maintenance domain any more :)
Regards,
Christian.
On Mon, May 30, 2022 at 08:12:16AM +0200, Christian K?nig wrote:
> Am 25.05.22 um 23:05 schrieb T.J. Mercier:
> > On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
> > > On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> > > > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > > > > On Mon, May 16, 2022 at 12:21 PM Christian K?nig
> > > > > <[email protected]> wrote:
> > > > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > > > > On Mon, May 16, 2022 at 10:20 AM Christian K?nig
> > > > > > > <[email protected]> wrote:
> > > > > > > > Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > > > > > > > Recently, we noticed an issue where a process went into direct reclaim
> > > > > > > > > while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > > > > > > > mode. This caused processes who were doing DMA-BUF exports and releases
> > > > > > > > > to go into uninterruptible sleep since they needed to acquire the same
> > > > > > > > > semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > > > > > > > blocking DMA-BUF export for an indeterminate amount of time while
> > > > > > > > > another process is holding the sysfs rw semaphore in exclusive mode,
> > > > > > > > > this patch moves the per-buffer sysfs file creation to the default work
> > > > > > > > > queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > > > > > > > sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > > > > > > > being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > > > > > > > it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > > > > > > > increase in size.
> > > > > > > > I'm still not very keen of this approach as it strongly feels like we
> > > > > > > > are working around shortcoming somewhere else.
> > > > > > > >
> > > > > > > My read of the thread for the last version is that we're running into
> > > > > > > a situation where sysfs is getting used for something it wasn't
> > > > > > > originally intended for, but we're also stuck with this sysfs
> > > > > > > functionality for dmabufs.
> > > > > > >
> > > > > > > > > Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > > > > > > > Originally-by: Hridya Valsaraju <[email protected]>
> > > > > > > > > Signed-off-by: T.J. Mercier <[email protected]>
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > See the originally submitted patch by Hridya Valsaraju here:
> > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pubWqUyqhCWpXHhJHsoqarc3GLtB6IFB1rhgfsL4a1M%3D&reserved=0
> > > > > > > > >
> > > > > > > > > v2 changes:
> > > > > > > > > - Defer only sysfs creation instead of creation and teardown per
> > > > > > > > > Christian K?nig
> > > > > > > > >
> > > > > > > > > - Use a work queue instead of a kthread for deferred work per
> > > > > > > > > Christian K?nig
> > > > > > > > > ---
> > > > > > > > > drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > > > > > > > include/linux/dma-buf.h | 14 ++++++-
> > > > > > > > > 2 files changed, 54 insertions(+), 16 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > index 2bba0babcb62..67b0a298291c 100644
> > > > > > > > > --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > @@ -11,6 +11,7 @@
> > > > > > > > > #include <linux/printk.h>
> > > > > > > > > #include <linux/slab.h>
> > > > > > > > > #include <linux/sysfs.h>
> > > > > > > > > +#include <linux/workqueue.h>
> > > > > > > > >
> > > > > > > > > #include "dma-buf-sysfs-stats.h"
> > > > > > > > >
> > > > > > > > > @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > > > > > > > kset_unregister(dma_buf_stats_kset);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > +static void sysfs_add_workfn(struct work_struct *work)
> > > > > > > > > +{
> > > > > > > > > + struct dma_buf_sysfs_entry *sysfs_entry =
> > > > > > > > > + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > > > > > > > + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > > > > > > > + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > > > > > > > + * optimization and a race; when the reference count drops to 1 immediately after
> > > > > > > > > + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > > > > > > > + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > > > > > > > + * is released, and that can't happen until the end of this function.
> > > > > > > > > + */
> > > > > > > > > + if (file_count(dmabuf->file) > 1) {
> > > > > > > > Please completely drop that. I see absolutely no justification for this
> > > > > > > > additional complexity.
> > > > > > > >
> > > > > > > This case gets hit around 5% of the time in my testing so the else is
> > > > > > > not a completely unused branch.
> > > > > > Well I can only repeat myself: This means that your userspace is
> > > > > > severely broken!
> > > > > >
> > > > > > DMA-buf are meant to be long living objects
> > > > > This patch addresses export *latency* regardless of how long-lived the
> > > > > object is. Even a single, long-lived export will benefit from this
> > > > > change if it would otherwise be blocked on adding an object to sysfs.
> > > > > I think attempting to improve this latency still has merit.
> > > > Fixing the latency is nice, but as it's just pushing the needed work off
> > > > to another code path, it will take longer overall for the sysfs stuff to
> > > > be ready for userspace to see.
> > > >
> > > > Perhaps we need to step back and understand what this code is supposed
> > > > to be doing. As I recall, it was created because some systems do not
> > > > allow debugfs anymore, and they wanted the debugging information that
> > > > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > > > that logic to sysfs made sense, but now I am wondering why we didn't see
> > > > these issues in the debugfs code previously?
> > > >
> > > > Perhaps we should go just one step further and make a misc device node
> > > > for dmabug debugging information to be in and just have userspace
> > > > poll/read on the device node and we spit the info that used to be in
> > > > debugfs out through that? That way this only affects systems when they
> > > > want to read the information and not normal code paths? Yeah that's a
> > > > hack, but this whole thing feels overly complex now.
> > > A bit late on this discussion, but just wanted to add my +1 that we should
> > > either redesign the uapi, or fix the underlying latency issue in sysfs, or
> > > whatever else is deemed the proper fix.
> > >
> > > Making uapi interfaces async in ways that userspace can't discover is a
> > > hack that we really shouldn't consider, at least for upstream. All kinds
> > > of hilarious things might start to happen when an object exists, but not
> > > consistently in all the places where it should be visible. There's a
> > > reason sysfs has all these neat property groups so that absolutely
> > > everything is added atomically. Doing stuff later on just because usually
> > > no one notices that the illusion falls apart isn't great.
> > >
> > > Unfortunately I don't have a clear idea here what would be the right
> > > solution :-/ One idea perhaps: Should we dynamically enumerate the objects
> > > when userspace does a readdir()? That's absolutely not how sysfs works,
> > > but procfs works like that and there's discussions going around about
> > > moving these optimizations to other kernfs implementations. At least there
> > > was a recent lwn article on this:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q58OZi79vmKMCZLL0pY7NniIW6hmSqyWjlEaZgqzYtM%3D&reserved=0
> > >
> > > But that would be serious amounts of work I guess.
> > > -Daniel
> > > --
> > > Daniel Vetter"
> > > Software Engineer, Intel Corporation
> > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pOIl5yszzak4TPqjBYyL0mHjj%2F1nYRfNJbNPQTXBhbA%3D&reserved=0
> > Hi Daniel,
> >
> > My team has been discussing this, and I think we're approaching a
> > consensus on a way forward that involves deprecating the existing
> > uapi.
> >
> > I actually proposed a similar (but less elegant) idea to the readdir()
> > one. A new "dump_dmabuf_data" sysfs file that a user would write to,
> > which would cause a one-time creation of the per-buffer files. These
> > could be left around to become stale, or get cleaned up after first
> > read. However to me it seems impossible to correctly deal with
> > multiple simultaneous users with this technique. We're not currently
> > planning to pursue this.
> >
> > Thanks for the link to the article. That on-demand creation sounds
> > like it would allow us to keep the existing structure and files for
> > DMA-buf, assuming there is not a similar lock contention issue when
> > adding a new node to the virtual tree. :)
>
> I think that this on demand creation is even worse than the existing ideas,
> but if you can get Greg to accept the required sysfs changes than that's at
> least outside of my maintenance domain any more :)
I think doing it cleanly in sysfs without changing the current uapi sounds
pretty good. The hand-rolled "touch a magic file to force update all the
files into existence" sounds like a horror show to me :-) Plus I don't see
how you can actually avoid the locking pain with that since once the files
are created, you have to remove them synchronously again, plus you get to
deal with races on top (and likely some locking inversion fun on top).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Sun, May 29, 2022 at 11:12 PM Christian König
<[email protected]> wrote:
>
> Am 25.05.22 um 23:05 schrieb T.J. Mercier:
> > On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
> >> On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> >>> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> >>>> On Mon, May 16, 2022 at 12:21 PM Christian König
> >>>> <[email protected]> wrote:
> >>>>> Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> >>>>>> On Mon, May 16, 2022 at 10:20 AM Christian König
> >>>>>> <[email protected]> wrote:
> >>>>>>> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> >>>>>>>> Recently, we noticed an issue where a process went into direct reclaim
> >>>>>>>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
> >>>>>>>> mode. This caused processes who were doing DMA-BUF exports and releases
> >>>>>>>> to go into uninterruptible sleep since they needed to acquire the same
> >>>>>>>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> >>>>>>>> blocking DMA-BUF export for an indeterminate amount of time while
> >>>>>>>> another process is holding the sysfs rw semaphore in exclusive mode,
> >>>>>>>> this patch moves the per-buffer sysfs file creation to the default work
> >>>>>>>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> >>>>>>>> sysfs statistics, but this is a tradeoff to prevent the hot path from
> >>>>>>>> being blocked. A work_struct is added to dma_buf to achieve this, but as
> >>>>>>>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
> >>>>>>>> increase in size.
> >>>>>>> I'm still not very keen of this approach as it strongly feels like we
> >>>>>>> are working around shortcoming somewhere else.
> >>>>>>>
> >>>>>> My read of the thread for the last version is that we're running into
> >>>>>> a situation where sysfs is getting used for something it wasn't
> >>>>>> originally intended for, but we're also stuck with this sysfs
> >>>>>> functionality for dmabufs.
> >>>>>>
> >>>>>>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> >>>>>>>> Originally-by: Hridya Valsaraju <[email protected]>
> >>>>>>>> Signed-off-by: T.J. Mercier <[email protected]>
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>> See the originally submitted patch by Hridya Valsaraju here:
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pubWqUyqhCWpXHhJHsoqarc3GLtB6IFB1rhgfsL4a1M%3D&reserved=0
> >>>>>>>>
> >>>>>>>> v2 changes:
> >>>>>>>> - Defer only sysfs creation instead of creation and teardown per
> >>>>>>>> Christian König
> >>>>>>>>
> >>>>>>>> - Use a work queue instead of a kthread for deferred work per
> >>>>>>>> Christian König
> >>>>>>>> ---
> >>>>>>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> >>>>>>>> include/linux/dma-buf.h | 14 ++++++-
> >>>>>>>> 2 files changed, 54 insertions(+), 16 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>>>>>>> index 2bba0babcb62..67b0a298291c 100644
> >>>>>>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>>>>>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> >>>>>>>> @@ -11,6 +11,7 @@
> >>>>>>>> #include <linux/printk.h>
> >>>>>>>> #include <linux/slab.h>
> >>>>>>>> #include <linux/sysfs.h>
> >>>>>>>> +#include <linux/workqueue.h>
> >>>>>>>>
> >>>>>>>> #include "dma-buf-sysfs-stats.h"
> >>>>>>>>
> >>>>>>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> >>>>>>>> kset_unregister(dma_buf_stats_kset);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> +static void sysfs_add_workfn(struct work_struct *work)
> >>>>>>>> +{
> >>>>>>>> + struct dma_buf_sysfs_entry *sysfs_entry =
> >>>>>>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> >>>>>>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> >>>>>>>> +
> >>>>>>>> + /*
> >>>>>>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
> >>>>>>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> >>>>>>>> + * optimization and a race; when the reference count drops to 1 immediately after
> >>>>>>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
> >>>>>>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> >>>>>>>> + * is released, and that can't happen until the end of this function.
> >>>>>>>> + */
> >>>>>>>> + if (file_count(dmabuf->file) > 1) {
> >>>>>>> Please completely drop that. I see absolutely no justification for this
> >>>>>>> additional complexity.
> >>>>>>>
> >>>>>> This case gets hit around 5% of the time in my testing so the else is
> >>>>>> not a completely unused branch.
> >>>>> Well I can only repeat myself: This means that your userspace is
> >>>>> severely broken!
> >>>>>
> >>>>> DMA-buf are meant to be long living objects
> >>>> This patch addresses export *latency* regardless of how long-lived the
> >>>> object is. Even a single, long-lived export will benefit from this
> >>>> change if it would otherwise be blocked on adding an object to sysfs.
> >>>> I think attempting to improve this latency still has merit.
> >>> Fixing the latency is nice, but as it's just pushing the needed work off
> >>> to another code path, it will take longer overall for the sysfs stuff to
> >>> be ready for userspace to see.
> >>>
> >>> Perhaps we need to step back and understand what this code is supposed
> >>> to be doing. As I recall, it was created because some systems do not
> >>> allow debugfs anymore, and they wanted the debugging information that
> >>> the dmabuf code was exposing to debugfs on a "normal" system. Moving
> >>> that logic to sysfs made sense, but now I am wondering why we didn't see
> >>> these issues in the debugfs code previously?
> >>>
> >>> Perhaps we should go just one step further and make a misc device node
> >>> for dmabug debugging information to be in and just have userspace
> >>> poll/read on the device node and we spit the info that used to be in
> >>> debugfs out through that? That way this only affects systems when they
> >>> want to read the information and not normal code paths? Yeah that's a
> >>> hack, but this whole thing feels overly complex now.
> >> A bit late on this discussion, but just wanted to add my +1 that we should
> >> either redesign the uapi, or fix the underlying latency issue in sysfs, or
> >> whatever else is deemed the proper fix.
> >>
> >> Making uapi interfaces async in ways that userspace can't discover is a
> >> hack that we really shouldn't consider, at least for upstream. All kinds
> >> of hilarious things might start to happen when an object exists, but not
> >> consistently in all the places where it should be visible. There's a
> >> reason sysfs has all these neat property groups so that absolutely
> >> everything is added atomically. Doing stuff later on just because usually
> >> no one notices that the illusion falls apart isn't great.
> >>
> >> Unfortunately I don't have a clear idea here what would be the right
> >> solution :-/ One idea perhaps: Should we dynamically enumerate the objects
> >> when userspace does a readdir()? That's absolutely not how sysfs works,
> >> but procfs works like that and there's discussions going around about
> >> moving these optimizations to other kernfs implementations. At least there
> >> was a recent lwn article on this:
> >>
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q58OZi79vmKMCZLL0pY7NniIW6hmSqyWjlEaZgqzYtM%3D&reserved=0
> >>
> >> But that would be serious amounts of work I guess.
> >> -Daniel
> >> --
> >> Daniel Vetter"
> >> Software Engineer, Intel Corporation
> >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pOIl5yszzak4TPqjBYyL0mHjj%2F1nYRfNJbNPQTXBhbA%3D&reserved=0
> > Hi Daniel,
> >
> > My team has been discussing this, and I think we're approaching a
> > consensus on a way forward that involves deprecating the existing
> > uapi.
> >
> > I actually proposed a similar (but less elegant) idea to the readdir()
> > one. A new "dump_dmabuf_data" sysfs file that a user would write to,
> > which would cause a one-time creation of the per-buffer files. These
> > could be left around to become stale, or get cleaned up after first
> > read. However to me it seems impossible to correctly deal with
> > multiple simultaneous users with this technique. We're not currently
> > planning to pursue this.
> >
> > Thanks for the link to the article. That on-demand creation sounds
> > like it would allow us to keep the existing structure and files for
> > DMA-buf, assuming there is not a similar lock contention issue when
> > adding a new node to the virtual tree. :)
>
> I think that this on demand creation is even worse than the existing
> ideas, but if you can get Greg to accept the required sysfs changes than
> that's at least outside of my maintenance domain any more :)
Hah, ok. After chatting with Steven it sounds like an attempt at on
demand creation for sysfs is not likely to happen soon, as the focus
is on getting it working for tracefs first and letting it stew there
for a while to polish it. I'll check with Greg when he's available
again next week about whether that is a direction we should even
consider before moving forward from here.
>
>
> Regards,
> Christian.
On Wed, Jun 1, 2022 at 5:40 AM Daniel Vetter <[email protected]> wrote:
>
> On Mon, May 30, 2022 at 08:12:16AM +0200, Christian König wrote:
> > Am 25.05.22 um 23:05 schrieb T.J. Mercier:
> > > On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
> > > > On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> > > > > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > > > > > On Mon, May 16, 2022 at 12:21 PM Christian König
> > > > > > <[email protected]> wrote:
> > > > > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > > > > > On Mon, May 16, 2022 at 10:20 AM Christian König
> > > > > > > > <[email protected]> wrote:
> > > > > > > > > Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > > > > > > > > Recently, we noticed an issue where a process went into direct reclaim
> > > > > > > > > > while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > > > > > > > > mode. This caused processes who were doing DMA-BUF exports and releases
> > > > > > > > > > to go into uninterruptible sleep since they needed to acquire the same
> > > > > > > > > > semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > > > > > > > > blocking DMA-BUF export for an indeterminate amount of time while
> > > > > > > > > > another process is holding the sysfs rw semaphore in exclusive mode,
> > > > > > > > > > this patch moves the per-buffer sysfs file creation to the default work
> > > > > > > > > > queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > > > > > > > > sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > > > > > > > > being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > > > > > > > > it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > > > > > > > > increase in size.
> > > > > > > > > I'm still not very keen of this approach as it strongly feels like we
> > > > > > > > > are working around shortcoming somewhere else.
> > > > > > > > >
> > > > > > > > My read of the thread for the last version is that we're running into
> > > > > > > > a situation where sysfs is getting used for something it wasn't
> > > > > > > > originally intended for, but we're also stuck with this sysfs
> > > > > > > > functionality for dmabufs.
> > > > > > > >
> > > > > > > > > > Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > > > > > > > > Originally-by: Hridya Valsaraju <[email protected]>
> > > > > > > > > > Signed-off-by: T.J. Mercier <[email protected]>
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > See the originally submitted patch by Hridya Valsaraju here:
> > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pubWqUyqhCWpXHhJHsoqarc3GLtB6IFB1rhgfsL4a1M%3D&reserved=0
> > > > > > > > > >
> > > > > > > > > > v2 changes:
> > > > > > > > > > - Defer only sysfs creation instead of creation and teardown per
> > > > > > > > > > Christian König
> > > > > > > > > >
> > > > > > > > > > - Use a work queue instead of a kthread for deferred work per
> > > > > > > > > > Christian König
> > > > > > > > > > ---
> > > > > > > > > > drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > > > > > > > > include/linux/dma-buf.h | 14 ++++++-
> > > > > > > > > > 2 files changed, 54 insertions(+), 16 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > > index 2bba0babcb62..67b0a298291c 100644
> > > > > > > > > > --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > > +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > > @@ -11,6 +11,7 @@
> > > > > > > > > > #include <linux/printk.h>
> > > > > > > > > > #include <linux/slab.h>
> > > > > > > > > > #include <linux/sysfs.h>
> > > > > > > > > > +#include <linux/workqueue.h>
> > > > > > > > > >
> > > > > > > > > > #include "dma-buf-sysfs-stats.h"
> > > > > > > > > >
> > > > > > > > > > @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > > > > > > > > kset_unregister(dma_buf_stats_kset);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > +static void sysfs_add_workfn(struct work_struct *work)
> > > > > > > > > > +{
> > > > > > > > > > + struct dma_buf_sysfs_entry *sysfs_entry =
> > > > > > > > > > + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > > > > > > > > + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > > > > > > > > +
> > > > > > > > > > + /*
> > > > > > > > > > + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > > > > > > > > + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > > > > > > > > + * optimization and a race; when the reference count drops to 1 immediately after
> > > > > > > > > > + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > > > > > > > > + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > > > > > > > > + * is released, and that can't happen until the end of this function.
> > > > > > > > > > + */
> > > > > > > > > > + if (file_count(dmabuf->file) > 1) {
> > > > > > > > > Please completely drop that. I see absolutely no justification for this
> > > > > > > > > additional complexity.
> > > > > > > > >
> > > > > > > > This case gets hit around 5% of the time in my testing so the else is
> > > > > > > > not a completely unused branch.
> > > > > > > Well I can only repeat myself: This means that your userspace is
> > > > > > > severely broken!
> > > > > > >
> > > > > > > DMA-buf are meant to be long living objects
> > > > > > This patch addresses export *latency* regardless of how long-lived the
> > > > > > object is. Even a single, long-lived export will benefit from this
> > > > > > change if it would otherwise be blocked on adding an object to sysfs.
> > > > > > I think attempting to improve this latency still has merit.
> > > > > Fixing the latency is nice, but as it's just pushing the needed work off
> > > > > to another code path, it will take longer overall for the sysfs stuff to
> > > > > be ready for userspace to see.
> > > > >
> > > > > Perhaps we need to step back and understand what this code is supposed
> > > > > to be doing. As I recall, it was created because some systems do not
> > > > > allow debugfs anymore, and they wanted the debugging information that
> > > > > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > > > > that logic to sysfs made sense, but now I am wondering why we didn't see
> > > > > these issues in the debugfs code previously?
> > > > >
> > > > > Perhaps we should go just one step further and make a misc device node
> > > > > for dmabug debugging information to be in and just have userspace
> > > > > poll/read on the device node and we spit the info that used to be in
> > > > > debugfs out through that? That way this only affects systems when they
> > > > > want to read the information and not normal code paths? Yeah that's a
> > > > > hack, but this whole thing feels overly complex now.
> > > > A bit late on this discussion, but just wanted to add my +1 that we should
> > > > either redesign the uapi, or fix the underlying latency issue in sysfs, or
> > > > whatever else is deemed the proper fix.
> > > >
> > > > Making uapi interfaces async in ways that userspace can't discover is a
> > > > hack that we really shouldn't consider, at least for upstream. All kinds
> > > > of hilarious things might start to happen when an object exists, but not
> > > > consistently in all the places where it should be visible. There's a
> > > > reason sysfs has all these neat property groups so that absolutely
> > > > everything is added atomically. Doing stuff later on just because usually
> > > > no one notices that the illusion falls apart isn't great.
> > > >
> > > > Unfortunately I don't have a clear idea here what would be the right
> > > > solution :-/ One idea perhaps: Should we dynamically enumerate the objects
> > > > when userspace does a readdir()? That's absolutely not how sysfs works,
> > > > but procfs works like that and there's discussions going around about
> > > > moving these optimizations to other kernfs implementations. At least there
> > > > was a recent lwn article on this:
> > > >
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q58OZi79vmKMCZLL0pY7NniIW6hmSqyWjlEaZgqzYtM%3D&reserved=0
> > > >
> > > > But that would be serious amounts of work I guess.
> > > > -Daniel
> > > > --
> > > > Daniel Vetter"
> > > > Software Engineer, Intel Corporation
> > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pOIl5yszzak4TPqjBYyL0mHjj%2F1nYRfNJbNPQTXBhbA%3D&reserved=0
> > > Hi Daniel,
> > >
> > > My team has been discussing this, and I think we're approaching a
> > > consensus on a way forward that involves deprecating the existing
> > > uapi.
> > >
> > > I actually proposed a similar (but less elegant) idea to the readdir()
> > > one. A new "dump_dmabuf_data" sysfs file that a user would write to,
> > > which would cause a one-time creation of the per-buffer files. These
> > > could be left around to become stale, or get cleaned up after first
> > > read. However to me it seems impossible to correctly deal with
> > > multiple simultaneous users with this technique. We're not currently
> > > planning to pursue this.
> > >
> > > Thanks for the link to the article. That on-demand creation sounds
> > > like it would allow us to keep the existing structure and files for
> > > DMA-buf, assuming there is not a similar lock contention issue when
> > > adding a new node to the virtual tree. :)
> >
> > I think that this on demand creation is even worse than the existing ideas,
> > but if you can get Greg to accept the required sysfs changes than that's at
> > least outside of my maintenance domain any more :)
>
> I think doing it cleanly in sysfs without changing the current uapi sounds
> pretty good. The hand-rolled "touch a magic file to force update all the
> files into existence" sounds like a horror show to me :-) Plus I don't see
> how you can actually avoid the locking pain with that since once the files
> are created, you have to remove them synchronously again, plus you get to
> deal with races on top (and likely some locking inversion fun on top).
> -Daniel
Yes, lots of reasons not to pursue that angle. :)
So I asked Greg about modifying sysfs for this purpose, and he's quite
convincing that it's not the right way to approach this problem. So
that leaves deprecating the per-buffer statistics. It looks like we
can maintain the userspace functionality that depended on this by
replacing it with a single sysfs node for "dmabuf_total_size" along
with adding exporter information to procfs (via Kalesh's path patch
[1]). However there is a separate effort to account dmabufs from heaps
with cgroups [2], so if I'm able to make that work then we would not
need the new "dmabuf_total_size" file since this same information
could be obtained from the root cgroup instead. So I'd like to try
that first before falling back to adding a new dmabuf_total_size file.
[1] https://lore.kernel.org/lkml/[email protected]/T/#m43a3d345f821a02babd4ebb1f4257982d027c9e4
[2] https://lore.kernel.org/lkml/CABdmKX1xvm87WMEDkMc9Aye46E4zv1-scenwgaRxHesrOCsaYg@mail.gmail.com/T/#mb82eca5438a4ea7ab157ab9cd7f044cbcfeb5509
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
On Wed, 15 Jun 2022 at 19:43, T.J. Mercier <[email protected]> wrote:
>
> On Wed, Jun 1, 2022 at 5:40 AM Daniel Vetter <[email protected]> wrote:
> >
> > On Mon, May 30, 2022 at 08:12:16AM +0200, Christian König wrote:
> > > Am 25.05.22 um 23:05 schrieb T.J. Mercier:
> > > > On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
> > > > > On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
> > > > > > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
> > > > > > > On Mon, May 16, 2022 at 12:21 PM Christian König
> > > > > > > <[email protected]> wrote:
> > > > > > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier:
> > > > > > > > > On Mon, May 16, 2022 at 10:20 AM Christian König
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > Am 16.05.22 um 19:13 schrieb T.J. Mercier:
> > > > > > > > > > > Recently, we noticed an issue where a process went into direct reclaim
> > > > > > > > > > > while holding the kernfs rw semaphore for sysfs in write (exclusive)
> > > > > > > > > > > mode. This caused processes who were doing DMA-BUF exports and releases
> > > > > > > > > > > to go into uninterruptible sleep since they needed to acquire the same
> > > > > > > > > > > semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
> > > > > > > > > > > blocking DMA-BUF export for an indeterminate amount of time while
> > > > > > > > > > > another process is holding the sysfs rw semaphore in exclusive mode,
> > > > > > > > > > > this patch moves the per-buffer sysfs file creation to the default work
> > > > > > > > > > > queue. Note that this can lead to a short-term inaccuracy in the dmabuf
> > > > > > > > > > > sysfs statistics, but this is a tradeoff to prevent the hot path from
> > > > > > > > > > > being blocked. A work_struct is added to dma_buf to achieve this, but as
> > > > > > > > > > > it is unioned with the kobject in the sysfs_entry, dma_buf does not
> > > > > > > > > > > increase in size.
> > > > > > > > > > I'm still not very keen of this approach as it strongly feels like we
> > > > > > > > > > are working around shortcoming somewhere else.
> > > > > > > > > >
> > > > > > > > > My read of the thread for the last version is that we're running into
> > > > > > > > > a situation where sysfs is getting used for something it wasn't
> > > > > > > > > originally intended for, but we're also stuck with this sysfs
> > > > > > > > > functionality for dmabufs.
> > > > > > > > >
> > > > > > > > > > > Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
> > > > > > > > > > > Originally-by: Hridya Valsaraju <[email protected]>
> > > > > > > > > > > Signed-off-by: T.J. Mercier <[email protected]>
> > > > > > > > > > >
> > > > > > > > > > > ---
> > > > > > > > > > > See the originally submitted patch by Hridya Valsaraju here:
> > > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pubWqUyqhCWpXHhJHsoqarc3GLtB6IFB1rhgfsL4a1M%3D&reserved=0
> > > > > > > > > > >
> > > > > > > > > > > v2 changes:
> > > > > > > > > > > - Defer only sysfs creation instead of creation and teardown per
> > > > > > > > > > > Christian König
> > > > > > > > > > >
> > > > > > > > > > > - Use a work queue instead of a kthread for deferred work per
> > > > > > > > > > > Christian König
> > > > > > > > > > > ---
> > > > > > > > > > > drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
> > > > > > > > > > > include/linux/dma-buf.h | 14 ++++++-
> > > > > > > > > > > 2 files changed, 54 insertions(+), 16 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > > > index 2bba0babcb62..67b0a298291c 100644
> > > > > > > > > > > --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > > > +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
> > > > > > > > > > > @@ -11,6 +11,7 @@
> > > > > > > > > > > #include <linux/printk.h>
> > > > > > > > > > > #include <linux/slab.h>
> > > > > > > > > > > #include <linux/sysfs.h>
> > > > > > > > > > > +#include <linux/workqueue.h>
> > > > > > > > > > >
> > > > > > > > > > > #include "dma-buf-sysfs-stats.h"
> > > > > > > > > > >
> > > > > > > > > > > @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
> > > > > > > > > > > kset_unregister(dma_buf_stats_kset);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > +static void sysfs_add_workfn(struct work_struct *work)
> > > > > > > > > > > +{
> > > > > > > > > > > + struct dma_buf_sysfs_entry *sysfs_entry =
> > > > > > > > > > > + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
> > > > > > > > > > > + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
> > > > > > > > > > > +
> > > > > > > > > > > + /*
> > > > > > > > > > > + * A dmabuf is ref-counted via its file member. If this handler holds the only
> > > > > > > > > > > + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
> > > > > > > > > > > + * optimization and a race; when the reference count drops to 1 immediately after
> > > > > > > > > > > + * this check it is not harmful as the sysfs entry will still get cleaned up in
> > > > > > > > > > > + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
> > > > > > > > > > > + * is released, and that can't happen until the end of this function.
> > > > > > > > > > > + */
> > > > > > > > > > > + if (file_count(dmabuf->file) > 1) {
> > > > > > > > > > Please completely drop that. I see absolutely no justification for this
> > > > > > > > > > additional complexity.
> > > > > > > > > >
> > > > > > > > > This case gets hit around 5% of the time in my testing so the else is
> > > > > > > > > not a completely unused branch.
> > > > > > > > Well I can only repeat myself: This means that your userspace is
> > > > > > > > severely broken!
> > > > > > > >
> > > > > > > > DMA-buf are meant to be long living objects
> > > > > > > This patch addresses export *latency* regardless of how long-lived the
> > > > > > > object is. Even a single, long-lived export will benefit from this
> > > > > > > change if it would otherwise be blocked on adding an object to sysfs.
> > > > > > > I think attempting to improve this latency still has merit.
> > > > > > Fixing the latency is nice, but as it's just pushing the needed work off
> > > > > > to another code path, it will take longer overall for the sysfs stuff to
> > > > > > be ready for userspace to see.
> > > > > >
> > > > > > Perhaps we need to step back and understand what this code is supposed
> > > > > > to be doing. As I recall, it was created because some systems do not
> > > > > > allow debugfs anymore, and they wanted the debugging information that
> > > > > > the dmabuf code was exposing to debugfs on a "normal" system. Moving
> > > > > > that logic to sysfs made sense, but now I am wondering why we didn't see
> > > > > > these issues in the debugfs code previously?
> > > > > >
> > > > > > Perhaps we should go just one step further and make a misc device node
> > > > > > for dmabug debugging information to be in and just have userspace
> > > > > > poll/read on the device node and we spit the info that used to be in
> > > > > > debugfs out through that? That way this only affects systems when they
> > > > > > want to read the information and not normal code paths? Yeah that's a
> > > > > > hack, but this whole thing feels overly complex now.
> > > > > A bit late on this discussion, but just wanted to add my +1 that we should
> > > > > either redesign the uapi, or fix the underlying latency issue in sysfs, or
> > > > > whatever else is deemed the proper fix.
> > > > >
> > > > > Making uapi interfaces async in ways that userspace can't discover is a
> > > > > hack that we really shouldn't consider, at least for upstream. All kinds
> > > > > of hilarious things might start to happen when an object exists, but not
> > > > > consistently in all the places where it should be visible. There's a
> > > > > reason sysfs has all these neat property groups so that absolutely
> > > > > everything is added atomically. Doing stuff later on just because usually
> > > > > no one notices that the illusion falls apart isn't great.
> > > > >
> > > > > Unfortunately I don't have a clear idea here what would be the right
> > > > > solution :-/ One idea perhaps: Should we dynamically enumerate the objects
> > > > > when userspace does a readdir()? That's absolutely not how sysfs works,
> > > > > but procfs works like that and there's discussions going around about
> > > > > moving these optimizations to other kernfs implementations. At least there
> > > > > was a recent lwn article on this:
> > > > >
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q58OZi79vmKMCZLL0pY7NniIW6hmSqyWjlEaZgqzYtM%3D&reserved=0
> > > > >
> > > > > But that would be serious amounts of work I guess.
> > > > > -Daniel
> > > > > --
> > > > > Daniel Vetter"
> > > > > Software Engineer, Intel Corporation
> > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C8f00afd44b9744c45f5708da3e926503%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637891095771223650%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pOIl5yszzak4TPqjBYyL0mHjj%2F1nYRfNJbNPQTXBhbA%3D&reserved=0
> > > > Hi Daniel,
> > > >
> > > > My team has been discussing this, and I think we're approaching a
> > > > consensus on a way forward that involves deprecating the existing
> > > > uapi.
> > > >
> > > > I actually proposed a similar (but less elegant) idea to the readdir()
> > > > one. A new "dump_dmabuf_data" sysfs file that a user would write to,
> > > > which would cause a one-time creation of the per-buffer files. These
> > > > could be left around to become stale, or get cleaned up after first
> > > > read. However to me it seems impossible to correctly deal with
> > > > multiple simultaneous users with this technique. We're not currently
> > > > planning to pursue this.
> > > >
> > > > Thanks for the link to the article. That on-demand creation sounds
> > > > like it would allow us to keep the existing structure and files for
> > > > DMA-buf, assuming there is not a similar lock contention issue when
> > > > adding a new node to the virtual tree. :)
> > >
> > > I think that this on demand creation is even worse than the existing ideas,
> > > but if you can get Greg to accept the required sysfs changes than that's at
> > > least outside of my maintenance domain any more :)
> >
> > I think doing it cleanly in sysfs without changing the current uapi sounds
> > pretty good. The hand-rolled "touch a magic file to force update all the
> > files into existence" sounds like a horror show to me :-) Plus I don't see
> > how you can actually avoid the locking pain with that since once the files
> > are created, you have to remove them synchronously again, plus you get to
> > deal with races on top (and likely some locking inversion fun on top).
> > -Daniel
>
> Yes, lots of reasons not to pursue that angle. :)
>
> So I asked Greg about modifying sysfs for this purpose, and he's quite
> convincing that it's not the right way to approach this problem. So
> that leaves deprecating the per-buffer statistics. It looks like we
> can maintain the userspace functionality that depended on this by
> replacing it with a single sysfs node for "dmabuf_total_size" along
> with adding exporter information to procfs (via Kalesh's path patch
> [1]). However there is a separate effort to account dmabufs from heaps
> with cgroups [2], so if I'm able to make that work then we would not
> need the new "dmabuf_total_size" file since this same information
> could be obtained from the root cgroup instead. So I'd like to try
> that first before falling back to adding a new dmabuf_total_size file.
Sounds like a plan.
-Daniel
>
> [1] https://lore.kernel.org/lkml/[email protected]/T/#m43a3d345f821a02babd4ebb1f4257982d027c9e4
> [2] https://lore.kernel.org/lkml/CABdmKX1xvm87WMEDkMc9Aye46E4zv1-scenwgaRxHesrOCsaYg@mail.gmail.com/T/#mb82eca5438a4ea7ab157ab9cd7f044cbcfeb5509
>
>
>
>
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Am 15.06.22 um 20:32 schrieb Daniel Vetter:
> On Wed, 15 Jun 2022 at 19:43, T.J. Mercier <[email protected]> wrote:
>> On Wed, Jun 1, 2022 at 5:40 AM Daniel Vetter <[email protected]> wrote:
>>> On Mon, May 30, 2022 at 08:12:16AM +0200, Christian König wrote:
>>>> Am 25.05.22 um 23:05 schrieb T.J. Mercier:
>>>>> On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <[email protected]> wrote:
>>>>>> On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
>>>>>>> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
>>>>>>>> On Mon, May 16, 2022 at 12:21 PM Christian König
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Am 16.05.22 um 20:08 schrieb T.J. Mercier:
>>>>>>>>>> On Mon, May 16, 2022 at 10:20 AM Christian König
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
>>>>>>>>>>>> Recently, we noticed an issue where a process went into direct reclaim
>>>>>>>>>>>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
>>>>>>>>>>>> mode. This caused processes who were doing DMA-BUF exports and releases
>>>>>>>>>>>> to go into uninterruptible sleep since they needed to acquire the same
>>>>>>>>>>>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
>>>>>>>>>>>> blocking DMA-BUF export for an indeterminate amount of time while
>>>>>>>>>>>> another process is holding the sysfs rw semaphore in exclusive mode,
>>>>>>>>>>>> this patch moves the per-buffer sysfs file creation to the default work
>>>>>>>>>>>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
>>>>>>>>>>>> sysfs statistics, but this is a tradeoff to prevent the hot path from
>>>>>>>>>>>> being blocked. A work_struct is added to dma_buf to achieve this, but as
>>>>>>>>>>>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
>>>>>>>>>>>> increase in size.
>>>>>>>>>>> I'm still not very keen of this approach as it strongly feels like we
>>>>>>>>>>> are working around shortcoming somewhere else.
>>>>>>>>>>>
>>>>>>>>>> My read of the thread for the last version is that we're running into
>>>>>>>>>> a situation where sysfs is getting used for something it wasn't
>>>>>>>>>> originally intended for, but we're also stuck with this sysfs
>>>>>>>>>> functionality for dmabufs.
>>>>>>>>>>
>>>>>>>>>>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
>>>>>>>>>>>> Originally-by: Hridya Valsaraju <[email protected]>
>>>>>>>>>>>> Signed-off-by: T.J. Mercier <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> See the originally submitted patch by Hridya Valsaraju here:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=to4ENY8GoofFrTi035VZThY1hxiEVMyIpO80LYpVSVo%3D&reserved=0
>>>>>>>>>>>>
>>>>>>>>>>>> v2 changes:
>>>>>>>>>>>> - Defer only sysfs creation instead of creation and teardown per
>>>>>>>>>>>> Christian König
>>>>>>>>>>>>
>>>>>>>>>>>> - Use a work queue instead of a kthread for deferred work per
>>>>>>>>>>>> Christian König
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
>>>>>>>>>>>> include/linux/dma-buf.h | 14 ++++++-
>>>>>>>>>>>> 2 files changed, 54 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>>>>>> index 2bba0babcb62..67b0a298291c 100644
>>>>>>>>>>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>>>>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>>>>>> @@ -11,6 +11,7 @@
>>>>>>>>>>>> #include <linux/printk.h>
>>>>>>>>>>>> #include <linux/slab.h>
>>>>>>>>>>>> #include <linux/sysfs.h>
>>>>>>>>>>>> +#include <linux/workqueue.h>
>>>>>>>>>>>>
>>>>>>>>>>>> #include "dma-buf-sysfs-stats.h"
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
>>>>>>>>>>>> kset_unregister(dma_buf_stats_kset);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> +static void sysfs_add_workfn(struct work_struct *work)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct dma_buf_sysfs_entry *sysfs_entry =
>>>>>>>>>>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
>>>>>>>>>>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
>>>>>>>>>>>> +
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
>>>>>>>>>>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
>>>>>>>>>>>> + * optimization and a race; when the reference count drops to 1 immediately after
>>>>>>>>>>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
>>>>>>>>>>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
>>>>>>>>>>>> + * is released, and that can't happen until the end of this function.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + if (file_count(dmabuf->file) > 1) {
>>>>>>>>>>> Please completely drop that. I see absolutely no justification for this
>>>>>>>>>>> additional complexity.
>>>>>>>>>>>
>>>>>>>>>> This case gets hit around 5% of the time in my testing so the else is
>>>>>>>>>> not a completely unused branch.
>>>>>>>>> Well I can only repeat myself: This means that your userspace is
>>>>>>>>> severely broken!
>>>>>>>>>
>>>>>>>>> DMA-buf are meant to be long living objects
>>>>>>>> This patch addresses export *latency* regardless of how long-lived the
>>>>>>>> object is. Even a single, long-lived export will benefit from this
>>>>>>>> change if it would otherwise be blocked on adding an object to sysfs.
>>>>>>>> I think attempting to improve this latency still has merit.
>>>>>>> Fixing the latency is nice, but as it's just pushing the needed work off
>>>>>>> to another code path, it will take longer overall for the sysfs stuff to
>>>>>>> be ready for userspace to see.
>>>>>>>
>>>>>>> Perhaps we need to step back and understand what this code is supposed
>>>>>>> to be doing. As I recall, it was created because some systems do not
>>>>>>> allow debugfs anymore, and they wanted the debugging information that
>>>>>>> the dmabuf code was exposing to debugfs on a "normal" system. Moving
>>>>>>> that logic to sysfs made sense, but now I am wondering why we didn't see
>>>>>>> these issues in the debugfs code previously?
>>>>>>>
>>>>>>> Perhaps we should go just one step further and make a misc device node
>>>>>>> for dmabug debugging information to be in and just have userspace
>>>>>>> poll/read on the device node and we spit the info that used to be in
>>>>>>> debugfs out through that? That way this only affects systems when they
>>>>>>> want to read the information and not normal code paths? Yeah that's a
>>>>>>> hack, but this whole thing feels overly complex now.
>>>>>> A bit late on this discussion, but just wanted to add my +1 that we should
>>>>>> either redesign the uapi, or fix the underlying latency issue in sysfs, or
>>>>>> whatever else is deemed the proper fix.
>>>>>>
>>>>>> Making uapi interfaces async in ways that userspace can't discover is a
>>>>>> hack that we really shouldn't consider, at least for upstream. All kinds
>>>>>> of hilarious things might start to happen when an object exists, but not
>>>>>> consistently in all the places where it should be visible. There's a
>>>>>> reason sysfs has all these neat property groups so that absolutely
>>>>>> everything is added atomically. Doing stuff later on just because usually
>>>>>> no one notices that the illusion falls apart isn't great.
>>>>>>
>>>>>> Unfortunately I don't have a clear idea here what would be the right
>>>>>> solution :-/ One idea perhaps: Should we dynamically enumerate the objects
>>>>>> when userspace does a readdir()? That's absolutely not how sysfs works,
>>>>>> but procfs works like that and there's discussions going around about
>>>>>> moving these optimizations to other kernfs implementations. At least there
>>>>>> was a recent lwn article on this:
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bjwKigpeAGgDDJWaL3zBDLgaVRRkIE%2BY59%2Be3q0Vts0%3D&reserved=0
>>>>>>
>>>>>> But that would be serious amounts of work I guess.
>>>>>> -Daniel
>>>>>> --
>>>>>> Daniel Vetter"
>>>>>> Software Engineer, Intel Corporation
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TzraByzC5slsrIrpqxNC860WPXqbanhgjt2%2FIhkWpyA%3D&reserved=0
>>>>> Hi Daniel,
>>>>>
>>>>> My team has been discussing this, and I think we're approaching a
>>>>> consensus on a way forward that involves deprecating the existing
>>>>> uapi.
>>>>>
>>>>> I actually proposed a similar (but less elegant) idea to the readdir()
>>>>> one. A new "dump_dmabuf_data" sysfs file that a user would write to,
>>>>> which would cause a one-time creation of the per-buffer files. These
>>>>> could be left around to become stale, or get cleaned up after first
>>>>> read. However to me it seems impossible to correctly deal with
>>>>> multiple simultaneous users with this technique. We're not currently
>>>>> planning to pursue this.
>>>>>
>>>>> Thanks for the link to the article. That on-demand creation sounds
>>>>> like it would allow us to keep the existing structure and files for
>>>>> DMA-buf, assuming there is not a similar lock contention issue when
>>>>> adding a new node to the virtual tree. :)
>>>> I think that this on demand creation is even worse than the existing ideas,
>>>> but if you can get Greg to accept the required sysfs changes than that's at
>>>> least outside of my maintenance domain any more :)
>>> I think doing it cleanly in sysfs without changing the current uapi sounds
>>> pretty good. The hand-rolled "touch a magic file to force update all the
>>> files into existence" sounds like a horror show to me :-) Plus I don't see
>>> how you can actually avoid the locking pain with that since once the files
>>> are created, you have to remove them synchronously again, plus you get to
>>> deal with races on top (and likely some locking inversion fun on top).
>>> -Daniel
>> Yes, lots of reasons not to pursue that angle. :)
>>
>> So I asked Greg about modifying sysfs for this purpose, and he's quite
>> convincing that it's not the right way to approach this problem. So
>> that leaves deprecating the per-buffer statistics. It looks like we
>> can maintain the userspace functionality that depended on this by
>> replacing it with a single sysfs node for "dmabuf_total_size" along
>> with adding exporter information to procfs (via Kalesh's path patch
>> [1]). However there is a separate effort to account dmabufs from heaps
>> with cgroups [2], so if I'm able to make that work then we would not
>> need the new "dmabuf_total_size" file since this same information
>> could be obtained from the root cgroup instead. So I'd like to try
>> that first before falling back to adding a new dmabuf_total_size file.
> Sounds like a plan.
Yep, totally agree.
Christian.
> -Daniel
>
>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F875yll1fp1.fsf%40stepbren-lnx.us.oracle.com%2FT%2F%23m43a3d345f821a02babd4ebb1f4257982d027c9e4&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=n0q9VFkEiTkJDMfGxYEMfAJxgyGU0MQ%2BAUDp4drx3Gc%3D&reserved=0
>> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2FCABdmKX1xvm87WMEDkMc9Aye46E4zv1-scenwgaRxHesrOCsaYg%40mail.gmail.com%2FT%2F%23mb82eca5438a4ea7ab157ab9cd7f044cbcfeb5509&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AxWwvx01V9JOsMELzggmQxAjXEai%2BA65rDH6F0ueul0%3D&reserved=0
>>
>>
>>
>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TzraByzC5slsrIrpqxNC860WPXqbanhgjt2%2FIhkWpyA%3D&reserved=0
>
>