Received: by 2002:a05:6602:18e:0:0:0:0 with SMTP id m14csp543460ioo; Thu, 26 May 2022 09:04:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyvqKNduENg9Kt20OjJWYAWHnwp/x8M4ugDEHvTUWe72DxO/6SrK1ykBvZFtdU8QPYw23A/ X-Received: by 2002:a17:907:7ba1:b0:6fe:e18a:dc89 with SMTP id ne33-20020a1709077ba100b006fee18adc89mr19431871ejc.392.1653581075571; Thu, 26 May 2022 09:04:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653581075; cv=none; d=google.com; s=arc-20160816; b=BsW3sKuCQDserKN6Cq5lKBh9OpfxIVYkOf5xLOm5U9Z7bQOVs5SxtRPawJ4TLXLAYU 3kp75y5chDtuoKedXfko1f5jId/3FhC5hO46hkUBNZ17+RBGQjPzu3dZfrNTqTqfVfQc sC3BFCrlnt5w1WXHwNIA8qZQFPt4n48Oawqf+qTSC8xCEBTfOk4MEg9GMRO76Em8QQ09 lEwLP++PIBHI68SGKlzOQKUdihS689uU0GXTVV7dPZ0JYUJKdCWeKubmn7abPmTt9LpU /4ayD5DLMmMajO6Bat3NCeb/J6HaLCV6q4HmUl17NFb6/68O07iLjysFm5gwGalYGobo w0UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=I+Uxt8or9INmUhCswJXtP42LvPPtnJppQlsHVVvfcE8=; b=kYioEv9z9h4blq3+fgpAlDrCvfreiFrzqhgSnBMlIvsv1OXdgAY4hTSZQ7l0nM6UzS tVK2c1n8lU6lohjzDH4HNwNCQeNRcqNEmjRgr/MXEMM4ANVKV6YQvD8SlTKa5xgxKXDD PyjmfBT4HUSZp9GbCJ+GUHA3wvfvcmwhWGBy0zDX2XKaH2WBjJq+YYWatEIQIUg03xnD d4krzYAq5nu0gSm9PsShaPC23Yb92pn8yppppQAlFMnKPcH/2S7G6wi9mRR5IgCScyh6 Dqlc3AkEkXu+lKOWn6QA46ucBv/8hz+EvyuzBiD/OMgWajxsFslu4ceRz1C1+w7Puf1v xwRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sXLh7SX3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o20-20020a170906975400b006fed9affed0si110750ejy.528.2022.05.26.09.04.06; Thu, 26 May 2022 09:04:35 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sXLh7SX3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231173AbiEYVGO (ORCPT + 99 others); Wed, 25 May 2022 17:06:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53036 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232167AbiEYVGM (ORCPT ); Wed, 25 May 2022 17:06:12 -0400 Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC010BA982 for ; Wed, 25 May 2022 14:06:10 -0700 (PDT) Received: by mail-ej1-x62c.google.com with SMTP id wh22so43922611ejb.7 for ; Wed, 25 May 2022 14:06:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=I+Uxt8or9INmUhCswJXtP42LvPPtnJppQlsHVVvfcE8=; b=sXLh7SX3RbE0HYOz3JA0kn8NgdQw4xbVmwwMy219EyvwlobFQWO0MmhKby/pIDLMB8 Wl7DXJrqvQcHXSBoJVsUGzDU2rjZjWW/qvrIVVpLQ4FfOdmFSTiUxMEUcDO67bOIgxYI fUBTmKS2CeqpqvWsi1z8Sk5cHhoCJx9gFPGpfI3Wd9BwFbXey7/aP8Nik4oQDlNX5tZx cQR+f9ucdVGoPC/ZwluxcPTWl0NIxu77JmQOUdvC4duDfPecPz2TBwzGCUZ8KybVtLfB MuyQbJgyAfRDT1d9VUEsxt6uvPRlbzjUNC9r8KMutvWUFiAPeok0HPhL0u3ste+fXv0P KkpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=I+Uxt8or9INmUhCswJXtP42LvPPtnJppQlsHVVvfcE8=; b=Qxi34o8Y4xyrLaisNT5YrUTQZL5X/d3qLowbTHEenP2Jb7BJlp3A9/jKXxNOJ5hBsL XS91tNWcHv1RSfNCGLbwcnFCE59EDBabaLBL+VWbq++tsPZmIkeJb0QNZK90x9l3xd5F k771Gl6oUUKRu8cU06qs6uhxZ5Yl65AD/tw7xXp1fnEkczYJC+x4YI6PCVmemT5eJW4W nDzj6XJeItxBnIJlcSiX0x6VC0pF+fWo92PrxkRvonpJw90B1IacYX2hmF0sn1g6g5Si a2b2NdMSIZDCCyu4vC9EeY72bRAEyeh3W0XCHPeIQVgXJQKCg/MwUEAwDkBu/JQ2LKKC QmAQ== X-Gm-Message-State: AOAM533jt376wlhO1Mkw2NS7bdp566GmsBY0i1a3/pwEKgzyxAXihry/ 45cI/vAZcBVyVMP9T4TVg0bXV8gqi6mPEYHunNC+Ag== X-Received: by 2002:a17:907:97d0:b0:6ff:e43:2145 with SMTP id js16-20020a17090797d000b006ff0e432145mr5397626ejc.273.1653512769030; Wed, 25 May 2022 14:06:09 -0700 (PDT) MIME-Version: 1.0 References: <20220516171315.2400578-1-tjmercier@google.com> <175c5af3-9224-9c8e-0784-349dad9a2954@amd.com> <0875fa95-3a25-a354-1433-201fca81ed3e@amd.com> In-Reply-To: From: "T.J. Mercier" Date: Wed, 25 May 2022 14:05:57 -0700 Message-ID: Subject: Re: [PATCH v2] dma-buf: Move sysfs work out of DMA-BUF export path To: Greg Kroah-Hartman , "T.J. Mercier" , =?UTF-8?Q?Christian_K=C3=B6nig?= , Suren Baghdasaryan , Kalesh Singh , Minchan Kim , Greg Kroah-Hartman , John Stultz , Sumit Semwal , Hridya Valsaraju , kernel-team@android.com, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org, linux-kernel@vger.kernel.org Cc: Daniel Vetter Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 25, 2022 at 7:38 AM Daniel Vetter wrote: > > On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote: > > On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote: > > > On Mon, May 16, 2022 at 12:21 PM Christian K=C3=B6nig > > > wrote: > > > > > > > > Am 16.05.22 um 20:08 schrieb T.J. Mercier: > > > > > On Mon, May 16, 2022 at 10:20 AM Christian K=C3=B6nig > > > > > wrote: > > > > >> Am 16.05.22 um 19:13 schrieb T.J. Mercier: > > > > >>> Recently, we noticed an issue where a process went into direct = reclaim > > > > >>> while holding the kernfs rw semaphore for sysfs in write (exclu= sive) > > > > >>> mode. This caused processes who were doing DMA-BUF exports and = releases > > > > >>> to go into uninterruptible sleep since they needed to acquire t= he same > > > > >>> semaphore for the DMA-BUF sysfs entry creation/deletion. In ord= er to avoid > > > > >>> blocking DMA-BUF export for an indeterminate amount of time whi= le > > > > >>> another process is holding the sysfs rw semaphore in exclusive = mode, > > > > >>> this patch moves the per-buffer sysfs file creation to the defa= ult work > > > > >>> queue. Note that this can lead to a short-term inaccuracy in th= e dmabuf > > > > >>> sysfs statistics, but this is a tradeoff to prevent the hot pat= h from > > > > >>> being blocked. A work_struct is added to dma_buf to achieve thi= s, but as > > > > >>> it is unioned with the kobject in the sysfs_entry, dma_buf does= not > > > > >>> increase in size. > > > > >> I'm still not very keen of this approach as it strongly feels li= ke we > > > > >> are working around shortcoming somewhere else. > > > > >> > > > > > My read of the thread for the last version is that we're running = into > > > > > a situation where sysfs is getting used for something it wasn't > > > > > originally intended for, but we're also stuck with this sysfs > > > > > functionality for dmabufs. > > > > > > > > > >>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-= BUF stats in sysfs") > > > > >>> Originally-by: Hridya Valsaraju > > > > >>> Signed-off-by: T.J. Mercier > > > > >>> > > > > >>> --- > > > > >>> See the originally submitted patch by Hridya Valsaraju here: > > > > >>> https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%= 2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=3D05%7C01%7Cchristian.k= oenig%40amd.com%7C794614324d114880a25508da37672e4b%7C3dd8961fe4884e608e11a8= 2d994e183d%7C0%7C0%7C637883213566903705%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4= wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&= ;sdata=3DbGlA2FeubfSeL5XDHYyWMZqUXfScoCphZjjK4jrqQJs%3D&reserved=3D0 > > > > >>> > > > > >>> v2 changes: > > > > >>> - Defer only sysfs creation instead of creation and teardown pe= r > > > > >>> Christian K=C3=B6nig > > > > >>> > > > > >>> - Use a work queue instead of a kthread for deferred work per > > > > >>> Christian K=C3=B6nig > > > > >>> --- > > > > >>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 +++++++++++++++++= +++------- > > > > >>> include/linux/dma-buf.h | 14 ++++++- > > > > >>> 2 files changed, 54 insertions(+), 16 deletions(-) > > > > >>> > > > > >>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dm= a-buf/dma-buf-sysfs-stats.c > > > > >>> index 2bba0babcb62..67b0a298291c 100644 > > > > >>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c > > > > >>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c > > > > >>> @@ -11,6 +11,7 @@ > > > > >>> #include > > > > >>> #include > > > > >>> #include > > > > >>> +#include > > > > >>> > > > > >>> #include "dma-buf-sysfs-stats.h" > > > > >>> > > > > >>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void= ) > > > > >>> kset_unregister(dma_buf_stats_kset); > > > > >>> } > > > > >>> > > > > >>> +static void sysfs_add_workfn(struct work_struct *work) > > > > >>> +{ > > > > >>> + struct dma_buf_sysfs_entry *sysfs_entry =3D > > > > >>> + container_of(work, struct dma_buf_sysfs_entry, sy= sfs_add_work); > > > > >>> + struct dma_buf *dmabuf =3D sysfs_entry->dmabuf; > > > > >>> + > > > > >>> + /* > > > > >>> + * A dmabuf is ref-counted via its file member. If this h= andler holds the only > > > > >>> + * reference to the dmabuf, there is no need for sysfs ko= bject creation. This is an > > > > >>> + * optimization and a race; when the reference count drop= s to 1 immediately after > > > > >>> + * this check it is not harmful as the sysfs entry will s= till get cleaned up in > > > > >>> + * dma_buf_stats_teardown, which won't get called until t= he final dmabuf reference > > > > >>> + * is released, and that can't happen until the end of th= is function. > > > > >>> + */ > > > > >>> + if (file_count(dmabuf->file) > 1) { > > > > >> Please completely drop that. I see absolutely no justification f= or this > > > > >> additional complexity. > > > > >> > > > > > This case gets hit around 5% of the time in my testing so the els= e is > > > > > not a completely unused branch. > > > > > > > > Well I can only repeat myself: This means that your userspace is > > > > severely broken! > > > > > > > > DMA-buf are meant to be long living objects > > > This patch addresses export *latency* regardless of how long-lived th= e > > > object is. Even a single, long-lived export will benefit from this > > > change if it would otherwise be blocked on adding an object to sysfs. > > > I think attempting to improve this latency still has merit. > > > > Fixing the latency is nice, but as it's just pushing the needed work of= f > > to another code path, it will take longer overall for the sysfs stuff t= o > > be ready for userspace to see. > > > > Perhaps we need to step back and understand what this code is supposed > > to be doing. As I recall, it was created because some systems do not > > allow debugfs anymore, and they wanted the debugging information that > > the dmabuf code was exposing to debugfs on a "normal" system. Moving > > that logic to sysfs made sense, but now I am wondering why we didn't se= e > > these issues in the debugfs code previously? > > > > Perhaps we should go just one step further and make a misc device node > > for dmabug debugging information to be in and just have userspace > > poll/read on the device node and we spit the info that used to be in > > debugfs out through that? That way this only affects systems when they > > want to read the information and not normal code paths? Yeah that's a > > hack, but this whole thing feels overly complex now. > > A bit late on this discussion, but just wanted to add my +1 that we shoul= d > either redesign the uapi, or fix the underlying latency issue in sysfs, o= r > whatever else is deemed the proper fix. > > Making uapi interfaces async in ways that userspace can't discover is a > hack that we really shouldn't consider, at least for upstream. All kinds > of hilarious things might start to happen when an object exists, but not > consistently in all the places where it should be visible. There's a > reason sysfs has all these neat property groups so that absolutely > everything is added atomically. Doing stuff later on just because usually > no one notices that the illusion falls apart isn't great. > > Unfortunately I don't have a clear idea here what would be the right > solution :-/ One idea perhaps: Should we dynamically enumerate the object= s > when userspace does a readdir()? That's absolutely not how sysfs works, > but procfs works like that and there's discussions going around about > moving these optimizations to other kernfs implementations. At least ther= e > was a recent lwn article on this: > > https://lwn.net/Articles/895111/ > > But that would be serious amounts of work I guess. > -Daniel > -- > Daniel Vetter" > Software Engineer, Intel Corporation > http://blog.ffwll.ch Hi Daniel, My team has been discussing this, and I think we're approaching a consensus on a way forward that involves deprecating the existing uapi. I actually proposed a similar (but less elegant) idea to the readdir() one. A new "dump_dmabuf_data" sysfs file that a user would write to, which would cause a one-time creation of the per-buffer files. These could be left around to become stale, or get cleaned up after first read. However to me it seems impossible to correctly deal with multiple simultaneous users with this technique. We're not currently planning to pursue this. Thanks for the link to the article. That on-demand creation sounds like it would allow us to keep the existing structure and files for DMA-buf, assuming there is not a similar lock contention issue when adding a new node to the virtual tree. :)