Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58973C7618A for ; Fri, 17 Mar 2023 16:00:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231423AbjCQQAV (ORCPT ); Fri, 17 Mar 2023 12:00:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54726 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231432AbjCQQAO (ORCPT ); Fri, 17 Mar 2023 12:00:14 -0400 Received: from mail-oi1-x230.google.com (mail-oi1-x230.google.com [IPv6:2607:f8b0:4864:20::230]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D46A55ADEB; Fri, 17 Mar 2023 09:00:01 -0700 (PDT) Received: by mail-oi1-x230.google.com with SMTP id bo10so4110426oib.11; Fri, 17 Mar 2023 09:00:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679068801; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=odT4kHXtE8kQmg/3O1COzr8+ljO71h/l65QShukUhlg=; b=W8BbGPAzOjaPDAJ96i62qqFoJdcRl1vGdecRNPXVe1bmiWjCBQ63SA04kBXvSOXnW8 1kqb3oUdebjwM2yvHSHc+5ChQDxieYtNe0WE3M/y5DhmoDqV561q77w3tXP7D3xGEbNb vmqIo5uCI6DKzfGjtvBQbBZY4PNbr4isAKIdgedrflBRrKiOX1SN1xr4V5D0QTQBesE9 1SOFJJ/Iotn2vn5omUYr77oKxF2z0gY9hfIFXY26KHLGNH8hsBf5T8gUjwrOsRFMFqLJ Ca0VaVXHi9VatJZawkjoWAI9sq4R7UOoA8qZp9fdiBB0E1fsiBArGeK8gTM2nzzgx6Xk I2uQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679068801; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=odT4kHXtE8kQmg/3O1COzr8+ljO71h/l65QShukUhlg=; b=D5jPsoHEQcGN0Np8nkUaq4fojcl+tpkGBO4R5lM0xwcaWWHKgrna7eqoqM+v1IOfdB OZrml5rzqYeCC0VooT8t1CyEeoLz9jt4YMgF4das4AiO7A6ddZBeCQPbY/pP3d6MZPOQ iP4eHWRZO8ZEvHWbN67iqTH7nCvEx7CqdKNXWGm+507clHNmHcFl+9E/5D5JuORSGpg6 II+S6W6/JKlpjcbAh9bDGyINmsF7YVoxA8LVwuIuXGDeptLAgrvGXdzwfjYGiIyRMQoA 4fseX1Yf3OUpqGURqm5V1EdpqjcIL6sYNtldBKBNT8UysliU4pp5bmfAVS7aFB8SCF4Z laQg== X-Gm-Message-State: AO0yUKWU4kQaMY83UjPqEcXRjEqLKHrnbqvwZlnFQ3O25BCILTit6tjt LuQhtzpRTh03pR4733boz0CYI6mdXINqXTXs3Q0= X-Google-Smtp-Source: AK7set8lV6Joe7jC2CgUms6NPoQk+LWuWbj7n/2/+PtYFXqEALI182PXHMs8oAk+dlaWXWKjXbF+xbZiIZCMqIfmSBw= X-Received: by 2002:a05:6808:b2f:b0:384:1e6a:bf10 with SMTP id t15-20020a0568080b2f00b003841e6abf10mr2906582oij.5.1679068799344; Fri, 17 Mar 2023 08:59:59 -0700 (PDT) MIME-Version: 1.0 References: <20230308155322.344664-1-robdclark@gmail.com> <20230308155322.344664-2-robdclark@gmail.com> In-Reply-To: From: Rob Clark Date: Fri, 17 Mar 2023 08:59:48 -0700 Message-ID: Subject: Re: [PATCH v10 01/15] dma-buf/dma-fence: Add deadline awareness To: =?UTF-8?B?Sm9uYXMgw4VkYWhs?= Cc: dri-devel@lists.freedesktop.org, Rob Clark , Pekka Paalanen , Jonathan Corbet , =?UTF-8?Q?Christian_K=C3=B6nig?= , intel-gfx@lists.freedesktop.org, "open list:DOCUMENTATION" , open list , Sumit Semwal , "moderated list:DMA BUFFER SHARING FRAMEWORK" , Luben Tuikov , Bagas Sanjaya , Rodrigo Vivi , Gustavo Padovan , Matt Turner , freedreno@lists.freedesktop.org, =?UTF-8?Q?Christian_K=C3=B6nig?= , "open list:DMA BUFFER SHARING FRAMEWORK" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 17, 2023 at 3:23=E2=80=AFAM Jonas =C3=85dahl = wrote: > > On Thu, Mar 16, 2023 at 09:28:55AM -0700, Rob Clark wrote: > > On Thu, Mar 16, 2023 at 2:26=E2=80=AFAM Jonas =C3=85dahl wrote: > > > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote: > > > > On Wed, Mar 15, 2023 at 6:53=E2=80=AFAM Jonas =C3=85dahl wrote: > > > > > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote: > > > > > > On Fri, Mar 10, 2023 at 7:45=E2=80=AFAM Jonas =C3=85dahl wrote: > > > > > > > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote: > > > > > > > > From: Rob Clark > > > > > > > > > > > > > > > > Add a way to hint to the fence signaler of an upcoming dead= line, such as > > > > > > > > vblank, which the fence waiter would prefer not to miss. T= his is to aid > > > > > > > > the fence signaler in making power management decisions, li= ke boosting > > > > > > > > frequency as the deadline approaches and awareness of missi= ng deadlines > > > > > > > > so that can be factored in to the frequency scaling. > > > > > > > > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter du= plicate > > > > > > > > deadlines, to avoid increasing dma_fence size. The fen= ce-context > > > > > > > > implementation will need similar logic to track deadlin= es of all > > > > > > > > the fences on the same timeline. [ckoenig] > > > > > > > > v3: Clarify locking wrt. set_deadline callback > > > > > > > > v4: Clarify in docs comment that this is a hint > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT. > > > > > > > > v6: More docs > > > > > > > > v7: Fix typo, clarify past deadlines > > > > > > > > > > > > > > > > Signed-off-by: Rob Clark > > > > > > > > Reviewed-by: Christian K=C3=B6nig > > > > > > > > Acked-by: Pekka Paalanen > > > > > > > > Reviewed-by: Bagas Sanjaya > > > > > > > > --- > > > > > > > > > > > > > > Hi Rob! > > > > > > > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++ > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++= ++++++++++++ > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++ > > > > > > > > 3 files changed, 87 insertions(+) > > > > > > > > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documen= tation/driver-api/dma-buf.rst > > > > > > > > index 622b8156d212..183e480d8cea 100644 > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > :doc: fence signalling annotation > > > > > > > > > > > > > > > > +DMA Fence Deadline Hints > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > + > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c > > > > > > > > + :doc: deadline hints > > > > > > > > + > > > > > > > > DMA Fences Functions Reference > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/= dma-fence.c > > > > > > > > index 0de0482cd36e..f177c56269bb 100644 > > > > > > > > --- a/drivers/dma-buf/dma-fence.c > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_= fence **fences, uint32_t count, > > > > > > > > } > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout); > > > > > > > > > > > > > > > > +/** > > > > > > > > + * DOC: deadline hints > > > > > > > > + * > > > > > > > > + * In an ideal world, it would be possible to pipeline a w= orkload sufficiently > > > > > > > > + * that a utilization based device frequency governor coul= d arrive at a minimum > > > > > > > > + * frequency that meets the requirements of the use-case, = in order to minimize > > > > > > > > + * power consumption. But in the real world there are man= y workloads which > > > > > > > > + * defy this ideal. For example, but not limited to: > > > > > > > > + * > > > > > > > > + * * Workloads that ping-pong between device and CPU, with= alternating periods > > > > > > > > + * of CPU waiting for device, and device waiting on CPU.= This can result in > > > > > > > > + * devfreq and cpufreq seeing idle time in their respect= ive domains and in > > > > > > > > + * result reduce frequency. > > > > > > > > + * > > > > > > > > + * * Workloads that interact with a periodic time based de= adline, such as double > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping= . In this scenario, > > > > > > > > + * missing a vblank deadline results in an *increase* in= idle time on the GPU > > > > > > > > + * (since it has to wait an additional vblank period), s= ending a signal to > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact t= he opposite is what is > > > > > > > > + * needed. > > > > > > > > > > > > > > This is the use case I'd like to get some better understandin= g about how > > > > > > > this series intends to work, as the problematic scheduling be= havior > > > > > > > triggered by missed deadlines has plagued compositing display= servers > > > > > > > for a long time. > > > > > > > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL dr= iver > > > > > > > developer, so I will need some hand holding when it comes to > > > > > > > understanding exactly what piece of software is responsible f= or > > > > > > > communicating what piece of information. > > > > > > > > > > > > > > > + * > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fenc= e via &dma_fence_set_deadline. > > > > > > > > + * The deadline hint provides a way for the waiting driver= , or userspace, to > > > > > > > > + * convey an appropriate sense of urgency to the signaling= driver. > > > > > > > > + * > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOT= ONIC for userspace > > > > > > > > + * facing APIs). The time could either be some point in t= he future (such as > > > > > > > > + * the vblank based deadline for page-flipping, or the sta= rt of a compositor's > > > > > > > > + * composition cycle), or the current time to indicate an = immediate deadline > > > > > > > > + * hint (Ie. forward progress cannot be made until this fe= nce is signaled). > > > > > > > > > > > > > > Is it guaranteed that a GPU driver will use the actual start = of the > > > > > > > vblank as the effective deadline? I have some memories of sei= ng > > > > > > > something about vblank evasion browsing driver code, which I = might have > > > > > > > misunderstood, but I have yet to find whether this is somethi= ng > > > > > > > userspace can actually expect to be something it can rely on. > > > > > > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes thing= s more > > > > > > clear if we talk about them separately even if they happen to b= e the > > > > > > same device. > > > > > > > > > > Sure, sorry about being unclear about that. > > > > > > > > > > > > > > > > > Assuming that is what you mean, nothing strongly defines what t= he > > > > > > deadline is. In practice there is probably some buffering in t= he > > > > > > display controller. For ex, block based (including bandwidth > > > > > > compressed) formats, you need to buffer up a row of blocks to > > > > > > efficiently linearize for scanout. So you probably need to lat= ch some > > > > > > time before you start sending pixel data to the display. But d= etails > > > > > > like this are heavily implementation dependent. I think the mo= st > > > > > > reasonable thing to target is start of vblank. > > > > > > > > > > The driver exposing those details would be quite useful for users= pace > > > > > though, so that it can delay committing updates to late, but not = too > > > > > late. Setting a deadline to be the vblank seems easy enough, but = it > > > > > isn't enough for scheduling the actual commit. > > > > > > > > I'm not entirely sure how that would even work.. but OTOH I think y= ou > > > > are talking about something on the order of 100us? But that is a b= it > > > > of another topic. > > > > > > Yes, something like that. But yea, it's not really related. Schedulin= g > > > commits closer to the deadline has more complex behavior than that to= o, > > > e.g. the need for real time scheduling, and knowing how long it usual= ly > > > takes to create and commit and for the kernel to process. > > > > > > > > > > > > > 8-< *snip* 8-< > > > > > > > > > > > > > > > You need a fence to set the deadline, and for that work needs t= o be > > > > > > flushed. But you can't associate a deadline with work that the= kernel > > > > > > is unaware of anyways. > > > > > > > > > > That makes sense, but it might also a bit inadequate to have it a= s the > > > > > only way to tell the kernel it should speed things up. Even with = the > > > > > trick i915 does, with GNOME Shell, we still end up with the feedb= ack > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. d= elaying > > > > > or dropping the first frame is so far the best work around that w= orks, > > > > > except doing other tricks that makes the kernel to ramp up its cl= ock. > > > > > Having to rely on choosing between latency and frame drops should > > > > > ideally not have to be made. > > > > > > > > Before you have a fence, the thing you want to be speeding up is th= e > > > > CPU, not the GPU. There are existing mechanisms for that. > > > > > > Is there no benefit to let the GPU know earlier that it should speed = up, > > > so that when the job queue arrives, it's already up to speed? > > > > Downstream we have input notifier that resumes the GPU so we can > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we > > wait to boost freq until we have cmdstream to submit, since that > > doesn't take as long. What needs help initially after input is all > > the stuff that happens on the CPU before the GPU can start to do > > anything ;-) > > How do you deal with boosting CPU speeds downstream? Does the input > notifier do that too? Yes.. actually currently downstream (depending on device) we have 1 to 3 input notifiers, one for CPU boost, one for early-PSR-exit, and one to get a head start on booting up the GPU. > > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying > > to help the steady-state situation, rather than the input-latency > > situation. It might take a frame or two of missed deadlines for > > gpufreq to arrive at a good steady-state freq. > > I'm just not sure it will help. Missed deadlines set at commit hasn't > been enough in the past to let the kernel understand it should speed > things up before the next frame (which will be a whole frame late > without any triple buffering which should be a last resort), so I don't > see how it will help by adding a userspace hook to do the same thing. So deadline is just a superset of "right now" and "sometime in the future".. and this has been useful enough for i915 that they have both forms, when waiting on GPU via i915 specific ioctls and when pageflip (assuming userspace isn't deferring composition decision and instead just pushing it all down to the kernel). But this breaks down in a few cases: 1) non pageflip (for ex. ping-ponging between cpu and gpu) use cases when you wait via polling on fence fd or wait via drm_syncobj instead of DRM_IOCTL_I915_GEM_WAIT 2) when userspace decides late in frame to not pageflip because app fence isn't signaled yet And this is all done in a way that doesn't help for situations where you have separate kms and render devices. Or the kms driver doesn't bypass atomic helpers (ie. uses drm_atomic_helper_wait_for_fences()). So the technique has already proven to be useful. This series just extends it beyond driver specific primitives (ie. dma_fence/drm_syncojb) > I think input latency and steady state target frequency here is tightly > linked; what we should aim for is to provide enough information at the > right time so that it does *not* take a frame or two to of missed > deadlines to arrive at the target frequency, as those missed deadlines > either means either stuttering and/or lag. If you have some magic way for a gl/vk driver to accurately predict how many cycles it will take to execute a sequence of draws, I'm all ears. Realistically, the best solution on sudden input is to overshoot and let freqs settle back down. But there is a lot more to input latency than GPU freq. In UI workloads, even fullscreen animation, I don't really see the GPU going above the 2nd lowest OPP even on relatively small things like a618. UI input latency (touch scrolling, on-screen stylus / low-latency-ink, animations) are a separate issue from what this series addresses, and aren't too much to do with GPU freq. > That it helps with the deliberately late commit I do understand, but we > don't do that yet, but intend to when there is kernel uapi to lets us do > so without negative consequences. > > > > > > > > > > > TBF I'm of the belief that there is still a need for input based cp= u > > > > boost (and early wake-up trigger for GPU).. we have something like > > > > this in CrOS kernel. That is a bit of a different topic, but my po= int > > > > is that fence deadlines are just one of several things we need to > > > > optimize power/perf and responsiveness, rather than the single thin= g > > > > that solves every problem under the sun ;-) > > > > > > Perhaps; but I believe it's a bit of a back channel of intent; the pi= ece > > > of the puzzle that has the information to know whether there is need > > > actually speed up is the compositor, not the kernel. > > > > > > For example, pressing 'p' while a terminal is focused does not need h= igh > > > frequency clocks, it just needs the terminal emulator to draw a 'p' a= nd > > > the compositor to composite that update. Pressing may however > > > trigger a non-trivial animation moving a lot of stuff around on scree= n, > > > maybe triggering Wayland clients to draw and what not, and should mos= t > > > arguably have the ability to "warn" the kernel about the upcoming flo= od > > > of work before it is already knocking on its door step. > > > > The super key is problematic, but not for the reason you think. It is > > because it is a case where we should boost on key-up instead of > > key-down.. and the second key-up event comes after the cpu-boost is > > already in it's cool-down period. But even if suboptimal in cases > > like this, it is still useful for touch/stylus cases where the > > slightest of lag is much more perceptible. > > Other keys are even more problematic. Alt, for example, does nothing, > Alt + Tab does some light rendering, but Alt + KeyAboveTab will, > depending on the current active applications, suddenly trigger N Wayland > surfaces to start rendering at the same time. > > > > > This is getting off topic but I kinda favor coming up with some sort > > of static definition that userspace could give the kernel to let the > > kernel know what input to boost on. Or maybe something could be done > > with BPF? > > I have hard time seeing any static information can be enough, it's > depends too much on context what is expected to happen. And can a BPF > program really help? Unless BPF programs that pulls some internal kernel > strings to speed things up whenever userspace wants I don't see how it > is that much better. > > I don't think userspace is necessarily too slow to actively particitpate > in providing direct scheduling hints either. Input processing can, for > example, be off loaded to a real time scheduled thread, and plumbing any > hints about future expectations from rendering, windowing and layout > subsystems will be significantly easier to plumb to a real time input > thread than translated into static informations or BPF programs. I mean, the kernel side input handler is called from irq context long before even the scheduler gets involved.. But I think you are over-thinking the Alt + SomeOtherKey case. The important thing isn't what the other key is, it is just to know that Alt is a modifier key (ie. handle it on key-up instead of key-down). No need to over-complicate things. It's probably enough to give the kernel a list of modifier+key combo's that do _something_.. And like I've said before, keyboard input is the least problematic in terms of latency. It is a _lot_ easier to notice lag with touch scrolling or stylus (on screen). (The latter case, I think wayland has some catching up to do compared to CrOS or android.. you really need a way to allow the app to do front buffer rendering to an overlay for the stylus case, because even just 16ms delay is _very_ noticeable.) BR, -R