Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751717AbbKPMIM (ORCPT ); Mon, 16 Nov 2015 07:08:12 -0500 Received: from mga02.intel.com ([134.134.136.20]:42120 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751562AbbKPMIL (ORCPT ); Mon, 16 Nov 2015 07:08:11 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,302,1444719600"; d="scan'208";a="851775086" Subject: Re: [PATCH 2/2] drm/i915: Limit the busy wait on requests to 2us not 10ms! To: Chris Wilson , Jens Axboe , intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Daniel Vetter , Eero Tamminen , "Rantala, Valtteri" , stable@kernel.vger.org References: <1447594364-4206-1-git-send-email-chris@chris-wilson.co.uk> <1447594364-4206-2-git-send-email-chris@chris-wilson.co.uk> <5649AEED.9090807@linux.intel.com> <20151116111208.GQ569@nuc-i3427.alporthouse.com> From: Tvrtko Ursulin Organization: Intel Corporation UK Plc Message-ID: <5649C728.5040109@linux.intel.com> Date: Mon, 16 Nov 2015 12:08:08 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <20151116111208.GQ569@nuc-i3427.alporthouse.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5630 Lines: 141 On 16/11/15 11:12, Chris Wilson wrote: > On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote: >> >> Hi, >> >> On 15/11/15 13:32, Chris Wilson wrote: >>> When waiting for high frequency requests, the finite amount of time >>> required to set up the irq and wait upon it limits the response rate. By >>> busywaiting on the request completion for a short while we can service >>> the high frequency waits as quick as possible. However, if it is a slow >>> request, we want to sleep as quickly as possible. The tradeoff between >>> waiting and sleeping is roughly the time it takes to sleep on a request, >>> on the order of a microsecond. Based on measurements from big core, I >>> have set the limit for busywaiting as 2 microseconds. >> >> Sounds like solid reasoning. Would it also be worth finding the >> trade off limit for small core? > > Takes a bit longer, but 2us seems "ok" on PineView (as in it doesn't > lose the boost from spinning rather than sleeping). Have some more > testing to do on vlv/byt. Cool. >>> The code currently uses the jiffie clock, but that is far too coarse (on >>> the order of 10 milliseconds) and results in poor interactivity as the >>> CPU ends up being hogged by slow requests. To get microsecond resolution >>> we need to use a high resolution timer. The cheapest of which is polling >>> local_clock(), but that is only valid on the same CPU. If we switch CPUs >>> because the task was preempted, we can also use that as an indicator that >>> the system is too busy to waste cycles on spinning and we should sleep >>> instead. >> >> Hm, need_resched would not cover the CPU switch anyway? Or maybe >> need_resched means something other than I thought which is "there >> are other runnable tasks"? > > As I understand it, it means that the scheduler tick fired (or something > else yielded). I haven't spotted if it gets set as the runqueue changes. > As it pertains to us, it means that we need to get to schedule() as > quick as possible which along this path means going to sleep. > > I'm not sure if need_resched() would catch the cpu switch, if we were > preempted as the flag would be set on the idle process not us. Could be, I wasn't sure at all, just curious and trying to understand it fully. Cpu check is so cheap as implemented that it is not under any criticism. >> This would also have impact on the patch subject line.I thought we >> would burn a jiffie of CPU cycles only if there are no other >> runnable tasks - so how come an impact on interactivity? > > I have a couple of ideas for the effect on interactivty: > > 1. Burning through the time slice is acting as a penalty against running > that process (typically the compositor) in the near future, perhaps > enough to miss some deadlines. > > 2. Processor power balancing. > >> Also again I think the commit message needs some data on how this >> was found and what is the impact. > > The system felt unresponsive. It would be interesting for me to know a > few more details about the tick on that system (HZ, tickless?, > preemption?) to see if changing the config on my xps13 also produces the > lag/jitter/poor interactivty. Yes interesting but not critical I think. Since the new scheme looks as efficient as the old one so there should be no downside anyway. >> Btw as it happens, just last week as I was playing with perf, I did >> notice busy spinning is the top cycle waster in some benchmarks. I >> was in the process of trying to quantize the difference with it on >> or off but did not complete it. > > I found that spin-request appearing in profiles makes tracking down the > culprit higer in the stack much easier. Otherwise you have to remember to > enable a pass with the tracepoint to find the stalls (or use > intel-gpu-overlay which does it for you). I'll put it on my TODO list of things to play with. >>> +static u64 local_clock_us(unsigned *cpu) >>> +{ >>> + u64 t; >>> + >>> + *cpu = get_cpu(); >>> + t = local_clock() >> 10; >> >> Needs comment I think to explicitly mention the approximation, or >> maybe drop the _us suffix? > > I did consider _approx_us but thought that was overkill. A comment along > the lines of > /* Approximately convert ns to us - the error is less than the > * truncation! > */ And the result is not used in subsequent calculations apart from comparing against an approximate timeout? >>> @@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) >>> if (signal_pending_state(state, current)) >>> break; >>> >>> - if (time_after_eq(jiffies, timeout)) >>> + if (busywait_stop(timeout, cpu)) >>> break; >>> >>> cpu_relax_lowlatency(); >>> >> >> Otherwise looks good. Not sure what would you convert to 32-bit from >> your follow up reply since you need us resolution? > > s/u64/unsigned long/ s/time_after64/time_after/ > > 32bits of us resolution gives us 1000s before wraparound between the two > samples. And I hope that a 1000s doesn't pass between loops. Or if it does, > the GPU managed to complete its task. Now I see that you did say low bits.. yes that sounds fine. Btw while you are optimizing things maybe pick up this micro optimization: http://patchwork.freedesktop.org/patch/64339/ Not in scope of this thread but under the normal development patch flow. Btw2, any benchmark result changes with this? Regards, Tvrtko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/