Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751512AbdGRNXg (ORCPT ); Tue, 18 Jul 2017 09:23:36 -0400 Received: from merlin.infradead.org ([205.233.59.134]:49194 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751415AbdGRNXe (ORCPT ); Tue, 18 Jul 2017 09:23:34 -0400 Date: Tue, 18 Jul 2017 15:23:12 +0200 From: Peter Zijlstra To: "Li, Aubrey" Cc: Andi Kleen , Frederic Weisbecker , Christoph Lameter , Aubrey Li , tglx@linutronix.de, len.brown@intel.com, rjw@rjwysocki.net, tim.c.chen@linux.intel.com, arjan@linux.intel.com, paulmck@linux.vnet.ibm.com, yang.zhang.wz@gmail.com, x86@kernel.org, linux-kernel@vger.kernel.org, daniel.lezcano@linaro.org Subject: Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods Message-ID: <20170718132312.sbw52mgmz6wrcbys@hirez.programming.kicks-ass.net> References: <20170713145311.z4zxlyd2dospeoqg@hirez.programming.kicks-ass.net> <4a577bd6-20b1-abb6-2153-f9870f0a721e@linux.intel.com> <20170713182820.sn3fjitnd3mca27p@hirez.programming.kicks-ass.net> <31170ac6-9db1-f0b8-4841-f1661c8ed6e1@linux.intel.com> <20170714153818.pjauqxebxyhs6ljp@hirez.programming.kicks-ass.net> <20170714155356.GH3441@tassilo.jf.intel.com> <20170714160648.tg2u6eo2id6gmnjz@hirez.programming.kicks-ass.net> <20170714162619.GJ3441@tassilo.jf.intel.com> <20170717192309.ubn5muvc3u7htuaw@hirez.programming.kicks-ass.net> <34371ef8-b8bc-d2bf-93de-3fccd6beb032@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <34371ef8-b8bc-d2bf-93de-3fccd6beb032@linux.intel.com> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4306 Lines: 135 On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote: > On 2017/7/18 3:23, Peter Zijlstra wrote: > > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > >>> And as said; Daniel has been working on a better predictor -- now he's > >>> probably not used it on the network workload you're looking at, so that > >>> might be something to consider. > >> > >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > > > No. If you want a different C state selected we need to fix the current > > C state selector. We're not going to tinker. > > > > And the predictor is probably the most fundamental part of the whole C > > state selection logic. > > > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > > > If you want to improve the worst case, we need to consider a cumulative > > distribution function, and staying with the Gaussian assumption already > > present, that would mean using: > > > > 1 x - mu > > CDF(x) = - [ 1 + erf(-------------) ] > > 2 sigma sqrt(2) > > > > Where, per the normal convention mu is the average and sigma^2 the > > variance. See also: > > > > https://en.wikipedia.org/wiki/Normal_distribution > > > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > > > This conceptually gets us better exit latency for the cases where we got > > it wrong before, and practically pushes down the estimate which gets us > > C1 longer. > > > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > Maybe you are talking about applying some machine learning algorithm online > to fit a multivariate normal distribution, :) Nah, nothing that fancy.. Something that _could_ work and deals with arbitrary distributions is buckets divided on the actual C-state selection boundaries and a (cyclic) array of the N most recent idle times. Something like so: struct est_data { u8 array[64] u8 *dist; u8 idx; } DEFINE_PER_CPU(struct est_data, est_data); void est_init(void) { int size = drv->state_count; int cpu; for_each_possible_cpu(cpu) { per_cpu(est_data, cpu).dist = kzalloc(size); // handle errors } } u8 est_duration_2_state(u64 duration) { for (i=0; istate_count; i++) { if (duration/1024 < drv->state[i].target_residency) return i; } return i-1; } void est_contemplate(u64 duration) { struct est_data *ed = this_cpu_ptr(&est_data); int state = est_duration_2_state(duration); int idx = (ed->idx++ % ARRAY_SIZE(ed->array); ed->dist[ed->array[idx]]--; ed->array[idx] = state; ed->dist[ed->array[idx]]++; } int est_state(int pct) { struct est_data *ed = this_cpu_ptr(&est_data); int limit = pct * ARRAY_SIZE(ed->array) / 100; /* XXX move div out of line */ int cnt, last = 0; /* CDF */ for (i=0; istate_count; i++) { cnt += ed->dist[i]; if (cnt > limit) break; last = i; } return last; } > Well, back to the problem, when the scheduler picks up idle thread, it does > not look at the history, nor make the prediction. So it's possible it has > to switch back a task ASAP when it's going into idle(very common under some > workloads). > > That is, (idle_entry + idle_exit) > idle. If the system has multiple > hardware idle states, then: > > (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep > > So we eventually want the idle path lighter than what we currently have. > A complex predictor may have high accuracy, but the cost could be high as well. I never suggested anything complex. The current menu thing uses an average, all I said is if instead of the average you use something less, say 'avg - 2*stdev' (it already computes the stdev) you get something, which assuming Gaussian, is less than ~5 wrong on exit latency. The above, also simple thing, uses a generic distribution function, which works because it uses the exact boundaries we're limited to anyway. Of course, the above needs to be augmented with IRQ bits etc..