Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754592AbbHNM72 (ORCPT ); Fri, 14 Aug 2015 08:59:28 -0400 Received: from foss.arm.com ([217.140.101.70]:38220 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751856AbbHNM71 (ORCPT ); Fri, 14 Aug 2015 08:59:27 -0400 Date: Fri, 14 Aug 2015 14:02:48 +0100 From: Morten Rasmussen To: Peter Zijlstra Cc: mingo@redhat.com, vincent.guittot@linaro.org, daniel.lezcano@linaro.org, Dietmar Eggemann , yuyang.du@intel.com, mturquette@baylibre.com, rjw@rjwysocki.net, Juri Lelli , sgurrappadi@nvidia.com, pang.xunlei@zte.com.cn, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Subject: Re: [RFCv5 PATCH 25/46] sched: Add over-utilization/tipping point indicator Message-ID: <20150814130247.GD29326@e105550-lin.cambridge.arm.com> References: <1436293469-25707-1-git-send-email-morten.rasmussen@arm.com> <1436293469-25707-26-git-send-email-morten.rasmussen@arm.com> <20150813173533.GZ19282@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150813173533.GZ19282@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3988 Lines: 78 On Thu, Aug 13, 2015 at 07:35:33PM +0200, Peter Zijlstra wrote: > On Tue, Jul 07, 2015 at 07:24:08PM +0100, Morten Rasmussen wrote: > > Energy-aware scheduling is only meant to be active while the system is > > _not_ over-utilized. That is, there are spare cycles available to shift > > tasks around based on their actual utilization to get a more > > energy-efficient task distribution without depriving any tasks. When > > above the tipping point task placement is done the traditional way, > > spreading the tasks across as many cpus as possible based on priority > > scaled load to preserve smp_nice. > > > > The over-utilization condition is conservatively chosen to indicate > > over-utilization as soon as one cpu is fully utilized at it's highest > > frequency. We don't consider groups as lumping usage and capacity > > together for a group of cpus may hide the fact that one or more cpus in > > the group are over-utilized while group-siblings are partially idle. The > > tasks could be served better if moved to another group with completely > > idle cpus. This is particularly problematic if some cpus have a > > significantly reduced capacity due to RT/IRQ pressure or if the system > > has cpus of different capacity (e.g. ARM big.LITTLE). > > I might be tired, but I'm having a very hard time deciphering this > second paragraph. I can see why, let me try again :-) It is essentially about when do we make balancing decisions based on load_avg and util_avg (using the new names in Yuyang's rewrite). As you mentioned in another thread recently, we want to use util_avg until the system is over-utilized and then switch to load_avg. We need to define the conditions that determine the switch. The util_avg for each cpu converges towards 100% (1024) regardless of how many task additional task we may put on it. If we define over-utilized as being something like: sum_{cpus}(rq::cfs::avg::util_avg) + margin > sum_{cpus}(rq::capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60%. Balancing based on util_avg we are likely to end up with nice=-10 sharing cpus and nice=0 getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization as when: cpu_rq(any)::cfs::avg::util_avg + margin > cpu_rq(any)::capacity is true for any cpu in the system. IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority. Now with this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For system where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. I haven't found any reasonably easy-to-track conditions that would work better. Suggestions are very welcome. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/