Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757212Ab2HTPgo (ORCPT ); Mon, 20 Aug 2012 11:36:44 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:39711 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757077Ab2HTPgk (ORCPT ); Mon, 20 Aug 2012 11:36:40 -0400 MIME-Version: 1.0 In-Reply-To: <1345028738.31459.82.camel@twins> References: <5028F12C.7080405@intel.com> <1345028738.31459.82.camel@twins> Date: Mon, 20 Aug 2012 17:36:39 +0200 Message-ID: Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler From: Vincent Guittot To: Peter Zijlstra Cc: Alex Shi , Suresh Siddha , Arjan van de Ven , svaidy@linux.vnet.ibm.com, Ingo Molnar , Andrew Morton , Linus Torvalds , "linux-kernel@vger.kernel.org" , Thomas Gleixner , Paul Turner Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9043 Lines: 197 On 15 August 2012 13:05, Peter Zijlstra wrote: > On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: >> Since there is no power saving consideration in scheduler CFS, I has a >> very rough idea for enabling a new power saving schema in CFS. > > Adding Thomas, he always delights poking holes in power schemes. > >> It bases on the following assumption: >> 1, If there are many task crowd in system, just let few domain cpus >> running and let other cpus idle can not save power. Let all cpu take the >> load, finish tasks early, and then get into idle. will save more power >> and have better user experience. > > I'm not sure this is a valid assumption. I've had it explained to me by > various people that race-to-idle isn't always the best thing. It has to > do with the cost of switching power states and the duration of execution > and other such things. > >> 2, schedule domain, schedule group perfect match the hardware, and >> the power consumption unit. So, pull tasks out of a domain means >> potentially this power consumption unit idle. > > I'm not sure I understand what you're saying, sorry. > >> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale >> power aware scheduling), this proposal will adopt the >> sched_balance_policy concept and use 2 kind of policy: performance, power. > > Yay, ideally we'd also provide a 3rd option: auto, which simply switches > between the two based on AC/BAT, UPS status and simple things like that. > But this seems like a later concern, you have to have something to pick > between before you can pick :-) > >> And in scheduling, 2 place will care the policy, load_balance() and in >> task fork/exec: select_task_rq_fair(). > > ack > >> Here is some pseudo code try to explain the proposal behaviour in >> load_balance() and select_task_rq_fair(); > > Oh man.. A few words outlining the general idea would've been nice. > >> load_balance() { >> update_sd_lb_stats(); //get busiest group, idlest group data. >> >> if (sd->nr_running > sd's capacity) { >> //power saving policy is not suitable for >> //this scenario, it runs like performance policy >> mv tasks from busiest cpu in busiest group to >> idlest cpu in idlest group; > > Once upon a time we talked about adding a factor to the capacity for > this. So say you'd allow 2*capacity before overflowing and waking > another power group. > > But I think we should not go on nr_running here, PJTs per-entity load > tracking stuff gives us much better measures -- also, repost that series > already Paul! :-) > > Also, I'm not sure this is entirely correct, the thing you want to do > for power aware stuff is to minimize the number of active power domains, > this means you don't want idlest, you want least busy non-idle. > >> } else {// the sd has enough capacity to hold all tasks. >> if (sg->nr_running > sg's capacity) { >> //imbalanced between groups >> if (schedule policy == performance) { >> //when 2 busiest group at same busy >> //degree, need to prefer the one has >> // softest group?? >> move tasks from busiest group to >> idletest group; > > So I'd leave the currently implemented scheme as performance, and I > don't think the above describes the current state. > >> } else if (schedule policy == power) >> move tasks from busiest group to >> idlest group until busiest is just full >> of capacity. >> //the busiest group can balance >> //internally after next time LB, > > There's another thing we need to do, and that is collect tasks in a > minimal amount of power domains. The old code (that got deleted) did > something like that, you can revive some of the that code if needed -- I > just killed everything to be able to start with a clean slate. > > >> } else { >> //all groups has enough capacity for its tasks. >> if (schedule policy == performance) >> //all tasks may has enough cpu >> //resources to run, >> //mv tasks from busiest to idlest group? >> //no, at this time, it's better to keep >> //the task on current cpu. >> //so, it is maybe better to do balance >> //in each of groups >> for_each_imbalance_groups() >> move tasks from busiest cpu to >> idlest cpu in each of groups; >> else if (schedule policy == power) { >> if (no hard pin in idlest group) >> mv tasks from idlest group to >> busiest until busiest full. >> else >> mv unpin tasks to the biggest >> hard pin group. >> } >> } >> } >> } > > OK, so you only start to group later.. I think we can do better than > that. > >> >> sub proposal: >> 1, If it's possible to balance task on idlest cpu not appointed 'balance >> cpu'. If so, it may can reduce one more time balancing. >> The idlest cpu can prefer the new idle cpu; and is the least load cpu; >> 2, se or task load is good for running time setting. >> but it should the second basis in load balancing. The first basis of LB >> is running tasks' number in group/cpu. Since whatever of the weight of >> groups is, if the tasks number is less than cpu number, the group is >> still has capacity to take more tasks. (will consider the SMT cpu power >> or other big/little cpu capacity on ARM.) > > Ah, no we shouldn't balance on nr_running, but on the amount of time > consumed. Imagine two tasks being woken at the same time, both tasks > will only run a fraction of the available time, you don't want this to > exceed your capacity because ran back to back the one cpu will still be > mostly idle. > > What you want it to keep track of a per-cpu utilization level (inverse > of idle-time) and using PJTs per-task runnable avg see if placing the > new task on will exceed the utilization limit. > > I think some of the Linaro people actually played around with this, > Vincent? Sorry for the late reply but I had almost no network access during last weeks. So Linaro also works on a power aware scheduler as Peter mentioned. Based on previous tests, we have concluded that main drawback of the (now removed) old power scheduler was that we had no way to make difference between short and long running tasks whereas it's a key input (at least for phone) for deciding to pack tasks and for selecting the core on an asymmetric system. One additional key information is the power distribution in the system which can have a finer granularity than current sched_domain description. Peter's proposal was to use a SHARE_POWERLINE flag similarly to flags that already describe if a sched_domain share resources or cpu capacity. With these 2 new information, we can have a 1st power saving scheduler which spread or packed tasks across core and package Vincent > >> unsolved issues: >> 1, like current scheduler, it didn't handled cpu affinity well in >> load_balance. > > cpu affinity is always 'fun'.. while there's still a few fun sites in > the current load-balancer we do better than we did a while ago. > >> 2, task group that isn't consider well in this rough proposal. > > You mean the cgroup mess? > >> It isn't consider well and may has mistaken . So just share my ideas and >> hope it become better and workable in your comments and discussion. > > Very simplistically the current scheme is a 'spread' the load scheme > (SD_PREFER_SIBLING if you will). We spread load to maximize per-task > cache and cpu power. > > The power scheme should be a 'pack' scheme, where we minimize the active > power domains. > > One way to implement this is to keep track of an active and > under-utilized power domain (the target) and fail the regular (pull) > load-balance for all cpus not in that domain. For the cpu that are in > that domain we'll have find_busiest select from all other under-utilized > domains pulling tasks to fill our target, once full, we pick a new > target, goto 1. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/