Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965167AbbHKPHO (ORCPT ); Tue, 11 Aug 2015 11:07:14 -0400 Received: from eu-smtp-delivery-143.mimecast.com ([146.101.78.143]:40334 "EHLO eu-smtp-delivery-143.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752728AbbHKPHL convert rfc822-to-8bit (ORCPT ); Tue, 11 Aug 2015 11:07:11 -0400 Message-ID: <55CA0FC4.4050908@arm.com> Date: Tue, 11 Aug 2015 16:07:48 +0100 From: Juri Lelli User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Vincent Guittot CC: Morten Rasmussen , Peter Zijlstra , "mingo@redhat.com" , Daniel Lezcano , Dietmar Eggemann , Yuyang Du , Michael Turquette , "rjw@rjwysocki.net" , Sai Charan Gurrappadi , "pang.xunlei@zte.com.cn" , linux-kernel , "linux-pm@vger.kernel.org" Subject: Re: [RFCv5 PATCH 41/46] sched/fair: add triggers for OPP change requests References: <1436293469-25707-1-git-send-email-morten.rasmussen@arm.com> <1436293469-25707-42-git-send-email-morten.rasmussen@arm.com> <55C8AA92.8000201@arm.com> <55C9BB78.1090508@arm.com> In-Reply-To: X-OriginalArrivalTime: 11 Aug 2015 15:07:07.0140 (UTC) FILETIME=[63887C40:01D0D447] X-MC-Unique: QH9irSLbR_SSy9GiSVU6wA-1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9754 Lines: 241 Hi Vincent, On 11/08/15 12:41, Vincent Guittot wrote: > On 11 August 2015 at 11:08, Juri Lelli wrote: >> On 10/08/15 16:07, Vincent Guittot wrote: >>> On 10 August 2015 at 15:43, Juri Lelli wrote: >>>> >>>> Hi Vincent, >>>> >>>> On 04/08/15 14:41, Vincent Guittot wrote: >>>>> Hi Juri, >>>>> >>>>> On 7 July 2015 at 20:24, Morten Rasmussen wrote: >>>>>> From: Juri Lelli >>>>>> >>>>>> Each time a task is {en,de}queued we might need to adapt the current >>>>>> frequency to the new usage. Add triggers on {en,de}queue_task_fair() for >>>>>> this purpose. Only trigger a freq request if we are effectively waking up >>>>>> or going to sleep. Filter out load balancing related calls to reduce the >>>>>> number of triggers. >>>>>> >>>>>> cc: Ingo Molnar >>>>>> cc: Peter Zijlstra >>>>>> >>>>>> Signed-off-by: Juri Lelli >>>>>> --- >>>>>> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++-- >>>>>> 1 file changed, 40 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>> index f74e9d2..b8627c6 100644 >>>>>> --- a/kernel/sched/fair.c >>>>>> +++ b/kernel/sched/fair.c >>>>>> @@ -4281,7 +4281,10 @@ static inline void hrtick_update(struct rq *rq) >>>>>> } >>>>>> #endif >>>>>> >>>>>> +static unsigned int capacity_margin = 1280; /* ~20% margin */ >>>>>> + >>>>>> static bool cpu_overutilized(int cpu); >>>>>> +static unsigned long get_cpu_usage(int cpu); >>>>>> struct static_key __sched_energy_freq __read_mostly = STATIC_KEY_INIT_FALSE; >>>>>> >>>>>> /* >>>>>> @@ -4332,6 +4335,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) >>>>>> if (!task_new && !rq->rd->overutilized && >>>>>> cpu_overutilized(rq->cpu)) >>>>>> rq->rd->overutilized = true; >>>>>> + /* >>>>>> + * We want to trigger a freq switch request only for tasks that >>>>>> + * are waking up; this is because we get here also during >>>>>> + * load balancing, but in these cases it seems wise to trigger >>>>>> + * as single request after load balancing is done. >>>>>> + * >>>>>> + * XXX: how about fork()? Do we need a special flag/something >>>>>> + * to tell if we are here after a fork() (wakeup_task_new)? >>>>>> + * >>>>>> + * Also, we add a margin (same ~20% used for the tipping point) >>>>>> + * to our request to provide some head room if p's utilization >>>>>> + * further increases. >>>>>> + */ >>>>>> + if (sched_energy_freq() && !task_new) { >>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq)); >>>>>> + >>>>>> + req_cap = req_cap * capacity_margin >>>>>> + >> SCHED_CAPACITY_SHIFT; >>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap); >>>>>> + } >>>>>> } >>>>>> hrtick_update(rq); >>>>>> } >>>>>> @@ -4393,6 +4416,23 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) >>>>>> if (!se) { >>>>>> sub_nr_running(rq, 1); >>>>>> update_rq_runnable_avg(rq, 1); >>>>>> + /* >>>>>> + * We want to trigger a freq switch request only for tasks that >>>>>> + * are going to sleep; this is because we get here also during >>>>>> + * load balancing, but in these cases it seems wise to trigger >>>>>> + * as single request after load balancing is done. >>>>>> + * >>>>>> + * Also, we add a margin (same ~20% used for the tipping point) >>>>>> + * to our request to provide some head room if p's utilization >>>>>> + * further increases. >>>>>> + */ >>>>>> + if (sched_energy_freq() && task_sleep) { >>>>>> + unsigned long req_cap = get_cpu_usage(cpu_of(rq)); >>>>>> + >>>>>> + req_cap = req_cap * capacity_margin >>>>>> + >> SCHED_CAPACITY_SHIFT; >>>>>> + cpufreq_sched_set_cap(cpu_of(rq), req_cap); >>>>> >>>>> Could you clarify why you want to trig a freq switch for tasks that >>>>> are going to sleep ? >>>>> The cpu_usage should not changed that much as the se_utilization of >>>>> the entity moves from utilization_load_avg to utilization_blocked_avg >>>>> of the rq and the usage and the freq are updated periodically. >>>> >>>> I think we still need to cover multiple back-to-back dequeues. Suppose >>>> that you have, let's say, 3 tasks that get enqueued at the same time. >>>> After some time the first one goes to sleep and its utilization, as you >>>> say, gets moved to utilization_blocked_avg. So, nothing changes, and >>>> the trigger is superfluous (even if no freq change I guess will be >>>> issued as we are already servicing enough capacity). However, after a >>>> while, the second task goes to sleep. Now we still use get_cpu_usage() >>>> and the first task contribution in utilization_blocked_avg should have >>>> been decayed by this time. Same thing may than happen for the third task >>>> as well. So, if we don't check if we need to scale down in >>>> dequeue_task_fair, it seems to me that we might miss some opportunities, >>>> as blocked contribution of other tasks could have been successively >>>> decayed. >>>> >>>> What you think? >>> >>> The tick is used to monitor such variation of the usage (in both way, >>> decay of the usage of sleeping tasks and increase of the usage of >>> running tasks). So in your example, if the duration between the sleep >>> of the 2 tasks is significant enough, the tick will handle this >>> variation >>> >> >> The tick is used to decide if we need to scale up (to max OPP for the >> time being), but we don't scale down. It makes more logical sense to > > why don't you want to check if you need to scale down ? > Well, because if I'm still executing something the cpu usage is only subject to raise. >> scale down at task deactivation, or wakeup after a long time, IMHO. > > But waking up or going to sleep don't have any impact on the usage of > a cpu. The only events that impact the cpu usage are: > -task migration, We explicitly cover this on load balancing paths. > -new task We cover this in enqueue_task_fair(), introducing a new flag. > -time that elapse which can be monitored by periodically checking the usage. Do you mean when a task utilization crosses some threshold related to the current OPP? If that is the case, we have a check in task_tick_fair(). > -and for nohz system when cpu enter or leave idle state > We address this in dequeue_task_fair(). In particular, if the cpu is going to be idle we don't trigger any change as it seems not always wise to wake up a thread to just change the OPP and the go idle; some platforms might require this behaviour anyway, but it probably more cpuidle/fw related? I would also add: - task is going to die We address this in dequeue as well, as its contribution is removed from usage (mod Yuyang's patches). > waking up and going to sleep events doesn't give any useful > information and using them to trig the monitoring of the usage > variation doesn't give you a predictable/periodic update of it whereas > the tick will > So, one key point of this solution is to get away as much as we can from periodic updates/sampling and move towards a (fully) event driven approach. The event logically associated to task_tick_fair() is when we realize that a task is going to saturate the current capacity; in this case we trigger a freq switch to an higher capacity. Also, if we never react to normal wakeups (as I understand you are proposing) we might miss some chances to adapt quickly enough. As an example, if you have a big task that suddenly goes to sleep, and sleeps until its decayed utilization goes almost to zero; when it wakes up, if we don't have a trigger in enqueue_task_fair(), we'll have to wait until the next tick to select an appropriate (low) OPP. Best, - Juri >> >> Best, >> >> - Juri >> >>> Regards, >>> Vincent >>>> >>>> Thanks, >>>> >>>> - Juri >>>> >>>>> It should be the same for the wake up of a task in enqueue_task_fair >>>>> above, even if it's less obvious for this latter use case because the >>>>> cpu might wake up from a long idle phase during which its >>>>> utilization_blocked_avg has not been updated. Nevertheless, a trig of >>>>> the freq switch at wake up of the cpu once its usage has been updated >>>>> should do the job. >>>>> >>>>> So tick, migration of tasks, new tasks, entering/leaving idle state of >>>>> cpu should be enough to trig freq switch >>>>> >>>>> Regards, >>>>> Vincent >>>>> >>>>> >>>>>> + } >>>>>> } >>>>>> hrtick_update(rq); >>>>>> } >>>>>> @@ -4959,8 +4999,6 @@ static int find_new_capacity(struct energy_env *eenv, >>>>>> return idx; >>>>>> } >>>>>> >>>>>> -static unsigned int capacity_margin = 1280; /* ~20% margin */ >>>>>> - >>>>>> static bool cpu_overutilized(int cpu) >>>>>> { >>>>>> return (capacity_of(cpu) * 1024) < >>>>>> -- >>>>>> 1.9.1 >>>>>> >>>>> >>>> >>> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/