Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1627148imm; Wed, 16 May 2018 00:15:07 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqsd0Ynp+qj+8sTayBCynM7b9f4kJaLCD4yybIFtha3aavvuFmyV0mBecM91VNaT0pUUXTj X-Received: by 2002:a17:902:aa95:: with SMTP id d21-v6mr17733646plr.73.1526454907757; Wed, 16 May 2018 00:15:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526454907; cv=none; d=google.com; s=arc-20160816; b=ySHS8+OVYIVqJ3lfzH9SjY/m5snj9bWPMQPxeOoBrtEybgzAWJv+B4+rzR4v9HVDmy q4l4D1D3BUVinC3xbgvNUQNL5Un11Zu3B0cHXm4H0IemOJfwrwbom6/7YBBnwgi4PsqR WNiUbZvOFOodsS5fTdTc1eOf42ibpXh/jeY497eCqim2RHlX4rpgUcHFsjZ9OJeSosj+ uDTmsaiSonIjRwaMLWuxsUkF3x4kRrqglxLwncfon1OW3u6q6HaOwQok5UIP1Y5rNIH3 aTNgDfREzDQJ/GFGNFaAl4lU+qeSFYhxiv4Z5WCQ4LMhFONa3eyVKJ8rci1rhLjyQX8D BuGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=rTX7N2QHdTshenMS9IdYvmuJ3c6oP/aXhd5E2NfHmDc=; b=hxLItTUgjJjqfXd6IpdnU2nbm0bt6IZCxDaG8CNwpp8zxXMPB+8TV3USst/45aZBGU P5/coXz3rKi2dVnbW5LMKoSp4YzuGT7SjbCBNrr9cotyAnRfwtf8atsNK+KtW3kuo6/B DFt1c7fqp0tatTmr+VXn+xns9RJZeTc+VfDSzaIw9A7fJE+S4BKe4WTM//Sbh6OVsRFQ XMmp3GXizIpZBwfoJqoepIhs61ppb910/XP5povuvsXle0WiZ7u34B9Z/mWCbLRAqjDQ 41nB9gcCE/++GKodeKqgIJoM1VrIurqY9DskhYWR0gOeiazAPYG8OLxpEmNaEfjVrH+m 6Q7A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=OjaePsFg; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n6-v6si1933854pla.12.2018.05.16.00.14.53; Wed, 16 May 2018 00:15:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=OjaePsFg; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752538AbeEPHNB (ORCPT + 99 others); Wed, 16 May 2018 03:13:01 -0400 Received: from mail-it0-f67.google.com ([209.85.214.67]:36776 "EHLO mail-it0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752500AbeEPHM6 (ORCPT ); Wed, 16 May 2018 03:12:58 -0400 Received: by mail-it0-f67.google.com with SMTP id e20-v6so8608529itc.1 for ; Wed, 16 May 2018 00:12:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=rTX7N2QHdTshenMS9IdYvmuJ3c6oP/aXhd5E2NfHmDc=; b=OjaePsFgD6Y4lGhtV2kVE01HME5AYH/u3+7xjd2w52A/ZHE/85kVLFdgcvZDp53OBS m2KLislltYGdcwK4IOQbep17L/0F7YPs5zUm927kKCB2mBTVp3tObsS8/R9NW13B5iCy zlrmBXNVamJ/hMcAJ1Lk9rKPkYgkwYp9Nwt2A= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=rTX7N2QHdTshenMS9IdYvmuJ3c6oP/aXhd5E2NfHmDc=; b=QhME8mtTiOOxish0n50FlrcDjMpjIVsnzQbqB1nuHYREVbC8jdfUaf/qx/jgVF41p8 JEP04kVCpqdOj53Y0INU4grMy8gaIU1GaY2++NATiPZgrDXfekqq+9/aR/uj4tSd6KIN g84toPrPEyBNrS9uild/9uITnsut9yb8dASSzwms3YWMj+2hWDOQ4sqKXiX5V8se0+Cb QJvxv6JrRD3t1L8rNNykNEtfUgUbfEfVLTAedFoOevUm9j3CPnd/feI4amXElQtgJOrP fBYG4FwwKb6HpTHvd/AKBU1GldmrxbfQfAIDoebRvUq7ixaXmBr7mUiw6iK6iq3xDbHv kbHA== X-Gm-Message-State: ALKqPwcDVL7lqAbZj6V4N8CqHhQsihN5U+3O5LgSe1M1f2cqnoYLMfHE zZ4anDROykr5BkPo00Aofqe5wZNpu2JMwMW0VHu86A== X-Received: by 2002:a24:5fca:: with SMTP id r193-v6mr1288190itb.89.1526454777273; Wed, 16 May 2018 00:12:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.4.204 with HTTP; Wed, 16 May 2018 00:12:36 -0700 (PDT) In-Reply-To: <20180515145343.GJ30654@e110439-lin> References: <20180510150553.28122-1-patrick.bellasi@arm.com> <20180510150553.28122-4-patrick.bellasi@arm.com> <20180513060443.GB64158@joelaf.mtv.corp.google.com> <20180513062538.GA116730@joelaf.mtv.corp.google.com> <20180514163206.GF30654@e110439-lin> <20180515145343.GJ30654@e110439-lin> From: Vincent Guittot Date: Wed, 16 May 2018 09:12:36 +0200 Message-ID: Subject: Re: [PATCH 3/3] sched/fair: schedutil: explicit update only when required To: Patrick Bellasi Cc: Joel Fernandes , linux-kernel , "open list:THERMAL" , Ingo Molnar , Peter Zijlstra , "Rafael J . Wysocki" , Viresh Kumar , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Joel Fernandes , Steve Muckle Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15 May 2018 at 16:53, Patrick Bellasi wrote: > On 15-May 12:19, Vincent Guittot wrote: >> On 14 May 2018 at 18:32, Patrick Bellasi wrote: >> > On 12-May 23:25, Joel Fernandes wrote: >> >> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote: >> >> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote: > > [...] > >> >> > One question about this change. In enqueue, throttle and unthrottle - you are >> >> > conditionally calling cpufreq_update_util incase the task was >> >> > visible/not-visible in the hierarchy. >> >> > >> >> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent. >> >> > Is this because of util_est or something? Could you add a comment here >> >> > explaining why this is so? >> >> >> >> The big question I have is incase se != NULL, then its still visible at the >> >> root RQ level. >> > >> > My understanding it that you get !se at dequeue time when we are >> > dequeuing a task from a throttled RQ. Isn't it? >> >> Yes se becomes NULL only when you reach root domain > > Right, my point was mainly what I'm saing below: a task removed from a > "throttled" cfs_rq is _already_ not visible from the root cfs_rq since > it has been de-accounted at throttle_cfs_rq() time. > > [...] > >> > At dequeue time instead, since we certainly removed some estimated >> > utilization, then I unconditionally updated schedutil. >> > >> > HOWEVER, I was not considering these two things: >> > >> > 1. for a task going to sleep, we still have its blocked utilization >> > accounted in the cfs_rq utilization. >> >> It might be still interesting to reduce the frequency because the >> blocked utilization can be lower than its estimated utilization. > > Good point, this is the case of a task which, in its last activation, > executed for less time then its previous estimated utilization. > > However, it could also very well be the opposite, a task which > executed more then its past activation. In this case a schedutil > update could trigger a frequency increase. > Thus, the scheduler knows that we are going to sleep: does is really > makes sense to send a notification in this case? Why do you say that we are going to sleep ? a task that does to sleep doesn't mean that cpu is going to sleep as well > > To me that's not a policy to choose when it makes sense to change the > frequency, but just the proper definition of when it makes sense to > send a notification. > > IMHO we should better consider not only (and blindly) the utilization > changes but also what the scheduler knows about the status of a task. > Thus: if the utilization change while a task is running, it's worth to > send a notification. While, when a task is done on a CPU and that CPU > is likely going to be idle, then maybe we should skip the > notification. I don't think so. Depending of the c-state, the power consumption can be impacted and in addition we will have to do the frequency change at wake up > >> > 2. for a task being migrated, at dequeue time we still have not >> > removed the task's utilization from the cfs_rq's utilization. >> > This usually happens later, for example we can have: >> > >> > move_queued_task() >> > dequeue_task() --> CFS task dequeued >> > set_task_cpu() --> schedutil updated >> > migrate_task_rq_fair() >> > detach_entity_cfs_rq() >> > detach_entity_load_avg() --> CFS util removal >> > enqueue_task() >> > >> > Moreover, the "CFS util removal" actually affects the cfs_rq only if >> > we hold the RQ lock, otherwise we know that it's just back annotated >> > as "removed" utilization and the actual cfs_rq utilization is fixed up >> > at the next chance we have the RQ lock. >> > >> > Thus, I would say that in both cases, at dequeue time it does not make >> > sense to update schedutil since we always see the task's utilization >> > in the cfs_rq and thus we will not reduce the frequency. >> >> Yes only attach/detach make sense from an utilization pov and that's >> where we should check for a frequency update for utilization > > Indeed, I was considering the idea to move the callbacks there, which > are the only code places where we know for sure that some utilization > joined or departed from a CPU. > > Still have to check better however, because these functions can be > called also for non root cfs_rqs... thus we will need again the > filtering condition we have now in the wrapper function. yes probably > >> > NOTE, this is true independently from the refactoring I'm proposing. >> > At dequeue time, although we call update_load_avg() on the root RQ, >> > it does not make sense to update schedutil since we still see either >> > the blocked utilization of a sleeping task or the not yet removed >> > utilization of a migrating task. In both cases the risk is to ask for >> > an higher OPP right when a CPU is going to be IDLE. >> >> We have to take care of not mixing the opportunity to update the >> frequency when we are updating the utilization with the policy that we >> want to apply regarding (what we think that is) the best time to >> update the frequency. Like saying that we should wait a bit more to >> make sure that the current utilization is sustainable because a >> frequency change is expensive on the platform (or not) > > I completely agree on keeping utilization update notification separated > from schedutil decisions... > >> It's not because a task is dequeued that we should not update and >> increase the frequency; Or even that we should not decrease it because >> we have just taken into account some removed utilization of a previous >> migration. >> The same happen when a task migrates, we don't know if the utilization >> that is about to be migrated, will be higher or lower than the normal >> update of the utilization (since the last update) and can not generate >> a frequency change >> >> I see your explanation above like a kind of policy where you want to >> balance the cost of a frequency change with the probability that we >> will not have to re-update the frequency soon. > > That was not my thinking. What I wanted to say is just that we should > send notification when it makes really sense, because we have the most > valuable information to pass. > > Thus, notifying schedutil when we update the RQ utilization is a bit > of a greedy approach with respect to the information the scheduler has. > > In the migration example above: > - first we update the RQ utilization > - then we actually remove from the RQ the utilization of the migrated > task > If we notify schedutil at the first step we are more likely to pass an > already outdated information, since from the scheduler standpoint we > know that we are going to reduce the CPU utilization quite soon. > Thus, would it not be better to defer the notification at detach time? better or not I would say that this depends of the platform, the cost of changing the frequency, how many OPP there are and the gap between these OPP ... > > After all that's the original goal of this patch > >> I agree that some scheduling events give higher chances of a >> sustainable utilization level and we should favor these events when >> the frequency change is costly but I'm not sure that we should remove >> all other opportunity to udjust the frequency to the current >> utilization level when the cost is low or negligible. > > Maybe we can try to run hackbench to quantify the overhead we add with > useless schedutil updates. However, my main concerns is that if we > want a proper decoupling between the scheduler and schedutil, then we > also have to ensure that we callback for updates only when it really > makes sense. > > Otherwise, the risk is that the schedutil policy will take decisions > based on wrong assumptions like: ok, let's increase the OPP (since I > can now and it's cheap) without knowing that the CPU is instead going > to be almost empty or even IDLE. > >> Can't we classify the utilization events into some kind of major and >> minor changes ? > > Doesn't a classification itself looks more like a policy? > > Maybe we can consider it, but still I think we should be able to find > when the scheduler has the most accurate and updated information about > the tasks actually RUNNABLE on a CPU and at that point send a > notification to schedutil. > > IMO there are few small events when the utilization could have big > changes: and these are wakeups (because of util_est) and migrations. This 2 events looks very few > For all the rest, the tick should be a good enough update rate, > considering also that, even at 250Hz, in 4ms PELT never build up more > then ~8%. And at 100HZ which is default for arm32, it's almost 20% > > [...] > >> > .:: Conclusions >> > =============== >> > >> > All that considered, I think I've convinced myself that we really need >> > to notify schedutil only in these cases: >> > >> > 1. enqueue time >> > because of the changes in estimated utilization and the >> > possibility to just straight to a better OPP >> > >> > 2. task tick time >> > because of the possible ramp-up of the utilization >> > >> > Another case is related to remote CPUs blocked utilization update, >> > after the recent Vincent's patches. Currently indeed: >> > >> > update_blocked_averages() >> > update_load_avg() >> > --> update schedutil >> > >> > and thus, potentially we wake up an IDLE cluster just to reduce its >> > OPP. If the cluster is in a deep idle state, I'm not entirely sure >> > this is good from an energy saving standpoint. >> > However, with the patch I'm proposing we are missing that support, >> > meaning that an IDLE cluster will get its utilization decayed but we >> > don't wake it up just to drop its frequency. >> >> So more than deciding in the scheduler if we should wake it up or not, >> we should give a chance to cpufreq to decide if it wants to update the >> frequency or not as this decision is somehow platform specific: cost >> of frequency change, clock topology and shared clock, voltage topology >> ... > > Fair enough, then we should keep updating schedutil from that code path. > > What about adding a new explicit callback at the end of: > update_blocked_averages() ? > > Something like: > > ---8<--- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index cb77407ba485..6eb0f31c656d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu) > if (done) > rq->has_blocked_load = 0; > #endif > + > + cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE); > + > rq_unlock_irqrestore(rq, &rf); > } > ---8<--- > > Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify > schedutil that the CPU is currently IDLE? > > Could that work? > > -- > #include > > Patrick Bellasi