Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp699546imm; Tue, 15 May 2018 07:56:00 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqB1ifxrIbGapnvbfh62hMe3N5+Q+KoaaWD6qiwDbr+4e+ln6k5ykHA8PIHgvfsWoonzLDy X-Received: by 2002:a65:6005:: with SMTP id m5-v6mr12696318pgu.339.1526396160802; Tue, 15 May 2018 07:56:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526396160; cv=none; d=google.com; s=arc-20160816; b=RhTMnTWYAgCXn7QYe7FkUN/KNs/p+738Yo8N9PPtuGnLsgFUSbmoKGX5obompuJpNL drCncbr11zPqQLzH8NkaC7m0lAXpHtN+VZ5Yqi65LUX+j4PGMYIr4WTJFHJz6vh38abX 81WLXLWzm4sDW3ZGWzvvpF1dxp8IvP2nG5qwkqmQz8IiA9XnQeQ1P4AZIRYIb6t65pzb gn06wRjjoR9XBNb4XX7Ymk2sXZVfh5toIYHV5tQhFX7SVeLp3g5sd+kxnGqxQR3yI5bQ 8Bb2gyE5xT2u9QGCZQi5ncldlepdppkpoGrhoZ76Zy2KQztbYBorbe/vwqRQinQjiHza IbJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=JoW7P1YOir+bF2/fyaF8Wl2k8KtduDu2aqgBf+BJSGk=; b=lTOyOuKDUw3a6pF1PpdUWF+gaFvnzMO3PZNYlDClJpml3j5AkgildfbFTW4Ztor0xz ZBfi+THcFUT0qoSQa1/ul7QKAj+4kcH9d3VTOShgeOId+29ycNfz4FuHWzy4RtroF8Xf pDBWWJouNRpvYZdI/4Da73AZtIkGa3Z0kOJm+5H/7ndMeDXm2Lazj0DTnlTt7OXU/Bn5 NwfccoYUgirO6wDzhqdoB4KjYrPzafYBU0x7Dq9Mntwy+ltdKJt0hTnraAtclcaPVTLy P/nsPIyF3XEWZYT5NUrPhsAeuR+eN3lusssx2EmzvxQ3YdcQ73z01lG532FleEkOOPF5 0IUg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l30-v6si175131plg.420.2018.05.15.07.55.46; Tue, 15 May 2018 07:56:00 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753325AbeEOOxu (ORCPT + 99 others); Tue, 15 May 2018 10:53:50 -0400 Received: from foss.arm.com ([217.140.101.70]:33926 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752453AbeEOOxs (ORCPT ); Tue, 15 May 2018 10:53:48 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4D00C1529; Tue, 15 May 2018 07:53:48 -0700 (PDT) Received: from e110439-lin (e110439-lin.cambridge.arm.com [10.1.210.68]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 187423F25D; Tue, 15 May 2018 07:53:45 -0700 (PDT) Date: Tue, 15 May 2018 15:53:43 +0100 From: Patrick Bellasi To: Vincent Guittot Cc: Joel Fernandes , linux-kernel , "open list:THERMAL" , Ingo Molnar , Peter Zijlstra , "Rafael J . Wysocki" , Viresh Kumar , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Joel Fernandes , Steve Muckle Subject: Re: [PATCH 3/3] sched/fair: schedutil: explicit update only when required Message-ID: <20180515145343.GJ30654@e110439-lin> References: <20180510150553.28122-1-patrick.bellasi@arm.com> <20180510150553.28122-4-patrick.bellasi@arm.com> <20180513060443.GB64158@joelaf.mtv.corp.google.com> <20180513062538.GA116730@joelaf.mtv.corp.google.com> <20180514163206.GF30654@e110439-lin> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15-May 12:19, Vincent Guittot wrote: > On 14 May 2018 at 18:32, Patrick Bellasi wrote: > > On 12-May 23:25, Joel Fernandes wrote: > >> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote: > >> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote: [...] > >> > One question about this change. In enqueue, throttle and unthrottle - you are > >> > conditionally calling cpufreq_update_util incase the task was > >> > visible/not-visible in the hierarchy. > >> > > >> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent. > >> > Is this because of util_est or something? Could you add a comment here > >> > explaining why this is so? > >> > >> The big question I have is incase se != NULL, then its still visible at the > >> root RQ level. > > > > My understanding it that you get !se at dequeue time when we are > > dequeuing a task from a throttled RQ. Isn't it? > > Yes se becomes NULL only when you reach root domain Right, my point was mainly what I'm saing below: a task removed from a "throttled" cfs_rq is _already_ not visible from the root cfs_rq since it has been de-accounted at throttle_cfs_rq() time. [...] > > At dequeue time instead, since we certainly removed some estimated > > utilization, then I unconditionally updated schedutil. > > > > HOWEVER, I was not considering these two things: > > > > 1. for a task going to sleep, we still have its blocked utilization > > accounted in the cfs_rq utilization. > > It might be still interesting to reduce the frequency because the > blocked utilization can be lower than its estimated utilization. Good point, this is the case of a task which, in its last activation, executed for less time then its previous estimated utilization. However, it could also very well be the opposite, a task which executed more then its past activation. In this case a schedutil update could trigger a frequency increase. Thus, the scheduler knows that we are going to sleep: does is really makes sense to send a notification in this case? To me that's not a policy to choose when it makes sense to change the frequency, but just the proper definition of when it makes sense to send a notification. IMHO we should better consider not only (and blindly) the utilization changes but also what the scheduler knows about the status of a task. Thus: if the utilization change while a task is running, it's worth to send a notification. While, when a task is done on a CPU and that CPU is likely going to be idle, then maybe we should skip the notification. > > 2. for a task being migrated, at dequeue time we still have not > > removed the task's utilization from the cfs_rq's utilization. > > This usually happens later, for example we can have: > > > > move_queued_task() > > dequeue_task() --> CFS task dequeued > > set_task_cpu() --> schedutil updated > > migrate_task_rq_fair() > > detach_entity_cfs_rq() > > detach_entity_load_avg() --> CFS util removal > > enqueue_task() > > > > Moreover, the "CFS util removal" actually affects the cfs_rq only if > > we hold the RQ lock, otherwise we know that it's just back annotated > > as "removed" utilization and the actual cfs_rq utilization is fixed up > > at the next chance we have the RQ lock. > > > > Thus, I would say that in both cases, at dequeue time it does not make > > sense to update schedutil since we always see the task's utilization > > in the cfs_rq and thus we will not reduce the frequency. > > Yes only attach/detach make sense from an utilization pov and that's > where we should check for a frequency update for utilization Indeed, I was considering the idea to move the callbacks there, which are the only code places where we know for sure that some utilization joined or departed from a CPU. Still have to check better however, because these functions can be called also for non root cfs_rqs... thus we will need again the filtering condition we have now in the wrapper function. > > NOTE, this is true independently from the refactoring I'm proposing. > > At dequeue time, although we call update_load_avg() on the root RQ, > > it does not make sense to update schedutil since we still see either > > the blocked utilization of a sleeping task or the not yet removed > > utilization of a migrating task. In both cases the risk is to ask for > > an higher OPP right when a CPU is going to be IDLE. > > We have to take care of not mixing the opportunity to update the > frequency when we are updating the utilization with the policy that we > want to apply regarding (what we think that is) the best time to > update the frequency. Like saying that we should wait a bit more to > make sure that the current utilization is sustainable because a > frequency change is expensive on the platform (or not) I completely agree on keeping utilization update notification separated from schedutil decisions... > It's not because a task is dequeued that we should not update and > increase the frequency; Or even that we should not decrease it because > we have just taken into account some removed utilization of a previous > migration. > The same happen when a task migrates, we don't know if the utilization > that is about to be migrated, will be higher or lower than the normal > update of the utilization (since the last update) and can not generate > a frequency change > > I see your explanation above like a kind of policy where you want to > balance the cost of a frequency change with the probability that we > will not have to re-update the frequency soon. That was not my thinking. What I wanted to say is just that we should send notification when it makes really sense, because we have the most valuable information to pass. Thus, notifying schedutil when we update the RQ utilization is a bit of a greedy approach with respect to the information the scheduler has. In the migration example above: - first we update the RQ utilization - then we actually remove from the RQ the utilization of the migrated task If we notify schedutil at the first step we are more likely to pass an already outdated information, since from the scheduler standpoint we know that we are going to reduce the CPU utilization quite soon. Thus, would it not be better to defer the notification at detach time? After all that's the original goal of this patch > I agree that some scheduling events give higher chances of a > sustainable utilization level and we should favor these events when > the frequency change is costly but I'm not sure that we should remove > all other opportunity to udjust the frequency to the current > utilization level when the cost is low or negligible. Maybe we can try to run hackbench to quantify the overhead we add with useless schedutil updates. However, my main concerns is that if we want a proper decoupling between the scheduler and schedutil, then we also have to ensure that we callback for updates only when it really makes sense. Otherwise, the risk is that the schedutil policy will take decisions based on wrong assumptions like: ok, let's increase the OPP (since I can now and it's cheap) without knowing that the CPU is instead going to be almost empty or even IDLE. > Can't we classify the utilization events into some kind of major and > minor changes ? Doesn't a classification itself looks more like a policy? Maybe we can consider it, but still I think we should be able to find when the scheduler has the most accurate and updated information about the tasks actually RUNNABLE on a CPU and at that point send a notification to schedutil. IMO there are few small events when the utilization could have big changes: and these are wakeups (because of util_est) and migrations. For all the rest, the tick should be a good enough update rate, considering also that, even at 250Hz, in 4ms PELT never build up more then ~8%. [...] > > .:: Conclusions > > =============== > > > > All that considered, I think I've convinced myself that we really need > > to notify schedutil only in these cases: > > > > 1. enqueue time > > because of the changes in estimated utilization and the > > possibility to just straight to a better OPP > > > > 2. task tick time > > because of the possible ramp-up of the utilization > > > > Another case is related to remote CPUs blocked utilization update, > > after the recent Vincent's patches. Currently indeed: > > > > update_blocked_averages() > > update_load_avg() > > --> update schedutil > > > > and thus, potentially we wake up an IDLE cluster just to reduce its > > OPP. If the cluster is in a deep idle state, I'm not entirely sure > > this is good from an energy saving standpoint. > > However, with the patch I'm proposing we are missing that support, > > meaning that an IDLE cluster will get its utilization decayed but we > > don't wake it up just to drop its frequency. > > So more than deciding in the scheduler if we should wake it up or not, > we should give a chance to cpufreq to decide if it wants to update the > frequency or not as this decision is somehow platform specific: cost > of frequency change, clock topology and shared clock, voltage topology > ... Fair enough, then we should keep updating schedutil from that code path. What about adding a new explicit callback at the end of: update_blocked_averages() ? Something like: ---8<--- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cb77407ba485..6eb0f31c656d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu) if (done) rq->has_blocked_load = 0; #endif + + cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE); + rq_unlock_irqrestore(rq, &rf); } ---8<--- Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify schedutil that the CPU is currently IDLE? Could that work? -- #include Patrick Bellasi