Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp396461imm; Tue, 15 May 2018 03:25:25 -0700 (PDT) X-Google-Smtp-Source: AB8JxZosPEJVpvDHb3/XMkU/AJ8PJVTjePhWMvfL6d+n7//keO/s3H36otZ7DzWjLqbz/6rv0PJp X-Received: by 2002:a17:902:6bc7:: with SMTP id m7-v6mr13481634plt.162.1526379925408; Tue, 15 May 2018 03:25:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526379925; cv=none; d=google.com; s=arc-20160816; b=l+vSOmAqL8ypt9tKfKeIcK434uo3autqCes0r0f7gh8wvnLVSzWM6GUulnWbbH4gd3 /WljsdBzPomsNQqseLKEPlgH6wk0lw8oi1Vw7u3OPfgS4Rs5AqYwFJEvNFjlYaheo6hF BxNJ0LzawM7YXmiLMvSfr43VySgntB4Wd/LPxn7P+4vBFd4myIsFBARwW4F92yDIsrWr PKwFqxyhXaRV+BB7lvoqyvzCxC/kyfn6xp4PS73Can+of2l0EB4fE+klvdGPdFra01Xg xuagxjmTZu7yleDOo9isMkH/WsphKyPlSsdJGm1gkA6pki9iviS4gBq+7PfwxjCWpTx0 aZjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=hK0yztEBugnqP+FmAtzb/ph+WMRsrWqOafyXr4of1go=; b=OiBgL5/6HjI1gCsCkFUT/CdV3O+o4xJoE/cBHSi18/KNBnyX3OTR1kTrn0pShGETLb 9wSOk81RNhphLiHIoOQVMo41GtS9Gfx6u8YdnrNvQnvLgl9sLJQyV8EFngrvUAkckDQn m6V/YcnjE8yu6OuWjDAvSnq4j5sYB1P+JEqv9pxXmugA1u9I1JyzEPIqzgl0Fk2GEwam 7ogJJSoA2S8tQPRJKSU8jE5KxjenVRXao1QZaS8hiBFgUGPXjf/Ft88I2kGl0D+I4sJI d4NtBr4odn/OmjfOzLu0Tj89mUy8XLGezJ/7owB2QlwV9Yv9fq6emhVSUQ8ILb+fi+GL s5cw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Oea00TQa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u26-v6si492816pgv.199.2018.05.15.03.25.08; Tue, 15 May 2018 03:25:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=Oea00TQa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752711AbeEOKTp (ORCPT + 99 others); Tue, 15 May 2018 06:19:45 -0400 Received: from mail-it0-f67.google.com ([209.85.214.67]:38312 "EHLO mail-it0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752481AbeEOKTn (ORCPT ); Tue, 15 May 2018 06:19:43 -0400 Received: by mail-it0-f67.google.com with SMTP id q4-v6so24313ite.3 for ; Tue, 15 May 2018 03:19:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=hK0yztEBugnqP+FmAtzb/ph+WMRsrWqOafyXr4of1go=; b=Oea00TQaT3uezqL9er6myX9ZiFVWmW9eOvVFX45Cmi6BD+N4PcWaSHhJ7dP24u0I91 73wM0nruiS3z+KlW0e4AUfTpQ+hwxHUwdMx8k0WtYzDNX5FB6WaAda7Se0qi558/V6Yr tun6DZmYPR2NJA7GnWckC6x7ukJGIRf91tVLs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=hK0yztEBugnqP+FmAtzb/ph+WMRsrWqOafyXr4of1go=; b=mVgpF5Dx+1jxPkToUlVTDuVMRPnWeKZ7TGi2a6l3IQBMDQ5kNQmGoDelfOZM3KJl85 +zU37W5b2jCVOfIRbkriXY4ufIsPPAg/cKmfwmhqlhsZbpZ7sLwu2LB0Qjv8UsrXAWwn D+oVwyhPMY6zLK04SEIQiHe2icrClC7LYYuRgPd0mthcGWfH3O3O3mWAJ4ewwH8Isj2V /vrLhwkiBKPSFeQtEBnSPAQCwDGp+84GWFwJUljzdmc//KMGMQ+JuJsEmwfYkgA8ArHP XM0OrneMP4M7Fk0azi2yE53E48vmQw0DFW6blBqevPCP0MpGDSys/FyXMMWbQMarEgmc Ti5Q== X-Gm-Message-State: ALKqPwfpOOh6vmPJSCLeBAc8Bz/3EHPlpCGtrLGhHwRSRxJqIv2aPJF2 Xi0M3m/v9qQmQKnMwhym5BuMnzvyXk/Oxai+hkW/aA== X-Received: by 2002:a6b:2c9:: with SMTP id 192-v6mr14281059ioc.294.1526379582170; Tue, 15 May 2018 03:19:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.4.204 with HTTP; Tue, 15 May 2018 03:19:21 -0700 (PDT) In-Reply-To: <20180514163206.GF30654@e110439-lin> References: <20180510150553.28122-1-patrick.bellasi@arm.com> <20180510150553.28122-4-patrick.bellasi@arm.com> <20180513060443.GB64158@joelaf.mtv.corp.google.com> <20180513062538.GA116730@joelaf.mtv.corp.google.com> <20180514163206.GF30654@e110439-lin> From: Vincent Guittot Date: Tue, 15 May 2018 12:19:21 +0200 Message-ID: Subject: Re: [PATCH 3/3] sched/fair: schedutil: explicit update only when required To: Patrick Bellasi Cc: Joel Fernandes , linux-kernel , "open list:THERMAL" , Ingo Molnar , Peter Zijlstra , "Rafael J . Wysocki" , Viresh Kumar , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Joel Fernandes , Steve Muckle Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 14 May 2018 at 18:32, Patrick Bellasi wrote: > On 12-May 23:25, Joel Fernandes wrote: >> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote: >> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote: >> > > Schedutil updates for FAIR tasks are triggered implicitly each time a >> > > cfs_rq's utilization is updated via cfs_rq_util_change(), currently >> > > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has >> > > changed, and {attach,detach}_entity_load_avg(). >> > > >> > > This design is based on the idea that "we should callback schedutil >> > > frequently enough" to properly update the CPU frequency at every >> > > utilization change. However, such an integration strategy has also >> > > some downsides: >> > >> > Hi Patrick, > > Hi Joel, > [...] >> > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> > > update_cfs_group(se); >> > > } >> > > >> > > + /* The task is no more visible from the root cfs_rq */ >> > > if (!se) >> > > sub_nr_running(rq, 1); >> > > >> > > util_est_dequeue(&rq->cfs, p, task_sleep); >> > > + cpufreq_update_util(rq, 0); >> > >> > One question about this change. In enqueue, throttle and unthrottle - you are >> > conditionally calling cpufreq_update_util incase the task was >> > visible/not-visible in the hierarchy. >> > >> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent. >> > Is this because of util_est or something? Could you add a comment here >> > explaining why this is so? >> >> The big question I have is incase se != NULL, then its still visible at the >> root RQ level. > > My understanding it that you get !se at dequeue time when we are > dequeuing a task from a throttled RQ. Isn't it? Yes se becomes NULL only when you reach root domain > > Thus, this means you are dequeuing a throttled task, I guess for > example because of a migration. > However, the point is that a task dequeue from a throttled RQ _is > already_ not visible from the root RQ, because of the sub_nr_running() > done by throttle_cfs_rq(). > >> In that case should we still call the util_est_dequeue and the >> cpufreq_update_util? > > I had a better look at the different code paths and I've possibly come > up with some interesting observations. Lemme try to resume theme here. > > First of all, we need to distinguish from estimated utilization > updates and schedutil updates, since they respond to two very > different goals. > > > .:: Estimated utilization updates > ================================= > > Goal: account for the amount of utilization we are going to > expect on a CPU > > At {en,de}queue time, util_est_{en,de}queue() is always > unconditionally called because it tracks the utilization which is > estimated to be generated by all the RUNNABLE tasks. > > We do not care about throttled/un-throttled RQ here because the effect > of throttling is already folded into the estimated utilization. > > For example, a 100% tasks which is placed into a 50% bandwidth > limited TG will generate a 50% (estimated) utilization. Thus, when the > task is enqueued we can account immediately for that utilization > although the RQ can be currently throttled. > > > .:: Schedutil updates > ===================== > > Goal: select a better frequency, if and _when_ required > > At enqueue time, if the task is visible at the root RQ the it's > expected to run within a scheduler latency period. Thus, it makes > sense to call immediately schedutil to account for its estimated > utilization to possibly increase the OPP. > > If instead the task is enqueued into a throttled RQ, then I'm > skipping the update since the task will not run until the RQ is > actually un-throttled. > > HOWEVER, I would say that in general we could skip this last > optimization and always unconditionally update schedutil at enqueue > time considering the fact that the effects of a throttled RQ are > always reflected into the (estimated) utilization of a task. I think so too > > At dequeue time instead, since we certainly removed some estimated > utilization, then I unconditionally updated schedutil. > > HOWEVER, I was not considering these two things: > > 1. for a task going to sleep, we still have its blocked utilization > accounted in the cfs_rq utilization. It might be still interesting to reduce the frequency because the blocked utilization can be lower than its estimated utilization. > > 2. for a task being migrated, at dequeue time we still have not > removed the task's utilization from the cfs_rq's utilization. > This usually happens later, for example we can have: > > move_queued_task() > dequeue_task() --> CFS task dequeued > set_task_cpu() --> schedutil updated > migrate_task_rq_fair() > detach_entity_cfs_rq() > detach_entity_load_avg() --> CFS util removal > enqueue_task() > > Moreover, the "CFS util removal" actually affects the cfs_rq only if > we hold the RQ lock, otherwise we know that it's just back annotated > as "removed" utilization and the actual cfs_rq utilization is fixed up > at the next chance we have the RQ lock. > > Thus, I would say that in both cases, at dequeue time it does not make > sense to update schedutil since we always see the task's utilization > in the cfs_rq and thus we will not reduce the frequency. Yes only attach/detach make sense from an utilization pov and that's where we should check for a frequency update for utilization > > NOTE, this is true independently from the refactoring I'm proposing. > At dequeue time, although we call update_load_avg() on the root RQ, > it does not make sense to update schedutil since we still see either > the blocked utilization of a sleeping task or the not yet removed > utilization of a migrating task. In both cases the risk is to ask for > an higher OPP right when a CPU is going to be IDLE. We have to take care of not mixing the opportunity to update the frequency when we are updating the utilization with the policy that we want to apply regarding (what we think that is) the best time to update the frequency. Like saying that we should wait a bit more to make sure that the current utilization is sustainable because a frequency change is expensive on the platform (or not) It's not because a task is dequeued that we should not update and increase the frequency; Or even that we should not decrease it because we have just taken into account some removed utilization of a previous migration. The same happen when a task migrates, we don't know if the utilization that is about to be migrated, will be higher or lower than the normal update of the utilization (since the last update) and can not generate a frequency change I see your explanation above like a kind of policy where you want to balance the cost of a frequency change with the probability that we will not have to re-update the frequency soon. I agree that some scheduling events give higher chances of a sustainable utilization level and we should favor these events when the frequency change is costly but I'm not sure that we should remove all other opportunity to udjust the frequency to the current utilization level when the cost is low or negligible. Can't we classify the utilization events into some kind of major and minor changes ? > > Moreover, it seems that in general we prefer a "conservative" approach > in frequency reduction. > For example it could be harmful to trigger a frequency reduction when > a task is migrating off a CPU, if right after another task should be > instead migrated into the same CPU. > > > .:: Conclusions > =============== > > All that considered, I think I've convinced myself that we really need > to notify schedutil only in these cases: > > 1. enqueue time > because of the changes in estimated utilization and the > possibility to just straight to a better OPP > > 2. task tick time > because of the possible ramp-up of the utilization > > Another case is related to remote CPUs blocked utilization update, > after the recent Vincent's patches. Currently indeed: > > update_blocked_averages() > update_load_avg() > --> update schedutil > > and thus, potentially we wake up an IDLE cluster just to reduce its > OPP. If the cluster is in a deep idle state, I'm not entirely sure > this is good from an energy saving standpoint. > However, with the patch I'm proposing we are missing that support, > meaning that an IDLE cluster will get its utilization decayed but we > don't wake it up just to drop its frequency. So more than deciding in the scheduler if we should wake it up or not, we should give a chance to cpufreq to decide if it wants to update the frequency or not as this decision is somehow platform specific: cost of frequency change, clock topology and shared clock, voltage topology ... Regards, Vincent > > Perhaps we should better pass in this information to schedutil via a > flag (e.g. SCHED_FREQ_REMOTE_UPDATE) and implement there a policy to > decide if and when it makes sense to drop the OPP. Or otherwise find a > way for the special DL tasks to always run on the lower capacity_orig > CPUs. > >> Sorry if I missed something obvious. > > Thanks for the question it has actually triggered a better analysis of > what we have and what we need. > > Looking forward to some feedbacks about the above before posting a new > version of this last patch. > >> thanks! >> >> - Joel > > -- > #include > > Patrick Bellasi