Received: by 10.213.65.68 with SMTP id h4csp2351349imn; Mon, 9 Apr 2018 01:55:44 -0700 (PDT) X-Google-Smtp-Source: AIpwx49bWBbyw4h93WRyELE0nfNOAxBTjV8KYn8aA3CTewgvdtWB1K0KYOt63fr4D/JPogGDHRzH X-Received: by 2002:a17:902:6889:: with SMTP id i9-v6mr8505175plk.301.1523264144183; Mon, 09 Apr 2018 01:55:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523264144; cv=none; d=google.com; s=arc-20160816; b=kD8fFqzm8teQkJISNGsWIonTLIdveUiXl8NLu9suklOyiEM9XMZXYdahlSe4ZZiAYa KTMlK2ihSOxxpHsnIdkFGVYdcuBX3ucPyt2+yJi4q3EGM/9zurxVzpbMnk8w5jxEDtMb kOG44e60YLXyhTcn+qpUW+3Cd8Rob8BLxmyuEgLMsGtkig6bv+XjGXD6mhm6sbkXTiDY lFGQ8YMdGyKEmkCTSKJLqUrMtXWbVb4/9IS0ppQxgXrlzVsm9tjyqjN37hNjBZSUncc3 HyeVpGE3g5y81xR7VCHy+4YGdOufHgLAg29BzFh4D4hl/wXLBuhCbLsF5xkDtcl6oNG8 Z2Dw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=vwbFp2HPz2VMPbIwNRnfsF3qG7yZ/W9YsOdRIconplI=; b=vIWn4vO9H6f2MN2h71NpDvMy1CDU27WekIJivGesnTW7SDVzkm1nUV7ZuAH871h2Hz hQPk6cWSQjD3ZAUh9O5ngqhWDRscJ040ClLIXJ7qGUsgi4BLde5ZFfArAHLty51VYpMa WEB2gTeOSbvrbzbiQeOYGFuIN/ViyjB8lcnP/+56KHXfFQWAsbgVGOyb/2gxrXt1UufW G1k41In/BNO4+Z6hN5WXgsP6BK23ha1A+UYP+KKz/R5ymCytsdEI1K6wd0aBB+qsfmZA SgUuZEIrBiMBK4n5bhjJ5arBTMyoVhpojwN3dcrAVUcxdjFlVTEjHXKEkQj+dMkP2dyF cNOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=jt9XTenS; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w28si11042754pgc.632.2018.04.09.01.55.06; Mon, 09 Apr 2018 01:55:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=jt9XTenS; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751742AbeDIIvv (ORCPT + 99 others); Mon, 9 Apr 2018 04:51:51 -0400 Received: from mail-it0-f49.google.com ([209.85.214.49]:36248 "EHLO mail-it0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751502AbeDIIvt (ORCPT ); Mon, 9 Apr 2018 04:51:49 -0400 Received: by mail-it0-f49.google.com with SMTP id 15-v6so9283582itl.1 for ; Mon, 09 Apr 2018 01:51:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=vwbFp2HPz2VMPbIwNRnfsF3qG7yZ/W9YsOdRIconplI=; b=jt9XTenSDByNSn5P3Mmo0tZ72hUfJTqNpx4cw1WrTmNQm5S4YTIxNXcI15Is/eR1mZ wGNe+SFIVDrMHOwjLEblFRLAfEm9Eo/iqks2nPDak6bR0tXUmiXYuKMMXkj6yY7gcs29 dOZplYfFoUVK+P+uVzj3xgJYnvunzGH43FXw4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=vwbFp2HPz2VMPbIwNRnfsF3qG7yZ/W9YsOdRIconplI=; b=gI5NMe01L6q6spLpyLbHNbmWcpN7vWueigq5KTiPy4+C5uMbst2PFkEUK0G4/1eGIr p3rR1vHOIWF1S+zECFQOUDW9aIcDe/n1sxJEvGZEDMwbNPYGoan4eFBAPnii/Epquw79 GyvPcMlwjK7FjX80tuH4f9Gua1BzhGWlvifGAXhVrcLEO/lOP04Ygsc4hXIhJ+Cka0Xj 5tQy3SffDqqAq+TwmhC13IDmvtFVLmwFMp8RVq/ZEQbR0RnaODHFctnimt66IX8tXg+n 5ez0Ojh8OHy+kFa3B/k09ddJdH4qKmxjxyUmyYO8Hhq3gZFOEAg/s9o0Z0lXCKhZx6MO kfww== X-Gm-Message-State: ALQs6tB84rxvuH3DYoDjHDth3Q/qSBrHbQz9Fgu7jlqTglfKrOBU3oDp 9GTtWE7PQYn6y7guhHkwamqFoJkm2SVaYqIDQY2TUQ== X-Received: by 2002:a24:1acc:: with SMTP id 195-v6mr26736830iti.89.1523263908184; Mon, 09 Apr 2018 01:51:48 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.222.20 with HTTP; Mon, 9 Apr 2018 01:51:27 -0700 (PDT) In-Reply-To: <20180406172835.20078-1-patrick.bellasi@arm.com> References: <20180406172835.20078-1-patrick.bellasi@arm.com> From: Vincent Guittot Date: Mon, 9 Apr 2018 10:51:27 +0200 Message-ID: Subject: Re: [PATCH] sched/fair: schedutil: update only with all info available To: Patrick Bellasi , Peter Zijlstra Cc: linux-kernel , "open list:THERMAL" , Ingo Molnar , "Rafael J . Wysocki" , Viresh Kumar , Juri Lelli , Joel Fernandes , Steve Muckle , Dietmar Eggemann , Morten Rasmussen Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Patrick On 6 April 2018 at 19:28, Patrick Bellasi wrote: > Schedutil is not properly updated when the first FAIR task wakes up on a > CPU and when a RQ is (un)throttled. This is mainly due to the current > integration strategy, which relies on updates being triggered implicitly > each time a cfs_rq's utilization is updated. > > Those updates are currently provided (mainly) via > cfs_rq_util_change() > which is used in: > - update_cfs_rq_load_avg() > when the utilization of a cfs_rq is updated > - {attach,detach}_entity_load_avg() > This is done based on the idea that "we should callback schedutil > frequently enough" to properly update the CPU frequency at every > utilization change. > > Since this recent schedutil update: > > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") > > we use additional RQ information to properly account for FAIR tasks > utilization. Specifically, cfs_rq::h_nr_running has to be non-zero > in sugov_aggregate_util() to sum up the cfs_rq's utilization. Isn't the use of cfs_rq::h_nr_running, the root cause of the problem ? I can now see a lot a frequency changes on my hikey with this new condition in sugov_aggregate_util(). With a rt-app UC that creates a periodic cfs task, I have a lot of frequency changes instead of staying at the same frequency Peter, what was your goal with adding the condition "if (rq->cfs.h_nr_running)" for the aggragation of CFS utilization Thanks Vincent > > However, cfs_rq::h_nr_running is usually updated as: > > enqueue_entity() > ... > update_load_avg() > ... > cfs_rq_util_change ==> trigger schedutil update > ... > cfs_rq->h_nr_running += number_of_tasks > > both in enqueue_task_fair() as well as in unthrottle_cfs_rq(). > A similar pattern is used also in dequeue_task_fair() and > throttle_cfs_rq() to remove tasks. > > This means that we are likely to see a zero cfs_rq utilization when we > enqueue a task on an empty CPU, or a non-zero cfs_rq utilization when > instead, for example, we are throttling all the FAIR tasks of a CPU. > > While the second issue is less important, since we are less likely to > reduce frequency when CPU utilization decreases, the first issue can > instead impact performance. Indeed, we potentially introduce a not desired > latency between a task enqueue on a CPU and its frequency increase. > > Another possible unwanted side effect is the iowait boosting of a CPU > when we enqueue a task into a throttled cfs_rq. > > Moreover, the current schedutil integration has these other downsides: > > - schedutil updates are triggered by RQ's load updates, which makes > sense in general but it does not allow to know exactly which other RQ > related information has been updated (e.g. h_nr_running). > > - increasing the chances to update schedutil does not always correspond > to provide the most accurate information for a proper frequency > selection, thus we can skip some updates. > > - we don't know exactly at which point a schedutil update is triggered, > and thus potentially a frequency change started, and that's because > the update is a side effect of cfs_rq_util_changed instead of an > explicit call from the most suitable call path. > > - cfs_rq_util_change() is mainly a wrapper function for an already > existing "public API", cpufreq_update_util(), to ensure we actually > update schedutil only when we are updating a root RQ. Thus, especially > when task groups are in use, most of the calls to this wrapper > function are really not required. > > - the usage of a wrapper function is not completely consistent across > fair.c, since we still need sometimes additional explicit calls to > cpufreq_update_util(), for example to support the IOWAIT boot flag in > the wakeup path > > - it makes it hard to integrate new features since it could require to > change other function prototypes just to pass in an additional flag, > as it happened for example here: > > commit ea14b57e8a18 ("sched/cpufreq: Provide migration hint") > > All the above considered, let's try to make schedutil updates more > explicit in fair.c by: > > - removing the cfs_rq_util_change() wrapper function to use the > cpufreq_update_util() public API only when root cfs_rq is updated > > - remove indirect and side-effect (sometimes not required) schedutil > updates when the cfs_rq utilization is updated > > - call cpufreq_update_util() explicitly in the few call sites where it > really makes sense and all the required information has been updated > > By doing so, this patch mainly removes code and adds explicit calls to > schedutil only when we: > - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq > - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq > - task_tick_fair() to update the utilization of the root cfs_rq > > All the other code paths, currently _indirectly_ covered by a call to > update_load_avg(), are also covered by the above three calls. > Some already imply enqueue/dequeue calls: > - switch_{to,from}_fair() > - sched_move_task() > or are followed by enqueue/dequeue calls: > - cpu_cgroup_fork() and > post_init_entity_util_avg(): > are used at wakeup_new_task() time and thus already followed by an > enqueue_task_fair() > - migrate_task_rq_fair(): > updates the removed utilization but not the actual cfs_rq > utilization, which is updated by a following sched event > > This new proposal allows also to better aggregate schedutil related > flags, which are required only at enqueue_task_fair() time. > Indeed, IOWAIT and MIGRATION flags are now requested only when a task is > actually visible at the root cfs_rq level. > > Signed-off-by: Patrick Bellasi > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Rafael J. Wysocki > Cc: Viresh Kumar > Cc: Joel Fernandes > Cc: Juri Lelli > Cc: linux-kernel@vger.kernel.org > Cc: linux-pm@vger.kernel.org > > --- > > The SCHED_CPUFREQ_MIGRATION flags, recently introduced by: > > ea14b57e8a18 sched/cpufreq: Provide migration hint > > is maintained although there are not actual usages so far in mainline > for this hint... do we really need it? > --- > kernel/sched/fair.c | 84 ++++++++++++++++++++++++----------------------------- > 1 file changed, 38 insertions(+), 46 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0951d1c58d2f..e726f91f0089 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se) > * For !fair tasks do: > * > update_cfs_rq_load_avg(now, cfs_rq); > - attach_entity_load_avg(cfs_rq, se, 0); > + attach_entity_load_avg(cfs_rq, se); > switched_from_fair(rq, p); > * > * such that the next switched_to_fair() has the > @@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se) > } > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags) > -{ > - struct rq *rq = rq_of(cfs_rq); > - > - if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) { > - /* > - * There are a few boundary cases this might miss but it should > - * get called often enough that that should (hopefully) not be > - * a real problem. > - * > - * It will not get called when we go idle, because the idle > - * thread is a different class (!fair), nor will the utilization > - * number include things like RT tasks. > - * > - * As is, the util number is not freq-invariant (we'd have to > - * implement arch_scale_freq_capacity() for that). > - * > - * See cpu_util(). > - */ > - cpufreq_update_util(rq, flags); > - } > -} > - > #ifdef CONFIG_SMP > /* > * Approximate: > @@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > cfs_rq->load_last_update_time_copy = sa->last_update_time; > #endif > > - if (decayed) > - cfs_rq_util_change(cfs_rq, 0); > - > return decayed; > } > > @@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > * Must call update_cfs_rq_load_avg() before this, since we rely on > * cfs_rq->avg.last_update_time being current. > */ > -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib; > > @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum); > > - cfs_rq_util_change(cfs_rq, flags); > } > > /** > @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); > > - cfs_rq_util_change(cfs_rq, 0); > } > > /* > @@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > * > * IOW we're enqueueing a task on a new CPU. > */ > - attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION); > + attach_entity_load_avg(cfs_rq, se); > update_tg_load_avg(cfs_rq, 0); > > } else if (decayed && (flags & UPDATE_TG)) > @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1) > { > - cfs_rq_util_change(cfs_rq, 0); > } > > static inline void remove_entity_load_avg(struct sched_entity *se) {} > > static inline void > -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {} > +attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} > static inline void > detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {} > > @@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) > dequeue = 0; > } > > - if (!se) > + /* The tasks are no more visible from the root cfs_rq */ > + if (!se) { > sub_nr_running(rq, task_delta); > + cpufreq_update_util(rq, 0); > + } > > cfs_rq->throttled = 1; > cfs_rq->throttled_clock = rq_clock(rq); > @@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > break; > } > > - if (!se) > + /* The tasks are now visible from the root cfs_rq */ > + if (!se) { > add_nr_running(rq, task_delta); > + cpufreq_update_util(rq, 0); > + } > > /* Determine whether we need to wake up potentially idle CPU: */ > if (rq->curr == rq->idle && rq->cfs.nr_running) > @@ -5356,14 +5333,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > struct cfs_rq *cfs_rq; > struct sched_entity *se = &p->se; > > - /* > - * If in_iowait is set, the code below may not trigger any cpufreq > - * utilization updates, so do it here explicitly with the IOWAIT flag > - * passed. > - */ > - if (p->in_iowait) > - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); > - > for_each_sched_entity(se) { > if (se->on_rq) > break; > @@ -5394,9 +5363,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > update_cfs_group(se); > } > > - if (!se) > + /* The task is visible from the root cfs_rq */ > + if (!se) { > + unsigned int flags = 0; > + > add_nr_running(rq, 1); > > + if (p->in_iowait) > + flags |= SCHED_CPUFREQ_IOWAIT; > + > + /* > + * !last_update_time means we've passed through > + * migrate_task_rq_fair() indicating we migrated. > + * > + * IOW we're enqueueing a task on a new CPU. > + */ > + if (!p->se.avg.last_update_time) > + flags |= SCHED_CPUFREQ_MIGRATION; > + > + cpufreq_update_util(rq, flags); > + } > + > util_est_enqueue(&rq->cfs, p); > hrtick_update(rq); > } > @@ -5454,8 +5441,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > update_cfs_group(se); > } > > - if (!se) > + /* The task is no more visible from the root cfs_rq */ > + if (!se) { > sub_nr_running(rq, 1); > + cpufreq_update_util(rq, 0); > + } > > util_est_dequeue(&rq->cfs, p, task_sleep); > hrtick_update(rq); > @@ -9950,6 +9940,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > > if (static_branch_unlikely(&sched_numa_balancing)) > task_tick_numa(rq, curr); > + > + cpufreq_update_util(rq, 0); > } > > /* > @@ -10087,7 +10079,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se) > > /* Synchronize entity with its cfs_rq */ > update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD); > - attach_entity_load_avg(cfs_rq, se, 0); > + attach_entity_load_avg(cfs_rq, se); > update_tg_load_avg(cfs_rq, false); > propagate_entity_cfs_rq(se); > } > -- > 2.15.1 >