Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Thu, 10 Jan 2019 15:30:42 +0000
From:   Patrick Bellasi <patrick.bellasi@arm.com>
To:     Vincent Guittot <vincent.guittot@linaro.org>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@kernel.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        "Rafael J. Wysocki" <rjw@rjwysocki.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <Morten.Rasmussen@arm.com>,
        Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
        Thara Gopinath <thara.gopinath@linaro.org>,
        pkondeti@codeaurora.org, Quentin Perret <quentin.perret@arm.com>,
        Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Subject: Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT
Message-ID: <20190110153031.4rh64xz2muctkffe@e110439-lin>
References: <20181128115336.GB23094@e110439-lin>
 <CAKfTPtBsKc7v5gc=XUrzO-_4kahGfdNteo=t9W5fLv0Ee8co_w@mail.gmail.com>
 <20181128144039.GC23094@e110439-lin>
 <CAKfTPtAR7otTTwKYbg5OWbgrUYNKBNsUnOcMS9CfQtbYspvO5A@mail.gmail.com>
 <20181128152133.GD23094@e110439-lin>
 <CAKfTPtDTSXiVsPepiJDFT9ZCXfUVYTQhRvgfZbF=_w87girPAQ@mail.gmail.com>
 <20181128163545.GE23094@e110439-lin>
 <CAKfTPtAd8V=Qst82HPb1HW7ZdvrHGgaVhZ=rAdzbHNednp+ztg@mail.gmail.com>
 <20181129150020.GF23094@e110439-lin>
 <CAKfTPtB_uVBx85vWVOhqUYyKGFipa4HGECsC-YZMvaz8wK6oCA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAKfTPtB_uVBx85vWVOhqUYyKGFipa4HGECsC-YZMvaz8wK6oCA@mail.gmail.com>
User-Agent: NeoMutt/20180716
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On 29-Nov 17:19, Vincent Guittot wrote:
> On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
> > On 29-Nov 11:43, Vincent Guittot wrote:

[...]

> > Seems we agree that, when there is no idle time:
> > - the two 15% tasks will be overestimated
> > - their utilization will reach 50% after a while
> >
> > If I'm not wrong, we will have:
> > - 30% CPU util in  ~16ms @1024 capacity
> >                    ~64ms  @256 capacity
> >
> > Thus, the tasks will be certainly over-estimated after ~64ms.
> > Is that correct ?
> 
> From a pure util_avg pov it's correct
> But i'd like to weight that a bit with the example below
> 
> > Now, we can argue that 64ms is a pretty long time and thus it's quite
> > unlucky we will have no idle for such a long time.
> >
> > Still, I'm wondering if we should keep collecting those samples or
> > better find a way to detect that and skip the sampling.
> 
> The problem is that you can have util_avg above capacity even with idle time
> In the 1st example of this thread, the 39ms/80ms task will reach 709
> which is the value saved by util_est on a big core
> But on core with half capacity, there is still idle time so 709 is a
> correct value although above 512

Right, I see your point and (in principle) I like the idea of
collecting samples for tasks which happen to run at a lower capacity
then required and the utilization value makes sense...

> In fact, max will be always above the linear ratio because it's based
> on geometric series
> 
> And this is true even with 15.6ms/32ms (same ratio as above) task
> although the impact is smaller (max value, which should be saved by
> util est, becomes  587 in this case).

However that's not always the case... as per my example above.

Moreover, we should also consider that util_est is mainly meant to be
a lower-bound for tasks utilization.
That's why task_util_est() already returns the actual util_avg when
it's higher than the estimated utilization.

With your new signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.

Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU
capacity, I would better prefer to just skip the sampling with
something like:

---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9332863d122a..485053026533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 {
 	long last_ewma_diff;
 	struct util_est ue;
+	int cpu;
 
 	if (!sched_feat(UTIL_EST))
 		return;
@@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
 		return;
 
+	/*
+	 * To avoid overestimation of actual task utilization, skip updates if
+	 * we cannot grant there is idle time in this CPU.
+	 */
+	cpu = cpu_of(rq_of(cfs_rq));
+	if (task_util(p) > cpu_capacity(cpu))
+		return;
+
 	/*
 	 * Update Task's estimated utilization
 	 *
---8<---

At least this will ensure that util_est always provides an actual
measured lower bound for a task utilization.

If you think this makes sense, feel free to add such a patch on
top of your series.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi