LinuxLists.cc - Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util

2021-02-22 16:03:51

Subject: Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()

On Monday 22 Feb 2021 at 15:01:51 (+0000), Vincent Donnefort wrote:
> You mean that it could lead to a wrong frequency estimation when doing
> freq = map_util_freq() in em_cpu_energy()?

I'm not too worried about the map_util_freq() part, I'm worried about
the schedutil aggregation. Specifically, when a task is enqueued on a
rq, we sum its util_avg to the rq's util_avg, and the _task_util_est()
to the rq's util_est.enqueue member (as per util_est_enqueue()).

Now, in schedutil, sugov_get_util() calls cpu_util_cfs(), which does the
following:

static inline unsigned long cpu_util_cfs(struct rq *rq)
{
unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

if (sched_feat(UTIL_EST)) {
util = max_t(unsigned long, util,
READ_ONCE(rq->cfs.avg.util_est.enqueued));
}

return util;
}

And that value will be the base for frequency selection. cpu_util_next()
tries to mimic this as accurately as possible, by doing the sums
independently, and then computing the max, exactly as we will do when
the task is enqueued and a freq update is generated.

> But in any case, the computed energy, being the product of sum_util with the
> OPP's cost, it is directly affected by this util_avg/util_est difference.

Sure, but we're not going to fix it by messing up the OPP selection part ;-)

> In the case where the task placement doesn't change the OPP, which is often the
> case, we can simplify the comparison and end-up with the following:
>
> delta_energy(CPU-3): OPP3 cost * (cpu_util_avg + task_util_avg - cpu_util_avg)
> delta_energy(CPU-2): OPP2 cost * (cpu_util_est + task_util_est - cpu_util_est)
>
> => OPP3 cost * task_util_avg < task_util_est * OPP2 cost
>
> With the same example I described previously, if you add the scaled OPP cost of
> 0.76 for CPU-3 and 0.65 for CPU-2 (real life OPP scaled costs), we have:
>
> 2.3 (CPU-3) < 7.15 (CPU-2)
>
> The task is placed on CPU-3, while it would have been much more efficient to use
> CPU-2.

That should really be a transient state: having a util_avg much larger
than util_est.enqueued is indicative of either a new task or a workload
changing behaviour. And so, chances are all the estimations are wrong
anyways -- it's hard to do good estimations when the present doesn't
look like the recent past.

But in any case, if we're going to address this, I'm still not sure this
patch will be what we want. As per my first comment we need to keep the
frequency estimation right.

> > > When computing the energy
> > > deltas, pd0's is likely to be higher than pd1's, only because the task
> > > contribution is higher for one comparison than the other.
> >
> > You mean the contribution to sum_util right? I think I see what you mean
> > but I'm still not sure if this really is an issue. This is how util_est
> > works, and the EM stuff is just consistent with that.
> >
> > The issue you describe can only happen (I think) when a rq's util_avg is
> > larger than its util-est emwa by some margin (that has to do with the
> > ewma-util_avg delta for the task?). But that means the ewma is not to be
> > trusted to begin with, so ...
>
> cfs_rq->avg.util_est.ewma is not used. cpu_util() will only return the max
> between ue.enqueued and util_avg.

Right, my bad, it was the 'enqueued' member. But the rest of the
argument is still valid I think, but with s/ewma/enqueued :-)

Thanks,
Quentin

2021-02-22 16:25:54

by Quentin Perret

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()

On Monday 22 Feb 2021 at 15:58:56 (+0000), Quentin Perret wrote:
> But in any case, if we're going to address this, I'm still not sure this
> patch will be what we want. As per my first comment we need to keep the
> frequency estimation right.

Totally untested, but I think in principle you would like something like
the snippet below. Would that work?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..6594d875c6ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6534,8 +6534,13 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* its pd list and will not be accounted by compute_energy().
*/
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
- unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long util_running = cpu_util_without(cpu, p);
struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+ unsigned long cpu_util;
+
+ if (cpu == dst_cpu)
+ util_running += task_util_est();

/*
* Busy time computation: utilization clamping is not
@@ -6543,7 +6548,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
- sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ sum_util += schedutil_cpu_util(cpu, util_running, cpu_cap,
ENERGY_UTIL, NULL);

/*
@@ -6553,7 +6558,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ cpu_util = schedutil_cpu_util(cpu, util_freq, cpu_cap,
FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
}

2021-02-23 20:04:43

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()

On 22/02/2021 17:23, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 15:58:56 (+0000), Quentin Perret wrote:
>> But in any case, if we're going to address this, I'm still not sure this
>> patch will be what we want. As per my first comment we need to keep the
>> frequency estimation right.
>
> Totally untested, but I think in principle you would like something like
> the snippet below. Would that work?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..6594d875c6ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6534,8 +6534,13 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * its pd list and will not be accounted by compute_energy().
> */
> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> - unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> + unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> + unsigned long util_running = cpu_util_without(cpu, p);

Wouldn't this be the same as:

unsigned long util_running = cpu_util_next(cpu, p, -1);

except some different handling of !last_update_time and
'task_on_rq_queued(p) || current == p)' in cpu_util_without() which
shouldn't happen in EAS.

We have quite a lot of util related functions!

[...]