There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.
cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.
In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util
Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..77c9f5816c31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
- if (boost)
- util_est = max(util_est, runnable);
-
/*
* During wake-up @p isn't enqueued yet and doesn't contribute
* to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
--
2.34.1
On 07/06/23 15:51, Vincent Guittot wrote:
> There is no need to use runnable_avg when estimating util_est and that
> even generates wrong behavior because one includes blocked tasks whereas
> the other one doesn't. This can lead to accounting twice the waking task p,
> once with the blocked runnable_avg and another one when adding its
> util_est.
>
> cpu's runnable_avg is already used when computing util_avg which is then
> compared with util_est.
>
> In some situation, feec will not select prev_cpu but another one on the
> same performance domain because of higher max_util
>
> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
Can we verify the numbers that introduced this magic boost are still valid
please?
Otherwise LGTM.
Thanks!
--
Qais Yousef
> kernel/sched/fair.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a80a73909dc2..77c9f5816c31 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>
> util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
>
> - if (boost)
> - util_est = max(util_est, runnable);
> -
> /*
> * During wake-up @p isn't enqueued yet and doesn't contribute
> * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> --
> 2.34.1
>
On Tue, 11 Jul 2023 at 17:47, Qais Yousef <[email protected]> wrote:
>
> On 07/06/23 15:51, Vincent Guittot wrote:
> > There is no need to use runnable_avg when estimating util_est and that
> > even generates wrong behavior because one includes blocked tasks whereas
> > the other one doesn't. This can lead to accounting twice the waking task p,
> > once with the blocked runnable_avg and another one when adding its
> > util_est.
> >
> > cpu's runnable_avg is already used when computing util_avg which is then
> > compared with util_est.
> >
> > In some situation, feec will not select prev_cpu but another one on the
> > same performance domain because of higher max_util
> >
> > Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
>
> Can we verify the numbers that introduced this magic boost are still valid
> please?
TBH I don't expect it but I agree it's worth checking. Dietmar could
you rerun your tests with this change ?
>
> Otherwise LGTM.
>
>
> Thanks!
>
> --
>
> Qais Yousef
>
> > kernel/sched/fair.c | 3 ---
> > 1 file changed, 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..77c9f5816c31 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >
> > util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> >
> > - if (boost)
> > - util_est = max(util_est, runnable);
> > -
> > /*
> > * During wake-up @p isn't enqueued yet and doesn't contribute
> > * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> > --
> > 2.34.1
> >
On 12/07/2023 17:30, Vincent Guittot wrote:
> On Tue, 11 Jul 2023 at 17:47, Qais Yousef <[email protected]> wrote:
>>
>> On 07/06/23 15:51, Vincent Guittot wrote:
>>> There is no need to use runnable_avg when estimating util_est and that
>>> even generates wrong behavior because one includes blocked tasks whereas
>>> the other one doesn't. This can lead to accounting twice the waking task p,
>>> once with the blocked runnable_avg and another one when adding its
>>> util_est.
... and we don't have this issue for the util_avg case since we have:
7317 } else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
^^^^^^^^^^^^^^^^^^
7318 util += task_util(p);
>>> cpu's runnable_avg is already used when computing util_avg which is then
>>> compared with util_est.
We discussed why I have to use max(X, runnable) for X=util and
X=util_est in v2:
https://lkml.kernel.org/r/[email protected]
-->
I need the util_est = max(util_est, runnable) further down as well. Just
want to fetch runnable only once.
util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70
max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu
<--
But I assume your point is that:
7327 if (boost)
7328 util_est = max(util_est, runnable);
7356 if (dst_cpu == cpu) <-- (1)
7357 util_est += _task_util_est(p);
7358 else if (p && unlikely(task_on_rq_queued(p) || current == p))
7359 lsub_positive(&util_est, _task_util_est(p));
7360
7361 util = max(util, util_est);
--> (1) doesn't work anymore in case `util_est == runnable`.
It will break the assumption for the if condition depicted in
cpu_util()'s comment:
7331 * During wake-up (2) @p isn't enqueued yet and doesn't contribute
7332 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
7333 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
7334 * has been enqueued.
(2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.
<---
Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
of both will still show the anticipated results. Likelihood is high that
it will since we do `util = max(util, util_est)` at the end of cpu_util().
>>> In some situation, feec will not select prev_cpu but another one on the
>>> same performance domain because of higher max_util
>>>
>>> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
>>> Signed-off-by: Vincent Guittot <[email protected]>
>>> ---
>>
>> Can we verify the numbers that introduced this magic boost are still valid
>> please?
>
> TBH I don't expect it but I agree it's worth checking. Dietmar could
> you rerun your tests with this change ?
Could do. But first lets understand the issue properly.
>> Otherwise LGTM.
>>
>>
>> Thanks!
>>
>> --
>>
>> Qais Yousef
>>
>>> kernel/sched/fair.c | 3 ---
>>> 1 file changed, 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index a80a73909dc2..77c9f5816c31 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>>>
>>> util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
>>>
>>> - if (boost)
>>> - util_est = max(util_est, runnable);
>>> -
>>> /*
>>> * During wake-up @p isn't enqueued yet and doesn't contribute
>>> * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
>>> --
>>> 2.34.1
>>>
On Fri, 21 Jul 2023 at 18:09, Dietmar Eggemann <[email protected]> wrote:
>
> On 12/07/2023 17:30, Vincent Guittot wrote:
> > On Tue, 11 Jul 2023 at 17:47, Qais Yousef <[email protected]> wrote:
> >>
> >> On 07/06/23 15:51, Vincent Guittot wrote:
> >>> There is no need to use runnable_avg when estimating util_est and that
> >>> even generates wrong behavior because one includes blocked tasks whereas
> >>> the other one doesn't. This can lead to accounting twice the waking task p,
> >>> once with the blocked runnable_avg and another one when adding its
> >>> util_est.
>
> ... and we don't have this issue for the util_avg case since we have:
>
> 7317 } else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
> ^^^^^^^^^^^^^^^^^^
> 7318 util += task_util(p);
>
> >>> cpu's runnable_avg is already used when computing util_avg which is then
> >>> compared with util_est.
>
> We discussed why I have to use max(X, runnable) for X=util and
> X=util_est in v2:
>
> https://lkml.kernel.org/r/[email protected]
>
> -->
>
> I need the util_est = max(util_est, runnable) further down as well. Just
> want to fetch runnable only once.
>
> util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70
>
> max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu
>
Hmm, I don't get your point here. Why should they be equal ?
Below is a example to describe my problem:
task A with util_avg=200 util_est=300 runnable=200
task A is attached to CPU0 so it contributes to CPU0's util_avg and
runnable_avg.
In eenv_pd_max_util() we call cpu_util(cpu, p, dst_cpu, 1) to get the
max utilization and the OPP to use to compute energy.
Let say that there is nothing else running on CPU0 and CPU1 and the
both belong to the same performance domain so
CPU0 util_avg= 200 util_est=0 runnable_avg=200
CPU1 util_avg=0 util_est=0 runnable_avg=0
For CPU0, cpu_util(cpu, p, dst_cpu, 1) will return (200 + 300) = 500
For CPU1, cpu_util(cpu, p, dst_cpu, 1) will return (0 + 300) = 300
If there is an OPP with a capacity between these 2 values, CPU1 will
use a lower OPP than CPU0 and its computed energy will be lower.
The condition if (max_spare_cap_cpu >= 0 && max_spare_cap >
prev_spare_cap) filters some cases when CPU0 and CPU1 have the exact
same spare capacity. But we often see a smaller spare capacity for
CPU0 because of small side activities like cpufreq, timer, irq, rcu
... The difference is often only 1 but enough to bypass the condition
above. task A will migrate to CPU1 whereas there is no need. Then it
will move back to CPU0 once CPU1 will have a smaller spare capacity
I ran a test on snapdragon RB5 with the latest tip/sched/core. I start
3 tasks: 1 large enough to be on medium CPUs and 2 small enough to
stay on little CPUs during 30 seconds
With tip/sched/core, the 3 tasks are migrating around 3665
With the patch, there is only 8 migration at the beginning of the test
> <--
>
> But I assume your point is that:
>
> 7327 if (boost)
> 7328 util_est = max(util_est, runnable);
>
> 7356 if (dst_cpu == cpu) <-- (1)
> 7357 util_est += _task_util_est(p);
> 7358 else if (p && unlikely(task_on_rq_queued(p) || current == p))
> 7359 lsub_positive(&util_est, _task_util_est(p));
> 7360
> 7361 util = max(util, util_est);
>
> --> (1) doesn't work anymore in case `util_est == runnable`.
>
> It will break the assumption for the if condition depicted in
> cpu_util()'s comment:
exactly
>
> 7331 * During wake-up (2) @p isn't enqueued yet and doesn't contribute
> 7332 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> 7333 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
> 7334 * has been enqueued.
>
> (2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.
>
> <---
>
> Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
> of both will still show the anticipated results. Likelihood is high that
> it will since we do `util = max(util, util_est)` at the end of cpu_util().
I think the same
>
> >>> In some situation, feec will not select prev_cpu but another one on the
> >>> same performance domain because of higher max_util
> >>>
> >>> Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
> >>> Signed-off-by: Vincent Guittot <[email protected]>
> >>> ---
> >>
> >> Can we verify the numbers that introduced this magic boost are still valid
> >> please?
> >
> > TBH I don't expect it but I agree it's worth checking. Dietmar could
> > you rerun your tests with this change ?
>
> Could do. But first lets understand the issue properly.
>
> >> Otherwise LGTM.
> >>
> >>
> >> Thanks!
> >>
> >> --
> >>
> >> Qais Yousef
> >>
> >>> kernel/sched/fair.c | 3 ---
> >>> 1 file changed, 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index a80a73909dc2..77c9f5816c31 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -7289,9 +7289,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >>>
> >>> util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> >>>
> >>> - if (boost)
> >>> - util_est = max(util_est, runnable);
> >>> -
> >>> /*
> >>> * During wake-up @p isn't enqueued yet and doesn't contribute
> >>> * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> >>> --
> >>> 2.34.1
> >>>
>
On 24/07/2023 15:06, Vincent Guittot wrote:
> On Fri, 21 Jul 2023 at 18:09, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 12/07/2023 17:30, Vincent Guittot wrote:
>>> On Tue, 11 Jul 2023 at 17:47, Qais Yousef <[email protected]> wrote:
>>>>
>>>> On 07/06/23 15:51, Vincent Guittot wrote:
[...]
>> -->
>>
>> I need the util_est = max(util_est, runnable) further down as well. Just
>> want to fetch runnable only once.
>>
>> util = 50, task_util = 5, util_est = 60, task_util_est = 10, runnable = 70
>>
>> max(70 + 5, 60 + 10) != max (70 + 5, 70 + 10) when dst_cpu == cpu
>>
>
> Hmm, I don't get your point here. Why should they be equal ?
>
> Below is a example to describe my problem:
>
> task A with util_avg=200 util_est=300 runnable=200
> task A is attached to CPU0 so it contributes to CPU0's util_avg and
> runnable_avg.
>
> In eenv_pd_max_util() we call cpu_util(cpu, p, dst_cpu, 1) to get the
> max utilization and the OPP to use to compute energy.
>
> Let say that there is nothing else running on CPU0 and CPU1 and the
> both belong to the same performance domain so
> CPU0 util_avg= 200 util_est=0 runnable_avg=200
> CPU1 util_avg=0 util_est=0 runnable_avg=0
>
> For CPU0, cpu_util(cpu, p, dst_cpu, 1) will return (200 + 300) = 500
> For CPU1, cpu_util(cpu, p, dst_cpu, 1) will return (0 + 300) = 300
>
> If there is an OPP with a capacity between these 2 values, CPU1 will
> use a lower OPP than CPU0 and its computed energy will be lower.
>
> The condition if (max_spare_cap_cpu >= 0 && max_spare_cap >
> prev_spare_cap) filters some cases when CPU0 and CPU1 have the exact
> same spare capacity. But we often see a smaller spare capacity for
> CPU0 because of small side activities like cpufreq, timer, irq, rcu
> ... The difference is often only 1 but enough to bypass the condition
> above. task A will migrate to CPU1 whereas there is no need. Then it
> will move back to CPU0 once CPU1 will have a smaller spare capacity
>
> I ran a test on snapdragon RB5 with the latest tip/sched/core. I start
> 3 tasks: 1 large enough to be on medium CPUs and 2 small enough to
> stay on little CPUs during 30 seconds
> With tip/sched/core, the 3 tasks are migrating around 3665
> With the patch, there is only 8 migration at the beginning of the test
I agree with this. The fact that cfs_rq->avg.runnable_avg contains
blocked contributions from task A makes it unsuitable for the util_est
(no blocked contributions) if condition (dst_cpu == cpu) since we don't
want to add A's util_est to util_est to simulate during wakeup that A is
enqueued.
>> <--
>>
>> But I assume your point is that:
>>
>> 7327 if (boost)
>> 7328 util_est = max(util_est, runnable);
>>
>> 7356 if (dst_cpu == cpu) <-- (1)
>> 7357 util_est += _task_util_est(p);
>> 7358 else if (p && unlikely(task_on_rq_queued(p) || current == p))
>> 7359 lsub_positive(&util_est, _task_util_est(p));
>> 7360
>> 7361 util = max(util, util_est);
>>
>> --> (1) doesn't work anymore in case `util_est == runnable`.
>>
>> It will break the assumption for the if condition depicted in
>> cpu_util()'s comment:
>
> exactly
OK.
>>
>> 7331 * During wake-up (2) @p isn't enqueued yet and doesn't contribute
>> 7332 * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
>> 7333 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
>> 7334 * has been enqueued.
>>
>> (2) eenv_pd_max_util() and find_energy_efficient_cpu() call-site.
>>
>> <---
>>
>> Rerunning Jankbench tests on Pix6 will tell if boosting util_avg instead
>> of both will still show the anticipated results. Likelihood is high that
>> it will since we do `util = max(util, util_est)` at the end of cpu_util().
>
> I think the same
Reran the Jankbench test with the patch (fix) on exactly the same
platform (Pixel6, Android 12) I used for v3 (base, runnable):
https://lkml.kernel.org/r/[email protected]
Max_frame_duration:
+-----------------+------------+
| kernel | value [ms] |
+-----------------+------------+
| base | 163.1 |
| runnable | 162.0 |
| fix | 157.1 |
+-----------------+------------+
Mean_frame_duration:
+-----------------+------------+----------+
| kernel | value [ms] | diff [%] |
+-----------------+------------+----------+
| base | 18.0 | 0.0 |
| runnable | 12.7 | -29.43 |
| fix | 13.0 | -27.78 |
+-----------------+------------+----------+
Jank percentage (Jank deadline 16ms):
+-----------------+------------+----------+
| kernel | value [%] | diff [%] |
+-----------------+------------+----------+
| base | 3.6 | 0.0 |
| runnable | 1.0 | -68.86 |
| fix | 1.0 | -68.86 |
+-----------------+------------+----------+
Power usage [mW] (total - all CPUs):
+-----------------+------------+----------+
| kernel | value [mW] | diff [%] |
+-----------------+------------+----------+
| base | 129.5 | 0.0 |
| runnable | 134.3 | 3.71 |
| fix | 129.9 | 0.31 |
+-----------------+------------+----------+
Test results look good to me.
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
[...]
On 07/24/23 23:11, Dietmar Eggemann wrote:
> Reran the Jankbench test with the patch (fix) on exactly the same
> platform (Pixel6, Android 12) I used for v3 (base, runnable):
>
> https://lkml.kernel.org/r/[email protected]
>
> Max_frame_duration:
> +-----------------+------------+
> | kernel | value [ms] |
> +-----------------+------------+
> | base | 163.1 |
> | runnable | 162.0 |
> | fix | 157.1 |
> +-----------------+------------+
>
> Mean_frame_duration:
> +-----------------+------------+----------+
> | kernel | value [ms] | diff [%] |
> +-----------------+------------+----------+
> | base | 18.0 | 0.0 |
> | runnable | 12.7 | -29.43 |
> | fix | 13.0 | -27.78 |
> +-----------------+------------+----------+
>
> Jank percentage (Jank deadline 16ms):
> +-----------------+------------+----------+
> | kernel | value [%] | diff [%] |
> +-----------------+------------+----------+
> | base | 3.6 | 0.0 |
> | runnable | 1.0 | -68.86 |
> | fix | 1.0 | -68.86 |
> +-----------------+------------+----------+
>
> Power usage [mW] (total - all CPUs):
> +-----------------+------------+----------+
> | kernel | value [mW] | diff [%] |
> +-----------------+------------+----------+
> | base | 129.5 | 0.0 |
> | runnable | 134.3 | 3.71 |
> | fix | 129.9 | 0.31 |
> +-----------------+------------+----------+
>
> Test results look good to me.
>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Tested-by: Dietmar Eggemann <[email protected]>
Thanks for re-running the test!
Cheers
--
Qais Yousef
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c2e164ac33f75e0acb93004960c73bd9166d3d35
Gitweb: https://git.kernel.org/tip/c2e164ac33f75e0acb93004960c73bd9166d3d35
Author: Vincent Guittot <[email protected]>
AuthorDate: Thu, 06 Jul 2023 15:51:44 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 26 Jul 2023 12:28:50 +02:00
sched/fair: remove util_est boosting
There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.
cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.
In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util
Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3df5b1..f55b0a7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7320,9 +7320,6 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
- if (boost)
- util_est = max(util_est, runnable);
-
/*
* During wake-up @p isn't enqueued yet and doesn't contribute
* to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.