2022-01-11 13:47:14

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v3 0/4] sched/pelt: Relax the sync of *_sum with *_avg

Rick reported performance regressions in bugzilla because of cpu
frequency being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

More details are available in commit message of patch 1.

This patchset reverts the commit above and adds several checks when
propagating the changes in the hierarchy to make sure that we still have
coherent util_avg and util_sum.

Dietmar found a simple way to reproduce the WARN fixed by
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
by looping on hackbench in several different sched group levels.

This patchset as run on the reproducer with success but it probably needs
more tests by people who faced the WARN before.

The changes done on util_sum have been also applied to runnable_sum and
load_sum which faces the same rounding problem although this has not been
reflected in measurable performance impact.

Changes for v3:
- split patch 1 in 2 patches
- One to fix rick's regression
- One to apply same changes in other places
- some typos
- move main comment so it appears in the 1st patch

Changes for v2:
- fix wrong update of load_sum
- move a change from patch 3 to patch 2
- update patch 3 commit message

Vincent Guittot (4):
sched/pelt: Relax the sync of util_sum with util_avg
sched/pelt: Continue to relax the sync of util_sum with util_avg
sched/pelt: Relax the sync of runnable_sum with runnable_avg
sched/pelt: Relax the sync of load_sum with load_avg

kernel/sched/fair.c | 113 +++++++++++++++++++++++++++++---------------
1 file changed, 75 insertions(+), 38 deletions(-)

--
2.17.1



2022-01-11 13:47:17

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 3 1/4] sched/pelt: Relax the sync of util_sum with util_avg

Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.

Instead of always setting util_sum to the low bound of util_avg, which can
significantly lower the utilization of root cfs_rq after propagating the
change down into the hierarchy, we revert the change of util_sum and
propagate the difference.

In addition, we also check that cfs's util_sum always stays above the
lower bound for a given util_avg as it has been observed that
sched_entity's util_sum is sometimes above cfs one.

Reported-by: Rick Yiu <[email protected]>
Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..ed35255fdb85 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3381,6 +3381,7 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}

+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)

/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
@@ -3449,7 +3450,6 @@ void set_task_rq_fair(struct sched_entity *se,
* XXX: only do this for the part of runnable > running ?
*
*/
-
static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
@@ -3681,7 +3681,19 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ /*
+ * Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum. Although this is not a problem by itself, detaching
+ * a lot of tasks with the rounding problem between 2 updates of
+ * util_avg (~1ms) can make cfs->util_sum becoming null whereas
+ * cfs_util_avg is not.
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ * util_avg * minimum possible divider
+ */
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
--
2.17.1


2022-01-11 13:47:18

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v3 2/4] sched/pelt: Continue to relax the sync of util_sum with util_avg

Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.

update_tg_cfs_util() is not the only place where we round util_sum and
lost some accumulated contributions that are not already reflected in
util_avg. Modify update_tg_cfs_util() and detach_entity_load_avg() to not
sync util_sum with the new util_avg. Instead of always setting util_sum to
the low bound of util_avg, which can significantly lower the utilization,
we propagate the difference. In addition, we also check that cfs's util_sum
always stays above the lower bound for a given util_avg as it has been
observed that sched_entity's util_sum is sometimes above cfs one.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed35255fdb85..3eb73ce6ef13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3453,11 +3453,11 @@ void set_task_rq_fair(struct sched_entity *se,
static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
+ u32 new_sum, divider;

/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;

/*
@@ -3466,13 +3466,20 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
*/
divider = get_pelt_divider(&cfs_rq->avg);

+
/* Set new sched_entity's utilization */
se->avg.util_avg = gcfs_rq->avg.util_avg;
- se->avg.util_sum = se->avg.util_avg * divider;
+ new_sum = se->avg.util_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.util_sum;
+ se->avg.util_sum = new_sum;

/* Update parent cfs_rq utilization */
- add_positive(&cfs_rq->avg.util_avg, delta);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ add_positive(&cfs_rq->avg.util_avg, delta_avg);
+ add_positive(&cfs_rq->avg.util_sum, delta_sum);
+
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * MIN_DIVIDER);
}

static inline void
@@ -3792,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * MIN_DIVIDER);
+
sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;

--
2.17.1


2022-01-11 13:47:22

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v3 3/4] sched/pelt: Relax the sync of runnable_sum with runnable_avg

Similarly to util_avg and util_sum, don't sync runnable_sum with the low
bound of runnable_avg but only ensure that runnable_sum stays in the
correct range.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3eb73ce6ef13..e34de721a630 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3485,11 +3485,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+ u32 new_sum, divider;

/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;

/*
@@ -3500,11 +3500,16 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf

/* Set new sched_entity's runnable */
se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
- se->avg.runnable_sum = se->avg.runnable_avg * divider;
+ new_sum = se->avg.runnable_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.runnable_sum;
+ se->avg.runnable_sum = new_sum;

/* Update parent cfs_rq runnable */
- add_positive(&cfs_rq->avg.runnable_avg, delta);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
+ add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * MIN_DIVIDER);
}

static inline void
@@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
- sa->runnable_sum = sa->runnable_avg * divider;
+ sub_positive(&sa->runnable_sum, r * divider);
+ /* See sa->util_sum above */
+ sa->runnable_sum = max_t(u32, sa->runnable_sum,
+ sa->runnable_avg * MIN_DIVIDER);

/*
* removed_runnable is the unweighted version of removed_load so we
@@ -3791,12 +3799,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
*/
static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- /*
- * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
- * See ___update_load_avg() for details.
- */
- u32 divider = get_pelt_divider(&cfs_rq->avg);
-
dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
@@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
cfs_rq->avg.util_avg * MIN_DIVIDER);

sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * MIN_DIVIDER);

add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);

--
2.17.1


2022-01-11 13:47:27

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v3 4/4] sched/pelt: Relax the sync of load_sum with load_avg

Similarly to util_avg and util_sum, don't sync load_sum with the low
bound of load_avg but only ensure that load_sum stays in the correct range.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e34de721a630..12d139fed1f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
}

+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
+
static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u32 divider = get_pelt_divider(&se->avg);
sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * MIN_DIVIDER);
+
}
#else
static inline void
@@ -3381,8 +3386,6 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}

-#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
-
/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
* propagate its contribution. The key to this propagation is the invariant
@@ -3515,9 +3518,10 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+ long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
unsigned long load_avg;
u64 load_sum = 0;
+ s64 delta_sum;
u32 divider;

if (!runnable_sum)
@@ -3544,7 +3548,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
* assuming all tasks are equally runnable.
*/
if (scale_load_down(gcfs_rq->load.weight)) {
- load_sum = div_s64(gcfs_rq->avg.load_sum,
+ load_sum = div_u64(gcfs_rq->avg.load_sum,
scale_load_down(gcfs_rq->load.weight));
}

@@ -3561,19 +3565,22 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
runnable_sum = max(runnable_sum, running_sum);

- load_sum = (s64)se_weight(se) * runnable_sum;
- load_avg = div_s64(load_sum, divider);
+ load_sum = se_weight(se) * runnable_sum;
+ load_avg = div_u64(load_sum, divider);

- se->avg.load_sum = runnable_sum;
-
- delta = load_avg - se->avg.load_avg;
- if (!delta)
+ delta_avg = load_avg - se->avg.load_avg;
+ if (!delta_avg)
return;

- se->avg.load_avg = load_avg;
+ delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;

- add_positive(&cfs_rq->avg.load_avg, delta);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ se->avg.load_sum = runnable_sum;
+ se->avg.load_avg = load_avg;
+ add_positive(&cfs_rq->avg.load_avg, delta_avg);
+ add_positive(&cfs_rq->avg.load_sum, delta_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * MIN_DIVIDER);
}

static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3689,7 +3696,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_load;
sub_positive(&sa->load_avg, r);
- sa->load_sum = sa->load_avg * divider;
+ sub_positive(&sa->load_sum, r * divider);
+ /* See sa->util_sum below */
+ sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);

r = removed_util;
sub_positive(&sa->util_avg, r);
--
2.17.1


2022-01-12 11:25:03

by Sachin Sant

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched/pelt: Relax the sync of *_sum with *_avg


> On 11-Jan-2022, at 7:16 PM, Vincent Guittot <[email protected]> wrote:
>
> Rick reported performance regressions in bugzilla because of cpu
> frequency being lower than before:
> https://bugzilla.kernel.org/show_bug.cgi?id=215045
>
> He bisected the problem to:
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
>
> More details are available in commit message of patch 1.
>
> This patchset reverts the commit above and adds several checks when
> propagating the changes in the hierarchy to make sure that we still have
> coherent util_avg and util_sum.
>
> Dietmar found a simple way to reproduce the WARN fixed by
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> by looping on hackbench in several different sched group levels.
>
> This patchset as run on the reproducer with success but it probably needs
> more tests by people who faced the WARN before.
>

I ran scheduler regression tests(including cfg_bandwidth) from LTP
for about 6 hours. I did not observe any (new or previously reported)
kernel warn messages.

Based on this test result for ppc64le
Tested-by: Sachin Sant <[email protected]>

-Sachin

2022-01-12 13:18:13

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched/pelt: Relax the sync of *_sum with *_avg

On Wed, 12 Jan 2022 at 12:24, Sachin Sant <[email protected]> wrote:
>
>
> > On 11-Jan-2022, at 7:16 PM, Vincent Guittot <[email protected]> wrote:
> >
> > Rick reported performance regressions in bugzilla because of cpu
> > frequency being lower than before:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215045
> >
> > He bisected the problem to:
> > commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> >
> > More details are available in commit message of patch 1.
> >
> > This patchset reverts the commit above and adds several checks when
> > propagating the changes in the hierarchy to make sure that we still have
> > coherent util_avg and util_sum.
> >
> > Dietmar found a simple way to reproduce the WARN fixed by
> > commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> > by looping on hackbench in several different sched group levels.
> >
> > This patchset as run on the reproducer with success but it probably needs
> > more tests by people who faced the WARN before.
> >
>
> I ran scheduler regression tests(including cfg_bandwidth) from LTP
> for about 6 hours. I did not observe any (new or previously reported)
> kernel warn messages.
>
> Based on this test result for ppc64le
> Tested-by: Sachin Sant <[email protected]>

Thanks

>
> -Sachin

2022-01-12 15:26:25

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] sched/pelt: Relax the sync of *_sum with *_avg

On 11/01/2022 14:46, Vincent Guittot wrote:
> Rick reported performance regressions in bugzilla because of cpu
> frequency being lower than before:
> https://bugzilla.kernel.org/show_bug.cgi?id=215045
>
> He bisected the problem to:
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
>
> More details are available in commit message of patch 1.
>
> This patchset reverts the commit above and adds several checks when
> propagating the changes in the hierarchy to make sure that we still have
> coherent util_avg and util_sum.
>
> Dietmar found a simple way to reproduce the WARN fixed by
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> by looping on hackbench in several different sched group levels.
>
> This patchset as run on the reproducer with success but it probably needs
> more tests by people who faced the WARN before.
>
> The changes done on util_sum have been also applied to runnable_sum and
> load_sum which faces the same rounding problem although this has not been
> reflected in measurable performance impact.
>
> Changes for v3:
> - split patch 1 in 2 patches
> - One to fix rick's regression
> - One to apply same changes in other places
> - some typos
> - move main comment so it appears in the 1st patch
>
> Changes for v2:
> - fix wrong update of load_sum
> - move a change from patch 3 to patch 2
> - update patch 3 commit message
>
> Vincent Guittot (4):
> sched/pelt: Relax the sync of util_sum with util_avg
> sched/pelt: Continue to relax the sync of util_sum with util_avg
> sched/pelt: Relax the sync of runnable_sum with runnable_avg
> sched/pelt: Relax the sync of load_sum with load_avg
>
> kernel/sched/fair.c | 113 +++++++++++++++++++++++++++++---------------
> 1 file changed, 75 insertions(+), 38 deletions(-)

LGTM. Just a couple of questions on the patch header of 1/4.

Reviewed-by: Dietmar Eggemann <[email protected]>

2022-01-12 15:26:36

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 3 1/4] sched/pelt: Relax the sync of util_sum with util_avg

On 11/01/2022 14:46, Vincent Guittot wrote:
> Rick reported performance regressions in bugzilla because of cpu frequency
> being lower than before:
> https://bugzilla.kernel.org/show_bug.cgi?id=215045
>
> He bisected the problem to:
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
>
> This commit forces util_sum to be synced with the new util_avg after
> removing the contribution of a task and before the next periodic sync. By
> doing so util_sum is rounded to its lower bound and might lost up to
> LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
> reflected in util_avg.
>
> Instead of always setting util_sum to the low bound of util_avg, which can
> significantly lower the utilization of root cfs_rq after propagating the
> change down into the hierarchy, we revert the change of util_sum and
> propagate the difference.

IMHO, this paragraph does not match the changes in this patch.

> In addition, we also check that cfs's util_sum always stays above the
> lower bound for a given util_avg as it has been observed that
> sched_entity's util_sum is sometimes above cfs one.

And I guess this one also refers to the code change in 2/4, i.e. in
update_tg_cfs_util().
> Reported-by: Rick Yiu <[email protected]>
> Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..ed35255fdb85 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3381,6 +3381,7 @@ void set_task_rq_fair(struct sched_entity *se,
> se->avg.last_update_time = n_last_update_time;
> }
>
> +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
>
> /*
> * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> @@ -3449,7 +3450,6 @@ void set_task_rq_fair(struct sched_entity *se,
> * XXX: only do this for the part of runnable > running ?
> *
> */
> -
> static inline void
> update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> {
> @@ -3681,7 +3681,19 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + /*
> + * Because of rounding, se->util_sum might ends up being +1 more than

There are no se's involved in update_cfs_rq_load_avg(). Could be hard to
grasp for people only looking at this function.

> + * cfs->util_sum. Although this is not a problem by itself, detaching
> + * a lot of tasks with the rounding problem between 2 updates of
> + * util_avg (~1ms) can make cfs->util_sum becoming null whereas
> + * cfs_util_avg is not.
> + * Check that util_sum is still above its lower bound for the new
> + * util_avg. Given that period_contrib might have moved since the last
> + * sync, we are only sure that util_sum must be above or equal to
> + * util_avg * minimum possible divider
^^^
some superfluous whitepaces.

> + */
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
>


2022-01-12 16:05:22

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 3 1/4] sched/pelt: Relax the sync of util_sum with util_avg

On Wed, 12 Jan 2022 at 16:26, Dietmar Eggemann <[email protected]> wrote:
>
> On 11/01/2022 14:46, Vincent Guittot wrote:
> > Rick reported performance regressions in bugzilla because of cpu frequency
> > being lower than before:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215045
> >
> > He bisected the problem to:
> > commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> >
> > This commit forces util_sum to be synced with the new util_avg after
> > removing the contribution of a task and before the next periodic sync. By
> > doing so util_sum is rounded to its lower bound and might lost up to
> > LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
> > reflected in util_avg.
> >
> > Instead of always setting util_sum to the low bound of util_avg, which can
> > significantly lower the utilization of root cfs_rq after propagating the
> > change down into the hierarchy, we revert the change of util_sum and
> > propagate the difference.
>
> IMHO, this paragraph does not match the changes in this patch.



>
> > In addition, we also check that cfs's util_sum always stays above the
> > lower bound for a given util_avg as it has been observed that
> > sched_entity's util_sum is sometimes above cfs one.
>
> And I guess this one also refers to the code change in 2/4, i.e. in
> update_tg_cfs_util().

The 3 places can have the problem although it's probably more obvious
in detach_entity_load_avg or update_tg_cfs_util because this is done
synchronously unlike here

> > Reported-by: Rick Yiu <[email protected]>
> > Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 16 ++++++++++++++--
> > 1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 095b0aa378df..ed35255fdb85 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3381,6 +3381,7 @@ void set_task_rq_fair(struct sched_entity *se,
> > se->avg.last_update_time = n_last_update_time;
> > }
> >
> > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
> >
> > /*
> > * When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
> > @@ -3449,7 +3450,6 @@ void set_task_rq_fair(struct sched_entity *se,
> > * XXX: only do this for the part of runnable > running ?
> > *
> > */
> > -
> > static inline void
> > update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
> > {
> > @@ -3681,7 +3681,19 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > r = removed_util;
> > sub_positive(&sa->util_avg, r);
> > - sa->util_sum = sa->util_avg * divider;
> > + sub_positive(&sa->util_sum, r * divider);
> > + /*
> > + * Because of rounding, se->util_sum might ends up being +1 more than
>
> There are no se's involved in update_cfs_rq_load_avg(). Could be hard to
> grasp for people only looking at this function.

I moved the comment there to stay in patch 1 but it reflects the whole
problem of rounding when removing some util_avg/sum from a cfs whereas
we only have part of the problem in update_cfs_rq_load_avg()

>
> > + * cfs->util_sum. Although this is not a problem by itself, detaching
> > + * a lot of tasks with the rounding problem between 2 updates of
> > + * util_avg (~1ms) can make cfs->util_sum becoming null whereas
> > + * cfs_util_avg is not.
> > + * Check that util_sum is still above its lower bound for the new
> > + * util_avg. Given that period_contrib might have moved since the last
> > + * sync, we are only sure that util_sum must be above or equal to
> > + * util_avg * minimum possible divider
> ^^^
> some superfluous whitepaces.
>
> > + */
> > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >
> > r = removed_runnable;
> > sub_positive(&sa->runnable_avg, r);
> >
>

Subject: [tip: sched/urgent] sched/pelt: Relax the sync of load_sum with load_avg

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 2d02fa8cc21a93da35cfba462bf8ab87bf2db651
Gitweb: https://git.kernel.org/tip/2d02fa8cc21a93da35cfba462bf8ab87bf2db651
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 11 Jan 2022 14:46:59 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Jan 2022 12:09:58 +01:00

sched/pelt: Relax the sync of load_sum with load_avg

Similarly to util_avg and util_sum, don't sync load_sum with the low
bound of load_avg but only ensure that load_sum stays in the correct range.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0e87e19..f4f02c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3028,9 +3028,11 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u32 divider = get_pelt_divider(&se->avg);
sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * PELT_MIN_DIVIDER);
}
#else
static inline void
@@ -3513,9 +3515,10 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+ long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
unsigned long load_avg;
u64 load_sum = 0;
+ s64 delta_sum;
u32 divider;

if (!runnable_sum)
@@ -3542,7 +3545,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
* assuming all tasks are equally runnable.
*/
if (scale_load_down(gcfs_rq->load.weight)) {
- load_sum = div_s64(gcfs_rq->avg.load_sum,
+ load_sum = div_u64(gcfs_rq->avg.load_sum,
scale_load_down(gcfs_rq->load.weight));
}

@@ -3559,19 +3562,22 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
runnable_sum = max(runnable_sum, running_sum);

- load_sum = (s64)se_weight(se) * runnable_sum;
- load_avg = div_s64(load_sum, divider);
-
- se->avg.load_sum = runnable_sum;
+ load_sum = se_weight(se) * runnable_sum;
+ load_avg = div_u64(load_sum, divider);

- delta = load_avg - se->avg.load_avg;
- if (!delta)
+ delta_avg = load_avg - se->avg.load_avg;
+ if (!delta_avg)
return;

- se->avg.load_avg = load_avg;
+ delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;

- add_positive(&cfs_rq->avg.load_avg, delta);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ se->avg.load_sum = runnable_sum;
+ se->avg.load_avg = load_avg;
+ add_positive(&cfs_rq->avg.load_avg, delta_avg);
+ add_positive(&cfs_rq->avg.load_sum, delta_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * PELT_MIN_DIVIDER);
}

static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3687,7 +3693,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_load;
sub_positive(&sa->load_avg, r);
- sa->load_sum = sa->load_avg * divider;
+ sub_positive(&sa->load_sum, r * divider);
+ /* See sa->util_sum below */
+ sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * PELT_MIN_DIVIDER);

r = removed_util;
sub_positive(&sa->util_avg, r);

Subject: [tip: sched/urgent] sched/pelt: Continue to relax the sync of util_sum with util_avg

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 7ceb77103001544a43e11d7f3a8a69a2c1f422cf
Gitweb: https://git.kernel.org/tip/7ceb77103001544a43e11d7f3a8a69a2c1f422cf
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 11 Jan 2022 14:46:57 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Jan 2022 12:09:58 +01:00

sched/pelt: Continue to relax the sync of util_sum with util_avg

Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.

update_tg_cfs_util() is not the only place where we round util_sum and
lost some accumulated contributions that are not already reflected in
util_avg. Modify update_tg_cfs_util() and detach_entity_load_avg() to not
sync util_sum with the new util_avg. Instead of always setting util_sum to
the low bound of util_avg, which can significantly lower the utilization,
we propagate the difference. In addition, we also check that cfs's util_sum
always stays above the lower bound for a given util_avg as it has been
observed that sched_entity's util_sum is sometimes above cfs one.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8f068d..ad2809c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3451,11 +3451,11 @@ void set_task_rq_fair(struct sched_entity *se,
static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
+ u32 new_sum, divider;

/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;

/*
@@ -3464,13 +3464,20 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
*/
divider = get_pelt_divider(&cfs_rq->avg);

+
/* Set new sched_entity's utilization */
se->avg.util_avg = gcfs_rq->avg.util_avg;
- se->avg.util_sum = se->avg.util_avg * divider;
+ new_sum = se->avg.util_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.util_sum;
+ se->avg.util_sum = new_sum;

/* Update parent cfs_rq utilization */
- add_positive(&cfs_rq->avg.util_avg, delta);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ add_positive(&cfs_rq->avg.util_avg, delta_avg);
+ add_positive(&cfs_rq->avg.util_sum, delta_sum);
+
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);
}

static inline void
@@ -3790,7 +3797,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);
+
sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;

Subject: [tip: sched/urgent] sched/pelt: Relax the sync of runnable_sum with runnable_avg

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 95246d1ec80b8d19d882cd8eb7ad094e63b41bb8
Gitweb: https://git.kernel.org/tip/95246d1ec80b8d19d882cd8eb7ad094e63b41bb8
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 11 Jan 2022 14:46:58 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Jan 2022 12:09:58 +01:00

sched/pelt: Relax the sync of runnable_sum with runnable_avg

Similarly to util_avg and util_sum, don't sync runnable_sum with the low
bound of runnable_avg but only ensure that runnable_sum stays in the
correct range.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad2809c..0e87e19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3483,11 +3483,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+ u32 new_sum, divider;

/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;

/*
@@ -3498,11 +3498,16 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf

/* Set new sched_entity's runnable */
se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
- se->avg.runnable_sum = se->avg.runnable_avg * divider;
+ new_sum = se->avg.runnable_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.runnable_sum;
+ se->avg.runnable_sum = new_sum;

/* Update parent cfs_rq runnable */
- add_positive(&cfs_rq->avg.runnable_avg, delta);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
+ add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER);
}

static inline void
@@ -3702,7 +3707,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
- sa->runnable_sum = sa->runnable_avg * divider;
+ sub_positive(&sa->runnable_sum, r * divider);
+ /* See sa->util_sum above */
+ sa->runnable_sum = max_t(u32, sa->runnable_sum,
+ sa->runnable_avg * PELT_MIN_DIVIDER);

/*
* removed_runnable is the unweighted version of removed_load so we
@@ -3789,12 +3797,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
*/
static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- /*
- * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
- * See ___update_load_avg() for details.
- */
- u32 divider = get_pelt_divider(&cfs_rq->avg);
-
dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
@@ -3803,7 +3805,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);

sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
+ /* See update_cfs_rq_load_avg() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER);

add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);

Subject: [tip: sched/urgent] sched/pelt: Relax the sync of util_sum with util_avg

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 98b0d890220d45418cfbc5157b3382e6da5a12ab
Gitweb: https://git.kernel.org/tip/98b0d890220d45418cfbc5157b3382e6da5a12ab
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 11 Jan 2022 14:46:56 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 18 Jan 2022 12:09:58 +01:00

sched/pelt: Relax the sync of util_sum with util_avg

Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.

Instead of always setting util_sum to the low bound of util_avg, which can
significantly lower the utilization of root cfs_rq after propagating the
change down into the hierarchy, we revert the change of util_sum and
propagate the difference.

In addition, we also check that cfs's util_sum always stays above the
lower bound for a given util_avg as it has been observed that
sched_entity's util_sum is sometimes above cfs one.

Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
Reported-by: Rick Yiu <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 16 +++++++++++++---
kernel/sched/pelt.h | 4 +++-
2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa..d8f068d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3381,7 +3381,6 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}

-
/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
* propagate its contribution. The key to this propagation is the invariant
@@ -3449,7 +3448,6 @@ void set_task_rq_fair(struct sched_entity *se,
* XXX: only do this for the part of runnable > running ?
*
*/
-
static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
@@ -3681,7 +3679,19 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ /*
+ * Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum. Although this is not a problem by itself, detaching
+ * a lot of tasks with the rounding problem between 2 updates of
+ * util_avg (~1ms) can make cfs->util_sum becoming null whereas
+ * cfs_util_avg is not.
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ * util_avg * minimum possible divider
+ */
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * PELT_MIN_DIVIDER);

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index e06071b..c336f5f 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -37,9 +37,11 @@ update_irq_load_avg(struct rq *rq, u64 running)
}
#endif

+#define PELT_MIN_DIVIDER (LOAD_AVG_MAX - 1024)
+
static inline u32 get_pelt_divider(struct sched_avg *avg)
{
- return LOAD_AVG_MAX - 1024 + avg->period_contrib;
+ return PELT_MIN_DIVIDER + avg->period_contrib;
}

static inline void cfs_se_util_change(struct sched_avg *avg)