2021-12-22 09:38:10

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg

Rick reported performance regressions in bugzilla because of cpu
frequency being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

More details are available in commit message of patch 1.

This patchset reverts the commit above and adds several checks when
propagating the changes in the hierarchy to make sure that we still have
coherent util_avg and util_sum.

Dietmar found a simple way to reproduce the WARN fixed by
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
by looping on hackbench in several different sched group levels.

This patchset as run on the reproducer with success but it probably needs
more tests by people who faced the WARN before.

The changes done on util_sum have been also applied to runnable_sum and
load_sum which faces the same rounding problem although this has not been
reflected in measurable performance impact.

Changes for v2:
- fix wrong update of load_sum
- move a change from patch 3 to patch 2
- update patch 3 commit message

Vincent Guittot (3):
sched/pelt: Don't sync hardly util_sum with uti_avg
sched/pelt: Don't sync hardly runnable_sum with runnable_avg
sched/pelt: Don't sync hardly load_sum with load_avg

kernel/sched/fair.c | 113 +++++++++++++++++++++++++++++---------------
1 file changed, 75 insertions(+), 38 deletions(-)

--
2.17.1



2021-12-22 09:38:15

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045

He bisected the problem to:
commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.

Instead of always setting util_sum to the low bound of util_avg, which can
significantly lower the utilization of root cfs_rq after propagating the
change down into the hierarchy, we revert the commit and propagate the
difference in util_sum.

In addition, we also check that cfs's util_sum always stays above the
lower bound for a given util_avg as it has been observed that
sched_entity's util_sum is sometimes above cfs one.

Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
Reported-by: Rick Yiu <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..9ac28f0f3645 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3381,6 +3381,7 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}

+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)

/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
@@ -3449,15 +3450,14 @@ void set_task_rq_fair(struct sched_entity *se,
* XXX: only do this for the part of runnable > running ?
*
*/
-
static inline void
update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.util_avg - se->avg.util_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
+ u32 new_sum, divider;

/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;

/*
@@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
*/
divider = get_pelt_divider(&cfs_rq->avg);

+
/* Set new sched_entity's utilization */
se->avg.util_avg = gcfs_rq->avg.util_avg;
- se->avg.util_sum = se->avg.util_avg * divider;
+ new_sum = se->avg.util_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.util_sum;
+ se->avg.util_sum = new_sum;

/* Update parent cfs_rq utilization */
- add_positive(&cfs_rq->avg.util_avg, delta);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ add_positive(&cfs_rq->avg.util_avg, delta_avg);
+ add_positive(&cfs_rq->avg.util_sum, delta_sum);
+
+ /*
+ * Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum (XXX fix the rounding). Although this is not
+ * a problem by itself, detaching a lot of tasks with the rounding
+ * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
+ * becoming null whereas cfs_util_avg is not.
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ * util_avg * minimum possible divider
+ */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * MIN_DIVIDER);
}

static inline void
@@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ /* See update_tg_cfs_util() */
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
@@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
- cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
+ sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
+ cfs_rq->avg.util_avg * MIN_DIVIDER);
+
sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;

--
2.17.1


2021-12-22 09:38:24

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg

Similarly to util_avg and util_sum, don't sync runnable_sum with the low
bound of runnable_avg but only ensure that runnable_sum stays in the
correct range.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ac28f0f3645..b4c350715c16 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3495,11 +3495,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
static inline void
update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
- u32 divider;
+ long delta_sum, delta_avg = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg;
+ u32 new_sum, divider;

/* Nothing to update */
- if (!delta)
+ if (!delta_avg)
return;

/*
@@ -3510,11 +3510,16 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf

/* Set new sched_entity's runnable */
se->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
- se->avg.runnable_sum = se->avg.runnable_avg * divider;
+ new_sum = se->avg.runnable_avg * divider;
+ delta_sum = (long)new_sum - (long)se->avg.runnable_sum;
+ se->avg.runnable_sum = new_sum;

/* Update parent cfs_rq runnable */
- add_positive(&cfs_rq->avg.runnable_avg, delta);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
+ add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * MIN_DIVIDER);
}

static inline void
@@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
- sa->runnable_sum = sa->runnable_avg * divider;
+ sub_positive(&sa->runnable_sum, r * divider);
+ /* See update_tg_cfs_util() */
+ sa->runnable_sum = max_t(u32, sa->runnable_sum,
+ sa->runnable_avg * MIN_DIVIDER);

/*
* removed_runnable is the unweighted version of removed_load so we
@@ -3791,12 +3799,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
*/
static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- /*
- * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
- * See ___update_load_avg() for details.
- */
- u32 divider = get_pelt_divider(&cfs_rq->avg);
-
dequeue_load_avg(cfs_rq, se);
sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
@@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
cfs_rq->avg.util_avg * MIN_DIVIDER);

sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
- cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
+ sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
+ cfs_rq->avg.runnable_avg * MIN_DIVIDER);

add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);

--
2.17.1


2021-12-22 09:38:26

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg

Similarly to util_avg and util_sum, don't sync load_sum with the low
bound of load_avg but only ensure that load_sum stays in the correct range.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++----------------
1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4c350715c16..488213d98770 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
}

+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
+
static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- u32 divider = get_pelt_divider(&se->avg);
sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * MIN_DIVIDER);
+
}
#else
static inline void
@@ -3381,8 +3386,6 @@ void set_task_rq_fair(struct sched_entity *se,
se->avg.last_update_time = n_last_update_time;
}

-#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
-
/*
* When on migration a sched_entity joins/leaves the PELT hierarchy, we need to
* propagate its contribution. The key to this propagation is the invariant
@@ -3525,9 +3528,10 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
{
- long delta, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+ long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
unsigned long load_avg;
u64 load_sum = 0;
+ s64 delta_sum;
u32 divider;

if (!runnable_sum)
@@ -3554,7 +3558,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
* assuming all tasks are equally runnable.
*/
if (scale_load_down(gcfs_rq->load.weight)) {
- load_sum = div_s64(gcfs_rq->avg.load_sum,
+ load_sum = div_u64(gcfs_rq->avg.load_sum,
scale_load_down(gcfs_rq->load.weight));
}

@@ -3571,19 +3575,22 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
runnable_sum = max(runnable_sum, running_sum);

- load_sum = (s64)se_weight(se) * runnable_sum;
- load_avg = div_s64(load_sum, divider);
-
- se->avg.load_sum = runnable_sum;
+ load_sum = se_weight(se) * runnable_sum;
+ load_avg = div_u64(load_sum, divider);

- delta = load_avg - se->avg.load_avg;
- if (!delta)
+ delta_avg = load_avg - se->avg.load_avg;
+ if (!delta_avg)
return;

- se->avg.load_avg = load_avg;
+ delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;

- add_positive(&cfs_rq->avg.load_avg, delta);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
+ se->avg.load_sum = runnable_sum;
+ se->avg.load_avg = load_avg;
+ add_positive(&cfs_rq->avg.load_avg, delta_avg);
+ add_positive(&cfs_rq->avg.load_sum, delta_sum);
+ /* See update_tg_cfs_util() */
+ cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
+ cfs_rq->avg.load_avg * MIN_DIVIDER);
}

static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -3699,7 +3706,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_load;
sub_positive(&sa->load_avg, r);
- sa->load_sum = sa->load_avg * divider;
+ sub_positive(&sa->load_sum, r * divider);
+ /* See update_tg_cfs_util() */
+ sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);

r = removed_util;
sub_positive(&sa->util_avg, r);
--
2.17.1


2022-01-04 11:47:01

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] sched/pelt: Don't sync hardly *_sum with *_avg

On 22/12/2021 10:37, Vincent Guittot wrote:

INHO you mean s/hardly/hard here?

> Rick reported performance regressions in bugzilla because of cpu
> frequency being lower than before:
> https://bugzilla.kernel.org/show_bug.cgi?id=215045
>
> He bisected the problem to:
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
>
> More details are available in commit message of patch 1.
>
> This patchset reverts the commit above and adds several checks when
> propagating the changes in the hierarchy to make sure that we still have
> coherent util_avg and util_sum.
>
> Dietmar found a simple way to reproduce the WARN fixed by
> commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
> by looping on hackbench in several different sched group levels.
>
> This patchset as run on the reproducer with success but it probably needs
> more tests by people who faced the WARN before.
>
> The changes done on util_sum have been also applied to runnable_sum and
> load_sum which faces the same rounding problem although this has not been
> reflected in measurable performance impact.

I think the overall idea here is that:

[add\|sub]_positive(&sa->X_avg, Y); (`add` in update_tg_cfs_X())
sa->FOO_sum = sa->X_avg * divider;

with X equal (util, runnable, load)

changes to:

[add\|sub]_positive(&sa->X_avg, Y);
[add\|sub]_positive(&sa->X_sum, Z);
sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER);

This change is done in:

(1) update_cfs_rq_load_avg()
(2) detach_entity_load_avg() and dequeue_load_avg()
(3) update_tg_cfs_X() (+ down-propagating _sum independently from _avg)

Prior to:

1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced w/ *_avg")
ceb6ba45dc80 ("sched/fair: Sync load_sum with load_avg after dequeue")

(i.e. the commits which get fixed by this patchset):

sub_positive(&sa->X_avg, Y);
sub_positive(&sa->X_sum, Z);

was used in (1) and (2).

(3) used sa->util_sum = sa->util_avg * divider already before (Since
95d685935a2e ("sched/pelt: Sync util/runnable_sum with PELT window when
propagating").

[...]

2022-01-04 11:47:16

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On 22/12/2021 10:38, Vincent Guittot wrote:

s/util_sum with uti_avg/util_sum with util_avg

[...]

> +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)

Shouldn't this be in pelt.h?

[...]

> @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> */
> divider = get_pelt_divider(&cfs_rq->avg);
>
> +
> /* Set new sched_entity's utilization */
> se->avg.util_avg = gcfs_rq->avg.util_avg;
> - se->avg.util_sum = se->avg.util_avg * divider;
> + new_sum = se->avg.util_avg * divider;
> + delta_sum = (long)new_sum - (long)se->avg.util_sum;
> + se->avg.util_sum = new_sum;
>
> /* Update parent cfs_rq utilization */
> - add_positive(&cfs_rq->avg.util_avg, delta);
> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> + add_positive(&cfs_rq->avg.util_avg, delta_avg);
> + add_positive(&cfs_rq->avg.util_sum, delta_sum);
> +
> + /*
> + * Because of rounding, se->util_sum might ends up being +1 more than
> + * cfs->util_sum (XXX fix the rounding). Although this is not
> + * a problem by itself, detaching a lot of tasks with the rounding
> + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> + * becoming null whereas cfs_util_avg is not.
> + * Check that util_sum is still above its lower bound for the new
> + * util_avg. Given that period_contrib might have moved since the last
> + * sync, we are only sure that util_sum must be above or equal to
> + * util_avg * minimum possible divider
> + */
> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> }
>

I still wonder whether the regression only comes from the changes in
update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
And could be fixed only by this part of the patch-set (A):

@@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
*cfs_rq)

r = removed_load;
sub_positive(&sa->load_avg, r);
- sa->load_sum = sa->load_avg * divider;
+ sub_positive(&sa->load_sum, r * divider);
+ sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);

r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
- sa->runnable_sum = sa->runnable_avg * divider;
+ sub_positive(&sa->runnable_sum, r * divider);
+ sa->runnable_sum = max_t(u32, sa->runnable_sum,
+ sa->runnable_avg * MIN_DIVIDER);

i.e. w/o changing update_tg_cfs_X() (and
detach_entity_load_avg()/dequeue_load_avg()).

update_load_avg()
update_cfs_rq_load_avg() <---
propagate_entity_load_avg()
update_tg_cfs_X() <---


I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
hackbench in several different sched group levels on
[Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].

Rick is probably in a position to test whether this would be sufficient
to cure the CPU frequency regression.

I can see that you want to use the same _avg/_sum sync in
detach_entity_load_avg()/dequeue_load_avg() as in
update_cfs_rq_load_avg(). (B)

And finally in update_tg_cfs_X() as well plus down-propagating _sum
independently from _avg. (C)

So rather splitting the patchset into X (util, runnable, load) the whole
change might be easier to handle IMHO when split into (A), (B), (C)
(obviously only in case (A) cures the regression).

> static inline void
> @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + /* See update_tg_cfs_util() */
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> dequeue_load_avg(cfs_rq, se);
> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> + /* See update_tg_cfs_util() */
> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> +

Maybe add a:

Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
with *_avg")

[...]

This max_t() should make sure that `_sum is always >= _avg *
MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in

(1) update_cfs_rq_load_avg()
(2) detach_entity_load_avg() and dequeue_load_avg()
(3) update_tg_cfs_X()

but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
reason for this?

2022-01-04 11:47:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] sched/pelt: Don't sync hardly runnable_sum with runnable_avg

On 22/12/2021 10:38, Vincent Guittot wrote:

[...]

> @@ -3704,7 +3709,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> - sa->runnable_sum = sa->runnable_avg * divider;
> + sub_positive(&sa->runnable_sum, r * divider);
> + /* See update_tg_cfs_util() */
> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> + sa->runnable_avg * MIN_DIVIDER);

Maybe add a:

Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay
consistent")

[...]

> @@ -3805,7 +3807,10 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> cfs_rq->avg.util_avg * MIN_DIVIDER);
>
> sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
> - cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * divider;
> + sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
> + /* See update_tg_cfs_util() */
> + cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
> + cfs_rq->avg.runnable_avg * MIN_DIVIDER);

Maybe add a:

Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
with *_avg")

[...]

2022-01-04 11:47:27

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] sched/pelt: Don't sync hardly load_sum with load_avg

On 22/12/2021 10:38, Vincent Guittot wrote:
> Similarly to util_avg and util_sum, don't sync load_sum with the low
> bound of load_avg but only ensure that load_sum stays in the correct range.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++----------------
> 1 file changed, 25 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b4c350715c16..488213d98770 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3025,12 +3025,17 @@ enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
> }
>
> +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
> +
> static inline void
> dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - u32 divider = get_pelt_divider(&se->avg);
> sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
> + sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
> + /* See update_tg_cfs_util() */
> + cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
> + cfs_rq->avg.load_avg * MIN_DIVIDER);
> +

Maybe add a:

Fixes: ceb6ba45dc80 ("sched/fair: Sync load_sum with load_avg after
dequeue")

[...]

> static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
> @@ -3699,7 +3706,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> - sa->load_sum = sa->load_avg * divider;
> + sub_positive(&sa->load_sum, r * divider);
> + /* See update_tg_cfs_util() */
> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);

Since this max_t() is used 9 times in this patch-set, maybe a macro in
pelt.h is better:

+/* Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum (XXX fix the rounding). Although this is not
+ * a problem by itself, detaching a lot of tasks with the rounding
+ * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
+ * becoming null whereas cfs_util_avg is not.
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ * util_avg * minimum possible divider
+ */
+#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
+
+#define enforce_lower_bound_on_pelt_sum(sa, var) do { \
+ (sa)->var##_sum = max_t(u32, \
+ (sa)->var##_sum, \
+ (sa)->var##_avg * MIN_DIVIDER); \
+} while (0)

This way, you could also add the comment from update_tg_cfs_util()
there. And there would be no need anymore to point to it from the other
places where it is used.

[...]

2022-01-04 13:43:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
>
> On 22/12/2021 10:38, Vincent Guittot wrote:
>
> s/util_sum with uti_avg/util_sum with util_avg

yes

>
> [...]
>
> > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
>
> Shouldn't this be in pelt.h?
>
> [...]
>
> > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> > */
> > divider = get_pelt_divider(&cfs_rq->avg);
> >
> > +
> > /* Set new sched_entity's utilization */
> > se->avg.util_avg = gcfs_rq->avg.util_avg;
> > - se->avg.util_sum = se->avg.util_avg * divider;
> > + new_sum = se->avg.util_avg * divider;
> > + delta_sum = (long)new_sum - (long)se->avg.util_sum;
> > + se->avg.util_sum = new_sum;
> >
> > /* Update parent cfs_rq utilization */
> > - add_positive(&cfs_rq->avg.util_avg, delta);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + add_positive(&cfs_rq->avg.util_avg, delta_avg);
> > + add_positive(&cfs_rq->avg.util_sum, delta_sum);
> > +
> > + /*
> > + * Because of rounding, se->util_sum might ends up being +1 more than
> > + * cfs->util_sum (XXX fix the rounding). Although this is not
> > + * a problem by itself, detaching a lot of tasks with the rounding
> > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> > + * becoming null whereas cfs_util_avg is not.
> > + * Check that util_sum is still above its lower bound for the new
> > + * util_avg. Given that period_contrib might have moved since the last
> > + * sync, we are only sure that util_sum must be above or equal to
> > + * util_avg * minimum possible divider
> > + */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > }
> >
>
> I still wonder whether the regression only comes from the changes in
> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> And could be fixed only by this part of the patch-set (A):

I have been able to trigger the warning even with (A) though It took
much more time.
And I have been able to catch wrong situations (with additional
traces) in the 3 places A, B and C

>
> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> - sa->load_sum = sa->load_avg * divider;
> + sub_positive(&sa->load_sum, r * divider);
> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> - sa->runnable_sum = sa->runnable_avg * divider;
> + sub_positive(&sa->runnable_sum, r * divider);
> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> + sa->runnable_avg * MIN_DIVIDER);
>
> i.e. w/o changing update_tg_cfs_X() (and
> detach_entity_load_avg()/dequeue_load_avg()).
>
> update_load_avg()
> update_cfs_rq_load_avg() <---
> propagate_entity_load_avg()
> update_tg_cfs_X() <---
>
>
> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> hackbench in several different sched group levels on
> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].

IIRC, it was with hikey960 with cgroup v1
As a side note, I never trigger the problem with dragonboard845 and cgroup v2

>
> Rick is probably in a position to test whether this would be sufficient
> to cure the CPU frequency regression.
>
> I can see that you want to use the same _avg/_sum sync in
> detach_entity_load_avg()/dequeue_load_avg() as in
> update_cfs_rq_load_avg(). (B)
>
> And finally in update_tg_cfs_X() as well plus down-propagating _sum
> independently from _avg. (C)
>
> So rather splitting the patchset into X (util, runnable, load) the whole
> change might be easier to handle IMHO when split into (A), (B), (C)
> (obviously only in case (A) cures the regression).
>
> > static inline void
> > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > r = removed_util;
> > sub_positive(&sa->util_avg, r);
> > - sa->util_sum = sa->util_avg * divider;
> > + sub_positive(&sa->util_sum, r * divider);
> > + /* See update_tg_cfs_util() */
> > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >
> > r = removed_runnable;
> > sub_positive(&sa->runnable_avg, r);
> > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > dequeue_load_avg(cfs_rq, se);
> > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> > + /* See update_tg_cfs_util() */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > +
>
> Maybe add a:
>
> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> with *_avg")

I spent time thinking about adding fixes tag. There is no crash/warn
so far so should we propagate it back in LTS for better performance ?

>
> [...]
>
> This max_t() should make sure that `_sum is always >= _avg *
> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
>
> (1) update_cfs_rq_load_avg()
> (2) detach_entity_load_avg() and dequeue_load_avg()
> (3) update_tg_cfs_X()
>
> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
> reason for this?

Main reason is that I have never seen the problem.
Then, the problem comes from subtracting task's value whereas here we
always add positive value

2022-01-04 13:48:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
>
> On 22/12/2021 10:38, Vincent Guittot wrote:
>
> s/util_sum with uti_avg/util_sum with util_avg
>
> [...]
>
> > +#define MIN_DIVIDER (LOAD_AVG_MAX - 1024)
>
> Shouldn't this be in pelt.h?

It is only used in fair.c so I kept it local like some other defines in fair.c


>
> [...]
>
> > @@ -3466,13 +3466,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
> > */
> > divider = get_pelt_divider(&cfs_rq->avg);
> >
> > +
> > /* Set new sched_entity's utilization */
> > se->avg.util_avg = gcfs_rq->avg.util_avg;
> > - se->avg.util_sum = se->avg.util_avg * divider;
> > + new_sum = se->avg.util_avg * divider;
> > + delta_sum = (long)new_sum - (long)se->avg.util_sum;
> > + se->avg.util_sum = new_sum;
> >
> > /* Update parent cfs_rq utilization */
> > - add_positive(&cfs_rq->avg.util_avg, delta);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + add_positive(&cfs_rq->avg.util_avg, delta_avg);
> > + add_positive(&cfs_rq->avg.util_sum, delta_sum);
> > +
> > + /*
> > + * Because of rounding, se->util_sum might ends up being +1 more than
> > + * cfs->util_sum (XXX fix the rounding). Although this is not
> > + * a problem by itself, detaching a lot of tasks with the rounding
> > + * problem between 2 updates of util_avg (~1ms) can make cfs->util_sum
> > + * becoming null whereas cfs_util_avg is not.
> > + * Check that util_sum is still above its lower bound for the new
> > + * util_avg. Given that period_contrib might have moved since the last
> > + * sync, we are only sure that util_sum must be above or equal to
> > + * util_avg * minimum possible divider
> > + */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > }
> >
>
> I still wonder whether the regression only comes from the changes in
> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> And could be fixed only by this part of the patch-set (A):
>
> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> - sa->load_sum = sa->load_avg * divider;
> + sub_positive(&sa->load_sum, r * divider);
> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> - sa->runnable_sum = sa->runnable_avg * divider;
> + sub_positive(&sa->runnable_sum, r * divider);
> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> + sa->runnable_avg * MIN_DIVIDER);
>
> i.e. w/o changing update_tg_cfs_X() (and
> detach_entity_load_avg()/dequeue_load_avg()).
>
> update_load_avg()
> update_cfs_rq_load_avg() <---
> propagate_entity_load_avg()
> update_tg_cfs_X() <---
>
>
> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> hackbench in several different sched group levels on
> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
>
> Rick is probably in a position to test whether this would be sufficient
> to cure the CPU frequency regression.
>
> I can see that you want to use the same _avg/_sum sync in
> detach_entity_load_avg()/dequeue_load_avg() as in
> update_cfs_rq_load_avg(). (B)
>
> And finally in update_tg_cfs_X() as well plus down-propagating _sum
> independently from _avg. (C)
>
> So rather splitting the patchset into X (util, runnable, load) the whole
> change might be easier to handle IMHO when split into (A), (B), (C)
> (obviously only in case (A) cures the regression).
>
> > static inline void
> > @@ -3681,7 +3698,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > r = removed_util;
> > sub_positive(&sa->util_avg, r);
> > - sa->util_sum = sa->util_avg * divider;
> > + sub_positive(&sa->util_sum, r * divider);
> > + /* See update_tg_cfs_util() */
> > + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >
> > r = removed_runnable;
> > sub_positive(&sa->runnable_avg, r);
> > @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > dequeue_load_avg(cfs_rq, se);
> > sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> > - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> > + /* See update_tg_cfs_util() */
> > + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > +
>
> Maybe add a:
>
> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> with *_avg")
>
> [...]
>
> This max_t() should make sure that `_sum is always >= _avg *
> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
>
> (1) update_cfs_rq_load_avg()
> (2) detach_entity_load_avg() and dequeue_load_avg()
> (3) update_tg_cfs_X()
>
> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
> reason for this?

2022-01-05 13:15:33

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On 04/01/2022 14:42, Vincent Guittot wrote:
> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 22/12/2021 10:38, Vincent Guittot wrote:

[...]

>> I still wonder whether the regression only comes from the changes in
>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
>> And could be fixed only by this part of the patch-set (A):
>
> I have been able to trigger the warning even with (A) though It took
> much more time.
> And I have been able to catch wrong situations (with additional
> traces) in the 3 places A, B and C

OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?

>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
>> *cfs_rq)
>>
>> r = removed_load;
>> sub_positive(&sa->load_avg, r);
>> - sa->load_sum = sa->load_avg * divider;
>> + sub_positive(&sa->load_sum, r * divider);
>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>>
>> r = removed_util;
>> sub_positive(&sa->util_avg, r);
>> - sa->util_sum = sa->util_avg * divider;
>> + sub_positive(&sa->util_sum, r * divider);
>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>>
>> r = removed_runnable;
>> sub_positive(&sa->runnable_avg, r);
>> - sa->runnable_sum = sa->runnable_avg * divider;
>> + sub_positive(&sa->runnable_sum, r * divider);
>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
>> + sa->runnable_avg * MIN_DIVIDER);
>>
>> i.e. w/o changing update_tg_cfs_X() (and
>> detach_entity_load_avg()/dequeue_load_avg()).
>>
>> update_load_avg()
>> update_cfs_rq_load_avg() <---
>> propagate_entity_load_avg()
>> update_tg_cfs_X() <---
>>
>>
>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
>> hackbench in several different sched group levels on
>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
>
> IIRC, it was with hikey960 with cgroup v1
> As a side note, I never trigger the problem with dragonboard845 and cgroup v2

OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.

[...]

>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>
>>> dequeue_load_avg(cfs_rq, se);
>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
>>> + /* See update_tg_cfs_util() */
>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
>>> +
>>
>> Maybe add a:
>>
>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
>> with *_avg")
>
> I spent time thinking about adding fixes tag. There is no crash/warn
> so far so should we propagate it back in LTS for better performance ?

Not sure I understand. What do you mean by 'should we propagate it back
in LTS'?

[...]

>> This max_t() should make sure that `_sum is always >= _avg *
>> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
>>
>> (1) update_cfs_rq_load_avg()
>> (2) detach_entity_load_avg() and dequeue_load_avg()
>> (3) update_tg_cfs_X()
>>
>> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
>> reason for this?
>
> Main reason is that I have never seen the problem.
> Then, the problem comes from subtracting task's value whereas here we
> always add positive value

OK, I see. The add_positive()'s in update_tg_cfs_X() deal w/ `long` values.

2022-01-05 13:58:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <[email protected]> wrote:
>
> On 04/01/2022 14:42, Vincent Guittot wrote:
> > On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 22/12/2021 10:38, Vincent Guittot wrote:
>
> [...]
>
> >> I still wonder whether the regression only comes from the changes in
> >> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> >> And could be fixed only by this part of the patch-set (A):
> >
> > I have been able to trigger the warning even with (A) though It took
> > much more time.
> > And I have been able to catch wrong situations (with additional
> > traces) in the 3 places A, B and C
>
> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?

not only.
also util_sum == 0 but util_avg !=0 in different places although these
situation didn't trigger sched_warn because some other sync happened
before the periodic call of __update_blocked_fair
or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
just before removing the task

>
> >> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> >> *cfs_rq)
> >>
> >> r = removed_load;
> >> sub_positive(&sa->load_avg, r);
> >> - sa->load_sum = sa->load_avg * divider;
> >> + sub_positive(&sa->load_sum, r * divider);
> >> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
> >>
> >> r = removed_util;
> >> sub_positive(&sa->util_avg, r);
> >> - sa->util_sum = sa->util_avg * divider;
> >> + sub_positive(&sa->util_sum, r * divider);
> >> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >>
> >> r = removed_runnable;
> >> sub_positive(&sa->runnable_avg, r);
> >> - sa->runnable_sum = sa->runnable_avg * divider;
> >> + sub_positive(&sa->runnable_sum, r * divider);
> >> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> >> + sa->runnable_avg * MIN_DIVIDER);
> >>
> >> i.e. w/o changing update_tg_cfs_X() (and
> >> detach_entity_load_avg()/dequeue_load_avg()).
> >>
> >> update_load_avg()
> >> update_cfs_rq_load_avg() <---
> >> propagate_entity_load_avg()
> >> update_tg_cfs_X() <---
> >>
> >>
> >> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> >> hackbench in several different sched group levels on
> >> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
> >
> > IIRC, it was with hikey960 with cgroup v1
> > As a side note, I never trigger the problem with dragonboard845 and cgroup v2
>
> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
>
> [...]
>
> >>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >>>
> >>> dequeue_load_avg(cfs_rq, se);
> >>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> >>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> >>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> >>> + /* See update_tg_cfs_util() */
> >>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> >>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> >>> +
> >>
> >> Maybe add a:
> >>
> >> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> >> with *_avg")
> >
> > I spent time thinking about adding fixes tag. There is no crash/warn
> > so far so should we propagate it back in LTS for better performance ?
>
> Not sure I understand. What do you mean by 'should we propagate it back
> in LTS'?

Sorry I had any stables in mind and not only LTS.

Some of the changes done in PELT signal propagation that replace
subtracting util_sum by using util_avg * divider instead, are related
to other problems with sched group hierarchy and
throttling/unthrottling. I'm not 100% confident that using fixes tag
to backport this on stables doesn't need to backport more patches on
other areas in order to not resurrect old problems. So I wonder if
it's worth backporting this on stables

>
> [...]
>
> >> This max_t() should make sure that `_sum is always >= _avg *
> >> MIN_DIVIDER`. Which is not the case sometimes. Currently this is done in
> >>
> >> (1) update_cfs_rq_load_avg()
> >> (2) detach_entity_load_avg() and dequeue_load_avg()
> >> (3) update_tg_cfs_X()
> >>
> >> but not in attach_entity_load_avg(), enqueue_load_avg(). What's the
> >> reason for this?
> >
> > Main reason is that I have never seen the problem.
> > Then, the problem comes from subtracting task's value whereas here we
> > always add positive value
>
> OK, I see. The add_positive()'s in update_tg_cfs_X() deal w/ `long` values.

2022-01-07 11:43:55

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On 05/01/2022 14:57, Vincent Guittot wrote:
> On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 04/01/2022 14:42, Vincent Guittot wrote:
>>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 22/12/2021 10:38, Vincent Guittot wrote:
>>
>> [...]
>>
>>>> I still wonder whether the regression only comes from the changes in
>>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
>>>> And could be fixed only by this part of the patch-set (A):
>>>
>>> I have been able to trigger the warning even with (A) though It took
>>> much more time.
>>> And I have been able to catch wrong situations (with additional
>>> traces) in the 3 places A, B and C
>>
>> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
>
> not only.
> also util_sum == 0 but util_avg !=0 in different places although these

Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'.

> situation didn't trigger sched_warn because some other sync happened
> before the periodic call of __update_blocked_fair
> or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
> just before removing the task

I see.

>>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
>>>> *cfs_rq)
>>>>
>>>> r = removed_load;
>>>> sub_positive(&sa->load_avg, r);
>>>> - sa->load_sum = sa->load_avg * divider;
>>>> + sub_positive(&sa->load_sum, r * divider);
>>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
>>>>
>>>> r = removed_util;
>>>> sub_positive(&sa->util_avg, r);
>>>> - sa->util_sum = sa->util_avg * divider;
>>>> + sub_positive(&sa->util_sum, r * divider);
>>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>>>>
>>>> r = removed_runnable;
>>>> sub_positive(&sa->runnable_avg, r);
>>>> - sa->runnable_sum = sa->runnable_avg * divider;
>>>> + sub_positive(&sa->runnable_sum, r * divider);
>>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
>>>> + sa->runnable_avg * MIN_DIVIDER);
>>>>
>>>> i.e. w/o changing update_tg_cfs_X() (and
>>>> detach_entity_load_avg()/dequeue_load_avg()).
>>>>
>>>> update_load_avg()
>>>> update_cfs_rq_load_avg() <---
>>>> propagate_entity_load_avg()
>>>> update_tg_cfs_X() <---
>>>>
>>>>
>>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
>>>> hackbench in several different sched group levels on
>>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
>>>
>>> IIRC, it was with hikey960 with cgroup v1
>>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
>>
>> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.

Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D
E/F G/H > 45h uptime

>>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>>>
>>>>> dequeue_load_avg(cfs_rq, se);
>>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
>>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
>>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
>>>>> + /* See update_tg_cfs_util() */
>>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
>>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
>>>>> +
>>>>
>>>> Maybe add a:
>>>>
>>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
>>>> with *_avg")
>>>
>>> I spent time thinking about adding fixes tag. There is no crash/warn
>>> so far so should we propagate it back in LTS for better performance ?
>>
>> Not sure I understand. What do you mean by 'should we propagate it back
>> in LTS'?
>
> Sorry I had any stables in mind and not only LTS.
>
> Some of the changes done in PELT signal propagation that replace
> subtracting util_sum by using util_avg * divider instead, are related
> to other problems with sched group hierarchy and
> throttling/unthrottling. I'm not 100% confident that using fixes tag
> to backport this on stables doesn't need to backport more patches on
> other areas in order to not resurrect old problems. So I wonder if
> it's worth backporting this on stables

OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
was the reason why I initially suggested to split the patch-set
differently. But you said that you saw the issue also when (1) is fixed.

2022-01-07 15:21:47

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <[email protected]> wrote:
>
> On 05/01/2022 14:57, Vincent Guittot wrote:
> > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 04/01/2022 14:42, Vincent Guittot wrote:
> >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
> >>>>
> >>>> On 22/12/2021 10:38, Vincent Guittot wrote:
> >>
> >> [...]
> >>
> >>>> I still wonder whether the regression only comes from the changes in
> >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> >>>> And could be fixed only by this part of the patch-set (A):
> >>>
> >>> I have been able to trigger the warning even with (A) though It took
> >>> much more time.
> >>> And I have been able to catch wrong situations (with additional
> >>> traces) in the 3 places A, B and C
> >>
> >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
> >
> > not only.
> > also util_sum == 0 but util_avg !=0 in different places although these
>
> Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'.
>
> > situation didn't trigger sched_warn because some other sync happened
> > before the periodic call of __update_blocked_fair
> > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
> > just before removing the task
>
> I see.
>
> >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> >>>> *cfs_rq)
> >>>>
> >>>> r = removed_load;
> >>>> sub_positive(&sa->load_avg, r);
> >>>> - sa->load_sum = sa->load_avg * divider;
> >>>> + sub_positive(&sa->load_sum, r * divider);
> >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
> >>>>
> >>>> r = removed_util;
> >>>> sub_positive(&sa->util_avg, r);
> >>>> - sa->util_sum = sa->util_avg * divider;
> >>>> + sub_positive(&sa->util_sum, r * divider);
> >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> >>>>
> >>>> r = removed_runnable;
> >>>> sub_positive(&sa->runnable_avg, r);
> >>>> - sa->runnable_sum = sa->runnable_avg * divider;
> >>>> + sub_positive(&sa->runnable_sum, r * divider);
> >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> >>>> + sa->runnable_avg * MIN_DIVIDER);
> >>>>
> >>>> i.e. w/o changing update_tg_cfs_X() (and
> >>>> detach_entity_load_avg()/dequeue_load_avg()).
> >>>>
> >>>> update_load_avg()
> >>>> update_cfs_rq_load_avg() <---
> >>>> propagate_entity_load_avg()
> >>>> update_tg_cfs_X() <---
> >>>>
> >>>>
> >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> >>>> hackbench in several different sched group levels on
> >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
> >>>
> >>> IIRC, it was with hikey960 with cgroup v1
> >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
> >>
> >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
>
> Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D
> E/F G/H > 45h uptime
>
> >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >>>>>
> >>>>> dequeue_load_avg(cfs_rq, se);
> >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> >>>>> + /* See update_tg_cfs_util() */
> >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> >>>>> +
> >>>>
> >>>> Maybe add a:
> >>>>
> >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> >>>> with *_avg")
> >>>
> >>> I spent time thinking about adding fixes tag. There is no crash/warn
> >>> so far so should we propagate it back in LTS for better performance ?
> >>
> >> Not sure I understand. What do you mean by 'should we propagate it back
> >> in LTS'?
> >
> > Sorry I had any stables in mind and not only LTS.
> >
> > Some of the changes done in PELT signal propagation that replace
> > subtracting util_sum by using util_avg * divider instead, are related
> > to other problems with sched group hierarchy and
> > throttling/unthrottling. I'm not 100% confident that using fixes tag
> > to backport this on stables doesn't need to backport more patches on
> > other areas in order to not resurrect old problems. So I wonder if
> > it's worth backporting this on stables
>
> OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
> update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
> was the reason why I initially suggested to split the patch-set
> differently. But you said that you saw the issue also when (1) is fixed.

Ok, I think that we were not speaking about the same setup. I wrongly
read that you were saying that
sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
was only needed in update_cfs_rq_load_avg() but not in the other places.

But what you said is that we only need the below to fix the perf
regression raised by rick ?
r = removed_util;
sub_positive(&sa->util_avg, r);
- sa->util_sum = sa->util_avg * divider;
+ sub_positive(&sa->util_sum, r * divider);
+ sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);

The WARN that I mentioned in my previous email was about not adding
the max_t in all 3 places. I rerun some test today and I triggered the
WARN after a detach without the max_t line.

I can probably isolate the code above in a dedicated patch for the
regression raised by Rick and we could consider adding a fixes tag; I
will run more tests with only this part during the weekend.
That being said, we need to stay consistent in all 3 places where we
move or propagate some *_avg. In particular, using "sa->util_sum =
sa->util_avg * divider" has the drawback of filtering the contribution
not already accounted for in util_avg and the impact is much larger
than expected. It means that although fixing only
update_cfs_rq_load_avg() seems enough for rick's case, some other
cases could be impacted by the 2 other places and we need to fixed
them now that we have a better view of the root cause

2022-01-11 07:54:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <[email protected]> wrote:
>
> On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <[email protected]> wrote:
> >
> > On 05/01/2022 14:57, Vincent Guittot wrote:
> > > On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <[email protected]> wrote:
> > >>
> > >> On 04/01/2022 14:42, Vincent Guittot wrote:
> > >>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
> > >>>>
> > >>>> On 22/12/2021 10:38, Vincent Guittot wrote:
> > >>
> > >> [...]
> > >>
> > >>>> I still wonder whether the regression only comes from the changes in
> > >>>> update_cfs_rq_load_avg(), introduced by 1c35b07e6d39.
> > >>>> And could be fixed only by this part of the patch-set (A):
> > >>>
> > >>> I have been able to trigger the warning even with (A) though It took
> > >>> much more time.
> > >>> And I have been able to catch wrong situations (with additional
> > >>> traces) in the 3 places A, B and C
> > >>
> > >> OK. By wrong situation you mean '_sum < _avg * MIN_DIVIDER' ?
> > >
> > > not only.
> > > also util_sum == 0 but util_avg !=0 in different places although these
> >
> > Ah OK, I saw this one as part of '_sum < _avg * MIN_DIVIDER'.
> >
> > > situation didn't trigger sched_warn because some other sync happened
> > > before the periodic call of __update_blocked_fair
> > > or if nr_running == 1 and and task's util_avg/sum > cfs' util_avg/sum
> > > just before removing the task
> >
> > I see.
> >
> > >>>> @@ -3677,15 +3706,22 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq
> > >>>> *cfs_rq)
> > >>>>
> > >>>> r = removed_load;
> > >>>> sub_positive(&sa->load_avg, r);
> > >>>> - sa->load_sum = sa->load_avg * divider;
> > >>>> + sub_positive(&sa->load_sum, r * divider);
> > >>>> + sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * MIN_DIVIDER);
> > >>>>
> > >>>> r = removed_util;
> > >>>> sub_positive(&sa->util_avg, r);
> > >>>> - sa->util_sum = sa->util_avg * divider;
> > >>>> + sub_positive(&sa->util_sum, r * divider);
> > >>>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> > >>>>
> > >>>> r = removed_runnable;
> > >>>> sub_positive(&sa->runnable_avg, r);
> > >>>> - sa->runnable_sum = sa->runnable_avg * divider;
> > >>>> + sub_positive(&sa->runnable_sum, r * divider);
> > >>>> + sa->runnable_sum = max_t(u32, sa->runnable_sum,
> > >>>> + sa->runnable_avg * MIN_DIVIDER);
> > >>>>
> > >>>> i.e. w/o changing update_tg_cfs_X() (and
> > >>>> detach_entity_load_avg()/dequeue_load_avg()).
> > >>>>
> > >>>> update_load_avg()
> > >>>> update_cfs_rq_load_avg() <---
> > >>>> propagate_entity_load_avg()
> > >>>> update_tg_cfs_X() <---
> > >>>>
> > >>>>
> > >>>> I didn't see the SCHED_WARN_ON() [cfs_rq_is_decayed()] when looping on
> > >>>> hackbench in several different sched group levels on
> > >>>> [Hikey620 (Arm64, 8 CPUs, SMP, 4 taskgroups: A/B C/D E/F G/H), >12h uptime].
> > >>>
> > >>> IIRC, it was with hikey960 with cgroup v1
> > >>> As a side note, I never trigger the problem with dragonboard845 and cgroup v2
> > >>
> > >> OK, just started a test on hikey960 cgroupv1. Let's see if I can catch it.
> >
> > Still no sign of the issue (hikey960, cgroupv1, 4 taskgroups: A/B C/D
> > E/F G/H > 45h uptime
> >
> > >>>>> @@ -3780,7 +3799,11 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > >>>>>
> > >>>>> dequeue_load_avg(cfs_rq, se);
> > >>>>> sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
> > >>>>> - cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * divider;
> > >>>>> + sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
> > >>>>> + /* See update_tg_cfs_util() */
> > >>>>> + cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
> > >>>>> + cfs_rq->avg.util_avg * MIN_DIVIDER);
> > >>>>> +
> > >>>>
> > >>>> Maybe add a:
> > >>>>
> > >>>> Fixes: fcf6631f3736 ("sched/pelt: Ensure that *_sum is always synced
> > >>>> with *_avg")
> > >>>
> > >>> I spent time thinking about adding fixes tag. There is no crash/warn
> > >>> so far so should we propagate it back in LTS for better performance ?
> > >>
> > >> Not sure I understand. What do you mean by 'should we propagate it back
> > >> in LTS'?
> > >
> > > Sorry I had any stables in mind and not only LTS.
> > >
> > > Some of the changes done in PELT signal propagation that replace
> > > subtracting util_sum by using util_avg * divider instead, are related
> > > to other problems with sched group hierarchy and
> > > throttling/unthrottling. I'm not 100% confident that using fixes tag
> > > to backport this on stables doesn't need to backport more patches on
> > > other areas in order to not resurrect old problems. So I wonder if
> > > it's worth backporting this on stables
> >
> > OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
> > update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
> > was the reason why I initially suggested to split the patch-set
> > differently. But you said that you saw the issue also when (1) is fixed.
>
> Ok, I think that we were not speaking about the same setup. I wrongly
> read that you were saying that
> sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> was only needed in update_cfs_rq_load_avg() but not in the other places.
>
> But what you said is that we only need the below to fix the perf
> regression raised by rick ?
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> - sa->util_sum = sa->util_avg * divider;
> + sub_positive(&sa->util_sum, r * divider);
> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);

The test with the code above doesn't trigger any SCHEd_WARN over the
weekend so it's probably ok to make a dedicated patch for this with
tag.
I'm going to prepare a v2

>
> The WARN that I mentioned in my previous email was about not adding
> the max_t in all 3 places. I rerun some test today and I triggered the
> WARN after a detach without the max_t line.
>
> I can probably isolate the code above in a dedicated patch for the
> regression raised by Rick and we could consider adding a fixes tag; I
> will run more tests with only this part during the weekend.
> That being said, we need to stay consistent in all 3 places where we
> move or propagate some *_avg. In particular, using "sa->util_sum =
> sa->util_avg * divider" has the drawback of filtering the contribution
> not already accounted for in util_avg and the impact is much larger
> than expected. It means that although fixing only
> update_cfs_rq_load_avg() seems enough for rick's case, some other
> cases could be impacted by the 2 other places and we need to fixed
> them now that we have a better view of the root cause

2022-01-11 12:37:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with uti_avg

On 11/01/2022 08:54, Vincent Guittot wrote:
> On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <[email protected]> wrote:
>>
>> On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <[email protected]> wrote:
>>>
>>> On 05/01/2022 14:57, Vincent Guittot wrote:
>>>> On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <[email protected]> wrote:
>>>>>
>>>>> On 04/01/2022 14:42, Vincent Guittot wrote:
>>>>>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <[email protected]> wrote:
>>>>>>>
>>>>>>> On 22/12/2021 10:38, Vincent Guittot wrote:

[...]

>>>> Some of the changes done in PELT signal propagation that replace
>>>> subtracting util_sum by using util_avg * divider instead, are related
>>>> to other problems with sched group hierarchy and
>>>> throttling/unthrottling. I'm not 100% confident that using fixes tag
>>>> to backport this on stables doesn't need to backport more patches on
>>>> other areas in order to not resurrect old problems. So I wonder if
>>>> it's worth backporting this on stables
>>>
>>> OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
>>> update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
>>> was the reason why I initially suggested to split the patch-set
>>> differently. But you said that you saw the issue also when (1) is fixed.
>>
>> Ok, I think that we were not speaking about the same setup. I wrongly
>> read that you were saying that
>> sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>> was only needed in update_cfs_rq_load_avg() but not in the other places.
>>
>> But what you said is that we only need the below to fix the perf
>> regression raised by rick ?
>> r = removed_util;
>> sub_positive(&sa->util_avg, r);
>> - sa->util_sum = sa->util_avg * divider;
>> + sub_positive(&sa->util_sum, r * divider);
>> + sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>
> The test with the code above doesn't trigger any SCHEd_WARN over the
> weekend so it's probably ok to make a dedicated patch for this with
> tag.
> I'm going to prepare a v2

Yes, `sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER)` is
needed for all 3 X = [load, runnable, util] in update_cfs_rq_load_avg()
to not hit the SCHED_WARN_ON() in cfs_rq_is_decayed().

>> The WARN that I mentioned in my previous email was about not adding
>> the max_t in all 3 places. I rerun some test today and I triggered the
>> WARN after a detach without the max_t line.
>>
>> I can probably isolate the code above in a dedicated patch for the
>> regression raised by Rick and we could consider adding a fixes tag; I
>> will run more tests with only this part during the weekend.
>> That being said, we need to stay consistent in all 3 places where we
>> move or propagate some *_avg. In particular, using "sa->util_sum =
>> sa->util_avg * divider" has the drawback of filtering the contribution
>> not already accounted for in util_avg and the impact is much larger
>> than expected. It means that although fixing only
>> update_cfs_rq_load_avg() seems enough for rick's case, some other
>> cases could be impacted by the 2 other places and we need to fixed
>> them now that we have a better view of the root cause

Agreed.