2017-04-24 20:13:58

by Tejun Heo

[permalink] [raw]
Subject: [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use

Hello,

We've noticed scheduling latency spike when cgroup is in use even when
the machine is idle enough with moderate scheduling frequency and
single level of cgroup nesting. More details are in the patch
descriptions but here's a schbench run from the root cgroup.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 26
75.0000th: 62
90.0000th: 74
95.0000th: 86
*99.0000th: 887
99.5000th: 3692
99.9000th: 10832
min=0, max=13374

And here's one from inside a first level cgroup.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 31
75.0000th: 65
90.0000th: 71
95.0000th: 91
*99.0000th: 7288
99.5000th: 10352
99.9000th: 12496
min=0, max=13023

The p99 latency spike got tracked down to runnable_load_avg not being
propagated through nested cfs_rqs and thus load_balance() operating on
out-of-sync load information. It ended up picking the wrong CPU as
load balance target often enough to significantly impact p99 latency.

This patchset fixes the issue by always propagating runnable_load_avg
so that, regardless of nesting, every cfs_rq's runnable_load_avg is
the sum of the scaled loads of all tasks queued below it.

As a side effect, this changes the load_avg behavior of sched_entities
associated cfs_rq's. It doesn't seem wrong to me and I can't think of
a better / cleaner way, but if there is, please let me know.

This patchset is on top of v4.11-rc8 and contains the following two
patches.

sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
sched/fair: Always propagate runnable_load_avg

diffstat follows.

kernel/sched/fair.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)

Thanks.

--
tejun


2017-04-24 20:14:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

09a43ace1f98 ("sched/fair: Propagate load during synchronous
attach/detach") added immediate load propagation from cfs_rq to its
sched_entity then to the parent cfs_rq; however, what gets propagated
doesn't seem to make sense.

It repeats the tg_weight calculation done in calc_cfs_shares() but
only uses it to compensate for shares being out of date. After that,
it sets the sched_entity's load_avg to the load_avg of the
corresponding cfs_rq.

This doesn't make sense as the cfs_rq's load_avg is some fraction of
its total weight, which the sched_entity's weight has nothing to with.
For example, if the cfs_rq has a single constant load 1 task the
cfs_rq's load_avg would be around 1. If that cfs_rq is the only
active sched_entity in the parent cfs_rq which has the maximum weight,
the sched_entity's load should be around the maximum weight but
update_tg_cfs_load() ends up overriding it to 1.

At the parent's level, the absolute value of load_avg inside a child
cfs_rq doesn't mean anything. Only the ratio against its weight is
meaningful.

This patch changes update_tg_cfs_load() to normalize the
runnable_load_avg of the cfs_rq and then scale it to the matching
sched_entity's freshly calculated shares for propagation. Use of
runnable_load_avg instead of load_avg is intentional and keeps the
parent's runnable_load_avg true to the sum of scaled loads of all
tasks queued under it which is critical for the correction operation
of load balancer. The next patch will depend on it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,29 @@ static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
- long delta, load = gcfs_rq->avg.load_avg;
+ long load = 0, delta;

/*
- * If the load of group cfs_rq is null, the load of the
- * sched_entity will also be null so we can skip the formula
+ * A cfs_rq's load avg contribution to the parent should be scaled
+ * to the sched_entity's weight. Use freshly calculated shares
+ * instead of @se->load.weight as the latter may not reflect
+ * changes from the current scheduling operation.
+ *
+ * Note that the propagation source is runnable_load_avg instead of
+ * load_avg. This keeps every cfs_rq's runnable_load_avg true to
+ * the sum of the scaled loads of all tasks queued under it, which
+ * is important for the correct operation of the load balancer.
+ *
+ * This can make the sched_entity's load_avg jumpier but that
+ * correctly reflects what would happen without cgroups if each
+ * task's load is scaled across nesting - the load is being
+ * averaged at the task and each cfs_rq.
*/
- if (load) {
- long tg_load;
+ if (gcfs_rq->load.weight) {
+ long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);

- /* Get tg's load and ensure tg_load > 0 */
- tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
- /* Ensure tg_load >= load and updated with current load*/
- tg_load -= gcfs_rq->tg_load_avg_contrib;
- tg_load += load;
-
- /*
- * We need to compute a correction term in the case that the
- * task group is consuming more CPU than a task of equal
- * weight. A task with a weight equals to tg->shares will have
- * a load less or equal to scale_load_down(tg->shares).
- * Similarly, the sched_entities that represent the task group
- * at parent level, can't have a load higher than
- * scale_load_down(tg->shares). And the Sum of sched_entities'
- * load must be <= scale_load_down(tg->shares).
- */
- if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
- /* scale gcfs_rq's load into tg's shares*/
- load *= scale_load_down(gcfs_rq->tg->shares);
- load /= tg_load;
- }
+ load = min(gcfs_rq->runnable_load_avg *
+ shares / gcfs_rq->load.weight, shares);
}

delta = load - se->avg.load_avg;

2017-04-24 20:14:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

We noticed that with cgroup CPU controller in use, the scheduling
latency gets wonky regardless of nesting level or weight
configuration. This is easily reproducible with Chris Mason's
schbench[1].

All tests are run on a single socket, 16 cores, 32 threads machine.
While the machine is mostly idle, it isn't completey. There's always
some variable management load going on. The command used is

schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

which measures scheduling latency with 32 threads each of which
repeatedly runs for 15ms and then sleeps for 10ms. Here's a
representative result when running from the root cgroup.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 26
75.0000th: 62
90.0000th: 74
95.0000th: 86
*99.0000th: 887
99.5000th: 3692
99.9000th: 10832
min=0, max=13374

The following is inside a first level CPU cgroup with the maximum
weight.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 31
75.0000th: 65
90.0000th: 71
95.0000th: 91
*99.0000th: 7288
99.5000th: 10352
99.9000th: 12496
min=0, max=13023

Note the drastic increase in p99 scheduling latency. After
investigation, it turned out that the update_sd_lb_stats(), which is
used by load_balance() to pick the most loaded group, was often
picking the wrong group. A CPU which has one schbench running and
another queued wouldn't report the correspondingly higher
weighted_cpuload() and get looked over as the target of load
balancing.

weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
sum of the load_avg of all queued sched_entities. Without cgroups or
at the root cgroup, each task's load_avg contributes directly to the
sum. When a task wakes up or goes to sleep, the change is immediately
reflected on runnable_load_avg which in turn affects load balancing.

However, when CPU cgroup is in use, a nesting cfs_rq blocks this
immediate reflection. When a task wakes up inside a cgroup, the
nested cfs_rq's runnable_load_avg is updated but doesn't get
propagated to the parent. It only affects the matching sched_entity's
load_avg over time which then gets propagated to the runnable_load_avg
at that level. This makes weighted_cpuload() often temporarily out of
sync leading to suboptimal behavior of load_balance() and increase in
scheduling latencies as shown above.

This patch fixes the issue by updating propagate_entity_load_avg() to
always propagate to the parent's runnable_load_avg. Combined with the
previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
of the scaled loads of all tasks queued below removing the artifacts
from nesting cfs_rqs. The following is from inside three levels of
nesting with the patch applied.

# ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 40
75.0000th: 71
90.0000th: 89
95.0000th: 108
*99.0000th: 679
99.5000th: 3500
99.9000th: 10960
min=0, max=13790

[1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git

Signed-off-by: Tejun Heo <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Chris Mason <[email protected]>
---
kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq

/* Take into account change of load of a child task group */
static inline void
-update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ bool propagate_avg)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
long load = 0, delta;
@@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
se->avg.load_avg = load;
se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;

- /* Update parent cfs_rq load */
- add_positive(&cfs_rq->avg.load_avg, delta);
- cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+ if (propagate_avg) {
+ /* Update parent cfs_rq load */
+ add_positive(&cfs_rq->avg.load_avg, delta);
+ cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
+ }

/*
* If the sched_entity is already enqueued, we also have to update the
@@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
/* Update task and its cfs_rq load average */
static inline int propagate_entity_load_avg(struct sched_entity *se)
{
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ bool propagate_avg;

if (entity_is_task(se))
return 0;

- if (!test_and_clear_tg_cfs_propagate(se))
- return 0;
-
- cfs_rq = cfs_rq_of(se);
+ propagate_avg = test_and_clear_tg_cfs_propagate(se);

- set_tg_cfs_propagate(cfs_rq);
+ /*
+ * We want to keep @cfs_rq->runnable_load_avg always in sync so
+ * that the load balancer can accurately determine the busiest CPU
+ * regardless of cfs_rq nesting.
+ */
+ update_tg_cfs_load(cfs_rq, se, propagate_avg);

- update_tg_cfs_util(cfs_rq, se);
- update_tg_cfs_load(cfs_rq, se);
+ if (propagate_avg) {
+ set_tg_cfs_propagate(cfs_rq);
+ update_tg_cfs_util(cfs_rq, se);
+ }

- return 1;
+ return propagate_avg;
}

#else /* CONFIG_FAIR_GROUP_SCHED */

2017-04-24 21:33:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH v2 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

09a43ace1f98 ("sched/fair: Propagate load during synchronous
attach/detach") added immediate load propagation from cfs_rq to its
sched_entity then to the parent cfs_rq; however, what gets propagated
doesn't seem to make sense.

It repeats the tg_weight calculation done in calc_cfs_shares() but
only uses it to compensate for shares being out of date. After that,
it sets the sched_entity's load_avg to the load_avg of the
corresponding cfs_rq.

This doesn't make sense as the cfs_rq's load_avg is some fraction of
its total weight, which the sched_entity's weight has nothing to with.
For example, if the cfs_rq has a single constant load 1 task the
cfs_rq's load_avg would be around 1. If that cfs_rq is the only
active sched_entity in the parent cfs_rq which has the maximum weight,
the sched_entity's load should be around the maximum weight but
update_tg_cfs_load() ends up overriding it to 1.

At the parent's level, the absolute value of load_avg inside a child
cfs_rq doesn't mean anything. Only the ratio against its weight is
meaningful.

This patch changes update_tg_cfs_load() to normalize the
runnable_load_avg of the cfs_rq and then scale it to the matching
sched_entity's freshly calculated shares for propagation. Use of
runnable_load_avg instead of load_avg is intentional and keeps the
parent's runnable_load_avg true to the sum of scaled loads of all
tasks queued under it which is critical for the correction operation
of load balancer. The next patch will depend on it.

v2: Use min_t() to squash a build warning.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 47 ++++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
- long delta, load = gcfs_rq->avg.load_avg;
+ long load = 0, delta;

/*
- * If the load of group cfs_rq is null, the load of the
- * sched_entity will also be null so we can skip the formula
+ * A cfs_rq's load avg contribution to the parent should be scaled
+ * to the sched_entity's weight. Use freshly calculated shares
+ * instead of @se->load.weight as the latter may not reflect
+ * changes from the current scheduling operation.
+ *
+ * Note that the propagation source is runnable_load_avg instead of
+ * load_avg. This keeps every cfs_rq's runnable_load_avg true to
+ * the sum of the scaled loads of all tasks queued under it, which
+ * is important for the correct operation of the load balancer.
+ *
+ * This can make the sched_entity's load_avg jumpier but that
+ * correctly reflects what would happen without cgroups if each
+ * task's load is scaled across nesting - the load is being
+ * averaged at the task and each cfs_rq.
*/
- if (load) {
- long tg_load;
+ if (gcfs_rq->load.weight) {
+ long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);

- /* Get tg's load and ensure tg_load > 0 */
- tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
- /* Ensure tg_load >= load and updated with current load*/
- tg_load -= gcfs_rq->tg_load_avg_contrib;
- tg_load += load;
-
- /*
- * We need to compute a correction term in the case that the
- * task group is consuming more CPU than a task of equal
- * weight. A task with a weight equals to tg->shares will have
- * a load less or equal to scale_load_down(tg->shares).
- * Similarly, the sched_entities that represent the task group
- * at parent level, can't have a load higher than
- * scale_load_down(tg->shares). And the Sum of sched_entities'
- * load must be <= scale_load_down(tg->shares).
- */
- if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
- /* scale gcfs_rq's load into tg's shares*/
- load *= scale_load_down(gcfs_rq->tg->shares);
- load /= tg_load;
- }
+ load = min_t(long, shares,
+ gcfs_rq->runnable_load_avg *
+ shares / gcfs_rq->load.weight);
}

delta = load - se->avg.load_avg;

2017-04-24 21:35:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities

Now that a cfs_rq sched_entity's load_avg always gets propagated from
the associated cfs_rq, there's no point in calling __update_load_avg()
on it. The two mechanisms compete with each other and we'd be always
using a value close to the propagated one anyway.

Skip __update_load_avg() for cfs_rq sched_entities. Also, relocate
propagate_entity_load_avg() to signify that propagation is the
counterpart to __update_load_avg() for cfs_rq sched_entities. This
puts the propagation before update_cfs_rq_load_avg() which shouldn't
disturb anything.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Turner <[email protected]>
---
Hello,

A follow-up patch. This removes __update_load_avg() on cfs_rq se's as
the value is now constantly kept in sync from cfs_rq. The patch
doesn't cause any noticable changes in tets.

Thanks.

kernel/sched/fair.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3293,20 +3293,22 @@ static inline void update_load_avg(struc
u64 now = cfs_rq_clock_task(cfs_rq);
struct rq *rq = rq_of(cfs_rq);
int cpu = cpu_of(rq);
- int decayed;
+ int decayed = 0;

/*
* Track task load average for carrying it to new CPU after migrated, and
* track group sched_entity load average for task_h_load calc in migration
*/
- if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
- __update_load_avg(now, cpu, &se->avg,
- se->on_rq * scale_load_down(se->load.weight),
- cfs_rq->curr == se, NULL);
+ if (entity_is_task(se)) {
+ if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+ __update_load_avg(now, cpu, &se->avg,
+ se->on_rq * scale_load_down(se->load.weight),
+ cfs_rq->curr == se, NULL);
+ } else {
+ decayed |= propagate_entity_load_avg(se);
}

- decayed = update_cfs_rq_load_avg(now, cfs_rq, true);
- decayed |= propagate_entity_load_avg(se);
+ decayed |= update_cfs_rq_load_avg(now, cfs_rq, true);

if (decayed && (flags & UPDATE_TG))
update_tg_load_avg(cfs_rq, 0);

2017-04-24 21:49:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities

On Mon, Apr 24, 2017 at 02:35:28PM -0700, Tejun Heo wrote:
> - if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
> - __update_load_avg(now, cpu, &se->avg,
> - se->on_rq * scale_load_down(se->load.weight),
> - cfs_rq->curr == se, NULL);
> + if (entity_is_task(se)) {
> + if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> + __update_load_avg(now, cpu, &se->avg,
> + se->on_rq * scale_load_down(se->load.weight),
> + cfs_rq->curr == se, NULL);

I've not looked at these patches yet, but you've been patching old code.
__update_load_avg() no longer exists (the conversion shouldn't be too
hard, its mostly been a restructure/rename thing).

2017-04-24 22:54:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities

Hello, Peter.

On Mon, Apr 24, 2017 at 11:48:59PM +0200, Peter Zijlstra wrote:
> On Mon, Apr 24, 2017 at 02:35:28PM -0700, Tejun Heo wrote:
> > - if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
> > - __update_load_avg(now, cpu, &se->avg,
> > - se->on_rq * scale_load_down(se->load.weight),
> > - cfs_rq->curr == se, NULL);
> > + if (entity_is_task(se)) {
> > + if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> > + __update_load_avg(now, cpu, &se->avg,
> > + se->on_rq * scale_load_down(se->load.weight),
> > + cfs_rq->curr == se, NULL);
>
> I've not looked at these patches yet, but you've been patching old code.
> __update_load_avg() no longer exists (the conversion shouldn't be too
> hard, its mostly been a restructure/rename thing).

Ah, sure. The patchset still being RFC, I wanted to post the version
I was working with. If you want the patchset refreshed now, please
let me know.

Thanks.

--
tejun

2017-04-25 08:36:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

Hi Tejun

On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
> 09a43ace1f98 ("sched/fair: Propagate load during synchronous
> attach/detach") added immediate load propagation from cfs_rq to its
> sched_entity then to the parent cfs_rq; however, what gets propagated
> doesn't seem to make sense.
>
> It repeats the tg_weight calculation done in calc_cfs_shares() but
> only uses it to compensate for shares being out of date. After that,
> it sets the sched_entity's load_avg to the load_avg of the
> corresponding cfs_rq.
>
> This doesn't make sense as the cfs_rq's load_avg is some fraction of
> its total weight, which the sched_entity's weight has nothing to with.
> For example, if the cfs_rq has a single constant load 1 task the
> cfs_rq's load_avg would be around 1. If that cfs_rq is the only
> active sched_entity in the parent cfs_rq which has the maximum weight,
> the sched_entity's load should be around the maximum weight but
> update_tg_cfs_load() ends up overriding it to 1.

not sure to catch your example:
a task TA with a load_avg = 1 is the only task in a task group GB so
the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
got a weight of 1024 (I use 10bits format for readability) which is
the total share of task group GB

Are you saying that the group_entity load_avg should be around 1024 and not 1 ?

I would say it depends of TA weight. I assume that TA weight is the
default value (1024) as you don't specify any value in your example

If TA directly runs at parent level, its sched_entity would have a
load_avg of 1 so why the group entity load_avg should be 1024 ? it
will just temporally show the cfs_rq more loaded than it is really and
at the end the group entity load_avg will go back to 1

>
> At the parent's level, the absolute value of load_avg inside a child
> cfs_rq doesn't mean anything. Only the ratio against its weight is
> meaningful.
>
> This patch changes update_tg_cfs_load() to normalize the
> runnable_load_avg of the cfs_rq and then scale it to the matching
> sched_entity's freshly calculated shares for propagation. Use of
> runnable_load_avg instead of load_avg is intentional and keeps the
> parent's runnable_load_avg true to the sum of scaled loads of all
> tasks queued under it which is critical for the correction operation
> of load balancer. The next patch will depend on it.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Paul Turner <[email protected]>
> ---
> kernel/sched/fair.c | 46 +++++++++++++++++++---------------------------
> 1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,29 @@ static inline void
> update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> - long delta, load = gcfs_rq->avg.load_avg;
> + long load = 0, delta;
>
> /*
> - * If the load of group cfs_rq is null, the load of the
> - * sched_entity will also be null so we can skip the formula
> + * A cfs_rq's load avg contribution to the parent should be scaled
> + * to the sched_entity's weight. Use freshly calculated shares
> + * instead of @se->load.weight as the latter may not reflect
> + * changes from the current scheduling operation.
> + *
> + * Note that the propagation source is runnable_load_avg instead of
> + * load_avg. This keeps every cfs_rq's runnable_load_avg true to
> + * the sum of the scaled loads of all tasks queued under it, which
> + * is important for the correct operation of the load balancer.
> + *
> + * This can make the sched_entity's load_avg jumpier but that
> + * correctly reflects what would happen without cgroups if each
> + * task's load is scaled across nesting - the load is being
> + * averaged at the task and each cfs_rq.
> */
> - if (load) {
> - long tg_load;
> + if (gcfs_rq->load.weight) {
> + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> - /* Get tg's load and ensure tg_load > 0 */
> - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> - /* Ensure tg_load >= load and updated with current load*/
> - tg_load -= gcfs_rq->tg_load_avg_contrib;
> - tg_load += load;
> -
> - /*
> - * We need to compute a correction term in the case that the
> - * task group is consuming more CPU than a task of equal
> - * weight. A task with a weight equals to tg->shares will have
> - * a load less or equal to scale_load_down(tg->shares).
> - * Similarly, the sched_entities that represent the task group
> - * at parent level, can't have a load higher than
> - * scale_load_down(tg->shares). And the Sum of sched_entities'
> - * load must be <= scale_load_down(tg->shares).
> - */
> - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> - /* scale gcfs_rq's load into tg's shares*/
> - load *= scale_load_down(gcfs_rq->tg->shares);
> - load /= tg_load;
> - }
> + load = min(gcfs_rq->runnable_load_avg *
> + shares / gcfs_rq->load.weight, shares);
> }
>
> delta = load - se->avg.load_avg;

2017-04-25 08:47:05

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
> latency gets wonky regardless of nesting level or weight
> configuration. This is easily reproducible with Chris Mason's
> schbench[1].
>
> All tests are run on a single socket, 16 cores, 32 threads machine.
> While the machine is mostly idle, it isn't completey. There's always
> some variable management load going on. The command used is
>
> schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
> which measures scheduling latency with 32 threads each of which
> repeatedly runs for 15ms and then sleeps for 10ms. Here's a
> representative result when running from the root cgroup.
>
> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
> Latency percentiles (usec)
> 50.0000th: 26
> 75.0000th: 62
> 90.0000th: 74
> 95.0000th: 86
> *99.0000th: 887
> 99.5000th: 3692
> 99.9000th: 10832
> min=0, max=13374
>
> The following is inside a first level CPU cgroup with the maximum
> weight.
>
> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
> Latency percentiles (usec)
> 50.0000th: 31
> 75.0000th: 65
> 90.0000th: 71
> 95.0000th: 91
> *99.0000th: 7288
> 99.5000th: 10352
> 99.9000th: 12496
> min=0, max=13023
>
> Note the drastic increase in p99 scheduling latency. After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group. A CPU which has one schbench running and
> another queued wouldn't report the correspondingly higher
> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all queued sched_entities. Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum. When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>
> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
> immediate reflection. When a task wakes up inside a cgroup, the
> nested cfs_rq's runnable_load_avg is updated but doesn't get
> propagated to the parent. It only affects the matching sched_entity's
> load_avg over time which then gets propagated to the runnable_load_avg
> at that level. This makes weighted_cpuload() often temporarily out of
> sync leading to suboptimal behavior of load_balance() and increase in
> scheduling latencies as shown above.
>
> This patch fixes the issue by updating propagate_entity_load_avg() to
> always propagate to the parent's runnable_load_avg. Combined with the
> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
> of the scaled loads of all tasks queued below removing the artifacts
> from nesting cfs_rqs. The following is from inside three levels of
> nesting with the patch applied.

So you are changing the purpose of propagate_entity_load_avg which
aims to propagate load_avg/util_avg changes only when a task migrate
and you also want to propagate the enqueue/dequeue in the parent
cfs_rq->runnable_load_avg

>
> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
> Latency percentiles (usec)
> 50.0000th: 40
> 75.0000th: 71
> 90.0000th: 89
> 95.0000th: 108
> *99.0000th: 679
> 99.5000th: 3500
> 99.9000th: 10960
> min=0, max=13790
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Chris Mason <[email protected]>
> ---
> kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
> 1 file changed, 21 insertions(+), 13 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>
> /* Take into account change of load of a child task group */
> static inline void
> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
> + bool propagate_avg)
> {
> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> long load = 0, delta;
> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
> se->avg.load_avg = load;
> se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>
> - /* Update parent cfs_rq load */
> - add_positive(&cfs_rq->avg.load_avg, delta);
> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> + if (propagate_avg) {
> + /* Update parent cfs_rq load */
> + add_positive(&cfs_rq->avg.load_avg, delta);
> + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> + }
>
> /*
> * If the sched_entity is already enqueued, we also have to update the
> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
> /* Update task and its cfs_rq load average */
> static inline int propagate_entity_load_avg(struct sched_entity *se)
> {
> - struct cfs_rq *cfs_rq;
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> + bool propagate_avg;
>
> if (entity_is_task(se))
> return 0;
>
> - if (!test_and_clear_tg_cfs_propagate(se))
> - return 0;
> -
> - cfs_rq = cfs_rq_of(se);
> + propagate_avg = test_and_clear_tg_cfs_propagate(se);
>
> - set_tg_cfs_propagate(cfs_rq);
> + /*
> + * We want to keep @cfs_rq->runnable_load_avg always in sync so
> + * that the load balancer can accurately determine the busiest CPU
> + * regardless of cfs_rq nesting.
> + */
> + update_tg_cfs_load(cfs_rq, se, propagate_avg);
>
> - update_tg_cfs_util(cfs_rq, se);
> - update_tg_cfs_load(cfs_rq, se);
> + if (propagate_avg) {
> + set_tg_cfs_propagate(cfs_rq);
> + update_tg_cfs_util(cfs_rq, se);
> + }
>
> - return 1;
> + return propagate_avg;
> }
>
> #else /* CONFIG_FAIR_GROUP_SCHED */

2017-04-25 09:06:02

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 25 April 2017 at 10:46, Vincent Guittot <[email protected]> wrote:
> On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
>> We noticed that with cgroup CPU controller in use, the scheduling
>> latency gets wonky regardless of nesting level or weight
>> configuration. This is easily reproducible with Chris Mason's
>> schbench[1].
>>
>> All tests are run on a single socket, 16 cores, 32 threads machine.
>> While the machine is mostly idle, it isn't completey. There's always
>> some variable management load going on. The command used is
>>
>> schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>
>> which measures scheduling latency with 32 threads each of which
>> repeatedly runs for 15ms and then sleeps for 10ms. Here's a
>> representative result when running from the root cgroup.
>>
>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>> Latency percentiles (usec)
>> 50.0000th: 26
>> 75.0000th: 62
>> 90.0000th: 74
>> 95.0000th: 86
>> *99.0000th: 887
>> 99.5000th: 3692
>> 99.9000th: 10832
>> min=0, max=13374
>>
>> The following is inside a first level CPU cgroup with the maximum
>> weight.
>>
>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>> Latency percentiles (usec)
>> 50.0000th: 31
>> 75.0000th: 65
>> 90.0000th: 71
>> 95.0000th: 91
>> *99.0000th: 7288
>> 99.5000th: 10352
>> 99.9000th: 12496
>> min=0, max=13023
>>
>> Note the drastic increase in p99 scheduling latency. After
>> investigation, it turned out that the update_sd_lb_stats(), which is
>> used by load_balance() to pick the most loaded group, was often
>> picking the wrong group. A CPU which has one schbench running and
>> another queued wouldn't report the correspondingly higher
>> weighted_cpuload() and get looked over as the target of load
>> balancing.
>>
>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>> sum of the load_avg of all queued sched_entities. Without cgroups or
>> at the root cgroup, each task's load_avg contributes directly to the
>> sum. When a task wakes up or goes to sleep, the change is immediately
>> reflected on runnable_load_avg which in turn affects load balancing.
>>
>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>> immediate reflection. When a task wakes up inside a cgroup, the
>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>> propagated to the parent. It only affects the matching sched_entity's
>> load_avg over time which then gets propagated to the runnable_load_avg
>> at that level. This makes weighted_cpuload() often temporarily out of
>> sync leading to suboptimal behavior of load_balance() and increase in
>> scheduling latencies as shown above.
>>
>> This patch fixes the issue by updating propagate_entity_load_avg() to
>> always propagate to the parent's runnable_load_avg. Combined with the
>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>> of the scaled loads of all tasks queued below removing the artifacts
>> from nesting cfs_rqs. The following is from inside three levels of
>> nesting with the patch applied.
>
> So you are changing the purpose of propagate_entity_load_avg which
> aims to propagate load_avg/util_avg changes only when a task migrate
> and you also want to propagate the enqueue/dequeue in the parent
> cfs_rq->runnable_load_avg

In fact you want that sched_entity load_avg reflects
cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

>
>>
>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>> Latency percentiles (usec)
>> 50.0000th: 40
>> 75.0000th: 71
>> 90.0000th: 89
>> 95.0000th: 108
>> *99.0000th: 679
>> 99.5000th: 3500
>> 99.9000th: 10960
>> min=0, max=13790
>>
>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>
>> Signed-off-by: Tejun Heo <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Mike Galbraith <[email protected]>
>> Cc: Paul Turner <[email protected]>
>> Cc: Chris Mason <[email protected]>
>> ---
>> kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
>> 1 file changed, 21 insertions(+), 13 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>
>> /* Take into account change of load of a child task group */
>> static inline void
>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> + bool propagate_avg)
>> {
>> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> long load = 0, delta;
>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>> se->avg.load_avg = load;
>> se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>
>> - /* Update parent cfs_rq load */
>> - add_positive(&cfs_rq->avg.load_avg, delta);
>> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> + if (propagate_avg) {
>> + /* Update parent cfs_rq load */
>> + add_positive(&cfs_rq->avg.load_avg, delta);
>> + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> + }
>>
>> /*
>> * If the sched_entity is already enqueued, we also have to update the
>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>> /* Update task and its cfs_rq load average */
>> static inline int propagate_entity_load_avg(struct sched_entity *se)
>> {
>> - struct cfs_rq *cfs_rq;
>> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> + bool propagate_avg;
>>
>> if (entity_is_task(se))
>> return 0;
>>
>> - if (!test_and_clear_tg_cfs_propagate(se))
>> - return 0;
>> -
>> - cfs_rq = cfs_rq_of(se);
>> + propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>
>> - set_tg_cfs_propagate(cfs_rq);
>> + /*
>> + * We want to keep @cfs_rq->runnable_load_avg always in sync so
>> + * that the load balancer can accurately determine the busiest CPU
>> + * regardless of cfs_rq nesting.
>> + */
>> + update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>
>> - update_tg_cfs_util(cfs_rq, se);
>> - update_tg_cfs_load(cfs_rq, se);
>> + if (propagate_avg) {
>> + set_tg_cfs_propagate(cfs_rq);
>> + update_tg_cfs_util(cfs_rq, se);
>> + }
>>
>> - return 1;
>> + return propagate_avg;
>> }
>>
>> #else /* CONFIG_FAIR_GROUP_SCHED */

2017-04-25 12:59:56

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 25 April 2017 at 11:05, Vincent Guittot <[email protected]> wrote:
> On 25 April 2017 at 10:46, Vincent Guittot <[email protected]> wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
>>> We noticed that with cgroup CPU controller in use, the scheduling
>>> latency gets wonky regardless of nesting level or weight
>>> configuration. This is easily reproducible with Chris Mason's
>>> schbench[1].
>>>
>>> All tests are run on a single socket, 16 cores, 32 threads machine.
>>> While the machine is mostly idle, it isn't completey. There's always
>>> some variable management load going on. The command used is
>>>
>>> schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>
>>> which measures scheduling latency with 32 threads each of which
>>> repeatedly runs for 15ms and then sleeps for 10ms. Here's a
>>> representative result when running from the root cgroup.
>>>
>>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>> Latency percentiles (usec)
>>> 50.0000th: 26
>>> 75.0000th: 62
>>> 90.0000th: 74
>>> 95.0000th: 86
>>> *99.0000th: 887
>>> 99.5000th: 3692
>>> 99.9000th: 10832
>>> min=0, max=13374
>>>
>>> The following is inside a first level CPU cgroup with the maximum
>>> weight.
>>>
>>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>> Latency percentiles (usec)
>>> 50.0000th: 31
>>> 75.0000th: 65
>>> 90.0000th: 71
>>> 95.0000th: 91
>>> *99.0000th: 7288
>>> 99.5000th: 10352
>>> 99.9000th: 12496
>>> min=0, max=13023
>>>
>>> Note the drastic increase in p99 scheduling latency. After
>>> investigation, it turned out that the update_sd_lb_stats(), which is
>>> used by load_balance() to pick the most loaded group, was often
>>> picking the wrong group. A CPU which has one schbench running and
>>> another queued wouldn't report the correspondingly higher
>>> weighted_cpuload() and get looked over as the target of load
>>> balancing.
>>>
>>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>>> sum of the load_avg of all queued sched_entities. Without cgroups or
>>> at the root cgroup, each task's load_avg contributes directly to the
>>> sum. When a task wakes up or goes to sleep, the change is immediately
>>> reflected on runnable_load_avg which in turn affects load balancing.
>>>
>>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>>> immediate reflection. When a task wakes up inside a cgroup, the
>>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>>> propagated to the parent. It only affects the matching sched_entity's
>>> load_avg over time which then gets propagated to the runnable_load_avg
>>> at that level. This makes weighted_cpuload() often temporarily out of
>>> sync leading to suboptimal behavior of load_balance() and increase in
>>> scheduling latencies as shown above.
>>>
>>> This patch fixes the issue by updating propagate_entity_load_avg() to
>>> always propagate to the parent's runnable_load_avg. Combined with the
>>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>>> of the scaled loads of all tasks queued below removing the artifacts
>>> from nesting cfs_rqs. The following is from inside three levels of
>>> nesting with the patch applied.
>>
>> So you are changing the purpose of propagate_entity_load_avg which
>> aims to propagate load_avg/util_avg changes only when a task migrate
>> and you also want to propagate the enqueue/dequeue in the parent
>> cfs_rq->runnable_load_avg
>
> In fact you want that sched_entity load_avg reflects
> cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

I have run a quick test with your patches and schbench on my platform.
I haven't been able to reproduce your regression but my platform is
quite different from yours (only 8 cores without SMT)
But most importantly, the parent cfs_rq->runnable_load_avg never
reaches 0 (or almost 0) when it is idle. Instead, it still has a
runnable_load_avg (this is not due to rounding computation) whereas
runnable_load_avg should be 0

Just to be curious, Is your regression still there if you disable
SMT/hyperthreading on your paltform?

Regards,
Vincent

>
>>
>>>
>>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>> Latency percentiles (usec)
>>> 50.0000th: 40
>>> 75.0000th: 71
>>> 90.0000th: 89
>>> 95.0000th: 108
>>> *99.0000th: 679
>>> 99.5000th: 3500
>>> 99.9000th: 10960
>>> min=0, max=13790
>>>
>>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>>
>>> Signed-off-by: Tejun Heo <[email protected]>
>>> Cc: Vincent Guittot <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: Peter Zijlstra <[email protected]>
>>> Cc: Mike Galbraith <[email protected]>
>>> Cc: Paul Turner <[email protected]>
>>> Cc: Chris Mason <[email protected]>
>>> ---
>>> kernel/sched/fair.c | 34 +++++++++++++++++++++-------------
>>> 1 file changed, 21 insertions(+), 13 deletions(-)
>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>>
>>> /* Take into account change of load of a child task group */
>>> static inline void
>>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>> + bool propagate_avg)
>>> {
>>> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>> long load = 0, delta;
>>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>> se->avg.load_avg = load;
>>> se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>>
>>> - /* Update parent cfs_rq load */
>>> - add_positive(&cfs_rq->avg.load_avg, delta);
>>> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> + if (propagate_avg) {
>>> + /* Update parent cfs_rq load */
>>> + add_positive(&cfs_rq->avg.load_avg, delta);
>>> + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> + }
>>>
>>> /*
>>> * If the sched_entity is already enqueued, we also have to update the
>>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>>> /* Update task and its cfs_rq load average */
>>> static inline int propagate_entity_load_avg(struct sched_entity *se)
>>> {
>>> - struct cfs_rq *cfs_rq;
>>> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> + bool propagate_avg;
>>>
>>> if (entity_is_task(se))
>>> return 0;
>>>
>>> - if (!test_and_clear_tg_cfs_propagate(se))
>>> - return 0;
>>> -
>>> - cfs_rq = cfs_rq_of(se);
>>> + propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>>
>>> - set_tg_cfs_propagate(cfs_rq);
>>> + /*
>>> + * We want to keep @cfs_rq->runnable_load_avg always in sync so
>>> + * that the load balancer can accurately determine the busiest CPU
>>> + * regardless of cfs_rq nesting.
>>> + */
>>> + update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>>
>>> - update_tg_cfs_util(cfs_rq, se);
>>> - update_tg_cfs_load(cfs_rq, se);
>>> + if (propagate_avg) {
>>> + set_tg_cfs_propagate(cfs_rq);
>>> + update_tg_cfs_util(cfs_rq, se);
>>> + }
>>>
>>> - return 1;
>>> + return propagate_avg;
>>> }
>>>
>>> #else /* CONFIG_FAIR_GROUP_SCHED */

2017-04-25 18:12:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

Hello,

On Tue, Apr 25, 2017 at 10:35:53AM +0200, Vincent Guittot wrote:
> not sure to catch your example:
> a task TA with a load_avg = 1 is the only task in a task group GB so
> the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
> got a weight of 1024 (I use 10bits format for readability) which is
> the total share of task group GB

The group_entity (the sched_entity corresponding to the cfs_rq) should
behave as if it's a task which has the weight of 1024.

> Are you saying that the group_entity load_avg should be around 1024 and not 1 ?

Yes.

> I would say it depends of TA weight. I assume that TA weight is the
> default value (1024) as you don't specify any value in your example

Please consider the following configuration, where GA is a group
entity, and TA and TB are tasks.

ROOT - GA (weight 1024) - TA (weight 1)
\ GB (weight 1 ) - TB (weight 1)

Let's say both TA and TB are running full-tilt. Now let's take out GA
and GB.

ROOT - TA1 (weight 1024)
\ TB1 (weight 1 )

GA should behave the same as TA1 and GB TB1. GA's load should match
TA1's, and GA's load when seen from ROOT's cfs_rq has nothing to do
with how much total absolute weight it has inside it.

ROOT - GA2 (weight 1024) - TA2 (weight 1 )
\ GB2 (weight 1 ) - TB2 (weight 1024)

If TA2 and TB2 are constantly running, GA2 and GB2's in ROOT's cfs_rq
should match GA and GB's, respectively.

> If TA directly runs at parent level, its sched_entity would have a
> load_avg of 1 so why the group entity load_avg should be 1024 ? it

Because then the hierarchical weight configuration doesn't mean
anything.

> will just temporally show the cfs_rq more loaded than it is really and
> at the end the group entity load_avg will go back to 1

It's not temporary. The weight of a group is its shares, which is its
load fraction of the configured weight of the group. Assuming UP, if
you configure a group to the weight of 1024 and have any task running
full-tilt in it, the group will converge to the load of 1024. The
problem is that the propagation logic is currently doing something
completely different and temporarily push down the load whenever it
triggers.

Thanks.

--
tejun

2017-04-25 18:49:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Hello,

On Tue, Apr 25, 2017 at 02:59:18PM +0200, Vincent Guittot wrote:
> >> So you are changing the purpose of propagate_entity_load_avg which
> >> aims to propagate load_avg/util_avg changes only when a task migrate
> >> and you also want to propagate the enqueue/dequeue in the parent
> >> cfs_rq->runnable_load_avg

Yeah, it always propagates runnable_load_avg and load_avg/util_avg too
on migrations.

> > In fact you want that sched_entity load_avg reflects
> > cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

Yes, that's how it gets changed. The load balancer assumes that the
root's runnable_load_avg is the total sum of all currently active
tasks. Nesting cfs_rq's shouldn't change that and how it should be
mapped is clearly defined (scaled recursively till it reaches the
root), which is what the code calculates. The change in
cfs_rq->avg.load_avg's behavior is to reflect that immediate
propagation as load_avg and runnable_load_avg are tightly coupled.

While it does change a nested cfs_rq's load_avg behavior. It sheds of
the extra layer of averaging and directly reflects the scaled load
avgs of its members, which are already time averaged. I could have
missed something but couldn't spot anything which can break from this.

> I have run a quick test with your patches and schbench on my platform.
> I haven't been able to reproduce your regression but my platform is
> quite different from yours (only 8 cores without SMT)
> But most importantly, the parent cfs_rq->runnable_load_avg never
> reaches 0 (or almost 0) when it is idle. Instead, it still has a
> runnable_load_avg (this is not due to rounding computation) whereas
> runnable_load_avg should be 0

Heh, let me try that out. Probably a silly mistake somewhere.

> Just to be curious, Is your regression still there if you disable
> SMT/hyperthreading on your paltform?

Will try that too. I can't see why HT would change it because I see
single CPU queues misevaluated. Just in case, you need to tune the
test params so that it doesn't load the machine too much and that
there are some non-CPU intensive workloads going on to purturb things
a bit. Anyways, I'm gonna try disabling HT.

Thanks.

--
tejun

2017-04-25 20:49:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
> Will try that too. I can't see why HT would change it because I see
> single CPU queues misevaluated. Just in case, you need to tune the
> test params so that it doesn't load the machine too much and that
> there are some non-CPU intensive workloads going on to purturb things
> a bit. Anyways, I'm gonna try disabling HT.

It's finickier but after changing the duty cycle a bit, it reproduces
w/ HT off. I think the trick is setting the number of threads to the
number of logical CPUs and tune -s/-c so that p99 starts climbing up.
The following is from the root cgroup.

~/schbench -m 2 -t 8 -s 15000 -c 10000 -r 30
Latency percentiles (usec)
50.0000th: 51
75.0000th: 62
90.0000th: 67
95.0000th: 70
*99.0000th: 1482
99.5000th: 5048
99.9000th: 9008
min=0, max=10066

And the following is from a first-level cgroup with maximum CPU
weight.

# ~/schbench -m 2 -t 8 -s 15000 -c 10000 -r 30
Latency percentiles (usec)
50.0000th: 51
75.0000th: 62
90.0000th: 71
95.0000th: 84
*99.0000th: 10064
99.5000th: 10064
99.9000th: 10064
min=0, max=10089

It's interesting that p99 ends up aligned on the CPU burn duration.
It looks like some threads end up wating for full durations.

The following is with the patches applied in the same cgroup setup.

# ~/schbench -m 2 -t 8 -s 15000 -c 10000 -r 30
Latency percentiles (usec)
50.0000th: 64
75.0000th: 73
90.0000th: 102
95.0000th: 111
*99.0000th: 1954
99.5000th: 5432
99.9000th: 9520
min=0, max=10012

The numbers fluctuate quite a bit between runs but the pattern is
still very clear - e.g. 10ms p99 never shows up in the root cgroup or
on the patched kernel.

Thanks.

--
tejun

2017-04-25 21:08:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
> > I have run a quick test with your patches and schbench on my platform.
> > I haven't been able to reproduce your regression but my platform is
> > quite different from yours (only 8 cores without SMT)
> > But most importantly, the parent cfs_rq->runnable_load_avg never
> > reaches 0 (or almost 0) when it is idle. Instead, it still has a
> > runnable_load_avg (this is not due to rounding computation) whereas
> > runnable_load_avg should be 0
>
> Heh, let me try that out. Probably a silly mistake somewhere.

This is from the follow-up patch. I was confused. Because we don't
propagate decays, we still should decay the runnable_load_avg;
otherwise, we end up accumulating errors in the counter. I'll drop
the last patch.

Thanks.

--
tejun

2017-04-25 21:16:48

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg



On 04/25/2017 04:49 PM, Tejun Heo wrote:
> On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
>> Will try that too. I can't see why HT would change it because I see
>> single CPU queues misevaluated. Just in case, you need to tune the
>> test params so that it doesn't load the machine too much and that
>> there are some non-CPU intensive workloads going on to purturb things
>> a bit. Anyways, I'm gonna try disabling HT.
>
> It's finickier but after changing the duty cycle a bit, it reproduces
> w/ HT off. I think the trick is setting the number of threads to the
> number of logical CPUs and tune -s/-c so that p99 starts climbing up.
> The following is from the root cgroup.

Since it's only measuring wakeup latency, schbench is best at exposing
problems when the machine is just barely below saturated. At
saturation, everyone has to wait for the CPUs, and if we're relatively
idle there's always a CPU to be found

There's schbench -a to try and find this magic tipping point, but I
haven't found a great way to automate for every kind of machine yet (sorry).

-chris

2017-04-26 10:22:21

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 25 April 2017 at 23:08, Tejun Heo <[email protected]> wrote:
> On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:
>> > I have run a quick test with your patches and schbench on my platform.
>> > I haven't been able to reproduce your regression but my platform is
>> > quite different from yours (only 8 cores without SMT)
>> > But most importantly, the parent cfs_rq->runnable_load_avg never
>> > reaches 0 (or almost 0) when it is idle. Instead, it still has a
>> > runnable_load_avg (this is not due to rounding computation) whereas
>> > runnable_load_avg should be 0
>>
>> Heh, let me try that out. Probably a silly mistake somewhere.
>
> This is from the follow-up patch. I was confused. Because we don't
> propagate decays, we still should decay the runnable_load_avg;
> otherwise, we end up accumulating errors in the counter. I'll drop
> the last patch.

Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
see runnable_load_avg sometimes significantly higher than load_avg
which is normally not possible as load_avg = runnable_load_avg +
sleeping task's load_avg

Then, I just have the opposite behavior on my platform. I see a
increase of latency at p99 with your patches.
My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
-t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
is similar to what you are doing on your platform

With v4.11-rc8. I have run 10 times the test and get consistent results
schbench -m 2 -t 4 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 255
75.0000th: 350
90.0000th: 454
95.0000th: 489
*99.0000th: 539
99.5000th: 585
99.9000th: 10224
min=0, max=13567

With your patches i see an increase of the latency for p99. I run 10
times the test too and half tests show latency increase like below
schbench$ ./schbench -m 2 -t 4 -s 10000 -c 15000 -r 30
Latency percentiles (usec)
50.0000th: 216
75.0000th: 295
90.0000th: 395
95.0000th: 444
*99.0000th: 2034
99.5000th: 5960
99.9000th: 12240
min=0, max=14744

>
> Thanks.
>
> --
> tejun

2017-04-26 16:14:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
> 09a43ace1f98 ("sched/fair: Propagate load during synchronous
> attach/detach") added immediate load propagation from cfs_rq to its
> sched_entity then to the parent cfs_rq; however, what gets propagated
> doesn't seem to make sense.
>
> It repeats the tg_weight calculation done in calc_cfs_shares() but
> only uses it to compensate for shares being out of date. After that,
> it sets the sched_entity's load_avg to the load_avg of the
> corresponding cfs_rq.
>
> This doesn't make sense as the cfs_rq's load_avg is some fraction of
> its total weight, which the sched_entity's weight has nothing to with.
> For example, if the cfs_rq has a single constant load 1 task the
> cfs_rq's load_avg would be around 1. If that cfs_rq is the only
> active sched_entity in the parent cfs_rq which has the maximum weight,
> the sched_entity's load should be around the maximum weight but
> update_tg_cfs_load() ends up overriding it to 1.
>
> At the parent's level, the absolute value of load_avg inside a child
> cfs_rq doesn't mean anything. Only the ratio against its weight is
> meaningful.
>
> This patch changes update_tg_cfs_load() to normalize the
> runnable_load_avg of the cfs_rq and then scale it to the matching
> sched_entity's freshly calculated shares for propagation. Use of
> runnable_load_avg instead of load_avg is intentional and keeps the
> parent's runnable_load_avg true to the sum of scaled loads of all
> tasks queued under it which is critical for the correction operation
> of load balancer. The next patch will depend on it.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Paul Turner <[email protected]>
> ---
> kernel/sched/fair.c | 46 +++++++++++++++++++---------------------------
> 1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,29 @@ static inline void
> update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> - long delta, load = gcfs_rq->avg.load_avg;
> + long load = 0, delta;
>
> /*
> - * If the load of group cfs_rq is null, the load of the
> - * sched_entity will also be null so we can skip the formula
> + * A cfs_rq's load avg contribution to the parent should be scaled
> + * to the sched_entity's weight. Use freshly calculated shares
> + * instead of @se->load.weight as the latter may not reflect
> + * changes from the current scheduling operation.
> + *
> + * Note that the propagation source is runnable_load_avg instead of
> + * load_avg. This keeps every cfs_rq's runnable_load_avg true to
> + * the sum of the scaled loads of all tasks queued under it, which
> + * is important for the correct operation of the load balancer.
> + *
> + * This can make the sched_entity's load_avg jumpier but that
> + * correctly reflects what would happen without cgroups if each
> + * task's load is scaled across nesting - the load is being
> + * averaged at the task and each cfs_rq.
> */
> - if (load) {
> - long tg_load;
> + if (gcfs_rq->load.weight) {
> + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> - /* Get tg's load and ensure tg_load > 0 */
> - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> - /* Ensure tg_load >= load and updated with current load*/
> - tg_load -= gcfs_rq->tg_load_avg_contrib;
> - tg_load += load;
> -
> - /*
> - * We need to compute a correction term in the case that the
> - * task group is consuming more CPU than a task of equal
> - * weight. A task with a weight equals to tg->shares will have
> - * a load less or equal to scale_load_down(tg->shares).
> - * Similarly, the sched_entities that represent the task group
> - * at parent level, can't have a load higher than
> - * scale_load_down(tg->shares). And the Sum of sched_entities'
> - * load must be <= scale_load_down(tg->shares).
> - */
> - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> - /* scale gcfs_rq's load into tg's shares*/
> - load *= scale_load_down(gcfs_rq->tg->shares);
> - load /= tg_load;
> - }
> + load = min(gcfs_rq->runnable_load_avg *
> + shares / gcfs_rq->load.weight, shares);

There is a unit problem above:
runnable_load_avg and shares are not in the same range but
runnable_load_avg and scale_load_down(gcfs_rq->load.weight) are so
you should use
gcfs_rq->runnable_load_avg * scale_load_down(shares) /
scale_load_down(gcfs_rq->load.weight).
Hopefully both scale_load_down cancel between them
But the min should be then tested with scale_load_down(shares) and not
only shares


> }
>
> delta = load - se->avg.load_avg;

2017-04-26 16:51:37

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

Le Tuesday 25 Apr 2017 ? 11:12:19 (-0700), Tejun Heo a ?crit :
> Hello,
>
> On Tue, Apr 25, 2017 at 10:35:53AM +0200, Vincent Guittot wrote:
> > not sure to catch your example:
> > a task TA with a load_avg = 1 is the only task in a task group GB so
> > the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
> > got a weight of 1024 (I use 10bits format for readability) which is
> > the total share of task group GB
>
> The group_entity (the sched_entity corresponding to the cfs_rq) should
> behave as if it's a task which has the weight of 1024.
>
> > Are you saying that the group_entity load_avg should be around 1024 and not 1 ?
>
> Yes.
>
> > I would say it depends of TA weight. I assume that TA weight is the
> > default value (1024) as you don't specify any value in your example
>
> Please consider the following configuration, where GA is a group
> entity, and TA and TB are tasks.
>
> ROOT - GA (weight 1024) - TA (weight 1)
> \ GB (weight 1 ) - TB (weight 1)
>
> Let's say both TA and TB are running full-tilt. Now let's take out GA
> and GB.
>
> ROOT - TA1 (weight 1024)
> \ TB1 (weight 1 )
>
> GA should behave the same as TA1 and GB TB1. GA's load should match
> TA1's, and GA's load when seen from ROOT's cfs_rq has nothing to do
> with how much total absolute weight it has inside it.
>
> ROOT - GA2 (weight 1024) - TA2 (weight 1 )
> \ GB2 (weight 1 ) - TB2 (weight 1024)
>
> If TA2 and TB2 are constantly running, GA2 and GB2's in ROOT's cfs_rq
> should match GA and GB's, respectively.

Yes I agree

>
> > If TA directly runs at parent level, its sched_entity would have a
> > load_avg of 1 so why the group entity load_avg should be 1024 ? it
>
> Because then the hierarchical weight configuration doesn't mean
> anything.
>
> > will just temporally show the cfs_rq more loaded than it is really and
> > at the end the group entity load_avg will go back to 1
>
> It's not temporary. The weight of a group is its shares, which is its
> load fraction of the configured weight of the group. Assuming UP, if
> you configure a group to the weight of 1024 and have any task running
> full-tilt in it, the group will converge to the load of 1024. The
> problem is that the propagation logic is currently doing something
> completely different and temporarily push down the load whenever it
> triggers.

Ok, I see your point and agree that there is an issue when propagating
load_avg of a task group which has tasks with lower weight than the share
but your proposal has got issue because it uses runnable_load_avg instead
of load_avg and this makes propagation of loadavg_avg incorrect,?something
like below which keeps using load_avg solve the problem

+ if (gcfs_rq->load.weight) {
+ long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
+
+ load = min(gcfs_rq->avg.load_avg *
+ shares / scale_load_down(gcfs_rq->load.weight), shares);

I have run schbench with the change above on v4.11-rc8 and latency are ok

Thanks
Vincent
>
>
> Thanks.
>
> --
> tejun

2017-04-26 18:12:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
>
> Note the drastic increase in p99 scheduling latency. After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group. A CPU which has one schbench running and

Can the problem be on the load balance side instead ? and more
precisely in the wakeup path ?
After looking at the trace, it seems that task placement happens at
wake up path and if it fails to select the right idle cpu at wake up,
you will have to wait for a load balance which is alreayd too late

> another queued wouldn't report the correspondingly higher

It will as load_avg includes the runnable_load_avg so whatever load is
in runnable_load_avg will be in load_avg too. But at the contrary,
runnable_load_avg will not have the blocked that is going to wake up
soon in the case of schbench

One last thing, the load_avg of an idle CPU can stay blocked for a
while (until a load balance happens that will update blocked load) and
can be seen has "busy" whereas it is not. Could it be a reason of your
problem ?

I have an ongoing patch to solve the problem at least partly if this
can be a reason

> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all queued sched_entities. Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum. When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>

> #else /* CONFIG_FAIR_GROUP_SCHED */

2017-04-26 22:27:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

Hello, Vincent.

On Wed, Apr 26, 2017 at 06:14:17PM +0200, Vincent Guittot wrote:
> > + if (gcfs_rq->load.weight) {
> > + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
> >
> > + load = min(gcfs_rq->runnable_load_avg *
> > + shares / gcfs_rq->load.weight, shares);
>
> There is a unit problem above:
> runnable_load_avg and shares are not in the same range but
> runnable_load_avg and scale_load_down(gcfs_rq->load.weight) are so
> you should use
> gcfs_rq->runnable_load_avg * scale_load_down(shares) /
> scale_load_down(gcfs_rq->load.weight).

But the only difference there is that we lose accuracy in calculation;
otherwise, the end results are the same, no?

> Hopefully both scale_load_down cancel between them
> But the min should be then tested with scale_load_down(shares) and not
> only shares

Ah, that's right. The min should be against scaled down shares.

Thanks.

--
tejun

2017-04-26 22:40:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

Hello,

On Wed, Apr 26, 2017 at 06:51:23PM +0200, Vincent Guittot wrote:
> > It's not temporary. The weight of a group is its shares, which is its
> > load fraction of the configured weight of the group. Assuming UP, if
> > you configure a group to the weight of 1024 and have any task running
> > full-tilt in it, the group will converge to the load of 1024. The
> > problem is that the propagation logic is currently doing something
> > completely different and temporarily push down the load whenever it
> > triggers.
>
> Ok, I see your point and agree that there is an issue when propagating
> load_avg of a task group which has tasks with lower weight than the share
> but your proposal has got issue because it uses runnable_load_avg instead
> of load_avg and this makes propagation of loadavg_avg incorrect,?something
> like below which keeps using load_avg solve the problem
>
> + if (gcfs_rq->load.weight) {
> + long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
> +
> + load = min(gcfs_rq->avg.load_avg *
> + shares / scale_load_down(gcfs_rq->load.weight), shares);
>
> I have run schbench with the change above on v4.11-rc8 and latency are ok

Hmm... so, I'll test this but this wouldn't solve the problem of
root's runnable_load_avg being out of sync with the approximate sum of
all task loads, which is the cause of the latencies that I'm seeing.

Are you saying that with the above change, you're not seeing the
higher latency issue that you reported in the other reply?

Thanks.

--
tejun

2017-04-26 22:52:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Hello,

On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
> Can the problem be on the load balance side instead ? and more
> precisely in the wakeup path ?
> After looking at the trace, it seems that task placement happens at
> wake up path and if it fails to select the right idle cpu at wake up,
> you will have to wait for a load balance which is alreayd too late

Oh, I was tracing most of scheduler activities and the ratios of
wakeups picking idle CPUs were about the same regardless of cgroup
membership. I can confidently say that the latency issue that I'm
seeing is from load balancer picking the wrong busiest CPU, which is
not to say that there can be other problems.

> > another queued wouldn't report the correspondingly higher
>
> It will as load_avg includes the runnable_load_avg so whatever load is
> in runnable_load_avg will be in load_avg too. But at the contrary,
> runnable_load_avg will not have the blocked that is going to wake up
> soon in the case of schbench

Decaying contribution of blocked tasks don't affect the busiest CPU
selection. Without cgroup, runnable_load_avg is immediately increased
and decreased as tasks enter and leave the queue and otherwise we end
up with CPUs which are idle when there are threads queued on different
CPUs accumulating scheduling latencies.

The patch doesn't change how the busiest CPU is picked. It already
uses runnable_load_avg. The change that cgroup causes is that it
blocks updates to runnable_load_avg from newly scheduled or sleeping
tasks.

The issue isn't about whether runnable_load_avg or load_avg should be
used but the unexpected differences in the metrics that the load
balancer uses depending on whether cgroup is used or not.

> One last thing, the load_avg of an idle CPU can stay blocked for a
> while (until a load balance happens that will update blocked load) and
> can be seen has "busy" whereas it is not. Could it be a reason of your
> problem ?

AFAICS, the load balancer doesn't use load_avg.

Thanks.

--
tejun

2017-04-27 00:30:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Hello, Vincent.

On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> > This is from the follow-up patch. I was confused. Because we don't
> > propagate decays, we still should decay the runnable_load_avg;
> > otherwise, we end up accumulating errors in the counter. I'll drop
> > the last patch.
>
> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> see runnable_load_avg sometimes significantly higher than load_avg
> which is normally not possible as load_avg = runnable_load_avg +
> sleeping task's load_avg

So, while load_avg would eventually converge on runnable_load_avg +
blocked load_avg given stable enough workload for long enough,
runnable_load_avg jumping above load avg temporarily is expected,
AFAICS. That's the whole point of it, a sum closely tracking what's
currently on the cpu so that we can pick the cpu which has the most on
it now. It doesn't make sense to try to pick threads off of a cpu
which is generally loaded but doesn't have much going on right now,
after all.

> Then, I just have the opposite behavior on my platform. I see a
> increase of latency at p99 with your patches.
> My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
> -t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
> is similar to what you are doing on your platform
>
> With v4.11-rc8. I have run 10 times the test and get consistent results
...
> *99.0000th: 539
...
> With your patches i see an increase of the latency for p99. I run 10
> *99.0000th: 2034

I see. This is surprising given that at least the purpose of the
patch is restoring cgroup behavior to match !cgroup one. I could have
totally messed it up tho. Hmm... there are several ways forward I
guess.

* Can you please double check that the higher latencies w/ the patch
is reliably reproducible? The test machines that I use have
variable management load. They never dominate the machine but are
enough to disturb the results so that to drawing out a reliable
pattern takes a lot of repeated runs. I'd really appreciate if you
could double check that the pattern is reliable with different run
patterns (ie. instead of 10 consecutive runs after another,
interleaved).

* Is the board something easily obtainable? It'd be the eaisest for
me to set up the same environment and reproduce the problem. I
looked up hikey boards on amazon but couldn't easily find 2x4 core
ones. If there's something I can easily buy, please point me to it.
If there's something I can loan, that'd be great too.

* If not, I'll try to clean up the debug patches I have and send them
your way to get more visiblity but given these things tend to be
very iterative, it might take quite a few back and forth.

Thanks!

--
tejun

2017-04-27 07:01:19

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

On 27 April 2017 at 00:40, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Wed, Apr 26, 2017 at 06:51:23PM +0200, Vincent Guittot wrote:
>> > It's not temporary. The weight of a group is its shares, which is its
>> > load fraction of the configured weight of the group. Assuming UP, if
>> > you configure a group to the weight of 1024 and have any task running
>> > full-tilt in it, the group will converge to the load of 1024. The
>> > problem is that the propagation logic is currently doing something
>> > completely different and temporarily push down the load whenever it
>> > triggers.
>>
>> Ok, I see your point and agree that there is an issue when propagating
>> load_avg of a task group which has tasks with lower weight than the share
>> but your proposal has got issue because it uses runnable_load_avg instead
>> of load_avg and this makes propagation of loadavg_avg incorrect, something
>> like below which keeps using load_avg solve the problem
>>
>> + if (gcfs_rq->load.weight) {
>> + long shares = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg));
>> +
>> + load = min(gcfs_rq->avg.load_avg *
>> + shares / scale_load_down(gcfs_rq->load.weight), shares);
>>
>> I have run schbench with the change above on v4.11-rc8 and latency are ok
>
> Hmm... so, I'll test this but this wouldn't solve the problem of
> root's runnable_load_avg being out of sync with the approximate sum of
> all task loads, which is the cause of the latencies that I'm seeing.
>
> Are you saying that with the above change, you're not seeing the
> higher latency issue that you reported in the other reply?

yes I don't have any latency regression like v4.11-rc8 with the above
change that uses load_avg but fix the propagation for of a task with a
lower weight than task group share.

>
> Thanks.
>
> --
> tejun

2017-04-27 08:28:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 27 April 2017 at 02:30, Tejun Heo <[email protected]> wrote:
> Hello, Vincent.
>
> On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
>> > This is from the follow-up patch. I was confused. Because we don't
>> > propagate decays, we still should decay the runnable_load_avg;
>> > otherwise, we end up accumulating errors in the counter. I'll drop
>> > the last patch.
>>
>> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
>> see runnable_load_avg sometimes significantly higher than load_avg
>> which is normally not possible as load_avg = runnable_load_avg +
>> sleeping task's load_avg
>
> So, while load_avg would eventually converge on runnable_load_avg +
> blocked load_avg given stable enough workload for long enough,
> runnable_load_avg jumping above load avg temporarily is expected,

No, it's not. Look at load_avg/runnable_avg at root domain when only
task are involved, runnable_load_avg will never be higher than
load_avg because
load_avg = /sum load_avg of tasks attached to the cfs_rq
runnable_load_avg = /sum load_avg of tasks attached and enqueued
to the cfs_rq
load_avg = runnable_load_avg + blocked tasks and as a result
runnable_load_avg is always lower than load_avg.
And with the propagate load/util_avg patchset, we can even reflect
task migration directly at root domain whereas we had to wait for the
util/load_avg and runnable_load_avg to converge to the new value
before

Just to confirm one of my assumption, the latency regression was
already there in previous kernel versions and is not a result of
propagate load/util_avg patchset, isn't it ?

> AFAICS. That's the whole point of it, a sum closely tracking what's
> currently on the cpu so that we can pick the cpu which has the most on
> it now. It doesn't make sense to try to pick threads off of a cpu
> which is generally loaded but doesn't have much going on right now,
> after all.

The only interest of runnable_load_avg is to be null when a cfs_rq is
idle whereas load_avg is not but not to be higher than load_avg. The
root cause is that load_balance only looks at "load" but not number of
task currently running and that's probably the main problem:
runnable_load_avg has been added because load_balance fails to filter
idle group and idle rq. We should better add a new type in
group_classify to tag group that are idle and the same in
find_busiest_queue more.

>
>> Then, I just have the opposite behavior on my platform. I see a
>> increase of latency at p99 with your patches.
>> My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
>> -t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
>> is similar to what you are doing on your platform
>>
>> With v4.11-rc8. I have run 10 times the test and get consistent results
> ...
>> *99.0000th: 539
> ...
>> With your patches i see an increase of the latency for p99. I run 10
>> *99.0000th: 2034
>
> I see. This is surprising given that at least the purpose of the
> patch is restoring cgroup behavior to match !cgroup one. I could have
> totally messed it up tho. Hmm... there are several ways forward I
> guess.
>
> * Can you please double check that the higher latencies w/ the patch
> is reliably reproducible? The test machines that I use have
> variable management load. They never dominate the machine but are
> enough to disturb the results so that to drawing out a reliable
> pattern takes a lot of repeated runs. I'd really appreciate if you
> could double check that the pattern is reliable with different run
> patterns (ie. instead of 10 consecutive runs after another,
> interleaved).

I always let time between 2 consecutive run and the 10 consecutive
runs take around 2min to execute

Then I have run several time these 10 consecutive test and results stay the same

>
> * Is the board something easily obtainable? It'd be the eaisest for
> me to set up the same environment and reproduce the problem. I
> looked up hikey boards on amazon but couldn't easily find 2x4 core

It is often named hikey octo cores but I use 2x4 cores just to point
out that there are 2 clusters which is important for scheduler
topology and behavior

> ones. If there's something I can easily buy, please point me to it.
> If there's something I can loan, that'd be great too.

It looks like most of web site are currently out of stock

>
> * If not, I'll try to clean up the debug patches I have and send them
> your way to get more visiblity but given these things tend to be
> very iterative, it might take quite a few back and forth.

Yes, that could be usefull. Even a trace of regression could be useful

I can also push on my git tree the debug patch that i use for tracking
load metrics if you want. It's ugly but it does the job

Thanks

>
> Thanks!
>
> --
> tejun

2017-04-27 08:29:38

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

On 27 April 2017 at 00:52, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
>> Can the problem be on the load balance side instead ? and more
>> precisely in the wakeup path ?
>> After looking at the trace, it seems that task placement happens at
>> wake up path and if it fails to select the right idle cpu at wake up,
>> you will have to wait for a load balance which is alreayd too late
>
> Oh, I was tracing most of scheduler activities and the ratios of
> wakeups picking idle CPUs were about the same regardless of cgroup
> membership. I can confidently say that the latency issue that I'm
> seeing is from load balancer picking the wrong busiest CPU, which is
> not to say that there can be other problems.

ok. Is there any trace that you can share ? your behavior seems
different of mine

>
>> > another queued wouldn't report the correspondingly higher
>>
>> It will as load_avg includes the runnable_load_avg so whatever load is
>> in runnable_load_avg will be in load_avg too. But at the contrary,
>> runnable_load_avg will not have the blocked that is going to wake up
>> soon in the case of schbench
>
> Decaying contribution of blocked tasks don't affect the busiest CPU
> selection. Without cgroup, runnable_load_avg is immediately increased
> and decreased as tasks enter and leave the queue and otherwise we end
> up with CPUs which are idle when there are threads queued on different
> CPUs accumulating scheduling latencies.
>
> The patch doesn't change how the busiest CPU is picked. It already
> uses runnable_load_avg. The change that cgroup causes is that it
> blocks updates to runnable_load_avg from newly scheduled or sleeping
> tasks.
>
> The issue isn't about whether runnable_load_avg or load_avg should be
> used but the unexpected differences in the metrics that the load

I think that's the root of the problem. I explain a bit more my view
on the other thread

> balancer uses depending on whether cgroup is used or not.
>
>> One last thing, the load_avg of an idle CPU can stay blocked for a
>> while (until a load balance happens that will update blocked load) and
>> can be seen has "busy" whereas it is not. Could it be a reason of your
>> problem ?
>
> AFAICS, the load balancer doesn't use load_avg.
>
> Thanks.
>
> --
> tejun

2017-04-27 08:59:52

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

On 27 April 2017 at 00:27, Tejun Heo <[email protected]> wrote:
> Hello, Vincent.
>
> On Wed, Apr 26, 2017 at 06:14:17PM +0200, Vincent Guittot wrote:
>> > + if (gcfs_rq->load.weight) {
>> > + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>> >
>> > + load = min(gcfs_rq->runnable_load_avg *
>> > + shares / gcfs_rq->load.weight, shares);
>>
>> There is a unit problem above:
>> runnable_load_avg and shares are not in the same range but
>> runnable_load_avg and scale_load_down(gcfs_rq->load.weight) are so
>> you should use
>> gcfs_rq->runnable_load_avg * scale_load_down(shares) /
>> scale_load_down(gcfs_rq->load.weight).
>
> But the only difference there is that we lose accuracy in calculation;
> otherwise, the end results are the same, no?

Yes the end result is the same, it was mainly to point out the range
difference and explain why we need scale_load_down(shares) for the 2nd
argument of min.
This should also explain the warning issue you mentioned earlier

>
>> Hopefully both scale_load_down cancel between them
>> But the min should be then tested with scale_load_down(shares) and not
>> only shares
>
> Ah, that's right. The min should be against scaled down shares.
>
> Thanks.
>
> --
> tejun

2017-04-28 16:15:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:28:01AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 02:30, Tejun Heo <[email protected]> wrote:
> > Hello, Vincent.
> >
> > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> >> > This is from the follow-up patch. I was confused. Because we don't
> >> > propagate decays, we still should decay the runnable_load_avg;
> >> > otherwise, we end up accumulating errors in the counter. I'll drop
> >> > the last patch.
> >>
> >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> >> see runnable_load_avg sometimes significantly higher than load_avg
> >> which is normally not possible as load_avg = runnable_load_avg +
> >> sleeping task's load_avg
> >
> > So, while load_avg would eventually converge on runnable_load_avg +
> > blocked load_avg given stable enough workload for long enough,
> > runnable_load_avg jumping above load avg temporarily is expected,
>
> No, it's not. Look at load_avg/runnable_avg at root domain when only
> task are involved, runnable_load_avg will never be higher than
> load_avg because
> load_avg = /sum load_avg of tasks attached to the cfs_rq
> runnable_load_avg = /sum load_avg of tasks attached and enqueued
> to the cfs_rq
> load_avg = runnable_load_avg + blocked tasks and as a result
> runnable_load_avg is always lower than load_avg.

We don't actually calculate that tho, but yeah, given the calculations
we do, runnable shouldn't be significantly higher than load_avg. I
see runable going above load_avg by a bit often but the margins are
small. I wonder whether you're seeing the errors accumulating from
the incorrect min() capping.

> And with the propagate load/util_avg patchset, we can even reflect
> task migration directly at root domain whereas we had to wait for the
> util/load_avg and runnable_load_avg to converge to the new value
> before
>
> Just to confirm one of my assumption, the latency regression was
> already there in previous kernel versions and is not a result of
> propagate load/util_avg patchset, isn't it ?

Yeah, the latency problem is independent of the propagation logic.
It's really the meaning of the top level running_avg_load being
different w/ and w/o cgroups.

> > AFAICS. That's the whole point of it, a sum closely tracking what's
> > currently on the cpu so that we can pick the cpu which has the most on
> > it now. It doesn't make sense to try to pick threads off of a cpu
> > which is generally loaded but doesn't have much going on right now,
> > after all.
>
> The only interest of runnable_load_avg is to be null when a cfs_rq is
> idle whereas load_avg is not but not to be higher than load_avg. The
> root cause is that load_balance only looks at "load" but not number of
> task currently running and that's probably the main problem:
> runnable_load_avg has been added because load_balance fails to filter
> idle group and idle rq. We should better add a new type in
> group_classify to tag group that are idle and the same in
> find_busiest_queue more.

I'll follow up in the other subthread but there really is fundamental
difference in how we calculate runnable_avg w/ and w/o cgroups.
Indepndent of whether we can improve the load balancer further, it is
an existing bug.

> > * Can you please double check that the higher latencies w/ the patch
> > is reliably reproducible? The test machines that I use have
> > variable management load. They never dominate the machine but are
> > enough to disturb the results so that to drawing out a reliable
> > pattern takes a lot of repeated runs. I'd really appreciate if you
> > could double check that the pattern is reliable with different run
> > patterns (ie. instead of 10 consecutive runs after another,
> > interleaved).
>
> I always let time between 2 consecutive run and the 10 consecutive
> runs take around 2min to execute
>
> Then I have run several time these 10 consecutive test and results stay the same

Can you please try the patch at the end of this message? I'm
wondering whether you're seeing the errors accumulating from the wrong
min().

> > ones. If there's something I can easily buy, please point me to it.
> > If there's something I can loan, that'd be great too.
>
> It looks like most of web site are currently out of stock

:(

> > * If not, I'll try to clean up the debug patches I have and send them
> > your way to get more visiblity but given these things tend to be
> > very iterative, it might take quite a few back and forth.
>
> Yes, that could be usefull. Even a trace of regression could be useful

Yeah, will clean it up a bit and post. Will also post some traces.

And here's the updated patch. I didn't add the suggested
scale_down()'s. I'll follow up on that in the other subthread.
Thanks.

kernel/sched/fair.c | 47 ++++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
- long delta, load = gcfs_rq->avg.load_avg;
+ long load = 0, delta;

/*
- * If the load of group cfs_rq is null, the load of the
- * sched_entity will also be null so we can skip the formula
+ * A cfs_rq's load avg contribution to the parent should be scaled
+ * to the sched_entity's weight. Use freshly calculated shares
+ * instead of @se->load.weight as the latter may not reflect
+ * changes from the current scheduling operation.
+ *
+ * Note that the propagation source is runnable_load_avg instead of
+ * load_avg. This keeps every cfs_rq's runnable_load_avg true to
+ * the sum of the scaled loads of all tasks queued under it, which
+ * is important for the correct operation of the load balancer.
+ *
+ * This can make the sched_entity's load_avg jumpier but that
+ * correctly reflects what would happen without cgroups if each
+ * task's load is scaled across nesting - the load is being
+ * averaged at the task and each cfs_rq.
*/
- if (load) {
- long tg_load;
+ if (gcfs_rq->load.weight) {
+ long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);

- /* Get tg's load and ensure tg_load > 0 */
- tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
- /* Ensure tg_load >= load and updated with current load*/
- tg_load -= gcfs_rq->tg_load_avg_contrib;
- tg_load += load;
-
- /*
- * We need to compute a correction term in the case that the
- * task group is consuming more CPU than a task of equal
- * weight. A task with a weight equals to tg->shares will have
- * a load less or equal to scale_load_down(tg->shares).
- * Similarly, the sched_entities that represent the task group
- * at parent level, can't have a load higher than
- * scale_load_down(tg->shares). And the Sum of sched_entities'
- * load must be <= scale_load_down(tg->shares).
- */
- if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
- /* scale gcfs_rq's load into tg's shares*/
- load *= scale_load_down(gcfs_rq->tg->shares);
- load /= tg_load;
- }
+ load = min_t(long, scale_load_down(shares),
+ gcfs_rq->runnable_load_avg *
+ shares / gcfs_rq->load.weight);
}

delta = load - se->avg.load_avg;

2017-04-28 17:46:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:59:12AM +0200, Vincent Guittot wrote:
> > But the only difference there is that we lose accuracy in calculation;
> > otherwise, the end results are the same, no?
>
> Yes the end result is the same, it was mainly to point out the range
> difference and explain why we need scale_load_down(shares) for the 2nd
> argument of min.
> This should also explain the warning issue you mentioned earlier

I'm not sure this makes sense. Practically, we're doing more shifts
just to lose calculation accuracy. Even conceptually, what we're
doing is

C
A * ---
B

Where A is in a different scale while B and C are in the same. What
you're suggesting is

scale_down(C)
A * ---------------
scale_down(B)

I can't see why this is better in any way.

Thanks.

--
tejun

2017-04-28 20:33:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 00:52, Tejun Heo <[email protected]> wrote:
> > Hello,
> >
> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> On 24 April 2017 at 22:14, Tejun Heo <[email protected]> wrote:
> >> Can the problem be on the load balance side instead ? and more
> >> precisely in the wakeup path ?
> >> After looking at the trace, it seems that task placement happens at
> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> you will have to wait for a load balance which is alreayd too late
> >
> > Oh, I was tracing most of scheduler activities and the ratios of
> > wakeups picking idle CPUs were about the same regardless of cgroup
> > membership. I can confidently say that the latency issue that I'm
> > seeing is from load balancer picking the wrong busiest CPU, which is
> > not to say that there can be other problems.
>
> ok. Is there any trace that you can share ? your behavior seems
> different of mine

I'm attaching the debug patch. With your change (avg instead of
runnable_avg), the following trace shows why it's wrong.

It's dumping a case where group A has a CPU w/ more than two schbench
threads and B doesn't, but the load balancer is determining that B is
loaded heavier.

dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
dbg_odd_dump: A: CPU001: run=1 schb=1
dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
dbg_odd_dump: A: Q001-asdf: schbench(153757C):w=1.000,l=0.527,u=0.514
dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
dbg_odd_dump: A: Q001-/: asdf(C):w=5.744,l=3.017,u=0.521
dbg_odd_dump: A: CPU017: run=2 schb=2
dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
dbg_odd_dump: A: Q017-asdf: schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
dbg_odd_dump: A: Q017-/: asdf(C):w=10.653,l=5.269,u=0.966
dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
dbg_odd_dump: B: CPU014: run=1 schb=1
dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
dbg_odd_dump: B: Q014-asdf: schbench(153760C):w=1.000,l=0.491,u=0.476
dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
dbg_odd_dump: B: Q014-/: asdf(C):w=5.605,l=5.766,u=0.970
dbg_odd_dump: B: CPU030: run=1 schb=1
dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
dbg_odd_dump: B: Q030-asdf: schbench(153747C):w=1.000,l=0.537,u=0.516
dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
dbg_odd_dump: B: Q030-/: asdf(C):w=5.758,l=3.092,u=0.516

You can notice that B's pertask weight is 4.409 which is way higher
than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
twice as high as it should be. The root queue's runnable avg should
only contain what's currently active but because we're scaling load
avg which includes both active and blocked, we're ending up picking
group B over A.

This shows up in the total number of times we pick the wrong queue and
thus latency. I'm running the following script with the debug patch
applied.

#!/bin/bash

date
cat /proc/self/cgroup

echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
echo 0 > /sys/module/fair/parameters/dbg_odd_cnt

~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

cat /sys/module/fair/parameters/dbg_odd_cnt


With your patch applied, in the root cgroup,

Fri Apr 28 12:48:59 PDT 2017
0::/
Latency percentiles (usec)
50.0000th: 26
75.0000th: 63
90.0000th: 78
95.0000th: 88
*99.0000th: 707
99.5000th: 5096
99.9000th: 10352
min=0, max=13743
577


In the /asdf cgroup,

Fri Apr 28 13:19:53 PDT 2017
0::/asdf
Latency percentiles (usec)
50.0000th: 35
75.0000th: 67
90.0000th: 81
95.0000th: 98
*99.0000th: 2212
99.5000th: 4536
99.9000th: 11024
min=0, max=13026
1708


The last line is the number of times the load balancer picked a group
w/o more than two schbench threads on a CPU over one w/. Some number
of these are expected as there are other threads and there are some
plays in all the calculations but propgating avg or not propgating at
all significantly increases the count and latency.

> > The issue isn't about whether runnable_load_avg or load_avg should be
> > used but the unexpected differences in the metrics that the load
>
> I think that's the root of the problem. I explain a bit more my view
> on the other thread

So, when picking the busiest group, the only thing which matters is
the queue's runnable_load_avg, which should approximate the sum of all
on-queue loads on that CPU.

If we don't propagate or propagate load_avg, we're factoring in
blocked avg of descendent cgroups into the root's runnable_load_avg
which is obviously wrong.

We can argue whether overriding a cfs_rq se's load_avg to the scaled
runnable_load_avg of the cfs_rq is the right way to go or we should
introduce a separate channel to propagate runnable_load_avg; however,
it's clear that we need to fix runnable_load_avg propagation one way
or another.

The thing with cfs_rq se's load_avg is that, it isn't really used
anywhere else AFAICS, so overriding it to the cfs_rq's
runnable_load_avg isn't prettiest but doesn't really change anything.

Thanks.

--
tejun

2017-04-28 20:38:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

Here's the debug patch.

The debug condition triggers when the load balancer picks a group w/o
more than one schbench threads on a CPU over one w/.

/sys/module/fair/parameters/dbg_odd_cnt: resettable counter
/sys/module/fair/parameters/dbg_odd_nth: dump group states on Nth
occurrence via trace_printk()

The load / weights are printed out so that NICE_0_LOAD is 1.000.

Thanks.
---
kernel/sched/fair.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 159 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -32,11 +32,18 @@
#include <linux/mempolicy.h>
#include <linux/migrate.h>
#include <linux/task_work.h>
+#include <linux/moduleparam.h>

#include <trace/events/sched.h>

#include "sched.h"

+static unsigned long dbg_odd_nth;
+static unsigned long dbg_odd_cnt;
+
+module_param(dbg_odd_nth, ulong, 0644);
+module_param(dbg_odd_cnt, ulong, 0644);
+
/*
* Targeted preemption latency for CPU-bound tasks:
*
@@ -7413,6 +7420,149 @@ static inline void update_sg_lb_stats(st
sgs->group_type = group_classify(group, sgs);
}

+static int count_schb(struct rq *rq)
+{
+ unsigned long flags;
+ struct task_struct *p;
+ int cnt = 0;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+
+ list_for_each_entry(p, &rq->cfs_tasks, se.group_node)
+ if (!strncmp(p->comm, "schbench", 8))
+ cnt++;
+
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+ return cnt;
+}
+
+static bool sg_has_two_schb(struct sched_group *sg)
+{
+ int cpu;
+
+ for_each_cpu(cpu, sched_group_cpus(sg))
+ if (count_schb(cpu_rq(cpu)) >= 2)
+ return true;
+ return false;
+}
+
+static DEFINE_PER_CPU(char [PAGE_SIZE], odd_buf);
+
+#define lbw(x) (int)((x) / NICE_0_LOAD), (int)(((x) % NICE_0_LOAD) * 1000 / NICE_0_LOAD)
+#define lba(x) (int)((scale_load(x)) / NICE_0_LOAD), (int)(((scale_load(x)) % NICE_0_LOAD) * 1000 / NICE_0_LOAD)
+
+static int odd_append_se(struct sched_entity *se, const char *postfix,
+ int cnt, char *buf, size_t size)
+{
+#define odd_append(fmt, args...) do { \
+ cnt += scnprintf(buf + cnt, size - cnt, fmt, ##args); \
+ cnt = min_t(int, cnt, size); \
+} while (0)
+
+ if (entity_is_task(se)) {
+ struct task_struct *task = task_of(se);
+ odd_append(" %s(%d%s)", task->comm, task->pid, postfix);
+ } else {
+ char nbuf[64];
+ cgroup_name(se->my_q->tg->css.cgroup, nbuf, sizeof(nbuf));
+ odd_append(" %s(%s)", nbuf, postfix);
+ }
+ odd_append(":w=%d.%03d,l=%d.%03d,u=%d.%03d",
+ lbw(se->load.weight),
+ lba(se->avg.load_avg),
+ lba(se->avg.util_avg));
+
+ return cnt;
+}
+
+static void dbg_odd_dump(const char *pref,
+ struct sched_group *sg, struct sg_lb_stats *sgs)
+{
+ int cpu;
+
+ trace_printk("%sgrp=%*pbl w=%u avg=%d.%03d grp=%d.%03d sum=%d.%03d pertask=%d.%03d\n", pref,
+ cpumask_pr_args(sched_group_cpus(sg)), sg->group_weight,
+ lba(sgs->avg_load), lba(sgs->group_load),
+ lba(sgs->sum_weighted_load), lba(sgs->load_per_task));
+ trace_printk("%sgcap=%d.%03d gutil=%d.%03d run=%u idle=%u gwt=%u type=%d nocap=%d\n",
+ pref,
+ lba(sgs->group_capacity), lba(sgs->group_util),
+ sgs->sum_nr_running, sgs->idle_cpus, sgs->group_weight,
+ sgs->group_type, sgs->group_no_capacity);
+
+ for_each_cpu(cpu, sched_group_cpus(sg)) {
+ struct task_group *tg;
+ unsigned long flags;
+
+ trace_printk("%sCPU%03d: run=%u schb=%d\n", pref, cpu,
+ cpu_rq(cpu)->nr_running, count_schb(cpu_rq(cpu)));
+
+ raw_spin_lock_irqsave(&cpu_rq(cpu)->lock, flags);
+
+ list_for_each_entry_rcu(tg, &task_groups, list) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+ char qname[32] = "root";
+ int depth = 0;
+ long tg_weight = 0, tg_shares = 0;
+ struct sched_entity *se;
+ char *buf = per_cpu_ptr(odd_buf, cpu);
+ int cnt;
+
+ if (!cfs_rq->nr_running)
+ continue;
+
+ if (cfs_rq->tg) {
+ cgroup_name(cfs_rq->tg->css.cgroup, qname, sizeof(qname));
+ if (cfs_rq->tg->se[cpu])
+ depth = cfs_rq->tg->se[cpu]->depth;
+ tg_weight = atomic_long_read(&cfs_rq->tg->load_avg);
+ tg_shares = cfs_rq->tg->shares;
+ }
+
+ trace_printk("%sQ%03d-%s@%d: w=%d.%03d,l=%d.%03d,u=%d.%03d,r=%d.%03d run=%u hrun=%u tgs=%d.%03d tgw=%d.%03d\n",
+ pref, cpu, qname, depth,
+ lbw(cfs_rq->load.weight),
+ lba(cfs_rq->avg.load_avg),
+ lba(cfs_rq->avg.util_avg),
+ lba(cfs_rq->runnable_load_avg),
+ cfs_rq->nr_running, cfs_rq->h_nr_running,
+ lbw(tg_shares),
+ lba(tg_weight));
+
+ buf[0] = '\0';
+ cnt = 0;
+
+ if (cfs_rq->curr)
+ cnt = odd_append_se(cfs_rq->curr, "C", cnt, buf, PAGE_SIZE);
+
+ for (se = __pick_first_entity(cfs_rq); se;
+ se = __pick_next_entity(se))
+ cnt = odd_append_se(se, "", cnt, buf, PAGE_SIZE);
+
+ trace_printk("%sQ%03d-%s@%d: %s\n",
+ pref, cpu, qname, depth, buf);
+ }
+
+ raw_spin_unlock_irqrestore(&cpu_rq(cpu)->lock, flags);
+ }
+}
+
+/* a has >= 2 dts, b doesn't */
+static void dbg_odd(struct lb_env *env,
+ struct sched_group *sga, struct sg_lb_stats *sgsa,
+ struct sched_group *sgb, struct sg_lb_stats *sgsb)
+{
+ if (dbg_odd_nth && (dbg_odd_cnt++ % dbg_odd_nth))
+ return;
+
+ trace_printk("odd: dst=%d idle=%d brk=%u lbtgt=%*pbl type=%d\n",
+ env->dst_cpu, env->idle, env->loop_break,
+ cpumask_pr_args(env->cpus), env->fbq_type);
+ dbg_odd_dump("A: ", sga, sgsa);
+ dbg_odd_dump("B: ", sgb, sgsb);
+}
+
/**
* update_sd_pick_busiest - return 1 on busiest group
* @env: The load balancing environment.
@@ -7432,6 +7582,8 @@ static bool update_sd_pick_busiest(struc
struct sg_lb_stats *sgs)
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ bool busiest_has_two = sds->busiest && sg_has_two_schb(sds->busiest);
+ bool sg_has_two = sg_has_two_schb(sg);

if (sgs->group_type > busiest->group_type)
return true;
@@ -7439,8 +7591,14 @@ static bool update_sd_pick_busiest(struc
if (sgs->group_type < busiest->group_type)
return false;

- if (sgs->avg_load <= busiest->avg_load)
+ if (sgs->avg_load <= busiest->avg_load) {
+ if (sg_has_two && !busiest_has_two)
+ dbg_odd(env, sg, sgs, sds->busiest, busiest);
return false;
+ }
+
+ if (!sg_has_two && busiest_has_two)
+ dbg_odd(env, sds->busiest, busiest, sg, sgs);

if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
goto asym_packing;

2017-04-25 21:10:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities

On Mon, Apr 24, 2017 at 02:35:28PM -0700, Tejun Heo wrote:
> Now that a cfs_rq sched_entity's load_avg always gets propagated from
> the associated cfs_rq, there's no point in calling __update_load_avg()
> on it. The two mechanisms compete with each other and we'd be always
> using a value close to the propagated one anyway.
>
> Skip __update_load_avg() for cfs_rq sched_entities. Also, relocate
> propagate_entity_load_avg() to signify that propagation is the
> counterpart to __update_load_avg() for cfs_rq sched_entities. This
> puts the propagation before update_cfs_rq_load_avg() which shouldn't
> disturb anything.

Please ignore this patch. As we don't propagate on decays, we still
need __update_load_avg() on runanble_load_avg so that it can decay on
its own.

Thanks.

--
tejun