by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH 1/8] sched: introduce task_se_h_load helper

On 6/12/19 9:32 PM, Rik van Riel wrote:
> Sometimes the hierarchical load of a sched_entity needs to be calculated.
> Split out task_h_load into a task_se_h_load that takes a sched_entity pointer
> as its argument, and a task_h_load wrapper that calls task_se_h_load.
>
> No functional changes.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> kernel/sched/fair.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f5e528..df624f7a68e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -706,6 +706,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> #ifdef CONFIG_SMP
>
> static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
> +static unsigned long task_se_h_load(struct sched_entity *se);
> static unsigned long task_h_load(struct task_struct *p);
> static unsigned long capacity_of(int cpu);
>
> @@ -7833,14 +7834,19 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
> }
> }
>
> -static unsigned long task_h_load(struct task_struct *p)
> +static unsigned long task_se_h_load(struct sched_entity *se)
> {
> - struct cfs_rq *cfs_rq = task_cfs_rq(p);
> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
> update_cfs_rq_h_load(cfs_rq);
> - return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> + return div64_ul(se->avg.load_avg * cfs_rq->h_load,
> cfs_rq_load_avg(cfs_rq) + 1);
> }

I wonder if this is necessary. I placed a BUG_ON(!entity_is_task(se))
into task_se_h_load() after I applied the whole patch-set and ran some
taskgroup related testcases. It didn't hit.

So why not use task_h_load(task_of(se)) instead?

[...]

2019-06-19 13:59:32

On Wed, 2019-06-19 at 17:18 +0200, Dietmar Eggemann wrote:
> On 6/19/19 3:57 PM, Rik van Riel wrote:
>
> > That would work, but task_h_load then dereferences
> > task->se to get the se->avg.load_avg value.
> >
> > Going back to task from the se, only to then get the
> > se from the task seems a little unnecessary :)
> >
> > Can you explain why you think task_h_load(task_of(se))
> > would be better? I think I may be overlooking something.
>
> Ah, OK, I just wanted to avoid having task_se_h_load() and
> task_h_load()
> at the same time. You could replace the remaining calls to
> task_h_load(p) with task_se_h_load(&p->se) in this case.
>
> - task_load = task_h_load(p);
> + task_load = task_se_h_load(&p->se);
>
> Not that important though right now ...

That I can do.

I might as well do that while going through the
rest of the series to merge in the bug fix that
I have for the performance regression, and the
fixes for compilation with other config options.

Thank you for the suggestion.

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2019-06-20 16:24:35

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper

On 6/12/19 9:32 PM, Rik van Riel wrote:
> Use an explicit "cfs_rq of parent sched_entity" helper in a few
> strategic places, where cfs_rq_of(se) may no longer point at the
> right runqueue once we flatten the hierarchical cgroup runqueues.
>
> No functional change.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> kernel/sched/fair.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dcc521d251e3..c6ede2ecc935 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -275,6 +275,15 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
> return grp->my_q;
> }
>
> +/* runqueue owned by the parent entity */
> +static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
> +{
> + if (se->parent)
> + return group_cfs_rq(se->parent);
> +
> + return &cfs_rq_of(se)->rq->cfs;
> +}
> +
> static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> {
> struct rq *rq = rq_of(cfs_rq);
> @@ -3298,7 +3307,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
>
> gcfs_rq->propagate = 0;
>
> - cfs_rq = cfs_rq_of(se);
> + cfs_rq = group_cfs_rq_of_parent(se);
>
> add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
>
> @@ -7779,7 +7788,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>
> WRITE_ONCE(cfs_rq->h_load_next, NULL);
> for_each_sched_entity(se) {
> - cfs_rq = cfs_rq_of(se);
> + cfs_rq = group_cfs_rq_of_parent(se);

Why do you change this here? task_se_h_load() calls
update_cfs_rq_h_load() with cfs_rq = group_cfs_rq_of_parent(se) because
the task might not be on the cfs_rq yet.

But inside update_cfs_rq_h_load() the first se is derived from
cfs_rq->tg->se[cpu_of(rq)] so in the first for_each_sched_entity() loop
we should still start with group_cfs_rq() (se->my_q) ?

The system doesn't barf with these two WARN_ON's in.

@@ -7663,12 +7673,17 @@ static void update_cfs_rq_h_load(struct cfs_rq
*cfs_rq)
unsigned long now = jiffies;
unsigned long load;

+ WARN_ON(se && (se != group_cfs_rq(se)->tg->se[cpu_of(rq)]));
+
if (cfs_rq->last_h_load_update == now)
return;

WRITE_ONCE(cfs_rq->h_load_next, NULL);
for_each_sched_entity(se) {
cfs_rq = group_cfs_rq_of_parent(se);
+
+ WARN_ON(se != group_cfs_rq(se)->tg->se[cpu_of(rq)]);
+
WRITE_ONCE(cfs_rq->h_load_next, se);
if (cfs_rq->last_h_load_update == now)
break;

[...]

2019-06-20 16:30:07

On Tue, 2019-06-25 at 11:50 +0200, Dietmar Eggemann wrote:
> On 6/12/19 9:32 PM, Rik van Riel wrote:
>
> [...]
>
> > @@ -410,6 +412,11 @@ static inline struct sched_entity
> > *parent_entity(struct sched_entity *se)
> > return se->parent;
> > }
> >
> > +static inline bool task_se_in_cgroup(struct sched_entity *se)
> > +{
> > + return parent_entity(se);
> > +}
>
> IMHO, s/in_cgroup/not_in_root_tg/ reads easier. "/", i.e. the root tg
> is
> still a cgroup, I guess. But you could use existing parent_entity(se)
> as
> well.

I agree my name is not the prettiest, but I am not
entirely convinced your idea is an improvement.

I'll hold out for better ideas by other reviewers :)

> > @@ -679,22 +710,16 @@ static inline u64 calc_delta_fair(u64 delta,
> > struct sched_entity *se)
> > static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity
> > *se)
> > {
> > u64 slice = sysctl_sched_latency;
> > + struct load_weight *load = &cfs_rq->load;
> > + struct load_weight lw;
> >
> > - for_each_sched_entity(se) {
> > - struct load_weight *load;
> > - struct load_weight lw;
> > + if (unlikely(!se->on_rq)) {
> > + lw = cfs_rq->load;
> >
> > - cfs_rq = cfs_rq_of(se);
> > - load = &cfs_rq->load;
> > -
> > - if (unlikely(!se->on_rq)) {
> > - lw = cfs_rq->load;
> > -
> > - update_load_add(&lw, se->load.weight);
> > - load = &lw;
> > - }
> > - slice = __calc_delta(slice, se->load.weight, load);
> > + update_load_add(&lw, task_se_h_load(se));
> > + load = &lw;
> > }
> > + slice = __calc_delta(slice, task_se_h_load(se), load);
>
> task_se_h_load(se) and se->load.weight are off my factor of >= 1024
> on
> 64bit.

Oh indeed they are!

I wonder if this is the root cause of that
performance regression I have been hunting for
the past few weeks :)

Let me go test some things...

> ...
> bash pid=3250: task_se_h_load(se)=1023 se->load.weight=1048576
> sysctl_sched_latency=18000000 slice=0 old_slice=17999995
> ...
>
> [...]
>
--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2019-06-26 14:36:25

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load

On 6/12/19 9:32 PM, Rik van Riel wrote:
> The runnable_load magic is used to quickly propagate information about
> runnable tasks up the hierarchy of runqueues. lhen switching to a flat

Looks like here is some information missing.

> runqueue, that no longer works.
>
> Redefine the CPU cfs_rq runnable_load_avg to be the sum of task_h_loads
> of the runnable tasks. This provides enough information to the load
> balancer.
>
> The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> used for anything, so don't bother calculating those.
>
> This removes one of the things that the code currently traverses the
> cgroup hierarchy for, and getting rid of it brings us one step closer
> to a flat runqueue for the CPU controller.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> include/linux/sched.h | 3 +-
> kernel/sched/core.c | 2 -
> kernel/sched/debug.c | 1 +
> kernel/sched/fair.c | 125 +++++++++++++-----------------------------
> kernel/sched/pelt.c | 49 ++++++-----------
> kernel/sched/sched.h | 6 --
> 6 files changed, 55 insertions(+), 131 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 11837410690f..f5bb6948e40c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -391,7 +391,6 @@ struct util_est {
> struct sched_avg {
> u64 last_update_time;
> u64 load_sum;
> - u64 runnable_load_sum;
> u32 util_sum;
> u32 period_contrib;
> unsigned long load_avg;

Could you not also remove runnable_load_avg from struct sched_avg and
put it into the struct cfs_rq directly. The signal has nothing to to
with PELT anymore and se don't have to carry it. You only need it for
the root cfs_rq's but it's at least better than having it still for all
se's as well.

[...]

> @@ -2767,20 +2765,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> static inline void
> enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - cfs_rq->runnable_weight += se->runnable_weight;
> + if (entity_is_task(se)) {
> + struct cfs_rq *cpu_cfs_rq = &cfs_rq->rq->cfs;

There are a couple of comments in fair.c referring to this cfs_rq as the
root cfs_rq, rather the cpu cfs_rq. IMHO, easier to read if we stick to
one name (root_cfs_rq vs. cpu_cfs_rq).

[...]

2019-06-26 15:59:18

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper

2019-06-26 16:17:45

On Fri, 2019-06-28 at 12:26 +0200, Dietmar Eggemann wrote:
> On 6/12/19 9:32 PM, Rik van Riel wrote:
> > Flatten the hierarchical runqueues into just the per CPU rq.cfs
> > runqueue.
> >
> > Iteration of the sched_entity hierarchy is rate limited to once per
> > jiffy
> > per sched_entity, which is a smaller change than it seems, because
> > load
> > average adjustments were already rate limited to once per jiffy
> > before this
> > patch series.
> >
> > This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to
> > park tasks
> > from throttled cgroups onto their cgroup runqueues, and slowly
> > (using the
> > GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once
> > the cgroup
> > gets unthrottled, to prevent thundering herd issues.
> >
> > Signed-off-by: Rik van Riel <[email protected]>
> > ---
> > include/linux/sched.h | 2 +
> > kernel/sched/fair.c | 478 +++++++++++++++++---------------------
> > ----
> > kernel/sched/pelt.c | 6 +-
> > kernel/sched/pelt.h | 2 +-
> > kernel/sched/sched.h | 2 +-
> > 5 files changed, 194 insertions(+), 296 deletions(-)
> >
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>
> [...]
>
> > @@ -3491,7 +3544,7 @@ static inline bool update_load_avg(struct
> > cfs_rq *cfs_rq, struct sched_entity *s
> > * track group sched_entity load average for task_h_load calc
> > in migration
> > */
> > if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> > - updated = __update_load_avg_se(now, cfs_rq, se);
> > + updated = __update_load_avg_se(now, cfs_rq, se, curr,
> > curr);
>
> I wonder if task migration is still working correctly.
>
> migrate_task_rq_fair(p, ...) -> remove_entity_load_avg(&p->se) would
> use
> cfs_rq = se->cfs_rq (i.e. root cfs_rq). So load (and util) will not
> propagate through the taskgroup hierarchy.
>
> [...]

Good point. This should be the group cfs_rq, and
then on the next tick the load change will be
propagated up.

Let me add that change in for v2 as well.

--
All Rights Reversed.

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part