2020-07-02 14:44:35

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH] sched/fair: handle case of task_h_load() returning 0

task_h_load() can return 0 in some situations like running stress-ng
mmapfork, which forks thousands of threads, in a sched group on a 224 cores
system. The load balance doesn't handle this correctly because
env->imbalance never decreases and it will stop pulling tasks only after
reaching loop_max, which can be equal to the number of running tasks of
the cfs. Make sure that imbalance will be decreased by at least 1.

misfit task is the other feature that doesn't handle correctly such
situation although it's probably more difficult to face the problem
because of the smaller number of CPUs and running tasks on heterogenous
system.

We can't simply ensure that task_h_load() returns at least one because it
would imply to handle underrun in other places.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fab1d17c575..62747c24aa9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
return;
}

- rq->misfit_task_load = task_h_load(p);
+ /*
+ * Make sure that misfit_task_load will not be null even if
+ * task_h_load() returns 0. misfit_task_load is only used to select
+ * rq with highest load so adding 1 will not modify the result
+ * of the comparison.
+ */
+ rq->misfit_task_load = task_h_load(p) + 1;
}

#else /* CONFIG_SMP */
@@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
goto next;

+ /*
+ * Depending of the number of CPUs and tasks and the
+ * cgroup hierarchy, task_h_load() can return a null
+ * value. Make sure that env->imbalance decreases
+ * otherwise detach_tasks() will stop only after
+ * detaching up to loop_max tasks.
+ */
+ if (!load)
+ load = 1;
+
env->imbalance -= load;
break;

--
2.17.1


2020-07-02 16:12:09

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0


On 02/07/20 15:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.

Nasty one, that...

Random thought: isn't that the kind of thing we have scale_load() and
scale_load_down() for? There's more uses of task_h_load() than I would like
for this, but if we upscale its output (or introduce an upscaled variant),
we could do something like:

---
detach_tasks()
{
long imbalance = env->imbalance;

if (env->migration_type == migrate_load)
imbalance = scale_load(imbalance);

while (!list_empty(tasks)) {
/* ... */
switch (env->migration_type) {
case migrate_load:
load = task_h_load_upscaled(p);
/* ... usual bits here ...*/
lsub_positive(&env->imbalance, load);
break;
/* ... */
}

if (!scale_load_down(env->imbalance))
break;
}
}
---

It's not perfect, and there's still the misfit situation to sort out -
still, do you think this is something we could go towards?

>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> return;
> }
>
> - rq->misfit_task_load = task_h_load(p);
> + /*
> + * Make sure that misfit_task_load will not be null even if
> + * task_h_load() returns 0. misfit_task_load is only used to select
> + * rq with highest load so adding 1 will not modify the result
> + * of the comparison.
> + */
> + rq->misfit_task_load = task_h_load(p) + 1;
> }
>
> #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> goto next;
>
> + /*
> + * Depending of the number of CPUs and tasks and the
> + * cgroup hierarchy, task_h_load() can return a null
> + * value. Make sure that env->imbalance decreases
> + * otherwise detach_tasks() will stop only after
> + * detaching up to loop_max tasks.
> + */
> + if (!load)
> + load = 1;
> +
> env->imbalance -= load;
> break;

2020-07-02 16:30:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On Thu, 2 Jul 2020 at 18:11, Valentin Schneider
<[email protected]> wrote:
>
>
> On 02/07/20 15:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
>
> Nasty one, that...
>
> Random thought: isn't that the kind of thing we have scale_load() and
> scale_load_down() for? There's more uses of task_h_load() than I would like
> for this, but if we upscale its output (or introduce an upscaled variant),
> we could do something like:
>
> ---
> detach_tasks()
> {
> long imbalance = env->imbalance;
>
> if (env->migration_type == migrate_load)
> imbalance = scale_load(imbalance);
>
> while (!list_empty(tasks)) {
> /* ... */
> switch (env->migration_type) {
> case migrate_load:
> load = task_h_load_upscaled(p);
> /* ... usual bits here ...*/
> lsub_positive(&env->imbalance, load);
> break;
> /* ... */
> }
>
> if (!scale_load_down(env->imbalance))
> break;
> }
> }
> ---
>
> It's not perfect, and there's still the misfit situation to sort out -
> still, do you think this is something we could go towards?

This will not work for 32bits system.

For 64bits, I have to think a bit more if the upscale would fix all
cases and support propagation across a hierarchy. And in this case we
could also consider to make scale_load/scale_load_down a nop all the
time

>
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 18 +++++++++++++++++-
> > 1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > return;
> > }
> >
> > - rq->misfit_task_load = task_h_load(p);
> > + /*
> > + * Make sure that misfit_task_load will not be null even if
> > + * task_h_load() returns 0. misfit_task_load is only used to select
> > + * rq with highest load so adding 1 will not modify the result
> > + * of the comparison.
> > + */
> > + rq->misfit_task_load = task_h_load(p) + 1;
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > goto next;
> >
> > + /*
> > + * Depending of the number of CPUs and tasks and the
> > + * cgroup hierarchy, task_h_load() can return a null
> > + * value. Make sure that env->imbalance decreases
> > + * otherwise detach_tasks() will stop only after
> > + * detaching up to loop_max tasks.
> > + */
> > + if (!load)
> > + load = 1;
> > +
> > env->imbalance -= load;
> > break;

2020-07-07 13:31:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On Thu, 2 Jul 2020 at 18:28, Vincent Guittot <[email protected]> wrote:
>
> On Thu, 2 Jul 2020 at 18:11, Valentin Schneider
> <[email protected]> wrote:
> >
> >
> > On 02/07/20 15:42, Vincent Guittot wrote:
> > > task_h_load() can return 0 in some situations like running stress-ng
> > > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > > system. The load balance doesn't handle this correctly because
> > > env->imbalance never decreases and it will stop pulling tasks only after
> > > reaching loop_max, which can be equal to the number of running tasks of
> > > the cfs. Make sure that imbalance will be decreased by at least 1.
> > >
> > > misfit task is the other feature that doesn't handle correctly such
> > > situation although it's probably more difficult to face the problem
> > > because of the smaller number of CPUs and running tasks on heterogenous
> > > system.
> > >
> > > We can't simply ensure that task_h_load() returns at least one because it
> > > would imply to handle underrun in other places.
> >
> > Nasty one, that...
> >
> > Random thought: isn't that the kind of thing we have scale_load() and
> > scale_load_down() for? There's more uses of task_h_load() than I would like
> > for this, but if we upscale its output (or introduce an upscaled variant),
> > we could do something like:
> >
> > ---
> > detach_tasks()
> > {
> > long imbalance = env->imbalance;
> >
> > if (env->migration_type == migrate_load)
> > imbalance = scale_load(imbalance);
> >
> > while (!list_empty(tasks)) {
> > /* ... */
> > switch (env->migration_type) {
> > case migrate_load:
> > load = task_h_load_upscaled(p);
> > /* ... usual bits here ...*/
> > lsub_positive(&env->imbalance, load);
> > break;
> > /* ... */
> > }
> >
> > if (!scale_load_down(env->imbalance))
> > break;
> > }
> > }
> > ---
> >
> > It's not perfect, and there's still the misfit situation to sort out -
> > still, do you think this is something we could go towards?
>
> This will not work for 32bits system.
>
> For 64bits, I have to think a bit more if the upscale would fix all
> cases and support propagation across a hierarchy. And in this case we
> could also consider to make scale_load/scale_load_down a nop all the
> time

In addition that problem remains on 32bits, the problem can still
happen after extending the scale so this current patch still makes
sense.

Then if we want to reduce the cases where task_h_load returns 0, we
should better make scale_load_down a nop otherwise we will have to
maintain 2 values h_load and scale_h_load across the hierarchy

>
> >
> > >
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 18 +++++++++++++++++-
> > > 1 file changed, 17 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6fab1d17c575..62747c24aa9e 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > > return;
> > > }
> > >
> > > - rq->misfit_task_load = task_h_load(p);
> > > + /*
> > > + * Make sure that misfit_task_load will not be null even if
> > > + * task_h_load() returns 0. misfit_task_load is only used to select
> > > + * rq with highest load so adding 1 will not modify the result
> > > + * of the comparison.
> > > + */
> > > + rq->misfit_task_load = task_h_load(p) + 1;
> > > }
> > >
> > > #else /* CONFIG_SMP */
> > > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > > env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > > goto next;
> > >
> > > + /*
> > > + * Depending of the number of CPUs and tasks and the
> > > + * cgroup hierarchy, task_h_load() can return a null
> > > + * value. Make sure that env->imbalance decreases
> > > + * otherwise detach_tasks() will stop only after
> > > + * detaching up to loop_max tasks.
> > > + */
> > > + if (!load)
> > > + load = 1;
> > > +
> > > env->imbalance -= load;
> > > break;

2020-07-08 09:46:36

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On 02/07/2020 16:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because

I guess the issue here is that 'cfs_rq->h_load' in

task_h_load() {
struct cfs_rq *cfs_rq = task_cfs_rq(p);
...
return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
cfs_rq_load_avg(cfs_rq) + 1);
}

is still ~0 (or at least pretty small) compared to se.avg.load_avg being
1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.

> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> return;
> }
>
> - rq->misfit_task_load = task_h_load(p);
> + /*
> + * Make sure that misfit_task_load will not be null even if
> + * task_h_load() returns 0. misfit_task_load is only used to select
> + * rq with highest load so adding 1 will not modify the result
> + * of the comparison.
> + */
> + rq->misfit_task_load = task_h_load(p) + 1;
> }
>
> #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> goto next;
>
> + /*
> + * Depending of the number of CPUs and tasks and the
> + * cgroup hierarchy, task_h_load() can return a null
> + * value. Make sure that env->imbalance decreases
> + * otherwise detach_tasks() will stop only after
> + * detaching up to loop_max tasks.
> + */
> + if (!load)
> + load = 1;
> +
> env->imbalance -= load;
> break;

I assume that this is related to the LKP mail

https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?

I ran the test (5.8.0-rc4 w/o vs. w/ your patch) on 'Intel(R) Xeon(R)
CPU E5-2690 v2 @ 3.00GHz' (2*2*10, 40 CPUs).
I can't see the changes in the magnitude shown in the email above (they
used a 96 CPU system though).
I used only scheduler stressor mmapfork in taskgroup /A/B/C:

stress-ng --timeout 1 --times --verify --metrics-brief --sequential 40 --class scheduler --exclude (all except mmapfork)

5.8.0-rc4-custom-dieegg01-stress-ng-base

stress-ng: info: [3720] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [3720] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [3720] mmapfork 40 1.98 12.53 71.12 20.21 0.48
stress-ng: info: [5201] mmapfork 40 2.50 13.10 98.61 16.01 0.36
stress-ng: info: [6682] mmapfork 40 2.58 14.80 98.63 15.88 0.36
stress-ng: info: [8195] mmapfork 40 1.79 12.57 61.61 22.31 0.54
stress-ng: info: [9679] mmapfork 40 2.20 12.17 82.66 18.20 0.42
stress-ng: info: [11164] mmapfork 40 2.61 15.09 102.86 16.86 0.37
stress-ng: info: [12773] mmapfork 40 1.89 12.32 65.09 21.15 0.52
stress-ng: info: [3883] mmapfork 40 2.14 12.90 76.73 18.68 0.45
stress-ng: info: [6845] mmapfork 40 2.25 11.83 84.06 17.80 0.42
stress-ng: info: [8326] mmapfork 40 1.76 12.93 56.65 22.70 0.57

Mean: 18.98 (σ: 2.369)
5.8.0-rc4-custom-dieegg01-stress-ng

stress-ng: info: [3895] mmapfork 40 2.40 13.56 92.83 16.67 0.38
stress-ng: info: [5379] mmapfork 40 2.08 13.65 74.11 19.23 0.46
stress-ng: info: [6860] mmapfork 40 2.15 13.72 80.24 18.62 0.43
stress-ng: info: [8341] mmapfork 40 2.37 13.74 90.93 16.85 0.38
stress-ng: info: [9822] mmapfork 40 2.10 12.48 83.85 19.09 0.42
stress-ng: info: [13816] mmapfork 40 2.05 12.13 77.64 19.49 0.45
stress-ng: info: [15297] mmapfork 40 2.53 13.16 100.26 15.84 0.35
stress-ng: info: [16780] mmapfork 40 2.00 12.10 71.25 20.02 0.48
stress-ng: info: [18262] mmapfork 40 1.73 12.24 57.69 23.09 0.57
stress-ng: info: [19743] mmapfork 40 1.78 12.51 57.89 22.48 0.57

Mean: 19.14 (σ: 2.239)





2020-07-08 09:50:40

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <[email protected]> wrote:
>
> On 02/07/2020 16:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
>
> I guess the issue here is that 'cfs_rq->h_load' in
>
> task_h_load() {
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> ...
> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> cfs_rq_load_avg(cfs_rq) + 1);
> }
>
> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 18 +++++++++++++++++-
> > 1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > return;
> > }
> >
> > - rq->misfit_task_load = task_h_load(p);
> > + /*
> > + * Make sure that misfit_task_load will not be null even if
> > + * task_h_load() returns 0. misfit_task_load is only used to select
> > + * rq with highest load so adding 1 will not modify the result
> > + * of the comparison.
> > + */
> > + rq->misfit_task_load = task_h_load(p) + 1;
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > goto next;
> >
> > + /*
> > + * Depending of the number of CPUs and tasks and the
> > + * cgroup hierarchy, task_h_load() can return a null
> > + * value. Make sure that env->imbalance decreases
> > + * otherwise detach_tasks() will stop only after
> > + * detaching up to loop_max tasks.
> > + */
> > + if (!load)
> > + load = 1;
> > +
> > env->imbalance -= load;
> > break;
>
> I assume that this is related to the LKP mail

I have found this problem while studying the regression raised in the
email below but it doesn't fix it. At least, it's not enough

>
> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
>
> I ran the test (5.8.0-rc4 w/o vs. w/ your patch) on 'Intel(R) Xeon(R)
> CPU E5-2690 v2 @ 3.00GHz' (2*2*10, 40 CPUs).
> I can't see the changes in the magnitude shown in the email above (they
> used a 96 CPU system though).
> I used only scheduler stressor mmapfork in taskgroup /A/B/C:
>
> stress-ng --timeout 1 --times --verify --metrics-brief --sequential 40 --class scheduler --exclude (all except mmapfork)
>
> 5.8.0-rc4-custom-dieegg01-stress-ng-base
>
> stress-ng: info: [3720] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
> stress-ng: info: [3720] (secs) (secs) (secs) (real time) (usr+sys time)
> stress-ng: info: [3720] mmapfork 40 1.98 12.53 71.12 20.21 0.48
> stress-ng: info: [5201] mmapfork 40 2.50 13.10 98.61 16.01 0.36
> stress-ng: info: [6682] mmapfork 40 2.58 14.80 98.63 15.88 0.36
> stress-ng: info: [8195] mmapfork 40 1.79 12.57 61.61 22.31 0.54
> stress-ng: info: [9679] mmapfork 40 2.20 12.17 82.66 18.20 0.42
> stress-ng: info: [11164] mmapfork 40 2.61 15.09 102.86 16.86 0.37
> stress-ng: info: [12773] mmapfork 40 1.89 12.32 65.09 21.15 0.52
> stress-ng: info: [3883] mmapfork 40 2.14 12.90 76.73 18.68 0.45
> stress-ng: info: [6845] mmapfork 40 2.25 11.83 84.06 17.80 0.42
> stress-ng: info: [8326] mmapfork 40 1.76 12.93 56.65 22.70 0.57
>
> Mean: 18.98 (σ: 2.369)
> 5.8.0-rc4-custom-dieegg01-stress-ng
>
> stress-ng: info: [3895] mmapfork 40 2.40 13.56 92.83 16.67 0.38
> stress-ng: info: [5379] mmapfork 40 2.08 13.65 74.11 19.23 0.46
> stress-ng: info: [6860] mmapfork 40 2.15 13.72 80.24 18.62 0.43
> stress-ng: info: [8341] mmapfork 40 2.37 13.74 90.93 16.85 0.38
> stress-ng: info: [9822] mmapfork 40 2.10 12.48 83.85 19.09 0.42
> stress-ng: info: [13816] mmapfork 40 2.05 12.13 77.64 19.49 0.45
> stress-ng: info: [15297] mmapfork 40 2.53 13.16 100.26 15.84 0.35
> stress-ng: info: [16780] mmapfork 40 2.00 12.10 71.25 20.02 0.48
> stress-ng: info: [18262] mmapfork 40 1.73 12.24 57.69 23.09 0.57
> stress-ng: info: [19743] mmapfork 40 1.78 12.51 57.89 22.48 0.57
>
> Mean: 19.14 (σ: 2.239)
>
>
>
>
>

2020-07-08 10:37:36

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0


On 07/07/20 14:30, Vincent Guittot wrote:
> On Thu, 2 Jul 2020 at 18:28, Vincent Guittot <[email protected]> wrote:
>>
>> On Thu, 2 Jul 2020 at 18:11, Valentin Schneider
>> <[email protected]> wrote:
>> >
>> >
>> > On 02/07/20 15:42, Vincent Guittot wrote:
>> > > task_h_load() can return 0 in some situations like running stress-ng
>> > > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
>> > > system. The load balance doesn't handle this correctly because
>> > > env->imbalance never decreases and it will stop pulling tasks only after
>> > > reaching loop_max, which can be equal to the number of running tasks of
>> > > the cfs. Make sure that imbalance will be decreased by at least 1.
>> > >
>> > > misfit task is the other feature that doesn't handle correctly such
>> > > situation although it's probably more difficult to face the problem
>> > > because of the smaller number of CPUs and running tasks on heterogenous
>> > > system.
>> > >
>> > > We can't simply ensure that task_h_load() returns at least one because it
>> > > would imply to handle underrun in other places.
>> >
>> > Nasty one, that...
>> >
>> > Random thought: isn't that the kind of thing we have scale_load() and
>> > scale_load_down() for? There's more uses of task_h_load() than I would like
>> > for this, but if we upscale its output (or introduce an upscaled variant),
>> > we could do something like:
>> >
>> > ---
>> > detach_tasks()
>> > {
>> > long imbalance = env->imbalance;
>> >
>> > if (env->migration_type == migrate_load)
>> > imbalance = scale_load(imbalance);
>> >
>> > while (!list_empty(tasks)) {
>> > /* ... */
>> > switch (env->migration_type) {
>> > case migrate_load:
>> > load = task_h_load_upscaled(p);
>> > /* ... usual bits here ...*/
>> > lsub_positive(&env->imbalance, load);
>> > break;
>> > /* ... */
>> > }
>> >
>> > if (!scale_load_down(env->imbalance))
>> > break;
>> > }
>> > }
>> > ---
>> >
>> > It's not perfect, and there's still the misfit situation to sort out -
>> > still, do you think this is something we could go towards?
>>
>> This will not work for 32bits system.
>>
>> For 64bits, I have to think a bit more if the upscale would fix all
>> cases and support propagation across a hierarchy. And in this case we
>> could also consider to make scale_load/scale_load_down a nop all the
>> time
>
> In addition that problem remains on 32bits, the problem can still
> happen after extending the scale so this current patch still makes
> sense.
>

Right, I think we'd want to have that at the very least for 32bit anyway. I
haven't done the math, but doesn't it require an obscene amount of tasks
for that to still happen on 64bit with the increased resolution?

> Then if we want to reduce the cases where task_h_load returns 0, we
> should better make scale_load_down a nop otherwise we will have to
> maintain 2 values h_load and scale_h_load across the hierarchy
>

I don't fully grasp yet how much surgery that would require, but it does
sound like something we've been meaning to do, see e.g. se_weight:

* XXX we want to get rid of these helpers and use the full load resolution.

2020-07-09 13:07:58

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0


On 02/07/20 15:42, Vincent Guittot wrote:
> task_h_load() can return 0 in some situations like running stress-ng
> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> system. The load balance doesn't handle this correctly because
> env->imbalance never decreases and it will stop pulling tasks only after
> reaching loop_max, which can be equal to the number of running tasks of
> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> misfit task is the other feature that doesn't handle correctly such
> situation although it's probably more difficult to face the problem
> because of the smaller number of CPUs and running tasks on heterogenous
> system.
>
> We can't simply ensure that task_h_load() returns at least one because it
> would imply to handle underrun in other places.
>
> Signed-off-by: Vincent Guittot <[email protected]>

I dug some more into this; if I got my math right, this can be reproduced
with a single task group below the root. Forked tasks get max load, so this
can be tried out with either tons of forks or tons of CPU hogs.

We need

p->se.avg.load_avg * cfs_rq->h_load
----------------------------------- < 1
cfs_rq_load_avg(cfs_rq) + 1

Assuming homogeneous system with tasks spread out all over (no other tasks
interfering), that should boil down to

1024 * (tg.shares / nr_cpus)
--------------------------- < 1
1024 * (nr_tasks_on_cpu)

IOW

tg.shares / nr_cpus < nr_tasks_on_cpu

If we get tasks nicely spread out, a simple condition to hit this should be
to have more tasks than shares.

I can hit task_h_load=0 with the following on my Juno (pinned to one CPU to
make things simpler; big.LITTLE doesn't yield equal weights between CPUs):

cgcreate -g cpu:tg0

echo 128 > /sys/fs/cgroup/cpu/tg0/cpu.shares

for ((i=0; i<130; i++)); do
# busy loop of your choice
taskset -c 0 ./loop.sh &
echo $! > /sys/fs/cgroup/cpu/tg0/tasks
done

> ---
> kernel/sched/fair.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6fab1d17c575..62747c24aa9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> return;
> }
>
> - rq->misfit_task_load = task_h_load(p);
> + /*
> + * Make sure that misfit_task_load will not be null even if
> + * task_h_load() returns 0. misfit_task_load is only used to select
> + * rq with highest load so adding 1 will not modify the result
> + * of the comparison.
> + */
> + rq->misfit_task_load = task_h_load(p) + 1;

For here and below; wouldn't it be a tad cleaner to just do

foo = max(task_h_load(p), 1);

Otherwise, I think I've properly convinced myself we do want to have
that in one form or another. So either way:

Reviewed-by: Valentin Schneider <[email protected]>

> }
>
> #else /* CONFIG_SMP */
> @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> goto next;
>
> + /*
> + * Depending of the number of CPUs and tasks and the
> + * cgroup hierarchy, task_h_load() can return a null
> + * value. Make sure that env->imbalance decreases
> + * otherwise detach_tasks() will stop only after
> + * detaching up to loop_max tasks.
> + */
> + if (!load)
> + load = 1;
> +
> env->imbalance -= load;
> break;

2020-07-09 13:35:51

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On 08/07/2020 11:47, Vincent Guittot wrote:
> On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 02/07/2020 16:42, Vincent Guittot wrote:
>>> task_h_load() can return 0 in some situations like running stress-ng
>>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
>>> system. The load balance doesn't handle this correctly because
>>
>> I guess the issue here is that 'cfs_rq->h_load' in
>>
>> task_h_load() {
>> struct cfs_rq *cfs_rq = task_cfs_rq(p);
>> ...
>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>> cfs_rq_load_avg(cfs_rq) + 1);
>> }
>>
>> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
>> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>>
>>> env->imbalance never decreases and it will stop pulling tasks only after
>>> reaching loop_max, which can be equal to the number of running tasks of
>>> the cfs. Make sure that imbalance will be decreased by at least 1.

Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).

env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);

[...]

>> I assume that this is related to the LKP mail
>
> I have found this problem while studying the regression raised in the
> email below but it doesn't fix it. At least, it's not enough
>
>>
>> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?

I see. It also happens with other workloads but it's most visible
at the beginning of a workload (fork).

Still on E5-2690 v2 (2*2*10, 40 CPUs):

In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
individual task load (1024).

One incarnation of 20 loops w/o any progress (that's w/o your patch).

With loop='loop/loop_break/loop_max'
and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'

Jul 9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
Jul 9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
Jul 9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
...
Jul 9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
Jul 9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
Jul 9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
...

Reviewed-by: Dietmar Eggemann <[email protected]>
Tested-by: Dietmar Eggemann <[email protected]>







2020-07-09 13:52:11

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On Thu, 9 Jul 2020 at 15:06, Valentin Schneider
<[email protected]> wrote:
>
>
> On 02/07/20 15:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
>
> I dug some more into this; if I got my math right, this can be reproduced
> with a single task group below the root. Forked tasks get max load, so this
> can be tried out with either tons of forks or tons of CPU hogs.
>
> We need
>
> p->se.avg.load_avg * cfs_rq->h_load
> ----------------------------------- < 1
> cfs_rq_load_avg(cfs_rq) + 1
>
> Assuming homogeneous system with tasks spread out all over (no other tasks
> interfering), that should boil down to
>
> 1024 * (tg.shares / nr_cpus)
> --------------------------- < 1
> 1024 * (nr_tasks_on_cpu)
>
> IOW
>
> tg.shares / nr_cpus < nr_tasks_on_cpu
>
> If we get tasks nicely spread out, a simple condition to hit this should be
> to have more tasks than shares.
>
> I can hit task_h_load=0 with the following on my Juno (pinned to one CPU to
> make things simpler; big.LITTLE doesn't yield equal weights between CPUs):
>
> cgcreate -g cpu:tg0
>
> echo 128 > /sys/fs/cgroup/cpu/tg0/cpu.shares
>
> for ((i=0; i<130; i++)); do
> # busy loop of your choice
> taskset -c 0 ./loop.sh &
> echo $! > /sys/fs/cgroup/cpu/tg0/tasks
> done
>
> > ---
> > kernel/sched/fair.c | 18 +++++++++++++++++-
> > 1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > return;
> > }
> >
> > - rq->misfit_task_load = task_h_load(p);
> > + /*
> > + * Make sure that misfit_task_load will not be null even if
> > + * task_h_load() returns 0. misfit_task_load is only used to select
> > + * rq with highest load so adding 1 will not modify the result
> > + * of the comparison.
> > + */
> > + rq->misfit_task_load = task_h_load(p) + 1;
>
> For here and below; wouldn't it be a tad cleaner to just do
>
> foo = max(task_h_load(p), 1);

+1

For the one below, my goal was mainly to not impact the result of the
tests before applying the +1 but doing it before will not change the
results

I'm going to update it

>
> Otherwise, I think I've properly convinced myself we do want to have
> that in one form or another. So either way:
>
> Reviewed-by: Valentin Schneider <[email protected]>

Thanks

>
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > goto next;
> >
> > + /*
> > + * Depending of the number of CPUs and tasks and the
> > + * cgroup hierarchy, task_h_load() can return a null
> > + * value. Make sure that env->imbalance decreases
> > + * otherwise detach_tasks() will stop only after
> > + * detaching up to loop_max tasks.
> > + */
> > + if (!load)
> > + load = 1;
> > +
> > env->imbalance -= load;
> > break;

2020-07-09 13:53:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On Thu, 9 Jul 2020 at 15:34, Dietmar Eggemann <[email protected]> wrote:
>
> On 08/07/2020 11:47, Vincent Guittot wrote:
> > On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 02/07/2020 16:42, Vincent Guittot wrote:
> >>> task_h_load() can return 0 in some situations like running stress-ng
> >>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> >>> system. The load balance doesn't handle this correctly because
> >>
> >> I guess the issue here is that 'cfs_rq->h_load' in
> >>
> >> task_h_load() {
> >> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> >> ...
> >> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >> cfs_rq_load_avg(cfs_rq) + 1);
> >> }
> >>
> >> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
> >> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
> >>
> >>> env->imbalance never decreases and it will stop pulling tasks only after
> >>> reaching loop_max, which can be equal to the number of running tasks of
> >>> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).

yes

>
> env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
>
> [...]
>
> >> I assume that this is related to the LKP mail
> >
> > I have found this problem while studying the regression raised in the
> > email below but it doesn't fix it. At least, it's not enough
> >
> >>
> >> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
>
> I see. It also happens with other workloads but it's most visible
> at the beginning of a workload (fork).
>
> Still on E5-2690 v2 (2*2*10, 40 CPUs):
>
> In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
> task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
> individual task load (1024).
>
> One incarnation of 20 loops w/o any progress (that's w/o your patch).
>
> With loop='loop/loop_break/loop_max'
> and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'
>
> Jul 9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
> Jul 9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
> Jul 9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
> ...
> Jul 9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
> Jul 9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
> Jul 9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
> ...
>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Tested-by: Dietmar Eggemann <[email protected]>

Thanks

>
>
>
>
>
>
>