2021-11-12 09:59:26

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 0/2] avoid spurious blocked load update

This patchset is a follow up of :
https://lore.kernel.org/lkml/[email protected]/

It ensures that newly idle load balance will not kick the update of
blocked load if it skips the load balance because avg_idle is too short.
It also makes sure that rq->next_balance doesn't go in the past when
updated.

Tim Chen (1):
sched: sched: Fix rq->next_balance time updated to earlier than
current time

Vincent Guittot (1):
sched/fair: skip newidle update stats

kernel/sched/fair.c | 25 ++++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)

--
2.17.1



2021-11-12 09:59:33

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: skip newidle update stats

In case we skip the newly idle LB entirely or we abort it because we are
going to exceed the avg_idle, we have to make sure to not start an update
of the blocked load when entering idle

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13950beb01a2..a162b0ec8963 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int this_cpu = this_rq->cpu;
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
- int pulled_task = 0;
+ int pulled_task = 0, early_stop = 0;

update_misfit_status(NULL, this_rq);

@@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- if (sd)
+ if (sd) {
update_next_balance(sd, &next_balance);
+
+ /*
+ * We skip new idle LB because there is not enough
+ * time before next wake up. Make sure that we will
+ * not kick NOHZ_NEWILB_KICK
+ */
+ early_stop = 1;
+ }
rcu_read_unlock();

goto out;
@@ -10918,8 +10926,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

update_next_balance(sd, &next_balance);

- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+ early_stop = 1;
break;
+ }

if (sd->flags & SD_BALANCE_NEWIDLE) {

@@ -10969,7 +10979,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (pulled_task)
this_rq->idle_stamp = 0;
- else
+ else if (!early_stop)
nohz_newidle_balance(this_rq);

rq_repin_lock(this_rq, rf);
--
2.17.1


2021-11-12 14:29:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats

On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> In case we skip the newly idle LB entirely or we abort it because we are
> going to exceed the avg_idle, we have to make sure to not start an update
> of the blocked load when entering idle
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 13950beb01a2..a162b0ec8963 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> int this_cpu = this_rq->cpu;
> u64 t0, t1, curr_cost = 0;
> struct sched_domain *sd;
> - int pulled_task = 0;
> + int pulled_task = 0, early_stop = 0;
>
> update_misfit_status(NULL, this_rq);
>
> @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> if (!READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> - if (sd)
> + if (sd) {
> update_next_balance(sd, &next_balance);
> +
> + /*
> + * We skip new idle LB because there is not enough
> + * time before next wake up. Make sure that we will
> + * not kick NOHZ_NEWILB_KICK
> + */
> + early_stop = 1;
> + }
> rcu_read_unlock();
>
> goto out;
> @@ -10918,8 +10926,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> update_next_balance(sd, &next_balance);
>
> - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> + early_stop = 1;
> break;
> + }
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
>
> @@ -10969,7 +10979,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> - else
> + else if (!early_stop)
> nohz_newidle_balance(this_rq);
>
> rq_repin_lock(this_rq, rf);

Urgh code flow is a mess... Let me see if I can fix some of that.

Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
with this on?

2021-11-12 14:47:37

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats

On Fri, 12 Nov 2021 at 15:29, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> > In case we skip the newly idle LB entirely or we abort it because we are
> > going to exceed the avg_idle, we have to make sure to not start an update
> > of the blocked load when entering idle
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 18 ++++++++++++++----
> > 1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 13950beb01a2..a162b0ec8963 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > int this_cpu = this_rq->cpu;
> > u64 t0, t1, curr_cost = 0;
> > struct sched_domain *sd;
> > - int pulled_task = 0;
> > + int pulled_task = 0, early_stop = 0;
> >
> > update_misfit_status(NULL, this_rq);
> >
> > @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > if (!READ_ONCE(this_rq->rd->overload) ||
> > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> >
> > - if (sd)
> > + if (sd) {
> > update_next_balance(sd, &next_balance);
> > +
> > + /*
> > + * We skip new idle LB because there is not enough
> > + * time before next wake up. Make sure that we will
> > + * not kick NOHZ_NEWILB_KICK
> > + */
> > + early_stop = 1;
> > + }
> > rcu_read_unlock();
> >
> > goto out;
> > @@ -10918,8 +10926,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> > update_next_balance(sd, &next_balance);
> >
> > - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> > + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> > + early_stop = 1;
> > break;
> > + }
> >
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> >
> > @@ -10969,7 +10979,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> > if (pulled_task)
> > this_rq->idle_stamp = 0;
> > - else
> > + else if (!early_stop)
> > nohz_newidle_balance(this_rq);
> >
> > rq_repin_lock(this_rq, rf);
>
> Urgh code flow is a mess... Let me see if I can fix some of that.

yeah, I haven't find a better way

>
> Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
> with this on?

This test still covers cases with short newly idle balance. Being
conservative, people never complained that the update of blocked load
average of idle CPUs doesn't happen often enough. It's most often the
opposite

2021-11-12 14:55:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats



Subject: sched/fair: Simplify newidle_balance()
From: Peter Zijlstra <[email protected]>
Date: Fri Nov 12 15:46:26 CET 2021

Move rq_{un,re}pin_lock() next to raw_spin_rq_{un,}lock().

Remove all rcu_read_{,un}lock(), since we have preempt/irqs disabled
over the whole function and those hold off RCU (again).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10884,15 +10884,6 @@ static int newidle_balance(struct rq *th
if (!cpu_active(this_cpu))
return 0;

- /*
- * This is OK, because current is on_cpu, which avoids it being picked
- * for load-balance and preemption/IRQs are still disabled avoiding
- * further scheduler activity on it and we're being very careful to
- * re-start the picking loop.
- */
- rq_unpin_lock(this_rq, rf);
-
- rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);

if (!READ_ONCE(this_rq->rd->overload) ||
@@ -10908,18 +10899,22 @@ static int newidle_balance(struct rq *th
*/
early_stop = 1;
}
- rcu_read_unlock();

goto out;
}
- rcu_read_unlock();

+ /*
+ * This is OK, because current is on_cpu, which avoids it being picked
+ * for load-balance and preemption/IRQs are still disabled avoiding
+ * further scheduler activity on it and we're being very careful to
+ * re-start the picking loop.
+ */
+ rq_unpin_lock(this_rq, rf);
raw_spin_rq_unlock(this_rq);

t0 = sched_clock_cpu(this_cpu);
update_blocked_averages(this_cpu);

- rcu_read_lock();
for_each_domain(this_cpu, sd) {
int continue_balancing = 1;
u64 domain_cost;
@@ -10953,9 +10948,9 @@ static int newidle_balance(struct rq *th
this_rq->ttwu_pending)
break;
}
- rcu_read_unlock();

raw_spin_rq_lock(this_rq);
+ rq_repin_lock(this_rq, rf);

if (curr_cost > this_rq->max_idle_balance_cost)
this_rq->max_idle_balance_cost = curr_cost;
@@ -10982,8 +10977,6 @@ static int newidle_balance(struct rq *th
else if (!early_stop)
nohz_newidle_balance(this_rq);

- rq_repin_lock(this_rq, rf);
-
return pulled_task;
}


2021-11-12 14:55:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats


Subject: sched/fair: Reflow newidle_balance()
From: Peter Zijlstra <[email protected]>
Date: Fri Nov 12 15:46:08 CET 2021

The control flow in newidle_balance() is a little convoluted, attempt
simplification.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++-------------
1 file changed, 8 insertions(+), 13 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10858,10 +10858,10 @@ static inline void nohz_newidle_balance(
static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
{
unsigned long next_balance = jiffies + HZ;
+ int pulled_task = 0, timeout = 0;
int this_cpu = this_rq->cpu;
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
- int pulled_task = 0, early_stop = 0;

update_misfit_status(NULL, this_rq);

@@ -10889,17 +10889,9 @@ static int newidle_balance(struct rq *th
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- if (sd) {
+ if (sd)
update_next_balance(sd, &next_balance);

- /*
- * We skip new idle LB because there is not enough
- * time before next wake up. Make sure that we will
- * not kick NOHZ_NEWILB_KICK
- */
- early_stop = 1;
- }
-
goto out;
}

@@ -10922,7 +10914,7 @@ static int newidle_balance(struct rq *th
update_next_balance(sd, &next_balance);

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
- early_stop = 1;
+ timeout = 1;
break;
}

@@ -10967,6 +10959,11 @@ static int newidle_balance(struct rq *th
if (this_rq->nr_running != this_rq->cfs.h_nr_running)
pulled_task = -1;

+ if (pulled_task || timeout)
+ goto out;
+
+ nohz_newidle_balance(this_rq);
+
out:
/* Move the next balance forward */
if (time_after(this_rq->next_balance, next_balance))
@@ -10974,8 +10971,6 @@ static int newidle_balance(struct rq *th

if (pulled_task)
this_rq->idle_stamp = 0;
- else if (!early_stop)
- nohz_newidle_balance(this_rq);

return pulled_task;
}

2021-11-12 15:30:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats

On Fri, Nov 12, 2021 at 03:47:21PM +0100, Vincent Guittot wrote:
> On Fri, 12 Nov 2021 at 15:29, Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> > > In case we skip the newly idle LB entirely or we abort it because we are
> > > going to exceed the avg_idle, we have to make sure to not start an update
> > > of the blocked load when entering idle
> > >
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 18 ++++++++++++++----
> > > 1 file changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 13950beb01a2..a162b0ec8963 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > int this_cpu = this_rq->cpu;
> > > u64 t0, t1, curr_cost = 0;
> > > struct sched_domain *sd;
> > > - int pulled_task = 0;
> > > + int pulled_task = 0, early_stop = 0;
> > >
> > > update_misfit_status(NULL, this_rq);
> > >
> > > @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > if (!READ_ONCE(this_rq->rd->overload) ||
> > > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > >
> > > - if (sd)
> > > + if (sd) {
> > > update_next_balance(sd, &next_balance);
> > > +
> > > + /*
> > > + * We skip new idle LB because there is not enough
> > > + * time before next wake up. Make sure that we will
> > > + * not kick NOHZ_NEWILB_KICK
> > > + */
> > > + early_stop = 1;
> > > + }
> > > rcu_read_unlock();
> > >
> > > goto out;

> > Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
> > with this on?
>
> This test still covers cases with short newly idle balance. Being
> conservative, people never complained that the update of blocked load
> average of idle CPUs doesn't happen often enough. It's most often the
> opposite

Well, per commit c5b0a7eefc70 ("sched/fair: Remove
sysctl_sched_migration_cost condition") combined with the above change,
we no longer call nohz_newidle_balance() in exactly that condition,
right?

Or are we worried about that !overload case?

2021-11-12 16:00:24

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats

On Fri, 12 Nov 2021 at 16:29, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Nov 12, 2021 at 03:47:21PM +0100, Vincent Guittot wrote:
> > On Fri, 12 Nov 2021 at 15:29, Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 10:58:56AM +0100, Vincent Guittot wrote:
> > > > In case we skip the newly idle LB entirely or we abort it because we are
> > > > going to exceed the avg_idle, we have to make sure to not start an update
> > > > of the blocked load when entering idle
> > > >
> > > > Signed-off-by: Vincent Guittot <[email protected]>
> > > > ---
> > > > kernel/sched/fair.c | 18 ++++++++++++++----
> > > > 1 file changed, 14 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 13950beb01a2..a162b0ec8963 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10861,7 +10861,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > > int this_cpu = this_rq->cpu;
> > > > u64 t0, t1, curr_cost = 0;
> > > > struct sched_domain *sd;
> > > > - int pulled_task = 0;
> > > > + int pulled_task = 0, early_stop = 0;
> > > >
> > > > update_misfit_status(NULL, this_rq);
> > > >
> > > > @@ -10898,8 +10898,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > > if (!READ_ONCE(this_rq->rd->overload) ||
> > > > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > > >
> > > > - if (sd)
> > > > + if (sd) {
> > > > update_next_balance(sd, &next_balance);
> > > > +
> > > > + /*
> > > > + * We skip new idle LB because there is not enough
> > > > + * time before next wake up. Make sure that we will
> > > > + * not kick NOHZ_NEWILB_KICK
> > > > + */
> > > > + early_stop = 1;
> > > > + }
> > > > rcu_read_unlock();
> > > >
> > > > goto out;
>
> > > Anyway, does nohz_newidle_balance() want to loose it's ->avg_idle test
> > > with this on?
> >
> > This test still covers cases with short newly idle balance. Being
> > conservative, people never complained that the update of blocked load
> > average of idle CPUs doesn't happen often enough. It's most often the
> > opposite
>
> Well, per commit c5b0a7eefc70 ("sched/fair: Remove
> sysctl_sched_migration_cost condition") combined with the above change,
> we no longer call nohz_newidle_balance() in exactly that condition,
> right?
>
> Or are we worried about that !overload case?

we can do a complete newly idle LB but have this_rq->avg_idle <
sysctl_sched_migration_cost. In this case, the condition will continue
to skip update of other idle CPUs

2021-11-12 16:05:56

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: skip newidle update stats

On Fri, 12 Nov 2021 at 15:55, Peter Zijlstra <[email protected]> wrote:
>
>
> Subject: sched/fair: Reflow newidle_balance()
> From: Peter Zijlstra <[email protected]>
> Date: Fri Nov 12 15:46:08 CET 2021
>
> The control flow in newidle_balance() is a little convoluted, attempt
> simplification.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 21 ++++++++-------------
> 1 file changed, 8 insertions(+), 13 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10858,10 +10858,10 @@ static inline void nohz_newidle_balance(
> static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> {
> unsigned long next_balance = jiffies + HZ;
> + int pulled_task = 0, timeout = 0;
> int this_cpu = this_rq->cpu;
> u64 t0, t1, curr_cost = 0;
> struct sched_domain *sd;
> - int pulled_task = 0, early_stop = 0;
>
> update_misfit_status(NULL, this_rq);
>
> @@ -10889,17 +10889,9 @@ static int newidle_balance(struct rq *th
> if (!READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> - if (sd) {
> + if (sd)
> update_next_balance(sd, &next_balance);
>
> - /*
> - * We skip new idle LB because there is not enough
> - * time before next wake up. Make sure that we will
> - * not kick NOHZ_NEWILB_KICK
> - */
> - early_stop = 1;
> - }
> -
> goto out;
> }
>
> @@ -10922,7 +10914,7 @@ static int newidle_balance(struct rq *th
> update_next_balance(sd, &next_balance);
>
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> - early_stop = 1;
> + timeout = 1;
> break;
> }
>
> @@ -10967,6 +10959,11 @@ static int newidle_balance(struct rq *th
> if (this_rq->nr_running != this_rq->cfs.h_nr_running)
> pulled_task = -1;
>
> + if (pulled_task || timeout)
> + goto out;
> +
> + nohz_newidle_balance(this_rq);

maybe

if (!pulled_task && !timeout)
nohz_newidle_balance(this_rq);

> +
> out:
> /* Move the next balance forward */
> if (time_after(this_rq->next_balance, next_balance))
> @@ -10974,8 +10971,6 @@ static int newidle_balance(struct rq *th
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> - else if (!early_stop)
> - nohz_newidle_balance(this_rq);
>
> return pulled_task;
> }

2021-11-16 23:48:42

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 0/2] avoid spurious blocked load update

On Fri, 2021-11-12 at 10:58 +0100, Vincent Guittot wrote:
> This patchset is a follow up of :
> https://lore.kernel.org/lkml/[email protected]/
>
> It ensures that newly idle load balance will not kick the update of
> blocked load if it skips the load balance because avg_idle is too
> short.
> It also makes sure that rq->next_balance doesn't go in the past when
> updated.
>
> Tim Chen (1):
> sched: sched: Fix rq->next_balance time updated to earlier than
> current time
>
> Vincent Guittot (1):
> sched/fair: skip newidle update stats
>
> kernel/sched/fair.c | 25 ++++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>

Vincent,

Got some data back from the benchmark team.
To my surprise, the skip_new_idle_update_stats patch
actually makes things a little worse.

Relative Performance
(higher better)
5.15 rc4 vanilla (cgroup disabled) 100%
5.15 rc4 vanilla (cgroup enabled) 96%
patch v2 96%
patch v3 96%
patch v3
+skip_new_idle_update_stats 93.7%
patch v3
+skip_new_idle_update_stats
+Fix rq->next_balance_time 93.7%

The cpu utilization actually is the similar compared with
having just the v3 patch. In both cases they are
81% user
12% kernel
2% idle
5% waiting for IO


Profile on key functions
in load balancing shows a little more cpu utilization,
which is unexpected as we are cutting short
the newidle_balance.

patch v3
0.56% [k] __update_load_avg_cfs_rq
0.51% [k] update_load_avg
0.39% [k] update_blocked_averages
0.36% [k] __update_load_avg_se
0.05% [k] newidle_balance

patch v3 + skip_new_idle_update_stats
0.58% [k] __update_load_avg_cfs_rq
0.53% [k] update_load_avg
0.40% [k] update_blocked_averages
0.37% [k] __update_load_avg_se
0.06% [k] newidle_balance

Context switch frequency is lower by 4% with the skip_new_idle_update_stats
patch.

Thanks.

Tim


2021-11-18 14:40:44

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/2] avoid spurious blocked load update

On Wed, 17 Nov 2021 at 00:48, Tim Chen <[email protected]> wrote:
>
> On Fri, 2021-11-12 at 10:58 +0100, Vincent Guittot wrote:
> > This patchset is a follow up of :
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > It ensures that newly idle load balance will not kick the update of
> > blocked load if it skips the load balance because avg_idle is too
> > short.
> > It also makes sure that rq->next_balance doesn't go in the past when
> > updated.
> >
> > Tim Chen (1):
> > sched: sched: Fix rq->next_balance time updated to earlier than
> > current time
> >
> > Vincent Guittot (1):
> > sched/fair: skip newidle update stats
> >
> > kernel/sched/fair.c | 25 ++++++++++++++++++++-----
> > 1 file changed, 20 insertions(+), 5 deletions(-)
> >
>
> Vincent,
>
> Got some data back from the benchmark team.
> To my surprise, the skip_new_idle_update_stats patch
> actually makes things a little worse.
>
> Relative Performance
> (higher better)
> 5.15 rc4 vanilla (cgroup disabled) 100%
> 5.15 rc4 vanilla (cgroup enabled) 96%
> patch v2 96%
> patch v3 96%
> patch v3
> +skip_new_idle_update_stats 93.7%
> patch v3
> +skip_new_idle_update_stats
> +Fix rq->next_balance_time 93.7%
>

Yeah, that looks surprising.
patch skip_new_idle_update_stats only ensures that the cpu will not
run an update of the blocked average of idle cpus before entering idle
but outside newidle_balance if it thinks that a task is about to wake
up soon.
The end result is that we run less often
_nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE); before
entering idle.

> The cpu utilization actually is the similar compared with
> having just the v3 patch. In both cases they are
> 81% user
> 12% kernel
> 2% idle
> 5% waiting for IO
>
>
> Profile on key functions
> in load balancing shows a little more cpu utilization,
> which is unexpected as we are cutting short
> the newidle_balance.
>
> patch v3
> 0.56% [k] __update_load_avg_cfs_rq
> 0.51% [k] update_load_avg
> 0.39% [k] update_blocked_averages
> 0.36% [k] __update_load_avg_se
> 0.05% [k] newidle_balance
>
> patch v3 + skip_new_idle_update_stats
> 0.58% [k] __update_load_avg_cfs_rq
> 0.53% [k] update_load_avg
> 0.40% [k] update_blocked_averages
> 0.37% [k] __update_load_avg_se
> 0.06% [k] newidle_balance
>
> Context switch frequency is lower by 4% with the skip_new_idle_update_stats
> patch.
>
> Thanks.
>
> Tim
>