2021-10-04 18:16:57

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 0/2] sched/fair: Improve cost accounting of newidle_balance

Improve the precision of the cost of newidle_balance by taking into account
the time spent in updating blocked load and skipping load balance loop
entirely if there is no chance to run at least one.

Vincent Guittot (2):
sched/fair: account update_blocked_averages in newidle_balance cost
sched/fair: Skip update_blocked_averages if we are defering load
balance

kernel/sched/fair.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

--
2.17.1


2021-10-04 18:18:58

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: account update_blocked_averages in newidle_balance cost

The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account when
deciding to stop newidle_balance() because it exceeds the expected idle
time.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8943dbb94365..1f78b2e3b71c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10810,7 +10810,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int this_cpu = this_rq->cpu;
struct sched_domain *sd;
int pulled_task = 0;
- u64 curr_cost = 0;
+ u64 t0, domain_cost, curr_cost = 0;

update_misfit_status(NULL, this_rq);

@@ -10855,11 +10855,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

raw_spin_rq_unlock(this_rq);

+ t0 = sched_clock_cpu(this_cpu);
update_blocked_averages(this_cpu);
+ domain_cost = sched_clock_cpu(this_cpu) - t0;
+ curr_cost += domain_cost;
+
rcu_read_lock();
for_each_domain(this_cpu, sd) {
int continue_balancing = 1;
- u64 t0, domain_cost;

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
--
2.17.1

2021-10-04 23:58:06

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 0/2] sched/fair: Improve cost accounting of newidle_balance

On Mon, 2021-10-04 at 19:14 +0200, Vincent Guittot wrote:
> Improve the precision of the cost of newidle_balance by taking into
> account
> the time spent in updating blocked load and skipping load balance
> loop
> entirely if there is no chance to run at least one.
>

Thanks Vincent.

I'll ask our benchmark team to give it a spin. They've been fairly
busy. I hope to get some bandwidth from them in next couple of
weeks.

Tim

> Vincent Guittot (2):
> sched/fair: account update_blocked_averages in newidle_balance cost
> sched/fair: Skip update_blocked_averages if we are defering load
> balance
>
> kernel/sched/fair.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>

2021-10-05 00:16:37

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 2/2] sched/fair: Skip update_blocked_averages if we are defering load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:

this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f78b2e3b71c..1294b78503d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10841,17 +10841,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();

raw_spin_rq_unlock(this_rq);

--
2.17.1

2021-10-05 20:43:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: account update_blocked_averages in newidle_balance cost

On Mon, Oct 04, 2021 at 07:14:50PM +0200, Vincent Guittot wrote:
> The time spent to update the blocked load can be significant depending of
> the complexity fo the cgroup hierarchy. Take this time into account when
> deciding to stop newidle_balance() because it exceeds the expected idle
> time.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8943dbb94365..1f78b2e3b71c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10810,7 +10810,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> int this_cpu = this_rq->cpu;
> struct sched_domain *sd;
> int pulled_task = 0;
> - u64 curr_cost = 0;
> + u64 t0, domain_cost, curr_cost = 0;
>
> update_misfit_status(NULL, this_rq);
>
> @@ -10855,11 +10855,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> raw_spin_rq_unlock(this_rq);
>
> + t0 = sched_clock_cpu(this_cpu);
> update_blocked_averages(this_cpu);
> + domain_cost = sched_clock_cpu(this_cpu) - t0;
> + curr_cost += domain_cost;
> +
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> int continue_balancing = 1;
> - u64 t0, domain_cost;
>
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> update_next_balance(sd, &next_balance);

Does this make sense? It avoids a bunch of clock calls (and thereby
accounts more actual time).

Also, perhaps we should some asymmetric IIR instead of a strict MAX
filter for max_newidle_lb_cost.

---
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -10759,9 +10759,9 @@ static int newidle_balance(struct rq *th
{
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
+ u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
int pulled_task = 0;
- u64 t0, domain_cost, curr_cost = 0;

update_misfit_status(NULL, this_rq);

@@ -10808,8 +10808,9 @@ static int newidle_balance(struct rq *th

t0 = sched_clock_cpu(this_cpu);
update_blocked_averages(this_cpu);
- domain_cost = sched_clock_cpu(this_cpu) - t0;
- curr_cost += domain_cost;
+ t1 = sched_clock_cpu(this_cpu);
+ curr_cost += t1 - t0;
+ t0 = t1;

rcu_read_lock();
for_each_domain(this_cpu, sd) {
@@ -10821,17 +10822,19 @@ static int newidle_balance(struct rq *th
}

if (sd->flags & SD_BALANCE_NEWIDLE) {
- t0 = sched_clock_cpu(this_cpu);
+ u64 domain_cost;

pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
&continue_balancing);

- domain_cost = sched_clock_cpu(this_cpu) - t0;
+ t1 = sched_clock_cpu(this_cpu);
+ domain_cost = t1 - t0;
if (domain_cost > sd->max_newidle_lb_cost)
sd->max_newidle_lb_cost = domain_cost;

curr_cost += domain_cost;
+ t0 = t1;
}

update_next_balance(sd, &next_balance);

2021-10-05 20:54:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Skip update_blocked_averages if we are defering load balance

On Mon, Oct 04, 2021 at 07:14:51PM +0200, Vincent Guittot wrote:
> In newidle_balance(), the scheduler skips load balance to the new idle cpu
> when the 1st sd of this_rq is:
>
> this_rq->avg_idle < sd->max_newidle_lb_cost
>
> Doing a costly call to update_blocked_averages() will not be useful and
> simply adds overhead when this condition is true.
>
> Check the condition early in newidle_balance() to skip
> update_blocked_averages() when possible.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1f78b2e3b71c..1294b78503d9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10841,17 +10841,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> */
> rq_unpin_lock(this_rq, rf);
>
> + rcu_read_lock();
> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

set cino=(0:0, please.

Also, people have, in the past, tried to get rid of the first clause
here, perhaps this can replace it instead of augment it?

>
> - rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out;
> }
> + rcu_read_unlock();

There's another rcu_read_lock section right below this, at the very
least we can merge them.

Also, IIRC we're running all this with premption disabled, and since
rcu-sched got folded into rcu, all that rcu_read_*lock() stuff isn't
strictly required anymore.

(we're full circle there, back in the day RCU implied RCU-sched and the
scheduler relied on preempt-disable for lots of this stuff, then Paul
split them, and I spend a fair amount of time adding all this
rcu_read_*lock() crud, and now he's merge them again, and it can go
again).

Except of course, I think we need to make rcu_dereference_check happy
first :/

2021-10-06 07:54:21

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: account update_blocked_averages in newidle_balance cost

On Tue, 5 Oct 2021 at 22:41, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Oct 04, 2021 at 07:14:50PM +0200, Vincent Guittot wrote:
> > The time spent to update the blocked load can be significant depending of
> > the complexity fo the cgroup hierarchy. Take this time into account when
> > deciding to stop newidle_balance() because it exceeds the expected idle
> > time.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8943dbb94365..1f78b2e3b71c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10810,7 +10810,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > int this_cpu = this_rq->cpu;
> > struct sched_domain *sd;
> > int pulled_task = 0;
> > - u64 curr_cost = 0;
> > + u64 t0, domain_cost, curr_cost = 0;
> >
> > update_misfit_status(NULL, this_rq);
> >
> > @@ -10855,11 +10855,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >
> > raw_spin_rq_unlock(this_rq);
> >
> > + t0 = sched_clock_cpu(this_cpu);
> > update_blocked_averages(this_cpu);
> > + domain_cost = sched_clock_cpu(this_cpu) - t0;
> > + curr_cost += domain_cost;
> > +
> > rcu_read_lock();
> > for_each_domain(this_cpu, sd) {
> > int continue_balancing = 1;
> > - u64 t0, domain_cost;
> >
> > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> > update_next_balance(sd, &next_balance);
>
> Does this make sense? It avoids a bunch of clock calls (and thereby
> accounts more actual time).

Originally, I didn't want to modify the current accounting of
sched_domain but only account the sometime large
update_blocked_averages(). but i agree that we can ensure to account
more actual time
>
> Also, perhaps we should some asymmetric IIR instead of a strict MAX
> filter for max_newidle_lb_cost.

Ok. I'm going to look at this and see how all this goes

>
> ---
> Index: linux-2.6/kernel/sched/fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/fair.c
> +++ linux-2.6/kernel/sched/fair.c
> @@ -10759,9 +10759,9 @@ static int newidle_balance(struct rq *th
> {
> unsigned long next_balance = jiffies + HZ;
> int this_cpu = this_rq->cpu;
> + u64 t0, t1, curr_cost = 0;
> struct sched_domain *sd;
> int pulled_task = 0;
> - u64 t0, domain_cost, curr_cost = 0;
>
> update_misfit_status(NULL, this_rq);
>
> @@ -10808,8 +10808,9 @@ static int newidle_balance(struct rq *th
>
> t0 = sched_clock_cpu(this_cpu);
> update_blocked_averages(this_cpu);
> - domain_cost = sched_clock_cpu(this_cpu) - t0;
> - curr_cost += domain_cost;
> + t1 = sched_clock_cpu(this_cpu);
> + curr_cost += t1 - t0;
> + t0 = t1;
>
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> @@ -10821,17 +10822,19 @@ static int newidle_balance(struct rq *th
> }
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> - t0 = sched_clock_cpu(this_cpu);
> + u64 domain_cost;
>
> pulled_task = load_balance(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE,
> &continue_balancing);
>
> - domain_cost = sched_clock_cpu(this_cpu) - t0;
> + t1 = sched_clock_cpu(this_cpu);
> + domain_cost = t1 - t0;
> if (domain_cost > sd->max_newidle_lb_cost)
> sd->max_newidle_lb_cost = domain_cost;
>
> curr_cost += domain_cost;
> + t0 = t1;
> }
>
> update_next_balance(sd, &next_balance);

2021-10-06 08:15:25

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: Skip update_blocked_averages if we are defering load balance

On Tue, 5 Oct 2021 at 22:49, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Oct 04, 2021 at 07:14:51PM +0200, Vincent Guittot wrote:
> > In newidle_balance(), the scheduler skips load balance to the new idle cpu
> > when the 1st sd of this_rq is:
> >
> > this_rq->avg_idle < sd->max_newidle_lb_cost
> >
> > Doing a costly call to update_blocked_averages() will not be useful and
> > simply adds overhead when this condition is true.
> >
> > Check the condition early in newidle_balance() to skip
> > update_blocked_averages() when possible.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > Signed-off-by: Tim Chen <[email protected]>
> > ---
> > kernel/sched/fair.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1f78b2e3b71c..1294b78503d9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10841,17 +10841,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > */
> > rq_unpin_lock(this_rq, rf);
> >
> > + rcu_read_lock();
> > + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> > +
> > if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> > - !READ_ONCE(this_rq->rd->overload)) {
> > + !READ_ONCE(this_rq->rd->overload) ||
> > + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> set cino=(0:0, please.
>
> Also, people have, in the past, tried to get rid of the first clause
> here, perhaps this can replace it instead of augment it?

Yes, that's a good point.
either sd->max_newidle_lb_cost >= sysctl_sched_migration_cost and the
current condition is covered by the new condition
or sd->max_newidle_lb_cost < sysctl_sched_migration_cost and we will
run newidle balance. But this also means that we have time to run the
newly idle LB

>
> >
> > - rcu_read_lock();
> > - sd = rcu_dereference_check_sched_domain(this_rq->sd);
> > if (sd)
> > update_next_balance(sd, &next_balance);
> > rcu_read_unlock();
> >
> > goto out;
> > }
> > + rcu_read_unlock();
>
> There's another rcu_read_lock section right below this, at the very
> least we can merge them.

we release the rq lock in between which adds more conditions to check
to all situations

>
> Also, IIRC we're running all this with premption disabled, and since
> rcu-sched got folded into rcu, all that rcu_read_*lock() stuff isn't
> strictly required anymore.
>
> (we're full circle there, back in the day RCU implied RCU-sched and the
> scheduler relied on preempt-disable for lots of this stuff, then Paul
> split them, and I spend a fair amount of time adding all this
> rcu_read_*lock() crud, and now he's merge them again, and it can go
> again).
>
> Except of course, I think we need to make rcu_dereference_check happy
> first :/

2021-10-06 08:19:09

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: account update_blocked_averages in newidle_balance cost

On Wed, 6 Oct 2021 at 09:52, Vincent Guittot <[email protected]> wrote:
>
> On Tue, 5 Oct 2021 at 22:41, Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Oct 04, 2021 at 07:14:50PM +0200, Vincent Guittot wrote:
> > > The time spent to update the blocked load can be significant depending of
> > > the complexity fo the cgroup hierarchy. Take this time into account when
> > > deciding to stop newidle_balance() because it exceeds the expected idle
> > > time.
> > >
> > > Signed-off-by: Vincent Guittot <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 7 +++++--
> > > 1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 8943dbb94365..1f78b2e3b71c 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10810,7 +10810,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > > int this_cpu = this_rq->cpu;
> > > struct sched_domain *sd;
> > > int pulled_task = 0;
> > > - u64 curr_cost = 0;
> > > + u64 t0, domain_cost, curr_cost = 0;
> > >
> > > update_misfit_status(NULL, this_rq);
> > >
> > > @@ -10855,11 +10855,14 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > >
> > > raw_spin_rq_unlock(this_rq);
> > >
> > > + t0 = sched_clock_cpu(this_cpu);
> > > update_blocked_averages(this_cpu);
> > > + domain_cost = sched_clock_cpu(this_cpu) - t0;
> > > + curr_cost += domain_cost;
> > > +
> > > rcu_read_lock();
> > > for_each_domain(this_cpu, sd) {
> > > int continue_balancing = 1;
> > > - u64 t0, domain_cost;
> > >
> > > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> > > update_next_balance(sd, &next_balance);
> >
> > Does this make sense? It avoids a bunch of clock calls (and thereby
> > accounts more actual time).
>
> Originally, I didn't want to modify the current accounting of
> sched_domain but only account the sometime large
> update_blocked_averages(). but i agree that we can ensure to account
> more actual time
> >
> > Also, perhaps we should some asymmetric IIR instead of a strict MAX
> > filter for max_newidle_lb_cost.
>
> Ok. I'm going to look at this and see how all this goes
>
> >
> > ---
> > Index: linux-2.6/kernel/sched/fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched/fair.c
> > +++ linux-2.6/kernel/sched/fair.c
> > @@ -10759,9 +10759,9 @@ static int newidle_balance(struct rq *th
> > {
> > unsigned long next_balance = jiffies + HZ;
> > int this_cpu = this_rq->cpu;
> > + u64 t0, t1, curr_cost = 0;
> > struct sched_domain *sd;
> > int pulled_task = 0;
> > - u64 t0, domain_cost, curr_cost = 0;
> >
> > update_misfit_status(NULL, this_rq);
> >
> > @@ -10808,8 +10808,9 @@ static int newidle_balance(struct rq *th
> >
> > t0 = sched_clock_cpu(this_cpu);
> > update_blocked_averages(this_cpu);
> > - domain_cost = sched_clock_cpu(this_cpu) - t0;

I wonder if we should not include the duration of
update_blocked_averages() in the 1st domain cost ?
To make sure that we will not update it but finally abort before
running the 1st domain because there is not enough remaining time

> > - curr_cost += domain_cost;
> > + t1 = sched_clock_cpu(this_cpu);
> > + curr_cost += t1 - t0;
> > + t0 = t1;
> >
> > rcu_read_lock();
> > for_each_domain(this_cpu, sd) {
> > @@ -10821,17 +10822,19 @@ static int newidle_balance(struct rq *th
> > }
> >
> > if (sd->flags & SD_BALANCE_NEWIDLE) {
> > - t0 = sched_clock_cpu(this_cpu);
> > + u64 domain_cost;
> >
> > pulled_task = load_balance(this_cpu, this_rq,
> > sd, CPU_NEWLY_IDLE,
> > &continue_balancing);
> >
> > - domain_cost = sched_clock_cpu(this_cpu) - t0;
> > + t1 = sched_clock_cpu(this_cpu);
> > + domain_cost = t1 - t0;
> > if (domain_cost > sd->max_newidle_lb_cost)
> > sd->max_newidle_lb_cost = domain_cost;
> >
> > curr_cost += domain_cost;
> > + t0 = t1;
> > }
> >
> > update_next_balance(sd, &next_balance);