This patchset updates newidle lb cost tracking and early abort:
The time spent running update_blocked_averages is now accounted in the 1st
sched_domain level. This time can be significant and move the cost of
newidle lb above the avg_idle time.
The decay of max_newidle_lb_cost is modified to start only when the field
has not been updated for a while. Recent update will not be decayed
immediatlybut only after a while.
The condition of an avg_idle lower than sysctl_sched_migration_cost has
been removed as the 500us value is quite large and prevent opportunity to
pull task on the newly idle CPU for at least 1st domain levels.
Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
following results:
min avg max
SMT: 1us 33us 273us - this one includes the update of blocked load
MC: 7us 49us 398us
NUMA: 10us 45us 158us
Some results for hackbench -l $LOOPS -g $group :
group tip/sched/core + this patchset
1 15.189(+/- 2%) 14.987(+/- 2%) +1%
4 4.336(+/- 3%) 4.322(+/- 5%) +0%
16 3.654(+/- 1%) 2.922(+/- 3%) +20%
32 3.209(+/- 1%) 2.919(+/- 3%) +9%
64 2.965(+/- 1%) 2.826(+/- 1%) +4%
128 2.954(+/- 1%) 2.993(+/- 8%) -1%
256 2.951(+/- 1%) 2.894(+/- 1%) +2%
tbench and reaim have not shown any difference
Change since v2:
- Update and decay of sd->last_decay_max_lb_cost are gathered in
update_newidle_cost(). The behavior remains almost the same except that
the decay can happen during newidle_balance now.
Tests results haven't shown any differences
I haven't modified rq->max_idle_balance_cost. It acts as the max value
for avg_idle and prevents the latter to reach high value during long
idle phase. Moving on an IIR filter instead, could delay the convergence
of avg_idle to a reasonnable value that reflect current situation.
- Added a minor cleanup of newidle_balance
Change since v1:
- account the time spent in update_blocked_averages() in the 1st domain
- reduce number of call of sched_clock_cpu()
- change the way max_newidle_lb_cost is decayed. Peter suggested to use a
IIR but keeping a track of the current max value gave the best result
- removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
as suggested by Peter
Vincent Guittot (5):
sched/fair: Account update_blocked_averages in newidle_balance cost
sched/fair: Skip update_blocked_averages if we are defering load
balance
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: cleanup newidle_balance
include/linux/sched/topology.h | 2 +-
kernel/sched/fair.c | 65 ++++++++++++++++++++++------------
kernel/sched/topology.c | 2 +-
3 files changed, 45 insertions(+), 24 deletions(-)
--
2.17.1
In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:
this_rq->avg_idle < sd->max_newidle_lb_cost
Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.
Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c0145677ee99..c4c36865321b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10873,17 +10873,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);
+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();
goto out;
}
+ rcu_read_unlock();
raw_spin_rq_unlock(this_rq);
--
2.17.1
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.
Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched/topology.h | 2 +-
kernel/sched/fair.c | 36 +++++++++++++++++++++++++---------
kernel/sched/topology.c | 2 +-
3 files changed, 29 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2f9166f6dec8..c07bfa2d80f2 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -105,7 +105,7 @@ struct sched_domain {
/* idle_balance() stats */
u64 max_newidle_lb_cost;
- unsigned long next_decay_max_lb_cost;
+ unsigned long last_decay_max_lb_cost;
u64 avg_scan_cost; /* select_idle_sibling */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4c36865321b..e50fd751e1df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10239,6 +10239,30 @@ void update_max_interval(void)
max_load_balance_interval = HZ*num_online_cpus()/10;
}
+static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
+{
+ if (cost > sd->max_newidle_lb_cost) {
+ /*
+ * Track max cost of a domain to make sure to not delay the
+ * next wakeup on the CPU.
+ */
+ sd->max_newidle_lb_cost = cost;
+ sd->last_decay_max_lb_cost = jiffies;
+ } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
+ /*
+ * Decay the newidle max times by ~1% per second to ensure that
+ * it is not outdated and the current max cost is actually
+ * shorter.
+ */
+ sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
+ sd->last_decay_max_lb_cost = jiffies;
+
+ return true;
+ }
+
+ return false;
+}
+
/*
* It checks each scheduling domain to see if it is due to be balanced,
* and initiates a balancing operation if so.
@@ -10262,14 +10286,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
for_each_domain(cpu, sd) {
/*
* Decay the newidle max times here because this is a regular
- * visit to all the domains. Decay ~1% per second.
+ * visit to all the domains.
*/
- if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
- sd->max_newidle_lb_cost =
- (sd->max_newidle_lb_cost * 253) / 256;
- sd->next_decay_max_lb_cost = jiffies + HZ;
- need_decay = 1;
- }
+ need_decay = update_newidle_cost(sd, 0);
max_cost += sd->max_newidle_lb_cost;
/*
@@ -10911,8 +10930,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu);
domain_cost = t1 - t0;
- if (domain_cost > sd->max_newidle_lb_cost)
- sd->max_newidle_lb_cost = domain_cost;
+ update_newidle_cost(sd, domain_cost);
curr_cost += domain_cost;
t0 = t1;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e81246787560..30169c7685b6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1568,7 +1568,7 @@ sd_init(struct sched_domain_topology_level *tl,
.last_balance = jiffies,
.balance_interval = sd_weight,
.max_newidle_lb_cost = 0,
- .next_decay_max_lb_cost = jiffies,
+ .last_decay_max_lb_cost = jiffies,
.child = child,
#ifdef CONFIG_SCHED_DEBUG
.name = tl->name,
--
2.17.1
With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e50fd751e1df..57eae0ebc492 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
- if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload) ||
+ if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
if (sd)
--
2.17.1
update_next_balance() uses sd->last_balance which is not modified by
load_balance() so we can merge the 2 calls in one place.
No functional change
Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57eae0ebc492..13950beb01a2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10916,10 +10916,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int continue_balancing = 1;
u64 domain_cost;
- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance);
+
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
break;
- }
if (sd->flags & SD_BALANCE_NEWIDLE) {
@@ -10935,8 +10935,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
t0 = t1;
}
- update_next_balance(sd, &next_balance);
-
/*
* Stop searching for tasks to pull if there are
* now runnable tasks on this rq.
--
2.17.1
On Tue, Oct 19, 2021 at 02:35:34PM +0200, Vincent Guittot wrote:
> In newidle_balance(), the scheduler skips load balance to the new idle cpu
> when the 1st sd of this_rq is:
>
> this_rq->avg_idle < sd->max_newidle_lb_cost
>
> Doing a costly call to update_blocked_averages() will not be useful and
> simply adds overhead when this condition is true.
>
> Check the condition early in newidle_balance() to skip
> update_blocked_averages() when possible.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Signed-off-by: Tim Chen <[email protected]>
The Signed-off-by seems to be in the wrong order but otherwise
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, Oct 19, 2021 at 02:35:36PM +0200, Vincent Guittot wrote:
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the
> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, Oct 19, 2021 at 02:35:37PM +0200, Vincent Guittot wrote:
> update_next_balance() uses sd->last_balance which is not modified by
> load_balance() so we can merge the 2 calls in one place.
>
> No functional change
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
Note with the change in the subject format causes b4 to miss this patch
when downloading the series as an mbox. Just something for the maintainers
to watch for if they use b4.
On Tue, Oct 19, 2021 at 02:35:35PM +0200, Vincent Guittot wrote:
> Decay max_newidle_lb_cost only when it has not been updated for a while
> and ensure to not decay a recently changed value.
>
> Signed-off-by: Vincent Guittot <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
>
> The time spent running update_blocked_averages is now accounted in the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
>
> The decay of max_newidle_lb_cost is modified to start only when the field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
>
> The condition of an avg_idle lower than sysctl_sched_migration_cost has
> been removed as the 500us value is quite large and prevent opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
>
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
> min avg max
> SMT: 1us 33us 273us - this one includes the update of blocked load
> MC: 7us 49us 398us
> NUMA: 10us 45us 158us
>
>
> Some results for hackbench -l $LOOPS -g $group :
> group tip/sched/core + this patchset
> 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
>
I read the patches earlier but had queued tests and waiting on the results
before Acking. The hackbench results were not bad, not a universal win,
but wins more than it loses with small decreaseds in system CPU usage.
Most other results showed small gains or losses, nothing overly dramatic
and mostly within the noise.
--
Mel Gorman
SUSE Labs
On Thu, 21 Oct 2021 at 11:52, Mel Gorman <[email protected]> wrote:
>
> On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > been removed as the 500us value is quite large and prevent opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> > min avg max
> > SMT: 1us 33us 273us - this one includes the update of blocked load
> > MC: 7us 49us 398us
> > NUMA: 10us 45us 158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group tip/sched/core + this patchset
> > 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> > 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> > 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> > 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> > 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> > 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> > 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
> >
>
> I read the patches earlier but had queued tests and waiting on the results
> before Acking. The hackbench results were not bad, not a universal win,
> but wins more than it loses with small decreaseds in system CPU usage.
Thanks for running tests
>
> Most other results showed small gains or losses, nothing overly dramatic
> and mostly within the noise.
>
> --
> Mel Gorman
> SUSE Labs
On Thu, 21 Oct 2021 at 11:44, Mel Gorman <[email protected]> wrote:
>
> On Tue, Oct 19, 2021 at 02:35:34PM +0200, Vincent Guittot wrote:
> > In newidle_balance(), the scheduler skips load balance to the new idle cpu
> > when the 1st sd of this_rq is:
> >
> > this_rq->avg_idle < sd->max_newidle_lb_cost
> >
> > Doing a costly call to update_blocked_averages() will not be useful and
> > simply adds overhead when this condition is true.
> >
> > Check the condition early in newidle_balance() to skip
> > update_blocked_averages() when possible.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > Signed-off-by: Tim Chen <[email protected]>
>
> The Signed-off-by seems to be in the wrong order but otherwise
The right signoff sequence should be:
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
But I skipped the last one which was only a rebase and my signoff was
already there
>
> Acked-by: Mel Gorman <[email protected]>
>
> --
> Mel Gorman
> SUSE Labs
On Tue, 2021-10-19 at 14:35 +0200, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
>
> The time spent running update_blocked_averages is now accounted in
> the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
>
> The decay of max_newidle_lb_cost is modified to start only when the
> field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
>
> The condition of an avg_idle lower than sysctl_sched_migration_cost
> has
> been removed as the 500us value is quite large and prevent
> opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
>
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
> min avg max
> SMT: 1us 33us 273us - this one includes the update of blocked
> load
> MC: 7us 49us 398us
> NUMA: 10us 45us 158us
>
>
> Some results for hackbench -l $LOOPS -g $group :
> group tip/sched/core + this patchset
> 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
>
> tbench and reaim have not shown any difference
>
Vincent,
Our benchmark team tested the patches for our OLTP benchmark
on a 2 socket Cascade Lake
with 28 cores/socket. It is a smaller configuration
than the 2 socket Ice Lake we hae tested previously that has 40
cores/socket so the overhead on update_blocked_averages is smaller
(~4%).
Here's a summary of the results:
Relative Performance
(higher better)
5.15 rc4 vanilla (cgroup disabled) 100%
5.15 rc4 vanilla (cgroup enabled) 96%
patch v2 96%
patch v3 96%
We didn't see much change in performance from the patch set.
Looking at the profile on update_blocked_averages a bit more,
the majority of the call to update_blocked_averages
happens in run_rebalance_domain. And we are not
including that cost of update_blocked_averages for
run_rebalance_domains in our current patch set. I think
the patch set should account for that too.
0.60% 0.00% 3 [kernel.vmlinux] [k] run_rebalance_domains - -
|
--0.59%--run_rebalance_domains
|
--0.57%--update_blocked_averages
Thanks.
Tim
On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
>
> > Looking at the profile on update_blocked_averages a bit more,
> > the majority of the call to update_blocked_averages
> > happens in run_rebalance_domain. And we are not
> > including that cost of update_blocked_averages for
> > run_rebalance_domains in our current patch set. I think
> > the patch set should account for that too.
>
> nohz_newidle_balance keeps using sysctl_sched_migration_cost to
> trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> This would probably benefit to take into account the cost of
> update_blocked_averages instead
>
For the case where
this_rq->avg_idle < sysctl_sched_migration_cost
in newidle_balance(), we skip to the out: label
out:
/* Move the next balance forward */
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
if (pulled_task)
this_rq->idle_stamp = 0;
else
nohz_newidle_balance(this_rq);
and we call nohz_newidle_balance as we don't have a pulled_task.
It seems to make sense to skip the call
to nohz_newidle_balance() for this case?
We expect a very short idle and a task to wake shortly.
So we do not have to pull a task
to this idle cpu and incur the migration cost.
Tim
On Tue, 26 Oct 2021 at 19:25, Tim Chen <[email protected]> wrote:
>
> On Tue, 2021-10-19 at 14:35 +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in
> > the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the
> > field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost
> > has
> > been removed as the 500us value is quite large and prevent
> > opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> > min avg max
> > SMT: 1us 33us 273us - this one includes the update of blocked
> > load
> > MC: 7us 49us 398us
> > NUMA: 10us 45us 158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group tip/sched/core + this patchset
> > 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> > 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> > 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> > 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> > 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> > 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> > 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
> >
> > tbench and reaim have not shown any difference
> >
>
> Vincent,
>
> Our benchmark team tested the patches for our OLTP benchmark
> on a 2 socket Cascade Lake
> with 28 cores/socket. It is a smaller configuration
> than the 2 socket Ice Lake we hae tested previously that has 40
> cores/socket so the overhead on update_blocked_averages is smaller
> (~4%).
>
> Here's a summary of the results:
> Relative Performance
> (higher better)
> 5.15 rc4 vanilla (cgroup disabled) 100%
> 5.15 rc4 vanilla (cgroup enabled) 96%
> patch v2 96%
> patch v3 96%
>
> We didn't see much change in performance from the patch set.
Thanks for testing.
I was not expecting this patch to fix all your problems but at least
those which have been highlighted during previous exchanges.
Few problems still remain in your case if I'm not wrong:
There is a patch that ensures that rq->next_balance is never set in the past.
IIRC, you also mentioned in a previous thread that the idle LB
happening during the tick just after the new idle balance was a
problem in your case. Which is probably linked to your figures below
>
> Looking at the profile on update_blocked_averages a bit more,
> the majority of the call to update_blocked_averages
> happens in run_rebalance_domain. And we are not
> including that cost of update_blocked_averages for
> run_rebalance_domains in our current patch set. I think
> the patch set should account for that too.
nohz_newidle_balance keeps using sysctl_sched_migration_cost to
trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
This would probably benefit to take into account the cost of
update_blocked_averages instead
>
>
> 0.60% 0.00% 3 [kernel.vmlinux] [k] run_rebalance_domains - -
> |
> --0.59%--run_rebalance_domains
> |
> --0.57%--update_blocked_averages
>
> Thanks.
>
> Tim
>
Le mercredi 27 oct. 2021 ? 13:53:32 (-0700), Tim Chen a ?crit :
> On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> >
> > > Looking at the profile on update_blocked_averages a bit more,
> > > the majority of the call to update_blocked_averages
> > > happens in run_rebalance_domain. And we are not
> > > including that cost of update_blocked_averages for
> > > run_rebalance_domains in our current patch set. I think
> > > the patch set should account for that too.
> >
> > nohz_newidle_balance keeps using sysctl_sched_migration_cost to
> > trigger a _nohz_idle_balance(cpu_rq(cpu), NOHZ_STATS_KICK, CPU_IDLE);
> > This would probably benefit to take into account the cost of
> > update_blocked_averages instead
> >
>
> For the case where
>
> this_rq->avg_idle < sysctl_sched_migration_cost
>
> in newidle_balance(), we skip to the out: label
>
> out:
> /* Move the next balance forward */
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> if (pulled_task)
> this_rq->idle_stamp = 0;
> else
> nohz_newidle_balance(this_rq);
>
> and we call nohz_newidle_balance as we don't have a pulled_task.
>
> It seems to make sense to skip the call
> to nohz_newidle_balance() for this case?
nohz_newidle_balance() also tests this condition :
(this_rq->avg_idle < sysctl_sched_migration_cost)
and doesn't set NOHZ_NEWILB_KICKi in such case
But this patch now used the condition :
this_rq->avg_idle < sd->max_newidle_lb_cost
and sd->max_newidle_lb_cost can be higher than sysctl_sched_migration_cost
which means that we can set NOHZ_NEWILB_KICK:
-although we decided to skip newidle loop
-or when we abort because this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost
This is even more true when sysctl_sched_migration_cost is lowered which is your case IIRC
The patch below ensures that we don't set NOHZ_NEWILB_KICK in such cases:
---
kernel/sched/fair.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c19f4bb3df1a..36ddae208959 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10779,7 +10779,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int this_cpu = this_rq->cpu;
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
- int pulled_task = 0;
+ int pulled_task = 0, early_stop = 0;
update_misfit_status(NULL, this_rq);
@@ -10816,8 +10816,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
- if (sd)
+ if (sd) {
update_next_balance(sd, &next_balance);
+
+ /*
+ * We skip new idle LB because there is not enough
+ * time before next wake up. Make sure that we will
+ * not kick NOHZ_NEWILB_KICK
+ */
+ early_stop = 1;
+ }
rcu_read_unlock();
goto out;
@@ -10836,8 +10844,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
update_next_balance(sd, &next_balance);
- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+ early_stop = 1;
break;
+ }
if (sd->flags & SD_BALANCE_NEWIDLE) {
@@ -10887,7 +10897,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
if (pulled_task)
this_rq->idle_stamp = 0;
- else
+ else if (!early_stop)
nohz_newidle_balance(this_rq);
rq_repin_lock(this_rq, rf);
--
> We expect a very short idle and a task to wake shortly.
> So we do not have to pull a task
> to this idle cpu and incur the migration cost.
>
> Tim
>
>
>
>
On Thu, 2021-10-28 at 14:15 +0200, Vincent Guittot wrote:
>
> > It seems to make sense to skip the call
> > to nohz_newidle_balance() for this case?
>
> nohz_newidle_balance() also tests this condition :
> (this_rq->avg_idle < sysctl_sched_migration_cost)
> and doesn't set NOHZ_NEWILB_KICKi in such case
>
> But this patch now used the condition :
> this_rq->avg_idle < sd->max_newidle_lb_cost
> and sd->max_newidle_lb_cost can be higher than
> sysctl_sched_migration_cost
>
> which means that we can set NOHZ_NEWILB_KICK:
> -although we decided to skip newidle loop
> -or when we abort because this_rq->avg_idle < curr_cost + sd-
> >max_newidle_lb_cost
>
> This is even more true when sysctl_sched_migration_cost is lowered
> which is your case IIRC
>
> The patch below ensures that we don't set NOHZ_NEWILB_KICK in such
> cases:
>
Thanks. Will ask our benchmark team to give it a spin.
Tim
Hi, Vincent, Peter,
On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
>
> The time spent running update_blocked_averages is now accounted in the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
>
> The decay of max_newidle_lb_cost is modified to start only when the field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
>
> The condition of an avg_idle lower than sysctl_sched_migration_cost has
> been removed as the 500us value is quite large and prevent opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
It appears this series is not yet in upstream Linus's tree. What's the latest on it?
I see a lot of times on ARM64 devices that load balance is skipped due to the
high the sysctl_sched_migration_cost. I saw another thread as well where
someone complained the performance varies and the default might be too high:
https://lkml.org/lkml/2021/9/14/150
Any other thoughts? Happy to help on any progress on this series as well. Thanks,
- Joel
>
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
> min avg max
> SMT: 1us 33us 273us - this one includes the update of blocked load
> MC: 7us 49us 398us
> NUMA: 10us 45us 158us
>
>
> Some results for hackbench -l $LOOPS -g $group :
> group tip/sched/core + this patchset
> 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
>
> tbench and reaim have not shown any difference
>
> Change since v2:
> - Update and decay of sd->last_decay_max_lb_cost are gathered in
> update_newidle_cost(). The behavior remains almost the same except that
> the decay can happen during newidle_balance now.
>
> Tests results haven't shown any differences
>
> I haven't modified rq->max_idle_balance_cost. It acts as the max value
> for avg_idle and prevents the latter to reach high value during long
> idle phase. Moving on an IIR filter instead, could delay the convergence
> of avg_idle to a reasonnable value that reflect current situation.
>
> - Added a minor cleanup of newidle_balance
>
> Change since v1:
> - account the time spent in update_blocked_averages() in the 1st domain
>
> - reduce number of call of sched_clock_cpu()
>
> - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> IIR but keeping a track of the current max value gave the best result
>
> - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> as suggested by Peter
>
> Vincent Guittot (5):
> sched/fair: Account update_blocked_averages in newidle_balance cost
> sched/fair: Skip update_blocked_averages if we are defering load
> balance
> sched/fair: Wait before decaying max_newidle_lb_cost
> sched/fair: Remove sysctl_sched_migration_cost condition
> sched/fair: cleanup newidle_balance
>
> include/linux/sched/topology.h | 2 +-
> kernel/sched/fair.c | 65 ++++++++++++++++++++++------------
> kernel/sched/topology.c | 2 +-
> 3 files changed, 45 insertions(+), 24 deletions(-)
>
> --
> 2.17.1
>
On Fri, 29 Oct 2021 at 01:25, Joel Fernandes <[email protected]> wrote:
>
> Hi, Vincent, Peter,
>
> On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > This patchset updates newidle lb cost tracking and early abort:
> >
> > The time spent running update_blocked_averages is now accounted in the 1st
> > sched_domain level. This time can be significant and move the cost of
> > newidle lb above the avg_idle time.
> >
> > The decay of max_newidle_lb_cost is modified to start only when the field
> > has not been updated for a while. Recent update will not be decayed
> > immediatlybut only after a while.
> >
> > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > been removed as the 500us value is quite large and prevent opportunity to
> > pull task on the newly idle CPU for at least 1st domain levels.
>
> It appears this series is not yet in upstream Linus's tree. What's the latest on it?
>
I sent an addon yesterday to cover cases that Tim cares about
> I see a lot of times on ARM64 devices that load balance is skipped due to the
> high the sysctl_sched_migration_cost. I saw another thread as well where
Have you tested the patchset ? Does it enable more load balance on
your platform ?
> someone complained the performance varies and the default might be too high:
> https://lkml.org/lkml/2021/9/14/150
Added Yicong and Barry in the list
>
> Any other thoughts? Happy to help on any progress on this series as well. Thanks,
>
> - Joel
>
> >
> > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > following results:
> > min avg max
> > SMT: 1us 33us 273us - this one includes the update of blocked load
> > MC: 7us 49us 398us
> > NUMA: 10us 45us 158us
> >
> >
> > Some results for hackbench -l $LOOPS -g $group :
> > group tip/sched/core + this patchset
> > 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> > 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> > 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> > 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> > 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> > 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> > 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
> >
> > tbench and reaim have not shown any difference
> >
> > Change since v2:
> > - Update and decay of sd->last_decay_max_lb_cost are gathered in
> > update_newidle_cost(). The behavior remains almost the same except that
> > the decay can happen during newidle_balance now.
> >
> > Tests results haven't shown any differences
> >
> > I haven't modified rq->max_idle_balance_cost. It acts as the max value
> > for avg_idle and prevents the latter to reach high value during long
> > idle phase. Moving on an IIR filter instead, could delay the convergence
> > of avg_idle to a reasonnable value that reflect current situation.
> >
> > - Added a minor cleanup of newidle_balance
> >
> > Change since v1:
> > - account the time spent in update_blocked_averages() in the 1st domain
> >
> > - reduce number of call of sched_clock_cpu()
> >
> > - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> > IIR but keeping a track of the current max value gave the best result
> >
> > - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> > as suggested by Peter
> >
> > Vincent Guittot (5):
> > sched/fair: Account update_blocked_averages in newidle_balance cost
> > sched/fair: Skip update_blocked_averages if we are defering load
> > balance
> > sched/fair: Wait before decaying max_newidle_lb_cost
> > sched/fair: Remove sysctl_sched_migration_cost condition
> > sched/fair: cleanup newidle_balance
> >
> > include/linux/sched/topology.h | 2 +-
> > kernel/sched/fair.c | 65 ++++++++++++++++++++++------------
> > kernel/sched/topology.c | 2 +-
> > 3 files changed, 45 insertions(+), 24 deletions(-)
> >
> > --
> > 2.17.1
> >
On 19/10/2021 14:35, Vincent Guittot wrote:
> Decay max_newidle_lb_cost only when it has not been updated for a while
> and ensure to not decay a recently changed value.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> include/linux/sched/topology.h | 2 +-
> kernel/sched/fair.c | 36 +++++++++++++++++++++++++---------
> kernel/sched/topology.c | 2 +-
> 3 files changed, 29 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 2f9166f6dec8..c07bfa2d80f2 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -105,7 +105,7 @@ struct sched_domain {
>
> /* idle_balance() stats */
> u64 max_newidle_lb_cost;
> - unsigned long next_decay_max_lb_cost;
> + unsigned long last_decay_max_lb_cost;
>
> u64 avg_scan_cost; /* select_idle_sibling */
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c4c36865321b..e50fd751e1df 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10239,6 +10239,30 @@ void update_max_interval(void)
> max_load_balance_interval = HZ*num_online_cpus()/10;
> }
>
> +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> +{
> + if (cost > sd->max_newidle_lb_cost) {
> + /*
> + * Track max cost of a domain to make sure to not delay the
> + * next wakeup on the CPU.
> + */
> + sd->max_newidle_lb_cost = cost;
> + sd->last_decay_max_lb_cost = jiffies;
That's the actual change of the patch: sd->last_decay_max_lb_cost being
moved forward also when newidle cost is updated from newidle_balance() ?
> + } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> + /*
> + * Decay the newidle max times by ~1% per second to ensure that
> + * it is not outdated and the current max cost is actually
> + * shorter.
I assume that `max cost` refers here to a local variable of the only
caller of update_newidle_cost(..., 0) - rebalance_domains()?
"the current max cost" has to be shorter so that
rq->max_idle_balance_cost also decays in this case. Is this what this
comment tries to say here?
[...]
On 19/10/2021 14:35, Vincent Guittot wrote:
> This patchset updates newidle lb cost tracking and early abort:
>
> The time spent running update_blocked_averages is now accounted in the 1st
> sched_domain level. This time can be significant and move the cost of
> newidle lb above the avg_idle time.
>
> The decay of max_newidle_lb_cost is modified to start only when the field
> has not been updated for a while. Recent update will not be decayed
> immediatlybut only after a while.
>
> The condition of an avg_idle lower than sysctl_sched_migration_cost has
> been removed as the 500us value is quite large and prevent opportunity to
> pull task on the newly idle CPU for at least 1st domain levels.
>
> Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> following results:
> min avg max
> SMT: 1us 33us 273us - this one includes the update of blocked load
> MC: 7us 49us 398us
> NUMA: 10us 45us 158us
>
>
> Some results for hackbench -l $LOOPS -g $group :
> group tip/sched/core + this patchset
> 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
>
> tbench and reaim have not shown any difference
>
> Change since v2:
> - Update and decay of sd->last_decay_max_lb_cost are gathered in
> update_newidle_cost(). The behavior remains almost the same except that
> the decay can happen during newidle_balance now.
>
> Tests results haven't shown any differences
>
> I haven't modified rq->max_idle_balance_cost. It acts as the max value
> for avg_idle and prevents the latter to reach high value during long
> idle phase. Moving on an IIR filter instead, could delay the convergence
> of avg_idle to a reasonnable value that reflect current situation.
>
> - Added a minor cleanup of newidle_balance
>
> Change since v1:
> - account the time spent in update_blocked_averages() in the 1st domain
>
> - reduce number of call of sched_clock_cpu()
>
> - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> IIR but keeping a track of the current max value gave the best result
>
> - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> as suggested by Peter
>
> Vincent Guittot (5):
> sched/fair: Account update_blocked_averages in newidle_balance cost
> sched/fair: Skip update_blocked_averages if we are defering load
> balance
> sched/fair: Wait before decaying max_newidle_lb_cost
> sched/fair: Remove sysctl_sched_migration_cost condition
> sched/fair: cleanup newidle_balance
>
> include/linux/sched/topology.h | 2 +-
> kernel/sched/fair.c | 65 ++++++++++++++++++++++------------
> kernel/sched/topology.c | 2 +-
> 3 files changed, 45 insertions(+), 24 deletions(-)
Reviewed-by: Dietmar Eggemann <[email protected]>
LGTM, just a couple of questions in 3/5 and 4/5.
On 19/10/2021 14:35, Vincent Guittot wrote:
> With a default value of 500us, sysctl_sched_migration_cost is
> significanlty higher than the cost of load_balance. Remove the
Shouldn't this be rather `load balance cost on the lowest sd`? I assume
here that lb cost stands for sd->max_newidle_lb_cost of the 1st sd.
We still use sysctl_sched_migration_cost as a floor against max_cost
(i.e. lb cost of all sd's) when setting rq->max_idle_balance_cost in
rebalance_domains().
And in the add-on discussion (disabling the call to
nohz_newidle_balance() you mention that sd->max_newidle_lb_cost can be
higher than sysctl_sched_migration_cost (even when default 500us).
> condition and rely on the sd->max_newidle_lb_cost to abort
> newidle_balance.
>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e50fd751e1df..57eae0ebc492 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> - if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload) ||
> + if (!READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> if (sd)
>
On Fri, 29 Oct 2021 at 12:01, Dietmar Eggemann <[email protected]> wrote:
>
> On 19/10/2021 14:35, Vincent Guittot wrote:
> > Decay max_newidle_lb_cost only when it has not been updated for a while
> > and ensure to not decay a recently changed value.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > include/linux/sched/topology.h | 2 +-
> > kernel/sched/fair.c | 36 +++++++++++++++++++++++++---------
> > kernel/sched/topology.c | 2 +-
> > 3 files changed, 29 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 2f9166f6dec8..c07bfa2d80f2 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -105,7 +105,7 @@ struct sched_domain {
> >
> > /* idle_balance() stats */
> > u64 max_newidle_lb_cost;
> > - unsigned long next_decay_max_lb_cost;
> > + unsigned long last_decay_max_lb_cost;
> >
> > u64 avg_scan_cost; /* select_idle_sibling */
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c4c36865321b..e50fd751e1df 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10239,6 +10239,30 @@ void update_max_interval(void)
> > max_load_balance_interval = HZ*num_online_cpus()/10;
> > }
> >
> > +static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
> > +{
> > + if (cost > sd->max_newidle_lb_cost) {
> > + /*
> > + * Track max cost of a domain to make sure to not delay the
> > + * next wakeup on the CPU.
> > + */
> > + sd->max_newidle_lb_cost = cost;
> > + sd->last_decay_max_lb_cost = jiffies;
>
> That's the actual change of the patch: sd->last_decay_max_lb_cost being
> moved forward also when newidle cost is updated from newidle_balance() ?
>
> > + } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
> > + /*
> > + * Decay the newidle max times by ~1% per second to ensure that
> > + * it is not outdated and the current max cost is actually
> > + * shorter.
>
> I assume that `max cost` refers here to a local variable of the only
> caller of update_newidle_cost(..., 0) - rebalance_domains()?
>
> "the current max cost" has to be shorter so that
> rq->max_idle_balance_cost also decays in this case. Is this what this
> comment tries to say here?
I refer to the time tracked in sd->max_newidle_lb_cost
here i set current_cost to zero to trigger a possible decay of
sd->max_newidle_lb_cost
>
> [...]
On Fri, 29 Oct 2021 at 12:02, Dietmar Eggemann <[email protected]> wrote:
>
> On 19/10/2021 14:35, Vincent Guittot wrote:
> > With a default value of 500us, sysctl_sched_migration_cost is
> > significanlty higher than the cost of load_balance. Remove the
>
> Shouldn't this be rather `load balance cost on the lowest sd`? I assume
> here that lb cost stands for sd->max_newidle_lb_cost of the 1st sd.
Both.
During the tests, I did on thx2, the sum of sd->max_newidle_lb_cost
could range between 18us and 829us.
During those tests, I had a limited number of cgroup and the
update_blocked_average could certainly go above the 273us of SMT
sd->max_newidle_lb_cost and the 500us of sysctl_sched_migration_cost
>
> We still use sysctl_sched_migration_cost as a floor against max_cost
> (i.e. lb cost of all sd's) when setting rq->max_idle_balance_cost in
> rebalance_domains().
rq->max_idle_balance_cost is used to cap the value used to compute rq->avg_idle.
>
> And in the add-on discussion (disabling the call to
> nohz_newidle_balance() you mention that sd->max_newidle_lb_cost can be
> higher than sysctl_sched_migration_cost (even when default 500us).
yes, I have reached 273us on thx2 with 2 empty cgroups
>
>
> > condition and rely on the sd->max_newidle_lb_cost to abort
> > newidle_balance.
> >
> > Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e50fd751e1df..57eae0ebc492 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > rcu_read_lock();
> > sd = rcu_dereference_check_sched_domain(this_rq->sd);
> >
> > - if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> > - !READ_ONCE(this_rq->rd->overload) ||
> > + if (!READ_ONCE(this_rq->rd->overload) ||
> > (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> >
> > if (sd)
> >
>
On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
>
>
> Few problems still remain in your case if I'm not wrong:
> There is a patch that ensures that rq->next_balance is never set in
> the past.
>
Vincent,
Were you planning to take the patch to prevent the next_balance to be
in the past?
Tim
---
From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
From: Tim Chen <[email protected]>
Date: Tue, 11 May 2021 09:55:41 -0700
Subject: [PATCH] sched: sched: Fix rq->next_balance time updated to earlier
than current time
To: [email protected]
In traces on newidle_balance(), this_rq->next_balance
time goes backward and earlier than current time jiffies, e.g.
11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
It doesn't make sense to have a next_balance in the past.
Fix newidle_balance() and update_next_balance() so the next
balance time is at least jiffies+1.
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..740a0572cbf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
/* used by idle balance, so cpu_busy = 0 */
interval = get_sd_balance_interval(sd, 0);
- next = sd->last_balance + interval;
+ if (time_after(jiffies+1, sd->last_balance + interval))
+ next = jiffies+1;
+ else
+ next = sd->last_balance + interval;
if (time_after(*next_balance, next))
*next_balance = next;
@@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
out:
/* Move the next balance forward */
+ if (time_after(jiffies+1, this_rq->next_balance))
+ this_rq->next_balance = jiffies+1;
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
--
2.20.1
Hi Vincent,
On Fri, Oct 29, 2021 at 09:49:23AM +0200, Vincent Guittot wrote:
> On Fri, 29 Oct 2021 at 01:25, Joel Fernandes <[email protected]> wrote:
> >
> > Hi, Vincent, Peter,
> >
> > On Tue, Oct 19, 2021 at 02:35:32PM +0200, Vincent Guittot wrote:
> > > This patchset updates newidle lb cost tracking and early abort:
> > >
> > > The time spent running update_blocked_averages is now accounted in the 1st
> > > sched_domain level. This time can be significant and move the cost of
> > > newidle lb above the avg_idle time.
> > >
> > > The decay of max_newidle_lb_cost is modified to start only when the field
> > > has not been updated for a while. Recent update will not be decayed
> > > immediatlybut only after a while.
> > >
> > > The condition of an avg_idle lower than sysctl_sched_migration_cost has
> > > been removed as the 500us value is quite large and prevent opportunity to
> > > pull task on the newly idle CPU for at least 1st domain levels.
> >
> > It appears this series is not yet in upstream Linus's tree. What's the latest on it?
> >
>
> I sent an addon yesterday to cover cases that Tim cares about
Oh ok, I'll check it out. Thanks for letting me know.
> > I see a lot of times on ARM64 devices that load balance is skipped due to the
> > high the sysctl_sched_migration_cost. I saw another thread as well where
>
> Have you tested the patchset ? Does it enable more load balance on
> your platform ?
Yes I tested and noticed load balance happening more often. I ran some
workloads noticed it makes things smoother. I am doing more tests as well.
Also, we need this series to fix the update_blocked_averages() latency issues
and I verified that preemptoff tracer does not show same high latency with
this series.
thanks,
- Joel
> > someone complained the performance varies and the default might be too high:
> > https://lkml.org/lkml/2021/9/14/150
>
> Added Yicong and Barry in the list
>
> >
> > Any other thoughts? Happy to help on any progress on this series as well. Thanks,
> >
> > - Joel
> >
> > >
> > > Monitoring sd->max_newidle_lb_cost on cpu0 of a Arm64 system
> > > THX2 (2 nodes * 28 cores * 4 cpus) during the benchmarks gives the
> > > following results:
> > > min avg max
> > > SMT: 1us 33us 273us - this one includes the update of blocked load
> > > MC: 7us 49us 398us
> > > NUMA: 10us 45us 158us
> > >
> > >
> > > Some results for hackbench -l $LOOPS -g $group :
> > > group tip/sched/core + this patchset
> > > 1 15.189(+/- 2%) 14.987(+/- 2%) +1%
> > > 4 4.336(+/- 3%) 4.322(+/- 5%) +0%
> > > 16 3.654(+/- 1%) 2.922(+/- 3%) +20%
> > > 32 3.209(+/- 1%) 2.919(+/- 3%) +9%
> > > 64 2.965(+/- 1%) 2.826(+/- 1%) +4%
> > > 128 2.954(+/- 1%) 2.993(+/- 8%) -1%
> > > 256 2.951(+/- 1%) 2.894(+/- 1%) +2%
> > >
> > > tbench and reaim have not shown any difference
> > >
> > > Change since v2:
> > > - Update and decay of sd->last_decay_max_lb_cost are gathered in
> > > update_newidle_cost(). The behavior remains almost the same except that
> > > the decay can happen during newidle_balance now.
> > >
> > > Tests results haven't shown any differences
> > >
> > > I haven't modified rq->max_idle_balance_cost. It acts as the max value
> > > for avg_idle and prevents the latter to reach high value during long
> > > idle phase. Moving on an IIR filter instead, could delay the convergence
> > > of avg_idle to a reasonnable value that reflect current situation.
> > >
> > > - Added a minor cleanup of newidle_balance
> > >
> > > Change since v1:
> > > - account the time spent in update_blocked_averages() in the 1st domain
> > >
> > > - reduce number of call of sched_clock_cpu()
> > >
> > > - change the way max_newidle_lb_cost is decayed. Peter suggested to use a
> > > IIR but keeping a track of the current max value gave the best result
> > >
> > > - removed the condition (this_rq->avg_idle < sysctl_sched_migration_cost)
> > > as suggested by Peter
> > >
> > > Vincent Guittot (5):
> > > sched/fair: Account update_blocked_averages in newidle_balance cost
> > > sched/fair: Skip update_blocked_averages if we are defering load
> > > balance
> > > sched/fair: Wait before decaying max_newidle_lb_cost
> > > sched/fair: Remove sysctl_sched_migration_cost condition
> > > sched/fair: cleanup newidle_balance
> > >
> > > include/linux/sched/topology.h | 2 +-
> > > kernel/sched/fair.c | 65 ++++++++++++++++++++++------------
> > > kernel/sched/topology.c | 2 +-
> > > 3 files changed, 45 insertions(+), 24 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >
The following commit has been merged into the sched/core branch of tip:
Commit-ID: e60b56e46b384cee1ad34e6adc164d883049c6c3
Gitweb: https://git.kernel.org/tip/e60b56e46b384cee1ad34e6adc164d883049c6c3
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 19 Oct 2021 14:35:35 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Sun, 31 Oct 2021 11:11:38 +01:00
sched/fair: Wait before decaying max_newidle_lb_cost
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched/topology.h | 2 +-
kernel/sched/fair.c | 36 ++++++++++++++++++++++++---------
kernel/sched/topology.c | 2 +-
3 files changed, 29 insertions(+), 11 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2f9166f..c07bfa2 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -105,7 +105,7 @@ struct sched_domain {
/* idle_balance() stats */
u64 max_newidle_lb_cost;
- unsigned long next_decay_max_lb_cost;
+ unsigned long last_decay_max_lb_cost;
u64 avg_scan_cost; /* select_idle_sibling */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c4c3686..e50fd75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10239,6 +10239,30 @@ void update_max_interval(void)
max_load_balance_interval = HZ*num_online_cpus()/10;
}
+static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
+{
+ if (cost > sd->max_newidle_lb_cost) {
+ /*
+ * Track max cost of a domain to make sure to not delay the
+ * next wakeup on the CPU.
+ */
+ sd->max_newidle_lb_cost = cost;
+ sd->last_decay_max_lb_cost = jiffies;
+ } else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
+ /*
+ * Decay the newidle max times by ~1% per second to ensure that
+ * it is not outdated and the current max cost is actually
+ * shorter.
+ */
+ sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
+ sd->last_decay_max_lb_cost = jiffies;
+
+ return true;
+ }
+
+ return false;
+}
+
/*
* It checks each scheduling domain to see if it is due to be balanced,
* and initiates a balancing operation if so.
@@ -10262,14 +10286,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
for_each_domain(cpu, sd) {
/*
* Decay the newidle max times here because this is a regular
- * visit to all the domains. Decay ~1% per second.
+ * visit to all the domains.
*/
- if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
- sd->max_newidle_lb_cost =
- (sd->max_newidle_lb_cost * 253) / 256;
- sd->next_decay_max_lb_cost = jiffies + HZ;
- need_decay = 1;
- }
+ need_decay = update_newidle_cost(sd, 0);
max_cost += sd->max_newidle_lb_cost;
/*
@@ -10911,8 +10930,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
t1 = sched_clock_cpu(this_cpu);
domain_cost = t1 - t0;
- if (domain_cost > sd->max_newidle_lb_cost)
- sd->max_newidle_lb_cost = domain_cost;
+ update_newidle_cost(sd, domain_cost);
curr_cost += domain_cost;
t0 = t1;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e812467..30169c7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1568,7 +1568,7 @@ sd_init(struct sched_domain_topology_level *tl,
.last_balance = jiffies,
.balance_interval = sd_weight,
.max_newidle_lb_cost = 0,
- .next_decay_max_lb_cost = jiffies,
+ .last_decay_max_lb_cost = jiffies,
.child = child,
#ifdef CONFIG_SCHED_DEBUG
.name = tl->name,
The following commit has been merged into the sched/core branch of tip:
Commit-ID: c5b0a7eefc70150caf23e37bc9d639c68c87a097
Gitweb: https://git.kernel.org/tip/c5b0a7eefc70150caf23e37bc9d639c68c87a097
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 19 Oct 2021 14:35:36 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Sun, 31 Oct 2021 11:11:38 +01:00
sched/fair: Remove sysctl_sched_migration_cost condition
With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e50fd75..57eae0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
- if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload) ||
+ if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
if (sd)
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 8ea9183db4ad8afbcb7089a77c23eaf965b0cacd
Gitweb: https://git.kernel.org/tip/8ea9183db4ad8afbcb7089a77c23eaf965b0cacd
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 19 Oct 2021 14:35:37 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Sun, 31 Oct 2021 11:11:38 +01:00
sched/fair: Cleanup newidle_balance
update_next_balance() uses sd->last_balance which is not modified by
load_balance() so we can merge the 2 calls in one place.
No functional change
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57eae0e..13950be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10916,10 +10916,10 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
int continue_balancing = 1;
u64 domain_cost;
- if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance);
+
+ if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
break;
- }
if (sd->flags & SD_BALANCE_NEWIDLE) {
@@ -10935,8 +10935,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
t0 = t1;
}
- update_next_balance(sd, &next_balance);
-
/*
* Stop searching for tasks to pull if there are
* now runnable tasks on this rq.
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 9d783c8dd112ad4b619e74e4bf57d2be0b400693
Gitweb: https://git.kernel.org/tip/9d783c8dd112ad4b619e74e4bf57d2be0b400693
Author: Vincent Guittot <[email protected]>
AuthorDate: Tue, 19 Oct 2021 14:35:34 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Sun, 31 Oct 2021 11:11:37 +01:00
sched/fair: Skip update_blocked_averages if we are defering load balance
In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:
this_rq->avg_idle < sd->max_newidle_lb_cost
Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.
Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c014567..c4c3686 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10873,17 +10873,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);
+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();
goto out;
}
+ rcu_read_unlock();
raw_spin_rq_unlock(this_rq);
On Fri, 29 Oct 2021 at 19:00, Tim Chen <[email protected]> wrote:
>
> On Wed, 2021-10-27 at 10:49 +0200, Vincent Guittot wrote:
> >
> >
> > Few problems still remain in your case if I'm not wrong:
> > There is a patch that ensures that rq->next_balance is never set in
> > the past.
> >
>
> Vincent,
>
> Were you planning to take the patch to prevent the next_balance to be
> in the past?
I can add it to this serie
>
> Tim
>
> ---
> From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
> From: Tim Chen <[email protected]>
> Date: Tue, 11 May 2021 09:55:41 -0700
> Subject: [PATCH] sched: sched: Fix rq->next_balance time updated to earlier
> than current time
> To: [email protected]
>
> In traces on newidle_balance(), this_rq->next_balance
> time goes backward and earlier than current time jiffies, e.g.
>
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>
> It doesn't make sense to have a next_balance in the past.
> Fix newidle_balance() and update_next_balance() so the next
> balance time is at least jiffies+1.
> ---
> kernel/sched/fair.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..740a0572cbf1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>
> /* used by idle balance, so cpu_busy = 0 */
> interval = get_sd_balance_interval(sd, 0);
> - next = sd->last_balance + interval;
> + if (time_after(jiffies+1, sd->last_balance + interval))
> + next = jiffies+1;
> + else
> + next = sd->last_balance + interval;
>
> if (time_after(*next_balance, next))
> *next_balance = next;
> @@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> + if (time_after(jiffies+1, this_rq->next_balance))
> + this_rq->next_balance = jiffies+1;
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> --
> 2.20.1
>
>