2023-06-12 08:56:22

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance

Hi,

This is an attempt to reduce the cost of newidle balance which is
found to occupy noticeable CPU cycles on some high-core count systems.

For example, by running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:

6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats

The main idea comes from the following question raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?

The proposal ILB_UTIL mainly adjusts the newidle balance scan depth
within the current sched domain, based on the system utilization in
this domain. The more spare time there is in the domain, the more time
each newidle balance can spend on scanning for a busy group. Although
the newidle balance has per domain max_newidle_lb_cost to decide
whether to launch the balance or not, the ILB_UTIL provides a smaller
granularity to decide how many groups each newidle balance can scan.

patch 1/4 is code cleanup.

patch 2/4 is to introduce a new variable in sched domain to indicate the
number of groups, and will be used by patch 3 and patch 4.

patch 3/4 is to calculate the scan depth in each periodic load balance.

patch 4/4 is to limit the scan depth based on the result of patch 3,
and the depth will be used by newidle_balance()->
find_busiest_group() -> update_sd_lb_stats()


According to the test result, netperf/tbench shows some improvements
when the system is underloaded, while no noticeable difference from
hackbench/schbench. While I'm trying to run more benchmarks including
some macro-benchmarks, I send this draft patch out and seek for suggestion
from the community if this is the right thing to do and if we are in the
right direction.

[We also have other wild ideas like sorting the groups by their load
in the periodic load balance, later newidle_balance() can fetch the
corresponding group in O(1). And this change seems to get improvement
too according to the test result].

Any comments would be appreciated.

Chen Yu (4):
sched/fair: Extract the function to get the sd_llc_shared
sched/topology: Introduce nr_groups in sched_domain to indicate the
number of groups
sched/fair: Calculate the scan depth for idle balance based on system
utilization
sched/fair: Throttle the busiest group scanning in idle load balance

include/linux/sched/topology.h | 5 +++
kernel/sched/fair.c | 74 +++++++++++++++++++++++++++++-----
kernel/sched/features.h | 1 +
kernel/sched/topology.c | 10 ++++-
4 files changed, 79 insertions(+), 11 deletions(-)

--
2.25.1



2023-06-12 09:48:59

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

When CPU is about to enter idle, it invokes newidle_balance() to pull
some tasks from other runqueues. Although there is per domain
max_newidle_lb_cost to throttle the newidle_balance(), it would be
good to further limit the scan based on overall system utilization.
The reason is that there is no limitation for newidle_balance() to
launch this balance simultaneously on multiple CPUs. Since each
newidle_balance() has to traverse all the CPUs to calculate the
statistics one by one, this total time cost on newidle_balance()
could be O(n^2). This is not good for performance or power saving.

For example, sqlite has spent quite some time on newidle balance()
on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats

Based on this observation, limit the scan depth of newidle_balance()
by considering the utilization of the LLC domain. Let the number of
scanned groups be a linear function of the utilization ratio:

nr_groups_to_scan = nr_groups * (1 - util_ratio)

Besides, save the total_load, total_capacity of the current
sched domain in each periodic load balance. This statistic
can be reused later by CPU_NEWLY_IDLE load balance if it quits
the scan earlier. Introduce a sched feature ILB_UTIL to
control this.

Suggested-by: Tim Chen <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 4 ++++
kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
3 files changed, 39 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 1faececd5694..d7b2bac9bdf3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,10 @@ struct sched_domain_shared {
atomic_t nr_busy_cpus;
int has_idle_cores;
int nr_idle_scan;
+ /* ilb scan depth and load balance statistic snapshot */
+ int ilb_nr_scan;
+ unsigned long ilb_total_load;
+ unsigned long ilb_total_capacity;
};

struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3a24aead848..f999e838114e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10122,6 +10122,39 @@ static void update_idle_cpu_scan(struct lb_env *env,
WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
}

+static void update_ilb_group_scan(struct lb_env *env,
+ unsigned long sum_util,
+ struct sched_domain_shared *sd_share,
+ struct sd_lb_stats *sds)
+{
+ u64 tmp, nr_scan;
+
+ if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
+ return;
+
+ if (!sd_share)
+ return;
+ /*
+ * Limit the newidle balance scan depth based on overall system
+ * utilization:
+ * nr_groups_scan = nr_groups * (1 - util_ratio)
+ * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+ */
+ nr_scan = env->sd->nr_groups * sum_util;
+ tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
+ do_div(nr_scan, tmp);
+ nr_scan = env->sd->nr_groups - nr_scan;
+ if ((int)nr_scan != sd_share->ilb_nr_scan)
+ WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
+
+ /* Also save the statistic snapshot of the periodic load balance */
+ if (sds->total_load != sd_share->ilb_total_load)
+ WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
+
+ if (sds->total_capacity != sd_share->ilb_total_capacity)
+ WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -10200,6 +10233,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_idle_cpu_scan(env, sum_util, sd_share);
+ update_ilb_group_scan(env, sum_util, sd_share, sds);
}

/**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..8f6e5b08408d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)

SCHED_FEAT(RT_RUNTIME_SHARE, false)
SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(ILB_UTIL, true)
SCHED_FEAT(ATTACH_AGE_LOAD, true)

SCHED_FEAT(WA_IDLE, true)
--
2.25.1


2023-06-12 09:54:38

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH 2/4] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups

Record the number of sched groups within each sched domain. Prepare for
newidle_balance() scan depth calculation.

Signed-off-by: Chen Yu <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/topology.c | 10 ++++++++--
2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..1faececd5694 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -152,6 +152,7 @@ struct sched_domain {
struct sched_domain_shared *shared;

unsigned int span_weight;
+ unsigned int nr_groups;
/*
* Span of all CPUs in this domain.
*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca4472281c28..255606e88956 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
- int i;
+ int i, nr_groups = 0;

cpumask_clear(covered);

@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!sg)
goto fail;

+ nr_groups++;
+
sg_span = sched_group_span(sg);
cpumask_or(covered, covered, sg_span);

@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
last->next = first;
}
sd->groups = first;
+ sd->nr_groups = nr_groups;

return 0;

@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
struct sd_data *sdd = sd->private;
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered;
- int i;
+ int i, nr_groups = 0;

lockdep_assert_held(&sched_domains_mutex);
covered = sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)

sg = get_group(i, sdd);

+ nr_groups++;
+
cpumask_or(covered, covered, sched_group_span(sg));

if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
}
last->next = first;
sd->groups = first;
+ sd->nr_groups = nr_groups;

return 0;
}
--
2.25.1


2023-06-15 04:58:28

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance

Hello Chen Yu,


On Tue, Jun 13, 2023 at 12:17:53AM +0800, Chen Yu wrote:
> Hi,
>
> This is an attempt to reduce the cost of newidle balance which is
> found to occupy noticeable CPU cycles on some high-core count systems.
>
> For example, by running sqlite on Intel Sapphire Rapids, which has
> 2 x 56C/112T = 224 CPUs:
>
> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>
> The main idea comes from the following question raised by Tim:
> Do we always have to find the busiest group and pull from it? Would
> a relatively busy group be enough?
>
> The proposal ILB_UTIL mainly adjusts the newidle balance scan depth
> within the current sched domain, based on the system utilization in
> this domain. The more spare time there is in the domain, the more time
> each newidle balance can spend on scanning for a busy group. Although
> the newidle balance has per domain max_newidle_lb_cost to decide
> whether to launch the balance or not, the ILB_UTIL provides a smaller
> granularity to decide how many groups each newidle balance can scan.

Thanks for the patchset. This is an interesting approach to minimise
the time spent in newidle balance when the system is relatively
busy. In the past we have seen that newly idle CPUs whose avg idle
duration is slightly higher than the max_newidle_balance_cost spend
quite a bit of time trying to find a busiest group/cpu only to not
find any task to pull. But that's the opportunity lost to go idle. I
suppose this patch is targetted towards a SMT --> MC(DIE) kind of
topologies where we have a lot of groups in the DIE domain, but it
would be interesting to see if this can help when there are fewer
groups in each sched-domain.

Will try this on our setup.

>
> patch 1/4 is code cleanup.
>
> patch 2/4 is to introduce a new variable in sched domain to indicate the
> number of groups, and will be used by patch 3 and patch 4.
>
> patch 3/4 is to calculate the scan depth in each periodic load balance.
>
> patch 4/4 is to limit the scan depth based on the result of patch 3,
> and the depth will be used by newidle_balance()->
> find_busiest_group() -> update_sd_lb_stats()
>
>
> According to the test result, netperf/tbench shows some improvements
> when the system is underloaded, while no noticeable difference from
> hackbench/schbench. While I'm trying to run more benchmarks including
> some macro-benchmarks, I send this draft patch out and seek for suggestion
> from the community if this is the right thing to do and if we are in the
> right direction.
>
> [We also have other wild ideas like sorting the groups by their load
> in the periodic load balance, later newidle_balance() can fetch the
> corresponding group in O(1). And this change seems to get improvement
> too according to the test result].
>
> Any comments would be appreciated.
>
> Chen Yu (4):
> sched/fair: Extract the function to get the sd_llc_shared
> sched/topology: Introduce nr_groups in sched_domain to indicate the
> number of groups
> sched/fair: Calculate the scan depth for idle balance based on system
> utilization
> sched/fair: Throttle the busiest group scanning in idle load balance
>
> include/linux/sched/topology.h | 5 +++
> kernel/sched/fair.c | 74 +++++++++++++++++++++++++++++-----
> kernel/sched/features.h | 1 +
> kernel/sched/topology.c | 10 ++++-
> 4 files changed, 79 insertions(+), 11 deletions(-)
>

--
Thanks and Regards
gautham.

2023-06-15 06:21:15

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hello Chen Yu,


On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> When CPU is about to enter idle, it invokes newidle_balance() to pull
> some tasks from other runqueues. Although there is per domain
> max_newidle_lb_cost to throttle the newidle_balance(), it would be
> good to further limit the scan based on overall system utilization.
> The reason is that there is no limitation for newidle_balance() to
> launch this balance simultaneously on multiple CPUs. Since each
> newidle_balance() has to traverse all the CPUs to calculate the
> statistics one by one, this total time cost on newidle_balance()
> could be O(n^2). This is not good for performance or power saving.
>
> For example, sqlite has spent quite some time on newidle balance()
> on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>
> Based on this observation, limit the scan depth of newidle_balance()
> by considering the utilization of the LLC domain. Let the number of
> scanned groups be a linear function of the utilization ratio:
>

Is there any particular reason why this is being limited only to the
LLC domain ?

On architectures where the LLC domain may not be so large (POWER9/10,
AMD), the additional cost is usually paid at the higher domains where
the number of groups is greater / equal to the number of groups in the
LLC domain and where sd_span is pretty large. It would be good to
explore avoiding the scan cost on those domains as well, right?

> nr_groups_to_scan = nr_groups * (1 - util_ratio)

If we can extend this logic to higher domains, on a Zen3 1 Socket
server with 128 CPUs at the DIE domain containing 8 groups, we can
expect a significant reduction in the time spent doing
update_sg_lb_stats() at higher utilizations.

util_ratio nr_groups_to_scan nr_cpus_scanned
========================================================
0.9 1 16 (-87.5%)
0.75 2 32 (-75%)
0.5 4 64 (-50%)
0.25 6 96 (-25%)
0.1 7 112 (-12.5%)


On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
12 groups, values will be:

util_ratio nr_groups_to_scan nr_cpus_scanned
========================================================
0.9 1 16 (-91%)
0.75 3 48 (-75%)
0.5 6 96 (-50%)
0.25 9 144 (-25%)
0.1 10 160 (-16.7%)

>
> Besides, save the total_load, total_capacity of the current
> sched domain in each periodic load balance. This statistic
> can be reused later by CPU_NEWLY_IDLE load balance if it quits
> the scan earlier. Introduce a sched feature ILB_UTIL to
> control this.
>
> Suggested-by: Tim Chen <[email protected]>
> Signed-off-by: Chen Yu <[email protected]>
> ---
> include/linux/sched/topology.h | 4 ++++
> kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++++++++
> kernel/sched/features.h | 1 +
> 3 files changed, 39 insertions(+)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 1faececd5694..d7b2bac9bdf3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -82,6 +82,10 @@ struct sched_domain_shared {
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> int nr_idle_scan;
> + /* ilb scan depth and load balance statistic snapshot */
> + int ilb_nr_scan;
> + unsigned long ilb_total_load;
> + unsigned long ilb_total_capacity;
> };
>
> struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3a24aead848..f999e838114e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10122,6 +10122,39 @@ static void update_idle_cpu_scan(struct lb_env *env,
> WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
> }
>
> +static void update_ilb_group_scan(struct lb_env *env,
> + unsigned long sum_util,
> + struct sched_domain_shared *sd_share,
> + struct sd_lb_stats *sds)
> +{
> + u64 tmp, nr_scan;
> +
> + if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
> + return;
> +
> + if (!sd_share)
> + return;
> + /*
> + * Limit the newidle balance scan depth based on overall system
> + * utilization:
> + * nr_groups_scan = nr_groups * (1 - util_ratio)
> + * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
> + */
> + nr_scan = env->sd->nr_groups * sum_util;
> + tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
> + do_div(nr_scan, tmp);
> + nr_scan = env->sd->nr_groups - nr_scan;
> + if ((int)nr_scan != sd_share->ilb_nr_scan)
> + WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
> +
> + /* Also save the statistic snapshot of the periodic load balance */
> + if (sds->total_load != sd_share->ilb_total_load)
> + WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
> +
> + if (sds->total_capacity != sd_share->ilb_total_capacity)
> + WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
> +}
> +
> /**
> * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -10200,6 +10233,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> }
>
> update_idle_cpu_scan(env, sum_util, sd_share);
> + update_ilb_group_scan(env, sum_util, sd_share, sds);
> }
>
> /**
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..8f6e5b08408d 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>
> SCHED_FEAT(RT_RUNTIME_SHARE, false)
> SCHED_FEAT(LB_MIN, false)
> +SCHED_FEAT(ILB_UTIL, true)
> SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
> SCHED_FEAT(WA_IDLE, true)
> --
> 2.25.1
>

2023-06-16 06:23:10

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hi Gautham,
On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
>
>
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
> >
> > For example, sqlite has spent quite some time on newidle balance()
> > on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
> >
> > Based on this observation, limit the scan depth of newidle_balance()
> > by considering the utilization of the LLC domain. Let the number of
> > scanned groups be a linear function of the utilization ratio:
> >
>
> Is there any particular reason why this is being limited only to the
> LLC domain ?
>
Thank you for your interest in this change. The original thought as you
might know is that our LLC has a huge number of groups.
And since SIS_UTIL has done similar thing for LLC, we want to propose
a simple prototype to demonstrate the idea.
> On architectures where the LLC domain may not be so large (POWER9/10,
> AMD), the additional cost is usually paid at the higher domains where
> the number of groups is greater / equal to the number of groups in the
> LLC domain and where sd_span is pretty large. It would be good to
> explore avoiding the scan cost on those domains as well, right?
>
Exactly.
> > nr_groups_to_scan = nr_groups * (1 - util_ratio)
>
> If we can extend this logic to higher domains, on a Zen3 1 Socket
> server with 128 CPUs at the DIE domain containing 8 groups, we can
> expect a significant reduction in the time spent doing
> update_sg_lb_stats() at higher utilizations.
>
> util_ratio nr_groups_to_scan nr_cpus_scanned
> ========================================================
> 0.9 1 16 (-87.5%)
> 0.75 2 32 (-75%)
> 0.5 4 64 (-50%)
> 0.25 6 96 (-25%)
> 0.1 7 112 (-12.5%)
>
>
> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
> 12 groups, values will be:
>
> util_ratio nr_groups_to_scan nr_cpus_scanned
> ========================================================
> 0.9 1 16 (-91%)
> 0.75 3 48 (-75%)
> 0.5 6 96 (-50%)
> 0.25 9 144 (-25%)
> 0.1 10 160 (-16.7%)
>
Thanks for this information, I think we can extend the scan limitation
for not only LLC(MC domain) but also higher domains. I'll think about
it.

thanks,
Chenyu

2023-06-21 08:09:25

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hi Gautham,
On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
>
>
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
> >
> > For example, sqlite has spent quite some time on newidle balance()
> > on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> > 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> > 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
> >
> > Based on this observation, limit the scan depth of newidle_balance()
> > by considering the utilization of the LLC domain. Let the number of
> > scanned groups be a linear function of the utilization ratio:
> >
>
> Is there any particular reason why this is being limited only to the
> LLC domain ?
>
> On architectures where the LLC domain may not be so large (POWER9/10,
> AMD), the additional cost is usually paid at the higher domains where
> the number of groups is greater / equal to the number of groups in the
> LLC domain and where sd_span is pretty large. It would be good to
> explore avoiding the scan cost on those domains as well, right?
>
> > nr_groups_to_scan = nr_groups * (1 - util_ratio)
>
> If we can extend this logic to higher domains, on a Zen3 1 Socket
> server with 128 CPUs at the DIE domain containing 8 groups, we can
> expect a significant reduction in the time spent doing
> update_sg_lb_stats() at higher utilizations.
>
> util_ratio nr_groups_to_scan nr_cpus_scanned
> ========================================================
> 0.9 1 16 (-87.5%)
> 0.75 2 32 (-75%)
> 0.5 4 64 (-50%)
> 0.25 6 96 (-25%)
> 0.1 7 112 (-12.5%)
>
>
> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
> 12 groups, values will be:
>
> util_ratio nr_groups_to_scan nr_cpus_scanned
> ========================================================
> 0.9 1 16 (-91%)
> 0.75 3 48 (-75%)
> 0.5 6 96 (-50%)
> 0.25 9 144 (-25%)
> 0.1 10 160 (-16.7%)
>
I have an idea to limit scan depth for newidle balance for big domains.
These domains should have CPUs higher than/equals to LLC(MC domain).
However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
flag set will have the shared struct sched_domain_shared among the CPUs in this
domain. And this is reasonable because the cost to access the struct sched_domain_shared
is lower if the CPUs share cache. Since ILB_UTIL relies on the sched_domain_shared
to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
during sched_domain_shared assignment.
If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
the overhead should be not too high(only periodic load balance would
write to sched_domain_shared). Here is a untest patch which shows what
I'm thinking of, and I'll do some refinement based on this:

thanks,
Chenyu

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 67b573d5bf28..ce7ffbb7b3f8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,10 @@ struct sched_domain_shared {
atomic_t nr_busy_cpus;
int has_idle_cores;
int nr_idle_scan;
+ /* ilb scan depth and load balance statistic snapshot */
+ int ilb_nr_scan;
+ unsigned long ilb_total_load;
+ unsigned long ilb_total_capacity;
};

struct sched_domain {
@@ -152,6 +156,7 @@ struct sched_domain {
struct sched_domain_shared *shared;

unsigned int span_weight;
+ unsigned int nr_groups;
/*
* Span of all CPUs in this domain.
*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d724215826ae..34619dbb2f4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10162,6 +10162,54 @@ static void update_idle_cpu_scan(struct lb_env *env,
WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
}

+/*
+ * Get the domain shared information of dst CPU.
+ */
+static struct sched_domain_shared *get_sd_shared(struct lb_env *env)
+{
+ /*
+ * Do not consider the domains smaller than LLC because those
+ * small domains have low cost on idle load balance.
+ */
+ if (env->sd->span_weight < per_cpu(sd_llc_size, env->dst_cpu))
+ return NULL;
+
+ return env->sd->shared;
+}
+
+static void update_ilb_group_scan(struct lb_env *env,
+ unsigned long sum_util,
+ struct sched_domain_shared *sd_share,
+ struct sd_lb_stats *sds)
+{
+ u64 tmp, nr_scan;
+
+ if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
+ return;
+
+ if(!sd_share)
+ return;
+ /*
+ * Limit the newidle balance scan depth based on overall system
+ * utilization:
+ * nr_groups_scan = nr_groups * (1 - util_ratio)
+ * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+ */
+ nr_scan = env->sd->nr_groups * sum_util;
+ tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
+ do_div(nr_scan, tmp);
+ nr_scan = env->sd->nr_groups - nr_scan;
+ if ((int)nr_scan != sd_share->ilb_nr_scan)
+ WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
+
+ /* save the statistic snapshot of the periodic load balance */
+ if (sds->total_load != sd_share->ilb_total_load)
+ WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
+
+ if (sds->total_capacity != sd_share->ilb_total_capacity)
+ WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -10170,11 +10218,17 @@ static void update_idle_cpu_scan(struct lb_env *env,

static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
+ struct sched_domain_shared *sd_share = get_sd_shared(env);
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
unsigned long sum_util = 0;
- int sg_status = 0;
+ int sg_status = 0, nr_scan_ilb;
+ bool ilb_util_enabled = sched_feat(ILB_UTIL) && env->idle == CPU_NEWLY_IDLE &&
+ sd_share && READ_ONCE(sd_share->ilb_total_capacity);
+
+ if (ilb_util_enabled)
+ nr_scan_ilb = sd_share->ilb_nr_scan;

do {
struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10192,6 +10246,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

update_sg_lb_stats(env, sds, sg, sgs, &sg_status);

+ if (ilb_util_enabled && --nr_scan_ilb <= 0) {
+ /*
+ * Borrow the statistic of previous periodic load balance.
+ * The data might be stale and it is a trade-off.
+ */
+ sds->total_load = READ_ONCE(sd_share->ilb_total_load);
+ sds->total_capacity = READ_ONCE(sd_share->ilb_total_capacity);
+
+ break;
+ }
+
if (local_group)
goto next_group;

@@ -10239,6 +10304,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}

update_idle_cpu_scan(env, sum_util);
+ update_ilb_group_scan(env, sum_util, sd_share, sds);
}

/**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..8f6e5b08408d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)

SCHED_FEAT(RT_RUNTIME_SHARE, false)
SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(ILB_UTIL, true)
SCHED_FEAT(ATTACH_AGE_LOAD, true)

SCHED_FEAT(WA_IDLE, true)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..98bfac5f7836 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
- int i;
+ int i, nr_groups = 0;

cpumask_clear(covered);

@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
if (!sg)
goto fail;

+ nr_groups++;
+
sg_span = sched_group_span(sg);
cpumask_or(covered, covered, sg_span);

@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
last->next = first;
}
sd->groups = first;
+ sd->nr_groups = nr_groups;

return 0;

@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
struct sd_data *sdd = sd->private;
const struct cpumask *span = sched_domain_span(sd);
struct cpumask *covered;
- int i;
+ int i, nr_groups = 0;

lockdep_assert_held(&sched_domains_mutex);
covered = sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)

sg = get_group(i, sdd);

+ nr_groups++;
+
cpumask_or(covered, covered, sched_group_span(sg));

if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
}
last->next = first;
sd->groups = first;
+ sd->nr_groups = nr_groups;

return 0;
}
@@ -1641,14 +1647,12 @@ sd_init(struct sched_domain_topology_level *tl,
}

/*
- * For all levels sharing cache; connect a sched_domain_shared
+ * For all levels; connect a sched_domain_shared
* instance.
*/
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+ atomic_set(&sd->shared->nr_busy_cpus, sd_weight);

sd->private = sdd;

--
2.25.1


2023-06-21 11:33:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

On Thu, Jun 15, 2023 at 11:31:07AM +0530, Gautham R. Shenoy wrote:

> Is there any particular reason why this is being limited only to the
> LLC domain ?

Yeah, agreed, this should be uniform across the domains below NUMA.
Above that we already have SD_SERIALIZE anyway -- and the interval is
fairly high.

2023-06-21 11:39:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> When CPU is about to enter idle, it invokes newidle_balance() to pull
> some tasks from other runqueues. Although there is per domain
> max_newidle_lb_cost to throttle the newidle_balance(), it would be
> good to further limit the scan based on overall system utilization.
> The reason is that there is no limitation for newidle_balance() to
> launch this balance simultaneously on multiple CPUs. Since each
> newidle_balance() has to traverse all the CPUs to calculate the
> statistics one by one, this total time cost on newidle_balance()
> could be O(n^2). This is not good for performance or power saving.

Another possible solution is to keep struct sg_lb_stats in
sd->child->shared (below the NUMA domains) and put a lock around it.

Then have update_sd_lb_stats() do something like:

struct sg_lb_stats *sgs = &sds->sgs;

if (raw_spin_trylock(&sds->sg_lock)) {
struct sg_lb_stats tmp;

... collect tmp

sds->sgs = tmp;
raw_spin_unlock(&sds->sg_lock);
}

... use sgs

Then you know you've always got a 'recent' copy but avoid the concurrent
updates.

2023-06-22 06:52:23

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

On Wed, Jun 21, 2023 at 03:29:27PM +0800, Chen Yu wrote:
> Hi Gautham,

[..snip..]

> >
> I have an idea to limit scan depth for newidle balance for big domains.
> These domains should have CPUs higher than/equals to LLC(MC domain).
> However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
> flag set will have the shared struct sched_domain_shared among the CPUs in this
> domain. And this is reasonable because the cost to access the struct sched_domain_shared
> is lower if the CPUs share cache.

Yes, this was a conscious choice. But now, we may need to explore
using it and consider the trade-off between cachelines bouncing across
LLCs while updating sd_shared on higher domains (I would still prefer
limiting it to below the NUMA domain!) vs having to recompute stuff at
the higher domains, if something has been recently computed for this
level by some other CPU in the domain.

> Since ILB_UTIL relies on the sched_domain_shared
> to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
> during sched_domain_shared assignment.

That's what Prateek was exploring as well. I wonder however, if we
should have sd_shared defined only for non-NUMA domains.

> If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
> the overhead should be not too high(only periodic load balance would
> write to sched_domain_shared).
> Here is a untest patch which shows what
> I'm thinking of, and I'll do some refinement based on this:

Thanks a lot for the patch. I will add it to our test queue.

--
Thanks and Regards
gautham.

2023-06-23 14:42:22

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hi Peter,
On 2023-06-21 at 13:17:21 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
>
> Another possible solution is to keep struct sg_lb_stats in
> sd->child->shared (below the NUMA domains) and put a lock around it.
>
> Then have update_sd_lb_stats() do something like:
>
> struct sg_lb_stats *sgs = &sds->sgs;
>
> if (raw_spin_trylock(&sds->sg_lock)) {
> struct sg_lb_stats tmp;
>
> ... collect tmp
>
> sds->sgs = tmp;
> raw_spin_unlock(&sds->sg_lock);
> }
>
> ... use sgs
>
> Then you know you've always got a 'recent' copy but avoid the concurrent
> updates.
Thanks for taking a look and gave the suggestions! Yes, this is a good idea, by
doing this we can further limit the number of CPU to do update in parallel, and
allow the newidle CPU to reuse the data for idle load balance from others.
This lock only allow 1 CPU in that domain to iterate the whole group, and the
bottleneck might reply on how fast the CPU who grabs the lock can finish
collecting the tmp sgs data. For MC domain, it would not take too much time, and for
higher domains between MC and NUMA domain, it depends on how many CPUs there are in that
domain. I'll create one prototype based on your suggestion and measure the test data.

thanks,
Chenyu

2023-06-23 14:52:46

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

On 2023-06-23 at 22:33:23 +0800, Chen Yu wrote:
> Hi Peter,
> On 2023-06-21 at 13:17:21 +0200, Peter Zijlstra wrote:
> > On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > > some tasks from other runqueues. Although there is per domain
> > > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > > good to further limit the scan based on overall system utilization.
> > > The reason is that there is no limitation for newidle_balance() to
> > > launch this balance simultaneously on multiple CPUs. Since each
> > > newidle_balance() has to traverse all the CPUs to calculate the
> > > statistics one by one, this total time cost on newidle_balance()
> > > could be O(n^2). This is not good for performance or power saving.
> >
> > Another possible solution is to keep struct sg_lb_stats in
> > sd->child->shared (below the NUMA domains) and put a lock around it.
> >
> > Then have update_sd_lb_stats() do something like:
> >
> > struct sg_lb_stats *sgs = &sds->sgs;
> >
> > if (raw_spin_trylock(&sds->sg_lock)) {
> > struct sg_lb_stats tmp;
> >
> > ... collect tmp
> >
> > sds->sgs = tmp;
> > raw_spin_unlock(&sds->sg_lock);
> > }
> >
> > ... use sgs
> >
> > Then you know you've always got a 'recent' copy but avoid the concurrent
> > updates.
> Thanks for taking a look and gave the suggestions! Yes, this is a good idea, by
> doing this we can further limit the number of CPU to do update in parallel, and
> allow the newidle CPU to reuse the data for idle load balance from others.
> This lock only allow 1 CPU in that domain to iterate the whole group, and the
> bottleneck might reply on how fast the CPU who grabs the lock can finish
> collecting the tmp sgs data. For MC domain, it would not take too much time, and for
> higher domains between MC and NUMA domain, it depends on how many CPUs there are in that
> domain.
I just realized that it's a trylock, so it should not block other CPUs who launch
the idle balance, but just to let 1 CPUs update the 'snapshot' at one time.
I'll do some tests.

thanks,
Chenyu
> I'll create one prototype based on your suggestion and measure the test data.
>
> thanks,
> Chenyu

2023-07-10 11:25:14

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hello Chenyu,

Thank you for sharing this extended version. Sharing the results from
testing below.

tl;dr

- tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.

- schbench (old) sees a regression in tail latency once system is heavily
loaded. DeathStarBench and SPECjbb too see a small drop under those
conditions.

- Rest of the benchmark results do not vary much.

On 6/21/2023 12:59 PM, Chen Yu wrote:
> Hi Gautham,
> On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
>> Hello Chen Yu,
>>
>>
>> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
>>> When CPU is about to enter idle, it invokes newidle_balance() to pull
>>> some tasks from other runqueues. Although there is per domain
>>> max_newidle_lb_cost to throttle the newidle_balance(), it would be
>>> good to further limit the scan based on overall system utilization.
>>> The reason is that there is no limitation for newidle_balance() to
>>> launch this balance simultaneously on multiple CPUs. Since each
>>> newidle_balance() has to traverse all the CPUs to calculate the
>>> statistics one by one, this total time cost on newidle_balance()
>>> could be O(n^2). This is not good for performance or power saving.
>>>
>>> For example, sqlite has spent quite some time on newidle balance()
>>> on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
>>> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
>>> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>>>
>>> Based on this observation, limit the scan depth of newidle_balance()
>>> by considering the utilization of the LLC domain. Let the number of
>>> scanned groups be a linear function of the utilization ratio:
>>>
>>
>> Is there any particular reason why this is being limited only to the
>> LLC domain ?
>>
>> On architectures where the LLC domain may not be so large (POWER9/10,
>> AMD), the additional cost is usually paid at the higher domains where
>> the number of groups is greater / equal to the number of groups in the
>> LLC domain and where sd_span is pretty large. It would be good to
>> explore avoiding the scan cost on those domains as well, right?
>>
>>> nr_groups_to_scan = nr_groups * (1 - util_ratio)
>>
>> If we can extend this logic to higher domains, on a Zen3 1 Socket
>> server with 128 CPUs at the DIE domain containing 8 groups, we can
>> expect a significant reduction in the time spent doing
>> update_sg_lb_stats() at higher utilizations.
>>
>> util_ratio nr_groups_to_scan nr_cpus_scanned
>> ========================================================
>> 0.9 1 16 (-87.5%)
>> 0.75 2 32 (-75%)
>> 0.5 4 64 (-50%)
>> 0.25 6 96 (-25%)
>> 0.1 7 112 (-12.5%)
>>
>>
>> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
>> 12 groups, values will be:
>>
>> util_ratio nr_groups_to_scan nr_cpus_scanned
>> ========================================================
>> 0.9 1 16 (-91%)
>> 0.75 3 48 (-75%)
>> 0.5 6 96 (-50%)
>> 0.25 9 144 (-25%)
>> 0.1 10 160 (-16.7%)
>>
> I have an idea to limit scan depth for newidle balance for big domains.
> These domains should have CPUs higher than/equals to LLC(MC domain).
> However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
> flag set will have the shared struct sched_domain_shared among the CPUs in this
> domain. And this is reasonable because the cost to access the struct sched_domain_shared
> is lower if the CPUs share cache. Since ILB_UTIL relies on the sched_domain_shared
> to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
> during sched_domain_shared assignment.
> If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
> the overhead should be not too high(only periodic load balance would
> write to sched_domain_shared). Here is a untest patch which shows what
> I'm thinking of, and I'll do some refinement based on this:
>
> thanks,
> Chenyu
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 67b573d5bf28..ce7ffbb7b3f8 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -82,6 +82,10 @@ struct sched_domain_shared {
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> int nr_idle_scan;
> + /* ilb scan depth and load balance statistic snapshot */
> + int ilb_nr_scan;
> + unsigned long ilb_total_load;
> + unsigned long ilb_total_capacity;
> };
>
> struct sched_domain {
> @@ -152,6 +156,7 @@ struct sched_domain {
> struct sched_domain_shared *shared;
>
> unsigned int span_weight;
> + unsigned int nr_groups;
> /*
> * Span of all CPUs in this domain.
> *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d724215826ae..34619dbb2f4e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10162,6 +10162,54 @@ static void update_idle_cpu_scan(struct lb_env *env,
> WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
> }
>
> +/*
> + * Get the domain shared information of dst CPU.
> + */
> +static struct sched_domain_shared *get_sd_shared(struct lb_env *env)
> +{
> + /*
> + * Do not consider the domains smaller than LLC because those
> + * small domains have low cost on idle load balance.
> + */
> + if (env->sd->span_weight < per_cpu(sd_llc_size, env->dst_cpu))
> + return NULL;
> +
> + return env->sd->shared;
> +}
> +
> +static void update_ilb_group_scan(struct lb_env *env,
> + unsigned long sum_util,
> + struct sched_domain_shared *sd_share,
> + struct sd_lb_stats *sds)
> +{
> + u64 tmp, nr_scan;
> +
> + if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
> + return;
> +
> + if(!sd_share)
> + return;
> + /*
> + * Limit the newidle balance scan depth based on overall system
> + * utilization:
> + * nr_groups_scan = nr_groups * (1 - util_ratio)
> + * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
> + */
> + nr_scan = env->sd->nr_groups * sum_util;
> + tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
> + do_div(nr_scan, tmp);
> + nr_scan = env->sd->nr_groups - nr_scan;
> + if ((int)nr_scan != sd_share->ilb_nr_scan)
> + WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
> +
> + /* save the statistic snapshot of the periodic load balance */
> + if (sds->total_load != sd_share->ilb_total_load)
> + WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
> +
> + if (sds->total_capacity != sd_share->ilb_total_capacity)
> + WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
> +}
> +
> /**
> * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -10170,11 +10218,17 @@ static void update_idle_cpu_scan(struct lb_env *env,
>
> static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
> {
> + struct sched_domain_shared *sd_share = get_sd_shared(env);
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats *local = &sds->local_stat;
> struct sg_lb_stats tmp_sgs;
> unsigned long sum_util = 0;
> - int sg_status = 0;
> + int sg_status = 0, nr_scan_ilb;
> + bool ilb_util_enabled = sched_feat(ILB_UTIL) && env->idle == CPU_NEWLY_IDLE &&
> + sd_share && READ_ONCE(sd_share->ilb_total_capacity);
> +
> + if (ilb_util_enabled)
> + nr_scan_ilb = sd_share->ilb_nr_scan;
>
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -10192,6 +10246,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
>
> + if (ilb_util_enabled && --nr_scan_ilb <= 0) {
> + /*
> + * Borrow the statistic of previous periodic load balance.
> + * The data might be stale and it is a trade-off.
> + */
> + sds->total_load = READ_ONCE(sd_share->ilb_total_load);
> + sds->total_capacity = READ_ONCE(sd_share->ilb_total_capacity);
> +
> + break;
> + }
> +
> if (local_group)
> goto next_group;
>
> @@ -10239,6 +10304,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> }
>
> update_idle_cpu_scan(env, sum_util);
> + update_ilb_group_scan(env, sum_util, sd_share, sds);
> }
>
> /**
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..8f6e5b08408d 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>
> SCHED_FEAT(RT_RUNTIME_SHARE, false)
> SCHED_FEAT(LB_MIN, false)
> +SCHED_FEAT(ILB_UTIL, true)
> SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
> SCHED_FEAT(WA_IDLE, true)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..98bfac5f7836 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> struct cpumask *covered = sched_domains_tmpmask;
> struct sd_data *sdd = sd->private;
> struct sched_domain *sibling;
> - int i;
> + int i, nr_groups = 0;
>
> cpumask_clear(covered);
>
> @@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> if (!sg)
> goto fail;
>
> + nr_groups++;
> +
> sg_span = sched_group_span(sg);
> cpumask_or(covered, covered, sg_span);
>
> @@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> last->next = first;
> }
> sd->groups = first;
> + sd->nr_groups = nr_groups;
>
> return 0;
>
> @@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> struct sd_data *sdd = sd->private;
> const struct cpumask *span = sched_domain_span(sd);
> struct cpumask *covered;
> - int i;
> + int i, nr_groups = 0;
>
> lockdep_assert_held(&sched_domains_mutex);
> covered = sched_domains_tmpmask;
> @@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>
> sg = get_group(i, sdd);
>
> + nr_groups++;
> +
> cpumask_or(covered, covered, sched_group_span(sg));
>
> if (!first)
> @@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> }
> last->next = first;
> sd->groups = first;
> + sd->nr_groups = nr_groups;
>
> return 0;
> }
> @@ -1641,14 +1647,12 @@ sd_init(struct sched_domain_topology_level *tl,
> }
>
> /*
> - * For all levels sharing cache; connect a sched_domain_shared
> + * For all levels; connect a sched_domain_shared
> * instance.
> */
> - if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> - sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> - atomic_inc(&sd->shared->ref);
> - atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> - }
> + sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> + atomic_inc(&sd->shared->ref);
> + atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>
> sd->private = sdd;
>

o System Details

Dual Socket 3rd Generation EPYC System (2 x 64C/128T)

o NPS Modes

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

o Kernel Versions

- tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
Avoid resetting the min update period when it is
unnecessary")

- ILB_UTIL - tip:sched/core + this patch

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip ILB_UTIL
1-groups: 3.92 (0.00 pct) 3.66 (6.63 pct)
2-groups: 4.58 (0.00 pct) 4.18 (8.73 pct)
4-groups: 4.99 (0.00 pct) 4.46 (10.62 pct)
8-groups: 5.67 (0.00 pct) 5.39 (4.93 pct)
16-groups: 7.88 (0.00 pct) 10.43 (-32.36 pct)

o NPS2

Test: tip ILB_UTIL
1-groups: 3.82 (0.00 pct) 3.59 (6.02 pct)
2-groups: 4.40 (0.00 pct) 4.08 (7.27 pct)
4-groups: 4.84 (0.00 pct) 4.44 (8.26 pct)
8-groups: 5.45 (0.00 pct) 6.32 (-15.96 pct)
16-groups: 6.94 (0.00 pct) 11.71 (-68.73 pct)

o NPS4

Test: tip ILB_UTIL
1-groups: 3.82 (0.00 pct) 3.65 (4.45 pct)
2-groups: 4.44 (0.00 pct) 4.19 (5.63 pct)
4-groups: 4.86 (0.00 pct) 4.60 (5.34 pct)
8-groups: 5.42 (0.00 pct) 5.36 (1.10 pct)
16-groups: 6.68 (0.00 pct) 10.09 (-51.04 pct)

~~~~~~~~~~~~~~~~~~
~ schbench (Old) ~
~~~~~~~~~~~~~~~~~~

o NPS1

#workers: tip ILB_UTIL
1: 26.00 (0.00 pct) 26.00 (0.00 pct)
2: 27.00 (0.00 pct) 28.00 (-3.70 pct)
4: 31.00 (0.00 pct) 27.00 (12.90 pct)
8: 36.00 (0.00 pct) 40.00 (-11.11 pct)
16: 49.00 (0.00 pct) 50.00 (-2.04 pct)
32: 80.00 (0.00 pct) 80.00 (0.00 pct)
64: 169.00 (0.00 pct) 170.00 (-0.59 pct)
128: 343.00 (0.00 pct) 338.00 (1.45 pct)
256: 42048.00 (0.00 pct) 45760.00 (-8.82 pct)
512: 95104.00 (0.00 pct) 109696.00 (-15.34 pct)

o NPS2

#workers: tip ILB_UTIL
1: 23.00 (0.00 pct) 21.00 (8.69 pct)
2: 24.00 (0.00 pct) 25.00 (-4.16 pct)
4: 31.00 (0.00 pct) 29.00 (6.45 pct)
8: 41.00 (0.00 pct) 43.00 (-4.87 pct)
16: 48.00 (0.00 pct) 50.00 (-4.16 pct)
32: 81.00 (0.00 pct) 81.00 (0.00 pct)
64: 157.00 (0.00 pct) 180.00 (-14.64 pct)
128: 386.00 (0.00 pct) 385.00 (0.25 pct)
256: 48832.00 (0.00 pct) 52032.00 (-6.55 pct)
512: 92032.00 (0.00 pct) 113024.00 (-22.80 pct)

o NPS4

#workers: tip ILB_UTIL
1: 21.00 (0.00 pct) 23.00 (-9.52 pct)
2: 28.00 (0.00 pct) 30.00 (-7.14 pct)
4: 32.00 (0.00 pct) 33.00 (-3.12 pct)
8: 46.00 (0.00 pct) 51.00 (-10.86 pct)
16: 51.00 (0.00 pct) 54.00 (-5.88 pct)
32: 82.00 (0.00 pct) 88.00 (-7.31 pct)
64: 173.00 (0.00 pct) 175.00 (-1.15 pct)
128: 396.00 (0.00 pct) 387.00 (2.27 pct)
256: 48832.00 (0.00 pct) 46912.00 (3.93 pct)
512: 95104.00 (0.00 pct) 110720.00 (-16.41 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip ILB_UTIL
1 452.49 (0.00 pct) 449.93 (-0.56 pct)
2 862.44 (0.00 pct) 875.04 (1.46 pct)
4 1604.27 (0.00 pct) 1626.23 (1.36 pct)
8 2966.77 (0.00 pct) 3036.80 (2.36 pct)
16 5176.70 (0.00 pct) 5402.88 (4.36 pct)
32 8205.24 (0.00 pct) 9256.48 (12.81 pct)
64 13956.71 (0.00 pct) 15581.58 (11.64 pct)
128 24005.50 (0.00 pct) 24782.63 (3.23 pct)
256 32457.61 (0.00 pct) 30810.66 (-5.07 pct)
512 34345.24 (0.00 pct) 40971.90 (19.29 pct)
1024 33432.92 (0.00 pct) 41604.06 (24.44 pct)

o NPS2

Clients: tip ILB_UTIL
1 453.73 (0.00 pct) 444.72 (-1.98 pct)
2 861.71 (0.00 pct) 853.67 (-0.93 pct)
4 1599.14 (0.00 pct) 1573.69 (-1.59 pct)
8 2951.03 (0.00 pct) 3021.87 (2.40 pct)
16 5080.32 (0.00 pct) 5464.64 (7.56 pct)
32 7900.41 (0.00 pct) 10304.44 (30.42 pct)
64 14629.65 (0.00 pct) 17083.33 (16.77 pct)
128 23155.88 (0.00 pct) 25278.86 (9.16 pct)
256 33449.57 (0.00 pct) 32964.11 (-1.45 pct)
512 33757.47 (0.00 pct) 40951.04 (21.30 pct)
1024 34823.14 (0.00 pct) 41737.76 (19.85 pct)

o NPS4

Clients: tip ILB_UTIL
1 450.14 (0.00 pct) 451.88 (0.38 pct)
2 863.26 (0.00 pct) 864.96 (0.19 pct)
4 1618.71 (0.00 pct) 1632.00 (0.82 pct)
8 2929.35 (0.00 pct) 3071.80 (4.86 pct)
16 5114.04 (0.00 pct) 5373.74 (5.07 pct)
32 7912.18 (0.00 pct) 8830.49 (11.60 pct)
64 14424.72 (0.00 pct) 15598.13 (8.13 pct)
128 23614.97 (0.00 pct) 24563.76 (4.01 pct)
256 34365.13 (0.00 pct) 32096.70 (-6.60 pct)
512 34215.50 (0.00 pct) 42068.49 (22.95 pct)
1024 35421.90 (0.00 pct) 42230.56 (19.22 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test: tip ILB_UTIL
Copy: 271317.35 (0.00 pct) 304210.62 (12.12 pct)
Scale: 205533.77 (0.00 pct) 204155.75 (-0.67 pct)
Add: 221624.62 (0.00 pct) 228757.07 (3.21 pct)
Triad: 228500.68 (0.00 pct) 236454.48 (3.48 pct)

- 100 Runs:

Test: tip ILB_UTIL
Copy: 317381.65 (0.00 pct) 321587.90 (1.32 pct)
Scale: 214145.00 (0.00 pct) 211397.70 (-1.28 pct)
Add: 239243.29 (0.00 pct) 235497.67 (-1.56 pct)
Triad: 249477.76 (0.00 pct) 240764.14 (-3.49 pct)

o NPS2

- 10 Runs:

Test: tip ILB_UTIL
Copy: 277761.29 (0.00 pct) 279582.97 (0.65 pct)
Scale: 215193.83 (0.00 pct) 203628.71 (-5.37 pct)
Add: 242725.75 (0.00 pct) 232522.80 (-4.20 pct)
Triad: 237253.44 (0.00 pct) 245716.42 (3.56 pct)

- 100 Runs:

Test: tip ILB_UTIL
Copy: 318082.10 (0.00 pct) 320640.80 (0.80 pct)
Scale: 219338.56 (0.00 pct) 222158.47 (1.28 pct)
Add: 248118.20 (0.00 pct) 254163.15 (2.43 pct)
Triad: 247088.55 (0.00 pct) 252459.53 (2.17 pct)

o NPS4

- 10 Runs:

Test: tip ILB_UTIL
Copy: 273307.14 (0.00 pct) 269979.40 (-1.21 pct)
Scale: 235715.23 (0.00 pct) 225429.20 (-4.36 pct)
Add: 244500.40 (0.00 pct) 227988.81 (-6.75 pct)
Triad: 250600.04 (0.00 pct) 234012.67 (-6.61 pct)

- 100 Runs:

Test: tip ILB_UTIL
Copy: 345396.19 (0.00 pct) 335548.25 (-2.85 pct)
Scale: 241521.63 (0.00 pct) 228991.04 (-5.18 pct)
Add: 261157.86 (0.00 pct) 247020.34 (-5.41 pct)
Triad: 267804.99 (0.00 pct) 258260.01 (-3.56 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

Test: tip ILB_UTIL
1-clients: 102839.97 (0.00 pct) 101826.77 (-0.98 pct)
2-clients: 98428.08 (0.00 pct) 98563.25 (0.13 pct)
4-clients: 92298.45 (0.00 pct) 95310.26 (3.26 pct)
8-clients: 85618.41 (0.00 pct) 87859.85 (2.61 pct)
16-clients: 78722.18 (0.00 pct) 79430.42 (0.89 pct)
32-clients: 73610.75 (0.00 pct) 76459.08 (3.86 pct)
64-clients: 55285.07 (0.00 pct) 64071.43 (15.89 pct)
128-clients: 31176.92 (0.00 pct) 37287.20 (19.59 pct)
256-clients: 20011.44 (0.00 pct) 31243.73 (56.12 pct)

o NPS2

Test: tip ILB_UTIL
1-clients: 103105.55 (0.00 pct) 99162.95 (-3.82 pct)
2-clients: 98720.29 (0.00 pct) 96055.84 (-2.69 pct)
4-clients: 92289.39 (0.00 pct) 92818.61 (0.57 pct)
8-clients: 84998.63 (0.00 pct) 86693.17 (1.99 pct)
16-clients: 76395.81 (0.00 pct) 77137.01 (0.97 pct)
32-clients: 71110.89 (0.00 pct) 70154.80 (-1.34 pct)
64-clients: 49526.21 (0.00 pct) 55032.79 (11.11 pct)
128-clients: 27917.51 (0.00 pct) 36377.03 (30.30 pct)
256-clients: 20067.17 (0.00 pct) 27607.78 (37.57 pct)

o NPS4

Test: tip ILB_UTIL
1-clients: 102139.49 (0.00 pct) 103414.93 (1.24 pct)
2-clients: 98259.53 (0.00 pct) 101472.40 (3.26 pct)
4-clients: 91576.79 (0.00 pct) 96917.69 (5.83 pct)
8-clients: 84742.30 (0.00 pct) 90389.72 (6.66 pct)
16-clients: 79540.75 (0.00 pct) 85183.23 (7.09 pct)
32-clients: 71166.14 (0.00 pct) 78511.48 (10.32 pct)
64-clients: 51763.24 (0.00 pct) 61334.30 (18.49 pct)
128-clients: 27829.29 (0.00 pct) 35989.34 (29.32 pct)
256-clients: 24185.37 (0.00 pct) 35769.17 (47.89 pct)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

tip ILB_UTIL
Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 41202944.91 ( -0.29%)
Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6193511930.01 * -0.94%*
Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2558258.57 * -2.27%*
Amean unixbench-syscall-512 8120524.00 ( 0.00%) 8014692.00 * 1.30%*
Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2395716.82 * 1.54%*
Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 339462110.52 ( 0.20%)
Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4786.09 * 12.31%*
Hmean unixbench-spawn-512 64328.93 ( 0.00%) 68328.36 * 6.22%*
Hmean unixbench-execl-1 3677.73 ( 0.00%) 3671.96 ( -0.16%)
Hmean unixbench-execl-512 11984.83 ( 0.00%) 13272.01 ( 10.74%)

o NPS2

tip ILB_UTIL
Hmean unixbench-dhry2reg-1 41311787.29 ( 0.00%) 41209738.92 ( -0.25%)
Hmean unixbench-dhry2reg-512 6243873272.76 ( 0.00%) 6198007442.15 * -0.73%*
Amean unixbench-syscall-1 2503190.70 ( 0.00%) 2559295.30 * -2.24%*
Amean unixbench-syscall-512 8012388.13 ( 0.00%) 7984268.83 * 0.35%*
Hmean unixbench-pipe-1 2340486.25 ( 0.00%) 2395174.42 * 2.34%*
Hmean unixbench-pipe-512 338965319.79 ( 0.00%) 339972146.39 ( 0.30%)
Hmean unixbench-spawn-1 5241.83 ( 0.00%) 5041.98 * -3.81%*
Hmean unixbench-spawn-512 65799.86 ( 0.00%) 68871.88 * 4.67%*
Hmean unixbench-execl-1 3670.65 ( 0.00%) 3659.10 ( -0.31%)
Hmean unixbench-execl-512 13682.00 ( 0.00%) 13984.58 ( 2.21%)

o NPS4

tip ILB_UTIL
Hmean unixbench-dhry2reg-1 41025577.99 ( 0.00%) 41039940.89 ( 0.04%)
Hmean unixbench-dhry2reg-512 6255568261.91 ( 0.00%) 6216198481.97 * -0.63%*
Amean unixbench-syscall-1 2507165.37 ( 0.00%) 2553468.33 * -1.85%*
Amean unixbench-syscall-512 7458476.50 ( 0.00%) 7483366.27 * -0.33%*
Hmean unixbench-pipe-1 2369301.21 ( 0.00%) 2397653.84 * 1.20%*
Hmean unixbench-pipe-512 340299405.72 ( 0.00%) 340332182.64 ( 0.01%)
Hmean unixbench-spawn-1 5571.78 ( 0.00%) 5389.50 ( -3.27%)
Hmean unixbench-spawn-512 63999.96 ( 0.00%) 68343.41 * 6.79%*
Hmean unixbench-execl-1 3587.15 ( 0.00%) 3628.48 * 1.15%*
Hmean unixbench-execl-512 14184.17 ( 0.00%) 13720.55 ( -3.27%)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

base: 298681.00 (var: 2.31%)
ILB_UTIL 292352.67 (var: 3.31%) (-2.11%)

o NPS2:

base: 296570.00 (var: 1.01%)
ILB_UTIL 298804.67 (var: 1.50%) (0.75%)

o NPS4:

base 297181.67 (var: 0.46%)
ILB_UTIL 297495.00 (var: 0.33%) (0.10%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1:

- 1 CCD

base: 1.00 (var: 0.27%)
ILB_UTIL: 1.03 (var: 0.16%) (+3.31%)

- 2 CCD

base: 1.00 (var: 0.42%)
ILB_UTIL: 1.01 (var: 0.19%) (+1.48%)

- 4 CCD

base: 1.00 (var: 0.46%)
ILB_UTIL: 0.98 (var: 0.17%) (-2.00%)

- 8 CCD

base: 1.00 (var: 0.63%)
ILB_UTIL: 0.96 (var: 0.46%) (-3.79%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb2015 - multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

max-jOPS 1.00 0.99 (-1.11%)
critical-jOPS 1.00 0.99 (-1.06%)

--

I have a couple of theories:

o Either new_idle_balance is failing to find an overloaded busy rq as a
result of the limit.

o Or, there is a chain reaction where pulling from a loaded rq which is not
the most loaded, will lead to more new_idle_balancing attempts which is
degrading performance.

I'll go back and get some data to narrow down the cause. Meanwhile if
there is any specific benchmark you would like me to run on the test
system, please do let me know.

--
Thanks and Regards,
Prateek

2023-07-10 16:33:00

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hi Prateek,

thanks for testing this patch,
On 2023-07-10 at 16:36:47 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> Thank you for sharing this extended version. Sharing the results from
> testing below.
>
> tl;dr
>
> - tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.
>
> - schbench (old) sees a regression in tail latency once system is heavily
> loaded. DeathStarBench and SPECjbb too see a small drop under those
> conditions.
>
> - Rest of the benchmark results do not vary much.
>
>
[...]
> I have a couple of theories:
>
Thanks for the insights, I agree the risk you mentioned below could impact the
performance. Some thoughts below:
> o Either new_idle_balance is failing to find an overloaded busy rq as a
> result of the limit.
>
If the system is overloaded, the ilb_util finds a relatively busy rq and pulls from it.
There could be no much difference between a relatively busy rq and the busiest one,
because all rqs are quite busy I suppose?
> o Or, there is a chain reaction where pulling from a loaded rq which is not
> the most loaded, will lead to more new_idle_balancing attempts which is
> degrading performance.
Yeah, it could be possible that the ilb_util finds a relatively busy rq, but the
imbalance is not high so ilb decides not pull from it. However the busiest
rq is still waiting for someone to help, and this could trigger idle load
balance more frequently.
>
> I'll go back and get some data to narrow down the cause. Meanwhile if
> there is any specific benchmark you would like me to run on the test
> system, please do let me know.
>
Another hints might be that, as Gautham and Peter suggested, we should apply ILB_UTIL
to non-Numa domains. In above patch all the domains has sd_share which could
bring negative impact when accessing/writing cross-node data.
Sorry I did not post the latest version with Numa domain excluded previously as
I was trying to create a protype to further speed up idle load balance by
reusing the statistic suggested by Peter. I will send them out once it is stable.

Thanks again,
Chenyu
> --
> Thanks and Regards,
> Prateek

2023-07-11 03:01:53

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

Hello Chenyu,

Thank you for taking a look at the results.

On 7/10/2023 9:07 PM, Chen Yu wrote:
> Hi Prateek,
>
> thanks for testing this patch,
> On 2023-07-10 at 16:36:47 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> Thank you for sharing this extended version. Sharing the results from
>> testing below.
>>
>> tl;dr
>>
>> - tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.
>>
>> - schbench (old) sees a regression in tail latency once system is heavily
>> loaded. DeathStarBench and SPECjbb too see a small drop under those
>> conditions.
>>
>> - Rest of the benchmark results do not vary much.
>>
>>
> [...]
>> I have a couple of theories:
>>
> Thanks for the insights, I agree the risk you mentioned below could impact the
> performance. Some thoughts below:
>> o Either new_idle_balance is failing to find an overloaded busy rq as a
>> result of the limit.
>>
> If the system is overloaded, the ilb_util finds a relatively busy rq and pulls from it.
> There could be no much difference between a relatively busy rq and the busiest one,
> because all rqs are quite busy I suppose?
>> o Or, there is a chain reaction where pulling from a loaded rq which is not
>> the most loaded, will lead to more new_idle_balancing attempts which is
>> degrading performance.
> Yeah, it could be possible that the ilb_util finds a relatively busy rq, but the
> imbalance is not high so ilb decides not pull from it. However the busiest
> rq is still waiting for someone to help, and this could trigger idle load
> balance more frequently.

Yes, I was thinking along the similar lines.

>>
>> I'll go back and get some data to narrow down the cause. Meanwhile if
>> there is any specific benchmark you would like me to run on the test
>> system, please do let me know.
>>
> Another hints might be that, as Gautham and Peter suggested, we should apply ILB_UTIL
> to non-Numa domains. In above patch all the domains has sd_share which could
> bring negative impact when accessing/writing cross-node data.
> Sorry I did not post the latest version with Numa domain excluded previously as
> I was trying to create a protype to further speed up idle load balance by
> reusing the statistic suggested by Peter. I will send them out once it is stable.

Sure. I'll keep an eye out for the next version :)

>
> [..snip..]
>

--
Thanks and Regards,
Prateek