in
Hi Abel,
On Fri, May 6, 2022 at 1:21 AM Abel Wu <[email protected]> wrote:
>
> Try to improve searching efficiency of SIS by filtering out the
> overloaded cpus, and as a result the more overloaded the system
> is, the less cpus will be searched.
>
My understanding is that, this patch aims to address the following issue:
What kind of CPUs should the SIS scan.
And we also have another patch[1] from another angle:
How many CPUs should the SIS scan.
I assume the two direction could both help speed up the SIS process, so
I'm curious what the result would be with both patch applied, and I planned
to run your patch on my system too.
> The overloaded cpus are tracked through LLC shared domain. To
> regulate accesses to the shared data, the update happens mainly
> at the tick. But in order to make it more accurate, we also take
> the task migrations into consideration during load balancing which
> can be quite frequent due to short running workload causing newly-
> idle. Since an overloaded runqueue requires at least 2 non-idle
> tasks runnable, we could have more faith on the "frequent newly-
> idle" case.
>
> Benchmark
> =========
>
> Tests are done in an Intel(R) Xeon(R) Platinum 8260 [email protected]
> machine with 2 NUMA nodes each of which has 24 cores with SMT2
> enabled, so 96 CPUs in total.
>
> All of the benchmarks are done inside a normal cpu cgroup in a
Do you have any script that I can leverage to launch the test?
> clean environment with cpu turbo disabled.
I would recommend to apply the following patch(queued for 5.19) if
the intel_pstate driver is loaded, because it seems that there is a
utilization calculation
issue when turbo is disabled:
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 846bb3a78788..2216b24b6f84 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a,
struct kobj_attribute *b,
mutex_unlock(&intel_pstate_limits_lock);
intel_pstate_update_policies();
+ arch_set_max_freq_ratio(global.no_turbo);
mutex_unlock(&intel_pstate_driver_lock);
--
2.25.1
[cut]
>
> v3:
> - removed sched-idle balance feature and focus on SIS
> - take non-CFS tasks into consideration
> - several fixes/improvement suggested by Josh Don
>
> v2:
> - several optimizations on sched-idle balancing
> - ignore asym topos in can_migrate_task
> - add more benchmarks including SIS efficiency
> - re-organize patch as suggested by Mel
>
> v1: https://lore.kernel.org/lkml/[email protected]/
> v2: https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Abel Wu <[email protected]>
> ---
> include/linux/sched/topology.h | 12 ++++++++++
> kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
> kernel/sched/idle.c | 1 +
> kernel/sched/sched.h | 4 ++++
> kernel/sched/topology.c | 4 +++-
> 6 files changed, 92 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 56cffe42abbc..95c7ad1e05b5 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> +
> + /*
> + * Tracking of the overloaded cpus can be heavy, so start
> + * a new cacheline to avoid false sharing.
> + */
Although we put the following items into different cache line compared to
above ones, is it possible that there is still cache false sharing if
CPU1 is reading nr_overloaded_cpus while
CPU2 is updating overloaded_cpus?
> + atomic_t nr_overloaded_cpus ____cacheline_aligned;
____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
be more lightweight?
> + unsigned long overloaded_cpus[]; /* Must be last */
> };
>
[cut]
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..79b4ff24faee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
> {
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + struct sched_domain_shared *sds = sd->shared;
> + int nr, nro, weight = sd->span_weight;
> + int i, cpu, idle_cpu = -1;
> struct rq *this_rq = this_rq();
> int this = smp_processor_id();
> struct sched_domain *this_sd;
> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> if (!this_sd)
> return -1;
>
> + nro = atomic_read(&sds->nr_overloaded_cpus);
> + if (nro == weight)
> + goto out;
> +
> + nr = min_t(int, weight, p->nr_cpus_allowed);
> +
> + /*
> + * It's unlikely to find an idle cpu if the system is under
> + * heavy pressure, so skip searching to save a few cycles
> + * and relieve cache traffic.
> + */
> + if (weight - nro < (nr >> 4) && !has_idle_core)
> + return -1;
In [1] we used util_avg to check if the domain is overloaded and quit
earlier, since util_avg would be
more stable and contains historic data. But I think nr_running in your
patch could be used as
complementary metric and added to update_idle_cpu_scan() in [1] IMO.
> +
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + if (nro > 1)
> + cpumask_andnot(cpus, cpus, sdo_mask(sds));
If I understand correctly, this is the core of the optimization: SIS
filters out the busy cores. I wonder if it
is possible to save historic h_nr_running/idle_h_nr_running and use
the average value? (like the calculation
of avg_scan_cost).
>
> if (sched_feat(SIS_PROP) && !has_idle_core) {
> u64 avg_cost, avg_idle, span_avg;
> @@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> avg_idle = this_rq->wake_avg_idle;
> avg_cost = this_sd->avg_scan_cost + 1;
>
> - span_avg = sd->span_weight * avg_idle;
> + span_avg = weight * avg_idle;
> if (span_avg > 4*avg_cost)
> nr = div_u64(span_avg, avg_cost);
> else
[cut]
[1] https://lore.kernel.org/lkml/[email protected]/
--
Thanks,
Chenyu
Hi Chen,
On 5/8/22 12:09 AM, Chen Yu Wrote:
> in
> Hi Abel,
> On Fri, May 6, 2022 at 1:21 AM Abel Wu <[email protected]> wrote:
>>
>> Try to improve searching efficiency of SIS by filtering out the
>> overloaded cpus, and as a result the more overloaded the system
>> is, the less cpus will be searched.
>>
> My understanding is that, this patch aims to address the following issue:
> What kind of CPUs should the SIS scan.
> And we also have another patch[1] from another angle:
> How many CPUs should the SIS scan.
> I assume the two direction could both help speed up the SIS process, so
Agreed.
> I'm curious what the result would be with both patch applied, and I planned
> to run your patch on my system too.
>> The overloaded cpus are tracked through LLC shared domain. To
>> regulate accesses to the shared data, the update happens mainly
>> at the tick. But in order to make it more accurate, we also take
>> the task migrations into consideration during load balancing which
>> can be quite frequent due to short running workload causing newly-
>> idle. Since an overloaded runqueue requires at least 2 non-idle
>> tasks runnable, we could have more faith on the "frequent newly-
>> idle" case.
>>
>> Benchmark
>> =========
>>
>> Tests are done in an Intel(R) Xeon(R) Platinum 8260 [email protected]
>> machine with 2 NUMA nodes each of which has 24 cores with SMT2
>> enabled, so 96 CPUs in total.
>>
>> All of the benchmarks are done inside a normal cpu cgroup in a
> Do you have any script that I can leverage to launch the test?
I benchmarked the following configs in mmtests:
config-scheduler-unbound
config-scheduler-schbench
config-network-netperf-stream-unbound
config-network-tbench
more details: https://github.com/gormanm/mmtests
>> clean environment with cpu turbo disabled.
> I would recommend to apply the following patch(queued for 5.19) if
> the intel_pstate driver is loaded, because it seems that there is a
> utilization calculation
> issue when turbo is disabled:
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 846bb3a78788..2216b24b6f84 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a,
> struct kobj_attribute *b,
> mutex_unlock(&intel_pstate_limits_lock);
>
> intel_pstate_update_policies();
> + arch_set_max_freq_ratio(global.no_turbo);
>
> mutex_unlock(&intel_pstate_driver_lock);
>
> --
> 2.25.1
> [cut]
Thanks, I will apply it before next rounds of testing.
>>
>> v3:
>> - removed sched-idle balance feature and focus on SIS
>> - take non-CFS tasks into consideration
>> - several fixes/improvement suggested by Josh Don
>>
>> v2:
>> - several optimizations on sched-idle balancing
>> - ignore asym topos in can_migrate_task
>> - add more benchmarks including SIS efficiency
>> - re-organize patch as suggested by Mel
>>
>> v1: https://lore.kernel.org/lkml/[email protected]/
>> v2: https://lore.kernel.org/lkml/[email protected]/
>>
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> include/linux/sched/topology.h | 12 ++++++++++
>> kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
>> kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
>> kernel/sched/idle.c | 1 +
>> kernel/sched/sched.h | 4 ++++
>> kernel/sched/topology.c | 4 +++-
>> 6 files changed, 92 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 56cffe42abbc..95c7ad1e05b5 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -81,8 +81,20 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> +
>> + /*
>> + * Tracking of the overloaded cpus can be heavy, so start
>> + * a new cacheline to avoid false sharing.
>> + */
> Although we put the following items into different cache line compared to
> above ones, is it possible that there is still cache false sharing if
> CPU1 is reading nr_overloaded_cpus while
> CPU2 is updating overloaded_cpus?
I think it's not false sharing, it's just cache contention. But yes,
it is still possible if the two items mixed with others (by compiler)
in one cacheline, which seems out of our control..
>> + atomic_t nr_overloaded_cpus ____cacheline_aligned;
> ____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
> seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
> be more lightweight?
I didn't see the difference of the two macros, it would be appreciate
if you can shed some light.
>> + unsigned long overloaded_cpus[]; /* Must be last */
>> };
>>
> [cut]
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..79b4ff24faee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>> static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>> {
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> - int i, cpu, idle_cpu = -1, nr = INT_MAX;
>> + struct sched_domain_shared *sds = sd->shared;
>> + int nr, nro, weight = sd->span_weight;
>> + int i, cpu, idle_cpu = -1;
>> struct rq *this_rq = this_rq();
>> int this = smp_processor_id();
>> struct sched_domain *this_sd;
>> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> if (!this_sd)
>> return -1;
>>
>> + nro = atomic_read(&sds->nr_overloaded_cpus);
>> + if (nro == weight)
>> + goto out;
>> +
>> + nr = min_t(int, weight, p->nr_cpus_allowed);
>> +
>> + /*
>> + * It's unlikely to find an idle cpu if the system is under
>> + * heavy pressure, so skip searching to save a few cycles
>> + * and relieve cache traffic.
>> + */
>> + if (weight - nro < (nr >> 4) && !has_idle_core)
>> + return -1;
> In [1] we used util_avg to check if the domain is overloaded and quit
> earlier, since util_avg would be
> more stable and contains historic data. But I think nr_running in your
> patch could be used as
> complementary metric and added to update_idle_cpu_scan() in [1] IMO.
>> +
>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> + if (nro > 1)
>> + cpumask_andnot(cpus, cpus, sdo_mask(sds));
> If I understand correctly, this is the core of the optimization: SIS
> filters out the busy cores. I wonder if it
> is possible to save historic h_nr_running/idle_h_nr_running and use
> the average value? (like the calculation
> of avg_scan_cost).
Yes, I have been already working on that for several days, and
along with some improvement on load balance (group_has_spare).
Ideally we can finally get rid out of the cache issues.
>
>>
>> if (sched_feat(SIS_PROP) && !has_idle_core) {
>> u64 avg_cost, avg_idle, span_avg;
>> @@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> avg_idle = this_rq->wake_avg_idle;
>> avg_cost = this_sd->avg_scan_cost + 1;
>>
>> - span_avg = sd->span_weight * avg_idle;
>> + span_avg = weight * avg_idle;
>> if (span_avg > 4*avg_cost)
>> nr = div_u64(span_avg, avg_cost);
>> else
> [cut]
> [1] https://lore.kernel.org/lkml/[email protected]/
>
I followed all 3 versions of your patch, and I think your work makes
sense. My only concern was that the depth is updated every llc_size
milliseconds which could be a long period in large machines and the
load can vary quickly enough to deviate from the historic value. But
it seems not a big deal as we discussed in your v1 patch.
Thanks & BR,
Abel
On Sun, May 8, 2022 at 1:50 AM Abel Wu <[email protected]> wrote:
>
> Hi Chen,
>
> On 5/8/22 12:09 AM, Chen Yu Wrote:
[cut]
> >> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> >> atomic_t ref;
> >> atomic_t nr_busy_cpus;
> >> int has_idle_cores;
> >> +
> >> + /*
> >> + * Tracking of the overloaded cpus can be heavy, so start
> >> + * a new cacheline to avoid false sharing.
> >> + */
> > Although we put the following items into different cache line compared to
> > above ones, is it possible that there is still cache false sharing if
> > CPU1 is reading nr_overloaded_cpus while
> > CPU2 is updating overloaded_cpus?
>
> I think it's not false sharing, it's just cache contention. But yes,
> it is still possible if the two items mixed with others (by compiler)
> in one cacheline, which seems out of our control..
>
My understanding is that, since nr_overloaded_cpus starts with a new
cache line, overloaded_cpus is very likely to be in the same cache line.
Only If the write to nr_overloaded_cpus mask is not frequent(maybe tick based
update is not frequent), the read of nr_overloaded_cpus can survive from cache
false sharing, which is mainly read by SIS. I have a stupid thought
that if nr_overloaded_cpus
mask and nr_overloaded_cpus could be put to 2 cache lines.
> >> + atomic_t nr_overloaded_cpus ____cacheline_aligned;
> > ____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
> > seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
> > be more lightweight?
>
> I didn't see the difference of the two macros, it would be appreciate
> if you can shed some light.
>
Sorry I mistook ____cacheline_aligned for __cacheline_aligned which is
put into a data section. Please ignore my previous comment.
> >> + unsigned long overloaded_cpus[]; /* Must be last */
> >> };
> >>
[cut]
> >> + /*
> >> + * It's unlikely to find an idle cpu if the system is under
> >> + * heavy pressure, so skip searching to save a few cycles
> >> + * and relieve cache traffic.
> >> + */
> >> + if (weight - nro < (nr >> 4) && !has_idle_core)
> >> + return -1;
> > In [1] we used util_avg to check if the domain is overloaded and quit
> > earlier, since util_avg would be
> > more stable and contains historic data. But I think nr_running in your
> > patch could be used as
> > complementary metric and added to update_idle_cpu_scan() in [1] IMO.
> >> +
> >> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> + if (nro > 1)
> >> + cpumask_andnot(cpus, cpus, sdo_mask(sds));
> > If I understand correctly, this is the core of the optimization: SIS
> > filters out the busy cores. I wonder if it
> > is possible to save historic h_nr_running/idle_h_nr_running and use
> > the average value? (like the calculation
> > of avg_scan_cost).
>
> Yes, I have been already working on that for several days, and
> along with some improvement on load balance (group_has_spare).
> Ideally we can finally get rid out of the cache issues.
>
Ok, could you please also Cc me in the next version? I'd like to have
a try.
--
Thanks,
Chenyu
On Mon, May 9, 2022 at 11:21 PM Chen Yu <[email protected]> wrote:
>
> On Sun, May 8, 2022 at 1:50 AM Abel Wu <[email protected]> wrote:
> >
> > Hi Chen,
> >
> > On 5/8/22 12:09 AM, Chen Yu Wrote:
> [cut]
> > >> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> > >> atomic_t ref;
> > >> atomic_t nr_busy_cpus;
> > >> int has_idle_cores;
> > >> +
> > >> + /*
> > >> + * Tracking of the overloaded cpus can be heavy, so start
> > >> + * a new cacheline to avoid false sharing.
> > >> + */
> > > Although we put the following items into different cache line compared to
> > > above ones, is it possible that there is still cache false sharing if
> > > CPU1 is reading nr_overloaded_cpus while
> > > CPU2 is updating overloaded_cpus?
> >
> > I think it's not false sharing, it's just cache contention. But yes,
> > it is still possible if the two items mixed with others (by compiler)
> > in one cacheline, which seems out of our control..
> >
> My understanding is that, since nr_overloaded_cpus starts with a new
> cache line, overloaded_cpus is very likely to be in the same cache line.
> Only If the write to nr_overloaded_cpus mask is not frequent(maybe tick based
> update is not frequent), the read of nr_overloaded_cpus can survive from cache
> false sharing, which is mainly read by SIS. I have a stupid thought
> that if nr_overloaded_cpus
> mask and nr_overloaded_cpus could be put to 2 cache lines.
Not exactly, as overloaded_cpus and nr_overloaded_cpus are updated at the same
time, it is not a false sharing case.
--
Thanks,
Chenyu
On 5/9/22 11:21 PM, Chen Yu Wrote:
>> Yes, I have been already working on that for several days, and
>> along with some improvement on load balance (group_has_spare).
>> Ideally we can finally get rid out of the cache issues.
>>
> Ok, could you please also Cc me in the next version? I'd like to have
> a try.
>
I will, thanks!
Abel