Introduce new per task property latency-nice for controlling scalability
in scheduler idle CPU search path. Valid latency-nice values are from 1 to
100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
CPU cgroup file cpu.latency-nice is added as an interface to set and get.
All tasks in the same cgroup share the same latency-nice value. Using a
lower latency-nice value can help latency intolerant tasks e.g very short
running OLTP threads where full LLC search cost can be significant compared
to run time of the threads. The default latency-nice value is 5.
In addition to latency-nice, it also adds a new sched feature SIS_CORE to
be able to disable idle core search altogether which is costly and hurts
more than it helps in short running workloads.
Finally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.
Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE
8 64.66 64.38 (-0.43%) 64.79 (0.2%)
16 123.34 122.88 (-0.37%) 125.87 (2.05%)
32 215.18 215.55 (0.17%) 247.77 (15.15%)
48 278.56 321.6 (15.45%) 321.2 (15.3%)
64 259.99 319.45 (22.87%) 333.95 (28.44%)
128 431.1 437.69 (1.53%) 431.09 (0%)
subhra mazumdar (9):
sched,cgroup: Add interface for latency-nice
sched: add search limit as per latency-nice
sched: add sched feature to disable idle core search
sched: SIS_CORE to disable idle core search
sched: Define macro for number of CPUs in core
x86/smpboot: Optimize cpumask_weight_sibling macro for x86
sched: search SMT before LLC domain
sched: introduce per-cpu var next_cpu to track search limit
sched: rotate the cpu search window for better spread
arch/x86/include/asm/smp.h | 1 +
arch/x86/include/asm/topology.h | 1 +
arch/x86/kernel/smpboot.c | 17 ++++++++++++++++-
include/linux/sched.h | 1 +
include/linux/topology.h | 4 ++++
kernel/sched/core.c | 42 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 34 +++++++++++++++++++++------------
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 9 +++++++++
9 files changed, 97 insertions(+), 13 deletions(-)
--
2.9.3
Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
in L1 cache locality.
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8856503..94dd4a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return i;
}
- i = select_idle_cpu(p, sd, target);
+ i = select_idle_smt(p, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
- i = select_idle_smt(p, target);
+ i = select_idle_cpu(p, sd, target);
if ((unsigned)i < nr_cpumask_bits)
return i;
--
2.9.3
Put upper and lower limit on CPU search in select_idle_cpu. The lower limit
is set to amount of CPUs in a core while upper limit is derived from the
latency-nice of the thread. This ensures for any architecture we will
usually search beyond a core. Changing the latency-nice value by user will
change the search cost making it appropriate for given workload.
Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b08d00c..c31082d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, nr = INT_MAX;
+ int cpu, floor, nr = INT_MAX;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return -1;
if (sched_feat(SIS_PROP)) {
- u64 span_avg = sd->span_weight * avg_idle;
- if (span_avg > 4*avg_cost)
- nr = div_u64(span_avg, avg_cost);
- else
- nr = 4;
+ floor = cpumask_weight(topology_sibling_cpumask(target));
+ if (floor < 2)
+ floor = 2;
+ nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
+ if (nr < floor)
+ nr = floor;
}
time = local_clock();
--
2.9.3
Introduce macro topology_sibling_weight for number of sibling CPUs in a
core and use in select_idle_cpu
Signed-off-by: subhra mazumdar <[email protected]>
---
include/linux/topology.h | 4 ++++
kernel/sched/fair.c | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
#ifndef topology_sibling_cpumask
#define topology_sibling_cpumask(cpu) cpumask_of(cpu)
#endif
+#ifndef topology_sibling_weight
+#define topology_sibling_weight(cpu) \
+ cpumask_weight(topology_sibling_cpumask(cpu))
+#endif
#ifndef topology_core_cpumask
#define topology_core_cpumask(cpu) cpumask_of(cpu)
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ec9c6..8856503 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6205,7 +6205,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return -1;
if (sched_feat(SIS_PROP)) {
- floor = cpumask_weight(topology_sibling_cpumask(target));
+ floor = topology_sibling_weight(target);
if (floor < 2)
floor = 2;
nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
--
2.9.3
Use per-CPU variable for cpumask_weight_sibling macro in case of x86 for
fast lookup in select_idle_cpu. This avoids reading multiple cache lines
in case of systems with large numbers of CPUs where bitmask can span
multiple cache lines. Even if bitmask spans only one cache line this avoids
looping through it to find the number of bits and gets it in O(1).
Signed-off-by: subhra mazumdar <[email protected]>
---
arch/x86/include/asm/smp.h | 1 +
arch/x86/include/asm/topology.h | 1 +
arch/x86/kernel/smpboot.c | 17 ++++++++++++++++-
3 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
extern unsigned int num_processors;
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
/* cpus sharing the last level cache: */
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
#ifdef CONFIG_SMP
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu) (per_cpu(cpumask_weight_sibling, cpu))
extern unsigned int __max_logical_packages;
#define topology_max_packages() (__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..57ad88d 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,9 @@
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+
/* representing HT and core siblings of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +523,8 @@ void set_cpu_sibling_map(int cpu)
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+ per_cpu(cpumask_weight_sibling, cpu) =
+ cpumask_weight(topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
c->booted_cores = 1;
@@ -529,8 +534,13 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = &cpu_data(i);
- if ((i == cpu) || (has_smt && match_smt(c, o)))
+ if ((i == cpu) || (has_smt && match_smt(c, o))) {
link_mask(topology_sibling_cpumask, cpu, i);
+ per_cpu(cpumask_weight_sibling, cpu) =
+ cpumask_weight(topology_sibling_cpumask(cpu));
+ per_cpu(cpumask_weight_sibling, i) =
+ cpumask_weight(topology_sibling_cpumask(i));
+ }
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
else
physid_set_mask_of_physid(0, &phys_cpu_present_map);
cpumask_set_cpu(0, topology_sibling_cpumask(0));
+ per_cpu(cpumask_weight_sibling, 0) =
+ cpumask_weight(topology_sibling_cpumask(0));
cpumask_set_cpu(0, topology_core_cpumask(0));
}
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
for_each_cpu(sibling, topology_core_cpumask(cpu)) {
cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+ per_cpu(cpumask_weight_sibling, sibling) =
+ cpumask_weight(topology_sibling_cpumask(sibling));
/*/
* last thread sibling in this cpu core going down
*/
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
+ per_cpu(cpumask_weight_sibling, cpu) = 0;
cpumask_clear(topology_core_cpumask(cpu));
c->cpu_core_id = 0;
c->booted_cores = 0;
--
2.9.3
Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).
Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/features.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.9.3
Hi Subhra,
On 8/30/19 11:19 PM, subhra mazumdar wrote:
> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path. Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
>
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.
>
> Finally it also introduces a new per-cpu variable next_cpu to track
> the limit of search so that every time search starts from where it ended.
> This rotating search window over cpus in LLC domain ensures that idle
> cpus are eventually found in case of high load.
>
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline latency-nice=5,SIS_CORE latency-nice=5,NO_SIS_CORE
> 8 64.66 64.38 (-0.43%) 64.79 (0.2%)
> 16 123.34 122.88 (-0.37%) 125.87 (2.05%)
> 32 215.18 215.55 (0.17%) 247.77 (15.15%)
> 48 278.56 321.6 (15.45%) 321.2 (15.3%)
> 64 259.99 319.45 (22.87%) 333.95 (28.44%)
> 128 431.1 437.69 (1.53%) 431.09 (0%)
>
The result seems to be appealing with your experimental setup.
BTW, do you have any plans of load balancing as well based on latency niceness
of the tasks? It seems to be a more interesting case when we give pack the lower
latency sensitive tasks on fewer CPUs.
Also, do you see any workload results showing performance regression with NO_SIS_CORE?
Thanks,
Parth
On 8/30/19 11:19 PM, subhra mazumdar wrote:
> Put upper and lower limit on CPU search in select_idle_cpu. The lower limit
> is set to amount of CPUs in a core while upper limit is derived from the
> latency-nice of the thread. This ensures for any architecture we will
> usually search beyond a core. Changing the latency-nice value by user will
> change the search cost making it appropriate for given workload.
>
> Signed-off-by: subhra mazumdar <[email protected]>
> ---
> kernel/sched/fair.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b08d00c..c31082d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> u64 avg_cost, avg_idle;
> u64 time, cost;
> s64 delta;
> - int cpu, nr = INT_MAX;
> + int cpu, floor, nr = INT_MAX;
>
> this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> if (!this_sd)
> @@ -6205,11 +6205,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> return -1;
>
> if (sched_feat(SIS_PROP)) {
> - u64 span_avg = sd->span_weight * avg_idle;
> - if (span_avg > 4*avg_cost)
> - nr = div_u64(span_avg, avg_cost);
> - else
> - nr = 4;
> + floor = cpumask_weight(topology_sibling_cpumask(target));
> + if (floor < 2)
> + floor = 2;
> + nr = (p->latency_nice * sd->span_weight) / LATENCY_NICE_MAX;
I see you defined LATENCY_NICE_MAX = 100,
So is the value 100 an experimental value?
I was hoping to be something in the power of 2 resulting in just ">>>" rather than
the heavy division operation.
> + if (nr < floor)
> + nr = floor;
> }
>
> time = local_clock();
>
On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:
> Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
> in L1 cache locality.
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8856503..94dd4a32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> return i;
> }
>
> - i = select_idle_cpu(p, sd, target);
> + i = select_idle_smt(p, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> - i = select_idle_smt(p, target);
> + i = select_idle_cpu(p, sd, target);
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
But it is absolutely conceptually wrong. An idle core is a much better
target than an idle sibling.
On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...
> Add a new sched feature SIS_CORE to have an option to disable idle core
> search (select_idle_core).
>
> Signed-off-by: subhra mazumdar <[email protected]>
> ---
> kernel/sched/features.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 858589b..de4d506 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_AVG_CPU, false)
> SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_CORE, true)
Why do we need a sched_feature? If you think there are systems in
which the usage of latency-nice does not make sense for in "Select Idle
Sibling", then we should probably better add a new Kconfig option.
If that's the case, you can probably use the init/Kconfig's
"Scheduler features" section, recently added by:
commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Aug 30, 2019 at 18:49:35 +0100, subhra mazumdar wrote...
> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path.
As per my comments in other message, we should try to better split the
"latency nice" concept introduction (API and mechanisms) from its usage
in different scheduler code paths.
This distinction should be evident from the cover letter too. What you
use "latency nice" for is just one possible use-case, thus it's
important to make sure it's generic enough to fit other usages too.
> Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
>
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.
I don't see why you latency-nice cannot be used to implement what you
get with NO_SIS_CORE.
IMHO, "latency nice" should be exposed as a pretty generic concept which
progressively enables different "levels of biasing" both at wake-up time
and load-balance time.
Why it's not possible to have SIS_CORE/NO_SIS_CORE switch implemented
just as different threshold values for the latency-nice value of a task?
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 9/5/19 3:17 AM, Patrick Bellasi wrote:
> On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...
>
>> Add a new sched feature SIS_CORE to have an option to disable idle core
>> search (select_idle_core).
>>
>> Signed-off-by: subhra mazumdar <[email protected]>
>> ---
>> kernel/sched/features.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>> index 858589b..de4d506 100644
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>> */
>> SCHED_FEAT(SIS_AVG_CPU, false)
>> SCHED_FEAT(SIS_PROP, true)
>> +SCHED_FEAT(SIS_CORE, true)
> Why do we need a sched_feature? If you think there are systems in
> which the usage of latency-nice does not make sense for in "Select Idle
> Sibling", then we should probably better add a new Kconfig option.
This is not for latency-nice but to be able to disable a different aspect
of the scheduler, i.e searching for idle cores. This can be made part of
latency-nice (i.e not do idle core search if latency-nice is below a
certain value) but even then having a feature to disable it doesn't hurt.
>
> If that's the case, you can probably use the init/Kconfig's
> "Scheduler features" section, recently added by:
>
> commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
>
>> /*
>> * Issue a WARN when we do multiple update_rq_clock() calls
> Best,
> Patrick
>
On 9/5/19 2:31 AM, Peter Zijlstra wrote:
> On Fri, Aug 30, 2019 at 10:49:42AM -0700, subhra mazumdar wrote:
>> Search SMT siblings before all CPUs in LLC domain for idle CPU. This helps
>> in L1 cache locality.
>> ---
>> kernel/sched/fair.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8856503..94dd4a32 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6274,11 +6274,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> return i;
>> }
>>
>> - i = select_idle_cpu(p, sd, target);
>> + i = select_idle_smt(p, target);
>> if ((unsigned)i < nr_cpumask_bits)
>> return i;
>>
>> - i = select_idle_smt(p, target);
>> + i = select_idle_cpu(p, sd, target);
>> if ((unsigned)i < nr_cpumask_bits)
>> return i;
>>
> But it is absolutely conceptually wrong. An idle core is a much better
> target than an idle sibling.
This is select_idle_smt not select_idle_core.