2019-06-26 22:53:32

by Subhra Mazumdar

[permalink] [raw]
Subject: [RFC PATCH 0/3] Scheduler Soft Affinity

When multiple instances of workloads are consolidated in same host it is
good practice to partition them for best performance. For e.g give a NUMA
node parition to each instance. Currently Linux kernel provides two
interfaces to hard parition: sched_setaffinity system call or cpuset.cpus
cgroup. But this doesn't allow one instance to burst out of its partition
and use available CPUs from other partitions when they are idle. Running
all instances free range without any affinity, on the other hand, suffers
from cache coherence overhead across sockets (NUMA nodes) when all
instances are busy.

To achieve the best of both worlds, one potential way is to use AutoNUMA
balancer which migrates memory and threads to align them on same NUMA node
when all instances are busy. But it doesn't work if memory is spread across
NUMA nodes, has high reaction time due to periodic scanning mechanism and
can't handle sub-NUMA levels. Some motivational experiments were done with
2 DB instances running on a 2 socket x86 Intel system with 22 cores per
socket. 'numactl -m' was used to bind the memory of each instance to one
NUMA node, the idea was to have AutoNUMA migrate only threads as memory is
pinned. But AutoNUMA ON vs OFF didn't make any difference in performance.
Also it was found that AutoNUMA still migrated pages across NUMA nodes, so
numactl only controls the initial allocation of memory and AutoNUMA is free
to migrate them later. Following are the vmstats for different number of
users running TPC-C in each DB instance with numactl and AutoNUMA ON. With
AutnoNUMA OFF the page migrations were of course zero.

users (2x16)
numa_hint_faults 1672485
numa_hint_faults_local 1158283
numa_pages_migrated 373670

users (2x24)
numa_hint_faults 2267425
numa_hint_faults_local 1548501
numa_pages_migrated 586473


users (2x32)
numa_hint_faults 1916625
numa_hint_faults_local 1499772
numa_pages_migrated 229581

Given the above drawbacks, the most logical way to achieve the desired
behavior is via the task scheduler. A new interface is added, a new system
call sched_setaffinity2 in this case, to specify the set of soft affinity
CPUs. It takes an extra paremeter to specify hard or soft affinity, where
hard behaves same as existing sched_setaffinity. I am open to using other
interfaces like cgroup or anything else I might not have considered. Also
this patchset only allows it for CFS class threads as for RT class the
preferential search latency may not be tolerated. But nothing in theory
stops us from implementing soft affinity for RT class too. Finally it
also adds new scheduler tunables to tune the "softness" of soft affinity
as different workload may have different optimal points. This is done
using two tunables sched_allowed and sched_preferred, where if the ratio
of CPU utilization of the prefrred set and allowed set crosses the ratio
sched_allowed:sched_preferred, scheduler will use the entire allowed set
instead of the preferred set in the first level of search in
select_task_rq_fair. The default value is 100:1.

Following are the performance results of running 2 and 1 instance(s) of
Hackbench and Oracle DB on 2 socket Intel x86 system with 22 cores per
socket with the default tunable settings. For 2 instance case DB shows
substantial improvement over no affinity but Hackbench shows negligible
improvement. For 1 instance case DB performance is close to no affinity,
but Hackbench has significant regression. Hard affinity numbers are also
added for comparison. The load in each Hackbench and DB instance is varied
by varying the number of groups and users respectively. %gain is w.r.t
no affinity.

(100:1)
Hackbench %gain with soft affinity %gain with hard affinity
2*4 1.12 1.3
2*8 1.67 1.35
2*16 1.3 1.12
2*32 0.31 0.61
1*4 -18 -58
1*8 -24 -59
1*16 -33 -72
1*32 -38 -83

DB %gain with soft affinity %gain with hard affinity
2*16 4.26 4.1
2*24 5.43 7.17
2*32 6.31 6.22
1*16 -0.02 0.62
1*24 0.12 -26.1
1*32 -1.48 -8.65

The experiments were repeated with sched_allowed:sched_preferred set to
5:4 to have "softer" soft affinity. Following numbers show it preserves the
(negligible) improvement for 2 instance Hackbench case but reduces the
regression for 1 instance significantly. For DB this setting doesn't work
well as the improvements for 2 instance case goes away. This also shows
different workloads have different optimal setting.

(5:4)
Hackbench %gain with soft affinity
2*4 1.43
2*8 1.36
2*16 1.01
2*32 1.45
1*4 -2.55
1*8 -5.06
1*16 -8
1*32 -7.32

DB %gain with soft affinity
2*16 0.46
2*24 3.68
2*32 -3.34
1*16 0.08
1*24 1.6
1*32 -1.29

Finally I measured the overhead of soft affinity when it is NOT used by
comparing it with baseline kernel in case of no affinity and hard affinity
with Hackbench. The following is the improvement of soft affinity kernel
w.r.t baseline, but really numbers are in noise margin. This shows soft
affinity has no overhead when not used.

Hackbench %diff of no affinity %diff of hard affinity
2*4 0.11 0.31
2*8 0.13 0.55
2*16 0.61 0.90
2*32 0.86 1.01
1*4 0.48 0.43
1*8 0.45 0.33
1*16 0.61 0.64
1*32 0.11 0.63

A final set of experiments were done (numbers not shown) having the memory
of each DB instance spread evenly across both NUMA nodes. This showed
similar improvements with soft affinity for 2 instance case, thus proving
the improvement is due to saving LLC coherence overhead.

subhra mazumdar (3):
sched: Introduce new interface for scheduler soft affinity
sched: change scheduler to give preference to soft affinity CPUs
sched: introduce tunables to control soft affinity

arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/sched.h | 5 +-
include/linux/sched/sysctl.h | 2 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/sched.h | 3 +
init/init_task.c | 2 +
kernel/compat.c | 2 +-
kernel/rcu/tree_plugin.h | 3 +-
kernel/sched/core.c | 167 ++++++++++++++++++++++++++++-----
kernel/sched/fair.c | 154 ++++++++++++++++++++++--------
kernel/sched/sched.h | 2 +
kernel/sysctl.c | 14 +++
13 files changed, 297 insertions(+), 65 deletions(-)

--
2.9.3


2019-06-26 22:53:35

by Subhra Mazumdar

[permalink] [raw]
Subject: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

The soft affinity CPUs present in the cpumask cpus_preferred is used by the
scheduler in two levels of search. First is in determining wake affine
which choses the LLC domain and secondly while searching for idle CPUs in
LLC domain. In the first level it uses cpus_preferred to prune out the
search space. In the second level it first searches the cpus_preferred and
then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
any overhead in the scheduler fast path when soft affinity is not used.
This only changes the wake up path of the scheduler, the idle balancing
is unchanged; together they achieve the "softness" of scheduling.

Signed-off-by: subhra mazumdar <[email protected]>
---
kernel/sched/fair.c | 137 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 100 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..53aa7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5807,7 +5807,7 @@ static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
*/
static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p,
- int this_cpu, int sd_flag)
+ int this_cpu, int sd_flag, struct cpumask *cpus)
{
struct sched_group *idlest = NULL, *group = sd->groups;
struct sched_group *most_spare_sg = NULL;
@@ -5831,7 +5831,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,

/* Skip over this group if it has no CPUs allowed */
if (!cpumask_intersects(sched_group_span(group),
- &p->cpus_allowed))
+ cpus))
continue;

local_group = cpumask_test_cpu(this_cpu,
@@ -5949,7 +5949,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
* find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
*/
static int
-find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_group_cpu(struct sched_group *group, struct task_struct *p,
+ int this_cpu, struct cpumask *cpus)
{
unsigned long load, min_load = ULONG_MAX;
unsigned int min_exit_latency = UINT_MAX;
@@ -5963,7 +5964,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
return cpumask_first(sched_group_span(group));

/* Traverse only the allowed CPUs */
- for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) {
+ for_each_cpu_and(i, sched_group_span(group), cpus) {
if (available_idle_cpu(i)) {
struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
@@ -5999,7 +6000,8 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
}

static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p,
- int cpu, int prev_cpu, int sd_flag)
+ int cpu, int prev_cpu, int sd_flag,
+ struct cpumask *cpus)
{
int new_cpu = cpu;

@@ -6023,13 +6025,14 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
continue;
}

- group = find_idlest_group(sd, p, cpu, sd_flag);
+ group = find_idlest_group(sd, p, cpu, sd_flag, cpus);
+
if (!group) {
sd = sd->child;
continue;
}

- new_cpu = find_idlest_group_cpu(group, p, cpu);
+ new_cpu = find_idlest_group_cpu(group, p, cpu, cpus);
if (new_cpu == cpu) {
/* Now try balancing at a lower domain level of 'cpu': */
sd = sd->child;
@@ -6104,6 +6107,27 @@ void __update_idle_core(struct rq *rq)
rcu_read_unlock();
}

+static inline int
+scan_cpu_mask_for_idle_cores(struct cpumask *cpus, int target)
+{
+ int core, cpu;
+
+ for_each_cpu_wrap(core, cpus, target) {
+ bool idle = true;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ cpumask_clear_cpu(cpu, cpus);
+ if (!idle_cpu(cpu))
+ idle = false;
+ }
+
+ if (idle)
+ return core;
+ }
+
+ return -1;
+}
+
/*
* Scan the entire LLC domain for idle cores; this dynamically switches off if
* there are no idle cores left in the system; tracked through
@@ -6112,7 +6136,7 @@ void __update_idle_core(struct rq *rq)
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu;
+ int core;

if (!static_branch_likely(&sched_smt_present))
return -1;
@@ -6120,21 +6144,22 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
if (!test_idle_cores(target, false))
return -1;

- cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+ cpumask_and(cpus, sched_domain_span(sd), &p->cpus_preferred);
+ core = scan_cpu_mask_for_idle_cores(cpus, target);

- for_each_cpu_wrap(core, cpus, target) {
- bool idle = true;
+ if (core >= 0)
+ return core;

- for_each_cpu(cpu, cpu_smt_mask(core)) {
- __cpumask_clear_cpu(cpu, cpus);
- if (!available_idle_cpu(cpu))
- idle = false;
- }
+ if (!p->affinity_unequal)
+ goto out;

- if (idle)
- return core;
- }
+ cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+ cpumask_andnot(cpus, cpus, &p->cpus_preferred);
+ core = scan_cpu_mask_for_idle_cores(cpus, target);

+ if (core >= 0)
+ return core;
+out:
/*
* Failed to find an idle core; stop looking for one.
*/
@@ -6143,24 +6168,40 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
return -1;
}

+static inline int
+scan_cpu_mask_for_idle_smt(struct cpumask *cpus, int target)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(cpu, cpus))
+ continue;
+ if (idle_cpu(cpu))
+ return cpu;
+ }
+
+ return -1;
+}
+
/*
* Scan the local SMT mask for idle CPUs.
*/
static int select_idle_smt(struct task_struct *p, int target)
{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
int cpu;

if (!static_branch_likely(&sched_smt_present))
return -1;

- for_each_cpu(cpu, cpu_smt_mask(target)) {
- if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
- continue;
- if (available_idle_cpu(cpu))
- return cpu;
- }
+ cpu = scan_cpu_mask_for_idle_smt(&p->cpus_preferred, target);

- return -1;
+ if (cpu >= 0 || !p->affinity_unequal)
+ return cpu;
+
+ cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+
+ return scan_cpu_mask_for_idle_smt(cpus, target);
}

#else /* CONFIG_SCHED_SMT */
@@ -6177,6 +6218,24 @@ static inline int select_idle_smt(struct task_struct *p, int target)

#endif /* CONFIG_SCHED_SMT */

+static inline int
+scan_cpu_mask_for_idle_cpu(struct cpumask *cpus, int target,
+ struct sched_domain *sd, int *nr)
+{
+ int cpu;
+
+ for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+ if (!--(*nr))
+ return -1;
+ if (!cpumask_test_cpu(cpu, cpus))
+ continue;
+ if (available_idle_cpu(cpu))
+ break;
+ }
+
+ return cpu;
+}
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6185,10 +6244,11 @@ static inline int select_idle_smt(struct task_struct *p, int target)
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *this_sd;
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
u64 avg_cost, avg_idle;
u64 time, cost;
s64 delta;
- int cpu, nr = INT_MAX;
+ int cpu, nr = INT_MAX, nr_begin;

this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -6212,16 +6272,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
nr = 4;
}

+ nr_begin = nr;
time = local_clock();

- for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
- if (!--nr)
- return -1;
- if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
- continue;
- if (available_idle_cpu(cpu))
- break;
- }
+ cpu = scan_cpu_mask_for_idle_cpu(&p->cpus_preferred, target, sd, &nr);
+
+ if (!nr || !p->affinity_unequal || cpu != target || nr >= nr_begin - 1)
+ goto out;
+
+ cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+
+ cpu = scan_cpu_mask_for_idle_cpu(cpus, target, sd, &nr);
+out:

time = local_clock() - time;
cost = this_sd->avg_scan_cost;
@@ -6677,6 +6739,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int new_cpu = prev_cpu;
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ struct cpumask *cpus = &p->cpus_preferred;

if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
@@ -6689,7 +6752,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}

want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
- cpumask_test_cpu(cpu, &p->cpus_allowed);
+ cpumask_test_cpu(cpu, cpus);
}

rcu_read_lock();
@@ -6718,7 +6781,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

if (unlikely(sd)) {
/* Slow path */
- new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
+ new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag, cpus);
} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
/* Fast path */

--
2.9.3

2019-06-26 22:53:55

by Subhra Mazumdar

[permalink] [raw]
Subject: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity

For different workloads the optimal "softness" of soft affinity can be
different. Introduce tunables sched_allowed and sched_preferred that can
be tuned via /proc. This allows to chose at what utilization difference
the scheduler will chose cpus_allowed over cpus_preferred in the first
level of search. Depending on the extent of data sharing, cache coherency
overhead of the system etc. the optimal point may vary.

Signed-off-by: subhra mazumdar <[email protected]>
---
include/linux/sched/sysctl.h | 2 ++
kernel/sched/fair.c | 19 ++++++++++++++++++-
kernel/sched/sched.h | 2 ++
kernel/sysctl.c | 14 ++++++++++++++
4 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d7..0e75602 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -41,6 +41,8 @@ extern unsigned int sysctl_numa_balancing_scan_size;
#ifdef CONFIG_SCHED_DEBUG
extern __read_mostly unsigned int sysctl_sched_migration_cost;
extern __read_mostly unsigned int sysctl_sched_nr_migrate;
+extern __read_mostly unsigned int sysctl_sched_preferred;
+extern __read_mostly unsigned int sysctl_sched_allowed;

int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 53aa7f2..d222d78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -85,6 +85,8 @@ unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;

const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
+const_debug unsigned int sysctl_sched_preferred = 1UL;
+const_debug unsigned int sysctl_sched_allowed = 100UL;

#ifdef CONFIG_SMP
/*
@@ -6739,7 +6741,22 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int new_cpu = prev_cpu;
int want_affine = 0;
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
- struct cpumask *cpus = &p->cpus_preferred;
+ int cpux, cpuy;
+ struct cpumask *cpus;
+
+ if (!p->affinity_unequal) {
+ cpus = &p->cpus_allowed;
+ } else {
+ cpux = cpumask_any(&p->cpus_preferred);
+ cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ cpumask_andnot(cpus, &p->cpus_allowed, &p->cpus_preferred);
+ cpuy = cpumask_any(cpus);
+ if (sysctl_sched_preferred * cpu_rq(cpux)->cfs.avg.util_avg >
+ sysctl_sched_allowed * cpu_rq(cpuy)->cfs.avg.util_avg)
+ cpus = &p->cpus_allowed;
+ else
+ cpus = &p->cpus_preferred;
+ }

if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..f856bdb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1863,6 +1863,8 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);

extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;
+extern const_debug unsigned int sysctl_sched_preferred;
+extern const_debug unsigned int sysctl_sched_allowed;

#ifdef CONFIG_SCHED_HRTICK

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7d1008b..bdffb48 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -383,6 +383,20 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "sched_preferred",
+ .data = &sysctl_sched_preferred,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_allowed",
+ .data = &sysctl_sched_allowed,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#ifdef CONFIG_SCHEDSTATS
{
.procname = "sched_schedstats",
--
2.9.3

2019-06-26 22:55:26

by Subhra Mazumdar

[permalink] [raw]
Subject: [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity

New system call sched_setaffinity2 is introduced for scheduler soft
affinity. It takes an extra parameter to specify hard or soft affinity,
where hard implies same as existing sched_setaffinity. New cpumask
cpus_preferred is introduced for this purpose which is always a subset of
cpus_allowed. A boolean affinity_unequal is used to store if they are
unequal for fast lookup. Setting hard affinity resets soft affinity set to
be equal to it. Soft affinity is only allowed for CFS class threads.

Signed-off-by: subhra mazumdar <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/sched.h | 5 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/sched.h | 3 +
init/init_task.c | 2 +
kernel/compat.c | 2 +-
kernel/rcu/tree_plugin.h | 3 +-
kernel/sched/core.c | 167 ++++++++++++++++++++++++++++-----
9 files changed, 162 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e..1dccdd2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+434 common sched_setaffinity2 __x64_sys_sched_setaffinity2

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b863fa8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -652,6 +652,8 @@ struct task_struct {
unsigned int policy;
int nr_cpus_allowed;
cpumask_t cpus_allowed;
+ cpumask_t cpus_preferred;
+ bool affinity_unequal;

#ifdef CONFIG_PREEMPT_RCU
int rcu_read_lock_nesting;
@@ -1784,7 +1786,8 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
# define vcpu_is_preempted(cpu) false
#endif

-extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
+extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask,
+ int flags);
extern long sched_getaffinity(pid_t pid, struct cpumask *mask);

#ifndef TASK_SIZE_OF
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe..147a4e5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -669,6 +669,9 @@ asmlinkage long sys_sched_rr_get_interval(pid_t pid,
struct __kernel_timespec __user *interval);
asmlinkage long sys_sched_rr_get_interval_time32(pid_t pid,
struct old_timespec32 __user *interval);
+asmlinkage long sys_sched_setaffinity2(pid_t pid, unsigned int len,
+ unsigned long __user *user_mask_ptr,
+ int flags);

/* kernel/signal.c */
asmlinkage long sys_restart_syscall(void);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904d..d77b366 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_sched_setaffinity2 434
+__SYSCALL(__NR_sched_setaffinity2, sys_sched_setaffinity2)

#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 435

/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee17..f910cd5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -52,6 +52,9 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04

+#define SCHED_HARD_AFFINITY 0
+#define SCHED_SOFT_AFFINITY 1
+
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN)
diff --git a/init/init_task.c b/init/init_task.c
index c70ef65..aa226a3 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -73,6 +73,8 @@ struct task_struct init_task
.normal_prio = MAX_PRIO - 20,
.policy = SCHED_NORMAL,
.cpus_allowed = CPU_MASK_ALL,
+ .cpus_preferred = CPU_MASK_ALL,
+ .affinity_unequal = false,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm = &init_mm,
diff --git a/kernel/compat.c b/kernel/compat.c
index b5f7063..96621d7 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -226,7 +226,7 @@ COMPAT_SYSCALL_DEFINE3(sched_setaffinity, compat_pid_t, pid,
if (retval)
goto out;

- retval = sched_setaffinity(pid, new_mask);
+ retval = sched_setaffinity(pid, new_mask, SCHED_HARD_AFFINITY);
out:
free_cpumask_var(new_mask);
return retval;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1102765..bdff600 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2287,7 +2287,8 @@ static bool init_nocb_callback_list(struct rcu_data *rdp)
void rcu_bind_current_to_nocb(void)
{
if (cpumask_available(rcu_nocb_mask) && cpumask_weight(rcu_nocb_mask))
- WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask));
+ WARN_ON(sched_setaffinity(current->pid, rcu_nocb_mask,
+ SCHED_HARD_AFFINITY));
}
EXPORT_SYMBOL_GPL(rcu_bind_current_to_nocb);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..eca3e98b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1060,6 +1060,12 @@ void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_ma
p->nr_cpus_allowed = cpumask_weight(new_mask);
}

+void set_cpus_preferred_common(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ cpumask_copy(&p->cpus_preferred, new_mask);
+}
+
void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
{
struct rq *rq = task_rq(p);
@@ -1082,6 +1088,37 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
put_prev_task(rq, p);

p->sched_class->set_cpus_allowed(p, new_mask);
+ set_cpus_preferred_common(p, new_mask);
+
+ if (queued)
+ enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
+ if (running)
+ set_curr_task(rq, p);
+}
+
+void do_set_cpus_preferred(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ struct rq *rq = task_rq(p);
+ bool queued, running;
+
+ lockdep_assert_held(&p->pi_lock);
+
+ queued = task_on_rq_queued(p);
+ running = task_current(rq, p);
+
+ if (queued) {
+ /*
+ * Because __kthread_bind() calls this on blocked tasks without
+ * holding rq->lock.
+ */
+ lockdep_assert_held(&rq->lock);
+ dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+ }
+ if (running)
+ put_prev_task(rq, p);
+
+ set_cpus_preferred_common(p, new_mask);

if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@@ -1170,6 +1207,41 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
return ret;
}

+static int
+__set_cpus_preferred_ptr(struct task_struct *p, const struct cpumask *new_mask)
+{
+ const struct cpumask *cpu_valid_mask = cpu_active_mask;
+ unsigned int dest_cpu;
+ struct rq_flags rf;
+ struct rq *rq;
+ int ret = 0;
+
+ rq = task_rq_lock(p, &rf);
+ update_rq_clock(rq);
+
+ if (p->flags & PF_KTHREAD) {
+ /*
+ * Kernel threads are allowed on online && !active CPUs
+ */
+ cpu_valid_mask = cpu_online_mask;
+ }
+
+ if (cpumask_equal(&p->cpus_preferred, new_mask))
+ goto out;
+
+ if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ do_set_cpus_preferred(p, new_mask);
+
+out:
+ task_rq_unlock(rq, p, &rf);
+
+ return ret;
+}
+
int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
{
return __set_cpus_allowed_ptr(p, new_mask, false);
@@ -4724,7 +4796,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
return retval;
}

-long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
+long sched_setaffinity(pid_t pid, const struct cpumask *in_mask, int flags)
{
cpumask_var_t cpus_allowed, new_mask;
struct task_struct *p;
@@ -4742,6 +4814,11 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
get_task_struct(p);
rcu_read_unlock();

+ if (flags == SCHED_SOFT_AFFINITY &&
+ p->sched_class != &fair_sched_class) {
+ retval = -EINVAL;
+ goto out_put_task;
+ }
if (p->flags & PF_NO_SETAFFINITY) {
retval = -EINVAL;
goto out_put_task;
@@ -4790,18 +4867,37 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
}
#endif
again:
- retval = __set_cpus_allowed_ptr(p, new_mask, true);
-
- if (!retval) {
- cpuset_cpus_allowed(p, cpus_allowed);
- if (!cpumask_subset(new_mask, cpus_allowed)) {
- /*
- * We must have raced with a concurrent cpuset
- * update. Just reset the cpus_allowed to the
- * cpuset's cpus_allowed
- */
- cpumask_copy(new_mask, cpus_allowed);
- goto again;
+ if (flags == SCHED_HARD_AFFINITY) {
+ retval = __set_cpus_allowed_ptr(p, new_mask, true);
+
+ if (!retval) {
+ cpuset_cpus_allowed(p, cpus_allowed);
+ if (!cpumask_subset(new_mask, cpus_allowed)) {
+ /*
+ * We must have raced with a concurrent cpuset
+ * update. Just reset the cpus_allowed to the
+ * cpuset's cpus_allowed
+ */
+ cpumask_copy(new_mask, cpus_allowed);
+ goto again;
+ }
+ p->affinity_unequal = false;
+ }
+ } else if (flags == SCHED_SOFT_AFFINITY) {
+ retval = __set_cpus_preferred_ptr(p, new_mask);
+ if (!retval) {
+ cpuset_cpus_allowed(p, cpus_allowed);
+ if (!cpumask_subset(new_mask, cpus_allowed)) {
+ /*
+ * We must have raced with a concurrent cpuset
+ * update.
+ */
+ cpumask_and(new_mask, new_mask, cpus_allowed);
+ goto again;
+ }
+ if (!cpumask_equal(&p->cpus_allowed,
+ &p->cpus_preferred))
+ p->affinity_unequal = true;
}
}
out_free_new_mask:
@@ -4824,30 +4920,53 @@ static int get_user_cpu_mask(unsigned long __user *user_mask_ptr, unsigned len,
return copy_from_user(new_mask, user_mask_ptr, len) ? -EFAULT : 0;
}

-/**
- * sys_sched_setaffinity - set the CPU affinity of a process
- * @pid: pid of the process
- * @len: length in bytes of the bitmask pointed to by user_mask_ptr
- * @user_mask_ptr: user-space pointer to the new CPU mask
- *
- * Return: 0 on success. An error code otherwise.
- */
-SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
- unsigned long __user *, user_mask_ptr)
+static bool
+valid_affinity_flags(int flags)
+{
+ return flags == SCHED_HARD_AFFINITY || flags == SCHED_SOFT_AFFINITY;
+}
+
+static int
+sched_setaffinity_common(pid_t pid, unsigned int len,
+ unsigned long __user *user_mask_ptr, int flags)
{
cpumask_var_t new_mask;
int retval;

+ if (!valid_affinity_flags(flags))
+ return -EINVAL;
+
if (!alloc_cpumask_var(&new_mask, GFP_KERNEL))
return -ENOMEM;

retval = get_user_cpu_mask(user_mask_ptr, len, new_mask);
if (retval == 0)
- retval = sched_setaffinity(pid, new_mask);
+ retval = sched_setaffinity(pid, new_mask, flags);
free_cpumask_var(new_mask);
return retval;
}

+SYSCALL_DEFINE4(sched_setaffinity2, pid_t, pid, unsigned int, len,
+ unsigned long __user *, user_mask_ptr, int, flags)
+{
+ return sched_setaffinity_common(pid, len, user_mask_ptr, flags);
+}
+
+/**
+ * sys_sched_setaffinity - set the CPU affinity of a process
+ * @pid: pid of the process
+ * @len: length in bytes of the bitmask pointed to by user_mask_ptr
+ * @user_mask_ptr: user-space pointer to the new CPU mask
+ *
+ * Return: 0 on success. An error code otherwise.
+ */
+SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
+ unsigned long __user *, user_mask_ptr)
+{
+ return sched_setaffinity_common(pid, len, user_mask_ptr,
+ SCHED_HARD_AFFINITY);
+}
+
long sched_getaffinity(pid_t pid, struct cpumask *mask)
{
struct task_struct *p;
--
2.9.3

2019-07-02 16:24:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity

On Wed, Jun 26, 2019 at 03:47:16PM -0700, subhra mazumdar wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1183741..b863fa8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -652,6 +652,8 @@ struct task_struct {
> unsigned int policy;
> int nr_cpus_allowed;
> cpumask_t cpus_allowed;

You're patching dead code, that no longer exists.

> + cpumask_t cpus_preferred;
> + bool affinity_unequal;

Urgh, no. cpumask_t is an abomination and having one of them is already
unfortunate, having two is really not sane, esp. since for 99% of the
tasks they'll be exactly the same.

Why not add cpus_ptr_soft or something like that, and have it point at
cpus_mask by default, and when it needs to not be the same, allocate a
cpumask for it. That also gets rid of that unequal thing.


2019-07-02 16:30:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: Introduce new interface for scheduler soft affinity

On Wed, Jun 26, 2019 at 03:47:16PM -0700, subhra mazumdar wrote:
> @@ -1082,6 +1088,37 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
> put_prev_task(rq, p);
>
> p->sched_class->set_cpus_allowed(p, new_mask);
> + set_cpus_preferred_common(p, new_mask);
> +
> + if (queued)
> + enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> + if (running)
> + set_curr_task(rq, p);
> +}
> +
> +void do_set_cpus_preferred(struct task_struct *p,
> + const struct cpumask *new_mask)
> +{
> + struct rq *rq = task_rq(p);
> + bool queued, running;
> +
> + lockdep_assert_held(&p->pi_lock);
> +
> + queued = task_on_rq_queued(p);
> + running = task_current(rq, p);
> +
> + if (queued) {
> + /*
> + * Because __kthread_bind() calls this on blocked tasks without
> + * holding rq->lock.
> + */
> + lockdep_assert_held(&rq->lock);
> + dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
> + }
> + if (running)
> + put_prev_task(rq, p);
> +
> + set_cpus_preferred_common(p, new_mask);
>
> if (queued)
> enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> @@ -1170,6 +1207,41 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> return ret;
> }
>
> +static int
> +__set_cpus_preferred_ptr(struct task_struct *p, const struct cpumask *new_mask)
> +{
> + const struct cpumask *cpu_valid_mask = cpu_active_mask;
> + unsigned int dest_cpu;
> + struct rq_flags rf;
> + struct rq *rq;
> + int ret = 0;
> +
> + rq = task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> +
> + if (p->flags & PF_KTHREAD) {
> + /*
> + * Kernel threads are allowed on online && !active CPUs
> + */
> + cpu_valid_mask = cpu_online_mask;
> + }
> +
> + if (cpumask_equal(&p->cpus_preferred, new_mask))
> + goto out;
> +
> + if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + do_set_cpus_preferred(p, new_mask);
> +
> +out:
> + task_rq_unlock(rq, p, &rf);
> +
> + return ret;
> +}
> +
> int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
> {
> return __set_cpus_allowed_ptr(p, new_mask, false);
> @@ -4724,7 +4796,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> return retval;
> }
>
> -long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> +long sched_setaffinity(pid_t pid, const struct cpumask *in_mask, int flags)
> {
> cpumask_var_t cpus_allowed, new_mask;
> struct task_struct *p;
> @@ -4742,6 +4814,11 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> get_task_struct(p);
> rcu_read_unlock();
>
> + if (flags == SCHED_SOFT_AFFINITY &&
> + p->sched_class != &fair_sched_class) {
> + retval = -EINVAL;
> + goto out_put_task;
> + }
> if (p->flags & PF_NO_SETAFFINITY) {
> retval = -EINVAL;
> goto out_put_task;
> @@ -4790,18 +4867,37 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
> }
> #endif
> again:
> - retval = __set_cpus_allowed_ptr(p, new_mask, true);
> -
> - if (!retval) {
> - cpuset_cpus_allowed(p, cpus_allowed);
> - if (!cpumask_subset(new_mask, cpus_allowed)) {
> - /*
> - * We must have raced with a concurrent cpuset
> - * update. Just reset the cpus_allowed to the
> - * cpuset's cpus_allowed
> - */
> - cpumask_copy(new_mask, cpus_allowed);
> - goto again;
> + if (flags == SCHED_HARD_AFFINITY) {
> + retval = __set_cpus_allowed_ptr(p, new_mask, true);
> +
> + if (!retval) {
> + cpuset_cpus_allowed(p, cpus_allowed);
> + if (!cpumask_subset(new_mask, cpus_allowed)) {
> + /*
> + * We must have raced with a concurrent cpuset
> + * update. Just reset the cpus_allowed to the
> + * cpuset's cpus_allowed
> + */
> + cpumask_copy(new_mask, cpus_allowed);
> + goto again;
> + }
> + p->affinity_unequal = false;
> + }
> + } else if (flags == SCHED_SOFT_AFFINITY) {
> + retval = __set_cpus_preferred_ptr(p, new_mask);
> + if (!retval) {
> + cpuset_cpus_allowed(p, cpus_allowed);
> + if (!cpumask_subset(new_mask, cpus_allowed)) {
> + /*
> + * We must have raced with a concurrent cpuset
> + * update.
> + */
> + cpumask_and(new_mask, new_mask, cpus_allowed);
> + goto again;
> + }
> + if (!cpumask_equal(&p->cpus_allowed,
> + &p->cpus_preferred))
> + p->affinity_unequal = true;
> }
> }
> out_free_new_mask:

This seems like a terrible lot of pointless duplication; don't you get a
much smaller diff by passing the hard/soft thing into
__set_cpus_allowed_ptr() and only branching where it matters?

2019-07-02 17:30:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
> The soft affinity CPUs present in the cpumask cpus_preferred is used by the
> scheduler in two levels of search. First is in determining wake affine
> which choses the LLC domain and secondly while searching for idle CPUs in
> LLC domain. In the first level it uses cpus_preferred to prune out the
> search space. In the second level it first searches the cpus_preferred and
> then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
> any overhead in the scheduler fast path when soft affinity is not used.
> This only changes the wake up path of the scheduler, the idle balancing
> is unchanged; together they achieve the "softness" of scheduling.

I really dislike this implementation.

I thought the idea was to remain work conserving (in so far as that
we're that anyway), so changing select_idle_sibling() doesn't make sense
to me. If there is idle, we use it.

Same for newidle; which you already retained.

This then leaves regular balancing, and for that we can fudge with
can_migrate_task() and nr_balance_failed or something.

And I also really don't want a second utilization tipping point; we
already have the overloaded thing.

I also still dislike how you never looked into the numa balancer, which
already has peferred_nid stuff.

2019-07-17 03:03:52

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs


On 7/2/19 10:58 PM, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
>> The soft affinity CPUs present in the cpumask cpus_preferred is used by the
>> scheduler in two levels of search. First is in determining wake affine
>> which choses the LLC domain and secondly while searching for idle CPUs in
>> LLC domain. In the first level it uses cpus_preferred to prune out the
>> search space. In the second level it first searches the cpus_preferred and
>> then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
>> any overhead in the scheduler fast path when soft affinity is not used.
>> This only changes the wake up path of the scheduler, the idle balancing
>> is unchanged; together they achieve the "softness" of scheduling.
> I really dislike this implementation.
>
> I thought the idea was to remain work conserving (in so far as that
> we're that anyway), so changing select_idle_sibling() doesn't make sense
> to me. If there is idle, we use it.
>
> Same for newidle; which you already retained.
The scheduler is already not work conserving in many ways. Soft affinity is
only for those who want to use it and has no side effects when not used.
Also the way scheduler is implemented in the first level of search it may
not be possible to do it in a work conserving way, I am open to ideas.
>
> This then leaves regular balancing, and for that we can fudge with
> can_migrate_task() and nr_balance_failed or something.
Possibly but I don't know if similar performance behavior can be achieved
by the periodic load balancer. Do you want a performance comparison of the
two approaches?
>
> And I also really don't want a second utilization tipping point; we
> already have the overloaded thing.
The numbers in the cover letter show that a static tipping point will not
work for all workloads. What soft affinity is doing is essentially trading
off cache coherence for more CPU. The optimum tradeoff point will vary
from workload to workload and the system metrics of coherence overhead etc.
If we just use the domain overload that becomes a static definition of
tipping point, we need something tunable that captures this tradeoff. The
ratio of CPU util seemed to work well and capture that.
>
> I also still dislike how you never looked into the numa balancer, which
> already has peferred_nid stuff.
Not sure if you mean using the existing NUMA balancer or enhancing it. If
the former, I have numbers in the cover letter that show NUMA balancer is
not making any difference. I allocated memory of each DB instance to one
NUMA node using numactl, but NUMA balancer still migrated pages, so numactl
only seems to control the initial allocation. Secondly even though NUMA
balancer migrated pages it had no performance benefit as compared to
disabling it.

2019-07-18 10:11:10

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity

* subhra mazumdar <[email protected]> [2019-06-26 15:47:18]:

> For different workloads the optimal "softness" of soft affinity can be
> different. Introduce tunables sched_allowed and sched_preferred that can
> be tuned via /proc. This allows to chose at what utilization difference
> the scheduler will chose cpus_allowed over cpus_preferred in the first
> level of search. Depending on the extent of data sharing, cache coherency
> overhead of the system etc. the optimal point may vary.
>
> Signed-off-by: subhra mazumdar <[email protected]>
> ---

Correct me but this patchset only seems to be concentrated on the wakeup
path, I don't see any changes in the regular load balancer or the
numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
these tasks be moved to cpus_allowed instead of cpus_preferred and hence
breaking this soft affinity.

--
Thanks and Regards
Srikar Dronamraju

2019-07-18 11:40:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs

On Wed, Jul 17, 2019 at 08:31:25AM +0530, Subhra Mazumdar wrote:
>
> On 7/2/19 10:58 PM, Peter Zijlstra wrote:
> > On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
> > > The soft affinity CPUs present in the cpumask cpus_preferred is used by the
> > > scheduler in two levels of search. First is in determining wake affine
> > > which choses the LLC domain and secondly while searching for idle CPUs in
> > > LLC domain. In the first level it uses cpus_preferred to prune out the
> > > search space. In the second level it first searches the cpus_preferred and
> > > then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
> > > any overhead in the scheduler fast path when soft affinity is not used.
> > > This only changes the wake up path of the scheduler, the idle balancing
> > > is unchanged; together they achieve the "softness" of scheduling.
> > I really dislike this implementation.
> >
> > I thought the idea was to remain work conserving (in so far as that
> > we're that anyway), so changing select_idle_sibling() doesn't make sense
> > to me. If there is idle, we use it.
> >
> > Same for newidle; which you already retained.
> The scheduler is already not work conserving in many ways. Soft affinity is
> only for those who want to use it and has no side effects when not used.
> Also the way scheduler is implemented in the first level of search it may
> not be possible to do it in a work conserving way, I am open to ideas.

I really don't understand the premise of this soft affinity stuff then.

I understood it was to allow spreading if under-utilized, but group when
over-utilized, but you're arguing for the exact opposite, which doesn't
make sense.

> > And I also really don't want a second utilization tipping point; we
> > already have the overloaded thing.
> The numbers in the cover letter show that a static tipping point will not
> work for all workloads. What soft affinity is doing is essentially trading
> off cache coherence for more CPU. The optimum tradeoff point will vary
> from workload to workload and the system metrics of coherence overhead etc.
> If we just use the domain overload that becomes a static definition of
> tipping point, we need something tunable that captures this tradeoff. The
> ratio of CPU util seemed to work well and capture that.

And then you run two workloads with different characteristics on the
same box.

Global knobs are buggered.

2019-07-19 02:59:55

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: change scheduler to give preference to soft affinity CPUs


On 7/18/19 5:07 PM, Peter Zijlstra wrote:
> On Wed, Jul 17, 2019 at 08:31:25AM +0530, Subhra Mazumdar wrote:
>> On 7/2/19 10:58 PM, Peter Zijlstra wrote:
>>> On Wed, Jun 26, 2019 at 03:47:17PM -0700, subhra mazumdar wrote:
>>>> The soft affinity CPUs present in the cpumask cpus_preferred is used by the
>>>> scheduler in two levels of search. First is in determining wake affine
>>>> which choses the LLC domain and secondly while searching for idle CPUs in
>>>> LLC domain. In the first level it uses cpus_preferred to prune out the
>>>> search space. In the second level it first searches the cpus_preferred and
>>>> then cpus_allowed. Using affinity_unequal flag it breaks early to avoid
>>>> any overhead in the scheduler fast path when soft affinity is not used.
>>>> This only changes the wake up path of the scheduler, the idle balancing
>>>> is unchanged; together they achieve the "softness" of scheduling.
>>> I really dislike this implementation.
>>>
>>> I thought the idea was to remain work conserving (in so far as that
>>> we're that anyway), so changing select_idle_sibling() doesn't make sense
>>> to me. If there is idle, we use it.
>>>
>>> Same for newidle; which you already retained.
>> The scheduler is already not work conserving in many ways. Soft affinity is
>> only for those who want to use it and has no side effects when not used.
>> Also the way scheduler is implemented in the first level of search it may
>> not be possible to do it in a work conserving way, I am open to ideas.
> I really don't understand the premise of this soft affinity stuff then.
>
> I understood it was to allow spreading if under-utilized, but group when
> over-utilized, but you're arguing for the exact opposite, which doesn't
> make sense.
You are right on the premise. The whole knob thing came into existence
because I couldn't make the first level of search work conserving. I am
concerned that trying to make that work conserving can introduce
significant latency in the code path when SA is used. I have made the
second level of search work conserving when we search the LLC domain.

Having said that, SA need not necessarily be binary i.e only spill over to
the allowed set if the preferred set is 100% utilized (work conserving).
The spill over can happen before that and SA can have a degree of softness.

The above two points made me go down the knob path for the first level of
search.

2019-07-19 07:25:10

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] sched: introduce tunables to control soft affinity


On 7/18/19 3:38 PM, Srikar Dronamraju wrote:
> * subhra mazumdar <[email protected]> [2019-06-26 15:47:18]:
>
>> For different workloads the optimal "softness" of soft affinity can be
>> different. Introduce tunables sched_allowed and sched_preferred that can
>> be tuned via /proc. This allows to chose at what utilization difference
>> the scheduler will chose cpus_allowed over cpus_preferred in the first
>> level of search. Depending on the extent of data sharing, cache coherency
>> overhead of the system etc. the optimal point may vary.
>>
>> Signed-off-by: subhra mazumdar <[email protected]>
>> ---
> Correct me but this patchset only seems to be concentrated on the wakeup
> path, I don't see any changes in the regular load balancer or the
> numa-balancer. If system is loaded or tasks are CPU intensive, then wouldn't
> these tasks be moved to cpus_allowed instead of cpus_preferred and hence
> breaking this soft affinity.
>
The new idle is purposefully unchanged, if threads get stolen to the allowed
set from the preferred set that's intended, together with the enqueue side
it will achieve softness of affinity.