2018-11-09 13:03:55

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 00/10] steal tasks to improve CPU utilization

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates. To minimize search time, steal the first migratable
task that is found when the bitmap is traversed. For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.

Patch 1 defines the sparsemask type and its operations.

Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.

Patches 5 and 6 refactor existing code for a cleaner merge of later
patches.

Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.

Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
time being because of performance regressions that are not due to stealing
per-se. See the patch description for details.

Patch 10 adds schedstats for comparing the new behavior to the old, and
provided as a convenience for developers only, not for integration.

The patch series is based on kernel 4.20.0-rc1. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
bandwidth control were tested.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

%find - percent of time spent in old and new functions that search for
idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench <grps> process 100000
sched_wakeup_granularity_ns=15000000

baseline
grps time %busy slice sched idle wake %find steal
1 8.084 75.02 0.10 105476 46291 59183 0.31 0
2 13.892 85.33 0.10 190225 70958 119264 0.45 0
3 19.668 89.04 0.10 263896 87047 176850 0.49 0
4 25.279 91.28 0.10 322171 94691 227474 0.51 0
8 47.832 94.86 0.09 630636 144141 486322 0.56 0

new
grps time %busy slice sched idle wake %find steal %speedup
1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow. A negative "speedup" is a
regression. Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec. Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

------------------ 1 Socket Results ------------------

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 8.008 0.1 5.905 0.2 35.6
2 13.814 0.2 11.438 0.1 20.7
3 19.488 0.2 16.919 0.1 15.1
4 25.059 0.1 22.409 0.1 11.8
8 47.478 0.1 44.221 0.1 7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 4.586 0.8 4.596 0.6 -0.3
2 7.693 0.2 5.775 1.3 33.2
3 10.442 0.3 8.288 0.3 25.9
4 13.087 0.2 11.057 0.1 18.3
8 24.145 0.2 22.076 0.3 9.3
16 43.779 0.1 41.741 0.2 4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs.

clients %speedup
1 16.2
2 11.7
4 9.9
8 12.8
16 13.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Benchmark %speedup
specjbb2015_critical_jops 5.7
mysql_sysb1.0.14_mutex_2 40.6
mysql_sysb1.0.14_oltp_2 3.9

------------------ 2 Socket Results ------------------

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 7.945 0.2 7.219 8.7 10.0
2 8.444 0.4 6.689 1.5 26.2
3 12.100 1.1 9.962 2.0 21.4
4 15.001 0.4 13.109 1.1 14.4
8 27.960 0.2 26.127 0.3 7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 5.826 5.4 5.840 5.0 -0.3
2 5.041 5.3 6.171 23.4 -18.4
3 6.839 2.1 6.324 3.8 8.1
4 8.177 0.6 7.318 3.6 11.7
8 14.429 0.7 13.966 1.3 3.3
16 26.401 0.3 25.149 1.5 4.9


X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Oracle database OLTP, logging disabled, NVRAM storage

Customers Users %speedup
1200000 40 -1.2
2400000 80 2.7
3600000 120 8.9
4800000 160 4.4
6000000 200 3.0

X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Results from the Oracle "Performance PIT".

Benchmark %speedup

mysql_sysb1.0.14_fileio_56_rndrd 19.6
mysql_sysb1.0.14_fileio_56_seqrd 12.1
mysql_sysb1.0.14_fileio_56_rndwr 0.4
mysql_sysb1.0.14_fileio_56_seqrewr -0.3

pgsql_sysb1.0.14_fileio_56_rndrd 19.5
pgsql_sysb1.0.14_fileio_56_seqrd 8.6
pgsql_sysb1.0.14_fileio_56_rndwr 1.0
pgsql_sysb1.0.14_fileio_56_seqrewr 0.5

opatch_time_ASM_12.2.0.1.0_HP2M 7.5
select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1
select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4
swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8

lm3_memlat_L2 4.8
lm3_memlat_L1 0.0

ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1
ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2
ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0
ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4

X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

NAS_OMP
bench class ncpu %improved(Mops)
dc B 72 1.3
is C 72 0.9
is D 72 0.7

sysbench mysql, average of 24 runs
--- base --- --- new ---
nthr events %stdev events %stdev %speedup
1 331.0 0.25 331.0 0.24 -0.1
2 661.3 0.22 661.8 0.22 0.0
4 1297.0 0.88 1300.5 0.82 0.2
8 2420.8 0.04 2420.5 0.04 -0.1
16 4826.3 0.07 4825.4 0.05 -0.1
32 8815.3 0.27 8830.2 0.18 0.1
64 12823.0 0.24 12823.6 0.26 0.0

--------------------------------------------------------------

Changes from v1 to v2:
- Remove stray find_time hunk from patch 5
- Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
- Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2
- Steal iff avg_idle exceeds the cost of stealing

Changes from v2 to v3:
- Update series for kernel 4.20. Context changes only.

Steve Sistare (10):
sched: Provide sparsemask, a reduced contention bitmap
sched/topology: Provide hooks to allocate data shared per LLC
sched/topology: Provide cfs_overload_cpus bitmap
sched/fair: Dynamically update cfs_overload_cpus
sched/fair: Hoist idle_stamp up from idle_balance
sched/fair: Generalize the detach_task interface
sched/fair: Provide can_migrate_task_llc
sched/fair: Steal work from an overloaded CPU when CPU goes idle
sched/fair: disable stealing if too many NUMA nodes
sched/fair: Provide idle search schedstats

include/linux/sched/topology.h | 1 +
include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++
kernel/sched/core.c | 30 +++-
kernel/sched/fair.c | 346 +++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 6 +
kernel/sched/sched.h | 13 +-
kernel/sched/stats.c | 11 +-
kernel/sched/stats.h | 13 ++
kernel/sched/topology.c | 117 +++++++++++++-
lib/Makefile | 2 +-
lib/sparsemask.c | 142 +++++++++++++++++
11 files changed, 906 insertions(+), 35 deletions(-)
create mode 100644 include/linux/sparsemask.h
create mode 100644 lib/sparsemask.c

--
1.8.3.1



2018-11-09 13:01:32

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

From: Steve Sistare <[email protected]>

Define and initialize a sparse bitmap of overloaded CPUs, per
last-level-cache scheduling domain, for use by the CFS scheduling class.
Save a pointer to cfs_overload_cpus in the rq for efficient access.

Signed-off-by: Steve Sistare <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 21 +++++++++++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 6b99761..b173a77 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ struct sparsemask *cfs_overload_cpus;
};

struct sched_domain {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 618577f..eacf5db 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -81,6 +81,7 @@

struct rq;
struct cpuidle_state;
+struct sparsemask;

/* task_struct::on_rq states: */
#define TASK_ON_RQ_QUEUED 1
@@ -812,6 +813,7 @@ struct rq {
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
+ struct sparsemask *cfs_overload_cpus;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 3e72ce0..6455bde 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -3,6 +3,7 @@
* Scheduler topology setup/handling methods
*/
#include "sched.h"
+#include <linux/sparsemask.h>

DEFINE_MUTEX(sched_domains_mutex);

@@ -441,6 +442,7 @@ static void update_top_cache_domain(int cpu)
static void
cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
{
+ struct sparsemask *cfs_overload_cpus;
struct rq *rq = cpu_rq(cpu);
struct sched_domain *tmp;

@@ -482,6 +484,10 @@ static void update_top_cache_domain(int cpu)
dirty_sched_domain_sysctl(cpu);
destroy_sched_domains(tmp);

+ sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+ cfs_overload_cpus = (sd ? sd->shared->cfs_overload_cpus : NULL);
+ rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
+
update_top_cache_domain(cpu);
}

@@ -1619,9 +1625,19 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}

+#define ZALLOC_MASK(maskp, nelems, node) \
+ (!*(maskp) && !zalloc_sparsemask_node(maskp, nelems, \
+ SPARSEMASK_DENSITY_DEFAULT, \
+ GFP_KERNEL, node)) \
+
static int sd_llc_alloc(struct sched_domain *sd)
{
- /* Allocate sd->shared data here. Empty for now. */
+ struct sched_domain_shared *sds = sd->shared;
+ struct cpumask *span = sched_domain_span(sd);
+ int nid = cpu_to_node(cpumask_first(span));
+
+ if (ZALLOC_MASK(&sds->cfs_overload_cpus, nr_cpu_ids, nid))
+ return 1;

return 0;
}
@@ -1633,7 +1649,8 @@ static void sd_llc_free(struct sched_domain *sd)
if (!sds)
return;

- /* Free data here. Empty for now. */
+ free_sparsemask(sds->cfs_overload_cpus);
+ sds->cfs_overload_cpus = NULL;
}

static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d)
--
1.8.3.1


2018-11-09 13:01:40

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 10/10] sched/fair: Provide idle search schedstats

Add schedstats to measure the effectiveness of searching for idle CPUs
and stealing tasks. This is a temporary patch intended for use during
development only. SCHEDSTAT_VERSION is bumped to 16, and the following
fields are added to the per-CPU statistics of /proc/schedstat:

field 10: # of times select_idle_sibling "easily" found an idle CPU --
prev or target is idle.
field 11: # of times select_idle_sibling searched and found an idle cpu.
field 12: # of times select_idle_sibling searched and found an idle core.
field 13: # of times select_idle_sibling failed to find anything idle.
field 14: time in nanoseconds spent in functions that search for idle
CPUs and search for tasks to steal.
field 15: # of times an idle CPU steals a task from another CPU.
field 16: # of times try_steal finds overloaded CPUs but no task is
migratable.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/core.c | 30 +++++++++++++++++++++++++++--
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 9 +++++++++
kernel/sched/stats.c | 11 ++++++++++-
kernel/sched/stats.h | 13 +++++++++++++
5 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f12225f..49b48da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2220,17 +2220,43 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
DEFINE_STATIC_KEY_FALSE(sched_schedstats);
static bool __initdata __sched_schedstats = false;

+unsigned long schedstat_skid;
+
+static void compute_skid(void)
+{
+ int i, n = 0;
+ s64 t, skid = 0;
+
+ for (i = 0; i < 100; i++) {
+ t = local_clock();
+ t = local_clock() - t;
+ if (t > 0 && t < 1000) { /* only use sane samples */
+ skid += t;
+ n++;
+ }
+ }
+
+ if (n > 0)
+ schedstat_skid = skid / n;
+ else
+ schedstat_skid = 0;
+ pr_info("schedstat_skid = %lu\n", schedstat_skid);
+}
+
static void set_schedstats(bool enabled)
{
- if (enabled)
+ if (enabled) {
+ compute_skid();
static_branch_enable(&sched_schedstats);
- else
+ } else {
static_branch_disable(&sched_schedstats);
+ }
}

void force_schedstat_enabled(void)
{
if (!schedstat_enabled()) {
+ compute_skid();
pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
static_branch_enable(&sched_schedstats);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac5bbf7..115b1a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3740,29 +3740,35 @@ static inline bool steal_enabled(void)
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ unsigned long time;

if (!steal_enabled())
return;

+ time = schedstat_start_time();
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_clear_elem(rq->cpu, overload_cpus);
rcu_read_unlock();
+ schedstat_end_time(rq->find_time, time);
}

static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ unsigned long time;

if (!steal_enabled())
return;

+ time = schedstat_start_time();
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_set_elem(rq->cpu, overload_cpus);
rcu_read_unlock();
+ schedstat_end_time(rq->find_time, time);
}

static int try_steal(struct rq *this_rq, struct rq_flags *rf);
@@ -6183,6 +6189,16 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return cpu;
}

+#define SET_STAT(STAT) \
+ do { \
+ if (schedstat_enabled()) { \
+ struct rq *rq = this_rq(); \
+ \
+ if (rq) \
+ __schedstat_inc(rq->STAT); \
+ } \
+ } while (0)
+
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
@@ -6191,14 +6207,18 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd;
int i, recent_used_cpu;

- if (available_idle_cpu(target))
+ if (available_idle_cpu(target)) {
+ SET_STAT(found_idle_cpu_easy);
return target;
+ }

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev))
+ if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev)) {
+ SET_STAT(found_idle_cpu_easy);
return prev;
+ }

/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
@@ -6211,26 +6231,36 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* Replace recent_used_cpu with prev as it is a potential
* candidate for the next wake:
*/
+ SET_STAT(found_idle_cpu_easy);
p->recent_used_cpu = prev;
return recent_used_cpu;
}

sd = rcu_dereference(per_cpu(sd_llc, target));
- if (!sd)
+ if (!sd) {
+ SET_STAT(nofound_idle_cpu);
return target;
+ }

i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_core);
return i;
+ }

i = select_idle_cpu(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_cpu);
return i;
+ }

i = select_idle_smt(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_cpu);
return i;
+ }

+ SET_STAT(nofound_idle_cpu);
return target;
}

@@ -6384,6 +6414,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
+ unsigned long time = schedstat_start_time();
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
@@ -6432,6 +6463,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
current->recent_used_cpu = cpu;
}
rcu_read_unlock();
+ schedstat_end_time(cpu_rq(cpu)->find_time, time);

return new_cpu;
}
@@ -6678,6 +6710,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
+ unsigned long time;

again:
if (!cfs_rq->nr_running)
@@ -6792,6 +6825,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
idle:
update_misfit_status(NULL, rq);

+ time = schedstat_start_time();
+
/*
* We must set idle_stamp _before_ calling try_steal() or
* idle_balance(), such that we measure the duration as idle time.
@@ -6805,6 +6840,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
if (new_tasks)
IF_SMP(rq->idle_stamp = 0;)

+ schedstat_end_time(rq->find_time, time);
+
/*
* Because try_steal() and idle_balance() release (and re-acquire)
* rq->lock, it is possible for any higher priority task to appear.
@@ -9878,6 +9915,7 @@ static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
update_rq_clock(dst_rq);
attach_task(dst_rq, p);
stolen = 1;
+ schedstat_inc(dst_rq->steal);
}
local_irq_restore(rf.flags);

@@ -9902,6 +9940,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
int dst_cpu = dst_rq->cpu;
bool locked = true;
int stolen = 0;
+ bool any_overload = false;
struct sparsemask *overload_cpus;

if (!steal_enabled())
@@ -9944,6 +9983,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
stolen = 1;
goto out;
}
+ any_overload = true;
}

out:
@@ -9955,6 +9995,8 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
stolen |= (dst_rq->cfs.h_nr_running > 0);
if (dst_rq->nr_running != dst_rq->cfs.h_nr_running)
stolen = -1;
+ if (!stolen && any_overload)
+ schedstat_inc(dst_rq->steal_fail);
return stolen;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a28340..f61f640 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -915,6 +915,15 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* Idle search stats */
+ unsigned int found_idle_core;
+ unsigned int found_idle_cpu;
+ unsigned int found_idle_cpu_easy;
+ unsigned int nofound_idle_cpu;
+ unsigned long find_time;
+ unsigned int steal;
+ unsigned int steal_fail;
#endif

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 750fb3c..00b3de5 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
* Bump this up when changing the output format or the meaning of an existing
* format, so that tools can adapt (or abort)
*/
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16

static int show_schedstat(struct seq_file *seq, void *v)
{
@@ -37,6 +37,15 @@ static int show_schedstat(struct seq_file *seq, void *v)
rq->rq_cpu_time,
rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);

+ seq_printf(seq, " %u %u %u %u %lu %u %u",
+ rq->found_idle_cpu_easy,
+ rq->found_idle_cpu,
+ rq->found_idle_core,
+ rq->nofound_idle_cpu,
+ rq->find_time,
+ rq->steal,
+ rq->steal_fail);
+
seq_printf(seq, "\n");

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 4904c46..63ba3c2 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -39,6 +39,17 @@
#define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
#define schedstat_val(var) (var)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)
+#define schedstat_start_time() schedstat_val_or_zero(local_clock())
+#define schedstat_end_time(stat, time) \
+ do { \
+ unsigned long endtime; \
+ \
+ if (schedstat_enabled() && (time)) { \
+ endtime = local_clock() - (time) - schedstat_skid; \
+ schedstat_add((stat), endtime); \
+ } \
+ } while (0)
+extern unsigned long schedstat_skid;

#else /* !CONFIG_SCHEDSTATS: */
static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
@@ -53,6 +64,8 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_set(var, val) do { } while (0)
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0
+# define schedstat_start_time() 0
+# define schedstat_end_time(stat, t) do { } while (0)
#endif /* CONFIG_SCHEDSTATS */

#ifdef CONFIG_PSI
--
1.8.3.1


2018-11-09 13:02:04

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 04/10] sched/fair: Dynamically update cfs_overload_cpus

An overloaded CPU has more than 1 runnable task. When a CFS task wakes
on a CPU, if h_nr_running transitions from 1 to more, then set the CPU in
the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_running
transitions from 2 to less, then clear the CPU in cfs_overload_cpus.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee271bb..9031d39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
#include "sched.h"

#include <trace/events/sched.h>
+#include <linux/sparsemask.h>

/*
* Targeted preemption latency for CPU-bound tasks:
@@ -3724,6 +3725,28 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
rq->misfit_task_load = task_h_load(p);
}

+static void overload_clear(struct rq *rq)
+{
+ struct sparsemask *overload_cpus;
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
+ if (overload_cpus)
+ sparsemask_clear_elem(rq->cpu, overload_cpus);
+ rcu_read_unlock();
+}
+
+static void overload_set(struct rq *rq)
+{
+ struct sparsemask *overload_cpus;
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
+ if (overload_cpus)
+ sparsemask_set_elem(rq->cpu, overload_cpus);
+ rcu_read_unlock();
+}
+
#else /* CONFIG_SMP */

#define UPDATE_TG 0x0
@@ -3747,6 +3770,9 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+static inline void overload_clear(struct rq *rq) {}
+static inline void overload_set(struct rq *rq) {}
+
static inline void
util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}

@@ -4441,6 +4467,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
+ unsigned int prev_nr = rq->cfs.h_nr_running;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
long task_delta, dequeue = 1;
@@ -4468,8 +4495,12 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue = 0;
}

- if (!se)
+ if (!se) {
sub_nr_running(rq, task_delta);
+ if (prev_nr >= 2 && prev_nr - task_delta < 2)
+ overload_clear(rq);
+
+ }

cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq);
@@ -4499,6 +4530,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
+ unsigned int prev_nr = rq->cfs.h_nr_running;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
int enqueue = 1;
@@ -4535,8 +4567,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
}

- if (!se)
+ if (!se) {
add_nr_running(rq, task_delta);
+ if (prev_nr < 2 && prev_nr + task_delta >= 2)
+ overload_set(rq);
+ }

/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
@@ -5082,6 +5117,7 @@ static inline void hrtick_update(struct rq *rq)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ unsigned int prev_nr = rq->cfs.h_nr_running;

/*
* The code below (indirectly) updates schedutil which looks at
@@ -5129,8 +5165,12 @@ static inline void hrtick_update(struct rq *rq)
update_cfs_group(se);
}

- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ if (prev_nr == 1)
+ overload_set(rq);
+
+ }

hrtick_update(rq);
}
@@ -5147,6 +5187,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
+ unsigned int prev_nr = rq->cfs.h_nr_running;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -5188,8 +5229,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}

- if (!se)
+ if (!se) {
sub_nr_running(rq, 1);
+ if (prev_nr == 2)
+ overload_clear(rq);
+ }

util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
--
1.8.3.1


2018-11-09 13:02:19

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle

When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates. To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed. For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

%find - percent of time spent in old and new functions that search for
idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench <grps> process 100000
sched_wakeup_granularity_ns=15000000

baseline
grps time %busy slice sched idle wake %find steal
1 8.084 75.02 0.10 105476 46291 59183 0.31 0
2 13.892 85.33 0.10 190225 70958 119264 0.45 0
3 19.668 89.04 0.10 263896 87047 176850 0.49 0
4 25.279 91.28 0.10 322171 94691 227474 0.51 0
8 47.832 94.86 0.09 630636 144141 486322 0.56 0

new
grps time %busy slice sched idle wake %find steal %speedup
1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow. A negative "speedup" is a
regression. Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec. Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

------------------ 1 Socket Results ------------------

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 8.008 0.1 5.905 0.2 35.6
2 13.814 0.2 11.438 0.1 20.7
3 19.488 0.2 16.919 0.1 15.1
4 25.059 0.1 22.409 0.1 11.8
8 47.478 0.1 44.221 0.1 7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 4.586 0.8 4.596 0.6 -0.3
2 7.693 0.2 5.775 1.3 33.2
3 10.442 0.3 8.288 0.3 25.9
4 13.087 0.2 11.057 0.1 18.3
8 24.145 0.2 22.076 0.3 9.3
16 43.779 0.1 41.741 0.2 4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs

clients %speedup
1 16.2
2 11.7
4 9.9
8 12.8
16 13.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Benchmark %speedup
specjbb2015_critical_jops 5.7
mysql_sysb1.0.14_mutex_2 40.6
mysql_sysb1.0.14_oltp_2 3.9

------------------ 2 Socket Results ------------------

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 7.945 0.2 7.219 8.7 10.0
2 8.444 0.4 6.689 1.5 26.2
3 12.100 1.1 9.962 2.0 21.4
4 15.001 0.4 13.109 1.1 14.4
8 27.960 0.2 26.127 0.3 7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 5.826 5.4 5.840 5.0 -0.3
2 5.041 5.3 6.171 23.4 -18.4
3 6.839 2.1 6.324 3.8 8.1
4 8.177 0.6 7.318 3.6 11.7
8 14.429 0.7 13.966 1.3 3.3
16 26.401 0.3 25.149 1.5 4.9

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Oracle database OLTP, logging disabled, NVRAM storage

Customers Users %speedup
1200000 40 -1.2
2400000 80 2.7
3600000 120 8.9
4800000 160 4.4
6000000 200 3.0

X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Results from the Oracle "Performance PIT".

Benchmark %speedup

mysql_sysb1.0.14_fileio_56_rndrd 19.6
mysql_sysb1.0.14_fileio_56_seqrd 12.1
mysql_sysb1.0.14_fileio_56_rndwr 0.4
mysql_sysb1.0.14_fileio_56_seqrewr -0.3

pgsql_sysb1.0.14_fileio_56_rndrd 19.5
pgsql_sysb1.0.14_fileio_56_seqrd 8.6
pgsql_sysb1.0.14_fileio_56_rndwr 1.0
pgsql_sysb1.0.14_fileio_56_seqrewr 0.5

opatch_time_ASM_12.2.0.1.0_HP2M 7.5
select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1
select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4
swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8

lm3_memlat_L2 4.8
lm3_memlat_L1 0.0

ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1
ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2
ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0
ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4

X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

NAS_OMP
bench class ncpu %improved(Mops)
dc B 72 1.3
is C 72 0.9
is D 72 0.7

sysbench mysql, average of 24 runs
--- base --- --- new ---
nthr events %stdev events %stdev %speedup
1 331.0 0.25 331.0 0.24 -0.1
2 661.3 0.22 661.8 0.22 0.0
4 1297.0 0.88 1300.5 0.82 0.2
8 2420.8 0.04 2420.5 0.04 -0.1
16 4826.3 0.07 4825.4 0.05 -0.1
32 8815.3 0.27 8830.2 0.18 0.1
64 12823.0 0.24 12823.6 0.26 0.0

-------------------------------------------------------------

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 169 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 6 ++
2 files changed, 170 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dc6224d..97bdea2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3731,6 +3731,9 @@ static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;

+ if (!sched_feat(STEAL))
+ return;
+
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
@@ -3742,6 +3745,9 @@ static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;

+ if (!sched_feat(STEAL))
+ return;
+
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
@@ -3749,6 +3755,8 @@ static void overload_set(struct rq *rq)
rcu_read_unlock();
}

+static int try_steal(struct rq *this_rq, struct rq_flags *rf);
+
#else /* CONFIG_SMP */

#define UPDATE_TG 0x0
@@ -3785,6 +3793,11 @@ static inline void overload_set(struct rq *rq) {}
bool task_sleep) {}
static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}

+static inline int try_steal(struct rq *this_rq, struct rq_flags *rf)
+{
+ return 0;
+}
+
#endif /* CONFIG_SMP */

static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -6770,20 +6783,22 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
update_misfit_status(NULL, rq);

/*
- * We must set idle_stamp _before_ calling idle_balance(), such that we
- * measure the duration of idle_balance() as idle time.
+ * We must set idle_stamp _before_ calling try_steal() or
+ * idle_balance(), such that we measure the duration as idle time.
*/
IF_SMP(rq->idle_stamp = rq_clock(rq);)

new_tasks = idle_balance(rq, rf);
+ if (new_tasks == 0)
+ new_tasks = try_steal(rq, rf);

if (new_tasks)
IF_SMP(rq->idle_stamp = 0;)

/*
- * Because idle_balance() releases (and re-acquires) rq->lock, it is
- * possible for any higher priority task to appear. In that case we
- * must re-start the pick_next_entity() loop.
+ * Because try_steal() and idle_balance() release (and re-acquire)
+ * rq->lock, it is possible for any higher priority task to appear.
+ * In that case we must re-start the pick_next_entity() loop.
*/
if (new_tasks < 0)
return RETRY_TASK;
@@ -9789,6 +9804,150 @@ void trigger_load_balance(struct rq *rq)
nohz_balancer_kick(rq);
}

+/*
+ * Search the runnable tasks in @cfs_rq in order of next to run, and find
+ * the first one that can be migrated to @dst_rq. @cfs_rq is locked on entry.
+ * On success, dequeue the task from @cfs_rq and return it, else return NULL.
+ */
+static struct task_struct *
+detach_next_task(struct cfs_rq *cfs_rq, struct rq *dst_rq)
+{
+ int dst_cpu = dst_rq->cpu;
+ struct task_struct *p;
+ struct rq *rq = rq_of(cfs_rq);
+
+ lockdep_assert_held(&rq_of(cfs_rq)->lock);
+
+ list_for_each_entry_reverse(p, &rq->cfs_tasks, se.group_node) {
+ if (can_migrate_task_llc(p, rq, dst_rq)) {
+ detach_task(p, rq, dst_cpu);
+ return p;
+ }
+ }
+ return NULL;
+}
+
+/*
+ * Attempt to migrate a CFS task from @src_cpu to @dst_rq. @locked indicates
+ * whether @dst_rq is already locked on entry. This function may lock or
+ * unlock @dst_rq, and updates @locked to indicate the locked state on return.
+ * The locking protocol is based on idle_balance().
+ * Returns 1 on success and 0 on failure.
+ */
+static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
+ int src_cpu)
+{
+ struct task_struct *p;
+ struct rq_flags rf;
+ int stolen = 0;
+ int dst_cpu = dst_rq->cpu;
+ struct rq *src_rq = cpu_rq(src_cpu);
+
+ if (dst_cpu == src_cpu || src_rq->cfs.h_nr_running < 2)
+ return 0;
+
+ if (*locked) {
+ rq_unpin_lock(dst_rq, dst_rf);
+ raw_spin_unlock(&dst_rq->lock);
+ *locked = false;
+ }
+ rq_lock_irqsave(src_rq, &rf);
+ update_rq_clock(src_rq);
+
+ if (src_rq->cfs.h_nr_running < 2 || !cpu_active(src_cpu))
+ p = NULL;
+ else
+ p = detach_next_task(&src_rq->cfs, dst_rq);
+
+ rq_unlock(src_rq, &rf);
+
+ if (p) {
+ raw_spin_lock(&dst_rq->lock);
+ rq_repin_lock(dst_rq, dst_rf);
+ *locked = true;
+ update_rq_clock(dst_rq);
+ attach_task(dst_rq, p);
+ stolen = 1;
+ }
+ local_irq_restore(rf.flags);
+
+ return stolen;
+}
+
+/*
+ * Conservative upper bound on the max cost of a steal, in nsecs (the typical
+ * cost is 1-2 microsec). Do not steal if average idle time is less.
+ */
+#define SCHED_STEAL_COST 10000
+
+/*
+ * Try to steal a runnable CFS task from a CPU in the same LLC as @dst_rq,
+ * and migrate it to @dst_rq. rq_lock is held on entry and return, but
+ * may be dropped in between. Return 1 on success, 0 on failure, and -1
+ * if a task in a different scheduling class has become runnable on @dst_rq.
+ */
+static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
+{
+ int src_cpu;
+ int dst_cpu = dst_rq->cpu;
+ bool locked = true;
+ int stolen = 0;
+ struct sparsemask *overload_cpus;
+
+ if (!sched_feat(STEAL))
+ return 0;
+
+ if (!cpu_active(dst_cpu))
+ return 0;
+
+ if (dst_rq->avg_idle < SCHED_STEAL_COST)
+ return 0;
+
+ /* Get bitmap of overloaded CPUs in the same LLC as @dst_rq */
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(dst_rq->cfs_overload_cpus);
+ if (!overload_cpus) {
+ rcu_read_unlock();
+ return 0;
+ }
+
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * First try overloaded CPUs on the same core to preserve cache warmth.
+ */
+ if (static_branch_likely(&sched_smt_present)) {
+ for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
+ if (sparsemask_test_elem(src_cpu, overload_cpus) &&
+ steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
+ stolen = 1;
+ goto out;
+ }
+ }
+ }
+#endif /* CONFIG_SCHED_SMT */
+
+ /* Accept any suitable task in the LLC */
+
+ for_each_sparse_wrap(src_cpu, overload_cpus, dst_cpu) {
+ if (steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
+ stolen = 1;
+ goto out;
+ }
+ }
+
+out:
+ rcu_read_unlock();
+ if (!locked) {
+ raw_spin_lock(&dst_rq->lock);
+ rq_repin_lock(dst_rq, dst_rf);
+ }
+ stolen |= (dst_rq->cfs.h_nr_running > 0);
+ if (dst_rq->nr_running != dst_rq->cfs.h_nr_running)
+ stolen = -1;
+ return stolen;
+}
+
static void rq_online_fair(struct rq *rq)
{
update_sysctl();
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..a665a9e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -59,6 +59,12 @@
SCHED_FEAT(SIS_PROP, true)

/*
+ * Steal a CFS task from another CPU when going idle.
+ * Improves CPU utilization.
+ */
+SCHED_FEAT(STEAL, true)
+
+/*
* Issue a WARN when we do multiple update_rq_clock() calls
* in a single rq->lock section. Default disabled because the
* annotations are not complete.
--
1.8.3.1


2018-11-09 13:03:12

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 02/10] sched/topology: Provide hooks to allocate data shared per LLC

Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and
free data pointed to by struct sched_domain_shared at the last-level-cache
domain. sd_llc_alloc_all() is called after the SD hierarchy is known, to
eliminate the unnecessary allocations that would occur if we instead
allocated in __sdt_alloc() and then figured out which shared nodes are
redundant.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/topology.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8d7f15b..3e72ce0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -10,6 +10,12 @@
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;

+struct s_data;
+static int sd_llc_alloc(struct sched_domain *sd);
+static void sd_llc_free(struct sched_domain *sd);
+static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d);
+static void sd_llc_free_all(const struct cpumask *cpu_map);
+
#ifdef CONFIG_SCHED_DEBUG

static int __init sched_debug_setup(char *str)
@@ -361,8 +367,10 @@ static void destroy_sched_domain(struct sched_domain *sd)
*/
free_sched_groups(sd->groups, 1);

- if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+ if (sd->shared && atomic_dec_and_test(&sd->shared->ref)) {
+ sd_llc_free(sd);
kfree(sd->shared);
+ }
kfree(sd);
}

@@ -996,6 +1004,7 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
free_percpu(d->sd);
/* Fall through */
case sa_sd_storage:
+ sd_llc_free_all(cpu_map);
__sdt_free(cpu_map);
/* Fall through */
case sa_none:
@@ -1610,6 +1619,62 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}

+static int sd_llc_alloc(struct sched_domain *sd)
+{
+ /* Allocate sd->shared data here. Empty for now. */
+
+ return 0;
+}
+
+static void sd_llc_free(struct sched_domain *sd)
+{
+ struct sched_domain_shared *sds = sd->shared;
+
+ if (!sds)
+ return;
+
+ /* Free data here. Empty for now. */
+}
+
+static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d)
+{
+ struct sched_domain *sd, *hsd;
+ int i;
+
+ for_each_cpu(i, cpu_map) {
+ /* Find highest domain that shares resources */
+ hsd = NULL;
+ for (sd = *per_cpu_ptr(d->sd, i); sd; sd = sd->parent) {
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
+ break;
+ hsd = sd;
+ }
+ if (hsd && sd_llc_alloc(hsd))
+ return 1;
+ }
+
+ return 0;
+}
+
+static void sd_llc_free_all(const struct cpumask *cpu_map)
+{
+ struct sched_domain_topology_level *tl;
+ struct sched_domain *sd;
+ struct sd_data *sdd;
+ int j;
+
+ for_each_sd_topology(tl) {
+ sdd = &tl->data;
+ if (!sdd)
+ continue;
+ for_each_cpu(j, cpu_map) {
+ sd = *per_cpu_ptr(sdd->sd, j);
+ if (sd)
+ sd_llc_free(sd);
+ }
+ }
+}
+
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
struct sched_domain *child, int dflags, int cpu)
@@ -1769,6 +1834,14 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
}
}

+ /*
+ * Allocate shared sd data at last level cache. Must be done after
+ * domains are built above, but before the data is used in
+ * cpu_attach_domain and descendants below.
+ */
+ if (sd_llc_alloc_all(cpu_map, &d))
+ goto error;
+
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
--
1.8.3.1


2018-11-09 13:03:22

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 09/10] sched/fair: disable stealing if too many NUMA nodes

The STEAL feature causes regressions on hackbench on larger NUMA systems,
so disable it on systems with more than sched_steal_node_limit nodes
(default 2). Note that the feature remains enabled as seen in features.h
and /sys/kernel/debug/sched_features, but stealing is only performed if
nodes <= sched_steal_node_limit. This arrangement allows users to activate
stealing on reboot by setting the kernel parameter sched_steal_node_limit
on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary
and will be deleted when the regression is fixed.

Details of the regression follow. With the STEAL feature set, hackbench
is slower on many-node systems:

X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz
Average of 10 runs of: hackbench <groups> processes 50000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 3.627 15.8 3.876 7.3 -6.5
2 4.545 24.7 5.583 16.7 -18.6
3 5.716 25.0 7.367 14.2 -22.5
4 6.901 32.9 7.718 14.5 -10.6
8 8.604 38.5 9.111 16.0 -5.6
16 7.734 6.8 11.007 8.2 -29.8

Total CPU time increases. Profiling shows that CPU time increases
uniformly across all functions, suggesting a systemic increase in cache
or memory latency. This may be due to NUMA migrations, as they cause
loss of LLC cache footprint and remote memory latencies.

The domains for this system and their flags are:

domain0 (SMT) : 1 core
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY
SD_WAKE_AFFINE

domain1 (MC) : 1 socket
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
SD_WAKE_AFFINE

domain2 (NUMA) : 4 sockets
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_SERIALIZE SD_OVERLAP SD_NUMA
SD_WAKE_AFFINE

domain3 (NUMA) : 8 sockets
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE
SD_SERIALIZE SD_OVERLAP SD_NUMA

Schedstats point to the root cause of the regression. hackbench is run
10 times per group and the average schedstat accumulation per-run and
per-cpu is shown below. Note that domain3 moves are zero because
SD_WAKE_AFFINE is not set there.

NO_STEAL
--- domain2 --- --- domain3 ---
grp time %busy sched idle wake steal remote move pull remote move pull
1 20.3 10.3 28710 14346 14366 0 490 3378 0 4039 0 0
2 26.4 18.8 56721 28258 28469 0 792 7026 12 9229 0 7
3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 0 3
4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 0 5
8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32
16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 0 8

STEAL
--- domain2 --- --- domain3 ---
grp time %busy sched idle wake steal remote move pull remote move pull
1 20.6 10.2 28490 14232 14261 18 3 3525 0 4254 0 0
2 27.9 18.8 56757 28203 28562 303 1675 7839 5 9690 0 2
3 35.3 27.7 87337 43274 44085 698 741 12785 14 15689 0 3
4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 0 7
8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34
16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20

Cross-numa-node migrations are caused by load balancing pulls and
wake_affine moves. Pulls are small and similar for no_steal and steal.
However, moves are significantly higher for steal, and rows above with the
highest moves have the worst regressions for time; see for example grp=8.

Moves increase for steal due to the following logic in wake_affine_idle()
for synchronous wakeup:

if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu; // move the task

The steal feature does a better job of smoothing the load between idle
and busy CPUs, so nr_running is 1 more often, and moves are performed
more often. For hackbench, cross-node affine moves early in the run are
good because they colocate wakers and wakees from the same group on the
same node, but continued moves later in the run are bad, because the wakee
is moved away from peers on its previous node. Note that even no_steal
is far from optimal; binding an instance of "hackbench 2" to each of the
8 NUMA nodes runs much faster than running "hackbench 16" with no binding.

Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node
migrations and eliminates the difference between no_steal and steal
performance. However, overall performance is lower than WA_IDLE because
some migrations are helpful as explained above.

I have tried many heuristics in a attempt to optimize the number of
cross-node moves in all conditions, with limited success. The fundamental
problem is that the scheduler does not track which groups of tasks talk to
each other. Parts of several groups become entrenched on the same node,
filling it to capacity, leaving no room for either group to pull its peers
over, and there is neither data nor mechanism for the scheduler to evict
one group to make room for the other.

For now, disable STEAL on such systems until we can do better, or it is
shown that hackbench is atypical and most workloads benefit from stealing.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 16 +++++++++++++---
kernel/sched/sched.h | 2 +-
kernel/sched/topology.c | 25 +++++++++++++++++++++++++
3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97bdea2..ac5bbf7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3727,11 +3727,21 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)

#define IF_SMP(statement) statement

+static inline bool steal_enabled(void)
+{
+#ifdef CONFIG_NUMA
+ bool allow = static_branch_likely(&sched_steal_allow);
+#else
+ bool allow = true;
+#endif
+ return sched_feat(STEAL) && allow;
+}
+
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;

- if (!sched_feat(STEAL))
+ if (!steal_enabled())
return;

rcu_read_lock();
@@ -3745,7 +3755,7 @@ static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;

- if (!sched_feat(STEAL))
+ if (!steal_enabled())
return;

rcu_read_lock();
@@ -9894,7 +9904,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
int stolen = 0;
struct sparsemask *overload_cpus;

- if (!sched_feat(STEAL))
+ if (!steal_enabled())
return 0;

if (!cpu_active(dst_cpu))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eacf5db..2a28340 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -936,7 +936,6 @@ static inline int cpu_of(struct rq *rq)
#endif
}

-
#ifdef CONFIG_SCHED_SMT

extern struct static_key_false sched_smt_present;
@@ -1185,6 +1184,7 @@ enum numa_topology_type {
#endif

#ifdef CONFIG_NUMA
+extern struct static_key_true sched_steal_allow;
extern void sched_init_numa(void);
extern void sched_domains_numa_masks_set(unsigned int cpu);
extern void sched_domains_numa_masks_clear(unsigned int cpu);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6455bde..fc511de 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1345,6 +1345,30 @@ static void init_numa_topology_type(void)
}
}

+DEFINE_STATIC_KEY_TRUE(sched_steal_allow);
+static int sched_steal_node_limit;
+#define SCHED_STEAL_NODE_LIMIT_DEFAULT 2
+
+static int __init steal_node_limit_setup(char *buf)
+{
+ get_option(&buf, &sched_steal_node_limit);
+ return 0;
+}
+
+early_param("sched_steal_node_limit", steal_node_limit_setup);
+
+static void check_node_limit(void)
+{
+ int n = num_possible_nodes();
+
+ if (sched_steal_node_limit == 0)
+ sched_steal_node_limit = SCHED_STEAL_NODE_LIMIT_DEFAULT;
+ if (n > sched_steal_node_limit) {
+ static_branch_disable(&sched_steal_allow);
+ pr_debug("Suppressing sched STEAL. To enable, reboot with sched_steal_node_limit=%d", n);
+ }
+}
+
void sched_init_numa(void)
{
int next_distance, curr_distance = node_distance(0, 0);
@@ -1493,6 +1517,7 @@ void sched_init_numa(void)
sched_max_numa_distance = sched_domains_numa_distance[level - 1];

init_numa_topology_type();
+ check_node_limit();
}

void sched_domains_numa_masks_set(unsigned int cpu)
--
1.8.3.1


2018-11-09 13:03:24

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 07/10] sched/fair: Provide can_migrate_task_llc

Define a simpler version of can_migrate_task called can_migrate_task_llc
which does not require a struct lb_env argument, and judges whether a
migration from one CPU to another within the same LLC should be allowed.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 453d280..dc6224d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7199,6 +7199,34 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
}

/*
+ * Return true if task @p can migrate from @rq to @dst_rq in the same LLC.
+ * No need to test for co-locality, and no need to test task_hot(), as sharing
+ * LLC provides cache warmth at that level.
+ */
+static bool
+can_migrate_task_llc(struct task_struct *p, struct rq *rq, struct rq *dst_rq)
+{
+ int dst_cpu = dst_rq->cpu;
+
+ lockdep_assert_held(&rq->lock);
+
+ if (throttled_lb_pair(task_group(p), cpu_of(rq), dst_cpu))
+ return false;
+
+ if (!cpumask_test_cpu(dst_cpu, &p->cpus_allowed)) {
+ schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
+ return false;
+ }
+
+ if (task_running(rq, p)) {
+ schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+ return false;
+ }
+
+ return true;
+}
+
+/*
* detach_task() -- detach the task for the migration from @src_rq to @dst_cpu.
*/
static void detach_task(struct task_struct *p, struct rq *src_rq, int dst_cpu)
--
1.8.3.1


2018-11-09 13:04:07

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 06/10] sched/fair: Generalize the detach_task interface

The detach_task function takes a struct lb_env argument, but only needs a
few of its members. Pass the rq and cpu arguments explicitly so the
function may be called from code that is not based on lb_env. No
functional change.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da368ed..453d280 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7199,15 +7199,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
}

/*
- * detach_task() -- detach the task for the migration specified in env
+ * detach_task() -- detach the task for the migration from @src_rq to @dst_cpu.
*/
-static void detach_task(struct task_struct *p, struct lb_env *env)
+static void detach_task(struct task_struct *p, struct rq *src_rq, int dst_cpu)
{
- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(&src_rq->lock);

p->on_rq = TASK_ON_RQ_MIGRATING;
- deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
- set_task_cpu(p, env->dst_cpu);
+ deactivate_task(src_rq, p, DEQUEUE_NOCLOCK);
+ set_task_cpu(p, dst_cpu);
}

/*
@@ -7227,7 +7227,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
if (!can_migrate_task(p, env))
continue;

- detach_task(p, env);
+ detach_task(p, env->src_rq, env->dst_cpu);

/*
* Right now, this is only the second place where
@@ -7294,7 +7294,7 @@ static int detach_tasks(struct lb_env *env)
if ((load / 2) > env->imbalance)
goto next;

- detach_task(p, env);
+ detach_task(p, env->src_rq, env->dst_cpu);
list_add(&p->se.group_node, &env->tasks);

detached++;
--
1.8.3.1


2018-11-09 13:04:15

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 01/10] sched: Provide sparsemask, a reduced contention bitmap

From: Steve Sistare <[email protected]>

Provide struct sparsemask and functions to manipulate it. A sparsemask is
a sparse bitmap. It reduces cache contention vs the usual bitmap when many
threads concurrently set, clear, and visit elements, by reducing the number
of significant bits per cacheline. For each 64 byte chunk of the mask,
only the first K bits of the first word are used, and the remaining bits
are ignored, where K is a creation time parameter. Thus a sparsemask that
can represent a set of N elements is approximately (N/K * 64) bytes in
size.

Signed-off-by: Steve Sistare <[email protected]>
---
include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
lib/Makefile | 2 +-
lib/sparsemask.c | 142 +++++++++++++++++++++++++
3 files changed, 403 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sparsemask.h
create mode 100644 lib/sparsemask.c

diff --git a/include/linux/sparsemask.h b/include/linux/sparsemask.h
new file mode 100644
index 0000000..d36a3be
--- /dev/null
+++ b/include/linux/sparsemask.h
@@ -0,0 +1,260 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sparsemask.h - sparse bitmap operations
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __LINUX_SPARSEMASK_H
+#define __LINUX_SPARSEMASK_H
+
+#include <linux/kernel.h>
+#include <linux/bitmap.h>
+#include <linux/bug.h>
+
+/*
+ * A sparsemask is a sparse bitmap. It reduces cache contention vs the usual
+ * bitmap when many threads concurrently set, clear, and visit elements. For
+ * each 64 byte chunk of the mask, only the first K bits of the first word are
+ * used, and the remaining bits are ignored, where K is a creation time
+ * parameter. Thus a sparsemask that can represent a set of N elements is
+ * approximately (N/K * 64) bytes in size.
+ *
+ * Clients pass and receive element numbers in the public API, and the
+ * implementation translates them to bit numbers to perform the bitmap
+ * operations.
+ *
+ * This file is partially derived from cpumask.h, and the public sparsemask
+ * operations are drop-in replacements for cpumask operations. However,
+ * sparsemask has no dependency on CPU definitions and can be used to
+ * represent any kind of elements.
+ */
+
+struct sparsemask {
+ short nelems; /* current number of elements */
+ short density; /* store 2^density elements per chunk */
+ unsigned long bits[0]; /* embedded array of chunks */
+};
+
+/* The maximum value for density, which implicitly defines the chunk size */
+
+#define _SMASK_DENSITY_MAX 6
+
+#define SMASK_DENSITY_TO_BYTES(density) (1U << (density))
+#define SMASK_DENSITY_TO_ELEMS(density) (1U << (density))
+
+/* The number of elements/bits/bytes/longs in a chunk */
+
+#define SMASK_ELEMS(mask) SMASK_DENSITY_TO_ELEMS((mask)->density)
+#define SMASK_BYTES SMASK_DENSITY_TO_BYTES(_SMASK_DENSITY_MAX)
+#define SMASK_BITS (SMASK_BYTES * BITS_PER_BYTE)
+#define SMASK_LONGS (SMASK_BYTES / sizeof(long))
+
+/*
+ * Translate element index @elem to a bit/byte/long index.
+ * @density: the density of a chunk.
+ */
+
+#define _SMASK_ELEM_TO_BIT(elem, density) \
+ ((elem) / SMASK_DENSITY_TO_ELEMS(density) * SMASK_BITS +\
+ (elem) % SMASK_DENSITY_TO_ELEMS(density))
+
+#define _SMASK_ELEM_TO_BYTE(elem, density) \
+ (_SMASK_ELEM_TO_BIT(elem, density) / BITS_PER_BYTE)
+
+#define _SMASK_ELEM_TO_LONG(elem, density) \
+ (_SMASK_ELEM_TO_BYTE(elem, density) / sizeof(long))
+
+/* Translate @bit/@byte/@long index to an element index */
+
+#define _SMASK_BIT_TO_ELEM(bit, density) \
+ ((bit) / SMASK_BITS * SMASK_DENSITY_TO_ELEMS(density) + \
+ (bit) % SMASK_BITS)
+
+#define _SMASK_BYTE_TO_ELEM(byte, density) \
+ _SMASK_BIT_TO_ELEM((byte) * BITS_PER_BYTE, density)
+
+#define _SMASK_LONG_TO_ELEM(index, density) \
+ _SMASK_BYTE_TO_ELEM((index) * sizeof(long), density)
+
+/* Same translations as above, but taking sparsemask @m instead of density */
+
+#define SMASK_ELEM_TO_BYTE(elem, m) _SMASK_ELEM_TO_BYTE(elem, (m)->density)
+#define SMASK_ELEM_TO_BIT(elem, m) _SMASK_ELEM_TO_BIT(elem, (m)->density)
+#define SMASK_ELEM_TO_LONG(elem, m) _SMASK_ELEM_TO_LONG(elem, (m)->density)
+#define SMASK_BYTE_TO_ELEM(byte, m) _SMASK_BYTE_TO_ELEM(byte, (m)->density)
+#define SMASK_BIT_TO_ELEM(bit, m) _SMASK_BIT_TO_ELEM(bit, (m)->density)
+#define SMASK_LONG_TO_ELEM(index, m) _SMASK_LONG_TO_ELEM(index, (m)->density)
+
+/*
+ * Verify the @elem argument to sparsemask functions, and return its bit.
+ */
+static inline int
+sparsemask_check(int elem, const struct sparsemask *mask)
+{
+ WARN_ON_ONCE(elem >= mask->nelems);
+ return SMASK_ELEM_TO_BIT(elem, mask);
+}
+
+int sparsemask_next(int n, const struct sparsemask *srcp);
+int sparsemask_next_wrap(int n, const struct sparsemask *mask,
+ int start, bool wrap);
+
+/****************** The public API ********************/
+
+/*
+ * for_each_sparse - iterate over every element in a mask
+ * @elem: the (optionally unsigned) integer iterator
+ * @mask: the sparsemask
+ *
+ * After the loop, @elem is >= @mask->nelems.
+ */
+#define for_each_sparse(elem, mask) \
+ for ((elem) = -1; \
+ (elem) = sparsemask_next((elem), (mask)), \
+ (elem) < (mask)->nelems;)
+
+/*
+ * for_each_sparse_wrap - iterate over every element in a mask, starting at a
+ * specified location.
+ * @elem: the (optionally unsigned) integer iterator
+ * @mask: the sparsemask
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set(including @start).
+ * After the loop, @elem is >= @mask->nelems.
+ */
+#define for_each_sparse_wrap(elem, mask, start) \
+ for ((elem) = sparsemask_next_wrap((start)-1, (mask), (start), false); \
+ (elem) < (mask)->nelems; \
+ (elem) = sparsemask_next_wrap((elem), (mask), (start), true))
+
+/*
+ * sparsemask_set_elem - set an element in a sparsemask
+ * @elem: element number (< @dstp->nelems)
+ * @dstp: the sparsemask
+ */
+static inline void sparsemask_set_elem(int elem, struct sparsemask *dstp)
+{
+ set_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+static inline void __sparsemask_set_elem(int elem, struct sparsemask *dstp)
+{
+ __set_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+/*
+ * sparsemask_clear_elem - clear an element in a sparsemask
+ * @elem: element number (< @dstp->nelems)
+ * @dstp: the sparsemask
+ */
+static inline void sparsemask_clear_elem(int elem, struct sparsemask *dstp)
+{
+ clear_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+static inline void __sparsemask_clear_elem(int elem, struct sparsemask *dstp)
+{
+ __clear_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+/*
+ * sparsemask_test_elem - test for an element in a sparsemask
+ * @elem: element number (< @mask->nelems)
+ * @mask: the sparsemask
+ *
+ * Returns 1 if @elem is set in @mask, else returns 0
+ */
+static inline int sparsemask_test_elem(int elem, const struct sparsemask *mask)
+{
+ return test_bit(sparsemask_check(elem, mask), mask->bits);
+}
+
+/*
+ * sparsemask_test_and_set_elem - atomically test and set an element
+ * @elem: element number (< @mask->nelems)
+ * @mask: the sparsemask
+ *
+ * Returns 1 if @elem is set in old bitmap of @mask, else returns 0
+ */
+static inline int
+sparsemask_test_and_set_elem(int elem, struct sparsemask *mask)
+{
+ return test_and_set_bit(sparsemask_check(elem, mask), mask->bits);
+}
+
+/*
+ * sparsemask_test_and_clear_elem - atomically test and clear an element
+ * @elem: element number (< @mask->nelems)
+ * @mask: the sparsemask
+ *
+ * Returns 1 if @elem is set in old bitmap of @mask, else returns 0
+ */
+static inline int
+sparsemask_test_and_clear_elem(int elem, struct sparsemask *mask)
+{
+ return test_and_clear_bit(sparsemask_check(elem, mask), mask->bits);
+}
+
+/*
+ * sparsemask_weight - return count of bits in @mask (<= @mask->nelems)
+ * @mask: the sparsemask
+ */
+unsigned int sparsemask_weight(const struct sparsemask *srcp);
+
+/*
+ * Suggested and max value for the density parameter
+ */
+#define SPARSEMASK_DENSITY_DEFAULT 3
+#define SMASK_DENSITY_MAX _SMASK_DENSITY_MAX
+
+/*
+ * Allocate and initialize a sparsemask and return it in @maskp.
+ * @nelems - maximum number of elements.
+ * @density - store 2^density elements per 64-byte chunk.
+ * values from 0 to SMASK_DENSITY_MAX inclusive.
+ * @flags - kmalloc allocation flags
+ * @node - numa node
+ *
+ * Return true on success, like the cpumask functions.
+ */
+
+bool alloc_sparsemask(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags);
+
+bool zalloc_sparsemask(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags);
+
+bool alloc_sparsemask_node(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags, int node);
+
+bool zalloc_sparsemask_node(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags, int node);
+
+/*
+ * Free a sparsemask allocated by any of the above
+ */
+void free_sparsemask(struct sparsemask *mask);
+
+/*
+ * Return bytes to allocate for a sparsemask, for custom allocators
+ */
+size_t sparsemask_size(int nelems, int density);
+
+/*
+ * Initialize an allocated sparsemask, for custom allocators
+ */
+void sparsemask_init(struct sparsemask *mask, int nelems, int density);
+
+#endif /* __LINUX_SPARSEMASK_H */
diff --git a/lib/Makefile b/lib/Makefile
index db06d12..0af90fa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,7 +28,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_MMU) += ioremap.o
-lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SMP) += cpumask.o sparsemask.o

lib-y += kobject.o klist.o
obj-y += lockref.o
diff --git a/lib/sparsemask.c b/lib/sparsemask.c
new file mode 100644
index 0000000..d51cf50
--- /dev/null
+++ b/lib/sparsemask.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * sparsemask.c - sparse bitmap operations
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/sparsemask.h>
+
+/*
+ * Return the next one bit in @mask after @start, not including @start.
+ */
+int sparsemask_next(int start, const struct sparsemask *mask)
+{
+ const unsigned long *addr = mask->bits;
+ unsigned long nelems = mask->nelems;
+ unsigned long tmp, nbits;
+
+ /* -1 is a legal arg here. */
+ if (start != -1)
+ sparsemask_check(start, mask);
+ start++;
+
+ if (unlikely(start >= nelems))
+ return nelems;
+
+ start = SMASK_ELEM_TO_BIT(start, mask);
+ nbits = SMASK_ELEM_TO_BIT(nelems, mask);
+ tmp = addr[start / BITS_PER_LONG];
+
+ /* Handle 1st word. */
+ tmp &= BITMAP_FIRST_WORD_MASK(start);
+ start = round_down(start, BITS_PER_LONG);
+
+ while (!tmp) {
+ start += SMASK_BITS;
+ if (start >= nbits)
+ return nelems;
+ tmp = addr[start / BITS_PER_LONG];
+ }
+
+ return min(SMASK_BIT_TO_ELEM(start, mask) + __ffs(tmp), nelems);
+}
+
+int
+sparsemask_next_wrap(int n, const struct sparsemask *mask, int start, bool wrap)
+{
+ int next;
+
+again:
+ next = sparsemask_next(n, mask);
+
+ if (wrap && n < start && next >= start) {
+ return mask->nelems;
+
+ } else if (next >= mask->nelems) {
+ wrap = true;
+ n = -1;
+ goto again;
+ }
+
+ return next;
+}
+
+unsigned int sparsemask_weight(const struct sparsemask *mask)
+{
+ int weight = 0;
+ const unsigned long *addr = mask->bits;
+ int nlongs = SMASK_ELEM_TO_LONG(mask->nelems, mask);
+ int i, extra, shift;
+
+ for (i = 0; i < nlongs; i += SMASK_LONGS) {
+ if (addr[i])
+ weight += hweight_long(addr[i]);
+ }
+ extra = mask->nelems - SMASK_LONG_TO_ELEM(i, mask);
+ if (extra > 0) {
+ shift = BITS_PER_LONG - extra;
+ weight += hweight_long((addr[i] << shift) >> shift);
+ }
+ return weight;
+}
+
+size_t sparsemask_size(int nelems, int density)
+{
+ nelems = round_up(nelems, SMASK_DENSITY_TO_ELEMS(density));
+ return sizeof(struct sparsemask) + _SMASK_ELEM_TO_BYTE(nelems, density);
+}
+
+void sparsemask_init(struct sparsemask *mask, int nelems, int density)
+{
+ WARN_ON(density < 0 || density > SMASK_DENSITY_MAX);
+ mask->nelems = nelems;
+ mask->density = density;
+}
+
+bool alloc_sparsemask_node(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags, int node)
+{
+ *mask = kmalloc_node(sparsemask_size(nelems, density), flags, node);
+ if (*mask)
+ sparsemask_init(*mask, nelems, density);
+ return !!*mask;
+}
+
+bool zalloc_sparsemask_node(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags, int node)
+{
+ flags |= __GFP_ZERO;
+ return alloc_sparsemask_node(mask, nelems, density, flags, node);
+}
+
+bool alloc_sparsemask(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags)
+{
+ return alloc_sparsemask_node(mask, nelems, density, flags,
+ NUMA_NO_NODE);
+}
+
+bool zalloc_sparsemask(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags)
+{
+ flags |= __GFP_ZERO;
+ return alloc_sparsemask(mask, nelems, density, flags);
+}
+
+void free_sparsemask(struct sparsemask *mask)
+{
+ kfree(mask);
+}
--
1.8.3.1


2018-11-09 13:04:55

by Steven Sistare

[permalink] [raw]
Subject: [PATCH v3 05/10] sched/fair: Hoist idle_stamp up from idle_balance

Move the update of idle_stamp from idle_balance to the call site in
pick_next_task_fair, to prepare for a future patch that adds work to
pick_next_task_fair which must be included in the idle_stamp interval.
No functional change.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9031d39..da368ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3725,6 +3725,8 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
rq->misfit_task_load = task_h_load(p);
}

+#define IF_SMP(statement) statement
+
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
@@ -3770,6 +3772,8 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+#define IF_SMP(statement) /* empty */
+
static inline void overload_clear(struct rq *rq) {}
static inline void overload_set(struct rq *rq) {}

@@ -6764,8 +6768,18 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_

idle:
update_misfit_status(NULL, rq);
+
+ /*
+ * We must set idle_stamp _before_ calling idle_balance(), such that we
+ * measure the duration of idle_balance() as idle time.
+ */
+ IF_SMP(rq->idle_stamp = rq_clock(rq);)
+
new_tasks = idle_balance(rq, rf);

+ if (new_tasks)
+ IF_SMP(rq->idle_stamp = 0;)
+
/*
* Because idle_balance() releases (and re-acquires) rq->lock, it is
* possible for any higher priority task to appear. In that case we
@@ -9611,12 +9625,6 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
u64 curr_cost = 0;

/*
- * We must set idle_stamp _before_ calling idle_balance(), such that we
- * measure the duration of idle_balance() as idle time.
- */
- this_rq->idle_stamp = rq_clock(this_rq);
-
- /*
* Do not pull tasks towards !active CPUs...
*/
if (!cpu_active(this_cpu))
@@ -9707,9 +9715,6 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
if (this_rq->nr_running != this_rq->cfs.h_nr_running)
pulled_task = -1;

- if (pulled_task)
- this_rq->idle_stamp = 0;
-
rq_repin_lock(this_rq, rf);

return pulled_task;
--
1.8.3.1


2018-11-09 15:06:52

by Steven Sistare

[permalink] [raw]
Subject: hackbench run scripts

Hi folks,
I am attaching the bash helper scripts I used to run and post-process hackbench
in case you find them useful. They compute the statistics and print a nicely
formatted result:

feat - Enable/disable one or more sched_features.
runmany - run a command many times and print average time and stdev.
hackdiff - Print difference between 2 identical series of runmany hackbench runs.

You supply the top level script that runs a series of groups. Example:

---- hackseries ---------------
#!/bin/sh

iters=50000
groups="1 2 3 4"
runs=5

tmpfile=temp$$.out
rm -f $tmpfile

runall() {
for g in $groups ; do
echo hackbench $g process $iters | tee -a $tmpfile
runmany $runs hackbench $g process $iters | tee -a $tmpfile
done
}

sudo feat NO_STEAL ; runall ; echo
sudo feat STEAL ; runall ; echo
hackdiff < $tmpfile
rm -f $tmpfile
--------------------------------

$ hackseries
hackbench 1 process 50000
4.096 4.090 4.073 4.129 4.092 Average 4.096 stdev 0.5%
hackbench 2 process 50000
7.215 7.214 7.242 7.208 7.224 Average 7.220 stdev 0.2%
hackbench 3 process 50000
10.082 10.035 10.049 10.068 10.082 Average 10.063 stdev 0.2%
hackbench 4 process 50000
12.851 12.948 12.905 12.916 12.994 Average 12.922 stdev 0.4%

hackbench 1 process 50000
3.193 3.261 3.257 3.223 3.247 Average 3.236 stdev 0.9%
hackbench 2 process 50000
6.020 5.960 6.003 6.008 5.998 Average 5.997 stdev 0.4%
hackbench 3 process 50000
8.598 8.692 8.536 8.661 8.468 Average 8.591 stdev 1.1%
hackbench 4 process 50000
11.201 11.148 11.053 11.174 11.127 Average 11.140 stdev 0.5%

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 4.096 0.5 3.236 0.9 26.5
2 7.220 0.2 5.997 0.4 20.3
3 10.063 0.2 8.591 1.1 17.1
4 12.922 0.4 11.140 0.5 15.9

- Steve


Attachments:
feat (836.00 B)
runmany (1.54 kB)
hackdiff (2.11 kB)
Download all attachments

2018-11-09 17:39:27

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

Hi Steve,

On 09/11/2018 12:50, Steve Sistare wrote:
[...]
> @@ -482,6 +484,10 @@ static void update_top_cache_domain(int cpu)
> dirty_sched_domain_sysctl(cpu);
> destroy_sched_domains(tmp);
>
> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> + cfs_overload_cpus = (sd ? sd->shared->cfs_overload_cpus : NULL);
> + rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
> +

Why not do this in update_top_cache_domain() where we also look for the
highest SD_SHARE_PKG_RESOURCES and setup shortcut pointers?

> update_top_cache_domain(cpu);
> }
>
> @@ -1619,9 +1625,19 @@ static void __sdt_free(const struct cpumask *cpu_map)
> }
> }
>
> +#define ZALLOC_MASK(maskp, nelems, node) \
> + (!*(maskp) && !zalloc_sparsemask_node(maskp, nelems, \
> + SPARSEMASK_DENSITY_DEFAULT, \
> + GFP_KERNEL, node)) \
> +
> static int sd_llc_alloc(struct sched_domain *sd)
> {
> - /* Allocate sd->shared data here. Empty for now. */
> + struct sched_domain_shared *sds = sd->shared;
> + struct cpumask *span = sched_domain_span(sd);
> + int nid = cpu_to_node(cpumask_first(span));
> +
> + if (ZALLOC_MASK(&sds->cfs_overload_cpus, nr_cpu_ids, nid))

Mmm so this is called once on every CPU, but the !*(maskp) check in the
macro makes it so there is only one allocation per sd_llc_shared.

I wouldn't mind having that called out in a comment, or having the
pointer check done explicitly outside of the macro.

[...]

2018-11-09 19:09:55

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 05/10] sched/fair: Hoist idle_stamp up from idle_balance

Hi Steve,

On 09/11/2018 12:50, Steve Sistare wrote:
> Move the update of idle_stamp from idle_balance to the call site in
> pick_next_task_fair, to prepare for a future patch that adds work to
> pick_next_task_fair which must be included in the idle_stamp interval.
> No functional change.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> kernel/sched/fair.c | 23 ++++++++++++++---------
> 1 file changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9031d39..da368ed 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3725,6 +3725,8 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> rq->misfit_task_load = task_h_load(p);
> }
>
> +#define IF_SMP(statement) statement
> +

I'm not too hot on those IF_SMP() macros. Since you're not introducing
any other user for them, what about an inline function for rq->idle_stamp
setting ? When it's mapped to an empty statement (!CONFIG_SMP) GCC is
smart enough to remove the rq_clock() that would be passed to it on
CONFIG_SMP:

----->8-----

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c11adf3..34d9864 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3725,7 +3725,10 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
rq->misfit_task_load = task_h_load(p);
}

-#define IF_SMP(statement) statement
+static inline void set_rq_idle_stamp(struct rq *rq, u64 value)
+{
+ rq->idle_stamp = value;
+}

static void overload_clear(struct rq *rq)
{
@@ -3772,7 +3775,7 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

-#define IF_SMP(statement) /* empty */
+static inline void set_rq_idle_stamp(struct rq *rq, u64 value) {}

static inline void overload_clear(struct rq *rq) {}
static inline void overload_set(struct rq *rq) {}
@@ -6773,12 +6776,12 @@ done: __maybe_unused;
* We must set idle_stamp _before_ calling idle_balance(), such that we
* measure the duration of idle_balance() as idle time.
*/
- IF_SMP(rq->idle_stamp = rq_clock(rq);)
+ set_rq_idle_stamp(rq, rq_clock(rq));

new_tasks = idle_balance(rq, rf);

if (new_tasks)
- IF_SMP(rq->idle_stamp = 0;)
+ set_rq_idle_stamp(rq, 0);

/*
* Because idle_balance() releases (and re-acquires) rq->lock, it is

2018-11-10 17:10:11

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3 10/10] sched/fair: Provide idle search schedstats

Hi Steve,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.20-rc1 next-20181109]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Steve-Sistare/steal-tasks-to-improve-CPU-utilization/20181110-072317
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm

All errors (new ones prefixed by >>):

kernel/sched/core.o: In function `compute_skid':
>> core.c:(.text+0x2d48): undefined reference to `__aeabi_ldivmod'

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.06 kB)
.config.gz (33.61 kB)
Download all attachments

2018-11-12 16:43:01

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

Hi Steve,

On 09/11/2018 12:50, Steve Sistare wrote:
> From: Steve Sistare <[email protected]>
>
> Define and initialize a sparse bitmap of overloaded CPUs, per
> last-level-cache scheduling domain, for use by the CFS scheduling class.
> Save a pointer to cfs_overload_cpus in the rq for efficient access.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/sched.h | 2 ++
> kernel/sched/topology.c | 21 +++++++++++++++++++--
> 3 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 6b99761..b173a77 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
> atomic_t ref;
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> + struct sparsemask *cfs_overload_cpus;

Thinking about misfit stealing, we can't use the sd_llc_shared's because
on big.LITTLE misfit migrations happen across LLC domains.

I was thinking of adding a misfit sparsemask to the root_domain, but
then I thought we could do the same thing for cfs_overload_cpus.

By doing so we'd have a single source of information for overloaded CPUs,
and we could filter that down during idle balance - you mentioned earlier
wanting to try stealing at each SD level. This would also let you get
rid of [PATCH 02].

The main part of try_steal() could then be written down as something like
this:

----->8-----

for_each_domain(this_cpu, sd) {
span = sched_domain_span(sd)

for_each_sparse_wrap(src_cpu, overload_cpus) {
if (cpumask_test_cpu(src_cpu, span) &&
steal_from(dts_rq, dst_rf, &locked, src_cpu)) {
stolen = 1;
goto out;
}
}
}

------8<-----

We could limit the stealing to stop at the highest SD_SHARE_PKG_RESOURCES
domain for now so there would be no behavioural change - but we'd
factorize the #ifdef SCHED_SMT bit. Furthermore, the door would be open
to further stealing.

What do you think?

[...]

2018-11-19 17:33:22

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 05/10] sched/fair: Hoist idle_stamp up from idle_balance

On 11/9/2018 2:07 PM, Valentin Schneider wrote:
> Hi Steve,
>
> On 09/11/2018 12:50, Steve Sistare wrote:
>> Move the update of idle_stamp from idle_balance to the call site in
>> pick_next_task_fair, to prepare for a future patch that adds work to
>> pick_next_task_fair which must be included in the idle_stamp interval.
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <[email protected]>
>> ---
>> kernel/sched/fair.c | 23 ++++++++++++++---------
>> 1 file changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9031d39..da368ed 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3725,6 +3725,8 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>> rq->misfit_task_load = task_h_load(p);
>> }
>>
>> +#define IF_SMP(statement) statement
>> +
>
> I'm not too hot on those IF_SMP() macros. Since you're not introducing
> any other user for them, what about an inline function for rq->idle_stamp
> setting ? When it's mapped to an empty statement (!CONFIG_SMP) GCC is
> smart enough to remove the rq_clock() that would be passed to it on
> CONFIG_SMP:

That may be true now, but I worry that rq_clock or its subroutines may gain
side effects in the future that prevent the compiler from removing it. However,
I could push rq_clock into the inline function:

static inline void rq_idle_stamp_set(rq) { rq->idle_stamp = rq_clock(rq); }
static inline void rq_idle_stamp_clear(rq) { rq->idle_stamp = 0; }

I like that better, do you?

- Steve

> ----->8-----
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c11adf3..34d9864 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3725,7 +3725,10 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> rq->misfit_task_load = task_h_load(p);
> }
>
> -#define IF_SMP(statement) statement
> +static inline void set_rq_idle_stamp(struct rq *rq, u64 value)
> +{
> + rq->idle_stamp = value;
> +}
>
> static void overload_clear(struct rq *rq)
> {
> @@ -3772,7 +3775,7 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
> return 0;
> }
>
> -#define IF_SMP(statement) /* empty */
> +static inline void set_rq_idle_stamp(struct rq *rq, u64 value) {}
>
> static inline void overload_clear(struct rq *rq) {}
> static inline void overload_set(struct rq *rq) {}
> @@ -6773,12 +6776,12 @@ done: __maybe_unused;
> * We must set idle_stamp _before_ calling idle_balance(), such that we
> * measure the duration of idle_balance() as idle time.
> */
> - IF_SMP(rq->idle_stamp = rq_clock(rq);)
> + set_rq_idle_stamp(rq, rq_clock(rq));
>
> new_tasks = idle_balance(rq, rf);
>
> if (new_tasks)
> - IF_SMP(rq->idle_stamp = 0;)
> + set_rq_idle_stamp(rq, 0);
>
> /*
> * Because idle_balance() releases (and re-acquires) rq->lock, it is
>

2018-11-19 17:34:44

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

On 11/9/2018 12:38 PM, Valentin Schneider wrote:
> Hi Steve,
>
> On 09/11/2018 12:50, Steve Sistare wrote:
> [...]
>> @@ -482,6 +484,10 @@ static void update_top_cache_domain(int cpu)
>> dirty_sched_domain_sysctl(cpu);
>> destroy_sched_domains(tmp);
>>
>> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>> + cfs_overload_cpus = (sd ? sd->shared->cfs_overload_cpus : NULL);
>> + rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
>> +
>
> Why not do this in update_top_cache_domain() where we also look for the
> highest SD_SHARE_PKG_RESOURCES and setup shortcut pointers?

My snippet needs rq which is currently referenced in cpu_attach_domain() but
not in update_top_cache_domain(). I could just as easily do it in
update_top_cache_domain(). Either way is fine with me.

>> update_top_cache_domain(cpu);
>> }
>>
>> @@ -1619,9 +1625,19 @@ static void __sdt_free(const struct cpumask *cpu_map)
>> }
>> }
>>
>> +#define ZALLOC_MASK(maskp, nelems, node) \
>> + (!*(maskp) && !zalloc_sparsemask_node(maskp, nelems, \
>> + SPARSEMASK_DENSITY_DEFAULT, \
>> + GFP_KERNEL, node)) \
>> +
>> static int sd_llc_alloc(struct sched_domain *sd)
>> {
>> - /* Allocate sd->shared data here. Empty for now. */
>> + struct sched_domain_shared *sds = sd->shared;
>> + struct cpumask *span = sched_domain_span(sd);
>> + int nid = cpu_to_node(cpumask_first(span));
>> +
>> + if (ZALLOC_MASK(&sds->cfs_overload_cpus, nr_cpu_ids, nid))
>
> Mmm so this is called once on every CPU, but the !*(maskp) check in the
> macro makes it so there is only one allocation per sd_llc_shared.
>
> I wouldn't mind having that called out in a comment, or having the
> pointer check done explicitly outside of the macro.

OK, will add a comment. I like the macro because the code is cleaner if/when
multiple sets are created.

- Steve

2018-11-19 17:36:35

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

On 11/12/2018 11:42 AM, Valentin Schneider wrote:
> Hi Steve,
>
> On 09/11/2018 12:50, Steve Sistare wrote:
>> From: Steve Sistare <[email protected]>
>>
>> Define and initialize a sparse bitmap of overloaded CPUs, per
>> last-level-cache scheduling domain, for use by the CFS scheduling class.
>> Save a pointer to cfs_overload_cpus in the rq for efficient access.
>>
>> Signed-off-by: Steve Sistare <[email protected]>
>> ---
>> include/linux/sched/topology.h | 1 +
>> kernel/sched/sched.h | 2 ++
>> kernel/sched/topology.c | 21 +++++++++++++++++++--
>> 3 files changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 6b99761..b173a77 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -72,6 +72,7 @@ struct sched_domain_shared {
>> atomic_t ref;
>> atomic_t nr_busy_cpus;
>> int has_idle_cores;
>> + struct sparsemask *cfs_overload_cpus;
>
> Thinking about misfit stealing, we can't use the sd_llc_shared's because
> on big.LITTLE misfit migrations happen across LLC domains.
>
> I was thinking of adding a misfit sparsemask to the root_domain, but
> then I thought we could do the same thing for cfs_overload_cpus.
>
> By doing so we'd have a single source of information for overloaded CPUs,
> and we could filter that down during idle balance - you mentioned earlier
> wanting to try stealing at each SD level. This would also let you get
> rid of [PATCH 02].
>
> The main part of try_steal() could then be written down as something like
> this:
>
> ----->8-----
>
> for_each_domain(this_cpu, sd) {
> span = sched_domain_span(sd)
>
> for_each_sparse_wrap(src_cpu, overload_cpus) {
> if (cpumask_test_cpu(src_cpu, span) &&
> steal_from(dts_rq, dst_rf, &locked, src_cpu)) {
> stolen = 1;
> goto out;
> }
> }
> }
>
> ------8<-----
>
> We could limit the stealing to stop at the highest SD_SHARE_PKG_RESOURCES
> domain for now so there would be no behavioural change - but we'd
> factorize the #ifdef SCHED_SMT bit. Furthermore, the door would be open
> to further stealing.
>
> What do you think?

That is not efficient for a multi-level search because at each domain level we
would (re) iterate over overloaded candidates that do not belong in that level.
To extend stealing across LLC, I would like to keep the per-LLC sparsemask,
but add to each SD a list of sparsemask pointers. The list nodes would be
private, but the sparsemask structs would be shared. Each list would include
the masks that overlap the SD's members. The list would be a singleton at the
core and LLC levels (same as the socket level for most processors), and would
have multiple elements at the NUMA level.

- Steve


2018-11-20 10:26:10

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 05/10] sched/fair: Hoist idle_stamp up from idle_balance

On 19/11/2018 17:31, Steven Sistare wrote:
[...]
>>> +#define IF_SMP(statement) statement
>>> +
>>
>> I'm not too hot on those IF_SMP() macros. Since you're not introducing
>> any other user for them, what about an inline function for rq->idle_stamp
>> setting ? When it's mapped to an empty statement (!CONFIG_SMP) GCC is
>> smart enough to remove the rq_clock() that would be passed to it on
>> CONFIG_SMP:
>
> That may be true now, but I worry that rq_clock or its subroutines may gain
> side effects in the future that prevent the compiler from removing it. However,
> I could push rq_clock into the inline function:
>
> static inline void rq_idle_stamp_set(rq) { rq->idle_stamp = rq_clock(rq); }
> static inline void rq_idle_stamp_clear(rq) { rq->idle_stamp = 0; }
>
> I like that better, do you?
>

That works for me, though I can't resist nitpicking on
s/rq_idle_stamp_set/rq_idle_stamp_update/

> - Steve

2018-11-20 13:22:27

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap



On 19/11/2018 17:32, Steven Sistare wrote:
> On 11/9/2018 12:38 PM, Valentin Schneider wrote:
>> Hi Steve,
>>
>> On 09/11/2018 12:50, Steve Sistare wrote:
>> [...]
>>> @@ -482,6 +484,10 @@ static void update_top_cache_domain(int cpu)
>>> dirty_sched_domain_sysctl(cpu);
>>> destroy_sched_domains(tmp);
>>>
>>> + sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>>> + cfs_overload_cpus = (sd ? sd->shared->cfs_overload_cpus : NULL);
>>> + rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
>>> +
>>
>> Why not do this in update_top_cache_domain() where we also look for the
>> highest SD_SHARE_PKG_RESOURCES and setup shortcut pointers?
>
> My snippet needs rq which is currently referenced in cpu_attach_domain() but
> not in update_top_cache_domain(). I could just as easily do it in
> update_top_cache_domain(). Either way is fine with me.
>

I think having this in update_top_cache_domain() makes more sense - it's
where all the domain shortcuts are set, and we already play with LLC there.

[...]

2018-11-20 14:55:31

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

On 19/11/2018 17:33, Steven Sistare wrote:
[...]
>>
>> Thinking about misfit stealing, we can't use the sd_llc_shared's because
>> on big.LITTLE misfit migrations happen across LLC domains.
>>
>> I was thinking of adding a misfit sparsemask to the root_domain, but
>> then I thought we could do the same thing for cfs_overload_cpus.
>>
>> By doing so we'd have a single source of information for overloaded CPUs,
>> and we could filter that down during idle balance - you mentioned earlier
>> wanting to try stealing at each SD level. This would also let you get
>> rid of [PATCH 02].
>>
>> The main part of try_steal() could then be written down as something like
>> this:
>>
>> ----->8-----
>>
>> for_each_domain(this_cpu, sd) {
>> span = sched_domain_span(sd)
>>
>> for_each_sparse_wrap(src_cpu, overload_cpus) {
>> if (cpumask_test_cpu(src_cpu, span) &&
>> steal_from(dts_rq, dst_rf, &locked, src_cpu)) {
>> stolen = 1;
>> goto out;
>> }
>> }
>> }
>>
>> ------8<-----
>>
>> We could limit the stealing to stop at the highest SD_SHARE_PKG_RESOURCES
>> domain for now so there would be no behavioural change - but we'd
>> factorize the #ifdef SCHED_SMT bit. Furthermore, the door would be open
>> to further stealing.
>>
>> What do you think?
>
> That is not efficient for a multi-level search because at each domain level we
> would (re) iterate over overloaded candidates that do not belong in that level.


Mmm I was thinking we could abuse the wrap() and start at
(fls(prev_span) + 1), but we're not guaranteed to have contiguous spans -
the Arm Juno for instance has [0, 3, 4], [1, 2] as MC-level domains, so
that goes down the drain.

Another thing that has been trotting in my head would be some helper to
create a cpumask from a sparsemask (some sort of sparsemask_span()),
which would let us use the standard mask operators:

----->8-----
struct cpumask *overload_span = sparsemask_span(overload_cpus)

for_each_domain(this_cpu, sd)
for_each_cpu_and(src_cpu, overload_span, sched_domain_span(sd))
<steal_from here>
-----8>-----

The cpumask could be part of the sparsemask struct to save us the
allocation, and only updated when calling sparsemask_span().

> To extend stealing across LLC, I would like to keep the per-LLC sparsemask,
> but add to each SD a list of sparsemask pointers. The list nodes would be
> private, but the sparsemask structs would be shared. Each list would include
> the masks that overlap the SD's members. The list would be a singleton at the
> core and LLC levels (same as the socket level for most processors), and would
> have multiple elements at the NUMA level.
>

I see. As for misfit, creating asym_cpucapacity siblings of the sd_llc_*()
functions seems a bit much - there'd be a lot of redundancy for basically
just a single shared sparsemask, which is why I was rambling about moving
things to root_domain.

Having different locations where sparsemasks are stored is a bit of a
pain which I'd like to avoid, but if it can't be unified I suppose we'll
have to live with it.

> - Steve
>

2018-11-26 19:09:09

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

On 11/20/2018 7:42 AM, Valentin Schneider wrote:
> On 19/11/2018 17:33, Steven Sistare wrote:
> [...]
>>>
>>> Thinking about misfit stealing, we can't use the sd_llc_shared's because
>>> on big.LITTLE misfit migrations happen across LLC domains.
>>>
>>> I was thinking of adding a misfit sparsemask to the root_domain, but
>>> then I thought we could do the same thing for cfs_overload_cpus.
>>>
>>> By doing so we'd have a single source of information for overloaded CPUs,
>>> and we could filter that down during idle balance - you mentioned earlier
>>> wanting to try stealing at each SD level. This would also let you get
>>> rid of [PATCH 02].
>>>
>>> The main part of try_steal() could then be written down as something like
>>> this:
>>>
>>> ----->8-----
>>>
>>> for_each_domain(this_cpu, sd) {
>>> span = sched_domain_span(sd)
>>>
>>> for_each_sparse_wrap(src_cpu, overload_cpus) {
>>> if (cpumask_test_cpu(src_cpu, span) &&
>>> steal_from(dts_rq, dst_rf, &locked, src_cpu)) {
>>> stolen = 1;
>>> goto out;
>>> }
>>> }
>>> }
>>>
>>> ------8<-----
>>>
>>> We could limit the stealing to stop at the highest SD_SHARE_PKG_RESOURCES
>>> domain for now so there would be no behavioural change - but we'd
>>> factorize the #ifdef SCHED_SMT bit. Furthermore, the door would be open
>>> to further stealing.
>>>
>>> What do you think?
>>
>> That is not efficient for a multi-level search because at each domain level we
>> would (re) iterate over overloaded candidates that do not belong in that level.
>
>
> Mmm I was thinking we could abuse the wrap() and start at
> (fls(prev_span) + 1), but we're not guaranteed to have contiguous spans -
> the Arm Juno for instance has [0, 3, 4], [1, 2] as MC-level domains, so
> that goes down the drain.
>
> Another thing that has been trotting in my head would be some helper to
> create a cpumask from a sparsemask (some sort of sparsemask_span()),
> which would let us use the standard mask operators:
>
> ----->8-----
> struct cpumask *overload_span = sparsemask_span(overload_cpus)
>
> for_each_domain(this_cpu, sd)
> for_each_cpu_and(src_cpu, overload_span, sched_domain_span(sd))
> <steal_from here>
> -----8>-----
>
> The cpumask could be part of the sparsemask struct to save us the
> allocation, and only updated when calling sparsemask_span().

I thought of providing something like this along with other sparsemask
utility functions, but I decided to be minimalist, and let others add
more functions if/when they become needed. this_cpu_cpumask_var_ptr(select_idle_mask)
is a temporary that could be used as the destination of the conversion.

Also, conversion adds cost, particularly on larger systems. When comparing a
cpumask and a sparsemask, it is more efficient to iterate over the smaller
set, and test for membership in the larger, such as in try_steal:

for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
if (sparsemask_test_elem(src_cpu, overload_cpus)

>> To extend stealing across LLC, I would like to keep the per-LLC sparsemask,
>> but add to each SD a list of sparsemask pointers. The list nodes would be
>> private, but the sparsemask structs would be shared. Each list would include
>> the masks that overlap the SD's members. The list would be a singleton at the
>> core and LLC levels (same as the socket level for most processors), and would
>> have multiple elements at the NUMA level.
>
> I see. As for misfit, creating asym_cpucapacity siblings of the sd_llc_*()
> functions seems a bit much - there'd be a lot of redundancy for basically
> just a single shared sparsemask, which is why I was rambling about moving
> things to root_domain.
>
> Having different locations where sparsemasks are stored is a bit of a
> pain which I'd like to avoid, but if it can't be unified I suppose we'll
> have to live with it.

I don't follow. A per-LLC sparsemask representing misfits can be allocated with
one more line in sd_llc_alloc, and you can steal across LLC using the list I
briefly described above.

- Steve

2018-11-27 19:25:18

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 01/10] sched: Provide sparsemask, a reduced contention bitmap

On 11/9/2018 7:50 AM, Steve Sistare wrote:
> From: Steve Sistare <[email protected]>
>
> Provide struct sparsemask and functions to manipulate it. A sparsemask is
> a sparse bitmap. It reduces cache contention vs the usual bitmap when many
> threads concurrently set, clear, and visit elements, by reducing the number
> of significant bits per cacheline. For each 64 byte chunk of the mask,
> only the first K bits of the first word are used, and the remaining bits
> are ignored, where K is a creation time parameter. Thus a sparsemask that
> can represent a set of N elements is approximately (N/K * 64) bytes in
> size.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
> lib/Makefile | 2 +-
> lib/sparsemask.c | 142 +++++++++++++++++++++++++
> 3 files changed, 403 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/sparsemask.h
> create mode 100644 lib/sparsemask.c

Hi Peter and Ingo,
I need your opinion: would you prefer that I keep the new sparsemask type,
or fold it into the existing sbitmap type? There is some overlap between the
two, but mostly in trivial one line functions. The main differences are:

* sparsemask defines iterators that allow an inline loop body, like cpumask,
whereas the sbitmap iterator forces us to define a callback function for
the body, which is awkward.

* sparsemask is slightly more efficient. The struct and variable length
bitmap are allocated contiguously, and sbitmap uses an extra field "depth"
per bitmap cacheline.

* The order of arguments is different for the sparsemask accessors and
sbitmap accessors. sparsemask mimics cpumask which is used extensively
in the sched code.

* Much of the sbitmap code supports queueing, sleeping, and waking on bit
allocation, which is N/A for scheduler load load balancing. However, we
can call the basic functions which do not use queueing.

I could add the sparsemask iterators to sbitmap (90 lines), and define
a thin layer to change the argument order to mimic cpumask, but that
essentially recreates sparsemask.

Also, pushing sparsemask into sbitmap would limit our freedom to evolve the
type to meet the future needs of sched, as sbitmap has its own maintainer,
and is used by drivers, so changes to its API and ABI will be frowned upon.

FWIW, here is the amount of code involved:

include/linux/sbitmap.h
250 lines basic operations
284 lines for queueing
---
534 lines total

lib/sbitmap.c
201 lines basic operations
380 lines for queueing
---
581 lines total

include/linux/sparsemask.h
260 lines total
https://lkml.org/lkml/2018/11/9/1176

lib/sparsemask.c
142 lines total
https://lkml.org/lkml/2018/11/9/1176

- Steve

2018-11-28 01:20:13

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH v3 01/10] sched: Provide sparsemask, a reduced contention bitmap

On Tue, Nov 27, 2018 at 10:16:56AM -0500, Steven Sistare wrote:
> On 11/9/2018 7:50 AM, Steve Sistare wrote:
> > From: Steve Sistare <[email protected]>
> >
> > Provide struct sparsemask and functions to manipulate it. A sparsemask is
> > a sparse bitmap. It reduces cache contention vs the usual bitmap when many
> > threads concurrently set, clear, and visit elements, by reducing the number
> > of significant bits per cacheline. For each 64 byte chunk of the mask,
> > only the first K bits of the first word are used, and the remaining bits
> > are ignored, where K is a creation time parameter. Thus a sparsemask that
> > can represent a set of N elements is approximately (N/K * 64) bytes in
> > size.
> >
> > Signed-off-by: Steve Sistare <[email protected]>
> > ---
> > include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
> > lib/Makefile | 2 +-
> > lib/sparsemask.c | 142 +++++++++++++++++++++++++
> > 3 files changed, 403 insertions(+), 1 deletion(-)
> > create mode 100644 include/linux/sparsemask.h
> > create mode 100644 lib/sparsemask.c
>
> Hi Peter and Ingo,
> I need your opinion: would you prefer that I keep the new sparsemask type,
> or fold it into the existing sbitmap type? There is some overlap between the
> two, but mostly in trivial one line functions. The main differences are:

Adding Jens and myself.

> * sparsemask defines iterators that allow an inline loop body, like cpumask,
> whereas the sbitmap iterator forces us to define a callback function for
> the body, which is awkward.
>
> * sparsemask is slightly more efficient. The struct and variable length
> bitmap are allocated contiguously,

That just means you have the pointer indirection elsewhere :) The users
of sbitmap embed it in whatever structure they have.

> and sbitmap uses an extra field "depth"
> per bitmap cacheline.

The depth field is memory which would otherwise be unused, and it's only
used for sbitmap_get(), so it doesn't have any cost if you're using it
like a cpumask.

> * The order of arguments is different for the sparsemask accessors and
> sbitmap accessors. sparsemask mimics cpumask which is used extensively
> in the sched code.
>
> * Much of the sbitmap code supports queueing, sleeping, and waking on bit
> allocation, which is N/A for scheduler load load balancing. However, we
> can call the basic functions which do not use queueing.
>
> I could add the sparsemask iterators to sbitmap (90 lines), and define
> a thin layer to change the argument order to mimic cpumask, but that
> essentially recreates sparsemask.

We only use sbitmap_for_each_set() in a few places. Maybe a for_each()
style macro would be cleaner for those users, too, in which case I
wouldn't be opposed to changing it. The cpumask argument order thing is
a annoying, though.

> Also, pushing sparsemask into sbitmap would limit our freedom to evolve the
> type to meet the future needs of sched, as sbitmap has its own maintainer,
> and is used by drivers, so changes to its API and ABI will be frowned upon.

It's a generic data structure, so of course Jens and I have no problem
with changing it to meet more needs :) Personally, I'd prefer to only
have one datastructure for this, but I suppose it depends on whether
Peter and Ingo think the argument order is important enough.

> FWIW, here is the amount of code involved:
>
> include/linux/sbitmap.h
> 250 lines basic operations
> 284 lines for queueing
> ---
> 534 lines total
>
> lib/sbitmap.c
> 201 lines basic operations
> 380 lines for queueing
> ---
> 581 lines total
>
> include/linux/sparsemask.h
> 260 lines total
> https://lkml.org/lkml/2018/11/9/1176
>
> lib/sparsemask.c
> 142 lines total
> https://lkml.org/lkml/2018/11/9/1176
>
> - Steve

2018-12-03 16:58:08

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

Hi Steve,

On 26/11/2018 19:06, Steven Sistare wrote:
> [...]
>> Mmm I was thinking we could abuse the wrap() and start at
>> (fls(prev_span) + 1), but we're not guaranteed to have contiguous spans -
>> the Arm Juno for instance has [0, 3, 4], [1, 2] as MC-level domains, so
>> that goes down the drain.
>>
>> Another thing that has been trotting in my head would be some helper to
>> create a cpumask from a sparsemask (some sort of sparsemask_span()),
>> which would let us use the standard mask operators:
>>
>> ----->8-----
>> struct cpumask *overload_span = sparsemask_span(overload_cpus)
>>
>> for_each_domain(this_cpu, sd)
>> for_each_cpu_and(src_cpu, overload_span, sched_domain_span(sd))
>> <steal_from here>
>> -----8>-----
>>
>> The cpumask could be part of the sparsemask struct to save us the
>> allocation, and only updated when calling sparsemask_span().
>
> I thought of providing something like this along with other sparsemask
> utility functions, but I decided to be minimalist, and let others add
> more functions if/when they become needed. this_cpu_cpumask_var_ptr(select_idle_mask)
> is a temporary that could be used as the destination of the conversion.
>
> Also, conversion adds cost, particularly on larger systems. When comparing a
> cpumask and a sparsemask, it is more efficient to iterate over the smaller
> set, and test for membership in the larger, such as in try_steal:
>
> for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
> if (sparsemask_test_elem(src_cpu, overload_cpus)
>
>>> To extend stealing across LLC, I would like to keep the per-LLC sparsemask,
>>> but add to each SD a list of sparsemask pointers. The list nodes would be
>>> private, but the sparsemask structs would be shared. Each list would include
>>> the masks that overlap the SD's members. The list would be a singleton at the
>>> core and LLC levels (same as the socket level for most processors), and would
>>> have multiple elements at the NUMA level.
>>
>> I see. As for misfit, creating asym_cpucapacity siblings of the sd_llc_*()
>> functions seems a bit much - there'd be a lot of redundancy for basically
>> just a single shared sparsemask, which is why I was rambling about moving
>> things to root_domain.
>>
>> Having different locations where sparsemasks are stored is a bit of a
>> pain which I'd like to avoid, but if it can't be unified I suppose we'll
>> have to live with it.
>
> I don't follow. A per-LLC sparsemask representing misfits can be allocated with
> one more line in sd_llc_alloc, and you can steal across LLC using the list I
> briefly described above.
>

Ah yes, that would work. Thing is, I had excluded having the misfit masks
being in the sd_llc_shareds, since from a logical standpoint they don't
really belong there.

With asymmetric CPU capacities we kind of disregard the cache landscape
and focus on, well, CPU asymmetries. There's a few commits laying around
that forgo some cache usage optimisations for asymmetric systems -
this one comes to mind:

9c63e84db29b ("sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains")

So in truth I was envisioning separate SD_ASYM_CPUCAPACITY-based
sparsemasks, which is why I was rambling about SD_ASYM_CPUCAPACITY siblings
of sd_llc_*()... *But* after I had a go at it, it looked to me like that
was a lot of duplicated code.

My root_domain suggestion stems from the fact that we only really need one
single sparsemask for misfit stealing, and it provides a unique location
to store the sparsemasks (and you mask them however you want when it comes
to using them).

Sadly I think that doesn't work as well for cfs_overload_cpus since you
can't split a sparsemask's chunks over several NUMA nodes, so we'd be
stuck with an allocation on a single node (but we already do that in some
places, e.g. for nohz.idle_cpus_mask, so... Is it that bad?).

> - Steve
>

2018-12-06 16:13:03

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 01/10] sched: Provide sparsemask, a reduced contention bitmap

On 11/27/2018 8:19 PM, Omar Sandoval wrote:
> On Tue, Nov 27, 2018 at 10:16:56AM -0500, Steven Sistare wrote:
>> On 11/9/2018 7:50 AM, Steve Sistare wrote:
>>> From: Steve Sistare <[email protected]>
>>>
>>> Provide struct sparsemask and functions to manipulate it. A sparsemask is
>>> a sparse bitmap. It reduces cache contention vs the usual bitmap when many
>>> threads concurrently set, clear, and visit elements, by reducing the number
>>> of significant bits per cacheline. For each 64 byte chunk of the mask,
>>> only the first K bits of the first word are used, and the remaining bits
>>> are ignored, where K is a creation time parameter. Thus a sparsemask that
>>> can represent a set of N elements is approximately (N/K * 64) bytes in
>>> size.
>>>
>>> Signed-off-by: Steve Sistare <[email protected]>
>>> ---
>>> include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
>>> lib/Makefile | 2 +-
>>> lib/sparsemask.c | 142 +++++++++++++++++++++++++
>>> 3 files changed, 403 insertions(+), 1 deletion(-)
>>> create mode 100644 include/linux/sparsemask.h
>>> create mode 100644 lib/sparsemask.c
>>
>> Hi Peter and Ingo,
>> I need your opinion: would you prefer that I keep the new sparsemask type,
>> or fold it into the existing sbitmap type? There is some overlap between the
>> two, but mostly in trivial one line functions. The main differences are:
>
> Adding Jens and myself.
>
>> * sparsemask defines iterators that allow an inline loop body, like cpumask,
>> whereas the sbitmap iterator forces us to define a callback function for
>> the body, which is awkward.
>>
>> * sparsemask is slightly more efficient. The struct and variable length
>> bitmap are allocated contiguously,
>
> That just means you have the pointer indirection elsewhere :) The users
> of sbitmap embed it in whatever structure they have.

Yes, the sparsemask can be embedded in one place, but in my use case I also cache
pointers to the mask from elsewhere, and those sites incur the cost of 2 indirections
to perform bitmap operations.

>> and sbitmap uses an extra field "depth"
>> per bitmap cacheline.
>
> The depth field is memory which would otherwise be unused, and it's only
> used for sbitmap_get(), so it doesn't have any cost if you're using it
> like a cpumask.
>
>> * The order of arguments is different for the sparsemask accessors and
>> sbitmap accessors. sparsemask mimics cpumask which is used extensively
>> in the sched code.
>>
>> * Much of the sbitmap code supports queueing, sleeping, and waking on bit
>> allocation, which is N/A for scheduler load load balancing. However, we
>> can call the basic functions which do not use queueing.
>>
>> I could add the sparsemask iterators to sbitmap (90 lines), and define
>> a thin layer to change the argument order to mimic cpumask, but that
>> essentially recreates sparsemask.
>
> We only use sbitmap_for_each_set() in a few places. Maybe a for_each()
> style macro would be cleaner for those users, too, in which case I
> wouldn't be opposed to changing it. The cpumask argument order thing is
> a annoying, though.
>
>> Also, pushing sparsemask into sbitmap would limit our freedom to evolve the
>> type to meet the future needs of sched, as sbitmap has its own maintainer,
>> and is used by drivers, so changes to its API and ABI will be frowned upon.
>
> It's a generic data structure, so of course Jens and I have no problem
> with changing it to meet more needs :) Personally, I'd prefer to only
> have one datastructure for this, but I suppose it depends on whether
> Peter and Ingo think the argument order is important enough.

The argument order is a minor thing, not a blocker to adoption, but efficiency
is important in the core scheduler code. I actually did the work to write a
for_each macro with inline body to sbitmap, and converted my patches to use sbitmap.
But then I noticed your very recent patch adding the cleared word to each cacheline,
which must be loaded and ANDed with each bitset word in the for_each traversal,
adding more overhead which we don't need for the scheduler use case, on top of the
extra indirection noted above. You might add more such things in the future (a
"deferred set" word?) to support the needs of the block drivers who are the
intended clients of sbitmap.

Your sbitmap is more than a simple bitmap abstraction, and for the scheduler we
just need simple. Therefore, I propose to trim sparsemask to the bare minimum,
and move it to kernel/sched for use
by sched only.
It was 400 lines, but will
be 200, and 80 of those are comments.

If anyone objects, please speak now.

- Steve

>> FWIW, here is the amount of code involved:
>>
>> include/linux/sbitmap.h
>> 250 lines basic operations
>> 284 lines for queueing
>> ---
>> 534 lines total
>>
>> lib/sbitmap.c
>> 201 lines basic operations
>> 380 lines for queueing
>> ---
>> 581 lines total
>>
>> include/linux/sparsemask.h
>> 260 lines total
>> https://lkml.org/lkml/2018/11/9/1176
>>
>> lib/sparsemask.c
>> 142 lines total
>> https://lkml.org/lkml/2018/11/9/1176
>>
>> - Steve

2018-12-06 16:41:51

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

On 12/3/2018 11:56 AM, Valentin Schneider wrote:
> Hi Steve,
> On 26/11/2018 19:06, Steven Sistare wrote:
>> [...]
>>> Mmm I was thinking we could abuse the wrap() and start at
>>> (fls(prev_span) + 1), but we're not guaranteed to have contiguous spans -
>>> the Arm Juno for instance has [0, 3, 4], [1, 2] as MC-level domains, so
>>> that goes down the drain.
>>>
>>> Another thing that has been trotting in my head would be some helper to
>>> create a cpumask from a sparsemask (some sort of sparsemask_span()),
>>> which would let us use the standard mask operators:
>>>
>>> ----->8-----
>>> struct cpumask *overload_span = sparsemask_span(overload_cpus)
>>>
>>> for_each_domain(this_cpu, sd)
>>> for_each_cpu_and(src_cpu, overload_span, sched_domain_span(sd))
>>> <steal_from here>
>>> -----8>-----
>>>
>>> The cpumask could be part of the sparsemask struct to save us the
>>> allocation, and only updated when calling sparsemask_span().
>>
>> I thought of providing something like this along with other sparsemask
>> utility functions, but I decided to be minimalist, and let others add
>> more functions if/when they become needed. this_cpu_cpumask_var_ptr(select_idle_mask)
>> is a temporary that could be used as the destination of the conversion.
>>
>> Also, conversion adds cost, particularly on larger systems. When comparing a
>> cpumask and a sparsemask, it is more efficient to iterate over the smaller
>> set, and test for membership in the larger, such as in try_steal:
>>
>> for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
>> if (sparsemask_test_elem(src_cpu, overload_cpus)
>>
>>>> To extend stealing across LLC, I would like to keep the per-LLC sparsemask,
>>>> but add to each SD a list of sparsemask pointers. The list nodes would be
>>>> private, but the sparsemask structs would be shared. Each list would include
>>>> the masks that overlap the SD's members. The list would be a singleton at the
>>>> core and LLC levels (same as the socket level for most processors), and would
>>>> have multiple elements at the NUMA level.
>>>
>>> I see. As for misfit, creating asym_cpucapacity siblings of the sd_llc_*()
>>> functions seems a bit much - there'd be a lot of redundancy for basically
>>> just a single shared sparsemask, which is why I was rambling about moving
>>> things to root_domain.
>>>
>>> Having different locations where sparsemasks are stored is a bit of a
>>> pain which I'd like to avoid, but if it can't be unified I suppose we'll
>>> have to live with it.
>>
>> I don't follow. A per-LLC sparsemask representing misfits can be allocated with
>> one more line in sd_llc_alloc, and you can steal across LLC using the list I
>> briefly described above.
>
> Ah yes, that would work. Thing is, I had excluded having the misfit masks
> being in the sd_llc_shareds, since from a logical standpoint they don't
> really belong there.
>
> With asymmetric CPU capacities we kind of disregard the cache landscape

Sure, but adding awareness of the cache hierarchy can only make it better,
and a per-LLC mask organization can serve both the overloaded and misfit
use cases quite naturally.

> and focus on, well, CPU asymmetries. There's a few commits laying around
> that forgo some cache usage optimisations for asymmetric systems -
> this one comes to mind:
>
> 9c63e84db29b ("sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains")
>
> So in truth I was envisioning separate SD_ASYM_CPUCAPACITY-based
> sparsemasks, which is why I was rambling about SD_ASYM_CPUCAPACITY siblings
> of sd_llc_*()... *But* after I had a go at it, it looked to me like that
> was a lot of duplicated code.

I would be happy to review your code and make suggestions to reduce duplication,
and happy to continue to discuss clean and optimal handling for misfits. However,
I have a request: can we push my patches across the finish line first? Stealing
for misfits can be its own patch series. Please consider sending your reviewed-by
for the next version of my series. I will send it soon.

> My root_domain suggestion stems from the fact that we only really need one
> single sparsemask for misfit stealing, and it provides a unique location
> to store the sparsemasks (and you mask them however you want when it comes
> to using them).
>
> Sadly I think that doesn't work as well for cfs_overload_cpus since you
> can't split a sparsemask's chunks over several NUMA nodes, so we'd be
> stuck with an allocation on a single node (but we already do that in some
> places, e.g. for nohz.idle_cpus_mask, so... Is it that bad?).

It can be bad for high memory bandwidth workloads, as the sparsemasks will
be displaced from cache and we incur remote memory latencies on next access.

- Steve

2018-12-06 17:29:47

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v3 03/10] sched/topology: Provide cfs_overload_cpus bitmap

Hi Steve,

On 06/12/2018 16:40, Steven Sistare wrote:
> [...]
>>
>> Ah yes, that would work. Thing is, I had excluded having the misfit masks
>> being in the sd_llc_shareds, since from a logical standpoint they don't
>> really belong there.
>>
>> With asymmetric CPU capacities we kind of disregard the cache landscape
>
> Sure, but adding awareness of the cache hierarchy can only make it better,
> and a per-LLC mask organization can serve both the overloaded and misfit
> use cases quite naturally.
> [...]
>> So in truth I was envisioning separate SD_ASYM_CPUCAPACITY-based
>> sparsemasks, which is why I was rambling about SD_ASYM_CPUCAPACITY siblings
>> of sd_llc_*()... *But* after I had a go at it, it looked to me like that
>> was a lot of duplicated code.
>
> I would be happy to review your code and make suggestions to reduce duplication,
> and happy to continue to discuss clean and optimal handling for misfits. However,
> I have a request: can we push my patches across the finish line first? Stealing
> for misfits can be its own patch series. Please consider sending your reviewed-by
> for the next version of my series. I will send it soon.
>

Sure, as things stand right now I'm fairly convinced this doesn't harm
asymmetric systems.

The only thing I would add (ignoring misfits) is that with EAS we would
need to gate stealing with something like:

!static_branch_unlikely(&sched_energy_present) ||
READ_ONCE(rq->rd->overutilized)

And who "gets" to add this gating (or at least, when must it be added)
depends on which patch-set gets in first.

[...]
>> Sadly I think that doesn't work as well for cfs_overload_cpus since you
>> can't split a sparsemask's chunks over several NUMA nodes, so we'd be
>> stuck with an allocation on a single node (but we already do that in some
>> places, e.g. for nohz.idle_cpus_mask, so... Is it that bad?).
>
> It can be bad for high memory bandwidth workloads, as the sparsemasks will
> be displaced from cache and we incur remote memory latencies on next access.
>

Aye, I just caught up with the LPC videos and was about to reply here to
say that, all things considered, it's probably not such a good idea...

> - Steve
>

2018-12-06 18:21:17

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH v3 01/10] sched: Provide sparsemask, a reduced contention bitmap

On Thu, Dec 06, 2018 at 11:07:46AM -0500, Steven Sistare wrote:
> On 11/27/2018 8:19 PM, Omar Sandoval wrote:
> > On Tue, Nov 27, 2018 at 10:16:56AM -0500, Steven Sistare wrote:
> >> On 11/9/2018 7:50 AM, Steve Sistare wrote:
> >>> From: Steve Sistare <[email protected]>
> >>>
> >>> Provide struct sparsemask and functions to manipulate it. A sparsemask is
> >>> a sparse bitmap. It reduces cache contention vs the usual bitmap when many
> >>> threads concurrently set, clear, and visit elements, by reducing the number
> >>> of significant bits per cacheline. For each 64 byte chunk of the mask,
> >>> only the first K bits of the first word are used, and the remaining bits
> >>> are ignored, where K is a creation time parameter. Thus a sparsemask that
> >>> can represent a set of N elements is approximately (N/K * 64) bytes in
> >>> size.
> >>>
> >>> Signed-off-by: Steve Sistare <[email protected]>
> >>> ---
> >>> include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
> >>> lib/Makefile | 2 +-
> >>> lib/sparsemask.c | 142 +++++++++++++++++++++++++
> >>> 3 files changed, 403 insertions(+), 1 deletion(-)
> >>> create mode 100644 include/linux/sparsemask.h
> >>> create mode 100644 lib/sparsemask.c
> >>
> >> Hi Peter and Ingo,
> >> I need your opinion: would you prefer that I keep the new sparsemask type,
> >> or fold it into the existing sbitmap type? There is some overlap between the
> >> two, but mostly in trivial one line functions. The main differences are:
> >
> > Adding Jens and myself.
> >
> >> * sparsemask defines iterators that allow an inline loop body, like cpumask,
> >> whereas the sbitmap iterator forces us to define a callback function for
> >> the body, which is awkward.
> >>
> >> * sparsemask is slightly more efficient. The struct and variable length
> >> bitmap are allocated contiguously,
> >
> > That just means you have the pointer indirection elsewhere :) The users
> > of sbitmap embed it in whatever structure they have.
>
> Yes, the sparsemask can be embedded in one place, but in my use case I also cache
> pointers to the mask from elsewhere, and those sites incur the cost of 2 indirections
> to perform bitmap operations.
>
> >> and sbitmap uses an extra field "depth"
> >> per bitmap cacheline.
> >
> > The depth field is memory which would otherwise be unused, and it's only
> > used for sbitmap_get(), so it doesn't have any cost if you're using it
> > like a cpumask.
> >
> >> * The order of arguments is different for the sparsemask accessors and
> >> sbitmap accessors. sparsemask mimics cpumask which is used extensively
> >> in the sched code.
> >>
> >> * Much of the sbitmap code supports queueing, sleeping, and waking on bit
> >> allocation, which is N/A for scheduler load load balancing. However, we
> >> can call the basic functions which do not use queueing.
> >>
> >> I could add the sparsemask iterators to sbitmap (90 lines), and define
> >> a thin layer to change the argument order to mimic cpumask, but that
> >> essentially recreates sparsemask.
> >
> > We only use sbitmap_for_each_set() in a few places. Maybe a for_each()
> > style macro would be cleaner for those users, too, in which case I
> > wouldn't be opposed to changing it. The cpumask argument order thing is
> > a annoying, though.
> >
> >> Also, pushing sparsemask into sbitmap would limit our freedom to evolve the
> >> type to meet the future needs of sched, as sbitmap has its own maintainer,
> >> and is used by drivers, so changes to its API and ABI will be frowned upon.
> >
> > It's a generic data structure, so of course Jens and I have no problem
> > with changing it to meet more needs :) Personally, I'd prefer to only
> > have one datastructure for this, but I suppose it depends on whether
> > Peter and Ingo think the argument order is important enough.
>
> The argument order is a minor thing, not a blocker to adoption, but efficiency
> is important in the core scheduler code. I actually did the work to write a
> for_each macro with inline body to sbitmap, and converted my patches to use sbitmap.
> But then I noticed your very recent patch adding the cleared word to each cacheline,
> which must be loaded and ANDed with each bitset word in the for_each traversal,
> adding more overhead which we don't need for the scheduler use case, on top of the
> extra indirection noted above. You might add more such things in the future (a
> "deferred set" word?) to support the needs of the block drivers who are the
> intended clients of sbitmap.
>
> Your sbitmap is more than a simple bitmap abstraction, and for the scheduler we
> just need simple. Therefore, I propose to trim sparsemask to the bare minimum,
> and move it to kernel/sched for use
> by sched only.
> It was 400 lines, but will
> be 200, and 80 of those are comments.
>
> If anyone objects, please speak now.

Yes, after the recent changes, I think it's reasonable to have a
separate implementation for sched.