2018-10-22 15:13:09

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 00/10] steal tasks to improve CPU utilization

When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates. To minimize search time, steal the first migratable
task that is found when the bitmap is traversed. For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.

Patch 1 defines the sparsemask type and its operations.

Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.

Patches 5 and 6 refactor existing code for a cleaner merge of later
patches.

Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.

Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
time being because of performance regressions that are not due to stealing
per-se. See the patch description for details.

Patch 10 adds schedstats for comparing the new behavior to the old, and
provided as a convenience for developers only, not for integration.

The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
bandwidth control were tested.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

%find - percent of time spent in old and new functions that search for
idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench <grps> process 100000
sched_wakeup_granularity_ns=15000000

baseline
grps time %busy slice sched idle wake %find steal
1 8.084 75.02 0.10 105476 46291 59183 0.31 0
2 13.892 85.33 0.10 190225 70958 119264 0.45 0
3 19.668 89.04 0.10 263896 87047 176850 0.49 0
4 25.279 91.28 0.10 322171 94691 227474 0.51 0
8 47.832 94.86 0.09 630636 144141 486322 0.56 0

new
grps time %busy slice sched idle wake %find steal %speedup
1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow. A negative "speedup" is a
regression. Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec. Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

------------------ 1 Socket Results ------------------

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 8.008 0.1 5.905 0.2 35.6
2 13.814 0.2 11.438 0.1 20.7
3 19.488 0.2 16.919 0.1 15.1
4 25.059 0.1 22.409 0.1 11.8
8 47.478 0.1 44.221 0.1 7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 4.586 0.8 4.596 0.6 -0.3
2 7.693 0.2 5.775 1.3 33.2
3 10.442 0.3 8.288 0.3 25.9
4 13.087 0.2 11.057 0.1 18.3
8 24.145 0.2 22.076 0.3 9.3
16 43.779 0.1 41.741 0.2 4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs.

clients %speedup
1 16.2
2 11.7
4 9.9
8 12.8
16 13.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Benchmark %speedup
specjbb2015_critical_jops 5.7
mysql_sysb1.0.14_mutex_2 40.6
mysql_sysb1.0.14_oltp_2 3.9

------------------ 2 Socket Results ------------------

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 7.945 0.2 7.219 8.7 10.0
2 8.444 0.4 6.689 1.5 26.2
3 12.100 1.1 9.962 2.0 21.4
4 15.001 0.4 13.109 1.1 14.4
8 27.960 0.2 26.127 0.3 7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 5.826 5.4 5.840 5.0 -0.3
2 5.041 5.3 6.171 23.4 -18.4
3 6.839 2.1 6.324 3.8 8.1
4 8.177 0.6 7.318 3.6 11.7
8 14.429 0.7 13.966 1.3 3.3
16 26.401 0.3 25.149 1.5 4.9


X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Oracle database OLTP, logging disabled, NVRAM storage

Customers Users %speedup
1200000 40 -1.2
2400000 80 2.7
3600000 120 8.9
4800000 160 4.4
6000000 200 3.0

X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Results from the Oracle "Performance PIT".

Benchmark %speedup

mysql_sysb1.0.14_fileio_56_rndrd 19.6
mysql_sysb1.0.14_fileio_56_seqrd 12.1
mysql_sysb1.0.14_fileio_56_rndwr 0.4
mysql_sysb1.0.14_fileio_56_seqrewr -0.3

pgsql_sysb1.0.14_fileio_56_rndrd 19.5
pgsql_sysb1.0.14_fileio_56_seqrd 8.6
pgsql_sysb1.0.14_fileio_56_rndwr 1.0
pgsql_sysb1.0.14_fileio_56_seqrewr 0.5

opatch_time_ASM_12.2.0.1.0_HP2M 7.5
select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1
select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4
swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8

lm3_memlat_L2 4.8
lm3_memlat_L1 0.0

ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1
ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2
ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0
ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4

X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

NAS_OMP
bench class ncpu %improved(Mops)
dc B 72 1.3
is C 72 0.9
is D 72 0.7

sysbench mysql, average of 24 runs
--- base --- --- new ---
nthr events %stdev events %stdev %speedup
1 331.0 0.25 331.0 0.24 -0.1
2 661.3 0.22 661.8 0.22 0.0
4 1297.0 0.88 1300.5 0.82 0.2
8 2420.8 0.04 2420.5 0.04 -0.1
16 4826.3 0.07 4825.4 0.05 -0.1
32 8815.3 0.27 8830.2 0.18 0.1
64 12823.0 0.24 12823.6 0.26 0.0

--------------------------------------------------------------

Steve Sistare (10):
sched: Provide sparsemask, a reduced contention bitmap
sched/topology: Provide hooks to allocate data shared per LLC
sched/topology: Provide cfs_overload_cpus bitmap
sched/fair: Dynamically update cfs_overload_cpus
sched/fair: Hoist idle_stamp up from idle_balance
sched/fair: Generalize the detach_task interface
sched/fair: Provide can_migrate_task_llc
sched/fair: Steal work from an overloaded CPU when CPU goes idle
sched/fair: disable stealing if too many NUMA nodes
sched/fair: Provide idle search schedstats

include/linux/sched/topology.h | 1 +
include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++
kernel/sched/core.c | 30 +++-
kernel/sched/fair.c | 338 +++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 6 +
kernel/sched/sched.h | 13 +-
kernel/sched/stats.c | 11 +-
kernel/sched/stats.h | 13 ++
kernel/sched/topology.c | 117 +++++++++++++-
lib/Makefile | 2 +-
lib/sparsemask.c | 142 +++++++++++++++++
11 files changed, 898 insertions(+), 35 deletions(-)
create mode 100644 include/linux/sparsemask.h
create mode 100644 lib/sparsemask.c

--
1.8.3.1



2018-10-22 15:11:24

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 04/10] sched/fair: Dynamically update cfs_overload_cpus

An overloaded CPU has more than 1 runnable task. When a CFS task wakes
on a CPU, if h_nr_running transitions from 1 to more, then set the CPU in
the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_running
transitions from 2 to less, then clear the CPU in cfs_overload_cpus.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7fc4a37..c623338 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
#include "sched.h"

#include <trace/events/sched.h>
+#include <linux/sparsemask.h>

/*
* Targeted preemption latency for CPU-bound tasks:
@@ -3723,6 +3724,28 @@ static inline bool within_margin(int value, int margin)
WRITE_ONCE(p->se.avg.util_est, ue);
}

+static void overload_clear(struct rq *rq)
+{
+ struct sparsemask *overload_cpus;
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
+ if (overload_cpus)
+ sparsemask_clear_elem(rq->cpu, overload_cpus);
+ rcu_read_unlock();
+}
+
+static void overload_set(struct rq *rq)
+{
+ struct sparsemask *overload_cpus;
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
+ if (overload_cpus)
+ sparsemask_set_elem(rq->cpu, overload_cpus);
+ rcu_read_unlock();
+}
+
#else /* CONFIG_SMP */

#define UPDATE_TG 0x0
@@ -3746,6 +3769,9 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+static inline void overload_clear(struct rq *rq) {}
+static inline void overload_set(struct rq *rq) {}
+
static inline void
util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}

@@ -4439,6 +4465,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
+ unsigned int prev_nr = rq->cfs.h_nr_running;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
long task_delta, dequeue = 1;
@@ -4466,8 +4493,12 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue = 0;
}

- if (!se)
+ if (!se) {
sub_nr_running(rq, task_delta);
+ if (prev_nr >= 2 && prev_nr - task_delta < 2)
+ overload_clear(rq);
+
+ }

cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq);
@@ -4493,6 +4524,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
+ unsigned int prev_nr = rq->cfs.h_nr_running;
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se;
int enqueue = 1;
@@ -4529,8 +4561,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
}

- if (!se)
+ if (!se) {
add_nr_running(rq, task_delta);
+ if (prev_nr < 2 && prev_nr + task_delta >= 2)
+ overload_set(rq);
+ }

/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
@@ -5064,6 +5099,7 @@ static inline void hrtick_update(struct rq *rq)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ unsigned int prev_nr = rq->cfs.h_nr_running;

/*
* The code below (indirectly) updates schedutil which looks at
@@ -5111,8 +5147,12 @@ static inline void hrtick_update(struct rq *rq)
update_cfs_group(se);
}

- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ if (prev_nr == 1)
+ overload_set(rq);
+
+ }

hrtick_update(rq);
}
@@ -5129,6 +5169,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
+ unsigned int prev_nr = rq->cfs.h_nr_running;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -5170,8 +5211,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}

- if (!se)
+ if (!se) {
sub_nr_running(rq, 1);
+ if (prev_nr == 2)
+ overload_clear(rq);
+ }

util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
--
1.8.3.1


2018-10-22 15:11:24

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes

The STEAL feature causes regressions on hackbench on larger NUMA systems,
so disable it on systems with more than sched_steal_node_limit nodes
(default 2). Note that the feature remains enabled as seen in features.h
and /sys/kernel/debug/sched_features, but stealing is only performed if
nodes <= sched_steal_node_limit. This arrangement allows users to activate
stealing on reboot by setting the kernel parameter sched_steal_node_limit
on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary
and will be deleted when the regression is fixed.

Details of the regression follow. With the STEAL feature set, hackbench
is slower on many-node systems:

X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz
Average of 10 runs of: hackbench <groups> processes 50000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 3.627 15.8 3.876 7.3 -6.5
2 4.545 24.7 5.583 16.7 -18.6
3 5.716 25.0 7.367 14.2 -22.5
4 6.901 32.9 7.718 14.5 -10.6
8 8.604 38.5 9.111 16.0 -5.6
16 7.734 6.8 11.007 8.2 -29.8

Total CPU time increases. Profiling shows that CPU time increases
uniformly across all functions, suggesting a systemic increase in cache
or memory latency. This may be due to NUMA migrations, as they cause
loss of LLC cache footprint and remote memory latencies.

The domains for this system and their flags are:

domain0 (SMT) : 1 core
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY
SD_WAKE_AFFINE

domain1 (MC) : 1 socket
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING
SD_WAKE_AFFINE

domain2 (NUMA) : 4 sockets
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_SERIALIZE SD_OVERLAP SD_NUMA
SD_WAKE_AFFINE

domain3 (NUMA) : 8 sockets
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE
SD_SERIALIZE SD_OVERLAP SD_NUMA

Schedstats point to the root cause of the regression. hackbench is run
10 times per group and the average schedstat accumulation per-run and
per-cpu is shown below. Note that domain3 moves are zero because
SD_WAKE_AFFINE is not set there.

NO_STEAL
--- domain2 --- --- domain3 ---
grp time %busy sched idle wake steal remote move pull remote move pull
1 20.3 10.3 28710 14346 14366 0 490 3378 0 4039 0 0
2 26.4 18.8 56721 28258 28469 0 792 7026 12 9229 0 7
3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 0 3
4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 0 5
8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32
16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 0 8

STEAL
--- domain2 --- --- domain3 ---
grp time %busy sched idle wake steal remote move pull remote move pull
1 20.6 10.2 28490 14232 14261 18 3 3525 0 4254 0 0
2 27.9 18.8 56757 28203 28562 303 1675 7839 5 9690 0 2
3 35.3 27.7 87337 43274 44085 698 741 12785 14 15689 0 3
4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 0 7
8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34
16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20

Cross-numa-node migrations are caused by load balancing pulls and
wake_affine moves. Pulls are small and similar for no_steal and steal.
However, moves are significantly higher for steal, and rows above with the
highest moves have the worst regressions for time; see for example grp=8.

Moves increase for steal due to the following logic in wake_affine_idle()
for synchronous wakeup:

if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu; // move the task

The steal feature does a better job of smoothing the load between idle
and busy CPUs, so nr_running is 1 more often, and moves are performed
more often. For hackbench, cross-node affine moves early in the run are
good because they colocate wakers and wakees from the same group on the
same node, but continued moves later in the run are bad, because the wakee
is moved away from peers on its previous node. Note that even no_steal
is far from optimal; binding an instance of "hackbench 2" to each of the
8 NUMA nodes runs much faster than running "hackbench 16" with no binding.

Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node
migrations and eliminates the difference between no_steal and steal
performance. However, overall performance is lower than WA_IDLE because
some migrations are helpful as explained above.

I have tried many heuristics in a attempt to optimize the number of
cross-node moves in all conditions, with limited success. The fundamental
problem is that the scheduler does not track which groups of tasks talk to
each other. Parts of several groups become entrenched on the same node,
filling it to capacity, leaving no room for either group to pull its peers
over, and there is neither data nor mechanism for the scheduler to evict
one group to make room for the other.

For now, disable STEAL on such systems until we can do better, or it is
shown that hackbench is atypical and most workloads benefit from stealing.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 16 +++++++++++++---
kernel/sched/sched.h | 2 +-
kernel/sched/topology.c | 25 +++++++++++++++++++++++++
3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb86ec9..33d24ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3726,11 +3726,21 @@ static inline bool within_margin(int value, int margin)

#define IF_SMP(statement) statement

+static inline bool steal_enabled(void)
+{
+#ifdef CONFIG_NUMA
+ bool allow = static_branch_likely(&sched_steal_allow);
+#else
+ bool allow = true;
+#endif
+ return sched_feat(STEAL) && allow;
+}
+
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;

- if (!sched_feat(STEAL))
+ if (!steal_enabled())
return;

rcu_read_lock();
@@ -3744,7 +3754,7 @@ static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;

- if (!sched_feat(STEAL))
+ if (!steal_enabled())
return;

rcu_read_lock();
@@ -9782,7 +9792,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
int stolen = 0;
struct sparsemask *overload_cpus;

- if (!sched_feat(STEAL))
+ if (!steal_enabled())
return 0;

if (!cpu_active(dst_cpu))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aadfe68..5f181e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -928,7 +928,6 @@ static inline int cpu_of(struct rq *rq)
#endif
}

-
#ifdef CONFIG_SCHED_SMT

extern struct static_key_false sched_smt_present;
@@ -1083,6 +1082,7 @@ enum numa_topology_type {
#endif

#ifdef CONFIG_NUMA
+extern struct static_key_true sched_steal_allow;
extern void sched_init_numa(void);
extern void sched_domains_numa_masks_set(unsigned int cpu);
extern void sched_domains_numa_masks_clear(unsigned int cpu);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f18c416..b7d2070 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1337,6 +1337,30 @@ static void init_numa_topology_type(void)
}
}

+DEFINE_STATIC_KEY_TRUE(sched_steal_allow);
+static int sched_steal_node_limit;
+#define SCHED_STEAL_NODE_LIMIT_DEFAULT 1
+
+static int __init steal_node_limit_setup(char *buf)
+{
+ get_option(&buf, &sched_steal_node_limit);
+ return 0;
+}
+
+early_param("sched_steal_node_limit", steal_node_limit_setup);
+
+static void check_node_limit(void)
+{
+ int n = num_possible_nodes();
+
+ if (sched_steal_node_limit == 0)
+ sched_steal_node_limit = SCHED_STEAL_NODE_LIMIT_DEFAULT;
+ if (n > sched_steal_node_limit) {
+ static_branch_disable(&sched_steal_allow);
+ pr_debug("Suppressing sched STEAL. To enable, reboot with sched_steal_node_limit=%d", n);
+ }
+}
+
void sched_init_numa(void)
{
int next_distance, curr_distance = node_distance(0, 0);
@@ -1485,6 +1509,7 @@ void sched_init_numa(void)
sched_max_numa_distance = sched_domains_numa_distance[level - 1];

init_numa_topology_type();
+ check_node_limit();
}

void sched_domains_numa_masks_set(unsigned int cpu)
--
1.8.3.1


2018-10-22 15:11:24

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 05/10] sched/fair: Hoist idle_stamp up from idle_balance

Move the update of idle_stamp from idle_balance to the call site in
pick_next_task_fair, to prepare for a future patch that adds work to
pick_next_task_fair which must be included in the idle_stamp interval.
No functional change.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c623338..9f31045 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3724,6 +3724,8 @@ static inline bool within_margin(int value, int margin)
WRITE_ONCE(p->se.avg.util_est, ue);
}

+#define IF_SMP(statement) statement
+
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
@@ -3769,6 +3771,8 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+#define IF_SMP(statement) /* empty */
+
static inline void overload_clear(struct rq *rq) {}
static inline void overload_set(struct rq *rq) {}

@@ -6740,8 +6744,19 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return p;

idle:
+ /*
+ * We must set idle_stamp _before_ calling idle_balance(), such that we
+ * measure the duration of idle_balance() as idle time.
+ */
+ IF_SMP(rq->idle_stamp = rq_clock(rq);)
+
new_tasks = idle_balance(rq, rf);

+ if (new_tasks)
+ IF_SMP(rq->idle_stamp = 0;)
+
+ schedstat_end_time(rq->find_time, time);
+
/*
* Because idle_balance() releases (and re-acquires) rq->lock, it is
* possible for any higher priority task to appear. In that case we
@@ -9504,12 +9519,6 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
u64 curr_cost = 0;

/*
- * We must set idle_stamp _before_ calling idle_balance(), such that we
- * measure the duration of idle_balance() as idle time.
- */
- this_rq->idle_stamp = rq_clock(this_rq);
-
- /*
* Do not pull tasks towards !active CPUs...
*/
if (!cpu_active(this_cpu))
@@ -9600,9 +9609,6 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
if (this_rq->nr_running != this_rq->cfs.h_nr_running)
pulled_task = -1;

- if (pulled_task)
- this_rq->idle_stamp = 0;
-
rq_repin_lock(this_rq, rf);

return pulled_task;
--
1.8.3.1


2018-10-22 15:12:35

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 01/10] sched: Provide sparsemask, a reduced contention bitmap

Provide struct sparsemask and functions to manipulate it. A sparsemask is
a sparse bitmap. It reduces cache contention vs the usual bitmap when many
threads concurrently set, clear, and visit elements, by reducing the number
of significant bits per cacheline. For each 64 byte chunk of the mask,
only the first K bits of the first word are used, and the remaining bits
are ignored, where K is a creation time parameter. Thus a sparsemask that
can represent a set of N elements is approximately (N/K * 64) bytes in
size.

Signed-off-by: Steve Sistare <[email protected]>
---
include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++++++++++++++++
lib/Makefile | 2 +-
lib/sparsemask.c | 142 +++++++++++++++++++++++++
3 files changed, 403 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sparsemask.h
create mode 100644 lib/sparsemask.c

diff --git a/include/linux/sparsemask.h b/include/linux/sparsemask.h
new file mode 100644
index 0000000..d36a3be
--- /dev/null
+++ b/include/linux/sparsemask.h
@@ -0,0 +1,260 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sparsemask.h - sparse bitmap operations
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __LINUX_SPARSEMASK_H
+#define __LINUX_SPARSEMASK_H
+
+#include <linux/kernel.h>
+#include <linux/bitmap.h>
+#include <linux/bug.h>
+
+/*
+ * A sparsemask is a sparse bitmap. It reduces cache contention vs the usual
+ * bitmap when many threads concurrently set, clear, and visit elements. For
+ * each 64 byte chunk of the mask, only the first K bits of the first word are
+ * used, and the remaining bits are ignored, where K is a creation time
+ * parameter. Thus a sparsemask that can represent a set of N elements is
+ * approximately (N/K * 64) bytes in size.
+ *
+ * Clients pass and receive element numbers in the public API, and the
+ * implementation translates them to bit numbers to perform the bitmap
+ * operations.
+ *
+ * This file is partially derived from cpumask.h, and the public sparsemask
+ * operations are drop-in replacements for cpumask operations. However,
+ * sparsemask has no dependency on CPU definitions and can be used to
+ * represent any kind of elements.
+ */
+
+struct sparsemask {
+ short nelems; /* current number of elements */
+ short density; /* store 2^density elements per chunk */
+ unsigned long bits[0]; /* embedded array of chunks */
+};
+
+/* The maximum value for density, which implicitly defines the chunk size */
+
+#define _SMASK_DENSITY_MAX 6
+
+#define SMASK_DENSITY_TO_BYTES(density) (1U << (density))
+#define SMASK_DENSITY_TO_ELEMS(density) (1U << (density))
+
+/* The number of elements/bits/bytes/longs in a chunk */
+
+#define SMASK_ELEMS(mask) SMASK_DENSITY_TO_ELEMS((mask)->density)
+#define SMASK_BYTES SMASK_DENSITY_TO_BYTES(_SMASK_DENSITY_MAX)
+#define SMASK_BITS (SMASK_BYTES * BITS_PER_BYTE)
+#define SMASK_LONGS (SMASK_BYTES / sizeof(long))
+
+/*
+ * Translate element index @elem to a bit/byte/long index.
+ * @density: the density of a chunk.
+ */
+
+#define _SMASK_ELEM_TO_BIT(elem, density) \
+ ((elem) / SMASK_DENSITY_TO_ELEMS(density) * SMASK_BITS +\
+ (elem) % SMASK_DENSITY_TO_ELEMS(density))
+
+#define _SMASK_ELEM_TO_BYTE(elem, density) \
+ (_SMASK_ELEM_TO_BIT(elem, density) / BITS_PER_BYTE)
+
+#define _SMASK_ELEM_TO_LONG(elem, density) \
+ (_SMASK_ELEM_TO_BYTE(elem, density) / sizeof(long))
+
+/* Translate @bit/@byte/@long index to an element index */
+
+#define _SMASK_BIT_TO_ELEM(bit, density) \
+ ((bit) / SMASK_BITS * SMASK_DENSITY_TO_ELEMS(density) + \
+ (bit) % SMASK_BITS)
+
+#define _SMASK_BYTE_TO_ELEM(byte, density) \
+ _SMASK_BIT_TO_ELEM((byte) * BITS_PER_BYTE, density)
+
+#define _SMASK_LONG_TO_ELEM(index, density) \
+ _SMASK_BYTE_TO_ELEM((index) * sizeof(long), density)
+
+/* Same translations as above, but taking sparsemask @m instead of density */
+
+#define SMASK_ELEM_TO_BYTE(elem, m) _SMASK_ELEM_TO_BYTE(elem, (m)->density)
+#define SMASK_ELEM_TO_BIT(elem, m) _SMASK_ELEM_TO_BIT(elem, (m)->density)
+#define SMASK_ELEM_TO_LONG(elem, m) _SMASK_ELEM_TO_LONG(elem, (m)->density)
+#define SMASK_BYTE_TO_ELEM(byte, m) _SMASK_BYTE_TO_ELEM(byte, (m)->density)
+#define SMASK_BIT_TO_ELEM(bit, m) _SMASK_BIT_TO_ELEM(bit, (m)->density)
+#define SMASK_LONG_TO_ELEM(index, m) _SMASK_LONG_TO_ELEM(index, (m)->density)
+
+/*
+ * Verify the @elem argument to sparsemask functions, and return its bit.
+ */
+static inline int
+sparsemask_check(int elem, const struct sparsemask *mask)
+{
+ WARN_ON_ONCE(elem >= mask->nelems);
+ return SMASK_ELEM_TO_BIT(elem, mask);
+}
+
+int sparsemask_next(int n, const struct sparsemask *srcp);
+int sparsemask_next_wrap(int n, const struct sparsemask *mask,
+ int start, bool wrap);
+
+/****************** The public API ********************/
+
+/*
+ * for_each_sparse - iterate over every element in a mask
+ * @elem: the (optionally unsigned) integer iterator
+ * @mask: the sparsemask
+ *
+ * After the loop, @elem is >= @mask->nelems.
+ */
+#define for_each_sparse(elem, mask) \
+ for ((elem) = -1; \
+ (elem) = sparsemask_next((elem), (mask)), \
+ (elem) < (mask)->nelems;)
+
+/*
+ * for_each_sparse_wrap - iterate over every element in a mask, starting at a
+ * specified location.
+ * @elem: the (optionally unsigned) integer iterator
+ * @mask: the sparsemask
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set(including @start).
+ * After the loop, @elem is >= @mask->nelems.
+ */
+#define for_each_sparse_wrap(elem, mask, start) \
+ for ((elem) = sparsemask_next_wrap((start)-1, (mask), (start), false); \
+ (elem) < (mask)->nelems; \
+ (elem) = sparsemask_next_wrap((elem), (mask), (start), true))
+
+/*
+ * sparsemask_set_elem - set an element in a sparsemask
+ * @elem: element number (< @dstp->nelems)
+ * @dstp: the sparsemask
+ */
+static inline void sparsemask_set_elem(int elem, struct sparsemask *dstp)
+{
+ set_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+static inline void __sparsemask_set_elem(int elem, struct sparsemask *dstp)
+{
+ __set_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+/*
+ * sparsemask_clear_elem - clear an element in a sparsemask
+ * @elem: element number (< @dstp->nelems)
+ * @dstp: the sparsemask
+ */
+static inline void sparsemask_clear_elem(int elem, struct sparsemask *dstp)
+{
+ clear_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+static inline void __sparsemask_clear_elem(int elem, struct sparsemask *dstp)
+{
+ __clear_bit(sparsemask_check(elem, dstp), dstp->bits);
+}
+
+/*
+ * sparsemask_test_elem - test for an element in a sparsemask
+ * @elem: element number (< @mask->nelems)
+ * @mask: the sparsemask
+ *
+ * Returns 1 if @elem is set in @mask, else returns 0
+ */
+static inline int sparsemask_test_elem(int elem, const struct sparsemask *mask)
+{
+ return test_bit(sparsemask_check(elem, mask), mask->bits);
+}
+
+/*
+ * sparsemask_test_and_set_elem - atomically test and set an element
+ * @elem: element number (< @mask->nelems)
+ * @mask: the sparsemask
+ *
+ * Returns 1 if @elem is set in old bitmap of @mask, else returns 0
+ */
+static inline int
+sparsemask_test_and_set_elem(int elem, struct sparsemask *mask)
+{
+ return test_and_set_bit(sparsemask_check(elem, mask), mask->bits);
+}
+
+/*
+ * sparsemask_test_and_clear_elem - atomically test and clear an element
+ * @elem: element number (< @mask->nelems)
+ * @mask: the sparsemask
+ *
+ * Returns 1 if @elem is set in old bitmap of @mask, else returns 0
+ */
+static inline int
+sparsemask_test_and_clear_elem(int elem, struct sparsemask *mask)
+{
+ return test_and_clear_bit(sparsemask_check(elem, mask), mask->bits);
+}
+
+/*
+ * sparsemask_weight - return count of bits in @mask (<= @mask->nelems)
+ * @mask: the sparsemask
+ */
+unsigned int sparsemask_weight(const struct sparsemask *srcp);
+
+/*
+ * Suggested and max value for the density parameter
+ */
+#define SPARSEMASK_DENSITY_DEFAULT 3
+#define SMASK_DENSITY_MAX _SMASK_DENSITY_MAX
+
+/*
+ * Allocate and initialize a sparsemask and return it in @maskp.
+ * @nelems - maximum number of elements.
+ * @density - store 2^density elements per 64-byte chunk.
+ * values from 0 to SMASK_DENSITY_MAX inclusive.
+ * @flags - kmalloc allocation flags
+ * @node - numa node
+ *
+ * Return true on success, like the cpumask functions.
+ */
+
+bool alloc_sparsemask(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags);
+
+bool zalloc_sparsemask(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags);
+
+bool alloc_sparsemask_node(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags, int node);
+
+bool zalloc_sparsemask_node(struct sparsemask **maskp, int nelems,
+ int density, gfp_t flags, int node);
+
+/*
+ * Free a sparsemask allocated by any of the above
+ */
+void free_sparsemask(struct sparsemask *mask);
+
+/*
+ * Return bytes to allocate for a sparsemask, for custom allocators
+ */
+size_t sparsemask_size(int nelems, int density);
+
+/*
+ * Initialize an allocated sparsemask, for custom allocators
+ */
+void sparsemask_init(struct sparsemask *mask, int nelems, int density);
+
+#endif /* __LINUX_SPARSEMASK_H */
diff --git a/lib/Makefile b/lib/Makefile
index ca3f7eb..77c168e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -28,7 +28,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_MMU) += ioremap.o
-lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SMP) += cpumask.o sparsemask.o

lib-y += kobject.o klist.o
obj-y += lockref.o
diff --git a/lib/sparsemask.c b/lib/sparsemask.c
new file mode 100644
index 0000000..d51cf50
--- /dev/null
+++ b/lib/sparsemask.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * sparsemask.c - sparse bitmap operations
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/sparsemask.h>
+
+/*
+ * Return the next one bit in @mask after @start, not including @start.
+ */
+int sparsemask_next(int start, const struct sparsemask *mask)
+{
+ const unsigned long *addr = mask->bits;
+ unsigned long nelems = mask->nelems;
+ unsigned long tmp, nbits;
+
+ /* -1 is a legal arg here. */
+ if (start != -1)
+ sparsemask_check(start, mask);
+ start++;
+
+ if (unlikely(start >= nelems))
+ return nelems;
+
+ start = SMASK_ELEM_TO_BIT(start, mask);
+ nbits = SMASK_ELEM_TO_BIT(nelems, mask);
+ tmp = addr[start / BITS_PER_LONG];
+
+ /* Handle 1st word. */
+ tmp &= BITMAP_FIRST_WORD_MASK(start);
+ start = round_down(start, BITS_PER_LONG);
+
+ while (!tmp) {
+ start += SMASK_BITS;
+ if (start >= nbits)
+ return nelems;
+ tmp = addr[start / BITS_PER_LONG];
+ }
+
+ return min(SMASK_BIT_TO_ELEM(start, mask) + __ffs(tmp), nelems);
+}
+
+int
+sparsemask_next_wrap(int n, const struct sparsemask *mask, int start, bool wrap)
+{
+ int next;
+
+again:
+ next = sparsemask_next(n, mask);
+
+ if (wrap && n < start && next >= start) {
+ return mask->nelems;
+
+ } else if (next >= mask->nelems) {
+ wrap = true;
+ n = -1;
+ goto again;
+ }
+
+ return next;
+}
+
+unsigned int sparsemask_weight(const struct sparsemask *mask)
+{
+ int weight = 0;
+ const unsigned long *addr = mask->bits;
+ int nlongs = SMASK_ELEM_TO_LONG(mask->nelems, mask);
+ int i, extra, shift;
+
+ for (i = 0; i < nlongs; i += SMASK_LONGS) {
+ if (addr[i])
+ weight += hweight_long(addr[i]);
+ }
+ extra = mask->nelems - SMASK_LONG_TO_ELEM(i, mask);
+ if (extra > 0) {
+ shift = BITS_PER_LONG - extra;
+ weight += hweight_long((addr[i] << shift) >> shift);
+ }
+ return weight;
+}
+
+size_t sparsemask_size(int nelems, int density)
+{
+ nelems = round_up(nelems, SMASK_DENSITY_TO_ELEMS(density));
+ return sizeof(struct sparsemask) + _SMASK_ELEM_TO_BYTE(nelems, density);
+}
+
+void sparsemask_init(struct sparsemask *mask, int nelems, int density)
+{
+ WARN_ON(density < 0 || density > SMASK_DENSITY_MAX);
+ mask->nelems = nelems;
+ mask->density = density;
+}
+
+bool alloc_sparsemask_node(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags, int node)
+{
+ *mask = kmalloc_node(sparsemask_size(nelems, density), flags, node);
+ if (*mask)
+ sparsemask_init(*mask, nelems, density);
+ return !!*mask;
+}
+
+bool zalloc_sparsemask_node(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags, int node)
+{
+ flags |= __GFP_ZERO;
+ return alloc_sparsemask_node(mask, nelems, density, flags, node);
+}
+
+bool alloc_sparsemask(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags)
+{
+ return alloc_sparsemask_node(mask, nelems, density, flags,
+ NUMA_NO_NODE);
+}
+
+bool zalloc_sparsemask(struct sparsemask **mask, int nelems, int density,
+ gfp_t flags)
+{
+ flags |= __GFP_ZERO;
+ return alloc_sparsemask(mask, nelems, density, flags);
+}
+
+void free_sparsemask(struct sparsemask *mask)
+{
+ kfree(mask);
+}
--
1.8.3.1


2018-10-22 15:12:54

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 03/10] sched/topology: Provide cfs_overload_cpus bitmap

Define and initialize a sparse bitmap of overloaded CPUs, per
last-level-cache scheduling domain, for use by the CFS scheduling class.
Save a pointer to cfs_overload_cpus in the rq for efficient access.

Signed-off-by: Steve Sistare <[email protected]>
---
include/linux/sched/topology.h | 1 +
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 21 +++++++++++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2634774..8bac15d 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ struct sparsemask *cfs_overload_cpus;
};

struct sched_domain {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 455fa33..aadfe68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -81,6 +81,7 @@

struct rq;
struct cpuidle_state;
+struct sparsemask;

/* task_struct::on_rq states: */
#define TASK_ON_RQ_QUEUED 1
@@ -805,6 +806,7 @@ struct rq {
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
+ struct sparsemask *cfs_overload_cpus;

#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2363f6..f18c416 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -3,6 +3,7 @@
* Scheduler topology setup/handling methods
*/
#include "sched.h"
+#include <linux/sparsemask.h>

DEFINE_MUTEX(sched_domains_mutex);

@@ -440,6 +441,7 @@ static void update_top_cache_domain(int cpu)
static void
cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
{
+ struct sparsemask *cfs_overload_cpus;
struct rq *rq = cpu_rq(cpu);
struct sched_domain *tmp;

@@ -481,6 +483,10 @@ static void update_top_cache_domain(int cpu)
dirty_sched_domain_sysctl(cpu);
destroy_sched_domains(tmp);

+ sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
+ cfs_overload_cpus = (sd ? sd->shared->cfs_overload_cpus : NULL);
+ rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
+
update_top_cache_domain(cpu);
}

@@ -1611,9 +1617,19 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}

+#define ZALLOC_MASK(maskp, nelems, node) \
+ (!*(maskp) && !zalloc_sparsemask_node(maskp, nelems, \
+ SPARSEMASK_DENSITY_DEFAULT, \
+ GFP_KERNEL, node)) \
+
static int sd_llc_alloc(struct sched_domain *sd)
{
- /* Allocate sd->shared data here. Empty for now. */
+ struct sched_domain_shared *sds = sd->shared;
+ struct cpumask *span = sched_domain_span(sd);
+ int nid = cpu_to_node(cpumask_first(span));
+
+ if (ZALLOC_MASK(&sds->cfs_overload_cpus, nr_cpu_ids, nid))
+ return 1;

return 0;
}
@@ -1625,7 +1641,8 @@ static void sd_llc_free(struct sched_domain *sd)
if (!sds)
return;

- /* Free data here. Empty for now. */
+ free_sparsemask(sds->cfs_overload_cpus);
+ sds->cfs_overload_cpus = NULL;
}

static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d)
--
1.8.3.1


2018-10-22 15:14:03

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 02/10] sched/topology: Provide hooks to allocate data shared per LLC

Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and
free data pointed to by struct sched_domain_shared at the last-level-cache
domain. sd_llc_alloc_all() is called after the SD hierarchy is known, to
eliminate the unnecessary allocations that would occur if we instead
allocated in __sdt_alloc() and then figured out which shared nodes are
redundant.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/topology.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 505a41c..a2363f6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -10,6 +10,12 @@
cpumask_var_t sched_domains_tmpmask;
cpumask_var_t sched_domains_tmpmask2;

+struct s_data;
+static int sd_llc_alloc(struct sched_domain *sd);
+static void sd_llc_free(struct sched_domain *sd);
+static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d);
+static void sd_llc_free_all(const struct cpumask *cpu_map);
+
#ifdef CONFIG_SCHED_DEBUG

static int __init sched_debug_setup(char *str)
@@ -361,8 +367,10 @@ static void destroy_sched_domain(struct sched_domain *sd)
*/
free_sched_groups(sd->groups, 1);

- if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+ if (sd->shared && atomic_dec_and_test(&sd->shared->ref)) {
+ sd_llc_free(sd);
kfree(sd->shared);
+ }
kfree(sd);
}

@@ -993,6 +1001,7 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
free_percpu(d->sd);
/* Fall through */
case sa_sd_storage:
+ sd_llc_free_all(cpu_map);
__sdt_free(cpu_map);
/* Fall through */
case sa_none:
@@ -1602,6 +1611,62 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}

+static int sd_llc_alloc(struct sched_domain *sd)
+{
+ /* Allocate sd->shared data here. Empty for now. */
+
+ return 0;
+}
+
+static void sd_llc_free(struct sched_domain *sd)
+{
+ struct sched_domain_shared *sds = sd->shared;
+
+ if (!sds)
+ return;
+
+ /* Free data here. Empty for now. */
+}
+
+static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d)
+{
+ struct sched_domain *sd, *hsd;
+ int i;
+
+ for_each_cpu(i, cpu_map) {
+ /* Find highest domain that shares resources */
+ hsd = NULL;
+ for (sd = *per_cpu_ptr(d->sd, i); sd; sd = sd->parent) {
+ if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
+ break;
+ hsd = sd;
+ }
+ if (hsd && sd_llc_alloc(hsd))
+ return 1;
+ }
+
+ return 0;
+}
+
+static void sd_llc_free_all(const struct cpumask *cpu_map)
+{
+ struct sched_domain_topology_level *tl;
+ struct sched_domain *sd;
+ struct sd_data *sdd;
+ int j;
+
+ for_each_sd_topology(tl) {
+ sdd = &tl->data;
+ if (!sdd)
+ continue;
+ for_each_cpu(j, cpu_map) {
+ sd = *per_cpu_ptr(sdd->sd, j);
+ if (sd)
+ sd_llc_free(sd);
+ }
+ }
+}
+
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
struct sched_domain *child, int cpu)
@@ -1690,6 +1755,14 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
}
}

+ /*
+ * Allocate shared sd data at last level cache. Must be done after
+ * domains are built above, but before the data is used in
+ * cpu_attach_domain and descendants below.
+ */
+ if (sd_llc_alloc_all(cpu_map, &d))
+ goto error;
+
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
--
1.8.3.1


2018-10-22 15:14:16

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle

When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates. To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed. For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

%find - percent of time spent in old and new functions that search for
idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench <grps> process 100000
sched_wakeup_granularity_ns=15000000

baseline
grps time %busy slice sched idle wake %find steal
1 8.084 75.02 0.10 105476 46291 59183 0.31 0
2 13.892 85.33 0.10 190225 70958 119264 0.45 0
3 19.668 89.04 0.10 263896 87047 176850 0.49 0
4 25.279 91.28 0.10 322171 94691 227474 0.51 0
8 47.832 94.86 0.09 630636 144141 486322 0.56 0

new
grps time %busy slice sched idle wake %find steal %speedup
1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow. A negative "speedup" is a
regression. Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec. Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

------------------ 1 Socket Results ------------------

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 8.008 0.1 5.905 0.2 35.6
2 13.814 0.2 11.438 0.1 20.7
3 19.488 0.2 16.919 0.1 15.1
4 25.059 0.1 22.409 0.1 11.8
8 47.478 0.1 44.221 0.1 7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 4.586 0.8 4.596 0.6 -0.3
2 7.693 0.2 5.775 1.3 33.2
3 10.442 0.3 8.288 0.3 25.9
4 13.087 0.2 11.057 0.1 18.3
8 24.145 0.2 22.076 0.3 9.3
16 43.779 0.1 41.741 0.2 4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs

clients %speedup
1 16.2
2 11.7
4 9.9
8 12.8
16 13.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

Benchmark %speedup
specjbb2015_critical_jops 5.7
mysql_sysb1.0.14_mutex_2 40.6
mysql_sysb1.0.14_oltp_2 3.9

------------------ 2 Socket Results ------------------

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 7.945 0.2 7.219 8.7 10.0
2 8.444 0.4 6.689 1.5 26.2
3 12.100 1.1 9.962 2.0 21.4
4 15.001 0.4 13.109 1.1 14.4
8 27.960 0.2 26.127 0.3 7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench <groups> process 100000

--- base -- --- new ---
groups time %stdev time %stdev %speedup
1 5.826 5.4 5.840 5.0 -0.3
2 5.041 5.3 6.171 23.4 -18.4
3 6.839 2.1 6.324 3.8 8.1
4 8.177 0.6 7.318 3.6 11.7
8 14.429 0.7 13.966 1.3 3.3
16 26.401 0.3 25.149 1.5 4.9

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Oracle database OLTP, logging disabled, NVRAM storage

Customers Users %speedup
1200000 40 -1.2
2400000 80 2.7
3600000 120 8.9
4800000 160 4.4
6000000 200 3.0

X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Results from the Oracle "Performance PIT".

Benchmark %speedup

mysql_sysb1.0.14_fileio_56_rndrd 19.6
mysql_sysb1.0.14_fileio_56_seqrd 12.1
mysql_sysb1.0.14_fileio_56_rndwr 0.4
mysql_sysb1.0.14_fileio_56_seqrewr -0.3

pgsql_sysb1.0.14_fileio_56_rndrd 19.5
pgsql_sysb1.0.14_fileio_56_seqrd 8.6
pgsql_sysb1.0.14_fileio_56_rndwr 1.0
pgsql_sysb1.0.14_fileio_56_seqrewr 0.5

opatch_time_ASM_12.2.0.1.0_HP2M 7.5
select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1
select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4
swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8

lm3_memlat_L2 4.8
lm3_memlat_L1 0.0

ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1
ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2
ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0
ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4

X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

NAS_OMP
bench class ncpu %improved(Mops)
dc B 72 1.3
is C 72 0.9
is D 72 0.7

sysbench mysql, average of 24 runs
--- base --- --- new ---
nthr events %stdev events %stdev %speedup
1 331.0 0.25 331.0 0.24 -0.1
2 661.3 0.22 661.8 0.22 0.0
4 1297.0 0.88 1300.5 0.82 0.2
8 2420.8 0.04 2420.5 0.04 -0.1
16 4826.3 0.07 4825.4 0.05 -0.1
32 8815.3 0.27 8830.2 0.18 0.1
64 12823.0 0.24 12823.6 0.26 0.0

-------------------------------------------------------------

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 6 ++
2 files changed, 161 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6548bed..cb86ec9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3730,6 +3730,9 @@ static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;

+ if (!sched_feat(STEAL))
+ return;
+
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
@@ -3741,6 +3744,9 @@ static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;

+ if (!sched_feat(STEAL))
+ return;
+
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
@@ -3748,6 +3754,8 @@ static void overload_set(struct rq *rq)
rcu_read_unlock();
}

+static int try_steal(struct rq *this_rq, struct rq_flags *rf);
+
#else /* CONFIG_SMP */

#define UPDATE_TG 0x0
@@ -3783,6 +3791,11 @@ static inline void overload_set(struct rq *rq) {}
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
bool task_sleep) {}

+static inline int try_steal(struct rq *this_rq, struct rq_flags *rf)
+{
+ return 0;
+}
+
#endif /* CONFIG_SMP */

static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -6745,12 +6758,14 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_

idle:
/*
- * We must set idle_stamp _before_ calling idle_balance(), such that we
- * measure the duration of idle_balance() as idle time.
+ * We must set idle_stamp _before_ calling try_steal() or
+ * idle_balance(), such that we measure the duration as idle time.
*/
IF_SMP(rq->idle_stamp = rq_clock(rq);)

new_tasks = idle_balance(rq, rf);
+ if (new_tasks == 0)
+ new_tasks = try_steal(rq, rf);

if (new_tasks)
IF_SMP(rq->idle_stamp = 0;)
@@ -6758,9 +6773,9 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
schedstat_end_time(rq->find_time, time);

/*
- * Because idle_balance() releases (and re-acquires) rq->lock, it is
- * possible for any higher priority task to appear. In that case we
- * must re-start the pick_next_entity() loop.
+ * Because try_steal() and idle_balance() release (and re-acquire)
+ * rq->lock, it is possible for any higher priority task to appear.
+ * In that case we must re-start the pick_next_entity() loop.
*/
if (new_tasks < 0)
return RETRY_TASK;
@@ -9683,6 +9698,141 @@ void trigger_load_balance(struct rq *rq)
nohz_balancer_kick(rq);
}

+/*
+ * Search the runnable tasks in @cfs_rq in order of next to run, and find
+ * the first one that can be migrated to @dst_rq. @cfs_rq is locked on entry.
+ * On success, dequeue the task from @cfs_rq and return it, else return NULL.
+ */
+static struct task_struct *
+detach_next_task(struct cfs_rq *cfs_rq, struct rq *dst_rq)
+{
+ int dst_cpu = dst_rq->cpu;
+ struct task_struct *p;
+ struct rq *rq = rq_of(cfs_rq);
+
+ lockdep_assert_held(&rq_of(cfs_rq)->lock);
+
+ list_for_each_entry_reverse(p, &rq->cfs_tasks, se.group_node) {
+ if (can_migrate_task_llc(p, rq, dst_rq)) {
+ detach_task(p, rq, dst_cpu);
+ return p;
+ }
+ }
+ return NULL;
+}
+
+/*
+ * Attempt to migrate a CFS task from @src_cpu to @dst_rq. @locked indicates
+ * whether @dst_rq is already locked on entry. This function may lock or
+ * unlock @dst_rq, and updates @locked to indicate the locked state on return.
+ * The locking protocol is based on idle_balance().
+ * Returns 1 on success and 0 on failure.
+ */
+static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
+ int src_cpu)
+{
+ struct task_struct *p;
+ struct rq_flags rf;
+ int stolen = 0;
+ int dst_cpu = dst_rq->cpu;
+ struct rq *src_rq = cpu_rq(src_cpu);
+
+ if (dst_cpu == src_cpu || src_rq->cfs.h_nr_running < 2)
+ return 0;
+
+ if (*locked) {
+ rq_unpin_lock(dst_rq, dst_rf);
+ raw_spin_unlock(&dst_rq->lock);
+ *locked = false;
+ }
+ rq_lock_irqsave(src_rq, &rf);
+ update_rq_clock(src_rq);
+
+ if (src_rq->cfs.h_nr_running < 2 || !cpu_active(src_cpu))
+ p = NULL;
+ else
+ p = detach_next_task(&src_rq->cfs, dst_rq);
+
+ rq_unlock(src_rq, &rf);
+
+ if (p) {
+ raw_spin_lock(&dst_rq->lock);
+ rq_repin_lock(dst_rq, dst_rf);
+ *locked = true;
+ update_rq_clock(dst_rq);
+ attach_task(dst_rq, p);
+ stolen = 1;
+ }
+ local_irq_restore(rf.flags);
+
+ return stolen;
+}
+
+/*
+ * Try to steal a runnable CFS task from a CPU in the same LLC as @dst_rq,
+ * and migrate it to @dst_rq. rq_lock is held on entry and return, but
+ * may be dropped in between. Return 1 on success, 0 on failure, and -1
+ * if a task in a different scheduling class has become runnable on @dst_rq.
+ */
+static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
+{
+ int src_cpu;
+ int dst_cpu = dst_rq->cpu;
+ bool locked = true;
+ int stolen = 0;
+ struct sparsemask *overload_cpus;
+
+ if (!sched_feat(STEAL))
+ return 0;
+
+ if (!cpu_active(dst_cpu))
+ return 0;
+
+ /* Get bitmap of overloaded CPUs in the same LLC as @dst_rq */
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(dst_rq->cfs_overload_cpus);
+ if (!overload_cpus) {
+ rcu_read_unlock();
+ return 0;
+ }
+
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * First try overloaded CPUs on the same core to preserve cache warmth.
+ */
+ if (static_branch_likely(&sched_smt_present)) {
+ for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
+ if (sparsemask_test_elem(src_cpu, overload_cpus) &&
+ steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
+ stolen = 1;
+ goto out;
+ }
+ }
+ }
+#endif /* CONFIG_SCHED_SMT */
+
+ /* Accept any suitable task in the LLC */
+
+ for_each_sparse_wrap(src_cpu, overload_cpus, dst_cpu) {
+ if (steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
+ stolen = 1;
+ break;
+ }
+ }
+
+out:
+ rcu_read_unlock();
+ if (!locked) {
+ raw_spin_lock(&dst_rq->lock);
+ rq_repin_lock(dst_rq, dst_rf);
+ }
+ stolen |= (dst_rq->cfs.h_nr_running > 0);
+ if (dst_rq->nr_running != dst_rq->cfs.h_nr_running)
+ stolen = -1;
+ return stolen;
+}
+
static void rq_online_fair(struct rq *rq)
{
update_sysctl();
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae848..4faab90 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -59,6 +59,12 @@
SCHED_FEAT(SIS_PROP, true)

/*
+ * Steal a CFS task from another CPU when going idle.
+ * Improves CPU utilization.
+ */
+SCHED_FEAT(STEAL, true)
+
+/*
* Issue a WARN when we do multiple update_rq_clock() calls
* in a single rq->lock section. Default disabled because the
* annotations are not complete.
--
1.8.3.1


2018-10-22 15:25:25

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 10/10] sched/fair: Provide idle search schedstats

Add schedstats to measure the effectiveness of searching for idle CPUs
and stealing tasks. This is a temporary patch intended for use during
development only. SCHEDSTAT_VERSION is bumped to 16, and the following
fields are added to the per-CPU statistics of /proc/schedstat:

field 10: # of times select_idle_sibling "easily" found an idle CPU --
prev or target is idle.
field 11: # of times select_idle_sibling searched and found an idle cpu.
field 12: # of times select_idle_sibling searched and found an idle core.
field 13: # of times select_idle_sibling failed to find anything idle.
field 14: time in nanoseconds spent in functions that search for idle
CPUs and search for tasks to steal.
field 15: # of times an idle CPU steals a task from another CPU.
field 16: # of times try_steal finds overloaded CPUs but no task is
migratable.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/core.c | 30 +++++++++++++++++++++++++++--
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 9 +++++++++
kernel/sched/stats.c | 11 ++++++++++-
kernel/sched/stats.h | 13 +++++++++++++
5 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad97f3b..b61d15d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2214,17 +2214,43 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
DEFINE_STATIC_KEY_FALSE(sched_schedstats);
static bool __initdata __sched_schedstats = false;

+unsigned long schedstat_skid;
+
+static void compute_skid(void)
+{
+ int i, n = 0;
+ s64 t, skid = 0;
+
+ for (i = 0; i < 100; i++) {
+ t = local_clock();
+ t = local_clock() - t;
+ if (t > 0 && t < 1000) { /* only use sane samples */
+ skid += t;
+ n++;
+ }
+ }
+
+ if (n > 0)
+ schedstat_skid = skid / n;
+ else
+ schedstat_skid = 0;
+ pr_info("schedstat_skid = %lu\n", schedstat_skid);
+}
+
static void set_schedstats(bool enabled)
{
- if (enabled)
+ if (enabled) {
+ compute_skid();
static_branch_enable(&sched_schedstats);
- else
+ } else {
static_branch_disable(&sched_schedstats);
+ }
}

void force_schedstat_enabled(void)
{
if (!schedstat_enabled()) {
+ compute_skid();
pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
static_branch_enable(&sched_schedstats);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33d24ee..cdad63b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3739,29 +3739,35 @@ static inline bool steal_enabled(void)
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ unsigned long time;

if (!steal_enabled())
return;

+ time = schedstat_start_time();
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_clear_elem(rq->cpu, overload_cpus);
rcu_read_unlock();
+ schedstat_end_time(rq->find_time, time);
}

static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ unsigned long time;

if (!steal_enabled())
return;

+ time = schedstat_start_time();
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_set_elem(rq->cpu, overload_cpus);
rcu_read_unlock();
+ schedstat_end_time(rq->find_time, time);
}

static int try_steal(struct rq *this_rq, struct rq_flags *rf);
@@ -6165,6 +6171,16 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return cpu;
}

+#define SET_STAT(STAT) \
+ do { \
+ if (schedstat_enabled()) { \
+ struct rq *rq = this_rq(); \
+ \
+ if (rq) \
+ __schedstat_inc(rq->STAT); \
+ } \
+ } while (0)
+
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
@@ -6173,14 +6189,18 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd;
int i, recent_used_cpu;

- if (available_idle_cpu(target))
+ if (available_idle_cpu(target)) {
+ SET_STAT(found_idle_cpu_easy);
return target;
+ }

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev))
+ if (prev != target && cpus_share_cache(prev, target) && available_idle_cpu(prev)) {
+ SET_STAT(found_idle_cpu_easy);
return prev;
+ }

/* Check a recently used CPU as a potential idle candidate: */
recent_used_cpu = p->recent_used_cpu;
@@ -6193,26 +6213,36 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* Replace recent_used_cpu with prev as it is a potential
* candidate for the next wake:
*/
+ SET_STAT(found_idle_cpu_easy);
p->recent_used_cpu = prev;
return recent_used_cpu;
}

sd = rcu_dereference(per_cpu(sd_llc, target));
- if (!sd)
+ if (!sd) {
+ SET_STAT(nofound_idle_cpu);
return target;
+ }

i = select_idle_core(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_core);
return i;
+ }

i = select_idle_cpu(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_cpu);
return i;
+ }

i = select_idle_smt(p, sd, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_cpu);
return i;
+ }

+ SET_STAT(nofound_idle_cpu);
return target;
}

@@ -6363,6 +6393,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
+ unsigned long time = schedstat_start_time();
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
@@ -6411,6 +6442,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
current->recent_used_cpu = cpu;
}
rcu_read_unlock();
+ schedstat_end_time(cpu_rq(cpu)->find_time, time);

return new_cpu;
}
@@ -6657,6 +6689,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
+ unsigned long time;

again:
if (!cfs_rq->nr_running)
@@ -6767,6 +6800,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return p;

idle:
+ time = schedstat_start_time();
+
/*
* We must set idle_stamp _before_ calling try_steal() or
* idle_balance(), such that we measure the duration as idle time.
@@ -6782,6 +6817,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_

schedstat_end_time(rq->find_time, time);

+ schedstat_end_time(rq->find_time, time);
+
/*
* Because try_steal() and idle_balance() release (and re-acquire)
* rq->lock, it is possible for any higher priority task to appear.
@@ -9772,6 +9809,7 @@ static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
update_rq_clock(dst_rq);
attach_task(dst_rq, p);
stolen = 1;
+ schedstat_inc(dst_rq->steal);
}
local_irq_restore(rf.flags);

@@ -9790,6 +9828,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
int dst_cpu = dst_rq->cpu;
bool locked = true;
int stolen = 0;
+ bool any_overload = false;
struct sparsemask *overload_cpus;

if (!steal_enabled())
@@ -9829,6 +9868,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
stolen = 1;
break;
}
+ any_overload = true;
}

out:
@@ -9840,6 +9880,8 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
stolen |= (dst_rq->cfs.h_nr_running > 0);
if (dst_rq->nr_running != dst_rq->cfs.h_nr_running)
stolen = -1;
+ if (!stolen && any_overload)
+ schedstat_inc(dst_rq->steal_fail);
return stolen;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f181e9..9f58e17 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -907,6 +907,15 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* Idle search stats */
+ unsigned int found_idle_core;
+ unsigned int found_idle_cpu;
+ unsigned int found_idle_cpu_easy;
+ unsigned int nofound_idle_cpu;
+ unsigned long find_time;
+ unsigned int steal;
+ unsigned int steal_fail;
#endif

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 750fb3c..00b3de5 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -10,7 +10,7 @@
* Bump this up when changing the output format or the meaning of an existing
* format, so that tools can adapt (or abort)
*/
-#define SCHEDSTAT_VERSION 15
+#define SCHEDSTAT_VERSION 16

static int show_schedstat(struct seq_file *seq, void *v)
{
@@ -37,6 +37,15 @@ static int show_schedstat(struct seq_file *seq, void *v)
rq->rq_cpu_time,
rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);

+ seq_printf(seq, " %u %u %u %u %lu %u %u",
+ rq->found_idle_cpu_easy,
+ rq->found_idle_cpu,
+ rq->found_idle_core,
+ rq->nofound_idle_cpu,
+ rq->find_time,
+ rq->steal,
+ rq->steal_fail);
+
seq_printf(seq, "\n");

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8aea199..50c3cf8 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -39,6 +39,17 @@
#define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
#define schedstat_val(var) (var)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)
+#define schedstat_start_time() schedstat_val_or_zero(local_clock())
+#define schedstat_end_time(stat, time) \
+ do { \
+ unsigned long endtime; \
+ \
+ if (schedstat_enabled() && (time)) { \
+ endtime = local_clock() - (time) - schedstat_skid; \
+ schedstat_add((stat), endtime); \
+ } \
+ } while (0)
+extern unsigned long schedstat_skid;

#else /* !CONFIG_SCHEDSTATS: */
static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
@@ -53,6 +64,8 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_set(var, val) do { } while (0)
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0
+# define schedstat_start_time() 0
+# define schedstat_end_time(stat, t) do { } while (0)
#endif /* CONFIG_SCHEDSTATS */

#ifdef CONFIG_SCHED_INFO
--
1.8.3.1


2018-10-22 15:26:44

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 06/10] sched/fair: Generalize the detach_task interface

The detach_task function takes a struct lb_env argument, but only needs a
few of its members. Pass the rq and cpu arguments explicitly so the
function may be called from code that is not based on lb_env. No
functional change.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f31045..4acdd8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7168,15 +7168,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
}

/*
- * detach_task() -- detach the task for the migration specified in env
+ * detach_task() -- detach the task for the migration from @src_rq to @dst_cpu.
*/
-static void detach_task(struct task_struct *p, struct lb_env *env)
+static void detach_task(struct task_struct *p, struct rq *src_rq, int dst_cpu)
{
- lockdep_assert_held(&env->src_rq->lock);
+ lockdep_assert_held(&src_rq->lock);

p->on_rq = TASK_ON_RQ_MIGRATING;
- deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
- set_task_cpu(p, env->dst_cpu);
+ deactivate_task(src_rq, p, DEQUEUE_NOCLOCK);
+ set_task_cpu(p, dst_cpu);
}

/*
@@ -7196,7 +7196,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
if (!can_migrate_task(p, env))
continue;

- detach_task(p, env);
+ detach_task(p, env->src_rq, env->dst_cpu);

/*
* Right now, this is only the second place where
@@ -7263,7 +7263,7 @@ static int detach_tasks(struct lb_env *env)
if ((load / 2) > env->imbalance)
goto next;

- detach_task(p, env);
+ detach_task(p, env->src_rq, env->dst_cpu);
list_add(&p->se.group_node, &env->tasks);

detached++;
--
1.8.3.1


2018-10-22 15:30:06

by Steven Sistare

[permalink] [raw]
Subject: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

Define a simpler version of can_migrate_task called can_migrate_task_llc
which does not require a struct lb_env argument, and judges whether a
migration from one CPU to another within the same LLC should be allowed.

Signed-off-by: Steve Sistare <[email protected]>
---
kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4acdd8d..6548bed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7168,6 +7168,34 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
}

/*
+ * Return true if task @p can migrate from @rq to @dst_rq in the same LLC.
+ * No need to test for co-locality, and no need to test task_hot(), as sharing
+ * LLC provides cache warmth at that level.
+ */
+static bool
+can_migrate_task_llc(struct task_struct *p, struct rq *rq, struct rq *dst_rq)
+{
+ int dst_cpu = dst_rq->cpu;
+
+ lockdep_assert_held(&rq->lock);
+
+ if (throttled_lb_pair(task_group(p), cpu_of(rq), dst_cpu))
+ return false;
+
+ if (!cpumask_test_cpu(dst_cpu, &p->cpus_allowed)) {
+ schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
+ return false;
+ }
+
+ if (task_running(rq, p)) {
+ schedstat_inc(p->se.statistics.nr_failed_migrations_running);
+ return false;
+ }
+
+ return true;
+}
+
+/*
* detach_task() -- detach the task for the migration from @src_rq to @dst_cpu.
*/
static void detach_task(struct task_struct *p, struct rq *src_rq, int dst_cpu)
--
1.8.3.1


2018-10-22 17:08:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On Mon, Oct 22, 2018 at 07:59:31AM -0700, Steve Sistare wrote:
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
>
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle. idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy. Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.

Why I don't dislike the idea; I feel it is unfortunate to have two
different mechanisms to do effectively the same thing.

Can't we improve idle_balance() instead of building this parallel
functionality?



2018-10-22 17:09:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/fair: Dynamically update cfs_overload_cpus

On Mon, Oct 22, 2018 at 07:59:35AM -0700, Steve Sistare wrote:
> An overloaded CPU has more than 1 runnable task. When a CFS task wakes
> on a CPU, if h_nr_running transitions from 1 to more, then set the CPU in
> the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_running
> transitions from 2 to less, then clear the CPU in cfs_overload_cpus.

> @@ -5111,8 +5147,12 @@ static inline void hrtick_update(struct rq *rq)
> update_cfs_group(se);
> }
>
> - if (!se)
> + if (!se) {
> add_nr_running(rq, 1);
> + if (prev_nr == 1)
> + overload_set(rq);
> +
> + }

This is very similar to what RT already has, except they also track if
the tasks in question is in fact migratable (p->nr_cpus_allowed > 1).

And similarly, it might make sense to use the sparse mask thing over
there.

2018-10-22 17:50:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes

On Mon, Oct 22, 2018 at 07:59:40AM -0700, Steve Sistare wrote:
> The STEAL feature causes regressions on hackbench on larger NUMA systems,
> so disable it on systems with more than sched_steal_node_limit nodes
> (default 2).

How come? From a quick read the stealing is per LLC, where do we steal
across nodes?

2018-10-22 19:26:08

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 04/10] sched/fair: Dynamically update cfs_overload_cpus

On 10/22/2018 12:56 PM, Peter Zijlstra wrote:
> On Mon, Oct 22, 2018 at 07:59:35AM -0700, Steve Sistare wrote:
>> An overloaded CPU has more than 1 runnable task. When a CFS task wakes
>> on a CPU, if h_nr_running transitions from 1 to more, then set the CPU in
>> the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_running
>> transitions from 2 to less, then clear the CPU in cfs_overload_cpus.
>
>> @@ -5111,8 +5147,12 @@ static inline void hrtick_update(struct rq *rq)
>> update_cfs_group(se);
>> }
>>
>> - if (!se)
>> + if (!se) {
>> add_nr_running(rq, 1);
>> + if (prev_nr == 1)
>> + overload_set(rq);
>> +
>> + }
>
> This is very similar to what RT already has, except they also track if
> the tasks in question is in fact migratable (p->nr_cpus_allowed > 1).
>
> And similarly, it might make sense to use the sparse mask thing over
> there.

Yes. I forgot to mention RT as likely follow on work.

- Steve

2018-10-22 19:51:48

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes

On 10/22/2018 1:06 PM, Peter Zijlstra wrote:
> On Mon, Oct 22, 2018 at 07:59:40AM -0700, Steve Sistare wrote:
>> The STEAL feature causes regressions on hackbench on larger NUMA systems,
>> so disable it on systems with more than sched_steal_node_limit nodes
>> (default 2).
>
> How come? From a quick read the stealing is per LLC, where do we steal
> across nodes?

See the complete explanation in this patch. It is deeper than can be gleaned
from a quick read.

- Steve

2018-10-22 20:08:36

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/22/2018 1:04 PM, Peter Zijlstra wrote:
> On Mon, Oct 22, 2018 at 07:59:31AM -0700, Steve Sistare wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates. To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed. For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This simple stealing yields a higher CPU utilization than idle_balance()
>> alone, because the search is cheap, so it may be called every time the CPU
>> is about to go idle. idle_balance() does more work because it searches
>> widely for the busiest queue, so to limit its CPU consumption, it declines
>> to search if the system is too busy. Simple stealing does not offload the
>> globally busiest queue, but it is much better than running nothing at all.
>
> Why I don't dislike the idea; I feel it is unfortunate to have two
> different mechanisms to do effectively the same thing.
>
> Can't we improve idle_balance() instead of building this parallel
> functionality?

We could delete idle_balance() and use stealing exclusively for handling
new idle. For each sd level, stealing would look for an overloaded CPU
in the overloaded bitmap(s) that overlap that level. I played with that
a little but it is not ready for prime time, and I did not want to hold
the patch series for it. Also, I would like folks to get some production
experience with stealing on a variety of architectures before considering
a radical step like replacing idle_balance().

We could merge the stealing code into the idle_balance() code to get a
union of the two, but IMO that would be less readable.

We could remove the core and socket levels from idle_balance() and let
stealing handle those levels. I think that makes sense after stealing
performance is validated on more architectures, but we would still have
two different mechanisms.

- Steve

2018-10-22 20:11:02

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes

On 10/22/2018 2:47 PM, Steven Sistare wrote:
> On 10/22/2018 1:06 PM, Peter Zijlstra wrote:
>> On Mon, Oct 22, 2018 at 07:59:40AM -0700, Steve Sistare wrote:
>>> The STEAL feature causes regressions on hackbench on larger NUMA systems,
>>> so disable it on systems with more than sched_steal_node_limit nodes
>>> (default 2).
>>
>> How come? From a quick read the stealing is per LLC, where do we steal
>> across nodes?
>
> See the complete explanation in this patch. It is deeper than can be gleaned
> from a quick read.

I should have said a bit more. Your quick take on stealing is correct, we do
not steal across nodes. However, stealing reduces average run queue length which
influences wake_affine migrations. Now see the complete explanation.

- Steve



2018-10-22 23:39:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On Mon, Oct 22, 2018 at 03:07:10PM -0400, Steven Sistare wrote:
> On 10/22/2018 1:04 PM, Peter Zijlstra wrote:
> > On Mon, Oct 22, 2018 at 07:59:31AM -0700, Steve Sistare wrote:
> >> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> >> find a task, then attempt to steal a task from an overloaded CPU in the
> >> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> >> identify candidates. To minimize search time, steal the first migratable
> >> task that is found when the bitmap is traversed. For fairness, search
> >> for migratable tasks on an overloaded CPU in order of next to run.
> >>
> >> This simple stealing yields a higher CPU utilization than idle_balance()
> >> alone, because the search is cheap, so it may be called every time the CPU
> >> is about to go idle. idle_balance() does more work because it searches
> >> widely for the busiest queue, so to limit its CPU consumption, it declines
> >> to search if the system is too busy. Simple stealing does not offload the
> >> globally busiest queue, but it is much better than running nothing at all.
> >
> > Why I don't dislike the idea; I feel it is unfortunate to have two
> > different mechanisms to do effectively the same thing.
> >
> > Can't we improve idle_balance() instead of building this parallel
> > functionality?
>
> We could delete idle_balance() and use stealing exclusively for handling
> new idle. For each sd level, stealing would look for an overloaded CPU
> in the overloaded bitmap(s) that overlap that level. I played with that
> a little but it is not ready for prime time, and I did not want to hold
> the patch series for it. Also, I would like folks to get some production
> experience with stealing on a variety of architectures before considering
> a radical step like replacing idle_balance().

Fair enough. And yes, it might make sense to fully replace the current
newidle balance with something along these lines.

> We could remove the core and socket levels from idle_balance() and let
> stealing handle those levels. I think that makes sense after stealing
> performance is validated on more architectures, but we would still have
> two different mechanisms.

Yes, this would be a fairly simple change and make sense until we have a
full replacement.

> We could merge the stealing code into the idle_balance() code to get a
> union of the two, but IMO that would be less readable.

Agreed; I don't think that'll be pretty.

2018-10-22 23:39:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes

On Mon, Oct 22, 2018 at 03:21:20PM -0400, Steven Sistare wrote:
> On 10/22/2018 2:47 PM, Steven Sistare wrote:
> > On 10/22/2018 1:06 PM, Peter Zijlstra wrote:
> >> On Mon, Oct 22, 2018 at 07:59:40AM -0700, Steve Sistare wrote:
> >>> The STEAL feature causes regressions on hackbench on larger NUMA systems,
> >>> so disable it on systems with more than sched_steal_node_limit nodes
> >>> (default 2).
> >>
> >> How come? From a quick read the stealing is per LLC, where do we steal
> >> across nodes?
> >
> > See the complete explanation in this patch. It is deeper than can be gleaned
> > from a quick read.
>
> I should have said a bit more. Your quick take on stealing is correct, we do
> not steal across nodes. However, stealing reduces average run queue length which
> influences wake_affine migrations. Now see the complete explanation.

Right; read a bit more just now.

hackbench is a fairly poor benchmark for numa performance. One that
comes to mind is multi wharehouse specjbb stuff (assuming you have numa
balance enabled of course).



2018-10-23 13:22:12

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 09/10] sched/fair: disable stealing if too many NUMA nodes

On 10/22/2018 10:59 AM, Steve Sistare wrote:
[...]
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index f18c416..b7d2070 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1337,6 +1337,30 @@ static void init_numa_topology_type(void)
> }
> }
>
> +DEFINE_STATIC_KEY_TRUE(sched_steal_allow);
> +static int sched_steal_node_limit;
> +#define SCHED_STEAL_NODE_LIMIT_DEFAULT 1

This is an oversight; I meant to suggest a value of 2. All my 2-socket
testing was done with stealing enabled.

- Steve

2018-10-24 15:36:18

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

Hi,

On 22/10/2018 20:07, Steven Sistare wrote:
> On 10/22/2018 1:04 PM, Peter Zijlstra wrote:
[...]
>
> We could delete idle_balance() and use stealing exclusively for handling
> new idle. For each sd level, stealing would look for an overloaded CPU
> in the overloaded bitmap(s) that overlap that level. I played with that
> a little but it is not ready for prime time, and I did not want to hold
> the patch series for it. Also, I would like folks to get some production
> experience with stealing on a variety of architectures before considering
> a radical step like replacing idle_balance().
>

I think this could work fine for standard symmetrical systems, but I have
some concerns for asymmetric systems (Arm big.LITTLE & co). One thing that
should show up in 4.20-rc1 is the misfit logic, which caters to those
asymmetric systems.

If you look at 757ffdd705ee ("sched/fair: Set rq->rd->overload when
misfit") on Linus' tree, we can set rq->rd->overload even if
(rq->nr_running == 1). This is because we do want to do an idle_balance()
when we have misfit tasks, which should lead to active balancing one of
those CPU-hungry tasks to move it to a more powerful CPU.

With a pure try_steal() approach, we won't do any active balancing - we
could steal some task from a cfs_overload_cpu but that's not what the
load balancer would have done. The load balancer would only do such a thing
if the imbalance type is group_overloaded, which means:

sum_nr_running > group_weight &&
group_util * sd->imbalance_pct > group_capacity * 100

(IOW the number of tasks running on the CPU is not the sole deciding
factor)

Otherwise, misfit tasks (group_misfit_task imbalance type) would have
priority.

Perhaps we could decorate the cfs_overload_cpus with some more information
(e.g. misfit task presence), but then we'd have to add some logic to decide
when to steal what.




We'd also lose the NOHZ update done in idle_balance(), though I think it's
not such a big deal - were were piggy-backing this on idle_balance() just
because it happened to be convenient, and we still have NOHZ_STATS_KICK
anyway.




Another thing - in your test cases, what is the most prevalent cause of
failure to pull a task in idle_balance()? Is it the load_balance() itself
that fails to find a task (e.g. because the imbalance is not deemed big
enough), or is it the idle migration cost logic that prevents
load_balance() from running to completion?

In the first case, try_steal() makes perfect sense to me. In the second
case, I'm not sure if we really want to pull something if we know (well,
we *think*) we're about to resume the execution of some other task.

> We could merge the stealing code into the idle_balance() code to get a
> union of the two, but IMO that would be less readable.
>
> We could remove the core and socket levels from idle_balance()

I understand that as only doing load_balance() at DIE level in
idle_balance(), as that is what makes most sense to me (with big.LITTLE
those misfit migrations are done at DIE level), is that correct?

Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
level, which could cause issues there.

> and let
> stealing handle those levels. I think that makes sense after stealing
> performance is validated on more architectures, but we would still have
> two different mechanisms.
>
> - Steve
>

I'll try out those patches on top of the misfit series to see how the
whole thing behaves.

2018-10-24 19:29:22

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/24/2018 11:34 AM, Valentin Schneider wrote:
> Hi,
>
> On 22/10/2018 20:07, Steven Sistare wrote:
>> On 10/22/2018 1:04 PM, Peter Zijlstra wrote:
> [...]
>>
>> We could delete idle_balance() and use stealing exclusively for handling
>> new idle. For each sd level, stealing would look for an overloaded CPU
>> in the overloaded bitmap(s) that overlap that level. I played with that
>> a little but it is not ready for prime time, and I did not want to hold
>> the patch series for it. Also, I would like folks to get some production
>> experience with stealing on a variety of architectures before considering
>> a radical step like replacing idle_balance().
>
> I think this could work fine for standard symmetrical systems, but I have
> some concerns for asymmetric systems (Arm big.LITTLE & co). One thing that
> should show up in 4.20-rc1 is the misfit logic, which caters to those
> asymmetric systems.
>
> If you look at 757ffdd705ee ("sched/fair: Set rq->rd->overload when
> misfit") on Linus' tree, we can set rq->rd->overload even if
> (rq->nr_running == 1). This is because we do want to do an idle_balance()
> when we have misfit tasks, which should lead to active balancing one of
> those CPU-hungry tasks to move it to a more powerful CPU.
>
> With a pure try_steal() approach, we won't do any active balancing - we
> could steal some task from a cfs_overload_cpu but that's not what the
> load balancer would have done. The load balancer would only do such a thing
> if the imbalance type is group_overloaded, which means:
>
> sum_nr_running > group_weight &&
> group_util * sd->imbalance_pct > group_capacity * 100
>
> (IOW the number of tasks running on the CPU is not the sole deciding
> factor)
>
> Otherwise, misfit tasks (group_misfit_task imbalance type) would have
> priority.
>
> Perhaps we could decorate the cfs_overload_cpus with some more information
> (e.g. misfit task presence), but then we'd have to add some logic to decide
> when to steal what.

Hi Valentin,

Asymmetric systems could maintain a separate bitmap for misfits; set a bit
when a CPU goes on CPU, clear it going off. When a fast CPU goes new idle,
it would first search the misfits mask, then search cfs_overload_cpus.
The misfits logic would be conditionalized with CONFIG or sched feat static
branches so symmetric systems do not incur extra overhead.

> We'd also lose the NOHZ update done in idle_balance(), though I think it's
> not such a big deal - were were piggy-backing this on idle_balance() just
> because it happened to be convenient, and we still have NOHZ_STATS_KICK
> anyway.

Agreed.

> Another thing - in your test cases, what is the most prevalent cause of
> failure to pull a task in idle_balance()? Is it the load_balance() itself
> that fails to find a task (e.g. because the imbalance is not deemed big
> enough), or is it the idle migration cost logic that prevents
> load_balance() from running to completion?

The latter. Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
so idle_balance bails at the top:
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
...
goto out

For other tests, we get past that clause but bail from a domain:
if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
...
break;

> In the first case, try_steal() makes perfect sense to me. In the second
> case, I'm not sure if we really want to pull something if we know (well,
> we *think*) we're about to resume the execution of some other task.

355.566 microsec is enough time to steal, go on CPU, do useful work, and go
off CPU, particularly for chatty workloads like hackbench. The performance
data bear this out. For the higher loads, the average timeslice for
hackbench

Perhaps I could skip try_steal() if avg_idle is very small, although with
hackbench I have seen average time slice as small as 10 microsec under
high load and preemptions. I'll run some experiments.

>> We could merge the stealing code into the idle_balance() code to get a
>> union of the two, but IMO that would be less readable.
>>
>> We could remove the core and socket levels from idle_balance()
>
> I understand that as only doing load_balance() at DIE level in
> idle_balance(), as that is what makes most sense to me (with big.LITTLE
> those misfit migrations are done at DIE level), is that correct?

Correct.
> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
> level, which could cause issues there.

We could keep idle_balance for this level and fall back to stealing as in
my patch, or you could extend the misfits bitmap to also include CPUs
with reduced memory bandwidth and active tasks. (if I understand the asymmetry
correctly).

>> and let
>> stealing handle those levels. I think that makes sense after stealing
>> performance is validated on more architectures, but we would still have
>> two different mechanisms.
>>
>> - Steve
>
> I'll try out those patches on top of the misfit series to see how the
> whole thing behaves.

Very good, thanks.

- Steve

2018-10-25 07:52:04

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

Hi Steve,

On Mon, 22 Oct 2018 at 17:10, Steve Sistare <[email protected]> wrote:
>
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
>
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle. idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy. Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
>
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
>
> Patch 1 defines the sparsemask type and its operations.
>
> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>
> Patches 5 and 6 refactor existing code for a cleaner merge of later
> patches.
>
> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>
> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> time being because of performance regressions that are not due to stealing
> per-se. See the patch description for details.
>
> Patch 10 adds schedstats for comparing the new behavior to the old, and
> provided as a convenience for developers only, not for integration.
>
> The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
> bandwidth control were tested.
>
> Stealing imprroves utilization with only a modest CPU overhead in scheduler
> code. In the following experiment, hackbench is run with varying numbers
> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> for each run, averaged per CPU, augmented with these non-standard stats:
>
> %find - percent of time spent in old and new functions that search for
> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>
> steal - number of times a task is stolen from another CPU.
>
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> hackbench <grps> process 100000
> sched_wakeup_granularity_ns=15000000

Why do you mention this sched_wakeup_granularity_ns value ?
It is something that you changed for you tests ?
The comment for this tunable says that default value is 1ms *
ilog(ncpus) = 4ms for 20CPUs

>
> baseline
> grps time %busy slice sched idle wake %find steal
> 1 8.084 75.02 0.10 105476 46291 59183 0.31 0
> 2 13.892 85.33 0.10 190225 70958 119264 0.45 0
> 3 19.668 89.04 0.10 263896 87047 176850 0.49 0
> 4 25.279 91.28 0.10 322171 94691 227474 0.51 0
> 8 47.832 94.86 0.09 630636 144141 486322 0.56 0
>
> new
> grps time %busy slice sched idle wake %find steal %speedup
> 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
> 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
> 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
> 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
> 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6
>
> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> The cost is at most 0.4% more find time.

>

2018-10-25 11:30:12

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/25/2018 3:50 AM, Vincent Guittot wrote:
> Hi Steve,
>
> On Mon, 22 Oct 2018 at 17:10, Steve Sistare <[email protected]> wrote:
>>
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates. To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed. For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This simple stealing yields a higher CPU utilization than idle_balance()
>> alone, because the search is cheap, so it may be called every time the CPU
>> is about to go idle. idle_balance() does more work because it searches
>> widely for the busiest queue, so to limit its CPU consumption, it declines
>> to search if the system is too busy. Simple stealing does not offload the
>> globally busiest queue, but it is much better than running nothing at all.
>>
>> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
>> reduce cache contention vs the usual bitmap when many threads concurrently
>> set, clear, and visit elements.
>>
>> Patch 1 defines the sparsemask type and its operations.
>>
>> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>>
>> Patches 5 and 6 refactor existing code for a cleaner merge of later
>> patches.
>>
>> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>>
>> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
>> time being because of performance regressions that are not due to stealing
>> per-se. See the patch description for details.
>>
>> Patch 10 adds schedstats for comparing the new behavior to the old, and
>> provided as a convenience for developers only, not for integration.
>>
>> The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
>> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
>> and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
>> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
>> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
>> bandwidth control were tested.
>>
>> Stealing imprroves utilization with only a modest CPU overhead in scheduler
>> code. In the following experiment, hackbench is run with varying numbers
>> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
>> for each run, averaged per CPU, augmented with these non-standard stats:
>>
>> %find - percent of time spent in old and new functions that search for
>> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>>
>> steal - number of times a task is stolen from another CPU.
>>
>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> hackbench <grps> process 100000
>> sched_wakeup_granularity_ns=15000000
>
> Why do you mention this sched_wakeup_granularity_ns value ?
> It is something that you changed for you tests ?
> The comment for this tunable says that default value is 1ms *
> ilog(ncpus) = 4ms for 20CPUs

I changed it for the test, and I explain why a few paragraphs later.
The value matches the one set by tuned.service, for those running it.

- Steve

>
>>
>> baseline
>> grps time %busy slice sched idle wake %find steal
>> 1 8.084 75.02 0.10 105476 46291 59183 0.31 0
>> 2 13.892 85.33 0.10 190225 70958 119264 0.45 0
>> 3 19.668 89.04 0.10 263896 87047 176850 0.49 0
>> 4 25.279 91.28 0.10 322171 94691 227474 0.51 0
>> 8 47.832 94.86 0.09 630636 144141 486322 0.56 0
>>
>> new
>> grps time %busy slice sched idle wake %find steal %speedup
>> 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
>> 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
>> 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
>> 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
>> 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6
>>
>> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
>> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
>> The cost is at most 0.4% more find time.
>
>>

2018-10-25 11:31:55

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization


On 24/10/2018 20:27, Steven Sistare wrote:
[...]
> Hi Valentin,
>
> Asymmetric systems could maintain a separate bitmap for misfits; set a bit
> when a CPU goes on CPU, clear it going off. When a fast CPU goes new idle,
> it would first search the misfits mask, then search cfs_overload_cpus.
> The misfits logic would be conditionalized with CONFIG or sched feat static
> branches so symmetric systems do not incur extra overhead.
>

That sounds reasonable - besides, misfit already introduces a
sched_asym_cpucapacity static key. I'll try to play around with that.

>> We'd also lose the NOHZ update done in idle_balance(), though I think it's
>> not such a big deal - were were piggy-backing this on idle_balance() just
>> because it happened to be convenient, and we still have NOHZ_STATS_KICK
>> anyway.
>
> Agreed.
>
>> Another thing - in your test cases, what is the most prevalent cause of
>> failure to pull a task in idle_balance()? Is it the load_balance() itself
>> that fails to find a task (e.g. because the imbalance is not deemed big
>> enough), or is it the idle migration cost logic that prevents
>> load_balance() from running to completion?
>
> The latter. Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
> so idle_balance bails at the top:
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> ...
> goto out
>
> For other tests, we get past that clause but bail from a domain:
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> ...
> break;
>
>> In the first case, try_steal() makes perfect sense to me. In the second
>> case, I'm not sure if we really want to pull something if we know (well,
>> we *think*) we're about to resume the execution of some other task.
>
> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go
> off CPU, particularly for chatty workloads like hackbench. The performance
> data bear this out. For the higher loads, the average timeslice for
> hackbench
>

Thanks for the explanation. AIUI the big difference here is that try_steal()
is considerably cheaper than load_balance(), so the rq->avg_idle concerns
matter less (or at least, on a considerably smaller scale).

> Perhaps I could skip try_steal() if avg_idle is very small, although with
> hackbench I have seen average time slice as small as 10 microsec under
> high load and preemptions. I'll run some experiments.
>

That might be a safe thing to do. In the same department, maybe we could
skip try_steal() if we bail out of idle_balance() because
!(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus
are decoupled, they should express the same thing here.

>>> We could merge the stealing code into the idle_balance() code to get a
>>> union of the two, but IMO that would be less readable.
>>>
>>> We could remove the core and socket levels from idle_balance()
>>
>> I understand that as only doing load_balance() at DIE level in
>> idle_balance(), as that is what makes most sense to me (with big.LITTLE
>> those misfit migrations are done at DIE level), is that correct?
>
> Correct.
>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
>> level, which could cause issues there.
>
> We could keep idle_balance for this level and fall back to stealing as in
> my patch, or you could extend the misfits bitmap to also include CPUs
> with reduced memory bandwidth and active tasks. (if I understand the asymmetry
> correctly).
>

It's mostly µarch asymmetry, so by "asymmetry at MC level" I meant "we'll
see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing
to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY
in some way or another), that could work.

>>> and let
>>> stealing handle those levels. I think that makes sense after stealing
>>> performance is validated on more architectures, but we would still have
>>> two different mechanisms.
>>>
>>> - Steve
>>
>> I'll try out those patches on top of the misfit series to see how the
>> whole thing behaves.
>
> Very good, thanks.
>
> - Steve
>

2018-10-25 12:23:23

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/25/2018 7:31 AM, Valentin Schneider wrote:
>
> On 24/10/2018 20:27, Steven Sistare wrote:
> [...]
>> Hi Valentin,
>>
>> Asymmetric systems could maintain a separate bitmap for misfits; set a bit
>> when a CPU goes on CPU, clear it going off. When a fast CPU goes new idle,
>> it would first search the misfits mask, then search cfs_overload_cpus.
>> The misfits logic would be conditionalized with CONFIG or sched feat static
>> branches so symmetric systems do not incur extra overhead.
>
> That sounds reasonable - besides, misfit already introduces a
> sched_asym_cpucapacity static key. I'll try to play around with that.
>
>>> We'd also lose the NOHZ update done in idle_balance(), though I think it's
>>> not such a big deal - were were piggy-backing this on idle_balance() just
>>> because it happened to be convenient, and we still have NOHZ_STATS_KICK
>>> anyway.
>>
>> Agreed.
>>
>>> Another thing - in your test cases, what is the most prevalent cause of
>>> failure to pull a task in idle_balance()? Is it the load_balance() itself
>>> that fails to find a task (e.g. because the imbalance is not deemed big
>>> enough), or is it the idle migration cost logic that prevents
>>> load_balance() from running to completion?
>>
>> The latter. Eg, for the test "X6-2, 40 CPUs, hackbench 3 process 50000",
>> CPU avg_idle is 355566 nsec, and sched_migration_cost_ns = 500000,
>> so idle_balance bails at the top:
>> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>> ...
>> goto out
>>
>> For other tests, we get past that clause but bail from a domain:
>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>> ...
>> break;
>>
>>> In the first case, try_steal() makes perfect sense to me. In the second
>>> case, I'm not sure if we really want to pull something if we know (well,
>>> we *think*) we're about to resume the execution of some other task.
>>
>> 355.566 microsec is enough time to steal, go on CPU, do useful work, and go
>> off CPU, particularly for chatty workloads like hackbench. The performance
>> data bear this out. For the higher loads, the average timeslice for
>> hackbench
>>
>
> Thanks for the explanation. AIUI the big difference here is that try_steal()
> is considerably cheaper than load_balance(), so the rq->avg_idle concerns
> matter less (or at least, on a considerably smaller scale).

Right.

>> Perhaps I could skip try_steal() if avg_idle is very small, although with
>> hackbench I have seen average time slice as small as 10 microsec under
>> high load and preemptions. I'll run some experiments.
>
> That might be a safe thing to do. In the same department, maybe we could
> skip try_steal() if we bail out of idle_balance() because
> !(this_rq->rd->overload). Although rq->rd->overload and cfs_overload_cpus
> are decoupled, they should express the same thing here.

I tried that in an earlier version of my code:

new_tasks = idle_balance(rq, rf);
if (new_tasks == 0 && rq->rd->overload)
new_tasks = try_steal(rq, rf);

but I did not see any performance improvement vs without the overload check,
so I omitted it for simplicity.

- Steve

>>>> We could merge the stealing code into the idle_balance() code to get a
>>>> union of the two, but IMO that would be less readable.
>>>>
>>>> We could remove the core and socket levels from idle_balance()
>>>
>>> I understand that as only doing load_balance() at DIE level in
>>> idle_balance(), as that is what makes most sense to me (with big.LITTLE
>>> those misfit migrations are done at DIE level), is that correct?
>>
>> Correct.
>>> Also, with DynamIQ (next gen big.LITTLE) we could have asymmetry at MC
>>> level, which could cause issues there.
>>
>> We could keep idle_balance for this level and fall back to stealing as in
>> my patch, or you could extend the misfits bitmap to also include CPUs
>> with reduced memory bandwidth and active tasks. (if I understand the asymmetry
>> correctly).
>>
>
> It's mostly µarch asymmetry, so by "asymmetry at MC level" I meant "we'll
> see the SD_ASYM_CPUCAPACITY flag at MC level". But if we tweak stealing
> to take misfit tasks into account (so we'd rely on SD_ASYM_CPUCAPACITY
> in some way or another), that could work.
>
>>>> and let
>>>> stealing handle those levels. I think that makes sense after stealing
>>>> performance is validated on more architectures, but we would still have
>>>> two different mechanisms.
>>>>
>>>> - Steve
>>>
>>> I'll try out those patches on top of the misfit series to see how the
>>> whole thing behaves.
>>
>> Very good, thanks.
>>
>> - Steve
>>

2018-10-25 12:46:28

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On Thu, 25 Oct 2018 at 13:29, Steven Sistare <[email protected]> wrote:
>
> On 10/25/2018 3:50 AM, Vincent Guittot wrote:
> > Hi Steve,
> >
> > On Mon, 22 Oct 2018 at 17:10, Steve Sistare <[email protected]> wrote:
> >>
> >> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> >> find a task, then attempt to steal a task from an overloaded CPU in the
> >> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> >> identify candidates. To minimize search time, steal the first migratable
> >> task that is found when the bitmap is traversed. For fairness, search
> >> for migratable tasks on an overloaded CPU in order of next to run.
> >>
> >> This simple stealing yields a higher CPU utilization than idle_balance()
> >> alone, because the search is cheap, so it may be called every time the CPU
> >> is about to go idle. idle_balance() does more work because it searches
> >> widely for the busiest queue, so to limit its CPU consumption, it declines
> >> to search if the system is too busy. Simple stealing does not offload the
> >> globally busiest queue, but it is much better than running nothing at all.
> >>
> >> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> >> reduce cache contention vs the usual bitmap when many threads concurrently
> >> set, clear, and visit elements.
> >>
> >> Patch 1 defines the sparsemask type and its operations.
> >>
> >> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
> >>
> >> Patches 5 and 6 refactor existing code for a cleaner merge of later
> >> patches.
> >>
> >> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
> >>
> >> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> >> time being because of performance regressions that are not due to stealing
> >> per-se. See the patch description for details.
> >>
> >> Patch 10 adds schedstats for comparing the new behavior to the old, and
> >> provided as a convenience for developers only, not for integration.
> >>
> >> The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
> >> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> >> and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
> >> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> >> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
> >> bandwidth control were tested.
> >>
> >> Stealing imprroves utilization with only a modest CPU overhead in scheduler
> >> code. In the following experiment, hackbench is run with varying numbers
> >> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> >> for each run, averaged per CPU, augmented with these non-standard stats:
> >>
> >> %find - percent of time spent in old and new functions that search for
> >> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
> >>
> >> steal - number of times a task is stolen from another CPU.
> >>
> >> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> >> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> >> hackbench <grps> process 100000
> >> sched_wakeup_granularity_ns=15000000
> >
> > Why do you mention this sched_wakeup_granularity_ns value ?
> > It is something that you changed for you tests ?
> > The comment for this tunable says that default value is 1ms *
> > ilog(ncpus) = 4ms for 20CPUs
>
> I changed it for the test, and I explain why a few paragraphs later.
> The value matches the one set by tuned.service, for those running it.

ok. I haven't noticed that later explanation.

You said " Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec. Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new."

What do you mean exactly by distort ?

>
> - Steve
>
> >
> >>
> >> baseline
> >> grps time %busy slice sched idle wake %find steal
> >> 1 8.084 75.02 0.10 105476 46291 59183 0.31 0
> >> 2 13.892 85.33 0.10 190225 70958 119264 0.45 0
> >> 3 19.668 89.04 0.10 263896 87047 176850 0.49 0
> >> 4 25.279 91.28 0.10 322171 94691 227474 0.51 0
> >> 8 47.832 94.86 0.09 630636 144141 486322 0.56 0
> >>
> >> new
> >> grps time %busy slice sched idle wake %find steal %speedup
> >> 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
> >> 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
> >> 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
> >> 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
> >> 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6
> >>
> >> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> >> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> >> The cost is at most 0.4% more find time.
> >
> >>

2018-10-25 13:49:11

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle

Hi Steve,

On 22/10/2018 15:59, Steve Sistare wrote:
[...]
> @@ -9683,6 +9698,141 @@ void trigger_load_balance(struct rq *rq)
> nohz_balancer_kick(rq);
> }
>
> +/*
> + * Search the runnable tasks in @cfs_rq in order of next to run, and find
> + * the first one that can be migrated to @dst_rq. @cfs_rq is locked on entry.
> + * On success, dequeue the task from @cfs_rq and return it, else return NULL.
> + */
> +static struct task_struct *
> +detach_next_task(struct cfs_rq *cfs_rq, struct rq *dst_rq)
> +{
> + int dst_cpu = dst_rq->cpu;
> + struct task_struct *p;
> + struct rq *rq = rq_of(cfs_rq);
> +
> + lockdep_assert_held(&rq_of(cfs_rq)->lock);
> +
> + list_for_each_entry_reverse(p, &rq->cfs_tasks, se.group_node) {
> + if (can_migrate_task_llc(p, rq, dst_rq)) {
> + detach_task(p, rq, dst_cpu);
> + return p;
> + }
> + }
> + return NULL;
> +}
> +
> +/*
> + * Attempt to migrate a CFS task from @src_cpu to @dst_rq. @locked indicates
> + * whether @dst_rq is already locked on entry. This function may lock or
> + * unlock @dst_rq, and updates @locked to indicate the locked state on return.
> + * The locking protocol is based on idle_balance().
> + * Returns 1 on success and 0 on failure.
> + */
> +static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
> + int src_cpu)
> +{
> + struct task_struct *p;
> + struct rq_flags rf;
> + int stolen = 0;
> + int dst_cpu = dst_rq->cpu;
> + struct rq *src_rq = cpu_rq(src_cpu);
> +
> + if (dst_cpu == src_cpu || src_rq->cfs.h_nr_running < 2)
> + return 0;
> +
> + if (*locked) {
> + rq_unpin_lock(dst_rq, dst_rf);
> + raw_spin_unlock(&dst_rq->lock);
> + *locked = false;
> + }
> + rq_lock_irqsave(src_rq, &rf);
> + update_rq_clock(src_rq);
> +
> + if (src_rq->cfs.h_nr_running < 2 || !cpu_active(src_cpu))
> + p = NULL;
> + else
> + p = detach_next_task(&src_rq->cfs, dst_rq);
> +
> + rq_unlock(src_rq, &rf);
> +
> + if (p) {
> + raw_spin_lock(&dst_rq->lock);
> + rq_repin_lock(dst_rq, dst_rf);
> + *locked = true;
> + update_rq_clock(dst_rq);
> + attach_task(dst_rq, p);
> + stolen = 1;
> + }
> + local_irq_restore(rf.flags);
> +
> + return stolen;
> +}
> +
> +/*
> + * Try to steal a runnable CFS task from a CPU in the same LLC as @dst_rq,
> + * and migrate it to @dst_rq. rq_lock is held on entry and return, but
> + * may be dropped in between. Return 1 on success, 0 on failure, and -1
> + * if a task in a different scheduling class has become runnable on @dst_rq.
> + */
> +static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
> +{
> + int src_cpu;
> + int dst_cpu = dst_rq->cpu;
> + bool locked = true;
> + int stolen = 0;
> + struct sparsemask *overload_cpus;
> +
> + if (!sched_feat(STEAL))
> + return 0;
> +
> + if (!cpu_active(dst_cpu))
> + return 0;
> +
> + /* Get bitmap of overloaded CPUs in the same LLC as @dst_rq */
> +
> + rcu_read_lock();
> + overload_cpus = rcu_dereference(dst_rq->cfs_overload_cpus);
> + if (!overload_cpus) {
> + rcu_read_unlock();
> + return 0;
> + }
> +
> +#ifdef CONFIG_SCHED_SMT
> + /*
> + * First try overloaded CPUs on the same core to preserve cache warmth.
> + */
> + if (static_branch_likely(&sched_smt_present)) {
> + for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
> + if (sparsemask_test_elem(src_cpu, overload_cpus) &&
> + steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
> + stolen = 1;
> + goto out;
> + }
> + }
> + }
> +#endif /* CONFIG_SCHED_SMT */
> +
> + /* Accept any suitable task in the LLC */
> +
> + for_each_sparse_wrap(src_cpu, overload_cpus, dst_cpu) {
> + if (steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
> + stolen = 1;
> + break;
^^^^^^
You might want to have a 'goto out' there for consistency and to make GCC
happy for !CONFIG_SCHED_SMT (I get a "warning: label ‘out’ defined but not
used")

[...]

2018-10-25 13:49:51

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 05/10] sched/fair: Hoist idle_stamp up from idle_balance

Hi Steve,

On 22/10/2018 15:59, Steve Sistare wrote:
[...]
> @@ -6740,8 +6744,19 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> return p;
>
> idle:
> + /*
> + * We must set idle_stamp _before_ calling idle_balance(), such that we
> + * measure the duration of idle_balance() as idle time.
> + */
> + IF_SMP(rq->idle_stamp = rq_clock(rq);)
> +
> new_tasks = idle_balance(rq, rf);
>
> + if (new_tasks)
> + IF_SMP(rq->idle_stamp = 0;)
> +
> + schedstat_end_time(rq->find_time, time);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
That's a stray hunk from 10/10

[...]

2018-10-25 14:07:00

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 05/10] sched/fair: Hoist idle_stamp up from idle_balance

On 10/25/2018 9:47 AM, Valentin Schneider wrote:
> Hi Steve,
>
> On 22/10/2018 15:59, Steve Sistare wrote:
> [...]
>> @@ -6740,8 +6744,19 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>> return p;
>>
>> idle:
>> + /*
>> + * We must set idle_stamp _before_ calling idle_balance(), such that we
>> + * measure the duration of idle_balance() as idle time.
>> + */
>> + IF_SMP(rq->idle_stamp = rq_clock(rq);)
>> +
>> new_tasks = idle_balance(rq, rf);
>>
>> + if (new_tasks)
>> + IF_SMP(rq->idle_stamp = 0;)
>> +
>> + schedstat_end_time(rq->find_time, time);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> That's a stray hunk from 10/10

Thanks Valentin, will fix in next version. I reordered the patches and missed this.

- Steve

2018-10-25 14:10:33

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle

On 10/25/2018 9:48 AM, Valentin Schneider wrote:
> Hi Steve,
>
> On 22/10/2018 15:59, Steve Sistare wrote:
> [...]
>> +/*
>> + * Try to steal a runnable CFS task from a CPU in the same LLC as @dst_rq,
>> + * and migrate it to @dst_rq. rq_lock is held on entry and return, but
>> + * may be dropped in between. Return 1 on success, 0 on failure, and -1
>> + * if a task in a different scheduling class has become runnable on @dst_rq.
>> + */
>> +static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
>> +{
>> + int src_cpu;
>> + int dst_cpu = dst_rq->cpu;
>> + bool locked = true;
>> + int stolen = 0;
>> + struct sparsemask *overload_cpus;
>> +
>> + if (!sched_feat(STEAL))
>> + return 0;
>> +
>> + if (!cpu_active(dst_cpu))
>> + return 0;
>> +
>> + /* Get bitmap of overloaded CPUs in the same LLC as @dst_rq */
>> +
>> + rcu_read_lock();
>> + overload_cpus = rcu_dereference(dst_rq->cfs_overload_cpus);
>> + if (!overload_cpus) {
>> + rcu_read_unlock();
>> + return 0;
>> + }
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> + /*
>> + * First try overloaded CPUs on the same core to preserve cache warmth.
>> + */
>> + if (static_branch_likely(&sched_smt_present)) {
>> + for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
>> + if (sparsemask_test_elem(src_cpu, overload_cpus) &&
>> + steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
>> + stolen = 1;
>> + goto out;
>> + }
>> + }
>> + }
>> +#endif /* CONFIG_SCHED_SMT */
>> +
>> [...]
>> +
>> + for_each_sparse_wrap(src_cpu, overload_cpus, dst_cpu) {
>> + if (steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
>> + stolen = 1;
>> + break;
> ^^^^^^
> You might want to have a 'goto out' there for consistency and to make GCC
> happy for !CONFIG_SCHED_SMT (I get a "warning: label ‘out’ defined but not
> used")

Absolutely, will fix in next version. Thanks for the careful review.

- Steve

2018-10-25 14:24:02

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/25/2018 8:43 AM, Vincent Guittot wrote:
> On Thu, 25 Oct 2018 at 13:29, Steven Sistare <[email protected]> wrote:
>>
>> On 10/25/2018 3:50 AM, Vincent Guittot wrote:
>>> Hi Steve,
>>>
>>> On Mon, 22 Oct 2018 at 17:10, Steve Sistare <[email protected]> wrote:
>>>>
>>>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>>>> find a task, then attempt to steal a task from an overloaded CPU in the
>>>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>>>> identify candidates. To minimize search time, steal the first migratable
>>>> task that is found when the bitmap is traversed. For fairness, search
>>>> for migratable tasks on an overloaded CPU in order of next to run.
>>>>
>>>> This simple stealing yields a higher CPU utilization than idle_balance()
>>>> alone, because the search is cheap, so it may be called every time the CPU
>>>> is about to go idle. idle_balance() does more work because it searches
>>>> widely for the busiest queue, so to limit its CPU consumption, it declines
>>>> to search if the system is too busy. Simple stealing does not offload the
>>>> globally busiest queue, but it is much better than running nothing at all.
>>>>
>>>> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
>>>> reduce cache contention vs the usual bitmap when many threads concurrently
>>>> set, clear, and visit elements.
>>>>
>>>> Patch 1 defines the sparsemask type and its operations.
>>>>
>>>> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>>>>
>>>> Patches 5 and 6 refactor existing code for a cleaner merge of later
>>>> patches.
>>>>
>>>> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>>>>
>>>> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
>>>> time being because of performance regressions that are not due to stealing
>>>> per-se. See the patch description for details.
>>>>
>>>> Patch 10 adds schedstats for comparing the new behavior to the old, and
>>>> provided as a convenience for developers only, not for integration.
>>>>
>>>> The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
>>>> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
>>>> and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
>>>> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
>>>> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
>>>> bandwidth control were tested.
>>>>
>>>> Stealing imprroves utilization with only a modest CPU overhead in scheduler
>>>> code. In the following experiment, hackbench is run with varying numbers
>>>> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
>>>> for each run, averaged per CPU, augmented with these non-standard stats:
>>>>
>>>> %find - percent of time spent in old and new functions that search for
>>>> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>>>>
>>>> steal - number of times a task is stolen from another CPU.
>>>>
>>>> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>> hackbench <grps> process 100000
>>>> sched_wakeup_granularity_ns=15000000
>>>
>>> Why do you mention this sched_wakeup_granularity_ns value ?
>>> It is something that you changed for you tests ?
>>> The comment for this tunable says that default value is 1ms *
>>> ilog(ncpus) = 4ms for 20CPUs
>>
>> I changed it for the test, and I explain why a few paragraphs later.
>> The value matches the one set by tuned.service, for those running it.
>
> ok. I haven't noticed that later explanation.
>
> You said " Note: for all hackbench runs, sched_wakeup_granularity_ns
> is set to 15 msec. Otherwise, preemptions increase at higher loads and
> distort the comparison between baseline and new."
>
> What do you mean exactly by distort ?

With the default value of sched_wakeup_granularity_ns and the load range
I tested, as load and CPU utilization increases, preemptions increase, average
timeslice decreases, and time per task goes up. For a given task count, stealing
achieves higher utilization than base for the same count, so is hit harder by the
preemption effect than the base. Raising sched_wakeup_granularity_ns factors this
out of the comparison.

- Steve

>>>> baseline
>>>> grps time %busy slice sched idle wake %find steal
>>>> 1 8.084 75.02 0.10 105476 46291 59183 0.31 0
>>>> 2 13.892 85.33 0.10 190225 70958 119264 0.45 0
>>>> 3 19.668 89.04 0.10 263896 87047 176850 0.49 0
>>>> 4 25.279 91.28 0.10 322171 94691 227474 0.51 0
>>>> 8 47.832 94.86 0.09 630636 144141 486322 0.56 0
>>>>
>>>> new
>>>> grps time %busy slice sched idle wake %find steal %speedup
>>>> 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
>>>> 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
>>>> 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
>>>> 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
>>>> 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6
>>>>
>>>> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
>>>> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
>>>> The cost is at most 0.4% more find time.
>>>
>>>>

2018-10-26 18:04:57

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

Hi Steve,

On 22/10/2018 15:59, Steve Sistare wrote:
> Define a simpler version of can_migrate_task called can_migrate_task_llc
> which does not require a struct lb_env argument, and judges whether a
> migration from one CPU to another within the same LLC should be allowed.
>
> Signed-off-by: Steve Sistare <[email protected]>
> ---
> kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4acdd8d..6548bed 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7168,6 +7168,34 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> }
>
> /*
> + * Return true if task @p can migrate from @rq to @dst_rq in the same LLC.
> + * No need to test for co-locality, and no need to test task_hot(), as sharing
> + * LLC provides cache warmth at that level.

I was thinking that perhaps we could have scenarios where some rq's
keep stealing tasks off of each other and we end up circulating tasks
between CPUs. Now, that would only happen if we had a handful of tasks
with a very tiny period, and I'm not familiar with (real) such hyperactive
workloads similar to those generated by hackbench where that could happen.

In short, I wonder if we should have task_hot() in there. Drawing a
parallel with load_balance(), even if load-balancing is happening between
rqs of the same LLC, we do go check task_hot(). Have you already experimented
with adding a task_hot() check in here?



I've run some iterations of hackbench (hackbench 2 process 100000) to
investigate this task bouncing, but I didn't really see any of it. That was
just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
with more CPUs.

----->8-----

activations: # of task activations (task starts running)
cpu_migrations: # of activations where cpu != prev_cpu
% stats are percentiles

- STEAL:

| stat | cpu_migrations | activations |
|-------+----------------+-------------|
| count | 2005.000000 | 2005.000000 |
| mean | 16.244888 | 290.608479 |
| std | 38.963138 | 253.003528 |
| min | 0.000000 | 3.000000 |
| 50% | 3.000000 | 239.000000 |
| 75% | 8.000000 | 436.000000 |
| 90% | 45.000000 | 626.000000 |
| 99% | 188.960000 | 1073.000000 |
| max | 369.000000 | 1417.000000 |

- NO_STEAL:

| stat | cpu_migrations | activations |
|-------+----------------+-------------|
| count | 2005.000000 | 2005.000000 |
| mean | 15.260848 | 297.860848 |
| std | 46.331890 | 253.210813 |
| min | 0.000000 | 3.000000 |
| 50% | 3.000000 | 252.000000 |
| 75% | 7.000000 | 444.000000 |
| 90% | 32.600000 | 643.600000 |
| 99% | 214.880000 | 1127.520000 |
| max | 467.000000 | 1547.000000 |

----->8-----

Otherwise, my only other concern at the moment is that since stealing
doesn't care about load, we could steal a task that would cause a big
imbalance, which wouldn't have happened with a call to load_balance().

I don't think this can be triggered with a symmetrical workload like
hackbench, so I'll go explore something else.

2018-10-26 18:30:16

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

On 10/26/2018 2:04 PM, Valentin Schneider wrote:
> Hi Steve,
> On 22/10/2018 15:59, Steve Sistare wrote:
>> Define a simpler version of can_migrate_task called can_migrate_task_llc
>> which does not require a struct lb_env argument, and judges whether a
>> migration from one CPU to another within the same LLC should be allowed.
>>
>> Signed-off-by: Steve Sistare <[email protected]>
>> ---
>> kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
>> 1 file changed, 28 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4acdd8d..6548bed 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7168,6 +7168,34 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>> }
>>
>> /*
>> + * Return true if task @p can migrate from @rq to @dst_rq in the same LLC.
>> + * No need to test for co-locality, and no need to test task_hot(), as sharing
>> + * LLC provides cache warmth at that level.
>
> I was thinking that perhaps we could have scenarios where some rq's
> keep stealing tasks off of each other and we end up circulating tasks
> between CPUs. Now, that would only happen if we had a handful of tasks
> with a very tiny period, and I'm not familiar with (real) such hyperactive
> workloads similar to those generated by hackbench where that could happen.

That will not happen with the current code, as it only steals if nr_running > 1.
The src loses a task, the dst gains it and has nr_running == 1, so it will not
be re-stolen.

If we modify the code to handle misfits, we may steal when src nr_running == 1,
but a fast CPU will only steal the lone task from a slow one, never fast from fast,
and never slow from fast, so no tug of war.

> In short, I wonder if we should have task_hot() in there. Drawing a
> parallel with load_balance(), even if load-balancing is happening between
> rqs of the same LLC, we do go check task_hot(). Have you already experimented
> with adding a task_hot() check in here?

I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems,
and it reduced steals and overall performance.

> I've run some iterations of hackbench (hackbench 2 process 100000) to
> investigate this task bouncing, but I didn't really see any of it. That was
> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
> with more CPUs.
>
> ----->8-----
>
> activations: # of task activations (task starts running)
> cpu_migrations: # of activations where cpu != prev_cpu
> % stats are percentiles
>
> - STEAL:
>
> | stat | cpu_migrations | activations |
> |-------+----------------+-------------|
> | count | 2005.000000 | 2005.000000 |
> | mean | 16.244888 | 290.608479 |
> | std | 38.963138 | 253.003528 |
> | min | 0.000000 | 3.000000 |
> | 50% | 3.000000 | 239.000000 |
> | 75% | 8.000000 | 436.000000 |
> | 90% | 45.000000 | 626.000000 |
> | 99% | 188.960000 | 1073.000000 |
> | max | 369.000000 | 1417.000000 |
>
> - NO_STEAL:
>
> | stat | cpu_migrations | activations |
> |-------+----------------+-------------|
> | count | 2005.000000 | 2005.000000 |
> | mean | 15.260848 | 297.860848 |
> | std | 46.331890 | 253.210813 |
> | min | 0.000000 | 3.000000 |
> | 50% | 3.000000 | 252.000000 |
> | 75% | 7.000000 | 444.000000 |
> | 90% | 32.600000 | 643.600000 |
> | 99% | 214.880000 | 1127.520000 |
> | max | 467.000000 | 1547.000000 |
>
> ----->8-----
>
> Otherwise, my only other concern at the moment is that since stealing
> doesn't care about load, we could steal a task that would cause a big
> imbalance, which wouldn't have happened with a call to load_balance().
>
> I don't think this can be triggered with a symmetrical workload like
> hackbench, so I'll go explore something else.

The dst is about to go idle with zero load, so stealing can only improve the
instantaneous balance between src and dst. For longer term average load, we
still rely on periodic load_balance to make adjustments.

All good questions, keep them coming.

- Steve

2018-10-29 19:35:40

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

On 26/10/2018 19:28, Steven Sistare wrote:
> On 10/26/2018 2:04 PM, Valentin Schneider wrote:
[...]
>>
>> I was thinking that perhaps we could have scenarios where some rq's
>> keep stealing tasks off of each other and we end up circulating tasks
>> between CPUs. Now, that would only happen if we had a handful of tasks
>> with a very tiny period, and I'm not familiar with (real) such hyperactive
>> workloads similar to those generated by hackbench where that could happen.
>
> That will not happen with the current code, as it only steals if nr_running > 1.
> The src loses a task, the dst gains it and has nr_running == 1, so it will not
> be re-stolen.
>

That's indeed fine, I was thinking of something like this:


Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).

As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
can potentially steal the non-running task from r_2. Sometime later the task
that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
and r_2->nr_running == 1.

IOW we've swapped their role in that example, and the whole thing can
repeat.

The shorter the period of those tasks, the more we'll migrate them
between rq's, hence why I wonder if we shouldn't have some sort of
throttling.

> If we modify the code to handle misfits, we may steal when src nr_running == 1,
> but a fast CPU will only steal the lone task from a slow one, never fast from fast,
> and never slow from fast, so no tug of war.
>
>> In short, I wonder if we should have task_hot() in there. Drawing a
>> parallel with load_balance(), even if load-balancing is happening between
>> rqs of the same LLC, we do go check task_hot(). Have you already experimented
>> with adding a task_hot() check in here?
>
> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems,
> and it reduced steals and overall performance.
>

Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
sched feature and the exec_start threshold.

The first one should be sidestepped in the stealing case since we won't
pass (if env->dst_rq->nr_running), that leaves us with the threshold.

We might want to sidestep it when we are doing balancing within an LLC
domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
in such cases.

In any case, I think it would make sense to add some LLC conditions to
task_hot() so that
- regular load_balance() can also benefit from them
- task stealing has at least some sort of throttling


On a sidenote, I find it a bit odd that the exec_start threshold depends on
sysctl_sched_migration_cost, which to me is more about idle_balance() cost
than "how long does it take for a previously run task to go cache cold".

>> I've run some iterations of hackbench (hackbench 2 process 100000) to
>> investigate this task bouncing, but I didn't really see any of it. That was
>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
>> with more CPUs.
>>
>> ----->8-----
>>
>> activations: # of task activations (task starts running)
>> cpu_migrations: # of activations where cpu != prev_cpu
>> % stats are percentiles
>>
>> - STEAL:
>>
>> | stat | cpu_migrations | activations |
>> |-------+----------------+-------------|
>> | count | 2005.000000 | 2005.000000 |
>> | mean | 16.244888 | 290.608479 |
>> | std | 38.963138 | 253.003528 |
>> | min | 0.000000 | 3.000000 |
>> | 50% | 3.000000 | 239.000000 |
>> | 75% | 8.000000 | 436.000000 |
>> | 90% | 45.000000 | 626.000000 |
>> | 99% | 188.960000 | 1073.000000 |
>> | max | 369.000000 | 1417.000000 |
>>
>> - NO_STEAL:
>>
>> | stat | cpu_migrations | activations |
>> |-------+----------------+-------------|
>> | count | 2005.000000 | 2005.000000 |
>> | mean | 15.260848 | 297.860848 |
>> | std | 46.331890 | 253.210813 |
>> | min | 0.000000 | 3.000000 |
>> | 50% | 3.000000 | 252.000000 |
>> | 75% | 7.000000 | 444.000000 |
>> | 90% | 32.600000 | 643.600000 |
>> | 99% | 214.880000 | 1127.520000 |
>> | max | 467.000000 | 1547.000000 |
>>
>> ----->8-----
>>
>> Otherwise, my only other concern at the moment is that since stealing
>> doesn't care about load, we could steal a task that would cause a big
>> imbalance, which wouldn't have happened with a call to load_balance().
>>
>> I don't think this can be triggered with a symmetrical workload like
>> hackbench, so I'll go explore something else.
>
> The dst is about to go idle with zero load, so stealing can only improve the
> instantaneous balance between src and dst. For longer term average load, we
> still rely on periodic load_balance to make adjustments.
>

Right, so my line of thinking was that by not doing a load_balance() and
taking a shortcut (stealing a task), we may end up just postponing a
load_balance() to after we've stolen a task. I guess in those cases
there's no magic trick to be found and we just have to deal with it.

And then there's some of the logic like we have in update_sd_pick_busiest()
where we e.g. try to prevent misfit tasks from running on LITTLEs, but
then if such tasks are waiting to be run and a LITTLE frees itself up,
I *think* it's okay to steal it.

> All good questions, keep them coming.
>
> - Steve
>

2018-10-31 15:46:09

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

On 10/29/2018 3:34 PM, Valentin Schneider wrote:
> On 26/10/2018 19:28, Steven Sistare wrote:
>> On 10/26/2018 2:04 PM, Valentin Schneider wrote:
> [...]
>>>
>>> I was thinking that perhaps we could have scenarios where some rq's
>>> keep stealing tasks off of each other and we end up circulating tasks
>>> between CPUs. Now, that would only happen if we had a handful of tasks
>>> with a very tiny period, and I'm not familiar with (real) such hyperactive
>>> workloads similar to those generated by hackbench where that could happen.
>>
>> That will not happen with the current code, as it only steals if nr_running > 1.
>> The src loses a task, the dst gains it and has nr_running == 1, so it will not
>> be re-stolen.
>
> That's indeed fine, I was thinking of something like this:
>
> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
>
> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
> can potentially steal the non-running task from r_2. Sometime later the task
> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
> and r_2->nr_running == 1.
>
> IOW we've swapped their role in that example, and the whole thing can
> repeat.
>
> The shorter the period of those tasks, the more we'll migrate them
> between rq's, hence why I wonder if we shouldn't have some sort of
> throttling.

Stealing is still the right move in this scenario. Idle cycles become useful
cycles. The only cost is the CPU time to dequeue from a remote rq and
enqueue on the local rq. Earlier we discussed skipping try_steal() if avg_idle
is very small, on the order of 10 usec. I think that type of throttling would
cover your scenario. I will add it in my next version.

>> If we modify the code to handle misfits, we may steal when src nr_running == 1,
>> but a fast CPU will only steal the lone task from a slow one, never fast from fast,
>> and never slow from fast, so no tug of war.
>>
>>> In short, I wonder if we should have task_hot() in there. Drawing a
>>> parallel with load_balance(), even if load-balancing is happening between
>>> rqs of the same LLC, we do go check task_hot(). Have you already experimented
>>> with adding a task_hot() check in here?
>>
>> I tried task_hot, to see if L1/L2 cache warmth matters much on L1/L2/L3 systems,
>> and it reduced steals and overall performance.
>
> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
> sched feature and the exec_start threshold.
>
> The first one should be sidestepped in the stealing case since we won't
> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
>
> We might want to sidestep it when we are doing balancing within an LLC
> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
> in such cases.
>
> In any case, I think it would make sense to add some LLC conditions to
> task_hot() so that
> - regular load_balance() can also benefit from them

This is probably a good idea (lower threshold for task_hot within LLC).
I would rather see it done as a separate patch, with a separate performance
evaluation, as it will affect all workloads, even those that do not steal.
A load balancing migration when !task_hot() may be performed even when
the dst CPU already has a task to run, so the migration may or may not
improve utilization. By contrast, a newly idle CPU that does not find
work goes idle and definitely wastes cycles. Note how
migrate_degrades_locality() chooses migration regardless of preferred node
when the dst is idle:

/* Leaving a core idle is often worse than degrading locality. */
if (env->idle != CPU_NOT_IDLE)
return -1;

I apply the same principle in can_migrate_task_llc().

> - task stealing has at least some sort of throttling
>
>
> On a sidenote, I find it a bit odd that the exec_start threshold depends on
> sysctl_sched_migration_cost, which to me is more about idle_balance() cost
> than "how long does it take for a previously run task to go cache cold".

Agreed, but these are all magic numbers anyway :)

>>> I've run some iterations of hackbench (hackbench 2 process 100000) to
>>> investigate this task bouncing, but I didn't really see any of it. That was
>>> just a 4+4 big.LITTLE system though, I'll try to get numbers on a system
>>> with more CPUs.
>>>
>>> ----->8-----
>>>
>>> activations: # of task activations (task starts running)
>>> cpu_migrations: # of activations where cpu != prev_cpu
>>> % stats are percentiles
>>>
>>> - STEAL:
>>>
>>> | stat | cpu_migrations | activations |
>>> |-------+----------------+-------------|
>>> | count | 2005.000000 | 2005.000000 |
>>> | mean | 16.244888 | 290.608479 |
>>> | std | 38.963138 | 253.003528 |
>>> | min | 0.000000 | 3.000000 |
>>> | 50% | 3.000000 | 239.000000 |
>>> | 75% | 8.000000 | 436.000000 |
>>> | 90% | 45.000000 | 626.000000 |
>>> | 99% | 188.960000 | 1073.000000 |
>>> | max | 369.000000 | 1417.000000 |
>>>
>>> - NO_STEAL:
>>>
>>> | stat | cpu_migrations | activations |
>>> |-------+----------------+-------------|
>>> | count | 2005.000000 | 2005.000000 |
>>> | mean | 15.260848 | 297.860848 |
>>> | std | 46.331890 | 253.210813 |
>>> | min | 0.000000 | 3.000000 |
>>> | 50% | 3.000000 | 252.000000 |
>>> | 75% | 7.000000 | 444.000000 |
>>> | 90% | 32.600000 | 643.600000 |
>>> | 99% | 214.880000 | 1127.520000 |
>>> | max | 467.000000 | 1547.000000 |
>>>
>>> ----->8-----
>>>
>>> Otherwise, my only other concern at the moment is that since stealing
>>> doesn't care about load, we could steal a task that would cause a big
>>> imbalance, which wouldn't have happened with a call to load_balance().
>>>
>>> I don't think this can be triggered with a symmetrical workload like
>>> hackbench, so I'll go explore something else.
>>
>> The dst is about to go idle with zero load, so stealing can only improve the
>> instantaneous balance between src and dst. For longer term average load, we
>> still rely on periodic load_balance to make adjustments.
>
> Right, so my line of thinking was that by not doing a load_balance() and
> taking a shortcut (stealing a task), we may end up just postponing a
> load_balance() to after we've stolen a task. I guess in those cases
> there's no magic trick to be found and we just have to deal with it.

In the current code I call idle_balance/load_balance first and then try_steal.
If idle_balance fails because of cost, then it has effectively postponed itself,
independently of stealing. The next successful call to load_balance will
correct any imbalance caused by stealing.

> And then there's some of the logic like we have in update_sd_pick_busiest()
> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
> then if such tasks are waiting to be run and a LITTLE frees itself up,
> I *think* it's okay to steal it.

Should be OK to steal. If a BIG subsequently goes idle, load_balance will move
the task to the BIG, or the BIG may steal it when we support misfit stealing.

Questions for you, Valentin:

- Should misfit stealing be a separate patch, after my series? I prefer that,
so we get stealing in peoples hands as soon as possible. I think separating
it is OK because stealing should not cause any regression for misfits, as
my code still calls idle_balance/load_balance, which handles misfits.

- Who should implement misfit stealing -- you, me, someone else? I have no
preference.

- Steve

2018-10-31 18:49:02

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc



On 31/10/2018 15:43, Steven Sistare wrote:
> On 10/29/2018 3:34 PM, Valentin Schneider wrote:
[...]
>> Suppose you have 2 rq's sharing a workload of 3 tasks. You get one rq with
>> nr_running == 1 (r_1) and one rq with nr_running == 2 (r_2).
>>
>> As soon as the task on r_1 ends/blocks, we'll go through idle balancing and
>> can potentially steal the non-running task from r_2. Sometime later the task
>> that was running on r_1 wakes up, and we end up with r_1->nr_running == 2
>> and r_2->nr_running == 1.
>>
>> IOW we've swapped their role in that example, and the whole thing can
>> repeat.
>>
>> The shorter the period of those tasks, the more we'll migrate them
>> between rq's, hence why I wonder if we shouldn't have some sort of
>> throttling.
>
> Stealing is still the right move in this scenario. Idle cycles become useful
> cycles. The only cost is the CPU time to dequeue from a remote rq and
> enqueue on the local rq. Earlier we discussed skipping try_steal() if avg_idle
> is very small, on the order of 10 usec. I think that type of throttling would
> cover your scenario. I will add it in my next version.
>

Sounds good to me. Out of curiosity, how did you establish this 10 µsec
threshold? I guess we just want something very tiny but still big enough
to broadly cover try_steal() worst case exec times.

[...]
>> Mmm so task_hot() mainly implements two mechanisms - the CACHE_HOT_BUDDY
>> sched feature and the exec_start threshold.
>>
>> The first one should be sidestepped in the stealing case since we won't
>> pass (if env->dst_rq->nr_running), that leaves us with the threshold.
>>
>> We might want to sidestep it when we are doing balancing within an LLC
>> domain (env->sd->flags & SD_SHARE_PKG_RESOURCES) - or use a lower threshold
>> in such cases.
>>
>> In any case, I think it would make sense to add some LLC conditions to
>> task_hot() so that
>> - regular load_balance() can also benefit from them
>
> This is probably a good idea (lower threshold for task_hot within LLC).
> I would rather see it done as a separate patch, with a separate performance
> evaluation, as it will affect all workloads, even those that do not steal.

Agreed.

> A load balancing migration when !task_hot() may be performed even when
> the dst CPU already has a task to run, so the migration may or may not
> improve utilization. By contrast, a newly idle CPU that does not find
> work goes idle and definitely wastes cycles. Note how
> migrate_degrades_locality() chooses migration regardless of preferred node
> when the dst is idle:
>
> /* Leaving a core idle is often worse than degrading locality. */
> if (env->idle != CPU_NOT_IDLE)
> return -1;
>
> I apply the same principle in can_migrate_task_llc().
>
[...]
>> Right, so my line of thinking was that by not doing a load_balance() and
>> taking a shortcut (stealing a task), we may end up just postponing a
>> load_balance() to after we've stolen a task. I guess in those cases
>> there's no magic trick to be found and we just have to deal with it.
>
> In the current code I call idle_balance/load_balance first and then try_steal.
> If idle_balance fails because of cost, then it has effectively postponed itself,
> independently of stealing. The next successful call to load_balance will
> correct any imbalance caused by stealing.
>
>> And then there's some of the logic like we have in update_sd_pick_busiest()
>> where we e.g. try to prevent misfit tasks from running on LITTLEs, but
>> then if such tasks are waiting to be run and a LITTLE frees itself up,
>> I *think* it's okay to steal it.
>
> Should be OK to steal. If a BIG subsequently goes idle, load_balance will move
> the task to the BIG, or the BIG may steal it when we support misfit stealing.
>
> Questions for you, Valentin:
>
> - Should misfit stealing be a separate patch, after my series? I prefer that,
> so we get stealing in peoples hands as soon as possible. I think separating
> it is OK because stealing should not cause any regression for misfits, as
> my code still calls idle_balance/load_balance, which handles misfits.
>

I don't have any objections, Quentin (added on cc) might want to disable
stealing when !overutilized for Energy Aware Scheduling, but then that
also depends on what gets in first :)

> - Who should implement misfit stealing -- you, me, someone else? I have no
> preference.

It would make more sense if I did that, as it's easy for me to test it -
unless you're really curious about Arm stuff and want to have some fun :)

>
> - Steve
>

2018-10-31 19:16:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

On Mon, Oct 29, 2018 at 07:34:50PM +0000, Valentin Schneider wrote:
> On a sidenote, I find it a bit odd that the exec_start threshold depends on
> sysctl_sched_migration_cost, which to me is more about idle_balance() cost
> than "how long does it take for a previously run task to go cache cold".

A long long (think 2.6.20 long ago) there was code that did boot-time
measurement of the various cache topology costs and that threshold was
related to that.

That code got killed because of boot to boot variance and dubious
benefits.

The migration cost is what it was replaced with as a single measure.

migration cost was then later abused in the newidle balance because it
was over eager. Ideally we'd get rid of it there, because we've now got
that much more elaborate accounting, but Rohit tried and found some
regression because of that.

Maybe we should remove it from newidle anyway.

2018-10-31 19:38:16

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

Thanks very much to everyone who has commented on my patch series.
Here are the issues to be addressed in V2 of the series, and the person
that suggested it, or raised the issue that led to it.

Changes for V2:
* Remove stray patch 10 hunk from patch 5 (Valentin)
* Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
(Valentin)
* Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2 (Steve)
* Call try_steal iff avg_idle exceeds some small threshold (Steve, Valentin)

Possible future work:
* Use sparsemask and stealing for RT (Steve, Peter)
* Remove the core and socket levels from idle_balance() and let stealing
handle those levels (Steve, Peter)
* Delete idle_balance() and use stealing exclusively for handling new idle
(Steve, Peter)
* Test specjbb multi-warehouse on 8-node systems when stealing for
large NUMA systems is revisited (Peter)
* Enhance stealing to handle misfits (Valentin)
* Lower time threshold for task_hot within LLC (Valentin)

Dropped:
* Skip try_steal() if we bail out of idle_balance() because !this_rq->rd->overload
(Valentin)
I tried it and saw no difference. Dropped for simplicity.

Does anyone else plan to review the code? Please tell me now, even if your
review will be delayed. If yes, I will wait for all comments before producing
V2. The code changes so far are small.

- Steve

On 10/22/2018 10:59 AM, Steve Sistare wrote:
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
>
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle. idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy. Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
>
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
>
> Patch 1 defines the sparsemask type and its operations.
>
> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>
> Patches 5 and 6 refactor existing code for a cleaner merge of later
> patches.
>
> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>
> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> time being because of performance regressions that are not due to stealing
> per-se. See the patch description for details.
>
> Patch 10 adds schedstats for comparing the new behavior to the old, and
> provided as a convenience for developers only, not for integration.
>
> The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
> bandwidth control were tested.
>
> Stealing imprroves utilization with only a modest CPU overhead in scheduler
> code. In the following experiment, hackbench is run with varying numbers
> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> for each run, averaged per CPU, augmented with these non-standard stats:
>
> %find - percent of time spent in old and new functions that search for
> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>
> steal - number of times a task is stolen from another CPU.
>
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> hackbench <grps> process 100000
> sched_wakeup_granularity_ns=15000000
>
> baseline
> grps time %busy slice sched idle wake %find steal
> 1 8.084 75.02 0.10 105476 46291 59183 0.31 0
> 2 13.892 85.33 0.10 190225 70958 119264 0.45 0
> 3 19.668 89.04 0.10 263896 87047 176850 0.49 0
> 4 25.279 91.28 0.10 322171 94691 227474 0.51 0
> 8 47.832 94.86 0.09 630636 144141 486322 0.56 0
>
> new
> grps time %busy slice sched idle wake %find steal %speedup
> 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
> 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
> 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
> 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
> 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6
>
> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> The cost is at most 0.4% more find time.
>
> Additional performance results follow. A negative "speedup" is a
> regression. Note: for all hackbench runs, sched_wakeup_granularity_ns
> is set to 15 msec. Otherwise, preemptions increase at higher loads and
> distort the comparison between baseline and new.
>
> ------------------ 1 Socket Results ------------------
>
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 8.008 0.1 5.905 0.2 35.6
> 2 13.814 0.2 11.438 0.1 20.7
> 3 19.488 0.2 16.919 0.1 15.1
> 4 25.059 0.1 22.409 0.1 11.8
> 8 47.478 0.1 44.221 0.1 7.3
>
> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 4.586 0.8 4.596 0.6 -0.3
> 2 7.693 0.2 5.775 1.3 33.2
> 3 10.442 0.3 8.288 0.3 25.9
> 4 13.087 0.2 11.057 0.1 18.3
> 8 24.145 0.2 22.076 0.3 9.3
> 16 43.779 0.1 41.741 0.2 4.8
>
> KVM 4-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> tbench, average of 11 runs.
>
> clients %speedup
> 1 16.2
> 2 11.7
> 4 9.9
> 8 12.8
> 16 13.7
>
> KVM 2-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>
> Benchmark %speedup
> specjbb2015_critical_jops 5.7
> mysql_sysb1.0.14_mutex_2 40.6
> mysql_sysb1.0.14_oltp_2 3.9
>
> ------------------ 2 Socket Results ------------------
>
> X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 7.945 0.2 7.219 8.7 10.0
> 2 8.444 0.4 6.689 1.5 26.2
> 3 12.100 1.1 9.962 2.0 21.4
> 4 15.001 0.4 13.109 1.1 14.4
> 8 27.960 0.2 26.127 0.3 7.0
>
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 5.826 5.4 5.840 5.0 -0.3
> 2 5.041 5.3 6.171 23.4 -18.4
> 3 6.839 2.1 6.324 3.8 8.1
> 4 8.177 0.6 7.318 3.6 11.7
> 8 14.429 0.7 13.966 1.3 3.3
> 16 26.401 0.3 25.149 1.5 4.9
>
>
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Oracle database OLTP, logging disabled, NVRAM storage
>
> Customers Users %speedup
> 1200000 40 -1.2
> 2400000 80 2.7
> 3600000 120 8.9
> 4800000 160 4.4
> 6000000 200 3.0
>
> X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
> Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
> Results from the Oracle "Performance PIT".
>
> Benchmark %speedup
>
> mysql_sysb1.0.14_fileio_56_rndrd 19.6
> mysql_sysb1.0.14_fileio_56_seqrd 12.1
> mysql_sysb1.0.14_fileio_56_rndwr 0.4
> mysql_sysb1.0.14_fileio_56_seqrewr -0.3
>
> pgsql_sysb1.0.14_fileio_56_rndrd 19.5
> pgsql_sysb1.0.14_fileio_56_seqrd 8.6
> pgsql_sysb1.0.14_fileio_56_rndwr 1.0
> pgsql_sysb1.0.14_fileio_56_seqrewr 0.5
>
> opatch_time_ASM_12.2.0.1.0_HP2M 7.5
> select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1
> select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4
> swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8
>
> lm3_memlat_L2 4.8
> lm3_memlat_L1 0.0
>
> ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1
> ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2
> ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0
> ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
>
> X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>
> NAS_OMP
> bench class ncpu %improved(Mops)
> dc B 72 1.3
> is C 72 0.9
> is D 72 0.7
>
> sysbench mysql, average of 24 runs
> --- base --- --- new ---
> nthr events %stdev events %stdev %speedup
> 1 331.0 0.25 331.0 0.24 -0.1
> 2 661.3 0.22 661.8 0.22 0.0
> 4 1297.0 0.88 1300.5 0.82 0.2
> 8 2420.8 0.04 2420.5 0.04 -0.1
> 16 4826.3 0.07 4825.4 0.05 -0.1
> 32 8815.3 0.27 8830.2 0.18 0.1
> 64 12823.0 0.24 12823.6 0.26 0.0
>
> --------------------------------------------------------------
>
> Steve Sistare (10):
> sched: Provide sparsemask, a reduced contention bitmap
> sched/topology: Provide hooks to allocate data shared per LLC
> sched/topology: Provide cfs_overload_cpus bitmap
> sched/fair: Dynamically update cfs_overload_cpus
> sched/fair: Hoist idle_stamp up from idle_balance
> sched/fair: Generalize the detach_task interface
> sched/fair: Provide can_migrate_task_llc
> sched/fair: Steal work from an overloaded CPU when CPU goes idle
> sched/fair: disable stealing if too many NUMA nodes
> sched/fair: Provide idle search schedstats
>
> include/linux/sched/topology.h | 1 +
> include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++
> kernel/sched/core.c | 30 +++-
> kernel/sched/fair.c | 338 +++++++++++++++++++++++++++++++++++++----
> kernel/sched/features.h | 6 +
> kernel/sched/sched.h | 13 +-
> kernel/sched/stats.c | 11 +-
> kernel/sched/stats.h | 13 ++
> kernel/sched/topology.c | 117 +++++++++++++-
> lib/Makefile | 2 +-
> lib/sparsemask.c | 142 +++++++++++++++++
> 11 files changed, 898 insertions(+), 35 deletions(-)
> create mode 100644 include/linux/sparsemask.h
> create mode 100644 lib/sparsemask.c
>

2018-11-01 11:17:54

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 07/10] sched/fair: Provide can_migrate_task_llc

On 31/10/2018 19:14, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 07:34:50PM +0000, Valentin Schneider wrote:
>> On a sidenote, I find it a bit odd that the exec_start threshold depends on
>> sysctl_sched_migration_cost, which to me is more about idle_balance() cost
>> than "how long does it take for a previously run task to go cache cold".
>
> A long long (think 2.6.20 long ago) there was code that did boot-time
> measurement of the various cache topology costs and that threshold was
> related to that.
>
> That code got killed because of boot to boot variance and dubious
> benefits.
>
> The migration cost is what it was replaced with as a single measure.
>
> migration cost was then later abused in the newidle balance because it
> was over eager.


Thanks! I tried following the blame rabbit hole but got a bit lost along
the way.

> Ideally we'd get rid of it there, because we've now got
> that much more elaborate accounting, but Rohit tried and found some
> regression because of that.
>
> Maybe we should remove it from newidle anyway.
>

Well, if we ditch idle_balance() in favor of stealing, that would "solve"
it...

2018-11-01 11:58:33

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 10/22/2018 10:59 AM, Steve Sistare wrote:
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
> [...]
> Steve Sistare (10):
> sched: Provide sparsemask, a reduced contention bitmap
> sched/topology: Provide hooks to allocate data shared per LLC
> sched/topology: Provide cfs_overload_cpus bitmap
> sched/fair: Dynamically update cfs_overload_cpus
> sched/fair: Hoist idle_stamp up from idle_balance
> sched/fair: Generalize the detach_task interface
> sched/fair: Provide can_migrate_task_llc
> sched/fair: Steal work from an overloaded CPU when CPU goes idle
> sched/fair: disable stealing if too many NUMA nodes
> sched/fair: Provide idle search schedstats

(resend, reformatted)

Thanks very much to everyone who has commented on my patch series.
Here are the issues to be addressed in V2 of the series, and the person
that suggested it, or raised the issue that led to it.

Changes for V2:
* Remove stray patch 10 hunk from patch 5 (Valentin)
* Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
(Valentin)
* Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2 (Steve)
* Call try_steal iff avg_idle exceeds some small threshold (Steve, Valentin)

Possible future work:
* Use sparsemask and stealing for RT (Steve, Peter)
* Remove the core and socket levels from idle_balance() and let stealing
handle those levels (Steve, Peter)
* Delete idle_balance() and use stealing exclusively for handling new idle
(Steve, Peter)
* Test specjbb multi-warehouse on 8-node systems when stealing for
large NUMA systems is revisited (Peter)
* Enhance stealing to handle misfits (Valentin)
* Lower time threshold for task_hot within LLC (Valentin)

Dropped:
* Skip try_steal() if we bail out of idle_balance() because !this_rq->rd->overload
(Valentin)
I tried it and saw no difference. Dropped for simplicity.

Does anyone else plan to review the code? Please tell me now, even if your
review will be delayed. If yes, I will wait for all comments before producing
V2. The code changes so far are small.

- Steve

2018-11-02 23:40:12

by Subhra Mazumdar

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization


On 10/22/18 7:59 AM, Steve Sistare wrote:
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
>
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle. idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy. Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
>
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
>
Is the bitmask saving much? I tried a simple stealing that just starts
searching the domain from the current cpu and steals a thread from the
first cpu that has more than one runnable thread. It seems to perform
similar to your patch.

hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
                baseline        %stdev  patch %stdev
1(40 tasks)     0.5524          2.36    0.5522 (0.045%) 3.82
2(80 tasks)     0.6482          11.4    0.7241 (-11.7%) 20.34
4(160 tasks)    0.9756          0.95    0.8276 (15.1%) 5.8
8(320 tasks)    1.7699          1.62    1.6655 (5.9%) 1.57
16(640 tasks)   3.1018          0.77    2.9858 (3.74%) 1.4
32(1280 tasks)  5.565           0.62    5.3388 (4.1%) 0.72

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Oracle database OLTP, logging _enabled_

Users %speedup
20 1.2
40 -0.41
60 0.83
80 2.37
100 1.54
120 3.0
140 2.24
160 1.82
180 1.94
200 2.23
220 1.49

Below is the patch (not in best shape)

----------->8--------------------

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf..1690451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3540,6 +3540,7 @@ static inline unsigned long cfs_rq_load_avg(struct
cfs_rq *cfs_rq)
 }

 static int idle_balance(struct rq *this_rq, struct rq_flags *rf);
+static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf);

 static inline unsigned long task_util(struct task_struct *p)
 {
@@ -6619,6 +6620,8 @@ done: __maybe_unused;

 idle:
        new_tasks = idle_balance(rq, rf);
+       if (new_tasks == 0)
+               new_tasks = my_idle_balance(rq, rf);

        /*
         * Because idle_balance() releases (and re-acquires) rq->lock,
it is
@@ -8434,6 +8437,75 @@ static int should_we_balance(struct lb_env *env)
        return balance_cpu == env->dst_cpu;
 }

+int get_best_cpu(int this_cpu, struct sched_domain *sd)
+{
+       struct rq *this_rq, *rq;
+       int i;
+       int best_cpu = -1;
+
+       this_rq = cpu_rq(this_cpu);
+       for_each_cpu_wrap(i, sched_domain_span(sd), this_cpu) {
+               if (this_rq->nr_running > 0)
+                       return (-1);
+               if (i == this_cpu)
+                       continue;
+               rq = cpu_rq(i);
+               if (rq->nr_running <= 1)
+                       continue;
+               best_cpu = i;
+               break;
+       }
+       return (best_cpu);
+}
+static int my_load_balance(int this_cpu, struct rq *this_rq,
+                          struct sched_domain *sd, enum cpu_idle_type idle)
+{
+       int ld_moved = 0;
+       struct rq *busiest;
+       unsigned long flags;
+       struct task_struct *p = NULL;
+       struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+       int best_cpu;
+
+       struct lb_env env = {
+               .sd             = sd,
+               .dst_cpu        = this_cpu,
+               .dst_rq         = this_rq,
+               .dst_grpmask    = sched_group_span(sd->groups),
+               .idle           = idle,
+               .cpus           = cpus,
+               .tasks          = LIST_HEAD_INIT(env.tasks),
+       };
+
+       if (idle == CPU_NEWLY_IDLE)
+               env.dst_grpmask = NULL;
+
+       cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+
+       best_cpu = get_best_cpu(this_cpu, sd);
+
+       if (best_cpu >= 0)
+               busiest = cpu_rq(best_cpu);
+       else
+               goto out;
+
+       env.src_cpu = busiest->cpu;
+       env.src_rq = busiest;
+       raw_spin_lock_irqsave(&busiest->lock, flags);
+
+       p = detach_one_task(&env);
+       raw_spin_unlock(&busiest->lock);
+       if (p) {
+               attach_one_task(this_rq, p);
+               ld_moved++;
+       }
+       local_irq_restore(flags);
+
+out:
+       return ld_moved;
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -9369,6 +9441,71 @@ static inline bool nohz_idle_balance(struct rq
*this_rq, enum cpu_idle_type idle
 static inline void nohz_newidle_balance(struct rq *this_rq) { }
 #endif /* CONFIG_NO_HZ_COMMON */

+
+static int my_idle_balance(struct rq *this_rq, struct rq_flags *rf)
+{
+       int this_cpu = this_rq->cpu;
+       struct sched_domain *sd;
+       int pulled_task = 0;
+       int level;
+
+       /*
+        * We must set idle_stamp _before_ calling idle_balance(), such
that we
+        * measure the duration of idle_balance() as idle time.
+        */
+       this_rq->idle_stamp = rq_clock(this_rq);
+
+       rq_unpin_lock(this_rq, rf);
+
+       /*
+        * Drop the rq->lock, but keep IRQ/preempt disabled.
+        */
+       raw_spin_unlock(&this_rq->lock);
+
+       rcu_read_lock();
+
+       level = 0;
+       for_each_domain(this_cpu, sd) {
+               if (level++ > 1)
+                       break;
+
+               if (!(sd->flags & SD_LOAD_BALANCE))
+                       continue;
+
+               if (sd->flags & SD_BALANCE_NEWIDLE)
+                       pulled_task = my_load_balance(this_cpu, this_rq,
+                                                     sd, CPU_NEWLY_IDLE);
+
+               /*
+                * Stop searching for tasks to pull if there are
+                * now runnable tasks on this rq.
+                */
+               if (pulled_task || this_rq->nr_running > 0)
+                       break;
+       }
+       rcu_read_unlock();
+
+       raw_spin_lock(&this_rq->lock);
+
+       /*
+        * While browsing the domains, we released the rq lock, a task could
+        * have been enqueued in the meantime. Since we're not going idle,
+        * pretend we pulled a task.
+        */
+       if (this_rq->cfs.h_nr_running && !pulled_task)
+               pulled_task = 1;
+out:
+       /* Is there a task of a high priority class? */
+       if (this_rq->nr_running != this_rq->cfs.h_nr_running)
+               pulled_task = -1;
+
+       if (pulled_task)
+               this_rq->idle_stamp = 0;
+
+       rq_repin_lock(this_rq, rf);
+       return pulled_task;
+}
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.


2018-11-05 20:09:45

by Steven Sistare

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 11/2/2018 7:39 PM, Subhra Mazumdar wrote:
> On 10/22/18 7:59 AM, Steve Sistare wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates.  To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed.  For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This simple stealing yields a higher CPU utilization than idle_balance()
>> alone, because the search is cheap, so it may be called every time the CPU
>> is about to go idle.  idle_balance() does more work because it searches
>> widely for the busiest queue, so to limit its CPU consumption, it declines
>> to search if the system is too busy.  Simple stealing does not offload the
>> globally busiest queue, but it is much better than running nothing at all.
>>
>> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
>> reduce cache contention vs the usual bitmap when many threads concurrently
>> set, clear, and visit elements.
>>
> Is the bitmask saving much? I tried a simple stealing that just starts
> searching the domain from the current cpu and steals a thread from the
> first cpu that has more than one runnable thread. It seems to perform
> similar to your patch.
>
> hackbench on X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
>                 baseline        %stdev  patch %stdev
> 1(40 tasks)     0.5524          2.36    0.5522 (0.045%) 3.82
> 2(80 tasks)     0.6482          11.4    0.7241 (-11.7%) 20.34
> 4(160 tasks)    0.9756          0.95    0.8276 (15.1%) 5.8
> 8(320 tasks)    1.7699          1.62    1.6655 (5.9%) 1.57
> 16(640 tasks)   3.1018          0.77    2.9858 (3.74%) 1.4
> 32(1280 tasks)  5.565           0.62    5.3388 (4.1%) 0.72
>
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Oracle database OLTP, logging _enabled_
>
> Users %speedup
> 20 1.2
> 40 -0.41
> 60 0.83
> 80 2.37
> 100 1.54
> 120 3.0
> 140 2.24
> 160 1.82
> 180 1.94
> 200 2.23
> 220 1.49
Hi Subhra,
The bitset is a few percent faster than iterating over CPUs in the
tests I ran on the X6-2 with 44 CPUs per node. If we extend stealing to
RT, folks care even more about a few percent. The difference should be
greater on systems with more CPUs per socket, and greater if we extend
stealing to steal across NUMA nodes, and greater if Valentin adds another
bitset for misfits. Lastly, there is no measurable downside in maintaining
the overloaded CPUs bitset. I ran experiments where I set and cleared the
bits in overload_set and overload_clear, but disabled stealing itself, and
saw no significant difference versus the baseline.

- Steve

2019-01-04 16:50:11

by Shijith Thotton

[permalink] [raw]
Subject: Re: [PATCH 00/10] steal tasks to improve CPU utilization

On 22-Oct-18 8:40 PM, Steve Sistare wrote:
>
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
>
> This simple stealing yields a higher CPU utilization than idle_balance()
> alone, because the search is cheap, so it may be called every time the CPU
> is about to go idle. idle_balance() does more work because it searches
> widely for the busiest queue, so to limit its CPU consumption, it declines
> to search if the system is too busy. Simple stealing does not offload the
> globally busiest queue, but it is much better than running nothing at all.
>
> The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
> reduce cache contention vs the usual bitmap when many threads concurrently
> set, clear, and visit elements.
>
> Patch 1 defines the sparsemask type and its operations.
>
> Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
>
> Patches 5 and 6 refactor existing code for a cleaner merge of later
> patches.
>
> Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
>
> Patch 9 disables stealing on systems with more than 2 NUMA nodes for the
> time being because of performance regressions that are not due to stealing
> per-se. See the patch description for details.
>
> Patch 10 adds schedstats for comparing the new behavior to the old, and
> provided as a convenience for developers only, not for integration.
>
> The patch series is based on kernel 4.19.0-rc7. It compiles, boots, and
> runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
> and CONFIG_PREEMPT. It runs without error with CONFIG_DEBUG_PREEMPT +
> CONFIG_SLUB_DEBUG + CONFIG_DEBUG_PAGEALLOC + CONFIG_DEBUG_MUTEXES +
> CONFIG_DEBUG_SPINLOCK + CONFIG_DEBUG_ATOMIC_SLEEP. CPU hot plug and CPU
> bandwidth control were tested.
>
> Stealing imprroves utilization with only a modest CPU overhead in scheduler
> code. In the following experiment, hackbench is run with varying numbers
> of groups (40 tasks per group), and the delta in /proc/schedstat is shown
> for each run, averaged per CPU, augmented with these non-standard stats:
>
> %find - percent of time spent in old and new functions that search for
> idle CPUs and tasks to steal and set the overloaded CPUs bitmap.
>
> steal - number of times a task is stolen from another CPU.
>
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> hackbench <grps> process 100000
> sched_wakeup_granularity_ns=15000000
>
> baseline
> grps time %busy slice sched idle wake %find steal
> 1 8.084 75.02 0.10 105476 46291 59183 0.31 0
> 2 13.892 85.33 0.10 190225 70958 119264 0.45 0
> 3 19.668 89.04 0.10 263896 87047 176850 0.49 0
> 4 25.279 91.28 0.10 322171 94691 227474 0.51 0
> 8 47.832 94.86 0.09 630636 144141 486322 0.56 0
>
> new
> grps time %busy slice sched idle wake %find steal %speedup
> 1 5.938 96.80 0.24 31255 7190 24061 0.63 7433 36.1
> 2 11.491 99.23 0.16 74097 4578 69512 0.84 19463 20.9
> 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8
> 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3
> 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6
>
> Elapsed time improves by 8 to 36%, and CPU busy utilization is up
> by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
> The cost is at most 0.4% more find time.
>
> Additional performance results follow. A negative "speedup" is a
> regression. Note: for all hackbench runs, sched_wakeup_granularity_ns
> is set to 15 msec. Otherwise, preemptions increase at higher loads and
> distort the comparison between baseline and new.
>
> ------------------ 1 Socket Results ------------------
>
> X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 8.008 0.1 5.905 0.2 35.6
> 2 13.814 0.2 11.438 0.1 20.7
> 3 19.488 0.2 16.919 0.1 15.1
> 4 25.059 0.1 22.409 0.1 11.8
> 8 47.478 0.1 44.221 0.1 7.3
>
> X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 4.586 0.8 4.596 0.6 -0.3
> 2 7.693 0.2 5.775 1.3 33.2
> 3 10.442 0.3 8.288 0.3 25.9
> 4 13.087 0.2 11.057 0.1 18.3
> 8 24.145 0.2 22.076 0.3 9.3
> 16 43.779 0.1 41.741 0.2 4.8
>
> KVM 4-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
> tbench, average of 11 runs.
>
> clients %speedup
> 1 16.2
> 2 11.7
> 4 9.9
> 8 12.8
> 16 13.7
>
> KVM 2-cpu
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>
> Benchmark %speedup
> specjbb2015_critical_jops 5.7
> mysql_sysb1.0.14_mutex_2 40.6
> mysql_sysb1.0.14_oltp_2 3.9
>
> ------------------ 2 Socket Results ------------------
>
> X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 7.945 0.2 7.219 8.7 10.0
> 2 8.444 0.4 6.689 1.5 26.2
> 3 12.100 1.1 9.962 2.0 21.4
> 4 15.001 0.4 13.109 1.1 14.4
> 8 27.960 0.2 26.127 0.3 7.0
>
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Average of 10 runs of: hackbench <groups> process 100000
>
> --- base -- --- new ---
> groups time %stdev time %stdev %speedup
> 1 5.826 5.4 5.840 5.0 -0.3
> 2 5.041 5.3 6.171 23.4 -18.4
> 3 6.839 2.1 6.324 3.8 8.1
> 4 8.177 0.6 7.318 3.6 11.7
> 8 14.429 0.7 13.966 1.3 3.3
> 16 26.401 0.3 25.149 1.5 4.9
>
>
> X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Oracle database OLTP, logging disabled, NVRAM storage
>
> Customers Users %speedup
> 1200000 40 -1.2
> 2400000 80 2.7
> 3600000 120 8.9
> 4800000 160 4.4
> 6000000 200 3.0
>
> X6-2: 2 sockets * 14 cores * 2 hyperthreads = 56 CPUs
> Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
> Results from the Oracle "Performance PIT".
>
> Benchmark %speedup
>
> mysql_sysb1.0.14_fileio_56_rndrd 19.6
> mysql_sysb1.0.14_fileio_56_seqrd 12.1
> mysql_sysb1.0.14_fileio_56_rndwr 0.4
> mysql_sysb1.0.14_fileio_56_seqrewr -0.3
>
> pgsql_sysb1.0.14_fileio_56_rndrd 19.5
> pgsql_sysb1.0.14_fileio_56_seqrd 8.6
> pgsql_sysb1.0.14_fileio_56_rndwr 1.0
> pgsql_sysb1.0.14_fileio_56_seqrewr 0.5
>
> opatch_time_ASM_12.2.0.1.0_HP2M 7.5
> select-1_users-warm_asmm_ASM_12.2.0.1.0_HP2M 5.1
> select-1_users_asmm_ASM_12.2.0.1.0_HP2M 4.4
> swingbenchv3_asmm_soebench_ASM_12.2.0.1.0_HP2M 5.8
>
> lm3_memlat_L2 4.8
> lm3_memlat_L1 0.0
>
> ub_gcc_56CPUs-56copies_Pipe-based_Context_Switching 60.1
> ub_gcc_56CPUs-56copies_Shell_Scripts_1_concurrent 5.2
> ub_gcc_56CPUs-56copies_Shell_Scripts_8_concurrent -3.0
> ub_gcc_56CPUs-56copies_File_Copy_1024_bufsize_2000_maxblocks 2.4
>
> X5-2: 2 sockets * 18 cores * 2 hyperthreads = 72 CPUs
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
>
> NAS_OMP
> bench class ncpu %improved(Mops)
> dc B 72 1.3
> is C 72 0.9
> is D 72 0.7
>
> sysbench mysql, average of 24 runs
> --- base --- --- new ---
> nthr events %stdev events %stdev %speedup
> 1 331.0 0.25 331.0 0.24 -0.1
> 2 661.3 0.22 661.8 0.22 0.0
> 4 1297.0 0.88 1300.5 0.82 0.2
> 8 2420.8 0.04 2420.5 0.04 -0.1
> 16 4826.3 0.07 4825.4 0.05 -0.1
> 32 8815.3 0.27 8830.2 0.18 0.1
> 64 12823.0 0.24 12823.6 0.26 0.0
>
> --------------------------------------------------------------
>
> Steve Sistare (10):
> sched: Provide sparsemask, a reduced contention bitmap
> sched/topology: Provide hooks to allocate data shared per LLC
> sched/topology: Provide cfs_overload_cpus bitmap
> sched/fair: Dynamically update cfs_overload_cpus
> sched/fair: Hoist idle_stamp up from idle_balance
> sched/fair: Generalize the detach_task interface
> sched/fair: Provide can_migrate_task_llc
> sched/fair: Steal work from an overloaded CPU when CPU goes idle
> sched/fair: disable stealing if too many NUMA nodes
> sched/fair: Provide idle search schedstats
>
> include/linux/sched/topology.h | 1 +
> include/linux/sparsemask.h | 260 +++++++++++++++++++++++++++++++
> kernel/sched/core.c | 30 +++-
> kernel/sched/fair.c | 338 +++++++++++++++++++++++++++++++++++++----
> kernel/sched/features.h | 6 +
> kernel/sched/sched.h | 13 +-
> kernel/sched/stats.c | 11 +-
> kernel/sched/stats.h | 13 ++
> kernel/sched/topology.c | 117 +++++++++++++-
> lib/Makefile | 2 +-
> lib/sparsemask.c | 142 +++++++++++++++++
> 11 files changed, 898 insertions(+), 35 deletions(-)
> create mode 100644 include/linux/sparsemask.h
> create mode 100644 lib/sparsemask.c
>
> --
> 1.8.3.1
>
>

Hi Steve,

Tried your patchset on ThunderX2 with 2 nodes. Please find my observations below.

Hackbench was run on single node due to variance on 2 nodes and it showed
improvement under load.

Single node hackbench numbers:
group old time new time steals %change
1 6.717 7.275 21 -8.31
2 8.449 9.268 106 -9.69
3 12.035 12.761 173071 -6.03
4 14.648 9.787 595889 33.19
8 22.513 18.329 2397394 18.58
16 39.861 36.263 3949903 9.06

column "new time" shows hackbench runtime in seconds with the patchset.

Tried below benchmarks with 2 nodes, but no performance benefit/degradation was
observed on multiple runs.
- MySQL (read/write/PS etc with sysbench)
- HHVM running oss-performance benchmarks

Shijith