2020-01-21 06:35:18

by Parth Shah

[permalink] [raw]
Subject: [RFC v6 0/5] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

This is the 5th version of the patch series to sustain Turbo frequencies
for longer durations.

The previous versions can be found here:
v5: https://lkml.org/lkml/2019/10/7/118
v4: https://lkml.org/lkml/2019/7/25/296
v3: https://lkml.org/lkml/2019/6/25/25
v2: https://lkml.org/lkml/2019/5/15/1258
v1: https://lwn.net/Articles/783959/

The changes in this versions are:
v5 -> v6:
- Addressed comments from Vincent Guittot and Hillf Danton
- Re based the series on the top of latency_nice patch series defined at
https://lkml.org/lkml/2020/1/16/319. This allows [1] to use the
latency_nice framework for small background tasks classification from the
userspace.
ToDo:
- Add Documentation for TurboSched including possible regression as per the
comment from Pavel Machek

v4 -> v5:
- Remove Core capacity calculation for finding non-idle core
- Use idle_cpu() and cpu_overutilized() to find the core for task packing
- This changes functionality a bit. Updated new results for POWER9 system
- Re-named ambiguous naming "jitter" to "small background" tasks

v3 -> v4:
- Based on Patrick Bellasi's comments, removed the use of UCLAMP based
mechanism to classify tasks as jitter
- Added support to sched_setattr to mark the task as jitter by adding a new
flag to the existing task_struct->flags attribute. This is decided to not
have any new variable inside task_struct and thus get rid of size
bloating.
- No functional changes

v2 -> v3:
- Added a new attribute in task_struct to allow per task jitter
classification so that scheduler can use this as request to change wakeup
path for task packing
- Use syscall for jitter classification, removed cgroup based task
classification
- Use mutex over spinlock to get rid of task sleeping problem
- Changed _Bool->int everywhere
- Split few patches to have arch specific code separate from core scheduler
code

v1 -> v2:
- No CPU bound tasks' classification, only jitter tasks are classified from
the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
factor changes dynamically based on the number of active threads in the
core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
better fit



Abstract
========

The modern servers allows multiple cores to run at range of frequencies
higher than rated range of frequencies. But the power budget of the system
inhibits sustaining these higher frequencies for longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain the
higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others
idle, but it may lead to performance penalty for such tasks and sustaining
higher frequencies proves to be of no benefit. But if one can identify
unimportant low utilization tasks which can be packed on the already active
cores then waking up of new cores can be avoided. Such tasks are short
and/or bursty "background tasks" and waking up new core is expensive for
such case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This
policy is perfect for major categories of the workload, but for background
tasks, one can save energy by packing them onto the active cores and allow
those cores to run at higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack
exclusively classified background tasks onto busy cores. The work involves
the such tasks classifications by using syscall based mechanisms.

In brief, if we can pack such small background tasks on busy cores then we
can save power by keeping other cores idle and allow busier cores to run at
turbo frequencies, patch-set tries to meet this solution in simplest
manner by only packing tasks with latency_nice==19 and util <= 12.5%.


Implementation
==============

These patches uses latency_nice [3] syscall based mechanism to classify the
tasks as small background noises. The task wakeup logic uses this
information to pack such tasks onto cores which are already running busy
with CPU intensive tasks. The task packing is done at
`select_task_rq_fair` only so that in case of wrong decision load balancer
may pull the classified background tasks for maximizing performance.

We define a core to be non-idle if any CPU has >12.5% utilization and not
more than 1 CPU is overutilized (>80% utilization); the background tasks
are packed over these cores using First-fit approach.

The value 12.5% utilization indicates the CPU is sufficiently busy to not
go to deeper IDLE-states (target_residency >= 10ms) and tasks can be packed
here.

To demonstrate/benchmark, patches uses turbo_bench, a synthetic workload
generator [2].

Following snippet demonstrates the use of TurboSched feature:
```
i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
```
This spawns 2*i total threads: of which i-CPU bound and i-low util threads.

Current implementation uses only small background classified tasks to be
packed on the first busy cores, but can be further optimized by getting
userspace input of important tasks and keeping track of such tasks. This
leads to optimized searching of non idle cores and also more accurate as
userspace hints are safer than auto classified busy cores/tasks.


Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM
POWER 9 system shows the frequency benefit can be up to 18% which can be
translated to the maximum workload benefit up to 14%.

(higher is better)

Frequency benefit of TurboSched w.r.t. CFS
+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
20 +-+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +-+
| ** Frequency benefit in % |
| ** |
15 +-+ ** +-+
| **** ** ** |
| * **** ****** |
10 +-+ * **** ****** +-+
| * **** ****** |
| * * ************ * |
5 +-+ * * ************ * * ** +-+
| ** * * ************ * * **** |
0 +-******** * * ************ * * ************ * * * ********** * * * **+
| ** |
| |
-5 +-+ +-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
0 1 2 3 4 5 6 7 8 91011 1213141516171819 2021222324252627 28293031
No. of workload threads

Performance benefit of TurboSched w.r.t. CFS
20 +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
| Performance benefit in % |
15 +-+ ** +-+
| ** |
| ****** |
10 +-+ ****** +-+
| ********** |
| ************ |
5 +-+ ************ * ** +-+
| ************ * **** |
| * ************ * * ****** ** |
0 +-******** * * ************ * * ************ * * * ********** * * * **+
| ** ** * |
| |
-5 +-+ +-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
0 1 2 3 4 5 6 7 8 91011 1213141516171819 2021222324252627 28293031
No. of workload threads


These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
which can create two kinds of tasks: CPU bound (High Utilization) and
Background (Low Utilization). N in X-axis represents N-CPU bound and
N-background tasks spawned. The performance (Operations per Seconds) graph
indicates the benefit with TurboSched can be upto 14% compared to the CFS
task placement strategy for such background classified tasks.


Series organization
==============
- Patches 1-2: Small background tasks classification using syscall
- Patch 3 : Tune CFS task wakeup logic to pack tasks onto busy cores
- Patches 4-5: Change non-idle core search domain to LLC by default and
provide arch hooks to change to NUMA for powerpc.

Series can be applied on top of latency_nice attribute introduction
patches [3].


References
==========
[1]. Usecases for the per-task latency-nice attribute,
https://lkml.org/lkml/2019/9/30/215
[2]. Test Benchmark: turbobench,
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c
[3]. Introduce per-task latency_nice for scheduler hints,
https://lkml.org/lkml/2020/1/16/319


Parth Shah (5):
sched: Introduce switch to enable TurboSched for task packing
sched/core: Update turbo_sched count only when required
sched/fair: Tune task wake-up logic to pack small background tasks on
fewer cores
sched/fair: Provide arch hook to find domain for non idle core search
scan
powerpc: Set turbo domain to NUMA node for task packing

arch/powerpc/include/asm/topology.h | 3 +
arch/powerpc/kernel/smp.c | 7 +++
kernel/sched/core.c | 37 +++++++++++
kernel/sched/fair.c | 95 ++++++++++++++++++++++++++++-
kernel/sched/sched.h | 15 +++++
5 files changed, 156 insertions(+), 1 deletion(-)

--
2.17.2


2020-01-21 06:36:19

by Parth Shah

[permalink] [raw]
Subject: [RFC v6 4/5] sched/fair: Provide arch hook to find domain for non idle core search scan

Specify the method which returns cpumask within which to limit the
search for a non idle core. By default, limit the search in LLC domain
which usually includes few/all the cores in the processor chip.

The select_non_idle_core searches for the non idle cores in the LLC domain.
But in the systems with multiple NUMA domains, the Turbo frequency can be
sustained within the NUMA domain without being affected from other
NUMA. For such case, arch_turbo_domain can be tuned to change domain for
non idle core search.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8643e6309451..af19e1f9d56d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5890,6 +5890,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
/* Define non-idle CPU as the one with the utilization >= 12.5% */
#define merely_used_cpu(util) ((cpu_util(util)) > (100 >> 3))

+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+ return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
/*
* Classify small background tasks with higher latency_nice value for task
* packing.
@@ -5916,6 +5923,7 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu)
int iter_cpu, sibling;

cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+ cpumask_and(cpus, cpus, arch_turbo_domain(prev_cpu));

for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
int idle_cpu_count = 0, non_idle_cpu_count = 0;
--
2.17.2