2019-06-25 04:38:01

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations


This is the 3rd version of the patchset to sustain Turbo frequencies for
longer durations.

The previous versions can be found here:
v2: https://lkml.org/lkml/2019/5/15/1258
v1: https://lwn.net/Articles/783959/

The changes in this versions are:
v[2] -> v[3]:
- Added a new attribute in task_struct to allow per task jitter
classification so that scheduler can use this as request to change wakeup
path for task packing
- Use syscall for jitter classification, removed cgroup based task
classification
- Use mutex over spinlock to get rid of task sleeping problem
- Changed _Bool->int everywhere
- Split few patches to have arch specific code separate from core scheduler
code
ToDo:
- Recompute core capacity only during CPU-Hotplug operation
- Remove smt capacity

v[1] -> v[2]:
- No CPU bound tasks' classification, only jitter tasks are classified from
the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
factor changes dynamically based on the number of active threads in the
core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
better fit



Abstract
========

The modern servers allows multiple cores to run at range of frequencies
higher than rated range of frequencies. But the power budget of the system
inhibits sustaining these higher frequencies for longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain the
higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others
idle, but it may lead to performance penalty for such tasks and sustaining
higher frequencies proves to be of no benefit. But if one can identify
unimportant low utilization tasks which can be packed on the already active
cores then waking up of new cores can be avoided. Such tasks are short
and/or bursty "jitter tasks" and waking up new core is expensive for such
case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This
policy is perfect for major categories of the workload, but for jitter
tasks, one can save energy by packing them onto the active cores and allow
those cores to run at higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack
exclusively classified jitter tasks onto busy cores. The work involves the
jitter tasks classifications by using syscall based mechanisms.

In brief, if we can pack jitter tasks on busy cores then we can save power
by keeping other cores idle and allow busier cores to run at turbo
frequencies, patch-set tries to meet this solution in simplest manner.
Though, there are some challenges in implementing it(like smt_capacity,
un-needed arch hooks, etc) and I'm trying to work around that, it would be
great to have a discussion around this patches.


Implementation
==============

These patches uses UCLAMP mechanism[2] used to clamp utilization from the
userspace, which can be used to classify the jitter tasks. The task wakeup
logic uses this information to pack such tasks onto cores which are already
running busy with CPU intensive tasks. The task packing is done at
`select_task_rq_fair` only so that in case of wrong decision load balancer
may pull the classified jitter tasks for maximizing performance.

Any tasks clamped with cpu.util.max=1 (with sched_setattr syscall) are
classified as jitter tasks. We define a core to be non-idle if it is over
12.5% utilized of its capacity; the jitters are packed over these cores
using First-fit approach.

To demonstrate/benchmark, one can use a synthetic workload generator
`turbo_bench.c`[1] available at
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

Following snippet demonstrates the use of TurboSched feature:
```
i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
```

Current implementation uses only jitter classified tasks to be packed on
the first busy cores, but can be further optimized by getting userspace
input of important tasks and keeping track of such tasks. This leads to
optimized searching of non idle cores and also more accurate as userspace
hints are safer than auto classified busy cores/tasks.


Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM
POWER 9 system shows the benefit for a workload can be up to 13%.

Performance benefit of TurboSched w.r.t. CFS
+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
| + + + + + + + + + + + + + + + + + + + + + + + |
15 +-+ Performance benefit in % +-+
| ** |
| ** ** |
10 +-+ ** ** ** +-+
| ** ** ** |
| ** ** ** |
5 +-+ ** ** ** ** ** ** +-+
| ** ** ** ** ** ** ** ** |
| ** ** ** ** ** ** ** ** ** ** |
| * ** ** ** ** ** ** ** ** ** ** ** * |
0 +-+** ** ** ** ** * ** ** ** ** ** ** ** ** ** ** ** * ** ** ** ** **-+
| ** ** ** ** |
| ** |
-5 +-+ +-+
| + + + + + + + + + + + + + + + + + + + + + + + |
+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24
No. of workload threads


Frequency benefit of TurboSched w.r.t. CFS
+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
| + + + + + + + + + + + + + + + + + + + + + + + |
15 +-+ Frequency benefit in % +-+
| ** |
| ** |
10 +-+ ** ** +-+
| ** ** ** |
| ** ** * ** ** ** |
5 +-+ ** ** ** * ** ** ** ** +-+
| ** ** ** ** * ** ** ** ** ** |
| ** ** ** ** ** * ** ** ** ** ** ** |
| ** ** ** ** ** * ** ** ** ** ** ** ** ** ** |
0 +-+** ** ** ** ** * ** ** ** ** ** ** ** ** ** ** ** * ** ** ** ** **-+
| |
| |
-5 +-+ +-+
| + + + + + + + + + + + + + + + + + + + + + + + |
+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920 21 22 23 24
No. of workload threads


These numbers are w.r.t. `turbo_bench.c` multi-threaded test benchmark
which can create two kinds of tasks: CPU bound (High Utilization) and
Jitters (Low Utilization). N in X-axis represents N-CPU bound and N-Jitter
tasks spawned.


Series organization
==============
- Patches [01-03]: Jitter tasks classification using syscall
- Patches [04-05]: Defines Core Capacity to limit task packing
- Patches [06-08]: Tune CFS task wakeup logic to pack tasks onto busy
cores

Series can be applied on top of Patrick Bellasi's UCLAMP RFCv9[2]
patches with branch on tip/sched/core and UCLAMP_TASK_GROUP config
options enabled.


References
==========

[1] "Turbo_bench: Synthetic workload generator"
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

[2] "Patrick Bellasi, Add utilization clamping support"
https://lkml.org/lkml/2019/5/15/212



Parth Shah (8):
sched/core: Add manual jitter classification using sched_setattr
syscall
sched: Introduce switch to enable TurboSched mode
sched/core: Update turbo_sched count only when required
sched/fair: Define core capacity to limit task packing
powerpc: Define Core Capacity for POWER systems
sched/fair: Tune task wake-up logic to pack jitter tasks
sched/fair: Bound non idle core search within LLC domain
powerpc: Set turbo domain to NUMA node for task packing

arch/powerpc/include/asm/topology.h | 7 ++
arch/powerpc/kernel/smp.c | 38 ++++++++
include/linux/sched.h | 6 ++
kernel/sched/core.c | 35 +++++++
kernel/sched/fair.c | 141 +++++++++++++++++++++++++++-
kernel/sched/sched.h | 9 ++
6 files changed, 235 insertions(+), 1 deletion(-)

--
2.17.1


2019-06-25 04:38:06

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 1/8] sched/core: Add manual jitter classification using sched_setattr syscall

Jitter tasks are short/bursty tasks,typically performing some housekeeping
and are less important in the overall scheme of things. In this patch we
provide a mechanism based on Patrick Bellasi's UCLAMP framework to classify
jitter tasks"

We define jitter tasks as those whose util.max in the UCLAMP framework is
the least (=0). This also provides benefit of giving the least frequency to
those jitter tasks, which is useful if all jitters are packed onto a
separate core."

UCLAMP already provides a way to set util.max for a task by using a syscall
interface. This patch uses the `sched_setattr` syscall to set
sched_util_max attribute of the task which is used to classify the task as
jitter.

Use Case with turbo_bench.c
===================
```
i=8;
./turbo_bench -t 30 -h $i -n $((2*i)) -j
```
This spawns 2*i total threads: of which i-CPU bound and i-jitter threads.

Signed-off-by: Parth Shah <[email protected]>
---
include/linux/sched.h | 6 ++++++
kernel/sched/core.c | 9 +++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e2d80e6a187d..2bd9f75a3abb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,12 @@ struct task_struct {
struct uclamp_se uclamp_req[UCLAMP_CNT];
/* Effective clamp values used for a scheduling entity */
struct uclamp_se uclamp[UCLAMP_CNT];
+ /*
+ * Tag the task as jitter.
+ * 0 = regular. Follows regular CFS policy for task placement.
+ * 1 = Jitter tasks. Should be packed to reduce active core count.
+ */
+ unsigned int is_jitter;
#endif

#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab0aa319fe60..19c7204d6351 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1189,6 +1189,15 @@ static void __setscheduler_uclamp(struct task_struct *p,
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
attr->sched_util_max, true);
+
+ /*
+ * Set task to jitter class if Max util is clamped to the least
+ * possible value
+ */
+ if (p->uclamp_req[UCLAMP_MAX].bucket_id == 0 && !p->is_jitter)
+ p->is_jitter = 1;
+ else if (p->is_jitter)
+ p->is_jitter = 0;
}
}

--
2.17.1

2019-06-25 04:38:19

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 6/8] sched/fair: Tune task wake-up logic to pack jitter tasks

The algorithm finds the first non idle core in the system and tries to
place a task in the least utilized CPU in the chosen core. To maintain
cache hotness, work of finding non idle core starts from the prev_cpu,
which also reduces task ping-pong behaviour inside of the core.

This patch defines a new method named core_underutilized() which will
determine if the core utilization is less than 12.5% of its capacity.
Since core with low utilization should not be selected for packing, the
margin of under-utilization is kept at 12.5% of core capacity.

12.5% is an experimental number which identifies whether the core is
considered to be idle or not. For task packing, the algorithm should
select the best core where the task can be accommodated such that it does
not wake up an idle core. But the jitter tasks should not be placed on the
core which is about to go idle. If the core has aggregated utilization of
<12.5%, it may go idle soon and hence packing on such core should be
ignored. The experiment showed that keeping this threshold to 12.5% gives
better decision capability on not selecting the core which will idle out
soon.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/fair.c | 116 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 114 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff3f88d788d8..9d11631ce18c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5318,6 +5318,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+/* A cpumask to find active cores in the system. */
+DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);

#ifdef CONFIG_NO_HZ_COMMON

@@ -5929,8 +5931,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return cpu;
}

-#ifdef CONFIG_SCHED_SMT
+#ifdef CONFIG_UCLAMP_TASK
+static inline bool is_task_jitter(struct task_struct *p)
+{
+ if (p->is_jitter == 1)
+ return true;

+ return false;
+}
+#else
+static inline bool is_task_jitter(struct task_struct *p)
+{
+ return false;
+}
+#endif
+
+#ifdef CONFIG_SCHED_SMT
#ifndef arch_scale_core_capacity
static inline unsigned long arch_scale_core_capacity(int first_thread,
unsigned long smt_cap)
@@ -5946,6 +5962,81 @@ static inline unsigned long arch_scale_core_capacity(int first_thread,
}
#endif

+/*
+ * Core is defined as under-utilized in case if the aggregated utilization of a
+ * all the CPUs in a core is less than 12.5%
+ */
+#define UNDERUTILIZED_THRESHOLD 3
+static inline bool core_underutilized(unsigned long core_util,
+ unsigned long core_capacity)
+{
+ return core_util < (core_capacity >> UNDERUTILIZED_THRESHOLD);
+}
+
+/*
+ * Try to find a non idle core in the system with spare capacity
+ * available for task packing, thereby keeping minimal cores active.
+ * Uses first fit algorithm to pack low util jitter tasks on active cores.
+ */
+static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
+{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
+ int iter_cpu, sibling;
+
+ cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+
+ for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
+ unsigned long core_util = 0;
+ unsigned long core_cap = arch_scale_core_capacity(iter_cpu,
+ capacity_of(iter_cpu));
+ unsigned long est_util = 0, est_util_enqueued = 0;
+ unsigned long util_best_cpu = ULONG_MAX;
+ int best_cpu = iter_cpu;
+ struct cfs_rq *cfs_rq;
+
+ for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
+ __cpumask_clear_cpu(sibling, cpus);
+ core_util += cpu_util(sibling);
+
+ /*
+ * Keep track of least utilized CPU in the core
+ */
+ if (cpu_util(sibling) < util_best_cpu) {
+ util_best_cpu = cpu_util(sibling);
+ best_cpu = sibling;
+ }
+ }
+
+ /*
+ * Find if the selected task will fit into this core or not by
+ * estimating the utilization of the core.
+ */
+ if (!core_underutilized(core_util, core_cap)) {
+ cfs_rq = &cpu_rq(best_cpu)->cfs;
+ est_util =
+ READ_ONCE(cfs_rq->avg.util_avg) + task_util(p);
+ est_util_enqueued =
+ READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ est_util_enqueued += _task_util_est(p);
+ est_util = max(est_util, est_util_enqueued);
+ est_util = core_util - util_best_cpu + est_util;
+
+ if (est_util < core_cap) {
+ /*
+ * Try to bias towards prev_cpu to avoid task
+ * ping-pong behaviour inside the core.
+ */
+ if (cpumask_test_cpu(prev_cpu,
+ cpu_smt_mask(iter_cpu)))
+ return prev_cpu;
+
+ return best_cpu;
+ }
+ }
+ }
+
+ return select_idle_sibling(p, prev_cpu, target);
+}
#endif

/*
@@ -6402,6 +6493,23 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
return -1;
}

+#ifdef CONFIG_SCHED_SMT
+/*
+ * Select all tasks of type 1(jitter) for task packing
+ */
+static inline int turbosched_select_non_idle_core(struct task_struct *p,
+ int prev_cpu, int target)
+{
+ return select_non_idle_core(p, prev_cpu, target);
+}
+#else
+static inline int turbosched_select_non_idle_core(struct task_struct *p,
+ int prev_cpu, int target)
+{
+ return select_idle_sibling(p, prev_cpu, target);
+}
+#endif
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6467,7 +6575,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
/* Fast path */

- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ if (is_turbosched_enabled() && unlikely(is_task_jitter(p)))
+ new_cpu = turbosched_select_non_idle_core(p, prev_cpu,
+ new_cpu);
+ else
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

if (want_affine)
current->recent_used_cpu = cpu;
--
2.17.1

2019-06-25 04:38:21

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 8/8] powerpc: Set turbo domain to NUMA node for task packing

This patch provides an powerpc architecture specific implementation for
defining the turbo domain to make searching of the core to be bound within
the NUMA. This provides a way to decrease the searching time for specific
architectures.

Signed-off-by: Parth Shah <[email protected]>
---
arch/powerpc/include/asm/topology.h | 3 +++
arch/powerpc/kernel/smp.c | 5 +++++
2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 1c777ee67180..410b94c9e1a2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -133,10 +133,13 @@ static inline void shared_proc_topology_init(void) {}
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_core_id(cpu) (cpu_to_core_id(cpu))
#define arch_scale_core_capacity powerpc_scale_core_capacity
+#define arch_turbo_domain powerpc_turbo_domain

unsigned long powerpc_scale_core_capacity(int first_smt,
unsigned long smt_cap);

+struct cpumask *powerpc_turbo_domain(int cpu);
+
int dlpar_cpu_readd(int cpu);
#endif
#endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 149a3fbf8ed3..856f7233190e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1200,6 +1200,11 @@ unsigned long powerpc_scale_core_capacity(int first_cpu,
/* Scale core capacity based on smt mode */
return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
}
+
+inline struct cpumask *powerpc_turbo_domain(int cpu)
+{
+ return cpumask_of_node(cpu_to_node(cpu));
+}
#endif

static inline void add_cpu_to_smallcore_masks(int cpu)
--
2.17.1

2019-06-25 04:38:47

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 7/8] sched/fair: Bound non idle core search within LLC domain

This patch specifies the method which returns sched domain to limit the
search for a non idle core. By default, limit the search in LLC domain
which usually includes all the cores across the system.

The select_non_idle_core searches for the non idle cores across whole
system. But in the systems with multiple NUMA domains, the Turbo frequency
can be sustained within the NUMA domain without being affected from other
NUMA. For such case, arch_turbo_domain can be tuned to change domain for
non idle core search.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d11631ce18c..b049c9d73f1d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5973,6 +5973,13 @@ static inline bool core_underutilized(unsigned long core_util,
return core_util < (core_capacity >> UNDERUTILIZED_THRESHOLD);
}

+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+ return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
/*
* Try to find a non idle core in the system with spare capacity
* available for task packing, thereby keeping minimal cores active.
@@ -5983,7 +5990,8 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu, int target)
struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
int iter_cpu, sibling;

- cpumask_and(cpus, cpu_online_mask, p->cpus_ptr);
+ cpumask_and(cpus, cpu_online_mask, arch_turbo_domain(prev_cpu));
+ cpumask_and(cpus, cpus, p->cpus_ptr);

for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
unsigned long core_util = 0;
--
2.17.1

2019-06-25 04:38:52

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 4/8] sched/fair: Define core capacity to limit task packing

This patch defines a method name arch_scale_core_capacity which should
return the capacity of the core. This method will be used in the future
patches to determine if the spare capacity is left in the core to pack
jitter tasks.

For some architectures, core capacity does not increase much with the
number of threads( or CPUs) in the core. For such cases, architecture
specific calculations needs to be done to find core capacity.

This patch provides a default implementation for the scaling core capacity.

ToDo: As per Peter's comments, if we are getting rid of SMT capacity then
we need to find a workaround for limiting task packing. I'm working around
that trying to find a solution for the same but would like to get community
response first to have better view.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8c5eb339e35..ff3f88d788d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5929,6 +5929,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return cpu;
}

+#ifdef CONFIG_SCHED_SMT
+
+#ifndef arch_scale_core_capacity
+static inline unsigned long arch_scale_core_capacity(int first_thread,
+ unsigned long smt_cap)
+{
+ /* Default capacity of core is sum of cap of all the threads */
+ unsigned long ret = 0;
+ int sibling;
+
+ for_each_cpu(sibling, cpu_smt_mask(first_thread))
+ ret += cpu_rq(sibling)->cpu_capacity;
+
+ return ret;
+}
+#endif
+
+#endif
+
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
--
2.17.1

2019-06-25 04:39:10

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 3/8] sched/core: Update turbo_sched count only when required

Use the get/put methods to add/remove the use of TurboSched support, such
that the feature is turned on only if there is atleast one jitter task.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef83725bd4b0..c7b628d0be2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1214,10 +1214,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
* Set task to jitter class if Max util is clamped to the least
* possible value
*/
- if (p->uclamp_req[UCLAMP_MAX].bucket_id == 0 && !p->is_jitter)
+ if (p->uclamp_req[UCLAMP_MAX].bucket_id == 0 && !p->is_jitter) {
p->is_jitter = 1;
- else if (p->is_jitter)
+ turbo_sched_get();
+ } else if (p->is_jitter) {
p->is_jitter = 0;
+ turbo_sched_put();
+ }
}
}

@@ -3215,6 +3218,9 @@ static struct rq *finish_task_switch(struct task_struct *prev)
mmdrop(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
+ if (unlikely(prev->is_jitter))
+ turbo_sched_put();
+
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);

--
2.17.1

2019-06-25 04:40:09

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 5/8] powerpc: Define Core Capacity for POWER systems

This patch tunes arch_scale_core_capacity for powerpc arch by scaling
capacity w.r.t to the number of online SMT in the core such that for SMT-4,
core capacity is 1.5x the capacity of sibling thread.

Signed-off-by: Parth Shah <[email protected]>
---
arch/powerpc/include/asm/topology.h | 4 ++++
arch/powerpc/kernel/smp.c | 33 +++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..1c777ee67180 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -132,6 +132,10 @@ static inline void shared_proc_topology_init(void) {}
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_core_id(cpu) (cpu_to_core_id(cpu))
+#define arch_scale_core_capacity powerpc_scale_core_capacity
+
+unsigned long powerpc_scale_core_capacity(int first_smt,
+ unsigned long smt_cap);

int dlpar_cpu_readd(int cpu);
#endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..149a3fbf8ed3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1169,6 +1169,39 @@ static void remove_cpu_from_masks(int cpu)
}
#endif

+#ifdef CONFIG_SCHED_SMT
+/*
+ * Calculate capacity of a core based on the active threads in the core
+ * Scale the capacity of first SM-thread based on total number of
+ * active threads in the respective smt_mask.
+ *
+ * The scaling is done such that for
+ * SMT-4, core_capacity = 1.5x first_cpu_capacity
+ * and for SMT-8, core_capacity multiplication factor is 2x
+ *
+ * So, core_capacity multiplication factor = (1 + smt_mode*0.125)
+ *
+ * @first_cpu: First/any CPU id in the core
+ * @cap: Capacity of the first_cpu
+ */
+unsigned long powerpc_scale_core_capacity(int first_cpu,
+ unsigned long cap)
+{
+ struct cpumask select_idles;
+ struct cpumask *cpus = &select_idles;
+ int cpu, smt_mode = 0;
+
+ cpumask_and(cpus, cpu_smt_mask(first_cpu), cpu_online_mask);
+
+ /* Find SMT mode from active SM-threads */
+ for_each_cpu(cpu, cpus)
+ smt_mode++;
+
+ /* Scale core capacity based on smt mode */
+ return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
+}
+#endif
+
static inline void add_cpu_to_smallcore_masks(int cpu)
{
struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
--
2.17.1

2019-06-25 04:40:33

by Parth Shah

[permalink] [raw]
Subject: [RFCv3 2/8] sched: Introduce switch to enable TurboSched mode

This patch creates a static key which allows to enable or disable
TurboSched feature at runtime.

This key is added in order to enable the TurboSched feature. The static key
helps in optimizing the scheduler fast-path when the TurboSched feature is
disabled.

The patch also provides get/put methods to keep track of the tasks using
the TurboSched feature. This allows to enable the feature on setting first
task classified as jitter, similarly disable the feature on unsetting of
such last task.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 20 ++++++++++++++++++++
kernel/sched/sched.h | 9 +++++++++
2 files changed, 29 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 19c7204d6351..ef83725bd4b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -61,6 +61,26 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+static DEFINE_MUTEX(turbo_sched_lock);
+static int turbo_sched_count;
+
+void turbo_sched_get(void)
+{
+ mutex_lock(&turbo_sched_lock);
+ if (!turbo_sched_count++)
+ static_branch_enable(&__turbo_sched_enabled);
+ mutex_unlock(&turbo_sched_lock);
+}
+
+void turbo_sched_put(void)
+{
+ mutex_lock(&turbo_sched_lock);
+ if (!--turbo_sched_count)
+ static_branch_disable(&__turbo_sched_enabled);
+ mutex_unlock(&turbo_sched_lock);
+}
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e5599b33ffb8..1f239a960a6d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2426,3 +2426,12 @@ static inline bool sched_energy_enabled(void)
static inline bool sched_energy_enabled(void) { return false; }

#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+void turbo_sched_get(void);
+void turbo_sched_put(void);
+DECLARE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+
+static inline bool is_turbosched_enabled(void)
+{
+ return static_branch_unlikely(&__turbo_sched_enabled);
+}
--
2.17.1

2019-06-28 13:16:10

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [RFCv3 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

On 25-Jun 10:07, Parth Shah wrote:

[...]

> Implementation
> ==============
>
> These patches uses UCLAMP mechanism[2] used to clamp utilization from the
> userspace, which can be used to classify the jitter tasks. The task wakeup
> logic uses this information to pack such tasks onto cores which are already
> running busy with CPU intensive tasks. The task packing is done at
> `select_task_rq_fair` only so that in case of wrong decision load balancer
> may pull the classified jitter tasks for maximizing performance.
>
> Any tasks clamped with cpu.util.max=1 (with sched_setattr syscall) are
> classified as jitter tasks.

I don't like this approach, it's overloading the meaning of clamps and
it also brings in un-wanted side effects, like running jitter tasks at
the minimum OPP.

Do you have any expected minimum frequency for those jitter tasks ?
I expect those to be relatively small tasks but still perhaps it makes
sense to run them on higher then minimal OPP.

Why not just adding a new dedicated per-task scheduling attribute,
e.g. SCHED_FLAG_LATENCY_TOLERANT, and manage it via
sched_{set,get}attr() ?

I guess such a concept could work well on defining a generic
spread-vs-pack wakeup policy which is something Android also could
benefit from.

However, what we will still be missing is a proper cgroups support.
Not always is possible and/or convenient to explicitly set per-task
attributes. But at the same time, AFAIK using cgroups to define
task properties which do not represent a "resource repartition" is
something very difficult to get accepted mainline.

In the past, back in 2011, there was an attempt to introduce a timer
slack controller, but apparently it was not very well received:

Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

But perhaps now the times are more mature and we can try to come up
with compelling cases from both the server and the mobile world.

> We define a core to be non-idle if it is over 12.5% utilized of its
> capacity;

This looks like a random number, can you elaborate on that?

> the jitters are packed over these cores using First-fit
> approach.
>
> To demonstrate/benchmark, one can use a synthetic workload generator
> `turbo_bench.c`[1] available at
> https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c
>
> Following snippet demonstrates the use of TurboSched feature:
> ```
> i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
> ```
>
> Current implementation uses only jitter classified tasks to be packed on
> the first busy cores, but can be further optimized by getting userspace
> input of important tasks and keeping track of such tasks.
> This leads to optimized searching of non idle cores and also more
> accurate as userspace hints are safer than auto classified busy
> cores/tasks.

Hints from user-space looks like an interesting concept, could you
better elaborate what you are thinking about in this sense?

--
#include <best/regards.h>

Patrick Bellasi

2019-06-28 16:42:51

by Parth Shah

[permalink] [raw]
Subject: Re: [RFCv3 0/8] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

Hi Patrick,

Thank you for taking interest at the patch set.


On 6/28/19 6:44 PM, Patrick Bellasi wrote:
> On 25-Jun 10:07, Parth Shah wrote:
>
> [...]
>
>> Implementation
>> ==============
>>
>> These patches uses UCLAMP mechanism[2] used to clamp utilization from the
>> userspace, which can be used to classify the jitter tasks. The task wakeup
>> logic uses this information to pack such tasks onto cores which are already
>> running busy with CPU intensive tasks. The task packing is done at
>> `select_task_rq_fair` only so that in case of wrong decision load balancer
>> may pull the classified jitter tasks for maximizing performance.
>>
>> Any tasks clamped with cpu.util.max=1 (with sched_setattr syscall) are
>> classified as jitter tasks.
>
> I don't like this approach, it's overloading the meaning of clamps and
> it also brings in un-wanted side effects, like running jitter tasks at
> the minimum OPP.
>
> Do you have any expected minimum frequency for those jitter tasks ?
> I expect those to be relatively small tasks but still perhaps it makes
> sense to run them on higher then minimal OPP.
>

I absolutely agree with you as it may overload the meaning of clamps.
AFAIK, the only way to detect jitters is by looking at its utilization,
where low util tasks are possibly jitters unless they are important tasks. If
userspace tells if the task is clamped to least OPP, then it is an indication of
low utilization or unimportant tasks, which we say a jitter.

Also, as we discussed in OSPM as well, if all the jitters are given a dedicated
core by the scheduler, then UCLAMP ensures least OPP for such tasks which can help
saving power a further bit, which can be channeled to busier core thus allowing
them to sustain or boost turbo frequencies.

I agree that it may have side-effects but I'm just putting idea out here.
Also, I understand that task packing and frequency are not co-related but for
this specific purpose of Turbo sustaining problem, jitters should be given least
power so that others can have extra one, hence jitters should be given less
frequency.

> Why not just adding a new dedicated per-task scheduling attribute,
> e.g. SCHED_FLAG_LATENCY_TOLERANT, and manage it via
> sched_{set,get}attr() ?
>
> I guess such a concept could work well on defining a generic
> spread-vs-pack wakeup policy which is something Android also could
> benefit from.
>

I have made attempts to use per-task attributes for task classification in first
series of TurboSched and it works fine.
https://lwn.net/ml/linux-pm/[email protected]/

Then from inputs from Dietmar, I thought of giving a try to UCLAMP for this purpose.
But, now I guess having one more task attribute is useful as it can serve multiple
purpose including android and task packing. I will add it v4 then.

> However, what we will still be missing is a proper cgroups support.
> Not always is possible and/or convenient to explicitly set per-task
> attributes. But at the same time, AFAIK using cgroups to define
> task properties which do not represent a "resource repartition" is
> something very difficult to get accepted mainline.
>

Yeah, I faced that problem in v2.
https://lkml.org/lkml/2019/5/15/1395

> In the past, back in 2011, there was an attempt to introduce a timer
> slack controller, but apparently it was not very well received:
>
> Message-ID: <[email protected]>
> https://lore.kernel.org/lkml/[email protected]/
>
> But perhaps now the times are more mature and we can try to come up
> with compelling cases from both the server and the mobile world.
>

The pointed patch series seems appealing and I will have a look at it.

>> We define a core to be non-idle if it is over 12.5% utilized of its
>> capacity;
>
> This looks like a random number, can you elaborate on that?

It is an experimental value to define whether a "core" should be considered to be
idle or not. This is because, even-though core is running few bunch of tasks summing
upto around 10% of utilization in a core, it maybe going to shallower idle-states
periodically which is kind of power-saving; placing new tasks on such core should
be avoided as far as possible.

I have just tested this on SMT-4/8 systems and it works as expected but at the end it
is still an experimental value.

>
>> the jitters are packed over these cores using First-fit
>> approach.
>>
>> To demonstrate/benchmark, one can use a synthetic workload generator
>> `turbo_bench.c`[1] available at
>> https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c
>>
>> Following snippet demonstrates the use of TurboSched feature:
>> ```
>> i=8; ./turbo_bench -t 30 -h $i -n $((i*2)) -j
>> ```
>>
>> Current implementation uses only jitter classified tasks to be packed on
>> the first busy cores, but can be further optimized by getting userspace
>> input of important tasks and keeping track of such tasks.
>> This leads to optimized searching of non idle cores and also more
>> accurate as userspace hints are safer than auto classified busy
>> cores/tasks.
>
> Hints from user-space looks like an interesting concept, could you
> better elaborate what you are thinking about in this sense?
>

Currently, we are just tagging tasks as jitters and packing it on already busier
cores (>12.5% core utilization). Packing strategy is a simple first-fit algorithm
looking for first core in a DIE where the waking-up jitter task can be accommodated.
This is a lot of work in fast-path but can be optimized out. If user can also tag
CPU intensive and/or important tasks then we can keep track of the cores occupying
such tasks which can be used for task packing reducing the effort of finding non-idle.
Again, this can be set with UCLAMP by cpu.util-min=SCHED_CAPACITY_SCALE.

Infact, v1 does this but then I thought of breaking down problem into steps and this
optimization can be introduced later.
https://lwn.net/ml/linux-pm/[email protected]/

So we can have some task attributes like task_type or similar which hints scheduler on
several features like packing, spreading, or giving dedicated core where siblings will
not be scheduled or even core scheduling, which in certain ways affect scheduling
decisions.


Thanks
Parth