LinuxLists.cc - [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

2019-05-15 13:55:07

Subject: [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

Abstract
========

The modern servers allows multiple cores to run at range of
frequencies higher than rated range of frequencies. But the power budget
of the system inhibits sustaining these higher frequencies for
longer durations.

However when certain cores are put to idle states, the power can be
effectively channelled to other busy cores, allowing them to sustain
the higher frequency.

One way to achieve this is to pack tasks onto fewer cores keeping others idle,
but it may lead to performance penalty for such tasks and sustaining higher
frequencies proves to be of no benefit. But if one can identify unimportant low
utilization tasks which can be packed on the already active cores then waking up
of new cores can be avoided. Such tasks are short and/or bursty "jitter tasks"
and waking up new core is expensive for such case.

Current CFS algorithm in kernel scheduler is performance oriented and hence
tries to assign any idle CPU first for the waking up of new tasks. This policy
is perfect for major categories of the workload, but for jitter tasks, one
can save energy by packing it onto active cores and allow other cores to run at
higher frequencies.

These patch-set tunes the task wake up logic in scheduler to pack exclusively
classified jitter tasks onto busy cores. The work involves the use of additional
attributes inside "cpu" cgroup controller to manually classify tasks as jitter.

Implementation
==============

These patches uses UCLAMP mechanism from "cpu" cgroup controller which
can be used to classify the jitter tasks. The task wakeup logic uses
this information to pack such tasks onto cores which are busy running
other workloads. The task packing is done at `select_task_rq_fair` only
so that in case of wrong decision load balancer may pull the classified
jitter tasks to performance giving CPU.

Any tasks added to the "cpu" cgroup tagged with cpu.util.max=1 are
classified as jitter. We define a core to be non-idle if it is over
12.5% utilized; the jitters are packed over these cores using First-fit
approach.

To demonstrate/benchmark, one can use a synthetic workload generator
`turbo_bench.c` available at
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

Following snippet demonstrates the use of TurboSched feature:
```
mkdir -p /sys/fs/cgroup/cpu/jitter
echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
i=8;
./turbo_bench -t 30 -h $i -n $i &
./turbo_bench -t 30 -h 0 -n $i &
echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs
```

Current implementation uses only jitter classified tasks to be packed on any
busy cores, but can be further optimized by getting userspace input of
important tasks and keeping track of such tasks. This leads to optimized
searching of non idle cores and also more accurate as userspace hints
are safer than auto classified busy cores/tasks.

Result
======

The patch-set proves to be useful for the system and the workload where
frequency boost is found to be useful than packing tasks into cores. IBM POWER 9
system shows the benefit for a workload can be up to 13%.

Performance benefit of TurboSched over CFS
+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
15 +-+ Performance benefit in % +-+
| ** |
| ** |
10 +-+ ******** +-+
| ******** |
| ************ * |
5 +-+ ************ * +-+
| * ************ * **** |
| ** * * ************ * * ****** |
| ** * * ************ * * ************ * |
0 +-******** * * ************ * * ************ * * * ********** * * * **+
| ** **** |
| ** |
-5 +-+ ** +-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
1 2 3 4 5 6 7 8 9101112 1314151617181920 2122232425262728 29303132
Workload threads count

Frequency benefit of TurboSched over CFS
20 +-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
| Frequency benefit in % |
15 +-+ ** +-+
| ** |
| ******** |
10 +-+ * ************ +-+
| * ************ |
| * ************ |
5 +-+ * * ************ * +-+
| ** * * ************ * * |
| **** * * ************ * * ****** ** |
0 +-******** * * ************ * * ************ * * * ********** * * * **+
| ** |
| ** |
-5 +-+ ** +-+
| + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + |
+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+-+-+-+--+-+-+-+-+
1 2 3 4 5 6 7 8 9101112 1314151617181920 2122232425262728 29303132
Workload threads count

These numbers are w.r.t. `turbo_bench.c` test benchmark which spawns multiple
threads of a mix of High Utilization and Low Utilization(jitters). X-axis
represents the count of both the categories of tasks spawned.

Series organization
==============
- Patches [01-03]: Cgroup based jitter tasks classification
- Patches [04]: Defines Core Capacity to limit task packing
- Patches [05-06]: Tune CFS task wakeup logic to pack tasks onto busy
cores

Series can be applied on top of Patrick Bellasi's UCLAMP RFCv8[3]
patches with branch on tip/sched/core and UCLAMP_TASK_GROUP config
options enabled.

Changelogs
=========
This patch set is a respin of TurboSched RFCv1
https://lwn.net/Articles/783959/
which includes the following main changes

- No WOF tasks classification, only jitter tasks are classified from
the cpu cgroup controller
- Use of Spinlock rather than mutex to count number of jitters in the
system classified from cgroup
- Architecture specific implementation of Core capacity multiplication
factor changes dynamically based on the number of active threads in
the core
- Selection of non idle core in the system is bounded by DIE domain
- Use of UCLAMP mechanism to classify jitter tasks
- Removed "highutil_cpu_mask", and rather uses sd for DIE domain to find
better fit

References
==========

[1] "TurboSched : A scheduler for sustaining Turbo frequency for longer
durations" https://lwn.net/Articles/783959/

[2] "Turbo_bench: Synthetic workload generator"
https://github.com/parthsl/tools/blob/master/benchmarks/turbo_bench.c

[3] "Patrick Bellasi, Add utilization clamping support"
https://lore.kernel.org/lkml/[email protected]/

Parth Shah (6):
sched/core: Add manual jitter classification from cgroup interface
sched: Introduce switch to enable TurboSched mode
sched/core: Update turbo_sched count only when required
sched/fair: Define core capacity to limit task packing
sched/fair: Tune task wake-up logic to pack jitter tasks
sched/fair: Bound non idle core search by DIE domain

arch/powerpc/include/asm/topology.h | 7 ++
arch/powerpc/kernel/smp.c | 37 ++++++++
kernel/sched/core.c | 32 +++++++
kernel/sched/fair.c | 127 +++++++++++++++++++++++++++-
kernel/sched/sched.h | 8 ++
5 files changed, 210 insertions(+), 1 deletion(-)

--
2.17.1

2019-05-15 13:55:20

by Parth Shah

[permalink] [raw]

Subject: [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode

This patch creates a static key which allows to enable or disable
TurboSched feature at runtime.

This key is added in order to enable the TurboSched feature. The static key
helps in optimizing the scheduler fast-path when the TurboSched feature is
disabled.

The patch also provides get/put methods to keep track of the cgroups using
the TurboSched feature. This allows to enable the feature on adding first
cgroup classified as jitter, similarly disable the feature on removal of
such last cgroup.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 20 ++++++++++++++++++++
kernel/sched/sched.h | 7 +++++++
2 files changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77aa4aee4478..facbedd2554e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,26 @@ __read_mostly int scheduler_running;
*/
int sysctl_sched_rt_runtime = 950000;

+DEFINE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+static DEFINE_SPINLOCK(turbo_sched_lock);
+static int turbo_sched_count;
+
+void turbo_sched_get(void)
+{
+ spin_lock(&turbo_sched_lock);
+ if (!turbo_sched_count++)
+ static_branch_enable(&__turbo_sched_enabled);
+ spin_unlock(&turbo_sched_lock);
+}
+
+void turbo_sched_put(void)
+{
+ spin_lock(&turbo_sched_lock);
+ if (!--turbo_sched_count)
+ static_branch_disable(&__turbo_sched_enabled);
+ spin_unlock(&turbo_sched_lock);
+}
+
/*
* __task_rq_lock - lock the rq @p resides on.
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e75ffaf3ff34..0339964cdf43 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2437,3 +2437,10 @@ static inline bool sched_energy_enabled(void)
static inline bool sched_energy_enabled(void) { return false; }

#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+
+DECLARE_STATIC_KEY_FALSE(__turbo_sched_enabled);
+
+static inline bool is_turbosched_enabled(void)
+{
+ return static_branch_unlikely(&__turbo_sched_enabled);
+}
--
2.17.1

2019-05-15 13:55:29

by Parth Shah

[permalink] [raw]

Subject: [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks

The algorithm finds the first non idle core in the system and tries to
place a task in the least utilized CPU in the chosen core. To maintain
cache hotness, work of finding non idle core starts from the prev_cpu,
which also reduces task ping-pong behaviour inside of the core.

The CPU/core is defined as under-utilized when the aggregated utilization
of the given CPUs is less than 12.5%. The function is named as
core_underutilized because of its specific use in finding a non idle core.

This patch uses the core_underutilized method to calculate whether the core
should be considered sufficiently busy or not. Since core with low
utilization should not be selected for packing, the margin of
under-utilization is kept at 12.5% of core capacity. This number is
experimental and can be modified as per the need. More larger the number,
more aggressive task packing will be.

12.5% is an experimental number which identifies whether the core is
considered to be idle or not. For task packing, the algorithm should
select the best core where the task can be accommodated such that it does
not wake up an idle core. But the jitter tasks should not be placed on the
core which is about to go idle. Since the core having aggregated
utilization of <12.5%, it may go idle soon and hence packing on such core
should be ignored. The experiment showed that keeping this threshold to
12.5% gives better decision capability on not selecting the core which will
idle out soon.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/fair.c | 100 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 99 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2578e6bdf85b..d2d556eb6d0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5323,6 +5323,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Working cpumask for: load_balance, load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+/* A cpumask to find active cores in the system. */
+DEFINE_PER_CPU(cpumask_var_t, turbo_sched_mask);

#ifdef CONFIG_NO_HZ_COMMON
/*
@@ -6248,6 +6250,73 @@ static inline unsigned long arch_scale_core_capacity(int first_thread,
}
#endif

+/*
+ * Core is defined as under-utilized in case if the aggregated utilization of a
+ * all the CPUs in a core is less than 12.5%
+ */
+static inline bool core_underutilized(unsigned long core_util,
+ unsigned long core_capacity)
+{
+ return core_util < (core_capacity >> 3);
+}
+
+/*
+ * Try to find a non idle core in the system with spare capacity
+ * available for task packing, thereby keeping minimal cores active.
+ * Uses first fit algorithm to pack low util jitter tasks on active cores.
+ */
+static int select_non_idle_core(struct task_struct *p, int prev_cpu)
+{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
+ int iter_cpu, sibling;
+
+ cpumask_and(cpus, cpu_online_mask, &p->cpus_allowed);
+
+ for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
+ unsigned long core_util = 0;
+ unsigned long core_cap = arch_scale_core_capacity(iter_cpu,
+ capacity_of(iter_cpu));
+ unsigned long est_util = 0, est_util_enqueued = 0;
+ unsigned long util_best_cpu = (unsigned int)-1;
+ int best_cpu = iter_cpu;
+ struct cfs_rq *cfs_rq;
+
+ for_each_cpu(sibling, cpu_smt_mask(iter_cpu)) {
+ __cpumask_clear_cpu(sibling, cpus);
+ core_util += cpu_util(sibling);
+
+ /*
+ * Keep track of least utilized CPU in the core
+ */
+ if (cpu_util(sibling) < util_best_cpu) {
+ util_best_cpu = cpu_util(sibling);
+ best_cpu = sibling;
+ }
+ }
+
+ /*
+ * Find if the selected task will fit into the tracked minutil
+ * CPU or not by estimating the utilization of the CPU.
+ */
+ cfs_rq = &cpu_rq(best_cpu)->cfs;
+ est_util = READ_ONCE(cfs_rq->avg.util_avg) + task_util(p);
+ est_util_enqueued = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ est_util_enqueued += _task_util_est(p);
+ est_util = max(est_util, est_util_enqueued);
+
+ if (!core_underutilized(core_util, core_cap) && est_util < core_cap) {
+ /*
+ * Try to bias towards prev_cpu to avoid task ping-pong
+ * behaviour inside the core.
+ */
+ if (cpumask_test_cpu(prev_cpu, cpu_smt_mask(iter_cpu)))
+ return prev_cpu;
+
+ return best_cpu;
+ }
+ }
+ return -1;
+}
#endif

/*
@@ -6704,6 +6773,31 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
return -1;
}

+#ifdef CONFIG_SCHED_SMT
+/*
+ * Select all tasks of type 1(jitter) for task packing
+ */
+static int turbosched_select_idle_sibling(struct task_struct *p, int prev_cpu,
+ int target)
+{
+ int new_cpu;
+
+ if (unlikely(task_group(p)->turbo_sched_enabled)) {
+ new_cpu = select_non_idle_core(p, prev_cpu);
+ if (new_cpu >= 0)
+ return new_cpu;
+ }
+
+ return select_idle_sibling(p, prev_cpu, target);
+}
+#else
+static int turbosched_select_idle_sibling(struct task_struct *p, int prev_cpu,
+ int target)
+{
+ return select_idle_sibling(p, prev_cpu, target);
+}
+#endif
+
/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6769,7 +6863,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
} else if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
/* Fast path */

- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ if (is_turbosched_enabled())
+ new_cpu = turbosched_select_idle_sibling(p, prev_cpu,
+ new_cpu);
+ else
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

if (want_affine)
current->recent_used_cpu = cpu;
--
2.17.1

2019-05-15 13:55:32

by Parth Shah

[permalink] [raw]

Subject: [RFCv2 4/6] sched/fair: Define core capacity to limit task packing

The task packing on a core needs to be bounded based on its capacity. This
patch defines a new method which acts as a tipping point for task packing.

The Core capacity is the method which limits task packing above certain
point. In general, the capacity of a core is defined to be the aggregated
sum of all the CPUs in the Core.

Some architectures does not have core capacity linearly increasing with the
number of threads( or CPUs) in the core. For such cases, architecture
specific calculations needs to be done to find core capacity.

The `arch_scale_core_capacity` is currently tuned for `powerpc` arch by
scaling capacity w.r.t to the number of online SMT in the core.

The patch provides default handler for other architecture by scaling core
capacity w.r.t. to the capacity of all the threads in the core.

ToDo: SMT mode is calculated each time a jitter task wakes up leading to
redundant decision time which can be eliminated by keeping track of online
CPUs during hotplug task.

Signed-off-by: Parth Shah <[email protected]>
---
arch/powerpc/include/asm/topology.h | 4 ++++
arch/powerpc/kernel/smp.c | 32 +++++++++++++++++++++++++++++
kernel/sched/fair.c | 19 +++++++++++++++++
3 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..1c777ee67180 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -132,6 +132,10 @@ static inline void shared_proc_topology_init(void) {}
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_core_id(cpu) (cpu_to_core_id(cpu))
+#define arch_scale_core_capacity powerpc_scale_core_capacity
+
+unsigned long powerpc_scale_core_capacity(int first_smt,
+ unsigned long smt_cap);

int dlpar_cpu_readd(int cpu);
#endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index e784342bdaa1..256ab2a50f6e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1173,6 +1173,38 @@ static void remove_cpu_from_masks(int cpu)
}
#endif

+#ifdef CONFIG_SCHED_SMT
+/*
+ * Calculate capacity of a core based on the active threads in the core
+ * Scale the capacity of first SM-thread based on total number of
+ * active threads in the respective smt_mask.
+ *
+ * The scaling is done such that for
+ * SMT-4, core_capacity = 1.5x first_cpu_capacity
+ * and for SMT-8, core_capacity multiplication factor is 2x
+ *
+ * So, core_capacity multiplication factor = (1 + smt_mode*0.125)
+ *
+ * @first_cpu: First/any CPU id in the core
+ * @cap: Capacity of the first_cpu
+ */
+inline unsigned long powerpc_scale_core_capacity(int first_cpu,
+ unsigned long cap) {
+ struct cpumask select_idles;
+ struct cpumask *cpus = &select_idles;
+ int cpu, smt_mode = 0;
+
+ cpumask_and(cpus, cpu_smt_mask(first_cpu), cpu_online_mask);
+
+ /* Find SMT mode from active SM-threads */
+ for_each_cpu(cpu, cpus)
+ smt_mode++;
+
+ /* Scale core capacity based on smt mode */
+ return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
+}
+#endif
+
static inline void add_cpu_to_smallcore_masks(int cpu)
{
struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7eea9dc4644..2578e6bdf85b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6231,6 +6231,25 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return cpu;
}

+#ifdef CONFIG_SCHED_SMT
+
+#ifndef arch_scale_core_capacity
+static inline unsigned long arch_scale_core_capacity(int first_thread,
+ unsigned long smt_cap)
+{
+ /* Default capacity of core is sum of cap of all the threads */
+ unsigned long ret = 0;
+ int sibling;
+
+ for_each_cpu(sibling, cpu_smt_mask(first_thread))
+ ret += cpu_rq(sibling)->cpu_capacity;
+
+ return ret;
+}
+#endif
+
+#endif
+
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
--
2.17.1

2019-05-15 13:56:02

by Parth Shah

[permalink] [raw]

Subject: [RFCv2 3/6] sched/core: Update turbo_sched count only when required

Use the get/put methods to add/remove the use of TurboSched support from
the cgroup.

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index facbedd2554e..4c55b5399985 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7216,10 +7216,13 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
* Classify the tasks belonging to the last bucket of MAX UCLAMP as
* jitters
*/
- if (uclamp_bucket_id(max_value) == 0)
+ if (uclamp_bucket_id(max_value) == 0) {
tg->turbo_sched_enabled = 1;
- else if (tg->turbo_sched_enabled)
+ turbo_sched_get();
+ } else if (tg->turbo_sched_enabled) {
tg->turbo_sched_enabled = 0;
+ turbo_sched_put();
+ }

/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(css, UCLAMP_MAX);
--
2.17.1

2019-05-15 13:56:02

by Parth Shah

[permalink] [raw]

Subject: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface

Jitter tasks are usually of less important in terms of performance
and are short/bursty in characteristics. TurboSched uses this jitter
classification to pack jitters into the already running busy cores to
keep the total idle core count high.

The patch describes the use of UCLAMP mechanism to classify tasks. Patrick
Bellasi came up with a mechanism to classify tasks from the userspace
https://lore.kernel.org/lkml/[email protected]/

This UCLAMP mechanism can be useful in classifying tasks as jitter. Jitters
can be classified for the cgroup by keeping util.max of the tasks as the
least(=0). This also provides benefit of giving the least frequency to
those jitter tasks, which is useful if all jitters are packed onto a
separate core.

Use Case with UCLAMP
===================
To create a cgroup with all the tasks classified as jitters;

```
mkdir -p /sys/fs/cgroup/cpu/jitter
echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
i=8;
./turbo_bench -t 30 -h $i -n $i &
./turbo_bench -t 30 -h 0 -n $i &
echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs;
```

Signed-off-by: Parth Shah <[email protected]>
---
kernel/sched/core.c | 9 +++++++++
kernel/sched/sched.h | 1 +
2 files changed, 10 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d42c0f5eefa9..77aa4aee4478 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7192,6 +7192,15 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
tg->uclamp_req[UCLAMP_MAX].value = max_value;
tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);

+ /*
+ * Classify the tasks belonging to the last bucket of MAX UCLAMP as
+ * jitters
+ */
+ if (uclamp_bucket_id(max_value) == 0)
+ tg->turbo_sched_enabled = 1;
+ else if (tg->turbo_sched_enabled)
+ tg->turbo_sched_enabled = 0;
+
/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(css, UCLAMP_MAX);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b4019012d84b..e75ffaf3ff34 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -407,6 +407,7 @@ struct task_group {
struct uclamp_se uclamp[UCLAMP_CNT];
#endif

+ bool turbo_sched_enabled;
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.17.1

2019-05-15 13:56:54

by Parth Shah

[permalink] [raw]

Subject: [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain

This patch specifies the sched domain to search for a non idle core.

The select_non_idle_core searches for the non idle cores across whole
system. But in the systems with multiple NUMA domains, the Turbo frequency
can be sustained within the NUMA domain without being affected from other
NUMA.

This patch provides an architecture specific implementation for defining
the turbo domain to make searching of the core to be bound within the NUMA.

Signed-off-by: Parth Shah <[email protected]>
---
arch/powerpc/include/asm/topology.h | 3 +++
arch/powerpc/kernel/smp.c | 5 +++++
kernel/sched/fair.c | 10 +++++++++-
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 1c777ee67180..410b94c9e1a2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -133,10 +133,13 @@ static inline void shared_proc_topology_init(void) {}
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_core_id(cpu) (cpu_to_core_id(cpu))
#define arch_scale_core_capacity powerpc_scale_core_capacity
+#define arch_turbo_domain powerpc_turbo_domain

unsigned long powerpc_scale_core_capacity(int first_smt,
unsigned long smt_cap);

+struct cpumask *powerpc_turbo_domain(int cpu);
+
int dlpar_cpu_readd(int cpu);
#endif
#endif
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 256ab2a50f6e..e13ba3981891 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1203,6 +1203,11 @@ inline unsigned long powerpc_scale_core_capacity(int first_cpu,
/* Scale core capacity based on smt mode */
return smt_mode == 1 ? cap : ((cap * smt_mode) >> 3) + cap;
}
+
+inline struct cpumask *powerpc_turbo_domain(int cpu)
+{
+ return cpumask_of_node(cpu_to_node(cpu));
+}
#endif

static inline void add_cpu_to_smallcore_masks(int cpu)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d2d556eb6d0f..bd9985775db4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6260,6 +6260,13 @@ static inline bool core_underutilized(unsigned long core_util,
return core_util < (core_capacity >> 3);
}

+#ifndef arch_turbo_domain
+static __always_inline struct cpumask *arch_turbo_domain(int cpu)
+{
+ return sched_domain_span(rcu_dereference(per_cpu(sd_llc, cpu)));
+}
+#endif
+
/*
* Try to find a non idle core in the system with spare capacity
* available for task packing, thereby keeping minimal cores active.
@@ -6270,7 +6277,8 @@ static int select_non_idle_core(struct task_struct *p, int prev_cpu)
struct cpumask *cpus = this_cpu_cpumask_var_ptr(turbo_sched_mask);
int iter_cpu, sibling;

- cpumask_and(cpus, cpu_online_mask, &p->cpus_allowed);
+ cpumask_and(cpus, cpu_online_mask, arch_turbo_domain(prev_cpu));
+ cpumask_and(cpus, cpus, &p->cpus_allowed);

for_each_cpu_wrap(iter_cpu, cpus, prev_cpu) {
unsigned long core_util = 0;
--
2.17.1

2019-05-15 16:30:58

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface

On Wed, May 15, 2019 at 07:23:17PM +0530, Parth Shah wrote:

> Subject: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface

How can this be v2 ?! I've never seen v1.

> Jitter tasks are usually of less important in terms of performance
> and are short/bursty in characteristics. TurboSched uses this jitter
> classification to pack jitters into the already running busy cores to
> keep the total idle core count high.
>
> The patch describes the use of UCLAMP mechanism to classify tasks. Patrick
> Bellasi came up with a mechanism to classify tasks from the userspace
> https://lore.kernel.org/lkml/[email protected]/

The canonical form is:

https://lkml.kernel.org/r/$MSGID

> This UCLAMP mechanism can be useful in classifying tasks as jitter. Jitters
> can be classified for the cgroup by keeping util.max of the tasks as the
> least(=0). This also provides benefit of giving the least frequency to
> those jitter tasks, which is useful if all jitters are packed onto a
> separate core.
>
> Use Case with UCLAMP
> ===================
> To create a cgroup with all the tasks classified as jitters;
>
> ```
> mkdir -p /sys/fs/cgroup/cpu/jitter
> echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
> i=8;
> ./turbo_bench -t 30 -h $i -n $i &
> ./turbo_bench -t 30 -h 0 -n $i &
> echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs;
> ```
>
> Signed-off-by: Parth Shah <[email protected]>
> ---
> kernel/sched/core.c | 9 +++++++++
> kernel/sched/sched.h | 1 +
> 2 files changed, 10 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d42c0f5eefa9..77aa4aee4478 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7192,6 +7192,15 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> tg->uclamp_req[UCLAMP_MAX].value = max_value;
> tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
>
> + /*
> + * Classify the tasks belonging to the last bucket of MAX UCLAMP as
> + * jitters
> + */
> + if (uclamp_bucket_id(max_value) == 0)
> + tg->turbo_sched_enabled = 1;
> + else if (tg->turbo_sched_enabled)
> + tg->turbo_sched_enabled = 0;
> +
> /* Update effective clamps to track the most restrictive value */
> cpu_util_update_eff(css, UCLAMP_MAX);
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b4019012d84b..e75ffaf3ff34 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -407,6 +407,7 @@ struct task_group {
> struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
>
> + bool turbo_sched_enabled;
> };

Your simple patch has 3 problems:

- it limits itself; for no apparent reason; to the cgroup interface.

- it is inconsistent in the terminology; pick either jitter or
turbo-sched, and I think the latter is a horrid name, it wants to be
'pack' or something similar. Also, jitter really doesn't make sense
given the classification.

- you use '_Bool' in a composite type.

2019-05-15 16:32:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode

On Wed, May 15, 2019 at 07:23:18PM +0530, Parth Shah wrote:
> +void turbo_sched_get(void)
> +{
> + spin_lock(&turbo_sched_lock);
> + if (!turbo_sched_count++)
> + static_branch_enable(&__turbo_sched_enabled);
> + spin_unlock(&turbo_sched_lock);
> +}

Muwhahaha, you didn't test this code, did you?

2019-05-15 16:33:00

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 3/6] sched/core: Update turbo_sched count only when required

On Wed, May 15, 2019 at 07:23:19PM +0530, Parth Shah wrote:
> Use the get/put methods to add/remove the use of TurboSched support from
> the cgroup.

Didn't anybody tell you that cgroup only interfaces are frowned upon?

2019-05-15 16:39:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 4/6] sched/fair: Define core capacity to limit task packing

On Wed, May 15, 2019 at 07:23:20PM +0530, Parth Shah wrote:
> The task packing on a core needs to be bounded based on its capacity. This
> patch defines a new method which acts as a tipping point for task packing.
>
> The Core capacity is the method which limits task packing above certain
> point. In general, the capacity of a core is defined to be the aggregated
> sum of all the CPUs in the Core.
>
> Some architectures does not have core capacity linearly increasing with the
> number of threads( or CPUs) in the core. For such cases, architecture
> specific calculations needs to be done to find core capacity.
>
> The `arch_scale_core_capacity` is currently tuned for `powerpc` arch by
> scaling capacity w.r.t to the number of online SMT in the core.
>
> The patch provides default handler for other architecture by scaling core
> capacity w.r.t. to the capacity of all the threads in the core.
>
> ToDo: SMT mode is calculated each time a jitter task wakes up leading to
> redundant decision time which can be eliminated by keeping track of online
> CPUs during hotplug task.

Urgh, we just got rid of capacity for SMT. Also I don't think the above
clearly defines your metric.

2019-05-15 16:45:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 5/6] sched/fair: Tune task wake-up logic to pack jitter tasks

On Wed, May 15, 2019 at 07:23:21PM +0530, Parth Shah wrote:
> @@ -6704,6 +6773,31 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> return -1;
> }
>
> +#ifdef CONFIG_SCHED_SMT
> +/*
> + * Select all tasks of type 1(jitter) for task packing
> + */
> +static int turbosched_select_idle_sibling(struct task_struct *p, int prev_cpu,
> + int target)
> +{
> + int new_cpu;
> +
> + if (unlikely(task_group(p)->turbo_sched_enabled)) {

So if you build without cgroups, this is a NULL dereference.

Also, this really should not be group based.

> + new_cpu = select_non_idle_core(p, prev_cpu);
> + if (new_cpu >= 0)
> + return new_cpu;
> + }
> +
> + return select_idle_sibling(p, prev_cpu, target);
> +}
> +#else

2019-05-15 16:46:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain

On Wed, May 15, 2019 at 07:23:22PM +0530, Parth Shah wrote:
> This patch specifies the sched domain to search for a non idle core.
>
> The select_non_idle_core searches for the non idle cores across whole
> system. But in the systems with multiple NUMA domains, the Turbo frequency
> can be sustained within the NUMA domain without being affected from other
> NUMA.
>
> This patch provides an architecture specific implementation for defining
> the turbo domain to make searching of the core to be bound within the NUMA.

NAK, this is insane. You don't need arch hooks to find the numa domain.

2019-05-15 16:51:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

On Wed, May 15, 2019 at 07:23:16PM +0530, Parth Shah wrote:
> Abstract
> ========
>
> The modern servers allows multiple cores to run at range of
> frequencies higher than rated range of frequencies. But the power budget
> of the system inhibits sustaining these higher frequencies for
> longer durations.
>
> However when certain cores are put to idle states, the power can be
> effectively channelled to other busy cores, allowing them to sustain
> the higher frequency.
>
> One way to achieve this is to pack tasks onto fewer cores keeping others idle,
> but it may lead to performance penalty for such tasks and sustaining higher
> frequencies proves to be of no benefit. But if one can identify unimportant low
> utilization tasks which can be packed on the already active cores then waking up
> of new cores can be avoided. Such tasks are short and/or bursty "jitter tasks"
> and waking up new core is expensive for such case.
>
> Current CFS algorithm in kernel scheduler is performance oriented and hence
> tries to assign any idle CPU first for the waking up of new tasks. This policy
> is perfect for major categories of the workload, but for jitter tasks, one
> can save energy by packing it onto active cores and allow other cores to run at
> higher frequencies.
>
> These patch-set tunes the task wake up logic in scheduler to pack exclusively
> classified jitter tasks onto busy cores. The work involves the use of additional
> attributes inside "cpu" cgroup controller to manually classify tasks as jitter.

Why does this make sense? Don't these higher freq bins burn power like
stupid? That is, it makes sense to use turbo-bins for single threaded
workloads that are CPU-bound and need performance.

But why pack a bunch of 'crap' tasks onto a core and give it turbo;
that's just burning power without getting anything back for it.

2019-05-16 16:18:23

by Parth Shah

[permalink] [raw]

Subject: Re: [RFCv2 2/6] sched: Introduce switch to enable TurboSched mode

On 5/15/19 10:00 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:18PM +0530, Parth Shah wrote:
>> +void turbo_sched_get(void)
>> +{
>> + spin_lock(&turbo_sched_lock);
>> + if (!turbo_sched_count++)
>> + static_branch_enable(&__turbo_sched_enabled);
>> + spin_unlock(&turbo_sched_lock);
>> +}
>
> Muwhahaha, you didn't test this code, did you?
>

yeah, I didn't see task-sleep problem coming in.
I will change to mutex.

Thanks for pointing out.

2019-05-16 16:28:36

by Parth Shah

[permalink] [raw]

Subject: Re: [RFCv2 6/6] sched/fair: Bound non idle core search by DIE domain

On 5/15/19 10:14 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:22PM +0530, Parth Shah wrote:
>> This patch specifies the sched domain to search for a non idle core.
>>
>> The select_non_idle_core searches for the non idle cores across whole
>> system. But in the systems with multiple NUMA domains, the Turbo frequency
>> can be sustained within the NUMA domain without being affected from other
>> NUMA.
>>
>> This patch provides an architecture specific implementation for defining
>> the turbo domain to make searching of the core to be bound within the NUMA.
>
> NAK, this is insane. You don't need arch hooks to find the numa domain.
>

The aim here is to limit searching for non-idle cores inside a NUMA node
(or DIE sched-domain), because some systems can sustain Turbo frequency by task
packing inside of a NUMA node. Hence turbo domain for them should be DIE.

Since not all systems have DIE domain, adding arch hooks can allow each arch to
override their turbo domain within which to allow task packing.

Thanks

2019-05-16 17:59:45

by Parth Shah

[permalink] [raw]

Subject: Re: [RFCv2 0/6] TurboSched: A scheduler for sustaining Turbo Frequencies for longer durations

On 5/15/19 10:18 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:16PM +0530, Parth Shah wrote:
>> Abstract
>> ========
>>
>> The modern servers allows multiple cores to run at range of
>> frequencies higher than rated range of frequencies. But the power budget
>> of the system inhibits sustaining these higher frequencies for
>> longer durations.
>>
>> However when certain cores are put to idle states, the power can be
>> effectively channelled to other busy cores, allowing them to sustain
>> the higher frequency.
>>
>> One way to achieve this is to pack tasks onto fewer cores keeping others idle,
>> but it may lead to performance penalty for such tasks and sustaining higher
>> frequencies proves to be of no benefit. But if one can identify unimportant low
>> utilization tasks which can be packed on the already active cores then waking up
>> of new cores can be avoided. Such tasks are short and/or bursty "jitter tasks"
>> and waking up new core is expensive for such case.
>>
>> Current CFS algorithm in kernel scheduler is performance oriented and hence
>> tries to assign any idle CPU first for the waking up of new tasks. This policy
>> is perfect for major categories of the workload, but for jitter tasks, one
>> can save energy by packing it onto active cores and allow other cores to run at
>> higher frequencies.
>>
>> These patch-set tunes the task wake up logic in scheduler to pack exclusively
>> classified jitter tasks onto busy cores. The work involves the use of additional
>> attributes inside "cpu" cgroup controller to manually classify tasks as jitter.
>
> Why does this make sense? Don't these higher freq bins burn power like
> stupid? That is, it makes sense to use turbo-bins for single threaded
> workloads that are CPU-bound and need performance.
>
> But why pack a bunch of 'crap' tasks onto a core and give it turbo;
> that's just burning power without getting anything back for it.
>

Thanks for taking interest in my patch series.
I will try my best to answer your question.

This patch series tries to pack jitter tasks on the busier cores to avoid waking
up any idle core as long as possible. This approach is supposed to give more
performance to the CPU bound tasks by sustaining Turbo for a longer duration.

Current implementation for task wake up is biased towards waking an idle CPU first,
which in turn consumes power as the CPU leaves idle domain.
For the system supporting Turbo frequencies, power budget is fixed and hence to
maintain this budget the system may throttle the frequency.

So the idea is, if we can pack the jitter tasks on already running cores, then we
can avoid waking up new cores and save power thereby sustaining Turbo for longer
duration.

2019-05-16 20:32:14

by Parth Shah

[permalink] [raw]

Subject: Re: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface

On 5/15/19 9:59 PM, Peter Zijlstra wrote:
> On Wed, May 15, 2019 at 07:23:17PM +0530, Parth Shah wrote:
>
>> Subject: [RFCv2 1/6] sched/core: Add manual jitter classification from cgroup interface
>
> How can this be v2 ?! I've never seen v1.
>

Actually, I sent out v1 on [email protected] mailing list. This patch set
is then refined and re-organized to get better comments from larger audience.
You can find v1 at https://lwn.net/Articles/783959/

>> Jitter tasks are usually of less important in terms of performance
>> and are short/bursty in characteristics. TurboSched uses this jitter
>> classification to pack jitters into the already running busy cores to
>> keep the total idle core count high.
>>
>> The patch describes the use of UCLAMP mechanism to classify tasks. Patrick
>> Bellasi came up with a mechanism to classify tasks from the userspace
>> https://lore.kernel.org/lkml/[email protected]/
>
> The canonical form is:
>
> https://lkml.kernel.org/r/$MSGID
>

Thanks for pointing out. I will use the above form from next time onwards

>> This UCLAMP mechanism can be useful in classifying tasks as jitter. Jitters
>> can be classified for the cgroup by keeping util.max of the tasks as the
>> least(=0). This also provides benefit of giving the least frequency to
>> those jitter tasks, which is useful if all jitters are packed onto a
>> separate core.
>>
>> Use Case with UCLAMP
>> ===================
>> To create a cgroup with all the tasks classified as jitters;
>>
>> ```
>> mkdir -p /sys/fs/cgroup/cpu/jitter
>> echo 0 > /proc/sys/kernel/sched_uclamp_util_min;
>> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.min;
>> echo 0 > /sys/fs/cgroup/cpu/jitter/cpu.util.max;
>> i=8;
>> ./turbo_bench -t 30 -h $i -n $i &
>> ./turbo_bench -t 30 -h 0 -n $i &
>> echo $! > /sys/fs/cgroup/cpu/jitter/cgroup.procs;
>> ```
>>
>> Signed-off-by: Parth Shah <[email protected]>
>> ---
>> kernel/sched/core.c | 9 +++++++++
>> kernel/sched/sched.h | 1 +
>> 2 files changed, 10 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d42c0f5eefa9..77aa4aee4478 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7192,6 +7192,15 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>> tg->uclamp_req[UCLAMP_MAX].value = max_value;
>> tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
>>
>> + /*
>> + * Classify the tasks belonging to the last bucket of MAX UCLAMP as
>> + * jitters
>> + */
>> + if (uclamp_bucket_id(max_value) == 0)
>> + tg->turbo_sched_enabled = 1;
>> + else if (tg->turbo_sched_enabled)
>> + tg->turbo_sched_enabled = 0;
>> +
>> /* Update effective clamps to track the most restrictive value */
>> cpu_util_update_eff(css, UCLAMP_MAX);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index b4019012d84b..e75ffaf3ff34 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -407,6 +407,7 @@ struct task_group {
>> struct uclamp_se uclamp[UCLAMP_CNT];
>> #endif
>>
>> + bool turbo_sched_enabled;
>> };
>
> Your simple patch has 3 problems:
>
> - it limits itself; for no apparent reason; to the cgroup interface.

Maybe I can add other interfaces like syscall to allow per-entity classification.

>
> - it is inconsistent in the terminology; pick either jitter or
> turbo-sched, and I think the latter is a horrid name, it wants to be
> 'pack' or something similar. Also, jitter really doesn't make sense
> given the classification.
>

Yes, I will be happy to re-name any variables/function to enhance readability.

> - you use '_Bool' in a composite type.
>

Maybe I can switch to int.

Thanks,
Parth