This RFC is an update to the initial SchedTune proposal [1] for a central
scheduler-driven power-performance control.
The posting is being made ahead of the LPC to facilitate discussions there.
The initial proposal was refined, eventually merged into the AOSP, and it
currently finds good use in production mobile devices [*]. This series is a
scaled down version of the complete solution that aims to restart discussions.
The focus is on a suitable user-space <-> kernel space interface for tuning the
scheduler’s behavior at run-time. Specifically, the intention is to highlight
how the proposed interface can be used by the scheduler to bias the selection
of the CPU's operating frequency depending on information injected from
userspace.
Patch Set Organization
======================
The concept of a simple power-performance tunable that is wholly scheduler
centric is implemented by patches [01-04].
This is where we introduce a ‘global task boosting’ knob which is integrated
with schedutil to allow the scheduler to bias OPP selection. These first 5
patches allow to dynamically tune schedutil up to the point where it behaves
like the existing ‘performance’ governor.
Patches [05-07] extend the basic mechanism to use different boost values for
different tasks. This allows informed runtimes (e.g. Android and ChromeOS) to
feed the scheduler with information related to their knowledge about the
specific demand of different tasks and/or use-cases.
Thanks to SchedTune’s defined interface, the scheduler is now able to collect
simple yet powerful information about tasks: how much the user cares about
their performance.
Although it can be argued that something similar is already provided by the
existing concept of task priority, we believe that the proposed interface is
much more generic and can be further extended to support both OPP selection and
task placement, thus leading in the future to a more comprehensive energy-aware
scheduler driven solution.
These patches enable schedutil to service interactive workloads like touch
screen interaction. Only out of tree cpufreq governors like the Interactive
governor were thus far able to service such use cases.
The last patch in the series introduces the concept of ‘negative boosting’.
Negative boosting is beneficial for mobile devices in scenarios where it is
desired to intentionally reduce the performance of a task by running it at a
lower frequency than the one selected by schedutil.
For certain tasks, like compute intensive background operations or memory
bounded tasks, negative boosting can have measurable energy-saving benefits.
In these cases, a negative SchedTune value allows to bias schedutil towards the
selection of a lower OPP. Importantly, this can be achieved using the same
SchedTune interface.
This patch allows to dynamically tune schedutil up to the point where it
effectively replaces the “powersave” governor.
The patches are based on tip/sched/core:
a225023 - sched/core: Explain sleep/wakeup in a better way
For testing purposes an integration branch, providing the required dependencies
as well as a set of debugging tracepoints, is available here:
git://http://www.linux-arm.com/linux-pb eas/stune/rfcv2
Test results
============
Extensive testing of the proposed solution has already been done as SchedTune
is shipping on a production mobile device, with benefits observed for key
use-cases (e.g. improved responsiveness and performance of key workloads).
The following synthetic focused tests are used to show functional benefits and
report overheads. All these tests have been performed on an HiKey board, an
octa-core (ARM CortexA53 @1.2GHz) SMP platform, running a Debian image on a
mainline kernel and using schedutil configured with a 1ms rate limit value.
Performance boosting validation
-------------------------------
The functional validation of the boost mechanism has been performed considering
a ramp task generated using the rt-app provided by the LISA testing suite [2].
The ramp is configured as a 16ms periodic task which increases its utilization
by 5% every second, starting from 5% up to 60%. The task is pinned to run on a
single CPU and executed with different boost values:
0%, 15%, 30%, 60% and -100%.
The following table reports:
- the value used to boost the task in each experiment
- the rt-app’s reported performance index:
PerfIndex Avg (the higher the better)
which expresses the average time left from completion of a task
activation (i.e. a fixed amount of work) until its next activation
- the CPU average frequency (FreqAvg)
- the actual boost measured for the PerfIndex and FreqAvg
Boost PerfIndex Actual FreqAvg Actual
value Avg Std Boost [MHz] Boost
0 0.53 0.12 0% 606 0%
15 0.61 0.07 17% 658 9%
30 0.68 0.07 26% 739 22%
60 0.71 0.05 40% 852 41%
-100 -98.84 120.00 -2K% 363 -36%
For positive boost values, SchedTune can improve the performance of a task
(i.e. its time to completion) by a quantity which is proportional to the boost
value. This is reported by the increasingly higher values of the PerfIndex Avg
as well as the average frequencies used to execute the task.
For negative boost values the performance is progressively reduced, in the
reported case of -100% boost we verified that the system runs most of its time
at one of the lowest OPPs (thus providing a behavior similar to the powersave
governor) while still running at higher OPPs when other (not negative boosted)
tasks needs to run. That’s why the reported average frequency (363MHz) is
slightly higher than the minimum OPP (208MHz).
A graphical representation of the task’s behaviors at different boost values
and the corresponding CPUs frequencies is available here:
https://gist.github.com/derkling/8be0a8ac365c935b3df585cb24afec6c
Impact on scheduler performance
-------------------------------
Performance impact has been evaluated using the hackbench test provided by perf
with this command line:
perf bench sched messaging --thread --group 25 --loop 1000
Reported completion times (CTime) in seconds are averages over 10 runs:
| | SchedTune (per-task) boost value |
| Schedutil | 0% | 10% | 90% |
------------------+-----------+------------+------------+------------+
CTime [s] | 12.93 | 13.08 | 13.32 | 13.27 |
vs Schedutil [%] | | 1.1% | 3.0% | 2.7% |
SchedTune currently introduces overheads when used on saturated systems such as
the one generated by running the hackbench test. This is possibly due to the
currently used locking schema which can be further optimized.
On the other hand, the SchedTune extension is mainly useful for lightly loaded
systems (mobile devices, laptops, etc.) where the additional overhead has been
verified to be compensated by the performance benefits due to (for example) a
faster task completion. Some of these benefits are reported in the following
section.
ChangeLog
=========
Changes since v1:
- Rebase on tip/sched/core:
A225023 sched/core: Explain sleep/wakeup in a better way
- Integrated with schedutil (in replacement of SchedFreq)
- Improved tasks accounting for correct boostgroups activations
- Added support for negative boosting
- Extensively tested on production-grade devices
Credits
=======
[*] This work has been supported by an extensive collaborative effort between
ARM, Linaro and Google, targeting production devices.
References
==========
[1] https://lkml.org/lkml/2015/8/19/419
[2] https://github.com/ARM-software/lisa
Patrick Bellasi (8):
sched/tune: add detailed documentation
sched/tune: add sysctl interface to define a boost value
sched/fair: add function to convert boost value into "margin"
sched/fair: add boosted CPU usage
sched/tune: add initial support for CGroups based boosting
sched/tune: compute and keep track of per CPU boost value
sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
sched/{fair,tune}: add support for negative boosting
Documentation/scheduler/sched-tune.txt | 426 +++++++++++++++++++++++++
include/linux/cgroup_subsys.h | 4 +
include/linux/sched/sysctl.h | 16 +
init/Kconfig | 73 +++++
kernel/exit.c | 5 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_schedutil.c | 4 +-
kernel/sched/fair.c | 119 +++++++
kernel/sched/sched.h | 2 +
kernel/sched/tune.c | 561 +++++++++++++++++++++++++++++++++
kernel/sched/tune.h | 40 +++
kernel/sysctl.c | 16 +
12 files changed, 1265 insertions(+), 2 deletions(-)
create mode 100644 Documentation/scheduler/sched-tune.txt
create mode 100644 kernel/sched/tune.c
create mode 100644 kernel/sched/tune.h
--
2.10.1
The current (CFS) scheduler implementation does not allow "to boost"
tasks performance by running them at a higher OPP compared to the
minimum required to meet their workload demands.
To support tasks performance boosting the scheduler should provide a
"knob" which allows to tune how much the system is going to be optimised
for energy efficiency vs performance boosting.
It's worth to notice that by energy-efficiency we mean running a CPU at
the minimum OPP which satisfy its utilization while for performance
boosting we mean running a task as fast as possible.
This patch is the first of a series which provides a simple interface to
define a tuning knob. One system-wide "boost" tunable is exposed via:
/proc/sys/kernel/sched_cfs_boost
which can be configured in the range [0..100], to define a percentage
where:
0% boost requires to operate in "standard" mode by scheduling
tasks at the minimum capacities required by the workload demand
100% boost requires to push at maximum the task performances,
"regardless" of the incurred energy consumption
A boost value in between these two boundaries is used to bias the
power/performance trade-off, the higher the boost value the more the
scheduler is biased toward performance boosting instead of energy
efficiency.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
include/linux/sched/sysctl.h | 16 ++++++++++++++++
init/Kconfig | 31 +++++++++++++++++++++++++++++++
kernel/sched/Makefile | 1 +
kernel/sched/tune.c | 23 +++++++++++++++++++++++
kernel/sysctl.c | 11 +++++++++++
5 files changed, 82 insertions(+)
create mode 100644 kernel/sched/tune.c
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 4411453..5bfbb14 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -55,6 +55,22 @@ extern int sysctl_sched_rt_runtime;
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
+#ifdef CONFIG_SCHED_TUNE
+extern unsigned int sysctl_sched_cfs_boost;
+int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length,
+ loff_t *ppos);
+static inline unsigned int get_sysctl_sched_cfs_boost(void)
+{
+ return sysctl_sched_cfs_boost;
+}
+#else
+static inline unsigned int get_sysctl_sched_cfs_boost(void)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_SCHED_AUTOGROUP
extern unsigned int sysctl_sched_autogroup_enabled;
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 34407f1..461e052 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1248,6 +1248,37 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.
+config SCHED_TUNE
+ bool "Boosting for CFS tasks (EXPERIMENTAL)"
+ depends on SMP
+ help
+ This option enables the system-wide support for task boosting.
+ When this support is enabled a new sysctl interface is exposed to
+ user-space via:
+ /proc/sys/kernel/sched_cfs_boost
+ which allows to set a system-wide boost value in range [0..100].
+
+ The currently boosting strategy is implemented in such a way that:
+ - a 0% boost value requires to operate in "standard" mode by
+ scheduling all tasks at the minimum capacities required by their
+ workload demand
+ - a 100% boost value requires to push at maximum the task
+ performances, "regardless" of the incurred energy consumption
+
+ A boost value in between these two boundaries is used to bias the
+ power/performance trade-off, the higher the boost value the more the
+ scheduler is biased toward performance boosting instead of energy
+ efficiency.
+
+ Since this support exposes a single system-wide knob, the specified
+ boost value is applied to all (CFS) tasks in the system.
+
+ NOTE: SchedTune support is available only on SMP system since only
+ for those systems is currently defined and tracked the utilization
+ signal for RQs and SEs.
+
+ If unsure, say N.
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5e59b83..26ab2a6 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_SCHED_TUNE) += tune.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
new file mode 100644
index 0000000..7336118
--- /dev/null
+++ b/kernel/sched/tune.c
@@ -0,0 +1,23 @@
+/*
+ * Scheduler Tunability (SchedTune) Extensions for CFS
+ *
+ * Copyright (C) 2016 ARM Ltd, Patrick Bellasi <[email protected]>
+ */
+
+#include "sched.h"
+
+unsigned int sysctl_sched_cfs_boost __read_mostly;
+
+int
+sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (ret || !write)
+ return ret;
+
+ return 0;
+}
+
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 739fb17..43b6d14 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -442,6 +442,17 @@ static struct ctl_table kern_table[] = {
.extra1 = &one,
},
#endif
+#ifdef CONFIG_SCHED_TUNE
+ {
+ .procname = "sched_cfs_boost",
+ .data = &sysctl_sched_cfs_boost,
+ .maxlen = sizeof(sysctl_sched_cfs_boost),
+ .mode = 0644,
+ .proc_handler = &sysctl_sched_cfs_boost_handler,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+#endif
#ifdef CONFIG_PROVE_LOCKING
{
.procname = "prove_locking",
--
2.10.1
The topic of a single simple power-performance tunable, that is wholly
scheduler centric, and has well defined and predictable properties has
come up on several occasions in the past. With techniques such as a
scheduler driven DVFS, which is now provided mainline via the schedutil
governor, we now have a good framework for implementing such a tunable.
This patch provides a detailed description of the motivations and design
decisions behind the implementation of SchedTune.
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Signed-off-by: Patrick Bellasi <[email protected]>
---
Documentation/scheduler/sched-tune.txt | 392 +++++++++++++++++++++++++++++++++
1 file changed, 392 insertions(+)
create mode 100644 Documentation/scheduler/sched-tune.txt
diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..da7b3eb
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,392 @@
+ SchedTune Support for CFS Tasks
+ central, scheduler-driven, power-performance control
+ (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric and has well defined and predictable properties, has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+
+Scheduler driven DVFS provides the foundation mechanism on top of which it's
+possible to differentiate the performance levels based on the specific
+requirements of different use-cases. For example, on mobile systems it's likely
+to have demanding use-cases which may benefit from running on an higher OPP
+than the one currently chosen by schedutil.
+
+In the past this behavior was provided by changing the governor or by tuning
+its the many different parameters of non-mainline governors, now we can achieve
+similar behaviors by tuning schedutil. This approach allows also for a more
+fine grained control which can be extended to consider the requirement of
+specific tasks.
+
+This document introduces SchedTune and describes the overall ideas behind its
+design and implementation.
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Question and Answers
+ - What about "auto" mode?
+ - What about boosting on a congested system?
+ - How CPUs are boosted when we have tasks with multiple boost values?
+7. References
+
+
+1. Motivation
+=============
+
+schedutil [3] is a new event-driven cpufreq governor which allows the scheduler
+to select the optimal DVFS Operating Performance Point (OPP) for running a task
+allocated to a CPU. The introduction of schedutil enables running workloads at
+the most efficient OPPs.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one actually required by its
+CPU bandwidth demand.
+
+This last requirement is especially important if we consider that schedutil can
+potentially replace all currently available CPUFreq policies. Since schedutil
+is event based, as opposed to the sampling driven governors, it is already more
+responsive at selecting the optimal OPP to run tasks allocated to a CPU.
+However, just tracking the actual task utilization demand may not be enough
+from a performance standpoint. For example, it is not possible to get
+behaviors similar to those provided by the "performance" and "powersave"
+CPUFreq governors.
+
+This document describes an implementation of a tunable, stacked on top of
+schedutil, which extends its functionality to support task performance
+boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates).
+For example, if we consider a simple periodic task which executes the same
+workload for 5[ms] every 20[ms] while running at a certain OPP, a boosted
+execution of that task should be able to complete each of its activations in
+less than 5[ms].
+
+Previous attempts to introduce such a boosting feature has not been successful,
+mainly because of the complexity of the proposed solution. The approach
+described in this document exposes a single simple interface to user-space.
+This single knob allows the tuning of system wide scheduler behaviours ranging
+from energy efficiency at one end through to incremental performance boosting
+at the other end. This tunable affects all tasks. A more advanced extension of
+this concept is also provided, which uses CGroups to boost the performance of
+selected tasks while using the default schedutil behaviors for all others.
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface with a single power-performance
+tunable:
+
+ /proc/sys/kernel/sched_cfs_boost
+
+This permits expressing a boost value as an integer in the range [0..100].
+
+A value of 0 (default) for a CFS task means that schedutil will attempt
+to match compute capacity of the CPU where the task is scheduled to
+match its current utilization with a few spare cycles left. A value of
+100 means that schedutil will select the highest available OPP.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably.
+For example to satisfy interactive response or depending on other system events
+(battery level, thermal status, etc).
+
+A CGroup based extension is also provided, which permits further user-space
+defined task classification to tune the scheduler for different goals depending
+on the specific nature of the task, e.g. background vs interactive vs
+low-priority.
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and schedutil by introducing a bias on the Operating
+Performance Point (OPP) selection. Each time a task is allocated on a CPU,
+schedutil has the opportunity to tune the operating frequency of that CPU to
+better match the workload demand. The selection of the actual OPP being
+activated is influenced by the global boost value, or the boost value for the
+task CGroup when in use.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob. The only new
+concept introduced is that of signal boosting.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few utilization tracking
+signals which basically track the CPU bandwidth requirements for tasks and the
+capacity of CPUs. The basic idea behind the SchedTune knob is to artificially
+inflate some of these utilization tracking signals to make a task or RQ appears
+more demanding than it actually is.
+
+Which signals have to be inflated depends on the specific "consumer". However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+ margin := boosting_strategy(sched_cfs_boost, signal)
+ boosted_signal := signal + margin
+
+Different boosting strategies were identified and analyzed before selecting the
+one found to be most effective. The next section describes the details of this
+boosting strategy.
+
+Signal Proportional Compensation (SPC)
+--------------------------------------
+
+In this boosting strategy the sched_cfs_boost value is used to compute a margin
+which is proportional to the complement of the original signal. This
+complement is defined as the delta from the actual value of a signal and its
+possible maximum value.
+
+Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE
+as the maximum possible value, the margin becomes:
+
+ margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+ question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+- 0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+- 50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+For example, assuming a 50% task which runs for 8ms every 16ms,
+ignoring any margins built into schedutil we should get:
+
+ Boost OPP Expected Completion Time
+ 0% 50% 16.00 ms
+ 50% 75% 16*50/75 = 10.67 ms
+ 100% 100% 8.00 ms
+
+The reduction of the completion time of boosting 100% instead of 50% is
+only 25%.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a 50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+ | SCHED_CAPACITY_SCALE
+ +-----------------------------------------------------------------+
+ |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+ |
+ | boosted_signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb
+ |
+ | original signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+ | |
+ |bbbbbbbbbbbbbbbbbb |
+ | |
+ | |
+ | |
+ | +-----------------------+
+ | |
+ | |
+ | |
+ |------------------+
+ |
+ |
+ +----------------------------------------------------------------------->
+
+The plot above shows a ramped utilization signal (titled 'original_signal') and
+its boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new
+utilization signals. Instead, it provides an API to tune existing signals. This
+tuning is done on demand and only in scheduler code paths where it is sensible
+to do so. The new API calls are defined to return either the default signal or
+a boosted one, depending on the value of sched_cfs_boost. This is a clean and
+non invasive modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To schedutil, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+ cpu_util()
+ boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where schedutil needs to
+decide the OPP to run a CPU at.
+For example, this allows selecting the highest OPP for a CPU which has
+the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+The availability of a single knob which is used to boost all tasks in the
+system is certainly a simple solution but it quite likely doesn't fit many
+use-case, especially in the mobile device space.
+
+For example, on battery powered devices there usually are many background
+services which are long running and need energy efficient scheduling. On the
+other hand, some applications are more performance sensitive and require an
+interactive response and/or maximum performance, regardless of the energy cost.
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", could be enabled which allows to
+defined and configure task groups with different boosting values. Tasks that
+require special performance can be put into separate CGroups. The value of the
+boost associated with the tasks in this group can be specified using a single
+knob exposed by the CGroup controller:
+
+ schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+ 1) It is only possible to create 1 level depth hierarchies
+
+ The root control groups define the system-wide boost value to be applied
+ by default to all tasks. Its direct subgroups are named "boost groups" and
+ they define the boost value for specific set of tasks.
+ Further nested subgroups are not allowed since they do not have a sensible
+ meaning from a user-space standpoint.
+
+ 2) It is possible to define only a limited number of "boost groups"
+
+ This number is defined at compile time and by default configured to 16.
+ This is a design decision motivated by two main reasons:
+ a) In a real system we do not expect use-cases with more then few
+ boost groups. For example, a reasonable collection of groups could be
+ just "background", "interactive" and "performance".
+ b) It simplifies the implementation considerably, especially for the code
+ which has to compute the per CPU boosting once there are multiple
+ RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios
+identified so far. It provides a simple interface which can be used to manage
+the power-performance of all tasks or only selected tasks. Moreover, this
+interface can be easily integrated by user-space run-times (e.g. Android,
+ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CGROUP_SCHED_TUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+ root@derkdell:~# cat /proc/cgroups
+ #subsys_name hierarchy num_cgroups enabled
+ cpuset 0 1 1
+ cpu 0 1 1
+ schedtune 0 1 1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+ root@derkdell:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+ root@derkdell:~# mkdir /sys/fs/cgroup/stune
+ root@derkdell:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Setup the system-wide boost value (Optional)
+
+ If not configured the root control group has a 0% boost value, which
+ basically disables boosting for all tasks in the system thus running in
+ an energy-efficient mode. Let assume SYSBOOST defines the default boost
+ value to be used for all tasks:
+
+ root@derkdell:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
+
+5. Create task groups and configure their specific boost value (Optional)
+
+ For example here we create a "performance" boost group configure to boost
+ all its tasks to 100%
+
+ root@derkdell:~# mkdir /sys/fs/cgroup/stune/performance
+ root@derkdell:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+6. Move tasks into the boost group
+
+ For example, the following moves the tasks with PID $TASKPID (and all its
+ tasks) into the "performance" boost group.
+
+ root@derkdell:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the tasks of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in previous propose approaches can be implemented
+by interfacing SchedTune with some suitable user-space element. This element
+could use the exposed system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. Once schedutil selects the OPP for a CPU, the utilization of that CPU
+is boosted with a value which is the maximum of the boost values of all the
+currently RUNNABLE tasks in that CPU.
+
+This allows schedutil to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+7. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] http://lkml.org/lkml/2016/3/16/559
+
--
2.10.1
The basic idea of the boost knob is to "artificially inflate" a signal
to make a task (or CPU) appears more demanding than what actually it is.
Independently from the specific signal, a consistent and possibly
simple semantic for the concept of "signal boosting" must define:
1. how we translate a boost percentage into a "margin" value,
to be added to the original signal we want to inflate
2. what is the meaning of a boost value from a user-space perspective
This patch provides the implementation of a possible boost semantic,
namely "Signal Proportional Compensation" (SPC), where the "Boost
Percentage" (BP) is used to compute a "Margin" (M) which is proportional
to the complement of the "Original Signal" (OS):
M = BP * (SCHED_CAPACITY_SCALE - OS)
The computed margin can added to the OS to obtain a "Boosted Signal" (BS)
BS = OS + M
The proposed boost semantic has these main features:
- each signal gets a boost which is proportional to its delta with respect
to the maximum available capacity in the system,
i.e. SCHED_CAPACITY_SCALE
- a 100% boosting has a clear understanding from a user-space perspective,
it means to run (possibly) "all" tasks at the maximum OPP
- each boosting value means to speedup the task by a quantity
which is proportional to the maximum speedup achievable by that
task on that system
Thus this semantics is somehow forcing a behaviour which is:
50% boosting means to run at half-way between the current and the
maximum performance which a task could achieve on that system
NOTE: this code is suitable for all signals operating in range
[0..SCHED_CAPACITY_SCALE]
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++
kernel/sched/tune.c | 9 +++++++++
kernel/sched/tune.h | 13 +++++++++++++
3 files changed, 59 insertions(+)
create mode 100644 kernel/sched/tune.h
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79d464a..fdacc29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -34,6 +34,7 @@
#include <trace/events/sched.h>
#include "sched.h"
+#include "tune.h"
/*
* Targeted preemption latency for CPU-bound tasks:
@@ -5543,6 +5544,42 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return target;
}
+#ifdef CONFIG_SCHED_TUNE
+
+struct reciprocal_value schedtune_spc_rdiv;
+
+/*
+ * schedtune_margin returns the "margin" to be added on top of
+ * the original value of a "signal".
+ *
+ * The Boost (B) value [%] is used to compute a Margin (M) which
+ * is proportional to the complement of the original Signal (S):
+ *
+ * M = B * (SCHED_CAPACITY_SCALE - S)
+ *
+ * The obtained value M could be used by the caller to "boost" S.
+ */
+static unsigned long
+schedtune_margin(unsigned long signal, unsigned int boost)
+{
+ unsigned long long margin = 0;
+
+ /* Do not boost saturated signals */
+ if (signal >= SCHED_CAPACITY_SCALE)
+ return 0UL;
+
+ /* Signal Proportional Compensation (SPC) */
+ margin = SCHED_CAPACITY_SCALE - signal;
+ if (boost < 100) {
+ margin *= boost;
+ margin = reciprocal_divide(margin, schedtune_spc_rdiv);
+ }
+
+ return margin;
+}
+
+#endif /* CONFIG_SCHED_TUNE */
+
/*
* cpu_util returns the amount of capacity of a CPU that is used by CFS
* tasks. The unit of the return value must be the one of capacity so we can
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 7336118..c28a06f 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -5,6 +5,7 @@
*/
#include "sched.h"
+#include "tune.h"
unsigned int sysctl_sched_cfs_boost __read_mostly;
@@ -21,3 +22,11 @@ sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
return 0;
}
+static int
+schedtune_init(void)
+{
+ schedtune_spc_rdiv = reciprocal_value(100);
+ return 0;
+}
+late_initcall(schedtune_init);
+
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h
new file mode 100644
index 0000000..515d02a
--- /dev/null
+++ b/kernel/sched/tune.h
@@ -0,0 +1,13 @@
+/*
+ * Scheduler Tunability (SchedTune) Extensions for CFS
+ *
+ * Copyright (C) 2016 ARM Ltd, Patrick Bellasi <[email protected]>
+ */
+
+#ifdef CONFIG_SCHED_TUNE
+
+#include <linux/reciprocal_div.h>
+
+extern struct reciprocal_value schedtune_spc_rdiv;
+
+#endif /* CONFIG_SCHED_TUNE */
--
2.10.1
When per task boosting is enabled, we can have multiple RUNNABLE tasks
which are concurrently scheduled on the same CPU but each one with a
different boost value.
For example, we could have a scenarios like this:
Task SchedTune CGroup Boost Value
T1 root 0
T2 low-priority 10
T3 interactive 90
In these conditions we expect a CPU to be configured according to a
proper "aggregation" of the required boost values for all the tasks
currently RUNNABLE on this CPU.
A simple aggregation function is the one which tracks the MAX boost
value for all the tasks RUNNABLE on a CPU. This approach allows to
always satisfy the most boost demanding task while at the same time:
a) boosting all its co-scheduled tasks, thus reducing potential
side-effects on most boost demanding tasks.
b) reduces the number of frequency switch requested by schedutil,
thus being more friendly to architectures with slow frequency
switching times.
Every time a task enters/exits the RQ of a CPU the max boost value
should be updated considering all the boost groups currently "affecting"
that CPU, i.e. which have at least one RUNNABLE task currently allocated
on a RQ of that CPU.
This patch introduces the required support to keep track of the boost
groups currently affecting a CPU. Thanks to the limited number of boost
groups, a small and memory efficient per-cpu array of boost groups
values (cpu_boost_groups) is updated by schedtune_boostgroup_update()
but only when a schedtune CGroup boost value is changed. However, this
is expected to be an infrequent operation, perhaps done just one time at
system boot time, or whenever user-space need to tune the boost value
for a specific group of tasks (e.g. touch boost behavior on Android
systems).
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/fair.c | 2 +-
kernel/sched/tune.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/tune.h | 14 ++++++++++
3 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 26c3911..313a815 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5581,7 +5581,7 @@ schedtune_margin(unsigned long signal, unsigned int boost)
static inline unsigned long
schedtune_cpu_margin(unsigned long util, int cpu)
{
- unsigned int boost = get_sysctl_sched_cfs_boost();
+ unsigned int boost = schedtune_cpu_boost(cpu);
if (boost == 0)
return 0UL;
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 4eaea1d..6a51a4d 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -104,6 +104,73 @@ struct boost_groups {
/* Boost groups affecting each CPU in the system */
DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
+static void
+schedtune_cpu_update(int cpu)
+{
+ struct boost_groups *bg;
+ unsigned int boost_max;
+ int idx;
+
+ bg = &per_cpu(cpu_boost_groups, cpu);
+
+ /* The root boost group is always active */
+ boost_max = bg->group[0].boost;
+ for (idx = 1; idx < boostgroups_max; ++idx) {
+ /*
+ * A boost group affects a CPU only if it has
+ * RUNNABLE tasks on that CPU
+ */
+ if (bg->group[idx].tasks == 0)
+ continue;
+ boost_max = max(boost_max, bg->group[idx].boost);
+ }
+
+ bg->boost_max = boost_max;
+}
+
+static void
+schedtune_boostgroup_update(int idx, int boost)
+{
+ struct boost_groups *bg;
+ int cur_boost_max;
+ int old_boost;
+ int cpu;
+
+ /* Update per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+
+ /*
+ * Keep track of current boost values to compute the per CPU
+ * maximum only when it has been affected by the new value of
+ * the updated boost group
+ */
+ cur_boost_max = bg->boost_max;
+ old_boost = bg->group[idx].boost;
+
+ /* Update the boost value of this boost group */
+ bg->group[idx].boost = boost;
+
+ /* Check if this update increase current max */
+ if (boost > cur_boost_max && bg->group[idx].tasks) {
+ bg->boost_max = boost;
+ continue;
+ }
+
+ /* Check if this update has decreased current max */
+ if (cur_boost_max == old_boost && old_boost > boost)
+ schedtune_cpu_update(cpu);
+ }
+}
+
+int schedtune_cpu_boost(int cpu)
+{
+ struct boost_groups *bg;
+
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ return bg->boost_max;
+}
+
static u64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
@@ -123,6 +190,9 @@ boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
st->boost = boost;
if (css == &root_schedtune.css)
sysctl_sched_cfs_boost = boost;
+ /* Update CPU boost */
+ schedtune_boostgroup_update(st->idx, st->boost);
+
return 0;
}
@@ -199,6 +269,9 @@ schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
static void
schedtune_boostgroup_release(struct schedtune *st)
{
+ /* Reset this boost group */
+ schedtune_boostgroup_update(st->idx, 0);
+
/* Keep track of allocated boost groups */
allocated_group[st->idx] = NULL;
}
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h
index 515d02a..e936b91 100644
--- a/kernel/sched/tune.h
+++ b/kernel/sched/tune.h
@@ -10,4 +10,18 @@
extern struct reciprocal_value schedtune_spc_rdiv;
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+
+int schedtune_cpu_boost(int cpu);
+
+#else /* CONFIG_CGROUP_SCHED_TUNE */
+
+#define schedtune_cpu_boost(cpu) get_sysctl_sched_cfs_boost()
+
+#endif /* CONFIG_CGROUP_SCHED_TUNE */
+
+#else /* CONFIG_SCHED_TUNE */
+
+#define schedtune_cpu_boost(cpu) 0
+
#endif /* CONFIG_SCHED_TUNE */
--
2.10.1
Boosting support allows to inflate a signal with a margin which is
defined to be proportional to its delta from its maximum possible
value. Such a mechanism allows to run a task on an OPP which is higher
with respect to the minimum capacity which can satisfy its demands.
In certain use-cases we could be interested to the opposite goal, i.e.
running a task on an OPP which is lower than the minimum required.
Currently the only why to achieve such a goal is to use the "powersave"
governor, thus forcing all tasks to run at the lower OPP, or the
"userspace" governor, still forcing all task to run at a certain OPP.
With the availability of schedutil and the addition of SchedTune, we now
have the support to tune the way OPPs are selected depending on which
tasks are active on a CPU.
This patch extends SchedTune to introduce the support for negative
boosting. While boosting inflate a signal, with negative boosting we can
reduce artificially the value of a signal. The boosting strategy used to
reduce a signal is quite simple and extends the concept of "margin"
already used for positive boosting.
The Boost (B) value [%] is used to compute a Margin (M) which, in case
of negative boosting, is a fraction of the original Signal (S):
M = B * S, when B in [-100%, 0%)
Such a value of M is defined to be a negative quantity which, once added
to the original signal S, allows to reduce the amount of that signal by
a fraction of the original signal.
With such a definition, a 50% utilization task will run at:
- 25% capacity OPP when boosted -50%
- minimum capacity OPP when boosted -100%
It's worth to notice that, the boosting of all tasks on a CPU is
aggregated to figure out what is the max boost value currently required.
Thus, for example, if we have two tasks:
T1 boosted @ -20%
T2 boosted @ +30%
when T2 is active, we boost the CPU +30%, also if T1 is active. While
the CPU is "slowed-down" 20% when T1 is the only task active on that
CPU.
Cc: Jonathan Corbet <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Suggested-by: Srinath Sridharan <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
Documentation/scheduler/sched-tune.txt | 44 ++++++++++++++++++++++++++++++----
include/linux/sched/sysctl.h | 6 ++---
kernel/sched/fair.c | 38 +++++++++++++++++++++--------
kernel/sched/tune.c | 33 +++++++++++++++----------
kernel/sysctl.c | 3 ++-
5 files changed, 92 insertions(+), 32 deletions(-)
diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
index da7b3eb..5822f9f 100644
--- a/Documentation/scheduler/sched-tune.txt
+++ b/Documentation/scheduler/sched-tune.txt
@@ -100,12 +100,17 @@ This permits expressing a boost value as an integer in the range [0..100].
A value of 0 (default) for a CFS task means that schedutil will attempt
to match compute capacity of the CPU where the task is scheduled to
-match its current utilization with a few spare cycles left. A value of
-100 means that schedutil will select the highest available OPP.
+match its current utilization with a few spare cycles left.
-The range between 0 and 100 can be set to satisfy other scenarios suitably.
-For example to satisfy interactive response or depending on other system events
-(battery level, thermal status, etc).
+A value of 100 means that schedutil will select the highest available OPP,
+while a negative value means that schedutils will try to run tasks at lower
+OPPs. Togheter, positive and negative boost value allows to get from scedutil
+behaviors similar to that of the existing "performance" and "powersave"
+governors but with a more fine grained control.
+
+The range between -100 and 100 can be set to satisfy other scenarios suitably.
+For example to satisfy interactive response or other system events (battery
+level, thermal status, etc).
A CGroup based extension is also provided, which permits further user-space
defined task classification to tune the scheduler for different goals depending
@@ -227,6 +232,27 @@ corresponding to a 50% boost is midway from the original signal and the upper
bound. Boosting by 100% generates a boosted signal which is always saturated to
the upper bound.
+Negative boosting
+-----------------
+
+While postive boosting uses the SPC strategy to inflate a signal, with negative
+boosting we can reduce artificially the value of a signal. The boosting
+strategy used to reduce a signal is quite simple and extends the concept of
+"margin" already used for positive boosting.
+
+When sched_cfs_boost is defined in [-100%, 0%), the boost value [%] is used to
+compute a margin which is a fraction of the original signal:
+
+ margin := sched_cfs_boost * signal
+
+Such a margin is defined to be a negative quantity which, once added to the
+original signal, it allows to reduce the amount of that signal by a fraction of
+the original value.
+
+With such a definition, for example a 50% utilization task will run at:
+ - 25% capacity OPP when boosted -50%
+ - minimum capacity OPP when boosted -100%
+
4. OPP selection using boosted CPU utilization
==============================================
@@ -304,6 +330,14 @@ main characteristics:
which has to compute the per CPU boosting once there are multiple
RUNNABLE tasks with different boost values.
+It's worth to notice that, the boosting of all tasks on a CPU is aggregated to
+figure out what is the max boost value currently required. Thus, for example,
+if we have two tasks:
+ T1 boosted @ -20%
+ T2 boosted @ +30%
+when T2 is active, we boost the CPU +30%, also if T1 is active.
+While the CPU is "slowed-down" 20% when T1 is the only task active on that CPU.
+
Such a simple design should allow servicing the main utilization scenarios
identified so far. It provides a simple interface which can be used to manage
the power-performance of all tasks or only selected tasks. Moreover, this
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5bfbb14..fe878c9 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,16 +56,16 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
#ifdef CONFIG_SCHED_TUNE
-extern unsigned int sysctl_sched_cfs_boost;
+extern int sysctl_sched_cfs_boost;
int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-static inline unsigned int get_sysctl_sched_cfs_boost(void)
+static inline int get_sysctl_sched_cfs_boost(void)
{
return sysctl_sched_cfs_boost;
}
#else
-static inline unsigned int get_sysctl_sched_cfs_boost(void)
+static inline int get_sysctl_sched_cfs_boost(void)
{
return 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f56953b..43a4989 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5580,17 +5580,34 @@ struct reciprocal_value schedtune_spc_rdiv;
* schedtune_margin returns the "margin" to be added on top of
* the original value of a "signal".
*
- * The Boost (B) value [%] is used to compute a Margin (M) which
- * is proportional to the complement of the original Signal (S):
+ * The Boost (B) value [%] is used to compute a Margin (M) which, in case of
+ * positive boosting, it is proportional to the complement of the original
+ * Signal (S):
*
- * M = B * (SCHED_CAPACITY_SCALE - S)
+ * M = B * (SCHED_CAPACITY_SCALE - S), when B is in (0%, 100%]
+ *
+ * In case of negative boosting, the computed margin is a fraction of the
+ * original S:
+ *
+ * M = B * S, when B in [-100%, 0%)
*
* The obtained value M could be used by the caller to "boost" S.
*/
-static unsigned long
-schedtune_margin(unsigned long signal, unsigned int boost)
+static long
+schedtune_margin(unsigned long signal, int boost)
{
- unsigned long long margin = 0;
+ long long margin = 0;
+
+ /* A -100% boost nullify the orignal signal */
+ if (unlikely(boost == -100))
+ return -signal;
+
+ /* A negative boost produces a proportional (negative) margin */
+ if (unlikely(boost < 0)) {
+ margin = -boost * margin;
+ margin = reciprocal_divide(margin, schedtune_spc_rdiv);
+ return -margin;
+ }
/* Do not boost saturated signals */
if (signal >= SCHED_CAPACITY_SCALE)
@@ -5606,10 +5623,10 @@ schedtune_margin(unsigned long signal, unsigned int boost)
return margin;
}
-static inline unsigned long
+static inline long
schedtune_cpu_margin(unsigned long util, int cpu)
{
- unsigned int boost = schedtune_cpu_boost(cpu);
+ int boost = schedtune_cpu_boost(cpu);
if (boost == 0)
return 0UL;
@@ -5619,7 +5636,7 @@ schedtune_cpu_margin(unsigned long util, int cpu)
#else /* CONFIG_SCHED_TUNE */
-static inline unsigned long
+static inline long
schedtune_cpu_margin(unsigned long util, int cpu)
{
return 0;
@@ -5665,9 +5682,10 @@ unsigned long boosted_cpu_util(int cpu)
{
unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);
+ int boost = schedtune_cpu_boost(cpu);
/* Do not boost saturated utilizations */
- if (util >= capacity)
+ if (boost >= 0 && util >= capacity)
return capacity;
/* Add margin to current CPU's capacity */
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 965a3e1..ed90830 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -13,7 +13,7 @@
#include "sched.h"
#include "tune.h"
-unsigned int sysctl_sched_cfs_boost __read_mostly;
+int sysctl_sched_cfs_boost __read_mostly;
#ifdef CONFIG_CGROUP_SCHED_TUNE
@@ -32,7 +32,7 @@ struct schedtune {
int idx;
/* Boost value for tasks on that SchedTune CGroup */
- unsigned int boost;
+ int boost;
};
@@ -95,10 +95,10 @@ static struct schedtune *allocated_group[boostgroups_max] = {
*/
struct boost_groups {
/* Maximum boost value for all RUNNABLE tasks on a CPU */
- unsigned int boost_max;
+ int boost_max;
struct {
/* The boost for tasks on that boost group */
- unsigned int boost;
+ int boost;
/* Count of RUNNABLE tasks on that boost group */
unsigned int tasks;
} group[boostgroups_max];
@@ -112,15 +112,14 @@ DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
static void
schedtune_cpu_update(int cpu)
{
+ bool active_tasks = false;
struct boost_groups *bg;
- unsigned int boost_max;
+ int boost_max = -100;
int idx;
bg = &per_cpu(cpu_boost_groups, cpu);
- /* The root boost group is always active */
- boost_max = bg->group[0].boost;
- for (idx = 1; idx < boostgroups_max; ++idx) {
+ for (idx = 0; idx < boostgroups_max; ++idx) {
/*
* A boost group affects a CPU only if it has
* RUNNABLE tasks on that CPU
@@ -128,8 +127,13 @@ schedtune_cpu_update(int cpu)
if (bg->group[idx].tasks == 0)
continue;
boost_max = max(boost_max, bg->group[idx].boost);
+ active_tasks = true;
}
+ /* Reset boosting when there are not tasks in the system */
+ if (!active_tasks)
+ boost_max = 0;
+
bg->boost_max = boost_max;
}
@@ -383,7 +387,7 @@ void schedtune_exit_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, &rq_flags);
}
-static u64
+static s64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct schedtune *st = css_st(css);
@@ -393,15 +397,18 @@ boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
static int
boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 boost)
+ s64 boost)
{
struct schedtune *st = css_st(css);
- if (boost > 100)
+ if (boost < -100 || boost > 100)
return -EINVAL;
+
+ /* Update boostgroup and global boosting (if required) */
st->boost = boost;
if (css == &root_schedtune.css)
sysctl_sched_cfs_boost = boost;
+
/* Update CPU boost */
schedtune_boostgroup_update(st->idx, st->boost);
@@ -411,8 +418,8 @@ boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
static struct cftype files[] = {
{
.name = "boost",
- .read_u64 = boost_read,
- .write_u64 = boost_write,
+ .read_s64 = boost_read,
+ .write_s64 = boost_write,
},
{ } /* terminate */
};
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 12c3432..3b412fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -127,6 +127,7 @@ static int __maybe_unused four = 4;
static unsigned long one_ul = 1;
static int one_hundred = 100;
static int one_thousand = 1000;
+static int __maybe_unused one_hundred_neg = -100;
#ifdef CONFIG_PRINTK
static int ten_thousand = 10000;
#endif
@@ -453,7 +454,7 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
#endif
.proc_handler = &sysctl_sched_cfs_boost_handler,
- .extra1 = &zero,
+ .extra1 = &one_hundred_neg,
.extra2 = &one_hundred,
},
#endif
--
2.10.1
When per-task boosting is enabled, every time a task enters/exits a CPU
its boost value could impact the currently selected OPP for that CPU.
Thus, the "aggregated" boost value for that CPU potentially needs to
be updated to match the current maximum boost value among all the tasks
currently RUNNABLE on that CPU.
This patch introduces the required support to keep track of which boost
groups are impacting a CPU. Each time a task is enqueued/dequeued to/from
a RQ of a CPU its boost group is used to increment a per-cpu counter of
RUNNABLE tasks on that CPU.
Only when the number of runnable tasks for a specific boost group
becomes 1 or 0 the corresponding boost group changes its effects on
that CPU, specifically:
a) boost_group::tasks == 1: this boost group starts to impact the CPU
b) boost_group::tasks == 0: this boost group stops to impact the CPU
In each of these two conditions the aggregation function:
schedtune_cpu_update(cpu)
could be required to run in order to identify the new maximum boost
value required for the CPU.
The proposed patch minimizes the number of times the aggregation
function is executed while still providing the required support to
always boost a CPU to the maximum boost value required by all its
currently RUNNABLE tasks.
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/exit.c | 5 ++
kernel/sched/fair.c | 28 +++++++
kernel/sched/tune.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/tune.h | 13 ++++
4 files changed, 262 insertions(+)
diff --git a/kernel/exit.c b/kernel/exit.c
index 9d68c45..541e4e1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -55,6 +55,8 @@
#include <linux/shm.h>
#include <linux/kcov.h>
+#include "sched/tune.h"
+
#include <asm/uaccess.h>
#include <asm/unistd.h>
#include <asm/pgtable.h>
@@ -775,6 +777,9 @@ void __noreturn do_exit(long code)
}
exit_signals(tsk); /* sets PF_EXITING */
+
+ schedtune_exit_task(tsk);
+
/*
* Ensure that all new tsk->pi_lock acquisitions must observe
* PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 313a815..f56953b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4570,6 +4570,25 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}
+ /*
+ * Update SchedTune accouting.
+ *
+ * We do it before updating the CPU capacity to ensure the
+ * boost value of the current task is accounted for in the
+ * selection of the OPP.
+ *
+ * We do it also in the case where we enqueue a trottled task;
+ * we could argue that a throttled task should not boost a CPU,
+ * however:
+ * a) properly implementing CPU boosting considering throttled
+ * tasks will increase a lot the complexity of the solution
+ * b) it's not easy to quantify the benefits introduced by
+ * such a more complex solution.
+ * Thus, for the time being we go for the simple solution and boost
+ * also for throttled RQs.
+ */
+ schedtune_enqueue_task(p, cpu_of(rq));
+
if (!se)
add_nr_running(rq, 1);
@@ -4629,6 +4648,15 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}
+ /*
+ * Update SchedTune accouting
+ *
+ * We do it before updating the CPU capacity to ensure the
+ * boost value of the current task is accounted for in the
+ * selection of the OPP.
+ */
+ schedtune_dequeue_task(p, cpu_of(rq));
+
if (!se)
sub_nr_running(rq, 1);
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 6a51a4d..965a3e1 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -7,6 +7,7 @@
#include <linux/cgroup.h>
#include <linux/err.h>
#include <linux/percpu.h>
+#include <linux/rcupdate.h>
#include <linux/slab.h>
#include "sched.h"
@@ -16,6 +17,8 @@ unsigned int sysctl_sched_cfs_boost __read_mostly;
#ifdef CONFIG_CGROUP_SCHED_TUNE
+static bool schedtune_initialized;
+
/*
* CFS Scheduler Tunables for Task Groups.
*/
@@ -99,6 +102,8 @@ struct boost_groups {
/* Count of RUNNABLE tasks on that boost group */
unsigned int tasks;
} group[boostgroups_max];
+ /* CPU's boost group locking */
+ raw_spinlock_t lock;
};
/* Boost groups affecting each CPU in the system */
@@ -171,6 +176,213 @@ int schedtune_cpu_boost(int cpu)
return bg->boost_max;
}
+#define ENQUEUE_TASK 1
+#define DEQUEUE_TASK -1
+
+static inline void
+schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ int tasks = bg->group[idx].tasks + task_count;
+
+ /* Update boosted tasks count while avoiding to make it negative */
+ bg->group[idx].tasks = max(0, tasks);
+
+ /* Boost group activation or deactivation on that RQ */
+ if (tasks == 1 || tasks == 0)
+ schedtune_cpu_update(cpu);
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_enqueue_task(struct task_struct *p, int cpu)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ unsigned long irq_flags;
+ struct schedtune *st;
+ int idx;
+
+ lockdep_assert_held(&cpu_rq(cpu)->lock);
+
+ if (!unlikely(schedtune_initialized))
+ return;
+
+ /*
+ * When a task is marked PF_EXITING by do_exit() it's going to be
+ * dequeued and enqueued multiple times in the exit path.
+ * Thus we avoid any further update, since we do not want to change
+ * CPU boosting while the task is exiting.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Boost group accouting is protected by a per-cpu lock and requires
+ * interrupt to be disabled to avoid race conditions for example on
+ * do_exit()::cgroup_exit() and task migration.
+ */
+ raw_spin_lock_irqsave(&bg->lock, irq_flags);
+ rcu_read_lock();
+
+ st = task_schedtune(p);
+ idx = st->idx;
+
+ schedtune_tasks_update(p, cpu, idx, ENQUEUE_TASK);
+
+ rcu_read_unlock();
+ raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
+}
+
+static int schedtune_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *dst_css;
+ struct rq_flags rq_flags;
+ struct task_struct *task;
+ struct boost_groups *bg;
+ unsigned int cpu;
+ struct rq *rq;
+ int src_bg; /* Source boost group index */
+ int dst_bg; /* Destination boost group index */
+ int tasks;
+
+ if (!unlikely(schedtune_initialized))
+ return 0;
+
+ cgroup_taskset_for_each(task, dst_css, tset) {
+
+ /*
+ * Lock the CPU's RQ the task is enqueued to avoid race
+ * conditions with migration code while the task is being
+ * accounted
+ */
+ rq = task_rq_lock(task, &rq_flags);
+
+ if (!task->on_rq) {
+ task_rq_unlock(rq, task, &rq_flags);
+ continue;
+ }
+
+ /*
+ * Boost group accouting is protected by a per-cpu lock and
+ * requires interrupt to be disabled to avoid race conditions
+ * on tasks migrations.
+ */
+ cpu = cpu_of(rq);
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ raw_spin_lock(&bg->lock);
+
+ dst_bg = css_st(dst_css)->idx;
+ src_bg = task_schedtune(task)->idx;
+
+ /*
+ * Current task is not changing boostgroup, which can
+ * happen when the new hierarchy is in use.
+ */
+ if (unlikely(dst_bg == src_bg)) {
+ raw_spin_unlock(&bg->lock);
+ task_rq_unlock(rq, task, &rq_flags);
+ continue;
+ }
+
+ /*
+ * This is the case of a RUNNABLE task which is switching its
+ * current boost group.
+ */
+
+ /* Move task from src to dst boost group */
+ tasks = bg->group[src_bg].tasks - 1;
+ bg->group[src_bg].tasks = max(0, tasks);
+ bg->group[dst_bg].tasks += 1;
+
+ raw_spin_unlock(&bg->lock);
+ task_rq_unlock(rq, task, &rq_flags);
+
+ /* Update CPU boost group */
+ if (bg->group[src_bg].tasks == 0 ||
+ bg->group[dst_bg].tasks == 1)
+ schedtune_cpu_update(task_cpu(task));
+ }
+
+ return 0;
+}
+
+static void schedtune_cancel_attach(struct cgroup_taskset *tset)
+{
+ /*
+ * This can happen only if SchedTune controller is mounted with
+ * other hierarchies and one of them fails. Since usually SchedTune is
+ * mounted on its own hierarchy, for the time being we do not implement
+ * a proper rollback mechanism.
+ */
+ WARN(1, "SchedTune cancel attach not implemented");
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_dequeue_task(struct task_struct *p, int cpu)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ unsigned long irq_flags;
+ struct schedtune *st;
+ int idx;
+
+ lockdep_assert_held(&cpu_rq(cpu)->lock);
+
+ if (!unlikely(schedtune_initialized))
+ return;
+
+ /*
+ * When a task is marked PF_EXITING by do_exit() it's going to be
+ * dequeued and enqueued multiple times in the exit path.
+ * Thus we avoid any further update, since we do not want to change
+ * CPU boosting while the task is exiting.
+ * The last dequeue is already enforce by the do_exit() code path
+ * via schedtune_exit_task().
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Boost group accouting is protected by a per-cpu lock and requires
+ * interrupt to be disabled to avoid race conditions on...
+ */
+ raw_spin_lock_irqsave(&bg->lock, irq_flags);
+ rcu_read_lock();
+
+ st = task_schedtune(p);
+ idx = st->idx;
+
+ schedtune_tasks_update(p, cpu, idx, DEQUEUE_TASK);
+
+ rcu_read_unlock();
+ raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
+}
+
+void schedtune_exit_task(struct task_struct *tsk)
+{
+ struct rq_flags rq_flags;
+ struct schedtune *st;
+ unsigned int cpu;
+ struct rq *rq;
+ int idx;
+
+ if (!unlikely(schedtune_initialized))
+ return;
+
+ rq = task_rq_lock(tsk, &rq_flags);
+ rcu_read_lock();
+
+ cpu = cpu_of(rq);
+ st = task_schedtune(tsk);
+ idx = st->idx;
+ schedtune_tasks_update(tsk, cpu, idx, DEQUEUE_TASK);
+
+ rcu_read_unlock();
+ task_rq_unlock(rq, tsk, &rq_flags);
+}
+
static u64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
@@ -288,6 +500,8 @@ schedtune_css_free(struct cgroup_subsys_state *css)
struct cgroup_subsys schedtune_cgrp_subsys = {
.css_alloc = schedtune_css_alloc,
.css_free = schedtune_css_free,
+ .can_attach = schedtune_can_attach,
+ .cancel_attach = schedtune_cancel_attach,
.legacy_cftypes = files,
.early_init = 1,
};
@@ -306,6 +520,8 @@ schedtune_init_cgroups(void)
pr_info("schedtune: configured to support %d boost groups\n",
boostgroups_max);
+
+ schedtune_initialized = true;
}
#endif /* CONFIG_CGROUP_SCHED_TUNE */
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h
index e936b91..ae7dccf 100644
--- a/kernel/sched/tune.h
+++ b/kernel/sched/tune.h
@@ -14,14 +14,27 @@ extern struct reciprocal_value schedtune_spc_rdiv;
int schedtune_cpu_boost(int cpu);
+void schedtune_exit_task(struct task_struct *tsk);
+
+void schedtune_enqueue_task(struct task_struct *p, int cpu);
+void schedtune_dequeue_task(struct task_struct *p, int cpu);
+
#else /* CONFIG_CGROUP_SCHED_TUNE */
#define schedtune_cpu_boost(cpu) get_sysctl_sched_cfs_boost()
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+#define schedtune_exit_task(task) do { } while (0)
+
#endif /* CONFIG_CGROUP_SCHED_TUNE */
#else /* CONFIG_SCHED_TUNE */
#define schedtune_cpu_boost(cpu) 0
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+#define schedtune_exit_task(task) do { } while (0)
+
#endif /* CONFIG_SCHED_TUNE */
--
2.10.1
To support task performance boosting, the usage of a single knob has the
advantage to be a simple solution, both from the implementation and the
usability standpoint. However, on a real system it can be difficult to
identify a single value for the knob which fits the needs of multiple
different tasks. For example, some kernel threads and/or user-space
background services should be better managed the "standard" way while we
still want to be able to boost the performance of specific workloads.
In order to improve the flexibility of the task boosting mechanism this
patch is the first of a small series which extends the previous
implementation to introduce a "per task group" support.
This first patch introduces just the basic CGroups support, a new
"schedtune" CGroups controller is added which allows to configure
different boost value for different groups of tasks.
To keep the implementation simple while still supporting an effective
boosting strategy, the new controller:
1. allows only a two layer hierarchy
2. supports only a limited number of boost groups
A two layer hierarchy allows to place each task either:
a) in the root control group
thus being subject to a system-wide boosting value
b) in a child of the root group
thus being subject to the specific boost value defined by that
"boost group"
The limited number of "boost groups" supported is mainly motivated by
the observation that in a real system it could be useful to have only
few classes of tasks which deserve different treatment.
For example, background vs foreground or interactive vs low-priority.
As an additional benefit, a limited number of boost groups allows also
to have a simpler implementation, especially for the code required to
compute the boost value for CPUs which have RUNNABLE tasks belonging to
different boost groups.
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
include/linux/cgroup_subsys.h | 4 +
init/Kconfig | 42 ++++++++
kernel/sched/tune.c | 233 ++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 4 +
4 files changed, 283 insertions(+)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..4fd0f82 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -20,6 +20,10 @@ SUBSYS(cpu)
SUBSYS(cpuacct)
#endif
+#if IS_ENABLED(CONFIG_CGROUP_SCHED_TUNE)
+SUBSYS(schedtune)
+#endif
+
#if IS_ENABLED(CONFIG_BLK_CGROUP)
SUBSYS(io)
#endif
diff --git a/init/Kconfig b/init/Kconfig
index 461e052..5bce1ef 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1074,6 +1074,48 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED
+config CGROUP_SCHED_TUNE
+ bool "Tasks boosting controller"
+ depends on SCHED_TUNE
+ help
+ This option allows users to define boost values for groups of
+ SCHED_OTHER tasks. Once enabled, the utilization of a CPU is boosted
+ by a factor proportional to the maximum boost value of all the tasks
+ RUNNABLE on that CPU.
+
+ This new controller:
+ 1. allows only a two layers hierarchy, where the root defines the
+ system-wide boost value and its children define a
+ "boost group" whose tasks will be boosted with the configured
+ value.
+ 2. supports only a limited number of different boost groups, each
+ one which could be configured with a different boost value.
+
+ Say N if unsure.
+
+config SCHED_TUNE_BOOSTGROUPS
+ int "Maximum number of SchedTune's boost groups"
+ range 2 16
+ default 5
+ depends on CGROUP_SCHED_TUNE
+
+ help
+ When per-task boosting is used we still allow only limited number of
+ boost groups for two main reasons:
+ 1. on a real system we usually have only few classes of workloads which
+ make sense to boost with different values,
+ e.g. background vs foreground tasks, interactive vs low-priority tasks
+ 2. a limited number allows for a simpler and more memory/time efficient
+ implementation especially for the computation of the per-CPU boost
+ value
+
+ NOTE: The first boost group is reserved to defined the global boosting to be
+ applied to all tasks, thus the minimum number of boost groups is 2.
+ Indeed, if only global boosting is required than per-task boosting is
+ not required and this support can be disabled.
+
+ Use the default value (5) is unsure.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index c28a06f..4eaea1d 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -4,11 +4,239 @@
* Copyright (C) 2016 ARM Ltd, Patrick Bellasi <[email protected]>
*/
+#include <linux/cgroup.h>
+#include <linux/err.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+
#include "sched.h"
#include "tune.h"
unsigned int sysctl_sched_cfs_boost __read_mostly;
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+
+/*
+ * CFS Scheduler Tunables for Task Groups.
+ */
+
+/* SchedTune tunables for a group of tasks */
+struct schedtune {
+ /* SchedTune CGroup subsystem */
+ struct cgroup_subsys_state css;
+
+ /* Boost group allocated ID */
+ int idx;
+
+ /* Boost value for tasks on that SchedTune CGroup */
+ unsigned int boost;
+
+};
+
+static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct schedtune, css) : NULL;
+}
+
+static inline struct schedtune *task_schedtune(struct task_struct *tsk)
+{
+ return css_st(task_css(tsk, schedtune_cgrp_id));
+}
+
+static inline struct schedtune *parent_st(struct schedtune *st)
+{
+ return css_st(st->css.parent);
+}
+
+/*
+ * SchedTune root control group
+ * The root control group is used to define a system-wide boosting tuning,
+ * which is applied to all tasks in the system.
+ * Task specific boost tuning could be specified by creating and
+ * configuring a child control group under the root one.
+ * By default, system-wide boosting is disabled, i.e. no boosting is applied
+ * to tasks which are not into a child control group.
+ */
+static struct schedtune
+root_schedtune = {
+ .boost = 0,
+};
+
+/*
+ * Maximum number of boost groups to support
+ * When per-task boosting is used we still allow only limited number of
+ * boost groups for two main reasons:
+ * 1. on a real system we usually have only few classes of workloads which
+ * make sense to boost with different values (e.g. background vs foreground
+ * tasks, interactive vs low-priority tasks)
+ * 2. a limited number allows for a simpler and more memory/time efficient
+ * implementation especially for the computation of the per-CPU boost
+ * value
+ */
+#define boostgroups_max CONFIG_SCHED_TUNE_BOOSTGROUPS
+
+/* Array of configured boostgroups */
+static struct schedtune *allocated_group[boostgroups_max] = {
+ &root_schedtune,
+ NULL,
+};
+
+/* SchedTune boost groups
+ * Keep track of all the boost groups which impact on CPU, for example when a
+ * CPU has two RUNNABLE tasks belonging to two different boost groups and thus
+ * likely with different boost values. Since the maximum number of boost
+ * groups is limited by CONFIG_SCHED_TUNE_BOOSTGROUPS, which is limited to 16,
+ * we use a simple array to keep track of the metrics required to compute the
+ * maximum per-CPU
+ * boosting value.
+ */
+struct boost_groups {
+ /* Maximum boost value for all RUNNABLE tasks on a CPU */
+ unsigned int boost_max;
+ struct {
+ /* The boost for tasks on that boost group */
+ unsigned int boost;
+ /* Count of RUNNABLE tasks on that boost group */
+ unsigned int tasks;
+ } group[boostgroups_max];
+};
+
+/* Boost groups affecting each CPU in the system */
+DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
+
+static u64
+boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct schedtune *st = css_st(css);
+
+ return st->boost;
+}
+
+static int
+boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
+ u64 boost)
+{
+ struct schedtune *st = css_st(css);
+
+ if (boost > 100)
+ return -EINVAL;
+ st->boost = boost;
+ if (css == &root_schedtune.css)
+ sysctl_sched_cfs_boost = boost;
+ return 0;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "boost",
+ .read_u64 = boost_read,
+ .write_u64 = boost_write,
+ },
+ { } /* terminate */
+};
+
+static int
+schedtune_boostgroup_init(struct schedtune *st)
+{
+ struct boost_groups *bg;
+ int cpu;
+
+ /* Keep track of allocated boost groups */
+ allocated_group[st->idx] = st;
+
+ /* Initialize the per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ bg->group[st->idx].boost = 0;
+ bg->group[st->idx].tasks = 0;
+ }
+
+ return 0;
+}
+
+static struct cgroup_subsys_state *
+schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct schedtune *st;
+ int idx;
+
+ if (!parent_css)
+ return &root_schedtune.css;
+
+ /* Allow only single level hierachies */
+ if (parent_css != &root_schedtune.css) {
+ pr_err("Nested SchedTune boosting groups not allowed\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Allow only a limited number of boosting groups */
+ for (idx = 1; idx < boostgroups_max; ++idx)
+ if (!allocated_group[idx])
+ break;
+ if (idx == boostgroups_max) {
+ pr_err("Trying to create more than %d SchedTune boosting groups\n",
+ boostgroups_max);
+ return ERR_PTR(-ENOSPC);
+ }
+
+ st = kzalloc(sizeof(*st), GFP_KERNEL);
+ if (!st)
+ goto out;
+
+ /* Initialize per CPUs boost group support */
+ st->idx = idx;
+ if (schedtune_boostgroup_init(st))
+ goto release;
+
+ return &st->css;
+
+release:
+ kfree(st);
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+static void
+schedtune_boostgroup_release(struct schedtune *st)
+{
+ /* Keep track of allocated boost groups */
+ allocated_group[st->idx] = NULL;
+}
+
+static void
+schedtune_css_free(struct cgroup_subsys_state *css)
+{
+ struct schedtune *st = css_st(css);
+
+ schedtune_boostgroup_release(st);
+ kfree(st);
+}
+
+struct cgroup_subsys schedtune_cgrp_subsys = {
+ .css_alloc = schedtune_css_alloc,
+ .css_free = schedtune_css_free,
+ .legacy_cftypes = files,
+ .early_init = 1,
+};
+
+static inline void
+schedtune_init_cgroups(void)
+{
+ struct boost_groups *bg;
+ int cpu;
+
+ /* Initialize the per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ memset(bg, 0, sizeof(struct boost_groups));
+ }
+
+ pr_info("schedtune: configured to support %d boost groups\n",
+ boostgroups_max);
+}
+
+#endif /* CONFIG_CGROUP_SCHED_TUNE */
+
int
sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
@@ -26,6 +254,11 @@ static int
schedtune_init(void)
{
schedtune_spc_rdiv = reciprocal_value(100);
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+ schedtune_init_cgroups();
+#else
+ pr_info("schedtune: configured to support global boosting only\n");
+#endif
return 0;
}
late_initcall(schedtune_init);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 43b6d14..12c3432 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -447,7 +447,11 @@ static struct ctl_table kern_table[] = {
.procname = "sched_cfs_boost",
.data = &sysctl_sched_cfs_boost,
.maxlen = sizeof(sysctl_sched_cfs_boost),
+#ifdef CONFIG_CGROUP_SCHED_TUNE
+ .mode = 0444,
+#else
.mode = 0644,
+#endif
.proc_handler = &sysctl_sched_cfs_boost_handler,
.extra1 = &zero,
.extra2 = &one_hundred,
--
2.10.1
The CPU utilization signal (cpu_rq(cpu)->cfs.avg.util_avg) is used by
the scheduler as an estimation of the overall bandwidth currently
allocated on a CPU. When the schedutil CPUFreq governor is in use, this
signal drives the selection of the Operating Performance Points (OPP)
required to accommodate all the workload allocated on that CPU.
A convenient way to boost the performance of tasks running on a CPU,
which is also little intrusive, is to boost the CPU utilization signal
each time it is used to select an OPP.
This patch introduces a new function:
boosted_cpu_util(cpu)
to return a boosted value for the usage of a specified CPU.
The margin added to the original usage is:
1. computed based on the "boosting strategy" in use
2. proportional to the system-wide boost value defined via the
provided user-space interface
The boosted signal is used by schedutil (transparently) each time it
requires an estimation of the capacity required by CFS tasks
which are currently RUNNABLE a CPU.
It's worth to notice that the RUNNABLE status is used to defined _when_
a CPU needs to be boosted. While _what_ we boost is the CPU utilization
which includes also the blocked utilization.
Currently SchedTune is available only for CONFIG_SMP system, thus we
have a single point of integration with schedutil which is provided by
the cfs_rq_util_change(cfs_rq) function which ultimately calls into:
kernel/sched/cpufreq_schedutil.c::sugov_get_util(util, max)
Each time a CFS utilization update is required, if SchedTune is compiled
in, we use the global boost value to update the CFS utilization required
by the CFS class.
Such a simple mechanism allows for example to use schedutil to mimics
the behaviors of other governors, i.e. performance (when boost=100% and
only while there are RUNNABLE tasks on that CPU).
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 4 ++--
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 69e0689..0382df7 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -148,12 +148,12 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util,
static void sugov_get_util(unsigned long *util, unsigned long *max)
{
- struct rq *rq = this_rq();
unsigned long cfs_max;
cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
- *util = min(rq->cfs.avg.util_avg, cfs_max);
+ *util = boosted_cpu_util(smp_processor_id());
+ *util = min(*util, cfs_max);
*max = cfs_max;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdacc29..26c3911 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5578,6 +5578,25 @@ schedtune_margin(unsigned long signal, unsigned int boost)
return margin;
}
+static inline unsigned long
+schedtune_cpu_margin(unsigned long util, int cpu)
+{
+ unsigned int boost = get_sysctl_sched_cfs_boost();
+
+ if (boost == 0)
+ return 0UL;
+
+ return schedtune_margin(util, boost);
+}
+
+#else /* CONFIG_SCHED_TUNE */
+
+static inline unsigned long
+schedtune_cpu_margin(unsigned long util, int cpu)
+{
+ return 0;
+}
+
#endif /* CONFIG_SCHED_TUNE */
/*
@@ -5614,6 +5633,23 @@ static int cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}
+unsigned long boosted_cpu_util(int cpu)
+{
+ unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ /* Do not boost saturated utilizations */
+ if (util >= capacity)
+ return capacity;
+
+ /* Add margin to current CPU's capacity */
+ util += schedtune_cpu_margin(util, cpu);
+ if (util >= capacity)
+ return capacity;
+
+ return util;
+}
+
static inline int task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..fd85818 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1764,6 +1764,8 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+unsigned long boosted_cpu_util(int cpu);
+
#ifdef CONFIG_CPU_FREQ
DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
--
2.10.1
Hello, Patrick.
On Thu, Oct 27, 2016 at 06:41:05PM +0100, Patrick Bellasi wrote:
> To support task performance boosting, the usage of a single knob has the
> advantage to be a simple solution, both from the implementation and the
> usability standpoint. However, on a real system it can be difficult to
> identify a single value for the knob which fits the needs of multiple
> different tasks. For example, some kernel threads and/or user-space
> background services should be better managed the "standard" way while we
> still want to be able to boost the performance of specific workloads.
>
> In order to improve the flexibility of the task boosting mechanism this
> patch is the first of a small series which extends the previous
> implementation to introduce a "per task group" support.
>
> This first patch introduces just the basic CGroups support, a new
> "schedtune" CGroups controller is added which allows to configure
> different boost value for different groups of tasks.
> To keep the implementation simple while still supporting an effective
> boosting strategy, the new controller:
> 1. allows only a two layer hierarchy
> 2. supports only a limited number of boost groups
>
> A two layer hierarchy allows to place each task either:
> a) in the root control group
> thus being subject to a system-wide boosting value
> b) in a child of the root group
> thus being subject to the specific boost value defined by that
> "boost group"
>
> The limited number of "boost groups" supported is mainly motivated by
> the observation that in a real system it could be useful to have only
> few classes of tasks which deserve different treatment.
> For example, background vs foreground or interactive vs low-priority.
>
> As an additional benefit, a limited number of boost groups allows also
> to have a simpler implementation, especially for the code required to
> compute the boost value for CPUs which have RUNNABLE tasks belonging to
> different boost groups.
So, skipping on the actual details of boosting mechanism, in terms of
cgroup support, it should be integrated into the existing cpu
controller and have proper support for hierarchy. Note that hierarchy
support doesn't necessarily mean that boosting itself has to be
hierarchical. It can be, for example, something along the line of
"the descendants are allowed upto this level of boosting" so that the
hierarchy just serves to assign the appropriate boosting values to the
groups of tasks.
Thanks.
--
tejun
On 27-Oct 14:30, Tejun Heo wrote:
> Hello, Patrick.
Hi Tejun,
> On Thu, Oct 27, 2016 at 06:41:05PM +0100, Patrick Bellasi wrote:
> > To support task performance boosting, the usage of a single knob has the
> > advantage to be a simple solution, both from the implementation and the
> > usability standpoint. However, on a real system it can be difficult to
> > identify a single value for the knob which fits the needs of multiple
> > different tasks. For example, some kernel threads and/or user-space
> > background services should be better managed the "standard" way while we
> > still want to be able to boost the performance of specific workloads.
> >
> > In order to improve the flexibility of the task boosting mechanism this
> > patch is the first of a small series which extends the previous
> > implementation to introduce a "per task group" support.
> >
> > This first patch introduces just the basic CGroups support, a new
> > "schedtune" CGroups controller is added which allows to configure
> > different boost value for different groups of tasks.
> > To keep the implementation simple while still supporting an effective
> > boosting strategy, the new controller:
> > 1. allows only a two layer hierarchy
> > 2. supports only a limited number of boost groups
> >
> > A two layer hierarchy allows to place each task either:
> > a) in the root control group
> > thus being subject to a system-wide boosting value
> > b) in a child of the root group
> > thus being subject to the specific boost value defined by that
> > "boost group"
> >
> > The limited number of "boost groups" supported is mainly motivated by
> > the observation that in a real system it could be useful to have only
> > few classes of tasks which deserve different treatment.
> > For example, background vs foreground or interactive vs low-priority.
> >
> > As an additional benefit, a limited number of boost groups allows also
> > to have a simpler implementation, especially for the code required to
> > compute the boost value for CPUs which have RUNNABLE tasks belonging to
> > different boost groups.
>
> So, skipping on the actual details of boosting mechanism, in terms of
> cgroup support, it should be integrated into the existing cpu
> controller and have proper support for hierarchy.
I have a couple of concerns/questions about both of these points.
First, regarding the integration with the cpu controller,
don't we risk to overload the semantic of the cpu controller?
Right now this controller is devoted to track the bandwidth that a
group of tasks can consume and/or to repartition the available
bandwidth among the tasks in that group.
Boosting is a different concept, it's kind-of related to CPU bandwidth
but it targets a completely different goal, i.e. biasing schedutils
and (in the future) scheduler decisions.
I'm wondering also how much confusing and complex it can be to
configure a system where you have not overlapping groups of tasks with
different bandwidth and boosting requirements.
For example, let assume we have three tasks: A, B, and C and we want:
Bandwidth: 10% for A and B, 20% for C
Boost: 10% for A, 0% for B and C
IMO, configuring such a set of constraints would be quite complex if
we expose the boost value through the cpu controller.
> Note that hierarchy
> support doesn't necessarily mean that boosting itself has to be
> hierarchical.
Initially I've actually considered such a design, however...
>It can be, for example, something along the line of
> "the descendants are allowed upto this level of boosting" so that the
> hierarchy just serves to assign the appropriate boosting values to the
> groups of tasks.
... the current "single layer hierarchy" has been proposed instead for
two main reasons.
First, we was not able to think about realistic use-cases where we
need this "up to this level" semantic.
For boosting purposes, tasks are grouped based on their role and/or
importance in the system. This property is usually defined in
"absolute" terms instead of "relative" therms.
Does it make sense to say that task A can be boosted only up to how
much is task B? In our experience probably never.
The second reason is mainly related to the possibility to have an
efficient and low-overhead implementation. The currently defined
semantic for CPU boosting requires to perform certain operations at
each task enqueue and dequeue events. Some of these operations are
part of the hot path in the scheduler code. The flat hierarchy allows
to use per-cpu data structures and algorithms which aims at being
efficient in reducing the overheads incurred in doing the required
accounting.
As a final remark, I would like to say that Google is currently using
SchedTune in Android to classify tasks by "importance" and feed this
information into the scheduler. Doing this exercise, so far we did not
spot limitations related to the usage of a flat hierarchy.
However, I like to have this discussion, which it's actually the main
goal of this RFC. My suggestion is just that we should think about
use-cases before and than introduce a more complex solution, but only
if we convince ourself that it can bring more benefits than burdens in
code maintainability.
Is your request for a "proper support for hierarchy" somehow related to
the requirements for the "unified hierarchy"? Or do you see also other
more functional/semantic aspects?
> Thanks.
If you are going to attend LPC next week, I hope we can have a chat on
these topics.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hello, Patrick.
On Thu, Oct 27, 2016 at 09:14:39PM +0100, Patrick Bellasi wrote:
> I'm wondering also how much confusing and complex it can be to
> configure a system where you have not overlapping groups of tasks with
> different bandwidth and boosting requirements.
>
> For example, let assume we have three tasks: A, B, and C and we want:
>
> Bandwidth: 10% for A and B, 20% for C
> Boost: 10% for A, 0% for B and C
>
> IMO, configuring such a set of constraints would be quite complex if
> we expose the boost value through the cpu controller.
Going back to your use case point, when would we realistically need
this?
> > Note that hierarchy
> > support doesn't necessarily mean that boosting itself has to be
> > hierarchical.
>
> Initially I've actually considered such a design, however...
>
> >It can be, for example, something along the line of
> > "the descendants are allowed upto this level of boosting" so that the
> > hierarchy just serves to assign the appropriate boosting values to the
> > groups of tasks.
>
> ... the current "single layer hierarchy" has been proposed instead for
> two main reasons.
>
> First, we was not able to think about realistic use-cases where we
> need this "up to this level" semantic.
> For boosting purposes, tasks are grouped based on their role and/or
> importance in the system. This property is usually defined in
> "absolute" terms instead of "relative" therms.
> Does it make sense to say that task A can be boosted only up to how
> much is task B? In our experience probably never.
There are basic semantics that people expect when they use cgroup for
resource control and it enables things like layering and delegating
configuration.
> The second reason is mainly related to the possibility to have an
> efficient and low-overhead implementation. The currently defined
> semantic for CPU boosting requires to perform certain operations at
> each task enqueue and dequeue events. Some of these operations are
> part of the hot path in the scheduler code. The flat hierarchy allows
> to use per-cpu data structures and algorithms which aims at being
> efficient in reducing the overheads incurred in doing the required
> accounting.
Unless I'm misunderstanding, the actually applied attributes should be
calculable during config changes or task migration, right? The
hierarchy's function would be allowing layering and delegating
configurations and shouldn't get in the way of actual enforcement.
> As a final remark, I would like to say that Google is currently using
> SchedTune in Android to classify tasks by "importance" and feed this
> information into the scheduler. Doing this exercise, so far we did not
> spot limitations related to the usage of a flat hierarchy.
>
> However, I like to have this discussion, which it's actually the main
> goal of this RFC. My suggestion is just that we should think about
> use-cases before and than introduce a more complex solution, but only
> if we convince ourself that it can bring more benefits than burdens in
> code maintainability.
>
> Is your request for a "proper support for hierarchy" somehow related to
> the requirements for the "unified hierarchy"? Or do you see also other
> more functional/semantic aspects?
Not necessarily. In general, all controllers, whether on v1 or v2,
should be fully hierarchical for reasons mentioned above. I get that
flat was fine for android but flat hierarchy would be fine for most
controllers for android. That's not the only use case we should be
considering, right?
> > Thanks.
>
> If you are going to attend LPC next week, I hope we can have a chat on
> these topics.
Yeah, sure, I'll be around till Thursday. Let's chat there.
Thanks.
--
tejun
On Thu, Oct 27, 2016 at 06:41:00PM +0100, Patrick Bellasi wrote:
>
> This RFC is an update to the initial SchedTune proposal [1] for a central
> scheduler-driven power-performance control.
> The posting is being made ahead of the LPC to facilitate discussions there.
This is weeks too late for that. There is literally no time left to look
at any of this before LPC.
On 27-Oct 16:39, Tejun Heo wrote:
> Hello, Patrick.
>
> On Thu, Oct 27, 2016 at 09:14:39PM +0100, Patrick Bellasi wrote:
> > I'm wondering also how much confusing and complex it can be to
> > configure a system where you have not overlapping groups of tasks with
> > different bandwidth and boosting requirements.
> >
> > For example, let assume we have three tasks: A, B, and C and we want:
> >
> > Bandwidth: 10% for A and B, 20% for C
> > Boost: 10% for A, 0% for B and C
> >
> > IMO, configuring such a set of constraints would be quite complex if
> > we expose the boost value through the cpu controller.
>
> Going back to your use case point, when would we realistically need
> this?
If we really want to be generic, we cannot exclude there could be this
kind of scenarios. What this toy example aims to show is just that, in
general, how much we want to boost a task can be decoupled from the
bandwidth reservation it shares with others.
> > > Note that hierarchy
> > > support doesn't necessarily mean that boosting itself has to be
> > > hierarchical.
> >
> > Initially I've actually considered such a design, however...
> >
> > >It can be, for example, something along the line of
> > > "the descendants are allowed upto this level of boosting" so that the
> > > hierarchy just serves to assign the appropriate boosting values to the
> > > groups of tasks.
> >
> > ... the current "single layer hierarchy" has been proposed instead for
> > two main reasons.
> >
> > First, we was not able to think about realistic use-cases where we
> > need this "up to this level" semantic.
> > For boosting purposes, tasks are grouped based on their role and/or
> > importance in the system. This property is usually defined in
> > "absolute" terms instead of "relative" therms.
> > Does it make sense to say that task A can be boosted only up to how
> > much is task B? In our experience probably never.
>
> There are basic semantics that people expect when they use cgroup for
> resource control and it enables things like layering and delegating
> configuration.
I see your point and I understand it, still I'm not completely
convinced that these concepts (i.e. layering and delegation) are
really required for the specific topic of "tasks classification" for
the purposes of energy-vs-performance tuning.
Perhaps this boils down to the fact that, for the specific needs of
tasks boosting, from the "Control Group" framework we are less
interested in the "Control" component than in the "Group" one.
> > The second reason is mainly related to the possibility to have an
> > efficient and low-overhead implementation. The currently defined
> > semantic for CPU boosting requires to perform certain operations at
> > each task enqueue and dequeue events. Some of these operations are
> > part of the hot path in the scheduler code. The flat hierarchy allows
> > to use per-cpu data structures and algorithms which aims at being
> > efficient in reducing the overheads incurred in doing the required
> > accounting.
>
> Unless I'm misunderstanding, the actually applied attributes should be
> calculable during config changes or task migration, right?
Perhaps you are missing enqueue/dequeue operations, which is not
necessarily only due to tasks migrations.
For example, the semantic exposed by SchedTune is such that if we have
two tasks RUNNABLE on a CPU:
T1 30% boosted
T2 60% boosted
then the CPU will be boosted 60%, while T2 is running, and boosted
only 30% as soon as T2 goes to sleep and only T1 is still runnable.
> The
> hierarchy's function would be allowing layering and delegating
> configurations and shouldn't get in the way of actual enforcement.
Ok, I should better think about this distinction between
layering/delegation and control enforcement.
> > As a final remark, I would like to say that Google is currently using
> > SchedTune in Android to classify tasks by "importance" and feed this
> > information into the scheduler. Doing this exercise, so far we did not
> > spot limitations related to the usage of a flat hierarchy.
> >
> > However, I like to have this discussion, which it's actually the main
> > goal of this RFC. My suggestion is just that we should think about
> > use-cases before and than introduce a more complex solution, but only
> > if we convince ourself that it can bring more benefits than burdens in
> > code maintainability.
> >
> > Is your request for a "proper support for hierarchy" somehow related to
> > the requirements for the "unified hierarchy"? Or do you see also other
> > more functional/semantic aspects?
>
> Not necessarily. In general, all controllers, whether on v1 or v2,
> should be fully hierarchical for reasons mentioned above. I get that
> flat was fine for android but flat hierarchy would be fine for most
> controllers for android. That's not the only use case we should be
> considering, right?
So far we had experience mainly with Android and ChromeOS, which
exposes a valuable and quite interesting set of realistic use-cases,
especially if we consider the possibility to collect task's
information from "informed runtimes".
However, I absolutely agree with you, it's worth to consider all the
other use-cases we can think about.
> > > Thanks.
> >
> > If you are going to attend LPC next week, I hope we can have a chat on
> > these topics.
>
> Yeah, sure, I'll be around till Thursday. Let's chat there.
Cool, thanks!
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 27-Oct 22:58, Peter Zijlstra wrote:
> On Thu, Oct 27, 2016 at 06:41:00PM +0100, Patrick Bellasi wrote:
> >
> > This RFC is an update to the initial SchedTune proposal [1] for a central
> > scheduler-driven power-performance control.
> > The posting is being made ahead of the LPC to facilitate discussions there.
>
> This is weeks too late for that. There is literally no time left to look
> at any of this before LPC.
Hi Peter, apologies. Yes, the timing is clearly off.
However, I wanted to post it anyway before LPC so that the code is available
and I can try hard to see if there will be a chance to describe aims and
architecture at the conference.
If that doesn't work out, I'll resort to list discussion after LPC,
which I'm still happy to do.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 27-10-16, 18:41, Patrick Bellasi wrote:
> +This last requirement is especially important if we consider that schedutil can
> +potentially replace all currently available CPUFreq policies. Since schedutil
> +is event based, as opposed to the sampling driven governors, it is already more
> +responsive at selecting the optimal OPP to run tasks allocated to a CPU.
I am not sure if I follow this paragraph. All the governors follow the same
basic rules now. They are all event driven (events from scheduler), but they
function only after a certain sampling period is finished. Isn't this the case ?
> +SchedTune exposes a simple user-space interface with a single power-performance
> +tunable:
> +
> + /proc/sys/kernel/sched_cfs_boost
> +
> +This permits expressing a boost value as an integer in the range [0..100].
> +
> +A value of 0 (default) for a CFS task means that schedutil will attempt
> +to match compute capacity of the CPU where the task is scheduled to
> +match its current utilization with a few spare cycles left. A value of
> +100 means that schedutil will select the highest available OPP.
> +
> +The range between 0 and 100 can be set to satisfy other scenarios suitably.
> +For example to satisfy interactive response or depending on other system events
> +(battery level, thermal status, etc).
Earlier section said that schedutil+schedtune can replace all earlier governors.
How will schedutil behave like powersave governor with schedtune? I was
expecting the possible values of sched_cfs_boost to be in the range -100 to 100,
where -100 will make it powersave, +100 will make it performance and 0 will not
make any changes.
--
viresh
On 04-Nov 15:16, Viresh Kumar wrote:
> On 27-10-16, 18:41, Patrick Bellasi wrote:
> > +This last requirement is especially important if we consider that schedutil can
> > +potentially replace all currently available CPUFreq policies. Since schedutil
> > +is event based, as opposed to the sampling driven governors, it is already more
> > +responsive at selecting the optimal OPP to run tasks allocated to a CPU.
>
> I am not sure if I follow this paragraph. All the governors follow the same
> basic rules now. They are all event driven (events from scheduler), but they
> function only after a certain sampling period is finished. Isn't this the case ?
Right, the main difference from what I call "sample based" governors
(e.g. ondemand, interactive) is that they consider metrics which are
averaged across time (e.g. how long is idle in average a CPU).
To the contrary, with schedutil we have a direct input from the
scheduler about what is the required CPU bandwidth demand.
Thus, schedutil is not only event based but it can exploits a more
direct knowledge of what is the CPU bandwidth demand. Moreover,
depending on the CPUFreq driver latencies of a specific platform,
schedutil can be much more aggressive on triggering frequencies
transitions, e.g. on some ARM platforms we can easily have 1ms OPP
switches.
AFAIK, such fast transitions cannot be exploited by "sample based"
governors because they cannot collect sensible averages in such a
limited timeframe without the risk to be "unstable" (e.g. almost
always get a wrong decision).
> > +SchedTune exposes a simple user-space interface with a single power-performance
> > +tunable:
> > +
> > + /proc/sys/kernel/sched_cfs_boost
> > +
> > +This permits expressing a boost value as an integer in the range [0..100].
> > +
> > +A value of 0 (default) for a CFS task means that schedutil will attempt
> > +to match compute capacity of the CPU where the task is scheduled to
> > +match its current utilization with a few spare cycles left. A value of
> > +100 means that schedutil will select the highest available OPP.
> > +
> > +The range between 0 and 100 can be set to satisfy other scenarios suitably.
> > +For example to satisfy interactive response or depending on other system events
> > +(battery level, thermal status, etc).
>
> Earlier section said that schedutil+schedtune can replace all earlier governors.
> How will schedutil behave like powersave governor with schedtune? I was
> expecting the possible values of sched_cfs_boost to be in the range -100 to 100,
> where -100 will make it powersave, +100 will make it performance and 0 will not
> make any changes.
You right, however the negative values for the boost are introduced by
the last patch of this series. That patch updates also the
documentation to describe the meaning of negative boost values.
> --
> viresh
--
#include <best/regards.h>
Patrick Bellasi