This is a respin of:
https://lkml.org/lkml/2017/8/24/721
which finally removes the RFC tag since we properly addressed all the major
concerns from previous discussions, mainly at LPCs and OSPM conferences, and we
want to aim now at finalizing this series for mainline merge.
Comments and feedbacks are more than welcome!
The content of this series will be discussed also at the upcoming OSPM Summit:
http://retis.sssup.it/ospm-summit/
A live stream and record of the event will be available for the benefit of
those interested which will not be able to join the summit.
.:: Main changes
The main change in this version is the introduction of a new userspace API, as
requested by Tejun. This makes now the CGroups based a "secondary" interface
for utilization clamping, which allows to use this feature also on systems
where the CGroups's CPU controller is not available or not in use.
The primary interface for utilization clamping is now also a per-task API,
which has been added by extending the existing sched_{set,get}attr syscalls.
Here we propose a simple yet effective extension of these syscalls based on a
couple of additional attributes.
A possible alternative implementation is also described, as a note in the
corresponding commit message, which will not require to change the syscall but
just to properly re-use existing attributes currently available only for
DEADLINE tasks. Since this would require a more complex implementation, we
decided to go for the simple option and open up for discussions on this
(hopefully last) point while we finalize the patchset.
Due to the new API, the series has also been re-organized into the following
described main sections.
Data Structures and Mechanisms
==============================
[PATCH 1/7] sched/core: uclamp: add CPU clamp groups accounting
[PATCH 2/7] sched/core: uclamp: map TASK clamp values into CPU clamp groups
Add the necessary data structures and mechanisms to translate task's
utilization clamping values into an effective and low-overhead fast path
(i.e. enqueue/dequeue time) tracking of the CPU's utilization clamping values.
Here we also introduce a new CONFIG_UCLAMP_TASK KConfig option, which allows to
completely remove utilization clamping code for systems not needing it.
Being mainly a mechanism to use in conjunction with schedutil, utilization
clamping depends also on CPU_FREQ_GOV_SCHEDUTIL being enabled.
We also add the possibility to define at compile time how many different clamp
values can be used. This is done because it makes sense from a practical usage
standpoint and has it also interesting size/overhead benefits.
Per task (primary) API
======================
[PATCH 3/7] sched/core: uclamp: extend sched_setattr to support utilization clamping
Provides a simple yet effective user-space API to define per-task minimum and
maximum utilization clamp values. A simple implementation is proposed here,
while a possible alternative is described in the notes.
Per task group (secondary) API
==============================
[PATCH 4/7] sched/core: uclamp: add utilization clamping to the CPU controller
[PATCH 5/7] sched/core: uclamp: use TG clamps to restrict TASK clamps
Add the same task group based API presented in the previous posting, but this
time on top of the per-task one. The second patch is dedicated to the
aggregation between per-task and per-task_group clamp values.
Schedutil integration
=====================
[PATCH 6/7] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
[PATCH 7/7] sched/cpufreq: uclamp: add utilization clamping for RT tasks
Extend sugov_aggregate_util() and sugov_set_iowait_boost() to clamp the
utilization reported by cfs_rq and rt_rq in the selection of the OPP.
This patch set is based on today's tip/sched/core:
commit b720342 ("sched/core: Update preempt_notifier_key to modern API")
but it depends on a couple of schedutil related refactoring patches which
I've posted separately on the list. For your convenience, a complete tree for
testing and evaluation is available here:
git://linux-arm.org/linux-pb.git lkml/utilclamp_v1
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v1
.:: Newcomer's Short Abstract
The Linux scheduler is able to drive frequency selection, when the schedutil
cpufreq's governor is in use, based on task utilization aggregated at CPU
level. The CPU utilization is then used to select the frequency which better
fits the task's generated workload. The current translation of utilization
values into a frequency selection is pretty simple: we just go to max for RT
tasks or to the minimum frequency which can accommodate the utilization of
DL+FAIR tasks.
While this simple mechanism is good enough for DL tasks, for RT and FAIR tasks
we can aim at some better frequency driving which can take into consideration
hints coming from user-space.
Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined from
user-space. The clamped utilization value can then be used to enforce a minimum
and/or maximum frequency depending on which tasks are currently active on a
CPU.
The main use-cases for utilization clamping are:
- boosting: better interactive response for small tasks which
are affecting the user experience. Consider for example the case of a
small control thread for an external accelerator (e.g. GPU, DSP, other
devices). In this case the scheduler does not have a complete view of
what are the task bandwidth requirements and, if it's a small task,
schedutil will keep selecting a lower frequency thus affecting the
overall time required to complete the task activations.
- clamping: increase energy efficiency for background tasks not directly
affecting the user experience. Since running at a lower frequency is in
general more energy efficient, when the completion time is not a main
goal then clamping the maximum frequency to use for certain (maybe big)
tasks can have positive effects, both on energy consumption and thermal
stress.
Moreover, this last support allows also to make RT tasks more energy
friendly on mobile systems, where running them at the maximum
frequency is not strictly required.
Cheers Patrick
Patrick Bellasi (7):
sched/core: uclamp: add CPU clamp groups accounting
sched/core: uclamp: map TASK clamp values into CPU clamp groups
sched/core: uclamp: extend sched_setattr to support utilization
clamping
sched/core: uclamp: add utilization clamping to the CPU controller
sched/core: uclamp: use TG clamps to restrict TASK clamps
sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
sched/cpufreq: uclamp: add utilization clamping for RT tasks
include/linux/sched.h | 34 ++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 65 ++-
init/Kconfig | 64 +++
kernel/sched/core.c | 824 +++++++++++++++++++++++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 46 ++-
kernel/sched/sched.h | 180 +++++++++
7 files changed, 1192 insertions(+), 25 deletions(-)
--
2.15.1
Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE. Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.
This patch extends the usage of util_{min,max} to the RT scheduling
class. Whenever a task in this class is RUNNABLE, the util required is
defined by the constraints of the CPU control group the task belongs to.
The IO wait boost value is thus subject to clamping for RT tasks too.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clampinig constraints.
It's worth to notice that, by default, clamp values are
min_util, max_util = (0, SCHED_CAPACITY_SCALE)
and thus, RT tasks always run at the maximum OPP if not otherwise
constrained by userspace.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fe53984d06a0..1c80ad65065a 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -182,16 +182,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
+ unsigned long util = sg_cpu->util_dl;
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util;
- if (rq->rt.rt_nr_running) {
- util = sg_cpu->max;
- } else {
- util = sg_cpu->util_dl;
- if (rq->cfs.h_nr_running)
- util += uclamp_util(sg_cpu->cpu, sg_cpu->util_cfs);
- }
+ /*
+ * RT and CFS utilization are clamped, according to the current CPU
+ * constrains, only when they have RUNNABLE tasks.
+ * Their utilization are individually clamped to ensure fairness across
+ * classes, meaning that CFS always get (if possible) the (minimum)
+ * required bandwidth on top of that required by higher priority
+ * classes.
+ */
+ if (rq->rt.rt_nr_running)
+ util += uclamp_util(sg_cpu->cpu, sg_cpu->max);
+ if (rq->cfs.h_nr_running)
+ util += uclamp_util(sg_cpu->cpu, sg_cpu->util_cfs);
/*
* Ideally we would like to set util_dl as min/guaranteed freq and
@@ -235,12 +240,10 @@ static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, unsigned int flags)
*
* Since DL tasks have a much more advanced bandwidth control, it's
* safe to assume that IO boost does not apply to those tasks.
- * Instead, since for RT tasks we are going to max, we don't want to
- * clamp the IO boost max value.
+ * Instead, for CFS and RT tasks we clamp the IO boost max value
+ * considering the current constraints for the CPU.
*/
- max_boost = sg_cpu->iowait_boost_max;
- if (!cpu_rq(sg_cpu->cpu)->rt.rt_nr_running)
- max_boost = uclamp_util(sg_cpu->cpu, max_boost);
+ max_boost = uclamp_util(sg_cpu->cpu, sg_cpu->iowait_boost_max);
/* Double the IO boost at each frequency increase */
sg_cpu->iowait_boost <<= 1;
--
2.15.1
When a task's util_clamp value is configured via sched_setattr, this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueue and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's CGroup in which the task is currently living.
Let's update uclamp_cpu_get() to provide an aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp value
between the task and the TG's ones. This mimics what already happen for
a task's CPU affinity mask when the task is also living in a cpuset.
The overall idea is that: CGroups attributes are always used to restrict
the per-task attributes.
For consistency purposes, as well as to properly inform userspace, the
sched_getattr call is updated to always return the properly aggregated
constrains as described above. This will also make sched_getattr a
convenient userpace API to know the utilization constraints enforced on
a task by the CGroups's CPU controller.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/core.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b8299a4f03e7..592de8d32427 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -966,9 +966,18 @@ static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
clamp_value = p->uclamp[clamp_id].value;
group_id = p->uclamp[clamp_id].group_id;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Use TG's clamp value to limit task specific values */
+ if (group_id == UCLAMP_NONE ||
+ clamp_value >= task_group(p)->uclamp[clamp_id].value) {
+ clamp_value = task_group(p)->uclamp[clamp_id].value;
+ group_id = task_group(p)->uclamp[clamp_id].group_id;
+ }
+#else
/* No task specific clamp values: nothing to do */
if (group_id == UCLAMP_NONE)
return;
+#endif
/* Increment the current group_id */
uc_cpu->group[group_id].tasks += 1;
@@ -5401,6 +5410,12 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
#ifdef CONFIG_UCLAMP_TASK
attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ if (task_group(p)->uclamp[UCLAMP_MIN].value < attr.sched_util_min)
+ attr.sched_util_min = task_group(p)->uclamp[UCLAMP_MIN].value;
+ if (task_group(p)->uclamp[UCLAMP_MAX].value < attr.sched_util_max)
+ attr.sched_util_max = task_group(p)->uclamp[UCLAMP_MAX].value;
+#endif
#endif
rcu_read_unlock();
--
2.15.1
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider the requirements suggested by userspace, for
example, to:
- boost tasks which are directly affecting the user experience
by running them at least at a minimum "required" frequency
- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency
These constraints are meant to support a per-task based tuning of the
operating frequency selection thus allowing to have a fine grained
definition of performance boosting vs energy saving strategies in kernel
space.
Let's add the required support to clamp the utilization generated by
FAIR tasks within the boundaries defined by their aggregated utilization
clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.
For FAIR tasks both the utilization as well as the IOWait boot values
are clamped according to the CPU aggregated utilization clamp
constraints.
The default values for boosting and capping are defined to be:
- util_min: 0
- util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 29 ++++++++++++++------
kernel/sched/sched.h | 59 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 80 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e021db562308..fe53984d06a0 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -190,7 +190,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
} else {
util = sg_cpu->util_dl;
if (rq->cfs.h_nr_running)
- util += sg_cpu->util_cfs;
+ util += uclamp_util(sg_cpu->cpu, sg_cpu->util_cfs);
}
/*
@@ -213,6 +213,7 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
*/
static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, unsigned int flags)
{
+ unsigned int max_boost;
/* Boost only tasks waking up after IO */
if (!(flags & SCHED_CPUFREQ_IOWAIT))
@@ -223,16 +224,28 @@ static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, unsigned int flags)
return;
sg_cpu->iowait_boost_pending = true;
- /* Double the IO boost at each frequency increase */
- if (sg_cpu->iowait_boost) {
- sg_cpu->iowait_boost <<= 1;
- if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
- sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+ /* At first wakeup after IO, start with minimum boost */
+ if (!sg_cpu->iowait_boost) {
+ sg_cpu->iowait_boost = sg_cpu->sg_policy->policy->min;
return;
}
- /* At first wakeup after IO, start with minimum boost */
- sg_cpu->iowait_boost = sg_cpu->sg_policy->policy->min;
+ /*
+ * Boost only up to the current max CPU clamped utilization.
+ *
+ * Since DL tasks have a much more advanced bandwidth control, it's
+ * safe to assume that IO boost does not apply to those tasks.
+ * Instead, since for RT tasks we are going to max, we don't want to
+ * clamp the IO boost max value.
+ */
+ max_boost = sg_cpu->iowait_boost_max;
+ if (!cpu_rq(sg_cpu->cpu)->rt.rt_nr_running)
+ max_boost = uclamp_util(sg_cpu->cpu, max_boost);
+
+ /* Double the IO boost at each frequency increase */
+ sg_cpu->iowait_boost <<= 1;
+ if (sg_cpu->iowait_boost > max_boost)
+ sg_cpu->iowait_boost = max_boost;
}
/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a91b9cd162a3..1aa5d9b93fa5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2268,6 +2268,65 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_value: get the current CPU's utilization clamp value
+ * @cpu: the CPU to consider
+ * @clamp_id: the utilization clamp index (i.e. min or max utilization)
+ *
+ * The utilization clamp value for a CPU depends on its set of currently
+ * active tasks and their specific util_{min,max} constraints.
+ * A max aggregated value is tracked for each CPU and returned by this
+ * function. An IDLE CPU never enforces a clamp value.
+ *
+ * Return: the current value for the specified CPU and clamp index
+ */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+ int clamp_value = uclamp_none(clamp_id);
+
+ /* Update min utilization clamp */
+ if (uc_cpu->value != UCLAMP_NONE)
+ clamp_value = uc_cpu->value;
+
+ return clamp_value;
+}
+
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @cpu: the CPU to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently active tasks. Given a utilization signal, i.e a signal in the
+ * [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
+{
+ unsigned int min_util = uclamp_value(cpu, UCLAMP_MIN);
+ unsigned int max_util = uclamp_value(cpu, UCLAMP_MAX);
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.15.1
The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on the actual tasks utilization.
Moreover, the utilization clamping support provides a mechanism to
constraint the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently active on a CPU.
Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is allowed to generate. By adding new constraints on minimum and
maximum utilization allowed for tasks in a cpu control group it will
also be possible to better control the actual amount of CPU bandwidth
consumed by these tasks.
The ultimate goal of this new pair of constraints is to enable:
- boosting: by selecting a higher execution frequency for small tasks
which are affecting the user interactive experience
- capping: by selecting lower execution frequency, which usually improves
energy efficiency, for big tasks which are mainly related to
background activities, and thus without a direct impact on
the user experience.
This patch extends the CPU controller by adding a couple of new attributes,
util_min and util_max, which can be used to enforce frequency boosting and
capping. Specifically:
- util_min: defines the minimum CPU utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run at least at a minimum frequency which
corresponds to the min_util utilization
- util_max: defines the maximum CPU utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run up to a maximum frequency which
corresponds to the max_util utilization
These attributes:
a) are tunable at all hierarchy levels, i.e. at root group level too, thus
allowing to define the minimum and maximum frequency constraints for all
otherwise non-classified tasks (e.g. autogroups) and to be a sort-of
replacement for cpufreq's powersave, ondemand and performance
governors.
b) allow to create subgroups of tasks which are not violating the
utilization constraints defined by the parent group.
Tasks on a subgroup can only be more boosted and/or capped, which is
matching with the "limits" schema proposed by the "Resource Distribution
Model (RDM)" suggested by the CGroups v2 documentation:
Documentation/cgroup-v2.txt
This patch provides the basic support to expose the two new attributes and
to validate their run-time update based on the "limits" of the
aforementioned RDM schema.
We first ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time). This is done by
slightly refactoring uclamp_group_get to accept a *cgroup_subsys_state
alongside *task_struct.
When uclamp_group_get is called with a valid *cgroup_subsys_state, a
clamp group is assigned to the task, which is possibly different than
the task specific clamp group. We then ensure to update the current
clamp group accounting for all the tasks which are currently runnable on
the cgroup via a new uclamp_group_get_tg() call.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
The actual aggregation of per-task and per-task_group utilization
constraints is provided in a separate patch to make it more clear and
documented how this aggregation is performed.
---
init/Kconfig | 22 +++++
kernel/sched/core.c | 271 ++++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 21 ++++
3 files changed, 311 insertions(+), 3 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig
index 977aa4d1e42a..d999879f8625 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -795,6 +795,28 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED
+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will alwasy use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger then the corresponing clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ee4f380aba6..b8299a4f03e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1130,8 +1130,22 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
}
+static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int group_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ /* Update clamp groups for RUNNABLE tasks in this TG */
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_task_update_active(p, clamp_id, group_id);
+ css_task_iter_end(&it);
+}
+
/**
* uclamp_group_get: increase the reference count for a clamp group
+ * @css: reference to the task group to account
* @clamp_id: the clamp index affected by the task group
* @uc_se: the utilization clamp data for the task group
* @clamp_value: the new clamp value for the task group
@@ -1145,6 +1159,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
* Return: -ENOSPC if there are not available clamp groups, 0 on success.
*/
static inline int uclamp_group_get(struct task_struct *p,
+ struct cgroup_subsys_state *css,
int clamp_id, struct uclamp_se *uc_se,
unsigned int clamp_value)
{
@@ -1172,8 +1187,13 @@ static inline int uclamp_group_get(struct task_struct *p,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+ /* Newly created TG don't have tasks assigned */
+ if (css)
+ uclamp_group_get_tg(css, clamp_id, next_group_id);
+
/* Update current task if task specific clamp has been changed */
- uclamp_task_update_active(p, clamp_id, next_group_id);
+ if (p)
+ uclamp_task_update_active(p, clamp_id, next_group_id);
/* Release the previous clamp group */
uclamp_group_put(clamp_id, prev_group_id);
@@ -1181,6 +1201,103 @@ static inline int uclamp_group_get(struct task_struct *p,
return 0;
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * init_uclamp_sched_group: initialize data structures required for TG's
+ * utilization clamping
+ */
+static inline void init_uclamp_sched_group(void)
+{
+ struct uclamp_map *uc_map;
+ struct uclamp_se *uc_se;
+ int group_id;
+ int clamp_id;
+
+ /* Root TG's are initialized to the first clamp group */
+ group_id = 0;
+
+ /* Initialize root TG's to default (none) clamp values */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_map = &uclamp_maps[clamp_id][0];
+
+ /* Map root TG's clamp value */
+ uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
+
+ /* Init root TG's clamp group */
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->group_id = group_id;
+
+ /* Attach root TG's clamp group */
+ uc_map[group_id].se_count = 1;
+ }
+}
+
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: !0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+ int ret = 1;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+
+ uc_se->value = parent->uclamp[clamp_id].value;
+ uc_se->group_id = UCLAMP_NONE;
+
+ if (uclamp_group_get(NULL, NULL, clamp_id, uc_se,
+ parent->uclamp[clamp_id].value)) {
+ ret = 0;
+ goto out;
+ }
+ }
+
+out:
+ return ret;
+}
+
+/**
+ * release_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
+ */
+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+ uclamp_group_put(clamp_id, uc_se->group_id);
+ }
+}
+
+#else /* CONFIG_UCLAMP_TASK_GROUP */
+static inline void init_uclamp_sched_group(void) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1196,12 +1313,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
/* Update min utilization clamp */
uc_se = &p->uclamp[UCLAMP_MIN];
- retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
+ retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se,
attr->sched_util_min);
/* Update max utilization clamp */
uc_se = &p->uclamp[UCLAMP_MAX];
- retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
+ retval |= uclamp_group_get(p, NULL, UCLAMP_MAX, uc_se,
attr->sched_util_max);
mutex_unlock(&uclamp_mutex);
@@ -1243,10 +1360,18 @@ static inline void init_uclamp(void)
memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
}
}
+
+ init_uclamp_sched_group();
}
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_task_update(struct rq *rq, struct task_struct *p) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -6823,6 +6948,7 @@ static DEFINE_SPINLOCK(task_group_lock);
static void sched_free_group(struct task_group *tg)
{
+ free_uclamp_sched_group(tg);
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
@@ -6844,6 +6970,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;
err:
@@ -7064,6 +7193,130 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct cgroup_subsys_state *pos;
+ struct uclamp_se *uc_se;
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return ret;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(css);
+
+ /* Already at the required value */
+ if (tg->uclamp[UCLAMP_MIN].value == min_value) {
+ ret = 0;
+ goto out;
+ }
+
+ /* Ensure to not exceed the maximum clamp value */
+ if (tg->uclamp[UCLAMP_MAX].value < min_value)
+ goto out;
+
+ /* Ensure min clamp fits within parent's clamp value */
+ if (tg->parent &&
+ tg->parent->uclamp[UCLAMP_MIN].value > min_value)
+ goto out;
+
+ /* Ensure each child is a restriction of this TG */
+ css_for_each_child(pos, css) {
+ if (css_tg(pos)->uclamp[UCLAMP_MIN].value < min_value)
+ goto out;
+ }
+
+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MIN];
+ ret = uclamp_group_get(NULL, css, UCLAMP_MIN, uc_se, min_value);
+
+out:
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct cgroup_subsys_state *pos;
+ struct uclamp_se *uc_se;
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return ret;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(css);
+
+ /* Already at the required value */
+ if (tg->uclamp[UCLAMP_MAX].value == max_value) {
+ ret = 0;
+ goto out;
+ }
+
+ /* Ensure to not go below the minimum clamp value */
+ if (tg->uclamp[UCLAMP_MIN].value > max_value)
+ goto out;
+
+ /* Ensure max clamp fits within parent's clamp value */
+ if (tg->parent &&
+ tg->parent->uclamp[UCLAMP_MAX].value < max_value)
+ goto out;
+
+ /* Ensure each child is a restriction of this TG */
+ css_for_each_child(pos, css) {
+ if (css_tg(pos)->uclamp[UCLAMP_MAX].value > max_value)
+ goto out;
+ }
+
+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MAX];
+ ret = uclamp_group_get(NULL, css, UCLAMP_MAX, uc_se, max_value);
+
+out:
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7391,6 +7644,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util_min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util_max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 25c2011ecc41..a91b9cd162a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -388,6 +388,11 @@ struct task_group {
#endif
struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -460,6 +465,22 @@ struct uclamp_cpu {
struct uclamp_group group[CONFIG_UCLAMP_GROUPS_COUNT + 1];
};
+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
/**
* uclamp_task_affects: check if a task affects a utilization clamp
* @p: the task to consider
--
2.15.1
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define task requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a simpler model which is essentially based on
the relatively simple concept of POSIX priorities.
Such a simple priority based model however does not allow to exploit
some of the more advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.
Utilization clamping aims at exposing to user-space a new set of
per-task attribute which can be used to provide the scheduler with some
hints about the expected/required utilization should consider for a
task. This will allow to implement a more advanced per-task frequency
control mechanism which is not based just on a "passive" measured task
utilization but on a more "proactive" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy efficient.
Ultimately, such a mechanism, can be considered similar to the cpufreq's
powersave, performance and userspace governor but with a much more fine
grained and per-task control.
Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes.
Specifically, a new pair of attributes allows to specify a minimum and
maximum utilization which the scheduler should consider for a task.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
The solution proposed here exposes the concept of "utilization" to
userspace. This should be a quite generic concept, maybe if we abstract
it a bit to be considered just as a percentage of the CPU capacity, and
thus in the range [0..100] instead of [0..SCHED_CAPACITY_SCALE] as it is
now.
If such a defined utilization should still be considered too much of an
implementation detail, a possible alternative proposal can be that do
"recycle" the usage of:
sched_runtime
sched_period
to be translated internally into a proper utilization.
Such a model, although being slightly more complicated from a coding
standpoint, it would allow to have a more abstract description of the
expected task utilization and, combined with in-kernel knowledge
of the math governing PELT, can probably be translated into a better
min/max utilization clamp value.
For this first proposal, I've opted to present the most simple solution
based on a and simple "abstract" utilization metric.
---
include/uapi/linux/sched.h | 4 ++-
include/uapi/linux/sched/types.h | 65 ++++++++++++++++++++++++++++++++++------
kernel/sched/core.c | 52 ++++++++++++++++++++++++++++++++
3 files changed, 111 insertions(+), 10 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..c27d6e81517b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_UTIL_CLAMP 0x08
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_UTIL_CLAMP)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..c243288465b2 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -21,8 +21,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifys the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +59,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +72,29 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization which
+ * should be expected by a task. These attributes allows to inform the
+ * scheduler about the utilization boundaries within which is safe to schedule
+ * the task. These utilization boundaries are valuable information to support
+ * scheduler decisions on both task placement and frequencies selection.
+ *
+ * @sched_min_utilization represents the minimum CPU utilization
+ * @sched_max_utilization represents the maximum CPU utilization
+ *
+ * Utilization is a value in the range [0..1023] which represents the
+ * percentage of CPU time used by a task when running at the maximum frequency
+ * in the highest capacity CPU of the system. Thus, for example, a 20%
+ * utilization task is a task running for 2ms every 10ms on a cpu with the
+ * highest capacity in the system.
+ *
+ * A task with a min utilization value bigger then 0 is more likely to be
+ * scheduled on a CPU which can support that utilization level.
+ * A task with a max utilization value smaller then 1024 is more likely to be
+ * scheduled on a CPU which do not support more then that utilization level.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +112,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a602b7b9d5f9..6ee4f380aba6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1181,6 +1181,41 @@ static inline int uclamp_group_get(struct task_struct *p,
return 0;
}
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ struct uclamp_se *uc_se;
+ int retval = 0;
+
+ if (attr->sched_util_min > attr->sched_util_max)
+ return -EINVAL;
+ if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ mutex_lock(&uclamp_mutex);
+
+ /* Update min utilization clamp */
+ uc_se = &p->uclamp[UCLAMP_MIN];
+ retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
+ attr->sched_util_min);
+
+ /* Update max utilization clamp */
+ uc_se = &p->uclamp[UCLAMP_MAX];
+ retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
+ attr->sched_util_max);
+
+ mutex_unlock(&uclamp_mutex);
+
+ /*
+ * If one of the two clamp values should fail,
+ * let's the userspace know
+ */
+ if (retval)
+ return -ENOSPC;
+
+ return 0;
+}
+
/**
* init_uclamp: initialize data structures required for utilization clamping
*/
@@ -1212,6 +1247,11 @@ static inline void init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_task_update(struct rq *rq, struct task_struct *p) { }
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EINVAL;
+}
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
@@ -4720,6 +4760,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}
+ /* Configure utilization clamps for the task */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = __setscheduler_uclamp(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -5226,6 +5273,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);
+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+#endif
+
rcu_read_unlock();
retval = sched_read_attr(uattr, &attr, size);
--
2.15.1
Utilization clamping requires each CPU to know which clamp values are
assigned to tasks, which are currently active on that CPU, and multiple
tasks can be assigned the same clamp value, for a given clamp index
(clamp_id := util_{min,max}). Thus, it is required to map clamp values
to a data structure which is suitable for fast and efficient aggregation
of the clamp values required for by active tasks.
To this purpose we use a per-CPU array of reference counters,
where each slot is used to account how many tasks requiring a certain
clamp value are currently active on each CPU.
Each clamp value corresponds to a "clamp index" which identifies the
position within the reference counters array.
:
:
SLOW PATH : FAST PATH
task_struct::uclamp::value : sched/core::enqueue/dequeue
: cpufreq_schedutil
:
+----------------+ +--------------------+ +-------------------+
| TASK | | CLAMP GROUP | | CPU CLAMPS |
+----------------+ +--------------------+ +-------------------+
| | | clamp_{min,max} | | clamp_{min,max} |
| util_{min,max} | | tg_count | | tasks count |
+----------------+ +--------------------+ +-------------------+
:
+------------------> : +------------------->
map(clamp_value, group_id) : ref_count(group_id)
:
:
This patch introduces the support to map tasks to "clamp groups".
Specifically it introduces the required functions to translate a clamp
value into a clamp group index (group_id).
Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
sense to boost/cap to different frequencies
e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
the per-CPU clamp values in the fast path
The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
a -ENOSPC error in case this will exceed the number of maximum different
clamp values supported.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/core.c | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 21 ++++
2 files changed, 315 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 009e65cbd4f4..a602b7b9d5f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -747,6 +747,168 @@ static void set_load_weight(struct task_struct *p, bool update_load)
*/
static DEFINE_MUTEX(uclamp_mutex);
+/**
+ * uclamp_map: reference counts a utilization "clamp value"
+ * @value: the utilization "clamp value" required
+ * @se_count: the number of scheduling entities requiring the "clamp value"
+ * @se_lock: serialize reference count updates by protecting se_count
+ */
+struct uclamp_map {
+ int value;
+ int se_count;
+ raw_spinlock_t se_lock;
+};
+
+/**
+ * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
+ *
+ * Since only a limited number of different "clamp values" are supported, we
+ * need to map each possible "clamp value" into a "clamp group" to be used by
+ * the per-CPU accounting in the fast-path (i.e. tasks enqueuing/dequeuing).
+ * We also support different kind of utilization clamping, min and max
+ * utilization, each representing a "clamp index" (clamp_id).
+ *
+ * A matrix is thus required to map clamp values to clamp groups (group_id)
+ * for each clamp index (clamp_id), where:
+ * - rows are indexed by clamp_id and collect the clamp groups for a given
+ * clamp index, i.e. min or max utilization;
+ * - columns are indexed by group_id and collect the clamp values which maps
+ * to that clamp group, i.e. actual uclamp_map instances;
+ *
+ * Thus, the column index of a given (clamp_id, value) pair represents the
+ * clamp group (group_id) used by the fast-path's per-CPU accounting.
+ *
+ * NOTE: first clamp group (group_id=0) is reserved for tracking of non
+ * clamped tasks. Thus we allocate one more slot than the value of
+ * CONFIG_UCLAMP_GROUPS_COUNT.
+ *
+ * Here is the map layout and, right below, how entries are accessed by the
+ * following code.
+ *
+ * uclamp_maps is a matrix of
+ * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
+ * | |
+ * | /---------------+---------------\
+ * | +------------+ +------------+
+ * | / UCLAMP_MIN | value | | value |
+ * | | | se_count |...... | se_count |
+ * | | +------------+ +------------+
+ * +--+ +------------+ +------------+
+ * | | value | | value |
+ * \ UCLAMP_MAX | se_count |...... | se_count |
+ * +-----^------+ +----^-------+
+ * | |
+ * uc_map = + |
+ * &uclamp_maps[clamp_id][0] +
+ * clamp_value =
+ * uc_map[group_id].value
+ */
+static struct uclamp_map uclamp_maps[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+
+/**
+ * uclamp_group_available: checks if a clamp group is available
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index in the given clamp_id
+ *
+ * A clamp group is not free if there is at least one TG's using a clamp value
+ * mapped on the specified clamp_id. These TG's are reference counted by the
+ * se_count of a uclamp_map entry.
+ *
+ * Return: true if there are no TG's mapped on the specified clamp
+ * index and group
+ */
+static inline bool uclamp_group_available(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ return (uc_map[group_id].value == UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_init: maps a clamp value on a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index to map a given clamp_value
+ * @clamp_value: the utilization clamp value to map
+ *
+ * Each different clamp value, for a given clamp index (i.e. min/max
+ * utilization clamp), is mapped by a clamp group which index is use by the
+ * fast-path code to keep track of active tasks requiring a certain clamp
+ * value.
+ *
+ * This function initializes a clamp group to track tasks from the fast-path.
+ */
+static inline void uclamp_group_init(int clamp_id, int group_id,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ struct uclamp_cpu *uc_cpu;
+ int cpu;
+
+ /* Set clamp group map */
+ uc_map[group_id].value = clamp_value;
+ uc_map[group_id].se_count = 0;
+
+ /* Set clamp groups on all CPUs */
+ for_each_possible_cpu(cpu) {
+ uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+ uc_cpu->group[group_id].value = clamp_value;
+ uc_cpu->group[group_id].tasks = 0;
+ }
+}
+
+/**
+ * uclamp_group_reset: resets a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @group_id: the group index to release
+ *
+ * A clamp group can be reset every time there are no more TGs using the
+ * clamp value it maps for a given clamp index.
+ */
+static inline void uclamp_group_reset(int clamp_id, int group_id)
+{
+ uclamp_group_init(clamp_id, group_id, UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_find: finds the group index of a utilization clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @clamp_value: the utilization clamping value lookup for
+ *
+ * Verify if a group has been assigned to a certain clamp value and return
+ * its index to be used for accounting.
+ *
+ * Since only a limited number of utilization clamp groups are allowed, if no
+ * groups have been assigned for the specified value, a new group is assigned
+ * if possible. Otherwise an error is returned, meaning that a different clamp
+ * value is not (currently) supported.
+ */
+static int
+uclamp_group_find(int clamp_id, unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int free_group_id = UCLAMP_NONE;
+ unsigned int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Keep track of first free clamp group */
+ if (uclamp_group_available(clamp_id, group_id)) {
+ if (free_group_id == UCLAMP_NONE)
+ free_group_id = group_id;
+ continue;
+ }
+ /* Return index of first group with same clamp value */
+ if (uc_map[group_id].value == clamp_value)
+ return group_id;
+ }
+ /* Default to first free clamp group */
+ if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
+ group_id = free_group_id;
+ /* All clamp group already tracking different clamp values */
+ if (group_id == UCLAMP_NONE)
+ return -ENOSPC;
+ return group_id;
+}
+
/**
* uclamp_cpu_update: updates the utilization clamp of a CPU
* @cpu: the CPU which utilization clamp has to be updated
@@ -896,18 +1058,150 @@ static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
}
}
+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ * @group_id: the clamp group to update
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups have to be updated for each CPU containing a RUNNABLE task belonging
+ * to this tasks group. Sleeping tasks are not updated since they will be
+ * enqueued with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the CPU where the task is (or was) queued.
+ *
+ * We might lock the (previous) RQ of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * The setting of the clamp group is serialized by task_rq_lock().
+ * Thus, if the task's task_struct is not referencing a valid group
+ * index, then that task is not yet RUNNABLE or it's going to be
+ * enqueued with the proper clamp group value.
+ */
+ if (!uclamp_task_active(p))
+ goto done;
+
+ /* Release p's currently referenced clamp group */
+ uclamp_cpu_put(p, task_cpu(p), clamp_id);
+
+ /* Get p's new clamp group */
+ uclamp_cpu_get(p, task_cpu(p), clamp_id);
+
+done:
+ task_rq_unlock(rq, p, &rf);
+}
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @uc_se: the utilization clamp data for that task group
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value. A clamp group is
+ * released when there are no more task groups referencing its clamp value.
+ */
+static inline void uclamp_group_put(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ unsigned long flags;
+
+ /* Ignore SE's not yet attached */
+ if (group_id == UCLAMP_NONE)
+ return;
+
+ /* Remove SE from this clamp group */
+ raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
+ uc_map[group_id].se_count -= 1;
+ if (uc_map[group_id].se_count == 0)
+ uclamp_group_reset(clamp_id, group_id);
+ raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @clamp_id: the clamp index affected by the task group
+ * @uc_se: the utilization clamp data for the task group
+ * @clamp_value: the new clamp value for the task group
+ *
+ * Each time a task group changes the utilization clamp value, for a specified
+ * clamp index, we need to find an available clamp group which can be used
+ * to track its new clamp value. The corresponding clamp group index will be
+ * used by tasks in this task group to reference count the clamp value on CPUs
+ * where they are enqueued.
+ *
+ * Return: -ENOSPC if there are not available clamp groups, 0 on success.
+ */
+static inline int uclamp_group_get(struct task_struct *p,
+ int clamp_id, struct uclamp_se *uc_se,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int prev_group_id = uc_se->group_id;
+ int next_group_id = UCLAMP_NONE;
+ unsigned long flags;
+
+ /* Lookup for a usable utilization clamp group */
+ next_group_id = uclamp_group_find(clamp_id, clamp_value);
+ if (next_group_id < 0) {
+ pr_err("Cannot allocate more than %d utilization clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ return -ENOSPC;
+ }
+
+ /* Allocate new clamp group for this clamp value */
+ if (uclamp_group_available(clamp_id, next_group_id))
+ uclamp_group_init(clamp_id, next_group_id, clamp_value);
+
+ /* Update SE's clamp values and attach it to new clamp group */
+ raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
+ uc_se->value = clamp_value;
+ uc_se->group_id = next_group_id;
+ uc_map[next_group_id].se_count += 1;
+ raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+
+ /* Update current task if task specific clamp has been changed */
+ uclamp_task_update_active(p, clamp_id, next_group_id);
+
+ /* Release the previous clamp group */
+ uclamp_group_put(clamp_id, prev_group_id);
+
+ return 0;
+}
+
/**
* init_uclamp: initialize data structures required for utilization clamping
*/
static inline void init_uclamp(void)
{
struct uclamp_cpu *uc_cpu;
+ struct uclamp_map *uc_map;
+ int group_id;
int clamp_id;
int cpu;
mutex_init(&uclamp_mutex);
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_map = &uclamp_maps[clamp_id][0];
+ /* Init SE's clamp map */
+ group_id = 0;
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ uc_map[group_id].value = UCLAMP_NONE;
+ raw_spin_lock_init(&uc_map[group_id].se_lock);
+ }
/* Init CPU's clamp groups */
for_each_possible_cpu(cpu) {
uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be93d833ad6b..25c2011ecc41 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -481,6 +481,27 @@ static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
return (task_group_id != UCLAMP_NONE);
}
+/**
+ * uclamp_task_active: check if a task is currently clamping a CPU
+ * @p: the task to check
+ *
+ * A task affects the utilization clamp of a CPU if it references a valid
+ * clamp group index for at least one clamp index.
+ *
+ * Return: true if p is currently clamping the utilization of its CPU.
+ */
+static inline bool uclamp_task_active(struct task_struct *p)
+{
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ if (uclamp_task_affects(p, clamp_id))
+ return true;
+ }
+
+ return false;
+}
+
/**
* uclamp_group_active: check if a clamp group is active on a CPU
* @uc_cpu: the array of clamp groups for a CPU
--
2.15.1
Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range. This range depends on the set of currently
active tasks on a CPU, where each task references two "clamp groups"
defining the util_min and the util_max clamp values to be considered for
that task. The clamp value mapped by a clamp group applies to a CPU only
when there is at least one task active referencing that clamp group.
When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it can be required to re-compute what is the new "aggregated" clamp
value to apply on that CPU.
Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performances of other
co-scheduled tasks which are either more boosted (i.e. with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).
Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.
Tasks have a task_struct::uclamp::group_id indexing the clamp group in
which they should be accounted into at enqueue time. This index is
cached into task_struct::uclamp_group_id once a task is enqueued in a
CPU to ensure a consistent and efficient update of the reference count
at dequeue time.
CPUs rq have a rq::uclamp::group[].tasks which is used to reference
count how many tasks are currently active on that CPU for each clamp
group. The clamp value of each clamp group is tracked by
rq::uclamp::group[].value, thus making rq::uclamp::group[] an unordered
array of clamp values. However, the MAX aggregation of the currently
active clamp groups is implemented to minimizes the number of times we
need to scan the complete (unordered) clamp group array to figure out
the new max value. This operation indeed happens only when we dequeue
last task of the clamp group corresponding to the current max clamp, and
thus the CPU is either entering IDLE or going to schedule a less boosted
or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small to make not worth a more
advanced ordering algorithm. In real use-cases we expect less then 10
different values.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 34 +++++++++
init/Kconfig | 42 +++++++++++
kernel/sched/core.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 79 ++++++++++++++++++++
4 files changed, 353 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f228c6033832..d25460754d03 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -238,6 +238,17 @@ struct vtime {
u64 gtime;
};
+enum uclamp_id {
+ /* No utilization clamp group assigned */
+ UCLAMP_NONE = -1,
+
+ UCLAMP_MIN = 0, /* Minimum utilization */
+ UCLAMP_MAX, /* Maximum utilization */
+
+ /* Utilization clamping constraints count */
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -526,6 +537,22 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};
+/**
+ * Utilization's clamp group
+ *
+ * A utilization clamp group maps a "clamp value" (value), i.e.
+ * util_{min,max}, to a "clamp group index" (group_id).
+ *
+ * Thus, the same "group_id" is used by all the TG's which enforce the same
+ * clamp "value" for a given clamp index.
+ */
+struct uclamp_se {
+ /* Utilization constraint for tasks in this group */
+ unsigned int value;
+ /* Utilization clamp group for this constraint */
+ unsigned int group_id;
+};
+
union rcu_special {
struct {
u8 blocked;
@@ -608,6 +635,13 @@ struct task_struct {
#endif
struct sched_dl_entity dl;
+#ifdef CONFIG_UCLAMP_TASK
+ /* Clamp group the task is currently accounted into */
+ int uclamp_group_id[UCLAMP_CNT];
+ /* Utlization clamp values for this task */
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/init/Kconfig b/init/Kconfig
index e37f4b2a6445..977aa4d1e42a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -585,6 +585,48 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool
+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enabled utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ default false
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max CPU
+ bandwidth which is allowed for a task.
+ The max bandwidth allows to clamp the maximum frequency a task can
+ use, while the min bandwidth allows to define a minimum frequency a
+ task will alwasy use.
+
+ If in doubt, say N.
+
+
+config UCLAMP_GROUPS_COUNT
+ int "Number of different utilization clamp values supported"
+ range 0 16
+ default 4
+ depends on UCLAMP_TASK
+ help
+ This defines the maximum number of different utilization clamp
+ values which can be concurrently enforced for each utilization
+ clamp index (i.e. minimum and maximum utilization).
+
+ Only a limited number of clamp values are supported because:
+ 1. there are usually only few classes of workloads for which is
+ makes sense to boost/cap for different frequencies
+ e.g. background vs foreground, interactive vs low-priority
+ 2. it allows a simpler and more memory/time efficient tracking of
+ the per-CPU clamp values.
+
+ Set to 0 (default value) to disable the utilization clamping feature.
+
+ If in doubt, use the default value.
+
+endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de440456f15c..009e65cbd4f4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -735,6 +735,192 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_mutex: serializes updates of utilization clamp values
+ *
+ * A utilization clamp value update is usually triggered from a user-space
+ * process (slow-path) but it requires a synchronization with the scheduler's
+ * (fast-path) enqueue/dequeue operations.
+ * While the fast-path synchronization is protected by RQs spinlock, this
+ * mutex ensure that we sequentially serve user-space requests.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @cpu: the CPU which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups is subject to change. Since each clamp group enforces a
+ * different utilization clamp value, once the set of these groups changes it
+ * can be required to re-compute what is the new clamp value to apply for that
+ * CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of active tasks on that CPU.
+ */
+static inline void uclamp_cpu_update(int cpu, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+ int max_value = UCLAMP_NONE;
+ unsigned int group_id;
+
+ for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
+ if (!uclamp_group_active(uc_cpu, group_id))
+ continue;
+
+ /* Both min and max clamp are MAX aggregated */
+ max_value = max(max_value, uc_cpu->group[group_id].value);
+
+ /* Stop if we reach the max possible clamp */
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ uc_cpu->value = max_value;
+}
+
+/**
+ * uclamp_cpu_get(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @cpu: the CPU where the clamp group has to be reference counted
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
+ *
+ * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
+ * the task's uclamp.group_id is reference counted on that CPU.
+ * We keep track of the reference counted clamp group by storing its index
+ * (group_id) into the task's task_struct::uclamp_group_id, which will then be
+ * used at task's dequeue time to release the reference count.
+ */
+static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+ int clamp_value;
+ int group_id;
+
+ /* Get task's specific clamp value */
+ clamp_value = p->uclamp[clamp_id].value;
+ group_id = p->uclamp[clamp_id].group_id;
+
+ /* No task specific clamp values: nothing to do */
+ if (group_id == UCLAMP_NONE)
+ return;
+
+ /* Increment the current group_id */
+ uc_cpu->group[group_id].tasks += 1;
+
+ /* Mark task as enqueued for this clamp index */
+ p->uclamp_group_id[clamp_id] = group_id;
+
+ /*
+ * If this is the new max utilization clamp value, then we can update
+ * straight away the CPU clamp value. Otherwise, the current CPU clamp
+ * value is still valid and we are done.
+ */
+ if (uc_cpu->value < clamp_value)
+ uc_cpu->value = clamp_value;
+}
+
+/**
+ * uclamp_cpu_put(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @cpu: the CPU from where the clamp group has to be released
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
+ *
+ * When a task is dequeued from a CPU's RQ, the clamp group reference counted
+ * by the task, which is reported by task_struct::uclamp_group_id, is decrease
+ * for that CPU. If this was the last task defining the current max clamp
+ * group, then the CPU clamping is updated to find out the new max for the
+ * specified clamp index.
+ */
+static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+ unsigned int clamp_value;
+ int group_id;
+
+ /* Decrement the task's reference counted group index */
+ group_id = p->uclamp_group_id[clamp_id];
+ uc_cpu->group[group_id].tasks -= 1;
+
+ /* Mark task as dequeued for this clamp IDX */
+ p->uclamp_group_id[clamp_id] = UCLAMP_NONE;
+
+ /* If this is not the last task, no updates are required */
+ if (uc_cpu->group[group_id].tasks > 0)
+ return;
+
+ /*
+ * Update the CPU only if this was the last task of the group
+ * defining the current clamp value.
+ */
+ clamp_value = uc_cpu->group[group_id].value;
+ if (clamp_value >= uc_cpu->value)
+ uclamp_cpu_update(cpu, clamp_id);
+}
+
+/**
+ * uclamp_task_update: update clamp group referenced by a task
+ * @rq: the RQ the task is going to be enqueued/dequeued to/from
+ * @p: the task being enqueued/dequeued
+ *
+ * Utilization clamp constraints for a CPU depend on tasks which are active
+ * (i.e. RUNNABLE or RUNNING) on that CPU. To keep track of tasks
+ * requirements, each active task reference counts a clamp group in the CPU
+ * they are currently enqueued for execution.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
+{
+ int cpu = cpu_of(rq);
+ int clamp_id;
+
+ /* The idle task does not affect CPU's clamps */
+ if (unlikely(p->sched_class == &idle_sched_class))
+ return;
+ /* DEADLINE tasks do not affect CPU's clamps */
+ if (unlikely(p->sched_class == &dl_sched_class))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ if (uclamp_task_affects(p, clamp_id))
+ uclamp_cpu_put(p, cpu, clamp_id);
+ else
+ uclamp_cpu_get(p, cpu, clamp_id);
+ }
+}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static inline void init_uclamp(void)
+{
+ struct uclamp_cpu *uc_cpu;
+ int clamp_id;
+ int cpu;
+
+ mutex_init(&uclamp_mutex);
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ /* Init CPU's clamp groups */
+ for_each_possible_cpu(cpu) {
+ uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
+ memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
+ }
+ }
+}
+
+#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_task_update(struct rq *rq, struct task_struct *p) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -743,6 +929,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);
+ uclamp_task_update(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -754,6 +941,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE))
sched_info_dequeued(rq, p);
+ uclamp_task_update(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}
@@ -2154,6 +2342,14 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.cfs_rq = NULL;
#endif
+#ifdef CONFIG_UCLAMP_TASK
+ memset(&p->uclamp_group_id, UCLAMP_NONE, sizeof(p->uclamp_group_id));
+ p->uclamp[UCLAMP_MIN].value = 0;
+ p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NONE;
+ p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
+ p->uclamp[UCLAMP_MAX].group_id = UCLAMP_NONE;
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -6108,6 +6304,8 @@ void __init sched_init(void)
init_schedstats();
+ init_uclamp();
+
scheduler_running = 1;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c3deaee7a7a2..be93d833ad6b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -423,6 +423,80 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
extern int tg_nop(struct task_group *tg, void *data);
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * Utilization clamp Group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+ /* Utilization clamp value for tasks on this clamp group */
+ int value;
+ /* Number of RUNNABLE tasks on this clamp group */
+ int tasks;
+};
+
+/**
+ * CPU's utilization clamp
+ *
+ * Keep track of active tasks on a CPUs to aggregate their clamp values. A
+ * clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we wanna run the CPU at least at the max of the minimum
+ * utilization required by its currently active tasks.
+ * - for util_max: we wanna allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently active tasks.
+ *
+ * Since on each system we expect only a limited number of utilization clamp
+ * values, we can use a simple array to track the metrics required to compute
+ * all the per-CPU utilization clamp values.
+ */
+struct uclamp_cpu {
+ /* Utilization clamp value for a CPU */
+ int value;
+ /* Utilization clamp groups affecting this CPU */
+ struct uclamp_group group[CONFIG_UCLAMP_GROUPS_COUNT + 1];
+};
+
+/**
+ * uclamp_task_affects: check if a task affects a utilization clamp
+ * @p: the task to consider
+ * @clamp_id: the utilization clamp to check
+ *
+ * A task affects a clamp index if its task_struct::uclamp_group_id is a
+ * valid clamp group index for the specified clamp index.
+ * Once a task is dequeued from a CPU, its clamp group indexes are reset to
+ * UCLAMP_NONE. A valid clamp group index is assigned to a task only when it
+ * is RUNNABLE on a CPU and it represents the clamp group which is currently
+ * reference counted by that task.
+ *
+ * Return: true if p currently affects the specified clamp_id
+ */
+static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
+{
+ int task_group_id = p->uclamp_group_id[clamp_id];
+
+ return (task_group_id != UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_active: check if a clamp group is active on a CPU
+ * @uc_cpu: the array of clamp groups for a CPU
+ * @group_id: the clamp group to check
+ *
+ * A clamp group affects a CPU if it as at least one "active" task.
+ *
+ * Return: true if the specified CPU has at least one active task for
+ * the specified clamp group.
+ */
+static inline bool uclamp_group_active(struct uclamp_cpu *uc_cpu, int group_id)
+{
+ return uc_cpu->group[group_id].tasks > 0;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
extern void free_fair_sched_group(struct task_group *tg);
extern int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent);
extern void online_fair_sched_group(struct task_group *tg);
@@ -811,6 +885,11 @@ struct rq {
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
+#ifdef CONFIG_UCLAMP_TASK
+ /* util_{min,max} clamp values based on CPU's active tasks */
+ struct uclamp_cpu uclamp[UCLAMP_CNT];
+#endif
+
struct callback_head *balance_callback;
unsigned char idle_balance;
--
2.15.1
Hello, Patrick.
Comments purely on cgroup interface side.
On Mon, Apr 09, 2018 at 05:56:12PM +0100, Patrick Bellasi wrote:
> This patch extends the CPU controller by adding a couple of new attributes,
> util_min and util_max, which can be used to enforce frequency boosting and
> capping. Specifically:
>
> - util_min: defines the minimum CPU utilization which should be considered,
> e.g. when schedutil selects the frequency for a CPU while a
> task in this group is RUNNABLE.
> i.e. the task will run at least at a minimum frequency which
> corresponds to the min_util utilization
>
> - util_max: defines the maximum CPU utilization which should be considered,
> e.g. when schedutil selects the frequency for a CPU while a
> task in this group is RUNNABLE.
> i.e. the task will run up to a maximum frequency which
> corresponds to the max_util utilization
I'm not too enthusiastic about util_min/max given that it can easily
be read as actual utilization based bandwidth control when what's
actually implemented, IIUC, is affecting CPU frequency selection.
Maybe something like cpu.freq.min/max are better names?
> These attributes:
> a) are tunable at all hierarchy levels, i.e. at root group level too, thus
> allowing to define the minimum and maximum frequency constraints for all
> otherwise non-classified tasks (e.g. autogroups) and to be a sort-of
> replacement for cpufreq's powersave, ondemand and performance
> governors.
This is a problem which exists for all other interfaces. For
historical and other reasons, at least till now, we've opted to put
everything at system level outside of cgroup interface. We might
change this in the future and duplicate system-level information and
interfaces in the root cgroup but we wanna do that in a more systemtic
fashion than adding an one-off knob in the cgroup root.
Besides, if a feature makes sense at the system level which is the
cgroup root, it makes sense without cgroup mounted or enabled, so it
needs a place outside cgroup one way or the other.
> b) allow to create subgroups of tasks which are not violating the
> utilization constraints defined by the parent group.
Tying creation / config operations to the config propagation doesn't
work well with delegation and is inconsistent with what other
controllers are doing. For cases where the propagated config being
visible in a sub cgroup is necessary, please add .effective files.
> Tasks on a subgroup can only be more boosted and/or capped, which is
Less boosted. .low at a parent level must set the upper bound of .low
that all its descendants can have.
Thanks.
--
tejun
Hi Tejun,
On 09-Apr 15:24, Tejun Heo wrote:
> On Mon, Apr 09, 2018 at 05:56:12PM +0100, Patrick Bellasi wrote:
> > This patch extends the CPU controller by adding a couple of new attributes,
> > util_min and util_max, which can be used to enforce frequency boosting and
> > capping. Specifically:
> >
> > - util_min: defines the minimum CPU utilization which should be considered,
> > e.g. when schedutil selects the frequency for a CPU while a
> > task in this group is RUNNABLE.
> > i.e. the task will run at least at a minimum frequency which
> > corresponds to the min_util utilization
> >
> > - util_max: defines the maximum CPU utilization which should be considered,
> > e.g. when schedutil selects the frequency for a CPU while a
> > task in this group is RUNNABLE.
> > i.e. the task will run up to a maximum frequency which
> > corresponds to the max_util utilization
>
> I'm not too enthusiastic about util_min/max given that it can easily
> be read as actual utilization based bandwidth control when what's
> actually implemented, IIUC, is affecting CPU frequency selection.
Right now we are basically affecting the frequency selection.
However, the next step is to use this same interface to possibly bias
task placement.
The idea is that:
- the util_min value can be used to possibly avoid CPUs which have
a (maybe temporarily) limited capacity, for example, due to thermal
pressure.
- a util_max value can use used to possibly identify tasks which can
be co-scheduled together in a (maybe) limited capacity CPU since
they are more likely "less important" tasks.
Thus, since this is a new user-space API, we would like to find a
concept which is generic enough to express the current requirement but
also easily accommodate future extensions.
> Maybe something like cpu.freq.min/max are better names?
IMO this is something too much platform specific.
I agree that utilization is maybe too much an implementation detail,
but perhaps this can be solved by using a more generic range.
What about using values in the [0..100] range which define:
a percentage of the maximum available capacity
for the CPUs in the target system
Do you think this can work?
> > These attributes:
> > a) are tunable at all hierarchy levels, i.e. at root group level too, thus
> > allowing to define the minimum and maximum frequency constraints for all
> > otherwise non-classified tasks (e.g. autogroups) and to be a sort-of
> > replacement for cpufreq's powersave, ondemand and performance
> > governors.
>
> This is a problem which exists for all other interfaces. For
> historical and other reasons, at least till now, we've opted to put
> everything at system level outside of cgroup interface. We might
> change this in the future and duplicate system-level information and
> interfaces in the root cgroup but we wanna do that in a more systemtic
> fashion than adding an one-off knob in the cgroup root.
I see, I think we can easily come up with a procfs/sysfs interface
usable to define system-wide values.
Any suggestion for something already existing which I can use as a
reference?
> Besides, if a feature makes sense at the system level which is the
> cgroup root, it makes sense without cgroup mounted or enabled, so it
> needs a place outside cgroup one way or the other.
Indeed, and it makes perfectly sense now that we have also a non
cgroup-based primary APU.
> > b) allow to create subgroups of tasks which are not violating the
> > utilization constraints defined by the parent group.
>
> Tying creation / config operations to the config propagation doesn't
> work well with delegation and is inconsistent with what other
> controllers are doing. For cases where the propagated config being
> visible in a sub cgroup is necessary, please add .effective files.
I'm not sure to understand this point: you mean that we should not
enforce "consistency rules" among parent-child groups?
I have to look better into this "effective" concept.
Meanwhile, can you make a simple example?
> > Tasks on a subgroup can only be more boosted and/or capped, which is
>
> Less boosted. .low at a parent level must set the upper bound of .low
> that all its descendants can have.
Is that a mandatory requirement? Or based on a proper justification
you can also accept what I'm proposing?
I've always been more of the idea that what I'm proposing could make
more sense for a general case but perhaps I just need to go back and
better check the use-cases we have on hand to see if it's really
required or not.
Thanks for the prompt feedbacks!
--
#include <best/regards.h>
Patrick Bellasi
Hello,
On Tue, Apr 10, 2018 at 06:16:12PM +0100, Patrick Bellasi wrote:
> > I'm not too enthusiastic about util_min/max given that it can easily
> > be read as actual utilization based bandwidth control when what's
> > actually implemented, IIUC, is affecting CPU frequency selection.
>
> Right now we are basically affecting the frequency selection.
> However, the next step is to use this same interface to possibly bias
> task placement.
>
> The idea is that:
>
> - the util_min value can be used to possibly avoid CPUs which have
> a (maybe temporarily) limited capacity, for example, due to thermal
> pressure.
>
> - a util_max value can use used to possibly identify tasks which can
> be co-scheduled together in a (maybe) limited capacity CPU since
> they are more likely "less important" tasks.
>
> Thus, since this is a new user-space API, we would like to find a
> concept which is generic enough to express the current requirement but
> also easily accommodate future extensions.
I'm not sure we can overload the meanings like that on the same
interface. Right now, it doesn't say anything about bandwidth (or
utilization) allocation. It just limits the frequency range the
particular cpu that the task ended up on can be in and what you're
describing above is the third different thing. It doesn't seem clear
that they're something which can be overloaded onto the same
interface.
> > Maybe something like cpu.freq.min/max are better names?
>
> IMO this is something too much platform specific.
>
> I agree that utilization is maybe too much an implementation detail,
> but perhaps this can be solved by using a more generic range.
>
> What about using values in the [0..100] range which define:
>
> a percentage of the maximum available capacity
> for the CPUs in the target system
>
> Do you think this can work?
Yeah, sure, it's more that right now the intention isn't clear. A
cgroup control knob which limits cpu frequency range while the cgroup
is on a cpu is a very different thing from a cgroup knob which
restricts what tasks can be scheduled on the same cpu. They're
actually incompatible. Doing the latter actively breaks the former.
> > This is a problem which exists for all other interfaces. For
> > historical and other reasons, at least till now, we've opted to put
> > everything at system level outside of cgroup interface. We might
> > change this in the future and duplicate system-level information and
> > interfaces in the root cgroup but we wanna do that in a more systemtic
> > fashion than adding an one-off knob in the cgroup root.
>
> I see, I think we can easily come up with a procfs/sysfs interface
> usable to define system-wide values.
>
> Any suggestion for something already existing which I can use as a
> reference?
Most system level interfaces are there with a long history and things
aren't that consistent. One route could be finding an interface
implementing a nearby feature and staying consistent with that.
> > Tying creation / config operations to the config propagation doesn't
> > work well with delegation and is inconsistent with what other
> > controllers are doing. For cases where the propagated config being
> > visible in a sub cgroup is necessary, please add .effective files.
>
> I'm not sure to understand this point: you mean that we should not
> enforce "consistency rules" among parent-child groups?
You should. It just shouldn't make configurations fail cuz that ends
up breaking delegations.
> I have to look better into this "effective" concept.
> Meanwhile, can you make a simple example?
There's a recent cpuset patchset posted by Waiman Long. Googling for
lkml cpuset and Waiman Long should find it easily.
> > > Tasks on a subgroup can only be more boosted and/or capped, which is
> >
> > Less boosted. .low at a parent level must set the upper bound of .low
> > that all its descendants can have.
>
> Is that a mandatory requirement? Or based on a proper justification
> you can also accept what I'm proposing?
>
> I've always been more of the idea that what I'm proposing could make
> more sense for a general case but perhaps I just need to go back and
> better check the use-cases we have on hand to see if it's really
> required or not.
Yeah, I think we want to stick to that semantics. That's what memory
controller does and it'd be really confusing to flip the directions on
different controllers.
Thanks.
--
tejun
On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> +{
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> +static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
> +{
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
That all seems daft, since you already have rq at the call site.
> +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> +{
> + int cpu = cpu_of(rq);
> + int clamp_id;
> +
> + /* The idle task does not affect CPU's clamps */
> + if (unlikely(p->sched_class == &idle_sched_class))
> + return;
> + /* DEADLINE tasks do not affect CPU's clamps */
> + if (unlikely(p->sched_class == &dl_sched_class))
> + return;
This is wrong; it misses the stop_sched_class.
And since you're looking at sched_class anyway, maybe put a marker in
there:
if (!p->sched_class->has_clamping)
return;
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + if (uclamp_task_affects(p, clamp_id))
> + uclamp_cpu_put(p, cpu, clamp_id);
> + else
> + uclamp_cpu_get(p, cpu, clamp_id);
> + }
> +}
On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> +static inline void init_uclamp(void)
WTH is that inline?
> +{
> + struct uclamp_cpu *uc_cpu;
> + int clamp_id;
> + int cpu;
> +
> + mutex_init(&uclamp_mutex);
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + /* Init CPU's clamp groups */
> + for_each_possible_cpu(cpu) {
> + uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> + memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
> + }
> + }
Those loops are the wrong way around, this shreds your cache. This is a
slow path so it doesn't much matter, but it is sloppy.
> +}
On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> +{
> + int cpu = cpu_of(rq);
> + int clamp_id;
> +
> + /* The idle task does not affect CPU's clamps */
> + if (unlikely(p->sched_class == &idle_sched_class))
> + return;
> + /* DEADLINE tasks do not affect CPU's clamps */
> + if (unlikely(p->sched_class == &dl_sched_class))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + if (uclamp_task_affects(p, clamp_id))
> + uclamp_cpu_put(p, cpu, clamp_id);
> + else
> + uclamp_cpu_get(p, cpu, clamp_id);
> + }
> +}
Is that uclamp_task_affects() thing there to fix up the fact you failed
to propagate the calling context (enqueue/dequeue) ?
I find this code _really_ hard to read...
> @@ -743,6 +929,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & ENQUEUE_RESTORE))
> sched_info_queued(rq, p);
>
> + uclamp_task_update(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -754,6 +941,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & DEQUEUE_SAVE))
> sched_info_dequeued(rq, p);
>
> + uclamp_task_update(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }
>
On Fri, Apr 13, 2018 at 11:30:05AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > +struct uclamp_group {
> > + /* Utilization clamp value for tasks on this clamp group */
> > + int value;
> > + /* Number of RUNNABLE tasks on this clamp group */
> > + int tasks;
> > +};
>
> > +struct uclamp_cpu {
> > + /* Utilization clamp value for a CPU */
> > + int value;
> > + /* Utilization clamp groups affecting this CPU */
> > + struct uclamp_group group[CONFIG_UCLAMP_GROUPS_COUNT + 1];
> > +};
>
> > @@ -811,6 +885,11 @@ struct rq {
> > unsigned long cpu_capacity;
> > unsigned long cpu_capacity_orig;
> >
> > +#ifdef CONFIG_UCLAMP_TASK
> > + /* util_{min,max} clamp values based on CPU's active tasks */
> > + struct uclamp_cpu uclamp[UCLAMP_CNT];
> > +#endif
> > +
> > struct callback_head *balance_callback;
> >
> > unsigned char idle_balance;
>
> So that is:
>
> struct rq {
>
> struct uclamp_cpu {
> int value;
>
> /* 4 byte hole */
>
> struct uclamp_group {
> int value;
> int tasks;
> } [COUNT + 1]; // size: COUNT+2 * 8 [bytes]
>
> } [2]; // size: 2 * COUNT+2 * 8 [bytes]
>
> };
Note that you can slightly compress the data structure if you do
something like:
struct rq {
int active_clamp[2];
struct uclamp_group {
int value;
int task;
} [COUNT + 1][2]; // XXX check array order of C
};
It also avoids the active values being split over 2 lines; and we need
those values every time, even if there are no active clamp tasks on the
system.
On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> +{
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> + int clamp_value;
> + int group_id;
> +
> + /* Get task's specific clamp value */
> + clamp_value = p->uclamp[clamp_id].value;
> + group_id = p->uclamp[clamp_id].group_id;
> +
> + /* No task specific clamp values: nothing to do */
> + if (group_id == UCLAMP_NONE)
> + return;
> +
> + /* Increment the current group_id */
That I think qualifies being called a bad comment.
> + uc_cpu->group[group_id].tasks += 1;
> +
> + /* Mark task as enqueued for this clamp index */
> + p->uclamp_group_id[clamp_id] = group_id;
Why exactly do we need this? we got group_id from @p in the first place.
I suspect this is because when we update p->uclamp[], we don't update
this active value (when needed), is that worth it?
> + /*
> + * If this is the new max utilization clamp value, then we can update
> + * straight away the CPU clamp value. Otherwise, the current CPU clamp
> + * value is still valid and we are done.
> + */
> + if (uc_cpu->value < clamp_value)
> + uc_cpu->value = clamp_value;
> +}
On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> +struct uclamp_group {
> + /* Utilization clamp value for tasks on this clamp group */
> + int value;
> + /* Number of RUNNABLE tasks on this clamp group */
> + int tasks;
> +};
> +struct uclamp_cpu {
> + /* Utilization clamp value for a CPU */
> + int value;
> + /* Utilization clamp groups affecting this CPU */
> + struct uclamp_group group[CONFIG_UCLAMP_GROUPS_COUNT + 1];
> +};
> @@ -811,6 +885,11 @@ struct rq {
> unsigned long cpu_capacity;
> unsigned long cpu_capacity_orig;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* util_{min,max} clamp values based on CPU's active tasks */
> + struct uclamp_cpu uclamp[UCLAMP_CNT];
> +#endif
> +
> struct callback_head *balance_callback;
>
> unsigned char idle_balance;
So that is:
struct rq {
struct uclamp_cpu {
int value;
/* 4 byte hole */
struct uclamp_group {
int value;
int tasks;
} [COUNT + 1]; // size: COUNT+2 * 8 [bytes]
} [2]; // size: 2 * COUNT+2 * 8 [bytes]
};
Which for the default (4) ends up being 96 bytes, or at least 2 lines.
Do you want some alignment such that we don't span more lines?
Which touch extra on enqueue/dequeue, at least it's only for tasks that
have a clamp set.
On 13-Apr 12:22, Peter Zijlstra wrote:
> On Fri, Apr 13, 2018 at 10:26:48AM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> > > +{
> > > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> >
> > > +static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
> > > +{
> > > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> >
> > That all seems daft, since you already have rq at the call site.
> >
> > > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > > +{
> > > + int cpu = cpu_of(rq);
> > > + int clamp_id;
> > > +
> > > + /* The idle task does not affect CPU's clamps */
> > > + if (unlikely(p->sched_class == &idle_sched_class))
> > > + return;
> > > + /* DEADLINE tasks do not affect CPU's clamps */
> > > + if (unlikely(p->sched_class == &dl_sched_class))
> > > + return;
> >
> > This is wrong; it misses the stop_sched_class.
> >
> > And since you're looking at sched_class anyway, maybe put a marker in
> > there:
> >
> > if (!p->sched_class->has_clamping)
> > return;
>
> Alternatively, we could simply add uclamp_task_{en,de}queue() into
> {en,de}queue_task_{fair,rt}().
I like better your first proposal, I think it makes sense to factor
out in core code used by both RT and FAIR the same way.
Do you have a strong preference?
> > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > > + if (uclamp_task_affects(p, clamp_id))
> > > + uclamp_cpu_put(p, cpu, clamp_id);
> > > + else
> > > + uclamp_cpu_get(p, cpu, clamp_id);
> > > + }
> > > +}
--
#include <best/regards.h>
Patrick Bellasi
On 13-Apr 11:46, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> > +{
> > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> > + int clamp_value;
> > + int group_id;
> > +
> > + /* Get task's specific clamp value */
> > + clamp_value = p->uclamp[clamp_id].value;
> > + group_id = p->uclamp[clamp_id].group_id;
> > +
> > + /* No task specific clamp values: nothing to do */
> > + if (group_id == UCLAMP_NONE)
> > + return;
> > +
> > + /* Increment the current group_id */
>
> That I think qualifies being called a bad comment.
my bad :/
> > + uc_cpu->group[group_id].tasks += 1;
> > +
> > + /* Mark task as enqueued for this clamp index */
> > + p->uclamp_group_id[clamp_id] = group_id;
>
> Why exactly do we need this? we got group_id from @p in the first place.
The idea is to back-annotate on the task the group in which it has
been refcounted. That allows a much simpler and less racy refcount
decrements at dequeue/migration time.
That's also why we have a single call-back, uclamp_task_update(),
for both enqueue/dequeue. Depending on the check performed by
uclamp_task_affects() we know if we have to get or put the refcounter.
> I suspect this is because when we update p->uclamp[], we don't update
> this active value (when needed), is that worth it?
What you mean by "we don't update this active value"?
> > + /*
> > + * If this is the new max utilization clamp value, then we can update
> > + * straight away the CPU clamp value. Otherwise, the current CPU clamp
> > + * value is still valid and we are done.
> > + */
> > + if (uc_cpu->value < clamp_value)
> > + uc_cpu->value = clamp_value;
> > +}
--
#include <best/regards.h>
Patrick Bellasi
On 13-Apr 10:43, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > +{
> > + int cpu = cpu_of(rq);
> > + int clamp_id;
> > +
> > + /* The idle task does not affect CPU's clamps */
> > + if (unlikely(p->sched_class == &idle_sched_class))
> > + return;
> > + /* DEADLINE tasks do not affect CPU's clamps */
> > + if (unlikely(p->sched_class == &dl_sched_class))
> > + return;
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + if (uclamp_task_affects(p, clamp_id))
> > + uclamp_cpu_put(p, cpu, clamp_id);
> > + else
> > + uclamp_cpu_get(p, cpu, clamp_id);
> > + }
> > +}
>
> Is that uclamp_task_affects() thing there to fix up the fact you failed
> to propagate the calling context (enqueue/dequeue) ?
Not really, it's intended by design: we back annotate the clamp_group
a task has been refcounted in.
The uclamp_task_affects() tells if we are refcounted now and then we
know from the back-annotation from which refcounter we need to remove
the task.
I found this solution much less racy and effective in avoiding to
screw up the refcounter whenever we look at a task at either
dequeue/migration time and these operations can overlaps with the
slow-path. Meaning, when we change the task specific clamp_group
either via syscall or cgroups attributes.
IOW, the back annotation allows to decouple refcounting from
clamp_group configuration in a lockless way.
> I find this code _really_ hard to read...
Hope the explanation above clarifies the logic... do you have
alternative proposals?
> > @@ -743,6 +929,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> > if (!(flags & ENQUEUE_RESTORE))
> > sched_info_queued(rq, p);
> >
> > + uclamp_task_update(rq, p);
> > p->sched_class->enqueue_task(rq, p, flags);
> > }
> >
> > @@ -754,6 +941,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > if (!(flags & DEQUEUE_SAVE))
> > sched_info_dequeued(rq, p);
> >
> > + uclamp_task_update(rq, p);
> > p->sched_class->dequeue_task(rq, p, flags);
> > }
> >
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Apr 13, 2018 at 12:04:26PM +0100, Patrick Bellasi wrote:
> On 13-Apr 12:22, Peter Zijlstra wrote:
> > On Fri, Apr 13, 2018 at 10:26:48AM +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > > +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> > > > +{
> > > > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> > >
> > > > +static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
> > > > +{
> > > > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> > >
> > > That all seems daft, since you already have rq at the call site.
> > >
> > > > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > > > +{
> > > > + int cpu = cpu_of(rq);
> > > > + int clamp_id;
> > > > +
> > > > + /* The idle task does not affect CPU's clamps */
> > > > + if (unlikely(p->sched_class == &idle_sched_class))
> > > > + return;
> > > > + /* DEADLINE tasks do not affect CPU's clamps */
> > > > + if (unlikely(p->sched_class == &dl_sched_class))
> > > > + return;
> > >
> > > This is wrong; it misses the stop_sched_class.
> > >
> > > And since you're looking at sched_class anyway, maybe put a marker in
> > > there:
> > >
> > > if (!p->sched_class->has_clamping)
> > > return;
> >
> > Alternatively, we could simply add uclamp_task_{en,de}queue() into
> > {en,de}queue_task_{fair,rt}().
>
> I like better your first proposal, I think it makes sense to factor
> out in core code used by both RT and FAIR the same way.
>
> Do you have a strong preference?
The second is probably faster as it avoids the load+branch; then again,
without LTO you'll get an actual call in return. Dunno...
On 13-Apr 10:40, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > +static inline void init_uclamp(void)
>
> WTH is that inline?
You mean I can avoid the attribute?
... or that I should do it in another way?
> > +{
> > + struct uclamp_cpu *uc_cpu;
> > + int clamp_id;
> > + int cpu;
> > +
> > + mutex_init(&uclamp_mutex);
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + /* Init CPU's clamp groups */
> > + for_each_possible_cpu(cpu) {
> > + uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> > + memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
> > + }
> > + }
>
> Those loops are the wrong way around, this shreds your cache. This is a
> slow path so it doesn't much matter, but it is sloppy.
Yess... my bad!... good catch! ;)
>
> > +}
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Apr 13, 2018 at 12:08:48PM +0100, Patrick Bellasi wrote:
> On 13-Apr 11:46, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> > > +{
> > > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
> > > + int clamp_value;
> > > + int group_id;
> > > +
> > > + /* Get task's specific clamp value */
> > > + clamp_value = p->uclamp[clamp_id].value;
> > > + group_id = p->uclamp[clamp_id].group_id;
> > > +
> > > + /* No task specific clamp values: nothing to do */
> > > + if (group_id == UCLAMP_NONE)
> > > + return;
> > > +
> > > + /* Increment the current group_id */
> >
> > That I think qualifies being called a bad comment.
>
> my bad :/
>
> > > + uc_cpu->group[group_id].tasks += 1;
> > > +
> > > + /* Mark task as enqueued for this clamp index */
> > > + p->uclamp_group_id[clamp_id] = group_id;
> >
> > Why exactly do we need this? we got group_id from @p in the first place.
>
> The idea is to back-annotate on the task the group in which it has
> been refcounted. That allows a much simpler and less racy refcount
> decrements at dequeue/migration time.
I'm not following; the only possible reason for having this second copy
of group_id is when your original value (p->uclamp[clamp_id].group_id)
can change between enqueue and dequeue.
Why can this happen?
> That's also why we have a single call-back, uclamp_task_update(),
> for both enqueue/dequeue. Depending on the check performed by
> uclamp_task_affects() we know if we have to get or put the refcounter.
But that check is _completely_ redundant, because you already _know_
from being in the en/de-queue path. So having that single callback is
actively harmfull (and confusing).
On Fri, Apr 13, 2018 at 12:17:53PM +0100, Patrick Bellasi wrote:
> On 13-Apr 10:40, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > +static inline void init_uclamp(void)
> >
> > WTH is that inline?
>
> You mean I can avoid the attribute?
> ... or that I should do it in another way?
yea, its init code, no need to go all inline with that (gcc will likely
inline it anyway because static-with-single-callsite).
On Fri, Apr 13, 2018 at 10:26:48AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > +static inline void uclamp_cpu_get(struct task_struct *p, int cpu, int clamp_id)
> > +{
> > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
>
> > +static inline void uclamp_cpu_put(struct task_struct *p, int cpu, int clamp_id)
> > +{
> > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp[clamp_id];
>
> That all seems daft, since you already have rq at the call site.
>
> > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > +{
> > + int cpu = cpu_of(rq);
> > + int clamp_id;
> > +
> > + /* The idle task does not affect CPU's clamps */
> > + if (unlikely(p->sched_class == &idle_sched_class))
> > + return;
> > + /* DEADLINE tasks do not affect CPU's clamps */
> > + if (unlikely(p->sched_class == &dl_sched_class))
> > + return;
>
> This is wrong; it misses the stop_sched_class.
>
> And since you're looking at sched_class anyway, maybe put a marker in
> there:
>
> if (!p->sched_class->has_clamping)
> return;
Alternatively, we could simply add uclamp_task_{en,de}queue() into
{en,de}queue_task_{fair,rt}().
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + if (uclamp_task_affects(p, clamp_id))
> > + uclamp_cpu_put(p, cpu, clamp_id);
> > + else
> > + uclamp_cpu_get(p, cpu, clamp_id);
> > + }
> > +}
On 13-Apr 13:29, Peter Zijlstra wrote:
> On Fri, Apr 13, 2018 at 12:17:53PM +0100, Patrick Bellasi wrote:
> > On 13-Apr 10:40, Peter Zijlstra wrote:
> > > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > > +static inline void init_uclamp(void)
> > >
> > > WTH is that inline?
> >
> > You mean I can avoid the attribute?
> > ... or that I should do it in another way?
>
> yea, its init code, no need to go all inline with that (gcc will likely
> inline it anyway because static-with-single-callsite).
Yes, indeed... I think I've just got the right above init_schedstats()
as a reference and lazily want for code consistency :(
However, if we remove inline, we should probably still add an __init,
isn't it?
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Apr 13, 2018 at 12:15:10PM +0100, Patrick Bellasi wrote:
> On 13-Apr 10:43, Peter Zijlstra wrote:
> > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > > +{
> > > + int cpu = cpu_of(rq);
> > > + int clamp_id;
> > > +
> > > + /* The idle task does not affect CPU's clamps */
> > > + if (unlikely(p->sched_class == &idle_sched_class))
> > > + return;
> > > + /* DEADLINE tasks do not affect CPU's clamps */
> > > + if (unlikely(p->sched_class == &dl_sched_class))
> > > + return;
> > > +
> > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > > + if (uclamp_task_affects(p, clamp_id))
> > > + uclamp_cpu_put(p, cpu, clamp_id);
> > > + else
> > > + uclamp_cpu_get(p, cpu, clamp_id);
> > > + }
> > > +}
> >
> > Is that uclamp_task_affects() thing there to fix up the fact you failed
> > to propagate the calling context (enqueue/dequeue) ?
>
> Not really, it's intended by design: we back annotate the clamp_group
> a task has been refcounted in.
>
> The uclamp_task_affects() tells if we are refcounted now and then we
> know from the back-annotation from which refcounter we need to remove
> the task.
>
> I found this solution much less racy and effective in avoiding to
> screw up the refcounter whenever we look at a task at either
> dequeue/migration time and these operations can overlaps with the
> slow-path. Meaning, when we change the task specific clamp_group
> either via syscall or cgroups attributes.
>
> IOW, the back annotation allows to decouple refcounting from
> clamp_group configuration in a lockless way.
But it adds extra state and logic, to a fastpath, for no reason.
I suspect you messed up the cgroup side; because the syscall should
already have done task_rq_lock() and hold both p->pi_lock and rq->lock
and have dequeued the task when changing the attribute.
It is actually really hard to make the syscall do it wrong.
On 13-Apr 12:47, Patrick Bellasi wrote:
> On 13-Apr 13:36, Peter Zijlstra wrote:
> > On Fri, Apr 13, 2018 at 12:15:10PM +0100, Patrick Bellasi wrote:
> > > On 13-Apr 10:43, Peter Zijlstra wrote:
> > > > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > > > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > > > > +{
> > > > > + int cpu = cpu_of(rq);
> > > > > + int clamp_id;
> > > > > +
> > > > > + /* The idle task does not affect CPU's clamps */
> > > > > + if (unlikely(p->sched_class == &idle_sched_class))
> > > > > + return;
> > > > > + /* DEADLINE tasks do not affect CPU's clamps */
> > > > > + if (unlikely(p->sched_class == &dl_sched_class))
> > > > > + return;
> > > > > +
> > > > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > > > > + if (uclamp_task_affects(p, clamp_id))
> > > > > + uclamp_cpu_put(p, cpu, clamp_id);
> > > > > + else
> > > > > + uclamp_cpu_get(p, cpu, clamp_id);
> > > > > + }
> > > > > +}
> > > >
> > > > Is that uclamp_task_affects() thing there to fix up the fact you failed
> > > > to propagate the calling context (enqueue/dequeue) ?
> > >
> > > Not really, it's intended by design: we back annotate the clamp_group
> > > a task has been refcounted in.
> > >
> > > The uclamp_task_affects() tells if we are refcounted now and then we
> > > know from the back-annotation from which refcounter we need to remove
> > > the task.
> > >
> > > I found this solution much less racy and effective in avoiding to
> > > screw up the refcounter whenever we look at a task at either
> > > dequeue/migration time and these operations can overlaps with the
> > > slow-path. Meaning, when we change the task specific clamp_group
> > > either via syscall or cgroups attributes.
> > >
> > > IOW, the back annotation allows to decouple refcounting from
> > > clamp_group configuration in a lockless way.
> >
> > But it adds extra state and logic, to a fastpath, for no reason.
> >
> > I suspect you messed up the cgroup side; because the syscall should
> > already have done task_rq_lock() and hold both p->pi_lock and rq->lock
> > and have dequeued the task when changing the attribute.
>
> Yes, actually I'm using task_rq_lock() from the cgroup callback to
> update each task already queued. And I do the same from the
> sched_setattr syscall...
>
> > It is actually really hard to make the syscall do it wrong.
>
> ... thus, I'll look better into this.
>
> Not sure now if there was some other corner-case.
Actually, I've just remembered another use-case for that
back-annotation. That's used when we have cgroups and per-task API
asserting two different clamp values.
For example, a task in a TG with max_clamp=50 is further clamped with a
task specific max_clamp=10. The back annotation tracks the group_id in
which we have been refcount right now, which is the task specific group
in the previous example.
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Apr 13, 2018 at 12:47:45PM +0100, Patrick Bellasi wrote:
> In the past I remember some funny dance in cgroup callbacks when a
> task was terminating (like being moved in the root-rq just before
> exiting). But, as you say, if we always have the task_rq_lock we
> should be safe.
The syscall does the whole:
task_rq_lock();
queued = task_on_rq_queued();
running = task_current();
if (queued)
dequeue_task();
if (running)
put_prev_task();
/*
* task is in invariant state here,
* muck with it.
*/
if (queued)
enqueue_task();
if (running)
set_curr_task()
task_rq_unlock();
pattern; which because C sucks, we've not found a good template for yet.
Without the dequeue/put,enqueue/set you have to jump a few extra hoops.
On 13-Apr 13:36, Peter Zijlstra wrote:
> On Fri, Apr 13, 2018 at 12:15:10PM +0100, Patrick Bellasi wrote:
> > On 13-Apr 10:43, Peter Zijlstra wrote:
> > > On Mon, Apr 09, 2018 at 05:56:09PM +0100, Patrick Bellasi wrote:
> > > > +static inline void uclamp_task_update(struct rq *rq, struct task_struct *p)
> > > > +{
> > > > + int cpu = cpu_of(rq);
> > > > + int clamp_id;
> > > > +
> > > > + /* The idle task does not affect CPU's clamps */
> > > > + if (unlikely(p->sched_class == &idle_sched_class))
> > > > + return;
> > > > + /* DEADLINE tasks do not affect CPU's clamps */
> > > > + if (unlikely(p->sched_class == &dl_sched_class))
> > > > + return;
> > > > +
> > > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > > > + if (uclamp_task_affects(p, clamp_id))
> > > > + uclamp_cpu_put(p, cpu, clamp_id);
> > > > + else
> > > > + uclamp_cpu_get(p, cpu, clamp_id);
> > > > + }
> > > > +}
> > >
> > > Is that uclamp_task_affects() thing there to fix up the fact you failed
> > > to propagate the calling context (enqueue/dequeue) ?
> >
> > Not really, it's intended by design: we back annotate the clamp_group
> > a task has been refcounted in.
> >
> > The uclamp_task_affects() tells if we are refcounted now and then we
> > know from the back-annotation from which refcounter we need to remove
> > the task.
> >
> > I found this solution much less racy and effective in avoiding to
> > screw up the refcounter whenever we look at a task at either
> > dequeue/migration time and these operations can overlaps with the
> > slow-path. Meaning, when we change the task specific clamp_group
> > either via syscall or cgroups attributes.
> >
> > IOW, the back annotation allows to decouple refcounting from
> > clamp_group configuration in a lockless way.
>
> But it adds extra state and logic, to a fastpath, for no reason.
>
> I suspect you messed up the cgroup side; because the syscall should
> already have done task_rq_lock() and hold both p->pi_lock and rq->lock
> and have dequeued the task when changing the attribute.
Yes, actually I'm using task_rq_lock() from the cgroup callback to
update each task already queued. And I do the same from the
sched_setattr syscall...
> It is actually really hard to make the syscall do it wrong.
... thus, I'll look better into this.
Not sure now if there was some other corner-case.
In the past I remember some funny dance in cgroup callbacks when a
task was terminating (like being moved in the root-rq just before
exiting). But, as you say, if we always have the task_rq_lock we
should be safe.
--
#include <best/regards.h>
Patrick Bellasi
Hi Tejun,
On Tue, Apr 10, 2018 at 1:05 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Tue, Apr 10, 2018 at 06:16:12PM +0100, Patrick Bellasi wrote:
>> > I'm not too enthusiastic about util_min/max given that it can easily
>> > be read as actual utilization based bandwidth control when what's
>> > actually implemented, IIUC, is affecting CPU frequency selection.
>>
>> Right now we are basically affecting the frequency selection.
>> However, the next step is to use this same interface to possibly bias
>> task placement.
>>
>> The idea is that:
>>
>> - the util_min value can be used to possibly avoid CPUs which have
>> a (maybe temporarily) limited capacity, for example, due to thermal
>> pressure.
>>
>> - a util_max value can use used to possibly identify tasks which can
>> be co-scheduled together in a (maybe) limited capacity CPU since
>> they are more likely "less important" tasks.
>>
>> Thus, since this is a new user-space API, we would like to find a
>> concept which is generic enough to express the current requirement but
>> also easily accommodate future extensions.
>
> I'm not sure we can overload the meanings like that on the same
> interface. Right now, it doesn't say anything about bandwidth (or
> utilization) allocation. It just limits the frequency range the
> particular cpu that the task ended up on can be in and what you're
> describing above is the third different thing. It doesn't seem clear
> that they're something which can be overloaded onto the same
> interface.
Actually no, its not about overloading them. What's Patrick is
defining here is a property/attribute. What that attribute is used for
(the algorithms that use it) are a different topic. Like, it can be
used by the frequency selection algorithms or the task placement
algorithm. There are multiple algorithms that can use the property. To
me, this part of the patch makes sense. Maybe it should really be
called "task_size" or something, since that's what it really is.
[...]
>> > > Tasks on a subgroup can only be more boosted and/or capped, which is
>> >
>> > Less boosted. .low at a parent level must set the upper bound of .low
>> > that all its descendants can have.
>>
>> Is that a mandatory requirement? Or based on a proper justification
>> you can also accept what I'm proposing?
>>
>> I've always been more of the idea that what I'm proposing could make
>> more sense for a general case but perhaps I just need to go back and
>> better check the use-cases we have on hand to see if it's really
>> required or not.
>
> Yeah, I think we want to stick to that semantics. That's what memory
> controller does and it'd be really confusing to flip the directions on
> different controllers.
>
What about the .high ? I think there was some confusion about how to
define that for subgroups. It could perhaps be such that the .high of
parent is the lower bound of the .high on child but then I'm not sure
if that fits well with the delegation policies...
thanks,
- Joel
Hello, Joel.
On Sat, Apr 21, 2018 at 02:08:30PM -0700, Joel Fernandes wrote:
> Actually no, its not about overloading them. What's Patrick is
> defining here is a property/attribute. What that attribute is used for
> (the algorithms that use it) are a different topic. Like, it can be
> used by the frequency selection algorithms or the task placement
> algorithm. There are multiple algorithms that can use the property. To
> me, this part of the patch makes sense. Maybe it should really be
> called "task_size" or something, since that's what it really is.
I understand that the interface can encode certain intentions and then
there can be different strategies to implement that, but the two
things mentioned here seem fundamentally different to declare them to
be two different implementations of the same intention.
> > Yeah, I think we want to stick to that semantics. That's what memory
> > controller does and it'd be really confusing to flip the directions on
> > different controllers.
>
> What about the .high ? I think there was some confusion about how to
> define that for subgroups. It could perhaps be such that the .high of
> parent is the lower bound of the .high on child but then I'm not sure
> if that fits well with the delegation policies...
The basic rule is simple. A child can never obtain more than its
ancestors.
Thanks.
--
tejun