Hi all, this is a respin of:
https://lore.kernel.org/lkml/[email protected]/
which includes the following main changes:
- fix "max local update" by moving into uclamp_rq_inc_id()
- use for_each_clamp_id() and uclamp_se_set() to make code less fragile
- rename sysfs node: s/sched_uclamp_util_{min,max}/sched_util_clamp_{min,max}/
- removed not used uclamp_eff_bucket_id()
- move uclamp_bucket_base_value() definition into patch using it
- get rid of not necessary SCHED_POLICY_MAX define
- update sched_setattr() syscall to just force the current policy on
SCHED_FLAG_KEEP_POLICY
- return EOPNOTSUPP from uclamp_validate() on !CONFIG_UCLAMP_TASK
- make alloc_uclamp_sched_group() a void function
- simplify bits_per() definition
- add rq's lockdep assert to uclamp_rq_{inc,dec}_id()
With the above, I think we captured all the major inputs from Peter [1].
Thus, this v9 is likely the right version to unlock Tejun's review [2] on the
remaining cgroup related bits, i.e. patches [12-16].
Cheers Patrick
Series Organization
===================
The series is organized into these main sections:
- Patches [01-07]: Per task (primary) API
- Patches [08-09]: Schedutil integration for FAIR and RT tasks
- Patches [10-11]: Integration with EAS's energy_compute()
- Patches [12-16]: Per task group (secondary) API
It is based on today's tip/sched/core and the full tree is available here:
git://linux-arm.org/linux-pb.git lkml/utilclamp_v9
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v9
Newcomer's Short Abstract
=========================
The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [3].
When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.
The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilization values by themselves cannot convey the desired
power/performance behaviors of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.
Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorized user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.
Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.
The main use-cases for utilization clamping are:
- boosting: better interactive response for small tasks which
are affecting the user experience.
Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
the scheduler does not have a complete view of what the task's requirements
are and, if it's a small utilization task, it keeps selecting a more energy
efficient CPU, with smaller capacity and lower frequency, thus negatively
impacting the overall time required to complete task activations.
- capping: increase energy efficiency for background tasks not affecting the
user experience.
Since running on a lower capacity CPU at a lower frequency is more energy
efficient, when the completion time is not a main goal, then capping the
utilization considered for certain (maybe big) tasks can have positive
effects, both on energy consumption and thermal headroom.
This feature allows also to make RT tasks more energy friendly on mobile
systems where running them on high capacity CPUs and at the maximum
frequency is not required.
From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in making tasks placement decisions.
Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.
Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:
- boosting: try to run small/foreground tasks on higher-capacity CPUs to
complete them faster despite being less energy efficient.
- capping: try to run big/background tasks on low-capacity CPUs to save power
and thermal headroom for more important tasks
This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [4] is one of its main
components.
Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.
References
==========
[1] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/
[2] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/
[3] Energy Aware Scheduling
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1
[4] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/
Patrick Bellasi (16):
sched/core: uclamp: Add CPU's clamp buckets refcounting
sched/core: uclamp: Add bucket local max tracking
sched/core: uclamp: Enforce last task's UCLAMP_MAX
sched/core: uclamp: Add system default clamps
sched/core: Allow sched_setattr() to use the current policy
sched/core: uclamp: Extend sched_setattr() to support utilization
clamping
sched/core: uclamp: Reset uclamp values on RESET_ON_FORK
sched/core: uclamp: Set default clamps for RT tasks
sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks
sched/core: uclamp: Add uclamp_util_with()
sched/fair: uclamp: Add uclamp support to energy_compute()
sched/core: uclamp: Extend CPU's cgroup controller
sched/core: uclamp: Propagate parent clamps
sched/core: uclamp: Propagate system defaults to root group
sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
sched/core: uclamp: Update CPU's refcount on TG's clamp changes
Documentation/admin-guide/cgroup-v2.rst | 46 ++
include/linux/log2.h | 34 ++
include/linux/sched.h | 58 ++
include/linux/sched/sysctl.h | 11 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 14 +-
include/uapi/linux/sched/types.h | 66 ++-
init/Kconfig | 75 +++
kernel/sched/core.c | 754 +++++++++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 22 +-
kernel/sched/fair.c | 44 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 123 +++-
kernel/sysctl.c | 16 +
14 files changed, 1229 insertions(+), 44 deletions(-)
--
2.21.0
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.
Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.
Extend the CPU controller with a couple of new attributes util.{min,max}
which allows to enforce utilization boosting and capping for all the
tasks in a group. Specifically:
- util.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the util.min
utilization
- util.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the util.max
utilization
These attributes:
a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.
b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent
c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests
Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing util.min < util.max.
Keep it simple by do not caring now about "effective" values computation
and propagation along the hierarchy.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
Changes in v9
Message-ID: <20190507114232.npsvba4itex5qpvl@e110439-lin>
- make alloc_uclamp_sched_group() a void function
Message-ID: <[email protected]>:
- use for_each_clamp_id() and uclamp_se_set() to make code less fragile
---
Documentation/admin-guide/cgroup-v2.rst | 27 +++++
init/Kconfig | 22 ++++
kernel/sched/core.c | 135 +++++++++++++++++++++++-
kernel/sched/sched.h | 6 ++
4 files changed, 189 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 20f92c16ffbf..3a940bfe4e8c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.
+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -974,6 +980,27 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.
+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization in the range [0, 1024].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1024". i.e. no utilization capping
+
+ The requested maximum utilization in the range [0, 1024].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+
Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index 8e103505456a..5617742b97e5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -894,6 +894,28 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED
+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eed7664437ac..19437257a08d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1137,8 +1137,12 @@ static void __init init_uclamp(void)
/* System defaults allow max clamp values for both indexes */
uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+ }
}
#else /* CONFIG_UCLAMP_TASK */
@@ -6695,6 +6699,17 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);
+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for_each_clamp_id(clamp_id)
+ tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
+#endif
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6718,6 +6733,8 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
+ alloc_uclamp_sched_group(tg, parent);
+
return tg;
err:
@@ -6938,6 +6955,96 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
+ goto out;
+ if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp_req[UCLAMP_MAX].value == max_value)
+ goto out;
+ if (tg->uclamp_req[UCLAMP_MIN].value > max_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp_req[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7282,6 +7389,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7449,6 +7568,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6335cfcc81ba..fd31527fdcc8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -399,6 +399,12 @@ struct task_group {
#endif
struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Clamp values requested for a task group */
+ struct uclamp_se uclamp_req[UCLAMP_CNT];
+#endif
+
};
#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.21.0
The clamp values are not tunable at the level of the root task group.
That's for two main reasons:
- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.
- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.
When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.
Utilization clamping supports already the concepts of:
- system defaults: which define the maximum possible clamp values
usable by tasks.
- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.
Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efedbd3a0ce6..bd96a977ed07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1005,6 +1005,13 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id);
+#else
+#define cpu_util_update_eff(...)
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -1038,6 +1045,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
sysctl_sched_uclamp_util_max, false);
}
+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MIN);
+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MAX);
+
/*
* Updating all the RUNNABLE task is expensive, keep it simple and do
* just a lazy update at each next enqueue time.
--
2.21.0
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:
- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency
- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency
These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.
Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.
Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
kernel/sched/fair.c | 4 ++++
kernel/sched/rt.c | 4 ++++
kernel/sched/sched.h | 23 +++++++++++++++++++++++
4 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5c41ea367422..7d786d99fdb4 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -201,8 +201,10 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);
- if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
+ if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
+ }
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -218,9 +220,16 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
- util = util_cfs;
- util += cpu_util_rt(rq);
+ util = util_cfs + cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util(rq, util);
dl_util = cpu_util_dl(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f5e528..5e5fe5462099 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10690,6 +10690,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};
#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 1e6b909dca36..7fe730382092 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,
.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};
#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a8a693c75669..1ea44f4dec16 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2275,6 +2275,29 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */
+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+ unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.21.0
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.
Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.
Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).
Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.
Do that by validating the required clamp values before and then applying
the required changes using _the_ same pattern already in use for
__setscheduler(). This ensures that the task is re-enqueued with the new
clamp values.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
Changes in v9:
Message-ID: <20190507111347.4ivnjwbymsf7i3e6@e110439-lin>
- return EOPNOTSUPP from uclamp_validate() on !CONFIG_UCLAMP_TASK
- update commit message
---
include/linux/sched.h | 9 ++++
include/uapi/linux/sched.h | 12 ++++-
include/uapi/linux/sched/types.h | 66 +++++++++++++++++++----
kernel/sched/core.c | 91 +++++++++++++++++++++++++++++---
4 files changed, 161 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0860c8f55c1d..863f70843875 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -586,6 +586,7 @@ struct sched_dl_entity {
* @value: clamp value "assigned" to a se
* @bucket_id: bucket index corresponding to the "assigned" value
* @active: the se is currently refcounted in a rq's bucket
+ * @user_defined: the requested clamp value comes from user-space
*
* The bucket_id is the index of the clamp bucket matching the clamp value
* which is pre-computed and stored to avoid expensive integer divisions from
@@ -595,11 +596,19 @@ struct sched_dl_entity {
* which can be different from the clamp value "requested" from user-space.
* This allows to know a task is refcounted in the rq's bucket corresponding
* to the "effective" bucket_id.
+ *
+ * The user_defined bit is set whenever a task has got a task-specific clamp
+ * value requested from userspace, i.e. the system defaults apply to this task
+ * just as a restriction. This allows to relax default clamps when a less
+ * restrictive task-specific value has been requested, thus allowing to
+ * implement a "nice" semantic. For example, a task running with a 20%
+ * default boost can still drop its own boosting to 0%.
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int active : 1;
+ unsigned int user_defined : 1;
};
#endif /* CONFIG_UCLAMP_TASK */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 261b6e43846c..62895bf8e65e 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -51,10 +51,20 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
#define SCHED_FLAG_KEEP_POLICY 0x08
+#define SCHED_FLAG_KEEP_PARAMS 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+
+#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
+ SCHED_FLAG_KEEP_PARAMS)
+
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
- SCHED_FLAG_KEEP_POLICY)
+ SCHED_FLAG_KEEP_ALL | \
+ SCHED_FLAG_UTIL_CLAMP)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..c852153ddb0d 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -9,6 +9,7 @@ struct sched_param {
};
#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
+#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */
/*
* Extended scheduling parameters data structure.
@@ -21,8 +22,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Task Attributes
+ * =========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such a model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +60,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +73,29 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization
+ * expected for a task. These attributes allow to inform the scheduler about
+ * the utilization boundaries within which it should schedule the task. These
+ * boundaries are valuable hints to support scheduler decisions on both task
+ * placement and frequency selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE]. It
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. For example, a
+ * 20% utilization task is a task running for 2ms every 10ms at maximum
+ * frequency.
+ *
+ * A task with a min utilization value bigger than 0 is more likely scheduled
+ * on a CPU with a capacity big enough to fit the specified value.
+ * A task with a max utilization value smaller than 1024 is more likely
+ * scheduled on a CPU with no more capacity than the specified value.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +113,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43b29b2efa4c..3e035fbb187d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,10 +793,12 @@ static inline unsigned int uclamp_none(int clamp_id)
return SCHED_CAPACITY_SCALE;
}
-static inline void uclamp_se_set(struct uclamp_se *uc_se, unsigned int value)
+static inline void uclamp_se_set(struct uclamp_se *uc_se,
+ unsigned int value, bool user_defined)
{
uc_se->value = value;
uc_se->bucket_id = uclamp_bucket_id(value);
+ uc_se->user_defined = user_defined;
}
static inline unsigned int
@@ -1004,11 +1006,11 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(&uclamp_default[UCLAMP_MIN],
- sysctl_sched_uclamp_util_min);
+ sysctl_sched_uclamp_util_min, false);
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(&uclamp_default[UCLAMP_MAX],
- sysctl_sched_uclamp_util_max);
+ sysctl_sched_uclamp_util_max, false);
}
/*
@@ -1026,6 +1028,42 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
return result;
}
+static int uclamp_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
+ unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ lower_bound = attr->sched_util_min;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ upper_bound = attr->sched_util_max;
+
+ if (lower_bound > upper_bound)
+ return -EINVAL;
+ if (upper_bound > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
+ return;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
+ attr->sched_util_min, true);
+ }
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
+ attr->sched_util_max, true);
+ }
+}
+
static void uclamp_fork(struct task_struct *p)
{
unsigned int clamp_id;
@@ -1047,11 +1085,11 @@ static void __init init_uclamp(void)
for_each_clamp_id(clamp_id) {
uclamp_se_set(&init_task.uclamp_req[clamp_id],
- uclamp_none(clamp_id));
+ uclamp_none(clamp_id), false);
}
/* System defaults allow max clamp values for both indexes */
- uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX));
+ uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
for_each_clamp_id(clamp_id)
uclamp_default[clamp_id] = uc_max;
}
@@ -1059,6 +1097,13 @@ static void __init init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline int uclamp_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EOPNOTSUPP;
+}
+static void __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr) { }
static inline void uclamp_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
@@ -4378,6 +4423,13 @@ static void __setscheduler_params(struct task_struct *p,
static void __setscheduler(struct rq *rq, struct task_struct *p,
const struct sched_attr *attr, bool keep_boost)
{
+ /*
+ * If params can't change scheduling class changes aren't allowed
+ * either.
+ */
+ if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
+ return;
+
__setscheduler_params(p, attr);
/*
@@ -4515,6 +4567,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}
+ /* Update task specific "requested" clamps */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = uclamp_validate(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4544,6 +4603,8 @@ static int __sched_setscheduler(struct task_struct *p,
goto change;
if (dl_policy(policy) && dl_param_changed(p, attr))
goto change;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
+ goto change;
p->sched_reset_on_fork = reset_on_fork;
task_rq_unlock(rq, p, &rf);
@@ -4624,7 +4685,9 @@ static int __sched_setscheduler(struct task_struct *p,
put_prev_task(rq, p);
prev_class = p->sched_class;
+
__setscheduler(rq, p, attr, pi);
+ __setscheduler_uclamp(p, attr);
if (queued) {
/*
@@ -4800,6 +4863,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
if (ret)
return -EFAULT;
+ if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
+ size < SCHED_ATTR_SIZE_VER1)
+ return -EINVAL;
+
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
@@ -4869,10 +4936,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
- retval = sched_setattr(p, &attr);
+ if (likely(p))
+ get_task_struct(p);
rcu_read_unlock();
+ if (likely(p)) {
+ retval = sched_setattr(p, &attr);
+ put_task_struct(p);
+ }
+
return retval;
}
@@ -5023,6 +5095,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);
+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
+#endif
+
rcu_read_unlock();
retval = sched_read_attr(uattr, &attr, size);
--
2.21.0
The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.
Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.
Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
Changes in v9:
Message-ID: <20190509145901.um7rrsslg7de4blf@e110439-lin>
- get rid of not necessary SCHED_POLICY_MAX define
- update sched_setattr() syscall to just force the current policy on
SCHED_FLAG_KEEP_POLICY
---
include/uapi/linux/sched.h | 4 +++-
kernel/sched/core.c | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..261b6e43846c 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_KEEP_POLICY 0x08
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_KEEP_POLICY)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dac73a5959b6..43b29b2efa4c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4863,6 +4863,8 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
if ((int)attr.sched_policy < 0)
return -EINVAL;
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;
rcu_read_lock();
retval = -ESRCH;
--
2.21.0
When a task specific clamp value is configured via sched_setattr(2),
this value is accounted in the corresponding clamp bucket every time the
task is {en,de}qeued. However, when cgroups are also in use, the task
specific clamp values could be restricted by the task_group (TG)
clamp values.
Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every
time a task is enqueued, it's accounted in the clamp_bucket defining the
smaller clamp between the task specific value and its TG effective
value. This allows to:
1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value
2. implement a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG
This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.
Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.
Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only
system defaults are enforced.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd96a977ed07..354d925a6ba8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -861,16 +861,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uc_max;
+
+ /*
+ * Tasks in autogroups or root task group will be
+ * restricted by system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return uc_req;
+ if (task_group(p) == &root_task_group)
+ return uc_req;
+
+ uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value || !uc_req.user_defined)
+ return uc_max;
+#endif
+
+ return uc_req;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
- struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
/* System default restrictions always apply */
--
2.21.0
A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:
sys_sched_setattr()
sched_setattr()
__sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)
the new forked task is expected to start with all attributes reset to
default values.
Do that for utilization clamp values too by checking the reset request
from the existing uclamp_fork() call which already provides the required
initialization for other uclamp related bits.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e035fbb187d..03b1cd82bc48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1070,6 +1070,14 @@ static void uclamp_fork(struct task_struct *p)
for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false;
+
+ if (likely(!p->sched_reset_on_fork))
+ return;
+
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&p->uclamp_req[clamp_id],
+ uclamp_none(clamp_id), false);
+ }
}
static void __init init_uclamp(void)
--
2.21.0
On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or clamped as requested.
Do that by slightly refactoring uclamp_bucket_inc(). An additional
parameter *cgroup_subsys_state (css) is used to walk the list of tasks
in the TGs and update the RUNNABLE ones. Do that by taking the rq
lock for each task, the same mechanism used for cpu affinity masks
updates.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 354d925a6ba8..0c078d586f36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1031,6 +1031,51 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_rq_dec_id(rq, p, clamp_id);
+ uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css, int clamp_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css,
unsigned int clamp_id);
@@ -7044,6 +7089,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css,
uc_se->value = value;
uc_se->bucket_id = uclamp_bucket_id(value);
+
+ /* Immediately update descendants RUNNABLE tasks */
+ uclamp_update_active_tasks(css, clamp_id);
}
}
--
2.21.0
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.
Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This is the actual clamp value enforced on tasks in a
task group.
Since it can be interesting for userspace, e.g. system management
software, to know exactly what the currently propagated/enforced
configuration is, the effective clamp values are exposed to user-space
by means of a new pair of read-only attributes
cpu.util.{min,max}.effective.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 19 +++++
kernel/sched/core.c | 108 ++++++++++++++++++++++--
kernel/sched/sched.h | 2 +
3 files changed, 124 insertions(+), 5 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3a940bfe4e8c..4d13f88bfe25 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -990,6 +990,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.
+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ allows only smaller minimum utilization values.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1024". i.e. no utilization capping
@@ -1000,6 +1010,15 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.
Memory
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 19437257a08d..efedbd3a0ce6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -761,6 +761,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
/* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
@@ -1125,6 +1137,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;
+ mutex_init(&uclamp_mutex);
+
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1141,6 +1155,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
#ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+ root_task_group.uclamp[clamp_id] = uc_max;
#endif
}
}
@@ -6705,8 +6720,10 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
#ifdef CONFIG_UCLAMP_TASK_GROUP
int clamp_id;
- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
+ tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
+ }
#endif
}
@@ -6956,6 +6973,44 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}
#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+ unsigned int value;
+
+ css_for_each_descendant_pre(css, top_css) {
+ value = css_tg(css)->uclamp_req[clamp_id].value;
+
+ uc_parent = NULL;
+ if (css_tg(css)->parent)
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * already matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive value, as effective clamp.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (uc_se->value == value &&
+ uc_parent && uc_parent->value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent && uc_parent->value < value)
+ value = uc_parent->value;
+ if (uc_se->value == value)
+ continue;
+
+ uc_se->value = value;
+ uc_se->bucket_id = uclamp_bucket_id(value);
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -6965,6 +7020,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;
+ mutex_lock(&uclamp_mutex);
rcu_read_lock();
tg = css_tg(css);
@@ -6981,8 +7037,12 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(css, UCLAMP_MIN);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
return ret;
}
@@ -6996,6 +7056,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;
+ mutex_lock(&uclamp_mutex);
rcu_read_lock();
tg = css_tg(css);
@@ -7012,21 +7073,28 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(css, UCLAMP_MAX);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
return ret;
}
static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;
rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp_req[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].value
+ : tg->uclamp_req[clamp_id].value;
rcu_read_unlock();
return util_clamp;
@@ -7035,13 +7103,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}
static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */
@@ -7396,11 +7476,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7576,12 +7664,22 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fd31527fdcc8..f3c65af96756 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -403,6 +403,8 @@ struct task_group {
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
+ /* Effective clamp values used for a task group */
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif
};
--
2.21.0
Hi Tejun,
this is just a gentle ping.
I had a chance to speak with Peter and Rafael at the last OSPM and
they both seems to be reasonably happy with the current status of the
core bits.
Thus, it would be very useful if you can jump into the discussion and
start the review of the cgroups integration bits.
You can find them in the last 5 patches of this series.
Luckily on the CGroup side, we don't start from ground zero.
Many aspects and pain points have been already discussed in the past
and have been addressed by the current version.
For you benefit, here is a list of previously discussed items:
- A child can never obtain more than its ancestors
https://lore.kernel.org/lkml/[email protected]/
- Enforcing consistency rules among parent-child groups
https://lore.kernel.org/lkml/[email protected]/
- Use "effective" attributes to enforce restrictions:
https://lore.kernel.org/lkml/[email protected]
https://lore.kernel.org/lkml/[email protected]
- Tasks on a subgroup can only be more boosted and/or capped,
i.e. parent always set an upper bound on both max and min clamps
https://lore.kernel.org/lkml/[email protected]
https://lore.kernel.org/lkml/[email protected]
- Put everything at system level outside of the cgroup interface,
i.e. no root group tunable attributes
https://lore.kernel.org/lkml/[email protected]/
- CGroups don't need any hierarchical behaviors on their own
https://lore.kernel.org/lkml/[email protected]/
Looking forward to get some more feedbacks from you.
Cheers,
Patrick
On 15-May 10:44, Patrick Bellasi wrote:
> Hi all, this is a respin of:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> which includes the following main changes:
>
> - fix "max local update" by moving into uclamp_rq_inc_id()
> - use for_each_clamp_id() and uclamp_se_set() to make code less fragile
> - rename sysfs node: s/sched_uclamp_util_{min,max}/sched_util_clamp_{min,max}/
> - removed not used uclamp_eff_bucket_id()
> - move uclamp_bucket_base_value() definition into patch using it
> - get rid of not necessary SCHED_POLICY_MAX define
> - update sched_setattr() syscall to just force the current policy on
> SCHED_FLAG_KEEP_POLICY
> - return EOPNOTSUPP from uclamp_validate() on !CONFIG_UCLAMP_TASK
> - make alloc_uclamp_sched_group() a void function
> - simplify bits_per() definition
> - add rq's lockdep assert to uclamp_rq_{inc,dec}_id()
>
> With the above, I think we captured all the major inputs from Peter [1].
> Thus, this v9 is likely the right version to unlock Tejun's review [2] on the
> remaining cgroup related bits, i.e. patches [12-16].
>
> Cheers Patrick
>
>
> Series Organization
> ===================
>
> The series is organized into these main sections:
>
> - Patches [01-07]: Per task (primary) API
> - Patches [08-09]: Schedutil integration for FAIR and RT tasks
> - Patches [10-11]: Integration with EAS's energy_compute()
> - Patches [12-16]: Per task group (secondary) API
>
> It is based on today's tip/sched/core and the full tree is available here:
>
> git://linux-arm.org/linux-pb.git lkml/utilclamp_v9
> http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v9
>
>
> Newcomer's Short Abstract
> =========================
>
> The Linux scheduler tracks a "utilization" signal for each scheduling entity
> (SE), e.g. tasks, to know how much CPU time they use. This signal allows the
> scheduler to know how "big" a task is and, in principle, it can support
> advanced task placement strategies by selecting the best CPU to run a task.
> Some of these strategies are represented by the Energy Aware Scheduler [3].
>
> When the schedutil cpufreq governor is in use, the utilization signal allows
> the Linux scheduler to also drive frequency selection. The CPU utilization
> signal, which represents the aggregated utilization of tasks scheduled on that
> CPU, is used to select the frequency which best fits the workload generated by
> the tasks.
>
> The current translation of utilization values into a frequency selection is
> simple: we go to max for RT tasks or to the minimum frequency which can
> accommodate the utilization of DL+FAIR tasks.
> However, utilization values by themselves cannot convey the desired
> power/performance behaviors of each task as intended by user-space.
> As such they are not ideally suited for task placement decisions.
>
> Task placement and frequency selection policies in the kernel can be improved
> by taking into consideration hints coming from authorized user-space elements,
> like for example the Android middleware or more generally any "System
> Management Software" (SMS) framework.
>
> Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
> utilization generated by RT and FAIR tasks within a range defined by user-space.
> The clamped utilization value can then be used, for example, to enforce a
> minimum and/or maximum frequency depending on which tasks are active on a CPU.
>
> The main use-cases for utilization clamping are:
>
> - boosting: better interactive response for small tasks which
> are affecting the user experience.
>
> Consider for example the case of a small control thread for an external
> accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
> the scheduler does not have a complete view of what the task's requirements
> are and, if it's a small utilization task, it keeps selecting a more energy
> efficient CPU, with smaller capacity and lower frequency, thus negatively
> impacting the overall time required to complete task activations.
>
> - capping: increase energy efficiency for background tasks not affecting the
> user experience.
>
> Since running on a lower capacity CPU at a lower frequency is more energy
> efficient, when the completion time is not a main goal, then capping the
> utilization considered for certain (maybe big) tasks can have positive
> effects, both on energy consumption and thermal headroom.
> This feature allows also to make RT tasks more energy friendly on mobile
> systems where running them on high capacity CPUs and at the maximum
> frequency is not required.
>
> From these two use-cases, it's worth noticing that frequency selection
> biasing, introduced by patches 9 and 10 of this series, is just one possible
> usage of utilization clamping. Another compelling extension of utilization
> clamping is in helping the scheduler in making tasks placement decisions.
>
> Utilization is (also) a task specific property the scheduler uses to know
> how much CPU bandwidth a task requires, at least as long as there is idle time.
> Thus, the utilization clamp values, defined either per-task or per-task_group,
> can represent tasks to the scheduler as being bigger (or smaller) than what
> they actually are.
>
> Utilization clamping thus enables interesting additional optimizations, for
> example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
> where:
>
> - boosting: try to run small/foreground tasks on higher-capacity CPUs to
> complete them faster despite being less energy efficient.
>
> - capping: try to run big/background tasks on low-capacity CPUs to save power
> and thermal headroom for more important tasks
>
> This series does not present this additional usage of utilization clamping but
> it's an integral part of the EAS feature set, where [4] is one of its main
> components.
>
> Android kernels use SchedTune, a solution similar to utilization clamping, to
> bias both 'frequency selection' and 'task placement'. This series provides the
> foundation to add similar features to mainline while focusing, for the
> time being, just on schedutil integration.
>
>
> References
> ==========
>
> [1] Message-ID: <[email protected]>
> https://lore.kernel.org/lkml/[email protected]/
>
> [2] Message-ID: <[email protected]>
> https://lore.kernel.org/lkml/[email protected]/
>
> [3] Energy Aware Scheduling
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1
>
> [4] Expressing per-task/per-cgroup performance hints
> Linux Plumbers Conference 2018
> https://linuxplumbersconf.org/event/2/contributions/128/
>
>
> Patrick Bellasi (16):
> sched/core: uclamp: Add CPU's clamp buckets refcounting
> sched/core: uclamp: Add bucket local max tracking
> sched/core: uclamp: Enforce last task's UCLAMP_MAX
> sched/core: uclamp: Add system default clamps
> sched/core: Allow sched_setattr() to use the current policy
> sched/core: uclamp: Extend sched_setattr() to support utilization
> clamping
> sched/core: uclamp: Reset uclamp values on RESET_ON_FORK
> sched/core: uclamp: Set default clamps for RT tasks
> sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks
> sched/core: uclamp: Add uclamp_util_with()
> sched/fair: uclamp: Add uclamp support to energy_compute()
> sched/core: uclamp: Extend CPU's cgroup controller
> sched/core: uclamp: Propagate parent clamps
> sched/core: uclamp: Propagate system defaults to root group
> sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
> sched/core: uclamp: Update CPU's refcount on TG's clamp changes
>
> Documentation/admin-guide/cgroup-v2.rst | 46 ++
> include/linux/log2.h | 34 ++
> include/linux/sched.h | 58 ++
> include/linux/sched/sysctl.h | 11 +
> include/linux/sched/topology.h | 6 -
> include/uapi/linux/sched.h | 14 +-
> include/uapi/linux/sched/types.h | 66 ++-
> init/Kconfig | 75 +++
> kernel/sched/core.c | 754 +++++++++++++++++++++++-
> kernel/sched/cpufreq_schedutil.c | 22 +-
> kernel/sched/fair.c | 44 +-
> kernel/sched/rt.c | 4 +
> kernel/sched/sched.h | 123 +++-
> kernel/sysctl.c | 16 +
> 14 files changed, 1229 insertions(+), 44 deletions(-)
>
> --
> 2.21.0
>
--
#include <best/regards.h>
Patrick Bellasi
Hello, Patrick.
On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> Extend the CPU controller with a couple of new attributes util.{min,max}
> which allows to enforce utilization boosting and capping for all the
> tasks in a group. Specifically:
>
> - util.min: defines the minimum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run at least at a
> minimum frequency which corresponds to the util.min
> utilization
>
> - util.max: defines the maximum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run up to a
> maximum frequency which corresponds to the util.max
> utilization
Let's please use a prefix which is more specific. It's clamping the
utilization estimates of the member tasks which in turn affect
scheduling / frequency decisions but cpu.util.max reads like it's
gonna limit the cpu utilization directly. Maybe just use uclamp?
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
> hierarchies, while system wide clamps are defined by a generic
> interface which does not depends on cgroups. This system wide
> interface enforces constraints on tasks in the root node.
I'd much prefer if they weren't entangled this way. The system wide
limits should work the same regardless of cgroup's existence. cgroup
can put further restriction on top but mere creation of cgroups with
cpu controller enabled shouldn't take them out of the system-wide
limits.
> b) enforce effective constraints at each level of the hierarchy which
> are a restriction of the group requests considering its parent's
> effective constraints. Root group effective constraints are defined
> by the system wide interface.
> This mechanism allows each (non-root) level of the hierarchy to:
> - request whatever clamp values it would like to get
> - effectively get only up to the maximum amount allowed by its parent
I'll come back to this later.
> c) have higher priority than task-specific clamps, defined via
> sched_setattr(), thus allowing to control and restrict task requests
This sounds good.
> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Validate local consistency by enforcing util.min < util.max.
> Keep it simple by do not caring now about "effective" values computation
> and propagation along the hierarchy.
So, the followings are what we're doing for hierarchical protection
and limit propgations.
* Limits (high / max) default to max. Protections (low / min) 0. A
new cgroup by default doesn't constrain itself further and doesn't
have any protection.
* A limit defines the upper ceiling for the subtree. If an ancestor
has a limit of X, none of its descendants can have more than X.
* A protection defines the upper ceiling of protections for the
subtree. If an andester has a protection of X, none of its
descendants can have more protection than X.
Note that there's no way for an ancestor to enforce protection its
descendants. It can only allow them to claim some. This is
intentional as the other end of the spectrum is either descendants
losing the ability to further distribute protections as they see fit.
For proportions (as opposed to weights), we use percentage rational
numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits
pending. I'll put them on cgroup/for-5.3.
Thanks.
--
tejun
On 31-May 08:35, Tejun Heo wrote:
> Hello, Patrick.
Hi Tejun!
> On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> > Extend the CPU controller with a couple of new attributes util.{min,max}
> > which allows to enforce utilization boosting and capping for all the
> > tasks in a group. Specifically:
> >
> > - util.min: defines the minimum utilization which should be considered
> > i.e. the RUNNABLE tasks of this group will run at least at a
> > minimum frequency which corresponds to the util.min
> > utilization
> >
> > - util.max: defines the maximum utilization which should be considered
> > i.e. the RUNNABLE tasks of this group will run up to a
> > maximum frequency which corresponds to the util.max
> > utilization
>
> Let's please use a prefix which is more specific. It's clamping the
> utilization estimates of the member tasks which in turn affect
> scheduling / frequency decisions but cpu.util.max reads like it's
> gonna limit the cpu utilization directly. Maybe just use uclamp?
Being too specific does not risk to expose implementation details?
If that's not a problem and Peter likes:
cpu.uclamp.{min,max}
that's ok with me.
--
#include <best/regards.h>
Patrick Bellasi
On 31-May 08:35, Tejun Heo wrote:
[...]
> > These attributes:
> >
> > a) are available only for non-root nodes, both on default and legacy
> > hierarchies, while system wide clamps are defined by a generic
> > interface which does not depends on cgroups. This system wide
> > interface enforces constraints on tasks in the root node.
>
> I'd much prefer if they weren't entangled this way. The system wide
> limits should work the same regardless of cgroup's existence. cgroup
> can put further restriction on top but mere creation of cgroups with
> cpu controller enabled shouldn't take them out of the system-wide
> limits.
That's correct and what you describe matches, at least on its
intents, the current implementation provided in:
[PATCH v9 14/16] sched/core: uclamp: Propagate system defaults to root group
https://lore.kernel.org/lkml/[email protected]/
System clamps always work the same way, independently from cgroups:
they define the upper bound for both min and max clamps.
When cgroups are not available, tasks specific clamps are always
capped by system clamps.
When cgroups are available, the root task group clamps are capped by
the system clamps, which affects its "effective" clamps and propagate
them down the hierarchy to child's "effective" clamps.
That's done in:
[PATCH v9 13/16] sched/core: uclamp: Propagate parent clamps
https://lore.kernel.org/lkml/[email protected]/
Example 1
---------
Here is an example of system and groups clamps aggregation:
min max
system defaults 400 600
cg_name min min.effective max max.effective
/uclamp 1024 400 500 500
/uclamp/app 512 400 512 500
/uclamp/app/pid_smalls 100 100 200 200
/uclamp/app/pid_bigs 500 400 700 500
The ".effective" clamps are used to define the actual clamp value to
apply to tasks, according to the aggregation rules defined in:
[PATCH v9 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
https://lore.kernel.org/lkml/[email protected]/
All the above, to me it means that:
- cgroups are always capped by system clamps
- cgroups can further restrict system clamps
Does that match with your view?
> > b) enforce effective constraints at each level of the hierarchy which
> > are a restriction of the group requests considering its parent's
> > effective constraints. Root group effective constraints are defined
> > by the system wide interface.
> > This mechanism allows each (non-root) level of the hierarchy to:
> > - request whatever clamp values it would like to get
> > - effectively get only up to the maximum amount allowed by its parent
>
> I'll come back to this later.
>
> > c) have higher priority than task-specific clamps, defined via
> > sched_setattr(), thus allowing to control and restrict task requests
>
> This sounds good.
>
> > Add two new attributes to the cpu controller to collect "requested"
> > clamp values. Allow that at each non-root level of the hierarchy.
> > Validate local consistency by enforcing util.min < util.max.
> > Keep it simple by do not caring now about "effective" values computation
> > and propagation along the hierarchy.
>
> So, the followings are what we're doing for hierarchical protection
> and limit propgations.
>
> * Limits (high / max) default to max. Protections (low / min) 0. A
> new cgroup by default doesn't constrain itself further and doesn't
> have any protection.
Example 2
---------
Let say we have:
/tg1:
util_min=200 (as a protection)
util_max=800 (as a limit)
the moment we create a subgroup /tg1/tg11, in v9 it is initialized
with the same limits _and protections_ of its father:
/tg1/tg11:
util_min=200 (protection inherited from /tg1)
util_max=800 (limit inherited from /tg1)
Do you mean that we should have instead:
/tg1/tg11:
util_min=0 (no protection by default at creation time)
util_max=800 (limit inherited from /tg1)
i.e. we need to reset the protection of a newly created subgroup?
> * A limit defines the upper ceiling for the subtree. If an ancestor
> has a limit of X, none of its descendants can have more than X.
That's correct, however we distinguish between "requested" and
"effective" values.
Example 3
---------
We can have:
cg_name max max.effective
/uclamp/app 400 400
/uclamp/app/pid_bigs 500 400
Which means that a subgroup can "request" a limit (max=500) higher
then its father (max=400), while still getting only up to what its
father allows (max.effective = 400).
Example 4
---------
Tracking the actual requested limit (max=500) it's useful to enforce
it once the father limit should be relaxed, for example we will have:
cg_name max max.effective
/uclamp/app 600 600
/uclamp/app/pid_bigs 500 500
where a subgroup gets not more than what it has been configured for.
This is the logic implemented by cpu_util_update_eff() in:
[PATCH v9 13/16] sched/core: uclamp: Propagate parent clamps
https://lore.kernel.org/lkml/[email protected]/
> * A protection defines the upper ceiling of protections for the
> subtree. If an andester has a protection of X, none of its
> descendants can have more protection than X.
Right, that's the current behavior in v9.
> Note that there's no way for an ancestor to enforce protection its
> descendants. It can only allow them to claim some. This is
> intentional as the other end of the spectrum is either descendants
> losing the ability to further distribute protections as they see fit.
Ok, that means I need to update in v10 the initialization of subgroups
min clamps to be none by default as discussed in the above Example 2,
right?
[...]
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 31-May 08:35, Tejun Heo wrote:
> Hello, Patrick.
>
> On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
[...]
> For proportions (as opposed to weights), we use percentage rational
> numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits
> pending. I'll put them on cgroup/for-5.3.
That's a point worth discussing with Peter, we already changed one
time from percentages to 1024 scale.
Utilization clamps are expressed as percentages by definition,
they are just expressed in a convenient 1024 scale which should not be
alien to people using those knobs.
If we wanna use a "more specific" name like uclamp.{min,max} then we
should probably also accept to use a "more specific" metric, don't we?
I personally like the [0..1024] range, but I guess that's really up to
you and Peter to agree upon.
> Thanks.
>
> --
> tejun
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hello,
On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote:
> All the above, to me it means that:
> - cgroups are always capped by system clamps
> - cgroups can further restrict system clamps
>
> Does that match with your view?
Yeah, as long as what's defined at system level clamps everything in
the system whether they're in cgroups or not, it's all good.
> > * Limits (high / max) default to max. Protections (low / min) 0. A
> > new cgroup by default doesn't constrain itself further and doesn't
> > have any protection.
>
> Example 2
> ---------
>
> Let say we have:
>
> /tg1:
> util_min=200 (as a protection)
> util_max=800 (as a limit)
>
> the moment we create a subgroup /tg1/tg11, in v9 it is initialized
> with the same limits _and protections_ of its father:
>
> /tg1/tg11:
> util_min=200 (protection inherited from /tg1)
> util_max=800 (limit inherited from /tg1)
>
> Do you mean that we should have instead:
>
> /tg1/tg11:
> util_min=0 (no protection by default at creation time)
> util_max=800 (limit inherited from /tg1)
>
>
> i.e. we need to reset the protection of a newly created subgroup?
The default value for limits should be max, protections 0. Don't
inherit config values from the parent. That gets confusing super fast
because when the parent config is set and each child is created plays
into the overall configuration. Hierarchical confinements should
always be enforced and a new cgroup should always start afresh in
terms of its own configuration.
> > * A limit defines the upper ceiling for the subtree. If an ancestor
> > has a limit of X, none of its descendants can have more than X.
>
> That's correct, however we distinguish between "requested" and
> "effective" values.
Sure, all property propagating controllers should.
> > Note that there's no way for an ancestor to enforce protection its
> > descendants. It can only allow them to claim some. This is
> > intentional as the other end of the spectrum is either descendants
> > losing the ability to further distribute protections as they see fit.
>
> Ok, that means I need to update in v10 the initialization of subgroups
> min clamps to be none by default as discussed in the above Example 2,
> right?
Yeah and max to max.
Thanks.
--
tejun
Hello,
On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote:
> On 31-May 08:35, Tejun Heo wrote:
> > Hello, Patrick.
> >
> > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
>
> [...]
>
> > For proportions (as opposed to weights), we use percentage rational
> > numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits
> > pending. I'll put them on cgroup/for-5.3.
>
> That's a point worth discussing with Peter, we already changed one
> time from percentages to 1024 scale.
cgroup tries to uss uniform units for its interface files as much as
possible even when that deviates from non-cgroup interface. We can
bikeshed the pros and cons for that design choice for sure but I don't
think it makes sense to deviate from that at this point unless there
are really strong reasons to do so.
> Utilization clamps are expressed as percentages by definition,
> they are just expressed in a convenient 1024 scale which should not be
> alien to people using those knobs.
>
> If we wanna use a "more specific" name like uclamp.{min,max} then we
> should probably also accept to use a "more specific" metric, don't we?
Heh, this actually made me chuckle. It's an interesting bargaining
take but I don't think that same word being in two different places
makes them tradable entities. We can go into the weeds with the
semantics but how about us using an alternative adjective "misleading"
for the cpu.util.min/max names to short-circuit that?
Thanks.
--
tejun
On 05-Jun 07:03, Tejun Heo wrote:
> Hello,
Hi!
> On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote:
> > All the above, to me it means that:
> > - cgroups are always capped by system clamps
> > - cgroups can further restrict system clamps
> >
> > Does that match with your view?
>
> Yeah, as long as what's defined at system level clamps everything in
> the system whether they're in cgroups or not, it's all good.
Right, then we are good with v9 on this point.
> > > * Limits (high / max) default to max. Protections (low / min) 0. A
> > > new cgroup by default doesn't constrain itself further and doesn't
> > > have any protection.
> >
> > Example 2
> > ---------
> >
> > Let say we have:
> >
> > /tg1:
> > util_min=200 (as a protection)
> > util_max=800 (as a limit)
> >
> > the moment we create a subgroup /tg1/tg11, in v9 it is initialized
> > with the same limits _and protections_ of its father:
> >
> > /tg1/tg11:
> > util_min=200 (protection inherited from /tg1)
> > util_max=800 (limit inherited from /tg1)
> >
> > Do you mean that we should have instead:
> >
> > /tg1/tg11:
> > util_min=0 (no protection by default at creation time)
> > util_max=800 (limit inherited from /tg1)
> >
> >
> > i.e. we need to reset the protection of a newly created subgroup?
>
> The default value for limits should be max, protections 0. Don't
> inherit config values from the parent. That gets confusing super fast
> because when the parent config is set and each child is created plays
> into the overall configuration. Hierarchical confinements should
> always be enforced and a new cgroup should always start afresh in
> terms of its own configuration.
Got it, so in the example above we will create:
/tg1/tg11:
util_min=0 (no requested protection by default at creation time)
util_max=1024 (no requests limit by default at creation time)
That's it for the "requested" values side, while the "effective"
values are enforced by the hierarchical confinement rules since
creation time.
Which means we will enforce the effective values as:
/tg1/tg11:
util_min.effective=0
i.e. keep the child protection since smaller than parent
util_max.effective=800
i.e. keep parent limit since stricter than child
Please shout if I got it wrong, otherwise I'll update v10 to
implement the above logic.
> > > * A limit defines the upper ceiling for the subtree. If an ancestor
> > > has a limit of X, none of its descendants can have more than X.
> >
> > That's correct, however we distinguish between "requested" and
> > "effective" values.
>
> Sure, all property propagating controllers should.
Right.
> > > Note that there's no way for an ancestor to enforce protection its
> > > descendants. It can only allow them to claim some. This is
> > > intentional as the other end of the spectrum is either descendants
> > > losing the ability to further distribute protections as they see fit.
> >
> > Ok, that means I need to update in v10 the initialization of subgroups
> > min clamps to be none by default as discussed in the above Example 2,
> > right?
>
> Yeah and max to max.
Right, I've got it now.
> Thanks.
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hello,
On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote:
> Which means we will enforce the effective values as:
>
> /tg1/tg11:
>
> util_min.effective=0
> i.e. keep the child protection since smaller than parent
>
> util_max.effective=800
> i.e. keep parent limit since stricter than child
>
> Please shout if I got it wrong, otherwise I'll update v10 to
> implement the above logic.
Everything sounds good to me. Please note that cgroup interface files
actually use literal "max" for limit/protection max settings so that 0
and "max" mean the same things for all limit/protection knobs.
Thanks.
--
tejun
On 05-Jun 07:09, Tejun Heo wrote:
> Hello,
Hi,
> On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote:
> > On 31-May 08:35, Tejun Heo wrote:
> > > Hello, Patrick.
> > >
> > > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > For proportions (as opposed to weights), we use percentage rational
> > > numbers - e.g. 38.44 for 38.44%. I have parser and doc update commits
> > > pending. I'll put them on cgroup/for-5.3.
> >
> > That's a point worth discussing with Peter, we already changed one
> > time from percentages to 1024 scale.
>
> cgroup tries to uss uniform units for its interface files as much as
> possible even when that deviates from non-cgroup interface. We can
> bikeshed the pros and cons for that design choice for sure but I don't
> think it makes sense to deviate from that at this point unless there
> are really strong reasons to do so.
that makes sense to me, having a uniform interface has certainly a
value.
The only additional point I can think about as a (slightly) stronger
reason is that I guess we would like to have the same API for cgroups
as well as for the task specific and the system wide settings.
The task specific values comes in via the sched_setattr() syscall:
[PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
https://lore.kernel.org/lkml/[email protected]/
where we need to encode each clamp into a __u32 value.
System wide settings are expose similarly to these:
grep '' /proc/sys/kernel/sched_*
where we have always integer numbers.
AFAIU your proposal will require to use a "scaled percentage" - e.g.
3844 for 38.44% which however it's still not quite the same as writing
the string "38.44".
Not sure that's a strong enough argument, is it?
> > Utilization clamps are expressed as percentages by definition,
> > they are just expressed in a convenient 1024 scale which should not be
> > alien to people using those knobs.
> >
> > If we wanna use a "more specific" name like uclamp.{min,max} then we
> > should probably also accept to use a "more specific" metric, don't we?
>
> Heh, this actually made me chuckle.
:)
> It's an interesting bargaining take but I don't think that same word
> being in two different places makes them tradable entities.
Sure, that was not my intention.
I was just trying to see if the need to be more specific could
be an argument for having also a more specific value.
> We can go into the weeds with the semantics but how about us using
> an alternative adjective "misleading" for the cpu.util.min/max names
> to short-circuit that?
Not quite sure to get what you mean here. Are you pointing out that
with clamps we don't strictly enforce a bandwidth but we just set a
bias?
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hello, Patrick.
On Wed, Jun 05, 2019 at 04:06:30PM +0100, Patrick Bellasi wrote:
> The only additional point I can think about as a (slightly) stronger
> reason is that I guess we would like to have the same API for cgroups
> as well as for the task specific and the system wide settings.
>
> The task specific values comes in via the sched_setattr() syscall:
>
> [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
> https://lore.kernel.org/lkml/[email protected]/
>
> where we need to encode each clamp into a __u32 value.
>
> System wide settings are expose similarly to these:
>
> grep '' /proc/sys/kernel/sched_*
>
> where we have always integer numbers.
>
> AFAIU your proposal will require to use a "scaled percentage" - e.g.
> 3844 for 38.44% which however it's still not quite the same as writing
> the string "38.44".
>
> Not sure that's a strong enough argument, is it?
It definitely is an argument but the thing is that the units we use in
kernel API are all over the place anyway. Even for something as
simple as sizes, we use bytes, 512 byte sectors, kilobytes and pages
all over the place. Some for good reasons (as you mentioned above)
and others for historical / random ones.
So, I'm generally not too concerned about units differing between
cgroup interface and, say, syscall interface. That ship has sailed a
long while ago and we have to deal with it everywhere anyway (in many
cases there isn't even a good unit to pick for compatibility because
the existing interfaces are already mixing units heavily). As long as
the translation is trivial, it isn't a big issue. Note that some
translations are not trivial. For example, the sched nice value
mapping to weight has a separate unit matching knob for that reason.
> > We can go into the weeds with the semantics but how about us using
> > an alternative adjective "misleading" for the cpu.util.min/max names
> > to short-circuit that?
>
> Not quite sure to get what you mean here. Are you pointing out that
> with clamps we don't strictly enforce a bandwidth but we just set a
> bias?
It's just that "util" is already used a lot and cpu.util.max reads
like it should cap cpu utilization (wallclock based) to 80% and it's
likely that it'd read seem way to many other folks too. A more
distinctive name signals that it isn't something that obvious.
Thanks.
--
tejun
On 05-Jun 07:44, Tejun Heo wrote:
> Hello,
Hi,
> On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote:
> > Which means we will enforce the effective values as:
> >
> > /tg1/tg11:
> >
> > util_min.effective=0
> > i.e. keep the child protection since smaller than parent
> >
> > util_max.effective=800
> > i.e. keep parent limit since stricter than child
> >
> > Please shout if I got it wrong, otherwise I'll update v10 to
> > implement the above logic.
>
> Everything sounds good to me. Please note that cgroup interface files
> actually use literal "max" for limit/protection max settings so that 0
> and "max" mean the same things for all limit/protection knobs.
Lemme see if I've got it right, do you mean that we can:
1) write the _string_ "max" into a cgroup attribute to:
- set 0 for util_max, since it's a protection
- set 1024 for util_min, since it's a limit
2) write the _string_ "0" into a cgroup attribute to:
- set 1024 for util_max, since it's a protection
- set 0 for util_min, since it's a limit
Is that correct or it's just me totally confused?
> Thanks.
>
> --
> tejun
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hello, Patrick.
On Wed, Jun 05, 2019 at 04:37:43PM +0100, Patrick Bellasi wrote:
> > Everything sounds good to me. Please note that cgroup interface files
> > actually use literal "max" for limit/protection max settings so that 0
> > and "max" mean the same things for all limit/protection knobs.
>
> Lemme see if I've got it right, do you mean that we can:
>
> 1) write the _string_ "max" into a cgroup attribute to:
>
> - set 0 for util_max, since it's a protection
> - set 1024 for util_min, since it's a limit
>
> 2) write the _string_ "0" into a cgroup attribute to:
>
> - set 1024 for util_max, since it's a protection
> - set 0 for util_min, since it's a limit
>
> Is that correct or it's just me totally confused?
Heh, sorry about not being clearer. "max" just means numerically
highest possible config for the config knob, so in your case, "max"
would always map to 1024.
Thanks.
--
tejun