2019-08-22 15:10:08

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 0/6] Add utilization clamping support (CGroups API)

Hi all, this is a respin of:

https://lore.kernel.org/lkml/[email protected]/

which introduces only a small fix suggested by Michal and adds his Reviewed-by.
Thanks Michal for your additional review!

The series is based on top of today's tip/sched/core:

commit a46d14eca7b7 ("sched/fair: Use rq_lock/unlock in online_fair_sched_group")

Since there was only minor changes, I've kept Tejun ACK tag.

Cheers,
Patrick


Series Organization
===================

The full tree is available here:

git://linux-arm.org/linux-pb.git lkml/utilclamp_v14
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v14


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [1].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilization values by themselves cannot convey the desired
power/performance behaviors of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorized user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience.

Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
the scheduler does not have a complete view of what the task's requirements
are and, if it's a small utilization task, it keeps selecting a more energy
efficient CPU, with smaller capacity and lower frequency, thus negatively
impacting the overall time required to complete task activations.

- capping: increase energy efficiency for background tasks not affecting the
user experience.

Since running on a lower capacity CPU at a lower frequency is more energy
efficient, when the completion time is not a main goal, then capping the
utilization considered for certain (maybe big) tasks can have positive
effects, both on energy consumption and thermal headroom.
This feature allows also to make RT tasks more energy friendly on mobile
systems where running them on high capacity CPUs and at the maximum
frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in making tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

- boosting: try to run small/foreground tasks on higher-capacity CPUs to
complete them faster despite being less energy efficient.

- capping: try to run big/background tasks on low-capacity CPUs to save power
and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] Energy Aware Scheduling
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (6):
sched/core: uclamp: Extend CPU's cgroup controller
sched/core: uclamp: Propagate parent clamps
sched/core: uclamp: Propagate system defaults to root group
sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
sched/core: uclamp: Update CPU's refcount on TG's clamp changes
sched/core: uclamp: always use enum uclamp_id for clamp_id values

Documentation/admin-guide/cgroup-v2.rst | 34 +++
init/Kconfig | 22 ++
kernel/sched/core.c | 375 ++++++++++++++++++++++--
kernel/sched/sched.h | 12 +-
4 files changed, 421 insertions(+), 22 deletions(-)

--
2.22.0


2019-08-22 15:11:35

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 3/6] sched/core: uclamp: Propagate system defaults to root group

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.

- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

- system defaults: which define the maximum possible clamp values
usable by tasks.

- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8dab64247597..3ca054ad3f3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+ struct task_group *tg = &root_task_group;
+
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+ rcu_read_lock();
+ cpu_util_update_eff(&root_task_group.css);
+ rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
{
+ bool update_root_tg = false;
int old_min, old_max;
int result;

@@ -1043,16 +1063,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(&uclamp_default[UCLAMP_MIN],
sysctl_sched_uclamp_util_min, false);
+ update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(&uclamp_default[UCLAMP_MAX],
sysctl_sched_uclamp_util_max, false);
+ update_root_tg = true;
}

+ if (update_root_tg)
+ uclamp_update_root_tg();
+
/*
- * Updating all the RUNNABLE task is expensive, keep it simple and do
- * just a lazy update at each next enqueue time.
+ * We update all RUNNABLE tasks only when task groups are in use.
+ * Otherwise, keep it simple and do just a lazy update at each next
+ * task enqueue time.
*/
+
goto done;

undo:
--
2.22.0

2019-08-22 15:15:29

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 6/6] sched/core: uclamp: always use enum uclamp_id for clamp_id values

The supported clamp indexes are defined in enum clamp_id however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use unsigned int to represent indexes.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 38 +++++++++++++++++++-------------------
kernel/sched/sched.h | 2 +-
2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc2dc86a2abe..269c14ad4473 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
}

-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
{
if (clamp_id == UCLAMP_MIN)
return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
}

static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value)
{
/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN);
}

-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value)
{
/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
}

static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
- unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+ unsigned int clamp_value)
{
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
}

static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
#ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req;
}

-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_eff;

@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
* for each bucket when all its RUNNABLE tasks require the same clamp.
*/
static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
- unsigned int clamp_id)
+ enum uclamp_id clamp_id)
{
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
* enforce the expected state and warn.
*/
static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
- unsigned int clamp_id)
+ enum uclamp_id clamp_id)
{
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@@ -1019,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,

static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1034,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)

static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1044,7 +1044,7 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
}

static inline void
-uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id)
{
struct rq_flags rf;
struct rq *rq;
@@ -1080,9 +1080,9 @@ static inline void
uclamp_update_active_tasks(struct cgroup_subsys_state *css,
unsigned int clamps)
{
+ enum uclamp_id clamp_id;
struct css_task_iter it;
struct task_struct *p;
- unsigned int clamp_id;

css_task_iter_start(css, 0, &it);
while ((p = css_task_iter_next(&it))) {
@@ -1190,7 +1190,7 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

/*
* On scheduling class change, reset to default clamps for tasks
@@ -1227,7 +1227,7 @@ static void __setscheduler_uclamp(struct task_struct *p,

static void uclamp_fork(struct task_struct *p)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false;
@@ -1249,7 +1249,7 @@ static void uclamp_fork(struct task_struct *p)
static void __init init_uclamp(void)
{
struct uclamp_se uc_max = {};
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;
int cpu;

mutex_init(&uclamp_mutex);
@@ -6924,7 +6924,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
#ifdef CONFIG_UCLAMP_TASK_GROUP
- int clamp_id;
+ enum uclamp_id clamp_id;

for_each_clamp_id(clamp_id) {
uclamp_se_set(&tg->uclamp_req[clamp_id],
@@ -7182,7 +7182,7 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
struct uclamp_se *uc_parent = NULL;
struct uclamp_se *uc_se = NULL;
unsigned int eff[UCLAMP_CNT];
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;
unsigned int clamps;

css_for_each_descendant_pre(css, top_css) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5b343112a47b..00ff5b57e9cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2281,7 +2281,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

#ifdef CONFIG_UCLAMP_TASK
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id);
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);

static __always_inline
unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
--
2.22.0

2019-08-22 17:10:10

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the uclamp.min
utilization

- uclamp.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the uclamp.max
utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v14:
Message-ID: <[email protected]>
- move uclamp_mutex usage here from the following patch
---
Documentation/admin-guide/cgroup-v2.rst | 34 +++++
init/Kconfig | 22 +++
kernel/sched/core.c | 188 +++++++++++++++++++++++-
kernel/sched/sched.h | 8 +
4 files changed, 248 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005aa981..5f1c266131b0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -1016,6 +1023,33 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for details.

+ cpu.uclamp.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization (protection) as a percentage
+ rational number, e.g. 12.34 for 12.34%.
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ The requested minimum utilization (protection) is always capped by
+ the current value for the maximum utilization (limit), i.e.
+ `cpu.uclamp.max`.
+
+ cpu.uclamp.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization (limit) as a percentage rational
+ number, e.g. 98.76 for 98.76%.
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+

Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index bd7d650d4a99..ac285cfa78b6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -928,6 +928,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6661852907b..7b610e1a4cda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
/* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;

@@ -1010,10 +1022,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
loff_t *ppos)
{
int old_min, old_max;
- static DEFINE_MUTEX(mutex);
int result;

- mutex_lock(&mutex);
+ mutex_lock(&uclamp_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;

@@ -1048,7 +1059,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
done:
- mutex_unlock(&mutex);
+ mutex_unlock(&uclamp_mutex);

return result;
}
@@ -1137,6 +1148,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

+ mutex_init(&uclamp_mutex);
+
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1149,8 +1162,12 @@ static void __init init_uclamp(void)

/* System defaults allow max clamp values for both indexes */
uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+ }
}

#else /* CONFIG_UCLAMP_TASK */
@@ -6798,6 +6815,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&tg->uclamp_req[clamp_id],
+ uclamp_none(clamp_id), false);
+ }
+#endif
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6821,6 +6851,8 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ alloc_uclamp_sched_group(tg, parent);
+
return tg;

err:
@@ -7037,6 +7069,126 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+
+#define _POW10(exp) ((unsigned int)1e##exp)
+#define POW10(exp) _POW10(exp)
+
+struct uclamp_request {
+#define UCLAMP_PERCENT_SHIFT 2
+#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
+ s64 percent;
+ u64 util;
+ int ret;
+};
+
+static inline struct uclamp_request
+capacity_from_percent(char *buf)
+{
+ struct uclamp_request req = {
+ .percent = UCLAMP_PERCENT_SCALE,
+ .util = SCHED_CAPACITY_SCALE,
+ .ret = 0,
+ };
+
+ buf = strim(buf);
+ if (strncmp("max", buf, 4)) {
+ req.ret = cgroup_parse_float(buf, UCLAMP_PERCENT_SHIFT,
+ &req.percent);
+ if (req.ret)
+ return req;
+ if (req.percent > UCLAMP_PERCENT_SCALE) {
+ req.ret = -ERANGE;
+ return req;
+ }
+
+ req.util = req.percent << SCHED_CAPACITY_SHIFT;
+ req.util = DIV_ROUND_CLOSEST_ULL(req.util, UCLAMP_PERCENT_SCALE);
+ }
+
+ return req;
+}
+
+static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off,
+ enum uclamp_id clamp_id)
+{
+ struct uclamp_request req;
+ struct task_group *tg;
+
+ req = capacity_from_percent(buf);
+ if (req.ret)
+ return req.ret;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(of_css(of));
+ if (tg->uclamp_req[clamp_id].value != req.util)
+ uclamp_se_set(&tg->uclamp_req[clamp_id], req.util, false);
+
+ /*
+ * Because of not recoverable conversion rounding we keep track of the
+ * exact requested value
+ */
+ tg->uclamp_pct[clamp_id] = req.percent;
+
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return nbytes;
+}
+
+static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MIN);
+}
+
+static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MAX);
+}
+
+static inline void cpu_uclamp_print(struct seq_file *sf,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+ u64 percent;
+ u32 rem;
+
+ rcu_read_lock();
+ tg = css_tg(seq_css(sf));
+ util_clamp = tg->uclamp_req[clamp_id].value;
+ rcu_read_unlock();
+
+ if (util_clamp == SCHED_CAPACITY_SCALE) {
+ seq_puts(sf, "max\n");
+ return;
+ }
+
+ percent = tg->uclamp_pct[clamp_id];
+ percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);
+ seq_printf(sf, "%llu.%0*u\n", percent, UCLAMP_PERCENT_SHIFT, rem);
+}
+
+static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MIN);
+ return 0;
+}
+
+static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MAX);
+ return 0;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7381,6 +7533,20 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "uclamp.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_min_show,
+ .write = cpu_uclamp_min_write,
+ },
+ {
+ .name = "uclamp.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_max_show,
+ .write = cpu_uclamp_max_write,
+ },
#endif
{ } /* Terminate */
};
@@ -7548,6 +7714,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "uclamp.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_min_show,
+ .write = cpu_uclamp_min_write,
+ },
+ {
+ .name = "uclamp.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_max_show,
+ .write = cpu_uclamp_max_write,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7111e3a1eeb4..ae1be61fb279 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -391,6 +391,14 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* The two decimal precision [%] value requested from user-space */
+ unsigned int uclamp_pct[UCLAMP_CNT];
+ /* Clamp values requested for a task group */
+ struct uclamp_se uclamp_req[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.22.0

2019-08-22 17:11:18

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04fc161e4dbe..fc2dc86a2abe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}

+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_rq_dec_id(rq, p, clamp_id);
+ uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+ unsigned int clamps)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+ unsigned int clamp_id;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it))) {
+ for_each_clamp_id(clamp_id) {
+ if ((0x1 << clamp_id) & clamps)
+ uclamp_update_active(p, clamp_id);
+ }
+ }
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css);
static void uclamp_update_root_tg(void)
@@ -7160,8 +7211,13 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
- if (!clamps)
+ if (!clamps) {
css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Immediately update descendants RUNNABLE tasks */
+ uclamp_update_active_tasks(css, clamps);
}
}

--
2.22.0

2019-08-22 18:46:59

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 2/6] sched/core: uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 ++
2 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b610e1a4cda..8dab64247597 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1166,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
#ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+ root_task_group.uclamp[clamp_id] = uc_max;
#endif
}
}
@@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(&tg->uclamp_req[clamp_id],
uclamp_none(clamp_id), false);
+ tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
#endif
}
@@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_parent = NULL;
+ struct uclamp_se *uc_se = NULL;
+ unsigned int eff[UCLAMP_CNT];
+ unsigned int clamp_id;
+ unsigned int clamps;
+
+ css_for_each_descendant_pre(css, top_css) {
+ uc_parent = css_tg(css)->parent
+ ? css_tg(css)->parent->uclamp : NULL;
+
+ for_each_clamp_id(clamp_id) {
+ /* Assume effective clamps matches requested clamps */
+ eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+ /* Cap effective clamps with parent's effective clamps */
+ if (uc_parent &&
+ eff[clamp_id] > uc_parent[clamp_id].value) {
+ eff[clamp_id] = uc_parent[clamp_id].value;
+ }
+ }
+ /* Ensure protection is always capped by limit */
+ eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+ /* Propagate most restrictive effective clamps */
+ clamps = 0x0;
+ uc_se = css_tg(css)->uclamp;
+ for_each_clamp_id(clamp_id) {
+ if (eff[clamp_id] == uc_se[clamp_id].value)
+ continue;
+ uc_se[clamp_id].value = eff[clamp_id];
+ uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
+ clamps |= (0x1 << clamp_id);
+ }
+ if (!clamps)
+ css = css_rightmost_descendant(css);
+ }
+}

#define _POW10(exp) ((unsigned int)1e##exp)
#define POW10(exp) _POW10(exp)
@@ -7133,6 +7174,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
*/
tg->uclamp_pct[clamp_id] = req.percent;

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(of_css(of));
+
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae1be61fb279..5b343112a47b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@ struct task_group {
unsigned int uclamp_pct[UCLAMP_CNT];
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
+ /* Effective clamp values used for a task group */
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

};
--
2.22.0

2019-08-22 18:47:29

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v14 4/6] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
i.e. boosted not more than its TG effective protection and capped at
least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ca054ad3f3e..04fc161e4dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uc_max;
+
+ /*
+ * Tasks in autogroups or root task group will be
+ * restricted by system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return uc_req;
+ if (task_group(p) == &root_task_group)
+ return uc_req;
+
+ uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value || !uc_req.user_defined)
+ return uc_max;
+#endif
+
+ return uc_req;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
- struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];

/* System default restrictions always apply */
--
2.22.0

2019-08-30 09:47:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

On Thu, Aug 22, 2019 at 02:28:06PM +0100, Patrick Bellasi wrote:
> +#define _POW10(exp) ((unsigned int)1e##exp)
> +#define POW10(exp) _POW10(exp)

What is this magic? You're forcing a float literal into an integer.
Surely that deserves a comment!

> +struct uclamp_request {
> +#define UCLAMP_PERCENT_SHIFT 2
> +#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
> + s64 percent;
> + u64 util;
> + int ret;
> +};
> +
> +static inline struct uclamp_request
> +capacity_from_percent(char *buf)
> +{
> + struct uclamp_request req = {
> + .percent = UCLAMP_PERCENT_SCALE,
> + .util = SCHED_CAPACITY_SCALE,
> + .ret = 0,
> + };
> +
> + buf = strim(buf);
> + if (strncmp("max", buf, 4)) {

That is either a bug, and you meant to write: strncmp(buf, "max", 3), or
it is not, and then you could've written: strcmp(buf, "max")

But as written it doesn't make sense.

> + req.ret = cgroup_parse_float(buf, UCLAMP_PERCENT_SHIFT,
> + &req.percent);
> + if (req.ret)
> + return req;
> + if (req.percent > UCLAMP_PERCENT_SCALE) {
> + req.ret = -ERANGE;
> + return req;
> + }
> +
> + req.util = req.percent << SCHED_CAPACITY_SHIFT;
> + req.util = DIV_ROUND_CLOSEST_ULL(req.util, UCLAMP_PERCENT_SCALE);
> + }
> +
> + return req;
> +}

2019-08-30 09:51:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On Thu, Aug 22, 2019 at 02:28:10PM +0100, Patrick Bellasi wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 04fc161e4dbe..fc2dc86a2abe 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> uclamp_rq_dec_id(rq, p, clamp_id);
> }
>
> +static inline void
> +uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + /*
> + * Lock the task and the rq where the task is (or was) queued.
> + *
> + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
> + * price to pay to safely serialize util_{min,max} updates with
> + * enqueues, dequeues and migration operations.
> + * This is the same locking schema used by __set_cpus_allowed_ptr().
> + */
> + rq = task_rq_lock(p, &rf);

Since modifying cgroup parameters is priv only, this should be OK I
suppose. Priv can already DoS the system anyway.

> + /*
> + * Setting the clamp bucket is serialized by task_rq_lock().
> + * If the task is not yet RUNNABLE and its task_struct is not
> + * affecting a valid clamp bucket, the next time it's enqueued,
> + * it will already see the updated clamp bucket value.
> + */
> + if (!p->uclamp[clamp_id].active)
> + goto done;
> +
> + uclamp_rq_dec_id(rq, p, clamp_id);
> + uclamp_rq_inc_id(rq, p, clamp_id);
> +
> +done:

I'm thinking that:

if (p->uclamp[clamp_id].active) {
uclamp_rq_dec_id(rq, p, clamp_id);
uclamp_rq_inc_id(rq, p, clamp_id);
}

was too obvious? ;-)

> +
> + task_rq_unlock(rq, p, &rf);
> +}

2019-09-02 06:57:48

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller


On Fri, Aug 30, 2019 at 09:45:05 +0000, Peter Zijlstra wrote...

> On Thu, Aug 22, 2019 at 02:28:06PM +0100, Patrick Bellasi wrote:
>> +#define _POW10(exp) ((unsigned int)1e##exp)
>> +#define POW10(exp) _POW10(exp)
>
> What is this magic? You're forcing a float literal into an integer.
> Surely that deserves a comment!

Yes, I'm introducing the two constants:
UCLAMP_PERCENT_SHIFT,
UCLAMP_PERCENT_SCALE
similar to what we have for CAPACITY. Moreover, I need both 100*100 (for
the scale) and 100 further down in the code for the:

percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);

used in cpu_uclamp_print().

That's why adding a compile time support to compute a 10^N is useful.

C provides the "1eN" literal, I just convert it to integer and to do
that at compile time I need a two level macros.

What if I add this comment just above the macro definitions:

/*
* Integer 10^N with a given N exponent by casting to integer the literal "1eN"
* C expression. Since there is no way to convert a macro argument (N) into a
* character constant, use two levels of macros.
*/

is this clear enough?

>
>> +struct uclamp_request {
>> +#define UCLAMP_PERCENT_SHIFT 2
>> +#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
>> + s64 percent;
>> + u64 util;
>> + int ret;
>> +};
>> +
>> +static inline struct uclamp_request
>> +capacity_from_percent(char *buf)
>> +{
>> + struct uclamp_request req = {
>> + .percent = UCLAMP_PERCENT_SCALE,
>> + .util = SCHED_CAPACITY_SCALE,
>> + .ret = 0,
>> + };
>> +
>> + buf = strim(buf);
>> + if (strncmp("max", buf, 4)) {
>
> That is either a bug, and you meant to write: strncmp(buf, "max", 3),
> or it is not, and then you could've written: strcmp(buf, "max")

I don't think it's a bug.

The usage of 4 is intentional, to force a '\0' check while using
strncmp(). Otherwise, strncmp(buf, "max", 3) would accept also strings
starting by "max", which we don't want.

> But as written it doesn't make sense.

The code is safe but I agree that strcmp() does just the same and it
does not generate confusion. That's actually a pretty good example
on how it's not always better to use strncmp() instead of strcmp().

Cheers,
Patrick

2019-09-02 07:00:05

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes


On Fri, Aug 30, 2019 at 09:48:34 +0000, Peter Zijlstra wrote...

> On Thu, Aug 22, 2019 at 02:28:10PM +0100, Patrick Bellasi wrote:
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 04fc161e4dbe..fc2dc86a2abe 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>> uclamp_rq_dec_id(rq, p, clamp_id);
>> }
>>
>> +static inline void
>> +uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
>> +{
>> + struct rq_flags rf;
>> + struct rq *rq;
>> +
>> + /*
>> + * Lock the task and the rq where the task is (or was) queued.
>> + *
>> + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
>> + * price to pay to safely serialize util_{min,max} updates with
>> + * enqueues, dequeues and migration operations.
>> + * This is the same locking schema used by __set_cpus_allowed_ptr().
>> + */
>> + rq = task_rq_lock(p, &rf);
>
> Since modifying cgroup parameters is priv only, this should be OK I
> suppose. Priv can already DoS the system anyway.

Are you referring to the possibility to DoS the scheduler by keep
writing cgroup attributes?

Because, in that case I think cgroup attributes could be written also by
non priv users. It all depends on how they are mounted and permissions
are set. Isn't it?

Anyway, I'm not sure we can fix that here... and in principle we could
have that DoS by setting CPUs affinities, which is user exposed.
Isn't it?

>> + /*
>> + * Setting the clamp bucket is serialized by task_rq_lock().
>> + * If the task is not yet RUNNABLE and its task_struct is not
>> + * affecting a valid clamp bucket, the next time it's enqueued,
>> + * it will already see the updated clamp bucket value.
>> + */
>> + if (!p->uclamp[clamp_id].active)
>> + goto done;
>> +
>> + uclamp_rq_dec_id(rq, p, clamp_id);
>> + uclamp_rq_inc_id(rq, p, clamp_id);
>> +
>> +done:
>
> I'm thinking that:
>
> if (p->uclamp[clamp_id].active) {
> uclamp_rq_dec_id(rq, p, clamp_id);
> uclamp_rq_inc_id(rq, p, clamp_id);
> }
>
> was too obvious? ;-)

Yep, right... I think there was some more code in prev versions but I
forgot to get rid of that "goto" pattern after some change.

>> +
>> + task_rq_unlock(rq, p, &rf);
>> +}

Cheers,
Patrick

2019-09-02 07:41:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On Mon, Sep 02, 2019 at 07:44:40AM +0100, Patrick Bellasi wrote:
> On Fri, Aug 30, 2019 at 09:48:34 +0000, Peter Zijlstra wrote...
> > On Thu, Aug 22, 2019 at 02:28:10PM +0100, Patrick Bellasi wrote:

> >> + rq = task_rq_lock(p, &rf);
> >
> > Since modifying cgroup parameters is priv only, this should be OK I
> > suppose. Priv can already DoS the system anyway.
>
> Are you referring to the possibility to DoS the scheduler by keep
> writing cgroup attributes?

Yep.

> Because, in that case I think cgroup attributes could be written also by
> non priv users. It all depends on how they are mounted and permissions
> are set. Isn't it?
>
> Anyway, I'm not sure we can fix that here... and in principle we could
> have that DoS by setting CPUs affinities, which is user exposed.
> Isn't it?

Only for a single task; by using the cgroup thing we have that in-kernel
iteration of tasks.

The thing I worry about is bouncing rq->lock around the system; but
yeah, I suppose a normal user could achieve something similar with
enough tasks.

> >> + /*
> >> + * Setting the clamp bucket is serialized by task_rq_lock().
> >> + * If the task is not yet RUNNABLE and its task_struct is not
> >> + * affecting a valid clamp bucket, the next time it's enqueued,
> >> + * it will already see the updated clamp bucket value.
> >> + */
> >> + if (!p->uclamp[clamp_id].active)
> >> + goto done;
> >> +
> >> + uclamp_rq_dec_id(rq, p, clamp_id);
> >> + uclamp_rq_inc_id(rq, p, clamp_id);
> >> +
> >> +done:
> >
> > I'm thinking that:
> >
> > if (p->uclamp[clamp_id].active) {
> > uclamp_rq_dec_id(rq, p, clamp_id);
> > uclamp_rq_inc_id(rq, p, clamp_id);
> > }
> >
> > was too obvious? ;-)
>
> Yep, right... I think there was some more code in prev versions but I
> forgot to get rid of that "goto" pattern after some change.

OK, already fixed that.

2019-09-02 07:49:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

On Mon, Sep 02, 2019 at 07:38:53AM +0100, Patrick Bellasi wrote:
>
> On Fri, Aug 30, 2019 at 09:45:05 +0000, Peter Zijlstra wrote...
>
> > On Thu, Aug 22, 2019 at 02:28:06PM +0100, Patrick Bellasi wrote:
> >> +#define _POW10(exp) ((unsigned int)1e##exp)
> >> +#define POW10(exp) _POW10(exp)
> >
> > What is this magic? You're forcing a float literal into an integer.
> > Surely that deserves a comment!
>
> Yes, I'm introducing the two constants:
> UCLAMP_PERCENT_SHIFT,
> UCLAMP_PERCENT_SCALE
> similar to what we have for CAPACITY. Moreover, I need both 100*100 (for
> the scale) and 100 further down in the code for the:

Ooh, right you are. I clearly was in need of weekend. Somehow I read
that code as if you were forcing the float representation into an
integer, which is not what you do.

> percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);
>
> used in cpu_uclamp_print().
>
> That's why adding a compile time support to compute a 10^N is useful.
>
> C provides the "1eN" literal, I just convert it to integer and to do
> that at compile time I need a two level macros.
>
> What if I add this comment just above the macro definitions:
>
> /*
> * Integer 10^N with a given N exponent by casting to integer the literal "1eN"
> * C expression. Since there is no way to convert a macro argument (N) into a
> * character constant, use two levels of macros.
> */
>
> is this clear enough?

Yeah, let me go add that.

> >
> >> +struct uclamp_request {
> >> +#define UCLAMP_PERCENT_SHIFT 2
> >> +#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
> >> + s64 percent;
> >> + u64 util;
> >> + int ret;
> >> +};
> >> +
> >> +static inline struct uclamp_request
> >> +capacity_from_percent(char *buf)
> >> +{
> >> + struct uclamp_request req = {
> >> + .percent = UCLAMP_PERCENT_SCALE,
> >> + .util = SCHED_CAPACITY_SCALE,
> >> + .ret = 0,
> >> + };
> >> +
> >> + buf = strim(buf);
> >> + if (strncmp("max", buf, 4)) {
> >
> > That is either a bug, and you meant to write: strncmp(buf, "max", 3),
> > or it is not, and then you could've written: strcmp(buf, "max")
>
> I don't think it's a bug.
>
> The usage of 4 is intentional, to force a '\0' check while using
> strncmp(). Otherwise, strncmp(buf, "max", 3) would accept also strings
> starting by "max", which we don't want.

Right; I figured.

> > But as written it doesn't make sense.
>
> The code is safe but I agree that strcmp() does just the same and it
> does not generate confusion. That's actually a pretty good example
> on how it's not always better to use strncmp() instead of strcmp().

OK, I made it strcmp(), because that is what I figured was the intended
semantics.

2019-09-02 23:05:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

Hi Patrick,

On Thu, Aug 22, 2019 at 6:28 AM Patrick Bellasi <[email protected]> wrote:
>
> The cgroup CPU bandwidth controller allows to assign a specified
> (maximum) bandwidth to the tasks of a group. However this bandwidth is
> defined and enforced only on a temporal base, without considering the
> actual frequency a CPU is running on. Thus, the amount of computation
> completed by a task within an allocated bandwidth can be very different
> depending on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.
>
> With the availability of schedutil, the scheduler is now able
> to drive frequency selections based on actual task utilization.
> Moreover, the utilization clamping support provides a mechanism to
> bias the frequency selection operated by schedutil depending on
> constraints assigned to the tasks currently RUNNABLE on a CPU.
>
> Giving the mechanisms described above, it is now possible to extend the
> cpu controller to specify the minimum (or maximum) utilization which
> should be considered for tasks RUNNABLE on a cpu.
> This makes it possible to better defined the actual computational
> power assigned to task groups, thus improving the cgroup CPU bandwidth
> controller which is currently based just on time constraints.
>
> Extend the CPU controller with a couple of new attributes uclamp.{min,max}
> which allow to enforce utilization boosting and capping for all the
> tasks in a group.
>
> Specifically:
>
> - uclamp.min: defines the minimum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run at least at a
> minimum frequency which corresponds to the uclamp.min
> utilization
>
> - uclamp.max: defines the maximum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run up to a
> maximum frequency which corresponds to the uclamp.max
> utilization
>
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
> hierarchies, while system wide clamps are defined by a generic
> interface which does not depends on cgroups. This system wide
> interface enforces constraints on tasks in the root node.
>
> b) enforce effective constraints at each level of the hierarchy which
> are a restriction of the group requests considering its parent's
> effective constraints. Root group effective constraints are defined
> by the system wide interface.
> This mechanism allows each (non-root) level of the hierarchy to:
> - request whatever clamp values it would like to get
> - effectively get only up to the maximum amount allowed by its parent
>
> c) have higher priority than task-specific clamps, defined via
> sched_setattr(), thus allowing to control and restrict task requests.
>
> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Keep it simple by not caring now about "effective" values computation
> and propagation along the hierarchy.
>
> Update sysctl_sched_uclamp_handler() to use the newly introduced
> uclamp_mutex so that we serialize system default updates with cgroup
> relate updates.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Reviewed-by: Michal Koutny <[email protected]>
> Acked-by: Tejun Heo <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
>
> ---
> Changes in v14:
> Message-ID: <[email protected]>
> - move uclamp_mutex usage here from the following patch
> ---
> Documentation/admin-guide/cgroup-v2.rst | 34 +++++
> init/Kconfig | 22 +++
> kernel/sched/core.c | 188 +++++++++++++++++++++++-
> kernel/sched/sched.h | 8 +
> 4 files changed, 248 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3b29005aa981..5f1c266131b0 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
> normal scheduling policy and absolute bandwidth allocation model for
> realtime scheduling policy.
>
> +In all the above models, cycles distribution is defined only on a temporal
> +base and it does not account for the frequency at which tasks are executed.
> +The (optional) utilization clamping support allows to hint the schedutil
> +cpufreq governor about the minimum desired frequency which should always be
> +provided by a CPU, as well as the maximum desired frequency, which should not
> +be exceeded by a CPU.
> +
> WARNING: cgroup2 doesn't yet support control of realtime processes and
> the cpu controller can only be enabled when all RT processes are in
> the root cgroup. Be aware that system management software may already
> @@ -1016,6 +1023,33 @@ All time durations are in microseconds.
> Shows pressure stall information for CPU. See
> Documentation/accounting/psi.rst for details.
>
> + cpu.uclamp.min
> + A read-write single value file which exists on non-root cgroups.
> + The default is "0", i.e. no utilization boosting.
> +
> + The requested minimum utilization (protection) as a percentage
> + rational number, e.g. 12.34 for 12.34%.
> +
> + This interface allows reading and setting minimum utilization clamp
> + values similar to the sched_setattr(2). This minimum utilization
> + value is used to clamp the task specific minimum utilization clamp.
> +
> + The requested minimum utilization (protection) is always capped by
> + the current value for the maximum utilization (limit), i.e.
> + `cpu.uclamp.max`.
> +
> + cpu.uclamp.max
> + A read-write single value file which exists on non-root cgroups.
> + The default is "max". i.e. no utilization capping
> +
> + The requested maximum utilization (limit) as a percentage rational
> + number, e.g. 98.76 for 98.76%.
> +
> + This interface allows reading and setting maximum utilization clamp
> + values similar to the sched_setattr(2). This maximum utilization
> + value is used to clamp the task specific maximum utilization clamp.
> +
> +
>
> Memory
> ------
> diff --git a/init/Kconfig b/init/Kconfig
> index bd7d650d4a99..ac285cfa78b6 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -928,6 +928,28 @@ config RT_GROUP_SCHED
>
> endif #CGROUP_SCHED
>
> +config UCLAMP_TASK_GROUP
> + bool "Utilization clamping per group of tasks"
> + depends on CGROUP_SCHED
> + depends on UCLAMP_TASK
> + default n
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> +
> + When this option is enabled, the user can specify a min and max
> + CPU bandwidth which is allowed for each single task in a group.
> + The max bandwidth allows to clamp the maximum frequency a task
> + can use, while the min bandwidth allows to define a minimum
> + frequency a task will always use.
> +
> + When task group based utilization clamping is enabled, an eventually
> + specified task-specific clamp value is constrained by the cgroup
> + specified clamp value. Both minimum and maximum task clamping cannot
> + be bigger than the corresponding clamping defined at task group level.
> +
> + If in doubt, say N.
> +
> config CGROUP_PIDS
> bool "PIDs controller"
> help
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a6661852907b..7b610e1a4cda 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
>
> #ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Serializes updates of utilization clamp values
> + *
> + * The (slow-path) user-space triggers utilization clamp value updates which
> + * can require updates on (fast-path) scheduler's data structures used to
> + * support enqueue/dequeue operations.
> + * While the per-CPU rq lock protects fast-path update operations, user-space
> + * requests are serialized using a mutex to reduce the risk of conflicting
> + * updates or API abuses.
> + */
> +static DEFINE_MUTEX(uclamp_mutex);
> +
> /* Max allowed minimum utilization */
> unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>
> @@ -1010,10 +1022,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> loff_t *ppos)
> {
> int old_min, old_max;
> - static DEFINE_MUTEX(mutex);
> int result;
>
> - mutex_lock(&mutex);
> + mutex_lock(&uclamp_mutex);
> old_min = sysctl_sched_uclamp_util_min;
> old_max = sysctl_sched_uclamp_util_max;
>
> @@ -1048,7 +1059,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> sysctl_sched_uclamp_util_min = old_min;
> sysctl_sched_uclamp_util_max = old_max;
> done:
> - mutex_unlock(&mutex);
> + mutex_unlock(&uclamp_mutex);
>
> return result;
> }
> @@ -1137,6 +1148,8 @@ static void __init init_uclamp(void)
> unsigned int clamp_id;
> int cpu;
>
> + mutex_init(&uclamp_mutex);
> +
> for_each_possible_cpu(cpu) {
> memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> cpu_rq(cpu)->uclamp_flags = 0;
> @@ -1149,8 +1162,12 @@ static void __init init_uclamp(void)
>
> /* System defaults allow max clamp values for both indexes */
> uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
> - for_each_clamp_id(clamp_id)
> + for_each_clamp_id(clamp_id) {
> uclamp_default[clamp_id] = uc_max;
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + root_task_group.uclamp_req[clamp_id] = uc_max;
> +#endif
> + }
> }
>
> #else /* CONFIG_UCLAMP_TASK */
> @@ -6798,6 +6815,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
> /* task_group_lock serializes the addition/removal of task groups */
> static DEFINE_SPINLOCK(task_group_lock);
>
> +static inline void alloc_uclamp_sched_group(struct task_group *tg,
> + struct task_group *parent)
> +{
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + int clamp_id;
> +
> + for_each_clamp_id(clamp_id) {
> + uclamp_se_set(&tg->uclamp_req[clamp_id],
> + uclamp_none(clamp_id), false);
> + }
> +#endif
> +}
> +
> static void sched_free_group(struct task_group *tg)
> {
> free_fair_sched_group(tg);
> @@ -6821,6 +6851,8 @@ struct task_group *sched_create_group(struct task_group *parent)
> if (!alloc_rt_sched_group(tg, parent))
> goto err;
>
> + alloc_uclamp_sched_group(tg, parent);
> +
> return tg;
>
> err:
> @@ -7037,6 +7069,126 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> sched_move_task(task);
> }
>
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +
> +#define _POW10(exp) ((unsigned int)1e##exp)
> +#define POW10(exp) _POW10(exp)
> +
> +struct uclamp_request {
> +#define UCLAMP_PERCENT_SHIFT 2
> +#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
> + s64 percent;
> + u64 util;
> + int ret;
> +};
> +
> +static inline struct uclamp_request
> +capacity_from_percent(char *buf)
> +{
> + struct uclamp_request req = {
> + .percent = UCLAMP_PERCENT_SCALE,
> + .util = SCHED_CAPACITY_SCALE,
> + .ret = 0,
> + };
> +
> + buf = strim(buf);
> + if (strncmp("max", buf, 4)) {
> + req.ret = cgroup_parse_float(buf, UCLAMP_PERCENT_SHIFT,
> + &req.percent);
> + if (req.ret)
> + return req;
> + if (req.percent > UCLAMP_PERCENT_SCALE) {
> + req.ret = -ERANGE;
> + return req;
> + }
> +
> + req.util = req.percent << SCHED_CAPACITY_SHIFT;
> + req.util = DIV_ROUND_CLOSEST_ULL(req.util, UCLAMP_PERCENT_SCALE);
> + }
> +
> + return req;
> +}
> +
> +static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off,
> + enum uclamp_id clamp_id)
> +{
> + struct uclamp_request req;
> + struct task_group *tg;
> +
> + req = capacity_from_percent(buf);
> + if (req.ret)
> + return req.ret;
> +
> + mutex_lock(&uclamp_mutex);
> + rcu_read_lock();
> +
> + tg = css_tg(of_css(of));
> + if (tg->uclamp_req[clamp_id].value != req.util)
> + uclamp_se_set(&tg->uclamp_req[clamp_id], req.util, false);
> +
> + /*
> + * Because of not recoverable conversion rounding we keep track of the
> + * exact requested value
> + */
> + tg->uclamp_pct[clamp_id] = req.percent;
> +
> + rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
> +
> + return nbytes;
> +}
> +
> +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MIN);
> +}
> +
> +static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MAX);
> +}
> +
> +static inline void cpu_uclamp_print(struct seq_file *sf,
> + enum uclamp_id clamp_id)
> +{
> + struct task_group *tg;
> + u64 util_clamp;
> + u64 percent;
> + u32 rem;
> +
> + rcu_read_lock();
> + tg = css_tg(seq_css(sf));
> + util_clamp = tg->uclamp_req[clamp_id].value;
> + rcu_read_unlock();
> +
> + if (util_clamp == SCHED_CAPACITY_SCALE) {
> + seq_puts(sf, "max\n");
> + return;
> + }
> +
> + percent = tg->uclamp_pct[clamp_id];

You are taking RCU lock when accessing tg->uclamp_req but not when
accessing tg->uclamp_pct. Is that intentional? Can tg be destroyed
under you?

> + percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);
> + seq_printf(sf, "%llu.%0*u\n", percent, UCLAMP_PERCENT_SHIFT, rem);
> +}
> +
> +static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
> +{
> + cpu_uclamp_print(sf, UCLAMP_MIN);
> + return 0;
> +}
> +
> +static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
> +{
> + cpu_uclamp_print(sf, UCLAMP_MAX);
> + return 0;
> +}
> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 shareval)
> @@ -7381,6 +7533,20 @@ static struct cftype cpu_legacy_files[] = {
> .read_u64 = cpu_rt_period_read_uint,
> .write_u64 = cpu_rt_period_write_uint,
> },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + {
> + .name = "uclamp.min",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = cpu_uclamp_min_show,
> + .write = cpu_uclamp_min_write,
> + },
> + {
> + .name = "uclamp.max",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = cpu_uclamp_max_show,
> + .write = cpu_uclamp_max_write,
> + },
> #endif
> { } /* Terminate */
> };
> @@ -7548,6 +7714,20 @@ static struct cftype cpu_files[] = {
> .seq_show = cpu_max_show,
> .write = cpu_max_write,
> },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + {
> + .name = "uclamp.min",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = cpu_uclamp_min_show,
> + .write = cpu_uclamp_min_write,
> + },
> + {
> + .name = "uclamp.max",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = cpu_uclamp_max_show,
> + .write = cpu_uclamp_max_write,
> + },
> #endif
> { } /* terminate */
> };
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7111e3a1eeb4..ae1be61fb279 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -391,6 +391,14 @@ struct task_group {
> #endif
>
> struct cfs_bandwidth cfs_bandwidth;
> +
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + /* The two decimal precision [%] value requested from user-space */
> + unsigned int uclamp_pct[UCLAMP_CNT];
> + /* Clamp values requested for a task group */
> + struct uclamp_se uclamp_req[UCLAMP_CNT];
> +#endif
> +
> };
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.22.0
>

Subject: [tip: sched/core] sched/uclamp: Update CPU's refcount on TG's clamp changes

The following commit has been merged into the sched/core branch of tip:

Commit-ID: babbe170e053c6ec2343751749995b7b9fd5fd2c
Gitweb: https://git.kernel.org/tip/babbe170e053c6ec2343751749995b7b9fd5fd2c
Author: Patrick Bellasi <[email protected]>
AuthorDate: Thu, 22 Aug 2019 14:28:10 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:40 +02:00

sched/uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 55 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c32ac07..55a1c07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,54 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}

+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active) {
+ uclamp_rq_dec_id(rq, p, clamp_id);
+ uclamp_rq_inc_id(rq, p, clamp_id);
+ }
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+ unsigned int clamps)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+ unsigned int clamp_id;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it))) {
+ for_each_clamp_id(clamp_id) {
+ if ((0x1 << clamp_id) & clamps)
+ uclamp_update_active(p, clamp_id);
+ }
+ }
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css);
static void uclamp_update_root_tg(void)
@@ -7160,8 +7208,13 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
- if (!clamps)
+ if (!clamps) {
css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Immediately update descendants RUNNABLE tasks */
+ uclamp_update_active_tasks(css, clamps);
}
}

Subject: [tip: sched/core] sched/uclamp: Always use 'enum uclamp_id' for clamp_id values

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0413d7f33e60751570fd6c179546bde2f7d82dcb
Gitweb: https://git.kernel.org/tip/0413d7f33e60751570fd6c179546bde2f7d82dcb
Author: Patrick Bellasi <[email protected]>
AuthorDate: Thu, 22 Aug 2019 14:28:11 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:40 +02:00

sched/uclamp: Always use 'enum uclamp_id' for clamp_id values

The supported clamp indexes are defined in 'enum clamp_id', however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use 'unsigned int' to represent indices.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 38 +++++++++++++++++++-------------------
kernel/sched/sched.h | 2 +-
2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55a1c07..3c7b90b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
}

-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
{
if (clamp_id == UCLAMP_MIN)
return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
}

static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value)
{
/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN);
}

-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
unsigned int clamp_value)
{
/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
}

static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
- unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+ unsigned int clamp_value)
{
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
}

static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
#ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req;
}

-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
{
struct uclamp_se uc_eff;

@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
* for each bucket when all its RUNNABLE tasks require the same clamp.
*/
static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
- unsigned int clamp_id)
+ enum uclamp_id clamp_id)
{
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
* enforce the expected state and warn.
*/
static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
- unsigned int clamp_id)
+ enum uclamp_id clamp_id)
{
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
@@ -1019,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,

static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1034,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)

static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1044,7 +1044,7 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
}

static inline void
-uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id)
{
struct rq_flags rf;
struct rq *rq;
@@ -1077,9 +1077,9 @@ static inline void
uclamp_update_active_tasks(struct cgroup_subsys_state *css,
unsigned int clamps)
{
+ enum uclamp_id clamp_id;
struct css_task_iter it;
struct task_struct *p;
- unsigned int clamp_id;

css_task_iter_start(css, 0, &it);
while ((p = css_task_iter_next(&it))) {
@@ -1187,7 +1187,7 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

/*
* On scheduling class change, reset to default clamps for tasks
@@ -1224,7 +1224,7 @@ static void __setscheduler_uclamp(struct task_struct *p,

static void uclamp_fork(struct task_struct *p)
{
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;

for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false;
@@ -1246,7 +1246,7 @@ static void uclamp_fork(struct task_struct *p)
static void __init init_uclamp(void)
{
struct uclamp_se uc_max = {};
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;
int cpu;

mutex_init(&uclamp_mutex);
@@ -6921,7 +6921,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
#ifdef CONFIG_UCLAMP_TASK_GROUP
- int clamp_id;
+ enum uclamp_id clamp_id;

for_each_clamp_id(clamp_id) {
uclamp_se_set(&tg->uclamp_req[clamp_id],
@@ -7179,7 +7179,7 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
struct uclamp_se *uc_parent = NULL;
struct uclamp_se *uc_se = NULL;
unsigned int eff[UCLAMP_CNT];
- unsigned int clamp_id;
+ enum uclamp_id clamp_id;
unsigned int clamps;

css_for_each_descendant_pre(css, top_css) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5b34311..00ff5b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2281,7 +2281,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

#ifdef CONFIG_UCLAMP_TASK
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id);
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);

static __always_inline
unsigned int uclamp_util_with(struct rq *rq, unsigned int util,

Subject: [tip: sched/core] sched/uclamp: Use TG's clamps to restrict TASK's clamps

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 3eac870a324728e5d17118888840dad70bcd37f3
Gitweb: https://git.kernel.org/tip/3eac870a324728e5d17118888840dad70bcd37f3
Author: Patrick Bellasi <[email protected]>
AuthorDate: Thu, 22 Aug 2019 14:28:09 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:39 +02:00

sched/uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
i.e. boosted not more than its TG effective protection and capped at
least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e6800fe..c32ac07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uc_max;
+
+ /*
+ * Tasks in autogroups or root task group will be
+ * restricted by system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return uc_req;
+ if (task_group(p) == &root_task_group)
+ return uc_req;
+
+ uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value || !uc_req.user_defined)
+ return uc_max;
+#endif
+
+ return uc_req;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
- struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];

/* System default restrictions always apply */

Subject: [tip: sched/core] sched/uclamp: Extend CPU's cgroup controller

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2480c093130f64ac3a410504fa8b3db1fc4b87ce
Gitweb: https://git.kernel.org/tip/2480c093130f64ac3a410504fa8b3db1fc4b87ce
Author: Patrick Bellasi <[email protected]>
AuthorDate: Thu, 22 Aug 2019 14:28:06 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00

sched/uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the uclamp.min
utilization

- uclamp.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the uclamp.max
utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 34 ++++-
init/Kconfig | 22 +++-
kernel/sched/core.c | 193 ++++++++++++++++++++++-
kernel/sched/sched.h | 8 +-
4 files changed, 253 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005..5f1c266 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -1016,6 +1023,33 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for details.

+ cpu.uclamp.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization (protection) as a percentage
+ rational number, e.g. 12.34 for 12.34%.
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ The requested minimum utilization (protection) is always capped by
+ the current value for the maximum utilization (limit), i.e.
+ `cpu.uclamp.max`.
+
+ cpu.uclamp.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization (limit) as a percentage rational
+ number, e.g. 98.76 for 98.76%.
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+

Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index bd7d650..ac285cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -928,6 +928,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a666185..c186abe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
/* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;

@@ -1010,10 +1022,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
loff_t *ppos)
{
int old_min, old_max;
- static DEFINE_MUTEX(mutex);
int result;

- mutex_lock(&mutex);
+ mutex_lock(&uclamp_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;

@@ -1048,7 +1059,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
done:
- mutex_unlock(&mutex);
+ mutex_unlock(&uclamp_mutex);

return result;
}
@@ -1137,6 +1148,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

+ mutex_init(&uclamp_mutex);
+
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1149,8 +1162,12 @@ static void __init init_uclamp(void)

/* System defaults allow max clamp values for both indexes */
uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+ }
}

#else /* CONFIG_UCLAMP_TASK */
@@ -6798,6 +6815,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&tg->uclamp_req[clamp_id],
+ uclamp_none(clamp_id), false);
+ }
+#endif
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6821,6 +6851,8 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ alloc_uclamp_sched_group(tg, parent);
+
return tg;

err:
@@ -7037,6 +7069,131 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+
+/*
+ * Integer 10^N with a given N exponent by casting to integer the literal "1eN"
+ * C expression. Since there is no way to convert a macro argument (N) into a
+ * character constant, use two levels of macros.
+ */
+#define _POW10(exp) ((unsigned int)1e##exp)
+#define POW10(exp) _POW10(exp)
+
+struct uclamp_request {
+#define UCLAMP_PERCENT_SHIFT 2
+#define UCLAMP_PERCENT_SCALE (100 * POW10(UCLAMP_PERCENT_SHIFT))
+ s64 percent;
+ u64 util;
+ int ret;
+};
+
+static inline struct uclamp_request
+capacity_from_percent(char *buf)
+{
+ struct uclamp_request req = {
+ .percent = UCLAMP_PERCENT_SCALE,
+ .util = SCHED_CAPACITY_SCALE,
+ .ret = 0,
+ };
+
+ buf = strim(buf);
+ if (strcmp(buf, "max")) {
+ req.ret = cgroup_parse_float(buf, UCLAMP_PERCENT_SHIFT,
+ &req.percent);
+ if (req.ret)
+ return req;
+ if (req.percent > UCLAMP_PERCENT_SCALE) {
+ req.ret = -ERANGE;
+ return req;
+ }
+
+ req.util = req.percent << SCHED_CAPACITY_SHIFT;
+ req.util = DIV_ROUND_CLOSEST_ULL(req.util, UCLAMP_PERCENT_SCALE);
+ }
+
+ return req;
+}
+
+static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off,
+ enum uclamp_id clamp_id)
+{
+ struct uclamp_request req;
+ struct task_group *tg;
+
+ req = capacity_from_percent(buf);
+ if (req.ret)
+ return req.ret;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(of_css(of));
+ if (tg->uclamp_req[clamp_id].value != req.util)
+ uclamp_se_set(&tg->uclamp_req[clamp_id], req.util, false);
+
+ /*
+ * Because of not recoverable conversion rounding we keep track of the
+ * exact requested value
+ */
+ tg->uclamp_pct[clamp_id] = req.percent;
+
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return nbytes;
+}
+
+static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MIN);
+}
+
+static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ return cpu_uclamp_write(of, buf, nbytes, off, UCLAMP_MAX);
+}
+
+static inline void cpu_uclamp_print(struct seq_file *sf,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+ u64 percent;
+ u32 rem;
+
+ rcu_read_lock();
+ tg = css_tg(seq_css(sf));
+ util_clamp = tg->uclamp_req[clamp_id].value;
+ rcu_read_unlock();
+
+ if (util_clamp == SCHED_CAPACITY_SCALE) {
+ seq_puts(sf, "max\n");
+ return;
+ }
+
+ percent = tg->uclamp_pct[clamp_id];
+ percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), &rem);
+ seq_printf(sf, "%llu.%0*u\n", percent, UCLAMP_PERCENT_SHIFT, rem);
+}
+
+static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MIN);
+ return 0;
+}
+
+static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MAX);
+ return 0;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7382,6 +7539,20 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "uclamp.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_min_show,
+ .write = cpu_uclamp_min_write,
+ },
+ {
+ .name = "uclamp.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_max_show,
+ .write = cpu_uclamp_max_write,
+ },
+#endif
{ } /* Terminate */
};

@@ -7549,6 +7720,20 @@ static struct cftype cpu_files[] = {
.write = cpu_max_write,
},
#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "uclamp.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_min_show,
+ .write = cpu_uclamp_min_write,
+ },
+ {
+ .name = "uclamp.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_max_show,
+ .write = cpu_uclamp_max_write,
+ },
+#endif
{ } /* terminate */
};

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7111e3a..ae1be61 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -391,6 +391,14 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* The two decimal precision [%] value requested from user-space */
+ unsigned int uclamp_pct[UCLAMP_CNT];
+ /* Clamp values requested for a task group */
+ struct uclamp_se uclamp_req[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED

Subject: [tip: sched/core] sched/uclamp: Propagate system defaults to the root group

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 7274a5c1bbec45f06f1fff4b8c8b5855b6cc189d
Gitweb: https://git.kernel.org/tip/7274a5c1bbec45f06f1fff4b8c8b5855b6cc189d
Author: Patrick Bellasi <[email protected]>
AuthorDate: Thu, 22 Aug 2019 14:28:08 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:38 +02:00

sched/uclamp: Propagate system defaults to the root group

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.

- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

- system defaults: which define the maximum possible clamp values
usable by tasks.

- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8855481..e6800fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+ struct task_group *tg = &root_task_group;
+
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+ rcu_read_lock();
+ cpu_util_update_eff(&root_task_group.css);
+ rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
{
+ bool update_root_tg = false;
int old_min, old_max;
int result;

@@ -1043,16 +1063,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(&uclamp_default[UCLAMP_MIN],
sysctl_sched_uclamp_util_min, false);
+ update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(&uclamp_default[UCLAMP_MAX],
sysctl_sched_uclamp_util_max, false);
+ update_root_tg = true;
}

+ if (update_root_tg)
+ uclamp_update_root_tg();
+
/*
- * Updating all the RUNNABLE task is expensive, keep it simple and do
- * just a lazy update at each next enqueue time.
+ * We update all RUNNABLE tasks only when task groups are in use.
+ * Otherwise, keep it simple and do just a lazy update at each next
+ * task enqueue time.
*/
+
goto done;

undo:

Subject: [tip: sched/core] sched/uclamp: Propagate parent clamps

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0b60ba2dd342016e4e717dbaa4ca9af3a43f4434
Gitweb: https://git.kernel.org/tip/0b60ba2dd342016e4e717dbaa4ca9af3a43f4434
Author: Patrick Bellasi <[email protected]>
AuthorDate: Thu, 22 Aug 2019 14:28:07 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Tue, 03 Sep 2019 09:17:38 +02:00

sched/uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Michal Koutny <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 2 ++-
2 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c186abe..8855481 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1166,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
#ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+ root_task_group.uclamp[clamp_id] = uc_max;
#endif
}
}
@@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(&tg->uclamp_req[clamp_id],
uclamp_none(clamp_id), false);
+ tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
#endif
}
@@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_parent = NULL;
+ struct uclamp_se *uc_se = NULL;
+ unsigned int eff[UCLAMP_CNT];
+ unsigned int clamp_id;
+ unsigned int clamps;
+
+ css_for_each_descendant_pre(css, top_css) {
+ uc_parent = css_tg(css)->parent
+ ? css_tg(css)->parent->uclamp : NULL;
+
+ for_each_clamp_id(clamp_id) {
+ /* Assume effective clamps matches requested clamps */
+ eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+ /* Cap effective clamps with parent's effective clamps */
+ if (uc_parent &&
+ eff[clamp_id] > uc_parent[clamp_id].value) {
+ eff[clamp_id] = uc_parent[clamp_id].value;
+ }
+ }
+ /* Ensure protection is always capped by limit */
+ eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+ /* Propagate most restrictive effective clamps */
+ clamps = 0x0;
+ uc_se = css_tg(css)->uclamp;
+ for_each_clamp_id(clamp_id) {
+ if (eff[clamp_id] == uc_se[clamp_id].value)
+ continue;
+ uc_se[clamp_id].value = eff[clamp_id];
+ uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
+ clamps |= (0x1 << clamp_id);
+ }
+ if (!clamps)
+ css = css_rightmost_descendant(css);
+ }
+}

/*
* Integer 10^N with a given N exponent by casting to integer the literal "1eN"
@@ -7138,6 +7179,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
*/
tg->uclamp_pct[clamp_id] = req.percent;

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(of_css(of));
+
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae1be61..5b34311 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@ struct task_group {
unsigned int uclamp_pct[UCLAMP_CNT];
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
+ /* Effective clamp values used for a task group */
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

};

2019-09-03 08:54:09

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

On Mon, Sep 02, 2019 at 04:02:57PM -0700, Suren Baghdasaryan <[email protected]> wrote:
> > +static inline void cpu_uclamp_print(struct seq_file *sf,
> > + enum uclamp_id clamp_id)
> > [...]
> > + rcu_read_lock();
> > + tg = css_tg(seq_css(sf));
> > + util_clamp = tg->uclamp_req[clamp_id].value;
> > + rcu_read_unlock();
> > +
> > + if (util_clamp == SCHED_CAPACITY_SCALE) {
> > + seq_puts(sf, "max\n");
> > + return;
> > + }
> > +
> > + percent = tg->uclamp_pct[clamp_id];
>
> You are taking RCU lock when accessing tg->uclamp_req but not when
> accessing tg->uclamp_pct.
Good point.

> Is that intentional? Can tg be destroyed under you?
Actually, the rcu_read{,un}lock should be unnecessary in the context of
the kernfs file op handler -- the tg/css won't go away as long as its
kernfs file is being worked with.


Attachments:
(No filename) (948.00 B)
signature.asc (849.00 B)
Digital signature
Download all attachments

2019-09-03 14:23:40

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

On Tue, Sep 3, 2019 at 4:53 AM Michal Koutný <[email protected]> wrote:
>
> On Mon, Sep 02, 2019 at 04:02:57PM -0700, Suren Baghdasaryan <[email protected]> wrote:
> > > +static inline void cpu_uclamp_print(struct seq_file *sf,
> > > + enum uclamp_id clamp_id)
> > > [...]
> > > + rcu_read_lock();
> > > + tg = css_tg(seq_css(sf));
> > > + util_clamp = tg->uclamp_req[clamp_id].value;
> > > + rcu_read_unlock();
> > > +
> > > + if (util_clamp == SCHED_CAPACITY_SCALE) {
> > > + seq_puts(sf, "max\n");
> > > + return;
> > > + }
> > > +
> > > + percent = tg->uclamp_pct[clamp_id];
> >
> > You are taking RCU lock when accessing tg->uclamp_req but not when
> > accessing tg->uclamp_pct.
> Good point.
>
> > Is that intentional? Can tg be destroyed under you?
> Actually, the rcu_read{,un}lock should be unnecessary in the context of
> the kernfs file op handler -- the tg/css won't go away as long as its
> kernfs file is being worked with.
>

Also, add to that the fact that there is no rcu_dereference() call to
access any of the pointers in the reader or any of its callers. And, I
don't see any "wait for completion" type of pattern here so that
rcu_read_{lock, unlock}() pair does seem useless.

thanks,

- Joel