2018-08-28 13:55:06

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 00/16] Add utilization clamping support

This is a respin of:

https://lore.kernel.org/lkml/[email protected]/

Which has been rebased on v4.19-rc1.

Thanks for all the valuable comments collected so far!
Further comments and feedbacks are more than welcome!

Cheers Patrick


Main changes in v4
==================

.:: Fix issues due to limited number of clamp groups
----------------------------------------------------

As Juri pointed out in:

https://lore.kernel.org/lkml/[email protected]/

we had an issue related to the limited number of supported clamp groups, which
could have affected both normal users as well as cgroups users.

This problem has been fixed by a couple of new patches:

[PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default
[PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support

which allows to ensure that only privileged tasks can create clamp groups
and/or the clamps groups are transformed into buckets to always ensure a
mapping for each possible userspace request.


.:: Better integrate RT tasks
-----------------------------

Quentin pointed out a change in behavior for RT task in:

https://lore.kernel.org/lkml/20180809155551.bp46sixk4u3ilcnh@queper01-lin/

which has been fixed by improving this patch:

[PATCH v4 16/16] sched/cpufreq: uclamp: add utilization clamping for RT tasks

This patch has also been moved to the end of the series since the solution is
partially based on some bits already added by other patches of this series.


.:: Improved refcounting code
-----------------------------

Pavan reported some code paths not covered by refcounting:

https://lore.kernel.org/lkml/[email protected]/

Which translated into a complete review and improvement of the slow-path code
where we refcount clamp groups availability.

We now properly refcount clamp groups usage for all the main entities, i.e.
init_task, root_task_group and sysfs defaults.
The same refcounting code has been properly integrated with fork/exit as well
as cgroups creation/release code paths.


.:: Series Organization
-----------------------

The series is organized into these main sections:

- Patches [01-05]: Per task (primary) API
- Patches [06] : Schedutil integration for CFS tasks
- Patches [07-13]: Per task group (secondary) API
- Patches [14-15]: Fix issues related to limited clamp groups
- Patches [16] : Schedutil integration for RT tasks


Newcomer's Short Abstract (Updated)
===================================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware scheduler
extension [1].

When the schedutil cpufreq's governor is in use, the utilization signal allows
also the Linux scheduler to drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the task's generated
workload.

However, the current translation of utilization values into a frequency
selection is pretty simple: we just go to max for RT tasks or to the minimum
frequency which can accommodate the utilization of DL+FAIR tasks. Regarding
task placement instead utilization is of limited usage since its value alone is
not enough to properly describe what's the expected power/performance behaviors
of each task.

In general, for both RT and FAIR tasks we can aim at better tasks placement and
frequency selection policies if take into consideration hints coming from
user-space.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined from
user-space. The clamped utilization value can then be used, for example, to
enforce a minimum and/or maximum frequency depending on which tasks are
currently active on a CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience.

Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). In this case, from its
utilization the scheduler does not have a complete view of what are the task
requirements and, if it's a small utilization task, schedutil will keep
selecting a more energy efficient CPU, with smaller capacity and lower
frequency, thus affecting the overall time required to complete the task
activations.

- capping: increase energy efficiency for background tasks not directly
affecting the user experience.

Since running on a lower capacity CPU at a lower frequency is in general
more energy efficient, when the completion time is not a main goal, then
capping the utilization considered for certain (maybe big) tasks can have
positive effects, both on energy consumption and thermal stress.
Moreover, this last support allows also to make RT tasks more energy
friendly on mobile systems, where running them on high capacity CPUs and at
the maximum frequency is not strictly required.

From these two use-cases, it's worth to notice that frequency selection
biasing, introduced by patches 6 and 16 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler on tasks placement decisions.

Utilization is a task specific property which is used by the scheduler to know
how much CPU bandwidth a task requires (under certain conditions).
Thus, the utilization clamp values, defined either per-task or via the CPU
controller, can be used to represent tasks to the scheduler as being bigger
(or smaller) than what they really are.

Utilization clamping thus ultimately enable interesting additional
optimizations, especially on asymmetric capacity systems like Arm
big.LITTLE and DynamIQ CPUs, where:

- boosting: small tasks are preferably scheduled on higher-capacity CPUs
where, despite being less energy efficient, they can complete faster

- capping: big/background tasks are preferably scheduled on low-capacity CPUs
where, being more energy efficient, they can still run but save power and
thermal headroom for more important tasks.

This additional usage of the utilization clamping is not presented in this
series but it's an integral part of the Energy Aware Scheduler (EAS) feature
set, where [1] is one of its main components. A solution similar to utilization
clamping, namely SchedTune, is already used on Android kernels to biasing of
both 'frequency selection' and 'task placement'.
This series provides the foundation bits to add a similar features to mainline
by focusing just on schedutil integration.

[1] https://lore.kernel.org/lkml/[email protected]/


Detailed Changelog
==================

Changes in v4:
Message-ID: <20180809152313.lewfhufidhxb2qrk@darkstar>
- implements the idea discussed in this thread
Message-ID: <[email protected]>
- remove not required default setting
- fixed some tabs/spaces
Message-ID: <[email protected]>
- replace/rephrase "bandwidth" references to use "capacity"
- better stress that this do not enforce any bandwidth requirement
but "just" give hints to the scheduler
- fixed some typos
Message-ID: <[email protected]>
- add uclamp_exit_task() to release clamp refcount from do_exit()
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN but beautify a bit that code
- keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
- add another WARN on the unexpected condition of releasing a refcount
from a CPU which has a lower clamp value active
Message-ID: <[email protected]>
- move uclamp_enabled at the top of sched_class to keep it on the same
cache line of other main wakeup time callbacks
Message-ID: <20180816132249.GA2960@e110439-lin>
- inline uclamp_task_active() code into uclamp_task_update_active()
- get rid of the now unused uclamp_task_active()
Message-ID: <20180816172016.GG2960@e110439-lin>
- ensure to always reset clamp holding on wakeup from IDLE
Message-ID: <CAKfTPtC2adLupg7wy1JU9zxKx1466Sza6fSCcr92wcawm1OYkg@mail.gmail.com>
- use *rq instead of cpu for both uclamp_util() and uclamp_value()
Message-ID: <20180816135300.GC2960@e110439-lin>
- remove uclamp_value() which is never used outside CONFIG_UCLAMP_TASK
Message-ID: <20180816140731.GD2960@e110439-lin>
- add ".effective" attributes to the default hierarchy
- reuse already existing:
task_struct::uclamp::effective::group_id
instead of adding:
task_struct::uclamp_group_id
to back annotate the effective clamp group in which a task has been
refcounted
Message-ID: <20180820122728.GM2960@e110439-lin>
- fix unwanted reset of clamp values on refcount success
Other:
- by default all tasks have a UCLAMP_NOT_VALID task specific clamp
- always use:
p->uclamp[clamp_id].effective.value
to track the actual clamp value the task has been refcounted into.
This matches with the usage of
p->uclamp[clamp_id].effective.group_id
- allow to call uclamp_group_get() without a task pointer, which is
used to refcount the initial clamp group for all the global objects
(init_task, root_task_group and system_defaults)
- ensure (and check) that all tasks have a valid group_id at
uclamp_cpu_get_id()
- rework uclamp_cpu layout to better fit into just 2x64B cache lines
- fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
- init uclamp for the init_task and refcount its clamp groups
- add uclamp specific fork time code into uclamp_fork
- add support for SCHED_FLAG_RESET_ON_FORK
default clamps are now set for init_task and inherited/reset at
fork time (when then flag is set for the parent)
- enable uclamp only for FAIR tasks, RT class will be enabled only
by a following patch which also integrate the class to schedutil
- define uclamp_maps ____cacheline_aligned_in_smp
- in uclamp_group_get() ensure to include uclamp_group_available() and
uclamp_group_init() into the atomic section defined by:
uc_map[next_group_id].se_lock
- do not use mutex_lock(&uclamp_mutex) in uclamp_exit_task
which is also not needed since refcounting is already guarded by
the uc_map[group_id].se_lock spinlock
- consolidate init_uclamp_sched_group() into init_uclamp()
- refcount root_task_group's clamp groups from init_uclamp()
- small documentation fixes
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
group_id as a parameter
- add explicit calls to uclamp_group_find()
which is now not more part of uclamp_group_get()
- fixed a not required override
- fixed some typos in comments and changelog
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Message-ID: <[email protected]>
- use "." notation for attributes naming
i.e. s/util_{min,max}/util.{min,max}/
- added new patches: 09 and 12
Other changes:
- rebased on tip/sched/core

Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when cgroup support is
added.
Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes
Other changes:
- improved documentation to make more explicit some concepts
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data into a single cache
line while still supporting up to 2 different {min,max}_utiql clamps.
- use -ERANGE as range violation error
- add attributes to the default hierarchy as well as the legacy one
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values,
i.e. tasks running on a TG are only allowed to demote themself.
- patches re-ordering in top-down way
- rebased on v4.18-rc4

Patrick Bellasi (16):
sched/core: uclamp: extend sched_setattr to support utilization
clamping
sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
sched/core: uclamp: add CPU's clamp groups accounting
sched/core: uclamp: update CPU's refcount on clamp changes
sched/core: uclamp: enforce last task UCLAMP_MAX
sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
sched/core: uclamp: extend cpu's cgroup controller
sched/core: uclamp: propagate parent clamps
sched/core: uclamp: map TG's clamp values into CPU's clamp groups
sched/core: uclamp: use TG's clamps to restrict Task's clamps
sched/core: uclamp: add system default clamps
sched/core: uclamp: update CPU's refcount on TG's clamp changes
sched/core: uclamp: use percentage clamp values
sched/core: uclamp: request CAP_SYS_ADMIN by default
sched/core: uclamp: add clamp group discretization support
sched/cpufreq: uclamp: add utilization clamping for RT tasks

Documentation/admin-guide/cgroup-v2.rst | 46 +
.../admin-guide/kernel-parameters.txt | 3 +
include/linux/sched.h | 65 +
include/linux/sched/sysctl.h | 11 +
include/linux/sched/task.h | 6 +
include/uapi/linux/sched.h | 8 +-
include/uapi/linux/sched/types.h | 68 +-
init/Kconfig | 63 +
init/init_task.c | 1 +
kernel/exit.c | 1 +
kernel/sched/core.c | 1368 ++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 31 +-
kernel/sched/fair.c | 4 +
kernel/sched/features.h | 10 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 177 ++-
kernel/sysctl.c | 16 +
17 files changed, 1863 insertions(+), 19 deletions(-)

--
2.18.0



2018-08-28 13:55:16

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 01/16] sched/core: uclamp: extend sched_setattr to support utilization clamping

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a more simplified model which is essentially based on
the relatively simple concept of POSIX priorities.

Such a simple priority based model however does not allow to exploit
some of the most advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.

Utilization clamping aims at exposing to user-space a new set of
per-task attributes which can be used to provide the scheduler with some
hints about the expected/required utilization for a task.
This will allow to implement a more advanced per-task frequency control
mechanism which is not based just on a "passive" measured task
utilization but on a more "active" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy efficient.
Ultimately, such a mechanism can be considered similar to the cpufreq's
powersave, performance and userspace governor but with a much fine
grained and per-task control.

Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes.
Specifically, a new pair of attributes allows to specify a minimum and
maximum utilization which the scheduler should consider for a task.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <[email protected]>
- remove not required default setting
- fixed some tabs/spaces
Message-ID: <[email protected]>
- replace/rephrase "bandwidth" references to use "capacity"
- better stress that this do not enforce any bandwidth requirement
but "just" give hints to the scheduler
- fixed some typos
Others:
- add support for SCHED_FLAG_RESET_ON_FORK
default clamps are now set for init_task and inherited/reset at
fork time (when then flag is set for the parent)
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- move at the head of the series

As discussed at OSPM, using a [0..SCHED_CAPACITY_SCALE] range seems to
be acceptable. However, an additional patch has been added at the end of
the series which introduces a simple abstraction to use a more
generic [0..100] range.

At OSPM we also discarded the idea to "recycle" the usage of
sched_runtime and sched_period which would have made the API too
much complex for limited benefits.
---
include/linux/sched.h | 13 +++++++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 66 +++++++++++++++++++++++++++-----
init/Kconfig | 21 ++++++++++
init/init_task.c | 5 +++
kernel/sched/core.c | 39 +++++++++++++++++++
6 files changed, 138 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..880a0c5c1f87 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,14 @@ struct vtime {
u64 gtime;
};

+enum uclamp_id {
+ UCLAMP_MIN = 0, /* Minimum utilization */
+ UCLAMP_MAX, /* Maximum utilization */
+
+ /* Utilization clamping constraints count */
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -649,6 +657,11 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utlization clamp values for this task */
+ int uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..c27d6e81517b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_UTIL_CLAMP 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_UTIL_CLAMP)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..7512b5934013 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -21,8 +21,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +59,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +72,30 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization which
+ * should be expected by a task. These attributes allow to inform the
+ * scheduler about the utilization boundaries within which it is expected to
+ * schedule the task. These boundaries are valuable hints to support scheduler
+ * decisions on both task placement and frequencies selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. Thus, for
+ * example, a 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger then 0 is more likely to be
+ * scheduled on a CPU which has a capacity big enough to fit the specified
+ * minimum utilization value.
+ * A task with a max utilization value smaller then 1024 is more likely to be
+ * scheduled on a CPU which do not necessarily have more capacity then the
+ * specified max utilization value.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +113,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..738974c4f628 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,6 +613,27 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max CPU
+ utilization which is allowed for RUNNABLE tasks.
+ The max utilization allows to request a maximum frequency a task should
+ use, while the min utilization allows to request a minimum frequency a
+ task should use.
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+endmenu
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..5bfdcc3fb839 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/task.h>
+#include <linux/sched/topology.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -91,6 +92,10 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp[UCLAMP_MIN] = 0,
+ .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 625bc9897f62..16d3544c7ffa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -716,6 +716,28 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (attr->sched_util_min > attr->sched_util_max)
+ return -EINVAL;
+ if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
+ p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+
+ return 0;
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2320,6 +2342,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);

+#ifdef CONFIG_UCLAMP_TASK
+ p->uclamp[UCLAMP_MIN] = 0;
+ p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif
+
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -4215,6 +4242,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ /* Configure utilization clamps for the task */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = __setscheduler_uclamp(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4721,6 +4755,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);

+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN];
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+#endif
+
rcu_read_unlock();

retval = sched_read_attr(uattr, &attr, size);
--
2.18.0


2018-08-28 13:55:47

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Utilization clamping requires each CPU to know which clamp values are
assigned to tasks that are currently RUNNABLE on that CPU.
Multiple tasks can be assigned the same clamp value and tasks with
different clamp values can be concurrently active on the same CPU.
Thus, a proper data structure is required to support a fast and
efficient aggregation of the clamp values required by the currently
RUNNABLE tasks.

For this purpose we use a per-CPU array of reference counters,
where each slot is used to account how many tasks require a certain
clamp value are currently RUNNABLE on each CPU.
Each clamp value corresponds to a "clamp index" which identifies the
position within the array of reference counters.

:
(user-space changes) : (kernel space / scheduler)
:
SLOW PATH : FAST PATH
:
task_struct::uclamp::value : sched/core::enqueue/dequeue
: cpufreq_schedutil
:
+----------------+ +--------------------+ +-------------------+
| TASK | | CLAMP GROUP | | CPU CLAMPS |
+----------------+ +--------------------+ +-------------------+
| | | clamp_{min,max} | | clamp_{min,max} |
| util_{min,max} | | se_count | | tasks count |
+----------------+ +--------------------+ +-------------------+
:
+------------------> : +------------------->
group_id = map(clamp_value) : ref_count(group_id)
:
:

Let's introduce the support to map tasks to "clamp groups".
Specifically we introduce the required functions to translate a
"clamp value" into a clamp's "group index" (group_id).

Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
sense to boost/limit to different frequencies,
e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
the per-CPU clamp values in the fast path.

The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
a -ENOSPC error in case this will exceed the number of maximum different
clamp values supported.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <[email protected]>
- add uclamp_exit_task() to release clamp refcount from do_exit()
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN but butify a bit that code
Message-ID: <[email protected]>
- move uclamp_enabled at the top of sched_class to keep it on the same
cache line of other main wakeup time callbacks
Others:
- init uclamp for the init_task and refcount its clamp groups
- add uclamp specific fork time code into uclamp_fork
- add support for SCHED_FLAG_RESET_ON_FORK
default clamps are now set for init_task and inherited/reset at
fork time (when then flag is set for the parent)
- enable uclamp only for FAIR tasks, RT class will be enabled only
by a following patch which also integrate the class to schedutil
- define uclamp_maps ____cacheline_aligned_in_smp
- in uclamp_group_get() ensure to include uclamp_group_available() and
uclamp_group_init() into the atomic section defined by:
uc_map[next_group_id].se_lock
- do not use mutex_lock(&uclamp_mutex) in uclamp_exit_task
which is also not needed since refcounting is already guarded by
the uc_map[group_id].se_lock spinlock
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
group_id as a parameter
Others:
- rebased on tip/sched/core
Changes in v2:
- rabased on v4.18-rc4
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data, partially
intorduced also by the following patches, into a single cache line
while still supporting up to 2 different {min,max}_utiql clamps.
---
include/linux/sched.h | 16 +-
include/linux/sched/task.h | 6 +
include/uapi/linux/sched.h | 6 +-
init/Kconfig | 20 ++
init/init_task.c | 4 -
kernel/exit.c | 1 +
kernel/sched/core.c | 395 +++++++++++++++++++++++++++++++++++--
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 28 ++-
9 files changed, 456 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 880a0c5c1f87..7385f0b1a7c0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,9 @@ struct vtime {
u64 gtime;
};

+/* Clamp not valid, i.e. group not assigned or invalid value */
+#define UCLAMP_NOT_VALID -1
+
enum uclamp_id {
UCLAMP_MIN = 0, /* Minimum utilization */
UCLAMP_MAX, /* Maximum utilization */
@@ -575,6 +578,17 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+/**
+ * Utilization's clamp group
+ *
+ * A utilization clamp group maps a "clamp value" (value), i.e.
+ * util_{min,max}, to a "clamp group index" (group_id).
+ */
+struct uclamp_se {
+ unsigned int value;
+ unsigned int group_id;
+};
+
union rcu_special {
struct {
u8 blocked;
@@ -659,7 +673,7 @@ struct task_struct {

#ifdef CONFIG_UCLAMP_TASK
/* Utlization clamp values for this task */
- int uclamp[UCLAMP_CNT];
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 108ede99e533..36c81c364112 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -68,6 +68,12 @@ static inline void exit_thread(struct task_struct *tsk)
#endif
extern void do_group_exit(int);

+#ifdef CONFIG_UCLAMP_TASK
+extern void uclamp_exit_task(struct task_struct *p);
+#else
+static inline void uclamp_exit_task(struct task_struct *p) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index c27d6e81517b..ae7e12de32ca 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,7 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
-#define SCHED_FLAG_UTIL_CLAMP 0x08
+
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x20
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
diff --git a/init/Kconfig b/init/Kconfig
index 738974c4f628..10536cb83295 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -633,7 +633,27 @@ config UCLAMP_TASK

If in doubt, say N.

+config UCLAMP_GROUPS_COUNT
+ int "Number of different utilization clamp values supported"
+ range 0 32
+ default 5
+ depends on UCLAMP_TASK
+ help
+ This defines the maximum number of different utilization clamp
+ values which can be concurrently enforced for each utilization
+ clamp index (i.e. minimum and maximum utilization).
+
+ Only a limited number of clamp values are supported because:
+ 1. there are usually only few classes of workloads for which it
+ makes sense to boost/cap for different frequencies,
+ e.g. background vs foreground, interactive vs low-priority.
+ 2. it allows a simpler and more memory/time efficient tracking of
+ the per-CPU clamp values.
+
+ If in doubt, use the default value.
+
endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5bfdcc3fb839..7f77741b6a9b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -92,10 +92,6 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
-#endif
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp[UCLAMP_MIN] = 0,
- .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..feb540558051 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -877,6 +877,7 @@ void __noreturn do_exit(long code)

sched_autogroup_exit_task(tsk);
cgroup_exit(tsk);
+ uclamp_exit_task(tsk);

/*
* FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 16d3544c7ffa..2668990b96d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -717,25 +717,389 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_mutex: serializes updates of utilization clamp values
+ *
+ * A utilization clamp value update is usually triggered from a user-space
+ * process (slow-path) but it requires a synchronization with the scheduler's
+ * (fast-path) enqueue/dequeue operations.
+ * While the fast-path synchronization is protected by RQs spinlock, this
+ * mutex ensures that we sequentially serve user-space requests.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * uclamp_map: reference counts a utilization "clamp value"
+ * @value: the utilization "clamp value" required
+ * @se_count: the number of scheduling entities requiring the "clamp value"
+ * @se_lock: serialize reference count updates by protecting se_count
+ */
+struct uclamp_map {
+ int value;
+ int se_count;
+ raw_spinlock_t se_lock;
+};
+
+/**
+ * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
+ *
+ * Since only a limited number of different "clamp values" are supported, we
+ * need to map each different clamp value into a "clamp group" (group_id) to
+ * be used by the per-CPU accounting in the fast-path, when tasks are
+ * enqueued and dequeued.
+ * We also support different kind of utilization clamping, min and max
+ * utilization for example, each representing what we call a "clamp index"
+ * (clamp_id).
+ *
+ * A matrix is thus required to map "clamp values" to "clamp groups"
+ * (group_id), for each "clamp index" (clamp_id), where:
+ * - rows are indexed by clamp_id and they collect the clamp groups for a
+ * given clamp index
+ * - columns are indexed by group_id and they collect the clamp values which
+ * maps to that clamp group
+ *
+ * Thus, the column index of a given (clamp_id, value) pair represents the
+ * clamp group (group_id) used by the fast-path's per-CPU accounting.
+ *
+ * NOTE: first clamp group (group_id=0) is reserved for tracking of non
+ * clamped tasks. Thus we allocate one more slot than the value of
+ * CONFIG_UCLAMP_GROUPS_COUNT.
+ *
+ * Here is the map layout and, right below, how entries are accessed by the
+ * following code.
+ *
+ * uclamp_maps is a matrix of
+ * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
+ * | |
+ * | /---------------+---------------\
+ * | +------------+ +------------+
+ * | / UCLAMP_MIN | value | | value |
+ * | | | se_count |...... | se_count |
+ * | | +------------+ +------------+
+ * +--+ +------------+ +------------+
+ * | | value | | value |
+ * \ UCLAMP_MAX | se_count |...... | se_count |
+ * +-----^------+ +----^-------+
+ * | |
+ * uc_map = + |
+ * &uclamp_maps[clamp_id][0] +
+ * clamp_value =
+ * uc_map[group_id].value
+ */
+static struct uclamp_map uclamp_maps[UCLAMP_CNT]
+ [CONFIG_UCLAMP_GROUPS_COUNT + 1]
+ ____cacheline_aligned_in_smp;
+
+#define UCLAMP_ENOSPC_FMT "Cannot allocate more than " \
+ __stringify(CONFIG_UCLAMP_GROUPS_COUNT) " UTIL_%s clamp groups\n"
+
+/**
+ * uclamp_group_available: checks if a clamp group is available
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index in the given clamp_id
+ *
+ * A clamp group is not free if there is at least one SE which is sing a clamp
+ * value mapped on the specified clamp_id. These SEs are reference counted by
+ * the se_count of a uclamp_map entry.
+ *
+ * Return: true if there are no SE's mapped on the specified clamp
+ * index and group
+ */
+static inline bool uclamp_group_available(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ return (uc_map[group_id].value == UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_group_init: maps a clamp value on a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index to map a given clamp_value
+ * @clamp_value: the utilization clamp value to map
+ *
+ * Initializes a clamp group to track tasks from the fast-path.
+ * Each different clamp value, for a given clamp index (i.e. min/max
+ * utilization clamp), is mapped by a clamp group which index is used by the
+ * fast-path code to keep track of RUNNABLE tasks requiring a certain clamp
+ * value.
+ *
+ */
+static inline void uclamp_group_init(int clamp_id, int group_id,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ uc_map[group_id].value = clamp_value;
+ uc_map[group_id].se_count = 0;
+}
+
+/**
+ * uclamp_group_reset: resets a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @group_id: the group index to release
+ *
+ * A clamp group can be reset every time there are no more task groups using
+ * the clamp value it maps for a given clamp index.
+ */
+static inline void uclamp_group_reset(int clamp_id, int group_id)
+{
+ uclamp_group_init(clamp_id, group_id, UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_group_find: finds the group index of a utilization clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @clamp_value: the utilization clamping value lookup for
+ *
+ * Verify if a group has been assigned to a certain clamp value and return
+ * its index to be used for accounting.
+ *
+ * Since only a limited number of utilization clamp groups are allowed, if no
+ * groups have been assigned for the specified value, a new group is assigned,
+ * if possible.
+ * Otherwise an error is returned, meaning that an additional clamp value is
+ * not (currently) supported.
+ */
+static int
+uclamp_group_find(int clamp_id, unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int free_group_id = UCLAMP_NOT_VALID;
+ unsigned int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Keep track of first free clamp group */
+ if (uclamp_group_available(clamp_id, group_id)) {
+ if (free_group_id == UCLAMP_NOT_VALID)
+ free_group_id = group_id;
+ continue;
+ }
+ /* Return index of first group with same clamp value */
+ if (uc_map[group_id].value == clamp_value)
+ return group_id;
+ }
+
+ if (likely(free_group_id != UCLAMP_NOT_VALID))
+ return free_group_id;
+
+ return -ENOSPC;
+}
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @uc_se: the utilization clamp data for that task group
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value. A clamp group is
+ * released when there are no more task groups referencing its clamp value.
+ */
+static inline void uclamp_group_put(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ unsigned long flags;
+
+ /* Ignore SE's not yet attached */
+ if (group_id == UCLAMP_NOT_VALID)
+ return;
+
+ /* Remove SE from this clamp group */
+ raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
+ if (likely(uc_map[group_id].se_count))
+ uc_map[group_id].se_count -= 1;
+#ifdef SCHED_DEBUG
+ else {
+ WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
+ clamp_id, group_id);
+ }
+#endif
+ if (uc_map[group_id].se_count == 0)
+ uclamp_group_reset(clamp_id, group_id);
+ raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @clamp_id: the clamp index affected by the task
+ * @next_group_id: the clamp group to refcount
+ * @uc_se: the utilization clamp data for the task
+ * @clamp_value: the new clamp value for the task
+ *
+ * Each time a task changes its utilization clamp value, for a specified clamp
+ * index, we need to find an available clamp group which can be used to track
+ * this new clamp value. The corresponding clamp group index will be used by
+ * the task to reference count the clamp value on CPUs while enqueued.
+ */
+static inline void uclamp_group_get(int clamp_id, int next_group_id,
+ struct uclamp_se *uc_se,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int prev_group_id = uc_se->group_id;
+ unsigned long flags;
+
+ /* Allocate new clamp group for this clamp value */
+ raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
+ if (uclamp_group_available(clamp_id, next_group_id))
+ uclamp_group_init(clamp_id, next_group_id, clamp_value);
+
+ /* Update SE's clamp values and attach it to new clamp group */
+ uc_se->value = clamp_value;
+ uc_se->group_id = next_group_id;
+ uc_map[next_group_id].se_count += 1;
+ raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+
+ /* Release the previous clamp group */
+ uclamp_group_put(clamp_id, prev_group_id);
+}
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
- if (attr->sched_util_min > attr->sched_util_max)
- return -EINVAL;
- if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
- return -EINVAL;
+ int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
+ int lower_bound, upper_bound;
+ struct uclamp_se *uc_se;
+ int result = 0;

- p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
- p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+ mutex_lock(&uclamp_mutex);

- return 0;
+ /* Find a valid group_id for each required clamp value */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ ? attr->sched_util_max
+ : p->uclamp[UCLAMP_MAX].value;
+
+ if (upper_bound == UCLAMP_NOT_VALID)
+ upper_bound = SCHED_CAPACITY_SCALE;
+ if (attr->sched_util_min > upper_bound) {
+ result = -EINVAL;
+ goto done;
+ }
+
+ result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
+ if (result == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+ goto done;
+ }
+ group_id[UCLAMP_MIN] = result;
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ ? attr->sched_util_min
+ : p->uclamp[UCLAMP_MIN].value;
+
+ if (lower_bound == UCLAMP_NOT_VALID)
+ lower_bound = 0;
+ if (attr->sched_util_max < lower_bound ||
+ attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto done;
+ }
+
+ result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
+ if (result == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MAX");
+ goto done;
+ }
+ group_id[UCLAMP_MAX] = result;
+ }
+
+ /* Update each required clamp group */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ uc_se = &p->uclamp[UCLAMP_MIN];
+ uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uc_se, attr->sched_util_min);
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ uc_se = &p->uclamp[UCLAMP_MAX];
+ uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uc_se, attr->sched_util_max);
+ }
+
+done:
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
+}
+
+/**
+ * uclamp_exit_task: release referenced clamp groups
+ * @p: the task exiting
+ *
+ * When a task terminates, release all its (eventually) refcounted
+ * task-specific clamp groups.
+ */
+void uclamp_exit_task(struct task_struct *p)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &p->uclamp[clamp_id];
+ uclamp_group_put(clamp_id, uc_se->group_id);
+ }
+}
+
+/**
+ * uclamp_fork: refcount task-specific clamp values for a new task
+ */
+static void uclamp_fork(struct task_struct *p, bool reset)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ int next_group_id = p->uclamp[clamp_id].group_id;
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+
+ if (unlikely(reset)) {
+ next_group_id = 0;
+ p->uclamp[clamp_id].value = uclamp_none(clamp_id);
+ }
+
+ p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
+ uclamp_group_get(clamp_id, next_group_id, uc_se,
+ p->uclamp[clamp_id].value);
+ }
+}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static void __init init_uclamp(void)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ mutex_init(&uclamp_mutex);
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ uc_map[group_id].value = UCLAMP_NOT_VALID;
+ raw_spin_lock_init(&uc_map[group_id].se_lock);
+ }
+
+ /* Init init_task's clamp group */
+ uc_se = &init_task.uclamp[clamp_id];
+ uc_se->group_id = UCLAMP_NOT_VALID;
+ uclamp_group_get(clamp_id, 0, uc_se, uclamp_none(clamp_id));
+ }
}
+
#else /* CONFIG_UCLAMP_TASK */
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
return -EINVAL;
}
+static inline void uclamp_fork(struct task_struct *p, bool reset) { }
+static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2314,6 +2678,7 @@ static inline void init_schedstats(void) {}
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
+ bool reset;

__sched_fork(clone_flags, p);
/*
@@ -2331,7 +2696,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
/*
* Revert to default priority/policy on fork if requested.
*/
- if (unlikely(p->sched_reset_on_fork)) {
+ reset = p->sched_reset_on_fork;
+ if (unlikely(reset)) {
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
@@ -2342,11 +2708,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);

-#ifdef CONFIG_UCLAMP_TASK
- p->uclamp[UCLAMP_MIN] = 0;
- p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
-#endif
-
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -2363,6 +2724,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)

init_entity_runnable_average(&p->se);

+ uclamp_fork(p, reset);
+
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
@@ -4756,8 +5119,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN];
- attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
#endif

rcu_read_unlock();
@@ -6107,6 +6470,8 @@ void __init sched_init(void)

init_schedstats();

+ init_uclamp();
+
scheduler_running = 1;
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..dab0405386c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10055,6 +10055,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..72df2dc779bc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1501,10 +1501,12 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
- void (*yield_task) (struct rq *rq);
- bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);

void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

@@ -1537,7 +1539,6 @@ struct sched_class {
void (*set_curr_task)(struct rq *rq);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
- void (*task_dead)(struct task_struct *p);

/*
* The switched_from() call is allowed to drop rq->lock, therefore we
@@ -1554,12 +1555,17 @@ struct sched_class {

void (*update_curr)(struct rq *rq);

+ void (*yield_task) (struct rq *rq);
+ bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
+
#define TASK_SET_GROUP 0
#define TASK_MOVE_GROUP 1

#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
+
+ void (*task_dead)(struct task_struct *p);
};

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
@@ -2177,6 +2183,22 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.18.0


2018-08-28 13:56:32

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support

The limited number of clamp groups is required to have both an effective
and efficient run-time tracking of the clamp groups required by RUNNABLE
tasks. However, being a limited number imposes some constraints on its
usage at run-time. Specifically, a System Management Software should
"reserve" all the possible clamp values required at run-time to ensure
there will always be a clamp group to track them whenever required.

To fix this problem we can trade-off CPU clamping precision for
efficiency by transforming CPU's clamp groups into buckets of a
predefined range.

The number of clamp groups configured at compile time defines the range
of utilization clamp values tracked by each CPU clamp group. Thus, for
example, with the default:
CONFIG_UCLAMP_GROUPS_COUNT 5
we will have 5 clamp groups tracking 20% utilization each and a task
with util_min=25% will have group_id=1.

Scheduling entities keep tracking the specific value defined from
user-space, which can still be used for task placement biasing
decisions. However, at enqueue time tasks will be refcounted in the
clamp group which range includes the task specific clamp value.

Each CPU's clamp value will also be updated to aggregate and represent
at run-time the most restrictive value among those of the RUNNABLE tasks
refcounted by that group. Each time a CPU clamp group becomes empty we
reset its clamp value to the minimum value of the range it tracks.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180809152313.lewfhufidhxb2qrk@darkstar>
- implements the idea discussed in this thread
Others:
- new patch added in this version
- rebased on v4.19-rc1
---
include/linux/sched.h | 13 ++++-----
kernel/sched/core.c | 59 ++++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 5 ++++
3 files changed, 70 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca0a80881fa9..752fcd5d2cea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -608,17 +608,18 @@ struct sched_dl_entity {
* either tasks or task groups, to enforce the same clamp "value" for a given
* clamp index.
*
- * Scheduling entity's specific clamp group index can be different
- * from the effective clamp group index used at enqueue time since
- * task groups's clamps can be restricted by their parent task group.
+ * Scheduling entity's specific clamp value and group index can be different
+ * from the effective value and group index used at enqueue time. Indeed:
+ * - task's clamps can be restricted by their task group calmps
+ * - task groups's clamps can be restricted by their parent task group
*/
struct uclamp_se {
unsigned int value;
unsigned int group_id;
/*
- * Effective task (group) clamp value and group index.
- * For task groups it's the value (eventually) enforced by a parent
- * task group.
+ * Effective task (group) clamp value and group index:
+ * for tasks: those used at enqueue time
+ * for task groups: those (eventually) enforced by a parent task group
*/
struct {
unsigned int value;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8341ce580a9a..f71e15eaf152 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -807,6 +807,34 @@ static struct uclamp_map uclamp_maps[UCLAMP_CNT]
#define UCLAMP_ENOSPC_FMT "Cannot allocate more than " \
__stringify(CONFIG_UCLAMP_GROUPS_COUNT) " UTIL_%s clamp groups\n"

+/*
+ * uclamp_round: round a clamp value to the closest trackable value
+ *
+ * The number of clamp group, which is defined at compile time, allows to
+ * track a finete number of different clamp values. This makes sense from both
+ * a practical standpoint, since we do not expect many different values at on
+ * a real system, as well as for run-time efficiency.
+ *
+ * To ensure a clamp group is always available, this methd allows to
+ * discretize a required value into one of the possible available clamp
+ * groups.
+ */
+static inline int uclamp_round(int value)
+{
+#define UCLAMP_GROUP_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_GROUPS_COUNT)
+#define UCLAMP_GROUP_UPPER (UCLAMP_GROUP_DELTA * CONFIG_UCLAMP_GROUPS_COUNT)
+
+ if (unlikely(!sched_feat(UCLAMP_ROUNDING)))
+ return value;
+
+ if (value <= 0)
+ return value;
+ if (value >= UCLAMP_GROUP_UPPER)
+ return SCHED_CAPACITY_SCALE;
+
+ return UCLAMP_GROUP_DELTA * (value / UCLAMP_GROUP_DELTA);
+}
+
/**
* uclamp_group_available: checks if a clamp group is available
* @clamp_id: the utilization clamp index (i.e. min or max clamp)
@@ -846,6 +874,9 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
struct uclamp_cpu *uc_cpu;
int cpu;

+ /* Clamp groups are always initialized to the rounded clamp value */
+ clamp_value = uclamp_round(clamp_value);
+
/* Set clamp group map */
uc_map[group_id].value = clamp_value;
uc_map[group_id].se_count = 0;
@@ -892,6 +923,7 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
int free_group_id = UCLAMP_NOT_VALID;
unsigned int group_id = 0;

+ clamp_value = uclamp_round(clamp_value);
for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
/* Keep track of first free clamp group */
if (uclamp_group_available(clamp_id, group_id)) {
@@ -979,6 +1011,22 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
* task_struct::uclamp::effective::value
* is updated to represent the clamp value corresponding to the taks effective
* group index.
+ *
+ * Thus, the effective clamp value for a task is granted to be in the range of
+ * the rounded clamp values of its effective clamp group. For example:
+ * - CONFIG_UCLAMP_GROUPS_COUNT=5 => UCLAMP_GROUP_DELTA=20%
+ * - TaskA: util_min=25% => clamp_group1: range [20-39]%
+ * - TaskB: util_min=35% => clamp_group1: range [20-39]%
+ * - TaskGroupA: util_min=10% => clamp_group0: range [ 0-19]%
+ * Then, when TaskA is part of TaskGroupA, it will be:
+ * - allocated in clamp_group1
+ * - clamp_group1.value=25
+ * while TaskA is running alone
+ * - clamp_group1.value=35
+ * since TaskB was RUNNABLE and until TaskA is RUNNABLE
+ * - clamp_group1.value=20
+ * i.e. CPU's clamp group value is reset to the nominal rounded value,
+ * while TaskA and TaskB are not RUNNABLE
*/
static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
{
@@ -1106,6 +1154,10 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
uc_cpu->value[clamp_id] = clamp_value;
}

+ /* Track the max effective clamp value for each CPU's clamp group */
+ if (clamp_value > uc_cpu->group[clamp_id][group_id].value)
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+
/*
* If this is the new max utilization clamp value, then we can update
* straight away the CPU clamp value. Otherwise, the current CPU clamp
@@ -1170,8 +1222,13 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
cpu_of(rq), clamp_id, group_id);
}
#endif
- if (clamp_value >= uc_cpu->value[clamp_id])
+ if (clamp_value >= uc_cpu->value[clamp_id]) {
+ /* Reset CPU's clamp value to rounded clamp group value */
+ clamp_value = uclamp_group_value(clamp_id, group_id);
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+
uclamp_cpu_update(rq, clamp_id, clamp_value);
+ }
}

/**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index aad826aa55f8..5b7d0965b090 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -95,3 +95,8 @@ SCHED_FEAT(UTIL_EST, true)
* Utilization clamping lazy update.
*/
SCHED_FEAT(UCLAMP_LAZY_UPDATE, false)
+
+/*
+ * Utilization clamping discretization.
+ */
+SCHED_FEAT(UCLAMP_ROUNDING, true)
--
2.18.0


2018-08-28 13:56:44

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values

The utilization is a well defined property of tasks and CPUs with an
in-kernel representation based on power-of-two values.
The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
allows efficient computations in hot-paths and a sufficient fixed point
arithmetic precision.
However, the utilization values range is still an implementation detail
which is also possibly subject to changes in the future.

Since we don't want to commit new user-space APIs to any in-kernel
implementation detail, let's add an abstraction layer on top of the APIs
used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
cpu.util_{min,max} attributes.

We do that by adding a couple of conversion functions which can be used
to conveniently transform utilization/capacity values from/to the internal
SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
in the standard [0..100] range.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Others:
- rebased on v4.19-rc1

Changes in v3:
- rebased on tip/sched/core
Changes in v2:
- none: this is a new patch
---
Documentation/admin-guide/cgroup-v2.rst | 10 +++----
include/linux/sched.h | 20 +++++++++++++
include/uapi/linux/sched/types.h | 14 +++++----
kernel/sched/core.c | 38 ++++++++++++++-----------
4 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 72272f58d304..4b236390273b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,7 +976,7 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no bandwidth boosting.

- The requested minimum utilization in the range [0, 1023].
+ The requested minimum percentage of utilization in the range [0, 100].

This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
@@ -987,16 +987,16 @@ All time durations are in microseconds.
reports minimum utilization clamp value currently enforced on a task
group.

- The actual minimum utilization in the range [0, 1023].
+ The actual minimum percentage of utilization in the range [0, 100].

This value can be lower then cpu.util.min in case a parent cgroup
is enforcing a more restrictive clamping on minimum utilization.

cpu.util.max
A read-write single value file which exists on non-root cgroups.
- The default is "1023". i.e. no bandwidth clamping
+ The default is "100". i.e. no bandwidth clamping

- The requested maximum utilization in the range [0, 1023].
+ The requested maximum percentage of utilization in the range [0, 100].

This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
@@ -1007,7 +1007,7 @@ All time durations are in microseconds.
reports maximum utilization clamp value currently enforced on a task
group.

- The actual maximum utilization in the range [0, 1023].
+ The actual maximum percentage of utilization in the range [0, 100].

This value can be lower then cpu.util.max in case a parent cgroup
is enforcing a more restrictive clamping on max utilization.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4e5522ed57e0..ca0a80881fa9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -321,6 +321,26 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+static inline unsigned int util_from_pct(unsigned int pct)
+{
+ WARN_ON(pct > 100);
+
+ return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
+}
+
+static inline unsigned int util_to_pct(unsigned int value)
+{
+ unsigned int rounding = 0;
+
+ WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
+
+ /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
+ if (likely((value & 0xFF) && ~(value & 0x700)))
+ rounding = 1;
+
+ return (rounding + ((100 * value) / SCHED_FIXEDPOINT_SCALE));
+}
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 7512b5934013..b0fe00939fb3 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -84,16 +84,18 @@ struct sched_param {
*
* @sched_util_min represents the minimum utilization
* @sched_util_max represents the maximum utilization
+ * @sched_util_min represents the minimum utilization percentage
+ * @sched_util_max represents the maximum utilization percentage
*
- * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
- * represents the percentage of CPU time used by a task when running at the
- * maximum frequency on the highest capacity CPU of the system. Thus, for
- * example, a 20% utilization task is a task running for 2ms every 10ms.
+ * Utilization is a value in the range [0..100] which represents the
+ * percentage of CPU time used by a task when running at the maximum frequency
+ * on the highest capacity CPU of the system. Thus, for example, a 20%
+ * utilization task is a task running for 2ms every 10ms.
*
- * A task with a min utilization value bigger then 0 is more likely to be
+ * A task with a min utilization value bigger then 0% is more likely to be
* scheduled on a CPU which has a capacity big enough to fit the specified
* minimum utilization value.
- * A task with a max utilization value smaller then 1024 is more likely to be
+ * A task with a max utilization value smaller then 100% is more likely to be
* scheduled on a CPU which do not necessarily have more capacity then the
* specified max utilization value.
*/
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9ca881d1ff9e..222397edb8a7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -730,15 +730,15 @@ static DEFINE_MUTEX(uclamp_mutex);

/*
* Minimum utilization for tasks in the root cgroup
- * default: 0
+ * default: 0%
*/
unsigned int sysctl_sched_uclamp_util_min;

/*
* Maximum utilization for tasks in the root cgroup
- * default: 1024
+ * default: 100%
*/
-unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+unsigned int sysctl_sched_uclamp_util_max = 100;

static struct uclamp_se uclamp_default[UCLAMP_CNT];

@@ -940,7 +940,7 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
max_value = max(max_value, uc_grp[group_id].value);

/* Stop if we reach the max possible clamp */
- if (max_value >= SCHED_CAPACITY_SCALE)
+ if (max_value >= 100)
break;
}

@@ -1397,7 +1397,7 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
result = -EINVAL;
if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
goto undo;
- if (sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE)
+ if (sysctl_sched_uclamp_util_max > 100)
goto undo;

/* Find a valid group_id for each required clamp value */
@@ -1424,13 +1424,15 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
/* Update each required clamp group */
if (old_min != sysctl_sched_uclamp_util_min) {
uc_se = &uclamp_default[UCLAMP_MIN];
+ value = util_from_pct(sysctl_sched_uclamp_util_min);
uclamp_group_get(NULL, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
- uc_se, sysctl_sched_uclamp_util_min);
+ uc_se, value);
}
if (old_max != sysctl_sched_uclamp_util_max) {
uc_se = &uclamp_default[UCLAMP_MAX];
+ value = util_from_pct(sysctl_sched_uclamp_util_max);
uclamp_group_get(NULL, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
- uc_se, sysctl_sched_uclamp_util_max);
+ uc_se, value);
}
goto done;

@@ -1525,7 +1527,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
: p->uclamp[UCLAMP_MAX].value;

if (upper_bound == UCLAMP_NOT_VALID)
- upper_bound = SCHED_CAPACITY_SCALE;
+ upper_bound = 100;
if (attr->sched_util_min > upper_bound) {
result = -EINVAL;
goto done;
@@ -1546,7 +1548,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
if (lower_bound == UCLAMP_NOT_VALID)
lower_bound = 0;
if (attr->sched_util_max < lower_bound ||
- attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+ attr->sched_util_max > 100) {
result = -EINVAL;
goto done;
}
@@ -1563,12 +1565,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uc_se = &p->uclamp[UCLAMP_MIN];
uclamp_group_get(p, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
- uc_se, attr->sched_util_min);
+ uc_se, util_from_pct(attr->sched_util_min));
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uc_se = &p->uclamp[UCLAMP_MAX];
uclamp_group_get(p, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
- uc_se, attr->sched_util_max);
+ uc_se, util_from_pct(attr->sched_util_max));
}

done:
@@ -5727,8 +5729,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = uclamp_task_value(p, UCLAMP_MIN);
- attr.sched_util_max = uclamp_task_value(p, UCLAMP_MAX);
+ attr.sched_util_min = util_to_pct(uclamp_task_value(p, UCLAMP_MIN));
+ attr.sched_util_max = util_to_pct(uclamp_task_value(p, UCLAMP_MAX));
#endif

rcu_read_unlock();
@@ -7581,8 +7583,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
int ret = -EINVAL;
int group_id;

- if (min_value > SCHED_CAPACITY_SCALE)
+ /* Check range and scale to internal representation */
+ if (min_value > 100)
return -ERANGE;
+ min_value = util_from_pct(min_value);

mutex_lock(&uclamp_mutex);
rcu_read_lock();
@@ -7626,8 +7630,10 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
int ret = -EINVAL;
int group_id;

- if (max_value > SCHED_CAPACITY_SCALE)
+ /* Check range and scale to internal representation */
+ if (max_value > 100)
return -ERANGE;
+ max_value = util_from_pct(max_value);

mutex_lock(&uclamp_mutex);
rcu_read_lock();
@@ -7677,7 +7683,7 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
: tg->uclamp[clamp_id].value;
rcu_read_unlock();

- return util_clamp;
+ return util_to_pct(util_clamp);
}

static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
--
2.18.0


2018-08-28 13:56:54

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 16/16] sched/cpufreq: uclamp: add utilization clamping for RT tasks

Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE. Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.

This patch extends the usage of util_{min,max} to the RT scheduling
class. Whenever a task in this class is RUNNABLE, the util required is
defined by its task specific clamp value. However, we still want to run
at maximum capacity RT tasks which:
- do not have task specific clamp values
- run either in the root task group or an autogroup

Let's add uclamp_default_perf, a special set of clamp value to be used
for tasks that require maximum performance. This set of clamps are then
used whenever the above conditions matches for an RT task being enqueued
on a CPU.

Since utilization clamping applies now to both CFS and RT tasks, we
clamp the combined utilization of these two classes.
This approach, contrary to combining individually clamped utilizations,
is more power efficient. Indeed, it selects lower frequencies when we
have both RT and CFS clamped tasks.
However, it could also affect performance of the lower priority CFS
class, since the CFS's minimum utilization clamp could be completely
eclipsed by the RT workloads.

The IO wait boost value also is subject to clamping for RT tasks.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clamping constraints.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180813150112.GE2605@e110439-lin>
- remove UCLAMP_SCHED_CLASS policy since we do not have in the current
implementation a proper per-sched_class clamp tracking support
Message-ID: <20180809155551.bp46sixk4u3ilcnh@queper01-lin>
- add default boost for not clamped RT tasks
Others:
- rebased on v4.19-rc1

Changes in v3:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
---
kernel/sched/core.c | 30 ++++++++++++++++++++++++------
kernel/sched/cpufreq_schedutil.c | 22 ++++++++++++----------
kernel/sched/rt.c | 4 ++++
3 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f71e15eaf152..9761457af1ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -741,6 +741,7 @@ unsigned int sysctl_sched_uclamp_util_min;
unsigned int sysctl_sched_uclamp_util_max = 100;

static struct uclamp_se uclamp_default[UCLAMP_CNT];
+static struct uclamp_se uclamp_default_perf[UCLAMP_CNT];

/**
* uclamp_map: reference counts a utilization "clamp value"
@@ -1052,10 +1053,15 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
*/
if (unclamped && (task_group_is_autogroup(task_group(p)) ||
task_group(p) == &root_task_group)) {
- p->uclamp[clamp_id].effective.value =
- uclamp_default[clamp_id].value;

- return uclamp_default[clamp_id].group_id;
+ /* Unclamped RT tasks: max perfs by default */
+ uc_se = task_has_rt_policy(p)
+ ? &uclamp_default_perf[clamp_id]
+ : &uclamp_default[clamp_id];
+
+ p->uclamp[clamp_id].effective.value = uc_se->value;
+
+ return uc_se->group_id;
}

/* Use TG's clamp value to limit task specific values */
@@ -1069,10 +1075,15 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
#else
/* By default, all tasks get the system default clamp value */
if (unclamped) {
- p->uclamp[clamp_id].effective.value =
- uclamp_default[clamp_id].value;

- return uclamp_default[clamp_id].group_id;
+ /* Unclamped RT tasks: max perfs by default */
+ uc_se = task_has_rt_policy(p)
+ ? &uclamp_default_perf[clamp_id]
+ : &uclamp_default[clamp_id];
+
+ p->uclamp[clamp_id].effective.value = uc_se->value;
+
+ return uc_se->group_id;
}
#endif

@@ -1761,6 +1772,13 @@ static void __init init_uclamp(void)
uc_se->group_id = UCLAMP_NOT_VALID;
uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
uclamp_none(clamp_id));
+
+ /* Init max perf clamps: default for RT tasks */
+ uc_se = &uclamp_default_perf[clamp_id];
+ uc_se->group_id = UCLAMP_NOT_VALID;
+ uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
+ uclamp_none(UCLAMP_MAX));
+
}
}

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 949082555ee8..8a2d12a691eb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -205,7 +205,10 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq);

- if (rt_rq_is_runnable(&rq->rt))
+ util = rt_rq_is_runnable(&rq->rt)
+ ? uclamp_util(rq, SCHED_CAPACITY_SCALE)
+ : cpu_util_rt(rq);
+ if (unlikely(util >= max))
return max;

/*
@@ -223,13 +226,14 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
- * CFS utilization can be boosted or capped, depending on utilization
- * clamp constraints configured for currently RUNNABLE tasks.
+ * CFS and RT utilizations can be boosted or capped, depending on
+ * utilization constraints enforce by currently RUNNABLE tasks.
*/
- util = cpu_util_cfs(rq);
+ util += cpu_util_cfs(rq);
if (util)
util = uclamp_util(rq, util);
- util += cpu_util_rt(rq);
+ if (unlikely(util >= max))
+ return max;

/*
* We do not make cpu_util_dl() a permanent part of this sum because we
@@ -333,13 +337,11 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
*
* Since DL tasks have a much more advanced bandwidth control, it's
* safe to assume that IO boost does not apply to those tasks.
- * Instead, since RT tasks are not utiliation clamped, we don't want
- * to apply clamping on IO boost while there is blocked RT
- * utilization.
+ * Instead, for CFS and RT tasks we clamp the IO boost max value
+ * considering the current constraints for the CPU.
*/
max_boost = sg_cpu->iowait_boost_max;
- if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
- max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+ max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);

/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2e2955a8cf8f..06ec33467dd9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2404,6 +2404,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
--
2.18.0


2018-08-28 13:57:05

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 12/16] sched/core: uclamp: update CPU's refcount on TG's clamp changes

When a task group refcounts a new clamp group, we need to ensure that
the new clamp values are immediately enforced to all its tasks which are
currently RUNNABLE. This is to ensure that all currently RUNNABLE tasks
are boosted and/or clamped as requested as soon as possible.

Let's ensure that, whenever a new clamp group is refcounted by a task
group, all its RUNNABLE tasks are correctly accounted in their
respective CPUs. We do that by slightly refactoring uclamp_group_get()
to get an additional parameter *cgroup_subsys_state which, when
provided, it's used to walk the list of tasks in the corresponding TGs
and update the RUNNABLE ones.

This is a "brute force" solution which allows to reuse the same refcount
update code already used by the per-task API. That's also the only way
to ensure a prompt enforcement of new clamp constraints on RUNNABLE
tasks, as soon as a task group attribute is tweaked.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Others:
- rebased on v4.19-rc1

Changes in v3:
- rebased on tip/sched/core
- fixed some typos
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 52 ++++++++++++++++++++++++++++++++---------
kernel/sched/features.h | 5 ++++
2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbc8d9fdfdbb..9ca881d1ff9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1306,9 +1306,30 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
}

+static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int group_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ /*
+ * In lazy update mode, tasks will be accounted into the right clamp
+ * group the next time they will be requeued.
+ */
+ if (unlikely(sched_feat(UCLAMP_LAZY_UPDATE)))
+ return;
+
+ /* Update clamp groups for RUNNABLE tasks in this TG */
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_task_update_active(p, clamp_id, group_id);
+ css_task_iter_end(&it);
+}
+
/**
* uclamp_group_get: increase the reference count for a clamp group
* @p: the task which clamp value must be tracked
+ * @css: the task group which clamp value must be tracked
* @clamp_id: the clamp index affected by the task
* @next_group_id: the clamp group to refcount
* @uc_se: the utilization clamp data for the task
@@ -1320,6 +1341,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
* the task to reference count the clamp value on CPUs while enqueued.
*/
static inline void uclamp_group_get(struct task_struct *p,
+ struct cgroup_subsys_state *css,
int clamp_id, int next_group_id,
struct uclamp_se *uc_se,
unsigned int clamp_value)
@@ -1339,6 +1361,10 @@ static inline void uclamp_group_get(struct task_struct *p,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);

+ /* Newly created TG don't have tasks assigned */
+ if (css)
+ uclamp_group_get_tg(css, clamp_id, next_group_id);
+
/* Update CPU's clamp group refcounts of RUNNABLE task */
if (p)
uclamp_task_update_active(p, clamp_id, next_group_id);
@@ -1398,12 +1424,12 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
/* Update each required clamp group */
if (old_min != sysctl_sched_uclamp_util_min) {
uc_se = &uclamp_default[UCLAMP_MIN];
- uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uclamp_group_get(NULL, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
uc_se, sysctl_sched_uclamp_util_min);
}
if (old_max != sysctl_sched_uclamp_util_max) {
uc_se = &uclamp_default[UCLAMP_MAX];
- uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uclamp_group_get(NULL, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
uc_se, sysctl_sched_uclamp_util_max);
}
goto done;
@@ -1448,7 +1474,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,

next_group_id = parent->uclamp[clamp_id].group_id;
uc_se->group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
+ uclamp_group_get(NULL, NULL, clamp_id, next_group_id, uc_se,
parent->uclamp[clamp_id].value);
}

@@ -1536,12 +1562,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uc_se = &p->uclamp[UCLAMP_MIN];
- uclamp_group_get(p, UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uclamp_group_get(p, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
uc_se, attr->sched_util_min);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uc_se = &p->uclamp[UCLAMP_MAX];
- uclamp_group_get(p, UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uclamp_group_get(p, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
uc_se, attr->sched_util_max);
}

@@ -1593,7 +1619,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;

p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
+ uclamp_group_get(NULL, NULL, clamp_id, next_group_id, uc_se,
p->uclamp[clamp_id].value);

/* By default we do not want task-specific clamp values */
@@ -1631,7 +1657,7 @@ static void __init init_uclamp(void)
/* Init init_task's clamp group */
uc_se = &init_task.uclamp[clamp_id];
uc_se->group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(NULL, clamp_id, 0, uc_se,
+ uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
uclamp_none(clamp_id));
/*
* By default we do not want task-specific clamp values,
@@ -1652,14 +1678,14 @@ static void __init init_uclamp(void)
* all child groups.
*/
uc_se->group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(NULL, clamp_id, 0, uc_se,
+ uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
uclamp_none(UCLAMP_MAX));
#endif

/* Init system defaul's clamp group */
uc_se = &uclamp_default[clamp_id];
uc_se->group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(NULL, clamp_id, 0, uc_se,
+ uclamp_group_get(NULL, NULL, clamp_id, 0, uc_se,
uclamp_none(clamp_id));
}
}
@@ -7540,6 +7566,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,

uc_se->effective.value = value;
uc_se->effective.group_id = group_id;
+
+ /* Immediately updated descendants active tasks */
+ if (css != top_css)
+ uclamp_group_get_tg(css, clamp_id, group_id);
}
}

@@ -7579,7 +7609,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,

/* Update TG's reference count */
uc_se = &tg->uclamp[UCLAMP_MIN];
- uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);
+ uclamp_group_get(NULL, css, UCLAMP_MIN, group_id, uc_se, min_value);

out:
rcu_read_unlock();
@@ -7624,7 +7654,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,

/* Update TG's reference count */
uc_se = &tg->uclamp[UCLAMP_MAX];
- uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);
+ uclamp_group_get(NULL, css, UCLAMP_MAX, group_id, uc_se, max_value);

out:
rcu_read_unlock();
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..aad826aa55f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,8 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Utilization clamping lazy update.
+ */
+SCHED_FEAT(UCLAMP_LAZY_UPDATE, false)
--
2.18.0


2018-08-28 13:57:14

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 07/16] sched/core: uclamp: extend cpu's cgroup controller

The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is expected (or allowed) to generate.
Constraints on minimum and maximum utilization allowed for tasks in a
CPU cgroup can improve the control on the actual amount of CPU bandwidth
consumed by tasks.

Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.

The ultimate goal of utilization clamping is thus to enable:

- boosting: by selecting an higher capacity CPU and/or higher execution
frequency for small tasks which are affecting the user
interactive experience.

- capping: by selecting more energy efficiency CPUs or lower execution
frequency, for big tasks which are mainly related to
background activities, and thus without a direct impact on
the user experience.

Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.

This patch extends the CPU controller by adding a couple of new
attributes, util.min and util.max, which can be used to enforce task's
utilization boosting and capping. Specifically:

- util.min: defines the minimum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run at least at a minimum frequency which
corresponds to the min_util utilization

- util.max: defines the maximum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run up to a maximum frequency which
corresponds to the max_util utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies
b) do not enforce any constraints and/or dependency between the parent
and its child nodes, thus relying on the delegation model and
permission settings defined by the system management software
c) allow to (eventually) further restrict task-specific clamps defined
via sched_setattr(2)

This patch provides the basic support to expose the two new attributes
and to validate their run-time updates. However, we do not actually
allocated clamp groups and thus the write calls added by this patch
always returns -EINVAL. Following patches will provide the missing bits.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Others:
- consolidate init_uclamp_sched_group() into init_uclamp()
- refcount root_task_group's clamp groups from init_uclamp()
- small documentation fixes

Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <[email protected]>
- use "." notation for attributes naming
i.e. s/util_{min,max}/util.{min,max}/
Others
- rebased on v4.19-rc1
Changes in v2:
Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes
Others:
- rebased on v4.18-rc4
- reduced code to simplify the review of this patch
which now provides just the basic code for CGroups integration
- add attributes to the default hierarchy as well as the legacy one
- use -ERANGE as range violation error

These additional bits:
- refcounting of clamp groups
- RUNNABLE tasks refcount updates
- aggregation of per-task and per-task_group utilization constraints
are provided in separate and following patches to make it more clear and
documented how they are performed.
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++
include/linux/sched.h | 4 +
init/Kconfig | 22 ++++
kernel/sched/core.c | 154 ++++++++++++++++++++++++
kernel/sched/sched.h | 5 +
5 files changed, 210 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..80ef7bdc517b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -907,6 +907,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -966,6 +972,25 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.

+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no bandwidth boosting.
+
+ The minimum utilization in the range [0, 1023].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1023". i.e. no bandwidth clamping
+
+ The maximum utilization in the range [0, 1023].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.

Memory
------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7385f0b1a7c0..dc39b67a366a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -583,6 +583,10 @@ struct sched_dl_entity {
*
* A utilization clamp group maps a "clamp value" (value), i.e.
* util_{min,max}, to a "clamp group index" (group_id).
+ *
+ * The same "group_id" can be used by multiple scheduling entities, i.e.
+ * either tasks or task groups, to enforce the same clamp "value" for a given
+ * clamp index.
*/
struct uclamp_se {
unsigned int value;
diff --git a/init/Kconfig b/init/Kconfig
index 10536cb83295..089db7a804a8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -827,6 +827,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ba0e7208c65a..dcbf22abd0bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1233,6 +1233,41 @@ static inline void uclamp_group_get(struct task_struct *p,
uclamp_group_put(clamp_id, prev_group_id);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: 0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+ uc_se->value = parent->uclamp[clamp_id].value;
+ uc_se->group_id = parent->uclamp[clamp_id].group_id;
+ }
+
+ return 1;
+}
+#else /* CONFIG_UCLAMP_TASK_GROUP */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1376,12 +1411,24 @@ static void __init init_uclamp(void)
uc_se->group_id = UCLAMP_NOT_VALID;
uclamp_group_get(NULL, clamp_id, 0, uc_se,
uclamp_none(clamp_id));
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Init root TG's clamp group */
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->group_id = 0;
+#endif
}
}

#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -6955,6 +7002,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;

err:
@@ -7175,6 +7225,84 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MIN].value == min_value) {
+ ret = 0;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MAX].value < min_value)
+ goto out;
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MAX].value == max_value) {
+ ret = 0;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MIN].value > max_value)
+ goto out;
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7512,6 +7640,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7679,6 +7819,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util_min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util_max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1b05b38b1081..489d7403affe 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,11 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.18.0


2018-08-28 13:57:23

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

The number of clamp groups supported is limited and defined at compile
time. However, a malicious user can currently ask for many different
clamp values thus consuming all the available clamp groups.

Since on properly configured systems we expect only a limited set of
different clamp values, the previous problem can be mitigated by
allowing access to clamp groups configuration only to privileged tasks.
This should still allow a System Management Software to properly
pre-configure the system.

Let's restrict the tuning of utilization clamp values, by default, to
tasks with CAP_SYS_ADMIN capabilities.

Whenever this should be considered too restrictive and/or not required
for a specific platforms, a kernel boot option is provided to change
this default behavior thus allowing non privileged tasks to change their
utilization clamp values.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Others:
- new patch added in this version
- rebased on v4.19-rc1
---
.../admin-guide/kernel-parameters.txt | 3 +++
kernel/sched/core.c | 22 ++++++++++++++++---
2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9871e649ffef..481f8214ea9a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4561,6 +4561,9 @@
<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
See also Documentation/input/devices/joystick-parport.rst

+ uclamp_user [KNL] Enable task-specific utilization clamping tuning
+ also from tasks without CAP_SYS_ADMIN capability.
+
udbg-immortal [PPC] When debugging early kernel crashes that
happen after console_init() and before a proper
console driver takes over, this boot options might
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 222397edb8a7..8341ce580a9a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1510,14 +1510,29 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
static inline void free_uclamp_sched_group(struct task_group *tg) { }
#endif /* CONFIG_UCLAMP_TASK_GROUP */

+static bool uclamp_user_allowed __read_mostly;
+static int __init uclamp_user_allow(char *str)
+{
+ uclamp_user_allowed = true;
+
+ return 0;
+}
+early_param("uclamp_user", uclamp_user_allow);
+
static inline int __setscheduler_uclamp(struct task_struct *p,
- const struct sched_attr *attr)
+ const struct sched_attr *attr,
+ bool user)
{
int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
int lower_bound, upper_bound;
struct uclamp_se *uc_se;
int result = 0;

+ if (!capable(CAP_SYS_ADMIN) &&
+ user && !uclamp_user_allowed) {
+ return -EPERM;
+ }
+
mutex_lock(&uclamp_mutex);

/* Find a valid group_id for each required clamp value */
@@ -1702,7 +1717,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
return 1;
}
static inline int __setscheduler_uclamp(struct task_struct *p,
- const struct sched_attr *attr)
+ const struct sched_attr *attr,
+ bool user)
{
return -EINVAL;
}
@@ -5217,7 +5233,7 @@ static int __sched_setscheduler(struct task_struct *p,

/* Configure utilization clamps for the task */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
- retval = __setscheduler_uclamp(p, attr);
+ retval = __setscheduler_uclamp(p, attr, user);
if (retval)
return retval;
}
--
2.18.0


2018-08-28 13:57:38

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 05/16] sched/core: uclamp: enforce last task UCLAMP_MAX

When a util_max clamped task sleeps, its clamp constraints are removed
from the CPU. However, the blocked utilization on that CPU can still be
higher than the max clamp value enforced while that task was running.
This max clamp removal when a CPU is going to be idle could thus allow
unwanted CPU frequency increases, right while the task is not running.

This can happen, for example, where there is another (smaller) task
running on a different CPU of the same frequency domain.
In this case, when we aggregate the utilization of all the CPUs in a
shared frequency domain, schedutil can still see the full non clamped
blocked utilization of all the CPUs and thus eventually increase the
frequency.

Let's fix this by using:

uclamp_cpu_put_id(UCLAMP_MAX)
uclamp_cpu_update(last_clamp_value)

to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition. Thus, while a CPU is idle, we can still enforce the last used
clamp value for it.

To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
idle, we don't want to enforce any minimum frequency
Indeed, we rely just on blocked load decay to smoothly reduce the
frequency.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180816172016.GG2960@e110439-lin>
- ensure to always reset clamp holding on wakeup from IDLE
Others:
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Changes in v2:
- rabased on v4.18-rc4
- new patch to improve a specific issue
---
kernel/sched/core.c | 39 +++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 11 +++++++++++
2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64e5c96bfdaf..ba0e7208c65a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -910,7 +910,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
* For the specified clamp index, this method computes the new CPU utilization
* clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
*/
-static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
+ unsigned int last_clamp_value)
{
struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
int max_value = UCLAMP_NOT_VALID;
@@ -928,6 +929,24 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
if (max_value >= SCHED_CAPACITY_SCALE)
break;
}
+
+ /*
+ * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
+ * task, we want to keep the CPU clamped to the last task's clamp
+ * value. This is to avoid frequency spikes to MAX when one CPU, with
+ * an high blocked utilization, sleeps and another CPU, in the same
+ * frequency domain, do not see anymore the clamp on the first CPU.
+ *
+ * The UCLAMP_FLAG_IDLE is set whenever we detect, from the above
+ * loop, that there are no more RUNNABLE taks on that CPU.
+ * In this case we enforce the CPU util_max to that of the last
+ * dequeued task.
+ */
+ if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
+ rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
+ max_value = last_clamp_value;
+ }
+
rq->uclamp.value[clamp_id] = max_value;
}

@@ -962,13 +981,25 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks += 1;

+ /* Reset clamp holds on idle exit */
+ uc_cpu = &rq->uclamp;
+ clamp_value = p->uclamp[clamp_id].value;
+ if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
+ /*
+ * This function is called for both UCLAMP_MIN (before) and
+ * UCLAMP_MAX (after). Let's reset the flag only the second
+ * once we know that UCLAMP_MIN has been already updated.
+ */
+ if (clamp_id == UCLAMP_MAX)
+ uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
+ uc_cpu->value[clamp_id] = clamp_value;
+ }
+
/*
* If this is the new max utilization clamp value, then we can update
* straight away the CPU clamp value. Otherwise, the current CPU clamp
* value is still valid and we are done.
*/
- uc_cpu = &rq->uclamp;
- clamp_value = p->uclamp[clamp_id].value;
if (uc_cpu->value[clamp_id] < clamp_value)
uc_cpu->value[clamp_id] = clamp_value;
}
@@ -1026,7 +1057,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
}
#endif
if (clamp_value >= uc_cpu->value[clamp_id])
- uclamp_cpu_update(rq, clamp_id);
+ uclamp_cpu_update(rq, clamp_id, clamp_value);
}

/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 25d1d218ae10..411635c4c09a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -805,6 +805,17 @@ struct uclamp_group {
struct uclamp_cpu {
struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
int value[UCLAMP_CNT];
+/*
+ * Idle clamp holding
+ * Whenever a CPU is idle, we enforce the util_max clamp value of the last
+ * task running on that CPU. This bit is used to flag a clamp holding
+ * currently active for a CPU. This flag is:
+ * - set when we update the clamp value of a CPU at the time of dequeuing the
+ * last before entering idle
+ * - reset when we enqueue the first task after a CPU wakeup from IDLE
+ */
+#define UCLAMP_FLAG_IDLE 0x01
+ int flags;
};
#endif /* CONFIG_UCLAMP_TASK */

--
2.18.0


2018-08-28 13:57:55

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This represent also the clamp value which is actually
used to clamp tasks in each task group.

Since it can be interesting for tasks in a cgroup to know exactly what
is the currently propagated/enforced configuration, the effective clamp
values are exposed to user-space by means of a new pair of read-only
attributes: cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180816140731.GD2960@e110439-lin>
- add ".effective" attributes to the default hierarchy
Others:
- small documentation fixes
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <[email protected]>
- new patch in v3, to implement a suggestion from v1 review
---
Documentation/admin-guide/cgroup-v2.rst | 25 +++++-
include/linux/sched.h | 8 ++
kernel/sched/core.c | 112 +++++++++++++++++++++++-
3 files changed, 139 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 80ef7bdc517b..72272f58d304 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,22 +976,43 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no bandwidth boosting.

- The minimum utilization in the range [0, 1023].
+ The requested minimum utilization in the range [0, 1023].

This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.

+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1023].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ is enforcing a more restrictive clamping on minimum utilization.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1023". i.e. no bandwidth clamping

- The maximum utilization in the range [0, 1023].
+ The requested maximum utilization in the range [0, 1023].

This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.

+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1023].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.
+
+
Memory
------

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc39b67a366a..2da130d17e70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -591,6 +591,14 @@ struct sched_dl_entity {
struct uclamp_se {
unsigned int value;
unsigned int group_id;
+ /*
+ * Effective task (group) clamp value.
+ * For task groups is the value (eventually) enforced by a parent task
+ * group.
+ */
+ struct {
+ unsigned int value;
+ } effective;
};

union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dcbf22abd0bf..b2d438b6484b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1254,6 +1254,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &tg->uclamp[clamp_id];
+ uc_se->effective.value =
+ parent->uclamp[clamp_id].effective.value;
uc_se->value = parent->uclamp[clamp_id].value;
uc_se->group_id = parent->uclamp[clamp_id].group_id;
}
@@ -1415,6 +1417,7 @@ static void __init init_uclamp(void)
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp group */
uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->effective.value = uclamp_none(clamp_id);
uc_se->value = uclamp_none(clamp_id);
uc_se->group_id = 0;
#endif
@@ -7226,6 +7229,68 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * cpu_util_update_hier: propagete effective clamp down the hierarchy
+ * @css: the task group to update
+ * @clamp_id: the clamp index to update
+ * @value: the new task group clamp value
+ *
+ * The effective clamp for a TG is expected to track the most restrictive
+ * value between the TG's clamp value and it's parent effective clamp value.
+ * This method achieve that:
+ * 1. updating the current TG effective value
+ * 2. walking all the descendant task group that needs an update
+ *
+ * A TG's effective clamp needs to be updated when its current value is not
+ * matching the TG's clamp value. In this case indeed either:
+ * a) the parent has got a more relaxed clamp value
+ * thus potentially we can relax the effective value for this group
+ * b) the parent has got a more strict clamp value
+ * thus potentially we have to restrict the effective value of this group
+ *
+ * Restriction and relaxation of current TG's effective clamp values needs to
+ * be propagated down to all the descendants. When a subgroup is found which
+ * has already its effective clamp value matching its clamp value, then we can
+ * safely skip all its descendants which are granted to be already in sync.
+ */
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+ int clamp_id, int value)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+
+ css_for_each_descendant_pre(css, top_css) {
+ /*
+ * The first visited task group is top_css, which clamp value
+ * is the one passed as parameter. For descendent task
+ * groups we consider their current value.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (css != top_css)
+ value = uc_se->value;
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * alredy matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive, as effective clamp.
+ */
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+ if (uc_se->effective.value == value &&
+ uc_parent->effective.value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent->effective.value < value)
+ value = uc_parent->effective.value;
+ if (uc_se->effective.value == value)
+ continue;
+
+ uc_se->effective.value = value;
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -7245,6 +7310,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MAX].value < min_value)
goto out;

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+
out:
rcu_read_unlock();

@@ -7270,6 +7338,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MIN].value > max_value)
goto out;

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+
out:
rcu_read_unlock();

@@ -7277,14 +7348,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
}

static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;

rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].effective.value
+ : tg->uclamp[clamp_id].value;
rcu_read_unlock();

return util_clamp;
@@ -7293,13 +7367,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}

static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */

@@ -7647,11 +7733,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7827,12 +7921,22 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util_max",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* terminate */
};
--
2.18.0


2018-08-28 13:57:56

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 10/16] sched/core: uclamp: use TG's clamps to restrict Task's clamps

When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's Task Group (TG) in which the task is currently living.

Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG effective value.

This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. The overall idea is that cgroup
attributes are always used to restrict the per-task attributes.

Thus, this implementation allows to:

1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG

For this mechanisms to work properly, we exploit the concept of
"effective" clamp, which is already used by a TG to track parent
enforced restrictions.
In this patch we re-use the same variable:
task_struct::uclamp::effective::group_id
to track the currently most restrictive clamp group each task is
subject to and thus it's also currently refcounted into.

This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.

For consistency purposes, as well as to properly inform userspace, the
sched_getattr(2) call is updated to always return the properly
aggregated constrains as described above. This will also make
sched_getattr(2) a convenient userspace API to know the utilization
constraints enforced on a task by the cgroup's CPU controller.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180816140731.GD2960@e110439-lin>
- reuse already existing:
task_struct::uclamp::effective::group_id
instead of adding:
task_struct::uclamp_group_id
to back annotate the effective clamp group in which a task has been
refcounted
Others:
- small documentation fixes
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- fix not required override
- fix typos in changelog
Others:
- clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
clamp group_id/value code into dedicated getter functions:
uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
- rebased on tip/sched/core
Changes in v2:
OSPM discussion:
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values, i.e. tasks running on a
TG are only allowed to demote themself.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 86 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 +-
2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e617a7b18f2d..da0b3bd41e96 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -950,14 +950,75 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
rq->uclamp.value[clamp_id] = max_value;
}

+/**
+ * uclamp_task_group_id: get the effective clamp group index of a task
+ *
+ * The effective clamp group index of a task depends on its status, RUNNABLE
+ * or SLEEPING, and on:
+ * - the task specific clamp value, when !UCLAMP_NOT_VALID
+ * - its task group effective clamp value, for tasks not in the root group
+ * - the system default clamp value, for tasks in the root group
+ *
+ * This method returns the effective group index for a task, depending on its
+ * status and a proper aggregation of the clamp values listed above.
+ */
+static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
+{
+ struct uclamp_se *uc_se;
+ int clamp_value;
+ int group_id;
+
+ /* Taks currently accounted into a clamp group */
+ if (uclamp_task_affects(p, clamp_id))
+ return p->uclamp[clamp_id].effective.group_id;
+
+ /* Task specific clamp value */
+ uc_se = &p->uclamp[clamp_id];
+ clamp_value = uc_se->value;
+ group_id = uc_se->group_id;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Use TG's clamp value to limit task specific values */
+ uc_se = &task_group(p)->uclamp[clamp_id];
+ if (clamp_value > uc_se->effective.value)
+ group_id = uc_se->effective.group_id;
+#endif
+
+ return group_id;
+}
+
+static inline int uclamp_group_value(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ if (group_id == UCLAMP_NOT_VALID)
+ return uclamp_none(clamp_id);
+
+ return uc_map[group_id].value;
+}
+
+static inline int uclamp_task_value(struct task_struct *p, int clamp_id)
+{
+ int group_id = uclamp_task_group_id(p, clamp_id);
+
+ return uclamp_group_value(clamp_id, group_id);
+}
+
/**
* uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
* @p: the task being enqueued on a CPU
* @rq: the CPU's rq where the clamp group has to be reference counted
* @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
*
- * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
- * the task's uclamp.group_id is reference counted on that CPU.
+ * Once a task is enqueued on a CPU's RQ, the most restrictive clamp group,
+ * among the task specific and that of the task's cgroup one, is reference
+ * counted on that CPU.
+ *
+ * Since the CPUs reference counted clamp group can be either that of the task
+ * or of its cgroup, we keep track of the reference counted clamp group by
+ * storing its index (group_id) into task_struct::uclamp::effective::group_id.
+ * This group index will then be used at task's dequeue time to release the
+ * correct refcount.
*/
static inline void uclamp_cpu_get_id(struct task_struct *p,
struct rq *rq, int clamp_id)
@@ -968,7 +1029,7 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
int group_id;

/* Every task must reference a clamp group */
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = uclamp_task_group_id(p, clamp_id);
#ifdef CONFIG_SCHED_DEBUG
if (unlikely(group_id == UCLAMP_NOT_VALID)) {
WARN(1, "invalid task [%d:%s] clamp group\n",
@@ -977,6 +1038,9 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
}
#endif

+ /* Track the effective clamp group */
+ p->uclamp[clamp_id].effective.group_id = group_id;
+
/* Reference count the task into its current group_id */
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks += 1;
@@ -1025,7 +1089,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
int group_id;

/* New tasks don't have a previous clamp group */
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = p->uclamp[clamp_id].effective.group_id;
if (group_id == UCLAMP_NOT_VALID)
return;

@@ -1040,6 +1104,9 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
}
#endif

+ /* Flag the task as not affecting any clamp index */
+ p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
+
/* If this is not the last task, no updates are required */
if (uc_grp[group_id].tasks > 0)
return;
@@ -1402,6 +1469,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)
next_group_id = 0;
p->uclamp[clamp_id].value = uclamp_none(clamp_id);
}
+ /* Forked tasks are not yet enqueued */
+ p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;

p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
@@ -5497,8 +5566,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
- attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+ attr.sched_util_min = uclamp_task_value(p, UCLAMP_MIN);
+ attr.sched_util_max = uclamp_task_value(p, UCLAMP_MAX);
#endif

rcu_read_unlock();
@@ -7308,8 +7377,11 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
* groups we consider their current value.
*/
uc_se = &css_tg(css)->uclamp[clamp_id];
- if (css != top_css)
+ if (css != top_css) {
value = uc_se->value;
+ group_id = uc_se->effective.group_id;
+ }
+
/*
* Skip the whole subtrees if the current effective clamp is
* alredy matching the TG's clamp value.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 489d7403affe..72b022b9a407 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2240,7 +2240,7 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
*/
static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
{
- return (p->uclamp[clamp_id].group_id != UCLAMP_NOT_VALID);
+ return (p->uclamp[clamp_id].effective.group_id != UCLAMP_NOT_VALID);
}
#endif /* CONFIG_UCLAMP_TASK */

--
2.18.0


2018-08-28 13:58:02

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 11/16] sched/core: uclamp: add system default clamps

Clamp values cannot be tuned at the root cgroup level. Moreover, because
of the delegation model requirements and how the parent clamps
propagation works, if we want to enable subgroups to set a non null
util.min, we need to be able to configure the root group util.min to the
allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).

Unfortunately this setup will also mean that all tasks running in the
root group, will always get a maximum util.min clamp, unless they have a
lower task specific clamp which is definitively not a desirable default
configuration.

Let's fix this by explicitly adding a system default configuration
(sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
for all tasks running on the root group.

This interface is available independently from cgroups, thus providing a
complete solution for system wide utilization clamping configuration.

Each task has now by default:
task_struct::uclamp::value = UCLAMP_NOT_VALID
unless:
- the task has been forked from a parent with a valid clamp and
!SCHED_FLAG_RESET_ON_FORK
- the task has got a task-specific value set via sched_setattr()

A task with a UCLAMP_NOT_VALID clamp value is refcounted considering the
system default clamps if either we do not have task group support or
they are part of the root_task_group.
Tasks without a task specific clamp value in a child task group will be
refcounted instead considering the task group clamps.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180820122728.GM2960@e110439-lin>
- fix unwanted reset of clamp values on refcount success
Others:
- by default all tasks have a UCLAMP_NOT_VALID task specific clamp
- always use:
p->uclamp[clamp_id].effective.value
to track the actual clamp value the task has been refcounted into.
This matches with the usage of
p->uclamp[clamp_id].effective.group_id
- rebased on v4.19-rc1
---
include/linux/sched/sysctl.h | 11 +++
kernel/sched/core.c | 147 +++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 16 ++++
3 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index a9c32daeb9d8..445fb54eaeff 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
@@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+#ifdef CONFIG_UCLAMP_TASK
+extern int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+#endif
+
extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da0b3bd41e96..fbc8d9fdfdbb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -728,6 +728,20 @@ static void set_load_weight(struct task_struct *p, bool update_load)
*/
static DEFINE_MUTEX(uclamp_mutex);

+/*
+ * Minimum utilization for tasks in the root cgroup
+ * default: 0
+ */
+unsigned int sysctl_sched_uclamp_util_min;
+
+/*
+ * Maximum utilization for tasks in the root cgroup
+ * default: 1024
+ */
+unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+
+static struct uclamp_se uclamp_default[UCLAMP_CNT];
+
/**
* uclamp_map: reference counts a utilization "clamp value"
* @value: the utilization "clamp value" required
@@ -961,11 +975,16 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
*
* This method returns the effective group index for a task, depending on its
* status and a proper aggregation of the clamp values listed above.
+ * Moreover, it ensures that the task's effective value:
+ * task_struct::uclamp::effective::value
+ * is updated to represent the clamp value corresponding to the taks effective
+ * group index.
*/
static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
{
struct uclamp_se *uc_se;
int clamp_value;
+ bool unclamped;
int group_id;

/* Taks currently accounted into a clamp group */
@@ -977,13 +996,40 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
clamp_value = uc_se->value;
group_id = uc_se->group_id;

+ unclamped = (clamp_value == UCLAMP_NOT_VALID);
#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /*
+ * Tasks in the root group, which do not have a task specific clamp
+ * value, get the system default clamp value.
+ */
+ if (unclamped && (task_group_is_autogroup(task_group(p)) ||
+ task_group(p) == &root_task_group)) {
+ p->uclamp[clamp_id].effective.value =
+ uclamp_default[clamp_id].value;
+
+ return uclamp_default[clamp_id].group_id;
+ }
+
/* Use TG's clamp value to limit task specific values */
uc_se = &task_group(p)->uclamp[clamp_id];
- if (clamp_value > uc_se->effective.value)
- group_id = uc_se->effective.group_id;
+ if (unclamped || clamp_value > uc_se->effective.value) {
+ p->uclamp[clamp_id].effective.value =
+ uc_se->effective.value;
+
+ return uc_se->effective.group_id;
+ }
+#else
+ /* By default, all tasks get the system default clamp value */
+ if (unclamped) {
+ p->uclamp[clamp_id].effective.value =
+ uclamp_default[clamp_id].value;
+
+ return uclamp_default[clamp_id].group_id;
+ }
#endif

+ p->uclamp[clamp_id].effective.value = clamp_value;
+
return group_id;
}

@@ -999,9 +1045,10 @@ static inline int uclamp_group_value(int clamp_id, int group_id)

static inline int uclamp_task_value(struct task_struct *p, int clamp_id)
{
- int group_id = uclamp_task_group_id(p, clamp_id);
+ /* Ensure effective task's clamp value is update */
+ uclamp_task_group_id(p, clamp_id);

- return uclamp_group_value(clamp_id, group_id);
+ return p->uclamp[clamp_id].effective.value;
}

/**
@@ -1047,7 +1094,7 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,

/* Reset clamp holds on idle exit */
uc_cpu = &rq->uclamp;
- clamp_value = p->uclamp[clamp_id].value;
+ clamp_value = p->uclamp[clamp_id].effective.value;
if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
/*
* This function is called for both UCLAMP_MIN (before) and
@@ -1300,6 +1347,77 @@ static inline void uclamp_group_get(struct task_struct *p,
uclamp_group_put(clamp_id, prev_group_id);
}

+int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
+ struct uclamp_se *uc_se;
+ int old_min, old_max;
+ unsigned int value;
+ int result;
+
+ mutex_lock(&uclamp_mutex);
+
+ old_min = sysctl_sched_uclamp_util_min;
+ old_max = sysctl_sched_uclamp_util_max;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ goto done;
+
+ result = -EINVAL;
+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
+ goto undo;
+ if (sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE)
+ goto undo;
+
+ /* Find a valid group_id for each required clamp value */
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ value = sysctl_sched_uclamp_util_min;
+ result = uclamp_group_find(UCLAMP_MIN, value);
+ if (result == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+ goto undo;
+ }
+ group_id[UCLAMP_MIN] = result;
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ value = sysctl_sched_uclamp_util_max;
+ result = uclamp_group_find(UCLAMP_MAX, value);
+ if (result == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MAX");
+ goto undo;
+ }
+ group_id[UCLAMP_MAX] = result;
+ }
+ result = 0;
+
+ /* Update each required clamp group */
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ uc_se = &uclamp_default[UCLAMP_MIN];
+ uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uc_se, sysctl_sched_uclamp_util_min);
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ uc_se = &uclamp_default[UCLAMP_MAX];
+ uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uc_se, sysctl_sched_uclamp_util_max);
+ }
+ goto done;
+
+undo:
+ sysctl_sched_uclamp_util_min = old_min;
+ sysctl_sched_uclamp_util_max = old_max;
+
+done:
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
/**
* alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
@@ -1468,6 +1586,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)
if (unlikely(reset)) {
next_group_id = 0;
p->uclamp[clamp_id].value = uclamp_none(clamp_id);
+ p->uclamp[clamp_id].effective.value =
+ uclamp_none(clamp_id);
}
/* Forked tasks are not yet enqueued */
p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
@@ -1475,6 +1595,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
p->uclamp[clamp_id].value);
+
+ /* By default we do not want task-specific clamp values */
+ if (unlikely(reset))
+ p->uclamp[clamp_id].value = UCLAMP_NOT_VALID;
}
}

@@ -1509,12 +1633,17 @@ static void __init init_uclamp(void)
uc_se->group_id = UCLAMP_NOT_VALID;
uclamp_group_get(NULL, clamp_id, 0, uc_se,
uclamp_none(clamp_id));
+ /*
+ * By default we do not want task-specific clamp values,
+ * so that system default values apply.
+ */
+ uc_se->value = UCLAMP_NOT_VALID;

#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp group */
uc_se = &root_task_group.uclamp[clamp_id];

- uc_se->effective.value = uclamp_none(clamp_id);
+ uc_se->effective.value = uclamp_none(UCLAMP_MAX);
uc_se->effective.group_id = 0;

/*
@@ -1526,6 +1655,12 @@ static void __init init_uclamp(void)
uclamp_group_get(NULL, clamp_id, 0, uc_se,
uclamp_none(UCLAMP_MAX));
#endif
+
+ /* Init system defaul's clamp group */
+ uc_se = &uclamp_default[clamp_id];
+ uc_se->group_id = UCLAMP_NOT_VALID;
+ uclamp_group_get(NULL, clamp_id, 0, uc_se,
+ uclamp_none(clamp_id));
}
}

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050fd0c4..378ea57e5fc5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -445,6 +445,22 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rr_handler,
},
+#ifdef CONFIG_UCLAMP_TASK
+ {
+ .procname = "sched_uclamp_util_min",
+ .data = &sysctl_sched_uclamp_util_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+ {
+ .procname = "sched_uclamp_util_max",
+ .data = &sysctl_sched_uclamp_util_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
2.18.0


2018-08-28 13:58:10

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider the requirements suggested by userspace, for
example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "required" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus allowing to have a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Let's add the required support to clamp the utilization generated by
FAIR tasks within the boundaries defined by their aggregated utilization
clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.

For FAIR tasks both the utilization as well as the IOWait boost values
are clamped according to the CPU aggregated utilization clamp
constraints.

The default values for boosting and capping are defined to be:
- util_min: 0
- util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <CAKfTPtC2adLupg7wy1JU9zxKx1466Sza6fSCcr92wcawm1OYkg@mail.gmail.com>
- use *rq instead of cpu for both uclamp_util() and uclamp_value()
Message-ID: <20180816135300.GC2960@e110439-lin>
- remove uclamp_value() which is never used outside CONFIG_UCLAMP_TASK
Others:
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
---
kernel/sched/cpufreq_schedutil.c | 23 +++++++++++++--
kernel/sched/sched.h | 50 ++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3bc8a8..949082555ee8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS utilization can be boosted or capped, depending on utilization
+ * clamp constraints configured for currently RUNNABLE tasks.
*/
util = cpu_util_cfs(rq);
+ if (util)
+ util = uclamp_util(rq, util);
util += cpu_util_rt(rq);

/*
@@ -307,6 +312,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
unsigned int flags)
{
bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+ unsigned int max_boost;

/* Reset boost if the CPU appears to have been idle enough */
if (sg_cpu->iowait_boost &&
@@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
return;
sg_cpu->iowait_boost_pending = true;

+ /*
+ * Boost FAIR tasks only up to the CPU clamped utilization.
+ *
+ * Since DL tasks have a much more advanced bandwidth control, it's
+ * safe to assume that IO boost does not apply to those tasks.
+ * Instead, since RT tasks are not utiliation clamped, we don't want
+ * to apply clamping on IO boost while there is blocked RT
+ * utilization.
+ */
+ max_boost = sg_cpu->iowait_boost_max;
+ if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
+ max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
sg_cpu->iowait_boost <<= 1;
- if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
- sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+ if (sg_cpu->iowait_boost > max_boost)
+ sg_cpu->iowait_boost = max_boost;
return;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 411635c4c09a..1b05b38b1081 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2293,6 +2293,56 @@ static inline unsigned int uclamp_none(int clamp_id)
return SCHED_CAPACITY_SCALE;
}

+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_value: get the current CPU's utilization clamp value
+ * @rq: the CPU's RQ to consider
+ * @clamp_id: the utilization clamp index (i.e. min or max utilization)
+ *
+ * The utilization clamp value for a CPU depends on its set of currently
+ * RUNNABLE tasks and their specific util_{min,max} constraints.
+ * A max aggregated value is tracked for each CPU and returned by this
+ * function.
+ *
+ * Return: the current value for the specified CPU and clamp index
+ */
+static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &rq->uclamp;
+
+ if (uc_cpu->value[clamp_id] == UCLAMP_NOT_VALID)
+ return uclamp_none(clamp_id);
+
+ return uc_cpu->value[clamp_id];
+}
+
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @rq: the CPU's RQ to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently RUNNABLE tasks. Given a utilization signal, i.e a signal in
+ * the [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = uclamp_value(rq, UCLAMP_MIN);
+ unsigned int max_util = uclamp_value(rq, UCLAMP_MAX);
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.18.0


2018-08-28 13:58:40

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 09/16] sched/core: uclamp: map TG's clamp values into CPU's clamp groups

Utilization clamping requires to map each different clamp value
into one of the available clamp groups used by the scheduler's fast-path
to account for RUNNABLE tasks. Thus, each time a TG's clamp value
sysfs attribute is updated via:
cpu_util_{min,max}_write_u64()
we need to get (if possible) a reference to the new value's clamp group
and release the reference to the previous one.

Let's ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time).
We do that by slightly refactoring uclamp_group_get() to make the
*task_struct parameter optional. This allows to re-use the code already
available to support the per-task API.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Others:
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add explicit calls to uclamp_group_find(), which is now not more
part of uclamp_group_get()
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
include/linux/sched.h | 11 +++--
kernel/sched/core.c | 95 +++++++++++++++++++++++++++++++++++++++----
2 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2da130d17e70..4e5522ed57e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -587,17 +587,22 @@ struct sched_dl_entity {
* The same "group_id" can be used by multiple scheduling entities, i.e.
* either tasks or task groups, to enforce the same clamp "value" for a given
* clamp index.
+ *
+ * Scheduling entity's specific clamp group index can be different
+ * from the effective clamp group index used at enqueue time since
+ * task groups's clamps can be restricted by their parent task group.
*/
struct uclamp_se {
unsigned int value;
unsigned int group_id;
/*
- * Effective task (group) clamp value.
- * For task groups is the value (eventually) enforced by a parent task
- * group.
+ * Effective task (group) clamp value and group index.
+ * For task groups it's the value (eventually) enforced by a parent
+ * task group.
*/
struct {
unsigned int value;
+ unsigned int group_id;
} effective;
};

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b2d438b6484b..e617a7b18f2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1250,24 +1250,51 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
struct uclamp_se *uc_se;
+ int next_group_id;
int clamp_id;

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &tg->uclamp[clamp_id];
+
uc_se->effective.value =
parent->uclamp[clamp_id].effective.value;
- uc_se->value = parent->uclamp[clamp_id].value;
- uc_se->group_id = parent->uclamp[clamp_id].group_id;
+ uc_se->effective.group_id =
+ parent->uclamp[clamp_id].effective.group_id;
+
+ next_group_id = parent->uclamp[clamp_id].group_id;
+ uc_se->group_id = UCLAMP_NOT_VALID;
+ uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
+ parent->uclamp[clamp_id].value);
}

return 1;
}
+
+/**
+ * release_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
+ */
+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+ uclamp_group_put(clamp_id, uc_se->group_id);
+ }
+}
#else /* CONFIG_UCLAMP_TASK_GROUP */
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
return 1;
}
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
#endif /* CONFIG_UCLAMP_TASK_GROUP */

static inline int __setscheduler_uclamp(struct task_struct *p,
@@ -1417,9 +1444,18 @@ static void __init init_uclamp(void)
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp group */
uc_se = &root_task_group.uclamp[clamp_id];
+
uc_se->effective.value = uclamp_none(clamp_id);
- uc_se->value = uclamp_none(clamp_id);
- uc_se->group_id = 0;
+ uc_se->effective.group_id = 0;
+
+ /*
+ * The max utilization is always allowed for both clamps.
+ * This is required to not force a null minimum utiliation on
+ * all child groups.
+ */
+ uc_se->group_id = UCLAMP_NOT_VALID;
+ uclamp_group_get(NULL, clamp_id, 0, uc_se,
+ uclamp_none(UCLAMP_MAX));
#endif
}
}
@@ -1427,6 +1463,7 @@ static void __init init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -6984,6 +7021,7 @@ static DEFINE_SPINLOCK(task_group_lock);

static void sched_free_group(struct task_group *tg)
{
+ free_uclamp_sched_group(tg);
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
@@ -7234,6 +7272,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
* @css: the task group to update
* @clamp_id: the clamp index to update
* @value: the new task group clamp value
+ * @group_id: the group index mapping the new task clamp value
*
* The effective clamp for a TG is expected to track the most restrictive
* value between the TG's clamp value and it's parent effective clamp value.
@@ -7252,9 +7291,12 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
* be propagated down to all the descendants. When a subgroup is found which
* has already its effective clamp value matching its clamp value, then we can
* safely skip all its descendants which are granted to be already in sync.
+ *
+ * The TG's group_id is also updated to ensure it tracks the effective clamp
+ * value.
*/
static void cpu_util_update_hier(struct cgroup_subsys_state *css,
- int clamp_id, int value)
+ int clamp_id, int value, int group_id)
{
struct cgroup_subsys_state *top_css = css;
struct uclamp_se *uc_se, *uc_parent;
@@ -7282,24 +7324,30 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
}

/* Propagate the most restrictive effective value */
- if (uc_parent->effective.value < value)
+ if (uc_parent->effective.value < value) {
value = uc_parent->effective.value;
+ group_id = uc_parent->effective.group_id;
+ }
if (uc_se->effective.value == value)
continue;

uc_se->effective.value = value;
+ uc_se->effective.group_id = group_id;
}
}

static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
+ struct uclamp_se *uc_se;
struct task_group *tg;
int ret = -EINVAL;
+ int group_id;

if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7310,11 +7358,25 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MAX].value < min_value)
goto out;

+ /* Find a valid group_id */
+ ret = uclamp_group_find(UCLAMP_MIN, min_value);
+ if (ret == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+ goto out;
+ }
+ group_id = ret;
+ ret = 0;
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+ cpu_util_update_hier(css, UCLAMP_MIN, min_value, group_id);
+
+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MIN];
+ uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);

out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}
@@ -7322,12 +7384,15 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 max_value)
{
+ struct uclamp_se *uc_se;
struct task_group *tg;
int ret = -EINVAL;
+ int group_id;

if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7338,11 +7403,25 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MIN].value > max_value)
goto out;

+ /* Find a valid group_id */
+ ret = uclamp_group_find(UCLAMP_MAX, max_value);
+ if (ret == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MAX");
+ goto out;
+ }
+ group_id = ret;
+ ret = 0;
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+ cpu_util_update_hier(css, UCLAMP_MAX, max_value, group_id);
+
+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MAX];
+ uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);

out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}
--
2.18.0


2018-08-28 14:04:22

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range. This range depends on the set of currently
RUNNABLE tasks on a CPU, where each task references two "clamp groups"
defining the util_min and the util_max clamp values to be considered for
that task. The clamp value mapped by a clamp group applies to a CPU only
when there is at least one task RUNNABLE referencing that clamp group.

When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it can be required to re-compute what is the new "aggregated" clamp
value to apply on that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e. with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).

Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.

Tasks have a:
task_struct::uclamp::group_id[clamp_idx]
indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
which they should refcount at enqueue time.

CPUs rq have a:
rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index..

The clamp value of each clamp group is tracked by
rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
unordered array of clamp values. However, the MAX aggregation of the
currently active clamp groups is implemented to minimize the number of
times we need to scan the complete (unordered) clamp group array to
figure out the new max value. This operation indeed happens only when we
dequeue last task of the clamp group corresponding to the current max
clamp, and thus the CPU is either entering IDLE or going to schedule a
less boosted or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different values.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
- add another WARN on the unexpected condition of releasing a refcount
from a CPU which has a lower clamp value active
Other:
- ensure (and check) that all tasks have a valid group_id at
uclamp_cpu_get_id()
- rework uclamp_cpu layout to better fit into just 2x64B cache lines
- fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Other:
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Other:
- rabased on v4.18-rc4
- improved documentation to make more explicit some concepts.
---
kernel/sched/core.c | 207 ++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 67 ++++++++++++++
2 files changed, 273 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2668990b96d1..8f908035701f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -829,9 +829,19 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
unsigned int clamp_value)
{
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ struct uclamp_cpu *uc_cpu;
+ int cpu;

+ /* Set clamp group map */
uc_map[group_id].value = clamp_value;
uc_map[group_id].se_count = 0;
+
+ /* Set clamp groups on all CPUs */
+ for_each_possible_cpu(cpu) {
+ uc_cpu = &cpu_rq(cpu)->uclamp;
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+ uc_cpu->group[clamp_id][group_id].tasks = 0;
+ }
}

/**
@@ -886,6 +896,190 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
return -ENOSPC;
}

+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @cpu: the CPU which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups is subject to change. Since each clamp group enforces a
+ * different utilization clamp value, once the set of these groups changes it
+ * can be required to re-compute what is the new clamp value to apply for that
+ * CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
+ */
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
+ int max_value = UCLAMP_NOT_VALID;
+ unsigned int group_id;
+
+ for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
+ if (!uclamp_group_active(uc_grp, group_id))
+ continue;
+
+ /* Both min and max clamp are MAX aggregated */
+ max_value = max(max_value, uc_grp[group_id].value);
+
+ /* Stop if we reach the max possible clamp */
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ rq->uclamp.value[clamp_id] = max_value;
+}
+
+/**
+ * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @rq: the CPU's rq where the clamp group has to be reference counted
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
+ *
+ * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
+ * the task's uclamp.group_id is reference counted on that CPU.
+ */
+static inline void uclamp_cpu_get_id(struct task_struct *p,
+ struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp;
+ struct uclamp_cpu *uc_cpu;
+ int clamp_value;
+ int group_id;
+
+ /* Every task must reference a clamp group */
+ group_id = p->uclamp[clamp_id].group_id;
+#ifdef CONFIG_SCHED_DEBUG
+ if (unlikely(group_id == UCLAMP_NOT_VALID)) {
+ WARN(1, "invalid task [%d:%s] clamp group\n",
+ p->pid, p->comm);
+ return;
+ }
+#endif
+
+ /* Reference count the task into its current group_id */
+ uc_grp = &rq->uclamp.group[clamp_id][0];
+ uc_grp[group_id].tasks += 1;
+
+ /*
+ * If this is the new max utilization clamp value, then we can update
+ * straight away the CPU clamp value. Otherwise, the current CPU clamp
+ * value is still valid and we are done.
+ */
+ uc_cpu = &rq->uclamp;
+ clamp_value = p->uclamp[clamp_id].value;
+ if (uc_cpu->value[clamp_id] < clamp_value)
+ uc_cpu->value[clamp_id] = clamp_value;
+}
+
+/**
+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @cpu: the CPU from where the clamp group has to be released
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
+ *
+ * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
+ * counted by the task is decreased.
+ * If this was the last task defining the current max clamp group, then the
+ * CPU clamping is updated to find the new max for the specified clamp
+ * index.
+ */
+static inline void uclamp_cpu_put_id(struct task_struct *p,
+ struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp;
+ struct uclamp_cpu *uc_cpu;
+ unsigned int clamp_value;
+ int group_id;
+
+ /* New tasks don't have a previous clamp group */
+ group_id = p->uclamp[clamp_id].group_id;
+ if (group_id == UCLAMP_NOT_VALID)
+ return;
+
+ /* Decrement the task's reference counted group index */
+ uc_grp = &rq->uclamp.group[clamp_id][0];
+ if (likely(uc_grp[group_id].tasks))
+ uc_grp[group_id].tasks -= 1;
+#ifdef CONFIG_SCHED_DEBUG
+ else {
+ WARN(1, "invalid CPU[%d] clamp group [%d:%d] refcount\n",
+ cpu_of(rq), clamp_id, group_id);
+ }
+#endif
+
+ /* If this is not the last task, no updates are required */
+ if (uc_grp[group_id].tasks > 0)
+ return;
+
+ /*
+ * Update the CPU only if this was the last task of the group
+ * defining the current clamp value.
+ */
+ uc_cpu = &rq->uclamp;
+ clamp_value = uc_grp[group_id].value;
+#ifdef CONFIG_SCHED_DEBUG
+ if (unlikely(clamp_value > uc_cpu->value[clamp_id])) {
+ WARN(1, "invalid CPU[%d] clamp group [%d:%d] value\n",
+ cpu_of(rq), clamp_id, group_id);
+ }
+#endif
+ if (clamp_value >= uc_cpu->value[clamp_id])
+ uclamp_cpu_update(rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_get(): increase CPU's clamp group refcount
+ * @rq: the CPU's rq where the clamp group has to be refcounted
+ * @p: the task being enqueued
+ *
+ * Once a task is enqueued on a CPU's rq, all the clamp groups currently
+ * enforced on a task are reference counted on that rq.
+ * Not all scheduling classes have utilization clamping support, their tasks
+ * will be silently ignored.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_get_id(p, rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_put(): decrease CPU's clamp group refcount
+ * @cpu: the CPU's rq where the clamp group refcount has to be decreased
+ * @p: the task being dequeued
+ *
+ * When a task is dequeued from a CPU's rq, all the clamp groups the task has
+ * been reference counted at task's enqueue time have to be decreased for that
+ * CPU.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_put_id(p, rq, clamp_id);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -908,7 +1102,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
if (likely(uc_map[group_id].se_count))
uc_map[group_id].se_count -= 1;
-#ifdef SCHED_DEBUG
+#ifdef CONFIG_SCHED_DEBUG
else {
WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
clamp_id, group_id);
@@ -1073,9 +1267,16 @@ static void __init init_uclamp(void)
{
struct uclamp_se *uc_se;
int clamp_id;
+ int cpu;

mutex_init(&uclamp_mutex);

+ for_each_possible_cpu(cpu) {
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ memset(uc_cpu, UCLAMP_NOT_VALID, sizeof(struct uclamp_cpu));
+ }
+
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
int group_id = 0;
@@ -1093,6 +1294,8 @@ static void __init init_uclamp(void)
}

#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1110,6 +1313,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);

+ uclamp_cpu_get(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -1121,6 +1325,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE))
sched_info_dequeued(rq, p);

+ uclamp_cpu_put(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72df2dc779bc..513608ae4908 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -764,6 +764,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * struct uclamp_group - Utilization clamp Group
+ * @value: utilization clamp value for tasks on this clamp group
+ * @tasks: number of RUNNABLE tasks on this clamp group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+ int value;
+ int tasks;
+};
+
+/**
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @group: utilization clamp groups affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp value, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+ struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+ int value[UCLAMP_CNT];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -801,6 +845,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_cpu uclamp ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -2145,6 +2194,24 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_group_active: check if a clamp group is active on a CPU
+ * @uc_grp: the clamp groups for a CPU
+ * @group_id: the clamp group to check
+ *
+ * A clamp group affects a CPU if it has at least one RUNNABLE task.
+ *
+ * Return: true if the specified CPU has at least one RUNNABLE task
+ * for the specified clamp group.
+ */
+static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
+ int group_id)
+{
+ return uc_grp[group_id].tasks > 0;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef CONFIG_CPU_FREQ
DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);

--
2.18.0


2018-08-28 14:05:08

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v4 04/16] sched/core: uclamp: update CPU's refcount on clamp changes

Utilization clamp values enforced on a CPU by a task can be updated at
run-time, for example via a sched_setattr syscall, while a task is
currently RUNNABLE on that CPU. In these cases, the task can be already
refcounting a clamp group for its CPU and thus we need to update this
reference to ensure the new constraints are immediately enforced.

Since a clamp value change always implies a clamp group refcount update,
this patch hooks into the clamp group refcount getter to trigger a CPU
refcount syncup. Such a syncup is required only by currently RUNNABLE
tasks which are also referencing at least one valid clamp group.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v4:
Message-ID: <20180816132249.GA2960@e110439-lin>
- inline uclamp_task_active() code into uclamp_task_update_active()
- get rid of the now unused uclamp_task_active()
Other:
- allow to call uclamp_group_get() without a task pointer, which is
used to refcount the initial clamp group for all the global objects
(init_task, root_task_group and system_defaults)
- rebased on v4.19-rc1

Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Other:
- rabased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when CGroups support is
added.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 65 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 16 +++++++++++
2 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f908035701f..64e5c96bfdaf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1080,6 +1080,54 @@ static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
uclamp_cpu_put_id(p, rq, clamp_id);
}

+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ * @group_id: the clamp group to update
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups have to be updated for each CPU containing a RUNNABLE task belonging
+ * to this tasks group. Sleeping tasks are not updated since they will be
+ * enqueued with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the CPU where the task is (or was) queued.
+ *
+ * We might lock the (previous) RQ of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * The setting of the clamp group is serialized by task_rq_lock().
+ * Thus, if the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp group, then the next time it's going to be
+ * enqueued it will already see the updated clamp group value.
+ */
+ if (!task_on_rq_queued(p) && !p->on_cpu)
+ goto done;
+ if (!uclamp_task_affects(p, clamp_id))
+ goto done;
+
+ /* Release p's currently referenced clamp group */
+ uclamp_cpu_put_id(p, rq, clamp_id);
+
+ /* Get p's new clamp group */
+ uclamp_cpu_get_id(p, rq, clamp_id);
+
+done:
+ task_rq_unlock(rq, p, &rf);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -1115,6 +1163,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)

/**
* uclamp_group_get: increase the reference count for a clamp group
+ * @p: the task which clamp value must be tracked
* @clamp_id: the clamp index affected by the task
* @next_group_id: the clamp group to refcount
* @uc_se: the utilization clamp data for the task
@@ -1125,7 +1174,8 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
* this new clamp value. The corresponding clamp group index will be used by
* the task to reference count the clamp value on CPUs while enqueued.
*/
-static inline void uclamp_group_get(int clamp_id, int next_group_id,
+static inline void uclamp_group_get(struct task_struct *p,
+ int clamp_id, int next_group_id,
struct uclamp_se *uc_se,
unsigned int clamp_value)
{
@@ -1144,6 +1194,10 @@ static inline void uclamp_group_get(int clamp_id, int next_group_id,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);

+ /* Update CPU's clamp group refcounts of RUNNABLE task */
+ if (p)
+ uclamp_task_update_active(p, clamp_id, next_group_id);
+
/* Release the previous clamp group */
uclamp_group_put(clamp_id, prev_group_id);
}
@@ -1202,12 +1256,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uc_se = &p->uclamp[UCLAMP_MIN];
- uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uclamp_group_get(p, UCLAMP_MIN, group_id[UCLAMP_MIN],
uc_se, attr->sched_util_min);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uc_se = &p->uclamp[UCLAMP_MAX];
- uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uclamp_group_get(p, UCLAMP_MAX, group_id[UCLAMP_MAX],
uc_se, attr->sched_util_max);
}

@@ -1255,7 +1309,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
}

p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(clamp_id, next_group_id, uc_se,
+ uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
p->uclamp[clamp_id].value);
}
}
@@ -1289,7 +1343,8 @@ static void __init init_uclamp(void)
/* Init init_task's clamp group */
uc_se = &init_task.uclamp[clamp_id];
uc_se->group_id = UCLAMP_NOT_VALID;
- uclamp_group_get(clamp_id, 0, uc_se, uclamp_none(clamp_id));
+ uclamp_group_get(NULL, clamp_id, 0, uc_se,
+ uclamp_none(clamp_id));
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 513608ae4908..25d1d218ae10 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2210,6 +2210,22 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
{
return uc_grp[group_id].tasks > 0;
}
+
+/**
+ * uclamp_task_affects: check if a task affects a utilization clamp
+ * @p: the task to consider
+ * @clamp_id: the utilization clamp to check
+ *
+ * A task affects a clamp index if:
+ * - it's currently enqueued on a CPU
+ * - it references a valid clamp group index for the specified clamp index
+ *
+ * Return: true if p currently affects the specified clamp_id
+ */
+static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
+{
+ return (p->uclamp[clamp_id].group_id != UCLAMP_NOT_VALID);
+}
#endif /* CONFIG_UCLAMP_TASK */

#ifdef CONFIG_CPU_FREQ
--
2.18.0


2018-08-28 18:32:32

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v4 07/16] sched/core: uclamp: extend cpu's cgroup controller

On 08/28/2018 06:53 AM, Patrick Bellasi wrote:
> +config UCLAMP_TASK_GROUP
> + bool "Utilization clamping per group of tasks"
> + depends on CGROUP_SCHED
> + depends on UCLAMP_TASK
> + default n
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> +
> + When this option is enabled, the user can specify a min and max
> + CPU bandwidth which is allowed for each single task in a group.
> + The max bandwidth allows to clamp the maximum frequency a task
> + can use, while the min bandwidth allows to define a minimum
> + frequency a task will always use.
> +
> + When task group based utilization clamping is enabled, an eventually
> + specified task-specific clamp value is constrained by the cgroup
> + specified clamp value. Both minimum and maximum task clamping cannot
> + be bigger than the corresponding clamping defined at task group level.

The 4 lines above should all be indented the same (one tab + 2 spaces).

> +
> + If in doubt, say N.
> +


--
~Randy

2018-08-29 08:55:53

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 07/16] sched/core: uclamp: extend cpu's cgroup controller

On 28-Aug 11:29, Randy Dunlap wrote:
> On 08/28/2018 06:53 AM, Patrick Bellasi wrote:
> > +config UCLAMP_TASK_GROUP
> > + bool "Utilization clamping per group of tasks"
> > + depends on CGROUP_SCHED
> > + depends on UCLAMP_TASK
> > + default n
> > + help
> > + This feature enables the scheduler to track the clamped utilization
> > + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> > +
> > + When this option is enabled, the user can specify a min and max
> > + CPU bandwidth which is allowed for each single task in a group.
> > + The max bandwidth allows to clamp the maximum frequency a task
> > + can use, while the min bandwidth allows to define a minimum
> > + frequency a task will always use.
> > +
> > + When task group based utilization clamping is enabled, an eventually
> > + specified task-specific clamp value is constrained by the cgroup
> > + specified clamp value. Both minimum and maximum task clamping cannot
> > + be bigger than the corresponding clamping defined at task group level.
>
> The 4 lines above should all be indented the same (one tab + 2 spaces).

Right... then there's definitively something broken with my vim
reformat shortcut, which sometimes uses spaces instead of tabs :(

Unfortunately this pattern is not covered by checkpatch, which
returns not errors/warnings on this patch.

> > +
> > + If in doubt, say N.
> > +

Anyway, thanks for spotting it... easy fix for the next respin.

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-04 13:49:16

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

Hi,

On 28/08/18 14:53, Patrick Bellasi wrote:
> The number of clamp groups supported is limited and defined at compile
> time. However, a malicious user can currently ask for many different

Even if not malicious.. :-)

> clamp values thus consuming all the available clamp groups.
>
> Since on properly configured systems we expect only a limited set of
> different clamp values, the previous problem can be mitigated by
> allowing access to clamp groups configuration only to privileged tasks.
> This should still allow a System Management Software to properly
> pre-configure the system.
>
> Let's restrict the tuning of utilization clamp values, by default, to
> tasks with CAP_SYS_ADMIN capabilities.
>
> Whenever this should be considered too restrictive and/or not required
> for a specific platforms, a kernel boot option is provided to change
> this default behavior thus allowing non privileged tasks to change their
> utilization clamp values.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v4:
> Others:
> - new patch added in this version
> - rebased on v4.19-rc1
> ---
> .../admin-guide/kernel-parameters.txt | 3 +++
> kernel/sched/core.c | 22 ++++++++++++++++---
> 2 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9871e649ffef..481f8214ea9a 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4561,6 +4561,9 @@
> <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
> See also Documentation/input/devices/joystick-parport.rst
>
> + uclamp_user [KNL] Enable task-specific utilization clamping tuning
> + also from tasks without CAP_SYS_ADMIN capability.
> +
> udbg-immortal [PPC] When debugging early kernel crashes that
> happen after console_init() and before a proper
> console driver takes over, this boot options might
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 222397edb8a7..8341ce580a9a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1510,14 +1510,29 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
> static inline void free_uclamp_sched_group(struct task_group *tg) { }
> #endif /* CONFIG_UCLAMP_TASK_GROUP */
>
> +static bool uclamp_user_allowed __read_mostly;
> +static int __init uclamp_user_allow(char *str)
> +{
> + uclamp_user_allowed = true;
> +
> + return 0;
> +}
> +early_param("uclamp_user", uclamp_user_allow);
> +
> static inline int __setscheduler_uclamp(struct task_struct *p,
> - const struct sched_attr *attr)
> + const struct sched_attr *attr,
> + bool user)

Wondering if you want to fold the check below inside the

if (user && !capable(CAP_SYS_NICE)) {
...
}

block. It would also save you from adding another parameter to the
function.

> {
> int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> int lower_bound, upper_bound;
> struct uclamp_se *uc_se;
> int result = 0;
>
> + if (!capable(CAP_SYS_ADMIN) &&
> + user && !uclamp_user_allowed) {
> + return -EPERM;
> + }
> +

Best,

- Juri

2018-09-05 10:47:22

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Hi,

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(&uclamp_mutex);

This is going to get called from an rcu_read_lock() section, which is a
no-go for using mutexes:

sys_sched_setattr ->
rcu_read_lock()
...
sched_setattr() ->
__sched_setscheduler() ->
...
__setscheduler_uclamp() ->
...
mutex_lock()

Guess you could fix the issue by getting the task struct after find_
process_by_pid() in sys_sched_attr() and then calling sched_setattr()
after rcu_read_lock() (putting the task struct at the end). Peter
actually suggested this mod to solve a different issue.

Best,

- Juri

2018-09-05 11:02:38

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v4 01/16] sched/core: uclamp: extend sched_setattr to support utilization clamping

Hi,

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

> Let's introduce a new API to set utilization clamping values for a
> specified task by extending sched_setattr, a syscall which already
> allows to define task specific properties for different scheduling
> classes.
> Specifically, a new pair of attributes allows to specify a minimum and
> maximum utilization which the scheduler should consider for a task.

AFAIK sched_setattr currently mandates that a policy is always specified
[1]. I was wondering if relaxing such requirement might be handy. Being
util clamp a cross-class feature it might be cumbersome to always have
to get current policy/params and use those with new umin/umax just to
change the latter.

sched_setparam already uses the in-kernel SETPARAM_POLICY thing, maybe
we could extend that to sched_setattr? Not sure exposing this to
userspace is a good idea though. :-/

Best,

- Juri

1 - https://elixir.bootlin.com/linux/v4.19-rc2/source/kernel/sched/core.c#L4564

2018-09-06 08:19:25

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(&uclamp_mutex);
>
> - return 0;
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }

Don't you have to reset result to 0 here (it seems what follows cannot
fail anymore)? Otherwise this function will return latest
uclamp_group_find return value, which will be interpreted as error if
not 0.

> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = &p->uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> + uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = &p->uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> + uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(&uclamp_mutex);
> +
> + return result;
> +}

Best,

- Juri

2018-09-06 13:51:22

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Hi Juri!

On 05-Sep 12:45, Juri Lelli wrote:
> Hi,
>
> On 28/08/18 14:53, Patrick Bellasi wrote:
>
> [...]
>
> > static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> > {
> > - if (attr->sched_util_min > attr->sched_util_max)
> > - return -EINVAL;
> > - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > - return -EINVAL;
> > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > + int lower_bound, upper_bound;
> > + struct uclamp_se *uc_se;
> > + int result = 0;
> >
> > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > + mutex_lock(&uclamp_mutex);
>
> This is going to get called from an rcu_read_lock() section, which is a
> no-go for using mutexes:
>
> sys_sched_setattr ->
> rcu_read_lock()
> ...
> sched_setattr() ->
> __sched_setscheduler() ->
> ...
> __setscheduler_uclamp() ->
> ...
> mutex_lock()

Rightm, great catch, thanks!

> Guess you could fix the issue by getting the task struct after find_
> process_by_pid() in sys_sched_attr() and then calling sched_setattr()
> after rcu_read_lock() (putting the task struct at the end). Peter
> actually suggested this mod to solve a different issue.

I guess you mean something like this ?

---8<---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5792,10 +5792,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
- retval = sched_setattr(p, &attr);
+ if (likely(p))
+ get_task_struct(p);
rcu_read_unlock();

+ if (likely(p)) {
+ retval = sched_setattr(p, &attr);
+ put_task_struct(p);
+ }
+
return retval;
}
---8<---

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-06 14:03:22

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 06-Sep 10:17, Juri Lelli wrote:
> On 28/08/18 14:53, Patrick Bellasi wrote:
>
> [...]
>
> > static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> > {
> > - if (attr->sched_util_min > attr->sched_util_max)
> > - return -EINVAL;
> > - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > - return -EINVAL;
> > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > + int lower_bound, upper_bound;
> > + struct uclamp_se *uc_se;
> > + int result = 0;
> >
> > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > + mutex_lock(&uclamp_mutex);
> >
> > - return 0;
> > + /* Find a valid group_id for each required clamp value */
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > + ? attr->sched_util_max
> > + : p->uclamp[UCLAMP_MAX].value;
> > +
> > + if (upper_bound == UCLAMP_NOT_VALID)
> > + upper_bound = SCHED_CAPACITY_SCALE;
> > + if (attr->sched_util_min > upper_bound) {
> > + result = -EINVAL;
> > + goto done;
> > + }
> > +
> > + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> > + if (result == -ENOSPC) {
> > + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> > + goto done;
> > + }
> > + group_id[UCLAMP_MIN] = result;
> > + }
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> > + ? attr->sched_util_min
> > + : p->uclamp[UCLAMP_MIN].value;
> > +
> > + if (lower_bound == UCLAMP_NOT_VALID)
> > + lower_bound = 0;
> > + if (attr->sched_util_max < lower_bound ||
> > + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> > + result = -EINVAL;
> > + goto done;
> > + }
> > +
> > + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> > + if (result == -ENOSPC) {
> > + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> > + goto done;
> > + }
> > + group_id[UCLAMP_MAX] = result;
> > + }
>
> Don't you have to reset result to 0 here (it seems what follows cannot
> fail anymore)? Otherwise this function will return latest
> uclamp_group_find return value, which will be interpreted as error if
> not 0.

Yes, wired.. it uses to work on my tests just because I return from
the caller on !0 and:
- that's just good enough to set by uclamps
- my current rt-app integration does not check the sycall return value
- my tests never request the clamp_group mapped on group_id=0

:(

Will fix both this and the rt-app integration and the tests !

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-06 14:42:50

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 04-Sep 15:47, Juri Lelli wrote:
> Hi,
>
> On 28/08/18 14:53, Patrick Bellasi wrote:
> > The number of clamp groups supported is limited and defined at compile
> > time. However, a malicious user can currently ask for many different
>
> Even if not malicious.. :-)

Yeah... I should had write "ambitious" :D
... I'll get rid of all those geeks in the next respin ;)

> > clamp values thus consuming all the available clamp groups.
> >
> > Since on properly configured systems we expect only a limited set of
> > different clamp values, the previous problem can be mitigated by
> > allowing access to clamp groups configuration only to privileged tasks.
> > This should still allow a System Management Software to properly
> > pre-configure the system.
> >
> > Let's restrict the tuning of utilization clamp values, by default, to
> > tasks with CAP_SYS_ADMIN capabilities.
> >
> > Whenever this should be considered too restrictive and/or not required
> > for a specific platforms, a kernel boot option is provided to change
> > this default behavior thus allowing non privileged tasks to change their
> > utilization clamp values.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Paul Turner <[email protected]>
> > Cc: Suren Baghdasaryan <[email protected]>
> > Cc: Todd Kjos <[email protected]>
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Steve Muckle <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Quentin Perret <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Morten Rasmussen <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> >
> > ---
> > Changes in v4:
> > Others:
> > - new patch added in this version
> > - rebased on v4.19-rc1
> > ---
> > .../admin-guide/kernel-parameters.txt | 3 +++
> > kernel/sched/core.c | 22 ++++++++++++++++---
> > 2 files changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 9871e649ffef..481f8214ea9a 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4561,6 +4561,9 @@
> > <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
> > See also Documentation/input/devices/joystick-parport.rst
> >
> > + uclamp_user [KNL] Enable task-specific utilization clamping tuning
> > + also from tasks without CAP_SYS_ADMIN capability.
> > +
> > udbg-immortal [PPC] When debugging early kernel crashes that
> > happen after console_init() and before a proper
> > console driver takes over, this boot options might
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 222397edb8a7..8341ce580a9a 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1510,14 +1510,29 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
> > static inline void free_uclamp_sched_group(struct task_group *tg) { }
> > #endif /* CONFIG_UCLAMP_TASK_GROUP */
> >
> > +static bool uclamp_user_allowed __read_mostly;
> > +static int __init uclamp_user_allow(char *str)
> > +{
> > + uclamp_user_allowed = true;
> > +
> > + return 0;
> > +}
> > +early_param("uclamp_user", uclamp_user_allow);
> > +
> > static inline int __setscheduler_uclamp(struct task_struct *p,
> > - const struct sched_attr *attr)
> > + const struct sched_attr *attr,
> > + bool user)
>
> Wondering if you want to fold the check below inside the
>
> if (user && !capable(CAP_SYS_NICE)) {
> ...
> }
>
> block. It would also save you from adding another parameter to the
> function.

So, there are two reasons for that:

1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
instead on capable(CAP_SYS_ADMIN)

Does that make sense ?

If yes, the I cannot fold it in the block you reported above
because we will not check for users with CAP_SYS_NICE.

2) Then we could move it after that block, where there is another
set of checks with just:

if (user) {

We can potentially add the check there yes... but when uclamp is
not enabled we will still perform those checks or we have to add
some compiler guards...

3) ... or at least check for:

if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)

Which is what I'm doing right after the block above (2).

But, at this point, by passing in the parameter to the
__setscheduler_uclamp() call, I get the benefits of:

a) reducing uclamp specific code in the caller
b) avoiding the checks on !CONFIG_UCLAMP_TASK build

> > {
> > int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > int lower_bound, upper_bound;
> > struct uclamp_se *uc_se;
> > int result = 0;
> >
> > + if (!capable(CAP_SYS_ADMIN) &&
> > + user && !uclamp_user_allowed) {
> > + return -EPERM;
> > + }
> > +

Does all the above makes sense ?

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-06 15:02:51

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 06/09/18 15:40, Patrick Bellasi wrote:
> On 04-Sep 15:47, Juri Lelli wrote:

[...]

> > Wondering if you want to fold the check below inside the
> >
> > if (user && !capable(CAP_SYS_NICE)) {
> > ...
> > }
> >
> > block. It would also save you from adding another parameter to the
> > function.
>
> So, there are two reasons for that:
>
> 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
> instead on capable(CAP_SYS_ADMIN)
>
> Does that make sense ?
>
> If yes, the I cannot fold it in the block you reported above
> because we will not check for users with CAP_SYS_NICE.

Ah, right, not sure though. Looks like CAP_SYS_NICE is used for settings
that relates to priorities, affinity, etc.: CPU related stuff. Since
here you are also dealing with something that seems to fall into the
same realm, it might actually fit more than CAP_SYS_ADMIN?

Now that I think more about it, would it actually make sense to allow
unpriviledged users to lower their assigned umin/umax properties if they
want? Something alike what happens for nice values or RT priorities.

> 2) Then we could move it after that block, where there is another
> set of checks with just:
>
> if (user) {
>
> We can potentially add the check there yes... but when uclamp is
> not enabled we will still perform those checks or we have to add
> some compiler guards...
>
> 3) ... or at least check for:
>
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
>
> Which is what I'm doing right after the block above (2).
>
> But, at this point, by passing in the parameter to the
> __setscheduler_uclamp() call, I get the benefits of:
>
> a) reducing uclamp specific code in the caller
> b) avoiding the checks on !CONFIG_UCLAMP_TASK build
>
> > > {
> > > int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > int lower_bound, upper_bound;
> > > struct uclamp_se *uc_se;
> > > int result = 0;
> > >
> > > + if (!capable(CAP_SYS_ADMIN) &&
> > > + user && !uclamp_user_allowed) {
> > > + return -EPERM;
> > > + }
> > > +
>
> Does all the above makes sense ?

If we agree on CAP_SYS_ADMIN, however, your approach looks cleaner yes.

2018-09-06 17:23:31

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 06-Sep 16:59, Juri Lelli wrote:
> On 06/09/18 15:40, Patrick Bellasi wrote:
> > On 04-Sep 15:47, Juri Lelli wrote:
>
> [...]
>
> > > Wondering if you want to fold the check below inside the
> > >
> > > if (user && !capable(CAP_SYS_NICE)) {
> > > ...
> > > }
> > >
> > > block. It would also save you from adding another parameter to the
> > > function.
> >
> > So, there are two reasons for that:
> >
> > 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
> > instead on capable(CAP_SYS_ADMIN)
> >
> > Does that make sense ?
> >
> > If yes, the I cannot fold it in the block you reported above
> > because we will not check for users with CAP_SYS_NICE.
>
> Ah, right, not sure though. Looks like CAP_SYS_NICE is used for settings
> that relates to priorities, affinity, etc.: CPU related stuff. Since
> here you are also dealing with something that seems to fall into the
> same realm, it might actually fit more than CAP_SYS_ADMIN?

Yes and no... from the functional standpoint if a task is running in
the root cgroup, or cgroups are not in use at all, with this API a
task can always ask for the max OPP. Which is what CAP_SYS_NICE is
there for AFAIU... but...

... this check was meant also to fix the issue of the limited number
of clamp groups. That's why I'm now asking for CAP_SYS_ADMIN.

However, I would say that if we condsider to get in also the
discretization support introduced in:

[PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support
https://lore.kernel.org/lkml/[email protected]/

then yes, we remain with the "nice" semantics to cover, and
CAP_SYS_NICE could be just enough.

> Now that I think more about it, would it actually make sense to allow
> unpriviledged users to lower their assigned umin/umax properties if they
> want? Something alike what happens for nice values or RT priorities.

Yes... if we fix the issue with the limited clamp groups, i.e. we take
discretization in.

> > 2) Then we could move it after that block, where there is another
> > set of checks with just:
> >
> > if (user) {
> >
> > We can potentially add the check there yes... but when uclamp is
> > not enabled we will still perform those checks or we have to add
> > some compiler guards...
> >
> > 3) ... or at least check for:
> >
> > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> >
> > Which is what I'm doing right after the block above (2).
> >
> > But, at this point, by passing in the parameter to the
> > __setscheduler_uclamp() call, I get the benefits of:
> >
> > a) reducing uclamp specific code in the caller
> > b) avoiding the checks on !CONFIG_UCLAMP_TASK build
> >
> > > > {
> > > > int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > > int lower_bound, upper_bound;
> > > > struct uclamp_se *uc_se;
> > > > int result = 0;
> > > >
> > > > + if (!capable(CAP_SYS_ADMIN) &&
> > > > + user && !uclamp_user_allowed) {
> > > > + return -EPERM;
> > > > + }
> > > > +
> >
> > Does all the above makes sense ?
>
> If we agree on CAP_SYS_ADMIN, however, your approach looks cleaner yes.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-06 20:28:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 06/09/18 14:48, Patrick Bellasi wrote:
> Hi Juri!
>
> On 05-Sep 12:45, Juri Lelli wrote:
> > Hi,
> >
> > On 28/08/18 14:53, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > static inline int __setscheduler_uclamp(struct task_struct *p,
> > > const struct sched_attr *attr)
> > > {
> > > - if (attr->sched_util_min > attr->sched_util_max)
> > > - return -EINVAL;
> > > - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > > - return -EINVAL;
> > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > + int lower_bound, upper_bound;
> > > + struct uclamp_se *uc_se;
> > > + int result = 0;
> > >
> > > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > > + mutex_lock(&uclamp_mutex);
> >
> > This is going to get called from an rcu_read_lock() section, which is a
> > no-go for using mutexes:
> >
> > sys_sched_setattr ->
> > rcu_read_lock()
> > ...
> > sched_setattr() ->
> > __sched_setscheduler() ->
> > ...
> > __setscheduler_uclamp() ->
> > ...
> > mutex_lock()
>
> Rightm, great catch, thanks!
>
> > Guess you could fix the issue by getting the task struct after find_
> > process_by_pid() in sys_sched_attr() and then calling sched_setattr()
> > after rcu_read_lock() (putting the task struct at the end). Peter
> > actually suggested this mod to solve a different issue.
>
> I guess you mean something like this ?
>
> ---8<---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5792,10 +5792,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> rcu_read_lock();
> retval = -ESRCH;
> p = find_process_by_pid(pid);
> - if (p != NULL)
> - retval = sched_setattr(p, &attr);
> + if (likely(p))
> + get_task_struct(p);
> rcu_read_unlock();
>
> + if (likely(p)) {
> + retval = sched_setattr(p, &attr);
> + put_task_struct(p);
> + }
> +
> return retval;
> }
> ---8<---

This should do the job yes.

2018-09-08 23:51:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Hi Patrick!

On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
<[email protected]> wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
>
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference counters.
>
> :
> (user-space changes) : (kernel space / scheduler)
> :
> SLOW PATH : FAST PATH
> :
> task_struct::uclamp::value : sched/core::enqueue/dequeue
> : cpufreq_schedutil
> :
> +----------------+ +--------------------+ +-------------------+
> | TASK | | CLAMP GROUP | | CPU CLAMPS |
> +----------------+ +--------------------+ +-------------------+
> | | | clamp_{min,max} | | clamp_{min,max} |
> | util_{min,max} | | se_count | | tasks count |
> +----------------+ +--------------------+ +-------------------+
> :
> +------------------> : +------------------->
> group_id = map(clamp_value) : ref_count(group_id)
> :
> :
>
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
>
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
> sense to boost/limit to different frequencies,
> e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
> the per-CPU clamp values in the fast path.
>
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v4:
> Message-ID: <[email protected]>
> - add uclamp_exit_task() to release clamp refcount from do_exit()
> Message-ID: <20180816133249.GA2964@e110439-lin>
> - keep the WARN but butify a bit that code
> Message-ID: <[email protected]>
> - move uclamp_enabled at the top of sched_class to keep it on the same
> cache line of other main wakeup time callbacks
> Others:
> - init uclamp for the init_task and refcount its clamp groups
> - add uclamp specific fork time code into uclamp_fork
> - add support for SCHED_FLAG_RESET_ON_FORK
> default clamps are now set for init_task and inherited/reset at
> fork time (when then flag is set for the parent)
> - enable uclamp only for FAIR tasks, RT class will be enabled only
> by a following patch which also integrate the class to schedutil
> - define uclamp_maps ____cacheline_aligned_in_smp
> - in uclamp_group_get() ensure to include uclamp_group_available() and
> uclamp_group_init() into the atomic section defined by:
> uc_map[next_group_id].se_lock
> - do not use mutex_lock(&uclamp_mutex) in uclamp_exit_task
> which is also not needed since refcounting is already guarded by
> the uc_map[group_id].se_lock spinlock
> - rebased on v4.19-rc1
>
> Changes in v3:
> Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
> - rename UCLAMP_NONE into UCLAMP_NOT_VALID
> - remove not necessary checks in uclamp_group_find()
> - add WARN on unlikely un-referenced decrement in uclamp_group_put()
> - make __setscheduler_uclamp() able to set just one clamp value
> - make __setscheduler_uclamp() failing if both clamps are required but
> there is no clamp groups available for one of them
> - remove uclamp_group_find() from uclamp_group_get() which now takes a
> group_id as a parameter
> Others:
> - rebased on tip/sched/core
> Changes in v2:
> - rabased on v4.18-rc4
> - set UCLAMP_GROUPS_COUNT=2 by default
> which allows to fit all the hot-path CPU clamps data, partially
> intorduced also by the following patches, into a single cache line
> while still supporting up to 2 different {min,max}_utiql clamps.
> ---
> include/linux/sched.h | 16 +-
> include/linux/sched/task.h | 6 +
> include/uapi/linux/sched.h | 6 +-
> init/Kconfig | 20 ++
> init/init_task.c | 4 -
> kernel/exit.c | 1 +
> kernel/sched/core.c | 395 +++++++++++++++++++++++++++++++++++--
> kernel/sched/fair.c | 4 +
> kernel/sched/sched.h | 28 ++-
> 9 files changed, 456 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 880a0c5c1f87..7385f0b1a7c0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -279,6 +279,9 @@ struct vtime {
> u64 gtime;
> };
>
> +/* Clamp not valid, i.e. group not assigned or invalid value */
> +#define UCLAMP_NOT_VALID -1
> +
> enum uclamp_id {
> UCLAMP_MIN = 0, /* Minimum utilization */
> UCLAMP_MAX, /* Maximum utilization */
> @@ -575,6 +578,17 @@ struct sched_dl_entity {
> struct hrtimer inactive_timer;
> };
>
> +/**
> + * Utilization's clamp group
> + *
> + * A utilization clamp group maps a "clamp value" (value), i.e.
> + * util_{min,max}, to a "clamp group index" (group_id).
> + */
> +struct uclamp_se {
> + unsigned int value;
> + unsigned int group_id;
> +};
> +
> union rcu_special {
> struct {
> u8 blocked;
> @@ -659,7 +673,7 @@ struct task_struct {
>
> #ifdef CONFIG_UCLAMP_TASK
> /* Utlization clamp values for this task */
> - int uclamp[UCLAMP_CNT];
> + struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
>
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 108ede99e533..36c81c364112 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -68,6 +68,12 @@ static inline void exit_thread(struct task_struct *tsk)
> #endif
> extern void do_group_exit(int);
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern void uclamp_exit_task(struct task_struct *p);
> +#else
> +static inline void uclamp_exit_task(struct task_struct *p) { }
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> extern void exit_files(struct task_struct *);
> extern void exit_itimers(struct signal_struct *);
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index c27d6e81517b..ae7e12de32ca 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -50,7 +50,11 @@
> #define SCHED_FLAG_RESET_ON_FORK 0x01
> #define SCHED_FLAG_RECLAIM 0x02
> #define SCHED_FLAG_DL_OVERRUN 0x04
> -#define SCHED_FLAG_UTIL_CLAMP 0x08
> +
> +#define SCHED_FLAG_UTIL_CLAMP_MIN 0x10
> +#define SCHED_FLAG_UTIL_CLAMP_MAX 0x20
> +#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
> + SCHED_FLAG_UTIL_CLAMP_MAX)
>
> #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> SCHED_FLAG_RECLAIM | \
> diff --git a/init/Kconfig b/init/Kconfig
> index 738974c4f628..10536cb83295 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -633,7 +633,27 @@ config UCLAMP_TASK
>
> If in doubt, say N.
>
> +config UCLAMP_GROUPS_COUNT
> + int "Number of different utilization clamp values supported"
> + range 0 32
> + default 5
> + depends on UCLAMP_TASK
> + help
> + This defines the maximum number of different utilization clamp
> + values which can be concurrently enforced for each utilization
> + clamp index (i.e. minimum and maximum utilization).
> +
> + Only a limited number of clamp values are supported because:
> + 1. there are usually only few classes of workloads for which it
> + makes sense to boost/cap for different frequencies,
> + e.g. background vs foreground, interactive vs low-priority.
> + 2. it allows a simpler and more memory/time efficient tracking of
> + the per-CPU clamp values.
> +
> + If in doubt, use the default value.
> +
> endmenu
> +
> #
> # For architectures that want to enable the support for NUMA-affine scheduler
> # balancing logic:
> diff --git a/init/init_task.c b/init/init_task.c
> index 5bfdcc3fb839..7f77741b6a9b 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -92,10 +92,6 @@ struct task_struct init_task
> #endif
> #ifdef CONFIG_CGROUP_SCHED
> .sched_task_group = &root_task_group,
> -#endif
> -#ifdef CONFIG_UCLAMP_TASK
> - .uclamp[UCLAMP_MIN] = 0,
> - .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
> #endif
> .ptraced = LIST_HEAD_INIT(init_task.ptraced),
> .ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 0e21e6d21f35..feb540558051 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -877,6 +877,7 @@ void __noreturn do_exit(long code)
>
> sched_autogroup_exit_task(tsk);
> cgroup_exit(tsk);
> + uclamp_exit_task(tsk);
>
> /*
> * FIXME: do that only when needed, using sched_exit tracepoint
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 16d3544c7ffa..2668990b96d1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -717,25 +717,389 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
>
> #ifdef CONFIG_UCLAMP_TASK
> +/**
> + * uclamp_mutex: serializes updates of utilization clamp values
> + *
> + * A utilization clamp value update is usually triggered from a user-space
> + * process (slow-path) but it requires a synchronization with the scheduler's
> + * (fast-path) enqueue/dequeue operations.
> + * While the fast-path synchronization is protected by RQs spinlock, this
> + * mutex ensures that we sequentially serve user-space requests.
> + */
> +static DEFINE_MUTEX(uclamp_mutex);
> +
> +/**
> + * uclamp_map: reference counts a utilization "clamp value"
> + * @value: the utilization "clamp value" required
> + * @se_count: the number of scheduling entities requiring the "clamp value"
> + * @se_lock: serialize reference count updates by protecting se_count
> + */
> +struct uclamp_map {
> + int value;
> + int se_count;
> + raw_spinlock_t se_lock;
> +};
> +
> +/**
> + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> + *
> + * Since only a limited number of different "clamp values" are supported, we
> + * need to map each different clamp value into a "clamp group" (group_id) to
> + * be used by the per-CPU accounting in the fast-path, when tasks are
> + * enqueued and dequeued.
> + * We also support different kind of utilization clamping, min and max
> + * utilization for example, each representing what we call a "clamp index"
> + * (clamp_id).
> + *
> + * A matrix is thus required to map "clamp values" to "clamp groups"
> + * (group_id), for each "clamp index" (clamp_id), where:
> + * - rows are indexed by clamp_id and they collect the clamp groups for a
> + * given clamp index
> + * - columns are indexed by group_id and they collect the clamp values which
> + * maps to that clamp group
> + *
> + * Thus, the column index of a given (clamp_id, value) pair represents the
> + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> + *
> + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> + * clamped tasks. Thus we allocate one more slot than the value of
> + * CONFIG_UCLAMP_GROUPS_COUNT.
> + *
> + * Here is the map layout and, right below, how entries are accessed by the
> + * following code.
> + *
> + * uclamp_maps is a matrix of
> + * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> + * | |
> + * | /---------------+---------------\
> + * | +------------+ +------------+
> + * | / UCLAMP_MIN | value | | value |
> + * | | | se_count |...... | se_count |
> + * | | +------------+ +------------+
> + * +--+ +------------+ +------------+
> + * | | value | | value |
> + * \ UCLAMP_MAX | se_count |...... | se_count |
> + * +-----^------+ +----^-------+
> + * | |
> + * uc_map = + |
> + * &uclamp_maps[clamp_id][0] +
> + * clamp_value =
> + * uc_map[group_id].value
> + */
> +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> + ____cacheline_aligned_in_smp;
> +
> +#define UCLAMP_ENOSPC_FMT "Cannot allocate more than " \
> + __stringify(CONFIG_UCLAMP_GROUPS_COUNT) " UTIL_%s clamp groups\n"
> +
> +/**
> + * uclamp_group_available: checks if a clamp group is available
> + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
> + * @group_id: the group index in the given clamp_id
> + *
> + * A clamp group is not free if there is at least one SE which is sing a clamp

typo in the sentence

> + * value mapped on the specified clamp_id. These SEs are reference counted by
> + * the se_count of a uclamp_map entry.
> + *
> + * Return: true if there are no SE's mapped on the specified clamp
> + * index and group
> + */
> +static inline bool uclamp_group_available(int clamp_id, int group_id)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> +
> + return (uc_map[group_id].value == UCLAMP_NOT_VALID);
> +}
> +
> +/**
> + * uclamp_group_init: maps a clamp value on a specified clamp group
> + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
> + * @group_id: the group index to map a given clamp_value
> + * @clamp_value: the utilization clamp value to map
> + *
> + * Initializes a clamp group to track tasks from the fast-path.
> + * Each different clamp value, for a given clamp index (i.e. min/max
> + * utilization clamp), is mapped by a clamp group which index is used by the
> + * fast-path code to keep track of RUNNABLE tasks requiring a certain clamp
> + * value.
> + *
> + */
> +static inline void uclamp_group_init(int clamp_id, int group_id,
> + unsigned int clamp_value)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> +
> + uc_map[group_id].value = clamp_value;
> + uc_map[group_id].se_count = 0;
> +}
> +
> +/**
> + * uclamp_group_reset: resets a specified clamp group
> + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
> + * @group_id: the group index to release
> + *
> + * A clamp group can be reset every time there are no more task groups using
> + * the clamp value it maps for a given clamp index.
> + */
> +static inline void uclamp_group_reset(int clamp_id, int group_id)
> +{
> + uclamp_group_init(clamp_id, group_id, UCLAMP_NOT_VALID);
> +}
> +
> +/**
> + * uclamp_group_find: finds the group index of a utilization clamp group
> + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
> + * @clamp_value: the utilization clamping value lookup for
> + *
> + * Verify if a group has been assigned to a certain clamp value and return
> + * its index to be used for accounting.
> + *
> + * Since only a limited number of utilization clamp groups are allowed, if no
> + * groups have been assigned for the specified value, a new group is assigned,
> + * if possible.
> + * Otherwise an error is returned, meaning that an additional clamp value is
> + * not (currently) supported.
> + */
> +static int
> +uclamp_group_find(int clamp_id, unsigned int clamp_value)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + int free_group_id = UCLAMP_NOT_VALID;
> + unsigned int group_id = 0;
> +
> + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> + /* Keep track of first free clamp group */
> + if (uclamp_group_available(clamp_id, group_id)) {
> + if (free_group_id == UCLAMP_NOT_VALID)
> + free_group_id = group_id;
> + continue;
> + }

Not a big improvement but reordering the two conditions in this loop
would avoid finding and recording free_group_id if the very first
group is the one we are looking for.

> + /* Return index of first group with same clamp value */
> + if (uc_map[group_id].value == clamp_value)
> + return group_id;
> + }
> +
> + if (likely(free_group_id != UCLAMP_NOT_VALID))
> + return free_group_id;
> +
> + return -ENOSPC;
> +}
> +
> +/**
> + * uclamp_group_put: decrease the reference count for a clamp group
> + * @clamp_id: the clamp index which was affected by a task group
> + * @uc_se: the utilization clamp data for that task group
> + *
> + * When the clamp value for a task group is changed we decrease the reference
> + * count for the clamp group mapping its current clamp value. A clamp group is
> + * released when there are no more task groups referencing its clamp value.
> + */

Is the size and the number of invocations of this function small
enough for inlining? Same goes for uclamp_group_get() and especially
for __setscheduler_uclamp().

> +static inline void uclamp_group_put(int clamp_id, int group_id)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + unsigned long flags;
> +
> + /* Ignore SE's not yet attached */
> + if (group_id == UCLAMP_NOT_VALID)
> + return;
> +
> + /* Remove SE from this clamp group */
> + raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
> + if (likely(uc_map[group_id].se_count))
> + uc_map[group_id].se_count -= 1;
> +#ifdef SCHED_DEBUG
> + else {

nit: no need for braces

> + WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
> + clamp_id, group_id);
> + }
> +#endif
> + if (uc_map[group_id].se_count == 0)
> + uclamp_group_reset(clamp_id, group_id);
> + raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
> +}
> +
> +/**
> + * uclamp_group_get: increase the reference count for a clamp group
> + * @clamp_id: the clamp index affected by the task
> + * @next_group_id: the clamp group to refcount
> + * @uc_se: the utilization clamp data for the task
> + * @clamp_value: the new clamp value for the task
> + *
> + * Each time a task changes its utilization clamp value, for a specified clamp
> + * index, we need to find an available clamp group which can be used to track
> + * this new clamp value. The corresponding clamp group index will be used by
> + * the task to reference count the clamp value on CPUs while enqueued.
> + */
> +static inline void uclamp_group_get(int clamp_id, int next_group_id,
> + struct uclamp_se *uc_se,
> + unsigned int clamp_value)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + int prev_group_id = uc_se->group_id;
> + unsigned long flags;
> +
> + /* Allocate new clamp group for this clamp value */
> + raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
> + if (uclamp_group_available(clamp_id, next_group_id))
> + uclamp_group_init(clamp_id, next_group_id, clamp_value);
> +
> + /* Update SE's clamp values and attach it to new clamp group */
> + uc_se->value = clamp_value;
> + uc_se->group_id = next_group_id;
> + uc_map[next_group_id].se_count += 1;
> + raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
> +
> + /* Release the previous clamp group */
> + uclamp_group_put(clamp_id, prev_group_id);
> +}
> +
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(&uclamp_mutex);
>
> - return 0;
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = &p->uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> + uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = &p->uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> + uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(&uclamp_mutex);
> +
> + return result;
> +}
> +
> +/**
> + * uclamp_exit_task: release referenced clamp groups
> + * @p: the task exiting
> + *
> + * When a task terminates, release all its (eventually) refcounted
> + * task-specific clamp groups.
> + */
> +void uclamp_exit_task(struct task_struct *p)
> +{
> + struct uclamp_se *uc_se;
> + int clamp_id;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_se = &p->uclamp[clamp_id];
> + uclamp_group_put(clamp_id, uc_se->group_id);
> + }
> +}
> +
> +/**
> + * uclamp_fork: refcount task-specific clamp values for a new task
> + */
> +static void uclamp_fork(struct task_struct *p, bool reset)
> +{
> + int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + int next_group_id = p->uclamp[clamp_id].group_id;
> + struct uclamp_se *uc_se = &p->uclamp[clamp_id];

Might be easier to read if after the above assignment you use
uc_se->xxx instead of p->uclamp[clamp_id].xxx in the code below.

> +
> + if (unlikely(reset)) {
> + next_group_id = 0;
> + p->uclamp[clamp_id].value = uclamp_none(clamp_id);
> + }
> +
> + p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
> + uclamp_group_get(clamp_id, next_group_id, uc_se,
> + p->uclamp[clamp_id].value);
> + }
> +}
> +
> +/**
> + * init_uclamp: initialize data structures required for utilization clamping
> + */
> +static void __init init_uclamp(void)
> +{
> + struct uclamp_se *uc_se;
> + int clamp_id;
> +
> + mutex_init(&uclamp_mutex);
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + int group_id = 0;
> +
> + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> + uc_map[group_id].value = UCLAMP_NOT_VALID;
> + raw_spin_lock_init(&uc_map[group_id].se_lock);
> + }
> +
> + /* Init init_task's clamp group */
> + uc_se = &init_task.uclamp[clamp_id];
> + uc_se->group_id = UCLAMP_NOT_VALID;
> + uclamp_group_get(clamp_id, 0, uc_se, uclamp_none(clamp_id));
> + }
> }
> +
> #else /* CONFIG_UCLAMP_TASK */
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> return -EINVAL;
> }
> +static inline void uclamp_fork(struct task_struct *p, bool reset) { }
> +static inline void init_uclamp(void) { }
> #endif /* CONFIG_UCLAMP_TASK */
>
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> @@ -2314,6 +2678,7 @@ static inline void init_schedstats(void) {}
> int sched_fork(unsigned long clone_flags, struct task_struct *p)
> {
> unsigned long flags;
> + bool reset;
>
> __sched_fork(clone_flags, p);
> /*
> @@ -2331,7 +2696,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> /*
> * Revert to default priority/policy on fork if requested.
> */
> - if (unlikely(p->sched_reset_on_fork)) {
> + reset = p->sched_reset_on_fork;
> + if (unlikely(reset)) {
> if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
> p->policy = SCHED_NORMAL;
> p->static_prio = NICE_TO_PRIO(0);
> @@ -2342,11 +2708,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->prio = p->normal_prio = __normal_prio(p);
> set_load_weight(p, false);
>
> -#ifdef CONFIG_UCLAMP_TASK
> - p->uclamp[UCLAMP_MIN] = 0;
> - p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
> -#endif
> -
> /*
> * We don't need the reset flag anymore after the fork. It has
> * fulfilled its duty:
> @@ -2363,6 +2724,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>
> init_entity_runnable_average(&p->se);
>
> + uclamp_fork(p, reset);
> +
> /*
> * The child is not yet in the pid-hash so no cgroup attach races,
> * and the cgroup is pinned to this child due to cgroup_fork()
> @@ -4756,8 +5119,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> attr.sched_nice = task_nice(p);
>
> #ifdef CONFIG_UCLAMP_TASK
> - attr.sched_util_min = p->uclamp[UCLAMP_MIN];
> - attr.sched_util_max = p->uclamp[UCLAMP_MAX];
> + attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
> + attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
> #endif
>
> rcu_read_unlock();
> @@ -6107,6 +6470,8 @@ void __init sched_init(void)
>
> init_schedstats();
>
> + init_uclamp();
> +
> scheduler_running = 1;
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b39fb596f6c1..dab0405386c1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10055,6 +10055,10 @@ const struct sched_class fair_sched_class = {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> .task_change_group = task_change_group_fair,
> #endif
> +
> +#ifdef CONFIG_UCLAMP_TASK
> + .uclamp_enabled = 1,
> +#endif
> };
>
> #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 4a2e8cae63c4..72df2dc779bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1501,10 +1501,12 @@ extern const u32 sched_prio_to_wmult[40];
> struct sched_class {
> const struct sched_class *next;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + int uclamp_enabled;
> +#endif
> +
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> - void (*yield_task) (struct rq *rq);
> - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
>
> void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1537,7 +1539,6 @@ struct sched_class {
> void (*set_curr_task)(struct rq *rq);
> void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> void (*task_fork)(struct task_struct *p);
> - void (*task_dead)(struct task_struct *p);
>
> /*
> * The switched_from() call is allowed to drop rq->lock, therefore we
> @@ -1554,12 +1555,17 @@ struct sched_class {
>
> void (*update_curr)(struct rq *rq);
>
> + void (*yield_task) (struct rq *rq);
> + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> +
> #define TASK_SET_GROUP 0
> #define TASK_MOVE_GROUP 1
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> void (*task_change_group)(struct task_struct *p, int type);
> #endif
> +
> + void (*task_dead)(struct task_struct *p);
> };
>
> static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> @@ -2177,6 +2183,22 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
> static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> #endif /* CONFIG_CPU_FREQ */
>
> +/**
> + * uclamp_none: default value for a clamp
> + *
> + * This returns the default value for each clamp
> + * - 0 for a min utilization clamp
> + * - SCHED_CAPACITY_SCALE for a max utilization clamp
> + *
> + * Return: the default value for a given utilization clamp
> + */
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> + if (clamp_id == UCLAMP_MIN)
> + return 0;
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> #ifdef arch_scale_freq_capacity
> # ifndef arch_scale_freq_invariant
> # define arch_scale_freq_invariant() true
> --
> 2.18.0
>

Thanks,
Suren.

2018-09-09 03:07:00

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
<[email protected]> wrote:
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
>
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This represent also the clamp value which is actually
> used to clamp tasks in each task group.
>
> Since it can be interesting for tasks in a cgroup to know exactly what
> is the currently propagated/enforced configuration, the effective clamp
> values are exposed to user-space by means of a new pair of read-only
> attributes: cpu.util.{min,max}.effective.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v4:
> Message-ID: <20180816140731.GD2960@e110439-lin>
> - add ".effective" attributes to the default hierarchy
> Others:
> - small documentation fixes
> - rebased on v4.19-rc1
>
> Changes in v3:
> Message-ID: <[email protected]>
> - new patch in v3, to implement a suggestion from v1 review
> ---
> Documentation/admin-guide/cgroup-v2.rst | 25 +++++-
> include/linux/sched.h | 8 ++
> kernel/sched/core.c | 112 +++++++++++++++++++++++-
> 3 files changed, 139 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 80ef7bdc517b..72272f58d304 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -976,22 +976,43 @@ All time durations are in microseconds.
> A read-write single value file which exists on non-root cgroups.
> The default is "0", i.e. no bandwidth boosting.
>
> - The minimum utilization in the range [0, 1023].
> + The requested minimum utilization in the range [0, 1023].
>
> This interface allows reading and setting minimum utilization clamp
> values similar to the sched_setattr(2). This minimum utilization
> value is used to clamp the task specific minimum utilization clamp.
>
> + cpu.util.min.effective
> + A read-only single value file which exists on non-root cgroups and
> + reports minimum utilization clamp value currently enforced on a task
> + group.
> +
> + The actual minimum utilization in the range [0, 1023].
> +
> + This value can be lower then cpu.util.min in case a parent cgroup
> + is enforcing a more restrictive clamping on minimum utilization.

IMHO if cpu.util.min=0 means "no restrictions" on UCLAMP_MIN then
calling parent's lower cpu.util.min value "more restrictive clamping"
is confusing. I would suggest to rephrase this to smth like "...in
case a parent cgroup requires lower cpu.util.min clamping."

> +
> cpu.util.max
> A read-write single value file which exists on non-root cgroups.
> The default is "1023". i.e. no bandwidth clamping
>
> - The maximum utilization in the range [0, 1023].
> + The requested maximum utilization in the range [0, 1023].
>
> This interface allows reading and setting maximum utilization clamp
> values similar to the sched_setattr(2). This maximum utilization
> value is used to clamp the task specific maximum utilization clamp.
>
> + cpu.util.max.effective
> + A read-only single value file which exists on non-root cgroups and
> + reports maximum utilization clamp value currently enforced on a task
> + group.
> +
> + The actual maximum utilization in the range [0, 1023].
> +
> + This value can be lower then cpu.util.max in case a parent cgroup
> + is enforcing a more restrictive clamping on max utilization.
> +
> +
> Memory
> ------
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc39b67a366a..2da130d17e70 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -591,6 +591,14 @@ struct sched_dl_entity {
> struct uclamp_se {
> unsigned int value;
> unsigned int group_id;
> + /*
> + * Effective task (group) clamp value.
> + * For task groups is the value (eventually) enforced by a parent task
> + * group.
> + */
> + struct {
> + unsigned int value;
> + } effective;
> };
>
> union rcu_special {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index dcbf22abd0bf..b2d438b6484b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1254,6 +1254,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
>
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> uc_se = &tg->uclamp[clamp_id];
> + uc_se->effective.value =
> + parent->uclamp[clamp_id].effective.value;
> uc_se->value = parent->uclamp[clamp_id].value;
> uc_se->group_id = parent->uclamp[clamp_id].group_id;
> }
> @@ -1415,6 +1417,7 @@ static void __init init_uclamp(void)
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> /* Init root TG's clamp group */
> uc_se = &root_task_group.uclamp[clamp_id];
> + uc_se->effective.value = uclamp_none(clamp_id);
> uc_se->value = uclamp_none(clamp_id);
> uc_se->group_id = 0;
> #endif
> @@ -7226,6 +7229,68 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> }
>
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> +/**
> + * cpu_util_update_hier: propagete effective clamp down the hierarchy

typo: propagate

> + * @css: the task group to update
> + * @clamp_id: the clamp index to update
> + * @value: the new task group clamp value
> + *
> + * The effective clamp for a TG is expected to track the most restrictive
> + * value between the TG's clamp value and it's parent effective clamp value.
> + * This method achieve that:
> + * 1. updating the current TG effective value
> + * 2. walking all the descendant task group that needs an update
> + *
> + * A TG's effective clamp needs to be updated when its current value is not
> + * matching the TG's clamp value. In this case indeed either:
> + * a) the parent has got a more relaxed clamp value
> + * thus potentially we can relax the effective value for this group
> + * b) the parent has got a more strict clamp value
> + * thus potentially we have to restrict the effective value of this group
> + *
> + * Restriction and relaxation of current TG's effective clamp values needs to
> + * be propagated down to all the descendants. When a subgroup is found which
> + * has already its effective clamp value matching its clamp value, then we can
> + * safely skip all its descendants which are granted to be already in sync.
> + */
> +static void cpu_util_update_hier(struct cgroup_subsys_state *css,
> + int clamp_id, int value)
> +{
> + struct cgroup_subsys_state *top_css = css;
> + struct uclamp_se *uc_se, *uc_parent;
> +
> + css_for_each_descendant_pre(css, top_css) {
> + /*
> + * The first visited task group is top_css, which clamp value
> + * is the one passed as parameter. For descendent task
> + * groups we consider their current value.
> + */
> + uc_se = &css_tg(css)->uclamp[clamp_id];
> + if (css != top_css)
> + value = uc_se->value;
> + /*
> + * Skip the whole subtrees if the current effective clamp is
> + * alredy matching the TG's clamp value.

typo: already

> + * In this case, all the subtrees already have top_value, or a
> + * more restrictive, as effective clamp.
> + */
> + uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> + if (uc_se->effective.value == value &&
> + uc_parent->effective.value >= value) {
> + css = css_rightmost_descendant(css);
> + continue;
> + }
> +
> + /* Propagate the most restrictive effective value */
> + if (uc_parent->effective.value < value)
> + value = uc_parent->effective.value;
> + if (uc_se->effective.value == value)
> + continue;
> +
> + uc_se->effective.value = value;
> + }
> +}
> +
> static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 min_value)
> {
> @@ -7245,6 +7310,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> if (tg->uclamp[UCLAMP_MAX].value < min_value)
> goto out;
>
> + /* Update effective clamps to track the most restrictive value */
> + cpu_util_update_hier(css, UCLAMP_MIN, min_value);
> +
> out:
> rcu_read_unlock();
>
> @@ -7270,6 +7338,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> if (tg->uclamp[UCLAMP_MIN].value > max_value)
> goto out;
>
> + /* Update effective clamps to track the most restrictive value */
> + cpu_util_update_hier(css, UCLAMP_MAX, max_value);
> +
> out:
> rcu_read_unlock();
>
> @@ -7277,14 +7348,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> }
>
> static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> - enum uclamp_id clamp_id)
> + enum uclamp_id clamp_id,
> + bool effective)
> {
> struct task_group *tg;
> u64 util_clamp;
>
> rcu_read_lock();
> tg = css_tg(css);
> - util_clamp = tg->uclamp[clamp_id].value;
> + util_clamp = effective
> + ? tg->uclamp[clamp_id].effective.value
> + : tg->uclamp[clamp_id].value;
> rcu_read_unlock();
>
> return util_clamp;
> @@ -7293,13 +7367,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
> struct cftype *cft)
> {
> - return cpu_uclamp_read(css, UCLAMP_MIN);
> + return cpu_uclamp_read(css, UCLAMP_MIN, false);
> }
>
> static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
> struct cftype *cft)
> {
> - return cpu_uclamp_read(css, UCLAMP_MAX);
> + return cpu_uclamp_read(css, UCLAMP_MAX, false);
> +}
> +
> +static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MIN, true);
> +}
> +
> +static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MAX, true);
> }
> #endif /* CONFIG_UCLAMP_TASK_GROUP */
>
> @@ -7647,11 +7733,19 @@ static struct cftype cpu_legacy_files[] = {
> .read_u64 = cpu_util_min_read_u64,
> .write_u64 = cpu_util_min_write_u64,
> },
> + {
> + .name = "util.min.effective",
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
> {
> .name = "util.max",
> .read_u64 = cpu_util_max_read_u64,
> .write_u64 = cpu_util_max_write_u64,
> },
> + {
> + .name = "util.max.effective",
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
> #endif
> { } /* Terminate */
> };
> @@ -7827,12 +7921,22 @@ static struct cftype cpu_files[] = {
> .read_u64 = cpu_util_min_read_u64,
> .write_u64 = cpu_util_min_write_u64,
> },
> + {
> + .name = "util.min.effective",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
> {
> .name = "util_max",
> .flags = CFTYPE_NOT_ON_ROOT,
> .read_u64 = cpu_util_max_read_u64,
> .write_u64 = cpu_util_max_write_u64,
> },
> + {
> + .name = "util.max.effective",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
> #endif
> { } /* terminate */
> };
> --
> 2.18.0
>

2018-09-09 18:53:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 09/16] sched/core: uclamp: map TG's clamp values into CPU's clamp groups

On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
<[email protected]> wrote:
> Utilization clamping requires to map each different clamp value
> into one of the available clamp groups used by the scheduler's fast-path
> to account for RUNNABLE tasks. Thus, each time a TG's clamp value
> sysfs attribute is updated via:
> cpu_util_{min,max}_write_u64()
> we need to get (if possible) a reference to the new value's clamp group
> and release the reference to the previous one.
>
> Let's ensure that, whenever a task group is assigned a specific
> clamp_value, this is properly translated into a unique clamp group to be
> used in the fast-path (i.e. at enqueue/dequeue time).
> We do that by slightly refactoring uclamp_group_get() to make the
> *task_struct parameter optional. This allows to re-use the code already
> available to support the per-task API.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v4:
> Others:
> - rebased on v4.19-rc1
>
> Changes in v3:
> Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
> - add explicit calls to uclamp_group_find(), which is now not more
> part of uclamp_group_get()
> Others:
> - rebased on tip/sched/core
> Changes in v2:
> - rebased on v4.18-rc4
> - this code has been split from a previous patch to simplify the review
> ---
> include/linux/sched.h | 11 +++--
> kernel/sched/core.c | 95 +++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 95 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2da130d17e70..4e5522ed57e0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -587,17 +587,22 @@ struct sched_dl_entity {
> * The same "group_id" can be used by multiple scheduling entities, i.e.
> * either tasks or task groups, to enforce the same clamp "value" for a given
> * clamp index.
> + *
> + * Scheduling entity's specific clamp group index can be different
> + * from the effective clamp group index used at enqueue time since
> + * task groups's clamps can be restricted by their parent task group.
> */
> struct uclamp_se {
> unsigned int value;
> unsigned int group_id;
> /*
> - * Effective task (group) clamp value.
> - * For task groups is the value (eventually) enforced by a parent task
> - * group.
> + * Effective task (group) clamp value and group index.
> + * For task groups it's the value (eventually) enforced by a parent
> + * task group.
> */
> struct {
> unsigned int value;
> + unsigned int group_id;
> } effective;
> };
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b2d438b6484b..e617a7b18f2d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1250,24 +1250,51 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
> struct task_group *parent)
> {
> struct uclamp_se *uc_se;
> + int next_group_id;
> int clamp_id;
>
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> uc_se = &tg->uclamp[clamp_id];
> +
> uc_se->effective.value =
> parent->uclamp[clamp_id].effective.value;
> - uc_se->value = parent->uclamp[clamp_id].value;
> - uc_se->group_id = parent->uclamp[clamp_id].group_id;
> + uc_se->effective.group_id =
> + parent->uclamp[clamp_id].effective.group_id;
> +
> + next_group_id = parent->uclamp[clamp_id].group_id;
> + uc_se->group_id = UCLAMP_NOT_VALID;
> + uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
> + parent->uclamp[clamp_id].value);
> }
>
> return 1;
> }
> +
> +/**
> + * release_uclamp_sched_group: release utilization clamp references of a TG

free_uclamp_sched_group

> + * @tg: the task group being removed
> + *
> + * An empty task group can be removed only when it has no more tasks or child
> + * groups. This means that we can also safely release all the reference
> + * counting to clamp groups.
> + */
> +static inline void free_uclamp_sched_group(struct task_group *tg)
> +{
> + struct uclamp_se *uc_se;
> + int clamp_id;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_se = &tg->uclamp[clamp_id];
> + uclamp_group_put(clamp_id, uc_se->group_id);
> + }
> +}
> #else /* CONFIG_UCLAMP_TASK_GROUP */
> static inline int alloc_uclamp_sched_group(struct task_group *tg,
> struct task_group *parent)
> {
> return 1;
> }
> +static inline void free_uclamp_sched_group(struct task_group *tg) { }
> #endif /* CONFIG_UCLAMP_TASK_GROUP */
>
> static inline int __setscheduler_uclamp(struct task_struct *p,
> @@ -1417,9 +1444,18 @@ static void __init init_uclamp(void)
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> /* Init root TG's clamp group */
> uc_se = &root_task_group.uclamp[clamp_id];
> +
> uc_se->effective.value = uclamp_none(clamp_id);
> - uc_se->value = uclamp_none(clamp_id);
> - uc_se->group_id = 0;
> + uc_se->effective.group_id = 0;
> +
> + /*
> + * The max utilization is always allowed for both clamps.
> + * This is required to not force a null minimum utiliation on
> + * all child groups.
> + */
> + uc_se->group_id = UCLAMP_NOT_VALID;
> + uclamp_group_get(NULL, clamp_id, 0, uc_se,
> + uclamp_none(UCLAMP_MAX));

I don't quite get why you are using uclamp_none(UCLAMP_MAX) for both
UCLAMP_MIN and UCLAMP_MAX clamps. I assume the comment above is to
explain this but I'm still unclear why this is done. Maybe expand the
comment to explain the intention? With this I think all TGs will get
boosted by default, won't they?

> #endif
> }
> }
> @@ -1427,6 +1463,7 @@ static void __init init_uclamp(void)
> #else /* CONFIG_UCLAMP_TASK */
> static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
> static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
> +static inline void free_uclamp_sched_group(struct task_group *tg) { }
> static inline int alloc_uclamp_sched_group(struct task_group *tg,
> struct task_group *parent)
> {
> @@ -6984,6 +7021,7 @@ static DEFINE_SPINLOCK(task_group_lock);
>
> static void sched_free_group(struct task_group *tg)
> {
> + free_uclamp_sched_group(tg);
> free_fair_sched_group(tg);
> free_rt_sched_group(tg);
> autogroup_free(tg);
> @@ -7234,6 +7272,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> * @css: the task group to update
> * @clamp_id: the clamp index to update
> * @value: the new task group clamp value
> + * @group_id: the group index mapping the new task clamp value
> *
> * The effective clamp for a TG is expected to track the most restrictive
> * value between the TG's clamp value and it's parent effective clamp value.
> @@ -7252,9 +7291,12 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> * be propagated down to all the descendants. When a subgroup is found which
> * has already its effective clamp value matching its clamp value, then we can
> * safely skip all its descendants which are granted to be already in sync.
> + *
> + * The TG's group_id is also updated to ensure it tracks the effective clamp
> + * value.
> */
> static void cpu_util_update_hier(struct cgroup_subsys_state *css,
> - int clamp_id, int value)
> + int clamp_id, int value, int group_id)
> {
> struct cgroup_subsys_state *top_css = css;
> struct uclamp_se *uc_se, *uc_parent;
> @@ -7282,24 +7324,30 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
> }
>
> /* Propagate the most restrictive effective value */
> - if (uc_parent->effective.value < value)
> + if (uc_parent->effective.value < value) {
> value = uc_parent->effective.value;
> + group_id = uc_parent->effective.group_id;
> + }
> if (uc_se->effective.value == value)
> continue;
>
> uc_se->effective.value = value;
> + uc_se->effective.group_id = group_id;
> }
> }
>
> static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 min_value)
> {
> + struct uclamp_se *uc_se;
> struct task_group *tg;
> int ret = -EINVAL;
> + int group_id;
>
> if (min_value > SCHED_CAPACITY_SCALE)
> return -ERANGE;
>
> + mutex_lock(&uclamp_mutex);
> rcu_read_lock();
>
> tg = css_tg(css);
> @@ -7310,11 +7358,25 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> if (tg->uclamp[UCLAMP_MAX].value < min_value)
> goto out;
>
> + /* Find a valid group_id */
> + ret = uclamp_group_find(UCLAMP_MIN, min_value);
> + if (ret == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto out;
> + }
> + group_id = ret;
> + ret = 0;
> +
> /* Update effective clamps to track the most restrictive value */
> - cpu_util_update_hier(css, UCLAMP_MIN, min_value);
> + cpu_util_update_hier(css, UCLAMP_MIN, min_value, group_id);
> +
> + /* Update TG's reference count */
> + uc_se = &tg->uclamp[UCLAMP_MIN];
> + uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);
>
> out:
> rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
>
> return ret;
> }
> @@ -7322,12 +7384,15 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 max_value)
> {
> + struct uclamp_se *uc_se;
> struct task_group *tg;
> int ret = -EINVAL;
> + int group_id;
>
> if (max_value > SCHED_CAPACITY_SCALE)
> return -ERANGE;
>
> + mutex_lock(&uclamp_mutex);
> rcu_read_lock();
>
> tg = css_tg(css);
> @@ -7338,11 +7403,25 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> if (tg->uclamp[UCLAMP_MIN].value > max_value)
> goto out;
>
> + /* Find a valid group_id */
> + ret = uclamp_group_find(UCLAMP_MAX, max_value);
> + if (ret == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto out;
> + }
> + group_id = ret;
> + ret = 0;
> +
> /* Update effective clamps to track the most restrictive value */
> - cpu_util_update_hier(css, UCLAMP_MAX, max_value);
> + cpu_util_update_hier(css, UCLAMP_MAX, max_value, group_id);
> +
> + /* Update TG's reference count */
> + uc_se = &tg->uclamp[UCLAMP_MAX];
> + uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);
>
> out:
> rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
>
> return ret;
> }
> --
> 2.18.0
>

2018-09-10 16:22:17

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 11/16] sched/core: uclamp: add system default clamps

On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
<[email protected]> wrote:
> Clamp values cannot be tuned at the root cgroup level. Moreover, because
> of the delegation model requirements and how the parent clamps
> propagation works, if we want to enable subgroups to set a non null
> util.min, we need to be able to configure the root group util.min to the
> allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
>
> Unfortunately this setup will also mean that all tasks running in the
> root group, will always get a maximum util.min clamp, unless they have a
> lower task specific clamp which is definitively not a desirable default
> configuration.
>
> Let's fix this by explicitly adding a system default configuration
> (sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
> for all tasks running on the root group.
>
> This interface is available independently from cgroups, thus providing a
> complete solution for system wide utilization clamping configuration.
>
> Each task has now by default:
> task_struct::uclamp::value = UCLAMP_NOT_VALID
> unless:
> - the task has been forked from a parent with a valid clamp and
> !SCHED_FLAG_RESET_ON_FORK
> - the task has got a task-specific value set via sched_setattr()
>
> A task with a UCLAMP_NOT_VALID clamp value is refcounted considering the
> system default clamps if either we do not have task group support or
> they are part of the root_task_group.
> Tasks without a task specific clamp value in a child task group will be
> refcounted instead considering the task group clamps.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v4:
> Message-ID: <20180820122728.GM2960@e110439-lin>
> - fix unwanted reset of clamp values on refcount success
> Others:
> - by default all tasks have a UCLAMP_NOT_VALID task specific clamp
> - always use:
> p->uclamp[clamp_id].effective.value
> to track the actual clamp value the task has been refcounted into.
> This matches with the usage of
> p->uclamp[clamp_id].effective.group_id
> - rebased on v4.19-rc1
> ---
> include/linux/sched/sysctl.h | 11 +++
> kernel/sched/core.c | 147 +++++++++++++++++++++++++++++++++--
> kernel/sysctl.c | 16 ++++
> 3 files changed, 168 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index a9c32daeb9d8..445fb54eaeff 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
> extern unsigned int sysctl_sched_rt_period;
> extern int sysctl_sched_rt_runtime;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern unsigned int sysctl_sched_uclamp_util_min;
> +extern unsigned int sysctl_sched_uclamp_util_max;
> +#endif
> +
> #ifdef CONFIG_CFS_BANDWIDTH
> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> #endif
> @@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp,
> loff_t *ppos);
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern int sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos);
> +#endif
> +
> extern int sysctl_numa_balancing(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp,
> loff_t *ppos);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da0b3bd41e96..fbc8d9fdfdbb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -728,6 +728,20 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> */
> static DEFINE_MUTEX(uclamp_mutex);
>
> +/*
> + * Minimum utilization for tasks in the root cgroup
> + * default: 0
> + */
> +unsigned int sysctl_sched_uclamp_util_min;
> +
> +/*
> + * Maximum utilization for tasks in the root cgroup
> + * default: 1024
> + */
> +unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> +
> +static struct uclamp_se uclamp_default[UCLAMP_CNT];
> +
> /**
> * uclamp_map: reference counts a utilization "clamp value"
> * @value: the utilization "clamp value" required
> @@ -961,11 +975,16 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> *
> * This method returns the effective group index for a task, depending on its
> * status and a proper aggregation of the clamp values listed above.
> + * Moreover, it ensures that the task's effective value:
> + * task_struct::uclamp::effective::value
> + * is updated to represent the clamp value corresponding to the taks effective
> + * group index.
> */
> static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
> {
> struct uclamp_se *uc_se;
> int clamp_value;
> + bool unclamped;
> int group_id;
>
> /* Taks currently accounted into a clamp group */
> @@ -977,13 +996,40 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
> clamp_value = uc_se->value;
> group_id = uc_se->group_id;
>
> + unclamped = (clamp_value == UCLAMP_NOT_VALID);
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> + /*
> + * Tasks in the root group, which do not have a task specific clamp
> + * value, get the system default clamp value.
> + */
> + if (unclamped && (task_group_is_autogroup(task_group(p)) ||
> + task_group(p) == &root_task_group)) {
> + p->uclamp[clamp_id].effective.value =
> + uclamp_default[clamp_id].value;
> +
> + return uclamp_default[clamp_id].group_id;
> + }
> +
> /* Use TG's clamp value to limit task specific values */
> uc_se = &task_group(p)->uclamp[clamp_id];
> - if (clamp_value > uc_se->effective.value)
> - group_id = uc_se->effective.group_id;
> + if (unclamped || clamp_value > uc_se->effective.value) {
> + p->uclamp[clamp_id].effective.value =
> + uc_se->effective.value;
> +
> + return uc_se->effective.group_id;
> + }
> +#else
> + /* By default, all tasks get the system default clamp value */
> + if (unclamped) {
> + p->uclamp[clamp_id].effective.value =
> + uclamp_default[clamp_id].value;
> +
> + return uclamp_default[clamp_id].group_id;
> + }
> #endif
>
> + p->uclamp[clamp_id].effective.value = clamp_value;
> +
> return group_id;
> }
>
> @@ -999,9 +1045,10 @@ static inline int uclamp_group_value(int clamp_id, int group_id)
>
> static inline int uclamp_task_value(struct task_struct *p, int clamp_id)
> {
> - int group_id = uclamp_task_group_id(p, clamp_id);
> + /* Ensure effective task's clamp value is update */

typo: is updated

> + uclamp_task_group_id(p, clamp_id);
>
> - return uclamp_group_value(clamp_id, group_id);
> + return p->uclamp[clamp_id].effective.value;
> }
>
> /**
> @@ -1047,7 +1094,7 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
>
> /* Reset clamp holds on idle exit */
> uc_cpu = &rq->uclamp;
> - clamp_value = p->uclamp[clamp_id].value;
> + clamp_value = p->uclamp[clamp_id].effective.value;
> if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
> /*
> * This function is called for both UCLAMP_MIN (before) and
> @@ -1300,6 +1347,77 @@ static inline void uclamp_group_get(struct task_struct *p,
> uclamp_group_put(clamp_id, prev_group_id);
> }
>
> +int sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + struct uclamp_se *uc_se;
> + int old_min, old_max;
> + unsigned int value;
> + int result;
> +
> + mutex_lock(&uclamp_mutex);
> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + result = -EINVAL;
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> + goto undo;
> + if (sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE)
> + goto undo;
> +
> + /* Find a valid group_id for each required clamp value */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + value = sysctl_sched_uclamp_util_min;
> + result = uclamp_group_find(UCLAMP_MIN, value);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto undo;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + value = sysctl_sched_uclamp_util_max;
> + result = uclamp_group_find(UCLAMP_MAX, value);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto undo;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> + result = 0;
> +
> + /* Update each required clamp group */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uc_se = &uclamp_default[UCLAMP_MIN];
> + uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> + uc_se, sysctl_sched_uclamp_util_min);
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + uc_se = &uclamp_default[UCLAMP_MAX];
> + uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
> + uc_se, sysctl_sched_uclamp_util_max);
> + }
> + goto done;
> +
> +undo:
> + sysctl_sched_uclamp_util_min = old_min;
> + sysctl_sched_uclamp_util_max = old_max;
> +
> +done:
> + mutex_unlock(&uclamp_mutex);
> +
> + return result;
> +}
> +
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> /**
> * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
> @@ -1468,6 +1586,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)
> if (unlikely(reset)) {
> next_group_id = 0;
> p->uclamp[clamp_id].value = uclamp_none(clamp_id);
> + p->uclamp[clamp_id].effective.value =
> + uclamp_none(clamp_id);
> }
> /* Forked tasks are not yet enqueued */
> p->uclamp[clamp_id].effective.group_id = UCLAMP_NOT_VALID;
> @@ -1475,6 +1595,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
> p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
> uclamp_group_get(NULL, clamp_id, next_group_id, uc_se,
> p->uclamp[clamp_id].value);
> +
> + /* By default we do not want task-specific clamp values */
> + if (unlikely(reset))
> + p->uclamp[clamp_id].value = UCLAMP_NOT_VALID;
> }
> }
>
> @@ -1509,12 +1633,17 @@ static void __init init_uclamp(void)
> uc_se->group_id = UCLAMP_NOT_VALID;
> uclamp_group_get(NULL, clamp_id, 0, uc_se,
> uclamp_none(clamp_id));
> + /*
> + * By default we do not want task-specific clamp values,
> + * so that system default values apply.
> + */
> + uc_se->value = UCLAMP_NOT_VALID;
>
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> /* Init root TG's clamp group */
> uc_se = &root_task_group.uclamp[clamp_id];
>
> - uc_se->effective.value = uclamp_none(clamp_id);
> + uc_se->effective.value = uclamp_none(UCLAMP_MAX);

Both clamps are initialized with 1023 because children can go lower
but can't go higher? Comment might be helpful.
I saw this pattern of using uclamp_none(UCLAMP_MAX) for both clamps in
couple places. Maybe would be better to have smth like:

static inline int tg_uclamp_none(int clamp_id) {
/* TG's min and max clamps default to SCHED_CAPACITY_SCALE to
allow children to tighten the restriction */
return SCHED_CAPACITY_SCALE;
}

and use tg_uclamp_none(clamp_id) instead of uclamp_none(UCLAMP_MAX)?
Functionally the same but much more readable.

> uc_se->effective.group_id = 0;
>
> /*
> @@ -1526,6 +1655,12 @@ static void __init init_uclamp(void)
> uclamp_group_get(NULL, clamp_id, 0, uc_se,
> uclamp_none(UCLAMP_MAX));
> #endif
> +
> + /* Init system defaul's clamp group */

typo: default

> + uc_se = &uclamp_default[clamp_id];
> + uc_se->group_id = UCLAMP_NOT_VALID;
> + uclamp_group_get(NULL, clamp_id, 0, uc_se,
> + uclamp_none(clamp_id));
> }
> }
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cc02050fd0c4..378ea57e5fc5 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -445,6 +445,22 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = sched_rr_handler,
> },
> +#ifdef CONFIG_UCLAMP_TASK
> + {
> + .procname = "sched_uclamp_util_min",
> + .data = &sysctl_sched_uclamp_util_min,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = sched_uclamp_handler,
> + },
> + {
> + .procname = "sched_uclamp_util_max",
> + .data = &sysctl_sched_uclamp_util_max,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = sched_uclamp_handler,
> + },
> +#endif
> #ifdef CONFIG_SCHED_AUTOGROUP
> {
> .procname = "sched_autogroup_enabled",
> --
> 2.18.0
>

2018-09-11 15:19:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

Hello, Patrick.

Can we first concentrate on getting in the non-cgroup part first? The
feature has to make sense without cgroup too and I think it'd be a lot
easier to discuss cgroup details once the scheduler core side is
settled.

Thanks.

--
tejun

2018-09-11 16:27:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

On 11-Sep 08:18, Tejun Heo wrote:
> Hello, Patrick.

Hi Tejun,

> Can we first concentrate on getting in the non-cgroup part first?

That's the reason why I've reordered (as per your request) the series
to have all the core and non-cgroup related bits at the beginning.

There are a couple of patches at the end of this series which can be
anticipated but, apart from those, the cgroup code is very well
self-contained within patches 7-12.

> The feature has to make sense without cgroup too

Indeed, this is what I worked on since you pointed out in v1 that
there must be a meaningful non-cgroup API and that's what we have
since v2.

> and I think it'd be a lot easier to discuss cgroup details once the
> scheduler core side is settled.

IMHO, developing the cgroup interface on top of the core bits is quite
important to ensure that we have effective data structures and
implementation which can satisfy both worlds.

My question is: IF the scheduler maintainers are going to be happy
with the overall design for the core bits, are you happy to start the
review of the cgroups bits before the core ones are (eventually) merged?

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-11 16:29:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

Hello, Patrick.

On Tue, Sep 11, 2018 at 05:26:24PM +0100, Patrick Bellasi wrote:
> My question is: IF the scheduler maintainers are going to be happy
> with the overall design for the core bits, are you happy to start the
> review of the cgroups bits before the core ones are (eventually) merged?

Yeah, sure, once the feature is more or less agreed on the scheduler
core side, we can delve into how it should be represented in cgroup.

Thanks.

--
tejun

2018-09-11 16:47:23

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 11/16] sched/core: uclamp: add system default clamps

On 10-Sep 09:20, Suren Baghdasaryan wrote:
> On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > @@ -1509,12 +1633,17 @@ static void __init init_uclamp(void)
> > uc_se->group_id = UCLAMP_NOT_VALID;
> > uclamp_group_get(NULL, clamp_id, 0, uc_se,
> > uclamp_none(clamp_id));
> > + /*
> > + * By default we do not want task-specific clamp values,
> > + * so that system default values apply.
> > + */
> > + uc_se->value = UCLAMP_NOT_VALID;
> >
> > #ifdef CONFIG_UCLAMP_TASK_GROUP
> > /* Init root TG's clamp group */
> > uc_se = &root_task_group.uclamp[clamp_id];
> >
> > - uc_se->effective.value = uclamp_none(clamp_id);
> > + uc_se->effective.value = uclamp_none(UCLAMP_MAX);
>
> Both clamps are initialized with 1023 because children can go lower
> but can't go higher? Comment might be helpful.

Yes, that's because with CGroups we set the max allowed value, which
is also the one used for a clamp IFF:
- the task is not part of a more restrictive group
- the task has not a more restrictive task specific value

I'll improve this comment on the next respin.

> I saw this pattern of using uclamp_none(UCLAMP_MAX) for both clamps in
> couple places.

The other place is to define / initialize "uclamp_default_perf", which
is the default clamps used for RT tasks, introduce by the last patch:

https://lore.kernel.org/lkml/[email protected]/

So, RT tasks and root task group are the only two exceptions for
which, by default, we want a maximum boosting.

> Maybe would be better to have smth like:
>
> static inline int tg_uclamp_none(int clamp_id) {
> /* TG's min and max clamps default to SCHED_CAPACITY_SCALE to
> allow children to tighten the restriction */
> return SCHED_CAPACITY_SCALE;
> }
>
> and use tg_uclamp_none(clamp_id) instead of uclamp_none(UCLAMP_MAX)?
> Functionally the same but much more readable.

Not entirely convinced, maybe because of the name you suggest: it
cannot contain tg, because it applies also to RT tasks when TG are not
in use.

Maybe something like: uclamp_max_boost(clamp_id) could work instead ?

It will make more explicit that the configuration will maps into a:

util.min = util.max = SCHED_CAPACITY_SCALE

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-11 19:27:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 11/16] sched/core: uclamp: add system default clamps

On Tue, Sep 11, 2018 at 9:46 AM, Patrick Bellasi
<[email protected]> wrote:
> On 10-Sep 09:20, Suren Baghdasaryan wrote:
>> On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> > @@ -1509,12 +1633,17 @@ static void __init init_uclamp(void)
>> > uc_se->group_id = UCLAMP_NOT_VALID;
>> > uclamp_group_get(NULL, clamp_id, 0, uc_se,
>> > uclamp_none(clamp_id));
>> > + /*
>> > + * By default we do not want task-specific clamp values,
>> > + * so that system default values apply.
>> > + */
>> > + uc_se->value = UCLAMP_NOT_VALID;
>> >
>> > #ifdef CONFIG_UCLAMP_TASK_GROUP
>> > /* Init root TG's clamp group */
>> > uc_se = &root_task_group.uclamp[clamp_id];
>> >
>> > - uc_se->effective.value = uclamp_none(clamp_id);
>> > + uc_se->effective.value = uclamp_none(UCLAMP_MAX);
>>
>> Both clamps are initialized with 1023 because children can go lower
>> but can't go higher? Comment might be helpful.
>
> Yes, that's because with CGroups we set the max allowed value, which
> is also the one used for a clamp IFF:
> - the task is not part of a more restrictive group
> - the task has not a more restrictive task specific value
>
> I'll improve this comment on the next respin.
>
>> I saw this pattern of using uclamp_none(UCLAMP_MAX) for both clamps in
>> couple places.
>
> The other place is to define / initialize "uclamp_default_perf", which
> is the default clamps used for RT tasks, introduce by the last patch:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> So, RT tasks and root task group are the only two exceptions for
> which, by default, we want a maximum boosting.
>
>> Maybe would be better to have smth like:
>>
>> static inline int tg_uclamp_none(int clamp_id) {
>> /* TG's min and max clamps default to SCHED_CAPACITY_SCALE to
>> allow children to tighten the restriction */
>> return SCHED_CAPACITY_SCALE;
>> }
>>
>> and use tg_uclamp_none(clamp_id) instead of uclamp_none(UCLAMP_MAX)?
>> Functionally the same but much more readable.
>
> Not entirely convinced, maybe because of the name you suggest: it
> cannot contain tg, because it applies also to RT tasks when TG are not
> in use.
>
> Maybe something like: uclamp_max_boost(clamp_id) could work instead ?

Sounds good to me.

>
> It will make more explicit that the configuration will maps into a:
>
> util.min = util.max = SCHED_CAPACITY_SCALE
>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-09-12 10:32:44

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Hi Suren,

On 08-Sep 16:47, Suren Baghdasaryan wrote:

[...]

> > + * A clamp group is not free if there is at least one SE which is sing a clamp
>
> typo in the sentence

Right, s/is sing/is using/

+1

[...]

> > +static int
> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
> > +{
> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> > + int free_group_id = UCLAMP_NOT_VALID;
> > + unsigned int group_id = 0;
> > +
> > + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> > + /* Keep track of first free clamp group */
> > + if (uclamp_group_available(clamp_id, group_id)) {
> > + if (free_group_id == UCLAMP_NOT_VALID)
> > + free_group_id = group_id;
> > + continue;
> > + }
>
> Not a big improvement but reordering the two conditions in this loop
> would avoid finding and recording free_group_id if the very first
> group is the one we are looking for.

Right, indeed with:

uclamp_group_put()
uclamp_group_reset()
uclamp_group_init()

we always ensure that:

uc_map[group_id].value == UCLAMP_NOT_VALID

for free groups. Thus, it should be safe to swap this two checks and
we likely save few instructions for a likely common case of non
clamped tasks.

+1

I'll also get the chance to remove the two simple comments. ;)

> > + /* Return index of first group with same clamp value */
> > + if (uc_map[group_id].value == clamp_value)
> > + return group_id;
> > + }
> > +
> > + if (likely(free_group_id != UCLAMP_NOT_VALID))
> > + return free_group_id;
> > +
> > + return -ENOSPC;
> > +}

[...]

> > +static inline void uclamp_group_put(int clamp_id, int group_id)

> Is the size and the number of invocations of this function small
> enough for inlining? Same goes for uclamp_group_get() and especially
> for __setscheduler_uclamp().

Right... yes, we could let the scheduler do its job and remove inline
from these functions... at least for those not in the critical path.

+1

[...]

> > + if (likely(uc_map[group_id].se_count))
> > + uc_map[group_id].se_count -= 1;
> > +#ifdef SCHED_DEBUG
> > + else {
>
> nit: no need for braces
>
> > + WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
> > + clamp_id, group_id);

Since the above statement is multi-line, we actually need it for code
code-style requirements.

> > + }
> > +#endif

[...]

> > +static void uclamp_fork(struct task_struct *p, bool reset)
> > +{
> > + int clamp_id;
> > +
> > + if (unlikely(!p->sched_class->uclamp_enabled))
> > + return;
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + int next_group_id = p->uclamp[clamp_id].group_id;
> > + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
>
> Might be easier to read if after the above assignment you use
> uc_se->xxx instead of p->uclamp[clamp_id].xxx in the code below.

Yes, that's actually the intent of the above assigmenet... but I've
forgot a couple of usages! +1

> > +
> > + if (unlikely(reset)) {
> > + next_group_id = 0;
> > + p->uclamp[clamp_id].value = uclamp_none(clamp_id);
> > + }
> > +
> > + p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
> > + uclamp_group_get(clamp_id, next_group_id, uc_se,
> > + p->uclamp[clamp_id].value);
> > + }
> > +}

[...]

> Thanks,
> Suren.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 12:52:12

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

On 08-Sep 20:02, Suren Baghdasaryan wrote:
> On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > + cpu.util.min.effective
> > + A read-only single value file which exists on non-root cgroups and
> > + reports minimum utilization clamp value currently enforced on a task
> > + group.
> > +
> > + The actual minimum utilization in the range [0, 1023].
> > +
> > + This value can be lower then cpu.util.min in case a parent cgroup
> > + is enforcing a more restrictive clamping on minimum utilization.
>
> IMHO if cpu.util.min=0 means "no restrictions" on UCLAMP_MIN then
> calling parent's lower cpu.util.min value "more restrictive clamping"
> is confusing. I would suggest to rephrase this to smth like "...in
> case a parent cgroup requires lower cpu.util.min clamping."

Right, it's slightly confusing... still I would like to call out that
a parent group can enforce something on its children. What about:

"... a parent cgroup allows only smaller minimum utilization values."

Is that less confusing ?

Otherwise I think your proposal could work too.

[...]

> > #ifdef CONFIG_UCLAMP_TASK_GROUP
> > +/**
> > + * cpu_util_update_hier: propagete effective clamp down the hierarchy
>
> typo: propagate

+1

[...]

> > + * Skip the whole subtrees if the current effective clamp is
> > + * alredy matching the TG's clamp value.
>
> typo: already

+1


Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 13:50:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> +/**
> + * Utilization's clamp group
> + *
> + * A utilization clamp group maps a "clamp value" (value), i.e.
> + * util_{min,max}, to a "clamp group index" (group_id).
> + */
> +struct uclamp_se {
> + unsigned int value;
> + unsigned int group_id;
> +};

> +/**
> + * uclamp_map: reference counts a utilization "clamp value"
> + * @value: the utilization "clamp value" required
> + * @se_count: the number of scheduling entities requiring the "clamp value"
> + * @se_lock: serialize reference count updates by protecting se_count

Why do you have a spinlock to serialize a single value? Don't we have
atomics for that?

> + */
> +struct uclamp_map {
> + int value;
> + int se_count;
> + raw_spinlock_t se_lock;
> +};
> +
> +/**
> + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> + *
> + * Since only a limited number of different "clamp values" are supported, we
> + * need to map each different clamp value into a "clamp group" (group_id) to
> + * be used by the per-CPU accounting in the fast-path, when tasks are
> + * enqueued and dequeued.
> + * We also support different kind of utilization clamping, min and max
> + * utilization for example, each representing what we call a "clamp index"
> + * (clamp_id).
> + *
> + * A matrix is thus required to map "clamp values" to "clamp groups"
> + * (group_id), for each "clamp index" (clamp_id), where:
> + * - rows are indexed by clamp_id and they collect the clamp groups for a
> + * given clamp index
> + * - columns are indexed by group_id and they collect the clamp values which
> + * maps to that clamp group
> + *
> + * Thus, the column index of a given (clamp_id, value) pair represents the
> + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> + *
> + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> + * clamped tasks. Thus we allocate one more slot than the value of
> + * CONFIG_UCLAMP_GROUPS_COUNT.
> + *
> + * Here is the map layout and, right below, how entries are accessed by the
> + * following code.
> + *
> + * uclamp_maps is a matrix of
> + * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> + * | |
> + * | /---------------+---------------\
> + * | +------------+ +------------+
> + * | / UCLAMP_MIN | value | | value |
> + * | | | se_count |...... | se_count |
> + * | | +------------+ +------------+
> + * +--+ +------------+ +------------+
> + * | | value | | value |
> + * \ UCLAMP_MAX | se_count |...... | se_count |
> + * +-----^------+ +----^-------+
> + * | |
> + * uc_map = + |
> + * &uclamp_maps[clamp_id][0] +
> + * clamp_value =
> + * uc_map[group_id].value
> + */
> +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> + ____cacheline_aligned_in_smp;
> +

I'm still completely confused by all this.

sizeof(uclamp_map) = 12

that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
more than 2 (64 byte) cachelines. What's the purpose of that cacheline
align statement?

Note that without that apparently superfluous lock, it would be 8*12 =
96 bytes, which is 1.5 lines and would indeed suggest you default to
GROUP_COUNT=7 by default to fill 2 lines.

Why are the min and max things torn up like that? I'm fairly sure I
asked some of that last time; but the above comments only try to explain
what, not why.

2018-09-12 14:20:43

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 09/16] sched/core: uclamp: map TG's clamp values into CPU's clamp groups

On 09-Sep 11:52, Suren Baghdasaryan wrote:
> On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > +/**
> > + * release_uclamp_sched_group: release utilization clamp references of a TG
>
> free_uclamp_sched_group

+1

> > + * @tg: the task group being removed
> > + *
> > + * An empty task group can be removed only when it has no more tasks or child
> > + * groups. This means that we can also safely release all the reference
> > + * counting to clamp groups.
> > + */
> > +static inline void free_uclamp_sched_group(struct task_group *tg)
> > +{

[...]

> > @@ -1417,9 +1444,18 @@ static void __init init_uclamp(void)
> > #ifdef CONFIG_UCLAMP_TASK_GROUP
> > /* Init root TG's clamp group */
> > uc_se = &root_task_group.uclamp[clamp_id];
> > +
> > uc_se->effective.value = uclamp_none(clamp_id);
> > - uc_se->value = uclamp_none(clamp_id);
> > - uc_se->group_id = 0;
> > + uc_se->effective.group_id = 0;
> > +
> > + /*
> > + * The max utilization is always allowed for both clamps.
> > + * This is required to not force a null minimum utiliation on
> > + * all child groups.
> > + */
> > + uc_se->group_id = UCLAMP_NOT_VALID;
> > + uclamp_group_get(NULL, clamp_id, 0, uc_se,
> > + uclamp_none(UCLAMP_MAX));
>
> I don't quite get why you are using uclamp_none(UCLAMP_MAX) for both
> UCLAMP_MIN and UCLAMP_MAX clamps. I assume the comment above is to
> explain this but I'm still unclear why this is done.

That's maybe a bit tricky to get but, this will not happen since for
root group tasks we apply the system default values... which however
are introduced by one of the following patches 11/16.

So, my understanding of the "delegation model" is that for cgroups we
have to ensure each TG is a "restriction" of its parent. Thus:

tg::util_min <= tg_parent::util_min

This is required to ensure that a tg_parent can always restrict
resources on its descendants. I guess that's required to have a sane
usage of CGroups for VMs where the Host can transparently control its
Guests.

According to the above rule, and considering that root task group
cannot be modified, to allow boosting on TG we are forced to set the
root group with util_min = SCHED_CAPACITY_SCALE.

Moreover, Tejun pointed out that if we need tuning at root TG level,
it means that we need system wide tunable, which should be available
also when CGroups are not in use.

That's why on patch:

[PATCH v4 11/16] sched/core: uclamp: add system default clamps
https://lore.kernel.org/lkml/[email protected]/

we add the concept of system default clamps which are actually
initialized with util_min=0, i.e. 0% boost by default.

These system default clamp values applies to tasks which are running
either in the root task group on in an autogroup, which also cannot be
tuned at run-time, whenever the task has not a task specific clamp
value specified.

All that considered, the code above is still confusing and I should
consider moving to patch 11/16 the initialization to UCLAMP_MAX for
util_min...

> Maybe expand the comment to explain the intention?

... and add there something like:

/*
* The max utilization is always allowed for both clamps.
* This satisfies the "delegation model" required by CGroups
* v2, where a child task group cannot have more resources then
* its father, thus allowing the creation of child groups with
* a non null util_min.
* For tasks within the root_task_group we will use the system
* default clamp values anyway, thus they will not be boosted
* to the max utilization by default.
*/

It this more clear ?


> With this I think all TGs will get boosted by default, won't they?

You right, at cgroup creation time we clone parent's clamps... thus,
all root_task_group's children group will get max boosting at creation
time. However, since we don't have task within a newly created task
group, the system management software can still refine the clamps
before staring to move tasks in there.

Do you think we should initialize root task group childrens
differently ? I would prefer to avoid special cases if not strictly
required...

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 15:54:29

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 09/16] sched/core: uclamp: map TG's clamp values into CPU's clamp groups

On Wed, Sep 12, 2018 at 7:19 AM, Patrick Bellasi
<[email protected]> wrote:
> On 09-Sep 11:52, Suren Baghdasaryan wrote:
>> On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> > +/**
>> > + * release_uclamp_sched_group: release utilization clamp references of a TG
>>
>> free_uclamp_sched_group
>
> +1
>
>> > + * @tg: the task group being removed
>> > + *
>> > + * An empty task group can be removed only when it has no more tasks or child
>> > + * groups. This means that we can also safely release all the reference
>> > + * counting to clamp groups.
>> > + */
>> > +static inline void free_uclamp_sched_group(struct task_group *tg)
>> > +{
>
> [...]
>
>> > @@ -1417,9 +1444,18 @@ static void __init init_uclamp(void)
>> > #ifdef CONFIG_UCLAMP_TASK_GROUP
>> > /* Init root TG's clamp group */
>> > uc_se = &root_task_group.uclamp[clamp_id];
>> > +
>> > uc_se->effective.value = uclamp_none(clamp_id);
>> > - uc_se->value = uclamp_none(clamp_id);
>> > - uc_se->group_id = 0;
>> > + uc_se->effective.group_id = 0;
>> > +
>> > + /*
>> > + * The max utilization is always allowed for both clamps.
>> > + * This is required to not force a null minimum utiliation on
>> > + * all child groups.
>> > + */
>> > + uc_se->group_id = UCLAMP_NOT_VALID;
>> > + uclamp_group_get(NULL, clamp_id, 0, uc_se,
>> > + uclamp_none(UCLAMP_MAX));
>>
>> I don't quite get why you are using uclamp_none(UCLAMP_MAX) for both
>> UCLAMP_MIN and UCLAMP_MAX clamps. I assume the comment above is to
>> explain this but I'm still unclear why this is done.
>
> That's maybe a bit tricky to get but, this will not happen since for
> root group tasks we apply the system default values... which however
> are introduced by one of the following patches 11/16.
>
> So, my understanding of the "delegation model" is that for cgroups we
> have to ensure each TG is a "restriction" of its parent. Thus:
>
> tg::util_min <= tg_parent::util_min
>
> This is required to ensure that a tg_parent can always restrict
> resources on its descendants. I guess that's required to have a sane
> usage of CGroups for VMs where the Host can transparently control its
> Guests.
>
> According to the above rule, and considering that root task group
> cannot be modified, to allow boosting on TG we are forced to set the
> root group with util_min = SCHED_CAPACITY_SCALE.
>
> Moreover, Tejun pointed out that if we need tuning at root TG level,
> it means that we need system wide tunable, which should be available
> also when CGroups are not in use.
>
> That's why on patch:
>
> [PATCH v4 11/16] sched/core: uclamp: add system default clamps
> https://lore.kernel.org/lkml/[email protected]/
>
> we add the concept of system default clamps which are actually
> initialized with util_min=0, i.e. 0% boost by default.
>
> These system default clamp values applies to tasks which are running
> either in the root task group on in an autogroup, which also cannot be
> tuned at run-time, whenever the task has not a task specific clamp
> value specified.
>
> All that considered, the code above is still confusing and I should
> consider moving to patch 11/16 the initialization to UCLAMP_MAX for
> util_min...
>
>> Maybe expand the comment to explain the intention?
>
> ... and add there something like:
>
> /*
> * The max utilization is always allowed for both clamps.
> * This satisfies the "delegation model" required by CGroups
> * v2, where a child task group cannot have more resources then
> * its father, thus allowing the creation of child groups with
> * a non null util_min.
> * For tasks within the root_task_group we will use the system
> * default clamp values anyway, thus they will not be boosted
> * to the max utilization by default.
> */
>
> It this more clear ?

Yes, I think so. Thanks for covering that.

>
>
>> With this I think all TGs will get boosted by default, won't they?
>
> You right, at cgroup creation time we clone parent's clamps... thus,
> all root_task_group's children group will get max boosting at creation
> time. However, since we don't have task within a newly created task
> group, the system management software can still refine the clamps
> before staring to move tasks in there.
>
> Do you think we should initialize root task group childrens
> differently ? I would prefer to avoid special cases if not strictly
> required...

I don't see a problem with the current approach.

>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-09-12 15:57:48

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 12-Sep 15:49, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> > +/**
> > + * Utilization's clamp group
> > + *
> > + * A utilization clamp group maps a "clamp value" (value), i.e.
> > + * util_{min,max}, to a "clamp group index" (group_id).
> > + */
> > +struct uclamp_se {
> > + unsigned int value;
> > + unsigned int group_id;
> > +};
>
> > +/**
> > + * uclamp_map: reference counts a utilization "clamp value"
> > + * @value: the utilization "clamp value" required
> > + * @se_count: the number of scheduling entities requiring the "clamp value"
> > + * @se_lock: serialize reference count updates by protecting se_count
>
> Why do you have a spinlock to serialize a single value? Don't we have
> atomics for that?

There are some code paths where it's used to protect clamp groups
mapping and initialization, e.g.

uclamp_group_get()
spin_lock()
// initialize clamp group (if required) and then...
se_count += 1
spin_unlock()

Almost all these paths are triggered from user-space and protected
by a global uclamp_mutex, but fork/exit paths.

To serialize these paths I'm using the spinlock above, does it make
sense ? Can we use the global uclamp_mutex on forks/exit too ?

One additional observations is that, if in the future we want to add a
kernel space API, (e.g. driver asking for a new clamp value), maybe we
will need to have a serialized non-sleeping uclamp_group_get() API ?

> > + */
> > +struct uclamp_map {
> > + int value;
> > + int se_count;
> > + raw_spinlock_t se_lock;
> > +};
> > +
> > +/**
> > + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> > + *
> > + * Since only a limited number of different "clamp values" are supported, we
> > + * need to map each different clamp value into a "clamp group" (group_id) to
> > + * be used by the per-CPU accounting in the fast-path, when tasks are
> > + * enqueued and dequeued.
> > + * We also support different kind of utilization clamping, min and max
> > + * utilization for example, each representing what we call a "clamp index"
> > + * (clamp_id).
> > + *
> > + * A matrix is thus required to map "clamp values" to "clamp groups"
> > + * (group_id), for each "clamp index" (clamp_id), where:
> > + * - rows are indexed by clamp_id and they collect the clamp groups for a
> > + * given clamp index
> > + * - columns are indexed by group_id and they collect the clamp values which
> > + * maps to that clamp group
> > + *
> > + * Thus, the column index of a given (clamp_id, value) pair represents the
> > + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> > + *
> > + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> > + * clamped tasks. Thus we allocate one more slot than the value of
> > + * CONFIG_UCLAMP_GROUPS_COUNT.
> > + *
> > + * Here is the map layout and, right below, how entries are accessed by the
> > + * following code.
> > + *
> > + * uclamp_maps is a matrix of
> > + * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> > + * | |
> > + * | /---------------+---------------\
> > + * | +------------+ +------------+
> > + * | / UCLAMP_MIN | value | | value |
> > + * | | | se_count |...... | se_count |
> > + * | | +------------+ +------------+
> > + * +--+ +------------+ +------------+
> > + * | | value | | value |
> > + * \ UCLAMP_MAX | se_count |...... | se_count |
> > + * +-----^------+ +----^-------+
> > + * | |
> > + * uc_map = + |
> > + * &uclamp_maps[clamp_id][0] +
> > + * clamp_value =
> > + * uc_map[group_id].value
> > + */
> > +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> > + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> > + ____cacheline_aligned_in_smp;
> > +
>
> I'm still completely confused by all this.
>
> sizeof(uclamp_map) = 12
>
> that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
> more than 2 (64 byte) cachelines.

This data structure is *not* used in the hot-path, that's why I did not
care about fitting it exactly into few cache lines.

It's used to map a user-space "clamp value" into a kernel-space "clamp
group" when user-space:
- changes a task specific clamp value
- changes a cgroup clamp value
- a task forks/exits

I assume we can consider all those as "slow" code paths, is that correct ?

At enqueue/dequeue time we use instead struct uclamp_cpu, introduced
by the next patch:

[PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting
https://lore.kernel.org/lkml/[email protected]/

That's where we refcount RUNNABLE tasks and we have to figure out
the current clamp value for a CPU.

That data structure, with CONFIG_UCLAMP_GROUPS_COUNT=5, is:

struct uclamp_cpu {
struct uclamp_group group[2][6]; /* 0 96 */
/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
int value[2]; /* 96 8 */
int flags; /* 104 4 */

/* size: 108, cachelines: 2, members: 3 */
/* last cacheline: 44 bytes */
};

and we fit into 2 cache lines with this data layout:

util_min[0..5] | util_max[0..5] | other data

> What's the purpose of that cacheline align statement?

In uclamp_maps, we still need to scan the array when a clamp value is
changed from user-space, i.e. the cases reported above. Thus, that
alignment is just to ensure that we minimize the number of cache lines
used. Does that make sense ?

Maybe that alignment implicitly generated by the compiler ?

> Note that without that apparently superfluous lock, it would be 8*12 =
> 96 bytes, which is 1.5 lines and would indeed suggest you default to
> GROUP_COUNT=7 by default to fill 2 lines.

Yes, will check better if we can count on just the uclamp_mutex

> Why are the min and max things torn up like that? I'm fairly sure I
> asked some of that last time; but the above comments only try to explain
> what, not why.

We use that organization to speedup scanning for clamp values of the
same clamp_id. That's more important in the hot-path than above, where
we need to scan struct uclamp_cpu when a new aggregated clamp value
has to be computed. This is done in:

[PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting
https://lore.kernel.org/lkml/[email protected]/

Specifically:

dequeue_task()
uclamp_cpu_put()
uclamp_cpu_put_id(clamp_id)
uclamp_cpu_update(clamp_id)
// Here we have an array scan by clamp_id

With the given data layout I reported above, when we update the
min_clamp value (boost) we have all the data required in a single
cache line.

If that makes sense, I can certainly improve the comment above to
justify its layout.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 15:58:35

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 08/16] sched/core: uclamp: propagate parent clamps

On Wed, Sep 12, 2018 at 5:51 AM, Patrick Bellasi
<[email protected]> wrote:
> On 08-Sep 20:02, Suren Baghdasaryan wrote:
>> On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> > + cpu.util.min.effective
>> > + A read-only single value file which exists on non-root cgroups and
>> > + reports minimum utilization clamp value currently enforced on a task
>> > + group.
>> > +
>> > + The actual minimum utilization in the range [0, 1023].
>> > +
>> > + This value can be lower then cpu.util.min in case a parent cgroup
>> > + is enforcing a more restrictive clamping on minimum utilization.
>>
>> IMHO if cpu.util.min=0 means "no restrictions" on UCLAMP_MIN then
>> calling parent's lower cpu.util.min value "more restrictive clamping"
>> is confusing. I would suggest to rephrase this to smth like "...in
>> case a parent cgroup requires lower cpu.util.min clamping."
>
> Right, it's slightly confusing... still I would like to call out that
> a parent group can enforce something on its children. What about:
>
> "... a parent cgroup allows only smaller minimum utilization values."
>
> Is that less confusing ?

SGTM.

>
> Otherwise I think your proposal could work too.
>
> [...]
>
>> > #ifdef CONFIG_UCLAMP_TASK_GROUP
>> > +/**
>> > + * cpu_util_update_hier: propagete effective clamp down the hierarchy
>>
>> typo: propagate
>
> +1
>
> [...]
>
>> > + * Skip the whole subtrees if the current effective clamp is
>> > + * alredy matching the TG's clamp value.
>>
>> typo: already
>
> +1
>
>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-09-12 16:15:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Wed, Sep 12, 2018 at 04:56:19PM +0100, Patrick Bellasi wrote:
> On 12-Sep 15:49, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:

> > > +/**
> > > + * uclamp_map: reference counts a utilization "clamp value"
> > > + * @value: the utilization "clamp value" required
> > > + * @se_count: the number of scheduling entities requiring the "clamp value"
> > > + * @se_lock: serialize reference count updates by protecting se_count
> >
> > Why do you have a spinlock to serialize a single value? Don't we have
> > atomics for that?
>
> There are some code paths where it's used to protect clamp groups
> mapping and initialization, e.g.
>
> uclamp_group_get()
> spin_lock()
> // initialize clamp group (if required) and then...
> se_count += 1
> spin_unlock()
>
> Almost all these paths are triggered from user-space and protected
> by a global uclamp_mutex, but fork/exit paths.
>
> To serialize these paths I'm using the spinlock above, does it make
> sense ? Can we use the global uclamp_mutex on forks/exit too ?

OK, then your comment is misleading; it serializes both fields.

> One additional observations is that, if in the future we want to add a
> kernel space API, (e.g. driver asking for a new clamp value), maybe we
> will need to have a serialized non-sleeping uclamp_group_get() API ?

No idea; but if you want to go all fancy you can replace he whole
uclamp_map thing with something like:

struct uclamp_map {
union {
struct {
unsigned long v : 10;
unsigned long c : BITS_PER_LONG - 10;
};
atomic_long_t s;
};
};

And use uclamp_map::c == 0 as unused (as per normal refcount
semantics) and atomic_long_cmpxchg() the whole thing using
uclamp_map::s.

> > > + * uclamp_maps is a matrix of
> > > + * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> > > + * | |
> > > + * | /---------------+---------------\
> > > + * | +------------+ +------------+
> > > + * | / UCLAMP_MIN | value | | value |
> > > + * | | | se_count |...... | se_count |
> > > + * | | +------------+ +------------+
> > > + * +--+ +------------+ +------------+
> > > + * | | value | | value |
> > > + * \ UCLAMP_MAX | se_count |...... | se_count |
> > > + * +-----^------+ +----^-------+
> > > + * | |
> > > + * uc_map = + |
> > > + * &uclamp_maps[clamp_id][0] +
> > > + * clamp_value =
> > > + * uc_map[group_id].value
> > > + */
> > > +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> > > + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> > > + ____cacheline_aligned_in_smp;
> > > +
> >
> > I'm still completely confused by all this.
> >
> > sizeof(uclamp_map) = 12
> >
> > that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
> > more than 2 (64 byte) cachelines.
>
> This data structure is *not* used in the hot-path, that's why I did not
> care about fitting it exactly into few cache lines.
>
> It's used to map a user-space "clamp value" into a kernel-space "clamp
> group" when user-space:
> - changes a task specific clamp value
> - changes a cgroup clamp value
> - a task forks/exits
>
> I assume we can consider all those as "slow" code paths, is that correct ?

Yep.

> > What's the purpose of that cacheline align statement?
>
> In uclamp_maps, we still need to scan the array when a clamp value is
> changed from user-space, i.e. the cases reported above. Thus, that
> alignment is just to ensure that we minimize the number of cache lines
> used. Does that make sense ?
>
> Maybe that alignment implicitly generated by the compiler ?

It is not, but if it really is a slow path, we shouldn't care about
alignment.

> > Note that without that apparently superfluous lock, it would be 8*12 =
> > 96 bytes, which is 1.5 lines and would indeed suggest you default to
> > GROUP_COUNT=7 by default to fill 2 lines.
>
> Yes, will check better if we can count on just the uclamp_mutex

Well, if we don't care about performance (slow path) then keeping he
lock is fine, just the comment and alignment are misleading.

> > Why are the min and max things torn up like that? I'm fairly sure I
> > asked some of that last time; but the above comments only try to explain
> > what, not why.
>
> We use that organization to speedup scanning for clamp values of the
> same clamp_id. That's more important in the hot-path than above, where
> we need to scan struct uclamp_cpu when a new aggregated clamp value
> has to be computed. This is done in:
>
> [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting
> https://lore.kernel.org/lkml/[email protected]/
>
> Specifically:
>
> dequeue_task()
> uclamp_cpu_put()
> uclamp_cpu_put_id(clamp_id)
> uclamp_cpu_update(clamp_id)
> // Here we have an array scan by clamp_id
>
> With the given data layout I reported above, when we update the
> min_clamp value (boost) we have all the data required in a single
> cache line.
>
> If that makes sense, I can certainly improve the comment above to
> justify its layout.

OK, let me read on.

2018-09-12 16:26:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)

But large for inline now.

> {
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;

I think the thing would become much more readable if you set
lower/upper_bound right here.
>
> + mutex_lock(&uclamp_mutex);
>
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");

AFAICT this is an unpriv part of the syscall; and you can spam the log
without limits. Not good.

> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = &p->uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> + uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = &p->uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> + uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(&uclamp_mutex);
> +
> + return result;
> +}

2018-09-12 17:35:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

On Tue, Aug 28, 2018 at 02:53:11PM +0100, Patrick Bellasi wrote:
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 72df2dc779bc..513608ae4908 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -764,6 +764,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> #endif
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/**
> + * struct uclamp_group - Utilization clamp Group
> + * @value: utilization clamp value for tasks on this clamp group
> + * @tasks: number of RUNNABLE tasks on this clamp group
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_group {
> + int value;
> + int tasks;
> +};

We could use bit fields again to compress in 10:22 bits I suppose. 4M
tasks is quite a lot to have runnable on a single CPU ;-)

2018-09-12 17:37:03

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 12-Sep 18:12, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 04:56:19PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 15:49, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
>
> > > > +/**
> > > > + * uclamp_map: reference counts a utilization "clamp value"
> > > > + * @value: the utilization "clamp value" required
> > > > + * @se_count: the number of scheduling entities requiring the "clamp value"
> > > > + * @se_lock: serialize reference count updates by protecting se_count
> > >
> > > Why do you have a spinlock to serialize a single value? Don't we have
> > > atomics for that?
> >
> > There are some code paths where it's used to protect clamp groups
> > mapping and initialization, e.g.
> >
> > uclamp_group_get()
> > spin_lock()
> > // initialize clamp group (if required) and then...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is actually a couple of function calls

> > se_count += 1
> > spin_unlock()
> >
> > Almost all these paths are triggered from user-space and protected
> > by a global uclamp_mutex, but fork/exit paths.
> >
> > To serialize these paths I'm using the spinlock above, does it make
> > sense ? Can we use the global uclamp_mutex on forks/exit too ?
>
> OK, then your comment is misleading; it serializes both fields.

Yes... that definitively needs an update.

> > One additional observations is that, if in the future we want to add a
> > kernel space API, (e.g. driver asking for a new clamp value), maybe we
> > will need to have a serialized non-sleeping uclamp_group_get() API ?
>
> No idea; but if you want to go all fancy you can replace he whole
> uclamp_map thing with something like:
>
> struct uclamp_map {
> union {
> struct {
> unsigned long v : 10;
> unsigned long c : BITS_PER_LONG - 10;
> };
> atomic_long_t s;
> };
> };

That sounds really cool and scary at the same time :)

The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
or that we use it to track a percentage value (i.e. [0..100]).

One of the last patches introduces percentage values to userspace.
But, I was considering that in kernel space we should always track
full scale utilization values.

The c:(BITS_PER_LONG-10) restricts the range of concurrently active
SE refcounting the same clamp value. Which, for some 32bit systems is
only 4 milions among tasks and cgroups... maybe still reasonable...


> And use uclamp_map::c == 0 as unused (as per normal refcount
> semantics) and atomic_long_cmpxchg() the whole thing using
> uclamp_map::s.

Yes... that could work for the uclamp_map updates, but as I noted
above, I think I have other calls serialized by that lock... will look
better into what you suggest, thanks!


[...]

> > > What's the purpose of that cacheline align statement?
> >
> > In uclamp_maps, we still need to scan the array when a clamp value is
> > changed from user-space, i.e. the cases reported above. Thus, that
> > alignment is just to ensure that we minimize the number of cache lines
> > used. Does that make sense ?
> >
> > Maybe that alignment implicitly generated by the compiler ?
>
> It is not, but if it really is a slow path, we shouldn't care about
> alignment.

Ok, will remove it.

> > > Note that without that apparently superfluous lock, it would be 8*12 =
> > > 96 bytes, which is 1.5 lines and would indeed suggest you default to
> > > GROUP_COUNT=7 by default to fill 2 lines.
> >
> > Yes, will check better if we can count on just the uclamp_mutex
>
> Well, if we don't care about performance (slow path) then keeping he
> lock is fine, just the comment and alignment are misleading.

Ok

[...]

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 17:43:03

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 12-Sep 18:24, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> > static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
>
> But large for inline now.

Yes, Suren also already pointed that out... already gone in my v5 ;)

> > {
> > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > + int lower_bound, upper_bound;
> > + struct uclamp_se *uc_se;
> > + int result = 0;
>
> I think the thing would become much more readable if you set
> lower/upper_bound right here.

Do you mean the bits I've ---8<---ed below ?

> >
> > + mutex_lock(&uclamp_mutex);
> >
> > + /* Find a valid group_id for each required clamp value */
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {

---8<---
> > + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > + ? attr->sched_util_max
> > + : p->uclamp[UCLAMP_MAX].value;
> > +
> > + if (upper_bound == UCLAMP_NOT_VALID)
> > + upper_bound = SCHED_CAPACITY_SCALE;
> > + if (attr->sched_util_min > upper_bound) {
> > + result = -EINVAL;
> > + goto done;
> > + }
---8<---

Actually it could also make sense to have them before the mutex ;)

> > +
> > + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> > + if (result == -ENOSPC) {
> > + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
>
> AFAICT this is an unpriv part of the syscall; and you can spam the log
> without limits. Not good.

Good point, will better check this.

[...]

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 17:43:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> On 12-Sep 18:12, Peter Zijlstra wrote:

> > No idea; but if you want to go all fancy you can replace he whole
> > uclamp_map thing with something like:
> >
> > struct uclamp_map {
> > union {
> > struct {
> > unsigned long v : 10;
> > unsigned long c : BITS_PER_LONG - 10;
> > };
> > atomic_long_t s;
> > };
> > };
>
> That sounds really cool and scary at the same time :)
>
> The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> or that we use it to track a percentage value (i.e. [0..100]).

Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.

> One of the last patches introduces percentage values to userspace.
> But, I was considering that in kernel space we should always track
> full scale utilization values.
>
> The c:(BITS_PER_LONG-10) restricts the range of concurrently active
> SE refcounting the same clamp value. Which, for some 32bit systems is
> only 4 milions among tasks and cgroups... maybe still reasonable...

Yeah, on 32bit having 4M tasks seems 'unlikely'.

2018-09-12 17:47:18

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

On 12-Sep 19:34, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:11PM +0100, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 72df2dc779bc..513608ae4908 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -764,6 +764,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> > #endif
> > #endif /* CONFIG_SMP */
> >
> > +#ifdef CONFIG_UCLAMP_TASK
> > +/**
> > + * struct uclamp_group - Utilization clamp Group
> > + * @value: utilization clamp value for tasks on this clamp group
> > + * @tasks: number of RUNNABLE tasks on this clamp group
> > + *
> > + * Keep track of how many tasks are RUNNABLE for a given utilization
> > + * clamp value.
> > + */
> > +struct uclamp_group {
> > + int value;
> > + int tasks;
> > +};
>
> We could use bit fields again to compress in 10:22 bits I suppose. 4M
> tasks is quite a lot to have runnable on a single CPU ;-)

If you say saw, it's ok with me... my Android phone can (yet) run so
many ;)

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-12 17:54:12

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 12-Sep 19:42, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 18:12, Peter Zijlstra wrote:
>
> > > No idea; but if you want to go all fancy you can replace he whole
> > > uclamp_map thing with something like:
> > >
> > > struct uclamp_map {
> > > union {
> > > struct {
> > > unsigned long v : 10;
> > > unsigned long c : BITS_PER_LONG - 10;
> > > };
> > > atomic_long_t s;
> > > };
> > > };
> >
> > That sounds really cool and scary at the same time :)
> >
> > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > or that we use it to track a percentage value (i.e. [0..100]).
>
> Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.

Just remembered a past experience where we had unaligned access traps
on some machine because... don't you see any potentially issue on
using bitfleds like you suggest above ?

I'm thinking to:

commit 317d359df95d ("sched/core: Force proper alignment of 'struct util_est'")

--
#include <best/regards.h>

Patrick Bellasi

2018-09-13 19:13:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

On Tue, Aug 28, 2018 at 02:53:11PM +0100, Patrick Bellasi wrote:
> +static inline void uclamp_cpu_get_id(struct task_struct *p,
> + struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp;
> + struct uclamp_cpu *uc_cpu;
> + int clamp_value;
> + int group_id;
> +
> + /* Every task must reference a clamp group */
> + group_id = p->uclamp[clamp_id].group_id;

> +}
> +
> +static inline void uclamp_cpu_put_id(struct task_struct *p,
> + struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp;
> + struct uclamp_cpu *uc_cpu;
> + unsigned int clamp_value;
> + int group_id;
> +
> + /* New tasks don't have a previous clamp group */
> + group_id = p->uclamp[clamp_id].group_id;
> + if (group_id == UCLAMP_NOT_VALID)
> + return;

*confused*, so on enqueue they must have a group_id, but then on dequeue
they might no longer have?

> +}

> @@ -1110,6 +1313,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & ENQUEUE_RESTORE))
> sched_info_queued(rq, p);
>
> + uclamp_cpu_get(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -1121,6 +1325,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & DEQUEUE_SAVE))
> sched_info_dequeued(rq, p);
>
> + uclamp_cpu_put(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }

The ordering, is that right? We get while the task isn't enqueued yet,
which would suggest we put when the task is dequeued.

2018-09-13 19:15:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Wed, Sep 12, 2018 at 06:52:02PM +0100, Patrick Bellasi wrote:
> On 12-Sep 19:42, Peter Zijlstra wrote:
> > On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > > On 12-Sep 18:12, Peter Zijlstra wrote:
> >
> > > > No idea; but if you want to go all fancy you can replace he whole
> > > > uclamp_map thing with something like:
> > > >
> > > > struct uclamp_map {
> > > > union {
> > > > struct {
> > > > unsigned long v : 10;
> > > > unsigned long c : BITS_PER_LONG - 10;
> > > > };
> > > > atomic_long_t s;
> > > > };
> > > > };
> > >
> > > That sounds really cool and scary at the same time :)
> > >
> > > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > > or that we use it to track a percentage value (i.e. [0..100]).
> >
> > Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.
>
> Just remembered a past experience where we had unaligned access traps
> on some machine because... don't you see any potentially issue on
> using bitfleds like you suggest above ?
>
> I'm thinking to:
>
> commit 317d359df95d ("sched/core: Force proper alignment of 'struct util_est'")

There should not be (and I'm still confused by that particular commit
you reference). If we access everything through the uclamp_map::s, and
only use the bitfields to interpret the results, it all 'works'.

The tricky thing we did earlier was trying to use u64 accesses for 2
u32 variables. And somehow ia64 didn't get the alignment right.

2018-09-13 19:31:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Wed, Sep 12, 2018 at 06:42:09PM +0100, Patrick Bellasi wrote:
> On 12-Sep 18:24, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:

> > > {
> > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > + int lower_bound, upper_bound;
> > > + struct uclamp_se *uc_se;
> > > + int result = 0;
> >
> > I think the thing would become much more readable if you set
> > lower/upper_bound right here.

> Actually it could also make sense to have them before the mutex ;)

Indeed.

+ upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ ? attr->sched_util_max
+ : p->uclamp[UCLAMP_MAX].value;
+
+ if (upper_bound == UCLAMP_NOT_VALID)
+ upper_bound = SCHED_CAPACITY_SCALE;
+ if (attr->sched_util_min > upper_bound) {
+ result = -EINVAL;
+ goto done;
+ }
+
+ result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
+ if (result == -ENOSPC) {
+ pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+ goto done;
+ }
+ group_id[UCLAMP_MIN] = result;
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ ? attr->sched_util_min
+ : p->uclamp[UCLAMP_MIN].value;
+
+ if (lower_bound == UCLAMP_NOT_VALID)
+ lower_bound = 0;
+ if (attr->sched_util_max < lower_bound ||
+ attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto done;
+ }

That would end up soething like:

unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;

if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
lower_bound = attr->sched_util_min;

if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
upper_bound = attr->sched_util_max;

if (lower_bound > upper_bound ||
upper_bound > SCHED_CAPACITY_MAX)
return -EINVAL;

mutex_lock(...);



2018-09-14 08:48:48

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 13-Sep 21:20, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:42:09PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 18:24, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
>
> > > > {
> > > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > > + int lower_bound, upper_bound;
> > > > + struct uclamp_se *uc_se;
> > > > + int result = 0;
> > >
> > > I think the thing would become much more readable if you set
> > > lower/upper_bound right here.
>
> > Actually it could also make sense to have them before the mutex ;)
>
> Indeed.
>
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
>
> That would end up soething like:
>
> unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
> unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
>
> if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> lower_bound = attr->sched_util_min;
>
> if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> upper_bound = attr->sched_util_max;
>
> if (lower_bound > upper_bound ||
> upper_bound > SCHED_CAPACITY_MAX)
> return -EINVAL;
>
> mutex_lock(...);

Yes... much cleaner, thanks.

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 08:52:10

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 13-Sep 21:14, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:52:02PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 19:42, Peter Zijlstra wrote:
> > > On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > > > On 12-Sep 18:12, Peter Zijlstra wrote:
> > >
> > > > > No idea; but if you want to go all fancy you can replace he whole
> > > > > uclamp_map thing with something like:
> > > > >
> > > > > struct uclamp_map {
> > > > > union {
> > > > > struct {
> > > > > unsigned long v : 10;
> > > > > unsigned long c : BITS_PER_LONG - 10;
> > > > > };
> > > > > atomic_long_t s;
> > > > > };
> > > > > };
> > > >
> > > > That sounds really cool and scary at the same time :)
> > > >
> > > > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > > > or that we use it to track a percentage value (i.e. [0..100]).
> > >
> > > Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.
> >
> > Just remembered a past experience where we had unaligned access traps
> > on some machine because... don't you see any potentially issue on
> > using bitfleds like you suggest above ?
> >
> > I'm thinking to:
> >
> > commit 317d359df95d ("sched/core: Force proper alignment of 'struct util_est'")
>
> There should not be (and I'm still confused by that particular commit
> you reference). If we access everything through the uclamp_map::s, and
> only use the bitfields to interpret the results, it all 'works'.

Yes, the problem above was different... still I was wondering if there
could be bitfields related alignment issue lurking somewhere.
But, as you say, if we always R/W atomically via uclamp_map::s there
should be none.

> The tricky thing we did earlier was trying to use u64 accesses for 2
> u32 variables. And somehow ia64 didn't get the alignment right.

Right, np... sorry for the noise.

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 09:09:53

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

On 13-Sep 21:12, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:11PM +0100, Patrick Bellasi wrote:
> > +static inline void uclamp_cpu_get_id(struct task_struct *p,
> > + struct rq *rq, int clamp_id)
> > +{
> > + struct uclamp_group *uc_grp;
> > + struct uclamp_cpu *uc_cpu;
> > + int clamp_value;
> > + int group_id;
> > +
> > + /* Every task must reference a clamp group */
> > + group_id = p->uclamp[clamp_id].group_id;
>
> > +}
> > +
> > +static inline void uclamp_cpu_put_id(struct task_struct *p,
> > + struct rq *rq, int clamp_id)
> > +{
> > + struct uclamp_group *uc_grp;
> > + struct uclamp_cpu *uc_cpu;
> > + unsigned int clamp_value;
> > + int group_id;
> > +
> > + /* New tasks don't have a previous clamp group */
> > + group_id = p->uclamp[clamp_id].group_id;
> > + if (group_id == UCLAMP_NOT_VALID)
> > + return;
>
> *confused*, so on enqueue they must have a group_id, but then on dequeue
> they might no longer have?

Why not ?

Tasks always have a (task-specific) group_id, once set the first time.

IOW, init_task::group_id is UCLAMP_NOT_VALID, as well as all the tasks
forked under reset_on_fork, otherwise the get the group_id of the
parent.

Actually, I just noted that the reset_on_fork path is now setting
p::group_id=0, which is semantically equivalent to UCLAMP_NOT_VALID...
but will update that assignment for consistency in v5.

The only way to set a !UCLAMP_NOT_VALID value for p::group_id is via
the syscall, which will either fails or set a new valid group_id.

Thus, if we have a valid p::group_id @enqueue time, we will have one
@dequeue time too. Eventually it could be a different one, because
while RUNNABLE we do a syscall... but this case is addressed by the
following patch:

[PATCH v4 04/16] sched/core: uclamp: update CPU's refcount on clamp changes
https://lore.kernel.org/lkml/[email protected]/

Does that makes sense ?

> > +}
>
> > @@ -1110,6 +1313,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> > if (!(flags & ENQUEUE_RESTORE))
> > sched_info_queued(rq, p);
> >
> > + uclamp_cpu_get(rq, p);
> > p->sched_class->enqueue_task(rq, p, flags);
> > }
> >
> > @@ -1121,6 +1325,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > if (!(flags & DEQUEUE_SAVE))
> > sched_info_dequeued(rq, p);
> >
> > + uclamp_cpu_put(rq, p);
> > p->sched_class->dequeue_task(rq, p, flags);
> > }
>
> The ordering, is that right? We get while the task isn't enqueued yet,
> which would suggest we put when the task is dequeued.


That's the "usual trick" required for correct schedutil updates.

The scheduler class code will likely poke schedutil and thus we
wanna be sure to have updated CPU clamps by the time we have to
compute the next OPP.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 09:33:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

On Tue, Aug 28, 2018 at 02:53:14PM +0100, Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 3fffad3bc8a8..949082555ee8 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> * CFS tasks and we use the same metric to track the effective
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> + *
> + * CFS utilization can be boosted or capped, depending on utilization
> + * clamp constraints configured for currently RUNNABLE tasks.
> */
> util = cpu_util_cfs(rq);
> + if (util)
> + util = uclamp_util(rq, util);

Should that not be:

util = clamp_util(rq, cpu_util_cfs(rq));

Because if !util might we not still want to enforce the min clamp?

> util += cpu_util_rt(rq);
>
> /*

> @@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> return;
> sg_cpu->iowait_boost_pending = true;
>
> + /*
> + * Boost FAIR tasks only up to the CPU clamped utilization.
> + *
> + * Since DL tasks have a much more advanced bandwidth control, it's
> + * safe to assume that IO boost does not apply to those tasks.
> + * Instead, since RT tasks are not utiliation clamped, we don't want
> + * to apply clamping on IO boost while there is blocked RT
> + * utilization.
> + */
> + max_boost = sg_cpu->iowait_boost_max;
> + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);

OK I suppose.

> +
> /* Double the boost at each request */
> if (sg_cpu->iowait_boost) {
> sg_cpu->iowait_boost <<= 1;
> - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> + if (sg_cpu->iowait_boost > max_boost)
> + sg_cpu->iowait_boost = max_boost;
> return;
> }
>


> +static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
> +{
> + struct uclamp_cpu *uc_cpu = &rq->uclamp;
> +
> + if (uc_cpu->value[clamp_id] == UCLAMP_NOT_VALID)
> + return uclamp_none(clamp_id);
> +
> + return uc_cpu->value[clamp_id];
> +}

Would that not be more readable as:

static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
{
unsigned int val = rq->uclamp.value[clamp_id];

if (unlikely(val == UCLAMP_NOT_VALID))
val = uclamp_none(clamp_id);

return val;
}

And how come NOT_VALID is possible? I thought the idea was to always
have all things a valid value.


2018-09-14 11:10:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On Thu, Sep 06, 2018 at 03:40:53PM +0100, Patrick Bellasi wrote:
> 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
> instead on capable(CAP_SYS_ADMIN)
>
> Does that make sense ?

Neither of them really makes sense to me.

The max clamp makes a task 'consume' less and you should always be able
to reduce yourself.

The min clamp doesn't avoid while(1); and is therefore also not a
problem.

So I think setting clamps on a task should not be subject to additional
capabilities.

Now, of course, there is a problem of clamp resources, which are
limited. Consuming those _is_ a problem.

I think the problem here is that the two are conflated in the very same
interface.

Would it make sense to move the available clamp values out to some sysfs
interface like thing and guard that with a capability, while keeping the
task interface unprivilidged?

Another thing that has me 'worried' about this interface is the direct
tie to CPU capacity (not that I have a better suggestion). But it does
raise the point of how userspace is going to discover the relevant
values of the platform.

2018-09-14 11:53:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

On Fri, Sep 14, 2018 at 10:07:51AM +0100, Patrick Bellasi wrote:
> On 13-Sep 21:12, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:11PM +0100, Patrick Bellasi wrote:
> > > +static inline void uclamp_cpu_get_id(struct task_struct *p,
> > > + struct rq *rq, int clamp_id)
> > > +{
> > > + struct uclamp_group *uc_grp;
> > > + struct uclamp_cpu *uc_cpu;
> > > + int clamp_value;
> > > + int group_id;
> > > +
> > > + /* Every task must reference a clamp group */
> > > + group_id = p->uclamp[clamp_id].group_id;
> >
> > > +}
> > > +
> > > +static inline void uclamp_cpu_put_id(struct task_struct *p,
> > > + struct rq *rq, int clamp_id)
> > > +{
> > > + struct uclamp_group *uc_grp;
> > > + struct uclamp_cpu *uc_cpu;
> > > + unsigned int clamp_value;
> > > + int group_id;
> > > +
> > > + /* New tasks don't have a previous clamp group */
> > > + group_id = p->uclamp[clamp_id].group_id;
> > > + if (group_id == UCLAMP_NOT_VALID)
> > > + return;
> >
> > *confused*, so on enqueue they must have a group_id, but then on dequeue
> > they might no longer have?
>
> Why not ?

That's what it says on the tin, right? enqueue: "every task must reference clamp
group" while on dequeue: "new tasks don't have a (previous) clamp group"
and we check for NOT_VALID.


2018-09-14 13:21:39

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

On 14-Sep 11:32, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:14PM +0100, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 3fffad3bc8a8..949082555ee8 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > * CFS tasks and we use the same metric to track the effective
> > * utilization (PELT windows are synchronized) we can directly add them
> > * to obtain the CPU's actual utilization.
> > + *
> > + * CFS utilization can be boosted or capped, depending on utilization
> > + * clamp constraints configured for currently RUNNABLE tasks.
> > */
> > util = cpu_util_cfs(rq);
> > + if (util)
> > + util = uclamp_util(rq, util);
>
> Should that not be:
>
> util = clamp_util(rq, cpu_util_cfs(rq));
>
> Because if !util might we not still want to enforce the min clamp?

If !util CFS tasks should have been gone since a long time
(proportional to their estimated utilization) and thus it probably
makes sense to not affect further energy efficiency for tasks of other
classes.

IOW, the blocked utiliation of a class gives us a bit of "hysteresis" in
case its tasks have a relatively small period and thus they are lucky
to wakeup soonish.

This "hysteresis" so far is based on the specific PELT decay rate,
which is not very tunable... what I would like instead, but that's for
a future update, is a dedicated (per-task) attribute which allows to
defined for how long a clamp has to last since the last task enqueue
time.

This will make up a much more flexible mechanism which allows
to completely decouple a clamp duration from PELT enabling scenarios
like quite similar to the 0lag we have in DL:
- a small task with relatively long period which gets and ensured
boost up to their next activation
- a big task which has important things to do just at the beginning
but can complete in a more energy efficient lower OPP

We already have this "boost holding" feature in Android and we found
it quite useful especially for RT tasks where it grants that an RT
tasks does not risk to wakeup on a lower OPP when that feature is
required (which can be not always).

Furthermore, based on such a generic "clamp holding mechanism" we can
thing also about replacing the IOWAIT boost with a more tunable and
task-specific boosting based on util_min.

... but again, if the above makes any sense, its for a future series
once we are happy with at least these bits.


> > util += cpu_util_rt(rq);
> >
> > /*
>
> > @@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > return;
> > sg_cpu->iowait_boost_pending = true;
> >
> > + /*
> > + * Boost FAIR tasks only up to the CPU clamped utilization.
> > + *
> > + * Since DL tasks have a much more advanced bandwidth control, it's
> > + * safe to assume that IO boost does not apply to those tasks.
> > + * Instead, since RT tasks are not utiliation clamped, we don't want
> > + * to apply clamping on IO boost while there is blocked RT
> > + * utilization.
> > + */
> > + max_boost = sg_cpu->iowait_boost_max;
> > + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> > + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
>
> OK I suppose.

Yes, if we have a task constraint it should apply to boost too...

> > +
> > /* Double the boost at each request */
> > if (sg_cpu->iowait_boost) {
> > sg_cpu->iowait_boost <<= 1;
> > - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> > - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> > + if (sg_cpu->iowait_boost > max_boost)
> > + sg_cpu->iowait_boost = max_boost;
> > return;
> > }
> >
>
>
> > +static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
> > +{
> > + struct uclamp_cpu *uc_cpu = &rq->uclamp;
> > +
> > + if (uc_cpu->value[clamp_id] == UCLAMP_NOT_VALID)
> > + return uclamp_none(clamp_id);
> > +
> > + return uc_cpu->value[clamp_id];
> > +}
>
> Would that not be more readable as:
>
> static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
> {
> unsigned int val = rq->uclamp.value[clamp_id];
>
> if (unlikely(val == UCLAMP_NOT_VALID))
> val = uclamp_none(clamp_id);
>
> return val;
> }

I'm trying to keep consistency in variable names usages by always
accessing the rq's clamps via a *uc_cpu to make it easy grepping the
code. Does this argument make sense ?

On the other side, what you propose above is more easy to read
by looking just at that function.... so, if you prefer it better, I'll
update it on v5.

> And how come NOT_VALID is possible? I thought the idea was to always
> have all things a valid value.

When we update the CPU's clamp for a "newly idle" CPU, there are not
tasks refcounting clamps and thus we end up with UCLAMP_NOT_VALID for
that CPU. That's how uclamp_cpu_update() is currently encoded.

Perhaps we can set the value to uclamp_none(clamp_id) from that
function, but I was thinking that perhaps it could be useful to track
explicitly that the CPU is now idle.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 13:37:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

On Fri, Sep 14, 2018 at 02:19:19PM +0100, Patrick Bellasi wrote:
> On 14-Sep 11:32, Peter Zijlstra wrote:

> > Should that not be:
> >
> > util = clamp_util(rq, cpu_util_cfs(rq));
> >
> > Because if !util might we not still want to enforce the min clamp?
>
> If !util CFS tasks should have been gone since a long time
> (proportional to their estimated utilization) and thus it probably
> makes sense to not affect further energy efficiency for tasks of other
> classes.

I don't remember what we do for util for new tasks; but weren't we
talking about setting that to 0 recently? IIRC the problem was that if
we start at 1 with util we'll always run new tasks on big cores, or
something along those lines.

So new tasks would still trigger this case until they'd accrued enough
history.

Either way around, I don't much care at this point except I think it
would be good to have a comment to record the assumptions.

> > Would that not be more readable as:
> >
> > static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
> > {
> > unsigned int val = rq->uclamp.value[clamp_id];
> >
> > if (unlikely(val == UCLAMP_NOT_VALID))
> > val = uclamp_none(clamp_id);
> >
> > return val;
> > }
>
> I'm trying to keep consistency in variable names usages by always
> accessing the rq's clamps via a *uc_cpu to make it easy grepping the
> code. Does this argument make sense ?
>
> On the other side, what you propose above is more easy to read
> by looking just at that function.... so, if you prefer it better, I'll
> update it on v5.

I prefer my version, also because it has a single load of the value (yes
I know about CSE passes). I figure one can always grep for uclamp or
something.

> > And how come NOT_VALID is possible? I thought the idea was to always
> > have all things a valid value.
>
> When we update the CPU's clamp for a "newly idle" CPU, there are not
> tasks refcounting clamps and thus we end up with UCLAMP_NOT_VALID for
> that CPU. That's how uclamp_cpu_update() is currently encoded.
>
> Perhaps we can set the value to uclamp_none(clamp_id) from that
> function, but I was thinking that perhaps it could be useful to track
> explicitly that the CPU is now idle.

IIRC you added an explicit flag to track idle somewhere.. to keep the
last max clamp in effect or something.

I think, but haven't overly thought about this, that if you always
ensure these things are valid you can avoid a bunch of NOT_VALID
conditions. And less conditions is always good, right? :-)

2018-09-14 13:42:18

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 03/16] sched/core: uclamp: add CPU's clamp groups accounting

On 14-Sep 13:52, Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 10:07:51AM +0100, Patrick Bellasi wrote:
> > On 13-Sep 21:12, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:11PM +0100, Patrick Bellasi wrote:
> > > > +static inline void uclamp_cpu_get_id(struct task_struct *p,
> > > > + struct rq *rq, int clamp_id)
> > > > +{
> > > > + struct uclamp_group *uc_grp;
> > > > + struct uclamp_cpu *uc_cpu;
> > > > + int clamp_value;
> > > > + int group_id;
> > > > +
> > > > + /* Every task must reference a clamp group */
> > > > + group_id = p->uclamp[clamp_id].group_id;
> > >
> > > > +}
> > > > +
> > > > +static inline void uclamp_cpu_put_id(struct task_struct *p,
> > > > + struct rq *rq, int clamp_id)
> > > > +{
> > > > + struct uclamp_group *uc_grp;
> > > > + struct uclamp_cpu *uc_cpu;
> > > > + unsigned int clamp_value;
> > > > + int group_id;
> > > > +
> > > > + /* New tasks don't have a previous clamp group */
> > > > + group_id = p->uclamp[clamp_id].group_id;
> > > > + if (group_id == UCLAMP_NOT_VALID)
> > > > + return;
> > >
> > > *confused*, so on enqueue they must have a group_id, but then on dequeue
> > > they might no longer have?
> >
> > Why not ?
>
> That's what it says on the tin, right? enqueue: "every task must reference clamp
> group" while on dequeue: "new tasks don't have a (previous) clamp group"
> and we check for NOT_VALID.

Oh, right... I've got confused me too since I was looking @enqueue.

You right, @dequeue we always a group_id. The check @dequeue time was
required only before v3 because of the way defaults (i.e. no clamps)
was tracked. Will remove it in v5, thanks!

BTW, my previous explanation was also incorrect, since the logic for
init_task initialization is:

uc_se->group_id = UCLAMP_NOT_VALID;
uclamp_group_get(value)

where the first assignment is required just to inform uclamp_group_get()
that we don't need to uclamp_group_put() a previous (non existing at
init time) clamp group.

Thus, at the end of the two instructions above we end up with an
init_task which has a !UCLAMP_NOT_VALID group id, which is then
cloned by forked tasks.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 13:57:50

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

On 14-Sep 15:36, Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 02:19:19PM +0100, Patrick Bellasi wrote:
> > On 14-Sep 11:32, Peter Zijlstra wrote:
>
> > > Should that not be:
> > >
> > > util = clamp_util(rq, cpu_util_cfs(rq));
> > >
> > > Because if !util might we not still want to enforce the min clamp?
> >
> > If !util CFS tasks should have been gone since a long time
> > (proportional to their estimated utilization) and thus it probably
> > makes sense to not affect further energy efficiency for tasks of other
> > classes.
>
> I don't remember what we do for util for new tasks; but weren't we
> talking about setting that to 0 recently? IIRC the problem was that if
> we start at 1 with util we'll always run new tasks on big cores, or
> something along those lines.

Mmm.. could have been in a recent discussion with Quentin, but I
think I've missed it. I know we have something similar on Android for
similar reasons.

> So new tasks would still trigger this case until they'd accrued enough
> history.

Well, yes and no. New tasks will be clamped which means that if they
are generated from a capped parent (or within a cgroups with a
suitable util_max) they can still live in a smaller capacity CPU
despite their utilization being 1024. Thus, to a certain extend,
UtilClamp could be a fix for the above misbehavior whenever needed.

NOTE: this series does not include tasks biasing bits.

> Either way around, I don't much care at this point except I think it
> would be good to have a comment to record the assumptions.

Sure, will add a comment on that and a warning about possible side
effects on tasks placement

> > > Would that not be more readable as:
> > >
> > > static inline unsigned int uclamp_value(struct rq *rq, int clamp_id)
> > > {
> > > unsigned int val = rq->uclamp.value[clamp_id];
> > >
> > > if (unlikely(val == UCLAMP_NOT_VALID))
> > > val = uclamp_none(clamp_id);
> > >
> > > return val;
> > > }
> >
> > I'm trying to keep consistency in variable names usages by always
> > accessing the rq's clamps via a *uc_cpu to make it easy grepping the
> > code. Does this argument make sense ?
> >
> > On the other side, what you propose above is more easy to read
> > by looking just at that function.... so, if you prefer it better, I'll
> > update it on v5.
>
> I prefer my version, also because it has a single load of the value (yes
> I know about CSE passes). I figure one can always grep for uclamp or
> something.

+1

> > > And how come NOT_VALID is possible? I thought the idea was to always
> > > have all things a valid value.
> >
> > When we update the CPU's clamp for a "newly idle" CPU, there are not
> > tasks refcounting clamps and thus we end up with UCLAMP_NOT_VALID for
> > that CPU. That's how uclamp_cpu_update() is currently encoded.
> >
> > Perhaps we can set the value to uclamp_none(clamp_id) from that
> > function, but I was thinking that perhaps it could be useful to track
> > explicitly that the CPU is now idle.
>
> IIRC you added an explicit flag to track idle somewhere.. to keep the
> last max clamp in effect or something.

Right... that patch was after this one on v3, but know that I've moved
it before we can probably simplify this path.

> I think, but haven't overly thought about this, that if you always
> ensure these things are valid you can avoid a bunch of NOT_VALID
> conditions. And less conditions is always good, right? :-)

Right, will check better all the usages and remove them when not
strictly required.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 14:08:23

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 14-Sep 13:10, Peter Zijlstra wrote:
> On Thu, Sep 06, 2018 at 03:40:53PM +0100, Patrick Bellasi wrote:
> > 1) _I think_ we don't want to depend on capable(CAP_SYS_NICE) but
> > instead on capable(CAP_SYS_ADMIN)
> >
> > Does that make sense ?
>
> Neither of them really makes sense to me.
>
> The max clamp makes a task 'consume' less and you should always be able
> to reduce yourself.
>
> The min clamp doesn't avoid while(1); and is therefore also not a
> problem.
>
> So I think setting clamps on a task should not be subject to additional
> capabilities.
>
> Now, of course, there is a problem of clamp resources, which are
> limited. Consuming those _is_ a problem.

Right, that problem could be solved if we convince ourself that the
quantization approach proposed in:

[PATCH v4 15/16] sched/core: uclamp: add clamp group discretization support
https://lore.kernel.org/lkml/[email protected]/

could make sense and specifically, the other limitation it imposes
(i.e. the quantizaiton) is within reasonable rounding control errors/

> I think the problem here is that the two are conflated in the very same
> interface.
>
> Would it make sense to move the available clamp values out to some sysfs
> interface like thing and guard that with a capability, while keeping the
> task interface unprivilidged?

You mean something like:

$ cat /proc/sys/kernel/sched_uclamp_min_utils
0 10 20 ... 100

to notify users about the set of clamp values which are available ?

> Another thing that has me 'worried' about this interface is the direct
> tie to CPU capacity (not that I have a better suggestion). But it does
> raise the point of how userspace is going to discover the relevant
> values of the platform.

This point worries me too, and that's what I think is addressed in a
sane way in:

[PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
https://lore.kernel.org/lkml/[email protected]/

IMHO percentages are a reasonably safe and generic API to expose to
user-space. Don't you think this should address your concern above ?

--
#include <best/regards.h>

Patrick Bellasi

2018-09-14 14:30:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default


Just a quick reply because I have to run..

On Fri, Sep 14, 2018 at 03:07:32PM +0100, Patrick Bellasi wrote:
> On 14-Sep 13:10, Peter Zijlstra wrote:

> > I think the problem here is that the two are conflated in the very same
> > interface.
> >
> > Would it make sense to move the available clamp values out to some sysfs
> > interface like thing and guard that with a capability, while keeping the
> > task interface unprivilidged?
>
> You mean something like:
>
> $ cat /proc/sys/kernel/sched_uclamp_min_utils
> 0 10 20 ... 100
>
> to notify users about the set of clamp values which are available ?
>
> > Another thing that has me 'worried' about this interface is the direct
> > tie to CPU capacity (not that I have a better suggestion). But it does
> > raise the point of how userspace is going to discover the relevant
> > values of the platform.
>
> This point worries me too, and that's what I think is addressed in a
> sane way in:
>
> [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
> https://lore.kernel.org/lkml/[email protected]/
>
> IMHO percentages are a reasonably safe and generic API to expose to
> user-space. Don't you think this should address your concern above ?

Not at all what I meant, and no, percentages don't help.

The thing is, the values you'd want to use are for example the capacity
of the little CPUs. or the capacity of the most energy efficient OPP
(the knee).

Similarly for boosting, how are we 'easily' going to find the values
that correspond to the various available OPPs.

The EAS thing might have these around; but I forgot if/how they're
exposed to userspace (I'll have to soon look at the latest posting).

But changing the clamp metric to something different than these values
is going to be pain.

2018-09-17 12:30:10

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 14-Sep 16:28, Peter Zijlstra wrote:
> Just a quick reply because I have to run..
>
> On Fri, Sep 14, 2018 at 03:07:32PM +0100, Patrick Bellasi wrote:
> > On 14-Sep 13:10, Peter Zijlstra wrote:
>
> > > I think the problem here is that the two are conflated in the very same
> > > interface.
> > >
> > > Would it make sense to move the available clamp values out to some sysfs
> > > interface like thing and guard that with a capability, while keeping the
> > > task interface unprivilidged?
> >
> > You mean something like:
> >
> > $ cat /proc/sys/kernel/sched_uclamp_min_utils
> > 0 10 20 ... 100
> >
> > to notify users about the set of clamp values which are available ?
> >
> > > Another thing that has me 'worried' about this interface is the direct
> > > tie to CPU capacity (not that I have a better suggestion). But it does
> > > raise the point of how userspace is going to discover the relevant
> > > values of the platform.
> >
> > This point worries me too, and that's what I think is addressed in a
> > sane way in:
> >
> > [PATCH v4 13/16] sched/core: uclamp: use percentage clamp values
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > IMHO percentages are a reasonably safe and generic API to expose to
> > user-space. Don't you think this should address your concern above ?
>
> Not at all what I meant, and no, percentages don't help.
>
> The thing is, the values you'd want to use are for example the capacity
> of the little CPUs. or the capacity of the most energy efficient OPP
> (the knee).

I don't think so.

On the knee topic, we had some thinking and on most platforms it seems
to be a rather arbitrary decision.

On sane platforms, the Energy Efficiency (EE) is monotonically
decreasing with frequency increase. Maybe we can define a threshold
for a "EE derivative ratio", but it will still be quite arbitrary.
Moreover, it could be that in certain use-cases we want to push for
higher energy efficiency (i.e. lower derivatives) then others.

> Similarly for boosting, how are we 'easily' going to find the values
> that correspond to the various available OPPs.

In our experience with SchedTune on Android, we found that we
generally focus on a small set of representative use-cases and then
run an exploration, by tuning the percentage of boost, to identify the
optimal trade-off between Performance and Energy.
The value you get could be something which do not match exactly an OPP
but still, since we (will) bias not only OPP selection but also tasks
placement, it's the one which makes most sense.

Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
something we don't care to specify exactly, since:

- schedutil will top the util request to the next frequency anyway

- capacity by itself is a loosely defined metric, since it's usually
measured considering a specific kind of instructions mix, which
can be very different from the actual instruction mix (e.g. integer
vs floating point)

- certain platforms don't even expose OPPs, but just "performance
levels"... which ultimately are a "percentage"

- there are so many rounding errors around on utilization tracking
and it aggregation that being exact on an OPP if of "relative"
importance

Do you see specific use-cases where an exact OPP capacity is much
better then a percentage value ?

Of course there can be scenarios in which wa want to clamp to a
specific OPP. But still, why should it be difficult for a platform
integrator to express it as a close enough percentage value ?

> The EAS thing might have these around; but I forgot if/how they're
> exposed to userspace (I'll have to soon look at the latest posting).

The new "Energy Model Management" framework can certainly be use to
get the list of OPPs for each frequency domain. IMO this could be
used to identify the maximum number of clamp groups we can have.
In this case, the discretization patch can translate a generic
percentage clamp into the closest OPP capacity...

... but to me that's an internal detail which I'm not convinced we
don't need to expose to user-space.

IMHO we should instead focus just on defining a usable and generic
userspace interface. Then, platform specific tuning is something
user-space can do, either offline or on-line.

> But changing the clamp metric to something different than these values
> is going to be pain.

Maybe I don't completely get what you mean here... are you saying that
not using exact capacity values to defined clamps is difficult ?
If that's the case why? Can you elaborate with an example ?

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-21 09:15:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On Mon, Sep 17, 2018 at 01:27:23PM +0100, Patrick Bellasi wrote:
> On 14-Sep 16:28, Peter Zijlstra wrote:

> > The thing is, the values you'd want to use are for example the capacity
> > of the little CPUs. or the capacity of the most energy efficient OPP
> > (the knee).
>
> I don't think so.
>
> On the knee topic, we had some thinking and on most platforms it seems
> to be a rather arbitrary decision.
>
> On sane platforms, the Energy Efficiency (EE) is monotonically
> decreasing with frequency increase. Maybe we can define a threshold
> for a "EE derivative ratio", but it will still be quite arbitrary.
> Moreover, it could be that in certain use-cases we want to push for
> higher energy efficiency (i.e. lower derivatives) then others.

I remember IBM-power folks asking for knee related features a number of
years ago (Dusseldorf IIRC) because after some point their chips start
to _really_ suck power. Sure, the curve is monotonic, but the perf/watt
takes a nose dive.

And given that: P = CfV^2, that seems like a fairly generic observation.

However, maybe, due to the very limited thermal capacity of these mobile
things, the issue doesn't really arrise in them.

Laptops with active cooling however...

> > Similarly for boosting, how are we 'easily' going to find the values
> > that correspond to the various available OPPs.
>
> In our experience with SchedTune on Android, we found that we
> generally focus on a small set of representative use-cases and then
> run an exploration, by tuning the percentage of boost, to identify the
> optimal trade-off between Performance and Energy.

So you basically do an automated optimization for a benchmark?

> The value you get could be something which do not match exactly an OPP
> but still, since we (will) bias not only OPP selection but also tasks
> placement, it's the one which makes most sense.

*groan*, so how exactly does that work? By limiting the task capacity,
we allow some stacking on the CPUs before we switch to regular
load-balancing?

> Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
> something we don't care to specify exactly, since:
>
> - schedutil will top the util request to the next frequency anyway
>
> - capacity by itself is a loosely defined metric, since it's usually
> measured considering a specific kind of instructions mix, which
> can be very different from the actual instruction mix (e.g. integer
> vs floating point)

Sure, things like pure SIMD workloads can skew things pretty bad, but on
average it should not drastically change the overall shape of the curve
and the knee point should not move around a lot.

> - certain platforms don't even expose OPPs, but just "performance
> levels"... which ultimately are a "percentage"

Well, the whole capacity thing is a 'percentage', it's just that 1024 is
much nicer to work with (for computers) than 100 is (also it provides a
wee bit more resolution).

But even the platforms with hidden OPPs (can) have knee points, and if
you measure their power to capacity curve you can place a workload
around the knee by capping capacity.

But yes, this gets trick real fast :/

> - there are so many rounding errors around on utilization tracking
> and it aggregation that being exact on an OPP if of "relative"
> importance

I'm not sure I understand that argument; sure the measurement is subject
to 'issues', but if we hard clip the result, that will exactly match the
fixed points for OPP selection. Any issues on the measurement are lost
after clipping.

> Do you see specific use-cases where an exact OPP capacity is much
> better then a percentage value ?

If I don't have algorithmic optimization available, hand selecting an
OPP is the 'obvious' thing to do.

> Of course there can be scenarios in which wa want to clamp to a
> specific OPP. But still, why should it be difficult for a platform
> integrator to express it as a close enough percentage value ?

But why put him through the trouble of finding the capacity value in the
EAS exposed data, converting that to a percentage that will work and
then feeding it back in.

I don't see the point or benefit of percentages, there's nothing magical
about 1/100, _any_ other fraction works exactly the same.

So why bother changing it around?

> > The EAS thing might have these around; but I forgot if/how they're
> > exposed to userspace (I'll have to soon look at the latest posting).
>
> The new "Energy Model Management" framework can certainly be use to
> get the list of OPPs for each frequency domain. IMO this could be
> used to identify the maximum number of clamp groups we can have.
> In this case, the discretization patch can translate a generic
> percentage clamp into the closest OPP capacity...
>
> ... but to me that's an internal detail which I'm not convinced we
> don't need to expose to user-space.
>
> IMHO we should instead focus just on defining a usable and generic
> userspace interface. Then, platform specific tuning is something
> user-space can do, either offline or on-line.

The thing I worry about is how do we determine the value to put in in
the first place.

How are expecting people to determine what to put into the interface?
Knee points, little capacity, those things make 'obvious' sense.

> > But changing the clamp metric to something different than these values
> > is going to be pain.
>
> Maybe I don't completely get what you mean here... are you saying that
> not using exact capacity values to defined clamps is difficult ?
> If that's the case why? Can you elaborate with an example ?

I meant changing the unit around, 1/1024 is what we use throughout and
is what EAS is also exposing IIRC, so why make things complicated again
and use 1/100 (which is a shit fraction for computers).



2018-09-24 15:15:29

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 21-Sep 11:13, Peter Zijlstra wrote:
> On Mon, Sep 17, 2018 at 01:27:23PM +0100, Patrick Bellasi wrote:
> > On 14-Sep 16:28, Peter Zijlstra wrote:
>
> > > The thing is, the values you'd want to use are for example the capacity
> > > of the little CPUs. or the capacity of the most energy efficient OPP
> > > (the knee).
> >
> > I don't think so.
> >
> > On the knee topic, we had some thinking and on most platforms it seems
> > to be a rather arbitrary decision.
> >
> > On sane platforms, the Energy Efficiency (EE) is monotonically
> > decreasing with frequency increase. Maybe we can define a threshold
> > for a "EE derivative ratio", but it will still be quite arbitrary.
> > Moreover, it could be that in certain use-cases we want to push for
> > higher energy efficiency (i.e. lower derivatives) then others.
>
> I remember IBM-power folks asking for knee related features a number of
> years ago (Dusseldorf IIRC) because after some point their chips start
> to _really_ suck power. Sure, the curve is monotonic, but the perf/watt
> takes a nose dive.
>
> And given that: P = CfV^2, that seems like a fairly generic observation.
>
> However, maybe, due to the very limited thermal capacity of these mobile
> things, the issue doesn't really arrise in them.

The curve is still following the equation above for mobile devices
too and it's exactly by looking at that curve that's rather arbitrary
to defined a knee point (more on that later)...

> Laptops with active cooling however...

How do you see active cooling playing a role ?

Are you thinking, for example, at reduced fan noise if we remain below
a certain OPP ?

Are you factoring fans power consumptions into the overall P consumption ?

> > > Similarly for boosting, how are we 'easily' going to find the values
> > > that correspond to the various available OPPs.
> >
> > In our experience with SchedTune on Android, we found that we
> > generally focus on a small set of representative use-cases and then
> > run an exploration, by tuning the percentage of boost, to identify the
> > optimal trade-off between Performance and Energy.
>
> So you basically do an automated optimization for a benchmark?

Not on one single benchmark, we consider a set of interesting
use-cases. We mostly focus on:
- interactivity: no jank frames while scrolling fast on list/views
- power efficiency: on common video/audio playback scenarios

The exploration of some optimization parameters can be automated.
However, at the end there is always a rather arbitrary decision to
take: you can be slightly more oriented towards interactive
performance or energy efficient.

Maybe (in the future) you can also see AI/ML, used from user-space, to
figure out the fine tuning based on user's usage patterns for different
apps... ;)

> > The value you get could be something which do not match exactly an OPP
> > but still, since we (will) bias not only OPP selection but also tasks
> > placement, it's the one which makes most sense.
>
> *groan*, so how exactly does that work? By limiting the task capacity,
> we allow some stacking on the CPUs before we switch to regular
> load-balancing?

This is a big topic in itself, one of the reasons why we did not added
it in this series. We will need dedicated discussions to figure out
something reasonable.

In principle, however, by capping the utilization of tasks and their
CPUs we can aim at a way to remain in energy_aware mode, i.e. below
the tipping point, and thus with load-balancing disabled.

Utilization clamping can be used to bias the CPUs selection from the
EA code paths. Other mechanisms, e.g. bandwidth control, can also be
exploited to keep CPU utilization under control.

> > Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
> > something we don't care to specify exactly, since:
> >
> > - schedutil will top the util request to the next frequency anyway
> >
> > - capacity by itself is a loosely defined metric, since it's usually
> > measured considering a specific kind of instructions mix, which
> > can be very different from the actual instruction mix (e.g. integer
> > vs floating point)
>
> Sure, things like pure SIMD workloads can skew things pretty bad, but on
> average it should not drastically change the overall shape of the curve
> and the knee point should not move around a lot.

There can be quite consistent skews based not just on instructions
type but also "app phases", e.g. memory-vs-cpu bound.
It's also true that's more likely a shift up/down in the capacity axis
then a change in shape.

However, I think my point here is that the actual capacity of each OPP
can be very different wrt the one reported by the EM.

> > - certain platforms don't even expose OPPs, but just "performance
> > levels"... which ultimately are a "percentage"
>
> Well, the whole capacity thing is a 'percentage', it's just that 1024 is
> much nicer to work with (for computers) than 100 is (also it provides a
> wee bit more resolution).

Right, indeed in kernel-space we still use 1024 based values, we just
convert them at the syscall interface...

> But even the platforms with hidden OPPs (can) have knee points, and if
> you measure their power to capacity curve you can place a workload
> around the knee by capping capacity.

... still it's difficult to give a precise definition of knee point,
unless you know about platforms which have a sharp change in energy
efficiency.

The only cases we know about are those where:

A) multiple frequencies uses the same voltage, e.g.


^ *
| Energy O
| efficiency O+
| O |
| O* |
| O** |
| O** O*** |
| + O** O**** |
| | O** O***** |
| | O** |
| | + |
| | Same V | Increasing V |
+---+----------+----------------------+----------->
| | | Frequency
L M H

B) there is a big frequency gap between low frequency OPPs and high
frequency OPPs, e.g.

O
^ **+
| Energy ** |
| efficiency ** |
| ** |
| ** |
| ** |
| ** |
| ** |
| O** |
| O******+ |
|O******* | |
| | |
++--------------+------------------+------>
| | | Frequency
L M H


In case A, all the OPPs left of M are dominated by M in terms
of energy efficiency and normally they should be never used.
Unless you are under thermal constraints and you still want to keep
your code running even if at a lower rate and energy efficiency.
At this point, however, you already invalidated all the OPPs right of
M and, on the remaining, you still struggle do define the knee point.

In case B... I'm wondering it such a conf even makes sense ;)
Is there really some platform out there with such a "non homogeneously
distributed" set of available frequencies ?

> But yes, this gets trick real fast :/
>
> > - there are so many rounding errors around on utilization tracking
> > and it aggregation that being exact on an OPP if of "relative"
> > importance
>
> I'm not sure I understand that argument; sure the measurement is subject
> to 'issues', but if we hard clip the result, that will exactly match the
> fixed points for OPP selection. Any issues on the measurement are lost
> after clipping.

My point is just that the clipping does not requires to be specified
as the actual capacity of an OPP.

If we give a "performance level" P then schedutil already knows how to
translate it into an OPP, it will pick the min capacity OPP which
capacity is greater then P.

However, given the knee definition is anyway fuzzy, selecting either
that OPP or the previous one, should almost always not make a big
difference from a "knee" standpoint, isn't it ?

> > Do you see specific use-cases where an exact OPP capacity is much
> > better then a percentage value ?
>
> If I don't have algorithmic optimization available, hand selecting an
> OPP is the 'obvious' thing to do.

Agree, but that's still possible by passing in a percentage value.

> > Of course there can be scenarios in which wa want to clamp to a
> > specific OPP. But still, why should it be difficult for a platform
> > integrator to express it as a close enough percentage value ?
>
> But why put him through the trouble of finding the capacity value in the
> EAS exposed data, converting that to a percentage that will work and
> then feeding it back in.
>
> I don't see the point or benefit of percentages, there's nothing magical
> about 1/100, _any_ other fraction works exactly the same.

If you wanna pass in an exact capacity data, you still need to look at
EAS exposed data, don't you?

I think the main problem is picking the right capacity, more then
converting it to a percentage.

Once you got a capacity, let say 654, it's conversion to percentage is
as simple as dropping the units, i.e. 654 ~= 65%, that's quite sure to
pick the 654 capacity OPP anyway.


> So why bother changing it around?

For two main reasons:

1) to expose userspace a more generic interface:
a "performance percentage" is more generic then a "capacity value"
while keep translating and using a 1024 based value in kernel space

2) to reduce the configuration space:
it quite likely doesn't make sense to use, in the same system, 100
difference clamp values... it makes even more sense to use 1024
different clamp values, does it ?

> > > The EAS thing might have these around; but I forgot if/how they're
> > > exposed to userspace (I'll have to soon look at the latest posting).
> >
> > The new "Energy Model Management" framework can certainly be use to
> > get the list of OPPs for each frequency domain. IMO this could be
> > used to identify the maximum number of clamp groups we can have.
> > In this case, the discretization patch can translate a generic
> > percentage clamp into the closest OPP capacity...
> >
> > ... but to me that's an internal detail which I'm not convinced we
> > don't need to expose to user-space.
> >
> > IMHO we should instead focus just on defining a usable and generic
> > userspace interface. Then, platform specific tuning is something
> > user-space can do, either offline or on-line.
>
> The thing I worry about is how do we determine the value to put in in
> the first place.

I agree that's the main problem, but I also think that's outside of
the kernel-space mechanism.

Is not all that quite similar to DEADLINE tasks configuration?

Given a DL task solving a certain issue, you can certainly define its
deadline (or period) on a completely platform independent way, by just
looking at the problem space. But when it comes to the run-time, we
always have to profile the task in a platform specific way.

In the DL case from user-space we figure out a bandwidth requirement.

In the clamping case, it's still the user-space that needs to figure
our an optimal clamp value, while considering your performance and
energy efficiency goals. This can be based on an automated profiling
process which comes up with "optimal" clamp values.

In the DL case, we are perfectly fine to have a running time
parameter, although we don't give any precise and deterministic
formula to quantify it. It's up to user-space to figure out the
required running time for a given app and platform.
It's also not unrealistic the case you need to close a control loop
with user-space to keep updating this requirement.

Why the same cannot hold for clamp values ?

> How are expecting people to determine what to put into the interface?
> Knee points, little capacity, those things make 'obvious' sense.

IMHO, they make "obvious" sense from a kernel-space perspective
exactly because they are implementation details and platform specific
concepts.

At the same time, I struggle to provide a definition of knee point and
I struggle to find a use-case where I can certainly say that a task
should be clamped exactly to the little capacity for example.

I'm more of the idea that the right clamp value is something a bit
fuzzy and possibly subject to change over time depending on the
specific application phase (e.g. cpu-vs-memory bounded) and/or
optimization goals (e.g. performance vs energy efficiency).

Here we are thus at defining and agree about a "generic and abstract"
interface which allows user-space to feed input to kernel-space.
To this purpose, I think platform specific details and/or internal
implementation details are not "a bonus".

> > > But changing the clamp metric to something different than these values
> > > is going to be pain.
> >
> > Maybe I don't completely get what you mean here... are you saying that
> > not using exact capacity values to defined clamps is difficult ?
> > If that's the case why? Can you elaborate with an example ?
>
> I meant changing the unit around, 1/1024 is what we use throughout and
> is what EAS is also exposing IIRC, so why make things complicated again
> and use 1/100 (which is a shit fraction for computers).

Internally, in kernel space, we use 1024 units. It's just the
user-space interface that speaks percentages but, as soon as a
percentage value is used to configure a clamp, it's translated into a
[0..1024] range value.

Is this not an acceptable compromise? We have a generic user-space
interface and an effective/consistent kernel-space implementation.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-24 15:57:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:
> On 21-Sep 11:13, Peter Zijlstra wrote:
> > Laptops with active cooling however...
>
> How do you see active cooling playing a role ?
>
> Are you thinking, for example, at reduced fan noise if we remain below
> a certain OPP ?
>
> Are you factoring fans power consumptions into the overall P consumption ?

Nothing as fancy as that; I just figured that with a larger cooling
capacity, you can push chips higher onto that curve past the optimal
IPC/Watt point. Make it go fast etc..

2018-09-24 16:27:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:

> ... still it's difficult to give a precise definition of knee point,
> unless you know about platforms which have a sharp change in energy
> efficiency.
>
> The only cases we know about are those where:
>
> A) multiple frequencies uses the same voltage, e.g.
>
>
> ^ *
> | Energy O
> | efficiency O+
> | O |
> | O* |
> | O** |
> | O** O*** |
> | + O** O**** |
> | | O** O***** |
> | | O** |
> | | + |
> | | Same V | Increasing V |
> +---+----------+----------------------+----------->
> | | | Frequency
> L M H
>
> B) there is a big frequency gap between low frequency OPPs and high
> frequency OPPs, e.g.
>
> O
> ^ **+
> | Energy ** |
> | efficiency ** |
> | ** |
> | ** |
> | ** |
> | ** |
> | ** |
> | O** |
> | O******+ |
> |O******* | |
> | | |
> ++--------------+------------------+------>
> | | | Frequency
> L M H
>
>
> In case A, all the OPPs left of M are dominated by M in terms
> of energy efficiency and normally they should be never used.
> Unless you are under thermal constraints and you still want to keep
> your code running even if at a lower rate and energy efficiency.
> At this point, however, you already invalidated all the OPPs right of
> M and, on the remaining, you still struggle do define the knee point.
>
> In case B... I'm wondering it such a conf even makes sense ;)
> Is there really some platform out there with such a "non homogeneously
> distributed" set of available frequencies ?

Well, the curve is a second or third order polynomial (when V~f -> fV^2
-> f^3), so it shoots up at some point. There's not really anything you
can do about that. But if you're willing to put in active cooling and
lots of energy, you can make it go fast :-)

Therefore I was thinking:

> Maybe we can define a threshold
> for a "EE derivative ratio", but it will still be quite arbitrary.

Because up until de/df=.5 we gain more performance than we loose ee. But
I might not have appreciated the fact that when we work with imaginary
cost units that skews the .5.


2018-09-24 17:20:30

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 24-Sep 18:26, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:
>
> > ... still it's difficult to give a precise definition of knee point,
> > unless you know about platforms which have a sharp change in energy
> > efficiency.
> >
> > The only cases we know about are those where:
> >
> > A) multiple frequencies uses the same voltage, e.g.
> >

On a side note, the following plots represents ee^-1, or eventually,
the P on the y axise... my bad.... but you got the meaning anyway ;)

> >
> > ^ *
> > | Energy O
> > | efficiency O+
> > | O |
> > | O* |
> > | O** |
> > | O** O*** |
> > | + O** O**** |
> > | | O** O***** |
> > | | O** |
> > | | + |
> > | | Same V | Increasing V |
> > +---+----------+----------------------+----------->
> > | | | Frequency
> > L M H
> >
> > B) there is a big frequency gap between low frequency OPPs and high
> > frequency OPPs, e.g.
> >
> > O
> > ^ **+
> > | Energy ** |
> > | efficiency ** |
> > | ** |
> > | ** |
> > | ** |
> > | ** |
> > | ** |
> > | O** |
> > | O******+ |
> > |O******* | |
> > | | |
> > ++--------------+------------------+------>
> > | | | Frequency
> > L M H
> >
> >
> > In case A, all the OPPs left of M are dominated by M in terms
> > of energy efficiency and normally they should be never used.
> > Unless you are under thermal constraints and you still want to keep
> > your code running even if at a lower rate and energy efficiency.
> > At this point, however, you already invalidated all the OPPs right of
> > M and, on the remaining, you still struggle do define the knee point.
> >
> > In case B... I'm wondering it such a conf even makes sense ;)
> > Is there really some platform out there with such a "non homogeneously
> > distributed" set of available frequencies ?
>
> Well, the curve is a second or third order polynomial (when V~f -> fV^2
> -> f^3), so it shoots up at some point. There's not really anything you
> can do about that. But if you're willing to put in active cooling and
> lots of energy, you can make it go fast :-)

Sure... until you don't melt the silicon you can push the frequency.

However, if you are going for such aggressive active cooling, perhaps
your interest for energy efficiency it's already a very low priority
goal.

> Therefore I was thinking:
>
> > Maybe we can define a threshold
> > for a "EE derivative ratio", but it will still be quite arbitrary.
>
> Because up until de/df=.5 we gain more performance than we loose ee.

You mean up until de < df ?

IOW... the threshold should be de == df => 45deg tangent ?

> But I might not have appreciated the fact that when we work with
> imaginary cost units that skews the .5.

The main skew IMO comes from the fact the energy efficiency
"tipping point" is very much application / user specific...
and it can also change depending on the usage scenario for the
same user and platform.

--
#include <best/regards.h>

Patrick Bellasi

2018-09-24 17:24:29

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 24-Sep 17:56, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:
> > On 21-Sep 11:13, Peter Zijlstra wrote:
> > > Laptops with active cooling however...
> >
> > How do you see active cooling playing a role ?
> >
> > Are you thinking, for example, at reduced fan noise if we remain below
> > a certain OPP ?
> >
> > Are you factoring fans power consumptions into the overall P consumption ?
>
> Nothing as fancy as that; I just figured that with a larger cooling
> capacity, you can push chips higher onto that curve past the optimal
> IPC/Watt point. Make it go fast etc..

That very concept of "optimal IPC/Watt" point is not something easy to
define considering a monotonic function and without considering the
specific optimization goals.

It really sounds like saying that an LP problem has a unique solution
independently from the optimization function.

Do you think we should "mandate" an optimization function from kernel
space? I'm not saying it does not make sense, but I find it at least a
strong implementation enforcement.

--
#include <best/regards.h>

Patrick Bellasi

2018-09-25 15:52:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:

> > So why bother changing it around?
>
> For two main reasons:
>
> 1) to expose userspace a more generic interface:
> a "performance percentage" is more generic then a "capacity value"
> while keep translating and using a 1024 based value in kernel space

The unit doesn't make it more or less generic. It's the exact same thing
in the end.

> 2) to reduce the configuration space:
> it quite likely doesn't make sense to use, in the same system, 100
> difference clamp values... it makes even more sense to use 1024
> different clamp values, does it ?

I'd tend to agree with you that 1024 is probably too big a
configureation space, OTOH I also don't want to end up with a "640KB is
enough for everybody" situation.

And 100 really isn't that much better either way around.

> > The thing I worry about is how do we determine the value to put in in
> > the first place.
>
> I agree that's the main problem, but I also think that's outside of
> the kernel-space mechanism.
>
> Is not all that quite similar to DEADLINE tasks configuration?

Well, with DL there are well defined rules for what to put in and what
to then expect.

For this thing, not so much I feel.

> Given a DL task solving a certain issue, you can certainly define its
> deadline (or period) on a completely platform independent way, by just
> looking at the problem space. But when it comes to the run-time, we
> always have to profile the task in a platform specific way.
>
> In the DL case from user-space we figure out a bandwidth requirement.

Most likely, although you can compute in a number of cases. But yes, it
is always platform specific.

> In the clamping case, it's still the user-space that needs to figure
> our an optimal clamp value, while considering your performance and
> energy efficiency goals. This can be based on an automated profiling
> process which comes up with "optimal" clamp values.
>
> In the DL case, we are perfectly fine to have a running time
> parameter, although we don't give any precise and deterministic
> formula to quantify it. It's up to user-space to figure out the
> required running time for a given app and platform.
> It's also not unrealistic the case you need to close a control loop
> with user-space to keep updating this requirement.
>
> Why the same cannot hold for clamp values ?

The big difference is that if I request (and am granted) a runtime quota
of a given amount, then that is what I'm guaranteed to get.

Irrespective of the amount being sufficient for the work in question --
which is where the platform dependency comes in.

But what am I getting when I set a specific clamp value? What does it
mean to set the value to 80%

So far the only real meaning is when combined with the EAS OPP data, we
get a direct translation to OPPs. Irrespective of how the utilization is
measured and the capacity:OPP mapping established, once that's set, we
can map a clamp value to an OPP and get meaning.

But without that, it doesn't mean anything much at all. And that is my
complaint. It seems to get presented as: 'random knob that might do
something'. The range it takes as input doesn't change a thing.

> > How are expecting people to determine what to put into the interface?
> > Knee points, little capacity, those things make 'obvious' sense.
>
> IMHO, they make "obvious" sense from a kernel-space perspective
> exactly because they are implementation details and platform specific
> concepts.
>
> At the same time, I struggle to provide a definition of knee point and
> I struggle to find a use-case where I can certainly say that a task
> should be clamped exactly to the little capacity for example.
>
> I'm more of the idea that the right clamp value is something a bit
> fuzzy and possibly subject to change over time depending on the
> specific application phase (e.g. cpu-vs-memory bounded) and/or
> optimization goals (e.g. performance vs energy efficiency).
>
> Here we are thus at defining and agree about a "generic and abstract"
> interface which allows user-space to feed input to kernel-space.
> To this purpose, I think platform specific details and/or internal
> implementation details are not "a bonus".

But unlike DL, which has well specified behaviour, and when I know my
platform I can compute a usable value. This doesn't seem to gain meaning
when I know the platform.

Or does it? If you say yes, then we need to be able to correlate to the
platform data that gives it meaning; which would be the OPP states. And
those come with capacity numbers.

> > > > But changing the clamp metric to something different than these values
> > > > is going to be pain.
> > >
> > > Maybe I don't completely get what you mean here... are you saying that
> > > not using exact capacity values to defined clamps is difficult ?
> > > If that's the case why? Can you elaborate with an example ?
> >
> > I meant changing the unit around, 1/1024 is what we use throughout and
> > is what EAS is also exposing IIRC, so why make things complicated again
> > and use 1/100 (which is a shit fraction for computers).
>
> Internally, in kernel space, we use 1024 units. It's just the
> user-space interface that speaks percentages but, as soon as a
> percentage value is used to configure a clamp, it's translated into a
> [0..1024] range value.
>
> Is this not an acceptable compromise? We have a generic user-space
> interface and an effective/consistent kernel-space implementation.

I really don't see how changing the unit changes anything. Either you
want to relate to OPPs and those are exposed in 1/1024 unit capacity
through the EAS files, or you don't and then the knob has no meaning.

And how the heck are we supposed to assign a value for something that
has no meaning.

Again, with DL we ask for time, once I know the platform I can convert
my work into instructions and time and all makes sense.

With this, you seem reluctant to allow us to close that loop. Why is
that? Why not directly relate to the EAS OPPs, because that is directly
what they end up mapping to.

When I know the platform, I can convert my work into instructions and
obtain time, I can convert my clamp into an OPP and time*OPP gives an
energy consumption.

Why muddle things up and make it complicated?

2018-09-26 10:44:49

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On 25-Sep 17:49, Peter Zijlstra wrote:
> On Mon, Sep 24, 2018 at 04:14:00PM +0100, Patrick Bellasi wrote:

[...]

> Well, with DL there are well defined rules for what to put in and what
> to then expect.
>
> For this thing, not so much I feel.

Maybe you'll prove me wrong, but that's not already something happening
for things like priorities?

When you set a prio for a CFS task you don't really know how much
more/less CPU time a CFS task will get, because it depends on other
tasks prios and tasks from higher prio classes.

The priority is thus a knob which defines an "intended behavior", a
preference, without being "legally binding" like in the case of DL
bandwidth... nevertheless we can still make a good use of prios.

[...]

> > In the clamping case, it's still the user-space that needs to figure
> > our an optimal clamp value, while considering your performance and
> > energy efficiency goals. This can be based on an automated profiling
> > process which comes up with "optimal" clamp values.
> >
> > In the DL case, we are perfectly fine to have a running time
> > parameter, although we don't give any precise and deterministic
> > formula to quantify it. It's up to user-space to figure out the
> > required running time for a given app and platform.
> > It's also not unrealistic the case you need to close a control loop
> > with user-space to keep updating this requirement.
> >
> > Why the same cannot hold for clamp values ?
>
> The big difference is that if I request (and am granted) a runtime quota
> of a given amount, then that is what I'm guaranteed to get.

(I think not always... but that's a detail for a different discussion)

> Irrespective of the amount being sufficient for the work in question --
> which is where the platform dependency comes in.
>
> But what am I getting when I set a specific clamp value? What does it
> mean to set the value to 80%

Exactly, that's a good point: which "expectations" can we set on users
based on a given value ?

> So far the only real meaning is when combined with the EAS OPP data, we
> get a direct translation to OPPs. Irrespective of how the utilization is
> measured and the capacity:OPP mapping established, once that's set, we
> can map a clamp value to an OPP and get meaning.

If we strictly follow this line of reasoning then we should probably
set a frequency value directly... but still we will not be saying
anything about "expectations".

Give the current patchset, right now we can't do much more then
_biasing_ an OPP selection.
It's actually just a bias, we cannot really grant anything to users
based on clamping. For example, if you require util_min=1024 you are
not really granted anything about running at the maximum capacity,
especially in the current patchset where we are not biasing task
placement.

My fear then is, since we are not really granting/enforcing anything,
why should we base such an interface on an internal implementation
detail and/or platform specific values ?

Why a slightly more abstract interface is so much more confusing ?

> But without that, it doesn't mean anything much at all. And that is my
> complaint. It seems to get presented as: 'random knob that might do
> something'. The range it takes as input doesn't change a thing.

Can not the "range" help in defining the perceived expectations ?

If we use a percentage, IMHO it's more clear that's a _relative_ and
_not mandatory_ interface:

relative: because, for example, a 50% capped task is expected
(normally) to run slower then a 50% boosted task, although
we don't know, or care to know, the exact frequency or cpu
capacity

not mandatory: because, for example, the 50% boosted task is not
granted to always run at an OPP which capacity is
not smaller then 512

> > > How are expecting people to determine what to put into the interface?

The same way people define priorities. Which means, with increasing
level of complexity:

a) by guessing (or using the default, i.e. no clamps)

b) by making an educated choice
i.e. profiling your app to pick the value which makes more sense
give the platform and a set of optimization goals

c) by controlling in a closed feedback loop
i.e. by periodically measuring some app specific power/perf metric
and tuning the clamp values to close a gap with respect to a given
power/perf

I think that the complexity of both b) and c) is not really impacted
by the scale/range used... but it also does not benefit much in
"clarity" if we use capacity values instead of percentages.

> > > Knee points, little capacity, those things make 'obvious' sense.

> > IMHO, they make "obvious" sense from a kernel-space perspective
> > exactly because they are implementation details and platform specific
> > concepts.
> >
> > At the same time, I struggle to provide a definition of knee point and
> > I struggle to find a use-case where I can certainly say that a task
> > should be clamped exactly to the little capacity for example.
> >
> > I'm more of the idea that the right clamp value is something a bit
> > fuzzy and possibly subject to change over time depending on the
> > specific application phase (e.g. cpu-vs-memory bounded) and/or
> > optimization goals (e.g. performance vs energy efficiency).

What do you think about this last my sentence above ?

> > Here we are thus at defining and agree about a "generic and abstract"
> > interface which allows user-space to feed input to kernel-space.
> > To this purpose, I think platform specific details and/or internal
> > implementation details are not "a bonus".
>
> But unlike DL, which has well specified behaviour, and when I know my
> platform I can compute a usable value. This doesn't seem to gain meaning
> when I know the platform.
>
> Or does it?

... or we don't really care about a platform specific meaning.

> If you say yes, then we need to be able to correlate to the platform
> data that gives it meaning; which would be the OPP states. And those
> come with capacity numbers.

The meaning, strictly speaking, should be just:

I figured out (somehow) that if I set value X my app is now working
as expected in terms of the acceptable power/performance
optimization goal.

I know that value X could require tuning over time depending on
possible changes in platform status or tasks composition.

[...]

> > Internally, in kernel space, we use 1024 units. It's just the
> > user-space interface that speaks percentages but, as soon as a
> > percentage value is used to configure a clamp, it's translated into a
> > [0..1024] range value.
> >
> > Is this not an acceptable compromise? We have a generic user-space
> > interface and an effective/consistent kernel-space implementation.
>
> I really don't see how changing the unit changes anything. Either you
> want to relate to OPPs and those are exposed in 1/1024 unit capacity
> through the EAS files, or you don't and then the knob has no meaning.
>
> And how the heck are we supposed to assign a value for something that
> has no meaning.
>
> Again, with DL we ask for time, once I know the platform I can convert
> my work into instructions and time and all makes sense.
>
> With this, you seem reluctant to allow us to close that loop. Why is
> that? Why not directly relate to the EAS OPPs, because that is directly
> what they end up mapping to.

I'm not really fighting against that, if people find it more intuitive
the usage of capacities we can certainly go for them.

My reluctance is really just tossing out a possible different
perspective considering we are adding a user-space API which
certainly set "expectations" to users.

Provided it's clear the concept that it's a _non mandatory_ and
_relative_ API, then any scale should be ok... I just personally
prefer a percentage for the reasons described above.

In both cases, who will use the interface can certainly close a
loop... especially an automated profiling or run-time control loop.

> When I know the platform, I can convert my work into instructions and
> obtain time, I can convert my clamp into an OPP and time*OPP gives an
> energy consumption.

What you describe makes sense, it can definitively help the human user
to set a value. I'm just not convinced this will be the main usage of
such an interface... or that a single value could fit all the
run-time scenarios.

I think in real workload scenarios we will have so many tasks, some
competing other cooperating, that it will not be possible to do the
exercise you describe above.

What we will do instead will probably be to close a profiling/control
loop from user-space and let the system figure out the optimal
value. In these cases, the platform details are just "informative"
and what we need is really just "random knob which can do
something"... provide that "something" is a consistent mapping of the
knob values on certain scheduler actions.

> Why muddle things up and make it complicated?

I'll not push this further, really, if you are strongly on the
opinion we should use capacity I'll drop percentages in the next v5.

Otherwise, if you also are like me still a bit unsure for what could
be the best API, we can hope in more feedbacks from other folks...
maybe I can ping someone in CC ?

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-09-26 17:52:10

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

Hi Peter,

On 21-Sep 11:13, Peter Zijlstra wrote:
> On Mon, Sep 17, 2018 at 01:27:23PM +0100, Patrick Bellasi wrote:

[...]

While going back to one of our previous conversation, I noted these
comments:

> > Thus, the capacity of little CPUs, or the exact capacity of an OPP, is
> > something we don't care to specify exactly, since:

[...]

> > - certain platforms don't even expose OPPs, but just "performance
> > levels"... which ultimately are a "percentage"
>
> Well, the whole capacity thing is a 'percentage', it's just that 1024 is
> much nicer to work with (for computers) than 100 is (also it provides a
> wee bit more resolution).

Here above I was referring to the Intel's HWP support [1],
specifically at the:

Ability of HWP to allow software to set an energy/performance
preference hint in the IA32_HWP_REQUEST MSR.

which is detailed in section "14.4.4 Managing HWP".

The {Minimum,Maximum}_Performance registers represent what I consider
the best semantics for UtilClamp.

In the HWP case we use 256 range values, and thus for UtilClamp as
well it would make more sense to use a 1024 scale as suggested by
Peter, even just to have a bit more room, while still considering the
clamp values _as a percentage_, with just one decimal digit of
resolution

I think the important bit here is the abstraction between what we
the user can require and what the platform can provided.

If HWP does not allow the OS to pinpoint a specific frequency, why
should a user-space interface be designed to pinpoint a specific
capacity ?

Can we find here a common ground around the idea that UtilClamp values
represent a 1024 range percentage of minimum/maximum performance
expected by a task ?

Would be really nice to know what Rafael thing about all that...

Cheers Patrick

[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf

--
#include <best/regards.h>

Patrick Bellasi

2018-09-27 10:02:25

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v4 14/16] sched/core: uclamp: request CAP_SYS_ADMIN by default

On Tuesday 25 Sep 2018 at 17:49:56 (+0200), Peter Zijlstra wrote:
> I really don't see how changing the unit changes anything. Either you
> want to relate to OPPs and those are exposed in 1/1024 unit capacity
> through the EAS files, or you don't and then the knob has no meaning.

FWIW, with the latest versions of the EAS series, we don't expose the
capacity of the various OPPs directly any more. You have 'frequency',
'power' and a more abstract thing called 'cost' (which is useful for
energy computations) in the sysfs files of the EM.

We decided to remove the 'capacity' file to simplify things quite a bit,
and because it wasn't helping much. I'm talking about this discussion w/
Dietmar on v4:

https://lore.kernel.org/lkml/[email protected]/

But it is also true that we could add back that 'capacity' file if it
makes sense for uclamp. It's just an additional show() function in
energy_model.c, so no problem on that side per se.

The only problem is if you want to use uclamp w/o EAS, I guess ...

Thanks,
Quentin

2018-09-27 10:24:30

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v4 06/16] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

On Friday 14 Sep 2018 at 14:57:12 (+0100), Patrick Bellasi wrote:
> On 14-Sep 15:36, Peter Zijlstra wrote:
> > On Fri, Sep 14, 2018 at 02:19:19PM +0100, Patrick Bellasi wrote:
> > > On 14-Sep 11:32, Peter Zijlstra wrote:
> >
> > > > Should that not be:
> > > >
> > > > util = clamp_util(rq, cpu_util_cfs(rq));
> > > >
> > > > Because if !util might we not still want to enforce the min clamp?
> > >
> > > If !util CFS tasks should have been gone since a long time
> > > (proportional to their estimated utilization) and thus it probably
> > > makes sense to not affect further energy efficiency for tasks of other
> > > classes.
> >
> > I don't remember what we do for util for new tasks; but weren't we
> > talking about setting that to 0 recently? IIRC the problem was that if
> > we start at 1 with util we'll always run new tasks on big cores, or
> > something along those lines.

I guess you're referring to that discussion ?

https://lore.kernel.org/lkml/CAKfTPtDcoySXK0fBkDNy4wp1vsRxmiuAGT3CDZBh6Vnwyep2BA@mail.gmail.com/

If yes, then the outcome was that we'll see later what we do with new
tasks :-)

Setting the util of new tasks to 0 surely can help power, but that can
also harm performance pretty badly, I think. You'd be stuck at min freq
for a while w/ sugov in case of a fork bomb for example.

> Mmm.. could have been in a recent discussion with Quentin, but I
> think I've missed it. I know we have something similar on Android for
> similar reasons.

I don't think PELT is different in Android (we still set the initial
util of new tasks as half of the spare cap of the CPU), but there are
other tweaks that influence the first task placement, though. And WALT
sets the util of new tasks to 0 IIRC (but I'm not sure it's relevant
since its signal ramps up a lot faster than PELT's).

Thanks,
Quentin