Hi all, this is a respin of:
https://lore.kernel.org/lkml/[email protected]/
which has been rebased on v4.19.
Thanks for all the valuable comments collected so far!
This version will be presented and discussed at the upcoming LPC.
Meanwhile, any comments and feedbacks are more than welcome!
Cheers Patrick
Main changes in v5
==================
.:: Better split core bits from task groups support
---------------------------------------------------
As per Tejun request:
https://lore.kernel.org/lkml/[email protected]/
we now have _all and only_ the core scheduler bits at the beginning of
the series, the first 10 patches of this series provide:
- per task utilization clamping API,
via sched_setattr
- system default clamps for both FAIR and RT tasks,
via /proc/sys/kernel/sched_uclamp_util_{min,max}
- schedutil integration for both FAIR and RT tasks
cgroups v1 and v2 support comes as an extension of the CPU controller in the
last 5 patches of this series. These bits are kept together with the core
scheduler ones to give a better view on the overall solution we are proposing.
Moreover, it helps to ensure we have core data structure and concepts which
properly fits for cgroups usage too.
.:: Spinlock removal in favour of atomic operations
---------------------------------------------------
As suggested by Peter:
https://lore.kernel.org/lkml/[email protected]/
the main data structures, i.e. clamp maps and clamp groups, have been
re-defined as bitfields mainly to:
a) compress those data structures to use less memory
b) avoid usage of spinlocks in favor of atomic operations
As an additional bonus, some spare bits can now be dedicated to track and flag
special conditions which was previously encoded in a more confusing way.
The final code looks (hopefully) much more clean and easy to read.
.:: Improve consistency and enforce invariant conditions
--------------------------------------------------------
We now ensure that every scheduling entity has always a valid clamp group and
value assigned. This made it possible to remove the previously confusing usage
of the UCLAMP_NOT_VALID=-1 special value and different checks here and there.
Data type for clamp groups and values are consistently defined and always used
as "unsigned", while spare bits are used whenever a special condition still
need to be tracked.
.:: Use of bucketization since the beginning
--------------------------------------------
In the previous version we added a couple of patches to deal with the limited
number of clamp groups. Those patches introduced the idea of using buckets for
the per-CPU clamp groups used to refcount RUNNABLE tasks. However, as pointed
out by Peter:
https://lore.kernel.org/lkml/[email protected]/
the previous implementation was misleading and it also introduced some checks
not really required, e.g. a privileged API to request a clamp which should
always be possible.
Since the fundamental idea of bucketization still seems to be sound and
acceptable, those bits have been used from the start of this patch-set.
This made it possible to simplify the overall series, thanks to the removal of
code previously required just to deal with the limited number of clamp groups.
The new implementation should now implement what Peter proposed above,
specifically:
a) we don't need to search anymore for all the required groups before actually
refcounting them. A clamp group is now granted to be always available for
each possible requested clamp value.
b) the userspace APIs to set scheduling entity specific clamp values are no
more privileged. Userspace can always ask for a clamp value and the system
will always assign it to the most appropriate "effective" clamp group
which matches with all the task group or system default clamps constraints.
.:: Series Organization
-----------------------
The series is organized into these main sections:
- Patches [01-08]: Per task (primary) API
- Patches [09-10]: Schedutil integration for CFS and RT tasks
- Patches [11-15]: Per task group (secondary) API
Newcomer's Short Abstract (Updated)
===================================
The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [1].
When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler also to drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.
However, the current translation of utilization values into a frequency
selection is pretty simple: we just go to max for RT tasks or to the minimum
frequency which can accommodate the utilization of DL+FAIR tasks.
Instead, utilization is of limited usage for tasks placement since its value
alone is not enough to properly describe what the _expected_ power/performance
behaviors of each task really is from a userspace standpoint.
In general, for both RT and FAIR tasks we can aim at better tasks placement and
frequency selection policies if we take hints coming from user-space into
consideration.
Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined from
user-space. The clamped utilization value can then be used, for example, to
enforce a minimum and/or maximum frequency depending on which tasks are
currently active on a CPU.
The main use-cases for utilization clamping are:
- boosting: better interactive response for small tasks which
are affecting the user experience.
Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). In this case, from the task
utilization the scheduler does not have a complete view of what the task
requirements are and, if it's a small utilization task, it keep selecting a
more energy efficient CPU, with smaller capacity and lower frequency, thus
affecting the overall time required to complete task activations.
- capping: increase energy efficiency for background tasks not directly
affecting the user experience.
Since running on a lower capacity CPU at a lower frequency is in general
more energy efficient, when the completion time is not a main goal, then
capping the utilization considered for certain (maybe big) tasks can have
positive effects, both on energy consumption and thermal headroom.
Moreover, this feature allows also to make RT tasks more energy
friendly on mobile systems, where running them on high capacity CPUs and at
the maximum frequency is not strictly required.
From these two use-cases, it's worth to notice that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler on tasks placement decisions.
Utilization is (also) a task specific property which is used by the scheduler
to know how much CPU bandwidth a task requires, at least as long as there is
idle time.
Thus, the utilization clamp values, defined either per-task or per-taskgroup,
can be used to represent tasks to the scheduler as being bigger (or smaller)
than what they really are.
Utilization clamping thus ultimately enables interesting additional
optimizations, especially on asymmetric capacity systems like Arm
big.LITTLE and DynamIQ CPUs, where:
- boosting: small/foreground tasks are preferably scheduled on
higher-capacity CPUs where, despite being less energy efficient, they are
expected to complete faster.
- capping: big/background tasks are preferably scheduled on low-capacity CPUs
where, being more energy efficient, they can still run but save power and
thermal headroom for more important tasks.
This additional usage of utilization clamping is not presented in this series
but it's an integral part of the EAS feature set, where [1] is one of its main
components. A solution similar to utilization clamping, namely SchedTune, is
already used on Android kernels to bias both 'frequency selection' and 'task
placement'.
This series provides the foundation bits to add similar features to mainline
while focusing, for the time being, just on schedutil integration.
[1] https://lore.kernel.org/lkml/[email protected]/
Detailed Changelog
==================
Changes in v5:
Message-ID: <[email protected]>
- use bitfields and atomic_long_cmpxchg() operations to both compress
the clamp maps and avoid usage of spinlock.
- remove enforced __cacheline_aligned_in_smp on uclamp_map since it's
accessed from the slow path only and we don't care about performance
- better describe the usage of uclamp_map::se_lock
Message-ID: <[email protected]>
- remove inline from uclamp_group_{get,put}() and __setscheduler_uclamp()
- set lower/upper bounds at the beginning of __setscheduler_uclamp()
- avoid usage of pr_err from unprivileged syscall paths
in __setscheduler_uclamp(), replaced by ratelimited version
Message-ID: <20180914134128.GP1413@e110439-lin>
- remove/limit usage of UCLAMP_NOT_VALID whenever not strictly required
Message-ID: <[email protected]>
- allow sched_setattr() syscall to sleep on mutex
- fix return value for successfull uclamp syscalls
Message-ID: <CAJuCfpF36-VZm0JVVNnOnGm-ukVejzxbPhH33X3z9gAQ06t9gQ@mail.gmail.com>
- reorder conditions in uclamp_group_find() loop
- use uc_se->xxx in uclamp_fork()
Message-ID: <20180914134128.GP1413@e110439-lin>
- remove not required check for (group_id == UCLAMP_NOT_VALID)
in uclamp_cpu_put_id
- remove not required uclamp_task_affects() since now all tasks always
have a valid clamp group assigned
Message-ID: <20180912174456.GJ1413@e110439-lin>
- use bitfields to compress uclamp_group
Message-ID: <[email protected]>
- added patch 02/15 which allows to change clamp values without affecting
current policy
Message-ID: <[email protected]>
- add a comment to justify the assumptions on util clamping for FAIR tasks
Message-ID: <[email protected]>
- removed uclamp_value and use inline access to data structures
Message-ID: <20180914135712.GQ1413@e110439-lin>
- the (unlikely(val == UCLAMP_NOT_VALID)) check is not more required
since we now ensure we always have a valid value configured
Message-ID: <20180912125133.GE1413@e110439-lin>
- make more clear the definition of cpu.util.min.effective
- small typos fixed
Others:
- renamed uclamp_round into uclamp_group_value to better represent
what this function returns
- reduce usage of alias local variables whenever the global ones can
still be used without affecting code readability
- consistently use "unsigned int" for both clamp_id and group_id
- fixup documentation
- reduced usage of inline comments
- use UCLAMP_GROUPS to track (CONFIG_UCLAMP_GROUPS_COUNT+1)
- rebased on v4.19
Changes in v4:
Message-ID: <20180809152313.lewfhufidhxb2qrk@darkstar>
- implements the idea discussed in this thread
Message-ID: <[email protected]>
- remove not required default setting
- fixed some tabs/spaces
Message-ID: <[email protected]>
- replace/rephrase "bandwidth" references to use "capacity"
- better stress that this do not enforce any bandwidth requirement
but "just" give hints to the scheduler
- fixed some typos
Message-ID: <[email protected]>
- add uclamp_exit_task() to release clamp refcount from do_exit()
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN but beautify a bit that code
- keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
- add another WARN on the unexpected condition of releasing a refcount
from a CPU which has a lower clamp value active
Message-ID: <[email protected]>
- move uclamp_enabled at the top of sched_class to keep it on the same
cache line of other main wakeup time callbacks
Message-ID: <20180816132249.GA2960@e110439-lin>
- inline uclamp_task_active() code into uclamp_task_update_active()
- get rid of the now unused uclamp_task_active()
Message-ID: <20180816172016.GG2960@e110439-lin>
- ensure to always reset clamp holding on wakeup from IDLE
Message-ID: <CAKfTPtC2adLupg7wy1JU9zxKx1466Sza6fSCcr92wcawm1OYkg@mail.gmail.com>
- use *rq instead of cpu for both uclamp_util() and uclamp_value()
Message-ID: <20180816135300.GC2960@e110439-lin>
- remove uclamp_value() which is never used outside CONFIG_UCLAMP_TASK
Message-ID: <20180816140731.GD2960@e110439-lin>
- add ".effective" attributes to the default hierarchy
- reuse already existing:
task_struct::uclamp::effective::group_id
instead of adding:
task_struct::uclamp_group_id
to back annotate the effective clamp group in which a task has been
refcounted
Message-ID: <20180820122728.GM2960@e110439-lin>
- fix unwanted reset of clamp values on refcount success
Other:
- by default all tasks have a UCLAMP_NOT_VALID task specific clamp
- always use:
p->uclamp[clamp_id].effective.value
to track the actual clamp value the task has been refcounted into.
This matches with the usage of
p->uclamp[clamp_id].effective.group_id
- allow to call uclamp_group_get() without a task pointer, which is
used to refcount the initial clamp group for all the global objects
(init_task, root_task_group and system_defaults)
- ensure (and check) that all tasks have a valid group_id at
uclamp_cpu_get_id()
- rework uclamp_cpu layout to better fit into just 2x64B cache lines
- fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
- init uclamp for the init_task and refcount its clamp groups
- add uclamp specific fork time code into uclamp_fork
- add support for SCHED_FLAG_RESET_ON_FORK
default clamps are now set for init_task and inherited/reset at
fork time (when then flag is set for the parent)
- enable uclamp only for FAIR tasks, RT class will be enabled only
by a following patch which also integrate the class to schedutil
- define uclamp_maps ____cacheline_aligned_in_smp
- in uclamp_group_get() ensure to include uclamp_group_available() and
uclamp_group_init() into the atomic section defined by:
uc_map[next_group_id].se_lock
- do not use mutex_lock(&uclamp_mutex) in uclamp_exit_task
which is also not needed since refcounting is already guarded by
the uc_map[group_id].se_lock spinlock
- consolidate init_uclamp_sched_group() into init_uclamp()
- refcount root_task_group's clamp groups from init_uclamp()
- small documentation fixes
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
group_id as a parameter
- add explicit calls to uclamp_group_find()
which is now not more part of uclamp_group_get()
- fixed a not required override
- fixed some typos in comments and changelog
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Message-ID: <[email protected]>
- use "." notation for attributes naming
i.e. s/util_{min,max}/util.{min,max}/
- added new patches: 09 and 12
Other changes:
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when cgroup support is
added.
Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes
Other changes:
- improved documentation to make more explicit some concepts
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data into a single cache
line while still supporting up to 2 different {min,max}_utiql clamps.
- use -ERANGE as range violation error
- add attributes to the default hierarchy as well as the legacy one
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values,
i.e. tasks running on a TG are only allowed to demote themself.
- patches re-ordering in top-down way
- rebased on v4.18-rc4
Patrick Bellasi (15):
sched/core: uclamp: extend sched_setattr to support utilization
clamping
sched/core: make sched_setattr able to tune the current policy
sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
sched/core: uclamp: add CPU's clamp groups refcounting
sched/core: uclamp: update CPU's refcount on clamp changes
sched/core: uclamp: enforce last task UCLAMP_MAX
sched/core: uclamp: add clamp group bucketing support
sched/core: uclamp: add system default clamps
sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
sched/cpufreq: uclamp: add utilization clamping for RT tasks
sched/core: uclamp: extend CPU's cgroup controller
sched/core: uclamp: propagate parent clamps
sched/core: uclamp: map TG's clamp values into CPU's clamp groups
sched/core: uclamp: use TG's clamps to restrict TASK's clamps
sched/core: uclamp: update CPU's refcount on TG's clamp changes
Documentation/admin-guide/cgroup-v2.rst | 46 +
include/linux/sched.h | 78 ++
include/linux/sched/sysctl.h | 11 +
include/linux/sched/task.h | 6 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 11 +-
include/uapi/linux/sched/types.h | 67 +-
init/Kconfig | 63 ++
init/init_task.c | 1 +
kernel/exit.c | 1 +
kernel/sched/core.c | 1112 ++++++++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 31 +-
kernel/sched/fair.c | 4 +
kernel/sched/features.h | 5 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 121 ++-
kernel/sysctl.c | 16 +
17 files changed, 1553 insertions(+), 30 deletions(-)
--
2.18.0
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a more simplified model which is essentially based on
the relatively simple concept of POSIX priorities.
Such a simple priority based model however does not allow to exploit
some of the most advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.
Utilization clamping aims at exposing to user-space a new set of
per-task attributes which can be used to provide the scheduler with some
hints about the expected/required utilization for a task.
This will allow to implement a more advanced per-task frequency control
mechanism which is not based just on a "passive" measured task
utilization but on a more "proactive" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy/thermal efficient.
Ultimately, such a mechanism can be considered similar to the cpufreq
powersave, performance and userspace governors but with a much fine
grained and per-task control.
Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes. Specifically, a new pair of attributes allows to specify a
minimum and maximum utilization which the scheduler should consider for
a task.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- rebased on v4.19
Changes in v4:
Message-ID: <[email protected]>
- remove not required default setting
- fixed some tabs/spaces
Message-ID: <[email protected]>
- replace/rephrase "bandwidth" references to use "capacity"
- better stress that this do not enforce any bandwidth requirement
but "just" give hints to the scheduler
- fixed some typos
Others:
- add support for SCHED_FLAG_RESET_ON_FORK
default clamps are now set for init_task and inherited/reset at
fork time (when then flag is set for the parent)
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- move at the head of the series
As discussed at OSPM, using a [0..SCHED_CAPACITY_SCALE] range seems to
be acceptable. However, an additional patch has been added at the end of
the series which introduces a simple abstraction to use a more
generic [0..100] range.
At OSPM we also discarded the idea to "recycle" the usage of
sched_runtime and sched_period which would have made the API too
much complex for limited benefits.
---
include/linux/sched.h | 13 +++++++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 67 +++++++++++++++++++++++++++-----
init/Kconfig | 21 ++++++++++
init/init_task.c | 5 +++
kernel/sched/core.c | 39 +++++++++++++++++++
6 files changed, 139 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..880a0c5c1f87 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,14 @@ struct vtime {
u64 gtime;
};
+enum uclamp_id {
+ UCLAMP_MIN = 0, /* Minimum utilization */
+ UCLAMP_MAX, /* Maximum utilization */
+
+ /* Utilization clamping constraints count */
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -649,6 +657,11 @@ struct task_struct {
#endif
struct sched_dl_entity dl;
+#ifdef CONFIG_UCLAMP_TASK
+ /* Utlization clamp values for this task */
+ int uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..c27d6e81517b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_UTIL_CLAMP 0x08
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_UTIL_CLAMP)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..fde7301ed28c 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -9,6 +9,7 @@ struct sched_param {
};
#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
+#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */
/*
* Extended scheduling parameters data structure.
@@ -21,8 +22,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +60,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +73,30 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization which
+ * should be expected by a task. These attributes allow to inform the
+ * scheduler about the utilization boundaries within which it is expected to
+ * schedule the task. These boundaries are valuable hints to support scheduler
+ * decisions on both task placement and frequencies selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. Thus, for
+ * example, a 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger then 0 is more likely to be
+ * scheduled on a CPU which has a capacity big enough to fit the specified
+ * minimum utilization value.
+ * A task with a max utilization value smaller then 1024 is more likely to be
+ * scheduled on a CPU which do not necessarily have more capacity then the
+ * specified max utilization value.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +114,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/Kconfig b/init/Kconfig
index 1e234e2f1cba..738974c4f628 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,6 +613,27 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool
+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max CPU
+ utilization which is allowed for RUNNABLE tasks.
+ The max utilization allows to request a maximum frequency a task should
+ use, while the min utilization allows to request a minimum frequency a
+ task should use.
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+endmenu
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..5bfdcc3fb839 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/task.h>
+#include <linux/sched/topology.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -91,6 +92,10 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp[UCLAMP_MIN] = 0,
+ .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad97f3ba5ec5..3701bb1e6698 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -716,6 +716,28 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}
+#ifdef CONFIG_UCLAMP_TASK
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (attr->sched_util_min > attr->sched_util_max)
+ return -EINVAL;
+ if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
+ p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+
+ return 0;
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2320,6 +2342,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);
+#ifdef CONFIG_UCLAMP_TASK
+ p->uclamp[UCLAMP_MIN] = 0;
+ p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif
+
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -4215,6 +4242,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}
+ /* Configure utilization clamps for the task */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = __setscheduler_uclamp(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4721,6 +4755,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);
+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN];
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+#endif
+
rcu_read_unlock();
retval = sched_read_attr(uattr, &attr, size);
--
2.18.0
Currently, sched_setattr mandates that a policy is always specified.
Since utilization clamp attributes could apply across different
scheduling policies (i.e. all but SCHED_DEADLINE), this requires also
to always know which policy a task has before changing its clamp values.
This is not just cumbersome but it's also racy, indeed we cannot be
sure that a task policy has been changed in between its policy read and
the actual clamp value change. Sometimes however, this could be the
actual use-case, we wanna change the clamps without affecting the
policy.
Let's fix this adding an additional attribute flag
(SCHED_FLAG_TUNE_POLICY) which, when specified, will ensure to always
force the usage of the current policy. This is done by re-using the
SETPARAM_POLICY thing we already have for the sched_setparam syscall,
thus extending its usage to the non-POSIX sched_setattr while not
exposing that internal concept to user-space.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Message-ID: <[email protected]>
- allow to change clamp values without affecting current policy
---
include/uapi/linux/sched.h | 6 +++++-
kernel/sched/core.c | 11 ++++++++++-
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index c27d6e81517b..62498d749bec 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -40,6 +40,8 @@
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+/* Must be the last entry: used check attr.policy values */
+#define SCHED_POLICY_MAX 7
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
@@ -50,11 +52,13 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
-#define SCHED_FLAG_UTIL_CLAMP 0x08
+#define SCHED_FLAG_TUNE_POLICY 0x08
+#define SCHED_FLAG_UTIL_CLAMP 0x10
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_TUNE_POLICY | \
SCHED_FLAG_UTIL_CLAMP)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3701bb1e6698..9a2e12eaa377 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4595,8 +4595,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
if (retval)
return retval;
- if ((int)attr.sched_policy < 0)
+ /*
+ * A valid policy has always to be required from userspace. Unless
+ * the SCHED_FLAG_TUNE_POLICY is set, in which case, the current
+ * policy will be enforced for this call.
+ */
+ if (attr.sched_policy >= SCHED_POLICY_MAX &&
+ !(attr.sched_flags & SCHED_FLAG_TUNE_POLICY)) {
return -EINVAL;
+ }
+ if (attr.sched_flags & SCHED_FLAG_TUNE_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;
rcu_read_lock();
retval = -ESRCH;
--
2.18.0
Utilization clamping requires each CPU to know which clamp values are
assigned to tasks that are currently RUNNABLE on that CPU: multiple
tasks can be assigned the same clamp value and tasks with different
clamp values can be concurrently active on the same CPU.
A proper data structure is required to support a fast and efficient
aggregation of the clamp values required by the currently RUNNABLE
tasks. To this purpose, a per-CPU array of reference counters can be
used, where each slot tracks how many tasks, requiring the same clamp
value, are currently RUNNABLE on each CPU.
Thus we need a mechanism to map each "clamp value" into a corresponding
"clamp index" which identifies the position within a reference counters
array used to track RUNNABLE tasks.
Let's introduce the support to map tasks to "clamp groups".
Specifically we introduce the required functions to translate a
"clamp value" (clamp_value) into a clamp's "group index" (group_id).
:
(user-space changes) : (kernel space / scheduler)
:
SLOW PATH : FAST PATH
:
task_struct::uclamp::value : sched/core::enqueue/dequeue
: cpufreq_schedutil
:
+----------------+ +--------------------+ +-------------------+
| TASK | | CLAMP GROUP | | CPU CLAMPS |
+----------------+ +--------------------+ +-------------------+
| | | clamp_{min,max} | | clamp_{min,max} |
| util_{min,max} | | se_count | | tasks count |
+----------------+ +--------------------+ +-------------------+
:
+------------------> : +------------------->
group_id = map(clamp_value) : ref_count(group_id)
:
:
Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
sense to boost/limit to different frequencies,
e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
the per-CPU clamp values in the fast path.
The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
the impossibility to map it into a dedicated clamp index.
Those tasks will be flagged as "not mapped" and not tracked at
enqueue/dequeue time.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
A following patch:
sched/core: uclamp: add clamp group bucketing support
will fix the "not mapped" tasks not being tracked.
Changes in v5:
Message-ID: <[email protected]>
- use bitfields and atomic_long_cmpxchg() operations to both compress
the clamp maps and avoid usage of spinlock.
- remove enforced __cacheline_aligned_in_smp on uclamp_map since it's
accessed from the slow path only and we don't care about performance
- better describe the usage of uclamp_map::se_lock
Message-ID: <[email protected]>
- remove inline from uclamp_group_{get,put}() and __setscheduler_uclamp()
- set lower/upper bounds at the beginning of __setscheduler_uclamp()
- avoid usage of pr_err from unprivileged syscall paths
in __setscheduler_uclamp(), replaced by ratelimited version
Message-ID: <20180914134128.GP1413@e110439-lin>
- remove/limit usage of UCLAMP_NOT_VALID whenever not strictly required
Message-ID: <[email protected]>
- allow sched_setattr() syscall to sleep on mutex
- fix return value for successfull uclamp syscalls
Message-ID: <CAJuCfpF36-VZm0JVVNnOnGm-ukVejzxbPhH33X3z9gAQ06t9gQ@mail.gmail.com>
- reorder conditions in uclamp_group_find() loop
- use uc_se->xxx in uclamp_fork()
Others:
- use UCLAMP_GROUPS to track (CONFIG_UCLAMP_GROUPS_COUNT+1)
- rebased on v4.19
Changes in v4:
Message-ID: <[email protected]>
- add uclamp_exit_task() to release clamp refcount from do_exit()
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN but butify a bit that code
Message-ID: <[email protected]>
- move uclamp_enabled at the top of sched_class to keep it on the same
cache line of other main wakeup time callbacks
Others:
- init uclamp for the init_task and refcount its clamp groups
- add uclamp specific fork time code into uclamp_fork
- add support for SCHED_FLAG_RESET_ON_FORK
default clamps are now set for init_task and inherited/reset at
fork time (when then flag is set for the parent)
- enable uclamp only for FAIR tasks, RT class will be enabled only
by a following patch which also integrate the class to schedutil
- define uclamp_maps ____cacheline_aligned_in_smp
- in uclamp_group_get() ensure to include uclamp_group_available() and
uclamp_group_init() into the atomic section defined by:
uc_map[next_group_id].se_lock
- do not use mutex_lock(&uclamp_mutex) in uclamp_exit_task
which is also not needed since refcounting is already guarded by
the uc_map[group_id].se_lock spinlock
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
group_id as a parameter
Others:
- rebased on tip/sched/core
Changes in v2:
- rabased on v4.18-rc4
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data, partially
intorduced also by the following patches, into a single cache line
while still supporting up to 2 different {min,max}_utiql clamps.
---
include/linux/sched.h | 39 ++++-
include/linux/sched/task.h | 6 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 5 +-
init/Kconfig | 20 +++
init/init_task.c | 4 -
kernel/exit.c | 1 +
kernel/sched/core.c | 283 ++++++++++++++++++++++++++++++---
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 28 +++-
10 files changed, 363 insertions(+), 33 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 880a0c5c1f87..facace271ea1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -318,6 +318,12 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
+/*
+ * Increase resolution of cpu_capacity calculations
+ */
+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -575,6 +581,37 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};
+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Number of utiliation clamp groups
+ *
+ * The first clamp group (group_id=0) is used to track non clamped tasks, i.e.
+ * util_{min,max} (0,SCHED_CAPACITY_SCALE). Thus we allocate one more group in
+ * addition to the configured number.
+ */
+#define UCLAMP_GROUPS (CONFIG_UCLAMP_GROUPS_COUNT + 1)
+
+/**
+ * Utilization clamp group
+ *
+ * A utilization clamp group maps a:
+ * clamp value (value), i.e.
+ * util_{min,max} value requested from userspace
+ * to a:
+ * clamp group index (group_id), i.e.
+ * index of the per-cpu RUNNABLE tasks refcounting array
+ *
+ * The mapped bit is set whenever a task has been mapped on a clamp group for
+ * the first time. When this bit is set, any clamp group get (for a new clamp
+ * value) will be matches by a clamp group put (for the old clamp value).
+ */
+struct uclamp_se {
+ unsigned int value : SCHED_CAPACITY_SHIFT + 1;
+ unsigned int group_id : order_base_2(UCLAMP_GROUPS);
+ unsigned int mapped : 1;
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
union rcu_special {
struct {
u8 blocked;
@@ -659,7 +696,7 @@ struct task_struct {
#ifdef CONFIG_UCLAMP_TASK
/* Utlization clamp values for this task */
- int uclamp[UCLAMP_CNT];
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif
#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 108ede99e533..36c81c364112 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -68,6 +68,12 @@ static inline void exit_thread(struct task_struct *tsk)
#endif
extern void do_group_exit(int);
+#ifdef CONFIG_UCLAMP_TASK
+extern void uclamp_exit_task(struct task_struct *p);
+#else
+static inline void uclamp_exit_task(struct task_struct *p) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..350043d203db 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@
#include <linux/sched/idle.h>
-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
/*
* sched-domains (multiprocessor balancing) declarations:
*/
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 62498d749bec..e6f2453eb5a5 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -53,7 +53,10 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
#define SCHED_FLAG_TUNE_POLICY 0x08
-#define SCHED_FLAG_UTIL_CLAMP 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x20
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
diff --git a/init/Kconfig b/init/Kconfig
index 738974c4f628..4c5475030286 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -633,7 +633,27 @@ config UCLAMP_TASK
If in doubt, say N.
+config UCLAMP_GROUPS_COUNT
+ int "Number of different utilization clamp values supported"
+ range 0 32
+ default 5
+ depends on UCLAMP_TASK
+ help
+ This defines the maximum number of different utilization clamp
+ values which can be concurrently enforced for each utilization
+ clamp index (i.e. minimum and maximum utilization).
+
+ Only a limited number of clamp values are supported because:
+ 1. there are usually only few classes of workloads for which it
+ makes sense to boost/cap for different frequencies,
+ e.g. background vs foreground, interactive vs low-priority.
+ 2. it allows a simpler and more memory/time efficient tracking of
+ per-CPU clamp values.
+
+ If in doubt, use the default value.
+
endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5bfdcc3fb839..7f77741b6a9b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -92,10 +92,6 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
-#endif
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp[UCLAMP_MIN] = 0,
- .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/exit.c b/kernel/exit.c
index 0e21e6d21f35..feb540558051 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -877,6 +877,7 @@ void __noreturn do_exit(long code)
sched_autogroup_exit_task(tsk);
cgroup_exit(tsk);
+ uclamp_exit_task(tsk);
/*
* FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2e12eaa377..654327d7f212 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -717,25 +717,266 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
#ifdef CONFIG_UCLAMP_TASK
-static inline int __setscheduler_uclamp(struct task_struct *p,
- const struct sched_attr *attr)
+/**
+ * uclamp_mutex: serializes updates of utilization clamp values
+ *
+ * Utilization clamp value updates are triggered from user-space (slow-path)
+ * but require refcounting updates on data structures used by scheduler's
+ * enqueue/dequeue operations (fast-path).
+ * While fast-path refcounting is enforced by atomic operations, this mutex
+ * ensures that we serialize user-space requests thus avoiding the risk of
+ * conflicting updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * uclamp_map: reference count utilization clamp groups
+ * @value: the utilization "clamp value" tracked by this clamp group
+ * @se_count: the number of scheduling entities using this "clamp value"
+ */
+union uclamp_map {
+ struct {
+ unsigned long value : SCHED_CAPACITY_SHIFT + 1;
+ unsigned long se_count : BITS_PER_LONG -
+ SCHED_CAPACITY_SHIFT - 1;
+ };
+ unsigned long data;
+ atomic_long_t adata;
+};
+
+/**
+ * uclamp_maps: map SEs "clamp value" into CPUs "clamp group"
+ *
+ * Since only a limited number of different "clamp values" are supported, we
+ * map each value into a "clamp group" (group_id) used at tasks {en,de}queued
+ * time to update a per-CPU refcounter tracking the number or RUNNABLE tasks
+ * requesting that clamp value.
+ * A "clamp index" (clamp_id) is used to define the kind of clamping, i.e. min
+ * and max utilization.
+ *
+ * A matrix is thus required to map "clamp values" (value) to "clamp groups"
+ * (group_id), for each "clamp index" (clamp_id), where:
+ * - rows are indexed by clamp_id and they collect the clamp groups for a
+ * given clamp index
+ * - columns are indexed by group_id and they collect the clamp values which
+ * maps to that clamp group
+ *
+ * Thus, the column index of a given (clamp_id, value) pair represents the
+ * clamp group (group_id) used by the fast-path's per-CPU refcounter.
+ *
+ * uclamp_maps is a matrix of
+ * +------- UCLAMP_CNT by UCLAMP_GROUPS entries
+ * | |
+ * | /---------------+---------------\
+ * | +------------+ +------------+
+ * | / UCLAMP_MIN | value | | value |
+ * | | | se_count |...... | se_count |
+ * | | +------------+ +------------+
+ * +--+ +------------+ +------------+
+ * | | value | | value |
+ * \ UCLAMP_MAX | se_count |...... | se_count |
+ * +-----^------+ +----^-------+
+ * |
+ * |
+ * +
+ * uclamp_maps[clamp_id][group_id].value
+ */
+static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @group_id: the clamp group to release
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value.
+ */
+static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
{
- if (attr->sched_util_min > attr->sched_util_max)
- return -EINVAL;
- if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
+ union uclamp_map uc_map_old, uc_map_new;
+ long res;
+
+retry:
+
+ uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
+#ifdef CONFIG_SCHED_DEBUG
+#define UCLAMP_GRPERR "invalid SE clamp group [%u:%u] refcount\n"
+ if (unlikely(!uc_map_old.se_count)) {
+ pr_err_ratelimited(UCLAMP_GRPERR, clamp_id, group_id);
+ return;
+ }
+#endif
+ uc_map_new = uc_map_old;
+ uc_map_new.se_count -= 1;
+ res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
+ uc_map_old.data, uc_map_new.data);
+ if (res != uc_map_old.data)
+ goto retry;
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @uc_se: the utilization clamp data for the task
+ * @clamp_id: the clamp index affected by the task
+ * @clamp_value: the new clamp value for the task
+ *
+ * Each time a task changes its utilization clamp value, for a specified clamp
+ * index, we need to find an available clamp group which can be used to track
+ * this new clamp value. The corresponding clamp group index will be used to
+ * reference count the corresponding clamp value while the task is enqueued on
+ * a CPU.
+ */
+static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
+ unsigned int clamp_value)
+{
+ union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
+ unsigned int prev_group_id = uc_se->group_id;
+ union uclamp_map uc_map_old, uc_map_new;
+ unsigned int free_group_id;
+ unsigned int group_id;
+ unsigned long res;
+
+retry:
+
+ free_group_id = UCLAMP_GROUPS;
+ for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
+ uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
+ if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
+ free_group_id = group_id;
+ if (uc_map_old.value == clamp_value)
+ break;
+ }
+ if (group_id >= UCLAMP_GROUPS) {
+#ifdef CONFIG_SCHED_DEBUG
+#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
+ if (unlikely(free_group_id == UCLAMP_GROUPS)) {
+ pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
+ return;
+ }
+#endif
+ group_id = free_group_id;
+ uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
+ }
+
+ uc_map_new.se_count = uc_map_old.se_count + 1;
+ uc_map_new.value = clamp_value;
+ res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
+ uc_map_old.data, uc_map_new.data);
+ if (res != uc_map_old.data)
+ goto retry;
+
+ /* Update SE's clamp values and attach it to new clamp group */
+ uc_se->value = clamp_value;
+ uc_se->group_id = group_id;
+
+ /* Release the previous clamp group */
+ if (uc_se->mapped)
+ uclamp_group_put(clamp_id, prev_group_id);
+ uc_se->mapped = true;
+}
+
+static int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
+ unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
+ int result = 0;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ lower_bound = attr->sched_util_min;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ upper_bound = attr->sched_util_max;
+
+ if (lower_bound > upper_bound ||
+ upper_bound > SCHED_CAPACITY_SCALE)
return -EINVAL;
- p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
- p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+ mutex_lock(&uclamp_mutex);
- return 0;
+ /* Update each required clamp group */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ uclamp_group_get(&p->uclamp[UCLAMP_MIN],
+ UCLAMP_MIN, lower_bound);
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ uclamp_group_get(&p->uclamp[UCLAMP_MAX],
+ UCLAMP_MAX, upper_bound);
+ }
+
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
+}
+
+/**
+ * uclamp_exit_task: release referenced clamp groups
+ * @p: the task exiting
+ *
+ * When a task terminates, release all its (eventually) refcounted
+ * task-specific clamp groups.
+ */
+void uclamp_exit_task(struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ if (!p->uclamp[clamp_id].mapped)
+ continue;
+ uclamp_group_put(clamp_id, p->uclamp[clamp_id].group_id);
+ }
+}
+
+/**
+ * uclamp_fork: refcount task-specific clamp values for a new task
+ */
+static void uclamp_fork(struct task_struct *p, bool reset)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ unsigned int clamp_value = p->uclamp[clamp_id].value;
+
+ if (unlikely(reset))
+ clamp_value = uclamp_none(clamp_id);
+
+ p->uclamp[clamp_id].mapped = false;
+ uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value);
+ }
+}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static void __init init_uclamp(void)
+{
+ struct uclamp_se *uc_se;
+ unsigned int clamp_id;
+
+ mutex_init(&uclamp_mutex);
+
+ memset(uclamp_maps, 0, sizeof(uclamp_maps));
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &init_task.uclamp[clamp_id];
+ uclamp_group_get(uc_se, clamp_id, uclamp_none(clamp_id));
+ }
}
+
#else /* CONFIG_UCLAMP_TASK */
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
return -EINVAL;
}
+static inline void uclamp_fork(struct task_struct *p, bool reset) { }
+static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2314,6 +2555,7 @@ static inline void init_schedstats(void) {}
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
+ bool reset;
__sched_fork(clone_flags, p);
/*
@@ -2331,7 +2573,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
/*
* Revert to default priority/policy on fork if requested.
*/
- if (unlikely(p->sched_reset_on_fork)) {
+ reset = p->sched_reset_on_fork;
+ if (unlikely(reset)) {
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
@@ -2342,11 +2585,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);
-#ifdef CONFIG_UCLAMP_TASK
- p->uclamp[UCLAMP_MIN] = 0;
- p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
-#endif
-
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -2363,6 +2601,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
init_entity_runnable_average(&p->se);
+ uclamp_fork(p, reset);
+
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
@@ -4610,10 +4850,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
- retval = sched_setattr(p, &attr);
+ if (likely(p))
+ get_task_struct(p);
rcu_read_unlock();
+ if (likely(p)) {
+ retval = sched_setattr(p, &attr);
+ put_task_struct(p);
+ }
+
return retval;
}
@@ -4765,8 +5010,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);
#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN];
- attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
#endif
rcu_read_unlock();
@@ -6116,6 +6361,8 @@ void __init sched_init(void)
init_schedstats();
+ init_uclamp();
+
scheduler_running = 1;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 908c9cdae2f0..6c92cd2d637a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10157,6 +10157,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};
#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9683f458aec7..947ab14d3d5b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1504,10 +1504,12 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;
+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
- void (*yield_task) (struct rq *rq);
- bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
@@ -1540,7 +1542,6 @@ struct sched_class {
void (*set_curr_task)(struct rq *rq);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
- void (*task_dead)(struct task_struct *p);
/*
* The switched_from() call is allowed to drop rq->lock, therefore we
@@ -1557,12 +1558,17 @@ struct sched_class {
void (*update_curr)(struct rq *rq);
+ void (*yield_task) (struct rq *rq);
+ bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
+
#define TASK_SET_GROUP 0
#define TASK_MOVE_GROUP 1
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
+
+ void (*task_dead)(struct task_struct *p);
};
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
@@ -2180,6 +2186,22 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */
+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.18.0
Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range which depends on the set of currently
RUNNABLE tasks on that CPU.
Each task references two "clamp groups" defining the minimum and maximum
utilization "clamp values" to be considered for that task. Each clamp
value is mapped to a clamp group which is enforced on a CPU only when
there is at least one RUNNABLE task referencing that clamp group.
When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it's required to re-compute what is the new "aggregated" clamp value to
apply on that CPU.
Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e. with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).
Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.
Tasks have a:
task_struct::uclamp[clamp_idx]::group_id
indexing, for each "clamp index" (i.e. util_{min,max}), the "group
index" of the clamp group they have to refcount at enqueue time.
CPUs rq have a:
rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index.
The clamp value of each clamp group is tracked by
rq::uclamp::group[clamp_idx][group_idx].value
thus making rq::uclamp::group[][] an unordered array of clamp values.
The MAX aggregation of the currently active clamp groups is implemented
to minimize the number of times we need to scan the complete (unordered)
clamp group array to figure out the new max value.
This operation indeed happens only when we dequeue the last task of the
clamp group corresponding to the current max clamp, and thus the CPU is
either entering IDLE or going to schedule a less boosted or more clamped
task. Moreover, the expected number of different clamp values, which can
be configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different clamp values for each clamp index.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Message-ID: <20180914134128.GP1413@e110439-lin>
- remove not required check for (group_id == UCLAMP_NOT_VALID)
in uclamp_cpu_put_id
Message-ID: <20180912174456.GJ1413@e110439-lin>
- use bitfields to compress uclamp_group
Others:
- consistently use "unsigned int" for both clamp_id and group_id
- fixup documentation
- reduced usage of inline comments
- rebased on v4.19
Changes in v4:
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
- add another WARN on the unexpected condition of releasing a refcount
from a CPU which has a lower clamp value active
Other:
- ensure (and check) that all tasks have a valid group_id at
uclamp_cpu_get_id()
- rework uclamp_cpu layout to better fit into just 2x64B cache lines
- fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Other:
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Other:
- rabased on v4.18-rc4
- improved documentation to make more explicit some concepts.
---
include/linux/sched.h | 5 ++
kernel/sched/core.c | 185 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 49 +++++++++++
3 files changed, 239 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index facace271ea1..3ab1cbd4e3b1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,11 +604,16 @@ struct sched_dl_entity {
* The mapped bit is set whenever a task has been mapped on a clamp group for
* the first time. When this bit is set, any clamp group get (for a new clamp
* value) will be matches by a clamp group put (for the old clamp value).
+ *
+ * The active bit is set whenever a task has got an effective clamp group
+ * and value assigned, which can be different from the user requested ones.
+ * This allows to know a task is actually refcounting a CPU's clamp group.
*/
struct uclamp_se {
unsigned int value : SCHED_CAPACITY_SHIFT + 1;
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
unsigned int mapped : 1;
+ unsigned int active : 1;
};
#endif /* CONFIG_UCLAMP_TASK */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 654327d7f212..a98a96a7d9f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -783,6 +783,159 @@ union uclamp_map {
*/
static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @rq: the CPU's rq which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups can change. Since each clamp group enforces a different
+ * utilization clamp value, once the set of active groups changes it can be
+ * required to re-compute what is the new clamp value to apply for that CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of active clamp groups.
+ */
+static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
+{
+ unsigned int group_id;
+ int max_value = 0;
+
+ for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
+ if (!rq->uclamp.group[clamp_id][group_id].tasks)
+ continue;
+ /* Both min and max clamps are MAX aggregated */
+ if (max_value < rq->uclamp.group[clamp_id][group_id].value)
+ max_value = rq->uclamp.group[clamp_id][group_id].value;
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ rq->uclamp.value[clamp_id] = max_value;
+}
+
+/**
+ * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @rq: the CPU's rq where the clamp group has to be reference counted
+ * @clamp_id: the clamp index to update
+ *
+ * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
+ * the task's uclamp::group_id is reference counted on that CPU.
+ */
+static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int group_id;
+
+ if (unlikely(!p->uclamp[clamp_id].mapped))
+ return;
+
+ group_id = p->uclamp[clamp_id].group_id;
+ p->uclamp[clamp_id].active = true;
+
+ rq->uclamp.group[clamp_id][group_id].tasks += 1;
+
+ if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
+ rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
+}
+
+/**
+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @rq: the CPU's rq from where the clamp group has to be released
+ * @clamp_id: the clamp index to update
+ *
+ * When a task is dequeued from a CPU's rq, the CPU's clamp group reference
+ * counted by the task is released.
+ * If this was the last task reference coutning the current max clamp group,
+ * then the CPU clamping is updated to find the new max for the specified
+ * clamp index.
+ */
+static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value;
+ unsigned int group_id;
+
+ if (unlikely(!p->uclamp[clamp_id].mapped))
+ return;
+
+ group_id = p->uclamp[clamp_id].group_id;
+ p->uclamp[clamp_id].active = false;
+
+ if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
+ rq->uclamp.group[clamp_id][group_id].tasks -= 1;
+#ifdef CONFIG_SCHED_DEBUG
+ else {
+ WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
+ cpu_of(rq), clamp_id, group_id);
+ }
+#endif
+
+ if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
+ return;
+
+ clamp_value = rq->uclamp.group[clamp_id][group_id].value;
+#ifdef CONFIG_SCHED_DEBUG
+ if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) {
+ WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n",
+ cpu_of(rq), clamp_id, group_id);
+ }
+#endif
+ if (clamp_value >= rq->uclamp.value[clamp_id])
+ uclamp_cpu_update(rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_get(): increase CPU's clamp group refcount
+ * @rq: the CPU's rq where the task is enqueued
+ * @p: the task being enqueued
+ *
+ * When a task is enqueued on a CPU's rq, all the clamp groups currently
+ * enforced on a task are reference counted on that rq. Since not all
+ * scheduling classes have utilization clamping support, their tasks will
+ * be silently ignored.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_get_id(p, rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_put(): decrease CPU's clamp group refcount
+ * @rq: the CPU's rq from where the task is dequeued
+ * @p: the task being dequeued
+ *
+ * When a task is dequeued from a CPU's rq, all the clamp groups the task has
+ * reference counted at enqueue time are now released.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_put_id(p, rq, clamp_id);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -836,6 +989,7 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
unsigned int free_group_id;
unsigned int group_id;
unsigned long res;
+ int cpu;
retry:
@@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
if (res != uc_map_old.data)
goto retry;
+ /* Ensure each CPU tracks the correct value for this clamp group */
+ if (likely(uc_map_new.se_count > 1))
+ goto done;
+ for_each_possible_cpu(cpu) {
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ /* Refcounting is expected to be always 0 for free groups */
+ if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) {
+ uc_cpu->group[clamp_id][group_id].tasks = 0;
+#ifdef CONFIG_SCHED_DEBUG
+ WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
+ cpu, clamp_id, group_id);
+#endif
+ }
+
+ if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
+ continue;
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+ }
+
+done:
+
/* Update SE's clamp values and attach it to new clamp group */
uc_se->value = clamp_value;
uc_se->group_id = group_id;
@@ -948,6 +1124,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
clamp_value = uclamp_none(clamp_id);
p->uclamp[clamp_id].mapped = false;
+ p->uclamp[clamp_id].active = false;
uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value);
}
}
@@ -959,9 +1136,13 @@ static void __init init_uclamp(void)
{
struct uclamp_se *uc_se;
unsigned int clamp_id;
+ int cpu;
mutex_init(&uclamp_mutex);
+ for_each_possible_cpu(cpu)
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
+
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
@@ -970,6 +1151,8 @@ static void __init init_uclamp(void)
}
#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -987,6 +1170,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);
+ uclamp_cpu_get(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -998,6 +1182,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE))
sched_info_dequeued(rq, p);
+ uclamp_cpu_put(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 947ab14d3d5b..94c4f2f410ad 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -766,6 +766,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * struct uclamp_group - Utilization clamp Group
+ * @value: utilization clamp value for tasks on this clamp group
+ * @tasks: number of RUNNABLE tasks on this clamp group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+ unsigned long value : SCHED_CAPACITY_SHIFT + 1;
+ unsigned long tasks : BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1;
+};
+
+/**
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @group: utilization clamp groups affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp value, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+ struct uclamp_group group[UCLAMP_CNT][UCLAMP_GROUPS];
+ int value[UCLAMP_CNT];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -804,6 +848,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;
+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_cpu uclamp ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
--
2.18.0
Utilization clamp values enforced on a CPU by a task can be updated at
run-time, for example via a sched_setattr syscall, while a task is
currently RUNNABLE on that CPU. In these cases, the task is already
refcounting a clamp group for its CPU and thus we need to update this
reference to ensure the new constraints are immediately enforced.
Since a clamp value change always implies a clamp group refcount update,
this patch hooks into uclamp_group_get() to trigger a CPU refcount
syncup, via uclamp_cpu_{put,get}_id(), whenever a task is RUNNABLE.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Message-ID: <20180914134128.GP1413@e110439-lin>
- remove not required uclamp_task_affects() since now all tasks always
have a valid clamp group assigned
Others:
- consistently use "unsigned int" for both clamp_id and group_id
- fixup documentation
- reduced usage of inline comments
- rebased on v4.19
Changes in v4:
Message-ID: <20180816132249.GA2960@e110439-lin>
- inline uclamp_task_active() code into uclamp_task_update_active()
- get rid of the now unused uclamp_task_active()
Other:
- allow to call uclamp_group_get() without a task pointer, which is
used to refcount the initial clamp group for all the global objects
(init_task, root_task_group and system_defaults)
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Other:
- rabased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when CGroups support is
added.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 60 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 54 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a98a96a7d9f1..21f6251a1d44 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -936,6 +936,48 @@ static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
uclamp_cpu_put_id(p, rq, clamp_id);
}
+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups must be updated for each CPU containing a RUNNABLE task belonging to
+ * that task group. Sleeping tasks are not updated since they will be enqueued
+ * with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the CPU where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * The setting of the clamp group is serialized by task_rq_lock().
+ * Thus, if the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp group, then the next time it's going to be
+ * enqueued it will already see the updated clamp group value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_cpu_put_id(p, rq, clamp_id);
+ uclamp_cpu_get_id(p, rq, clamp_id);
+
+done:
+ task_rq_unlock(rq, p, &rf);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -970,6 +1012,7 @@ static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
/**
* uclamp_group_get: increase the reference count for a clamp group
+ * @p: the task which clamp value must be tracked
* @uc_se: the utilization clamp data for the task
* @clamp_id: the clamp index affected by the task
* @clamp_value: the new clamp value for the task
@@ -980,8 +1023,8 @@ static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
* reference count the corresponding clamp value while the task is enqueued on
* a CPU.
*/
-static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
- unsigned int clamp_value)
+static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
+ unsigned int clamp_id, unsigned int clamp_value)
{
union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
unsigned int prev_group_id = uc_se->group_id;
@@ -1046,6 +1089,10 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
uc_se->value = clamp_value;
uc_se->group_id = group_id;
+ /* Update CPU's clamp group refcounts of RUNNABLE task */
+ if (p)
+ uclamp_task_update_active(p, clamp_id);
+
/* Release the previous clamp group */
if (uc_se->mapped)
uclamp_group_put(clamp_id, prev_group_id);
@@ -1073,11 +1120,11 @@ static int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
- uclamp_group_get(&p->uclamp[UCLAMP_MIN],
+ uclamp_group_get(p, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
- uclamp_group_get(&p->uclamp[UCLAMP_MAX],
+ uclamp_group_get(p, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
@@ -1125,7 +1172,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)
p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
- uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value);
+ uclamp_group_get(NULL, &p->uclamp[clamp_id],
+ clamp_id, clamp_value);
}
}
@@ -1146,7 +1194,7 @@ static void __init init_uclamp(void)
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
- uclamp_group_get(uc_se, clamp_id, uclamp_none(clamp_id));
+ uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
}
}
--
2.18.0
When a util_max clamped task sleeps, its clamp constraints are removed
from the CPU. However, the blocked utilization on that CPU can still be
higher than the max clamp value enforced while that task was running.
The release of a util_max clamp when a CPU is going to be idle could
thus allow unwanted CPU frequency increases while tasks are not
running. This can happen, for example, when a frequency update is
triggered from another CPU of the same frequency domain.
In this case, when we aggregate the utilization of all the CPUs in a
shared frequency domain, schedutil can still see the full not clamped
blocked utilization of all the CPUs and thus, eventually, increase the
frequency.
Let's fix this by using:
uclamp_cpu_put_id(UCLAMP_MAX)
uclamp_cpu_update(last_clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition. Thus, while a CPU is idle, we can still enforce the last used
clamp value for it.
To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
idle, we don't want to enforce any minimum frequency. Indeed, in this
case, we rely just on the decay of the blocked utilization to smoothly
reduce the CPU frequency.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- reduce usage of alias local variables whenever the global ones can
still be used without affecting code readability
- reduced usage of inline comments
- rebased on v4.19
Changes in v4:
Message-ID: <20180816172016.GG2960@e110439-lin>
- ensure to always reset clamp holding on wakeup from IDLE
Others:
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Changes in v2:
- rabased on v4.18-rc4
- new patch to improve a specific issue
---
kernel/sched/core.c | 41 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 11 +++++++++++
2 files changed, 49 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21f6251a1d44..b23f80c07be9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -796,10 +796,11 @@ static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
* For the specified clamp index, this method computes the new CPU utilization
* clamp to use until the next change on the set of active clamp groups.
*/
-static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
+static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
+ unsigned int last_clamp_value)
{
unsigned int group_id;
- int max_value = 0;
+ int max_value = -1;
for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
if (!rq->uclamp.group[clamp_id][group_id].tasks)
@@ -810,6 +811,28 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
if (max_value >= SCHED_CAPACITY_SCALE)
break;
}
+
+ /*
+ * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
+ * task, we want to keep the CPU clamped to the last task's clamp
+ * value. This is to avoid frequency spikes to MAX when one CPU, with
+ * an high blocked utilization, sleeps and another CPU, in the same
+ * frequency domain, do not see anymore the clamp on the first CPU.
+ *
+ * The UCLAMP_FLAG_IDLE is set whenever we detect, from the above
+ * loop, that there are no more RUNNABLE taks on that CPU.
+ * In this case we enforce the CPU util_max to that of the last
+ * dequeued task.
+ */
+ if (max_value < 0) {
+ if (clamp_id == UCLAMP_MAX) {
+ rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
+ max_value = last_clamp_value;
+ } else {
+ max_value = uclamp_none(UCLAMP_MIN);
+ }
+ }
+
rq->uclamp.value[clamp_id] = max_value;
}
@@ -835,6 +858,18 @@ static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
rq->uclamp.group[clamp_id][group_id].tasks += 1;
+ if (unlikely(rq->uclamp.flags & UCLAMP_FLAG_IDLE)) {
+ /*
+ * Reset clamp holds on idle exit.
+ * This function is called for both UCLAMP_MIN (before) and
+ * UCLAMP_MAX (after). Let's reset the flag only the second
+ * once we know that UCLAMP_MIN has been already updated.
+ */
+ if (clamp_id == UCLAMP_MAX)
+ rq->uclamp.flags &= ~UCLAMP_FLAG_IDLE;
+ rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
+ }
+
if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
}
@@ -883,7 +918,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
}
#endif
if (clamp_value >= rq->uclamp.value[clamp_id])
- uclamp_cpu_update(rq, clamp_id);
+ uclamp_cpu_update(rq, clamp_id, clamp_value);
}
/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 94c4f2f410ad..859192ec492c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -807,6 +807,17 @@ struct uclamp_group {
struct uclamp_cpu {
struct uclamp_group group[UCLAMP_CNT][UCLAMP_GROUPS];
int value[UCLAMP_CNT];
+/*
+ * Idle clamp holding
+ * Whenever a CPU is idle, we enforce the util_max clamp value of the last
+ * task running on that CPU. This bit is used to flag a clamp holding
+ * currently active for a CPU. This flag is:
+ * - set when we update the clamp value of a CPU at the time of dequeuing the
+ * last before entering idle
+ * - reset when we enqueue the first task after a CPU wakeup from IDLE
+ */
+#define UCLAMP_FLAG_IDLE 0x01
+ int flags;
};
#endif /* CONFIG_UCLAMP_TASK */
--
2.18.0
The limited number of clamp groups is required to have an effective and
efficient run-time tracking of the clamp groups required by RUNNABLE
tasks. However, we must ensure we can always know which clamp group to
use in the fast path (task enqueue/dequeue time) to refcount each tasks,
whatever its clamp value is.
To this purpose we can trade-off CPU clamping precision for efficiency
by turning CPU's clamp groups into buckets, each one representing a
range of possible clamp values.
The number of clamp groups configured at compile time defines the range
of utilization clamp values tracked by each CPU clamp group.
For example, with the default configuration:
CONFIG_UCLAMP_GROUPS_COUNT 5
we will have 5 clamp groups tracking 20% utilization each. In this case,
a task with util_min=25% will have group_id=1.
This bucketing mechanisms applies only on the fast-path, when tasks are
refcounted into a per-CPU clamp groups at enqueue/dequeue time, while
tasks keep tracking their task-specific clamp value requested from
user-space. This allows to track, within each bucket, the maximum
task-specific clamp value for tasks refcounted in that bucket.
In the example above, a 25% boosted tasks will be refcounted in the
[20..39]% bucket and will set the bucket clamp effective value to 25%.
If a second 30% boosted task should be co-scheduled on the same CPU,
that task will be refcounted in the same bucket of the first task and it
will boost the bucket clamp effective value to 30%.
The clamp effective value of a bucket will be reset to its nominal value
(named the "group_value", 20% in the example above) when there are
anymore tasks refcounted in that bucket.
On a real system we expect a limited number of sufficiently different
clamp values thus, this simple bucketing mechanism is still effective
to track tasks clamp effective values quite closely.
An additional boost/capping margin can be added to some tasks, in the
example above the 25% task will be boosted to 30% until it exits the
CPU.
If that should be considered not acceptable on certain systems, it's
always possible to reduce the margin by increasing the bucketing
resolution. Indeed, by properly setting the number of
CONFIG_UCLAMP_GROUPS_COUNT, we can trade off memory efficiency for
resolution.
The already existing mechanism to map "clamp values" into "clamp groups"
ensures to use only the minimal set of clamp groups actually required.
For example, if we have only 20% and 25% clamped tasks, by setting:
CONFIG_UCLAMP_GROUPS_COUNT 20
we will allocate 20 5% resolution buckets, however we will use only 2
two of them in the fast-path, since their 5% resolution will be enough
to always distinguish them.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- renamed uclamp_round into uclamp_group_value to better represent
what this function returns
- rebased on v4.19
Changes in v4:
Message-ID: <20180809152313.lewfhufidhxb2qrk@darkstar>
- implements the idea discussed in this thread
Others:
- new patch added in this version
- rebased on v4.19-rc1
---
kernel/sched/core.c | 48 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 43 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b23f80c07be9..9b49062439f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -783,6 +783,27 @@ union uclamp_map {
*/
static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
+/*
+ * uclamp_group_value: get the "group value" for a given "clamp value"
+ * @value: the utiliation "clamp value" to translate
+ *
+ * The number of clamp group, which is defined at compile time, allows to
+ * track a finite number of different clamp values. Thus clamp values are
+ * grouped into bins each one representing a different "group value".
+ * This method returns the "group value" corresponding to the specified
+ * "clamp value".
+ */
+static inline unsigned int uclamp_group_value(unsigned int clamp_value)
+{
+#define UCLAMP_GROUP_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_GROUPS_COUNT)
+#define UCLAMP_GROUP_UPPER (UCLAMP_GROUP_DELTA * CONFIG_UCLAMP_GROUPS_COUNT)
+
+ if (clamp_value >= UCLAMP_GROUP_UPPER)
+ return SCHED_CAPACITY_SCALE;
+
+ return UCLAMP_GROUP_DELTA * (clamp_value / UCLAMP_GROUP_DELTA);
+}
+
/**
* uclamp_cpu_update: updates the utilization clamp of a CPU
* @rq: the CPU's rq which utilization clamp has to be updated
@@ -848,6 +869,7 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
{
+ unsigned int clamp_value;
unsigned int group_id;
if (unlikely(!p->uclamp[clamp_id].mapped))
@@ -870,6 +892,11 @@ static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
}
+ /* CPU's clamp groups track the max effective clamp value */
+ clamp_value = p->uclamp[clamp_id].value;
+ if (clamp_value > rq->uclamp.group[clamp_id][group_id].value)
+ rq->uclamp.group[clamp_id][group_id].value = clamp_value;
+
if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
}
@@ -917,8 +944,16 @@ static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
cpu_of(rq), clamp_id, group_id);
}
#endif
- if (clamp_value >= rq->uclamp.value[clamp_id])
+ if (clamp_value >= rq->uclamp.value[clamp_id]) {
+ /*
+ * Each CPU's clamp group value is reset to its nominal group
+ * value whenever there are anymore RUNNABLE tasks refcounting
+ * that clamp group.
+ */
+ rq->uclamp.group[clamp_id][group_id].value =
+ uclamp_maps[clamp_id][group_id].value;
uclamp_cpu_update(rq, clamp_id, clamp_value);
+ }
}
/**
@@ -1065,10 +1100,13 @@ static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
unsigned int prev_group_id = uc_se->group_id;
union uclamp_map uc_map_old, uc_map_new;
unsigned int free_group_id;
+ unsigned int group_value;
unsigned int group_id;
unsigned long res;
int cpu;
+ group_value = uclamp_group_value(clamp_value);
+
retry:
free_group_id = UCLAMP_GROUPS;
@@ -1076,7 +1114,7 @@ static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
free_group_id = group_id;
- if (uc_map_old.value == clamp_value)
+ if (uc_map_old.value == group_value)
break;
}
if (group_id >= UCLAMP_GROUPS) {
@@ -1092,7 +1130,7 @@ static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
}
uc_map_new.se_count = uc_map_old.se_count + 1;
- uc_map_new.value = clamp_value;
+ uc_map_new.value = group_value;
res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
uc_map_old.data, uc_map_new.data);
if (res != uc_map_old.data)
@@ -1113,9 +1151,9 @@ static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
#endif
}
- if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
+ if (uc_cpu->group[clamp_id][group_id].value == group_value)
continue;
- uc_cpu->group[clamp_id][group_id].value = clamp_value;
+ uc_cpu->group[clamp_id][group_id].value = group_value;
}
done:
--
2.18.0
Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE. Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload.
This will contribute to make the RT class more friendly for power/energy
sensitive use-cases.
This patch extends the usage of util_{min,max} to the RT scheduling
class. Whenever a task in this class is RUNNABLE, the util required is
defined by its task specific clamp value. However, we still want to run
at maximum capacity RT tasks which do not have task specific clamp
values.
Let's add uclamp_default_perf, a special set of clamp value to be used
for tasks that require maximum performance. This set of clamps are then
used whenever the above conditions matches for an RT task being enqueued
on a CPU.
Since utilization clamping applies now to both CFS and RT tasks, we
clamp the combined utilization of these two classes.
This approach, contrary to combining individually clamped utilizations,
is more power efficient. Indeed, it selects lower frequencies when we
have both RT and CFS clamped tasks.
However, it could also affect performance of the lower priority CFS
class, since the CFS's minimum utilization clamp could be completely
eclipsed by the RT workloads.
The IOWait boost value also is subject to clamping for RT tasks.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clamping constraints.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- rebased on v4.19
Changes in v4:
Message-ID: <20180813150112.GE2605@e110439-lin>
- remove UCLAMP_SCHED_CLASS policy since we do not have in the current
implementation a proper per-sched_class clamp tracking support
Message-ID: <20180809155551.bp46sixk4u3ilcnh@queper01-lin>
- add default boost for not clamped RT tasks
Others:
- rebased on v4.19-rc1
Changes in v3:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
---
kernel/sched/core.c | 19 +++++++++++++++----
kernel/sched/cpufreq_schedutil.c | 22 ++++++++++------------
kernel/sched/rt.c | 4 ++++
3 files changed, 29 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8421ef96ec97..b9dd1980ec93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -745,6 +745,7 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
* Tasks's clamp values are required to be within this range
*/
static struct uclamp_se uclamp_default[UCLAMP_CNT];
+static struct uclamp_se uclamp_default_perf[UCLAMP_CNT];
/**
* uclamp_map: reference count utilization clamp groups
@@ -895,6 +896,7 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
unsigned int clamp_id)
{
+ struct uclamp_se *default_clamp;
unsigned int clamp_value;
unsigned int group_id;
@@ -906,15 +908,20 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
clamp_value = p->uclamp[clamp_id].value;
group_id = p->uclamp[clamp_id].group_id;
+ /* RT tasks have different default values */
+ default_clamp = task_has_rt_policy(p)
+ ? uclamp_default_perf
+ : uclamp_default;
+
/* System default restriction */
- if (unlikely(clamp_value < uclamp_default[UCLAMP_MIN].value ||
- clamp_value > uclamp_default[UCLAMP_MAX].value)) {
+ if (unlikely(clamp_value < default_clamp[UCLAMP_MIN].value ||
+ clamp_value > default_clamp[UCLAMP_MAX].value)) {
/*
* Unconditionally enforce system defaults, which is a simpler
* solution compared to a proper clamping.
*/
- clamp_value = uclamp_default[clamp_id].value;
- group_id = uclamp_default[clamp_id].group_id;
+ clamp_value = default_clamp[clamp_id].value;
+ group_id = default_clamp[clamp_id].group_id;
}
p->uclamp[clamp_id].effective.value = clamp_value;
@@ -1381,6 +1388,10 @@ static void __init init_uclamp(void)
uc_se = &uclamp_default[clamp_id];
uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+
+ /* RT tasks by default will go to max frequency */
+ uc_se = &uclamp_default_perf[clamp_id];
+ uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
}
}
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index fd3fe55d605b..1156c7117fc2 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -205,9 +205,6 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq);
- if (rt_rq_is_runnable(&rq->rt))
- return max;
-
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
* because of inaccuracies in how we track these -- see
@@ -223,13 +220,16 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
- * CFS utilization can be boosted or capped, depending on utilization
- * clamp constraints requested by currently RUNNABLE tasks.
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
* When there are no CFS RUNNABLE tasks, clamps are released and OPPs
* will be gracefully reduced with the utilization decay.
*/
- util = uclamp_util(rq, cpu_util_cfs(rq));
- util += cpu_util_rt(rq);
+ util = cpu_util_rt(rq) + cpu_util_cfs(rq);
+ util = uclamp_util(rq, util);
+ if (unlikely(util >= max))
+ return max;
/*
* We do not make cpu_util_dl() a permanent part of this sum because we
@@ -333,13 +333,11 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
*
* Since DL tasks have a much more advanced bandwidth control, it's
* safe to assume that IO boost does not apply to those tasks.
- * Instead, since RT tasks are not utiliation clamped, we don't want
- * to apply clamping on IO boost while there is blocked RT
- * utilization.
+ * Instead, for CFS and RT tasks we clamp the IO boost max value
+ * considering the current constraints for the CPU.
*/
max_boost = sg_cpu->iowait_boost_max;
- if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
- max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+ max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2e2955a8cf8f..06ec33467dd9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2404,6 +2404,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,
.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};
#ifdef CONFIG_RT_GROUP_SCHED
--
2.18.0
When a task group refcounts a new clamp group, we need to ensure that
the new clamp values are immediately enforced to all its tasks which are
currently RUNNABLE. This is to ensure that all currently RUNNABLE tasks
are boosted and/or clamped as requested as soon as possible.
Let's ensure that, whenever a new clamp group is refcounted by a task
group, all its RUNNABLE tasks are correctly accounted in their
respective CPUs. We do that by slightly refactoring uclamp_group_get()
to get an additional parameter *cgroup_subsys_state which, when
provided, it's used to walk the list of tasks in the corresponding TGs
and update the RUNNABLE ones.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- rebased in v4.19
Changes in v4:
Others:
- rebased on v4.19-rc1
Changes in v3:
- rebased on tip/sched/core
- fixed some typos
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 65 ++++++++++++++++++++++++++++++++---------
kernel/sched/features.h | 5 ++++
2 files changed, 56 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2e35b0a1f0c..06f0c98a1b32 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1197,9 +1197,30 @@ static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
goto retry;
}
+static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int group_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ /*
+ * In lazy update mode, tasks will be accounted into the right clamp
+ * group the next time they will be requeued.
+ */
+ if (unlikely(sched_feat(UCLAMP_LAZY_UPDATE)))
+ return;
+
+ /* Update clamp groups for RUNNABLE tasks in this TG */
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_task_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
/**
* uclamp_group_get: increase the reference count for a clamp group
* @p: the task which clamp value must be tracked
+ * @css: the task group which clamp value must be tracked
* @uc_se: the utilization clamp data for the task
* @clamp_id: the clamp index affected by the task
* @clamp_value: the new clamp value for the task
@@ -1210,7 +1231,9 @@ static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
* reference count the corresponding clamp value while the task is enqueued on
* a CPU.
*/
-static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
+static void uclamp_group_get(struct task_struct *p,
+ struct cgroup_subsys_state *css,
+ struct uclamp_se *uc_se,
unsigned int clamp_id, unsigned int clamp_value)
{
union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
@@ -1279,6 +1302,10 @@ static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
uc_se->value = clamp_value;
uc_se->group_id = group_id;
+ /* Newly created TG don't have tasks assigned */
+ if (css)
+ uclamp_group_get_tg(css, clamp_id, group_id);
+
/* Update CPU's clamp group refcounts of RUNNABLE task */
if (p)
uclamp_task_update_active(p, clamp_id);
@@ -1314,11 +1341,11 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
}
if (old_min != sysctl_sched_uclamp_util_min) {
- uclamp_group_get(NULL, &uclamp_default[UCLAMP_MIN],
+ uclamp_group_get(NULL, NULL, &uclamp_default[UCLAMP_MIN],
UCLAMP_MIN, sysctl_sched_uclamp_util_min);
}
if (old_max != sysctl_sched_uclamp_util_max) {
- uclamp_group_get(NULL, &uclamp_default[UCLAMP_MAX],
+ uclamp_group_get(NULL, NULL, &uclamp_default[UCLAMP_MAX],
UCLAMP_MAX, sysctl_sched_uclamp_util_max);
}
goto done;
@@ -1355,12 +1382,12 @@ static int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
p->uclamp[UCLAMP_MIN].user_defined = true;
- uclamp_group_get(p, &p->uclamp[UCLAMP_MIN],
+ uclamp_group_get(p, NULL, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
p->uclamp[UCLAMP_MAX].user_defined = true;
- uclamp_group_get(p, &p->uclamp[UCLAMP_MAX],
+ uclamp_group_get(p, NULL, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
@@ -1410,7 +1437,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
- uclamp_group_get(NULL, &p->uclamp[clamp_id],
+ uclamp_group_get(NULL, NULL, &p->uclamp[clamp_id],
clamp_id, clamp_value);
}
}
@@ -1432,19 +1459,23 @@ static void __init init_uclamp(void)
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
- uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+ uclamp_group_get(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(clamp_id));
uc_se = &uclamp_default[clamp_id];
- uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+ uclamp_group_get(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(clamp_id));
/* RT tasks by default will go to max frequency */
uc_se = &uclamp_default_perf[clamp_id];
- uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+ uclamp_group_get(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(UCLAMP_MAX));
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp group */
uc_se = &root_task_group.uclamp[clamp_id];
- uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+ uclamp_group_get(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(UCLAMP_MAX));
uc_se->effective.group_id = uc_se->group_id;
uc_se->effective.value = uc_se->value;
#endif
@@ -7053,8 +7084,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
int clamp_id;
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- uclamp_group_get(NULL, &tg->uclamp[clamp_id], clamp_id,
- parent->uclamp[clamp_id].value);
+ uclamp_group_get(NULL, NULL, &tg->uclamp[clamp_id],
+ clamp_id, parent->uclamp[clamp_id].value);
tg->uclamp[clamp_id].effective.value =
parent->uclamp[clamp_id].effective.value;
tg->uclamp[clamp_id].effective.group_id =
@@ -7390,6 +7421,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
uc_se->effective.value = value;
uc_se->effective.group_id = group_id;
+
+ /* Immediately updated descendants active tasks */
+ if (css != top_css)
+ uclamp_group_get_tg(css, clamp_id, group_id);
}
}
@@ -7414,7 +7449,8 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
}
/* Update TG's reference count */
- uclamp_group_get(NULL, &tg->uclamp[UCLAMP_MIN], UCLAMP_MIN, min_value);
+ uclamp_group_get(NULL, css, &tg->uclamp[UCLAMP_MIN],
+ UCLAMP_MIN, min_value);
/* Update effective clamps to track the most restrictive value */
cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].group_id,
@@ -7448,7 +7484,8 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
}
/* Update TG's reference count */
- uclamp_group_get(NULL, &tg->uclamp[UCLAMP_MAX], UCLAMP_MAX, max_value);
+ uclamp_group_get(NULL, css, &tg->uclamp[UCLAMP_MAX],
+ UCLAMP_MAX, max_value);
/* Update effective clamps to track the most restrictive value */
cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].group_id,
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..aad826aa55f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,8 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Utilization clamping lazy update.
+ */
+SCHED_FEAT(UCLAMP_LAZY_UPDATE, false)
--
2.18.0
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider userspace utilization clamping hints.
This will allow, for example, to:
- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency
- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency
These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.
This adds the required support to clamp the utilization generated by
RUNNABLE FAIR tasks within the boundaries defined by their aggregated
utilization clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.
For FAIR tasks both the utilization and the IOWait boost values
are clamped according to the CPU aggregated utilization clamp
constraints.
The default values for boosting and capping are defined to be:
- util_min: 0
- util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Message-ID: <[email protected]>
- add a comment to justify the assumptions on util clamping for FAIR tasks
Message-ID: <[email protected]>
- removed uclamp_value and use inline access to data structures
Message-ID: <20180914135712.GQ1413@e110439-lin>
- the (unlikely(val == UCLAMP_NOT_VALID)) check is not more required
since we now ensure we always have a valid value configured
Others:
- rebased on v4.19
Changes in v4:
Message-ID: <CAKfTPtC2adLupg7wy1JU9zxKx1466Sza6fSCcr92wcawm1OYkg@mail.gmail.com>
- use *rq instead of cpu for both uclamp_util() and uclamp_value()
Message-ID: <20180816135300.GC2960@e110439-lin>
- remove uclamp_value() which is never used outside CONFIG_UCLAMP_TASK
Others:
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
---
kernel/sched/cpufreq_schedutil.c | 25 ++++++++++++++++++++++---
kernel/sched/sched.h | 28 ++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3bc8a8..fd3fe55d605b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS utilization can be boosted or capped, depending on utilization
+ * clamp constraints requested by currently RUNNABLE tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and OPPs
+ * will be gracefully reduced with the utilization decay.
*/
- util = cpu_util_cfs(rq);
+ util = uclamp_util(rq, cpu_util_cfs(rq));
util += cpu_util_rt(rq);
/*
@@ -307,6 +312,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
unsigned int flags)
{
bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+ unsigned int max_boost;
/* Reset boost if the CPU appears to have been idle enough */
if (sg_cpu->iowait_boost &&
@@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
return;
sg_cpu->iowait_boost_pending = true;
+ /*
+ * Boost FAIR tasks only up to the CPU clamped utilization.
+ *
+ * Since DL tasks have a much more advanced bandwidth control, it's
+ * safe to assume that IO boost does not apply to those tasks.
+ * Instead, since RT tasks are not utiliation clamped, we don't want
+ * to apply clamping on IO boost while there is blocked RT
+ * utilization.
+ */
+ max_boost = sg_cpu->iowait_boost_max;
+ if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
+ max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
sg_cpu->iowait_boost <<= 1;
- if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
- sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+ if (sg_cpu->iowait_boost > max_boost)
+ sg_cpu->iowait_boost = max_boost;
return;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 859192ec492c..a7e9b7041ea5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2262,6 +2262,34 @@ static inline unsigned int uclamp_none(int clamp_id)
return SCHED_CAPACITY_SCALE;
}
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @rq: the CPU's RQ to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently RUNNABLE tasks. Given a utilization signal, i.e a signal in
+ * the [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = rq->uclamp.value[UCLAMP_MIN];
+ unsigned int max_util = rq->uclamp.value[UCLAMP_MAX];
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.18.0
Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range which depends on the set of currently
RUNNABLE tasks on that CPU.
Each task references two "clamp groups" defining the minimum and maximum
utilization clamp values to be considered for that task. These clamp
value are mapped by a clamp group which is enforced on a CPU only when
there is at least one RUNNABLE task referencing that clamp group.
When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it's required to re-compute what is the new "aggregated" clamp value to
apply on that CPU.
Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e. with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).
Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.
Tasks have a:
task_struct::uclamp::group_id[clamp_idx]
indexing, for each clamp index (i.e. util_{min,max}), the clamp group
they have to refcount at enqueue time.
CPUs rq have a:
rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index.
The clamp value of each clamp group is tracked by
rq::uclamp::group[][].value
thus making rq::uclamp::group[][] an unordered array of clamp values.
However, the MAX aggregation of the currently active clamp groups is
implemented to minimize the number of times we need to scan the complete
(unordered) clamp group array to figure out the new max value. This
operation indeed happens only when we dequeue the last task of the clamp
group corresponding to the current max clamp, and thus the CPU is either
entering IDLE or going to schedule a less boosted or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different clamp values for each clamp index.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Message-ID: <20180914134128.GP1413@e110439-lin>
- remove not required check for (group_id == UCLAMP_NOT_VALID)
in uclamp_cpu_put_id
Message-ID: <20180912174456.GJ1413@e110439-lin>
- use bitfields to compress uclamp_group
Others:
- consistently use "unsigned int" for both clamp_id and group_id
- fixup documentation
- reduced usage of inline comments
- rebased on v4.19.0
Changes in v4:
Message-ID: <20180816133249.GA2964@e110439-lin>
- keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
- add another WARN on the unexpected condition of releasing a refcount
from a CPU which has a lower clamp value active
Other:
- ensure (and check) that all tasks have a valid group_id at
uclamp_cpu_get_id()
- rework uclamp_cpu layout to better fit into just 2x64B cache lines
- fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Other:
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Other:
- rabased on v4.18-rc4
- improved documentation to make more explicit some concepts.
---
include/linux/sched.h | 5 ++
kernel/sched/core.c | 185 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 49 +++++++++++
3 files changed, 239 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index facace271ea1..3ab1cbd4e3b1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,11 +604,16 @@ struct sched_dl_entity {
* The mapped bit is set whenever a task has been mapped on a clamp group for
* the first time. When this bit is set, any clamp group get (for a new clamp
* value) will be matches by a clamp group put (for the old clamp value).
+ *
+ * The active bit is set whenever a task has got an effective clamp group
+ * and value assigned, which can be different from the user requested ones.
+ * This allows to know a task is actually refcounting a CPU's clamp group.
*/
struct uclamp_se {
unsigned int value : SCHED_CAPACITY_SHIFT + 1;
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
unsigned int mapped : 1;
+ unsigned int active : 1;
};
#endif /* CONFIG_UCLAMP_TASK */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 654327d7f212..a98a96a7d9f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -783,6 +783,159 @@ union uclamp_map {
*/
static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @rq: the CPU's rq which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups can change. Since each clamp group enforces a different
+ * utilization clamp value, once the set of active groups changes it can be
+ * required to re-compute what is the new clamp value to apply for that CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of active clamp groups.
+ */
+static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
+{
+ unsigned int group_id;
+ int max_value = 0;
+
+ for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
+ if (!rq->uclamp.group[clamp_id][group_id].tasks)
+ continue;
+ /* Both min and max clamps are MAX aggregated */
+ if (max_value < rq->uclamp.group[clamp_id][group_id].value)
+ max_value = rq->uclamp.group[clamp_id][group_id].value;
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ rq->uclamp.value[clamp_id] = max_value;
+}
+
+/**
+ * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @rq: the CPU's rq where the clamp group has to be reference counted
+ * @clamp_id: the clamp index to update
+ *
+ * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
+ * the task's uclamp::group_id is reference counted on that CPU.
+ */
+static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int group_id;
+
+ if (unlikely(!p->uclamp[clamp_id].mapped))
+ return;
+
+ group_id = p->uclamp[clamp_id].group_id;
+ p->uclamp[clamp_id].active = true;
+
+ rq->uclamp.group[clamp_id][group_id].tasks += 1;
+
+ if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
+ rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
+}
+
+/**
+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @rq: the CPU's rq from where the clamp group has to be released
+ * @clamp_id: the clamp index to update
+ *
+ * When a task is dequeued from a CPU's rq, the CPU's clamp group reference
+ * counted by the task is released.
+ * If this was the last task reference coutning the current max clamp group,
+ * then the CPU clamping is updated to find the new max for the specified
+ * clamp index.
+ */
+static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value;
+ unsigned int group_id;
+
+ if (unlikely(!p->uclamp[clamp_id].mapped))
+ return;
+
+ group_id = p->uclamp[clamp_id].group_id;
+ p->uclamp[clamp_id].active = false;
+
+ if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
+ rq->uclamp.group[clamp_id][group_id].tasks -= 1;
+#ifdef CONFIG_SCHED_DEBUG
+ else {
+ WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
+ cpu_of(rq), clamp_id, group_id);
+ }
+#endif
+
+ if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
+ return;
+
+ clamp_value = rq->uclamp.group[clamp_id][group_id].value;
+#ifdef CONFIG_SCHED_DEBUG
+ if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) {
+ WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n",
+ cpu_of(rq), clamp_id, group_id);
+ }
+#endif
+ if (clamp_value >= rq->uclamp.value[clamp_id])
+ uclamp_cpu_update(rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_get(): increase CPU's clamp group refcount
+ * @rq: the CPU's rq where the task is enqueued
+ * @p: the task being enqueued
+ *
+ * When a task is enqueued on a CPU's rq, all the clamp groups currently
+ * enforced on a task are reference counted on that rq. Since not all
+ * scheduling classes have utilization clamping support, their tasks will
+ * be silently ignored.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_get_id(p, rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_put(): decrease CPU's clamp group refcount
+ * @rq: the CPU's rq from where the task is dequeued
+ * @p: the task being dequeued
+ *
+ * When a task is dequeued from a CPU's rq, all the clamp groups the task has
+ * reference counted at enqueue time are now released.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_put_id(p, rq, clamp_id);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -836,6 +989,7 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
unsigned int free_group_id;
unsigned int group_id;
unsigned long res;
+ int cpu;
retry:
@@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
if (res != uc_map_old.data)
goto retry;
+ /* Ensure each CPU tracks the correct value for this clamp group */
+ if (likely(uc_map_new.se_count > 1))
+ goto done;
+ for_each_possible_cpu(cpu) {
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ /* Refcounting is expected to be always 0 for free groups */
+ if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) {
+ uc_cpu->group[clamp_id][group_id].tasks = 0;
+#ifdef CONFIG_SCHED_DEBUG
+ WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
+ cpu, clamp_id, group_id);
+#endif
+ }
+
+ if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
+ continue;
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+ }
+
+done:
+
/* Update SE's clamp values and attach it to new clamp group */
uc_se->value = clamp_value;
uc_se->group_id = group_id;
@@ -948,6 +1124,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
clamp_value = uclamp_none(clamp_id);
p->uclamp[clamp_id].mapped = false;
+ p->uclamp[clamp_id].active = false;
uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value);
}
}
@@ -959,9 +1136,13 @@ static void __init init_uclamp(void)
{
struct uclamp_se *uc_se;
unsigned int clamp_id;
+ int cpu;
mutex_init(&uclamp_mutex);
+ for_each_possible_cpu(cpu)
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
+
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
@@ -970,6 +1151,8 @@ static void __init init_uclamp(void)
}
#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -987,6 +1170,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);
+ uclamp_cpu_get(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -998,6 +1182,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE))
sched_info_dequeued(rq, p);
+ uclamp_cpu_put(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 947ab14d3d5b..1755c9c9f4f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -766,6 +766,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * struct uclamp_group - Utilization clamp Group
+ * @value: utilization clamp value for tasks on this clamp group
+ * @tasks: number of RUNNABLE tasks on this clamp group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+ unsigned long value : SCHED_CAPACITY_SHIFT + 1;
+ unsigned long tasks : BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1;
+};
+
+/**
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @group: utilization clamp groups affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp value, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+ struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+ int value[UCLAMP_CNT];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -804,6 +848,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;
+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_cpu uclamp ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
--
2.18.0
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can be any value in the
[0..SCHED_CAPACITY_SCALE] range. Tasks with a user-defined clamp value
are allowed to request any value in that range, and we currently
unconditionally enforce the required clamps.
However, a "System Management Software" could be interested in
unconditionally limiting the range of clamp values allowed for all
tasks.
Let's fix this by explicitly adding a privileged interface to define a
system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
If a task specific value is not compliant with the system default range,
it will be forced to the corresponding system default value.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
The current restriction could be too aggressive since, for example if a
task has a util_min which is higher then the system default max, it
will be forced to the system default min unconditionally.
We should probably better restrict util_min to the maximum system
default value, but that whould make the code more complex and we keep it
for a future update.
Changes in v5:
Other:
- rebased on v4.19
Changes in v4:
Message-ID: <20180820122728.GM2960@e110439-lin>
- fix unwanted reset of clamp values on refcount success
Others:
- by default all tasks have a UCLAMP_NOT_VALID task specific clamp
- always use:
p->uclamp[clamp_id].effective.value
to track the actual clamp value the task has been refcounted into.
This matches with the usage of
p->uclamp[clamp_id].effective.group_id
- rebased on v4.19-rc1
---
include/linux/sched.h | 5 ++
include/linux/sched/sysctl.h | 11 +++
kernel/sched/core.c | 131 ++++++++++++++++++++++++++++++++---
kernel/sysctl.c | 16 +++++
4 files changed, 154 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3ab1cbd4e3b1..ec6783ea4e7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -614,6 +614,11 @@ struct uclamp_se {
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
unsigned int mapped : 1;
unsigned int active : 1;
+ /* Clamp group and value actually used by a RUNNABLE task */
+ struct {
+ unsigned int value : SCHED_CAPACITY_SHIFT + 1;
+ unsigned int group_id : order_base_2(UCLAMP_GROUPS);
+ } effective;
};
#endif /* CONFIG_UCLAMP_TASK */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index a9c32daeb9d8..445fb54eaeff 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
@@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
+#ifdef CONFIG_UCLAMP_TASK
+extern int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+#endif
+
extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9b49062439f3..8421ef96ec97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -729,6 +729,23 @@ static void set_load_weight(struct task_struct *p, bool update_load)
*/
static DEFINE_MUTEX(uclamp_mutex);
+/*
+ * Minimum utilization for all tasks
+ * default: 0
+ */
+unsigned int sysctl_sched_uclamp_util_min;
+
+/*
+ * Maximum utilization for all tasks
+ * default: 1024
+ */
+unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+
+/*
+ * Tasks's clamp values are required to be within this range
+ */
+static struct uclamp_se uclamp_default[UCLAMP_CNT];
+
/**
* uclamp_map: reference count utilization clamp groups
* @value: the utilization "clamp value" tracked by this clamp group
@@ -857,6 +874,55 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
rq->uclamp.value[clamp_id] = max_value;
}
+/**
+ * uclamp_effective_group_id: get the effective clamp group index of a task
+ * @p: the task to get the effective clamp value for
+ * @clamp_id: the clamp index to consider
+ *
+ * The effective clamp group index of a task depends on:
+ * - the task specific clamp value, explicitly requested from userspace
+ * - the system default clamp value, defined by the sysadmin
+ * and tasks specific's clamp values are always restricted by system
+ * defaults clamp values.
+ *
+ * This method returns the effective group index for a task, depending on its
+ * status and a proper aggregation of the clamp values listed above.
+ * Moreover, it ensures that the task's effective value:
+ * task_struct::uclamp::effective::value
+ * is updated to represent the clamp value corresponding to the taks effective
+ * group index.
+ */
+static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value;
+ unsigned int group_id;
+
+ /* Task currently refcounted into a CPU clamp group */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].effective.group_id;
+
+ /* Task specific clamp value */
+ clamp_value = p->uclamp[clamp_id].value;
+ group_id = p->uclamp[clamp_id].group_id;
+
+ /* System default restriction */
+ if (unlikely(clamp_value < uclamp_default[UCLAMP_MIN].value ||
+ clamp_value > uclamp_default[UCLAMP_MAX].value)) {
+ /*
+ * Unconditionally enforce system defaults, which is a simpler
+ * solution compared to a proper clamping.
+ */
+ clamp_value = uclamp_default[clamp_id].value;
+ group_id = uclamp_default[clamp_id].group_id;
+ }
+
+ p->uclamp[clamp_id].effective.value = clamp_value;
+ p->uclamp[clamp_id].effective.group_id = group_id;
+
+ return group_id;
+}
+
/**
* uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
* @p: the task being enqueued on a CPU
@@ -869,16 +935,17 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
{
- unsigned int clamp_value;
+ unsigned int effective;
unsigned int group_id;
if (unlikely(!p->uclamp[clamp_id].mapped))
return;
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = uclamp_effective_group_id(p, clamp_id);
p->uclamp[clamp_id].active = true;
rq->uclamp.group[clamp_id][group_id].tasks += 1;
+ effective = p->uclamp[clamp_id].effective.value;
if (unlikely(rq->uclamp.flags & UCLAMP_FLAG_IDLE)) {
/*
@@ -889,16 +956,15 @@ static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
*/
if (clamp_id == UCLAMP_MAX)
rq->uclamp.flags &= ~UCLAMP_FLAG_IDLE;
- rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
+ rq->uclamp.value[clamp_id] = effective;
}
/* CPU's clamp groups track the max effective clamp value */
- clamp_value = p->uclamp[clamp_id].value;
- if (clamp_value > rq->uclamp.group[clamp_id][group_id].value)
- rq->uclamp.group[clamp_id][group_id].value = clamp_value;
+ if (effective > rq->uclamp.group[clamp_id][group_id].value)
+ rq->uclamp.group[clamp_id][group_id].value = effective;
- if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
- rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
+ if (rq->uclamp.value[clamp_id] < effective)
+ rq->uclamp.value[clamp_id] = effective;
}
/**
@@ -922,7 +988,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
if (unlikely(!p->uclamp[clamp_id].mapped))
return;
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = uclamp_effective_group_id(p, clamp_id);
p->uclamp[clamp_id].active = false;
if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
@@ -1172,6 +1238,50 @@ static void uclamp_group_get(struct task_struct *p, struct uclamp_se *uc_se,
uc_se->mapped = true;
}
+int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int old_min, old_max;
+ int result = 0;
+
+ mutex_lock(&uclamp_mutex);
+
+ old_min = sysctl_sched_uclamp_util_min;
+ old_max = sysctl_sched_uclamp_util_max;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ goto done;
+
+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
+ sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto undo;
+ }
+
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ uclamp_group_get(NULL, &uclamp_default[UCLAMP_MIN],
+ UCLAMP_MIN, sysctl_sched_uclamp_util_min);
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ uclamp_group_get(NULL, &uclamp_default[UCLAMP_MAX],
+ UCLAMP_MAX, sysctl_sched_uclamp_util_max);
+ }
+ goto done;
+
+undo:
+ sysctl_sched_uclamp_util_min = old_min;
+ sysctl_sched_uclamp_util_max = old_max;
+
+done:
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
+}
+
static int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1268,6 +1378,9 @@ static void __init init_uclamp(void)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+
+ uc_se = &uclamp_default[clamp_id];
+ uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
}
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050fd0c4..378ea57e5fc5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -445,6 +445,22 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rr_handler,
},
+#ifdef CONFIG_UCLAMP_TASK
+ {
+ .procname = "sched_uclamp_util_min",
+ .data = &sysctl_sched_uclamp_util_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+ {
+ .procname = "sched_uclamp_util_max",
+ .data = &sysctl_sched_uclamp_util_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
2.18.0
The cgroup's CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.
Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is expected (or allowed) to generate.
Constraints on minimum and maximum utilization allowed for tasks in a
CPU cgroup can improve the control on the actual amount of CPU bandwidth
consumed by tasks.
Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.
The ultimate goal of utilization clamping is thus to enable:
- boosting: by selecting an higher capacity CPU and/or higher execution
frequency for small tasks which are affecting the user
interactive experience.
- capping: by selecting more energy efficiency CPUs or lower execution
frequency, for big tasks which are mainly related to
background activities, and thus without a direct impact on
the user experience.
Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.
This patch extends the CPU controller by adding a couple of new
attributes, util.min and util.max, which can be used to enforce task's
utilization boosting and capping. Specifically:
- util.min: defines the minimum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run at least at a minimum frequency which
corresponds to the min_util utilization
- util.max: defines the maximum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run up to a maximum frequency which
corresponds to the max_util utilization
These attributes:
a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups
b) do not enforce any constraints and/or dependency between the parent
and its child nodes, thus relying:
- on permission settings defined by the system management software,
to define if subgroups can configure their clamp values
- on the delegation model, to ensure that effective clamps are
updated to consider both subgroup requests and parent group
constraints
c) allow to (eventually) further restrict task-specific clamps defined
via sched_setattr(2)
This patch provides the basic support to expose the two new attributes
and to validate their run-time updates, while we do not actually
allocated clamp groups.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- rebased on v4.19
Changes in v4:
Others:
- consolidate init_uclamp_sched_group() into init_uclamp()
- refcount root_task_group's clamp groups from init_uclamp()
- small documentation fixes
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <[email protected]>
- use "." notation for attributes naming
i.e. s/util_{min,max}/util.{min,max}/
Others
- rebased on v4.19-rc1
Changes in v2:
Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes
Others:
- rebased on v4.18-rc4
- reduced code to simplify the review of this patch
which now provides just the basic code for CGroups integration
- add attributes to the default hierarchy as well as the legacy one
- use -ERANGE as range violation error
These additional bits:
- refcounting of clamp groups
- RUNNABLE tasks refcount updates
- aggregation of per-task and per-task_group utilization constraints
are provided in separate and following patches to make it more clear and
documented how they are performed.
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++
init/Kconfig | 22 ++++
kernel/sched/core.c | 149 ++++++++++++++++++++++++
kernel/sched/sched.h | 5 +
4 files changed, 201 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..a6907266e52f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -907,6 +907,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.
+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -966,6 +972,25 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no bandwidth boosting.
+
+ The minimum utilization in the range [0, 1024].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1024". i.e. no bandwidth capping
+
+ The maximum utilization in the range [0, 1024].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index 4c5475030286..83e4c987641e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -827,6 +827,28 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED
+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b9dd1980ec93..9b06e8f156f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1392,6 +1392,13 @@ static void __init init_uclamp(void)
/* RT tasks by default will go to max frequency */
uc_se = &uclamp_default_perf[clamp_id];
uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Init root TG's clamp group */
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->group_id = 0;
+#endif
}
}
@@ -6962,6 +6969,41 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: 0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ tg->uclamp[clamp_id].value =
+ parent->uclamp[clamp_id].value;
+ tg->uclamp[clamp_id].group_id =
+ parent->uclamp[clamp_id].group_id;
+ }
+
+ return 1;
+}
+#else
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6985,6 +7027,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;
err:
@@ -7205,6 +7250,84 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MIN].value == min_value)
+ goto out;
+ if (tg->uclamp[UCLAMP_MAX].value < min_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MAX].value == max_value)
+ goto out;
+ if (tg->uclamp[UCLAMP_MIN].value > max_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7542,6 +7665,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7709,6 +7844,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7e9b7041ea5..c766abf0721d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -391,6 +391,11 @@ struct task_group {
#endif
struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};
#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.18.0
Utilization clamping requires to map each different clamp value into one
of the available clamp groups used in the scheduler fast-path to account
for RUNNABLE tasks. Thus, each time a TG's clamp value sysfs attribute
is updated via:
cpu_util_{min,max}_write_u64()
we need to update the task group reference to the new value's clamp
group and release the reference to the previous one.
Let's ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time).
We do that by slightly refactoring uclamp_group_get() to make the
*task_struct parameter optional. This allows to re-use the code already
available for the support of the per-task API.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- rebased on v4.19
Changes in v4:
Others:
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add explicit calls to uclamp_group_find(), which is now not more
part of uclamp_group_get()
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
include/linux/sched.h | 7 +++--
kernel/sched/core.c | 64 +++++++++++++++++++++++++++++++++++--------
2 files changed, 56 insertions(+), 15 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d3f6bf62ab3f..7698e7554892 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,9 +601,10 @@ struct sched_dl_entity {
* clamp group index (group_id), i.e.
* index of the per-cpu RUNNABLE tasks refcounting array
*
- * The mapped bit is set whenever a task has been mapped on a clamp group for
- * the first time. When this bit is set, any clamp group get (for a new clamp
- * value) will be matches by a clamp group put (for the old clamp value).
+ * The mapped bit is set whenever a scheduling entity has been mapped on a
+ * clamp group for the first time. When this bit is set, any clamp group get
+ * (for a new clamp value) will be matches by a clamp group put (for the old
+ * clamp value).
*
* The active bit is set whenever a task has got an effective clamp group
* and value assigned, which can be different from the user requested ones.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb49bffb3da8..3dcd1c17a244 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1396,9 +1396,9 @@ static void __init init_uclamp(void)
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp group */
uc_se = &root_task_group.uclamp[clamp_id];
- uc_se->value = uclamp_none(clamp_id);
- uc_se->group_id = 0;
- uc_se->effective.value = uclamp_none(clamp_id);
+ uclamp_group_get(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+ uc_se->effective.group_id = uc_se->group_id;
+ uc_se->effective.value = uc_se->value;
#endif
}
}
@@ -6971,6 +6971,22 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
static DEFINE_SPINLOCK(task_group_lock);
#ifdef CONFIG_UCLAMP_TASK_GROUP
+/*
+ * free_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
+ */
+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_group_put(clamp_id, tg->uclamp[clamp_id].group_id);
+}
+
/**
* alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
* @tg: the newly created task group
@@ -6989,17 +7005,18 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
int clamp_id;
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- tg->uclamp[clamp_id].value =
- parent->uclamp[clamp_id].value;
- tg->uclamp[clamp_id].group_id =
- parent->uclamp[clamp_id].group_id;
+ uclamp_group_get(NULL, &tg->uclamp[clamp_id], clamp_id,
+ parent->uclamp[clamp_id].value);
tg->uclamp[clamp_id].effective.value =
parent->uclamp[clamp_id].effective.value;
+ tg->uclamp[clamp_id].effective.group_id =
+ parent->uclamp[clamp_id].effective.group_id;
}
return 1;
}
#else
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -7009,6 +7026,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
static void sched_free_group(struct task_group *tg)
{
+ free_uclamp_sched_group(tg);
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
@@ -7258,6 +7276,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
* cpu_util_update_hier: propagate effective clamp down the hierarchy
* @css: the task group to update
* @clamp_id: the clamp index to update
+ * @group_id: the group index mapping the new task clamp value
* @value: the new task group clamp value
*
* The effective clamp for a TG is expected to track the most restrictive
@@ -7277,9 +7296,13 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
* be propagated down to all the descendants. When a subgroup is found which
* has already its effective clamp value matching its clamp value, then we can
* safely skip all its descendants which are granted to be already in sync.
+ *
+ * The TG's group_id is also updated to ensure it tracks the effective clamp
+ * value.
*/
static void cpu_util_update_hier(struct cgroup_subsys_state *css,
- int clamp_id, unsigned int value)
+ unsigned int clamp_id, unsigned int group_id,
+ unsigned int value)
{
struct cgroup_subsys_state *top_css = css;
struct uclamp_se *uc_se, *uc_parent;
@@ -7291,8 +7314,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
* groups we consider their current value.
*/
uc_se = &css_tg(css)->uclamp[clamp_id];
- if (css != top_css)
+ if (css != top_css) {
value = uc_se->value;
+ group_id = uc_se->effective.group_id;
+ }
/*
* Skip the whole subtrees if the current effective clamp is
@@ -7308,12 +7333,15 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
}
/* Propagate the most restrictive effective value */
- if (uc_parent->effective.value < value)
+ if (uc_parent->effective.value < value) {
value = uc_parent->effective.value;
+ group_id = uc_parent->effective.group_id;
+ }
if (uc_se->effective.value == value)
continue;
uc_se->effective.value = value;
+ uc_se->effective.group_id = group_id;
}
}
@@ -7326,6 +7354,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;
+ mutex_lock(&uclamp_mutex);
rcu_read_lock();
tg = css_tg(css);
@@ -7336,11 +7365,16 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
goto out;
}
+ /* Update TG's reference count */
+ uclamp_group_get(NULL, &tg->uclamp[UCLAMP_MIN], UCLAMP_MIN, min_value);
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+ cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].group_id,
+ min_value);
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
return ret;
}
@@ -7354,6 +7388,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;
+ mutex_lock(&uclamp_mutex);
rcu_read_lock();
tg = css_tg(css);
@@ -7364,11 +7399,16 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
goto out;
}
+ /* Update TG's reference count */
+ uclamp_group_get(NULL, &tg->uclamp[UCLAMP_MAX], UCLAMP_MAX, max_value);
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+ cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].group_id,
+ max_value);
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
return ret;
}
--
2.18.0
When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's Task Group (TG) in which the task is currently living.
Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG effective value.
This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. The overall idea is that cgroup
attributes are always used to restrict the per-task attributes.
Thus, this implementation allows to:
1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG
For this mechanisms to work properly, we exploit the concept of
"effective" clamp, which is already used by a TG to track parent
enforced restrictions.
In this patch we re-use the same variable:
task_struct::uclamp::effective::group_id
to track the currently most restrictive clamp group each task is
subject to and thus it's also currently refcounted into.
This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.
For consistency purposes, as well as to properly inform userspace, the
sched_getattr(2) call is updated to always return the properly
aggregated constrains as described above. This will also make
sched_getattr(2) a convenient userspace API to know the utilization
constraints enforced on a task by the cgroup's CPU controller.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v4:
Message-ID: <20180816140731.GD2960@e110439-lin>
- reuse already existing:
task_struct::uclamp::effective::group_id
instead of adding:
task_struct::uclamp_group_id
to back annotate the effective clamp group in which a task has been
refcounted
Others:
- small documentation fixes
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- fix not required override
- fix typos in changelog
Others:
- clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
clamp group_id/value code into dedicated getter functions:
uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
- rebased on tip/sched/core
Changes in v2:
OSPM discussion:
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values, i.e. tasks running on a
TG are only allowed to demote themself.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
include/linux/sched.h | 9 +++++++
kernel/sched/core.c | 58 +++++++++++++++++++++++++++++++++++++++----
2 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7698e7554892..4b61fbcb0797 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -609,12 +609,21 @@ struct sched_dl_entity {
* The active bit is set whenever a task has got an effective clamp group
* and value assigned, which can be different from the user requested ones.
* This allows to know a task is actually refcounting a CPU's clamp group.
+ *
+ * The user_defined bit is set whenever a task has got a task-specific clamp
+ * value requested from userspace, i.e. the system defaults applies to this
+ * task just as a restriction. This allows to relax TG's clamps when a less
+ * restrictive task specific value has been defined, thus allowing to
+ * implement a "nice" semantic when both task group and task specific values
+ * are used. For example, a task running on a 20% boosted TG can still drop
+ * its own boosting to 0%.
*/
struct uclamp_se {
unsigned int value : SCHED_CAPACITY_SHIFT + 1;
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
unsigned int mapped : 1;
unsigned int active : 1;
+ unsigned int user_defined : 1;
/*
* Clamp group and value actually used by a scheduling entity,
* i.e. a (RUNNABLE) task or a task group.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e2292c698e3b..2ce84d22ab17 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -875,6 +875,28 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
rq->uclamp.value[clamp_id] = max_value;
}
+/**
+ * uclamp_apply_defaults: check if p is subject to system default clamps
+ * @p: the task to check
+ *
+ * Tasks in the root group or autogroups are always and only limited by system
+ * defaults. All others instead are limited by their TG's specific value.
+ * This method checks the conditions under witch a task is subject to system
+ * default clamps.
+ */
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static inline bool uclamp_apply_defaults(struct task_struct *p)
+{
+ if (task_group_is_autogroup(task_group(p)))
+ return true;
+ if (task_group(p) == &root_task_group)
+ return true;
+ return false;
+}
+#else
+#define uclamp_apply_defaults(p) true
+#endif
+
/**
* uclamp_effective_group_id: get the effective clamp group index of a task
* @p: the task to get the effective clamp value for
@@ -882,9 +904,11 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
*
* The effective clamp group index of a task depends on:
* - the task specific clamp value, explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not in the root group or
+ * in an autogroup
* - the system default clamp value, defined by the sysadmin
- * and tasks specific's clamp values are always restricted by system
- * defaults clamp values.
+ * and tasks specific's clamp values are always restricted, with increasing
+ * priority, by their task group first and the system defaults after.
*
* This method returns the effective group index for a task, depending on its
* status and a proper aggregation of the clamp values listed above.
@@ -908,6 +932,22 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
clamp_value = p->uclamp[clamp_id].value;
group_id = p->uclamp[clamp_id].group_id;
+ if (!uclamp_apply_defaults(p)) {
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ unsigned int clamp_max =
+ task_group(p)->uclamp[clamp_id].effective.value;
+ unsigned int group_max =
+ task_group(p)->uclamp[clamp_id].effective.group_id;
+
+ if (!p->uclamp[clamp_id].user_defined ||
+ clamp_value > clamp_max) {
+ clamp_value = clamp_max;
+ group_id = group_max;
+ }
+#endif
+ goto done;
+ }
+
/* RT tasks have different default values */
default_clamp = task_has_rt_policy(p)
? uclamp_default_perf
@@ -924,6 +964,8 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
group_id = default_clamp[clamp_id].group_id;
}
+done:
+
p->uclamp[clamp_id].effective.value = clamp_value;
p->uclamp[clamp_id].effective.group_id = group_id;
@@ -936,8 +978,10 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
* @rq: the CPU's rq where the clamp group has to be reference counted
* @clamp_id: the clamp index to update
*
- * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
- * the task's uclamp::group_id is reference counted on that CPU.
+ * Once a task is enqueued on a CPU's rq, with increasing priority, we
+ * reference count the most restrictive clamp group between the task specific
+ * clamp value, the clamp value of its task group and the system default clamp
+ * value.
*/
static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
@@ -1312,10 +1356,12 @@ static int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ p->uclamp[UCLAMP_MIN].user_defined = true;
uclamp_group_get(p, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ p->uclamp[UCLAMP_MAX].user_defined = true;
uclamp_group_get(p, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
@@ -1359,8 +1405,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
unsigned int clamp_value = p->uclamp[clamp_id].value;
- if (unlikely(reset))
+ if (unlikely(reset)) {
clamp_value = uclamp_none(clamp_id);
+ p->uclamp[clamp_id].user_defined = false;
+ }
p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
--
2.18.0
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.
Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This represent also the clamp value which is actually
used to clamp tasks in each task group.
Since it can be interesting for tasks in a cgroup to know exactly what
is the currently propagated/enforced configuration, the effective clamp
values are exposed to user-space by means of a new pair of read-only
attributes: cpu.util.{min,max}.effective.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Message-ID: <20180912125133.GE1413@e110439-lin>
- make more clear the definition of cpu.util.min.effective
- small typos fixed
Others:
- rebased on v4.19
Changes in v4:
Message-ID: <20180816140731.GD2960@e110439-lin>
- add ".effective" attributes to the default hierarchy
Others:
- small documentation fixes
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <[email protected]>
- new patch in v3, to implement a suggestion from v1 review
---
Documentation/admin-guide/cgroup-v2.rst | 25 +++++-
include/linux/sched.h | 10 ++-
kernel/sched/core.c | 113 +++++++++++++++++++++++-
3 files changed, 141 insertions(+), 7 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a6907266e52f..56bc56513721 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -976,22 +976,43 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no bandwidth boosting.
- The minimum utilization in the range [0, 1024].
+ The requested minimum utilization in the range [0, 1024].
This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.
+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ allows only smaller minimum utilization values.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1024". i.e. no bandwidth capping
- The maximum utilization in the range [0, 1024].
+ The requested maximum utilization in the range [0, 1024].
This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.
+
+
Memory
------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec6783ea4e7d..d3f6bf62ab3f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -614,7 +614,15 @@ struct uclamp_se {
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
unsigned int mapped : 1;
unsigned int active : 1;
- /* Clamp group and value actually used by a RUNNABLE task */
+ /*
+ * Clamp group and value actually used by a scheduling entity,
+ * i.e. a (RUNNABLE) task or a task group.
+ * For task groups, this is the value (eventually) enforced by a
+ * parent task group.
+ * For a task, this is the value (eventually) enforced by the
+ * task group the task is currently part of or by the system
+ * default clamp values, whichever is the most restrictive.
+ */
struct {
unsigned int value : SCHED_CAPACITY_SHIFT + 1;
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9b06e8f156f8..cb49bffb3da8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1398,6 +1398,7 @@ static void __init init_uclamp(void)
uc_se = &root_task_group.uclamp[clamp_id];
uc_se->value = uclamp_none(clamp_id);
uc_se->group_id = 0;
+ uc_se->effective.value = uclamp_none(clamp_id);
#endif
}
}
@@ -6992,6 +6993,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
parent->uclamp[clamp_id].value;
tg->uclamp[clamp_id].group_id =
parent->uclamp[clamp_id].group_id;
+ tg->uclamp[clamp_id].effective.value =
+ parent->uclamp[clamp_id].effective.value;
}
return 1;
@@ -7251,6 +7254,69 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}
#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * cpu_util_update_hier: propagate effective clamp down the hierarchy
+ * @css: the task group to update
+ * @clamp_id: the clamp index to update
+ * @value: the new task group clamp value
+ *
+ * The effective clamp for a TG is expected to track the most restrictive
+ * value between the TG's clamp value and it's parent effective clamp value.
+ * This method achieve that:
+ * 1. updating the current TG effective value
+ * 2. walking all the descendant task group that needs an update
+ *
+ * A TG's effective clamp needs to be updated when its current value is not
+ * matching the TG's clamp value. In this case indeed either:
+ * a) the parent has got a more relaxed clamp value
+ * thus potentially we can relax the effective value for this group
+ * b) the parent has got a more strict clamp value
+ * thus potentially we have to restrict the effective value of this group
+ *
+ * Restriction and relaxation of current TG's effective clamp values needs to
+ * be propagated down to all the descendants. When a subgroup is found which
+ * has already its effective clamp value matching its clamp value, then we can
+ * safely skip all its descendants which are granted to be already in sync.
+ */
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int value)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+
+ css_for_each_descendant_pre(css, top_css) {
+ /*
+ * The first visited task group is top_css, which clamp value
+ * is the one passed as parameter. For descendent task
+ * groups we consider their current value.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (css != top_css)
+ value = uc_se->value;
+
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * already matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive, as effective clamp.
+ */
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+ if (uc_se->effective.value == value &&
+ uc_parent->effective.value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent->effective.value < value)
+ value = uc_parent->effective.value;
+ if (uc_se->effective.value == value)
+ continue;
+
+ uc_se->effective.value = value;
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -7270,6 +7336,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
goto out;
}
+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+
out:
rcu_read_unlock();
@@ -7295,6 +7364,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
goto out;
}
+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+
out:
rcu_read_unlock();
@@ -7302,14 +7374,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
}
static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;
rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].effective.value
+ : tg->uclamp[clamp_id].value;
rcu_read_unlock();
return util_clamp;
@@ -7318,13 +7393,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}
static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */
@@ -7672,11 +7759,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7852,12 +7947,22 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* terminate */
};
--
2.18.0
When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. However, when cgroups are
also in use, the per-task clamp values could be restricted by the TG in
which the task is currently living.
Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG effective value.
This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. The overall idea is that cgroup
attributes are always used to restrict per-task attributes.
Thus, this implementation allows to:
1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG
For this mechanisms to work properly, we exploit the concept of
"effective" clamp, which is already used by a TG to track parent
enforced restrictions. Here we re-use the same variable:
task_struct::uclamp::effective::group_id
to track the currently most restrictive clamp group each task is
subject to and thus currently refcounted into.
This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v5:
Others:
- rebased on v4.19
Changes in v4:
Message-ID: <20180816140731.GD2960@e110439-lin>
- reuse already existing:
task_struct::uclamp::effective::group_id
instead of adding:
task_struct::uclamp_group_id
to back annotate the effective clamp group in which a task has been
refcounted
Others:
- small documentation fixes
- rebased on v4.19-rc1
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- fix not required override
- fix typos in changelog
Others:
- clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
clamp group_id/value code into dedicated getter functions:
uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
- rebased on tip/sched/core
Changes in v2:
OSPM discussion:
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values, i.e. tasks running on a
TG are only allowed to demote themself.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
include/linux/sched.h | 9 +++++++
kernel/sched/core.c | 58 +++++++++++++++++++++++++++++++++++++++----
2 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7698e7554892..4b61fbcb0797 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -609,12 +609,21 @@ struct sched_dl_entity {
* The active bit is set whenever a task has got an effective clamp group
* and value assigned, which can be different from the user requested ones.
* This allows to know a task is actually refcounting a CPU's clamp group.
+ *
+ * The user_defined bit is set whenever a task has got a task-specific clamp
+ * value requested from userspace, i.e. the system defaults applies to this
+ * task just as a restriction. This allows to relax TG's clamps when a less
+ * restrictive task specific value has been defined, thus allowing to
+ * implement a "nice" semantic when both task group and task specific values
+ * are used. For example, a task running on a 20% boosted TG can still drop
+ * its own boosting to 0%.
*/
struct uclamp_se {
unsigned int value : SCHED_CAPACITY_SHIFT + 1;
unsigned int group_id : order_base_2(UCLAMP_GROUPS);
unsigned int mapped : 1;
unsigned int active : 1;
+ unsigned int user_defined : 1;
/*
* Clamp group and value actually used by a scheduling entity,
* i.e. a (RUNNABLE) task or a task group.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3dcd1c17a244..f2e35b0a1f0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -875,6 +875,28 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
rq->uclamp.value[clamp_id] = max_value;
}
+/**
+ * uclamp_apply_defaults: check if p is subject to system default clamps
+ * @p: the task to check
+ *
+ * Tasks in the root group or autogroups are always and only limited by system
+ * defaults. All others instead are limited by their TG's specific value.
+ * This method checks the conditions under witch a task is subject to system
+ * default clamps.
+ */
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static inline bool uclamp_apply_defaults(struct task_struct *p)
+{
+ if (task_group_is_autogroup(task_group(p)))
+ return true;
+ if (task_group(p) == &root_task_group)
+ return true;
+ return false;
+}
+#else
+#define uclamp_apply_defaults(p) true
+#endif
+
/**
* uclamp_effective_group_id: get the effective clamp group index of a task
* @p: the task to get the effective clamp value for
@@ -882,9 +904,11 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
*
* The effective clamp group index of a task depends on:
* - the task specific clamp value, explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not in the root group or
+ * in an autogroup
* - the system default clamp value, defined by the sysadmin
- * and tasks specific's clamp values are always restricted by system
- * defaults clamp values.
+ * and tasks specific's clamp values are always restricted, with increasing
+ * priority, by their task group first and the system defaults after.
*
* This method returns the effective group index for a task, depending on its
* status and a proper aggregation of the clamp values listed above.
@@ -908,6 +932,22 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
clamp_value = p->uclamp[clamp_id].value;
group_id = p->uclamp[clamp_id].group_id;
+ if (!uclamp_apply_defaults(p)) {
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ unsigned int clamp_max =
+ task_group(p)->uclamp[clamp_id].effective.value;
+ unsigned int group_max =
+ task_group(p)->uclamp[clamp_id].effective.group_id;
+
+ if (!p->uclamp[clamp_id].user_defined ||
+ clamp_value > clamp_max) {
+ clamp_value = clamp_max;
+ group_id = group_max;
+ }
+#endif
+ goto done;
+ }
+
/* RT tasks have different default values */
default_clamp = task_has_rt_policy(p)
? uclamp_default_perf
@@ -924,6 +964,8 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
group_id = default_clamp[clamp_id].group_id;
}
+done:
+
p->uclamp[clamp_id].effective.value = clamp_value;
p->uclamp[clamp_id].effective.group_id = group_id;
@@ -936,8 +978,10 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
* @rq: the CPU's rq where the clamp group has to be reference counted
* @clamp_id: the clamp index to update
*
- * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
- * the task's uclamp::group_id is reference counted on that CPU.
+ * Once a task is enqueued on a CPU's rq, with increasing priority, we
+ * reference count the most restrictive clamp group between the task specific
+ * clamp value, the clamp value of its task group and the system default clamp
+ * value.
*/
static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
@@ -1310,10 +1354,12 @@ static int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ p->uclamp[UCLAMP_MIN].user_defined = true;
uclamp_group_get(p, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ p->uclamp[UCLAMP_MAX].user_defined = true;
uclamp_group_get(p, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
@@ -1357,8 +1403,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
unsigned int clamp_value = p->uclamp[clamp_id].value;
- if (unlikely(reset))
+ if (unlikely(reset)) {
clamp_value = uclamp_none(clamp_id);
+ p->uclamp[clamp_id].user_defined = false;
+ }
p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
--
2.18.0
Slightly older version posted by error along with the correct one.
Please comment on:
Message-ID: <[email protected]>
Sorry for the noise.
On 29-Oct 18:32, Patrick Bellasi wrote:
> Utilization clamping allows to clamp the utilization of a CPU within a
> [util_min, util_max] range which depends on the set of currently
> RUNNABLE tasks on that CPU.
> Each task references two "clamp groups" defining the minimum and maximum
> utilization clamp values to be considered for that task. These clamp
> value are mapped by a clamp group which is enforced on a CPU only when
> there is at least one RUNNABLE task referencing that clamp group.
>
> When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
> active on that CPU can change. Since each clamp group enforces a
> different utilization clamp value, once the set of these groups changes
> it's required to re-compute what is the new "aggregated" clamp value to
> apply on that CPU.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This is to ensure that no tasks can affect the performance of other
> co-scheduled tasks which are either more boosted (i.e. with higher
> util_min clamp) or less capped (i.e. with higher util_max clamp).
>
> Here we introduce the required support to properly reference count clamp
> groups at each task enqueue/dequeue time.
>
> Tasks have a:
> task_struct::uclamp::group_id[clamp_idx]
> indexing, for each clamp index (i.e. util_{min,max}), the clamp group
> they have to refcount at enqueue time.
>
> CPUs rq have a:
> rq::uclamp::group[clamp_idx][group_idx].tasks
> which is used to reference count how many tasks are currently RUNNABLE on
> that CPU for each clamp group of each clamp index.
>
> The clamp value of each clamp group is tracked by
> rq::uclamp::group[][].value
> thus making rq::uclamp::group[][] an unordered array of clamp values.
> However, the MAX aggregation of the currently active clamp groups is
> implemented to minimize the number of times we need to scan the complete
> (unordered) clamp group array to figure out the new max value. This
> operation indeed happens only when we dequeue the last task of the clamp
> group corresponding to the current max clamp, and thus the CPU is either
> entering IDLE or going to schedule a less boosted or more clamped task.
> Moreover, the expected number of different clamp values, which can be
> configured at build time, is usually so small that a more advanced
> ordering algorithm is not needed. In real use-cases we expect less then
> 10 different clamp values for each clamp index.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v5:
> Message-ID: <20180914134128.GP1413@e110439-lin>
> - remove not required check for?(group_id == UCLAMP_NOT_VALID)
> in?uclamp_cpu_put_id
> Message-ID: <20180912174456.GJ1413@e110439-lin>
> - use bitfields to compress uclamp_group
> Others:
> - consistently use "unsigned int" for both clamp_id and group_id
> - fixup documentation
> - reduced usage of inline comments
> - rebased on v4.19.0
>
> Changes in v4:
> Message-ID: <20180816133249.GA2964@e110439-lin>
> - keep the WARN in uclamp_cpu_put_id() but beautify a bit that code
> - add another WARN on the unexpected condition of releasing a refcount
> from a CPU which has a lower clamp value active
> Other:
> - ensure (and check) that all tasks have a valid group_id at
> uclamp_cpu_get_id()
> - rework uclamp_cpu layout to better fit into just 2x64B cache lines
> - fix some s/SCHED_DEBUG/CONFIG_SCHED_DEBUG/
> - rebased on v4.19-rc1
> Changes in v3:
> Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
> - add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
> - rename UCLAMP_NONE into UCLAMP_NOT_VALID
> Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
> - few typos fixed
> Other:
> - rebased on tip/sched/core
> Changes in v2:
> Message-ID: <[email protected]>
> - refactored struct rq::uclamp_cpu to be more cache efficient
> no more holes, re-arranged vectors to match cache lines with expected
> data locality
> Message-ID: <[email protected]>
> - use *rq as parameter whenever already available
> - add scheduling class's uclamp_enabled marker
> - get rid of the "confusing" single callback uclamp_task_update()
> and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
> - fix/remove "bad" comments
> Message-ID: <20180413113337.GU14248@e110439-lin>
> - remove inline from init_uclamp, flag it __init
> Other:
> - rabased on v4.18-rc4
> - improved documentation to make more explicit some concepts.
> ---
> include/linux/sched.h | 5 ++
> kernel/sched/core.c | 185 ++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 49 +++++++++++
> 3 files changed, 239 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index facace271ea1..3ab1cbd4e3b1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -604,11 +604,16 @@ struct sched_dl_entity {
> * The mapped bit is set whenever a task has been mapped on a clamp group for
> * the first time. When this bit is set, any clamp group get (for a new clamp
> * value) will be matches by a clamp group put (for the old clamp value).
> + *
> + * The active bit is set whenever a task has got an effective clamp group
> + * and value assigned, which can be different from the user requested ones.
> + * This allows to know a task is actually refcounting a CPU's clamp group.
> */
> struct uclamp_se {
> unsigned int value : SCHED_CAPACITY_SHIFT + 1;
> unsigned int group_id : order_base_2(UCLAMP_GROUPS);
> unsigned int mapped : 1;
> + unsigned int active : 1;
> };
> #endif /* CONFIG_UCLAMP_TASK */
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 654327d7f212..a98a96a7d9f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -783,6 +783,159 @@ union uclamp_map {
> */
> static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_GROUPS];
>
> +/**
> + * uclamp_cpu_update: updates the utilization clamp of a CPU
> + * @rq: the CPU's rq which utilization clamp has to be updated
> + * @clamp_id: the clamp index to update
> + *
> + * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
> + * clamp groups can change. Since each clamp group enforces a different
> + * utilization clamp value, once the set of active groups changes it can be
> + * required to re-compute what is the new clamp value to apply for that CPU.
> + *
> + * For the specified clamp index, this method computes the new CPU utilization
> + * clamp to use until the next change on the set of active clamp groups.
> + */
> +static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
> +{
> + unsigned int group_id;
> + int max_value = 0;
> +
> + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> + if (!rq->uclamp.group[clamp_id][group_id].tasks)
> + continue;
> + /* Both min and max clamps are MAX aggregated */
> + if (max_value < rq->uclamp.group[clamp_id][group_id].value)
> + max_value = rq->uclamp.group[clamp_id][group_id].value;
> + if (max_value >= SCHED_CAPACITY_SCALE)
> + break;
> + }
> + rq->uclamp.value[clamp_id] = max_value;
> +}
> +
> +/**
> + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
> + * @p: the task being enqueued on a CPU
> + * @rq: the CPU's rq where the clamp group has to be reference counted
> + * @clamp_id: the clamp index to update
> + *
> + * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
> + * the task's uclamp::group_id is reference counted on that CPU.
> + */
> +static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int group_id;
> +
> + if (unlikely(!p->uclamp[clamp_id].mapped))
> + return;
> +
> + group_id = p->uclamp[clamp_id].group_id;
> + p->uclamp[clamp_id].active = true;
> +
> + rq->uclamp.group[clamp_id][group_id].tasks += 1;
> +
> + if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
> + rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
> +}
> +
> +/**
> + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> + * @p: the task being dequeued from a CPU
> + * @rq: the CPU's rq from where the clamp group has to be released
> + * @clamp_id: the clamp index to update
> + *
> + * When a task is dequeued from a CPU's rq, the CPU's clamp group reference
> + * counted by the task is released.
> + * If this was the last task reference coutning the current max clamp group,
> + * then the CPU clamping is updated to find the new max for the specified
> + * clamp index.
> + */
> +static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int clamp_value;
> + unsigned int group_id;
> +
> + if (unlikely(!p->uclamp[clamp_id].mapped))
> + return;
> +
> + group_id = p->uclamp[clamp_id].group_id;
> + p->uclamp[clamp_id].active = false;
> +
> + if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> + rq->uclamp.group[clamp_id][group_id].tasks -= 1;
> +#ifdef CONFIG_SCHED_DEBUG
> + else {
> + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> + cpu_of(rq), clamp_id, group_id);
> + }
> +#endif
> +
> + if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> + return;
> +
> + clamp_value = rq->uclamp.group[clamp_id][group_id].value;
> +#ifdef CONFIG_SCHED_DEBUG
> + if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) {
> + WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n",
> + cpu_of(rq), clamp_id, group_id);
> + }
> +#endif
> + if (clamp_value >= rq->uclamp.value[clamp_id])
> + uclamp_cpu_update(rq, clamp_id);
> +}
> +
> +/**
> + * uclamp_cpu_get(): increase CPU's clamp group refcount
> + * @rq: the CPU's rq where the task is enqueued
> + * @p: the task being enqueued
> + *
> + * When a task is enqueued on a CPU's rq, all the clamp groups currently
> + * enforced on a task are reference counted on that rq. Since not all
> + * scheduling classes have utilization clamping support, their tasks will
> + * be silently ignored.
> + *
> + * This method updates the utilization clamp constraints considering the
> + * requirements for the specified task. Thus, this update must be done before
> + * calling into the scheduling classes, which will eventually update schedutil
> + * considering the new task requirements.
> + */
> +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_cpu_get_id(p, rq, clamp_id);
> +}
> +
> +/**
> + * uclamp_cpu_put(): decrease CPU's clamp group refcount
> + * @rq: the CPU's rq from where the task is dequeued
> + * @p: the task being dequeued
> + *
> + * When a task is dequeued from a CPU's rq, all the clamp groups the task has
> + * reference counted at enqueue time are now released.
> + *
> + * This method updates the utilization clamp constraints considering the
> + * requirements for the specified task. Thus, this update must be done before
> + * calling into the scheduling classes, which will eventually update schedutil
> + * considering the new task requirements.
> + */
> +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_cpu_put_id(p, rq, clamp_id);
> +}
> +
> /**
> * uclamp_group_put: decrease the reference count for a clamp group
> * @clamp_id: the clamp index which was affected by a task group
> @@ -836,6 +989,7 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> unsigned int free_group_id;
> unsigned int group_id;
> unsigned long res;
> + int cpu;
>
> retry:
>
> @@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> if (res != uc_map_old.data)
> goto retry;
>
> + /* Ensure each CPU tracks the correct value for this clamp group */
> + if (likely(uc_map_new.se_count > 1))
> + goto done;
> + for_each_possible_cpu(cpu) {
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
> +
> + /* Refcounting is expected to be always 0 for free groups */
> + if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) {
> + uc_cpu->group[clamp_id][group_id].tasks = 0;
> +#ifdef CONFIG_SCHED_DEBUG
> + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> + cpu, clamp_id, group_id);
> +#endif
> + }
> +
> + if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
> + continue;
> + uc_cpu->group[clamp_id][group_id].value = clamp_value;
> + }
> +
> +done:
> +
> /* Update SE's clamp values and attach it to new clamp group */
> uc_se->value = clamp_value;
> uc_se->group_id = group_id;
> @@ -948,6 +1124,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
> clamp_value = uclamp_none(clamp_id);
>
> p->uclamp[clamp_id].mapped = false;
> + p->uclamp[clamp_id].active = false;
> uclamp_group_get(&p->uclamp[clamp_id], clamp_id, clamp_value);
> }
> }
> @@ -959,9 +1136,13 @@ static void __init init_uclamp(void)
> {
> struct uclamp_se *uc_se;
> unsigned int clamp_id;
> + int cpu;
>
> mutex_init(&uclamp_mutex);
>
> + for_each_possible_cpu(cpu)
> + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
> +
> memset(uclamp_maps, 0, sizeof(uclamp_maps));
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> uc_se = &init_task.uclamp[clamp_id];
> @@ -970,6 +1151,8 @@ static void __init init_uclamp(void)
> }
>
> #else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> @@ -987,6 +1170,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & ENQUEUE_RESTORE))
> sched_info_queued(rq, p);
>
> + uclamp_cpu_get(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -998,6 +1182,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & DEQUEUE_SAVE))
> sched_info_dequeued(rq, p);
>
> + uclamp_cpu_put(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 947ab14d3d5b..1755c9c9f4f0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -766,6 +766,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> #endif
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/**
> + * struct uclamp_group - Utilization clamp Group
> + * @value: utilization clamp value for tasks on this clamp group
> + * @tasks: number of RUNNABLE tasks on this clamp group
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_group {
> + unsigned long value : SCHED_CAPACITY_SHIFT + 1;
> + unsigned long tasks : BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1;
> +};
> +
> +/**
> + * struct uclamp_cpu - CPU's utilization clamp
> + * @value: currently active clamp values for a CPU
> + * @group: utilization clamp groups affecting a CPU
> + *
> + * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
> + * A clamp value is affecting a CPU where there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * We have up to UCLAMP_CNT possible different clamp value, which are
> + * currently only two: minmum utilization and maximum utilization.
> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + * utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + * maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
> + * array to track the metrics required to compute all the per-CPU utilization
> + * clamp values. The additional slot is used to track the default clamp
> + * values, i.e. no min/max clamping at all.
> + */
> +struct uclamp_cpu {
> + struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
> + int value[UCLAMP_CNT];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -804,6 +848,11 @@ struct rq {
> unsigned long nr_load_updates;
> u64 nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> + struct uclamp_cpu uclamp ____cacheline_aligned;
> +#endif
> +
> struct cfs_rq cfs;
> struct rt_rq rt;
> struct dl_rq dl;
> --
> 2.18.0
>
--
#include <best/regards.h>
Patrick Bellasi
Slightly older version posted by error along with the correct one.
Please comment on:
Message-ID: <[email protected]>
Sorry for the noise.
On 29-Oct 18:33, Patrick Bellasi wrote:
> When a task's util_clamp value is configured via sched_setattr(2), this
> value has to be properly accounted in the corresponding clamp group
> every time the task is enqueued and dequeued. When cgroups are also in
> use, per-task clamp values have to be aggregated to those of the CPU's
> controller's Task Group (TG) in which the task is currently living.
>
> Let's update uclamp_cpu_get() to provide aggregation between the task
> and the TG clamp values. Every time a task is enqueued, it will be
> accounted in the clamp_group which defines the smaller clamp between the
> task specific value and its TG effective value.
>
> This also mimics what already happen for a task's CPU affinity mask when
> the task is also living in a cpuset. The overall idea is that cgroup
> attributes are always used to restrict the per-task attributes.
>
> Thus, this implementation allows to:
>
> 1. ensure cgroup clamps are always used to restrict task specific
> requests, i.e. boosted only up to the effective granted value or
> clamped at least to a certain value
> 2. implements a "nice-like" policy, where tasks are still allowed to
> request less then what enforced by their current TG
>
> For this mechanisms to work properly, we exploit the concept of
> "effective" clamp, which is already used by a TG to track parent
> enforced restrictions.
> In this patch we re-use the same variable:
> task_struct::uclamp::effective::group_id
> to track the currently most restrictive clamp group each task is
> subject to and thus it's also currently refcounted into.
>
> This solution allows also to better decouple the slow-path, where task
> and task group clamp values are updated, from the fast-path, where the
> most appropriate clamp value is tracked by refcounting clamp groups.
>
> For consistency purposes, as well as to properly inform userspace, the
> sched_getattr(2) call is updated to always return the properly
> aggregated constrains as described above. This will also make
> sched_getattr(2) a convenient userspace API to know the utilization
> constraints enforced on a task by the cgroup's CPU controller.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes in v4:
> Message-ID: <20180816140731.GD2960@e110439-lin>
> - reuse already existing:
> task_struct::uclamp::effective::group_id
> instead of adding:
> task_struct::uclamp_group_id
> to back annotate the effective clamp group in which a task has been
> refcounted
> Others:
> - small documentation fixes
> - rebased on v4.19-rc1
>
> Changes in v3:
> Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
> - rename UCLAMP_NONE into UCLAMP_NOT_VALID
> - fix not required override
> - fix typos in changelog
> Others:
> - clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
> clamp group_id/value code into dedicated getter functions:
> uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
> - rebased on tip/sched/core
> Changes in v2:
> OSPM discussion:
> - implement a "nice" semantics where cgroup clamp values are always
> used to restrict task specific clamp values, i.e. tasks running on a
> TG are only allowed to demote themself.
> Other:
> - rabased on v4.18-rc4
> - this code has been split from a previous patch to simplify the review
> ---
> include/linux/sched.h | 9 +++++++
> kernel/sched/core.c | 58 +++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 62 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7698e7554892..4b61fbcb0797 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -609,12 +609,21 @@ struct sched_dl_entity {
> * The active bit is set whenever a task has got an effective clamp group
> * and value assigned, which can be different from the user requested ones.
> * This allows to know a task is actually refcounting a CPU's clamp group.
> + *
> + * The user_defined bit is set whenever a task has got a task-specific clamp
> + * value requested from userspace, i.e. the system defaults applies to this
> + * task just as a restriction. This allows to relax TG's clamps when a less
> + * restrictive task specific value has been defined, thus allowing to
> + * implement a "nice" semantic when both task group and task specific values
> + * are used. For example, a task running on a 20% boosted TG can still drop
> + * its own boosting to 0%.
> */
> struct uclamp_se {
> unsigned int value : SCHED_CAPACITY_SHIFT + 1;
> unsigned int group_id : order_base_2(UCLAMP_GROUPS);
> unsigned int mapped : 1;
> unsigned int active : 1;
> + unsigned int user_defined : 1;
> /*
> * Clamp group and value actually used by a scheduling entity,
> * i.e. a (RUNNABLE) task or a task group.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e2292c698e3b..2ce84d22ab17 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -875,6 +875,28 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
> rq->uclamp.value[clamp_id] = max_value;
> }
>
> +/**
> + * uclamp_apply_defaults: check if p is subject to system default clamps
> + * @p: the task to check
> + *
> + * Tasks in the root group or autogroups are always and only limited by system
> + * defaults. All others instead are limited by their TG's specific value.
> + * This method checks the conditions under witch a task is subject to system
> + * default clamps.
> + */
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +static inline bool uclamp_apply_defaults(struct task_struct *p)
> +{
> + if (task_group_is_autogroup(task_group(p)))
> + return true;
> + if (task_group(p) == &root_task_group)
> + return true;
> + return false;
> +}
> +#else
> +#define uclamp_apply_defaults(p) true
> +#endif
> +
> /**
> * uclamp_effective_group_id: get the effective clamp group index of a task
> * @p: the task to get the effective clamp value for
> @@ -882,9 +904,11 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
> *
> * The effective clamp group index of a task depends on:
> * - the task specific clamp value, explicitly requested from userspace
> + * - the task group effective clamp value, for tasks not in the root group or
> + * in an autogroup
> * - the system default clamp value, defined by the sysadmin
> - * and tasks specific's clamp values are always restricted by system
> - * defaults clamp values.
> + * and tasks specific's clamp values are always restricted, with increasing
> + * priority, by their task group first and the system defaults after.
> *
> * This method returns the effective group index for a task, depending on its
> * status and a proper aggregation of the clamp values listed above.
> @@ -908,6 +932,22 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
> clamp_value = p->uclamp[clamp_id].value;
> group_id = p->uclamp[clamp_id].group_id;
>
> + if (!uclamp_apply_defaults(p)) {
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + unsigned int clamp_max =
> + task_group(p)->uclamp[clamp_id].effective.value;
> + unsigned int group_max =
> + task_group(p)->uclamp[clamp_id].effective.group_id;
> +
> + if (!p->uclamp[clamp_id].user_defined ||
> + clamp_value > clamp_max) {
> + clamp_value = clamp_max;
> + group_id = group_max;
> + }
> +#endif
> + goto done;
> + }
> +
> /* RT tasks have different default values */
> default_clamp = task_has_rt_policy(p)
> ? uclamp_default_perf
> @@ -924,6 +964,8 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
> group_id = default_clamp[clamp_id].group_id;
> }
>
> +done:
> +
> p->uclamp[clamp_id].effective.value = clamp_value;
> p->uclamp[clamp_id].effective.group_id = group_id;
>
> @@ -936,8 +978,10 @@ static inline unsigned int uclamp_effective_group_id(struct task_struct *p,
> * @rq: the CPU's rq where the clamp group has to be reference counted
> * @clamp_id: the clamp index to update
> *
> - * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
> - * the task's uclamp::group_id is reference counted on that CPU.
> + * Once a task is enqueued on a CPU's rq, with increasing priority, we
> + * reference count the most restrictive clamp group between the task specific
> + * clamp value, the clamp value of its task group and the system default clamp
> + * value.
> */
> static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
> unsigned int clamp_id)
> @@ -1312,10 +1356,12 @@ static int __setscheduler_uclamp(struct task_struct *p,
>
> /* Update each required clamp group */
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + p->uclamp[UCLAMP_MIN].user_defined = true;
> uclamp_group_get(p, &p->uclamp[UCLAMP_MIN],
> UCLAMP_MIN, lower_bound);
> }
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + p->uclamp[UCLAMP_MAX].user_defined = true;
> uclamp_group_get(p, &p->uclamp[UCLAMP_MAX],
> UCLAMP_MAX, upper_bound);
> }
> @@ -1359,8 +1405,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> unsigned int clamp_value = p->uclamp[clamp_id].value;
>
> - if (unlikely(reset))
> + if (unlikely(reset)) {
> clamp_value = uclamp_none(clamp_id);
> + p->uclamp[clamp_id].user_defined = false;
> + }
>
> p->uclamp[clamp_id].mapped = false;
> p->uclamp[clamp_id].active = false;
> --
> 2.18.0
>
--
#include <best/regards.h>
Patrick Bellasi
On 10/29/18 11:32 AM, Patrick Bellasi wrote:
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index 10fbb8031930..fde7301ed28c 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -53,6 +73,30 @@ struct sched_param {
> * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
> * only user of this new interface. More information about the algorithm
> * available in the scheduling class file or in Documentation/.
> + *
> + * Task Utilization Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes allows to specify the utilization which
> + * should be expected by a task. These attributes allow to inform the
> + * scheduler about the utilization boundaries within which it is expected to
> + * schedule the task. These boundaries are valuable hints to support scheduler
> + * decisions on both task placement and frequencies selection.
> + *
> + * @sched_util_min represents the minimum utilization
> + * @sched_util_max represents the maximum utilization
> + *
> + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> + * represents the percentage of CPU time used by a task when running at the
> + * maximum frequency on the highest capacity CPU of the system. Thus, for
> + * example, a 20% utilization task is a task running for 2ms every 10ms.
> + *
> + * A task with a min utilization value bigger then 0 is more likely to be
than
> + * scheduled on a CPU which has a capacity big enough to fit the specified
> + * minimum utilization value.
> + * A task with a max utilization value smaller then 1024 is more likely to be
> + * scheduled on a CPU which do not necessarily have more capacity then the
does than
> + * specified max utilization value.
> */
> struct sched_attr {
> __u32 size;
cheers.
--
~Randy
On 29-Oct 18:33, Patrick Bellasi wrote:
[...]
> +#ifdef CONFIG_UCLAMP_TASK
> +/**
> + * clamp_util: clamp a utilization value for a specified CPU
> + * @rq: the CPU's RQ to get the clamp values from
> + * @util: the utilization signal to clamp
> + *
> + * Each CPU tracks util_{min,max} clamp values depending on the set of its
> + * currently RUNNABLE tasks. Given a utilization signal, i.e a signal in
> + * the [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
> + * utilization signal considering the current clamp values for the
> + * specified CPU.
> + *
> + * Return: a clamped utilization signal for a given CPU.
> + */
> +static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
> +{
> + unsigned int min_util = rq->uclamp.value[UCLAMP_MIN];
> + unsigned int max_util = rq->uclamp.value[UCLAMP_MAX];
Just notice here we can have an issue.
For each scheduling entity, we always ensure that:
util_min <= util_max
However, since CPU's {min,max}_util clamps are always MAX aggregated
considering the corresponding clamps of RUNNABLE tasks with
_different_ clamps, we can end up with CPU clamps where:
util_min > util_max
Thus, we need to add the following sanity check here:
+ if (unlikely(min_util > max_util))
+ return min_util;
> +
> + return clamp(util, min_util, max_util);
> +}
--
#include <best/regards.h>
Patrick Bellasi
On Mon, Oct 29, 2018 at 06:32:55PM +0000, Patrick Bellasi wrote:
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 22627f80063e..c27d6e81517b 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -50,9 +50,11 @@
> #define SCHED_FLAG_RESET_ON_FORK 0x01
> #define SCHED_FLAG_RECLAIM 0x02
> #define SCHED_FLAG_DL_OVERRUN 0x04
> +#define SCHED_FLAG_UTIL_CLAMP 0x08
>
> #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> SCHED_FLAG_RECLAIM | \
> - SCHED_FLAG_DL_OVERRUN)
> + SCHED_FLAG_DL_OVERRUN | \
> + SCHED_FLAG_UTIL_CLAMP)
>
> #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index 10fbb8031930..fde7301ed28c 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -9,6 +9,7 @@ struct sched_param {
> };
>
> #define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
> +#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */
>
> /*
> * Extended scheduling parameters data structure.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4533,6 +4533,10 @@ static int sched_copy_attr(struct sched_
if (ret)
return -EFAULT;
+ if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
+ size < SCHED_ATTR_SIZE_VER1)
+ return -EINVAL;
+
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
On Mon, Oct 29, 2018 at 06:32:56PM +0000, Patrick Bellasi wrote:
> @@ -50,11 +52,13 @@
> #define SCHED_FLAG_RESET_ON_FORK 0x01
> #define SCHED_FLAG_RECLAIM 0x02
> #define SCHED_FLAG_DL_OVERRUN 0x04
> -#define SCHED_FLAG_UTIL_CLAMP 0x08
> +#define SCHED_FLAG_TUNE_POLICY 0x08
> +#define SCHED_FLAG_UTIL_CLAMP 0x10
That seems to suggest you want to do this patch first, but you didn't..
On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> +struct uclamp_se {
> + unsigned int value : SCHED_CAPACITY_SHIFT + 1;
> + unsigned int group_id : order_base_2(UCLAMP_GROUPS);
Are you sure about ob2() ? seems weird we'll end up with 0 for
UCLAMP_GROUPS==1.
> + unsigned int mapped : 1;
> +};
On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> +static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
> {
> + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> + union uclamp_map uc_map_old, uc_map_new;
> + long res;
> +
> +retry:
> +
> + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> + uc_map_new = uc_map_old;
> + uc_map_new.se_count -= 1;
> + res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
> + uc_map_old.data, uc_map_new.data);
> + if (res != uc_map_old.data)
> + goto retry;
> +}
Please write cmpxchg loops in the form:
atomic_long_t *ptr = &uclamp_maps[clamp_id][group_id].adata;
union uclamp_map old, new;
old.data = atomic_long_read(ptr);
do {
new.data = old.data;
new.se_cound--;
} while (!atomic_long_try_cmpxchg(ptr, &old.data, new.data));
(same for all the others of course)
On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> +/**
> + * uclamp_group_get: increase the reference count for a clamp group
> + * @uc_se: the utilization clamp data for the task
> + * @clamp_id: the clamp index affected by the task
> + * @clamp_value: the new clamp value for the task
> + *
> + * Each time a task changes its utilization clamp value, for a specified clamp
> + * index, we need to find an available clamp group which can be used to track
> + * this new clamp value. The corresponding clamp group index will be used to
> + * reference count the corresponding clamp value while the task is enqueued on
> + * a CPU.
> + */
> +static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> + unsigned int clamp_value)
> +{
> + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> + unsigned int prev_group_id = uc_se->group_id;
> + union uclamp_map uc_map_old, uc_map_new;
> + unsigned int free_group_id;
> + unsigned int group_id;
> + unsigned long res;
> +
> +retry:
> +
> + free_group_id = UCLAMP_GROUPS;
> + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> + if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
> + free_group_id = group_id;
> + if (uc_map_old.value == clamp_value)
> + break;
> + }
> + if (group_id >= UCLAMP_GROUPS) {
> +#ifdef CONFIG_SCHED_DEBUG
> +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
> + if (unlikely(free_group_id == UCLAMP_GROUPS)) {
> + pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
> + return;
> + }
> +#endif
> + group_id = free_group_id;
> + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> + }
You forgot to check for refcount overflow here ;-)
And I'm not really a fan of hiding that error in a define like you keep
doing.
What's wrong with something like:
if (SCHED_WARN(free_group_id == UCLAMP_GROUPS))
return;
and
> + uc_map_new.se_count = uc_map_old.se_count + 1;
if (SCHED_WARN(!new.se_count))
new.se_count = -1;
> + uc_map_new.value = clamp_value;
> + res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
> + uc_map_old.data, uc_map_new.data);
> + if (res != uc_map_old.data)
> + goto retry;
> +
> + /* Update SE's clamp values and attach it to new clamp group */
> + uc_se->value = clamp_value;
> + uc_se->group_id = group_id;
> +
> + /* Release the previous clamp group */
> + if (uc_se->mapped)
> + uclamp_group_put(clamp_id, prev_group_id);
> + uc_se->mapped = true;
> +}
On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> +/**
> + * uclamp_group_get: increase the reference count for a clamp group
> + * @uc_se: the utilization clamp data for the task
> + * @clamp_id: the clamp index affected by the task
> + * @clamp_value: the new clamp value for the task
> + *
> + * Each time a task changes its utilization clamp value, for a specified clamp
> + * index, we need to find an available clamp group which can be used to track
> + * this new clamp value. The corresponding clamp group index will be used to
> + * reference count the corresponding clamp value while the task is enqueued on
> + * a CPU.
> + */
> +static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> + unsigned int clamp_value)
> +{
> + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> + unsigned int prev_group_id = uc_se->group_id;
> + union uclamp_map uc_map_old, uc_map_new;
> + unsigned int free_group_id;
> + unsigned int group_id;
> + unsigned long res;
> +
> +retry:
> +
> + free_group_id = UCLAMP_GROUPS;
> + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> + if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
> + free_group_id = group_id;
> + if (uc_map_old.value == clamp_value)
> + break;
> + }
> + if (group_id >= UCLAMP_GROUPS) {
> +#ifdef CONFIG_SCHED_DEBUG
> +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
> + if (unlikely(free_group_id == UCLAMP_GROUPS)) {
> + pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
> + return;
> + }
> +#endif
Can you please put in a comment, either here or on top, on why this can
not in fact happen? And we're always guaranteed a free group.
> + group_id = free_group_id;
> + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> + }
> +
> + uc_map_new.se_count = uc_map_old.se_count + 1;
> + uc_map_new.value = clamp_value;
> + res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
> + uc_map_old.data, uc_map_new.data);
> + if (res != uc_map_old.data)
> + goto retry;
> +
> + /* Update SE's clamp values and attach it to new clamp group */
> + uc_se->value = clamp_value;
> + uc_se->group_id = group_id;
> +
> + /* Release the previous clamp group */
> + if (uc_se->mapped)
> + uclamp_group_put(clamp_id, prev_group_id);
> + uc_se->mapped = true;
> +}
On 07-Nov 13:11, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:56PM +0000, Patrick Bellasi wrote:
>
> > @@ -50,11 +52,13 @@
> > #define SCHED_FLAG_RESET_ON_FORK 0x01
> > #define SCHED_FLAG_RECLAIM 0x02
> > #define SCHED_FLAG_DL_OVERRUN 0x04
> > -#define SCHED_FLAG_UTIL_CLAMP 0x08
> > +#define SCHED_FLAG_TUNE_POLICY 0x08
> > +#define SCHED_FLAG_UTIL_CLAMP 0x10
>
> That seems to suggest you want to do this patch first, but you didn't..
I've kept it here just to better highlight this change, suggested by
Juri, since we was not entirely sure you are fine with it...
If you think it's ok adding a SCHED_FLAG_TUNE_POLICY behavior to the
sched_setattr syscall, I can certainly squash into the previous patch,
which gives more context on why we need it.
Otherwise, if we want to keep these bits better isolated for possible
future bisects, I can also move it at the beginning of the series.
What do you like best ?
Since we are at that, are we supposed to document some{where,how}
these API changes ?
--
#include <best/regards.h>
Patrick Bellasi
On 07-Nov 14:16, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
>
> > +static void uclamp_group_put(unsigned int clamp_id, unsigned int group_id)
> > {
> > + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> > + union uclamp_map uc_map_old, uc_map_new;
> > + long res;
> > +
> > +retry:
> > +
> > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > + uc_map_new = uc_map_old;
> > + uc_map_new.se_count -= 1;
> > + res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
> > + uc_map_old.data, uc_map_new.data);
> > + if (res != uc_map_old.data)
> > + goto retry;
> > +}
>
> Please write cmpxchg loops in the form:
>
> atomic_long_t *ptr = &uclamp_maps[clamp_id][group_id].adata;
> union uclamp_map old, new;
>
> old.data = atomic_long_read(ptr);
> do {
> new.data = old.data;
> new.se_cound--;
> } while (!atomic_long_try_cmpxchg(ptr, &old.data, new.data));
>
>
> (same for all the others of course)
Ok, I did that to save some indentation, but actually it's most
commonly used in a while loop... will update in v6.
Out of curiosity, apart from code consistency, is that required also
specifically for any possible compiler related (mis)behavior ?
--
#include <best/regards.h>
Patrick Bellasi
On Wed, Nov 07, 2018 at 01:50:39PM +0000, Patrick Bellasi wrote:
> On 07-Nov 13:11, Peter Zijlstra wrote:
> > On Mon, Oct 29, 2018 at 06:32:56PM +0000, Patrick Bellasi wrote:
> >
> > > @@ -50,11 +52,13 @@
> > > #define SCHED_FLAG_RESET_ON_FORK 0x01
> > > #define SCHED_FLAG_RECLAIM 0x02
> > > #define SCHED_FLAG_DL_OVERRUN 0x04
> > > -#define SCHED_FLAG_UTIL_CLAMP 0x08
> > > +#define SCHED_FLAG_TUNE_POLICY 0x08
> > > +#define SCHED_FLAG_UTIL_CLAMP 0x10
> >
> > That seems to suggest you want to do this patch first, but you didn't..
>
> I've kept it here just to better highlight this change, suggested by
> Juri, since we was not entirely sure you are fine with it...
>
> If you think it's ok adding a SCHED_FLAG_TUNE_POLICY behavior to the
> sched_setattr syscall, I can certainly squash into the previous patch,
> which gives more context on why we need it.
I'm fine with the idea I think. It's the details I worry about. Which
fields in particular are not updated with this. Are flags?
Also, I'm not too keen on the name; since it explicitly does not modify
the policy and its related parameters, so TUNE_POLICY is actively wrong.
But the thing that confused me most is how fiddled the numbers to fit
this before UTIL_CLAMP.
> Since we are at that, are we supposed to document some{where,how}
> these API changes ?
I'm pretty sure there's a manpage somewhere... SCHED_SETATTR(2) seems to
exist on my machine. So that wants updates.
On Wed, Nov 07, 2018 at 01:57:38PM +0000, Patrick Bellasi wrote:
> On 07-Nov 14:16, Peter Zijlstra wrote:
> > Please write cmpxchg loops in the form:
> >
> > atomic_long_t *ptr = &uclamp_maps[clamp_id][group_id].adata;
> > union uclamp_map old, new;
> >
> > old.data = atomic_long_read(ptr);
> > do {
> > new.data = old.data;
> > new.se_cound--;
> > } while (!atomic_long_try_cmpxchg(ptr, &old.data, new.data));
> >
> >
> > (same for all the others of course)
>
> Ok, I did that to save some indentation, but actually it's most
> commonly used in a while loop... will update in v6.
>
> Out of curiosity, apart from code consistency, is that required also
> specifically for any possible compiler related (mis)behavior ?
No; it is just the 'normal' form my brain likes :-)
And the try_cmpxchg() thing is slightly more efficient on x86 vs the
traditional form:
while (cmpxchg(ptr, old, new) != old)
On 07-Nov 13:19, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > +struct uclamp_se {
> > + unsigned int value : SCHED_CAPACITY_SHIFT + 1;
> > + unsigned int group_id : order_base_2(UCLAMP_GROUPS);
>
> Are you sure about ob2() ? seems weird we'll end up with 0 for
> UCLAMP_GROUPS==1.
So, we have:
#define UCLAMP_GROUPS (CONFIG_UCLAMP_GROUPS_COUNT + 1)
because one clamp group is always reserved for defaults.
Thus, ob2(in) is always called with n>=2.
... should be safe no ?
However, will check better again on v6 for possible corner-cases...
> > + unsigned int mapped : 1;
> > +};
--
#include <best/regards.h>
Patrick Bellasi
On 07-Nov 14:44, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > +/**
> > + * uclamp_group_get: increase the reference count for a clamp group
> > + * @uc_se: the utilization clamp data for the task
> > + * @clamp_id: the clamp index affected by the task
> > + * @clamp_value: the new clamp value for the task
> > + *
> > + * Each time a task changes its utilization clamp value, for a specified clamp
> > + * index, we need to find an available clamp group which can be used to track
> > + * this new clamp value. The corresponding clamp group index will be used to
> > + * reference count the corresponding clamp value while the task is enqueued on
> > + * a CPU.
> > + */
> > +static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> > + unsigned int clamp_value)
> > +{
> > + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> > + unsigned int prev_group_id = uc_se->group_id;
> > + union uclamp_map uc_map_old, uc_map_new;
> > + unsigned int free_group_id;
> > + unsigned int group_id;
> > + unsigned long res;
> > +
> > +retry:
> > +
> > + free_group_id = UCLAMP_GROUPS;
> > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > + if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
> > + free_group_id = group_id;
> > + if (uc_map_old.value == clamp_value)
> > + break;
> > + }
> > + if (group_id >= UCLAMP_GROUPS) {
> > +#ifdef CONFIG_SCHED_DEBUG
> > +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
> > + if (unlikely(free_group_id == UCLAMP_GROUPS)) {
> > + pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
> > + return;
> > + }
> > +#endif
>
> Can you please put in a comment, either here or on top, on why this can
> not in fact happen? And we're always guaranteed a free group.
You right, that's confusing especially because up to this point we are
not granted. We are always granted a free group once we add:
sched/core: uclamp: add clamp group bucketing support
I've kept it separated to better document how we introduce that
support.
Is it ok for for you if I better call out in the change log that the
guarantee comes from a following patch... and add the comment in
that later patch ?
>
> > + group_id = free_group_id;
> > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > + }
> > +
> > + uc_map_new.se_count = uc_map_old.se_count + 1;
> > + uc_map_new.value = clamp_value;
> > + res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
> > + uc_map_old.data, uc_map_new.data);
> > + if (res != uc_map_old.data)
> > + goto retry;
> > +
> > + /* Update SE's clamp values and attach it to new clamp group */
> > + uc_se->value = clamp_value;
> > + uc_se->group_id = group_id;
> > +
> > + /* Release the previous clamp group */
> > + if (uc_se->mapped)
> > + uclamp_group_put(clamp_id, prev_group_id);
> > + uc_se->mapped = true;
> > +}
--
#include <best/regards.h>
Patrick Bellasi
On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> +static int __setscheduler_uclamp(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
> + unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
> + int result = 0;
> +
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + lower_bound = attr->sched_util_min;
> +
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + upper_bound = attr->sched_util_max;
> +
> + if (lower_bound > upper_bound ||
> + upper_bound > SCHED_CAPACITY_SCALE)
> return -EINVAL;
>
> + mutex_lock(&uclamp_mutex);
>
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uclamp_group_get(&p->uclamp[UCLAMP_MIN],
> + UCLAMP_MIN, lower_bound);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uclamp_group_get(&p->uclamp[UCLAMP_MAX],
> + UCLAMP_MAX, upper_bound);
> + }
> +
> + mutex_unlock(&uclamp_mutex);
> +
> + return result;
> +}
I'm missing where we error due to lack of groups.
On Wed, Nov 07, 2018 at 02:19:49PM +0000, Patrick Bellasi wrote:
> On 07-Nov 13:19, Peter Zijlstra wrote:
> > On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > > +struct uclamp_se {
> > > + unsigned int value : SCHED_CAPACITY_SHIFT + 1;
> > > + unsigned int group_id : order_base_2(UCLAMP_GROUPS);
> >
> > Are you sure about ob2() ? seems weird we'll end up with 0 for
> > UCLAMP_GROUPS==1.
>
> So, we have:
>
> #define UCLAMP_GROUPS (CONFIG_UCLAMP_GROUPS_COUNT + 1)
>
> because one clamp group is always reserved for defaults.
> Thus, ob2(in) is always called with n>=2.
>
> ... should be safe no ?
+config UCLAMP_GROUPS_COUNT
+ int "Number of different utilization clamp values supported"
+ range 0 32
+ default 5
0+1 == 1
Increase the min range and you should be good I think.
On Wed, Nov 07, 2018 at 02:24:28PM +0000, Patrick Bellasi wrote:
> On 07-Nov 14:44, Peter Zijlstra wrote:
> > On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > > +static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> > > + unsigned int clamp_value)
> > > +{
> > > + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> > > + unsigned int prev_group_id = uc_se->group_id;
> > > + union uclamp_map uc_map_old, uc_map_new;
> > > + unsigned int free_group_id;
> > > + unsigned int group_id;
> > > + unsigned long res;
> > > +
> > > +retry:
> > > +
> > > + free_group_id = UCLAMP_GROUPS;
> > > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> > > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > > + if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
> > > + free_group_id = group_id;
> > > + if (uc_map_old.value == clamp_value)
> > > + break;
> > > + }
> > > + if (group_id >= UCLAMP_GROUPS) {
> > > +#ifdef CONFIG_SCHED_DEBUG
> > > +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
> > > + if (unlikely(free_group_id == UCLAMP_GROUPS)) {
> > > + pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
> > > + return;
> > > + }
> > > +#endif
> >
> > Can you please put in a comment, either here or on top, on why this can
> > not in fact happen? And we're always guaranteed a free group.
>
> You right, that's confusing especially because up to this point we are
> not granted. We are always granted a free group once we add:
>
> sched/core: uclamp: add clamp group bucketing support
>
> I've kept it separated to better document how we introduce that
> support.
>
> Is it ok for for you if I better call out in the change log that the
> guarantee comes from a following patch... and add the comment in
> that later patch ?
Urgh.. that is mighty confusing and since this stuff actually 'works'
might result in bisection issues too, right?
I would really rather prefer a series that makes sense in the order you
read it.
On 07-Nov 14:35, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > +/**
> > + * uclamp_group_get: increase the reference count for a clamp group
> > + * @uc_se: the utilization clamp data for the task
> > + * @clamp_id: the clamp index affected by the task
> > + * @clamp_value: the new clamp value for the task
> > + *
> > + * Each time a task changes its utilization clamp value, for a specified clamp
> > + * index, we need to find an available clamp group which can be used to track
>You mean se_count overflow ?
> + * this new clamp value. The corresponding clamp group index will be used to
> > + * reference count the corresponding clamp value while the task is enqueued on
> > + * a CPU.
> > + */
> > +static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> > + unsigned int clamp_value)
> > +{
> > + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> > + unsigned int prev_group_id = uc_se->group_id;
> > + union uclamp_map uc_map_old, uc_map_new;
> > + unsigned int free_group_id;
> > + unsigned int group_id;
> > + unsigned long res;
> > +
> > +retry:
> > +
> > + free_group_id = UCLAMP_GROUPS;
> > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > + if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
> > + free_group_id = group_id;
> > + if (uc_map_old.value == clamp_value)
> > + break;
> > + }
> > + if (group_id >= UCLAMP_GROUPS) {
> > +#ifdef CONFIG_SCHED_DEBUG
> > +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
> > + if (unlikely(free_group_id == UCLAMP_GROUPS)) {
> > + pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
> > + return;
> > + }
> > +#endif
> > + group_id = free_group_id;
> > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > + }
>
> You forgot to check for refcount overflow here ;-)
You mean se_count overflow ?
That se_count is (BITS_PER_LONG - SCHED_CAPACITY_SHIFT - 1)
which makes it able to track up to:
+2mln tasks/task_groups on 32bit systems (with SCHED_CAPACITY_SHIFT 10)
+10^12 tasks/task_groups on 64bit systems (with SCHED_CAPACITY_SHIFT 20)
I don't expect overflow on 64bit systems, do we ?
It's more likely on 32bit systems, especially if in the future we
should increase SCHED_CAPACITY_SHIFT.
> And I'm not really a fan of hiding that error in a define like you keep
> doing.
The #define is not there to mask an overflow, it's there to catch the
case in which the refcount should be corrupted and we end up violating
the invariant: "there is always a clamp group available".
NOTE: that invariant is granted once we add
sched/core: uclamp: add clamp group bucketing support
The warning reports the issue only on CONFIG_SCHED_DEBUG, but...
it makes sense to keep it always enabled.
While, in case of data corruption, we should just return thus not
setting the scheduling entity as "mapped" towards the end of the
function... which makes me see that it's actually wrong to
conditionally compile the above "return"
> What's wrong with something like:
>
> if (SCHED_WARN(free_group_id == UCLAMP_GROUPS))
> return;
Right, the flow should be:
1. try to find a valid clamp group
2. if you don't find one, the data structures are corrupted
warn once for data corruption
do not map this scheduling entity and return
3. map the scheduling entity
Is that ok ?
> and
>
> > + uc_map_new.se_count = uc_map_old.se_count + 1;
>
> if (SCHED_WARN(!new.se_count))
> new.se_count = -1;
Mmm... not sure we can recover from a corrupted refcount or an
overflow.
What should we do on these cases, disable uclamp completely ?
> > + uc_map_new.value = clamp_value;
> > + res = atomic_long_cmpxchg(&uc_maps[group_id].adata,
> > + uc_map_old.data, uc_map_new.data);
> > + if (res != uc_map_old.data)
> > + goto retry;
> > +
> > + /* Update SE's clamp values and attach it to new clamp group */
> > + uc_se->value = clamp_value;
> > + uc_se->group_id = group_id;
> > +
> > + /* Release the previous clamp group */
> > + if (uc_se->mapped)
> > + uclamp_group_put(clamp_id, prev_group_id);
> > + uc_se->mapped = true;
> > +}
--
#include <best/regards.h>
Patrick Bellasi
On 07-Nov 15:37, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > +static int __setscheduler_uclamp(struct task_struct *p,
> > + const struct sched_attr *attr)
> > +{
> > + unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
> > + unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
> > + int result = 0;
> > +
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> > + lower_bound = attr->sched_util_min;
> > +
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > + upper_bound = attr->sched_util_max;
> > +
> > + if (lower_bound > upper_bound ||
> > + upper_bound > SCHED_CAPACITY_SCALE)
> > return -EINVAL;
> >
> > + mutex_lock(&uclamp_mutex);
> >
> > + /* Update each required clamp group */
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > + uclamp_group_get(&p->uclamp[UCLAMP_MIN],
> > + UCLAMP_MIN, lower_bound);
> > + }
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > + uclamp_group_get(&p->uclamp[UCLAMP_MAX],
> > + UCLAMP_MAX, upper_bound);
> > + }
> > +
> > + mutex_unlock(&uclamp_mutex);
> > +
> > + return result;
> > +}
>
> I'm missing where we error due to lack of groups.
Again, after:
sched/core: uclamp: add clamp group bucketing support
we should never error.
Perhaps I should really squash the bucketing in these first patches
directly... lemme see what's your take on those bits, if you like them
better not in a separate patch, I'll squash them on v6.
--
#include <best/regards.h>
Patrick Bellasi
On Wed, Nov 07, 2018 at 02:48:09PM +0000, Patrick Bellasi wrote:
> On 07-Nov 14:35, Peter Zijlstra wrote:
> You mean se_count overflow ?
Yah..
> > And I'm not really a fan of hiding that error in a define like you keep
> > doing.
>
> The #define is not there to mask an overflow, it's there to catch the
+#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
Is what I was talking about.
> > What's wrong with something like:
> >
> > if (SCHED_WARN(free_group_id == UCLAMP_GROUPS))
> > return;
>
> Right, the flow should be:
>
> 1. try to find a valid clamp group
> 2. if you don't find one, the data structures are corrupted
> warn once for data corruption
> do not map this scheduling entity and return
> 3. map the scheduling entity
>
> Is that ok ?
That's what the proposed does.
> > and
> >
> > > + uc_map_new.se_count = uc_map_old.se_count + 1;
> >
> > if (SCHED_WARN(!new.se_count))
> > new.se_count = -1;
>
> Mmm... not sure we can recover from a corrupted refcount or an
> overflow.
>
> What should we do on these cases, disable uclamp completely ?
You can teach put to never decrement -1 (aka. all 1s).
But its all SCHED_DEBUG stuff anyway, so who really cares.
On 07-Nov 15:42, Peter Zijlstra wrote:
> On Wed, Nov 07, 2018 at 02:19:49PM +0000, Patrick Bellasi wrote:
> > On 07-Nov 13:19, Peter Zijlstra wrote:
> > > On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
> > > > +struct uclamp_se {
> > > > + unsigned int value : SCHED_CAPACITY_SHIFT + 1;
> > > > + unsigned int group_id : order_base_2(UCLAMP_GROUPS);
> > >
> > > Are you sure about ob2() ? seems weird we'll end up with 0 for
> > > UCLAMP_GROUPS==1.
> >
> > So, we have:
> >
> > #define UCLAMP_GROUPS (CONFIG_UCLAMP_GROUPS_COUNT + 1)
> >
> > because one clamp group is always reserved for defaults.
> > Thus, ob2(in) is always called with n>=2.
> >
> > ... should be safe no ?
>
> +config UCLAMP_GROUPS_COUNT
> + int "Number of different utilization clamp values supported"
> + range 0 32
> + default 5
>
> 0+1 == 1
Seems so... :)
> Increase the min range and you should be good I think.
... dunno why I was absolutely convinced that was already 1, since 0
group does not make a lot of sense. :/
--
#include <best/regards.h>
Patrick Bellasi
On 07-Nov 15:44, Peter Zijlstra wrote:
> On Wed, Nov 07, 2018 at 02:24:28PM +0000, Patrick Bellasi wrote:
> > On 07-Nov 14:44, Peter Zijlstra wrote:
> > > On Mon, Oct 29, 2018 at 06:32:57PM +0000, Patrick Bellasi wrote:
>
> > > > +static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> > > > + unsigned int clamp_value)
> > > > +{
> > > > + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> > > > + unsigned int prev_group_id = uc_se->group_id;
> > > > + union uclamp_map uc_map_old, uc_map_new;
> > > > + unsigned int free_group_id;
> > > > + unsigned int group_id;
> > > > + unsigned long res;
> > > > +
> > > > +retry:
> > > > +
> > > > + free_group_id = UCLAMP_GROUPS;
> > > > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> > > > + uc_map_old.data = atomic_long_read(&uc_maps[group_id].adata);
> > > > + if (free_group_id == UCLAMP_GROUPS && !uc_map_old.se_count)
> > > > + free_group_id = group_id;
> > > > + if (uc_map_old.value == clamp_value)
> > > > + break;
> > > > + }
> > > > + if (group_id >= UCLAMP_GROUPS) {
> > > > +#ifdef CONFIG_SCHED_DEBUG
> > > > +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
> > > > + if (unlikely(free_group_id == UCLAMP_GROUPS)) {
> > > > + pr_err_ratelimited(UCLAMP_MAPERR, clamp_value);
> > > > + return;
> > > > + }
> > > > +#endif
> > >
> > > Can you please put in a comment, either here or on top, on why this can
> > > not in fact happen? And we're always guaranteed a free group.
> >
> > You right, that's confusing especially because up to this point we are
> > not granted. We are always granted a free group once we add:
> >
> > sched/core: uclamp: add clamp group bucketing support
> >
> > I've kept it separated to better document how we introduce that
> > support.
> >
> > Is it ok for for you if I better call out in the change log that the
> > guarantee comes from a following patch... and add the comment in
> > that later patch ?
>
> Urgh.. that is mighty confusing and since this stuff actually 'works'
> might result in bisection issues too, right?
True...
> I would really rather prefer a series that makes sense in the order you
> read it.
... yes, bisects can be a problem, if we run functional tests on them.
Ok, let see what you think about the bucketing support and then we can
see if it's better to keep them separate by adding here some check to
remove after... or just squash them in since the beginning.
--
#include <best/regards.h>
Patrick Bellasi
On 07-Nov 15:55, Peter Zijlstra wrote:
> On Wed, Nov 07, 2018 at 02:48:09PM +0000, Patrick Bellasi wrote:
> > On 07-Nov 14:35, Peter Zijlstra wrote:
> > You mean se_count overflow ?
>
> Yah..
>
> > > And I'm not really a fan of hiding that error in a define like you keep
> > > doing.
> >
> > The #define is not there to mask an overflow, it's there to catch the
>
> +#define UCLAMP_MAPERR "clamp value [%u] mapping to clamp group failed\n"
>
> Is what I was talking about.
>
> > > What's wrong with something like:
> > >
> > > if (SCHED_WARN(free_group_id == UCLAMP_GROUPS))
> > > return;
> >
> > Right, the flow should be:
> >
> > 1. try to find a valid clamp group
> > 2. if you don't find one, the data structures are corrupted
> > warn once for data corruption
> > do not map this scheduling entity and return
> > 3. map the scheduling entity
> >
> > Is that ok ?
>
> That's what the proposed does.
>
> > > and
> > >
> > > > + uc_map_new.se_count = uc_map_old.se_count + 1;
> > >
> > > if (SCHED_WARN(!new.se_count))
> > > new.se_count = -1;
> >
> > Mmm... not sure we can recover from a corrupted refcount or an
> > overflow.
> >
> > What should we do on these cases, disable uclamp completely ?
>
> You can teach put to never decrement -1 (aka. all 1s).
Still we don't know when re-enable -1 again...
But, with bucketization, this will eventually turns into a small
performance penalty at run-time when a CPU clamp group is going to be
empty (since we will end up scanning an array with some holes to find
out the new max)... maybe still acceptable...
Will look into that for v6!
Thanks
> But its all SCHED_DEBUG stuff anyway, so who really cares.
--
#include <best/regards.h>
Patrick Bellasi
On Mon, Oct 29, 2018 at 06:32:59PM +0000, Patrick Bellasi wrote:
> +static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
> +{
> + unsigned int group_id;
> + int max_value = 0;
> +
> + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> + if (!rq->uclamp.group[clamp_id][group_id].tasks)
> + continue;
> + /* Both min and max clamps are MAX aggregated */
> + if (max_value < rq->uclamp.group[clamp_id][group_id].value)
> + max_value = rq->uclamp.group[clamp_id][group_id].value;
max_value = max(max_value, rq->uclamp.group[clamp_id][group_id].value);
> + if (max_value >= SCHED_CAPACITY_SCALE)
> + break;
> + }
> + rq->uclamp.value[clamp_id] = max_value;
> +}
> +
> +/**
> + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
> + * @p: the task being enqueued on a CPU
> + * @rq: the CPU's rq where the clamp group has to be reference counted
> + * @clamp_id: the clamp index to update
> + *
> + * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
> + * the task's uclamp::group_id is reference counted on that CPU.
> + */
> +static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int group_id;
> +
> + if (unlikely(!p->uclamp[clamp_id].mapped))
> + return;
> +
> + group_id = p->uclamp[clamp_id].group_id;
> + p->uclamp[clamp_id].active = true;
> +
> + rq->uclamp.group[clamp_id][group_id].tasks += 1;
++
> +
> + if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
> + rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
rq->uclamp.value[clamp_id] = max(rq->uclamp.value[clamp_id],
p->uclamp[clamp_id].value);
> +}
> +
> +/**
> + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> + * @p: the task being dequeued from a CPU
> + * @rq: the CPU's rq from where the clamp group has to be released
> + * @clamp_id: the clamp index to update
> + *
> + * When a task is dequeued from a CPU's rq, the CPU's clamp group reference
> + * counted by the task is released.
> + * If this was the last task reference coutning the current max clamp group,
> + * then the CPU clamping is updated to find the new max for the specified
> + * clamp index.
> + */
> +static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int clamp_value;
> + unsigned int group_id;
> +
> + if (unlikely(!p->uclamp[clamp_id].mapped))
> + return;
> +
> + group_id = p->uclamp[clamp_id].group_id;
> + p->uclamp[clamp_id].active = false;
> +
SCHED_WARN_ON(!rq->uclamp.group[clamp_id][group_id].tasks);
> + if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> + rq->uclamp.group[clamp_id][group_id].tasks -= 1;
--
> +#ifdef CONFIG_SCHED_DEBUG
> + else {
> + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> + cpu_of(rq), clamp_id, group_id);
> + }
> +#endif
> +
> + if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> + return;
> +
> + clamp_value = rq->uclamp.group[clamp_id][group_id].value;
> +#ifdef CONFIG_SCHED_DEBUG
> + if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) {
> + WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n",
> + cpu_of(rq), clamp_id, group_id);
> + }
> +#endif
SCHED_WARN_ON(clamp_value > rq->uclamp.value[clamp_id]);
> + if (clamp_value >= rq->uclamp.value[clamp_id])
> + uclamp_cpu_update(rq, clamp_id);
> +}
> @@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> if (res != uc_map_old.data)
> goto retry;
>
> + /* Ensure each CPU tracks the correct value for this clamp group */
> + if (likely(uc_map_new.se_count > 1))
> + goto done;
> + for_each_possible_cpu(cpu) {
yuck yuck yuck.. why!?
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
> +
> + /* Refcounting is expected to be always 0 for free groups */
> + if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) {
> + uc_cpu->group[clamp_id][group_id].tasks = 0;
> +#ifdef CONFIG_SCHED_DEBUG
> + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> + cpu, clamp_id, group_id);
> +#endif
SCHED_WARN_ON();
> + }
> +
> + if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
> + continue;
> + uc_cpu->group[clamp_id][group_id].value = clamp_value;
> + }
> +
> +done:
> +
> /* Update SE's clamp values and attach it to new clamp group */
> uc_se->value = clamp_value;
> uc_se->group_id = group_id;
On Mon, Oct 29, 2018 at 06:33:01PM +0000, Patrick Bellasi wrote:
> When a util_max clamped task sleeps, its clamp constraints are removed
> from the CPU. However, the blocked utilization on that CPU can still be
> higher than the max clamp value enforced while that task was running.
>
> The release of a util_max clamp when a CPU is going to be idle could
> thus allow unwanted CPU frequency increases while tasks are not
> running. This can happen, for example, when a frequency update is
> triggered from another CPU of the same frequency domain.
> In this case, when we aggregate the utilization of all the CPUs in a
> shared frequency domain, schedutil can still see the full not clamped
> blocked utilization of all the CPUs and thus, eventually, increase the
> frequency.
> @@ -810,6 +811,28 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
> if (max_value >= SCHED_CAPACITY_SCALE)
> break;
> }
> +
> + /*
> + * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> + * task, we want to keep the CPU clamped to the last task's clamp
> + * value. This is to avoid frequency spikes to MAX when one CPU, with
> + * an high blocked utilization, sleeps and another CPU, in the same
> + * frequency domain, do not see anymore the clamp on the first CPU.
> + *
> + * The UCLAMP_FLAG_IDLE is set whenever we detect, from the above
> + * loop, that there are no more RUNNABLE taks on that CPU.
> + * In this case we enforce the CPU util_max to that of the last
> + * dequeued task.
> + */
> + if (max_value < 0) {
> + if (clamp_id == UCLAMP_MAX) {
> + rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> + max_value = last_clamp_value;
> + } else {
> + max_value = uclamp_none(UCLAMP_MIN);
> + }
> + }
> +
> rq->uclamp.value[clamp_id] = max_value;
> }
*groan*, so it could be jet-lag, but I find the comment really hard to
understand.
Would not something like:
/*
* Avoid blocked utilization pushing up the frequency when we go
* idle (which drops the max-clamp) by retaining the last known
* max-clamp.
*/
Be more clear?
On Mon, Oct 29, 2018 at 06:33:02PM +0000, Patrick Bellasi wrote:
> The number of clamp groups configured at compile time defines the range
> of utilization clamp values tracked by each CPU clamp group.
> For example, with the default configuration:
> CONFIG_UCLAMP_GROUPS_COUNT 5
> we will have 5 clamp groups tracking 20% utilization each. In this case,
> a task with util_min=25% will have group_id=1.
OK I suppose; but should we not do a wholesale s/group/bucket/ at this
point?
We should probably raise the minimum number of buckets from 1 though :-)
> +/*
> + * uclamp_group_value: get the "group value" for a given "clamp value"
> + * @value: the utiliation "clamp value" to translate
> + *
> + * The number of clamp group, which is defined at compile time, allows to
> + * track a finite number of different clamp values. Thus clamp values are
> + * grouped into bins each one representing a different "group value".
> + * This method returns the "group value" corresponding to the specified
> + * "clamp value".
> + */
> +static inline unsigned int uclamp_group_value(unsigned int clamp_value)
> +{
> +#define UCLAMP_GROUP_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_GROUPS_COUNT)
> +#define UCLAMP_GROUP_UPPER (UCLAMP_GROUP_DELTA * CONFIG_UCLAMP_GROUPS_COUNT)
> +
> + if (clamp_value >= UCLAMP_GROUP_UPPER)
> + return SCHED_CAPACITY_SCALE;
> +
> + return UCLAMP_GROUP_DELTA * (clamp_value / UCLAMP_GROUP_DELTA);
> +}
Can't we further simplify; I mean, at this point all we really need to
know is the rq's highest group_id that is in use. We don't need to
actually track the value anymore.
On 11-Nov 17:47, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:32:59PM +0000, Patrick Bellasi wrote:
> > +static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > + unsigned int group_id;
> > + int max_value = 0;
> > +
> > + for (group_id = 0; group_id < UCLAMP_GROUPS; ++group_id) {
> > + if (!rq->uclamp.group[clamp_id][group_id].tasks)
> > + continue;
> > + /* Both min and max clamps are MAX aggregated */
> > + if (max_value < rq->uclamp.group[clamp_id][group_id].value)
> > + max_value = rq->uclamp.group[clamp_id][group_id].value;
>
> max_value = max(max_value, rq->uclamp.group[clamp_id][group_id].value);
Right, I get used to this pattern to avoid write instructions.
I guess that here, being just a function local variable, we don't
really care much...
> > + if (max_value >= SCHED_CAPACITY_SCALE)
> > + break;
> > + }
> > + rq->uclamp.value[clamp_id] = max_value;
> > +}
> > +
> > +/**
> > + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
> > + * @p: the task being enqueued on a CPU
> > + * @rq: the CPU's rq where the clamp group has to be reference counted
> > + * @clamp_id: the clamp index to update
> > + *
> > + * Once a task is enqueued on a CPU's rq, the clamp group currently defined by
> > + * the task's uclamp::group_id is reference counted on that CPU.
> > + */
> > +static inline void uclamp_cpu_get_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int group_id;
> > +
> > + if (unlikely(!p->uclamp[clamp_id].mapped))
> > + return;
> > +
> > + group_id = p->uclamp[clamp_id].group_id;
> > + p->uclamp[clamp_id].active = true;
> > +
> > + rq->uclamp.group[clamp_id][group_id].tasks += 1;
>
> ++
> > +
> > + if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
> > + rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
>
> rq->uclamp.value[clamp_id] = max(rq->uclamp.value[clamp_id],
> p->uclamp[clamp_id].value);
In this case instead, since we are updating a variable visible from
other CPUs, should not be preferred to avoid assignment when not
required ?
Is the compiler is smart enough to optimize the code above?
... will check better.
> > +}
> > +
> > +/**
> > + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> > + * @p: the task being dequeued from a CPU
> > + * @rq: the CPU's rq from where the clamp group has to be released
> > + * @clamp_id: the clamp index to update
> > + *
> > + * When a task is dequeued from a CPU's rq, the CPU's clamp group reference
> > + * counted by the task is released.
> > + * If this was the last task reference coutning the current max clamp group,
> > + * then the CPU clamping is updated to find the new max for the specified
> > + * clamp index.
> > + */
> > +static inline void uclamp_cpu_put_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int clamp_value;
> > + unsigned int group_id;
> > +
> > + if (unlikely(!p->uclamp[clamp_id].mapped))
> > + return;
> > +
> > + group_id = p->uclamp[clamp_id].group_id;
> > + p->uclamp[clamp_id].active = false;
> > +
> SCHED_WARN_ON(!rq->uclamp.group[clamp_id][group_id].tasks);
>
> > + if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> > + rq->uclamp.group[clamp_id][group_id].tasks -= 1;
>
> --
>
> > +#ifdef CONFIG_SCHED_DEBUG
> > + else {
> > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> > + cpu_of(rq), clamp_id, group_id);
> > + }
> > +#endif
>
> > +
> > + if (likely(rq->uclamp.group[clamp_id][group_id].tasks))
> > + return;
> > +
> > + clamp_value = rq->uclamp.group[clamp_id][group_id].value;
> > +#ifdef CONFIG_SCHED_DEBUG
> > + if (unlikely(clamp_value > rq->uclamp.value[clamp_id])) {
> > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] value\n",
> > + cpu_of(rq), clamp_id, group_id);
> > + }
> > +#endif
>
> SCHED_WARN_ON(clamp_value > rq->uclamp.value[clamp_id]);
>
> > + if (clamp_value >= rq->uclamp.value[clamp_id])
> > + uclamp_cpu_update(rq, clamp_id);
> > +}
>
> > @@ -866,6 +1020,28 @@ static void uclamp_group_get(struct uclamp_se *uc_se, unsigned int clamp_id,
> > if (res != uc_map_old.data)
> > goto retry;
> >
> > + /* Ensure each CPU tracks the correct value for this clamp group */
> > + if (likely(uc_map_new.se_count > 1))
> > + goto done;
> > + for_each_possible_cpu(cpu) {
>
> yuck yuck yuck.. why!?
When a clamp group is released, i.e. no more SEs refcount it, that
group could be mapped in the future to a different clamp value.
Thus, when this happens, a different clamp value can be assigned to
that clamp group and we need to update the value tracked in the CPU
side data structures. That's the value actually used to figure out the
min/max clamps at enqueue/dequeue times.
However, since here we are in the slow-path, this should not be an
issue, isn't it ?
>
> > + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
> > +
> > + /* Refcounting is expected to be always 0 for free groups */
> > + if (unlikely(uc_cpu->group[clamp_id][group_id].tasks)) {
> > + uc_cpu->group[clamp_id][group_id].tasks = 0;
> > +#ifdef CONFIG_SCHED_DEBUG
> > + WARN(1, "invalid CPU[%d] clamp group [%u:%u] refcount\n",
> > + cpu, clamp_id, group_id);
> > +#endif
>
> SCHED_WARN_ON();
>
> > + }
> > +
> > + if (uc_cpu->group[clamp_id][group_id].value == clamp_value)
> > + continue;
> > + uc_cpu->group[clamp_id][group_id].value = clamp_value;
> > + }
> > +
> > +done:
> > +
> > /* Update SE's clamp values and attach it to new clamp group */
> > uc_se->value = clamp_value;
> > uc_se->group_id = group_id;
--
#include <best/regards.h>
Patrick Bellasi
On 11-Nov 18:08, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:33:01PM +0000, Patrick Bellasi wrote:
> > When a util_max clamped task sleeps, its clamp constraints are removed
> > from the CPU. However, the blocked utilization on that CPU can still be
> > higher than the max clamp value enforced while that task was running.
> >
> > The release of a util_max clamp when a CPU is going to be idle could
> > thus allow unwanted CPU frequency increases while tasks are not
> > running. This can happen, for example, when a frequency update is
> > triggered from another CPU of the same frequency domain.
> > In this case, when we aggregate the utilization of all the CPUs in a
> > shared frequency domain, schedutil can still see the full not clamped
> > blocked utilization of all the CPUs and thus, eventually, increase the
> > frequency.
>
> > @@ -810,6 +811,28 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
> > if (max_value >= SCHED_CAPACITY_SCALE)
> > break;
> > }
> > +
> > + /*
> > + * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> > + * task, we want to keep the CPU clamped to the last task's clamp
> > + * value. This is to avoid frequency spikes to MAX when one CPU, with
> > + * an high blocked utilization, sleeps and another CPU, in the same
> > + * frequency domain, do not see anymore the clamp on the first CPU.
> > + *
> > + * The UCLAMP_FLAG_IDLE is set whenever we detect, from the above
> > + * loop, that there are no more RUNNABLE taks on that CPU.
> > + * In this case we enforce the CPU util_max to that of the last
> > + * dequeued task.
> > + */
> > + if (max_value < 0) {
> > + if (clamp_id == UCLAMP_MAX) {
> > + rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> > + max_value = last_clamp_value;
> > + } else {
> > + max_value = uclamp_none(UCLAMP_MIN);
> > + }
> > + }
> > +
> > rq->uclamp.value[clamp_id] = max_value;
> > }
>
> *groan*, so it could be jet-lag, but I find the comment really hard to
> understand.
>
> Would not something like:
>
> /*
> * Avoid blocked utilization pushing up the frequency when we go
> * idle (which drops the max-clamp) by retaining the last known
> * max-clamp.
> */
>
> Be more clear?
It works: short and effective... will update in v6.
Thanks.
--
#include <best/regards.h>
Patrick Bellasi
On 12-Nov 01:09, Peter Zijlstra wrote:
> On Mon, Oct 29, 2018 at 06:33:02PM +0000, Patrick Bellasi wrote:
> > The number of clamp groups configured at compile time defines the range
> > of utilization clamp values tracked by each CPU clamp group.
> > For example, with the default configuration:
> > CONFIG_UCLAMP_GROUPS_COUNT 5
> > we will have 5 clamp groups tracking 20% utilization each. In this case,
> > a task with util_min=25% will have group_id=1.
>
> OK I suppose; but should we not do a wholesale s/group/bucket/ at this
> point?
Yes, if bucketization is acceptable, we should probably rename.
Question is: are you ok for a renaming in this patch. or you better
prefer I use that naming since the beginning ?
If we wanna use "bucket" since the beginning, then we should also
probably squash the entire patch into the previous ones and drop this
one.
I personally prefer to keep this concept into a separate patch, but at
the same time I don't very like the idea of a massive renaming in this
patch.
>
> We should probably raise the minimum number of buckets from 1 though :-)
Mmm... the default is already set to what fits into a single cache
line... perhaps we can use that as a minimum too ?
But. technically we can (partially) track different clamp values also
with just one bucket... (explanation in the following comment).
> > +/*
> > + * uclamp_group_value: get the "group value" for a given "clamp value"
> > + * @value: the utiliation "clamp value" to translate
> > + *
> > + * The number of clamp group, which is defined at compile time, allows to
> > + * track a finite number of different clamp values. Thus clamp values are
> > + * grouped into bins each one representing a different "group value".
> > + * This method returns the "group value" corresponding to the specified
> > + * "clamp value".
> > + */
> > +static inline unsigned int uclamp_group_value(unsigned int clamp_value)
> > +{
> > +#define UCLAMP_GROUP_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_GROUPS_COUNT)
> > +#define UCLAMP_GROUP_UPPER (UCLAMP_GROUP_DELTA * CONFIG_UCLAMP_GROUPS_COUNT)
> > +
> > + if (clamp_value >= UCLAMP_GROUP_UPPER)
> > + return SCHED_CAPACITY_SCALE;
> > +
> > + return UCLAMP_GROUP_DELTA * (clamp_value / UCLAMP_GROUP_DELTA);
> > +}
>
> Can't we further simplify; I mean, at this point all we really need to
> know is the rq's highest group_id that is in use. We don't need to
> actually track the value anymore.
This will force to track each clamp value with the exact bucket value.
Instead, by tracking the actual clamp value within a bucket, we have
the chance to updte the bucket value to the actual (max) clamp value
of the RUNNABLE tasks in that bucket.
In a properly configured system, this allows to track exact clamp
values with a minimum number of buckets.
--
#include <best/regards.h>
Patrick Bellasi
On 13-Nov 07:11, Patrick Bellasi wrote:
> On 11-Nov 17:47, Peter Zijlstra wrote:
> > On Mon, Oct 29, 2018 at 06:32:59PM +0000, Patrick Bellasi wrote:
[...]
> > > + /* Both min and max clamps are MAX aggregated */
> > > + if (max_value < rq->uclamp.group[clamp_id][group_id].value)
> > > + max_value = rq->uclamp.group[clamp_id][group_id].value;
> >
> > max_value = max(max_value, rq->uclamp.group[clamp_id][group_id].value);
>
> Right, I get used to this pattern to avoid write instructions.
> I guess that here, being just a function local variable, we don't
> really care much...
The above does not work also because we now use bitfields:
In file included from ./include/linux/list.h:9:0,
from ./include/linux/rculist.h:10,
from ./include/linux/pid.h:5,
from ./include/linux/sched.h:14,
from kernel/sched/sched.h:5,
from kernel/sched/core.c:8:
kernel/sched/core.c: In function ‘uclamp_cpu_update’:
kernel/sched/core.c:867:5: error: ‘typeof’ applied to a bit-field
rq->uclamp.group[clamp_id][group_id].value);
^
[...]
> > > + if (rq->uclamp.value[clamp_id] < p->uclamp[clamp_id].value)
> > > + rq->uclamp.value[clamp_id] = p->uclamp[clamp_id].value;
> >
> > rq->uclamp.value[clamp_id] = max(rq->uclamp.value[clamp_id],
> > p->uclamp[clamp_id].value);
>
> In this case instead, since we are updating a variable visible from
> other CPUs, should not be preferred to avoid assignment when not
> required ?
And what about this ?
> Is the compiler is smart enough to optimize the code above?
> ... will check better.
Did not really checked what the compiler does in the two cases but,
given also the above, for consistency I would probably prefer to keep
both max aggregation as originally defined.
What do you think ?
--
#include <best/regards.h>
Patrick Bellasi