This is a respin of:
https://lore.kernel.org/lkml/[email protected]
Which has been rebased on today's tip/sched/core:
commit 1b6266ebe3da ("watchdog: Reduce message verbosity")
and addresses all the comments from Tejun and Suren, thanks for your feedback!
Further comments and feedbacks are more than welcome!
Cheers Patrick
Main changes
============
.:: Properly implemented the cgroup delegation model
----------------------------------------------------
As Tejun pointed out in:
https://lore.kernel.org/lkml/[email protected]
the cgroup delegation model requires that a parent group can always restrict
the resources of a child group. To properly support this behavior we need to
add a concept of "effective" clamp values beside the one of "requested" clamp
values.
This means that a child group can always ask for certain clamp values but the
effective clamps it get depends on the parent group configuration. More
specifically, the effective clamps for a group are defined as the most
restrictive value (i.e. the minimum) between a task group clamp value and the
effective value of its parent task group.
This new feature is introduced by this new patch in this series:
[PATCH v3 09/14] sched/core: uclamp: propagate parent clamps
.:: Added support for system defaults
-------------------------------------
The introduction of the previous patch implies that the root task group must be
configured with a default min utilization clamping which corresponds to the
maximum value, i.e. root_task_group::uclamp[UTIL_MIN].value = 100%.
Otherwise, subgroups will always have an effective 0% minimum clamp.
To fix this misbehavior, as well as to overcome the (cgroup imposed) limitation
of non configurable attributes of the root task group, in this new patch:
[PATCH v3 12/14] sched/core: uclamp: add system default clamps
we add sysfs support for system wide defaults.
This should satisfy another comment by Tejun and it also provides a convenient
system wide configuration API, which is available independently from cgroup.
.:: Improved syscall API semantics
----------------------------------
As pointed out and suggested by Suren, the __sched_setscheduler() syscall
semantics has been improved to support:
- single attribute configuration
by using a new set of dedicated sched_attr::sched_flags we can now specify
which clamp values we want to configure
- atomic setting or failure in case of two attributes being configured at the
same time.
These changes affect mainly:
[PATCH v3 01/14] sched/core: uclamp: extend sched_setattr to support utilization clamping
[PATCH v3 02/14] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
.:: Other notes
---------------
The rest of the series is similar to v2 and split into these sections:
- Patches [01-04]: Per task (primary) API
- Patches [05-06]: Schedutil integration
- Patches [08-13]: Per task group (secondary) API
- Patches [07,14]: Additional improvements
Newcomer's Short Abstract (Updated)
===================================
The Linux scheduler is able to drive frequency selection, when the schedutil
cpufreq's governor is in use, based on task utilization aggregated at CPU
level. The CPU utilization is then used to select the frequency which best
fits the task's generated workload. The current translation of utilization
values into a frequency selection is pretty simple: we just go to max for RT
tasks or to the minimum frequency which can accommodate the utilization of
DL+FAIR tasks.
While this simple mechanism is good enough for DL tasks, for RT and FAIR tasks
we can aim at some better frequency driving which can take into consideration
hints coming from user-space.
Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined from
user-space. The clamped utilization value can then be used to enforce a minimum
and/or maximum frequency depending on which tasks are currently active on a
CPU.
The main use-cases for utilization clamping are:
- boosting: better interactive response for small tasks which
are affecting the user experience.
Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). In this case, from its
utilization the scheduler does not have a complete view of what are the task
requirements and, if it's a small utilization task, schedutil will keep
selecting a more energy efficient CPU, with smaller capacity and lower
frequency, thus affecting the overall time required to complete the task
activations.
- capping: increase energy efficiency for background tasks not directly
affecting the user experience.
Since running on a lower capacity CPU at a lower frequency is in general
more energy efficient, when the completion time is not a main goal, then
capping the utilization considered for certain (maybe big) tasks can have
positive effects, both on energy consumption and thermal stress.
Moreover, this last support allows also to make RT tasks more energy
friendly on mobile systems, where running them on high capacity CPUs and at
the maximum frequency is not strictly required.
From these two use-cases, it's worth to notice that frequency selection
biasing, introduced by patches 5 and 6 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler on tasks placement decisions.
Utilization is a task specific property which is used by the scheduler to know
how much CPU bandwidth a task requires (under certain conditions).
Thus, the utilization clamp values defined either per-task or via the CPU
controller, can be used to represent tasks to the scheduler as being bigger
(or smaller) than what they really are.
Utilization clamping thus ultimately enable interesting additional
optimizations, especially on asymmetric capacity systems like Arm
big.LITTLE and DynamIQ CPUs, where:
- boosting: small tasks are preferably scheduled on higher-capacity CPUs
where, despite being less energy efficient, they can complete faster
- capping: big/background tasks are preferably scheduled on low-capacity CPUs
where, being more energy efficient, they can still run but save power and
thermal headroom for more important tasks.
This additional usage of the utilization clamping is not presented in this
series but it's an integral part of the Energy Aware Scheduler (EAS) feature
set. A similar solution (SchedTune) is already used on Android kernels, which
targets the biasing of both 'frequency selection' and 'task placement'.
This series provides the foundation bits to add similar features in mainline
and its first simple client with the schedutil integration.
Detailed Changelog
==================
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
group_id as a parameter
- add explicit calls to uclamp_group_find()
which is now not more part of uclamp_group_get()
- fixed a not required override
- fixed some typos in comments and changelog
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Message-ID: <[email protected]>
- use "." notation for attributes naming
i.e. s/util_{min,max}/util.{min,max}/
- added new patches: 09 and 12
Other changes:
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when cgroup support is
added.
Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes
Other changes:
- improved documentation to make more explicit some concepts
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data into a single cache
line while still supporting up to 2 different {min,max}_utiql clamps.
- use -ERANGE as range violation error
- add attributes to the default hierarchy as well as the legacy one
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values,
i.e. tasks running on a TG are only allowed to demote themself.
- patches re-ordering in top-down way
- rebased on v4.18-rc4
Patrick Bellasi (14):
sched/core: uclamp: extend sched_setattr to support utilization
clamping
sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
sched/core: uclamp: add CPU's clamp groups accounting
sched/core: uclamp: update CPU's refcount on clamp changes
sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
sched/cpufreq: uclamp: add utilization clamping for RT tasks
sched/core: uclamp: enforce last task UCLAMP_MAX
sched/core: uclamp: extend cpu's cgroup controller
sched/core: uclamp: propagate parent clamps
sched/core: uclamp: map TG's clamp values into CPU's clamp groups
sched/core: uclamp: use TG's clamps to restrict Task's clamps
sched/core: uclamp: add system default clamps
sched/core: uclamp: update CPU's refcount on TG's clamp changes
sched/core: uclamp: use percentage clamp values
Documentation/admin-guide/cgroup-v2.rst | 46 +
include/linux/sched.h | 58 ++
include/linux/sched/sysctl.h | 11 +
include/uapi/linux/sched.h | 8 +-
include/uapi/linux/sched/types.h | 66 +-
init/Kconfig | 61 ++
kernel/sched/core.c | 1186 +++++++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 38 +-
kernel/sched/fair.c | 4 +
kernel/sched/features.h | 10 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 194 ++++
kernel/sysctl.c | 16 +
13 files changed, 1688 insertions(+), 14 deletions(-)
--
2.18.0
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a more simplified model which is essentially based on
the relatively simple concept of POSIX priorities.
Such a simple priority based model however does not allow to exploit
some of the most advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.
Utilization clamping aims at exposing to user-space a new set of
per-task attributes which can be used to provide the scheduler with some
hints about the expected/required utilization for a task.
This will allow to implement a more advanced per-task frequency control
mechanism which is not based just on a "passive" measured task
utilization but on a more "active" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy efficient.
Ultimately, such a mechanism can be considered similar to the cpufreq's
powersave, performance and userspace governor but with a much fine
grained and per-task control.
Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes.
Specifically, a new pair of attributes allows to specify a minimum and
maximum utilization which the scheduler should consider for a task.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- removed UCLAMP_NONE not used by this patch
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- move at the head of the series
As discussed at OSPM, using a [0..SCHED_CAPACITY_SCALE] range seems to
be acceptable. However, an additional patch has been added at the end of
the series which introduces a simple abstraction to use a more
generic [0..100] range.
At OSPM we also discarded the idea to "recycle" the usage of
sched_runtime and sched_period which would have made the API too
much complex for limited benefits.
---
include/linux/sched.h | 13 +++++++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 64 +++++++++++++++++++++++++++-----
init/Kconfig | 19 ++++++++++
kernel/sched/core.c | 39 +++++++++++++++++++
5 files changed, 129 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e0f4f56c9310..42f6439378e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,14 @@ struct vtime {
u64 gtime;
};
+enum uclamp_id {
+ UCLAMP_MIN = 0, /* Minimum utilization */
+ UCLAMP_MAX, /* Maximum utilization */
+
+ /* Utilization clamping constraints count */
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -649,6 +657,11 @@ struct task_struct {
#endif
struct sched_dl_entity dl;
+#ifdef CONFIG_UCLAMP_TASK
+ /* Utlization clamp values for this task */
+ int uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..c27d6e81517b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_UTIL_CLAMP 0x08
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_UTIL_CLAMP)
#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..7421cd25354d 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -21,8 +21,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +59,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +72,28 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization which
+ * should be expected by a task. These attributes allows to inform the
+ * scheduler about the utilization boundaries within which is safe to schedule
+ * the task. These utilization boundaries are valuable information to support
+ * scheduler decisions on both task placement and frequencies selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. Thus, for
+ * example, a 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger then 0 is more likely to be
+ * scheduled on a CPU which can provide that bandwidth.
+ * A task with a max utilization value smaller then 1024 is more likely to be
+ * scheduled on a CPU which do not provide more then the required bandwidth.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +111,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/Kconfig b/init/Kconfig
index 041f3a022122..1d45a6877d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool
+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ default false
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max CPU
+ bandwidth which is allowed for a task.
+ The max bandwidth allows to clamp the maximum frequency a task can
+ use, while the min bandwidth allows to define a minimum frequency a
+ task will always use.
+
+ If in doubt, say N.
+
+endmenu
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index deafa9fe602b..2cabbbcaa447 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -716,6 +716,28 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}
+#ifdef CONFIG_UCLAMP_TASK
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (attr->sched_util_min > attr->sched_util_max)
+ return -EINVAL;
+ if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
+ p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+
+ return 0;
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2156,6 +2178,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.cfs_rq = NULL;
#endif
+#ifdef CONFIG_UCLAMP_TASK
+ p->uclamp[UCLAMP_MIN] = 0;
+ p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -4218,6 +4245,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}
+ /* Configure utilization clamps for the task */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = __setscheduler_uclamp(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4724,6 +4758,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);
+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN];
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+#endif
+
rcu_read_unlock();
retval = sched_read_attr(uattr, &attr, size);
--
2.18.0
Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range. This range depends on the set of currently
RUNNABLE tasks on a CPU, where each task references two "clamp groups"
defining the util_min and the util_max clamp values to be considered for
that task. The clamp value mapped by a clamp group applies to a CPU only
when there is at least one task RUNNABLE referencing that clamp group.
When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it can be required to re-compute what is the new "aggregated" clamp
value to apply on that CPU.
Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e. with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).
Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.
Tasks have a:
task_struct::uclamp::group_id[clamp_idx]
indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
which they should refcount at enqueue time.
CPUs rq have a:
rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index..
The clamp value of each clamp group is tracked by
rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
unordered array of clamp values. However, the MAX aggregation of the
currently active clamp groups is implemented to minimize the number of
times we need to scan the complete (unordered) clamp group array to
figure out the new max value. This operation indeed happens only when we
dequeue last task of the clamp group corresponding to the current max
clamp, and thus the CPU is either entering IDLE or going to schedule a
less boosted or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different values.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add WARN on unlikely un-referenced decrement in uclamp_cpu_put_id()
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <CAJuCfpGaKvxKcO=RLcmveHRB9qbMrvFs2yFVrk=k-v_m7JkxwQ@mail.gmail.com>
- few typos fixed
Other:
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality
Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments
Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init
Other:
- rabased on v4.18-rc4
- improved documentation to make more explicit some concepts.
---
kernel/sched/core.c | 195 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 4 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 71 ++++++++++++++++
4 files changed, 274 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4caa2686644b..f058ceb14d25 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -825,9 +825,19 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
unsigned int clamp_value)
{
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ struct uclamp_cpu *uc_cpu;
+ int cpu;
+ /* Set clamp group map */
uc_map[group_id].value = clamp_value;
uc_map[group_id].se_count = 0;
+
+ /* Set clamp groups on all CPUs */
+ for_each_possible_cpu(cpu) {
+ uc_cpu = &cpu_rq(cpu)->uclamp;
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+ uc_cpu->group[clamp_id][group_id].tasks = 0;
+ }
}
/**
@@ -882,6 +892,179 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
return -ENOSPC;
}
+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @cpu: the CPU which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups is subject to change. Since each clamp group enforces a
+ * different utilization clamp value, once the set of these groups changes it
+ * can be required to re-compute what is the new clamp value to apply for that
+ * CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
+ */
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
+ int max_value = UCLAMP_NOT_VALID;
+ unsigned int group_id;
+
+ for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
+ if (!uclamp_group_active(uc_grp, group_id))
+ continue;
+
+ /* Both min and max clamp are MAX aggregated */
+ max_value = max(max_value, uc_grp[group_id].value);
+
+ /* Stop if we reach the max possible clamp */
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ rq->uclamp.value[clamp_id] = max_value;
+}
+
+/**
+ * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @rq: the CPU's rq where the clamp group has to be reference counted
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
+ *
+ * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
+ * the task's uclamp.group_id is reference counted on that CPU.
+ */
+static inline void uclamp_cpu_get_id(struct task_struct *p,
+ struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp;
+ struct uclamp_cpu *uc_cpu;
+ int clamp_value;
+ int group_id;
+
+ /* No task specific clamp values: nothing to do */
+ group_id = p->uclamp[clamp_id].group_id;
+ if (group_id == UCLAMP_NOT_VALID)
+ return;
+
+ /* Reference count the task into its current group_id */
+ uc_grp = &rq->uclamp.group[clamp_id][0];
+ uc_grp[group_id].tasks += 1;
+
+ /*
+ * If this is the new max utilization clamp value, then we can update
+ * straight away the CPU clamp value. Otherwise, the current CPU clamp
+ * value is still valid and we are done.
+ */
+ uc_cpu = &rq->uclamp;
+ clamp_value = p->uclamp[clamp_id].value;
+ if (uc_cpu->value[clamp_id] < clamp_value)
+ uc_cpu->value[clamp_id] = clamp_value;
+}
+
+/**
+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @cpu: the CPU from where the clamp group has to be released
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
+ *
+ * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
+ * counted by the task is decreased.
+ * If this was the last task defining the current max clamp group, then the
+ * CPU clamping is updated to find the new max for the specified clamp
+ * index.
+ */
+static inline void uclamp_cpu_put_id(struct task_struct *p,
+ struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp;
+ struct uclamp_cpu *uc_cpu;
+ unsigned int clamp_value;
+ int group_id;
+
+ /* No task specific clamp values: nothing to do */
+ group_id = p->uclamp[clamp_id].group_id;
+ if (group_id == UCLAMP_NOT_VALID)
+ return;
+
+ /* Decrement the task's reference counted group index */
+ uc_grp = &rq->uclamp.group[clamp_id][0];
+#ifdef SCHED_DEBUG
+ if (unlikely(uc_grp[group_id].tasks == 0)) {
+ WARN(1, "invalid CPU[%d] clamp group [%d:%d] refcount\n",
+ cpu_of(rq), clamp_id, group_id);
+ uc_grp[group_id].tasks = 1;
+ }
+#endif
+ uc_grp[group_id].tasks -= 1;
+
+ /* If this is not the last task, no updates are required */
+ if (uc_grp[group_id].tasks > 0)
+ return;
+
+ /*
+ * Update the CPU only if this was the last task of the group
+ * defining the current clamp value.
+ */
+ uc_cpu = &rq->uclamp;
+ clamp_value = uc_grp[group_id].value;
+ if (clamp_value >= uc_cpu->value[clamp_id])
+ uclamp_cpu_update(rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_get(): increase CPU's clamp group refcount
+ * @rq: the CPU's rq where the clamp group has to be refcounted
+ * @p: the task being enqueued
+ *
+ * Once a task is enqueued on a CPU's rq, all the clamp groups currently
+ * enforced on a task are reference counted on that rq.
+ * Not all scheduling classes have utilization clamping support, their tasks
+ * will be silently ignored.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_get_id(p, rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_put(): decrease CPU's clamp group refcount
+ * @cpu: the CPU's rq where the clamp group refcount has to be decreased
+ * @p: the task being dequeued
+ *
+ * When a task is dequeued from a CPU's rq, all the clamp groups the task has
+ * been reference counted at task's enqueue time have to be decreased for that
+ * CPU.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_put_id(p, rq, clamp_id);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -1026,9 +1209,17 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
static void __init init_uclamp(void)
{
int clamp_id;
+ int cpu;
mutex_init(&uclamp_mutex);
+ /* Init CPU's clamp groups */
+ for_each_possible_cpu(cpu) {
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ memset(uc_cpu, UCLAMP_NOT_VALID, sizeof(struct uclamp_cpu));
+ }
+
/* Init SE's clamp map */
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
@@ -1042,6 +1233,8 @@ static void __init init_uclamp(void)
}
#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1058,6 +1251,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);
+ uclamp_cpu_get(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}
@@ -1069,6 +1263,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE))
sched_info_dequeued(rq, p);
+ uclamp_cpu_put(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93fcc604..c3d87780764c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10054,6 +10054,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};
#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2e2955a8cf8f..06ec33467dd9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2404,6 +2404,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,
.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};
#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..8f96926703fc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -764,6 +764,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * struct uclamp_group - Utilization clamp Group
+ * @value: utilization clamp value for tasks on this clamp group
+ * @tasks: number of RUNNABLE tasks on this clamp group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+ int value;
+ int tasks;
+};
+
+/**
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @group: utilization clamp groups affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp value, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+ int value[UCLAMP_CNT];
+ struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -801,6 +845,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;
+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_cpu uclamp ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -1560,6 +1609,10 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
};
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
@@ -2139,6 +2192,24 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_group_active: check if a clamp group is active on a CPU
+ * @uc_grp: the clamp groups for a CPU
+ * @group_id: the clamp group to check
+ *
+ * A clamp group affects a CPU if it has at least one RUNNABLE task.
+ *
+ * Return: true if the specified CPU has at least one RUNNABLE task
+ * for the specified clamp group.
+ */
+static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
+ int group_id)
+{
+ return uc_grp[group_id].tasks > 0;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef CONFIG_CPU_FREQ
DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
--
2.18.0
Utilization clamp values enforced on a CPU by a task can be updated at
run-time, for example via a sched_setattr syscall, while a task is
currently RUNNABLE on that CPU. In these cases, the task can be already
refcounting a clamp group for its CPU and thus we need to update this
reference to ensure the new constraints are immediately enforced.
Since a clamp value change always implies a clamp group refcount update,
this patch hooks into the clamp group refcount getter to trigger a CPU
refcount syncup. Such a syncup is required only by currently RUNNABLE
tasks which are also referencing at least one valid clamp group.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Other:
- rabased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when CGroups support is
added.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 61 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 45 ++++++++++++++++++++++++++++++++
2 files changed, 101 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f058ceb14d25..bc2beedec7bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1065,6 +1065,52 @@ static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
uclamp_cpu_put_id(p, rq, clamp_id);
}
+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ * @group_id: the clamp group to update
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups have to be updated for each CPU containing a RUNNABLE task belonging
+ * to this tasks group. Sleeping tasks are not updated since they will be
+ * enqueued with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the CPU where the task is (or was) queued.
+ *
+ * We might lock the (previous) RQ of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * The setting of the clamp group is serialized by task_rq_lock().
+ * Thus, if the task's task_struct is not referencing a valid group
+ * index, then that task is not yet RUNNABLE and it's going to be
+ * enqueued with the proper clamp group value.
+ */
+ if (!uclamp_task_active(p))
+ goto done;
+
+ /* Release p's currently referenced clamp group */
+ uclamp_cpu_put_id(p, rq, clamp_id);
+
+ /* Get p's new clamp group */
+ uclamp_cpu_get_id(p, rq, clamp_id);
+
+done:
+ task_rq_unlock(rq, p, &rf);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -1100,6 +1146,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
/**
* uclamp_group_get: increase the reference count for a clamp group
+ * @p: the task which clamp value must be tracked
* @clamp_id: the clamp index affected by the task
* @next_group_id: the clamp group to refcount
* @uc_se: the utilization clamp data for the task
@@ -1110,9 +1157,10 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
* this new clamp value. The corresponding clamp group index will be used by
* the task to reference count the clamp value on CPUs while enqueued.
*/
-static inline void uclamp_group_get(int clamp_id, int next_group_id,
- struct uclamp_se *uc_se,
- unsigned int clamp_value)
+static inline void uclamp_group_get(struct task_struct *p,
+ int clamp_id, int next_group_id,
+ struct uclamp_se *uc_se,
+ unsigned int clamp_value)
{
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
int prev_group_id = uc_se->group_id;
@@ -1129,6 +1177,9 @@ static inline void uclamp_group_get(int clamp_id, int next_group_id,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+ /* Update CPU's clamp group refcounts of RUNNABLE task */
+ uclamp_task_update_active(p, clamp_id, next_group_id);
+
/* Release the previous clamp group */
uclamp_group_put(clamp_id, prev_group_id);
}
@@ -1188,12 +1239,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uc_se = &p->uclamp[UCLAMP_MIN];
- uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uclamp_group_get(p, UCLAMP_MIN, group_id[UCLAMP_MIN],
uc_se, attr->sched_util_min);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uc_se = &p->uclamp[UCLAMP_MAX];
- uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uclamp_group_get(p, UCLAMP_MAX, group_id[UCLAMP_MAX],
uc_se, attr->sched_util_max);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f96926703fc..c9df61d39a03 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2208,6 +2208,51 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
{
return uc_grp[group_id].tasks > 0;
}
+
+/**
+ * uclamp_task_affects: check if a task affects a utilization clamp
+ * @p: the task to consider
+ * @clamp_id: the utilization clamp to check
+ *
+ * A task affects a clamp index if:
+ * - it's currently enqueued on a CPU
+ * - it references a valid clamp group index for the specified clamp index
+ *
+ * Return: true if p currently affects the specified clamp_id
+ */
+static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
+{
+ return (p->uclamp[clamp_id].group_id != UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_task_active: check if a task is currently clamping a CPU
+ * @p: the task to check
+ *
+ * A task actively affects the utilization clamp of a CPU if:
+ * - it's currently enqueued or running on that CPU
+ * - it's refcounted in at least one clamp group of that CPU
+ *
+ * Return: true if p is currently clamping the utilization of its CPU.
+ */
+static inline bool uclamp_task_active(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ int clamp_id;
+
+ lockdep_assert_held(&p->pi_lock);
+ lockdep_assert_held(&rq->lock);
+
+ if (!task_on_rq_queued(p) && !p->on_cpu)
+ return false;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ if (uclamp_task_affects(p, clamp_id))
+ return true;
+ }
+
+ return false;
+}
#endif /* CONFIG_UCLAMP_TASK */
#ifdef CONFIG_CPU_FREQ
--
2.18.0
Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE. Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.
This patch extends the usage of util_{min,max} to the RT scheduling
class. Whenever a task in this class is RUNNABLE, the util required is
defined by the constraints of the CPU control group the task belongs to.
Since utilization clamping applies now to both CFS and RT task, there
can be two alternative approaches:
A) clamp the combined utilization
B) combine the clamped utilizations
which have pros and cons.
Approach A) is more power efficient, since it generally selects lower
frequencies when we have both RT and CFS utilization. However, this
could affect performance of the lower priority CFS class, since the
minimum utilization clamp could be completely eclipsed by the RT
utilization.
Approach B) is more fair to the lower priority CFS class since it always
adds the required minimum utilization to that class too. For that reason
it could be less power efficient and, since we do not distinguish clamp
values based on the scheduling class, it could also end up boosting CFS
tasks more then required (e.g. when the current min utilization of a CPU
is required by an RT task). That's why this approach is masked behind a
sched feature.
The IO wait boost value is thus subject to clamping for RT tasks too.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clamping constraints.
It's worth to notice that, by default, clamp values are
min_util, max_util = (0, SCHED_CAPACITY_SCALE)
and thus, RT tasks always run at the maximum OPP if not otherwise
constrained by userspace.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
---
kernel/sched/cpufreq_schedutil.c | 33 +++++++++++++++++++++-----------
kernel/sched/features.h | 5 +++++
2 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index a7affc729c25..bb25ef66c2d3 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -200,6 +200,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util_cfs, util_rt;
unsigned long util, irq, max;
sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
@@ -223,13 +224,25 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
- * CFS utilization can be boosted or capped, depending on utilization
- * clamp constraints configured for currently RUNNABLE tasks.
+ * CFS and RT utilizations can be boosted or capped, depending on
+ * utilization constraints enforce by currently RUNNABLE tasks.
+ * They are individually clamped to ensure fairness across classes,
+ * meaning that CFS always gets (if possible) the (minimum) required
+ * bandwidth on top of that required by higher priority classes.
*/
- util = cpu_util_cfs(rq);
- if (util)
- util = uclamp_util(cpu_of(rq), util);
- util += cpu_util_rt(rq);
+ util_cfs = cpu_util_cfs(rq);
+ util_rt = cpu_util_rt(rq);
+ if (sched_feat(UCLAMP_SCHED_CLASS)) {
+ util = 0;
+ if (util_cfs)
+ util += uclamp_util(cpu_of(rq), util_cfs);
+ if (util_rt)
+ util += uclamp_util(cpu_of(rq), util_rt);
+ } else {
+ util = cpu_util_cfs(rq);
+ util += cpu_util_rt(rq);
+ util = uclamp_util(cpu_of(rq), util);
+ }
/*
* We do not make cpu_util_dl() a permanent part of this sum because we
@@ -333,13 +346,11 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
*
* Since DL tasks have a much more advanced bandwidth control, it's
* safe to assume that IO boost does not apply to those tasks.
- * Instead, since RT tasks are currently not utiliation clamped,
- * we don't want to apply clamping on IO boost while there is
- * blocked RT utilization.
+ * Instead, for CFS and RT tasks we clamp the IO boost max value
+ * considering the current constraints for the CPU.
*/
max_boost = sg_cpu->iowait_boost_max;
- if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
- max_boost = uclamp_util(sg_cpu->cpu, max_boost);
+ max_boost = uclamp_util(sg_cpu->cpu, max_boost);
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..a3ca449e36c1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,8 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Per class CPU's utilization clamping.
+ */
+SCHED_FEAT(UCLAMP_SCHED_CLASS, false)
--
2.18.0
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider the requirements suggested by userspace, for
example, to:
- boost tasks which are directly affecting the user experience
by running them at least at a minimum "required" frequency
- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency
These constraints are meant to support a per-task based tuning of the
frequency selection thus allowing to have a fine grained definition of
performance boosting vs energy saving strategies in kernel space.
Let's add the required support to clamp the utilization generated by
FAIR tasks within the boundaries defined by their aggregated utilization
clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.
For FAIR tasks both the utilization as well as the IOWait boost values
are clamped according to the CPU aggregated utilization clamp
constraints.
The default values for boosting and capping are defined to be:
- util_min: 0
- util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
---
kernel/sched/cpufreq_schedutil.c | 23 ++++++++++-
kernel/sched/sched.h | 71 ++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3bc8a8..a7affc729c25 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS utilization can be boosted or capped, depending on utilization
+ * clamp constraints configured for currently RUNNABLE tasks.
*/
util = cpu_util_cfs(rq);
+ if (util)
+ util = uclamp_util(cpu_of(rq), util);
util += cpu_util_rt(rq);
/*
@@ -307,6 +312,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
unsigned int flags)
{
bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+ unsigned int max_boost;
/* Reset boost if the CPU appears to have been idle enough */
if (sg_cpu->iowait_boost &&
@@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
return;
sg_cpu->iowait_boost_pending = true;
+ /*
+ * Boost FAIR tasks only up to the CPU clamped utilization.
+ *
+ * Since DL tasks have a much more advanced bandwidth control, it's
+ * safe to assume that IO boost does not apply to those tasks.
+ * Instead, since RT tasks are currently not utiliation clamped,
+ * we don't want to apply clamping on IO boost while there is
+ * blocked RT utilization.
+ */
+ max_boost = sg_cpu->iowait_boost_max;
+ if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
+ max_boost = uclamp_util(sg_cpu->cpu, max_boost);
+
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
sg_cpu->iowait_boost <<= 1;
- if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
- sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+ if (sg_cpu->iowait_boost > max_boost)
+ sg_cpu->iowait_boost = max_boost;
return;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c9df61d39a03..bb305e3d5737 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2293,6 +2293,77 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */
+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_value: get the current CPU's utilization clamp value
+ * @cpu: the CPU to consider
+ * @clamp_id: the utilization clamp index (i.e. min or max utilization)
+ *
+ * The utilization clamp value for a CPU depends on its set of currently
+ * RUNNABLE tasks and their specific util_{min,max} constraints.
+ * A max aggregated value is tracked for each CPU and returned by this
+ * function.
+ *
+ * Return: the current value for the specified CPU and clamp index
+ */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ if (uc_cpu->value[clamp_id] == UCLAMP_NOT_VALID)
+ return uclamp_none(clamp_id);
+
+ return uc_cpu->value[clamp_id];
+}
+
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @cpu: the CPU to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently RUNNABLE tasks. Given a utilization signal, i.e a signal in
+ * the [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
+{
+ unsigned int min_util = uclamp_value(cpu, UCLAMP_MIN);
+ unsigned int max_util = uclamp_value(cpu, UCLAMP_MAX);
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+ return uclamp_none(clamp_id);
+}
+
+static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.18.0
When a util_max clamped task sleeps, its clamp constraints are removed
from the CPU. However, the blocked utilization on that CPU can still be
higher than the max clamp value enforced while that task was running.
This max clamp removal when a CPU is going to be idle could thus allow
unwanted CPU frequency increases, right while the task is not running.
This can happen, for example, where there is another (smaller) task
running on a different CPU of the same frequency domain.
In this case, when we aggregate the utilization of all the CPUs in a
shared frequency domain, schedutil can still see the full non clamped
blocked utilization of all the CPUs and thus eventually increase the
frequency.
Let's fix this by using:
uclamp_cpu_put_id(UCLAMP_MAX)
uclamp_cpu_update(last_clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition. Thus, while a CPU is idle, we can still enforce the last used
clamp value for it.
To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
idle, we don't want to enforce any minimum frequency
Indeed, we rely just on blocked load decay to smoothly reduce the
frequency.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Changes in v2:
- rabased on v4.18-rc4
- new patch to improve a specific issue
---
kernel/sched/core.c | 35 +++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 ++
2 files changed, 33 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bc2beedec7bf..ff76b000bbe8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -906,7 +906,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
* For the specified clamp index, this method computes the new CPU utilization
* clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
*/
-static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
+ unsigned int last_clamp_value)
{
struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
int max_value = UCLAMP_NOT_VALID;
@@ -924,6 +925,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
if (max_value >= SCHED_CAPACITY_SCALE)
break;
}
+
+ /*
+ * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
+ * task, we keep the CPU clamped to the last task's clamp value.
+ * This avoids frequency spikes to MAX when one CPU, with an high
+ * blocked utilization, sleeps and another CPU, in the same frequency
+ * domain, do not see anymore the clamp on the first CPU.
+ */
+ if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
+ rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
+ max_value = last_clamp_value;
+ }
+
rq->uclamp.value[clamp_id] = max_value;
}
@@ -953,13 +967,26 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks += 1;
+ /* Force clamp update on idle exit */
+ uc_cpu = &rq->uclamp;
+ clamp_value = p->uclamp[clamp_id].value;
+ if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
+ /*
+ * This function is called for both UCLAMP_MIN (before) and
+ * UCLAMP_MAX (after). Let's reset the flag only the second
+ * once we know that UCLAMP_MIN has been already updated.
+ */
+ if (clamp_id == UCLAMP_MAX)
+ uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
+ uc_cpu->value[clamp_id] = clamp_value;
+ return;
+ }
+
/*
* If this is the new max utilization clamp value, then we can update
* straight away the CPU clamp value. Otherwise, the current CPU clamp
* value is still valid and we are done.
*/
- uc_cpu = &rq->uclamp;
- clamp_value = p->uclamp[clamp_id].value;
if (uc_cpu->value[clamp_id] < clamp_value)
uc_cpu->value[clamp_id] = clamp_value;
}
@@ -1011,7 +1038,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
uc_cpu = &rq->uclamp;
clamp_value = uc_grp[group_id].value;
if (clamp_value >= uc_cpu->value[clamp_id])
- uclamp_cpu_update(rq, clamp_id);
+ uclamp_cpu_update(rq, clamp_id, clamp_value);
}
/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb305e3d5737..d5855babb9c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -803,6 +803,8 @@ struct uclamp_group {
* values, i.e. no min/max clamping at all.
*/
struct uclamp_cpu {
+#define UCLAMP_FLAG_IDLE 0x01
+ int flags;
int value[UCLAMP_CNT];
struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
};
--
2.18.0
The utilization is a well defined property of tasks and CPUs with an
in-kernel representation based on power-of-two values.
The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
allows efficient computations in hot-paths and a sufficient fixed point
arithmetic precision.
However, the utilization values range is still an implementation detail
which is also possibly subject to changes in the future.
Since we don't want to commit new user-space APIs to any in-kernel
implementation detail, let's add an abstraction layer on top of the APIs
used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
cpu.util_{min,max} attributes.
We do that by adding a couple of conversion functions which can be used
to conveniently transform utilization/capacity values from/to the internal
SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
in the standard [0..100] range.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
- rebased on tip/sched/core
Changes in v2:
- none: this is a new patch
---
Documentation/admin-guide/cgroup-v2.rst | 10 +++----
include/linux/sched.h | 20 +++++++++++++
include/uapi/linux/sched/types.h | 14 ++++++----
kernel/sched/core.c | 37 +++++++++++++++----------
4 files changed, 55 insertions(+), 26 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c73ceaf496b2..6055e4524dc6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -973,7 +973,7 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no bandwidth boosting.
- The requested minimum utilization in the range [0, 1023].
+ The requested minimum percentage of utilization in the range [0, 100].
This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
@@ -984,16 +984,16 @@ All time durations are in microseconds.
reports minimum utilization clamp value currently enforced on a task
group.
- The actual minimum utilization in the range [0, 1023].
+ The actual minimum percentage of utilization in the range [0, 100].
This value can be lower then cpu.util.min in case a parent cgroup
is enforcing a more restrictive clamping on minimum utilization.
cpu.util.max
A read-write single value file which exists on non-root cgroups.
- The default is "1023". i.e. no bandwidth clamping
+ The default is "100". i.e. no bandwidth clamping
- The requested maximum utilization in the range [0, 1023].
+ The requested maximum percentage of utilization in the range [0, 100].
This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
@@ -1004,7 +1004,7 @@ All time durations are in microseconds.
reports maximum utilization clamp value currently enforced on a task
group.
- The actual maximum utilization in the range [0, 1023].
+ The actual maximum percentage of utilization in the range [0, 100].
This value can be lower then cpu.util.max in case a parent cgroup
is enforcing a more restrictive clamping on max utilization.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 753d10cd25f1..1d48453e8d4c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -321,6 +321,26 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
+static inline unsigned int scale_from_percent(unsigned int pct)
+{
+ WARN_ON(pct > 100);
+
+ return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
+}
+
+static inline unsigned int scale_to_percent(unsigned int value)
+{
+ unsigned int rounding = 0;
+
+ WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
+
+ /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
+ if (likely((value & 0xFF) && ~(value & 0x700)))
+ rounding = 1;
+
+ return (rounding + ((100 * value) / SCHED_FIXEDPOINT_SCALE));
+}
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 7421cd25354d..e2c2acb1c6af 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -84,15 +84,17 @@ struct sched_param {
*
* @sched_util_min represents the minimum utilization
* @sched_util_max represents the maximum utilization
+ * @sched_util_min represents the minimum utilization percentage
+ * @sched_util_max represents the maximum utilization percentage
*
- * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
- * represents the percentage of CPU time used by a task when running at the
- * maximum frequency on the highest capacity CPU of the system. Thus, for
- * example, a 20% utilization task is a task running for 2ms every 10ms.
+ * Utilization is a value in the range [0..100] which represents the
+ * percentage of CPU time used by a task when running at the maximum frequency
+ * on the highest capacity CPU of the system. Thus, for example, a 20%
+ * utilization task is a task running for 2ms every 10ms.
*
- * A task with a min utilization value bigger then 0 is more likely to be
+ * A task with a min utilization value bigger then 0% is more likely to be
* scheduled on a CPU which can provide that bandwidth.
- * A task with a max utilization value smaller then 1024 is more likely to be
+ * A task with a max utilization value smaller then 100% is more likely to be
* scheduled on a CPU which do not provide more then the required bandwidth.
*/
struct sched_attr {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6db307803047..09dc550a4174 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -730,15 +730,15 @@ static DEFINE_MUTEX(uclamp_mutex);
/*
* Minimum utilization for tasks in the root cgroup
- * default: 0
+ * default: 0%
*/
unsigned int sysctl_sched_uclamp_util_min;
/*
* Maximum utilization for tasks in the root cgroup
- * default: 1024
+ * default: 100%
*/
-unsigned int sysctl_sched_uclamp_util_max = 1024;
+unsigned int sysctl_sched_uclamp_util_max = 100;
static struct uclamp_se uclamp_default[UCLAMP_CNT];
@@ -1329,6 +1329,7 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
struct uclamp_se *uc_se;
int old_min, old_max;
+ unsigned int value;
int result;
mutex_lock(&uclamp_mutex);
@@ -1344,7 +1345,7 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
goto undo;
- if (sysctl_sched_uclamp_util_max > 1024)
+ if (sysctl_sched_uclamp_util_max > 100)
goto undo;
/* Find a valid group_id for each required clamp value */
@@ -1370,13 +1371,15 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
/* Update each required clamp group */
if (old_min != sysctl_sched_uclamp_util_min) {
uc_se = &uclamp_default[UCLAMP_MIN];
+ value = scale_from_percent(sysctl_sched_uclamp_util_min);
uclamp_group_get(NULL, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
- uc_se, sysctl_sched_uclamp_util_min);
+ uc_se, value);
}
if (old_max != sysctl_sched_uclamp_util_max) {
uc_se = &uclamp_default[UCLAMP_MAX];
+ value = scale_from_percent(sysctl_sched_uclamp_util_max);
uclamp_group_get(NULL, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
- uc_se, sysctl_sched_uclamp_util_max);
+ uc_se, value);
}
if (result) {
@@ -1519,7 +1522,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
: p->uclamp[UCLAMP_MAX].value;
if (upper_bound == UCLAMP_NOT_VALID)
- upper_bound = SCHED_CAPACITY_SCALE;
+ upper_bound = 100;
if (attr->sched_util_min > upper_bound) {
result = -EINVAL;
goto done;
@@ -1541,7 +1544,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
if (lower_bound == UCLAMP_NOT_VALID)
lower_bound = 0;
if (attr->sched_util_max < lower_bound ||
- attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+ attr->sched_util_max > 100) {
result = -EINVAL;
goto done;
}
@@ -1559,12 +1562,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uc_se = &p->uclamp[UCLAMP_MIN];
uclamp_group_get(p, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
- uc_se, attr->sched_util_min);
+ uc_se, scale_from_percent(attr->sched_util_min));
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uc_se = &p->uclamp[UCLAMP_MAX];
uclamp_group_get(p, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
- uc_se, attr->sched_util_max);
+ uc_se, scale_from_percent(attr->sched_util_max));
}
done:
@@ -5648,8 +5651,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);
#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = uclamp_task_value(p, UCLAMP_MIN);
- attr.sched_util_max = uclamp_task_value(p, UCLAMP_MAX);
+ attr.sched_util_min = scale_to_percent(uclamp_task_value(p, UCLAMP_MIN));
+ attr.sched_util_max = scale_to_percent(uclamp_task_value(p, UCLAMP_MAX));
#endif
rcu_read_unlock();
@@ -7509,8 +7512,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
int ret = -EINVAL;
int group_id;
- if (min_value > SCHED_CAPACITY_SCALE)
+ /* Check range and scale to internal representation */
+ if (min_value > 100)
return -ERANGE;
+ min_value = scale_from_percent(min_value);
mutex_lock(&uclamp_mutex);
rcu_read_lock();
@@ -7555,8 +7560,10 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
int ret = -EINVAL;
int group_id;
- if (max_value > SCHED_CAPACITY_SCALE)
+ /* Check range and scale to internal representation */
+ if (max_value > 100)
return -ERANGE;
+ max_value = scale_from_percent(max_value);
mutex_lock(&uclamp_mutex);
rcu_read_lock();
@@ -7607,7 +7614,7 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
: tg->uclamp[clamp_id].value;
rcu_read_unlock();
- return util_clamp;
+ return scale_to_percent(util_clamp);
}
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
--
2.18.0
Utilization clamping requires to map each different clamp value
into one of the available clamp groups used by the scheduler's fast-path
to account for RUNNABLE tasks. Thus, each time a TG's clamp value is
updated we need to get a reference to the new value's clamp group and
release a reference to the previous one.
Let's ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time).
We do that by slightly refactoring uclamp_group_get() to make the
*task_struct parameter optional. This allows to re-use the code already
available to support the per-task API.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- add explicit calls to uclamp_group_find(), which is now not more
part of uclamp_group_get()
Others:
- rebased on tip/sched/core
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
include/linux/sched.h | 2 +
kernel/sched/core.c | 114 ++++++++++++++++++++++++++++++++++++++++--
2 files changed, 111 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3fac2d098084..04f3b47a31bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -583,6 +583,8 @@ struct sched_dl_entity {
*
* A utilization clamp group maps a "clamp value" (value), i.e.
* util_{min,max}, to a "clamp group index" (group_id).
+ * The same "group_id" can be used by multiple TG's to enforce the same
+ * clamp "value" for a given clamp index.
*/
struct uclamp_se {
/* Utilization constraint for tasks in this group */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f692df3787bd..01229864fd93 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1205,7 +1205,8 @@ static inline void uclamp_group_get(struct task_struct *p,
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
/* Update CPU's clamp group refcounts of RUNNABLE task */
- uclamp_task_update_active(p, clamp_id, next_group_id);
+ if (p)
+ uclamp_task_update_active(p, clamp_id, next_group_id);
/* Release the previous clamp group */
uclamp_group_put(clamp_id, prev_group_id);
@@ -1262,22 +1263,60 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
{
struct uclamp_se *uc_se;
int clamp_id;
+ int group_id;
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &tg->uclamp[clamp_id];
uc_se->value = parent->uclamp[clamp_id].value;
uc_se->group_id = UCLAMP_NOT_VALID;
+
uc_se->effective.value =
parent->uclamp[clamp_id].effective.value;
uc_se->effective.group_id =
parent->uclamp[clamp_id].effective.group_id;
+
+ /*
+ * Find a valid group_id.
+ * Since it's a parent clone this will never fail.
+ */
+ group_id = uclamp_group_find(clamp_id, uc_se->value);
+#ifdef SCHED_DEBUG
+ if (unlikely(group_id == -ENOSPC)) {
+ WARN(1, "invalid clamp group [%d:%d] cloning\n",
+ clamp_id, parent->uclamp[clamp_id].group_id);
+ return 0;
+ }
+#endif
+ uclamp_group_get(NULL, clamp_id, group_id, uc_se,
+ parent->uclamp[clamp_id].value);
}
return 1;
}
+
+/**
+ * release_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
+ */
+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+ uclamp_group_put(clamp_id, uc_se->group_id);
+ }
+}
+
#else /* CONFIG_UCLAMP_TASK_GROUP */
static inline void init_uclamp_sched_group(void) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -1389,6 +1428,7 @@ static void __init init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -6958,6 +6998,7 @@ static DEFINE_SPINLOCK(task_group_lock);
static void sched_free_group(struct task_group *tg)
{
+ free_uclamp_sched_group(tg);
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
@@ -7203,8 +7244,36 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}
#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * cpu_util_update_hier: propagete effective clamp down the hierarchy
+ * @css: the task group to update
+ * @clamp_id: the clamp index to update
+ * @value: the new task group clamp value
+ * @group_id: the group index mapping the new task clamp value
+ *
+ * The effective clamp for a TG is expected to track the most restrictive
+ * value between the TG's clamp value and it's parent effective clamp value.
+ * This method achieve that:
+ * 1. updating the current TG effective value
+ * 2. walking all the descendant task group that needs an update
+ *
+ * A TG's effective clamp needs to be updated when its current value is not
+ * matching the TG's clamp value. In this case indeed either:
+ * a) the parent has got a more relaxed clamp value
+ * thus potentially we can relax the effective value for this group
+ * b) the parent has got a more strict clamp value
+ * thus potentially we have to restrict the effective value of this group
+ *
+ * Restriction and relaxation of current TG's effective clamp values needs to
+ * be propagated down to all the descendants. When a subgroup is found which
+ * has already its effective clamp value matching its clamp value, then we can
+ * safely skip all its descendants which are granted to be already in sync.
+ *
+ * The TG's group_id is also updated to ensure it tracks the effective clamp
+ * value.
+ */
static void cpu_util_update_hier(struct cgroup_subsys_state *css,
- int clamp_id, int value)
+ int clamp_id, int value, int group_id)
{
struct cgroup_subsys_state *top_css = css;
struct uclamp_se *uc_se, *uc_parent;
@@ -7232,20 +7301,25 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
}
/* Propagate the most restrictive effective value */
- if (uc_parent->effective.value < value)
+ if (uc_parent->effective.value < value) {
value = uc_parent->effective.value;
+ group_id = uc_parent->effective.group_id;
+ }
if (uc_se->effective.value == value)
continue;
uc_se->effective.value = value;
+ uc_se->effective.group_id = group_id;
}
}
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
+ struct uclamp_se *uc_se;
struct task_group *tg;
int ret = -EINVAL;
+ int group_id;
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;
@@ -7261,8 +7335,22 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MAX].value < min_value)
goto out;
+ /* Find a valid group_id */
+ ret = uclamp_group_find(UCLAMP_MIN, min_value);
+ if (ret == -ENOSPC) {
+ pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ goto out;
+ }
+ group_id = ret;
+ ret = 0;
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+ cpu_util_update_hier(css, UCLAMP_MIN, min_value, group_id);
+
+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MIN];
+ uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);
out:
rcu_read_unlock();
@@ -7274,8 +7362,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 max_value)
{
+ struct uclamp_se *uc_se;
struct task_group *tg;
int ret = -EINVAL;
+ int group_id;
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;
@@ -7291,8 +7381,22 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MIN].value > max_value)
goto out;
+ /* Find a valid group_id */
+ ret = uclamp_group_find(UCLAMP_MAX, max_value);
+ if (ret == -ENOSPC) {
+ pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ goto out;
+ }
+ group_id = ret;
+ ret = 0;
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+ cpu_util_update_hier(css, UCLAMP_MAX, max_value, group_id);
+
+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MAX];
+ uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);
out:
rcu_read_unlock();
--
2.18.0
When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's Task Group (TG) in which the task is currently living.
Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG effective value.
This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. The overall idea is that cgroup
attributes are always used to restrict the per-task attributes.
Thus, this implementation allows to:
1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG
For this mecanisms to work properly, we add the concept of "active"
clamp group, which is used to track the currently most restrictive clamp
value each task is subject to.
The active clamp is computed at enqueue time, by using an additional
task_struct::uclamp_group_id
to keep track of the clamp group in which each task is currently
accounted into. This allows to update task constrains on
demand, only when a task becames RUNNABLE, thus always using the most
restrictive clamp depending on the current TG's settings.
This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.
For consistency purposes, as well as to properly inform userspace, the
sched_getattr(2) call is updated to always return the properly
aggregated constrains as described above. This will also make
sched_getattr(2) a convenient userpace API to know the utilization
constraints enforced on a task by the cgroup's CPU controller.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- fix not required override
- fix typos in changelog
Others:
- clean up uclamp_cpu_get_id()/sched_getattr() code by moving task's
clamp group_id/value code into dedicated getter functions:
uclamp_task_group_id(), uclamp_group_value() and uclamp_task_value()
- rebased on tip/sched/core
Changes in v2:
OSPM discussion:
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values, i.e. tasks running on a
TG are only allowed to demote themself.
Other:
- rabased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 78 ++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 2 +-
3 files changed, 73 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04f3b47a31bc..753d10cd25f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -681,6 +681,8 @@ struct task_struct {
struct sched_dl_entity dl;
#ifdef CONFIG_UCLAMP_TASK
+ /* Clamp group the task is currently accounted into */
+ int uclamp_group_id[UCLAMP_CNT];
/* Utlization clamp values for this task */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01229864fd93..f54fd9bda9a7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -941,14 +941,65 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
rq->uclamp.value[clamp_id] = max_value;
}
+static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
+{
+ struct uclamp_se *uc_se;
+ int clamp_value;
+ int group_id;
+
+ /* Taks currently accounted into a clamp group */
+ if (uclamp_task_affects(p, clamp_id))
+ return p->uclamp_group_id[clamp_id];
+
+ /* Task specific clamp value */
+ uc_se = &p->uclamp[clamp_id];
+ clamp_value = uc_se->value;
+ group_id = uc_se->group_id;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Use TG's clamp value to limit task specific values */
+ uc_se = &task_group(p)->uclamp[clamp_id];
+ if (group_id == UCLAMP_NOT_VALID ||
+ clamp_value > uc_se->effective.value) {
+ group_id = uc_se->effective.group_id;
+ }
+#endif
+
+ return group_id;
+}
+
+static inline int uclamp_group_value(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ if (group_id == UCLAMP_NOT_VALID)
+ return uclamp_none(clamp_id);
+
+ return uc_map[group_id].value;
+}
+
+static inline int uclamp_task_value(struct task_struct *p, int clamp_id)
+{
+ int group_id = uclamp_task_group_id(p, clamp_id);
+
+ return uclamp_group_value(clamp_id, group_id);
+}
+
/**
* uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
* @p: the task being enqueued on a CPU
* @rq: the CPU's rq where the clamp group has to be reference counted
* @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
*
- * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
- * the task's uclamp.group_id is reference counted on that CPU.
+ * Once a task is enqueued on a CPU's RQ, the most restrictive clamp group,
+ * among the task specific and that of the task's cgroup one, is reference
+ * counted on that CPU.
+ *
+ * Since the CPUs reference counted clamp group can be either that of the task
+ * or of its cgroup, we keep track of the reference counted clamp group by
+ * storing its index (group_id) into the task's task_struct::uclamp_group_id.
+ * This group index will then be used at task's dequeue time to release the
+ * correct refcount.
*/
static inline void uclamp_cpu_get_id(struct task_struct *p,
struct rq *rq, int clamp_id)
@@ -959,17 +1010,20 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
int group_id;
/* No task specific clamp values: nothing to do */
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = uclamp_task_group_id(p, clamp_id);
if (group_id == UCLAMP_NOT_VALID)
return;
+ clamp_value = uclamp_group_value(clamp_id, group_id);
/* Reference count the task into its current group_id */
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks += 1;
+ /* Track the effective clamp group */
+ p->uclamp_group_id[clamp_id] = group_id;
+
/* Force clamp update on idle exit */
uc_cpu = &rq->uclamp;
- clamp_value = p->uclamp[clamp_id].value;
if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
/*
* This function is called for both UCLAMP_MIN (before) and
@@ -1012,7 +1066,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
int group_id;
/* No task specific clamp values: nothing to do */
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = p->uclamp_group_id[clamp_id];
if (group_id == UCLAMP_NOT_VALID)
return;
@@ -1027,6 +1081,9 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
#endif
uc_grp[group_id].tasks -= 1;
+ /* Flag the task as not affecting any clamp index */
+ p->uclamp_group_id[clamp_id] = UCLAMP_NOT_VALID;
+
/* If this is not the last task, no updates are required */
if (uc_grp[group_id].tasks > 0)
return;
@@ -2885,6 +2942,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#ifdef CONFIG_UCLAMP_TASK
+ memset(&p->uclamp_group_id, UCLAMP_NOT_VALID,
+ sizeof(p->uclamp_group_id));
p->uclamp[UCLAMP_MIN].value = 0;
p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NOT_VALID;
p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
@@ -5467,8 +5526,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);
#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
- attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+ attr.sched_util_min = uclamp_task_value(p, UCLAMP_MIN);
+ attr.sched_util_max = uclamp_task_value(p, UCLAMP_MAX);
#endif
rcu_read_unlock();
@@ -7285,8 +7344,11 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
* groups we consider their current value.
*/
uc_se = &css_tg(css)->uclamp[clamp_id];
- if (css != top_css)
+ if (css != top_css) {
value = uc_se->value;
+ group_id = uc_se->effective.group_id;
+ }
+
/*
* Skip the whole subtrees if the current effective clamp is
* alredy matching the TG's clamp value.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a443b2c22cb7..a296b6463f1e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2229,7 +2229,7 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
*/
static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
{
- return (p->uclamp[clamp_id].group_id != UCLAMP_NOT_VALID);
+ return (p->uclamp_group_id[clamp_id] != UCLAMP_NOT_VALID);
}
/**
--
2.18.0
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.
Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This represent also the clamp value which is actually
used to clamp tasks in each task group.
Since it can be interesting for tasks in a cgroup to know exactly what
is the currently propagated/enforced configuration, the effective clamp
values are exposed to user-space by means of a new pair of read-only
attributes: cpu.util.{min,max}.effective.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <[email protected]>
- new patch in v3, to implement a suggestion from v1 review
---
Documentation/admin-guide/cgroup-v2.rst | 25 +++++++-
include/linux/sched.h | 5 ++
kernel/sched/core.c | 81 +++++++++++++++++++++++--
3 files changed, 105 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 71244b55d901..c73ceaf496b2 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -973,22 +973,43 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no bandwidth boosting.
- The minimum utilization in the range [0, 1023].
+ The requested minimum utilization in the range [0, 1023].
This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.
+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1023].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ is enforcing a more restrictive clamping on minimum utilization.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1023". i.e. no bandwidth clamping
- The maximum utilization in the range [0, 1023].
+ The requested maximum utilization in the range [0, 1023].
This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1023].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.
+
+
Memory
------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f48e64fb8a6..3fac2d098084 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -589,6 +589,11 @@ struct uclamp_se {
unsigned int value;
/* Utilization clamp group for this constraint */
unsigned int group_id;
+ /* Effective clamp for tasks in this group */
+ struct {
+ unsigned int value;
+ unsigned int group_id;
+ } effective;
};
union rcu_special {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2ba55a4afffb..f692df3787bd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1237,6 +1237,8 @@ static inline void init_uclamp_sched_group(void)
uc_se = &root_task_group.uclamp[clamp_id];
uc_se->value = uclamp_none(clamp_id);
uc_se->group_id = group_id;
+ uc_se->effective.value = uclamp_none(clamp_id);
+ uc_se->effective.group_id = group_id;
/* Attach root TG's clamp group */
uc_map[group_id].se_count = 1;
@@ -1266,6 +1268,10 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
uc_se->value = parent->uclamp[clamp_id].value;
uc_se->group_id = UCLAMP_NOT_VALID;
+ uc_se->effective.value =
+ parent->uclamp[clamp_id].effective.value;
+ uc_se->effective.group_id =
+ parent->uclamp[clamp_id].effective.group_id;
}
return 1;
@@ -7197,6 +7203,44 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}
#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+ int clamp_id, int value)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+
+ css_for_each_descendant_pre(css, top_css) {
+ /*
+ * The first visited task group is top_css, which clamp value
+ * is the one passed as parameter. For descendent task
+ * groups we consider their current value.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (css != top_css)
+ value = uc_se->value;
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * alredy matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive, as effective clamp.
+ */
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+ if (uc_se->effective.value == value &&
+ uc_parent->effective.value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent->effective.value < value)
+ value = uc_parent->effective.value;
+ if (uc_se->effective.value == value)
+ continue;
+
+ uc_se->effective.value = value;
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -7217,6 +7261,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MAX].value < min_value)
goto out;
+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+
out:
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);
@@ -7244,6 +7291,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MIN].value > max_value)
goto out;
+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+
out:
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);
@@ -7252,14 +7302,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
}
static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;
rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].effective.value
+ : tg->uclamp[clamp_id].value;
rcu_read_unlock();
return util_clamp;
@@ -7268,13 +7321,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}
static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */
@@ -7622,11 +7687,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
--
2.18.0
Utilization clamping requires each CPU to know which clamp values are
assigned to tasks that are currently RUNNABLE on that CPU.
Multiple tasks can be assigned the same clamp value and tasks with
different clamp values can be concurrently active on the same CPU.
Thus, a proper data structure is required to support a fast and
efficient aggregation of the clamp values required by the currently
RUNNABLE tasks.
For this purpose we use a per-CPU array of reference counters,
where each slot is used to account how many tasks require a certain
clamp value are currently RUNNABLE on each CPU.
Each clamp value corresponds to a "clamp index" which identifies the
position within the array of reference couters.
:
(user-space changes) : (kernel space / scheduler)
:
SLOW PATH : FAST PATH
:
task_struct::uclamp::value : sched/core::enqueue/dequeue
: cpufreq_schedutil
:
+----------------+ +--------------------+ +-------------------+
| TASK | | CLAMP GROUP | | CPU CLAMPS |
+----------------+ +--------------------+ +-------------------+
| | | clamp_{min,max} | | clamp_{min,max} |
| util_{min,max} | | se_count | | tasks count |
+----------------+ +--------------------+ +-------------------+
:
+------------------> : +------------------->
group_id = map(clamp_value) : ref_count(group_id)
:
:
Let's introduce the support to map tasks to "clamp groups".
Specifically we introduce the required functions to translate a
"clamp value" into a clamp's "group index" (group_id).
Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
sense to boost/limit to different frequencies,
e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
the per-CPU clamp values in the fast path.
The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
a -ENOSPC error in case this will exceed the number of maximum different
clamp values supported.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpF6=L=0LrmNnJrTNPazT4dWKqNv+thhN0dwpKCgUzs9sg@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
- remove not necessary checks in uclamp_group_find()
- add WARN on unlikely un-referenced decrement in uclamp_group_put()
- make __setscheduler_uclamp() able to set just one clamp value
- make __setscheduler_uclamp() failing if both clamps are required but
there is no clamp groups available for one of them
- remove uclamp_group_find() from uclamp_group_get() which now takes a
group_id as a parameter
Others:
- rebased on tip/sched/core
Changes in v2:
- rabased on v4.18-rc4
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data, partially
intorduced also by the following patches, into a single cache line
while still supporting up to 2 different {min,max}_utiql clamps.
---
include/linux/sched.h | 18 +-
include/uapi/linux/sched.h | 6 +-
init/Kconfig | 20 +++
kernel/sched/core.c | 338 +++++++++++++++++++++++++++++++++++--
4 files changed, 369 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f6439378e1..8f48e64fb8a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,9 @@ struct vtime {
u64 gtime;
};
+/* Clamp not valid, i.e. group not assigned or invalid value */
+#define UCLAMP_NOT_VALID -1
+
enum uclamp_id {
UCLAMP_MIN = 0, /* Minimum utilization */
UCLAMP_MAX, /* Maximum utilization */
@@ -575,6 +578,19 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};
+/**
+ * Utilization's clamp group
+ *
+ * A utilization clamp group maps a "clamp value" (value), i.e.
+ * util_{min,max}, to a "clamp group index" (group_id).
+ */
+struct uclamp_se {
+ /* Utilization constraint for tasks in this group */
+ unsigned int value;
+ /* Utilization clamp group for this constraint */
+ unsigned int group_id;
+};
+
union rcu_special {
struct {
u8 blocked;
@@ -659,7 +675,7 @@ struct task_struct {
#ifdef CONFIG_UCLAMP_TASK
/* Utlization clamp values for this task */
- int uclamp[UCLAMP_CNT];
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif
#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index c27d6e81517b..ae7e12de32ca 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,7 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
-#define SCHED_FLAG_UTIL_CLAMP 0x08
+
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x20
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)
#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
diff --git a/init/Kconfig b/init/Kconfig
index 1d45a6877d6f..701300e8f0eb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -601,7 +601,27 @@ config UCLAMP_TASK
If in doubt, say N.
+config UCLAMP_GROUPS_COUNT
+ int "Number of different utilization clamp values supported"
+ range 0 32
+ default 5
+ depends on UCLAMP_TASK
+ help
+ This defines the maximum number of different utilization clamp
+ values which can be concurrently enforced for each utilization
+ clamp index (i.e. minimum and maximum utilization).
+
+ Only a limited number of clamp values are supported because:
+ 1. there are usually only few classes of workloads for which it
+ makes sense to boost/cap for different frequencies,
+ e.g. background vs foreground, interactive vs low-priority.
+ 2. it allows a simpler and more memory/time efficient tracking of
+ the per-CPU clamp values.
+
+ If in doubt, use the default value.
+
endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2cabbbcaa447..4caa2686644b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -717,25 +717,337 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_mutex: serializes updates of utilization clamp values
+ *
+ * A utilization clamp value update is usually triggered from a user-space
+ * process (slow-path) but it requires a synchronization with the scheduler's
+ * (fast-path) enqueue/dequeue operations.
+ * While the fast-path synchronization is protected by RQs spinlock, this
+ * mutex ensures that we sequentially serve user-space requests.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * uclamp_map: reference counts a utilization "clamp value"
+ * @value: the utilization "clamp value" required
+ * @se_count: the number of scheduling entities requiring the "clamp value"
+ * @se_lock: serialize reference count updates by protecting se_count
+ */
+struct uclamp_map {
+ int value;
+ int se_count;
+ raw_spinlock_t se_lock;
+};
+
+/**
+ * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
+ *
+ * Since only a limited number of different "clamp values" are supported, we
+ * need to map each different clamp value into a "clamp group" (group_id) to
+ * be used by the per-CPU accounting in the fast-path, when tasks are
+ * enqueued and dequeued.
+ * We also support different kind of utilization clamping, min and max
+ * utilization for example, each representing what we call a "clamp index"
+ * (clamp_id).
+ *
+ * A matrix is thus required to map "clamp values" to "clamp groups"
+ * (group_id), for each "clamp index" (clamp_id), where:
+ * - rows are indexed by clamp_id and they collect the clamp groups for a
+ * given clamp index
+ * - columns are indexed by group_id and they collect the clamp values which
+ * maps to that clamp group
+ *
+ * Thus, the column index of a given (clamp_id, value) pair represents the
+ * clamp group (group_id) used by the fast-path's per-CPU accounting.
+ *
+ * NOTE: first clamp group (group_id=0) is reserved for tracking of non
+ * clamped tasks. Thus we allocate one more slot than the value of
+ * CONFIG_UCLAMP_GROUPS_COUNT.
+ *
+ * Here is the map layout and, right below, how entries are accessed by the
+ * following code.
+ *
+ * uclamp_maps is a matrix of
+ * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
+ * | |
+ * | /---------------+---------------\
+ * | +------------+ +------------+
+ * | / UCLAMP_MIN | value | | value |
+ * | | | se_count |...... | se_count |
+ * | | +------------+ +------------+
+ * +--+ +------------+ +------------+
+ * | | value | | value |
+ * \ UCLAMP_MAX | se_count |...... | se_count |
+ * +-----^------+ +----^-------+
+ * | |
+ * uc_map = + |
+ * &uclamp_maps[clamp_id][0] +
+ * clamp_value =
+ * uc_map[group_id].value
+ */
+static struct uclamp_map uclamp_maps[UCLAMP_CNT]
+ [CONFIG_UCLAMP_GROUPS_COUNT + 1];
+
+/**
+ * uclamp_group_available: checks if a clamp group is available
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index in the given clamp_id
+ *
+ * A clamp group is not free if there is at least one SE which is sing a clamp
+ * value mapped on the specified clamp_id. These SEs are reference counted by
+ * the se_count of a uclamp_map entry.
+ *
+ * Return: true if there are no SE's mapped on the specified clamp
+ * index and group
+ */
+static inline bool uclamp_group_available(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ return (uc_map[group_id].value == UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_group_init: maps a clamp value on a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index to map a given clamp_value
+ * @clamp_value: the utilization clamp value to map
+ *
+ * Initializes a clamp group to track tasks from the fast-path.
+ * Each different clamp value, for a given clamp index (i.e. min/max
+ * utilization clamp), is mapped by a clamp group which index is used by the
+ * fast-path code to keep track of RUNNABLE tasks requiring a certain clamp
+ * value.
+ *
+ */
+static inline void uclamp_group_init(int clamp_id, int group_id,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ uc_map[group_id].value = clamp_value;
+ uc_map[group_id].se_count = 0;
+}
+
+/**
+ * uclamp_group_reset: resets a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @group_id: the group index to release
+ *
+ * A clamp group can be reset every time there are no more task groups using
+ * the clamp value it maps for a given clamp index.
+ */
+static inline void uclamp_group_reset(int clamp_id, int group_id)
+{
+ uclamp_group_init(clamp_id, group_id, UCLAMP_NOT_VALID);
+}
+
+/**
+ * uclamp_group_find: finds the group index of a utilization clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @clamp_value: the utilization clamping value lookup for
+ *
+ * Verify if a group has been assigned to a certain clamp value and return
+ * its index to be used for accounting.
+ *
+ * Since only a limited number of utilization clamp groups are allowed, if no
+ * groups have been assigned for the specified value, a new group is assigned,
+ * if possible.
+ * Otherwise an error is returned, meaning that an additional clamp value is
+ * not (currently) supported.
+ */
+static int
+uclamp_group_find(int clamp_id, unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int free_group_id = UCLAMP_NOT_VALID;
+ unsigned int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Keep track of first free clamp group */
+ if (uclamp_group_available(clamp_id, group_id)) {
+ if (free_group_id == UCLAMP_NOT_VALID)
+ free_group_id = group_id;
+ continue;
+ }
+ /* Return index of first group with same clamp value */
+ if (uc_map[group_id].value == clamp_value)
+ return group_id;
+ }
+
+ if (likely(free_group_id != UCLAMP_NOT_VALID))
+ return free_group_id;
+
+ return -ENOSPC;
+}
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @uc_se: the utilization clamp data for that task group
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value. A clamp group is
+ * released when there are no more task groups referencing its clamp value.
+ */
+static inline void uclamp_group_put(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ unsigned long flags;
+
+ /* Ignore SE's not yet attached */
+ if (group_id == UCLAMP_NOT_VALID)
+ return;
+
+ /* Remove SE from this clamp group */
+ raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
+#ifdef SCHED_DEBUG
+ if (unlikely(uc_map[group_id].se_count == 0)) {
+ WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
+ clamp_id, group_id);
+ uc_map[group_id].se_count = 1;
+ }
+#endif
+ uc_map[group_id].se_count -= 1;
+ if (uc_map[group_id].se_count == 0)
+ uclamp_group_reset(clamp_id, group_id);
+ raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @clamp_id: the clamp index affected by the task
+ * @next_group_id: the clamp group to refcount
+ * @uc_se: the utilization clamp data for the task
+ * @clamp_value: the new clamp value for the task
+ *
+ * Each time a task changes its utilization clamp value, for a specified clamp
+ * index, we need to find an available clamp group which can be used to track
+ * this new clamp value. The corresponding clamp group index will be used by
+ * the task to reference count the clamp value on CPUs while enqueued.
+ */
+static inline void uclamp_group_get(int clamp_id, int next_group_id,
+ struct uclamp_se *uc_se,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int prev_group_id = uc_se->group_id;
+ unsigned long flags;
+
+ /* Allocate new clamp group for this clamp value */
+ if (uclamp_group_available(clamp_id, next_group_id))
+ uclamp_group_init(clamp_id, next_group_id, clamp_value);
+
+ /* Update SE's clamp values and attach it to new clamp group */
+ raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
+ uc_se->value = clamp_value;
+ uc_se->group_id = next_group_id;
+ uc_map[next_group_id].se_count += 1;
+ raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+
+ /* Release the previous clamp group */
+ uclamp_group_put(clamp_id, prev_group_id);
+}
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
- if (attr->sched_util_min > attr->sched_util_max)
- return -EINVAL;
- if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
- return -EINVAL;
+ int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
+ struct uclamp_se *uc_se;
+ int result = 0;
- p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
- p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+ mutex_lock(&uclamp_mutex);
- return 0;
+ /* Find a valid group_id for each required clamp value */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ int upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ ? attr->sched_util_max
+ : p->uclamp[UCLAMP_MAX].value;
+
+ if (upper_bound == UCLAMP_NOT_VALID)
+ upper_bound = SCHED_CAPACITY_SCALE;
+ if (attr->sched_util_min > upper_bound) {
+ result = -EINVAL;
+ goto done;
+ }
+
+ result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
+ if (result == -ENOSPC) {
+ pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ goto done;
+ }
+ group_id[UCLAMP_MIN] = result;
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ int lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ ? attr->sched_util_min
+ : p->uclamp[UCLAMP_MIN].value;
+
+ if (lower_bound == UCLAMP_NOT_VALID)
+ lower_bound = 0;
+ if (attr->sched_util_max < lower_bound ||
+ attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto done;
+ }
+
+ result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
+ if (result == -ENOSPC) {
+ pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ goto done;
+ }
+ group_id[UCLAMP_MAX] = result;
+ }
+
+ /* Update each required clamp group */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ uc_se = &p->uclamp[UCLAMP_MIN];
+ uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uc_se, attr->sched_util_min);
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ uc_se = &p->uclamp[UCLAMP_MAX];
+ uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uc_se, attr->sched_util_max);
+ }
+
+done:
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static void __init init_uclamp(void)
+{
+ int clamp_id;
+
+ mutex_init(&uclamp_mutex);
+
+ /* Init SE's clamp map */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ uc_map[group_id].value = UCLAMP_NOT_VALID;
+ raw_spin_lock_init(&uc_map[group_id].se_lock);
+ }
+ }
+}
+
#else /* CONFIG_UCLAMP_TASK */
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
return -EINVAL;
}
+static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2179,8 +2491,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#ifdef CONFIG_UCLAMP_TASK
- p->uclamp[UCLAMP_MIN] = 0;
- p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+ p->uclamp[UCLAMP_MIN].value = 0;
+ p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NOT_VALID;
+ p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
+ p->uclamp[UCLAMP_MAX].group_id = UCLAMP_NOT_VALID;
#endif
#ifdef CONFIG_SCHEDSTATS
@@ -4759,8 +5073,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);
#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN];
- attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
#endif
rcu_read_unlock();
@@ -6117,6 +6431,8 @@ void __init sched_init(void)
init_schedstats();
+ init_uclamp();
+
scheduler_running = 1;
}
--
2.18.0
Hi,
On 08/06/2018 09:39 AM, Patrick Bellasi wrote:
> diff --git a/init/Kconfig b/init/Kconfig
> index 041f3a022122..1d45a6877d6f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> config GENERIC_SCHED_CLOCK
> bool
>
> +menu "Scheduler features"
> +
> +config UCLAMP_TASK
> + bool "Enable utilization clamping for RT/FAIR tasks"
> + depends on CPU_FREQ_GOV_SCHEDUTIL
> + default false
default n
but just omit the line completely since "n" is already the default.
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> +
> + When this option is enabled, the user can specify a min and max CPU
> + bandwidth which is allowed for a task.
> + The max bandwidth allows to clamp the maximum frequency a task can
> + use, while the min bandwidth allows to define a minimum frequency a
> + task will always use.
Please clean up the indentation above to use one tab + 2 spaces on all lines.
> +
> + If in doubt, say N.
> +
> +endmenu
thanks,
--
~Randy
The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.
With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.
Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is expected (or allowed) to generate.
Constraints on minimum and maximum utilization allowed for tasks in a
CPU cgroup can improve the control on the actual amount of CPU bandwidth
consumed by tasks.
Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.
The ultimate goal of utilization clamping is thus to enable:
- boosting: by selecting an higher capacity CPU and/or higher execution
frequency for small tasks which are affecting the user
interactive experience.
- capping: by selecting more energy efficiency CPUs or lower execution
frequency, for big tasks which are mainly related to
background activities, and thus without a direct impact on
the user experience.
Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.
This patch extends the CPU controller by adding a couple of new
attributes, util.min and util.max, which can be used to enforce task's
utilization boosting and capping. Specifically:
- util.min: defines the minimum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run at least at a minimum frequency which
corresponds to the min_util utilization
- util.max: defines the maximum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run up to a maximum frequency which
corresponds to the max_util utilization
These attributes:
a) are available only for non-root nodes, both on default and legacy
hierarchies
b) do not enforce any constraints and/or dependency between the parent
and its child nodes, thus relying on the delegation model and
permission settings defined by the system management software
c) allow to (eventually) further restrict task-specific clamps defined
via sched_setattr(2)
This patch provides the basic support to expose the two new attributes
and to validate their run-time updates. However, we do not actually
allocated clamp groups and thus the write calls added by this patch
always returns -EINVAL. Following patches will provide the missing bits.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
Message-ID: <CAJuCfpFnj2g3+ZpR4fP4yqfxs0zd=c-Zehr2XM7m_C+WdL9jNA@mail.gmail.com>
- rename UCLAMP_NONE into UCLAMP_NOT_VALID
Message-ID: <[email protected]>
- use "." notation for attributes naming
i.e. s/util_{min,max}/util.{min,max}/
Others
- rebased on tip/sched/core
Changes in v2:
Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies
Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes
Others:
- rebased on v4.18-rc4
- reduced code to simplify the review of this patch
which now provides just the basic code for CGroups integration
- add attributes to the default hierarchy as well as the legacy one
- use -ERANGE as range violation error
These additional bits:
- refcounting of clamp groups
- RUNNABLE tasks refcount updates
- aggregation of per-task and per-task_group utilization constraints
are provided in separate and following patches to make it more clear and
documented how they are performed.
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++
init/Kconfig | 22 +++
kernel/sched/core.c | 186 ++++++++++++++++++++++++
kernel/sched/sched.h | 5 +
4 files changed, 238 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d5c53b..71244b55d901 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -904,6 +904,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.
+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -963,6 +969,25 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no bandwidth boosting.
+
+ The minimum utilization in the range [0, 1023].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1023". i.e. no bandwidth clamping
+
+ The maximum utilization in the range [0, 1023].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index 701300e8f0eb..592164e0b117 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -790,6 +790,28 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED
+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ff76b000bbe8..2ba55a4afffb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1211,6 +1211,74 @@ static inline void uclamp_group_get(struct task_struct *p,
uclamp_group_put(clamp_id, prev_group_id);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * init_uclamp_sched_group: initialize data structures required for TG's
+ * utilization clamping
+ */
+static inline void init_uclamp_sched_group(void)
+{
+ struct uclamp_map *uc_map;
+ struct uclamp_se *uc_se;
+ int group_id;
+ int clamp_id;
+
+ /* Root TG's is statically assigned to the first clamp group */
+ group_id = 0;
+
+ /* Initialize root TG's to default (none) clamp values */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_map = &uclamp_maps[clamp_id][0];
+
+ /* Map root TG's clamp value */
+ uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
+
+ /* Init root TG's clamp group */
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->group_id = group_id;
+
+ /* Attach root TG's clamp group */
+ uc_map[group_id].se_count = 1;
+ }
+}
+
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: 0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+
+ uc_se->value = parent->uclamp[clamp_id].value;
+ uc_se->group_id = UCLAMP_NOT_VALID;
+ }
+
+ return 1;
+}
+#else /* CONFIG_UCLAMP_TASK_GROUP */
+static inline void init_uclamp_sched_group(void) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1308,11 +1376,18 @@ static void __init init_uclamp(void)
raw_spin_lock_init(&uc_map[group_id].se_lock);
}
}
+
+ init_uclamp_sched_group();
}
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -6898,6 +6973,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;
err:
@@ -7118,6 +7196,88 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MIN].value == min_value) {
+ ret = 0;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MAX].value < min_value)
+ goto out;
+
+out:
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MAX].value == max_value) {
+ ret = 0;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MIN].value > max_value)
+ goto out;
+
+out:
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7455,6 +7615,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7622,6 +7794,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util_min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util_max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d5855babb9c9..a443b2c22cb7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,11 @@ struct task_group {
#endif
struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};
#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.18.0
Clamp values cannot be tuned at the root cgroup level. Moreover, because
of the delegation model requirements and how the parent clamps
propagation works, if we want to enable subgroups to set a non null
util.min, we need to be able to configure the root group util.min to the
allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
Unfortunately this setup will also mean that all tasks running in the
root group, will always get a maximum util.min clamp, unless they have a
lower task specific clamp which is definitively not a desirable default
configuration.
Let's fix this by explicitly adding a system default configuration
(sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
for all tasks running on the root group.
This interface is available independently from cgroups, thus providing a
complete solution for system wide utilization clamping configuration.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched/sysctl.h | 11 ++++
kernel/sched/core.c | 102 +++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 16 ++++++
3 files changed, 126 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 913488d828cb..c46346d3cc69 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -55,6 +55,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
@@ -74,6 +79,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
+#ifdef CONFIG_UCLAMP_TASK
+extern int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+#endif
+
extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f54fd9bda9a7..48458fea2d5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -728,6 +728,20 @@ static void set_load_weight(struct task_struct *p, bool update_load)
*/
static DEFINE_MUTEX(uclamp_mutex);
+/*
+ * Minimum utilization for tasks in the root cgroup
+ * default: 0
+ */
+unsigned int sysctl_sched_uclamp_util_min;
+
+/*
+ * Maximum utilization for tasks in the root cgroup
+ * default: 1024
+ */
+unsigned int sysctl_sched_uclamp_util_max = 1024;
+
+static struct uclamp_se uclamp_default[UCLAMP_CNT];
+
/**
* uclamp_map: reference counts a utilization "clamp value"
* @value: the utilization "clamp value" required
@@ -957,12 +971,25 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
group_id = uc_se->group_id;
#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /*
+ * Tasks in the root group, which do not have a task specific clamp
+ * value, get the system default calmp value.
+ */
+ if (group_id == UCLAMP_NOT_VALID &&
+ task_group(p) == &root_task_group) {
+ return uclamp_default[clamp_id].group_id;
+ }
+
/* Use TG's clamp value to limit task specific values */
uc_se = &task_group(p)->uclamp[clamp_id];
if (group_id == UCLAMP_NOT_VALID ||
clamp_value > uc_se->effective.value) {
group_id = uc_se->effective.group_id;
}
+#else
+ /* By default, all tasks get the system default clamp value */
+ if (group_id == UCLAMP_NOT_VALID)
+ return uclamp_default[clamp_id].group_id;
#endif
return group_id;
@@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct *p,
uclamp_group_put(clamp_id, prev_group_id);
}
+int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
+ struct uclamp_se *uc_se;
+ int old_min, old_max;
+ int result;
+
+ mutex_lock(&uclamp_mutex);
+
+ old_min = sysctl_sched_uclamp_util_min;
+ old_max = sysctl_sched_uclamp_util_max;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ goto done;
+
+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
+ goto undo;
+ if (sysctl_sched_uclamp_util_max > 1024)
+ goto undo;
+
+ /* Find a valid group_id for each required clamp value */
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ result = uclamp_group_find(UCLAMP_MIN, sysctl_sched_uclamp_util_min);
+ if (result == -ENOSPC) {
+ pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ goto undo;
+ }
+ group_id[UCLAMP_MIN] = result;
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ result = uclamp_group_find(UCLAMP_MAX, sysctl_sched_uclamp_util_max);
+ if (result == -ENOSPC) {
+ pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ goto undo;
+ }
+ group_id[UCLAMP_MAX] = result;
+ }
+
+ /* Update each required clamp group */
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ uc_se = &uclamp_default[UCLAMP_MIN];
+ uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uc_se, sysctl_sched_uclamp_util_min);
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ uc_se = &uclamp_default[UCLAMP_MAX];
+ uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uc_se, sysctl_sched_uclamp_util_max);
+ }
+
+ if (result) {
+undo:
+ sysctl_sched_uclamp_util_min = old_min;
+ sysctl_sched_uclamp_util_max = old_max;
+ }
+
+done:
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
/**
* init_uclamp_sched_group: initialize data structures required for TG's
@@ -1291,11 +1387,11 @@ static inline void init_uclamp_sched_group(void)
/* Map root TG's clamp value */
uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
- /* Init root TG's clamp group */
+ /* Init root TG's clamp group: max values always enabled */
uc_se = &root_task_group.uclamp[clamp_id];
- uc_se->value = uclamp_none(clamp_id);
+ uc_se->value = uclamp_none(UCLAMP_MAX);
uc_se->group_id = group_id;
- uc_se->effective.value = uclamp_none(clamp_id);
+ uc_se->effective.value = uclamp_none(UCLAMP_MAX);
uc_se->effective.group_id = group_id;
/* Attach root TG's clamp group */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f22f76b7a138..051d6da237e0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -442,6 +442,22 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rr_handler,
},
+#ifdef CONFIG_UCLAMP_TASK
+ {
+ .procname = "sched_uclamp_util_min",
+ .data = &sysctl_sched_uclamp_util_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+ {
+ .procname = "sched_uclamp_util_max",
+ .data = &sysctl_sched_uclamp_util_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
2.18.0
When a task group refcounts a new clamp group, we need to ensure that
the new clamp values are immediately enforced to all its tasks which are
currently RUNNABLE. This is to ensure that all currently RUNNABLE tasks
are boosted and/or clamped as requested as soon as possible.
Let's ensure that, whenever a new clamp group is refcounted by a task
group, all its RUNNABLE tasks are correctly accounted in their
respective CPUs. We do that by slightly refactoring uclamp_group_get()
to get an additional parameter *cgroup_subsys_state which, when
provided, it's used to walk the list of tasks in the corresponding TGs
and update the RUNNABLE ones.
This is a "brute force" solution which allows to reuse the same refcount
update code already used by the per-task API. That's also the only way
to ensure a prompt enforcement of new clamp constraints on RUNNABLE
tasks, as soon as a task group attribute is tweaked.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v3:
- rebased on tip/sched/core
- fixed some typos
Changes in v2:
- rebased on v4.18-rc4
- this code has been split from a previous patch to simplify the review
---
kernel/sched/core.c | 44 ++++++++++++++++++++++++++++++++++-------
kernel/sched/features.h | 5 +++++
2 files changed, 42 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48458fea2d5e..6db307803047 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1255,9 +1255,30 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
}
+static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int group_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ /*
+ * In lazy update mode, tasks will be accounted into the right clamp
+ * group the next time they will be requeued.
+ */
+ if (unlikely(sched_feat(UCLAMP_LAZY_UPDATE)))
+ return;
+
+ /* Update clamp groups for RUNNABLE tasks in this TG */
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_task_update_active(p, clamp_id, group_id);
+ css_task_iter_end(&it);
+}
+
/**
* uclamp_group_get: increase the reference count for a clamp group
* @p: the task which clamp value must be tracked
+ * @css: the task group which clamp value must be tracked
* @clamp_id: the clamp index affected by the task
* @next_group_id: the clamp group to refcount
* @uc_se: the utilization clamp data for the task
@@ -1269,6 +1290,7 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
* the task to reference count the clamp value on CPUs while enqueued.
*/
static inline void uclamp_group_get(struct task_struct *p,
+ struct cgroup_subsys_state *css,
int clamp_id, int next_group_id,
struct uclamp_se *uc_se,
unsigned int clamp_value)
@@ -1288,6 +1310,10 @@ static inline void uclamp_group_get(struct task_struct *p,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+ /* Newly created TG don't have tasks assigned */
+ if (css)
+ uclamp_group_get_tg(css, clamp_id, next_group_id);
+
/* Update CPU's clamp group refcounts of RUNNABLE task */
if (p)
uclamp_task_update_active(p, clamp_id, next_group_id);
@@ -1344,12 +1370,12 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
/* Update each required clamp group */
if (old_min != sysctl_sched_uclamp_util_min) {
uc_se = &uclamp_default[UCLAMP_MIN];
- uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uclamp_group_get(NULL, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
uc_se, sysctl_sched_uclamp_util_min);
}
if (old_max != sysctl_sched_uclamp_util_max) {
uc_se = &uclamp_default[UCLAMP_MAX];
- uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uclamp_group_get(NULL, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
uc_se, sysctl_sched_uclamp_util_max);
}
@@ -1441,7 +1467,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
return 0;
}
#endif
- uclamp_group_get(NULL, clamp_id, group_id, uc_se,
+ uclamp_group_get(NULL, NULL, clamp_id, group_id, uc_se,
parent->uclamp[clamp_id].value);
}
@@ -1532,12 +1558,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
/* Update each required clamp group */
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uc_se = &p->uclamp[UCLAMP_MIN];
- uclamp_group_get(p, UCLAMP_MIN, group_id[UCLAMP_MIN],
+ uclamp_group_get(p, NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
uc_se, attr->sched_util_min);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
uc_se = &p->uclamp[UCLAMP_MAX];
- uclamp_group_get(p, UCLAMP_MAX, group_id[UCLAMP_MAX],
+ uclamp_group_get(p, NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
uc_se, attr->sched_util_max);
}
@@ -7468,6 +7494,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
uc_se->effective.value = value;
uc_se->effective.group_id = group_id;
+
+ /* Immediately updated descendants active tasks */
+ if (css != top_css)
+ uclamp_group_get_tg(css, clamp_id, group_id);
}
}
@@ -7508,7 +7538,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
/* Update TG's reference count */
uc_se = &tg->uclamp[UCLAMP_MIN];
- uclamp_group_get(NULL, UCLAMP_MIN, group_id, uc_se, min_value);
+ uclamp_group_get(NULL, css, UCLAMP_MIN, group_id, uc_se, min_value);
out:
rcu_read_unlock();
@@ -7554,7 +7584,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
/* Update TG's reference count */
uc_se = &tg->uclamp[UCLAMP_MAX];
- uclamp_group_get(NULL, UCLAMP_MAX, group_id, uc_se, max_value);
+ uclamp_group_get(NULL, css, UCLAMP_MAX, group_id, uc_se, max_value);
out:
rcu_read_unlock();
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ca449e36c1..ced86cfd8fcd 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -91,6 +91,11 @@ SCHED_FEAT(WA_BIAS, true)
*/
SCHED_FEAT(UTIL_EST, true)
+/*
+ * Utilization clamping lazy update.
+ */
+SCHED_FEAT(UCLAMP_LAZY_UPDATE, false)
+
/*
* Per class CPU's utilization clamping.
*/
--
2.18.0
Hi,
Minor comments below.
On 06/08/18 17:39, Patrick Bellasi wrote:
[...]
> + *
> + * Task Utilization Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes allows to specify the utilization which
> + * should be expected by a task. These attributes allows to inform the
^
allow
> + * scheduler about the utilization boundaries within which is safe to schedule
Isn't all this more about providing hints than safety?
> + * the task. These utilization boundaries are valuable information to support
> + * scheduler decisions on both task placement and frequencies selection.
> + *
> + * @sched_util_min represents the minimum utilization
> + * @sched_util_max represents the maximum utilization
> + *
> + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> + * represents the percentage of CPU time used by a task when running at the
> + * maximum frequency on the highest capacity CPU of the system. Thus, for
> + * example, a 20% utilization task is a task running for 2ms every 10ms.
> + *
> + * A task with a min utilization value bigger then 0 is more likely to be
> + * scheduled on a CPU which can provide that bandwidth.
> + * A task with a max utilization value smaller then 1024 is more likely to be
> + * scheduled on a CPU which do not provide more then the required bandwidth.
Isn't s/bandwidth/capacity/ here, above, and in general where you use
the term "bandwidth" more appropriate? I wonder if overloading this term
(w.r.t. how is used with DEADLINE) might create confusion. In this case
we are not providing any sort of guarantees, it's a hint.
Best,
- Juri
On 06/08/18 17:39, Patrick Bellasi wrote:
[...]
> @@ -4218,6 +4245,13 @@ static int __sched_setscheduler(struct task_struct *p,
> return retval;
> }
>
> + /* Configure utilization clamps for the task */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> + retval = __setscheduler_uclamp(p, attr);
> + if (retval)
> + return retval;
> + }
> +
IIUC, this is available to root and non-root users. In the latter case,
how do we cope with the fact that some user might occupy all the
available clamping groups configured for the system?
Best,
- Juri
Hi Patrick,
On Monday 06 Aug 2018 at 17:39:38 (+0100), Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index a7affc729c25..bb25ef66c2d3 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -200,6 +200,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> + unsigned long util_cfs, util_rt;
> unsigned long util, irq, max;
>
> sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
IIUC, not far below this you should still have something like:
if (rt_rq_is_runnable(&rq->rt))
return max;
So you won't reach the actual clamping code at the end of the function
when a RT task is runnable no?
Also, I always had the feeling that the default for RT should be
util_min == 1024, and then users could decide to lower the bar if they
want to. For the specific case of RT, that feels more natural than
applying a max util clamp IMO. What do you think ?
Thanks,
Quentin
Hi,
On 06/08/18 17:39, Patrick Bellasi wrote:
[...]
> @@ -223,13 +224,25 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> *
> - * CFS utilization can be boosted or capped, depending on utilization
> - * clamp constraints configured for currently RUNNABLE tasks.
> + * CFS and RT utilizations can be boosted or capped, depending on
> + * utilization constraints enforce by currently RUNNABLE tasks.
> + * They are individually clamped to ensure fairness across classes,
> + * meaning that CFS always gets (if possible) the (minimum) required
> + * bandwidth on top of that required by higher priority classes.
Is this a stale comment written before UCLAMP_SCHED_CLASS was
introduced? It seems to apply to the below if branch only.
> */
> - util = cpu_util_cfs(rq);
> - if (util)
> - util = uclamp_util(cpu_of(rq), util);
> - util += cpu_util_rt(rq);
> + util_cfs = cpu_util_cfs(rq);
> + util_rt = cpu_util_rt(rq);
> + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> + util = 0;
> + if (util_cfs)
> + util += uclamp_util(cpu_of(rq), util_cfs);
> + if (util_rt)
> + util += uclamp_util(cpu_of(rq), util_rt);
> + } else {
> + util = cpu_util_cfs(rq);
> + util += cpu_util_rt(rq);
> + util = uclamp_util(cpu_of(rq), util);
> + }
Best,
- Juri
Hi Patrick,
On Mon, 6 Aug 2018 at 18:40, Patrick Bellasi <[email protected]> wrote:
>
> @@ -222,8 +222,13 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> util = cpu_util_cfs(rq);
> + if (util)
> + util = uclamp_util(cpu_of(rq), util);
> util += cpu_util_rt(rq);
>
> /*
> @@ -322,11 +328,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> + max_boost = sg_cpu->iowait_boost_max;
> + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> + max_boost = uclamp_util(sg_cpu->cpu, max_boost);
> +
> +static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
> +{
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
> +
> + if (uc_cpu->value[clamp_id] == UCLAMP_NOT_VALID)
> + return uclamp_none(clamp_id);
> +
> + return uc_cpu->value[clamp_id];
> +}
> +
> +static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
using struct *rq rq instead of cpu as parameter would align
uclamp_util() interface with other cpu_util_*() interface and remove
some cpu_of(rq) and cpu_rq(cpu)
> +{
> + unsigned int min_util = uclamp_value(cpu, UCLAMP_MIN);
> + unsigned int max_util = uclamp_value(cpu, UCLAMP_MAX);
> +
> + return clamp(util, min_util, max_util);
> +}
On 06-Aug 09:50, Randy Dunlap wrote:
> Hi,
Hi Randy,
> On 08/06/2018 09:39 AM, Patrick Bellasi wrote:
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 041f3a022122..1d45a6877d6f 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> > config GENERIC_SCHED_CLOCK
> > bool
> >
> > +menu "Scheduler features"
> > +
> > +config UCLAMP_TASK
> > + bool "Enable utilization clamping for RT/FAIR tasks"
> > + depends on CPU_FREQ_GOV_SCHEDUTIL
> > + default false
>
> default n
> but just omit the line completely since "n" is already the default.
Right, will update for next posting!
Is there a strict rule to omit this line when it's already the
default?
>
> > + help
> > + This feature enables the scheduler to track the clamped utilization
> > + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> > +
> > + When this option is enabled, the user can specify a min and max CPU
> > + bandwidth which is allowed for a task.
> > + The max bandwidth allows to clamp the maximum frequency a task can
> > + use, while the min bandwidth allows to define a minimum frequency a
> > + task will always use.
>
> Please clean up the indentation above to use one tab + 2 spaces on all lines.
Sure, my bad I did not notice it... although I'm quite sure the patch
passed a checkpatch... will check better next time.
> > +
> > + If in doubt, say N.
> > +
> > +endmenu
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 07-Aug 14:35, Juri Lelli wrote:
> On 06/08/18 17:39, Patrick Bellasi wrote:
>
> [...]
>
> > @@ -4218,6 +4245,13 @@ static int __sched_setscheduler(struct task_struct *p,
> > return retval;
> > }
> >
> > + /* Configure utilization clamps for the task */
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> > + retval = __setscheduler_uclamp(p, attr);
> > + if (retval)
> > + return retval;
> > + }
> > +
>
> IIUC, this is available to root and non-root users. In the latter case,
> how do we cope with the fact that some user might occupy all the
> available clamping groups configured for the system?
That's a very good point, glad you noticed it.
What concern me most is that we set constraints to the cgroups
delegation model. If all clamp groups have been used it could be
not possible for a parent group to shrink resources for its subgroups.
In both cases however, in principle, I think we can live with the idea
that the "System Management Software" (SMS) can pre-allocate all the
required boost groups at boot time; malicious tasks and dependent
groups will eventually get an -ENOSPC error.
These are the main reason why I did not posted a more "safe" solution:
this series is already big enough, a properly (pre)configured system
is still reasonably functional safe and this feature can be added in
a second step.
However, I already have a couple of possible extensions/fixes which I
can add on top on the next respin. They are along these lines:
1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
time parameter to relax this check
2) add discretization support to clamp groups allocation
This second feature specifically, will ensure that clamp values are
always mapped into one of the available clamp groups. While the exact
clamp value can always be used for tasks placement biasing, when it
comes to frequency selection biasing, depending on concurrently
running tasks, you can end up with an effective clamp value which is a
rounded up.
Will likely add a couple of additional patches on v4 posting.
Do you have any other possible idea?
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 09/08/18 10:14, Patrick Bellasi wrote:
> On 07-Aug 14:35, Juri Lelli wrote:
> > On 06/08/18 17:39, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > @@ -4218,6 +4245,13 @@ static int __sched_setscheduler(struct task_struct *p,
> > > return retval;
> > > }
> > >
> > > + /* Configure utilization clamps for the task */
> > > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> > > + retval = __setscheduler_uclamp(p, attr);
> > > + if (retval)
> > > + return retval;
> > > + }
> > > +
> >
> > IIUC, this is available to root and non-root users. In the latter case,
> > how do we cope with the fact that some user might occupy all the
> > available clamping groups configured for the system?
>
> That's a very good point, glad you noticed it.
>
> What concern me most is that we set constraints to the cgroups
> delegation model. If all clamp groups have been used it could be
> not possible for a parent group to shrink resources for its subgroups.
Right, when groups are in use the problem might actually be even more
serious.
> In both cases however, in principle, I think we can live with the idea
> that the "System Management Software" (SMS) can pre-allocate all the
> required boost groups at boot time; malicious tasks and dependent
> groups will eventually get an -ENOSPC error.
>
> These are the main reason why I did not posted a more "safe" solution:
> this series is already big enough, a properly (pre)configured system
> is still reasonably functional safe and this feature can be added in
> a second step.
I see, but I also fear that there will be times and usages of this new
interface where no SMS is present.
> However, I already have a couple of possible extensions/fixes which I
> can add on top on the next respin. They are along these lines:
These are exactly what I was thinking about as well. :-)
> 1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
> time parameter to relax this check
It seems to me that this might work well with that the intended usage of
the interface that you depict above. SMS only (or any privileged user)
will be in control of how groups are configured, so no problem for
normal users.
> 2) add discretization support to clamp groups allocation
And this might also work well if we feel that we don't want to restrict
usage of the interface to admin only, however...
> This second feature specifically, will ensure that clamp values are
> always mapped into one of the available clamp groups. While the exact
> clamp value can always be used for tasks placement biasing, when it
> comes to frequency selection biasing, depending on concurrently
> running tasks, you can end up with an effective clamp value which is a
> rounded up.
what I'm not so sure about is that we might lose in flexibility if the
number of available discrete clamp groups is too small compared to the
number of available OPP on the platform.
> Will likely add a couple of additional patches on v4 posting.
> Do you have any other possible idea?
As said, I though as well about the two options you mentioned.
On 08/09/2018 01:39 AM, Patrick Bellasi wrote:
> On 06-Aug 09:50, Randy Dunlap wrote:
>> Hi,
>
> Hi Randy,
>
>> On 08/06/2018 09:39 AM, Patrick Bellasi wrote:
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index 041f3a022122..1d45a6877d6f 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
>>> config GENERIC_SCHED_CLOCK
>>> bool
>>>
>>> +menu "Scheduler features"
>>> +
>>> +config UCLAMP_TASK
>>> + bool "Enable utilization clamping for RT/FAIR tasks"
>>> + depends on CPU_FREQ_GOV_SCHEDUTIL
>>> + default false
>>
>> default n
>> but just omit the line completely since "n" is already the default.
>
>
> Right, will update for next posting!
> Is there a strict rule to omit this line when it's already the
> default?
It's not documented AFAIK, but it's commonly repeated on LKML.
>>> + help
>>> + This feature enables the scheduler to track the clamped utilization
>>> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
>>> +
>>> + When this option is enabled, the user can specify a min and max CPU
>>> + bandwidth which is allowed for a task.
>>> + The max bandwidth allows to clamp the maximum frequency a task can
>>> + use, while the min bandwidth allows to define a minimum frequency a
>>> + task will always use.
>>
>> Please clean up the indentation above to use one tab + 2 spaces on all lines.
>
> Sure, my bad I did not notice it... although I'm quite sure the patch
> passed a checkpatch... will check better next time.
Thanks.
>>> +
>>> + If in doubt, say N.
>>> +
>>> +endmenu
--
~Randy
On 09-Aug 11:50, Juri Lelli wrote:
> On 09/08/18 10:14, Patrick Bellasi wrote:
> > On 07-Aug 14:35, Juri Lelli wrote:
> > > On 06/08/18 17:39, Patrick Bellasi wrote:
[...]
> > 1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
> > time parameter to relax this check
>
> It seems to me that this might work well with that the intended usage of
> the interface that you depict above. SMS only (or any privileged user)
> will be in control of how groups are configured, so no problem for
> normal users.
Yes, well... apart normal users still getting a -ENOSPC is they are
requesting one of the not pre-configured clamp values. Which is why
the following bits can be helpful.
> > 2) add discretization support to clamp groups allocation
>
> And this might also work well if we feel that we don't want to restrict
> usage of the interface to admin only, however...
>
> > This second feature specifically, will ensure that clamp values are
> > always mapped into one of the available clamp groups. While the exact
> > clamp value can always be used for tasks placement biasing, when it
> > comes to frequency selection biasing, depending on concurrently
> > running tasks, you can end up with an effective clamp value which is a
> > rounded up.
>
> what I'm not so sure about is that we might lose in flexibility if the
> number of available discrete clamp groups is too small compared to the
> number of available OPP on the platform.
Regarding this concern, I would say that we should consider that, for
frequency biasing, we are in general not interested in nailing down
the single 1% difference and/or exact OPP capacities
A certain coarse grained resolution is usually acceptable for many
different reasons:
a) schedutil already uses a 20% margin which can potentially eclipse
few OPP when we scale up/down
b) tasks/CPUs utilization are good enough but never exact and precise
values
c) reducing the number of OPP switches could have some benefits on
stability/latencies
d) clamping is actually defining minimum/maximum preferred values, is
not to be considered a tool for "precise control"
All that considered, I would say that maybe a 5% resolution could
still be considered an acceptable _worst case_ rounding since we don't
have always to round up to the next 5%.
For example, if we have:
- TaskA: util_min=41%
- TaskB: util_nin=44%
they will be both accounted in the 40-45% clamp group but the clamp
group value can be modulated at run-time depending on RUNNABLE
tasks. When TaskA is running alone, we can still set util_min to
41%, while we will use 44% (not 45%) when TaskB is (also) running.
It's worth to notice that we pre-allocated at compile time 20 clamp
groups, but not necessarily all of them will be used at run-time.
Indeed, we will still use a policy where only the actual required
values are allocated at the beginning of the clamps map, thus
optimizing max updates.
--
#include <best/regards.h>
Patrick Bellasi
On 08-Aug 15:18, Vincent Guittot wrote:
> Hi Patrick,
Hi VIncent,
> On Mon, 6 Aug 2018 at 18:40, Patrick Bellasi <[email protected]> wrote:
[...]
> > +static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
>
> using struct *rq rq instead of cpu as parameter would align
> uclamp_util() interface with other cpu_util_*() interface and remove
> some cpu_of(rq) and cpu_rq(cpu)
Right, I've tired to keep consistency within the "uclamp_*" APIs...
but what you suggests makes also sense and I've also already
switched some other APIs to use *rq.
I'll look into better aligning these APIs for the next posting.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 07-Aug 15:26, Juri Lelli wrote:
> Hi,
>
> On 06/08/18 17:39, Patrick Bellasi wrote:
>
> [...]
>
> > @@ -223,13 +224,25 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > * utilization (PELT windows are synchronized) we can directly add them
> > * to obtain the CPU's actual utilization.
> > *
> > - * CFS utilization can be boosted or capped, depending on utilization
> > - * clamp constraints configured for currently RUNNABLE tasks.
> > + * CFS and RT utilizations can be boosted or capped, depending on
> > + * utilization constraints enforce by currently RUNNABLE tasks.
> > + * They are individually clamped to ensure fairness across classes,
> > + * meaning that CFS always gets (if possible) the (minimum) required
> > + * bandwidth on top of that required by higher priority classes.
>
> Is this a stale comment written before UCLAMP_SCHED_CLASS was
> introduced? It seems to apply to the below if branch only.
Yes, you right... I'll update the comment.
> > */
> > - util = cpu_util_cfs(rq);
> > - if (util)
> > - util = uclamp_util(cpu_of(rq), util);
> > - util += cpu_util_rt(rq);
> > + util_cfs = cpu_util_cfs(rq);
> > + util_rt = cpu_util_rt(rq);
> > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > + util = 0;
> > + if (util_cfs)
> > + util += uclamp_util(cpu_of(rq), util_cfs);
> > + if (util_rt)
> > + util += uclamp_util(cpu_of(rq), util_rt);
> > + } else {
> > + util = cpu_util_cfs(rq);
> > + util += cpu_util_rt(rq);
> > + util = uclamp_util(cpu_of(rq), util);
> > + }
Regarding the two policies, do you have any comment?
We had an internal discussion and we found pro/cons for both... but
I'm not sure keeping the sched_feat is a good solution on the long
run, i.e. mainline merge ;)
--
#include <best/regards.h>
Patrick Bellasi
On 07-Aug 14:54, Quentin Perret wrote:
> Hi Patrick,
Hi Quentin!
> On Monday 06 Aug 2018 at 17:39:38 (+0100), Patrick Bellasi wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index a7affc729c25..bb25ef66c2d3 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -200,6 +200,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > + unsigned long util_cfs, util_rt;
> > unsigned long util, irq, max;
> >
> > sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
>
> IIUC, not far below this you should still have something like:
>
> if (rt_rq_is_runnable(&rq->rt))
> return max;
Do you mean that when RT tasks are RUNNABLE we still want to got to
MAX? Not sure to understand... since this patch is actually to clamp
the RT class...
> So you won't reach the actual clamping code at the end of the function
> when a RT task is runnable no?
... mmm... no... this patch is to clamp RT tasks... Am I missing
something?
> Also, I always had the feeling that the default for RT should be
> util_min == 1024, and then users could decide to lower the bar if they
> want to.
Mmm... good point! This would keep the policy unaltered for RT tasks.
I want to keep sched class specific code in uclamp at minimum, but
likely this should be achievable by just properly initializing the
task-specific util_min for RT tasks, if the original task has
UCLAM_NOT_VALID.
> For the specific case of RT, that feels more natural than
> applying a max util clamp IMO. What do you think ?
I'll look better into this for the next posting!
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hi Patrick,
On Thursday 09 Aug 2018 at 16:41:56 (+0100), Patrick Bellasi wrote:
> > IIUC, not far below this you should still have something like:
> >
> > if (rt_rq_is_runnable(&rq->rt))
> > return max;
>
> Do you mean that when RT tasks are RUNNABLE we still want to got to
> MAX? Not sure to understand... since this patch is actually to clamp
> the RT class...
Argh, reading my message again it wasn't very clear indeed. Sorry about
that ...
What I'm try to say is that your patch does _not_ remove the snippet of code
above from sugov_get_util(). So I think that when a RT task is runnable,
you will not reach the end of the function where the clamping is done.
And this is not what you want AFAICT.
Does that make any sense ?
>
> > So you won't reach the actual clamping code at the end of the function
> > when a RT task is runnable no?
>
> ... mmm... no... this patch is to clamp RT tasks... Am I missing
> something?
>
> > Also, I always had the feeling that the default for RT should be
> > util_min == 1024, and then users could decide to lower the bar if they
> > want to.
>
> Mmm... good point! This would keep the policy unaltered for RT tasks.
>
> I want to keep sched class specific code in uclamp at minimum but
> likely this should be achievable by just properly initializing the
> task-specific util_min for RT tasks, if the original task has
> UCLAM_NOT_VALID.
+1, it'd be nice to keep the cross-class mess to a minimum IMO. But
hopefully this RT thing isn't too ugly to implement ...
>
> > For the specific case of RT, that feels more natural than
> > applying a max util clamp IMO. What do you think ?
>
> I'll look better into this for the next posting!
Sounds good :-)
Thanks,
Quentin
Hi Patrick,
On Thu, 9 Aug 2018 at 17:34, Patrick Bellasi <[email protected]> wrote:
>
> On 07-Aug 15:26, Juri Lelli wrote:
> > Hi,
> >
> > On 06/08/18 17:39, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > @@ -223,13 +224,25 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > > * utilization (PELT windows are synchronized) we can directly add them
> > > * to obtain the CPU's actual utilization.
> > > *
> > > - * CFS utilization can be boosted or capped, depending on utilization
> > > - * clamp constraints configured for currently RUNNABLE tasks.
> > > + * CFS and RT utilizations can be boosted or capped, depending on
> > > + * utilization constraints enforce by currently RUNNABLE tasks.
> > > + * They are individually clamped to ensure fairness across classes,
> > > + * meaning that CFS always gets (if possible) the (minimum) required
> > > + * bandwidth on top of that required by higher priority classes.
> >
> > Is this a stale comment written before UCLAMP_SCHED_CLASS was
> > introduced? It seems to apply to the below if branch only.
>
> Yes, you right... I'll update the comment.
>
> > > */
> > > - util = cpu_util_cfs(rq);
> > > - if (util)
> > > - util = uclamp_util(cpu_of(rq), util);
> > > - util += cpu_util_rt(rq);
> > > + util_cfs = cpu_util_cfs(rq);
> > > + util_rt = cpu_util_rt(rq);
> > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > + util = 0;
> > > + if (util_cfs)
> > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > + if (util_rt)
> > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > + } else {
> > > + util = cpu_util_cfs(rq);
> > > + util += cpu_util_rt(rq);
> > > + util = uclamp_util(cpu_of(rq), util);
> > > + }
>
> Regarding the two policies, do you have any comment?
Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
make sense as it is ?
I mean, uclamp_util doesn't make any difference between rt and cfs
tasks when clamping the utilization so why should be add twice the
returned value ?
IMHO, this policy would make sense if there were something like
uclamp_util_rt() and a uclamp_util_cfs()
>
> We had an internal discussion and we found pro/cons for both... but
> I'm not sure keeping the sched_feat is a good solution on the long
> run, i.e. mainline merge ;)
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 09/08/18 16:23, Patrick Bellasi wrote:
> On 09-Aug 11:50, Juri Lelli wrote:
> > On 09/08/18 10:14, Patrick Bellasi wrote:
> > > On 07-Aug 14:35, Juri Lelli wrote:
> > > > On 06/08/18 17:39, Patrick Bellasi wrote:
>
> [...]
>
> > > 1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
> > > time parameter to relax this check
> >
> > It seems to me that this might work well with that the intended usage of
> > the interface that you depict above. SMS only (or any privileged user)
> > will be in control of how groups are configured, so no problem for
> > normal users.
>
> Yes, well... apart normal users still getting a -ENOSPC is they are
> requesting one of the not pre-configured clamp values. Which is why
> the following bits can be helpful.
>
> > > 2) add discretization support to clamp groups allocation
> >
> > And this might also work well if we feel that we don't want to restrict
> > usage of the interface to admin only, however...
> >
> > > This second feature specifically, will ensure that clamp values are
> > > always mapped into one of the available clamp groups. While the exact
> > > clamp value can always be used for tasks placement biasing, when it
> > > comes to frequency selection biasing, depending on concurrently
> > > running tasks, you can end up with an effective clamp value which is a
> > > rounded up.
> >
> > what I'm not so sure about is that we might lose in flexibility if the
> > number of available discrete clamp groups is too small compared to the
> > number of available OPP on the platform.
>
> Regarding this concern, I would say that we should consider that, for
> frequency biasing, we are in general not interested in nailing down
> the single 1% difference and/or exact OPP capacities
True.
> A certain coarse grained resolution is usually acceptable for many
> different reasons:
> a) schedutil already uses a 20% margin which can potentially eclipse
> few OPP when we scale up/down
> b) tasks/CPUs utilization are good enough but never exact and precise
> values
> c) reducing the number of OPP switches could have some benefits on
> stability/latencies
> d) clamping is actually defining minimum/maximum preferred values, is
> not to be considered a tool for "precise control"
>
> All that considered, I would say that maybe a 5% resolution could
> still be considered an acceptable _worst case_ rounding since we don't
> have always to round up to the next 5%.
>
> For example, if we have:
> - TaskA: util_min=41%
> - TaskB: util_nin=44%
> they will be both accounted in the 40-45% clamp group but the clamp
> group value can be modulated at run-time depending on RUNNABLE
> tasks. When TaskA is running alone, we can still set util_min to
> 41%, while we will use 44% (not 45%) when TaskB is (also) running.
>
> It's worth to notice that we pre-allocated at compile time 20 clamp
> groups, but not necessarily all of them will be used at run-time.
> Indeed, we will still use a policy where only the actual required
> values are allocated at the beginning of the clamps map, thus
> optimizing max updates.
OK, so you'll only still iterate over the groups that are actually in
use, which is hopefully less than 20 and should keep overhead low. Makes
sense to me.
Hi Vincent!
On 09-Aug 18:03, Vincent Guittot wrote:
> > On 07-Aug 15:26, Juri Lelli wrote:
[...]
> > > > + util_cfs = cpu_util_cfs(rq);
> > > > + util_rt = cpu_util_rt(rq);
> > > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > > + util = 0;
> > > > + if (util_cfs)
> > > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > > + if (util_rt)
> > > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > > + } else {
> > > > + util = cpu_util_cfs(rq);
> > > > + util += cpu_util_rt(rq);
> > > > + util = uclamp_util(cpu_of(rq), util);
> > > > + }
> >
> > Regarding the two policies, do you have any comment?
>
> Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
> make sense as it is ?
> I mean, uclamp_util doesn't make any difference between rt and cfs
> tasks when clamping the utilization so why should be add twice the
> returned value ?
> IMHO, this policy would make sense if there were something like
> uclamp_util_rt() and a uclamp_util_cfs()
The idea for the UCLAMP_SCHED_CLASS policy is to improve fairness on
low-priority classese, especially when we have high RT utilization.
Let say we have:
util_rt = 40%, util_min=0%
util_cfs = 10%, util_min=50%
the two policies will select:
UCLAMP_SCHED_CLASS: util = uclamp(40) + uclamp(10) = 50 + 50 = 100%
!UCLAMP_SCHED_CLASS: util = uclamp(40 + 10) = uclmp(50) = 50%
Which means that, despite the CPU's util_min will be set to 50% when
CFS is running, these tasks will have almost no boost at all, since
their bandwidth margin is eclipsed by RT tasks.
> > We had an internal discussion and we found pro/cons for both... but
The UCLAMP_SCHED_CLASS policy is thus less energy efficiency but it
should grant a better "isolation" in terms of what is the expected
speed-up a task will get at run-time, independently from higher
priority classes.
Does that make sense?
> > I'm not sure keeping the sched_feat is a good solution on the long
> > run, i.e. mainline merge ;)
This problem still stands...
--
#include <best/regards.h>
Patrick Bellasi
Hi Quentin!
On 09-Aug 16:55, Quentin Perret wrote:
> Hi Patrick,
>
> On Thursday 09 Aug 2018 at 16:41:56 (+0100), Patrick Bellasi wrote:
> > > IIUC, not far below this you should still have something like:
> > >
> > > if (rt_rq_is_runnable(&rq->rt))
> > > return max;
> >
> > Do you mean that when RT tasks are RUNNABLE we still want to got to
> > MAX? Not sure to understand... since this patch is actually to clamp
> > the RT class...
>
> Argh, reading my message again it wasn't very clear indeed. Sorry about
> that ...
>
> What I'm try to say is that your patch does _not_ remove the snippet of code
> above from sugov_get_util(). So I think that when a RT task is runnable,
> you will not reach the end of the function where the clamping is done.
> And this is not what you want AFAICT.
>
> Does that make any sense ?
Oh gotcha... you right, I've missed that bit when I rebased on tip.
Will fix on the next iteration!
--
#include <best/regards.h>
Patrick Bellasi
On 13/08/18 11:12, Patrick Bellasi wrote:
> Hi Vincent!
>
> On 09-Aug 18:03, Vincent Guittot wrote:
> > > On 07-Aug 15:26, Juri Lelli wrote:
>
> [...]
>
> > > > > + util_cfs = cpu_util_cfs(rq);
> > > > > + util_rt = cpu_util_rt(rq);
> > > > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > > > + util = 0;
> > > > > + if (util_cfs)
> > > > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > > > + if (util_rt)
> > > > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > > > + } else {
> > > > > + util = cpu_util_cfs(rq);
> > > > > + util += cpu_util_rt(rq);
> > > > > + util = uclamp_util(cpu_of(rq), util);
> > > > > + }
> > >
> > > Regarding the two policies, do you have any comment?
> >
> > Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
> > make sense as it is ?
> > I mean, uclamp_util doesn't make any difference between rt and cfs
> > tasks when clamping the utilization so why should be add twice the
> > returned value ?
> > IMHO, this policy would make sense if there were something like
> > uclamp_util_rt() and a uclamp_util_cfs()
>
> The idea for the UCLAMP_SCHED_CLASS policy is to improve fairness on
> low-priority classese, especially when we have high RT utilization.
>
> Let say we have:
>
> util_rt = 40%, util_min=0%
> util_cfs = 10%, util_min=50%
>
> the two policies will select:
>
> UCLAMP_SCHED_CLASS: util = uclamp(40) + uclamp(10) = 50 + 50 = 100%
> !UCLAMP_SCHED_CLASS: util = uclamp(40 + 10) = uclmp(50) = 50%
>
> Which means that, despite the CPU's util_min will be set to 50% when
> CFS is running, these tasks will have almost no boost at all, since
> their bandwidth margin is eclipsed by RT tasks.
Ah, right. But isn't possible to distinguish between classes? I mean, if
you would know that only CFS is clamped (boosted) in this case, you
could have:
util = util_rt + uclamp(util_cfs) = 40 + 50 = 90%
Which should do what one expects w/o energy side effects?
On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
>
> Hi Vincent!
>
> On 09-Aug 18:03, Vincent Guittot wrote:
> > > On 07-Aug 15:26, Juri Lelli wrote:
>
> [...]
>
> > > > > + util_cfs = cpu_util_cfs(rq);
> > > > > + util_rt = cpu_util_rt(rq);
> > > > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > > > + util = 0;
> > > > > + if (util_cfs)
> > > > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > > > + if (util_rt)
> > > > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > > > + } else {
> > > > > + util = cpu_util_cfs(rq);
> > > > > + util += cpu_util_rt(rq);
> > > > > + util = uclamp_util(cpu_of(rq), util);
> > > > > + }
> > >
> > > Regarding the two policies, do you have any comment?
> >
> > Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
> > make sense as it is ?
> > I mean, uclamp_util doesn't make any difference between rt and cfs
> > tasks when clamping the utilization so why should be add twice the
> > returned value ?
> > IMHO, this policy would make sense if there were something like
> > uclamp_util_rt() and a uclamp_util_cfs()
>
> The idea for the UCLAMP_SCHED_CLASS policy is to improve fairness on
> low-priority classese, especially when we have high RT utilization.
>
> Let say we have:
>
> util_rt = 40%, util_min=0%
> util_cfs = 10%, util_min=50%
>
> the two policies will select:
>
> UCLAMP_SCHED_CLASS: util = uclamp(40) + uclamp(10) = 50 + 50 = 100%
> !UCLAMP_SCHED_CLASS: util = uclamp(40 + 10) = uclmp(50) = 50%
>
> Which means that, despite the CPU's util_min will be set to 50% when
> CFS is running, these tasks will have almost no boost at all, since
> their bandwidth margin is eclipsed by RT tasks.
Hmm ... At the opposite, even if there is no running rt task but only
some remaining blocked rt utilization,
even if util_rt = 10%, util_min=0%
and util_cfs = 40%, util_min=50%
the UCLAMP_SCHED_CLASS: util = uclamp(10) + uclamp(40) = 50 + 50 = 100%
So cfs task can get double boosted by a small rt task.
Furthermore, if there is no rt task but 2 cfs tasks of 40% and 10%
the UCLAMP_SCHED_CLASS: util = uclamp(0) + uclamp(40) = 50 = 50%
So in this case cfs tasks don't get more boost and have to share the
bandwidth and you don't ensure 50% for each unlike what you try to do
for rt.
You create a difference in the behavior depending of the class of the
others co-schedule tasks which is not sane IMHO
>
> > > We had an internal discussion and we found pro/cons for both... but
>
> The UCLAMP_SCHED_CLASS policy is thus less energy efficiency but it
> should grant a better "isolation" in terms of what is the expected
> speed-up a task will get at run-time, independently from higher
> priority classes.
>
> Does that make sense?
>
> > > I'm not sure keeping the sched_feat is a good solution on the long
> > > run, i.e. mainline merge ;)
>
> This problem still stands...
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On Mon, 13 Aug 2018 at 14:07, Vincent Guittot
<[email protected]> wrote:
>
> On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
> >
> > Hi Vincent!
> >
> > On 09-Aug 18:03, Vincent Guittot wrote:
> > > > On 07-Aug 15:26, Juri Lelli wrote:
> >
> > [...]
> >
> > > > > > + util_cfs = cpu_util_cfs(rq);
> > > > > > + util_rt = cpu_util_rt(rq);
> > > > > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > > > > + util = 0;
> > > > > > + if (util_cfs)
> > > > > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > > > > + if (util_rt)
> > > > > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > > > > + } else {
> > > > > > + util = cpu_util_cfs(rq);
> > > > > > + util += cpu_util_rt(rq);
> > > > > > + util = uclamp_util(cpu_of(rq), util);
> > > > > > + }
> > > >
> > > > Regarding the two policies, do you have any comment?
> > >
> > > Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
> > > make sense as it is ?
> > > I mean, uclamp_util doesn't make any difference between rt and cfs
> > > tasks when clamping the utilization so why should be add twice the
> > > returned value ?
> > > IMHO, this policy would make sense if there were something like
> > > uclamp_util_rt() and a uclamp_util_cfs()
> >
> > The idea for the UCLAMP_SCHED_CLASS policy is to improve fairness on
> > low-priority classese, especially when we have high RT utilization.
> >
> > Let say we have:
> >
> > util_rt = 40%, util_min=0%
> > util_cfs = 10%, util_min=50%
> >
> > the two policies will select:
> >
> > UCLAMP_SCHED_CLASS: util = uclamp(40) + uclamp(10) = 50 + 50 = 100%
> > !UCLAMP_SCHED_CLASS: util = uclamp(40 + 10) = uclmp(50) = 50%
> >
> > Which means that, despite the CPU's util_min will be set to 50% when
> > CFS is running, these tasks will have almost no boost at all, since
> > their bandwidth margin is eclipsed by RT tasks.
>
> Hmm ... At the opposite, even if there is no running rt task but only
> some remaining blocked rt utilization,
> even if util_rt = 10%, util_min=0%
> and util_cfs = 40%, util_min=50%
> the UCLAMP_SCHED_CLASS: util = uclamp(10) + uclamp(40) = 50 + 50 = 100%
>
> So cfs task can get double boosted by a small rt task.
>
> Furthermore, if there is no rt task but 2 cfs tasks of 40% and 10%
> the UCLAMP_SCHED_CLASS: util = uclamp(0) + uclamp(40) = 50 = 50%
s/uclamp(40)/uclamp(50)/
>
> So in this case cfs tasks don't get more boost and have to share the
> bandwidth and you don't ensure 50% for each unlike what you try to do
> for rt.
> You create a difference in the behavior depending of the class of the
> others co-schedule tasks which is not sane IMHO
>
>
> >
> > > > We had an internal discussion and we found pro/cons for both... but
> >
> > The UCLAMP_SCHED_CLASS policy is thus less energy efficiency but it
> > should grant a better "isolation" in terms of what is the expected
> > speed-up a task will get at run-time, independently from higher
> > priority classes.
> >
> > Does that make sense?
> >
> > > > I'm not sure keeping the sched_feat is a good solution on the long
> > > > run, i.e. mainline merge ;)
> >
> > This problem still stands...
> >
> > --
> > #include <best/regards.h>
> >
> > Patrick Bellasi
On 07-Aug 11:59, Juri Lelli wrote:
> Hi,
>
> Minor comments below.
>
> On 06/08/18 17:39, Patrick Bellasi wrote:
>
> [...]
>
> > + *
> > + * Task Utilization Attributes
> > + * ===========================
> > + *
> > + * A subset of sched_attr attributes allows to specify the utilization which
> > + * should be expected by a task. These attributes allows to inform the
> ^
> allow
>
> > + * scheduler about the utilization boundaries within which is safe to schedule
>
> Isn't all this more about providing hints than safety?
Yes, it's "just" hints... will rephrase to make it more clear.
> > + * the task. These utilization boundaries are valuable information to support
> > + * scheduler decisions on both task placement and frequencies selection.
> > + *
> > + * @sched_util_min represents the minimum utilization
> > + * @sched_util_max represents the maximum utilization
> > + *
> > + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> > + * represents the percentage of CPU time used by a task when running at the
> > + * maximum frequency on the highest capacity CPU of the system. Thus, for
> > + * example, a 20% utilization task is a task running for 2ms every 10ms.
> > + *
> > + * A task with a min utilization value bigger then 0 is more likely to be
> > + * scheduled on a CPU which can provide that bandwidth.
> > + * A task with a max utilization value smaller then 1024 is more likely to be
> > + * scheduled on a CPU which do not provide more then the required bandwidth.
>
> Isn't s/bandwidth/capacity/ here, above, and in general where you use
> the term "bandwidth" more appropriate? I wonder if overloading this term
> (w.r.t. how is used with DEADLINE) might create confusion. In this case
> we are not providing any sort of guarantees, it's a hint.
Yes, you right... here we are not really granting any bandwidth but
just "improving" the bandwidth provisioning by hinting the scheduler
about a certain min/max capacity required.
The problem related to using capacity is that, from kernel space,
capacity is defined as a static quantity/property of CPUs. Still, I
think it makes sense to argue that util_{min,max} are hints on the
min/max capacity required for a task.
I'll update comments and text to avoid using bandwidth in favour of
capacity.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 13/08/18 13:14, Patrick Bellasi wrote:
> On 07-Aug 11:59, Juri Lelli wrote:
> > Hi,
> >
> > Minor comments below.
> >
> > On 06/08/18 17:39, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > + *
> > > + * Task Utilization Attributes
> > > + * ===========================
> > > + *
> > > + * A subset of sched_attr attributes allows to specify the utilization which
> > > + * should be expected by a task. These attributes allows to inform the
> > ^
> > allow
> >
> > > + * scheduler about the utilization boundaries within which is safe to schedule
> >
> > Isn't all this more about providing hints than safety?
>
> Yes, it's "just" hints... will rephrase to make it more clear.
>
> > > + * the task. These utilization boundaries are valuable information to support
> > > + * scheduler decisions on both task placement and frequencies selection.
> > > + *
> > > + * @sched_util_min represents the minimum utilization
> > > + * @sched_util_max represents the maximum utilization
> > > + *
> > > + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> > > + * represents the percentage of CPU time used by a task when running at the
> > > + * maximum frequency on the highest capacity CPU of the system. Thus, for
> > > + * example, a 20% utilization task is a task running for 2ms every 10ms.
> > > + *
> > > + * A task with a min utilization value bigger then 0 is more likely to be
> > > + * scheduled on a CPU which can provide that bandwidth.
> > > + * A task with a max utilization value smaller then 1024 is more likely to be
> > > + * scheduled on a CPU which do not provide more then the required bandwidth.
> >
> > Isn't s/bandwidth/capacity/ here, above, and in general where you use
> > the term "bandwidth" more appropriate? I wonder if overloading this term
> > (w.r.t. how is used with DEADLINE) might create confusion. In this case
> > we are not providing any sort of guarantees, it's a hint.
>
> Yes, you right... here we are not really granting any bandwidth but
> just "improving" the bandwidth provisioning by hinting the scheduler
> about a certain min/max capacity required.
>
> The problem related to using capacity is that, from kernel space,
> capacity is defined as a static quantity/property of CPUs. Still, I
Looks like it's also more inline with EAS terminology (i.e., capacity
states).
On 13-Aug 14:07, Vincent Guittot wrote:
> On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
> >
> > Hi Vincent!
> >
> > On 09-Aug 18:03, Vincent Guittot wrote:
> > > > On 07-Aug 15:26, Juri Lelli wrote:
> >
> > [...]
> >
> > > > > > + util_cfs = cpu_util_cfs(rq);
> > > > > > + util_rt = cpu_util_rt(rq);
> > > > > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > > > > + util = 0;
> > > > > > + if (util_cfs)
> > > > > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > > > > + if (util_rt)
> > > > > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > > > > + } else {
> > > > > > + util = cpu_util_cfs(rq);
> > > > > > + util += cpu_util_rt(rq);
> > > > > > + util = uclamp_util(cpu_of(rq), util);
> > > > > > + }
> > > >
> > > > Regarding the two policies, do you have any comment?
> > >
> > > Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
> > > make sense as it is ?
> > > I mean, uclamp_util doesn't make any difference between rt and cfs
> > > tasks when clamping the utilization so why should be add twice the
> > > returned value ?
> > > IMHO, this policy would make sense if there were something like
> > > uclamp_util_rt() and a uclamp_util_cfs()
> >
> > The idea for the UCLAMP_SCHED_CLASS policy is to improve fairness on
> > low-priority classese, especially when we have high RT utilization.
> >
> > Let say we have:
> >
> > util_rt = 40%, util_min=0%
> > util_cfs = 10%, util_min=50%
> >
> > the two policies will select:
> >
> > UCLAMP_SCHED_CLASS: util = uclamp(40) + uclamp(10) = 50 + 50 = 100%
> > !UCLAMP_SCHED_CLASS: util = uclamp(40 + 10) = uclmp(50) = 50%
> >
> > Which means that, despite the CPU's util_min will be set to 50% when
> > CFS is running, these tasks will have almost no boost at all, since
> > their bandwidth margin is eclipsed by RT tasks.
>
> Hmm ... At the opposite, even if there is no running rt task but only
> some remaining blocked rt utilization,
> even if util_rt = 10%, util_min=0%
> and util_cfs = 40%, util_min=50%
> the UCLAMP_SCHED_CLASS: util = uclamp(10) + uclamp(40) = 50 + 50 = 100%
Yes, that's true... since now I clamp util_rt if it's non zero.
Perhaps this can be fixed by clamping util_rt only:
if (rt_rq_is_runnable(&rq->rt))
?
> So cfs task can get double boosted by a small rt task.
Well, in principle we don't know if the 50% clamp was asserted by the
RT or the CFS task, since in the current implementation we max
aggregate clamp values across all RT and CFS tasks.
> Furthermore, if there is no rt task but 2 cfs tasks of 40% and 10%
> the UCLAMP_SCHED_CLASS: util = uclamp(0) + uclamp(40) = 50 = 50%
True, but here we are within the same class and what utilization
clamping aims to do is to defined the minimum capacity to run _all_
the RUNNABLE tasks... not the minimum capacity for _each_ one of them.
> So in this case cfs tasks don't get more boost and have to share the
> bandwidth and you don't ensure 50% for each unlike what you try to do
> for rt.
Above I'm not trying to fix a per-task issue. The UCLAMP_SCHED_CLASS
policy is just "trying" to fix a cross-class issue... if we agree
there can be a cross-class issue worth to be fixed.
> You create a difference in the behavior depending of the class of the
> others co-schedule tasks which is not sane IMHO
Yes I agree that the current behavior is not completely clean... still
the question is: do you reckon the problem I depicted above, i.e. RT
workloads eclipsing the min_util required by lower priority classes?
To a certain extend I see this problem similar to the rt/dl/irq pressure
in defining cpu_capacity, isn't it?
Maybe we can make use of (cpu_capacity_orig - cpu_capacity) to factor
in a util_min compensation for CFS tasks?
--
#include <best/regards.h>
Patrick Bellasi
On Mon, 13 Aug 2018 at 14:49, Patrick Bellasi <[email protected]> wrote:
>
> On 13-Aug 14:07, Vincent Guittot wrote:
> > On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
> > >
> > > Hi Vincent!
> > >
> > > On 09-Aug 18:03, Vincent Guittot wrote:
> > > > > On 07-Aug 15:26, Juri Lelli wrote:
> > >
> > > [...]
> > >
> > > > > > > + util_cfs = cpu_util_cfs(rq);
> > > > > > > + util_rt = cpu_util_rt(rq);
> > > > > > > + if (sched_feat(UCLAMP_SCHED_CLASS)) {
> > > > > > > + util = 0;
> > > > > > > + if (util_cfs)
> > > > > > > + util += uclamp_util(cpu_of(rq), util_cfs);
> > > > > > > + if (util_rt)
> > > > > > > + util += uclamp_util(cpu_of(rq), util_rt);
> > > > > > > + } else {
> > > > > > > + util = cpu_util_cfs(rq);
> > > > > > > + util += cpu_util_rt(rq);
> > > > > > > + util = uclamp_util(cpu_of(rq), util);
> > > > > > > + }
> > > > >
> > > > > Regarding the two policies, do you have any comment?
> > > >
> > > > Does the policy for (sched_feat(UCLAMP_SCHED_CLASS)== true) really
> > > > make sense as it is ?
> > > > I mean, uclamp_util doesn't make any difference between rt and cfs
> > > > tasks when clamping the utilization so why should be add twice the
> > > > returned value ?
> > > > IMHO, this policy would make sense if there were something like
> > > > uclamp_util_rt() and a uclamp_util_cfs()
> > >
> > > The idea for the UCLAMP_SCHED_CLASS policy is to improve fairness on
> > > low-priority classese, especially when we have high RT utilization.
> > >
> > > Let say we have:
> > >
> > > util_rt = 40%, util_min=0%
> > > util_cfs = 10%, util_min=50%
> > >
> > > the two policies will select:
> > >
> > > UCLAMP_SCHED_CLASS: util = uclamp(40) + uclamp(10) = 50 + 50 = 100%
> > > !UCLAMP_SCHED_CLASS: util = uclamp(40 + 10) = uclmp(50) = 50%
> > >
> > > Which means that, despite the CPU's util_min will be set to 50% when
> > > CFS is running, these tasks will have almost no boost at all, since
> > > their bandwidth margin is eclipsed by RT tasks.
> >
> > Hmm ... At the opposite, even if there is no running rt task but only
> > some remaining blocked rt utilization,
> > even if util_rt = 10%, util_min=0%
> > and util_cfs = 40%, util_min=50%
> > the UCLAMP_SCHED_CLASS: util = uclamp(10) + uclamp(40) = 50 + 50 = 100%
>
> Yes, that's true... since now I clamp util_rt if it's non zero.
> Perhaps this can be fixed by clamping util_rt only:
> if (rt_rq_is_runnable(&rq->rt))
> ?
>
> > So cfs task can get double boosted by a small rt task.
>
> Well, in principle we don't know if the 50% clamp was asserted by the
> RT or the CFS task, since in the current implementation we max
> aggregate clamp values across all RT and CFS tasks.
Yes it was just the assumption of your example above.
IMHO, having util = 100% for your use case looks more like a bug than a feature
As you said below: "what utilization clamping aims to do is to defined
the minimum capacity to run _all_
the RUNNABLE tasks... not the minimum capacity for _each_ one of them "
>
> > Furthermore, if there is no rt task but 2 cfs tasks of 40% and 10%
> > the UCLAMP_SCHED_CLASS: util = uclamp(0) + uclamp(40) = 50 = 50%
>
> True, but here we are within the same class and what utilization
> clamping aims to do is to defined the minimum capacity to run _all_
> the RUNNABLE tasks... not the minimum capacity for _each_ one of them.
I fully agree and that's exactly what I want to highlight: With
UCLAMP_SCHED_CLASS policy, you try (but fail because the clamping is
not done per class) to distinguish rt and cfs as different kind of
runnable tasks.
>
> > So in this case cfs tasks don't get more boost and have to share the
> > bandwidth and you don't ensure 50% for each unlike what you try to do
> > for rt.
>
> Above I'm not trying to fix a per-task issue. The UCLAMP_SCHED_CLASS
> policy is just "trying" to fix a cross-class issue... if we agree
> there can be a cross-class issue worth to be fixed.
But the cross class issue that you are describing can also exists
between cfs tasks with different uclamp_min
So I'm not sure that's there is more cross-class issue than in class issue
>
> > You create a difference in the behavior depending of the class of the
> > others co-schedule tasks which is not sane IMHO
>
> Yes I agree that the current behavior is not completely clean... still
> the question is: do you reckon the problem I depicted above, i.e. RT
> workloads eclipsing the min_util required by lower priority classes?
As said above, I don't think that there is a problem that is specific
to cross class scheduling that can't also happen in the same class.
Regarding your example:
task TA util=40% with uclamp_min 50%
task TB util=10% with uclamp_min 0%
If TA and TB are cfs, util=50% and it doesn't seem to be a problem
whereas TB will steal some bandwidth to TA and delay it (and i even
don't speak about the impact of the nice priority of TB)
If TA is cfs and TB is rt, Why util=50% is now a problem for TA ?
>
> To a certain extend I see this problem similar to the rt/dl/irq pressure
> in defining cpu_capacity, isn't it?
>
> Maybe we can make use of (cpu_capacity_orig - cpu_capacity) to factor
> in a util_min compensation for CFS tasks?
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 13-Aug 16:06, Vincent Guittot wrote:
> On Mon, 13 Aug 2018 at 14:49, Patrick Bellasi <[email protected]> wrote:
> > On 13-Aug 14:07, Vincent Guittot wrote:
> > > On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
[...]
> > Yes I agree that the current behavior is not completely clean... still
> > the question is: do you reckon the problem I depicted above, i.e. RT
> > workloads eclipsing the min_util required by lower priority classes?
>
> As said above, I don't think that there is a problem that is specific
> to cross class scheduling that can't also happen in the same class.
>
> Regarding your example:
> task TA util=40% with uclamp_min 50%
> task TB util=10% with uclamp_min 0%
>
> If TA and TB are cfs, util=50% and it doesn't seem to be a problem
> whereas TB will steal some bandwidth to TA and delay it (and i even
> don't speak about the impact of the nice priority of TB)
> If TA is cfs and TB is rt, Why util=50% is now a problem for TA ?
You right, in the current implementation, where we _do not_
distinguish among scheduling classes it's not possible to get a
reasonable implementation of a per sched class clamping.
> > To a certain extend I see this problem similar to the rt/dl/irq pressure
> > in defining cpu_capacity, isn't it?
However, I still think that higher priority classes eclipsing the
clamping of lower priority classes can still be a problem.
In your example above, the main difference between TA and TB being on
the same class or different classes is that in the second case TB
is granted to always preempt TA. We can end up with a non boosted RT
task consuming all the boosted bandwidth required by a CFS task.
This does not happen, apart maybe for the corner case of really
different nice values, if the tasks are both CFS, since the fair
scheduler will grant some progress for both of them.
Thus, given the current implementation, I think it makes sense to drop
the UCLAMP_SCHED_CLASS policy and stick with a more clean and
consistent design.
I'll then see if it makes sense to add a dedicated patch on top of the
series to add a proper per-class clamp tracking.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Mon, Aug 06, 2018 at 05:39:34PM +0100, Patrick Bellasi wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
>
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference couters.
>
> :
> (user-space changes) : (kernel space / scheduler)
> :
> SLOW PATH : FAST PATH
> :
> task_struct::uclamp::value : sched/core::enqueue/dequeue
> : cpufreq_schedutil
> :
> +----------------+ +--------------------+ +-------------------+
> | TASK | | CLAMP GROUP | | CPU CLAMPS |
> +----------------+ +--------------------+ +-------------------+
> | | | clamp_{min,max} | | clamp_{min,max} |
> | util_{min,max} | | se_count | | tasks count |
> +----------------+ +--------------------+ +-------------------+
> :
> +------------------> : +------------------->
> group_id = map(clamp_value) : ref_count(group_id)
> :
> :
>
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
>
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
> sense to boost/limit to different frequencies,
> e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
> the per-CPU clamp values in the fast path.
>
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
>
I see that we drop reference on the previous clamp group when a task changes
its clamp limits. What about exiting tasks which claimed clamp groups? should
not we drop the reference?
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hi Pavan,
On 14-Aug 16:55, Pavan Kondeti wrote:
> On Mon, Aug 06, 2018 at 05:39:34PM +0100, Patrick Bellasi wrote:
> I see that we drop reference on the previous clamp group when a task changes
> its clamp limits. What about exiting tasks which claimed clamp groups? should
> not we drop the reference?
Yes, you right... when a task ends we are not currently releasing the
reference to its (eventually defined) task-specific clamp value!
Thanks for pointing this out... I'll fix this on the next posting!
--
#include <best/regards.h>
Patrick Bellasi
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
> +/**
> + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> + * @p: the task being dequeued from a CPU
> + * @cpu: the CPU from where the clamp group has to be released
> + * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
> + *
> + * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
> + * counted by the task is decreased.
> + * If this was the last task defining the current max clamp group, then the
> + * CPU clamping is updated to find the new max for the specified clamp
> + * index.
> + */
> +static inline void uclamp_cpu_put_id(struct task_struct *p,
> + struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp;
> + struct uclamp_cpu *uc_cpu;
> + unsigned int clamp_value;
> + int group_id;
> +
> + /* No task specific clamp values: nothing to do */
> + group_id = p->uclamp[clamp_id].group_id;
> + if (group_id == UCLAMP_NOT_VALID)
> + return;
> +
> + /* Decrement the task's reference counted group index */
> + uc_grp = &rq->uclamp.group[clamp_id][0];
> +#ifdef SCHED_DEBUG
> + if (unlikely(uc_grp[group_id].tasks == 0)) {
> + WARN(1, "invalid CPU[%d] clamp group [%d:%d] refcount\n",
> + cpu_of(rq), clamp_id, group_id);
> + uc_grp[group_id].tasks = 1;
> + }
> +#endif
This one indicates that there are some holes in your ref-counting. It's
probably easier to debug that there is still a task but the
uc_grp[group_id].tasks value == 0 (A). I assume the other problem exists
as well, i.e. last task and uc_grp[group_id].tasks > 1 (B)?
You have uclamp_cpu_[get/put](_id)() in [enqueue/dequeue]_task.
Patch 04/14 introduces its use in uclamp_task_update_active().
Do you know why (A) (and (B)) are happening?
> + uc_grp[group_id].tasks -= 1;
> +
> + /* If this is not the last task, no updates are required */
> + if (uc_grp[group_id].tasks > 0)
> + return;
> +
> + /*
> + * Update the CPU only if this was the last task of the group
> + * defining the current clamp value.
> + */
> + uc_cpu = &rq->uclamp;
> + clamp_value = uc_grp[group_id].value;
> + if (clamp_value >= uc_cpu->value[clamp_id])
'clamp_value > uc_cpu->value[clamp_id]' should indicate another
inconsistency in the uclamp machinery, right?
[...]
Hi Dietmar!
On 14-Aug 17:44, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
> >+/**
> >+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> >+ * @p: the task being dequeued from a CPU
> >+ * @cpu: the CPU from where the clamp group has to be released
> >+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
> >+ *
> >+ * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
> >+ * counted by the task is decreased.
> >+ * If this was the last task defining the current max clamp group, then the
> >+ * CPU clamping is updated to find the new max for the specified clamp
> >+ * index.
> >+ */
> >+static inline void uclamp_cpu_put_id(struct task_struct *p,
> >+ struct rq *rq, int clamp_id)
> >+{
> >+ struct uclamp_group *uc_grp;
> >+ struct uclamp_cpu *uc_cpu;
> >+ unsigned int clamp_value;
> >+ int group_id;
> >+
> >+ /* No task specific clamp values: nothing to do */
> >+ group_id = p->uclamp[clamp_id].group_id;
> >+ if (group_id == UCLAMP_NOT_VALID)
> >+ return;
> >+
> >+ /* Decrement the task's reference counted group index */
> >+ uc_grp = &rq->uclamp.group[clamp_id][0];
> >+#ifdef SCHED_DEBUG
> >+ if (unlikely(uc_grp[group_id].tasks == 0)) {
> >+ WARN(1, "invalid CPU[%d] clamp group [%d:%d] refcount\n",
> >+ cpu_of(rq), clamp_id, group_id);
> >+ uc_grp[group_id].tasks = 1;
> >+ }
> >+#endif
>
> This one indicates that there are some holes in your ref-counting.
Not really, this has been added not because I've detected a refcount
issue... but because it was suggested as a possible safety check in a
previous code review comment:
https://lore.kernel.org/lkml/20180720151156.GA31421@e110439-lin/
> It's probably easier to debug that there is still a task but the
> uc_grp[group_id].tasks value == 0 (A). I assume the other problem exists as
> well, i.e. last task and uc_grp[group_id].tasks > 1 (B)?
>
> You have uclamp_cpu_[get/put](_id)() in [enqueue/dequeue]_task.
>
> Patch 04/14 introduces its use in uclamp_task_update_active().
>
> Do you know why (A) (and (B)) are happening?
I've never saw that warning in my tests so far so, again, the warning
is there just to support testing/debugging when refcounting code
is/will be touched in the future. That's also the reason why is
SCHED_DEBUG protected.
> >+ uc_grp[group_id].tasks -= 1;
> >+
> >+ /* If this is not the last task, no updates are required */
> >+ if (uc_grp[group_id].tasks > 0)
> >+ return;
> >+
> >+ /*
> >+ * Update the CPU only if this was the last task of the group
> >+ * defining the current clamp value.
> >+ */
> >+ uc_cpu = &rq->uclamp;
> >+ clamp_value = uc_grp[group_id].value;
> >+ if (clamp_value >= uc_cpu->value[clamp_id])
>
> 'clamp_value > uc_cpu->value[clamp_id]' should indicate another
> inconsistency in the uclamp machinery, right?
Here you right, I would say that it should always be:
clamp_value <= uc_cpu->value[clamp_id]
since this matches the update done at the end of uclamp_cpu_get_id():
if (uc_cpu->value[clamp_id] < clamp_value)
uc_cpu->value[clamp_id] = clamp_value;
Perhaps we can add another safety check here, similar to the one
above, to have something like:
clamp_value = uc_grp[group_id].value;
#ifdef SCHED_DEBUG
if (unlikely(clamp_value > uc_cpu->value[clamp_id])) {
WARN(1, "invalid CPU[%d] clamp group [%d:%d] value\n",
cpu_of(rq), clamp_id, group_id);
#endif
if (clamp_value == uc_cpu->value[clamp_id])
uclamp_cpu_update(rq, clamp_id);
--
#include <best/regards.h>
Patrick Bellasi
On 08/14/2018 06:49 PM, Patrick Bellasi wrote:
> Hi Dietmar!
>
> On 14-Aug 17:44, Dietmar Eggemann wrote:
>> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
>> This one indicates that there are some holes in your ref-counting.
>
> Not really, this has been added not because I've detected a refcount
> issue... but because it was suggested as a possible safety check in a
> previous code review comment:
>
> https://lore.kernel.org/lkml/20180720151156.GA31421@e110439-lin/
>
>> It's probably easier to debug that there is still a task but the
>> uc_grp[group_id].tasks value == 0 (A). I assume the other problem exists as
>> well, i.e. last task and uc_grp[group_id].tasks > 1 (B)?
>>
>> You have uclamp_cpu_[get/put](_id)() in [enqueue/dequeue]_task.
>>
>> Patch 04/14 introduces its use in uclamp_task_update_active().
>>
>> Do you know why (A) (and (B)) are happening?
>
> I've never saw that warning in my tests so far so, again, the warning
> is there just to support testing/debugging when refcounting code
> is/will be touched in the future. That's also the reason why is
> SCHED_DEBUG protected.
Ah, OK, I thought you really see it more often and that it also relate
to Pavan's comment on 02/14 about the missing treatment of exiting tasks.
If this is only for testing/debugging, I would suggest a simple one line
BUG_ON()
You find CONFIG_SCHED_DEBUG=y in production kernels as well.
[...]
> Here you right, I would say that it should always be:
>
> clamp_value <= uc_cpu->value[clamp_id]
>
> since this matches the update done at the end of uclamp_cpu_get_id():
>
> if (uc_cpu->value[clamp_id] < clamp_value)
> uc_cpu->value[clamp_id] = clamp_value;
>
> Perhaps we can add another safety check here, similar to the one
> above, to have something like:
>
> clamp_value = uc_grp[group_id].value;
> #ifdef SCHED_DEBUG
> if (unlikely(clamp_value > uc_cpu->value[clamp_id])) {
> WARN(1, "invalid CPU[%d] clamp group [%d:%d] value\n",
> cpu_of(rq), clamp_id, group_id);
> #endif
> if (clamp_value == uc_cpu->value[clamp_id])
> uclamp_cpu_update(rq, clamp_id);
Yes, but I would prefer a BUG_ON() one liner.
On 15-Aug 11:37, Dietmar Eggemann wrote:
> On 08/14/2018 06:49 PM, Patrick Bellasi wrote:
> >Hi Dietmar!
> >
> >On 14-Aug 17:44, Dietmar Eggemann wrote:
> >>On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>
> [...]
>
> >>This one indicates that there are some holes in your ref-counting.
> >
> >Not really, this has been added not because I've detected a refcount
> >issue... but because it was suggested as a possible safety check in a
> >previous code review comment:
> >
> > https://lore.kernel.org/lkml/20180720151156.GA31421@e110439-lin/
> >
> >>It's probably easier to debug that there is still a task but the
> >>uc_grp[group_id].tasks value == 0 (A). I assume the other problem exists as
> >>well, i.e. last task and uc_grp[group_id].tasks > 1 (B)?
> >>
> >>You have uclamp_cpu_[get/put](_id)() in [enqueue/dequeue]_task.
> >>
> >>Patch 04/14 introduces its use in uclamp_task_update_active().
> >>
> >>Do you know why (A) (and (B)) are happening?
> >
> >I've never saw that warning in my tests so far so, again, the warning
> >is there just to support testing/debugging when refcounting code
> >is/will be touched in the future. That's also the reason why is
> >SCHED_DEBUG protected.
>
> Ah, OK, I thought you really see it more often and that it also relate to
> Pavan's comment on 02/14 about the missing treatment of exiting tasks.
>
> If this is only for testing/debugging, I would suggest a simple one line
> BUG_ON()
These are (eventually) considered as recoverable errors... thus,
AFAIK, using BUG_ON is overkilling and discouraged:
https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bug.h#L42
> You find CONFIG_SCHED_DEBUG=y in production kernels as well.
AFAIK, that setting is discouraged for production kernels...
Moreover, it's still better to WARN sometimes on a production kernel
the crash the device, isnt't it?
--
#include <best/regards.h>
Patrick Bellasi
On 08/15/2018 12:54 PM, Patrick Bellasi wrote:
> On 15-Aug 11:37, Dietmar Eggemann wrote:
>> On 08/14/2018 06:49 PM, Patrick Bellasi wrote:
>>> Hi Dietmar!
>>>
>>> On 14-Aug 17:44, Dietmar Eggemann wrote:
>>>> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[..]
>> If this is only for testing/debugging, I would suggest a simple one line
>> BUG_ON()
>
> These are (eventually) considered as recoverable errors... thus,
> AFAIK, using BUG_ON is overkilling and discouraged:
> https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bug.h#L42
Not sure about that. If this refcounting is out of sync, that's
indicating a serious issue here for me which should be fixed.
>> You find CONFIG_SCHED_DEBUG=y in production kernels as well.
>
> AFAIK, that setting is discouraged for production kernels...
> Moreover, it's still better to WARN sometimes on a production kernel
> the crash the device, isnt't it?
IMHO, if this is something which should not happen at all, a BUG_ON() is
the right thing to do here. And you get the call stack to investigate
why it hit.
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
> +/**
> + * uclamp_task_active: check if a task is currently clamping a CPU
> + * @p: the task to check
> + *
> + * A task actively affects the utilization clamp of a CPU if:
> + * - it's currently enqueued or running on that CPU
> + * - it's refcounted in at least one clamp group of that CPU
> + *
> + * Return: true if p is currently clamping the utilization of its CPU.
> + */
> +static inline bool uclamp_task_active(struct task_struct *p)
> +{
> + struct rq *rq = task_rq(p);
> + int clamp_id;
> +
> + lockdep_assert_held(&p->pi_lock);
> + lockdep_assert_held(&rq->lock);
> +
> + if (!task_on_rq_queued(p) && !p->on_cpu)
> + return false;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + if (uclamp_task_affects(p, clamp_id))
> + return true;
> + }
> +
> + return false;
> +}
Looks like that uclamp_task_active() is only used once (in
uclamp_task_update_active()). Can you not code the if condition and the
for loop directly in uclamp_task_update_active()? This would save code
(lockdep_assert_held() etc.) and comment lines.
[...]
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
> +#else /* CONFIG_UCLAMP_TASK */
> +static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
> +{
> + return uclamp_none(clamp_id);
> +}
Looks like that uclamp_value() is not used outside CONFIG_UCLAMP_TASK areas.
On Mon, Aug 06, 2018 at 05:39:41PM +0100, Patrick Bellasi wrote:
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
>
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This represent also the clamp value which is actually
> used to clamp tasks in each task group.
>
> Since it can be interesting for tasks in a cgroup to know exactly what
> is the currently propagated/enforced configuration, the effective clamp
> values are exposed to user-space by means of a new pair of read-only
> attributes: cpu.util.{min,max}.effective.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
<snip>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8f48e64fb8a6..3fac2d098084 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -589,6 +589,11 @@ struct uclamp_se {
> unsigned int value;
> /* Utilization clamp group for this constraint */
> unsigned int group_id;
> + /* Effective clamp for tasks in this group */
> + struct {
> + unsigned int value;
> + unsigned int group_id;
> + } effective;
> };
Are these needed when CONFIG_UCLAMP_TASK_GROUP is disabled?
>
> union rcu_special {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2ba55a4afffb..f692df3787bd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1237,6 +1237,8 @@ static inline void init_uclamp_sched_group(void)
> uc_se = &root_task_group.uclamp[clamp_id];
> uc_se->value = uclamp_none(clamp_id);
> uc_se->group_id = group_id;
> + uc_se->effective.value = uclamp_none(clamp_id);
> + uc_se->effective.group_id = group_id;
>
> /* Attach root TG's clamp group */
> uc_map[group_id].se_count = 1;
>
<snip>
> @@ -7622,11 +7687,19 @@ static struct cftype cpu_legacy_files[] = {
> .read_u64 = cpu_util_min_read_u64,
> .write_u64 = cpu_util_min_write_u64,
> },
> + {
> + .name = "util.min.effective",
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
> {
> .name = "util.max",
> .read_u64 = cpu_util_max_read_u64,
> .write_u64 = cpu_util_max_write_u64,
> },
> + {
> + .name = "util.max.effective",
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
> #endif
> { } /* Terminate */
> };
Is there any reason why these are not added for the default hierarchy?
Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon, Aug 06, 2018 at 05:39:44PM +0100, Patrick Bellasi wrote:
> Clamp values cannot be tuned at the root cgroup level. Moreover, because
> of the delegation model requirements and how the parent clamps
> propagation works, if we want to enable subgroups to set a non null
> util.min, we need to be able to configure the root group util.min to the
> allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
>
> Unfortunately this setup will also mean that all tasks running in the
> root group, will always get a maximum util.min clamp, unless they have a
> lower task specific clamp which is definitively not a desirable default
> configuration.
>
> Let's fix this by explicitly adding a system default configuration
> (sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
> for all tasks running on the root group.
>
> This interface is available independently from cgroups, thus providing a
> complete solution for system wide utilization clamping configuration.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Suren Baghdasaryan <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
<snip>
> +/*
> + * Minimum utilization for tasks in the root cgroup
> + * default: 0
> + */
> +unsigned int sysctl_sched_uclamp_util_min;
> +
> +/*
> + * Maximum utilization for tasks in the root cgroup
> + * default: 1024
> + */
> +unsigned int sysctl_sched_uclamp_util_max = 1024;
> +
> +static struct uclamp_se uclamp_default[UCLAMP_CNT];
> +
The default group id for un-clamped root tasks is 0 because of
this declaration, correct?
> /**
> * uclamp_map: reference counts a utilization "clamp value"
> * @value: the utilization "clamp value" required
> @@ -957,12 +971,25 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
> group_id = uc_se->group_id;
>
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> + /*
> + * Tasks in the root group, which do not have a task specific clamp
> + * value, get the system default calmp value.
> + */
> + if (group_id == UCLAMP_NOT_VALID &&
> + task_group(p) == &root_task_group) {
> + return uclamp_default[clamp_id].group_id;
> + }
> +
> /* Use TG's clamp value to limit task specific values */
> uc_se = &task_group(p)->uclamp[clamp_id];
> if (group_id == UCLAMP_NOT_VALID ||
> clamp_value > uc_se->effective.value) {
> group_id = uc_se->effective.group_id;
> }
> +#else
> + /* By default, all tasks get the system default clamp value */
> + if (group_id == UCLAMP_NOT_VALID)
> + return uclamp_default[clamp_id].group_id;
> #endif
>
> return group_id;
> @@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct *p,
> uclamp_group_put(clamp_id, prev_group_id);
> }
>
> +int sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + struct uclamp_se *uc_se;
> + int old_min, old_max;
> + int result;
> +
> + mutex_lock(&uclamp_mutex);
> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> + goto undo;
> + if (sysctl_sched_uclamp_util_max > 1024)
> + goto undo;
> +
> + /* Find a valid group_id for each required clamp value */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + result = uclamp_group_find(UCLAMP_MIN, sysctl_sched_uclamp_util_min);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
> + CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + result = uclamp_group_find(UCLAMP_MAX, sysctl_sched_uclamp_util_max);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
> + CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uc_se = &uclamp_default[UCLAMP_MIN];
> + uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> + uc_se, sysctl_sched_uclamp_util_min);
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + uc_se = &uclamp_default[UCLAMP_MAX];
> + uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
> + uc_se, sysctl_sched_uclamp_util_max);
> + }
uclamp_group_get() also drops the reference on the previous group id.
The initial group id for uclamp_default[] i.e 0 is never claimed by
us. so we end up releasing it. But root group still points to group#0.
is this a problem?
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On 08/13/2018 05:01 PM, Patrick Bellasi wrote:
> On 13-Aug 16:06, Vincent Guittot wrote:
>> On Mon, 13 Aug 2018 at 14:49, Patrick Bellasi <[email protected]> wrote:
>>> On 13-Aug 14:07, Vincent Guittot wrote:
>>>> On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
>
> [...]
>
>>> Yes I agree that the current behavior is not completely clean... still
>>> the question is: do you reckon the problem I depicted above, i.e. RT
>>> workloads eclipsing the min_util required by lower priority classes?
>>
>> As said above, I don't think that there is a problem that is specific
>> to cross class scheduling that can't also happen in the same class.
>>
>> Regarding your example:
>> task TA util=40% with uclamp_min 50%
>> task TB util=10% with uclamp_min 0%
>>
>> If TA and TB are cfs, util=50% and it doesn't seem to be a problem
>> whereas TB will steal some bandwidth to TA and delay it (and i even
>> don't speak about the impact of the nice priority of TB)
>> If TA is cfs and TB is rt, Why util=50% is now a problem for TA ?
>
> You right, in the current implementation, where we _do not_
> distinguish among scheduling classes it's not possible to get a
> reasonable implementation of a per sched class clamping.
>
>>> To a certain extend I see this problem similar to the rt/dl/irq pressure
>>> in defining cpu_capacity, isn't it?
>
> However, I still think that higher priority classes eclipsing the
> clamping of lower priority classes can still be a problem.
>
> In your example above, the main difference between TA and TB being on
> the same class or different classes is that in the second case TB
> is granted to always preempt TA. We can end up with a non boosted RT
> task consuming all the boosted bandwidth required by a CFS task.
>
> This does not happen, apart maybe for the corner case of really
> different nice values, if the tasks are both CFS, since the fair
> scheduler will grant some progress for both of them.
>
> Thus, given the current implementation, I think it makes sense to drop
> the UCLAMP_SCHED_CLASS policy and stick with a more clean and
> consistent design.
I agree with everything said in this thread so far.
So in case you skip UCLAMP_SCHED_CLASS [(B) combine the clamped
utilizations] in v4, you will only provide [A) clamp the combined
utilization]?
I assume that we don't have to guard the util clamping for rt tasks
behind a disabled by default sched feature because all runnable rt tasks
will have util_min = SCHED_CAPACITY_SCALE by default?
> I'll then see if it makes sense to add a dedicated patch on top of the
> series to add a proper per-class clamp tracking.
I assume if you introduce this per-class clamping you will switch to use
the UCLAMP_SCHED_CLASS approach?
Hi Dietmar!
On 15-Aug 12:59, Dietmar Eggemann wrote:
> On 08/15/2018 12:54 PM, Patrick Bellasi wrote:
> >On 15-Aug 11:37, Dietmar Eggemann wrote:
> >>On 08/14/2018 06:49 PM, Patrick Bellasi wrote:
[...]
> >>If this is only for testing/debugging, I would suggest a simple one line
> >>BUG_ON()
> >
> >These are (eventually) considered as recoverable errors... thus,
> >AFAIK, using BUG_ON is overkilling and discouraged:
> > https://elixir.bootlin.com/linux/latest/source/include/asm-generic/bug.h#L42
>
> Not sure about that. If this refcounting is out of sync, that's indicating a
> serious issue here for me which should be fixed.
Well, refconting seems quite ok to me, we always inc/dec under RQ
locking and it's a per-CPU variable.
The warning is there to report issues on further testing as well as to
be safe with respect to possible future modifications of the code.
> >>You find CONFIG_SCHED_DEBUG=y in production kernels as well.
> >
> >AFAIK, that setting is discouraged for production kernels...
> >Moreover, it's still better to WARN sometimes on a production kernel
> >the crash the device, isnt't it?
>
> IMHO, if this is something which should not happen at all, a BUG_ON() is the
> right thing to do here.
I don't agree on that. I agree it should not happen but since it's a
recoverable error it think we should not panic.
There are really few BUG_ON() in core.c and they are all for much more
serious issues than a (eventually) broken refcount.
IMHO instead an (unlikely) inconsistent refcont for an "optional
optimization" on "frequency selection" is not such a critical
failure worth a device crash.
> And you get the call stack to investigate why it hit.
We can always add a stack dump if we notice the warning.
But, since we do not agree on that point, I would say we should better
wait for what the maintainers prefers.
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 15-Aug 17:30, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>
> [...]
>
> >+#else /* CONFIG_UCLAMP_TASK */
> >+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
> >+{
> >+ return uclamp_none(clamp_id);
> >+}
>
> Looks like that uclamp_value() is not used outside CONFIG_UCLAMP_TASK areas.
Yes, you right... I use it on some debug patches I'm tracking on
top of the series... but I can defenitivel drop it in v4.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 16-Aug 14:39, Pavan Kondeti wrote:
> On Mon, Aug 06, 2018 at 05:39:41PM +0100, Patrick Bellasi wrote:
[...]
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 8f48e64fb8a6..3fac2d098084 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -589,6 +589,11 @@ struct uclamp_se {
> > unsigned int value;
> > /* Utilization clamp group for this constraint */
> > unsigned int group_id;
> > + /* Effective clamp for tasks in this group */
> > + struct {
> > + unsigned int value;
> > + unsigned int group_id;
> > + } effective;
> > };
>
> Are these needed when CONFIG_UCLAMP_TASK_GROUP is disabled?
Mmm... not entirely, at least not the value.
While working on v4 I've notice that:
(1) task_struct::uclamp::effective::group_id
can be used for the back annotation we add in:
[PATCH v3 11/14] sched/core: uclamp: use TG's clamps to restrict Task's clamps
using the additional field:
(2) task_struct::uclamp_group_id
So, I'm updating that patch to re-use (1) instead of adding (2).
Regarding:
(3) task_struct::uclamp::effective::value
it can be used to track the task's effective clamp value once I add
the discretization support discussed with Juri in:
https://lore.kernel.org/lkml/20180809152313.lewfhufidhxb2qrk@darkstar/
So, I would say that in v4 I can try to see if and how we can guard
(some of) the effective values on !CONFIG_UCLAMP_TASK_GROUP
configurations...
>
> > union rcu_special {
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2ba55a4afffb..f692df3787bd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1237,6 +1237,8 @@ static inline void init_uclamp_sched_group(void)
> > uc_se = &root_task_group.uclamp[clamp_id];
> > uc_se->value = uclamp_none(clamp_id);
> > uc_se->group_id = group_id;
> > + uc_se->effective.value = uclamp_none(clamp_id);
> > + uc_se->effective.group_id = group_id;
> >
> > /* Attach root TG's clamp group */
> > uc_map[group_id].se_count = 1;
> >
>
> <snip>
>
> > @@ -7622,11 +7687,19 @@ static struct cftype cpu_legacy_files[] = {
> > .read_u64 = cpu_util_min_read_u64,
> > .write_u64 = cpu_util_min_write_u64,
> > },
> > + {
> > + .name = "util.min.effective",
> > + .read_u64 = cpu_util_min_effective_read_u64,
> > + },
> > {
> > .name = "util.max",
> > .read_u64 = cpu_util_max_read_u64,
> > .write_u64 = cpu_util_max_write_u64,
> > },
> > + {
> > + .name = "util.max.effective",
> > + .read_u64 = cpu_util_max_effective_read_u64,
> > + },
> > #endif
> > { } /* Terminate */
> > };
>
> Is there any reason why these are not added for the default hierarchy?
Not really, good point!
I think I've just forgot them... which makes me notice that I still
have to improve the coverage for my tests on the default hierarchy.
I don't expect big difference, since the behaviors should be the
same but...
Thanks for pointing this out... will had the attributes in the
upcoming v4.
Cheers,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Thursday 16 Aug 2018 at 15:45:45 (+0200), Dietmar Eggemann wrote:
> On 08/16/2018 03:37 PM, Quentin Perret wrote:
> > > > IMHO, if this is something which should not happen at all, a BUG_ON() is the
> > > > right thing to do here.
> > >
> > > I don't agree on that. I agree it should not happen but since it's a
> > > recoverable error it think we should not panic.
> >
> > FWIW, if this is a recoverable error, I think Linus will agree with
> > Patrick on this one :-)
> >
> > https://lkml.org/lkml/2016/10/4/1
>
> Yeah, not really agreeing here that this is a recoverable error.
A non-recoverable scenario could be, for example, if you corrupt your
stack and there is absolutely _nothing_ you can do to keep the system up
and running, because it's just too broken. I don't feel like we're
talking about such an extreme case here ...
> Besides, we
> only consider under-run here, what about over-run?
>
> Currently this warning doesn't hit and if the code will be changed and it
> hits, I still find a BUG_ON more appealing here ...
>
> So this error scenario can happen over and over again and we always recover
> from ? The important thing is that we find the culprit for this behaviour as
> fast as possible ...
Agreed, we want to debug that ASAP, but WARN should let us do that just
fine, I think.
Quentin
On 16-Aug 14:43, Pavan Kondeti wrote:
> On Mon, Aug 06, 2018 at 05:39:44PM +0100, Patrick Bellasi wrote:
> > Clamp values cannot be tuned at the root cgroup level. Moreover, because
> > of the delegation model requirements and how the parent clamps
> > propagation works, if we want to enable subgroups to set a non null
> > util.min, we need to be able to configure the root group util.min to the
> > allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
> >
> > Unfortunately this setup will also mean that all tasks running in the
> > root group, will always get a maximum util.min clamp, unless they have a
> > lower task specific clamp which is definitively not a desirable default
> > configuration.
> >
> > Let's fix this by explicitly adding a system default configuration
> > (sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
> > for all tasks running on the root group.
> >
> > This interface is available independently from cgroups, thus providing a
> > complete solution for system wide utilization clamping configuration.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> > Cc: Paul Turner <[email protected]>
> > Cc: Suren Baghdasaryan <[email protected]>
> > Cc: Todd Kjos <[email protected]>
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Steve Muckle <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Morten Rasmussen <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
>
> <snip>
>
> > +/*
> > + * Minimum utilization for tasks in the root cgroup
> > + * default: 0
> > + */
> > +unsigned int sysctl_sched_uclamp_util_min;
> > +
> > +/*
> > + * Maximum utilization for tasks in the root cgroup
> > + * default: 1024
> > + */
> > +unsigned int sysctl_sched_uclamp_util_max = 1024;
> > +
> > +static struct uclamp_se uclamp_default[UCLAMP_CNT];
> > +
>
> The default group id for un-clamped root tasks is 0 because of
> this declaration, correct?
Yes, the clamp group 0 is (should be) initialized to track the system
default values:
util_min = 0 and util_max = SCHED_CAPACITY_SCALE
but...
> > /**
> > * uclamp_map: reference counts a utilization "clamp value"
> > * @value: the utilization "clamp value" required
> > @@ -957,12 +971,25 @@ static inline int uclamp_task_group_id(struct task_struct *p, int clamp_id)
> > group_id = uc_se->group_id;
> >
> > #ifdef CONFIG_UCLAMP_TASK_GROUP
> > + /*
> > + * Tasks in the root group, which do not have a task specific clamp
> > + * value, get the system default calmp value.
> > + */
> > + if (group_id == UCLAMP_NOT_VALID &&
> > + task_group(p) == &root_task_group) {
> > + return uclamp_default[clamp_id].group_id;
> > + }
> > +
> > /* Use TG's clamp value to limit task specific values */
> > uc_se = &task_group(p)->uclamp[clamp_id];
> > if (group_id == UCLAMP_NOT_VALID ||
> > clamp_value > uc_se->effective.value) {
> > group_id = uc_se->effective.group_id;
> > }
> > +#else
> > + /* By default, all tasks get the system default clamp value */
> > + if (group_id == UCLAMP_NOT_VALID)
> > + return uclamp_default[clamp_id].group_id;
> > #endif
> >
> > return group_id;
> > @@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct *p,
> > uclamp_group_put(clamp_id, prev_group_id);
> > }
> >
> > +int sched_uclamp_handler(struct ctl_table *table, int write,
> > + void __user *buffer, size_t *lenp,
> > + loff_t *ppos)
> > +{
> > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > + struct uclamp_se *uc_se;
> > + int old_min, old_max;
> > + int result;
> > +
> > + mutex_lock(&uclamp_mutex);
> > +
> > + old_min = sysctl_sched_uclamp_util_min;
> > + old_max = sysctl_sched_uclamp_util_max;
> > +
> > + result = proc_dointvec(table, write, buffer, lenp, ppos);
> > + if (result)
> > + goto undo;
> > + if (!write)
> > + goto done;
> > +
> > + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> > + goto undo;
> > + if (sysctl_sched_uclamp_util_max > 1024)
> > + goto undo;
> > +
> > + /* Find a valid group_id for each required clamp value */
> > + if (old_min != sysctl_sched_uclamp_util_min) {
> > + result = uclamp_group_find(UCLAMP_MIN, sysctl_sched_uclamp_util_min);
> > + if (result == -ENOSPC) {
> > + pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
> > + CONFIG_UCLAMP_GROUPS_COUNT);
> > + goto undo;
> > + }
> > + group_id[UCLAMP_MIN] = result;
> > + }
> > + if (old_max != sysctl_sched_uclamp_util_max) {
> > + result = uclamp_group_find(UCLAMP_MAX, sysctl_sched_uclamp_util_max);
> > + if (result == -ENOSPC) {
> > + pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
> > + CONFIG_UCLAMP_GROUPS_COUNT);
> > + goto undo;
> > + }
> > + group_id[UCLAMP_MAX] = result;
> > + }
> > +
> > + /* Update each required clamp group */
> > + if (old_min != sysctl_sched_uclamp_util_min) {
> > + uc_se = &uclamp_default[UCLAMP_MIN];
> > + uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> > + uc_se, sysctl_sched_uclamp_util_min);
> > + }
> > + if (old_max != sysctl_sched_uclamp_util_max) {
> > + uc_se = &uclamp_default[UCLAMP_MAX];
> > + uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
> > + uc_se, sysctl_sched_uclamp_util_max);
> > + }
>
... you are right hereafter: there are still some lose points.
> uclamp_group_get() also drops the reference on the previous group id.
> The initial group id for uclamp_default[] i.e 0 is never claimed by
> us. so we end up releasing it. But root group still points to group#0.
> is this a problem?
No, it's not correct... and it's similar to the issue you already
raised for the refcounting decrement on task exit.
I'm working on v4 to review and fixup all the lose points and missing
initializations (like in this case) related to the clamp groups
mapping refcounting.
Thanks for pointing this out!
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 08/16/2018 04:21 PM, Quentin Perret wrote:
> On Thursday 16 Aug 2018 at 15:45:45 (+0200), Dietmar Eggemann wrote:
>> On 08/16/2018 03:37 PM, Quentin Perret wrote:
>>>>> IMHO, if this is something which should not happen at all, a BUG_ON() is the
>>>>> right thing to do here.
>>>>
>>>> I don't agree on that. I agree it should not happen but since it's a
>>>> recoverable error it think we should not panic.
>>>
>>> FWIW, if this is a recoverable error, I think Linus will agree with
>>> Patrick on this one :-)
>>>
>>> https://lkml.org/lkml/2016/10/4/1
>>
>> Yeah, not really agreeing here that this is a recoverable error.
>
> A non-recoverable scenario could be, for example, if you corrupt your
> stack and there is absolutely _nothing_ you can do to keep the system up
> and running, because it's just too broken. I don't feel like we're
> talking about such an extreme case here ...
Yeah, that's the extreme. But what about this lovely BUG_ON(busiest ==
env.dst_rq) in fair.c's load_balance()?
We could recover by just bailing out ;-)
I guess we know by now that there are different opinions here.
>
>> Besides, we
>> only consider under-run here, what about over-run?
Important thing is to also detect the over-run, i.e. add the first task
and the task counter is already > 0.
>>
>> Currently this warning doesn't hit and if the code will be changed and it
>> hits, I still find a BUG_ON more appealing here ...
>>
>> So this error scenario can happen over and over again and we always recover
>> from ? The important thing is that we find the culprit for this behaviour as
>> fast as possible ...
>
> Agreed, we want to debug that ASAP, but WARN should let us do that just
> fine, I think.
+1.
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> When a util_max clamped task sleeps, its clamp constraints are removed
> from the CPU. However, the blocked utilization on that CPU can still be
> higher than the max clamp value enforced while that task was running.
> This max clamp removal when a CPU is going to be idle could thus allow
> unwanted CPU frequency increases, right while the task is not running.
So 'rq->uclamp.flags == UCLAMP_FLAG_IDLE' means CPU is IDLE because
non-clamped tasks are tracked as well ((group_id = 0)).
Maybe this is worth mentioning here?
> This can happen, for example, where there is another (smaller) task
> running on a different CPU of the same frequency domain.
> In this case, when we aggregate the utilization of all the CPUs in a
> shared frequency domain, schedutil can still see the full non clamped
> blocked utilization of all the CPUs and thus eventually increase the
> frequency.
>
> Let's fix this by using:
>
> uclamp_cpu_put_id(UCLAMP_MAX)
> uclamp_cpu_update(last_clamp_value)
>
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition. Thus, while a CPU is idle, we can still enforce the last used
> clamp value for it.
>
> To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
> idle, we don't want to enforce any minimum frequency
> Indeed, we rely just on blocked load decay to smoothly reduce the
> frequency.
[...]
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bc2beedec7bf..ff76b000bbe8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -906,7 +906,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
> * For the specified clamp index, this method computes the new CPU utilization
> * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
> */
> -static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> + unsigned int last_clamp_value)
> {
> struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> int max_value = UCLAMP_NOT_VALID;
> @@ -924,6 +925,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
The condition:
if (!uclamp_group_active(uc_grp, group_id))
continue;
in 'for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT;
++group_id) {}' makes sure that 'max_value == UCLAMP_NOT_VALID' is true
for the if condition (*):
> if (max_value >= SCHED_CAPACITY_SCALE)
> break;
> }
> +
> + /*
> + * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> + * task, we keep the CPU clamped to the last task's clamp value.
> + * This avoids frequency spikes to MAX when one CPU, with an high
> + * blocked utilization, sleeps and another CPU, in the same frequency
> + * domain, do not see anymore the clamp on the first CPU.
> + */
> + if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
> + rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> + max_value = last_clamp_value;
> + }
> +
(*): So the uc_grp[group_id].value stays last_clamp_value?
What do you do when the blocked utilization decays below this enforced
last_clamp_value on that CPU?
I assume there are plenty of this kind of corner cases because we have
blocked signals (including all tasks) and clamping (including runnable
tasks).
On 16-Aug 17:43, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> >When a util_max clamped task sleeps, its clamp constraints are removed
> >from the CPU. However, the blocked utilization on that CPU can still be
> >higher than the max clamp value enforced while that task was running.
> >This max clamp removal when a CPU is going to be idle could thus allow
> >unwanted CPU frequency increases, right while the task is not running.
>
> So 'rq->uclamp.flags == UCLAMP_FLAG_IDLE' means CPU is IDLE because
> non-clamped tasks are tracked as well ((group_id = 0)).
Right, but... with (group_id = 0) you mean that "non-clamped tasks are
tracked" in the first clamp group?
> Maybe this is worth mentioning here?
Maybe I can explicitely say that we detect that there are not RUNNABLE
tasks because all the clamp groups are in UCLAMP_NOT_VALID status.
> >This can happen, for example, where there is another (smaller) task
> >running on a different CPU of the same frequency domain.
> >In this case, when we aggregate the utilization of all the CPUs in a
> >shared frequency domain, schedutil can still see the full non clamped
> >blocked utilization of all the CPUs and thus eventually increase the
> >frequency.
> >
> >Let's fix this by using:
> >
> > uclamp_cpu_put_id(UCLAMP_MAX)
> > uclamp_cpu_update(last_clamp_value)
> >
> >to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> >condition. Thus, while a CPU is idle, we can still enforce the last used
> >clamp value for it.
> >
> >To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
> >idle, we don't want to enforce any minimum frequency
> >Indeed, we rely just on blocked load decay to smoothly reduce the
> >frequency.
>
> [...]
>
> >diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >index bc2beedec7bf..ff76b000bbe8 100644
> >--- a/kernel/sched/core.c
> >+++ b/kernel/sched/core.c
> >@@ -906,7 +906,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
> > * For the specified clamp index, this method computes the new CPU utilization
> > * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
> > */
> >-static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> >+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> >+ unsigned int last_clamp_value)
> > {
> > struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> > int max_value = UCLAMP_NOT_VALID;
> >@@ -924,6 +925,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
>
> The condition:
>
> if (!uclamp_group_active(uc_grp, group_id))
> continue;
>
> in 'for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id)
> {}' makes sure that 'max_value == UCLAMP_NOT_VALID' is true for the if
> condition (*):
>
>
> > if (max_value >= SCHED_CAPACITY_SCALE)
> > break;
> > }
> >+
> >+ /*
> >+ * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> >+ * task, we keep the CPU clamped to the last task's clamp value.
> >+ * This avoids frequency spikes to MAX when one CPU, with an high
> >+ * blocked utilization, sleeps and another CPU, in the same frequency
> >+ * domain, do not see anymore the clamp on the first CPU.
> >+ */
> >+ if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
> >+ rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> >+ max_value = last_clamp_value;
> >+ }
> >+
>
> (*): So the uc_grp[group_id].value stays last_clamp_value?
A bit confusing... but I think you've got the point.
> What do you do when the blocked utilization decays below this enforced
> last_clamp_value on that CPU?
This is done _just_ for max_util:
- it clamps a blocked utilization bigger then last_clamp_value
thus avoiding the selection of an OPP bigger then the one enforced
while the task was runnable
- it has not effect on a blocked utilization smaller then last_clamp_value
thus allowing to reduce gracefully the OPP as long as the blocked
utilization is decayed
> I assume there are plenty of this kind of corner cases because we have
> blocked signals (including all tasks) and clamping (including runnable
> tasks).
This is a pretty compelling one I've noticed in my tests and thus
worth a fix... I don't have on hand other similar corner cases, do
you?
--
#include <best/regards.h>
Patrick Bellasi
On 06-Aug 17:39, Patrick Bellasi wrote:
> When a util_max clamped task sleeps, its clamp constraints are removed
> from the CPU. However, the blocked utilization on that CPU can still be
> higher than the max clamp value enforced while that task was running.
> This max clamp removal when a CPU is going to be idle could thus allow
> unwanted CPU frequency increases, right while the task is not running.
>
> This can happen, for example, where there is another (smaller) task
> running on a different CPU of the same frequency domain.
> In this case, when we aggregate the utilization of all the CPUs in a
> shared frequency domain, schedutil can still see the full non clamped
> blocked utilization of all the CPUs and thus eventually increase the
> frequency.
>
> Let's fix this by using:
>
> uclamp_cpu_put_id(UCLAMP_MAX)
> uclamp_cpu_update(last_clamp_value)
>
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition. Thus, while a CPU is idle, we can still enforce the last used
> clamp value for it.
>
> To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
> idle, we don't want to enforce any minimum frequency
> Indeed, we rely just on blocked load decay to smoothly reduce the
> frequency.
[...]
> @@ -906,7 +906,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
> * For the specified clamp index, this method computes the new CPU utilization
> * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
> */
> -static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> + unsigned int last_clamp_value)
> {
> struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> int max_value = UCLAMP_NOT_VALID;
> @@ -924,6 +925,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> if (max_value >= SCHED_CAPACITY_SCALE)
> break;
> }
> +
> + /*
> + * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> + * task, we keep the CPU clamped to the last task's clamp value.
> + * This avoids frequency spikes to MAX when one CPU, with an high
> + * blocked utilization, sleeps and another CPU, in the same frequency
> + * domain, do not see anymore the clamp on the first CPU.
> + */
> + if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
> + rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> + max_value = last_clamp_value;
> + }
> +
> rq->uclamp.value[clamp_id] = max_value;
> }
>
> @@ -953,13 +967,26 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
> uc_grp = &rq->uclamp.group[clamp_id][0];
> uc_grp[group_id].tasks += 1;
>
> + /* Force clamp update on idle exit */
> + uc_cpu = &rq->uclamp;
> + clamp_value = p->uclamp[clamp_id].value;
> + if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
> + /*
> + * This function is called for both UCLAMP_MIN (before) and
> + * UCLAMP_MAX (after). Let's reset the flag only the second
> + * once we know that UCLAMP_MIN has been already updated.
> + */
> + if (clamp_id == UCLAMP_MAX)
> + uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
> + uc_cpu->value[clamp_id] = clamp_value;
> + return;
> + }
Just notice that the code block above is not reached when we enqueue a task
without a valid clamp group, i.e. an un-clamped task, which is the
default for all tasks.
The fix should be as simple as moving this block at the beginning of
uclamp_cpu_update() so that we always un-conditionally release the
clamp holding as soon as we enqueue the first task after a CPU wakeup,
i.e. when the UCLAMP_FLAG_IDLE flag is set.
Will fix this on v4.
--
#include <best/regards.h>
Patrick Bellasi
On 16-Aug 19:10, Dietmar Eggemann wrote:
> On 08/16/2018 06:47 PM, Patrick Bellasi wrote:
> >On 16-Aug 17:43, Dietmar Eggemann wrote:
> >>On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> >>>When a util_max clamped task sleeps, its clamp constraints are removed
> >>>from the CPU. However, the blocked utilization on that CPU can still be
> >>>higher than the max clamp value enforced while that task was running.
> >>>This max clamp removal when a CPU is going to be idle could thus allow
> >>>unwanted CPU frequency increases, right while the task is not running.
> >>
> >>So 'rq->uclamp.flags == UCLAMP_FLAG_IDLE' means CPU is IDLE because
> >>non-clamped tasks are tracked as well ((group_id = 0)).
> >
> >Right, but... with (group_id = 0) you mean that "non-clamped tasks are
> >tracked" in the first clamp group?
>
> Yes. I was asking myself what will happen if there are only non-clamped
> tasks runnable ...
Non clamped tasks is kind-of ambiguous, since you can have:
a) tasks with util_max = UCLAMP_NOT_VALID (the default for all tasks)
b) tasks with util_max = SCHED_CAPACITY_SCALE as a task specific
clamp value
They are both technically not clamped but, for case b there should not
be issue, since we will track SCHED_CAPACITY_SCALE as idle hold value.
For case a instead is a bit different, especially when they mix with
tasks with a valid task specific clamp value, as I've just commented
in this posting:
Message-ID: <20180816172016.GG2960@e110439-lin>
>
> >
> >>Maybe this is worth mentioning here?
> >
> >Maybe I can explicitely say that we detect that there are not RUNNABLE
> >tasks because all the clamp groups are in UCLAMP_NOT_VALID status.
>
> Yes, would have helped me the grasp this earlier ...
Right, I'm going to add a bit of text on that.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 08/16/2018 06:47 PM, Patrick Bellasi wrote:
> On 16-Aug 17:43, Dietmar Eggemann wrote:
>> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>>> When a util_max clamped task sleeps, its clamp constraints are removed
>> >from the CPU. However, the blocked utilization on that CPU can still be
>>> higher than the max clamp value enforced while that task was running.
>>> This max clamp removal when a CPU is going to be idle could thus allow
>>> unwanted CPU frequency increases, right while the task is not running.
>>
>> So 'rq->uclamp.flags == UCLAMP_FLAG_IDLE' means CPU is IDLE because
>> non-clamped tasks are tracked as well ((group_id = 0)).
>
> Right, but... with (group_id = 0) you mean that "non-clamped tasks are
> tracked" in the first clamp group?
Yes. I was asking myself what will happen if there are only non-clamped
tasks runnable ...
>
>> Maybe this is worth mentioning here?
>
> Maybe I can explicitely say that we detect that there are not RUNNABLE
> tasks because all the clamp groups are in UCLAMP_NOT_VALID status.
Yes, would have helped me the grasp this earlier ...
[...]
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index bc2beedec7bf..ff76b000bbe8 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -906,7 +906,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
>>> * For the specified clamp index, this method computes the new CPU utilization
>>> * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
>>> */
>>> -static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
>>> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
>>> + unsigned int last_clamp_value)
>>> {
>>> struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
>>> int max_value = UCLAMP_NOT_VALID;
>>> @@ -924,6 +925,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
>>
>> The condition:
>>
>> if (!uclamp_group_active(uc_grp, group_id))
>> continue;
>>
>> in 'for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id)
>> {}' makes sure that 'max_value == UCLAMP_NOT_VALID' is true for the if
>> condition (*):
>>
>>
>>> if (max_value >= SCHED_CAPACITY_SCALE)
>>> break;
>>> }
>>> +
>>> + /*
>>> + * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
>>> + * task, we keep the CPU clamped to the last task's clamp value.
>>> + * This avoids frequency spikes to MAX when one CPU, with an high
>>> + * blocked utilization, sleeps and another CPU, in the same frequency
>>> + * domain, do not see anymore the clamp on the first CPU.
>>> + */
>>> + if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NOT_VALID) {
>>> + rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
>>> + max_value = last_clamp_value;
>>> + }
>>> +
>>
>> (*): So the uc_grp[group_id].value stays last_clamp_value?
>
> A bit confusing... but I think you've got the point.
OK.
>
>> What do you do when the blocked utilization decays below this enforced
>> last_clamp_value on that CPU?
>
> This is done _just_ for max_util:
> - it clamps a blocked utilization bigger then last_clamp_value
> thus avoiding the selection of an OPP bigger then the one enforced
> while the task was runnable
> - it has not effect on a blocked utilization smaller then last_clamp_value
> thus allowing to reduce gracefully the OPP as long as the blocked
> utilization is decayed
Ah correct, max_util is about capping, not boosting.
>
>> I assume there are plenty of this kind of corner cases because we have
>> blocked signals (including all tasks) and clamping (including runnable
>> tasks).
>
> This is a pretty compelling one I've noticed in my tests and thus
> worth a fix... I don't have on hand other similar corner cases, do
> you?
No not right now, will continue to watch out for them ...
On 15-Aug 17:02, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>
> [...]
>
> >+/**
> >+ * uclamp_task_active: check if a task is currently clamping a CPU
> >+ * @p: the task to check
> >+ *
> >+ * A task actively affects the utilization clamp of a CPU if:
> >+ * - it's currently enqueued or running on that CPU
> >+ * - it's refcounted in at least one clamp group of that CPU
> >+ *
> >+ * Return: true if p is currently clamping the utilization of its CPU.
> >+ */
> >+static inline bool uclamp_task_active(struct task_struct *p)
> >+{
> >+ struct rq *rq = task_rq(p);
> >+ int clamp_id;
> >+
> >+ lockdep_assert_held(&p->pi_lock);
> >+ lockdep_assert_held(&rq->lock);
> >+
> >+ if (!task_on_rq_queued(p) && !p->on_cpu)
> >+ return false;
> >+
> >+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> >+ if (uclamp_task_affects(p, clamp_id))
> >+ return true;
> >+ }
> >+
> >+ return false;
> >+}
>
> Looks like that uclamp_task_active() is only used once (in
> uclamp_task_update_active()). Can you not code the if condition and the for
> loop directly in uclamp_task_update_active()? This would save code
> (lockdep_assert_held() etc.) and comment lines.
Yes, that's possible... and we will have:
---8<---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ae528a7b9bef..0c8bec892018 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1331,7 +1331,9 @@ uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
* index, then that task is not yet RUNNABLE and it's going to be
* enqueued with the proper clamp group value.
*/
- if (!uclamp_task_active(p))
+ if (!task_on_rq_queued(p) && !p->on_cpu)
+ goto done;
+ if (!uclamp_task_affects(p, clamp_id))
goto done;
/* Release p's currently referenced clamp group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 11d139faed1f..65fdf4abc6ff 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2231,35 +2231,6 @@ static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
{
return (p->uclamp[clamp_id].effective.group_id != UCLAMP_NOT_VALID);
}
-
-/**
- * uclamp_task_active: check if a task is currently clamping a CPU
- * @p: the task to check
- *
- * A task actively affects the utilization clamp of a CPU if:
- * - it's currently enqueued or running on that CPU
- * - it's refcounted in at least one clamp group of that CPU
- *
- * Return: true if p is currently clamping the utilization of its CPU.
- */
-static inline bool uclamp_task_active(struct task_struct *p)
-{
- struct rq *rq = task_rq(p);
- int clamp_id;
-
- lockdep_assert_held(&p->pi_lock);
- lockdep_assert_held(&rq->lock);
-
- if (!task_on_rq_queued(p) && !p->on_cpu)
- return false;
-
- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- if (uclamp_task_affects(p, clamp_id))
- return true;
- }
-
- return false;
-}
#endif /* CONFIG_UCLAMP_TASK */
#ifdef CONFIG_CPU_FREQ
---8<---
I'll add this to the next posting!
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
> > IMHO, if this is something which should not happen at all, a BUG_ON() is the
> > right thing to do here.
>
> I don't agree on that. I agree it should not happen but since it's a
> recoverable error it think we should not panic.
FWIW, if this is a recoverable error, I think Linus will agree with
Patrick on this one :-)
https://lkml.org/lkml/2016/10/4/1
On 08/16/2018 03:37 PM, Quentin Perret wrote:
>>> IMHO, if this is something which should not happen at all, a BUG_ON() is the
>>> right thing to do here.
>>
>> I don't agree on that. I agree it should not happen but since it's a
>> recoverable error it think we should not panic.
>
> FWIW, if this is a recoverable error, I think Linus will agree with
> Patrick on this one :-)
>
> https://lkml.org/lkml/2016/10/4/1
Yeah, not really agreeing here that this is a recoverable error.
Besides, we only consider under-run here, what about over-run?
Currently this warning doesn't hit and if the code will be changed and
it hits, I still find a BUG_ON more appealing here ...
So this error scenario can happen over and over again and we always
recover from ? The important thing is that we find the culprit for this
behaviour as fast as possible ...
Warning or bug, at least a stack trace is necessary.
On 16-Aug 12:34, Dietmar Eggemann wrote:
> On 08/13/2018 05:01 PM, Patrick Bellasi wrote:
> >On 13-Aug 16:06, Vincent Guittot wrote:
> >>On Mon, 13 Aug 2018 at 14:49, Patrick Bellasi <[email protected]> wrote:
> >>>On 13-Aug 14:07, Vincent Guittot wrote:
> >>>>On Mon, 13 Aug 2018 at 12:12, Patrick Bellasi <[email protected]> wrote:
> >
> >[...]
> >
> >>>Yes I agree that the current behavior is not completely clean... still
> >>>the question is: do you reckon the problem I depicted above, i.e. RT
> >>>workloads eclipsing the min_util required by lower priority classes?
> >>
> >>As said above, I don't think that there is a problem that is specific
> >>to cross class scheduling that can't also happen in the same class.
> >>
> >>Regarding your example:
> >>task TA util=40% with uclamp_min 50%
> >>task TB util=10% with uclamp_min 0%
> >>
> >>If TA and TB are cfs, util=50% and it doesn't seem to be a problem
> >>whereas TB will steal some bandwidth to TA and delay it (and i even
> >>don't speak about the impact of the nice priority of TB)
> >>If TA is cfs and TB is rt, Why util=50% is now a problem for TA ?
> >
> >You right, in the current implementation, where we _do not_
> >distinguish among scheduling classes it's not possible to get a
> >reasonable implementation of a per sched class clamping.
> >
> >>>To a certain extend I see this problem similar to the rt/dl/irq pressure
> >>>in defining cpu_capacity, isn't it?
> >
> >However, I still think that higher priority classes eclipsing the
> >clamping of lower priority classes can still be a problem.
> >
> >In your example above, the main difference between TA and TB being on
> >the same class or different classes is that in the second case TB
> >is granted to always preempt TA. We can end up with a non boosted RT
> >task consuming all the boosted bandwidth required by a CFS task.
> >
> >This does not happen, apart maybe for the corner case of really
> >different nice values, if the tasks are both CFS, since the fair
> >scheduler will grant some progress for both of them.
> >
> >Thus, given the current implementation, I think it makes sense to drop
> >the UCLAMP_SCHED_CLASS policy and stick with a more clean and
> >consistent design.
>
> I agree with everything said in this thread so far.
Cool!
> So in case you skip UCLAMP_SCHED_CLASS [(B) combine the clamped
> utilizations] in v4, you will only provide [A) clamp the combined
> utilization]?
Right... unless I find time to add support to per scheduling class
tracking of clamps values. It should be relatively simple... but I
guess it's also something we can keep as a really low prio and propose
it once the main bits are not controversial anymore.
> I assume that we don't have to guard the util clamping for rt tasks behind a
> disabled by default sched feature because all runnable rt tasks will have
> util_min = SCHED_CAPACITY_SCALE by default?
So, that's what Quentin also proposed in a previous discussion:
https://lore.kernel.org/lkml/20180809155551.bp46sixk4u3ilcnh@queper01-lin/
but yes, you're right: it's in my todo list to ensure that by default
RT tasks get a task-specific util_min set SCHED_CAPACITY_SCALE.
> >I'll then see if it makes sense to add a dedicated patch on top of the
> >series to add a proper per-class clamp tracking.
>
> I assume if you introduce this per-class clamping you will switch to use the
> UCLAMP_SCHED_CLASS approach?
Likely... but at that point we probably don't need the sched feature
anymore and it could be just the default and unique aggregation
policy.
But let see when we will have the patches... and we don't necessarily
need them for v4.
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hi Patrick,
On Thursday 09 Aug 2018 at 16:23:13 (+0100), Patrick Bellasi wrote:
> On 09-Aug 11:50, Juri Lelli wrote:
> > On 09/08/18 10:14, Patrick Bellasi wrote:
> > > On 07-Aug 14:35, Juri Lelli wrote:
> > > > On 06/08/18 17:39, Patrick Bellasi wrote:
>
> [...]
>
> > > 1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
> > > time parameter to relax this check
> >
> > It seems to me that this might work well with that the intended usage of
> > the interface that you depict above. SMS only (or any privileged user)
> > will be in control of how groups are configured, so no problem for
> > normal users.
>
> Yes, well... apart normal users still getting a -ENOSPC is they are
> requesting one of the not pre-configured clamp values. Which is why
> the following bits can be helpful.
So IIUC, normal users would still be free of choosing their clamp values
as long as they choose one in the list of pre-allocated ones ? Is that
correct ?
If yes, that would still let normal users make they tasks look bigger no ?
They could just choose the clamp group with the highest min_clamp or
something. Isn't this a problem too ? I mean, if that can be abused easily,
I'm pretty sure people _will_ abuse it ...
Or maybe I misunderstood something ?
Thanks,
Quentin
On 17-Aug 11:34, Quentin Perret wrote:
> Hi Patrick,
>
> On Thursday 09 Aug 2018 at 16:23:13 (+0100), Patrick Bellasi wrote:
> > On 09-Aug 11:50, Juri Lelli wrote:
> > > On 09/08/18 10:14, Patrick Bellasi wrote:
> > > > On 07-Aug 14:35, Juri Lelli wrote:
> > > > > On 06/08/18 17:39, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > > 1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
> > > > time parameter to relax this check
> > >
> > > It seems to me that this might work well with that the intended usage of
> > > the interface that you depict above. SMS only (or any privileged user)
> > > will be in control of how groups are configured, so no problem for
> > > normal users.
> >
> > Yes, well... apart normal users still getting a -ENOSPC is they are
> > requesting one of the not pre-configured clamp values. Which is why
> > the following bits can be helpful.
>
> So IIUC, normal users would still be free of choosing their clamp values
> as long as they choose one in the list of pre-allocated ones ? Is that
> correct ?
No, with the CAP_SYS_NICE/ADMIN guard in place, as discussed above in
point 1, the syscall will just fail for normal users.
Only privileged tasks (i.e. SMS control threads) can change clamp values.
> If yes, that would still let normal users make they tasks look bigger no ?
> They could just choose the clamp group with the highest min_clamp or
> something. Isn't this a problem too ? I mean, if that can be abused easily,
> I'm pretty sure people _will_ abuse it ...
It should not be possible with 1) in place.
However, if the system is booted with that check disabled (e.g. via
kernel boot parameter) that probably means you trust/control your
userspace and don't want to impose restrictions on non privileged
tasks. In this case "abuses" are just "acceptable usages"...
--
#include <best/regards.h>
Patrick Bellasi
On 06-Aug 17:39, Patrick Bellasi wrote:
[...]
> +/**
> + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
> + * @p: the task being enqueued on a CPU
> + * @rq: the CPU's rq where the clamp group has to be reference counted
> + * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
> + *
> + * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
> + * the task's uclamp.group_id is reference counted on that CPU.
> + */
> +static inline void uclamp_cpu_get_id(struct task_struct *p,
> + struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp;
> + struct uclamp_cpu *uc_cpu;
> + int clamp_value;
> + int group_id;
> +
> + /* No task specific clamp values: nothing to do */
> + group_id = p->uclamp[clamp_id].group_id;
> + if (group_id == UCLAMP_NOT_VALID)
> + return;
This is broken for util_max aggression.
By not refcounting tasks without a task specific clamp value, we end
up enforcing a max_util to these tasks when they are co-scheduled with
another max clamped task.
I need to fix this by removing this "optimization" (working just for
util_min) and refcount all the tasks.
> +
> + /* Reference count the task into its current group_id */
> + uc_grp = &rq->uclamp.group[clamp_id][0];
> + uc_grp[group_id].tasks += 1;
> +
> + /*
> + * If this is the new max utilization clamp value, then we can update
> + * straight away the CPU clamp value. Otherwise, the current CPU clamp
> + * value is still valid and we are done.
> + */
> + uc_cpu = &rq->uclamp;
> + clamp_value = p->uclamp[clamp_id].value;
> + if (uc_cpu->value[clamp_id] < clamp_value)
> + uc_cpu->value[clamp_id] = clamp_value;
> +}
> +
--
#include <best/regards.h>
Patrick Bellasi
On Friday 17 Aug 2018 at 11:57:31 (+0100), Patrick Bellasi wrote:
> On 17-Aug 11:34, Quentin Perret wrote:
> > Hi Patrick,
> >
> > On Thursday 09 Aug 2018 at 16:23:13 (+0100), Patrick Bellasi wrote:
> > > On 09-Aug 11:50, Juri Lelli wrote:
> > > > On 09/08/18 10:14, Patrick Bellasi wrote:
> > > > > On 07-Aug 14:35, Juri Lelli wrote:
> > > > > > On 06/08/18 17:39, Patrick Bellasi wrote:
> > >
> > > [...]
> > >
> > > > > 1) make CAP_SYS_NICE protected the clamp groups, with an optional boot
> > > > > time parameter to relax this check
> > > >
> > > > It seems to me that this might work well with that the intended usage of
> > > > the interface that you depict above. SMS only (or any privileged user)
> > > > will be in control of how groups are configured, so no problem for
> > > > normal users.
> > >
> > > Yes, well... apart normal users still getting a -ENOSPC is they are
> > > requesting one of the not pre-configured clamp values. Which is why
> > > the following bits can be helpful.
> >
> > So IIUC, normal users would still be free of choosing their clamp values
> > as long as they choose one in the list of pre-allocated ones ? Is that
> > correct ?
>
> No, with the CAP_SYS_NICE/ADMIN guard in place, as discussed above in
> point 1, the syscall will just fail for normal users.
Right, I just misunderstood then :-)
Sorry for the noise ...
Thanks,
Quentin
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
> Give the above mechanisms, it is now possible to extend the cpu
> controller to specify what is the minimum (or maximum) utilization which
> a task is expected (or allowed) to generate.
> Constraints on minimum and maximum utilization allowed for tasks in a
> CPU cgroup can improve the control on the actual amount of CPU bandwidth
> consumed by tasks.
I assume this is covered by the s/bandwidth/capacity/ you had w/ Juri on
01/14.
[...]
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ff76b000bbe8..2ba55a4afffb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1211,6 +1211,74 @@ static inline void uclamp_group_get(struct task_struct *p,
> uclamp_group_put(clamp_id, prev_group_id);
> }
>
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +/**
> + * init_uclamp_sched_group: initialize data structures required for TG's
> + * utilization clamping
> + */
> +static inline void init_uclamp_sched_group(void)
> +{
> + struct uclamp_map *uc_map;
> + struct uclamp_se *uc_se;
> + int group_id;
> + int clamp_id;
> +
> + /* Root TG's is statically assigned to the first clamp group */
> + group_id = 0;
> +
> + /* Initialize root TG's to default (none) clamp values */
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_map = &uclamp_maps[clamp_id][0];
> +
> + /* Map root TG's clamp value */
> + uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
> +
> + /* Init root TG's clamp group */
> + uc_se = &root_task_group.uclamp[clamp_id];
> + uc_se->value = uclamp_none(clamp_id);
> + uc_se->group_id = group_id;
> +
> + /* Attach root TG's clamp group */
> + uc_map[group_id].se_count = 1;
> + }
> +}
Can you not save some lines here:
static inline void init_uclamp_sched_group(void)
{
- struct uclamp_map *uc_map;
struct uclamp_se *uc_se;
int group_id;
int clamp_id;
@@ -1228,8 +1227,6 @@ static inline void init_uclamp_sched_group(void)
/* Initialize root TG's to default (none) clamp values */
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- uc_map = &uclamp_maps[clamp_id][0];
-
/* Map root TG's clamp value */
uclamp_group_init(clamp_id, group_id,
uclamp_none(clamp_id));
@@ -1239,7 +1236,7 @@ static inline void init_uclamp_sched_group(void)
uc_se->group_id = group_id;
/* Attach root TG's clamp group */
- uc_map[group_id].se_count = 1;
+ uclamp_maps[clamp_id][group_id].se_count = 1;
}
This would also make uclamp_group_available() smaller.
[...]
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
I don't understand the reason mentioned here:
IMHO, a write to a child's (tg1/tg11) cpu.rt_runtime_us can fail if the
value is restricted by the parents value:
root@juno:/sys/fs/cgroup/cpu# cat cpu.rt_*
1000000
950000
root@juno:/sys/fs/cgroup/cpu# cat tg1/cpu.rt_*
1000000
0
root@juno:/sys/fs/cgroup/cpu# cat tg1/tg11/cpu.rt_*
1000000
0
root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/tg11/cpu.rt_runtime_us
-bash: echo: write error: Invalid argument
root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/cpu.rt_runtime_us
root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/tg11/cpu.rt_runtime_us
root@juno:/sys/fs/cgroup/cpu#
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This represent also the clamp value which is actually
> used to clamp tasks in each task group.
>
> Since it can be interesting for tasks in a cgroup to know exactly what
> is the currently propagated/enforced configuration, the effective clamp
> values are exposed to user-space by means of a new pair of read-only
> attributes: cpu.util.{min,max}.effective.
I assume here that the cpu.util.{min,max} of the child will not be used
any more because the 'effective' counterparts are taken instead.
I wonder if this propagation not been provided with only
cpu.util.{min,max}?
[...]
On 17-Aug 14:21, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>
> [...]
>
> >Give the above mechanisms, it is now possible to extend the cpu
> >controller to specify what is the minimum (or maximum) utilization which
> >a task is expected (or allowed) to generate.
> >Constraints on minimum and maximum utilization allowed for tasks in a
> >CPU cgroup can improve the control on the actual amount of CPU bandwidth
> >consumed by tasks.
>
> I assume this is covered by the s/bandwidth/capacity/ you had w/ Juri on
> 01/14.
Yes... I need to update this changelog too.
> [...]
>
> >diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >index ff76b000bbe8..2ba55a4afffb 100644
> >--- a/kernel/sched/core.c
> >+++ b/kernel/sched/core.c
> >@@ -1211,6 +1211,74 @@ static inline void uclamp_group_get(struct task_struct *p,
> > uclamp_group_put(clamp_id, prev_group_id);
> > }
> >+#ifdef CONFIG_UCLAMP_TASK_GROUP
> >+/**
> >+ * init_uclamp_sched_group: initialize data structures required for TG's
> >+ * utilization clamping
> >+ */
> >+static inline void init_uclamp_sched_group(void)
> >+{
> >+ struct uclamp_map *uc_map;
> >+ struct uclamp_se *uc_se;
> >+ int group_id;
> >+ int clamp_id;
> >+
> >+ /* Root TG's is statically assigned to the first clamp group */
> >+ group_id = 0;
> >+
> >+ /* Initialize root TG's to default (none) clamp values */
> >+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> >+ uc_map = &uclamp_maps[clamp_id][0];
> >+
> >+ /* Map root TG's clamp value */
> >+ uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
> >+
> >+ /* Init root TG's clamp group */
> >+ uc_se = &root_task_group.uclamp[clamp_id];
> >+ uc_se->value = uclamp_none(clamp_id);
> >+ uc_se->group_id = group_id;
> >+
> >+ /* Attach root TG's clamp group */
> >+ uc_map[group_id].se_count = 1;
> >+ }
> >+}
>
> Can you not save some lines here:
>
> static inline void init_uclamp_sched_group(void)
> {
> - struct uclamp_map *uc_map;
> struct uclamp_se *uc_se;
> int group_id;
> int clamp_id;
> @@ -1228,8 +1227,6 @@ static inline void init_uclamp_sched_group(void)
>
> /* Initialize root TG's to default (none) clamp values */
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> - uc_map = &uclamp_maps[clamp_id][0];
> -
> /* Map root TG's clamp value */
> uclamp_group_init(clamp_id, group_id,
> uclamp_none(clamp_id));
>
> @@ -1239,7 +1236,7 @@ static inline void init_uclamp_sched_group(void)
> uc_se->group_id = group_id;
>
> /* Attach root TG's clamp group */
> - uc_map[group_id].se_count = 1;
> + uclamp_maps[clamp_id][group_id].se_count = 1;
> }
>
> This would also make uclamp_group_available() smaller.
Unfortunately this var name and its matrix indexes tends to become
quite long easily and in some usages we exceed the 80 columns sometimes.
That's why, for consistency, I always shortcut to *uc_map a reference
to a clamp group and use that variable name consistently in the code.
I'll see if it's possible to get rid of the shortcut in the next version.
--
#include <best/regards.h>
Patrick Bellasi
On 17-Aug 15:43, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> >In order to properly support hierarchical resources control, the cgroup
> >delegation model requires that attribute writes from a child group never
> >fail but still are (potentially) constrained based on parent's assigned
> >resources. This requires to properly propagate and aggregate parent
> >attributes down to its descendants.
>
> I don't understand the reason mentioned here:
>
> IMHO, a write to a child's (tg1/tg11) cpu.rt_runtime_us can fail if the
> value is restricted by the parents value:
Well... that's my interpretation after this discussion:
https://lore.kernel.org/lkml/[email protected]/
AFAIU, what has not to fail is a write to a parent, which wants to enforce
more restrictive constraints to child groups. Thus, if we have for example:
tg1: util_max=100%
tg1/tg11: util_max=80%
It should be possible without errors to set:
tg1: util_max=50%
and then enforce a 50% util_max to tg1/tg11 tasks too and eventually
use "effective" attributes to expose the effective value used at each
level of the hierarchy.
> root@juno:/sys/fs/cgroup/cpu# cat cpu.rt_*
> 1000000
> 950000
> root@juno:/sys/fs/cgroup/cpu# cat tg1/cpu.rt_*
> 1000000
> 0
> root@juno:/sys/fs/cgroup/cpu# cat tg1/tg11/cpu.rt_*
> 1000000
> 0
> root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/tg11/cpu.rt_runtime_us
> -bash: echo: write error: Invalid argument
> root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/cpu.rt_runtime_us
> root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/tg11/cpu.rt_runtime_us
> root@juno:/sys/fs/cgroup/cpu#
This example is using the legacy hierarcy (cgroups v1).
AFAIK the default hierarcy (cgroups v2) has a much more stricy set of
requirements for the "delegation model".
> >Let's implement this mechanism by adding a new "effective" clamp value
> >for each task group. The effective clamp value is defined as the smaller
> >value between the clamp value of a group and the effective clamp value
> >of its parent. This represent also the clamp value which is actually
> >used to clamp tasks in each task group.
> >
> >Since it can be interesting for tasks in a cgroup to know exactly what
> >is the currently propagated/enforced configuration, the effective clamp
> >values are exposed to user-space by means of a new pair of read-only
> >attributes: cpu.util.{min,max}.effective.
>
> I assume here that the cpu.util.{min,max} of the child will not be used any
> more because the 'effective' counterparts are taken instead.
Yes, the "effective" attributes are the one used in kernel space for
the actual clamping.
However, the cpu.util.{min,max} of a child are still required as soon
as the parent relax its constraints... when we use their value to
set the "effective" value.
> I wonder if this propagation not been provided with only cpu.util.{min,max}?
In the example before, if we use the same variables we miss the
opportunity to reset:
tg1/tg11: util_max=80%
as soon as tg1's util_max goes back to 100%.
--
#include <best/regards.h>
Patrick Bellasi
On 08/17/2018 04:45 PM, Patrick Bellasi wrote:
> On 17-Aug 15:43, Dietmar Eggemann wrote:
>> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>>> In order to properly support hierarchical resources control, the cgroup
>>> delegation model requires that attribute writes from a child group never
>>> fail but still are (potentially) constrained based on parent's assigned
>>> resources. This requires to properly propagate and aggregate parent
>>> attributes down to its descendants.
>>
>> I don't understand the reason mentioned here:
>>
>> IMHO, a write to a child's (tg1/tg11) cpu.rt_runtime_us can fail if the
>> value is restricted by the parents value:
>
> Well... that's my interpretation after this discussion:
>
> https://lore.kernel.org/lkml/[email protected]/
So cgroups v2 uses .effective files for config propagation. Didn't know
that.
> AFAIU, what has not to fail is a write to a parent, which wants to enforce
> more restrictive constraints to child groups. Thus, if we have for example:
>
> tg1: util_max=100%
> tg1/tg11: util_max=80%
>
> It should be possible without errors to set:
>
> tg1: util_max=50%
>
> and then enforce a 50% util_max to tg1/tg11 tasks too and eventually
> use "effective" attributes to expose the effective value used at each
> level of the hierarchy.
Ok, your example makes sense. But the text above says 'that attribute
writes from a child group never fail but still are ...'. So this is a
little bit different.
I guess with the knowledge that this is by default cgroups v2 and that
config propagation is implemented via the .effective files it's digestible.
>
>> root@juno:/sys/fs/cgroup/cpu# cat cpu.rt_*
>> 1000000
>> 950000
>> root@juno:/sys/fs/cgroup/cpu# cat tg1/cpu.rt_*
>> 1000000
>> 0
>> root@juno:/sys/fs/cgroup/cpu# cat tg1/tg11/cpu.rt_*
>> 1000000
>> 0
>> root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/tg11/cpu.rt_runtime_us
>> -bash: echo: write error: Invalid argument
>> root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/cpu.rt_runtime_us
>> root@juno:/sys/fs/cgroup/cpu# echo 950000 > tg1/tg11/cpu.rt_runtime_us
>> root@juno:/sys/fs/cgroup/cpu#
>
> This example is using the legacy hierarcy (cgroups v1).
Yeah, so your patches take unified (v2) as default.
>
> AFAIK the default hierarcy (cgroups v2) has a much more stricy set of
> requirements for the "delegation model".
Could be ... I guess I have to study this more.
[...]
>> I assume here that the cpu.util.{min,max} of the child will not be used any
>> more because the 'effective' counterparts are taken instead.
>
> Yes, the "effective" attributes are the one used in kernel space for
> the actual clamping.
>
> However, the cpu.util.{min,max} of a child are still required as soon
> as the parent relax its constraints... when we use their value to
> set the "effective" value.
Yes, with the new background this make sense.
>> I wonder if this propagation not been provided with only cpu.util.{min,max}?
>
> In the example before, if we use the same variables we miss the
> opportunity to reset:
>
> tg1/tg11: util_max=80%
>
> as soon as tg1's util_max goes back to 100%.
Yes, from the config propagation point of view this should be pretty
close to the v2 cpuset controller from Waiman Long.
Maybe mentioning that these .effective files are the 'standard' way to
implement proper config propagation in cgroups v2 would help
understanding this patch.
On 08/17/2018 05:50 PM, Dietmar Eggemann wrote:
> On 08/17/2018 04:45 PM, Patrick Bellasi wrote:
>> On 17-Aug 15:43, Dietmar Eggemann wrote:
>>> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
[...]
>> AFAIK the default hierarcy (cgroups v2) has a much more stricy set of
>> requirements for the "delegation model".
>
> Could be ... I guess I have to study this more.
So I see that this idea with the .effective files is used in Waiman's
new cpuset cgroupv2 controller proposal.
The cpu controller with you util clamp extension has to be usable under
cgroupv2 and cgroupv1, I assume?
[...]
On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> Clamp values cannot be tuned at the root cgroup level. Moreover, because
> of the delegation model requirements and how the parent clamps
> propagation works, if we want to enable subgroups to set a non null
> util.min, we need to be able to configure the root group util.min to the
> allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
Why 1024 (100)? Would any non 0 value work here?
[...]
> @@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct *p,
> uclamp_group_put(clamp_id, prev_group_id);
> }
>
> +int sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + struct uclamp_se *uc_se;
> + int old_min, old_max;
> + int result;
> +
> + mutex_lock(&uclamp_mutex);
> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> + goto undo;
> + if (sysctl_sched_uclamp_util_max > 1024)
> + goto undo;
> +
> + /* Find a valid group_id for each required clamp value */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + result = uclamp_group_find(UCLAMP_MIN, sysctl_sched_uclamp_util_min);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
> + CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + result = uclamp_group_find(UCLAMP_MAX, sysctl_sched_uclamp_util_max);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
> + CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uc_se = &uclamp_default[UCLAMP_MIN];
> + uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> + uc_se, sysctl_sched_uclamp_util_min);
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + uc_se = &uclamp_default[UCLAMP_MAX];
> + uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
> + uc_se, sysctl_sched_uclamp_util_max);
> + }
> +
> + if (result) {
> +undo:
> + sysctl_sched_uclamp_util_min = old_min;
> + sysctl_sched_uclamp_util_max = old_max;
> + }
This looks strange! In case uclamp_group_find() returns free_group_id
instead of -ENOSPC, the sysctl min/max values are reset?
I was under the assumption that I could specify:
sysctl_sched_uclamp_util_min = 40 (for boosting)
sysctl_sched_uclamp_util_max = 80 (for clamping)
with an empty cpu controller hierarchy and then those values become the
.effective values of (a first level) task group?
[...]
On 20-Aug 12:18, Dietmar Eggemann wrote:
> On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
> >Clamp values cannot be tuned at the root cgroup level. Moreover, because
> >of the delegation model requirements and how the parent clamps
> >propagation works, if we want to enable subgroups to set a non null
> >util.min, we need to be able to configure the root group util.min to the
> >allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
>
> Why 1024 (100)? Would any non 0 value work here?
Something less then 100% will clamp subgroups util.min to that value.
If we want to allow the full span to subgroups, the root group should
not enforce boundaries... hence util.min should be set to 100%.
> [...]
>
> >@@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct *p,
> > uclamp_group_put(clamp_id, prev_group_id);
> > }
> >+int sched_uclamp_handler(struct ctl_table *table, int write,
> >+ void __user *buffer, size_t *lenp,
> >+ loff_t *ppos)
> >+{
> >+ int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> >+ struct uclamp_se *uc_se;
> >+ int old_min, old_max;
> >+ int result;
> >+
> >+ mutex_lock(&uclamp_mutex);
> >+
> >+ old_min = sysctl_sched_uclamp_util_min;
> >+ old_max = sysctl_sched_uclamp_util_max;
> >+
> >+ result = proc_dointvec(table, write, buffer, lenp, ppos);
> >+ if (result)
> >+ goto undo;
> >+ if (!write)
> >+ goto done;
> >+
> >+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> >+ goto undo;
> >+ if (sysctl_sched_uclamp_util_max > 1024)
> >+ goto undo;
> >+
> >+ /* Find a valid group_id for each required clamp value */
> >+ if (old_min != sysctl_sched_uclamp_util_min) {
> >+ result = uclamp_group_find(UCLAMP_MIN, sysctl_sched_uclamp_util_min);
> >+ if (result == -ENOSPC) {
> >+ pr_err("Cannot allocate more than %d UTIL_MIN clamp groups\n",
> >+ CONFIG_UCLAMP_GROUPS_COUNT);
> >+ goto undo;
> >+ }
> >+ group_id[UCLAMP_MIN] = result;
> >+ }
> >+ if (old_max != sysctl_sched_uclamp_util_max) {
> >+ result = uclamp_group_find(UCLAMP_MAX, sysctl_sched_uclamp_util_max);
> >+ if (result == -ENOSPC) {
> >+ pr_err("Cannot allocate more than %d UTIL_MAX clamp groups\n",
> >+ CONFIG_UCLAMP_GROUPS_COUNT);
> >+ goto undo;
> >+ }
> >+ group_id[UCLAMP_MAX] = result;
> >+ }
> >+
> >+ /* Update each required clamp group */
> >+ if (old_min != sysctl_sched_uclamp_util_min) {
> >+ uc_se = &uclamp_default[UCLAMP_MIN];
> >+ uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> >+ uc_se, sysctl_sched_uclamp_util_min);
> >+ }
> >+ if (old_max != sysctl_sched_uclamp_util_max) {
> >+ uc_se = &uclamp_default[UCLAMP_MAX];
> >+ uclamp_group_get(NULL, UCLAMP_MAX, group_id[UCLAMP_MAX],
> >+ uc_se, sysctl_sched_uclamp_util_max);
> >+ }
> >+
> >+ if (result) {
> >+undo:
> >+ sysctl_sched_uclamp_util_min = old_min;
> >+ sysctl_sched_uclamp_util_max = old_max;
> >+ }
>
> This looks strange! In case uclamp_group_find() returns free_group_id
> instead of -ENOSPC, the sysctl min/max values are reset?
>
> I was under the assumption that I could specify:
>
> sysctl_sched_uclamp_util_min = 40 (for boosting)
> sysctl_sched_uclamp_util_max = 80 (for clamping)
>
> with an empty cpu controller hierarchy and then those values become the
> .effective values of (a first level) task group?
You right, I forgot to reset result=0 once we passed the two
uclamp_group_find() calls. Will fix in v4.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 20-Aug 12:01, Dietmar Eggemann wrote:
> On 08/17/2018 05:50 PM, Dietmar Eggemann wrote:
> >On 08/17/2018 04:45 PM, Patrick Bellasi wrote:
> >>On 17-Aug 15:43, Dietmar Eggemann wrote:
> >>>On 08/06/2018 06:39 PM, Patrick Bellasi wrote:
>
> [...]
>
> >>AFAIK the default hierarcy (cgroups v2) has a much more stricy set of
> >>requirements for the "delegation model".
> >
> >Could be ... I guess I have to study this more.
>
> So I see that this idea with the .effective files is used in Waiman's new
> cpuset cgroupv2 controller proposal.
>
> The cpu controller with you util clamp extension has to be usable under
> cgroupv2 and cgroupv1, I assume?
Right, and in general what works for v2 should work fine for v1 too.
Best,
Patrick
--
#include <best/regards.h>
Patrick Bellasi