2018-07-16 08:30:21

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 00/12] Add utilization clamping support

This is a respin of:

https://lore.kernel.org/lkml/[email protected]

which addresses all the feedbacks collected from the LKML discussion as well
as during the presentation at last OSPM Summit:

https://www.youtube.com/watch?v=0Yv9smm9i78

Further comments and feedbacks are more than welcome!

Cheers Patrick


Main changes
============

The main change of this version is an overall restructuring and polishing of
the entire series. The ultimate goals was to further optimize some data
structures as well as to (hopefully) make it more easy the review by both
reordering the patches and splitting some of them into smaller ones.

The series is now composed by the following described main sections.

.:: Per task (primary) API

[PATCH v2 01/12] sched/core: uclamp: extend sched_setattr to support
[PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into
[PATCH v2 03/12] sched/core: uclamp: add CPU's clamp groups
[PATCH v2 04/12] sched/core: uclamp: update CPU's refcount on clamp

This first subset adds all the main required data structures and mechanism to
support clamping in a per-task basis.

These bits are added in a top-down way:

01. adds the sched_setattr(2) syscall based API
02. adds the mapping from clamp values to clamp groups
03. adds the clamp group refcouting at {en,de}queue time
04. sync syscall changes with CPU's clamp group refcounts


.:: Schedutil integration

[PATCH v2 05/12] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
[PATCH v2 06/12] sched/cpufreq: uclamp: add utilization clamping for RT tasks

These couple of additional patches provides a first fully working solution of
utilization clamping by using the clamp values to bias frequency selection.

It's worth to notice that frequencies selection is just one of the possible
utilization clamping clients. We do not introduce other possible scheduler
integration to keep this series small enough and focused on the core bits.


.:: Per task group (secondary) API

[PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller
[PATCH v2 09/12] sched/core: uclamp: map TG's clamp values into CPU's clamp groups
[PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict
[PATCH v2 11/12] sched/core: uclamp: update CPU's refcount on TG's

These additional patches introduce the cgroup support, using the same top-down
approach of the first ones:

08. adds the cpu.util_{min,max} attributes
09. adds the mapping from clamp values to clamp groups
10. uses TG clamp value to restrict the task-specific API
11. sync TG's clamp value changes with CPU's clamp group refcounts


.:: Additional improvements

[PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX
[PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

A couple of functional improvements are provided by these few additional
patches. Although these bits are not strictly required for a fully functional
solution, they are still considered improvements worth to have.


Newcomer's Short Abstract (Updated)
===================================

The Linux scheduler is able to drive frequency selection, when the schedutil
cpufreq's governor is in use, based on task utilization aggregated at CPU
level. The CPU utilization is then used to select the frequency which better
fits the task's generated workload. The current translation of utilization
values into a frequency selection is pretty simple: we just go to max for RT
tasks or to the minimum frequency which can accommodate the utilization of
DL+FAIR tasks.

While this simple mechanism is good enough for DL tasks, for RT and FAIR tasks
we can aim at some better frequency driving which can take into consideration
hints coming from user-space.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined from
user-space. The clamped utilization value can then be used to enforce a minimum
and/or maximum frequency depending on which tasks are currently active on a
CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience. Consider for example the case of a
small control thread for an external accelerator (e.g. GPU, DSP, other
devices). In this case the scheduler does not have a complete view of
what are the task bandwidth requirements and, if it's a small task,
schedutil will keep selecting a lower frequency thus affecting the
overall time required to complete the task activations.

- clamping: increase energy efficiency for background tasks not directly
affecting the user experience. Since running at a lower frequency is in
general more energy efficient, when the completion time is not a main
goal then clamping the maximum frequency to use for certain (maybe big)
tasks can have positive effects, both on energy consumption and thermal
stress.
Moreover, this last support allows also to make RT tasks more energy
friendly on mobile systems, where running them at the maximum
frequency is not strictly required.

Frequency selection biasing, introduced by patches 5 and 6 of this series,
is just one possible usage of utilization clamping. Another compelling use
case this support is interesting for is in helping the scheduler on tasks
tasks placement decisions.

Indeed, utilization is a task specific property which is used by the scheduler
to know how much CPU bandwidth a task requires (under certain conditions).
Thus, the utilization clamp values defined either per-task or via the CPU
controller, can be used to represent tasks to the scheduler as being bigger (or
smaller) then what they really are.

Utilization clamping thus ultimately enable interesting additional
optimizations, especially on asymmetric capacity systems like Arm
big.LITTLE and DynamIQ CPUs, where:

- boosting: small tasks are preferably scheduled on higher-capacity CPUs
where, despite being less energy efficient, they can complete faster

- clamping: big/background tasks are preferably scheduler on low-capacity CPUs
where, being more energy efficient, they can still run but save power and
thermal headroom for more important tasks.

This additional usage of the utilization clamping is not presented in this
series but it's an integral part of the Energy Aware Scheduler (EAS) feature
set. A similar solution (SchedTune) is already used on Android kernels, which
targets both frequency selection and task placement biasing.
This series provides the foundation bits to add similar features in mainline
and its first simple client with the schedutil integration.


Detailed Changelog
==================

Changes in v2:

Message-ID: <[email protected]>
- refactored struct rq::uclamp_cpu to be more cache efficient
no more holes, re-arranged vectors to match cache lines with expected
data locality

Message-ID: <[email protected]>
- use *rq as parameter whenever already available
- add scheduling class's uclamp_enabled marker
- get rid of the "confusing" single callback uclamp_task_update()
and use uclamp_cpu_{get,put}() directly from {en,de}queue_task()
- fix/remove "bad" comments

Message-ID: <20180413113337.GU14248@e110439-lin>
- remove inline from init_uclamp, flag it __init

Message-ID: <[email protected]>
- get rid of the group_id back annotation
which is not requires at this stage where we have only per-task
clamping support. It will be introduce later when cgroup support is
added.

Message-ID: <[email protected]>
- make attributes available only on non-root nodes
a system wide API seems of not immediate interest and thus it's not
supported anymore
- remove implicit parent-child constraints and dependencies

Message-ID: <[email protected]>
- add some cgroup-v2 documentation for the new attributes
- (hopefully) better explain intended use-cases
the changelog above has been extended to better justify the naming
proposed by the new attributes

Other changes:
- improved documentation to make more explicit some concepts
- set UCLAMP_GROUPS_COUNT=2 by default
which allows to fit all the hot-path CPU clamps data into a single cache
line while still supporting up to 2 different {min,max}_utiql clamps.
- use -ERANGE as range violation error
- add attributes to the default hierarchy as well as the legacy one
- implement a "nice" semantics where cgroup clamp values are always
used to restrict task specific clamp values,
i.e. tasks running on a TG are only allowed to demote themself.
- patches re-ordering in top-down way
- rebased on v4.18-rc4


Patrick Bellasi (12):
sched/core: uclamp: extend sched_setattr to support utilization
clamping
sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
sched/core: uclamp: add CPU's clamp groups accounting
sched/core: uclamp: update CPU's refcount on clamp changes
sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
sched/cpufreq: uclamp: add utilization clamping for RT tasks
sched/core: uclamp: enforce last task UCLAMP_MAX
sched/core: uclamp: extend cpu's cgroup controller
sched/core: uclamp: map TG's clamp values into CPU's clamp groups
sched/core: uclamp: use TG's clamps to restrict Task's clamps
sched/core: uclamp: update CPU's refcount on TG's clamp changes
sched/core: uclamp: use percentage clamp values

Documentation/admin-guide/cgroup-v2.rst | 25 +
include/linux/sched.h | 53 ++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 66 +-
init/Kconfig | 63 ++
kernel/sched/core.c | 876 ++++++++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 51 +-
kernel/sched/fair.c | 4 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 194 ++++++
10 files changed, 1316 insertions(+), 24 deletions(-)

--
2.17.1



2018-07-16 08:30:32

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 01/12] sched/core: uclamp: extend sched_setattr to support utilization clamping

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements which can be translated into proper
decisions for both task placements and frequencies selections.
Other classes have a more simplified model which is essentially based on
the relatively simple concept of POSIX priorities.

Such a simple priority based model however does not allow to exploit
some of the most advanced features of the Linux scheduler like, for
example, driving frequencies selection via the schedutil cpufreq
governor. However, also for non SCHED_DEADLINE tasks, it's still
interesting to define tasks properties which can be used to better
support certain scheduler decisions.

Utilization clamping aims at exposing to user-space a new set of
per-task attributes which can be used to provide the scheduler with some
hints about the expected/required utilization for a task.
This will allow to implement a more advanced per-task frequency control
mechanism which is not based just on a "passive" measured task
utilization but on a more "active" approach. For example, it could be
possible to boost interactive tasks, thus getting better performance, or
cap background tasks, thus being more energy efficient.
Ultimately, such a mechanism can be considered similar to the cpufreq's
powersave, performance and userspace governor but with a much fine
grained and per-task control.

Let's introduce a new API to set utilization clamping values for a
specified task by extending sched_setattr, a syscall which already
allows to define task specific properties for different scheduling
classes.
Specifically, a new pair of attributes allows to specify a minimum and
maximum utilization which the scheduler should consider for a task.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 16 ++++++++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 64 +++++++++++++++++++++++++++-----
init/Kconfig | 19 ++++++++++
kernel/sched/core.c | 39 +++++++++++++++++++
5 files changed, 132 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..fd8495723088 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -279,6 +279,17 @@ struct vtime {
u64 gtime;
};

+enum uclamp_id {
+ /* No utilization clamp group assigned */
+ UCLAMP_NONE = -1,
+
+ UCLAMP_MIN = 0, /* Minimum utilization */
+ UCLAMP_MAX, /* Maximum utilization */
+
+ /* Utilization clamping constraints count */
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -649,6 +660,11 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utlization clamp values for this task */
+ int uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..c27d6e81517b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -50,9 +50,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_UTIL_CLAMP 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_UTIL_CLAMP)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..7421cd25354d 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -21,8 +21,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +59,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +72,28 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization which
+ * should be expected by a task. These attributes allows to inform the
+ * scheduler about the utilization boundaries within which is safe to schedule
+ * the task. These utilization boundaries are valuable information to support
+ * scheduler decisions on both task placement and frequencies selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. Thus, for
+ * example, a 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger then 0 is more likely to be
+ * scheduled on a CPU which can provide that bandwidth.
+ * A task with a max utilization value smaller then 1024 is more likely to be
+ * scheduled on a CPU which do not provide more then the required bandwidth.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +111,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/Kconfig b/init/Kconfig
index 041f3a022122..1d45a6877d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ default false
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max CPU
+ bandwidth which is allowed for a task.
+ The max bandwidth allows to clamp the maximum frequency a task can
+ use, while the min bandwidth allows to define a minimum frequency a
+ task will always use.
+
+ If in doubt, say N.
+
+endmenu
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe365c9a08e9..6a42cd86b6f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -739,6 +739,28 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (attr->sched_util_min > attr->sched_util_max)
+ return -EINVAL;
+ if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
+ p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+
+ return 0;
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2176,6 +2198,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.cfs_rq = NULL;
#endif

+#ifdef CONFIG_UCLAMP_TASK
+ p->uclamp[UCLAMP_MIN] = 0;
+ p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -4243,6 +4270,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ /* Configure utilization clamps for the task */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = __setscheduler_uclamp(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4749,6 +4783,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);

+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN];
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+#endif
+
rcu_read_unlock();

retval = sched_read_attr(uattr, &attr, size);
--
2.17.1


2018-07-16 08:30:39

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Utilization clamping requires each CPU to know which clamp values are
assigned to tasks that are currently RUNNABLE on that CPU.
Multiple tasks can be assigned the same clamp value and tasks with
different clamp values can be concurrently active on the same CPU.
Thus, a proper data structure is required to support a fast and
efficient aggregation of the clamp values required by the currently
RUNNABLE tasks.

For this purpose we use a per-CPU array of reference counters,
where each slot is used to account how many tasks require a certain
clamp value are currently RUNNABLE on each CPU.
Each clamp value corresponds to a "clamp index" which identifies the
position within the array of reference couters.

:
(user-space changes) : (kernel space / scheduler)
:
SLOW PATH : FAST PATH
:
task_struct::uclamp::value : sched/core::enqueue/dequeue
: cpufreq_schedutil
:
+----------------+ +--------------------+ +-------------------+
| TASK | | CLAMP GROUP | | CPU CLAMPS |
+----------------+ +--------------------+ +-------------------+
| | | clamp_{min,max} | | clamp_{min,max} |
| util_{min,max} | | se_count | | tasks count |
+----------------+ +--------------------+ +-------------------+
:
+------------------> : +------------------->
group_id = map(clamp_value) : ref_count(group_id)
:
:

Let's introduce the support to map tasks to "clamp groups".
Specifically we introduce the required functions to translate a
"clamp value" into a clamp's "group index" (group_id).

Only a limited number of (different) clamp values are supported since:
1. there are usually only few classes of workloads for which it makes
sense to boost/limit to different frequencies,
e.g. background vs foreground, interactive vs low-priority
2. it allows a simpler and more memory/time efficient tracking of
the per-CPU clamp values in the fast path.

The number of possible different clamp values is currently defined at
compile time. Thus, setting a new clamp value for a task can result into
a -ENOSPC error in case this will exceed the number of maximum different
clamp values supported.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 15 ++-
init/Kconfig | 22 ++++
kernel/sched/core.c | 300 +++++++++++++++++++++++++++++++++++++++++-
3 files changed, 330 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fd8495723088..0635e8073cd3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -578,6 +578,19 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+/**
+ * Utilization's clamp group
+ *
+ * A utilization clamp group maps a "clamp value" (value), i.e.
+ * util_{min,max}, to a "clamp group index" (group_id).
+ */
+struct uclamp_se {
+ /* Utilization constraint for tasks in this group */
+ unsigned int value;
+ /* Utilization clamp group for this constraint */
+ unsigned int group_id;
+};
+
union rcu_special {
struct {
u8 blocked;
@@ -662,7 +675,7 @@ struct task_struct {

#ifdef CONFIG_UCLAMP_TASK
/* Utlization clamp values for this task */
- int uclamp[UCLAMP_CNT];
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/init/Kconfig b/init/Kconfig
index 1d45a6877d6f..0a377ad7c166 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -601,7 +601,29 @@ config UCLAMP_TASK

If in doubt, say N.

+config UCLAMP_GROUPS_COUNT
+ int "Number of different utilization clamp values supported"
+ range 0 127
+ default 2
+ depends on UCLAMP_TASK
+ help
+ This defines the maximum number of different utilization clamp
+ values which can be concurrently enforced for each utilization
+ clamp index (i.e. minimum and maximum utilization).
+
+ Only a limited number of clamp values are supported because:
+ 1. there are usually only few classes of workloads for which it
+ makes sense to boost/cap for different frequencies,
+ e.g. background vs foreground, interactive vs low-priority.
+ 2. it allows a simpler and more memory/time efficient tracking of
+ the per-CPU clamp values.
+
+ Set to 0 (default value) to disable the utilization clamping feature.
+
+ If in doubt, use the default value.
+
endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6a42cd86b6f3..50e749067df5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -740,25 +740,309 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_mutex: serializes updates of utilization clamp values
+ *
+ * A utilization clamp value update is usually triggered from a user-space
+ * process (slow-path) but it requires a synchronization with the scheduler's
+ * (fast-path) enqueue/dequeue operations.
+ * While the fast-path synchronization is protected by RQs spinlock, this
+ * mutex ensures that we sequentially serve user-space requests.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/**
+ * uclamp_map: reference counts a utilization "clamp value"
+ * @value: the utilization "clamp value" required
+ * @se_count: the number of scheduling entities requiring the "clamp value"
+ * @se_lock: serialize reference count updates by protecting se_count
+ */
+struct uclamp_map {
+ int value;
+ int se_count;
+ raw_spinlock_t se_lock;
+};
+
+/**
+ * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
+ *
+ * Since only a limited number of different "clamp values" are supported, we
+ * need to map each different clamp value into a "clamp group" (group_id) to
+ * be used by the per-CPU accounting in the fast-path, when tasks are
+ * enqueued and dequeued.
+ * We also support different kind of utilization clamping, min and max
+ * utilization for example, each representing what we call a "clamp index"
+ * (clamp_id).
+ *
+ * A matrix is thus required to map "clamp values" to "clamp groups"
+ * (group_id), for each "clamp index" (clamp_id), where:
+ * - rows are indexed by clamp_id and they collect the clamp groups for a
+ * given clamp index
+ * - columns are indexed by group_id and they collect the clamp values which
+ * maps to that clamp group
+ *
+ * Thus, the column index of a given (clamp_id, value) pair represents the
+ * clamp group (group_id) used by the fast-path's per-CPU accounting.
+ *
+ * NOTE: first clamp group (group_id=0) is reserved for tracking of non
+ * clamped tasks. Thus we allocate one more slot than the value of
+ * CONFIG_UCLAMP_GROUPS_COUNT.
+ *
+ * Here is the map layout and, right below, how entries are accessed by the
+ * following code.
+ *
+ * uclamp_maps is a matrix of
+ * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
+ * | |
+ * | /---------------+---------------\
+ * | +------------+ +------------+
+ * | / UCLAMP_MIN | value | | value |
+ * | | | se_count |...... | se_count |
+ * | | +------------+ +------------+
+ * +--+ +------------+ +------------+
+ * | | value | | value |
+ * \ UCLAMP_MAX | se_count |...... | se_count |
+ * +-----^------+ +----^-------+
+ * | |
+ * uc_map = + |
+ * &uclamp_maps[clamp_id][0] +
+ * clamp_value =
+ * uc_map[group_id].value
+ */
+static struct uclamp_map uclamp_maps[UCLAMP_CNT]
+ [CONFIG_UCLAMP_GROUPS_COUNT + 1];
+
+/**
+ * uclamp_group_available: checks if a clamp group is available
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index in the given clamp_id
+ *
+ * A clamp group is not free if there is at least one SE which is sing a clamp
+ * value mapped on the specified clamp_id. These SEs are reference counted by
+ * the se_count of a uclamp_map entry.
+ *
+ * Return: true if there are no SE's mapped on the specified clamp
+ * index and group
+ */
+static inline bool uclamp_group_available(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ return (uc_map[group_id].value == UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_init: maps a clamp value on a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamp)
+ * @group_id: the group index to map a given clamp_value
+ * @clamp_value: the utilization clamp value to map
+ *
+ * Initializes a clamp group to track tasks from the fast-path.
+ * Each different clamp value, for a given clamp index (i.e. min/max
+ * utilization clamp), is mapped by a clamp group which index is used by the
+ * fast-path code to keep track of RUNNABLE tasks requiring a certain clamp
+ * value.
+ *
+ */
+static inline void uclamp_group_init(int clamp_id, int group_id,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+
+ uc_map[group_id].value = clamp_value;
+ uc_map[group_id].se_count = 0;
+}
+
+/**
+ * uclamp_group_reset: resets a specified clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @group_id: the group index to release
+ *
+ * A clamp group can be reset every time there are no more task groups using
+ * the clamp value it maps for a given clamp index.
+ */
+static inline void uclamp_group_reset(int clamp_id, int group_id)
+{
+ uclamp_group_init(clamp_id, group_id, UCLAMP_NONE);
+}
+
+/**
+ * uclamp_group_find: finds the group index of a utilization clamp group
+ * @clamp_id: the utilization clamp index (i.e. min or max clamping)
+ * @clamp_value: the utilization clamping value lookup for
+ *
+ * Verify if a group has been assigned to a certain clamp value and return
+ * its index to be used for accounting.
+ *
+ * Since only a limited number of utilization clamp groups are allowed, if no
+ * groups have been assigned for the specified value, a new group is assigned
+ * if possible. Otherwise an error is returned, meaning that an additional clamp
+ * value is not (currently) supported.
+ */
+static int
+uclamp_group_find(int clamp_id, unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int free_group_id = UCLAMP_NONE;
+ unsigned int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Keep track of first free clamp group */
+ if (uclamp_group_available(clamp_id, group_id)) {
+ if (free_group_id == UCLAMP_NONE)
+ free_group_id = group_id;
+ continue;
+ }
+ /* Return index of first group with same clamp value */
+ if (uc_map[group_id].value == clamp_value)
+ return group_id;
+ }
+ /* Default to first free clamp group */
+ if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
+ group_id = free_group_id;
+ /* All clamp group already track different clamp values */
+ if (group_id == UCLAMP_NONE)
+ return -ENOSPC;
+ return group_id;
+}
+
+/**
+ * uclamp_group_put: decrease the reference count for a clamp group
+ * @clamp_id: the clamp index which was affected by a task group
+ * @uc_se: the utilization clamp data for that task group
+ *
+ * When the clamp value for a task group is changed we decrease the reference
+ * count for the clamp group mapping its current clamp value. A clamp group is
+ * released when there are no more task groups referencing its clamp value.
+ */
+static inline void uclamp_group_put(int clamp_id, int group_id)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ unsigned long flags;
+
+ /* Ignore SE's not yet attached */
+ if (group_id == UCLAMP_NONE)
+ return;
+
+ /* Remove SE from this clamp group */
+ raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
+ uc_map[group_id].se_count -= 1;
+ if (uc_map[group_id].se_count == 0)
+ uclamp_group_reset(clamp_id, group_id);
+ raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
+}
+
+/**
+ * uclamp_group_get: increase the reference count for a clamp group
+ * @p: the task which clamp value must be tracked
+ * @clamp_id: the clamp index affected by the task
+ * @uc_se: the utilization clamp data for the task
+ * @clamp_value: the new clamp value for the task
+ *
+ * Each time a task changes its utilization clamp value, for a specified clamp
+ * index, we need to find an available clamp group which can be used to track
+ * this new clamp value. The corresponding clamp group index will be used by
+ * the task to reference count the clamp value on CPUs while enqueued.
+ *
+ * Return: -ENOSPC if there are no available clamp groups, 0 on success.
+ */
+static inline int uclamp_group_get(struct task_struct *p,
+ int clamp_id, struct uclamp_se *uc_se,
+ unsigned int clamp_value)
+{
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int prev_group_id = uc_se->group_id;
+ int next_group_id = UCLAMP_NONE;
+ unsigned long flags;
+
+ /* Lookup for a usable utilization clamp group */
+ next_group_id = uclamp_group_find(clamp_id, clamp_value);
+ if (next_group_id < 0) {
+ pr_err("Cannot allocate more than %d utilization clamp groups\n",
+ CONFIG_UCLAMP_GROUPS_COUNT);
+ return -ENOSPC;
+ }
+
+ /* Allocate new clamp group for this clamp value */
+ if (uclamp_group_available(clamp_id, next_group_id))
+ uclamp_group_init(clamp_id, next_group_id, clamp_value);
+
+ /* Update SE's clamp values and attach it to new clamp group */
+ raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
+ uc_se->value = clamp_value;
+ uc_se->group_id = next_group_id;
+ uc_map[next_group_id].se_count += 1;
+ raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
+
+ /* Release the previous clamp group */
+ uclamp_group_put(clamp_id, prev_group_id);
+
+ return 0;
+}
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
+ struct uclamp_se *uc_se;
+ int retval = 0;
+
if (attr->sched_util_min > attr->sched_util_max)
return -EINVAL;
if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
return -EINVAL;

- p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
- p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+ mutex_lock(&uclamp_mutex);
+
+ /* Update min utilization clamp */
+ uc_se = &p->uclamp[UCLAMP_MIN];
+ retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
+ attr->sched_util_min);
+
+ /* Update max utilization clamp */
+ uc_se = &p->uclamp[UCLAMP_MAX];
+ retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
+ attr->sched_util_max);
+
+ mutex_unlock(&uclamp_mutex);
+
+ /*
+ * If one of the two clamp values should fail,
+ * let the userspace know.
+ */
+ if (retval)
+ return -ENOSPC;

return 0;
}
+
+/**
+ * init_uclamp: initialize data structures required for utilization clamping
+ */
+static void __init init_uclamp(void)
+{
+ int clamp_id;
+
+ mutex_init(&uclamp_mutex);
+
+ /* Init SE's clamp map */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ int group_id = 0;
+
+ for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ uc_map[group_id].value = UCLAMP_NONE;
+ raw_spin_lock_init(&uc_map[group_id].se_lock);
+ }
+ }
+}
+
#else /* CONFIG_UCLAMP_TASK */
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
return -EINVAL;
}
+static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2199,8 +2483,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif

#ifdef CONFIG_UCLAMP_TASK
- p->uclamp[UCLAMP_MIN] = 0;
- p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+ p->uclamp[UCLAMP_MIN].value = 0;
+ p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NONE;
+ p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
+ p->uclamp[UCLAMP_MAX].group_id = UCLAMP_NONE;
#endif

#ifdef CONFIG_SCHEDSTATS
@@ -4784,8 +5070,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN];
- attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
#endif

rcu_read_unlock();
@@ -6151,6 +6437,8 @@ void __init sched_init(void)

init_schedstats();

+ init_uclamp();
+
scheduler_running = 1;
}

--
2.17.1


2018-07-16 08:30:49

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 03/12] sched/core: uclamp: add CPU's clamp groups accounting

Utilization clamping allows to clamp the utilization of a CPU within a
[util_min, util_max] range. This range depends on the set of currently
RUNNABLE tasks on a CPU, where each task references two "clamp groups"
defining the util_min and the util_max clamp values to be considered for
that task. The clamp value mapped by a clamp group applies to a CPU only
when there is at least one task RUNNABLE referencing that clamp group.

When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
active on that CPU can change. Since each clamp group enforces a
different utilization clamp value, once the set of these groups changes
it can be required to re-compute what is the new "aggregated" clamp
value to apply on that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This is to ensure that no tasks can affect the performance of other
co-scheduled tasks which are either more boosted (i.e. with higher
util_min clamp) or less capped (i.e. with higher util_max clamp).

Here we introduce the required support to properly reference count clamp
groups at each task enqueue/dequeue time.

Tasks have a:
task_struct::uclamp::group_id[clamp_idx]
indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
which they should refcount at enqueue time.

CPUs rq have a:
rq::uclamp::group[clamp_idx][group_idx].tasks
which is used to reference count how many tasks are currently RUNNABLE on
that CPU for each clamp group of each clamp index..

The clamp value of each clamp group is tracked by
rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
unordered array of clamp values. However, the MAX aggregation of the
currently active clamp groups is implemented to minimize the number of
times we need to scan the complete (unordered) clamp group array to
figure out the new max value. This operation indeed happens only when we
dequeue last task of the clamp group corresponding to the current max
clamp, and thus the CPU is either entering IDLE or going to schedule a
less boosted or more clamped task.
Moreover, the expected number of different clamp values, which can be
configured at build time, is usually so small that a more advanced
ordering algorithm is not needed. In real use-cases we expect less then
10 different values.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/core.c | 188 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 4 +
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 71 ++++++++++++++++
4 files changed, 267 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 50e749067df5..d1969931fea6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -848,9 +848,19 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
unsigned int clamp_value)
{
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
+ struct uclamp_cpu *uc_cpu;
+ int cpu;

+ /* Set clamp group map */
uc_map[group_id].value = clamp_value;
uc_map[group_id].se_count = 0;
+
+ /* Set clamp groups on all CPUs */
+ for_each_possible_cpu(cpu) {
+ uc_cpu = &cpu_rq(cpu)->uclamp;
+ uc_cpu->group[clamp_id][group_id].value = clamp_value;
+ uc_cpu->group[clamp_id][group_id].tasks = 0;
+ }
}

/**
@@ -906,6 +916,172 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
return group_id;
}

+/**
+ * uclamp_cpu_update: updates the utilization clamp of a CPU
+ * @cpu: the CPU which utilization clamp has to be updated
+ * @clamp_id: the clamp index to update
+ *
+ * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
+ * clamp groups is subject to change. Since each clamp group enforces a
+ * different utilization clamp value, once the set of these groups changes it
+ * can be required to re-compute what is the new clamp value to apply for that
+ * CPU.
+ *
+ * For the specified clamp index, this method computes the new CPU utilization
+ * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
+ */
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
+ int max_value = UCLAMP_NONE;
+ unsigned int group_id;
+
+ for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
+ /* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
+ if (!uclamp_group_active(uc_grp, group_id))
+ continue;
+
+ /* Both min and max clamp are MAX aggregated */
+ max_value = max(max_value, uc_grp[group_id].value);
+
+ /* Stop if we reach the max possible clamp */
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ rq->uclamp.value[clamp_id] = max_value;
+}
+
+/**
+ * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
+ * @p: the task being enqueued on a CPU
+ * @rq: the CPU's rq where the clamp group has to be reference counted
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
+ *
+ * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
+ * the task's uclamp.group_id is reference counted on that CPU.
+ */
+static inline void uclamp_cpu_get_id(struct task_struct *p,
+ struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp;
+ struct uclamp_cpu *uc_cpu;
+ int clamp_value;
+ int group_id;
+
+ /* No task specific clamp values: nothing to do */
+ group_id = p->uclamp[clamp_id].group_id;
+ if (group_id == UCLAMP_NONE)
+ return;
+
+ /* Reference count the task into its current group_id */
+ uc_grp = &rq->uclamp.group[clamp_id][0];
+ uc_grp[group_id].tasks += 1;
+
+ /*
+ * If this is the new max utilization clamp value, then we can update
+ * straight away the CPU clamp value. Otherwise, the current CPU clamp
+ * value is still valid and we are done.
+ */
+ uc_cpu = &rq->uclamp;
+ clamp_value = p->uclamp[clamp_id].value;
+ if (uc_cpu->value[clamp_id] < clamp_value)
+ uc_cpu->value[clamp_id] = clamp_value;
+}
+
+/**
+ * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
+ * @p: the task being dequeued from a CPU
+ * @cpu: the CPU from where the clamp group has to be released
+ * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
+ *
+ * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
+ * counted by the task is decreased.
+ * If this was the last task defining the current max clamp group, then the
+ * CPU clamping is updated to find the new max for the specified clamp
+ * index.
+ */
+static inline void uclamp_cpu_put_id(struct task_struct *p,
+ struct rq *rq, int clamp_id)
+{
+ struct uclamp_group *uc_grp;
+ struct uclamp_cpu *uc_cpu;
+ unsigned int clamp_value;
+ int group_id;
+
+ /* No task specific clamp values: nothing to do */
+ group_id = p->uclamp[clamp_id].group_id;
+ if (group_id == UCLAMP_NONE)
+ return;
+
+ /* Decrement the task's reference counted group index */
+ uc_grp = &rq->uclamp.group[clamp_id][0];
+ uc_grp[group_id].tasks -= 1;
+
+ /* If this is not the last task, no updates are required */
+ if (uc_grp[group_id].tasks > 0)
+ return;
+
+ /*
+ * Update the CPU only if this was the last task of the group
+ * defining the current clamp value.
+ */
+ uc_cpu = &rq->uclamp;
+ clamp_value = uc_grp[group_id].value;
+ if (clamp_value >= uc_cpu->value[clamp_id])
+ uclamp_cpu_update(rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_get(): increase CPU's clamp group refcount
+ * @rq: the CPU's rq where the clamp group has to be refcounted
+ * @p: the task being enqueued
+ *
+ * Once a task is enqueued on a CPU's rq, all the clamp groups currently
+ * enforced on a task are reference counted on that rq.
+ * Not all scheduling classes have utilization clamping support, their tasks
+ * will be silently ignored.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_get_id(p, rq, clamp_id);
+}
+
+/**
+ * uclamp_cpu_put(): decrease CPU's clamp group refcount
+ * @cpu: the CPU's rq where the clamp group refcount has to be decreased
+ * @p: the task being dequeued
+ *
+ * When a task is dequeued from a CPU's rq, all the clamp groups the task has
+ * been reference counted at task's enqueue time have to be decreased for that
+ * CPU.
+ *
+ * This method updates the utilization clamp constraints considering the
+ * requirements for the specified task. Thus, this update must be done before
+ * calling into the scheduling classes, which will eventually update schedutil
+ * considering the new task requirements.
+ */
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
+{
+ int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_put_id(p, rq, clamp_id);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -1021,9 +1197,17 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
static void __init init_uclamp(void)
{
int clamp_id;
+ int cpu;

mutex_init(&uclamp_mutex);

+ /* Init CPU's clamp groups */
+ for_each_possible_cpu(cpu) {
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
+ }
+
/* Init SE's clamp map */
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
@@ -1037,6 +1221,8 @@ static void __init init_uclamp(void)
}

#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1053,6 +1239,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & ENQUEUE_RESTORE))
sched_info_queued(rq, p);

+ uclamp_cpu_get(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -1064,6 +1251,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_SAVE))
sched_info_dequeued(rq, p);

+ uclamp_cpu_put(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f0a0be4d344..fd857440276c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10433,6 +10433,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 572567078b60..056a7e1bd529 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2391,6 +2391,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..65bf9ebacd83 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -744,6 +744,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * struct uclamp_group - Utilization clamp Group
+ * @value: utilization clamp value for tasks on this clamp group
+ * @tasks: number of RUNNABLE tasks on this clamp group
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_group {
+ int value;
+ int tasks;
+};
+
+/**
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @group: utilization clamp groups affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp value, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+ int value[UCLAMP_CNT];
+ struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -781,6 +825,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_cpu uclamp ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -1535,6 +1584,10 @@ struct sched_class {
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
};

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
@@ -2130,6 +2183,24 @@ static inline u64 irq_time_read(int cpu)
}
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_group_active: check if a clamp group is active on a CPU
+ * @uc_grp: the clamp groups for a CPU
+ * @group_id: the clamp group to check
+ *
+ * A clamp group affects a CPU if it as at least one RUNNABLE task.
+ *
+ * Return: true if the specified CPU has at least one RUNNABLE task
+ * for the specified clamp group.
+ */
+static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
+ int group_id)
+{
+ return uc_grp[group_id].tasks > 0;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef CONFIG_CPU_FREQ
DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);

--
2.17.1


2018-07-16 08:30:57

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 05/12] sched/cpufreq: uclamp: add utilization clamping for FAIR tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by the CFS
class. However, when utilization clamping is in use, the frequency
selection should consider the requirements suggested by userspace, for
example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "required" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus allowing to have a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Let's add the required support to clamp the utilization generated by
FAIR tasks within the boundaries defined by their aggregated utilization
clamp constraints.
On each CPU the aggregated clamp values are obtained by considering the
maximum of the {min,max}_util values for each task. This max aggregation
responds to the goal of not penalizing, for example, high boosted (i.e.
more important for the user-experience) CFS tasks which happens to be
co-scheduled with high capped (i.e. less important for the
user-experience) CFS tasks.

For FAIR tasks both the utilization as well as the IOWait boost values
are clamped according to the CPU aggregated utilization clamp
constraints.

The default values for boosting and capping are defined to be:
- util_min: 0
- util_max: SCHED_CAPACITY_SCALE
which means that by default no boosting/capping is enforced on FAIR
tasks, and thus the frequency will be selected considering the actual
utilization value of each CPU.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 21 ++++++++--
kernel/sched/sched.h | 71 ++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index c907fde01eaa..70aea6ec3c08 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -191,6 +191,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util_cfs;

if (rt_rq_is_runnable(&rq->rt))
return sg_cpu->max;
@@ -205,7 +206,9 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
+ util_cfs = uclamp_util(sg_cpu->cpu, sg_cpu->util_cfs);
+
+ return min(sg_cpu->max, (sg_cpu->util_dl + util_cfs));
}

/**
@@ -252,6 +255,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
unsigned int flags)
{
bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+ unsigned int max_boost;

/* Reset boost if the CPU appears to have been idle enough */
if (sg_cpu->iowait_boost &&
@@ -267,11 +271,22 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
return;
sg_cpu->iowait_boost_pending = true;

+ /*
+ * Boost FAIR tasks only up to the CPU clamped utilization.
+ *
+ * Since DL tasks have a much more advanced bandwidth control, it's
+ * safe to assume that IO boost does not apply to those tasks.
+ * Instead, since for RT tasks we are already going to max, we don't
+ * rally care about clamping the IO boost max value for them too.
+ */
+ max_boost = sg_cpu->iowait_boost_max;
+ max_boost = uclamp_util(sg_cpu->cpu, max_boost);
+
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
sg_cpu->iowait_boost <<= 1;
- if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
- sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+ if (sg_cpu->iowait_boost > max_boost)
+ sg_cpu->iowait_boost = max_boost;
return;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b471d2222410..1207add36478 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2284,6 +2284,77 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+/**
+ * uclamp_none: default value for a clamp
+ *
+ * This returns the default value for each clamp
+ * - 0 for a min utilization clamp
+ * - SCHED_CAPACITY_SCALE for a max utilization clamp
+ *
+ * Return: the default value for a given utilization clamp
+ */
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+#ifdef CONFIG_UCLAMP_TASK
+/**
+ * uclamp_value: get the current CPU's utilization clamp value
+ * @cpu: the CPU to consider
+ * @clamp_id: the utilization clamp index (i.e. min or max utilization)
+ *
+ * The utilization clamp value for a CPU depends on its set of currently
+ * RUNNABLE tasks and their specific util_{min,max} constraints.
+ * A max aggregated value is tracked for each CPU and returned by this
+ * function.
+ *
+ * Return: the current value for the specified CPU and clamp index
+ */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+ struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
+
+ if (uc_cpu->value[clamp_id] == UCLAMP_NONE)
+ return uclamp_none(clamp_id);
+
+ return uc_cpu->value[clamp_id];
+}
+
+/**
+ * clamp_util: clamp a utilization value for a specified CPU
+ * @cpu: the CPU to get the clamp values from
+ * @util: the utilization signal to clamp
+ *
+ * Each CPU tracks util_{min,max} clamp values depending on the set of its
+ * currently RUNNABLE tasks. Given a utilization signal, i.e a signal in
+ * the [0..SCHED_CAPACITY_SCALE] range, this function returns a clamped
+ * utilization signal considering the current clamp values for the
+ * specified CPU.
+ *
+ * Return: a clamped utilization signal for a given CPU.
+ */
+static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
+{
+ unsigned int min_util = uclamp_value(cpu, UCLAMP_MIN);
+ unsigned int max_util = uclamp_value(cpu, UCLAMP_MAX);
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_value(unsigned int cpu, int clamp_id)
+{
+ return uclamp_none(clamp_id);
+}
+
+static inline unsigned int uclamp_util(unsigned int cpu, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.17.1


2018-07-16 08:30:59

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 06/12] sched/cpufreq: uclamp: add utilization clamping for RT tasks

Currently schedutil enforces a maximum frequency when RT tasks are
RUNNABLE. Such a mandatory policy can be made more tunable from
userspace thus allowing for example to define a max frequency which is
still reasonable for the execution of a specific RT workload. This
will contribute to make the RT class more friendly for power/energy
sensitive use-cases.

This patch extends the usage of util_{min,max} to the RT scheduling
class. Whenever a task in this class is RUNNABLE, the util required is
defined by the constraints of the CPU control group the task belongs to.

The IO wait boost value is thus subject to clamping for RT tasks too.
This is to ensure that RT tasks as well as CFS ones are always subject
to the set of current utilization clamping constraints.
It's worth to notice that, by default, clamp values are
min_util, max_util = (0, SCHED_CAPACITY_SCALE)
and thus, RT tasks always run at the maximum OPP if not otherwise
constrained by userspace.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 42 +++++++++++++++++++-------------
1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 70aea6ec3c08..b05a63055e70 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -188,27 +188,35 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->util_dl = cpu_util_dl(rq);
}

+/**
+ * sugov_aggregate_util() - Aggregate scheduling classes requests.
+ * @sg_cpu: the sugov data for the CPU to get utilization from
+ *
+ * Utilization required by DEADLINE must always be granted while, for
+ * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
+ * gracefully reduce the frequency when no tasks show up for longer
+ * periods of time.
+ *
+ * Ideally we would like to set util_dl as min/guaranteed freq and
+ * util_cfs + util_dl as requested freq. However, cpufreq is not yet
+ * ready for such an interface. So, we only do the latter for now.
+ *
+ * RT and CFS utilization are clamped, according to the current CPU
+ * constrains. They are individually clamped to ensure fairness across
+ * classes, meaning that CFS always gets (if possible) the (minimum)
+ * required bandwidth on top of that required by higher priority
+ * classes.
+ */
static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
+ unsigned long util = sg_cpu->util_dl;
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util_cfs;

if (rt_rq_is_runnable(&rq->rt))
- return sg_cpu->max;
-
- /*
- * Utilization required by DEADLINE must always be granted while, for
- * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
- * gracefully reduce the frequency when no tasks show up for longer
- * periods of time.
- *
- * Ideally we would like to set util_dl as min/guaranteed freq and
- * util_cfs + util_dl as requested freq. However, cpufreq is not yet
- * ready for such an interface. So, we only do the latter for now.
- */
- util_cfs = uclamp_util(sg_cpu->cpu, sg_cpu->util_cfs);
+ util += uclamp_util(sg_cpu->cpu, sg_cpu->max);
+ util += uclamp_util(sg_cpu->cpu, sg_cpu->util_cfs);

- return min(sg_cpu->max, (sg_cpu->util_dl + util_cfs));
+ return min(sg_cpu->max, util);
}

/**
@@ -276,8 +284,8 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
*
* Since DL tasks have a much more advanced bandwidth control, it's
* safe to assume that IO boost does not apply to those tasks.
- * Instead, since for RT tasks we are already going to max, we don't
- * rally care about clamping the IO boost max value for them too.
+ * Instead, for CFS and RT tasks we clamp the IO boost max value
+ * considering the current constraints for the CPU.
*/
max_boost = sg_cpu->iowait_boost_max;
max_boost = uclamp_util(sg_cpu->cpu, max_boost);
--
2.17.1


2018-07-16 08:31:07

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

The cgroup's CPU controller allows to assign a specified (maximum)
bandwidth to the tasks of a group. However this bandwidth is defined and
enforced only on a temporal base, without considering the actual
frequency a CPU is running on. Thus, the amount of computation completed
by a task within an allocated bandwidth can be very different depending
on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Give the above mechanisms, it is now possible to extend the cpu
controller to specify what is the minimum (or maximum) utilization which
a task is expected (or allowed) to generate.
Constraints on minimum and maximum utilization allowed for tasks in a
CPU cgroup can improve the control on the actual amount of CPU bandwidth
consumed by tasks.

Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.

The ultimate goal of utilization clamping is thus to enable:

- boosting: by selecting an higher capacity CPU and/or higher execution
frequency for small tasks which are affecting the user
interactive experience.

- capping: by selecting more energy efficiency CPUs or lower execution
frequency, for big tasks which are mainly related to
background activities, and thus without a direct impact on
the user experience.

Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.

This patch extends the CPU controller by adding a couple of new
attributes, util_min and util_max, which can be used to enforce task's
utilization boosting and capping. Specifically:

- util_min: defines the minimum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run at least at a minimum frequency which
corresponds to the min_util utilization

- util_max: defines the maximum utilization which should be considered,
e.g. when schedutil selects the frequency for a CPU while a
task in this group is RUNNABLE.
i.e. the task will run up to a maximum frequency which
corresponds to the max_util utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies
b) do not enforce any constraints and/or dependency between the parent
and its child nodes, thus relying on the delegation model and
permission settings defined by the system management software
c) allow to (eventually) further restrict task-specific clamps defined
via sched_setattr(2)

This patch provides the basic support to expose the two new attributes
and to validate their run-time updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++
init/Kconfig | 22 +++
kernel/sched/core.c | 186 ++++++++++++++++++++++++
kernel/sched/sched.h | 5 +
4 files changed, 238 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 8a2c52d5c53b..328c011cc105 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -904,6 +904,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -963,6 +969,25 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.

+ cpu.util_min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no bandwidth boosting.
+
+ The minimum utilization in the range [0, 1023].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util_max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1023". i.e. no bandwidth clamping
+
+ The maximum utilization in the range [0, 1023].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.

Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index 0a377ad7c166..d7e2b74637ff 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -792,6 +792,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0cb6e0aa4faa..30b1d894f978 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1227,6 +1227,74 @@ static inline int uclamp_group_get(struct task_struct *p,
return 0;
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+/**
+ * init_uclamp_sched_group: initialize data structures required for TG's
+ * utilization clamping
+ */
+static inline void init_uclamp_sched_group(void)
+{
+ struct uclamp_map *uc_map;
+ struct uclamp_se *uc_se;
+ int group_id;
+ int clamp_id;
+
+ /* Root TG's is statically assigned to the first clamp group */
+ group_id = 0;
+
+ /* Initialize root TG's to default (none) clamp values */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_map = &uclamp_maps[clamp_id][0];
+
+ /* Map root TG's clamp value */
+ uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
+
+ /* Init root TG's clamp group */
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->group_id = group_id;
+
+ /* Attach root TG's clamp group */
+ uc_map[group_id].se_count = 1;
+ }
+}
+
+/**
+ * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
+ * @tg: the newly created task group
+ * @parent: its parent task group
+ *
+ * A newly created task group inherits its utilization clamp values, for all
+ * clamp indexes, from its parent task group.
+ * This ensures that its values are properly initialized and that the task
+ * group is accounted in the same parent's group index.
+ *
+ * Return: !0 on error
+ */
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+
+ uc_se->value = parent->uclamp[clamp_id].value;
+ uc_se->group_id = UCLAMP_NONE;
+ }
+
+ return 1;
+}
+#else /* CONFIG_UCLAMP_TASK_GROUP */
+static inline void init_uclamp_sched_group(void) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1289,11 +1357,18 @@ static void __init init_uclamp(void)
raw_spin_lock_init(&uc_map[group_id].se_lock);
}
}
+
+ init_uclamp_sched_group();
}

#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+ return 1;
+}
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -6890,6 +6965,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;

err:
@@ -7110,6 +7188,88 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MIN].value == min_value) {
+ ret = 0;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MAX].value < min_value)
+ goto out;
+
+out:
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = -EINVAL;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ mutex_lock(&uclamp_mutex);
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MAX].value == max_value) {
+ ret = 0;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MIN].value > max_value)
+ goto out;
+
+out:
+ rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7437,6 +7597,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util_min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util_max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7604,6 +7776,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util_min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util_max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e4f10c507b7..1471a23e8f57 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -389,6 +389,11 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.17.1


2018-07-16 08:31:08

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 09/12] sched/core: uclamp: map TG's clamp values into CPU's clamp groups

Utilization clamping requires to map each different clamp value
into one of the available clamp groups used by the scheduler's fast-path
to account for RUNNABLE tasks. Thus, each time a TG's clamp value is
updated we need to get a reference to the new value's clamp group and
release a reference to the previous one.

Let's ensure that, whenever a task group is assigned a specific
clamp_value, this is properly translated into a unique clamp group to be
used in the fast-path (i.e. at enqueue/dequeue time).
We do that by slightly refactoring uclamp_group_get() to make the
*task_struct parameter optional. This allows to re-use the code already
available to support the per-task API.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 46 +++++++++++++++++++++++++++++++++++++++++--
2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0635e8073cd3..260aa8d3fca9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -583,6 +583,8 @@ struct sched_dl_entity {
*
* A utilization clamp group maps a "clamp value" (value), i.e.
* util_{min,max}, to a "clamp group index" (group_id).
+ * The same "group_id" can be used by multiple TG's to enforce the same
+ * clamp "value" for a given clamp index.
*/
struct uclamp_se {
/* Utilization constraint for tasks in this group */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30b1d894f978..04e758224e22 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1219,7 +1219,8 @@ static inline int uclamp_group_get(struct task_struct *p,
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);

/* Update CPU's clamp group refcounts of RUNNABLE task */
- uclamp_task_update_active(p, clamp_id, next_group_id);
+ if (p)
+ uclamp_task_update_active(p, clamp_id, next_group_id);

/* Release the previous clamp group */
uclamp_group_put(clamp_id, prev_group_id);
@@ -1276,18 +1277,47 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
{
struct uclamp_se *uc_se;
int clamp_id;
+ int ret = 1;

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &tg->uclamp[clamp_id];

uc_se->value = parent->uclamp[clamp_id].value;
uc_se->group_id = UCLAMP_NONE;
+
+ if (uclamp_group_get(NULL, clamp_id, uc_se,
+ parent->uclamp[clamp_id].value)) {
+ ret = 0;
+ goto out;
+ }
}

- return 1;
+out:
+ return ret;
+}
+
+/**
+ * release_uclamp_sched_group: release utilization clamp references of a TG
+ * @tg: the task group being removed
+ *
+ * An empty task group can be removed only when it has no more tasks or child
+ * groups. This means that we can also safely release all the reference
+ * counting to clamp groups.
+ */
+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+ struct uclamp_se *uc_se;
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &tg->uclamp[clamp_id];
+ uclamp_group_put(clamp_id, uc_se->group_id);
+ }
}
+
#else /* CONFIG_UCLAMP_TASK_GROUP */
static inline void init_uclamp_sched_group(void) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -1364,6 +1394,7 @@ static void __init init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
+static inline void free_uclamp_sched_group(struct task_group *tg) { }
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -6944,6 +6975,7 @@ static DEFINE_SPINLOCK(task_group_lock);

static void sched_free_group(struct task_group *tg)
{
+ free_uclamp_sched_group(tg);
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
@@ -7192,6 +7224,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
+ struct uclamp_se *uc_se;
struct task_group *tg;
int ret = -EINVAL;

@@ -7209,6 +7242,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MAX].value < min_value)
goto out;

+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MIN];
+ ret = uclamp_group_get(NULL, UCLAMP_MIN, uc_se, min_value);
+
out:
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);
@@ -7219,6 +7256,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 max_value)
{
+ struct uclamp_se *uc_se;
struct task_group *tg;
int ret = -EINVAL;

@@ -7236,6 +7274,10 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (tg->uclamp[UCLAMP_MIN].value > max_value)
goto out;

+ /* Update TG's reference count */
+ uc_se = &tg->uclamp[UCLAMP_MAX];
+ ret = uclamp_group_get(NULL, UCLAMP_MAX, uc_se, max_value);
+
out:
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);
--
2.17.1


2018-07-16 08:31:13

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

When a task's util_clamp value is configured via sched_setattr(2), this
value has to be properly accounted in the corresponding clamp group
every time the task is enqueued and dequeued. When cgroups are also in
use, per-task clamp values have to be aggregated to those of the CPU's
controller's Task Group (TG) in which the task is currently living.

Let's update uclamp_cpu_get() to provide aggregation between the task
and the TG clamp values. Every time a task is enqueued, it will be
accounted in the clamp_group which defines the smaller clamp between the
task specific value and its TG value.

This also mimics what already happen for a task's CPU affinity mask when
the task is also living in a cpuset. he overall idea is that cgroup
attributes are always used to restrict the per-task attributes.

Thus, this implementation allows to:

1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to a granted value or clamped at least
to a certain value
2. implements a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG

For this mecanisms to work properly, we need to implement a concept of
"effective" clamp group, which is used to track the currently most
restrictive clamp value each task is subject to.
The effective clamp is computed at enqueue time, by using an additional
task_struct::uclamp_group_id
to keep track of the clamp group in which each task is currently
accounted into. This solution allows to update task constrains on
demand, only when they became RUNNABLE, to always get the least
restrictive clamp depending on the current TG's settings.

This solution allows also to better decouple the slow-path, where task
and task group clamp values are updated, from the fast-path, where the
most appropriate clamp value is tracked by refcounting clamp groups.

For consistency purposes, as well as to properly inform userspace, the
sched_getattr(2) call is updated to always return the properly
aggregated constrains as described above. This will also make
sched_getattr(2) a convenient userpace API to know the utilization
constraints enforced on a task by the cgroup's CPU controller.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 40 +++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 2 +-
3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 260aa8d3fca9..5dd76a27ec17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -676,6 +676,8 @@ struct task_struct {
struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
+ /* Clamp group the task is currently accounted into */
+ int uclamp_group_id[UCLAMP_CNT];
/* Utlization clamp values for this task */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04e758224e22..50613d3d5b83 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -971,8 +971,15 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
* @rq: the CPU's rq where the clamp group has to be reference counted
* @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
*
- * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
- * the task's uclamp.group_id is reference counted on that CPU.
+ * Once a task is enqueued on a CPU's RQ, the most restrictive clamp group,
+ * among the task specific and that of the task's cgroup one, is reference
+ * counted on that CPU.
+ *
+ * Since the CPUs reference counted clamp group can be either that of the task
+ * or of its cgroup, we keep track of the reference counted clamp group by
+ * storing its index (group_id) into the task's task_struct::uclamp_group_id.
+ * This group index will then be used at task's dequeue time to release the
+ * correct refcount.
*/
static inline void uclamp_cpu_get_id(struct task_struct *p,
struct rq *rq, int clamp_id)
@@ -982,18 +989,30 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
int clamp_value;
int group_id;

- /* No task specific clamp values: nothing to do */
group_id = p->uclamp[clamp_id].group_id;
+ clamp_value = p->uclamp[clamp_id].value;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Use TG's clamp value to limit task specific values */
+ if (group_id == UCLAMP_NONE ||
+ clamp_value >= task_group(p)->uclamp[clamp_id].value) {
+ clamp_value = task_group(p)->uclamp[clamp_id].value;
+ group_id = task_group(p)->uclamp[clamp_id].group_id;
+ }
+#else
+ /* No task specific clamp values: nothing to do */
if (group_id == UCLAMP_NONE)
return;
+#endif

/* Reference count the task into its current group_id */
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks += 1;

+ /* Track the effective clamp group */
+ p->uclamp_group_id[clamp_id] = group_id;
+
/* Force clamp update on idle exit */
uc_cpu = &rq->uclamp;
- clamp_value = p->uclamp[clamp_id].value;
if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
if (clamp_id == UCLAMP_MAX)
uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
@@ -1031,7 +1050,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
int group_id;

/* No task specific clamp values: nothing to do */
- group_id = p->uclamp[clamp_id].group_id;
+ group_id = p->uclamp_group_id[clamp_id];
if (group_id == UCLAMP_NONE)
return;

@@ -1039,6 +1058,9 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks -= 1;

+ /* Flag the task as not affecting any clamp index */
+ p->uclamp_group_id[clamp_id] = UCLAMP_NONE;
+
/* If this is not the last task, no updates are required */
if (uc_grp[group_id].tasks > 0)
return;
@@ -2848,6 +2870,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif

#ifdef CONFIG_UCLAMP_TASK
+ memset(&p->uclamp_group_id, UCLAMP_NONE, sizeof(p->uclamp_group_id));
p->uclamp[UCLAMP_MIN].value = 0;
p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NONE;
p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
@@ -5437,6 +5460,13 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
#ifdef CONFIG_UCLAMP_TASK
attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Use cgroup enforced clamps to restrict task specific clamps */
+ if (task_group(p)->uclamp[UCLAMP_MIN].value < attr.sched_util_min)
+ attr.sched_util_min = task_group(p)->uclamp[UCLAMP_MIN].value;
+ if (task_group(p)->uclamp[UCLAMP_MAX].value < attr.sched_util_max)
+ attr.sched_util_max = task_group(p)->uclamp[UCLAMP_MAX].value;
+#endif
#endif

rcu_read_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1471a23e8f57..e3d5a2bc2f6c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2220,7 +2220,7 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
*/
static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
{
- return (p->uclamp[clamp_id].group_id != UCLAMP_NONE);
+ return (p->uclamp_group_id[clamp_id] != UCLAMP_NONE);
}

/**
--
2.17.1


2018-07-16 08:31:17

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 11/12] sched/core: uclamp: update CPU's refcount on TG's clamp changes

When a task group refcounts a new clamp group, we need to ensure that
the new clamp values are immediately enforced to all its tasks which are
currently RUNNABLE. This is to ensure that all currently RUNNABLE task
are boosted and/or clamped as requested as soon as possible.

Let's ensure that, whenever a new clamp group is refcounted by a task
group, all its RUNNABLE tasks are correctly accounted in their
respective CPUs. We do that by slightly refactoring uclamp_group_get()
to get an additional parameter *cgroup_subsys_state which, when
provided, it's used to walk the list of tasks in the correspond TGs and
update the RUNNABLE ones.

This is a "brute force" solution which allows to reuse the same refcount
update code already used by the per-task API. That's also the only way
to ensure a prompt enforcement of new clamp constraints on RUNNABLE
tasks, as soon as a task group attribute is tweaked.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/core.c | 42 ++++++++++++++++++++++++++++++++++--------
1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 50613d3d5b83..42cff5ffddae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1198,21 +1198,43 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
}

+static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int group_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ /* Update clamp groups for RUNNABLE tasks in this TG */
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_task_update_active(p, clamp_id, group_id);
+ css_task_iter_end(&it);
+}
+
/**
* uclamp_group_get: increase the reference count for a clamp group
* @p: the task which clamp value must be tracked
- * @clamp_id: the clamp index affected by the task
- * @uc_se: the utilization clamp data for the task
- * @clamp_value: the new clamp value for the task
+ * @css: the task group which clamp value must be tracked
+ * @clamp_id: the clamp index affected by the task (group)
+ * @uc_se: the utilization clamp data for the task (group)
+ * @clamp_value: the new clamp value for the task (group)
*
* Each time a task changes its utilization clamp value, for a specified clamp
* index, we need to find an available clamp group which can be used to track
* this new clamp value. The corresponding clamp group index will be used by
* the task to reference count the clamp value on CPUs while enqueued.
*
+ * When the cgroup's cpu controller utilization clamping support is enabled,
+ * each task group has a set of clamp values which are used to restrict the
+ * corresponding task specific clamp values.
+ * When a clamp value for a task group is changed, all the (active) tasks
+ * belonging to that task group must be update to ensure they are refcounting
+ * the correct CPU's clamp value.
+ *
* Return: -ENOSPC if there are no available clamp groups, 0 on success.
*/
static inline int uclamp_group_get(struct task_struct *p,
+ struct cgroup_subsys_state *css,
int clamp_id, struct uclamp_se *uc_se,
unsigned int clamp_value)
{
@@ -1240,6 +1262,10 @@ static inline int uclamp_group_get(struct task_struct *p,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);

+ /* Newly created TG don't have tasks assigned */
+ if (css)
+ uclamp_group_get_tg(css, clamp_id, next_group_id);
+
/* Update CPU's clamp group refcounts of RUNNABLE task */
if (p)
uclamp_task_update_active(p, clamp_id, next_group_id);
@@ -1307,7 +1333,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
uc_se->value = parent->uclamp[clamp_id].value;
uc_se->group_id = UCLAMP_NONE;

- if (uclamp_group_get(NULL, clamp_id, uc_se,
+ if (uclamp_group_get(NULL, NULL, clamp_id, uc_se,
parent->uclamp[clamp_id].value)) {
ret = 0;
goto out;
@@ -1362,12 +1388,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,

/* Update min utilization clamp */
uc_se = &p->uclamp[UCLAMP_MIN];
- retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
+ retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se,
attr->sched_util_min);

/* Update max utilization clamp */
uc_se = &p->uclamp[UCLAMP_MAX];
- retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
+ retval |= uclamp_group_get(p, NULL, UCLAMP_MAX, uc_se,
attr->sched_util_max);

mutex_unlock(&uclamp_mutex);
@@ -7274,7 +7300,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,

/* Update TG's reference count */
uc_se = &tg->uclamp[UCLAMP_MIN];
- ret = uclamp_group_get(NULL, UCLAMP_MIN, uc_se, min_value);
+ ret = uclamp_group_get(NULL, css, UCLAMP_MIN, uc_se, min_value);

out:
rcu_read_unlock();
@@ -7306,7 +7332,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,

/* Update TG's reference count */
uc_se = &tg->uclamp[UCLAMP_MAX];
- ret = uclamp_group_get(NULL, UCLAMP_MAX, uc_se, max_value);
+ ret = uclamp_group_get(NULL, css, UCLAMP_MAX, uc_se, max_value);

out:
rcu_read_unlock();
--
2.17.1


2018-07-16 08:31:33

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 04/12] sched/core: uclamp: update CPU's refcount on clamp changes

Utilization clamp values enforced on a CPU by a task can be updated at
run-time, for example via a sched_setattr syscall, while a task is
currently RUNNABLE on that CPU. In these cases, the task can be already
refcounting a clamp group for its CPU and thus we need to update this
reference to ensure the new constraints are immediately enforced.

Since a clamp value change always implies a clamp group refcount update,
this patch hooks into the clamp group refcount getter to trigger a CPU
refcount syncup. Such a syncup is required only by currently RUNNABLE
tasks which are also referencing at least one valid clamp group.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 45 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 94 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d1969931fea6..b2424eea7990 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1082,6 +1082,52 @@ static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
uclamp_cpu_put_id(p, rq, clamp_id);
}

+/**
+ * uclamp_task_update_active: update the clamp group of a RUNNABLE task
+ * @p: the task which clamp groups must be updated
+ * @clamp_id: the clamp index to consider
+ * @group_id: the clamp group to update
+ *
+ * Each time the clamp value of a task group is changed, the old and new clamp
+ * groups have to be updated for each CPU containing a RUNNABLE task belonging
+ * to this tasks group. Sleeping tasks are not updated since they will be
+ * enqueued with the proper clamp group index at their next activation.
+ */
+static inline void
+uclamp_task_update_active(struct task_struct *p, int clamp_id, int group_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the CPU where the task is (or was) queued.
+ *
+ * We might lock the (previous) RQ of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * The setting of the clamp group is serialized by task_rq_lock().
+ * Thus, if the task's task_struct is not referencing a valid group
+ * index, then that task is not yet RUNNABLE and it's going to be
+ * enqueued with the proper clamp group value.
+ */
+ if (!uclamp_task_active(p))
+ goto done;
+
+ /* Release p's currently referenced clamp group */
+ uclamp_cpu_put_id(p, rq, clamp_id);
+
+ /* Get p's new clamp group */
+ uclamp_cpu_get_id(p, rq, clamp_id);
+
+done:
+ task_rq_unlock(rq, p, &rf);
+}
+
/**
* uclamp_group_put: decrease the reference count for a clamp group
* @clamp_id: the clamp index which was affected by a task group
@@ -1150,6 +1196,9 @@ static inline int uclamp_group_get(struct task_struct *p,
uc_map[next_group_id].se_count += 1;
raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);

+ /* Update CPU's clamp group refcounts of RUNNABLE task */
+ uclamp_task_update_active(p, clamp_id, next_group_id);
+
/* Release the previous clamp group */
uclamp_group_put(clamp_id, prev_group_id);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65bf9ebacd83..b471d2222410 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2199,6 +2199,51 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
{
return uc_grp[group_id].tasks > 0;
}
+
+/**
+ * uclamp_task_affects: check if a task affects a utilization clamp
+ * @p: the task to consider
+ * @clamp_id: the utilization clamp to check
+ *
+ * A task affects a clamp index if:
+ * - it's currently enqueued on a CPU
+ * - it references a valid clamp group index for the specified clamp index
+ *
+ * Return: true if p currently affects the specified clamp_id
+ */
+static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
+{
+ return (p->uclamp[clamp_id].group_id != UCLAMP_NONE);
+}
+
+/**
+ * uclamp_task_active: check if a task is currently clamping a CPU
+ * @p: the task to check
+ *
+ * A task actively affects the utilization clamp of a CPU if:
+ * - it's currently enqueued or running on that CPU
+ * - it's refcounted in at least one clamp group of that CPU
+ *
+ * Return: true if p is currently clamping the utilization of its CPU.
+ */
+static inline bool uclamp_task_active(struct task_struct *p)
+{
+ struct rq *rq = task_rq(p);
+ int clamp_id;
+
+ lockdep_assert_held(&p->pi_lock);
+ lockdep_assert_held(&rq->lock);
+
+ if (!task_on_rq_queued(p) && !p->on_cpu)
+ return false;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ if (uclamp_task_affects(p, clamp_id))
+ return true;
+ }
+
+ return false;
+}
#endif /* CONFIG_UCLAMP_TASK */

#ifdef CONFIG_CPU_FREQ
--
2.17.1


2018-07-16 08:31:43

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX

When a util_max clamped task sleeps, its clamp constraints are removed
from the CPU. However, the blocked utilization on that CPU can still be
higher than the max clamp value enforced while that task was running.
This max clamp removal when a CPU is going to be idle could thus allow
unwanted CPU frequency increases, right while the task is not running.

This can happen, for example, where there is another (smaller) task
running on a different CPU of the same frequency domain.
In this case, when we aggregates the utilization of all the CPUs in a
shared frequency domain, schedutil can still see the full non clamped
blocked utilization of all the CPUs and thus eventually increase the
frequency.

Let's fix this by using:

uclamp_cpu_put_id(UCLAMP_MAX)
uclamp_cpu_update(last_clamp_value)

to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition. Thus, while a CPU is idle, we can still enforce the last used
clamp value for it.

To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
idle, we don't want to enforce any minimum frequency
Indeed, we relay just on blocked load decay to smoothly reduce the
frequency.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/core.c | 30 ++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 ++
2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b2424eea7990..0cb6e0aa4faa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -930,7 +930,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
* For the specified clamp index, this method computes the new CPU utilization
* clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
*/
-static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
+static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
+ unsigned int last_clamp_value)
{
struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
int max_value = UCLAMP_NONE;
@@ -948,6 +949,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
if (max_value >= SCHED_CAPACITY_SCALE)
break;
}
+
+ /*
+ * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
+ * task, we keep the CPU clamped to the last task's clamp value.
+ * This avoids frequency spikes to MAX when one CPU, with an high
+ * blocked utilization, sleeps and another CPU, in the same frequency
+ * domain, do not see anymore the clamp on the first CPU.
+ */
+ if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NONE) {
+ rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
+ max_value = last_clamp_value;
+ }
+
rq->uclamp.value[clamp_id] = max_value;
}

@@ -977,13 +991,21 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
uc_grp = &rq->uclamp.group[clamp_id][0];
uc_grp[group_id].tasks += 1;

+ /* Force clamp update on idle exit */
+ uc_cpu = &rq->uclamp;
+ clamp_value = p->uclamp[clamp_id].value;
+ if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
+ if (clamp_id == UCLAMP_MAX)
+ uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
+ uc_cpu->value[clamp_id] = clamp_value;
+ return;
+ }
+
/*
* If this is the new max utilization clamp value, then we can update
* straight away the CPU clamp value. Otherwise, the current CPU clamp
* value is still valid and we are done.
*/
- uc_cpu = &rq->uclamp;
- clamp_value = p->uclamp[clamp_id].value;
if (uc_cpu->value[clamp_id] < clamp_value)
uc_cpu->value[clamp_id] = clamp_value;
}
@@ -1028,7 +1050,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
uc_cpu = &rq->uclamp;
clamp_value = uc_grp[group_id].value;
if (clamp_value >= uc_cpu->value[clamp_id])
- uclamp_cpu_update(rq, clamp_id);
+ uclamp_cpu_update(rq, clamp_id, clamp_value);
}

/**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1207add36478..7e4f10c507b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,6 +783,8 @@ struct uclamp_group {
* values, i.e. no min/max clamping at all.
*/
struct uclamp_cpu {
+#define UCLAMP_FLAG_IDLE 0x01
+ int flags;
int value[UCLAMP_CNT];
struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
};
--
2.17.1


2018-07-16 08:32:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

The utilization is a well defined property of tasks and CPUs with an
in-kernel representation based on power-of-two values.
The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
allows efficient computations in hot-paths and a sufficient fixed point
arithmetic precision.
However, the utilization values range is still an implementation detail
which is also possibly subject to changes in the future.

Since we don't want to commit new user-space APIs to any in-kernel
implementation detail, let's add an abstraction layer on top of the APIs
used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
cpu.util_{min,max} attributes.

We do that by adding a couple of conversion function which can be used
to conveniently transform utilization/capacity values from/to the internal
SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
in the standard [0..100] range.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Documentation/admin-guide/cgroup-v2.rst | 6 +++---
include/linux/sched.h | 20 ++++++++++++++++++++
include/uapi/linux/sched/types.h | 14 ++++++++------
kernel/sched/core.c | 18 ++++++++++++------
4 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 328c011cc105..08b8062e55cd 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -973,7 +973,7 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no bandwidth boosting.

- The minimum utilization in the range [0, 1023].
+ The minimum percentage of utilization in the range [0, 100].

This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
@@ -981,9 +981,9 @@ All time durations are in microseconds.

cpu.util_max
A read-write single value file which exists on non-root cgroups.
- The default is "1023". i.e. no bandwidth clamping
+ The default is "100". i.e. no bandwidth clamping

- The maximum utilization in the range [0, 1023].
+ The maximum percentage of utilization in the range [0, 100].

This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5dd76a27ec17..f5970903c187 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -321,6 +321,26 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+static inline unsigned int scale_from_percent(unsigned int pct)
+{
+ WARN_ON(pct > 100);
+
+ return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
+}
+
+static inline unsigned int scale_to_percent(unsigned int value)
+{
+ unsigned int rounding = 0;
+
+ WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
+
+ /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
+ if (likely((value & 0xFF) && ~(value & 0x700)))
+ rounding = 1;
+
+ return (rounding + ((100 * value) / SCHED_FIXEDPOINT_SCALE));
+}
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 7421cd25354d..e2c2acb1c6af 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -84,15 +84,17 @@ struct sched_param {
*
* @sched_util_min represents the minimum utilization
* @sched_util_max represents the maximum utilization
+ * @sched_util_min represents the minimum utilization percentage
+ * @sched_util_max represents the maximum utilization percentage
*
- * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
- * represents the percentage of CPU time used by a task when running at the
- * maximum frequency on the highest capacity CPU of the system. Thus, for
- * example, a 20% utilization task is a task running for 2ms every 10ms.
+ * Utilization is a value in the range [0..100] which represents the
+ * percentage of CPU time used by a task when running at the maximum frequency
+ * on the highest capacity CPU of the system. Thus, for example, a 20%
+ * utilization task is a task running for 2ms every 10ms.
*
- * A task with a min utilization value bigger then 0 is more likely to be
+ * A task with a min utilization value bigger then 0% is more likely to be
* scheduled on a CPU which can provide that bandwidth.
- * A task with a max utilization value smaller then 1024 is more likely to be
+ * A task with a max utilization value smaller then 100% is more likely to be
* scheduled on a CPU which do not provide more then the required bandwidth.
*/
struct sched_attr {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42cff5ffddae..da7b8630cc8d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1381,7 +1381,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,

if (attr->sched_util_min > attr->sched_util_max)
return -EINVAL;
- if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ if (attr->sched_util_max > 100)
return -EINVAL;

mutex_lock(&uclamp_mutex);
@@ -1389,12 +1389,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
/* Update min utilization clamp */
uc_se = &p->uclamp[UCLAMP_MIN];
retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se,
- attr->sched_util_min);
+ scale_from_percent(attr->sched_util_min));

/* Update max utilization clamp */
uc_se = &p->uclamp[UCLAMP_MAX];
retval |= uclamp_group_get(p, NULL, UCLAMP_MAX, uc_se,
- attr->sched_util_max);
+ scale_from_percent(attr->sched_util_max));

mutex_unlock(&uclamp_mutex);

@@ -5493,6 +5493,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
if (task_group(p)->uclamp[UCLAMP_MAX].value < attr.sched_util_max)
attr.sched_util_max = task_group(p)->uclamp[UCLAMP_MAX].value;
#endif
+ attr.sched_util_min = scale_to_percent(attr.sched_util_min);
+ attr.sched_util_max = scale_to_percent(attr.sched_util_max);
#endif

rcu_read_unlock();
@@ -7284,8 +7286,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct task_group *tg;
int ret = -EINVAL;

- if (min_value > SCHED_CAPACITY_SCALE)
+ /* Check range and scale to internal representation */
+ if (min_value > 100)
return -ERANGE;
+ min_value = scale_from_percent(min_value);

mutex_lock(&uclamp_mutex);
rcu_read_lock();
@@ -7316,8 +7320,10 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
struct task_group *tg;
int ret = -EINVAL;

- if (max_value > SCHED_CAPACITY_SCALE)
+ /* Check range and scale to internal representation */
+ if (max_value > 100)
return -ERANGE;
+ max_value = scale_from_percent(max_value);

mutex_lock(&uclamp_mutex);
rcu_read_lock();
@@ -7352,7 +7358,7 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
util_clamp = tg->uclamp[clamp_id].value;
rcu_read_unlock();

- return util_clamp;
+ return scale_to_percent(util_clamp);
}

static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
--
2.17.1


2018-07-17 13:04:56

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Add utilization clamping support

On Mon, Jul 16, 2018 at 09:28:54AM +0100, Patrick Bellasi wrote:
> This is a respin of:
>
> https://lore.kernel.org/lkml/[email protected]
>
> which addresses all the feedbacks collected from the LKML discussion as well
> as during the presentation at last OSPM Summit:
>
> https://www.youtube.com/watch?v=0Yv9smm9i78
>
> Further comments and feedbacks are more than welcome!
>
[...]
>
> Patrick Bellasi (12):
> sched/core: uclamp: extend sched_setattr to support utilization
> clamping
> sched/core: uclamp: map TASK's clamp values into CPU's clamp groups
> sched/core: uclamp: add CPU's clamp groups accounting
> sched/core: uclamp: update CPU's refcount on clamp changes
> sched/cpufreq: uclamp: add utilization clamping for FAIR tasks
> sched/cpufreq: uclamp: add utilization clamping for RT tasks
> sched/core: uclamp: enforce last task UCLAMP_MAX
> sched/core: uclamp: extend cpu's cgroup controller
> sched/core: uclamp: map TG's clamp values into CPU's clamp groups
> sched/core: uclamp: use TG's clamps to restrict Task's clamps
> sched/core: uclamp: update CPU's refcount on TG's clamp changes
> sched/core: uclamp: use percentage clamp values
>
> Documentation/admin-guide/cgroup-v2.rst | 25 +
> include/linux/sched.h | 53 ++
> include/uapi/linux/sched.h | 4 +-
> include/uapi/linux/sched/types.h | 66 +-
> init/Kconfig | 63 ++
> kernel/sched/core.c | 876 ++++++++++++++++++++++++

While I'm reviewing these patches, I had a quick thought. core.c is already
7k+ lines. Based on this diffstat, does it make sense for uclamp to be in its
own kernel/sched/uclamp.c file?

thanks,

- Joel


2018-07-17 13:42:44

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 00/12] Add utilization clamping support

On 17-Jul 06:03, Joel Fernandes wrote:
> On Mon, Jul 16, 2018 at 09:28:54AM +0100, Patrick Bellasi wrote:
> > Documentation/admin-guide/cgroup-v2.rst | 25 +
> > include/linux/sched.h | 53 ++
> > include/uapi/linux/sched.h | 4 +-
> > include/uapi/linux/sched/types.h | 66 +-
> > init/Kconfig | 63 ++
> > kernel/sched/core.c | 876 ++++++++++++++++++++++++
>
> While I'm reviewing these patches, I had a quick thought. core.c is already
> 7k+ lines. Based on this diffstat, does it make sense for uclamp to be in its
> own kernel/sched/uclamp.c file?

Good point.

I've added it to core.c because it's logically part of the core
scheduler and we have some calls which are part of the fast path and
thus we want to avoid function calls.

I guess that, provided we can rely on LTOs, we can try to move it into
a separate file. Let see what Ingo and Peter thing about this.

> thanks,
>
> - Joel

--
#include <best/regards.h>

Patrick Bellasi

2018-07-17 17:52:25

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] sched/core: uclamp: extend sched_setattr to support utilization clamping

On Mon, Jul 16, 2018 at 09:28:55AM +0100, Patrick Bellasi wrote:
> The SCHED_DEADLINE scheduling class provides an advanced and formal
> model to define tasks requirements which can be translated into proper
> decisions for both task placements and frequencies selections.
> Other classes have a more simplified model which is essentially based on
> the relatively simple concept of POSIX priorities.
>
> Such a simple priority based model however does not allow to exploit
> some of the most advanced features of the Linux scheduler like, for
> example, driving frequencies selection via the schedutil cpufreq
> governor. However, also for non SCHED_DEADLINE tasks, it's still
> interesting to define tasks properties which can be used to better
> support certain scheduler decisions.
>
> Utilization clamping aims at exposing to user-space a new set of
> per-task attributes which can be used to provide the scheduler with some
> hints about the expected/required utilization for a task.
> This will allow to implement a more advanced per-task frequency control
> mechanism which is not based just on a "passive" measured task
> utilization but on a more "active" approach. For example, it could be
> possible to boost interactive tasks, thus getting better performance, or
> cap background tasks, thus being more energy efficient.
> Ultimately, such a mechanism can be considered similar to the cpufreq's
> powersave, performance and userspace governor but with a much fine
> grained and per-task control.
>
> Let's introduce a new API to set utilization clamping values for a
> specified task by extending sched_setattr, a syscall which already
> allows to define task specific properties for different scheduling
> classes.
> Specifically, a new pair of attributes allows to specify a minimum and
> maximum utilization which the scheduler should consider for a task.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/sched.h | 16 ++++++++
> include/uapi/linux/sched.h | 4 +-
> include/uapi/linux/sched/types.h | 64 +++++++++++++++++++++++++++-----
> init/Kconfig | 19 ++++++++++
> kernel/sched/core.c | 39 +++++++++++++++++++
> 5 files changed, 132 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 43731fe51c97..fd8495723088 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -279,6 +279,17 @@ struct vtime {
> u64 gtime;
> };
>
> +enum uclamp_id {
> + /* No utilization clamp group assigned */
> + UCLAMP_NONE = -1,
> +
> + UCLAMP_MIN = 0, /* Minimum utilization */
> + UCLAMP_MAX, /* Maximum utilization */
> +
> + /* Utilization clamping constraints count */
> + UCLAMP_CNT
> +};
> +
> struct sched_info {
> #ifdef CONFIG_SCHED_INFO
> /* Cumulative counters: */
> @@ -649,6 +660,11 @@ struct task_struct {
> #endif
> struct sched_dl_entity dl;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utlization clamp values for this task */
> + int uclamp[UCLAMP_CNT];
> +#endif
> +
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> /* List of struct preempt_notifier: */
> struct hlist_head preempt_notifiers;
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 22627f80063e..c27d6e81517b 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -50,9 +50,11 @@
> #define SCHED_FLAG_RESET_ON_FORK 0x01
> #define SCHED_FLAG_RECLAIM 0x02
> #define SCHED_FLAG_DL_OVERRUN 0x04
> +#define SCHED_FLAG_UTIL_CLAMP 0x08
> #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> SCHED_FLAG_RECLAIM | \
> - SCHED_FLAG_DL_OVERRUN)
> + SCHED_FLAG_DL_OVERRUN | \
> + SCHED_FLAG_UTIL_CLAMP)
>
> #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index 10fbb8031930..7421cd25354d 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -21,8 +21,33 @@ struct sched_param {
> * the tasks may be useful for a wide variety of application fields, e.g.,
> * multimedia, streaming, automation and control, and many others.
> *
> - * This variant (sched_attr) is meant at describing a so-called
> - * sporadic time-constrained task. In such model a task is specified by:
> + * This variant (sched_attr) allows to define additional attributes to
> + * improve the scheduler knowledge about task requirements.
> + *
> + * Scheduling Class Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes specifies the
> + * scheduling policy and relative POSIX attributes:
> + *
> + * @size size of the structure, for fwd/bwd compat.
> + *
> + * @sched_policy task's scheduling policy
> + * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
> + * @sched_priority task's static priority (SCHED_FIFO/RR)
> + *
> + * Certain more advanced scheduling features can be controlled by a
> + * predefined set of flags via the attribute:
> + *
> + * @sched_flags for customizing the scheduler behaviour
> + *
> + * Sporadic Time-Constrained Tasks Attributes
> + * ==========================================
> + *
> + * A subset of sched_attr attributes allows to describe a so-called
> + * sporadic time-constrained task.
> + *
> + * In such model a task is specified by:
> * - the activation period or minimum instance inter-arrival time;
> * - the maximum (or average, depending on the actual scheduling
> * discipline) computation time of all instances, a.k.a. runtime;
> @@ -34,14 +59,8 @@ struct sched_param {
> * than the runtime and must be completed by time instant t equal to
> * the instance activation time + the deadline.
> *
> - * This is reflected by the actual fields of the sched_attr structure:
> + * This is reflected by the following fields of the sched_attr structure:
> *
> - * @size size of the structure, for fwd/bwd compat.
> - *
> - * @sched_policy task's scheduling policy
> - * @sched_flags for customizing the scheduler behaviour
> - * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
> - * @sched_priority task's static priority (SCHED_FIFO/RR)
> * @sched_deadline representative of the task's deadline
> * @sched_runtime representative of the task's runtime
> * @sched_period representative of the task's period
> @@ -53,6 +72,28 @@ struct sched_param {
> * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
> * only user of this new interface. More information about the algorithm
> * available in the scheduling class file or in Documentation/.
> + *
> + * Task Utilization Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes allows to specify the utilization which
> + * should be expected by a task. These attributes allows to inform the
> + * scheduler about the utilization boundaries within which is safe to schedule
> + * the task. These utilization boundaries are valuable information to support
> + * scheduler decisions on both task placement and frequencies selection.
> + *
> + * @sched_util_min represents the minimum utilization
> + * @sched_util_max represents the maximum utilization
> + *
> + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> + * represents the percentage of CPU time used by a task when running at the
> + * maximum frequency on the highest capacity CPU of the system. Thus, for
> + * example, a 20% utilization task is a task running for 2ms every 10ms.
> + *
> + * A task with a min utilization value bigger then 0 is more likely to be
> + * scheduled on a CPU which can provide that bandwidth.
> + * A task with a max utilization value smaller then 1024 is more likely to be
> + * scheduled on a CPU which do not provide more then the required bandwidth.
> */
> struct sched_attr {
> __u32 size;
> @@ -70,6 +111,11 @@ struct sched_attr {
> __u64 sched_runtime;
> __u64 sched_deadline;
> __u64 sched_period;
> +
> + /* Utilization hints */
> + __u32 sched_util_min;
> + __u32 sched_util_max;
> +
> };
>
> #endif /* _UAPI_LINUX_SCHED_TYPES_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 041f3a022122..1d45a6877d6f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> config GENERIC_SCHED_CLOCK
> bool
>
> +menu "Scheduler features"
> +
> +config UCLAMP_TASK
> + bool "Enable utilization clamping for RT/FAIR tasks"
> + depends on CPU_FREQ_GOV_SCHEDUTIL

Does it make sense to depend on this? One could turn off schedutil and then
uclamp can't be used for any other purpose (big.LITTLE task placement etc)?

thanks,

- Joel


2018-07-17 18:05:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] sched/core: uclamp: extend sched_setattr to support utilization clamping

On Mon, Jul 16, 2018 at 09:28:55AM +0100, Patrick Bellasi wrote:
> The SCHED_DEADLINE scheduling class provides an advanced and formal
> model to define tasks requirements which can be translated into proper
> decisions for both task placements and frequencies selections.
> Other classes have a more simplified model which is essentially based on
> the relatively simple concept of POSIX priorities.
>
> Such a simple priority based model however does not allow to exploit
> some of the most advanced features of the Linux scheduler like, for
> example, driving frequencies selection via the schedutil cpufreq
> governor. However, also for non SCHED_DEADLINE tasks, it's still
> interesting to define tasks properties which can be used to better
> support certain scheduler decisions.
>
> Utilization clamping aims at exposing to user-space a new set of
> per-task attributes which can be used to provide the scheduler with some
> hints about the expected/required utilization for a task.
> This will allow to implement a more advanced per-task frequency control
> mechanism which is not based just on a "passive" measured task
> utilization but on a more "active" approach. For example, it could be
> possible to boost interactive tasks, thus getting better performance, or
> cap background tasks, thus being more energy efficient.
> Ultimately, such a mechanism can be considered similar to the cpufreq's
> powersave, performance and userspace governor but with a much fine
> grained and per-task control.
>
> Let's introduce a new API to set utilization clamping values for a
> specified task by extending sched_setattr, a syscall which already
> allows to define task specific properties for different scheduling
> classes.
> Specifically, a new pair of attributes allows to specify a minimum and
> maximum utilization which the scheduler should consider for a task.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/sched.h | 16 ++++++++
> include/uapi/linux/sched.h | 4 +-
> include/uapi/linux/sched/types.h | 64 +++++++++++++++++++++++++++-----
> init/Kconfig | 19 ++++++++++
> kernel/sched/core.c | 39 +++++++++++++++++++
> 5 files changed, 132 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 43731fe51c97..fd8495723088 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -279,6 +279,17 @@ struct vtime {
> u64 gtime;
> };
>
> +enum uclamp_id {
> + /* No utilization clamp group assigned */
> + UCLAMP_NONE = -1,
> +
> + UCLAMP_MIN = 0, /* Minimum utilization */
> + UCLAMP_MAX, /* Maximum utilization */
> +
> + /* Utilization clamping constraints count */
> + UCLAMP_CNT
> +};
> +
> struct sched_info {
> #ifdef CONFIG_SCHED_INFO
> /* Cumulative counters: */
> @@ -649,6 +660,11 @@ struct task_struct {
> #endif
> struct sched_dl_entity dl;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utlization clamp values for this task */
> + int uclamp[UCLAMP_CNT];
> +#endif

Seems a bit wasteful to me. Seems you need 2 values that are in the range
0..1024. Can we not do better with task_struct space usage?

thanks!

- Joel


2018-07-18 08:44:06

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] sched/core: uclamp: extend sched_setattr to support utilization clamping

On 17-Jul 10:50, Joel Fernandes wrote:
> On Mon, Jul 16, 2018 at 09:28:55AM +0100, Patrick Bellasi wrote:

[...]

> > diff --git a/init/Kconfig b/init/Kconfig
> > index 041f3a022122..1d45a6877d6f 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> > config GENERIC_SCHED_CLOCK
> > bool
> >
> > +menu "Scheduler features"
> > +
> > +config UCLAMP_TASK
> > + bool "Enable utilization clamping for RT/FAIR tasks"
> > + depends on CPU_FREQ_GOV_SCHEDUTIL
>
> Does it make sense to depend on this? One could turn off schedutil and then
> uclamp can't be used for any other purpose (big.LITTLE task placement etc)?

You right, utilization clamping is _going_ to target tasks placement.
But, the support currently posted in this series is just for OPP
biasing. Thus, it would not make sense to enabled it when schedutil
is not available.

My idea was to keep this dependency while we finalize these bits.
Once we move on to the tasks placement biasing, we will remove this
dependency too.

Does that makes sense?

> thanks,
>
> - Joel

--
#include <best/regards.h>

Patrick Bellasi

2018-07-18 17:03:51

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] sched/core: uclamp: extend sched_setattr to support utilization clamping

On Wed, Jul 18, 2018 at 09:42:18AM +0100, Patrick Bellasi wrote:
> On 17-Jul 10:50, Joel Fernandes wrote:
> > On Mon, Jul 16, 2018 at 09:28:55AM +0100, Patrick Bellasi wrote:
>
> [...]
>
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 041f3a022122..1d45a6877d6f 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -583,6 +583,25 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> > > config GENERIC_SCHED_CLOCK
> > > bool
> > >
> > > +menu "Scheduler features"
> > > +
> > > +config UCLAMP_TASK
> > > + bool "Enable utilization clamping for RT/FAIR tasks"
> > > + depends on CPU_FREQ_GOV_SCHEDUTIL
> >
> > Does it make sense to depend on this? One could turn off schedutil and then
> > uclamp can't be used for any other purpose (big.LITTLE task placement etc)?
>
> You right, utilization clamping is _going_ to target tasks placement.
> But, the support currently posted in this series is just for OPP
> biasing. Thus, it would not make sense to enabled it when schedutil
> is not available.
>
> My idea was to keep this dependency while we finalize these bits.
> Once we move on to the tasks placement biasing, we will remove this
> dependency too.
>
> Does that makes sense?

Yes, that's fine with me.

2018-07-19 23:52:29

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
<[email protected]> wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
>
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference couters.
>
> :
> (user-space changes) : (kernel space / scheduler)
> :
> SLOW PATH : FAST PATH
> :
> task_struct::uclamp::value : sched/core::enqueue/dequeue
> : cpufreq_schedutil
> :
> +----------------+ +--------------------+ +-------------------+
> | TASK | | CLAMP GROUP | | CPU CLAMPS |
> +----------------+ +--------------------+ +-------------------+
> | | | clamp_{min,max} | | clamp_{min,max} |
> | util_{min,max} | | se_count | | tasks count |
> +----------------+ +--------------------+ +-------------------+
> :
> +------------------> : +------------------->
> group_id = map(clamp_value) : ref_count(group_id)
> :
> :
>
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
>
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
> sense to boost/limit to different frequencies,
> e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
> the per-CPU clamp values in the fast path.
>
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/sched.h | 15 ++-
> init/Kconfig | 22 ++++
> kernel/sched/core.c | 300 +++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 330 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fd8495723088..0635e8073cd3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -578,6 +578,19 @@ struct sched_dl_entity {
> struct hrtimer inactive_timer;
> };
>
> +/**
> + * Utilization's clamp group
> + *
> + * A utilization clamp group maps a "clamp value" (value), i.e.
> + * util_{min,max}, to a "clamp group index" (group_id).
> + */
> +struct uclamp_se {
> + /* Utilization constraint for tasks in this group */
> + unsigned int value;
> + /* Utilization clamp group for this constraint */
> + unsigned int group_id;
> +};
> +
> union rcu_special {
> struct {
> u8 blocked;
> @@ -662,7 +675,7 @@ struct task_struct {
>
> #ifdef CONFIG_UCLAMP_TASK
> /* Utlization clamp values for this task */
> - int uclamp[UCLAMP_CNT];
> + struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
>
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> diff --git a/init/Kconfig b/init/Kconfig
> index 1d45a6877d6f..0a377ad7c166 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -601,7 +601,29 @@ config UCLAMP_TASK
>
> If in doubt, say N.
>
> +config UCLAMP_GROUPS_COUNT
> + int "Number of different utilization clamp values supported"
> + range 0 127
> + default 2
> + depends on UCLAMP_TASK
> + help
> + This defines the maximum number of different utilization clamp
> + values which can be concurrently enforced for each utilization
> + clamp index (i.e. minimum and maximum utilization).
> +
> + Only a limited number of clamp values are supported because:
> + 1. there are usually only few classes of workloads for which it
> + makes sense to boost/cap for different frequencies,
> + e.g. background vs foreground, interactive vs low-priority.
> + 2. it allows a simpler and more memory/time efficient tracking of
> + the per-CPU clamp values.
> +
> + Set to 0 (default value) to disable the utilization clamping feature.
> +
> + If in doubt, use the default value.
> +
> endmenu
> +
> #
> # For architectures that want to enable the support for NUMA-affine scheduler
> # balancing logic:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6a42cd86b6f3..50e749067df5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -740,25 +740,309 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
>
> #ifdef CONFIG_UCLAMP_TASK
> +/**
> + * uclamp_mutex: serializes updates of utilization clamp values
> + *
> + * A utilization clamp value update is usually triggered from a user-space
> + * process (slow-path) but it requires a synchronization with the scheduler's
> + * (fast-path) enqueue/dequeue operations.
> + * While the fast-path synchronization is protected by RQs spinlock, this
> + * mutex ensures that we sequentially serve user-space requests.
> + */
> +static DEFINE_MUTEX(uclamp_mutex);
> +
> +/**
> + * uclamp_map: reference counts a utilization "clamp value"
> + * @value: the utilization "clamp value" required
> + * @se_count: the number of scheduling entities requiring the "clamp value"
> + * @se_lock: serialize reference count updates by protecting se_count
> + */
> +struct uclamp_map {
> + int value;
> + int se_count;
> + raw_spinlock_t se_lock;
> +};
> +
> +/**
> + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> + *
> + * Since only a limited number of different "clamp values" are supported, we
> + * need to map each different clamp value into a "clamp group" (group_id) to
> + * be used by the per-CPU accounting in the fast-path, when tasks are
> + * enqueued and dequeued.
> + * We also support different kind of utilization clamping, min and max
> + * utilization for example, each representing what we call a "clamp index"
> + * (clamp_id).
> + *
> + * A matrix is thus required to map "clamp values" to "clamp groups"
> + * (group_id), for each "clamp index" (clamp_id), where:
> + * - rows are indexed by clamp_id and they collect the clamp groups for a
> + * given clamp index
> + * - columns are indexed by group_id and they collect the clamp values which
> + * maps to that clamp group
> + *
> + * Thus, the column index of a given (clamp_id, value) pair represents the
> + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> + *
> + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> + * clamped tasks. Thus we allocate one more slot than the value of
> + * CONFIG_UCLAMP_GROUPS_COUNT.
> + *
> + * Here is the map layout and, right below, how entries are accessed by the
> + * following code.
> + *
> + * uclamp_maps is a matrix of
> + * +------- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> + * | |
> + * | /---------------+---------------\
> + * | +------------+ +------------+
> + * | / UCLAMP_MIN | value | | value |
> + * | | | se_count |...... | se_count |
> + * | | +------------+ +------------+
> + * +--+ +------------+ +------------+
> + * | | value | | value |
> + * \ UCLAMP_MAX | se_count |...... | se_count |
> + * +-----^------+ +----^-------+
> + * | |
> + * uc_map = + |
> + * &uclamp_maps[clamp_id][0] +
> + * clamp_value =
> + * uc_map[group_id].value
> + */
> +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> + [CONFIG_UCLAMP_GROUPS_COUNT + 1];
> +
> +/**
> + * uclamp_group_available: checks if a clamp group is available
> + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
> + * @group_id: the group index in the given clamp_id
> + *
> + * A clamp group is not free if there is at least one SE which is sing a clamp

Did you mean to say "single clamp"?

> + * value mapped on the specified clamp_id. These SEs are reference counted by
> + * the se_count of a uclamp_map entry.
> + *
> + * Return: true if there are no SE's mapped on the specified clamp
> + * index and group
> + */
> +static inline bool uclamp_group_available(int clamp_id, int group_id)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> +
> + return (uc_map[group_id].value == UCLAMP_NONE);

The usage of UCLAMP_NONE is very confusing to me. It was not used at
all in the patch where it was introduced [1/12], here it's used as a
clamp value and in uclamp_group_find() it's used as group_id. Please
clarify the usage. I also feel UCLAMP_NONE does not really belong to
the uclamp_id enum because other elements there are indexes in
uclamp_maps and this one is a special value. IMHO if both *group_id*
and *value* need a special value (-1) to represent
unused/uninitialized entry it would be better to use different
constants. Maybe UCLAMP_VAL_NONE and UCLAMP_GROUP_NONE?

> +}
> +
> +/**
> + * uclamp_group_init: maps a clamp value on a specified clamp group
> + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
> + * @group_id: the group index to map a given clamp_value
> + * @clamp_value: the utilization clamp value to map
> + *
> + * Initializes a clamp group to track tasks from the fast-path.
> + * Each different clamp value, for a given clamp index (i.e. min/max
> + * utilization clamp), is mapped by a clamp group which index is used by the
> + * fast-path code to keep track of RUNNABLE tasks requiring a certain clamp
> + * value.
> + *
> + */
> +static inline void uclamp_group_init(int clamp_id, int group_id,
> + unsigned int clamp_value)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> +
> + uc_map[group_id].value = clamp_value;
> + uc_map[group_id].se_count = 0;
> +}
> +
> +/**
> + * uclamp_group_reset: resets a specified clamp group
> + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
> + * @group_id: the group index to release
> + *
> + * A clamp group can be reset every time there are no more task groups using
> + * the clamp value it maps for a given clamp index.
> + */
> +static inline void uclamp_group_reset(int clamp_id, int group_id)
> +{
> + uclamp_group_init(clamp_id, group_id, UCLAMP_NONE);
> +}
> +
> +/**
> + * uclamp_group_find: finds the group index of a utilization clamp group
> + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
> + * @clamp_value: the utilization clamping value lookup for
> + *
> + * Verify if a group has been assigned to a certain clamp value and return
> + * its index to be used for accounting.
> + *
> + * Since only a limited number of utilization clamp groups are allowed, if no
> + * groups have been assigned for the specified value, a new group is assigned
> + * if possible. Otherwise an error is returned, meaning that an additional clamp
> + * value is not (currently) supported.
> + */
> +static int
> +uclamp_group_find(int clamp_id, unsigned int clamp_value)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + int free_group_id = UCLAMP_NONE;
> + unsigned int group_id = 0;
> +
> + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> + /* Keep track of first free clamp group */
> + if (uclamp_group_available(clamp_id, group_id)) {
> + if (free_group_id == UCLAMP_NONE)
> + free_group_id = group_id;
> + continue;
> + }
> + /* Return index of first group with same clamp value */
> + if (uc_map[group_id].value == clamp_value)
> + return group_id;
> + }
> + /* Default to first free clamp group */
> + if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)

Is the condition above needed? I think it's always true if you got here.
Also AFAIKT after the for loop you can just do:

return (free_group_id != UCLAMP_NONE) ? free_group_id : -ENOSPC;

> + group_id = free_group_id;
> + /* All clamp group already track different clamp values */
> + if (group_id == UCLAMP_NONE)
> + return -ENOSPC;
> + return group_id;
> +}
> +
> +/**
> + * uclamp_group_put: decrease the reference count for a clamp group
> + * @clamp_id: the clamp index which was affected by a task group
> + * @uc_se: the utilization clamp data for that task group
> + *
> + * When the clamp value for a task group is changed we decrease the reference
> + * count for the clamp group mapping its current clamp value. A clamp group is
> + * released when there are no more task groups referencing its clamp value.
> + */
> +static inline void uclamp_group_put(int clamp_id, int group_id)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + unsigned long flags;
> +
> + /* Ignore SE's not yet attached */
> + if (group_id == UCLAMP_NONE)
> + return;
> +
> + /* Remove SE from this clamp group */
> + raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
> + uc_map[group_id].se_count -= 1;

If uc_map[group_id].se_count was 0 before decrement you end up with
se_count == -1 and no reset for the element.

> + if (uc_map[group_id].se_count == 0)
> + uclamp_group_reset(clamp_id, group_id);
> + raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
> +}
> +
> +/**
> + * uclamp_group_get: increase the reference count for a clamp group
> + * @p: the task which clamp value must be tracked
> + * @clamp_id: the clamp index affected by the task
> + * @uc_se: the utilization clamp data for the task
> + * @clamp_value: the new clamp value for the task
> + *
> + * Each time a task changes its utilization clamp value, for a specified clamp
> + * index, we need to find an available clamp group which can be used to track
> + * this new clamp value. The corresponding clamp group index will be used by
> + * the task to reference count the clamp value on CPUs while enqueued.
> + *
> + * Return: -ENOSPC if there are no available clamp groups, 0 on success.
> + */
> +static inline int uclamp_group_get(struct task_struct *p,
> + int clamp_id, struct uclamp_se *uc_se,
> + unsigned int clamp_value)
> +{
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + int prev_group_id = uc_se->group_id;
> + int next_group_id = UCLAMP_NONE;
> + unsigned long flags;
> +
> + /* Lookup for a usable utilization clamp group */
> + next_group_id = uclamp_group_find(clamp_id, clamp_value);
> + if (next_group_id < 0) {
> + pr_err("Cannot allocate more than %d utilization clamp groups\n",
> + CONFIG_UCLAMP_GROUPS_COUNT);
> + return -ENOSPC;
> + }
> +
> + /* Allocate new clamp group for this clamp value */
> + if (uclamp_group_available(clamp_id, next_group_id))
> + uclamp_group_init(clamp_id, next_group_id, clamp_value);
> +
> + /* Update SE's clamp values and attach it to new clamp group */
> + raw_spin_lock_irqsave(&uc_map[next_group_id].se_lock, flags);
> + uc_se->value = clamp_value;
> + uc_se->group_id = next_group_id;
> + uc_map[next_group_id].se_count += 1;
> + raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
> +
> + /* Release the previous clamp group */
> + uclamp_group_put(clamp_id, prev_group_id);
> +
> + return 0;
> +}
> +
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> + struct uclamp_se *uc_se;
> + int retval = 0;
> +
> if (attr->sched_util_min > attr->sched_util_max)
> return -EINVAL;
> if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> return -EINVAL;
>
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(&uclamp_mutex);
> +
> + /* Update min utilization clamp */
> + uc_se = &p->uclamp[UCLAMP_MIN];
> + retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
> + attr->sched_util_min);
> +
> + /* Update max utilization clamp */
> + uc_se = &p->uclamp[UCLAMP_MAX];
> + retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
> + attr->sched_util_max);
> +
> + mutex_unlock(&uclamp_mutex);
> +
> + /*
> + * If one of the two clamp values should fail,
> + * let the userspace know.
> + */
> + if (retval)
> + return -ENOSPC;

Maybe a minor issue but this failure is ambiguous. It might mean:
1. no clamp value was updated
2. UCLAMP_MIN was updated but UCLAMP_MAX was not
3. UCLAMP_MAX was updated but UCLAMP_MIN was not

>
> return 0;
> }
> +
> +/**
> + * init_uclamp: initialize data structures required for utilization clamping
> + */
> +static void __init init_uclamp(void)
> +{
> + int clamp_id;
> +
> + mutex_init(&uclamp_mutex);
> +
> + /* Init SE's clamp map */
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + int group_id = 0;
> +
> + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> + uc_map[group_id].value = UCLAMP_NONE;
> + raw_spin_lock_init(&uc_map[group_id].se_lock);
> + }
> + }
> +}
> +
> #else /* CONFIG_UCLAMP_TASK */
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> return -EINVAL;
> }
> +static inline void init_uclamp(void) { }
> #endif /* CONFIG_UCLAMP_TASK */
>
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> @@ -2199,8 +2483,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> #endif
>
> #ifdef CONFIG_UCLAMP_TASK
> - p->uclamp[UCLAMP_MIN] = 0;
> - p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
> + p->uclamp[UCLAMP_MIN].value = 0;
> + p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NONE;
> + p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
> + p->uclamp[UCLAMP_MAX].group_id = UCLAMP_NONE;
> #endif
>
> #ifdef CONFIG_SCHEDSTATS
> @@ -4784,8 +5070,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> attr.sched_nice = task_nice(p);
>
> #ifdef CONFIG_UCLAMP_TASK
> - attr.sched_util_min = p->uclamp[UCLAMP_MIN];
> - attr.sched_util_max = p->uclamp[UCLAMP_MAX];
> + attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
> + attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
> #endif
>
> rcu_read_unlock();
> @@ -6151,6 +6437,8 @@ void __init sched_init(void)
>
> init_schedstats();
>
> + init_uclamp();
> +
> scheduler_running = 1;
> }
>
> --
> 2.17.1
>

2018-07-20 15:13:03

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Hi Suren,
thanks for the review, all good point... some more comments follow
inline.

On 19-Jul 16:51, Suren Baghdasaryan wrote:
> On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > +/**
> > + * uclamp_group_available: checks if a clamp group is available
> > + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
> > + * @group_id: the group index in the given clamp_id
> > + *
> > + * A clamp group is not free if there is at least one SE which is sing a clamp
>
> Did you mean to say "single clamp"?

No, it's "...at least one SE which is USING a clamp value..."

> > + * value mapped on the specified clamp_id. These SEs are reference counted by
> > + * the se_count of a uclamp_map entry.
> > + *
> > + * Return: true if there are no SE's mapped on the specified clamp
> > + * index and group
> > + */
> > +static inline bool uclamp_group_available(int clamp_id, int group_id)
> > +{
> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> > +
> > + return (uc_map[group_id].value == UCLAMP_NONE);
>
> The usage of UCLAMP_NONE is very confusing to me. It was not used at
> all in the patch where it was introduced [1/12], here it's used as a
> clamp value and in uclamp_group_find() it's used as group_id. Please
> clarify the usage.

Yes, it's meant to represent a "clamp not valid" condition, whatever
it's a "clamp group" or a "clamp value"... perhaps the name can be
improved.

> I also feel UCLAMP_NONE does not really belong to
> the uclamp_id enum because other elements there are indexes in
> uclamp_maps and this one is a special value.

Right, it looks a bit misplaced, I agree. I think I tried to set it
using a #define but there was some issues I don't remember now...
Anyway, I'll give it another go...


> IMHO if both *group_id*
> and *value* need a special value (-1) to represent
> unused/uninitialized entry it would be better to use different
> constants. Maybe UCLAMP_VAL_NONE and UCLAMP_GROUP_NONE?

Yes, maybe we can use a

#define UCLAMP_NOT_VALID -1

and get rid the confusing enum entry.

Will update it on v3.

> > +}

[...]

> > +/**
> > + * uclamp_group_find: finds the group index of a utilization clamp group
> > + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
> > + * @clamp_value: the utilization clamping value lookup for
> > + *
> > + * Verify if a group has been assigned to a certain clamp value and return
> > + * its index to be used for accounting.
> > + *
> > + * Since only a limited number of utilization clamp groups are allowed, if no
> > + * groups have been assigned for the specified value, a new group is assigned
> > + * if possible. Otherwise an error is returned, meaning that an additional clamp
> > + * value is not (currently) supported.
> > + */
> > +static int
> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
> > +{
> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> > + int free_group_id = UCLAMP_NONE;
> > + unsigned int group_id = 0;
> > +
> > + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> > + /* Keep track of first free clamp group */
> > + if (uclamp_group_available(clamp_id, group_id)) {
> > + if (free_group_id == UCLAMP_NONE)
> > + free_group_id = group_id;
> > + continue;
> > + }
> > + /* Return index of first group with same clamp value */
> > + if (uc_map[group_id].value == clamp_value)
> > + return group_id;
> > + }
> > + /* Default to first free clamp group */
> > + if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
>
> Is the condition above needed? I think it's always true if you got here.
> Also AFAIKT after the for loop you can just do:
>
> return (free_group_id != UCLAMP_NONE) ? free_group_id : -ENOSPC;

Yes, you right... the code above can be simplified!

>
> > + group_id = free_group_id;
> > + /* All clamp group already track different clamp values */
> > + if (group_id == UCLAMP_NONE)
> > + return -ENOSPC;
> > + return group_id;
> > +}

[...]

> > +static inline void uclamp_group_put(int clamp_id, int group_id)
> > +{
> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> > + unsigned long flags;
> > +
> > + /* Ignore SE's not yet attached */
> > + if (group_id == UCLAMP_NONE)
> > + return;
> > +
> > + /* Remove SE from this clamp group */
> > + raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
> > + uc_map[group_id].se_count -= 1;
>
> If uc_map[group_id].se_count was 0 before decrement you end up with
> se_count == -1 and no reset for the element.

Well... this should never happen, otherwise the refcounting is not
working as expected.

Maybe we can add (at least) a debug check and warning, something like:

#ifdef SCHED_DEBUG
if (unlikely(uc_map[group_id].se_count == 0)) {
WARN(1, "invalid clamp group [%d:%d] refcount\n",
clamp_id, group_id);
uc_map[group_id].se_count = 1;
}
#endif

> > + if (uc_map[group_id].se_count == 0)
> > + uclamp_group_reset(clamp_id, group_id);
> > + raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
> > +}
> > +

[...]

> > static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> > {
> > + struct uclamp_se *uc_se;
> > + int retval = 0;
> > +
> > if (attr->sched_util_min > attr->sched_util_max)
> > return -EINVAL;
> > if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > return -EINVAL;
> >
> > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > + mutex_lock(&uclamp_mutex);
> > +
> > + /* Update min utilization clamp */
> > + uc_se = &p->uclamp[UCLAMP_MIN];
> > + retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
> > + attr->sched_util_min);
> > +
> > + /* Update max utilization clamp */
> > + uc_se = &p->uclamp[UCLAMP_MAX];
> > + retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
> > + attr->sched_util_max);
> > +
> > + mutex_unlock(&uclamp_mutex);
> > +
> > + /*
> > + * If one of the two clamp values should fail,
> > + * let the userspace know.
> > + */
> > + if (retval)
> > + return -ENOSPC;
>
> Maybe a minor issue but this failure is ambiguous. It might mean:
> 1. no clamp value was updated
> 2. UCLAMP_MIN was updated but UCLAMP_MAX was not
> 3. UCLAMP_MAX was updated but UCLAMP_MIN was not

That's right, I put a bit of thought on that me too but at the end
I've been convinced that the possibility to use a single syscall to
set both clamps at the same time has some benefits for user-space.

Maybe the current solution can be improved by supporting an (optional)
strict semantic with an in-kernel roll-back in case one of the two
uclamp_group_get should fail.

The strict semantic with roll-back could be controller via an
additional flag, e.g. SCHED_FLAG_UTIL_CLAMP_STRICT.

When the flag is set, either we are able to set both the attributes or
we roll-back. Otherwise, when the flag is not set, we keep the current
behavior. i.e. we set what we can and report an error to notify
userspace that one constraints was not enforced.

The following snippet should implement this semantics:

---8<---

/* Uclamp flags */
#define SCHED_FLAG_UTIL_CLAMP_STRICT 0x11 /* Roll-back on failure */
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x12 /* Update util_min */
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x14 /* Update util_max */
#define SCHED_FLAG_UTIL_CLAMP ( \
SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)

static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
unsigned int uclamp_value_old = 0;
unsigned int uclamp_value;
struct uclamp_se *uc_se;
int retval = 0;

if (attr->sched_util_min > attr->sched_util_max)
return -EINVAL;
if (attr->sched_util_max > 100)
return -EINVAL;

mutex_lock(&uclamp_mutex);

if (!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN))
goto set_util_max;

uc_se = &p->uclamp[UCLAMP_MIN];
uclamp_value = scale_from_percent(attr->sched_util_min);
if (uc_se->value == uclamp_value)
goto set_util_max;

/* Update min utilization clamp */
uclamp_value_old = uc_se->value;
retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se, uclamp_value);
if (retval &&
attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_STRICT)
goto exit_failure;

set_util_max:

if (!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX))
goto exit_done;

uc_se = &p->uclamp[UCLAMP_MAX];
uclamp_value = scale_from_percent(attr->sched_util_max);
if (uc_se->value == uclamp_value)
goto exit_done;

/* Update max utilization clamp */
if (uclamp_group_get(p, NULL, UCLAMP_MAX,
uc_se, uclamp_value))
goto exit_rollback;

exit_done:
mutex_unlock(&uclamp_mutex);
return retval;

exit_rollback:
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_STRICT) {
uclamp_group_get(p, NULL, UCLAMP_MIN,
uc_se, uclamp_value_old);
}
exit_failure:
mutex_unlock(&uclamp_mutex);

return -ENOSPC;
}

---8<---


Could that work better?

The code is maybe a bit more convoluted... but perhaps it can be
improved by encoding it in a loop.


--
#include <best/regards.h>

Patrick Bellasi

2018-07-20 20:25:58

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 03/12] sched/core: uclamp: add CPU's clamp groups accounting

Hi Patrick,

On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
<[email protected]> wrote:
> Utilization clamping allows to clamp the utilization of a CPU within a
> [util_min, util_max] range. This range depends on the set of currently
> RUNNABLE tasks on a CPU, where each task references two "clamp groups"
> defining the util_min and the util_max clamp values to be considered for
> that task. The clamp value mapped by a clamp group applies to a CPU only
> when there is at least one task RUNNABLE referencing that clamp group.
>
> When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
> active on that CPU can change. Since each clamp group enforces a
> different utilization clamp value, once the set of these groups changes
> it can be required to re-compute what is the new "aggregated" clamp
> value to apply on that CPU.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This is to ensure that no tasks can affect the performance of other
> co-scheduled tasks which are either more boosted (i.e. with higher
> util_min clamp) or less capped (i.e. with higher util_max clamp).
>
> Here we introduce the required support to properly reference count clamp
> groups at each task enqueue/dequeue time.
>
> Tasks have a:
> task_struct::uclamp::group_id[clamp_idx]
> indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
> which they should refcount at enqueue time.
>
> CPUs rq have a:
> rq::uclamp::group[clamp_idx][group_idx].tasks
> which is used to reference count how many tasks are currently RUNNABLE on
> that CPU for each clamp group of each clamp index..
>
> The clamp value of each clamp group is tracked by
> rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
> unordered array of clamp values. However, the MAX aggregation of the
> currently active clamp groups is implemented to minimize the number of
> times we need to scan the complete (unordered) clamp group array to
> figure out the new max value. This operation indeed happens only when we
> dequeue last task of the clamp group corresponding to the current max
> clamp, and thus the CPU is either entering IDLE or going to schedule a
> less boosted or more clamped task.
> Moreover, the expected number of different clamp values, which can be
> configured at build time, is usually so small that a more advanced
> ordering algorithm is not needed. In real use-cases we expect less then
> 10 different values.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/core.c | 188 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 4 +
> kernel/sched/rt.c | 4 +
> kernel/sched/sched.h | 71 ++++++++++++++++
> 4 files changed, 267 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 50e749067df5..d1969931fea6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -848,9 +848,19 @@ static inline void uclamp_group_init(int clamp_id, int group_id,
> unsigned int clamp_value)
> {
> struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> + struct uclamp_cpu *uc_cpu;
> + int cpu;
>
> + /* Set clamp group map */
> uc_map[group_id].value = clamp_value;
> uc_map[group_id].se_count = 0;
> +
> + /* Set clamp groups on all CPUs */
> + for_each_possible_cpu(cpu) {
> + uc_cpu = &cpu_rq(cpu)->uclamp;
> + uc_cpu->group[clamp_id][group_id].value = clamp_value;
> + uc_cpu->group[clamp_id][group_id].tasks = 0;
> + }
> }
>
> /**
> @@ -906,6 +916,172 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
> return group_id;
> }
>
> +/**
> + * uclamp_cpu_update: updates the utilization clamp of a CPU
> + * @cpu: the CPU which utilization clamp has to be updated
> + * @clamp_id: the clamp index to update
> + *
> + * When tasks are enqueued/dequeued on/from a CPU, the set of currently active
> + * clamp groups is subject to change. Since each clamp group enforces a
> + * different utilization clamp value, once the set of these groups changes it
> + * can be required to re-compute what is the new clamp value to apply for that
> + * CPU.
> + *
> + * For the specified clamp index, this method computes the new CPU utilization
> + * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
> + */
> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> + int max_value = UCLAMP_NONE;
> + unsigned int group_id;
> +
> + for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> + /* Ignore inactive clamp groups, i.e. no RUNNABLE tasks */
> + if (!uclamp_group_active(uc_grp, group_id))
> + continue;
> +
> + /* Both min and max clamp are MAX aggregated */
> + max_value = max(max_value, uc_grp[group_id].value);
> +
> + /* Stop if we reach the max possible clamp */
> + if (max_value >= SCHED_CAPACITY_SCALE)
> + break;
> + }
> + rq->uclamp.value[clamp_id] = max_value;
> +}
> +
> +/**
> + * uclamp_cpu_get_id(): increase reference count for a clamp group on a CPU
> + * @p: the task being enqueued on a CPU
> + * @rq: the CPU's rq where the clamp group has to be reference counted
> + * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
> + *
> + * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
> + * the task's uclamp.group_id is reference counted on that CPU.
> + */
> +static inline void uclamp_cpu_get_id(struct task_struct *p,
> + struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp;
> + struct uclamp_cpu *uc_cpu;
> + int clamp_value;
> + int group_id;
> +
> + /* No task specific clamp values: nothing to do */
> + group_id = p->uclamp[clamp_id].group_id;
> + if (group_id == UCLAMP_NONE)
> + return;
> +
> + /* Reference count the task into its current group_id */
> + uc_grp = &rq->uclamp.group[clamp_id][0];
> + uc_grp[group_id].tasks += 1;
> +
> + /*
> + * If this is the new max utilization clamp value, then we can update
> + * straight away the CPU clamp value. Otherwise, the current CPU clamp
> + * value is still valid and we are done.
> + */
> + uc_cpu = &rq->uclamp;
> + clamp_value = p->uclamp[clamp_id].value;
> + if (uc_cpu->value[clamp_id] < clamp_value)
> + uc_cpu->value[clamp_id] = clamp_value;
> +}
> +
> +/**
> + * uclamp_cpu_put_id(): decrease reference count for a clamp group on a CPU
> + * @p: the task being dequeued from a CPU
> + * @cpu: the CPU from where the clamp group has to be released
> + * @clamp_id: the utilization clamp (e.g. min or max utilization) to release
> + *
> + * When a task is dequeued from a CPU's RQ, the CPU's clamp group reference
> + * counted by the task is decreased.
> + * If this was the last task defining the current max clamp group, then the
> + * CPU clamping is updated to find the new max for the specified clamp
> + * index.
> + */
> +static inline void uclamp_cpu_put_id(struct task_struct *p,
> + struct rq *rq, int clamp_id)
> +{
> + struct uclamp_group *uc_grp;
> + struct uclamp_cpu *uc_cpu;
> + unsigned int clamp_value;
> + int group_id;
> +
> + /* No task specific clamp values: nothing to do */
> + group_id = p->uclamp[clamp_id].group_id;
> + if (group_id == UCLAMP_NONE)
> + return;
> +
> + /* Decrement the task's reference counted group index */
> + uc_grp = &rq->uclamp.group[clamp_id][0];
> + uc_grp[group_id].tasks -= 1;
> +
> + /* If this is not the last task, no updates are required */
> + if (uc_grp[group_id].tasks > 0)
> + return;
> +
> + /*
> + * Update the CPU only if this was the last task of the group
> + * defining the current clamp value.
> + */
> + uc_cpu = &rq->uclamp;
> + clamp_value = uc_grp[group_id].value;
> + if (clamp_value >= uc_cpu->value[clamp_id])
> + uclamp_cpu_update(rq, clamp_id);
> +}
> +
> +/**
> + * uclamp_cpu_get(): increase CPU's clamp group refcount
> + * @rq: the CPU's rq where the clamp group has to be refcounted
> + * @p: the task being enqueued
> + *
> + * Once a task is enqueued on a CPU's rq, all the clamp groups currently
> + * enforced on a task are reference counted on that rq.
> + * Not all scheduling classes have utilization clamping support, their tasks
> + * will be silently ignored.
> + *
> + * This method updates the utilization clamp constraints considering the
> + * requirements for the specified task. Thus, this update must be done before
> + * calling into the scheduling classes, which will eventually update schedutil
> + * considering the new task requirements.
> + */
> +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p)
> +{
> + int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_cpu_get_id(p, rq, clamp_id);
> +}
> +
> +/**
> + * uclamp_cpu_put(): decrease CPU's clamp group refcount
> + * @cpu: the CPU's rq where the clamp group refcount has to be decreased
> + * @p: the task being dequeued
> + *
> + * When a task is dequeued from a CPU's rq, all the clamp groups the task has
> + * been reference counted at task's enqueue time have to be decreased for that
> + * CPU.
> + *
> + * This method updates the utilization clamp constraints considering the
> + * requirements for the specified task. Thus, this update must be done before
> + * calling into the scheduling classes, which will eventually update schedutil
> + * considering the new task requirements.
> + */
> +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p)
> +{
> + int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_cpu_put_id(p, rq, clamp_id);
> +}
> +
> /**
> * uclamp_group_put: decrease the reference count for a clamp group
> * @clamp_id: the clamp index which was affected by a task group
> @@ -1021,9 +1197,17 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
> static void __init init_uclamp(void)
> {
> int clamp_id;
> + int cpu;
>
> mutex_init(&uclamp_mutex);
>
> + /* Init CPU's clamp groups */
> + for_each_possible_cpu(cpu) {
> + struct uclamp_cpu *uc_cpu = &cpu_rq(cpu)->uclamp;
> +
> + memset(uc_cpu, UCLAMP_NONE, sizeof(struct uclamp_cpu));
> + }
> +
> /* Init SE's clamp map */
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> @@ -1037,6 +1221,8 @@ static void __init init_uclamp(void)
> }
>
> #else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> @@ -1053,6 +1239,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & ENQUEUE_RESTORE))
> sched_info_queued(rq, p);
>
> + uclamp_cpu_get(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -1064,6 +1251,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> if (!(flags & DEQUEUE_SAVE))
> sched_info_dequeued(rq, p);
>
> + uclamp_cpu_put(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f0a0be4d344..fd857440276c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10433,6 +10433,10 @@ const struct sched_class fair_sched_class = {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> .task_change_group = task_change_group_fair,
> #endif
> +
> +#ifdef CONFIG_UCLAMP_TASK
> + .uclamp_enabled = 1,
> +#endif
> };
>
> #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 572567078b60..056a7e1bd529 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2391,6 +2391,10 @@ const struct sched_class rt_sched_class = {
> .switched_to = switched_to_rt,
>
> .update_curr = update_curr_rt,
> +
> +#ifdef CONFIG_UCLAMP_TASK
> + .uclamp_enabled = 1,
> +#endif
> };
>
> #ifdef CONFIG_RT_GROUP_SCHED
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c7742dcc136c..65bf9ebacd83 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -744,6 +744,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> #endif
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/**
> + * struct uclamp_group - Utilization clamp Group
> + * @value: utilization clamp value for tasks on this clamp group
> + * @tasks: number of RUNNABLE tasks on this clamp group
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_group {
> + int value;
> + int tasks;
> +};
> +
> +/**
> + * struct uclamp_cpu - CPU's utilization clamp
> + * @value: currently active clamp values for a CPU
> + * @group: utilization clamp groups affecting a CPU
> + *
> + * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
> + * A clamp value is affecting a CPU where there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * We have up to UCLAMP_CNT possible different clamp value, which are
> + * currently only two: minmum utilization and maximum utilization.
> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + * utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + * maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (CONFIG_UCLAMP_GROUPS_COUNT), we use a simple
> + * array to track the metrics required to compute all the per-CPU utilization
> + * clamp values. The additional slot is used to track the default clamp
> + * values, i.e. no min/max clamping at all.
> + */
> +struct uclamp_cpu {
> + int value[UCLAMP_CNT];
> + struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -781,6 +825,11 @@ struct rq {
> unsigned long nr_load_updates;
> u64 nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> + struct uclamp_cpu uclamp ____cacheline_aligned;
> +#endif
> +
> struct cfs_rq cfs;
> struct rt_rq rt;
> struct dl_rq dl;
> @@ -1535,6 +1584,10 @@ struct sched_class {
> #ifdef CONFIG_FAIR_GROUP_SCHED
> void (*task_change_group)(struct task_struct *p, int type);
> #endif
> +
> +#ifdef CONFIG_UCLAMP_TASK
> + int uclamp_enabled;
> +#endif
> };
>
> static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> @@ -2130,6 +2183,24 @@ static inline u64 irq_time_read(int cpu)
> }
> #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/**
> + * uclamp_group_active: check if a clamp group is active on a CPU
> + * @uc_grp: the clamp groups for a CPU
> + * @group_id: the clamp group to check
> + *
> + * A clamp group affects a CPU if it as at least one RUNNABLE task.

typo: "has at least"

> + *
> + * Return: true if the specified CPU has at least one RUNNABLE task
> + * for the specified clamp group.
> + */
> +static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
> + int group_id)
> +{
> + return uc_grp[group_id].tasks > 0;
> +}
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> #ifdef CONFIG_CPU_FREQ
> DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
>
> --
> 2.17.1
>

2018-07-21 00:26:36

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

Hi Patrick,

On Fri, Jul 20, 2018 at 8:11 AM, Patrick Bellasi
<[email protected]> wrote:
> Hi Suren,
> thanks for the review, all good point... some more comments follow
> inline.
>
> On 19-Jul 16:51, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> > +/**
>> > + * uclamp_group_available: checks if a clamp group is available
>> > + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
>> > + * @group_id: the group index in the given clamp_id
>> > + *
>> > + * A clamp group is not free if there is at least one SE which is sing a clamp
>>
>> Did you mean to say "single clamp"?
>
> No, it's "...at least one SE which is USING a clamp value..."
>
>> > + * value mapped on the specified clamp_id. These SEs are reference counted by
>> > + * the se_count of a uclamp_map entry.
>> > + *
>> > + * Return: true if there are no SE's mapped on the specified clamp
>> > + * index and group
>> > + */
>> > +static inline bool uclamp_group_available(int clamp_id, int group_id)
>> > +{
>> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > +
>> > + return (uc_map[group_id].value == UCLAMP_NONE);
>>
>> The usage of UCLAMP_NONE is very confusing to me. It was not used at
>> all in the patch where it was introduced [1/12], here it's used as a
>> clamp value and in uclamp_group_find() it's used as group_id. Please
>> clarify the usage.
>
> Yes, it's meant to represent a "clamp not valid" condition, whatever
> it's a "clamp group" or a "clamp value"... perhaps the name can be
> improved.
>
>> I also feel UCLAMP_NONE does not really belong to
>> the uclamp_id enum because other elements there are indexes in
>> uclamp_maps and this one is a special value.
>
> Right, it looks a bit misplaced, I agree. I think I tried to set it
> using a #define but there was some issues I don't remember now...
> Anyway, I'll give it another go...
>
>
>> IMHO if both *group_id*
>> and *value* need a special value (-1) to represent
>> unused/uninitialized entry it would be better to use different
>> constants. Maybe UCLAMP_VAL_NONE and UCLAMP_GROUP_NONE?
>
> Yes, maybe we can use a
>
> #define UCLAMP_NOT_VALID -1
>
> and get rid the confusing enum entry.
>
> Will update it on v3.
>

Sounds good to me.

>> > +}
>
> [...]
>
>> > +/**
>> > + * uclamp_group_find: finds the group index of a utilization clamp group
>> > + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
>> > + * @clamp_value: the utilization clamping value lookup for
>> > + *
>> > + * Verify if a group has been assigned to a certain clamp value and return
>> > + * its index to be used for accounting.
>> > + *
>> > + * Since only a limited number of utilization clamp groups are allowed, if no
>> > + * groups have been assigned for the specified value, a new group is assigned
>> > + * if possible. Otherwise an error is returned, meaning that an additional clamp
>> > + * value is not (currently) supported.
>> > + */
>> > +static int
>> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
>> > +{
>> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > + int free_group_id = UCLAMP_NONE;
>> > + unsigned int group_id = 0;
>> > +
>> > + for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
>> > + /* Keep track of first free clamp group */
>> > + if (uclamp_group_available(clamp_id, group_id)) {
>> > + if (free_group_id == UCLAMP_NONE)
>> > + free_group_id = group_id;
>> > + continue;
>> > + }
>> > + /* Return index of first group with same clamp value */
>> > + if (uc_map[group_id].value == clamp_value)
>> > + return group_id;
>> > + }
>> > + /* Default to first free clamp group */
>> > + if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
>>
>> Is the condition above needed? I think it's always true if you got here.
>> Also AFAIKT after the for loop you can just do:
>>
>> return (free_group_id != UCLAMP_NONE) ? free_group_id : -ENOSPC;
>
> Yes, you right... the code above can be simplified!
>
>>
>> > + group_id = free_group_id;
>> > + /* All clamp group already track different clamp values */
>> > + if (group_id == UCLAMP_NONE)
>> > + return -ENOSPC;
>> > + return group_id;
>> > +}
>
> [...]
>
>> > +static inline void uclamp_group_put(int clamp_id, int group_id)
>> > +{
>> > + struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > + unsigned long flags;
>> > +
>> > + /* Ignore SE's not yet attached */
>> > + if (group_id == UCLAMP_NONE)
>> > + return;
>> > +
>> > + /* Remove SE from this clamp group */
>> > + raw_spin_lock_irqsave(&uc_map[group_id].se_lock, flags);
>> > + uc_map[group_id].se_count -= 1;
>>
>> If uc_map[group_id].se_count was 0 before decrement you end up with
>> se_count == -1 and no reset for the element.
>
> Well... this should never happen, otherwise the refcounting is not
> working as expected.
>
> Maybe we can add (at least) a debug check and warning, something like:
>
> #ifdef SCHED_DEBUG
> if (unlikely(uc_map[group_id].se_count == 0)) {
> WARN(1, "invalid clamp group [%d:%d] refcount\n",
> clamp_id, group_id);
> uc_map[group_id].se_count = 1;
> }
> #endif
>
>> > + if (uc_map[group_id].se_count == 0)
>> > + uclamp_group_reset(clamp_id, group_id);
>> > + raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
>> > +}
>> > +
>
> [...]
>
>> > static inline int __setscheduler_uclamp(struct task_struct *p,
>> > const struct sched_attr *attr)
>> > {
>> > + struct uclamp_se *uc_se;
>> > + int retval = 0;
>> > +
>> > if (attr->sched_util_min > attr->sched_util_max)
>> > return -EINVAL;
>> > if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
>> > return -EINVAL;
>> >
>> > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
>> > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
>> > + mutex_lock(&uclamp_mutex);
>> > +
>> > + /* Update min utilization clamp */
>> > + uc_se = &p->uclamp[UCLAMP_MIN];
>> > + retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
>> > + attr->sched_util_min);
>> > +
>> > + /* Update max utilization clamp */
>> > + uc_se = &p->uclamp[UCLAMP_MAX];
>> > + retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
>> > + attr->sched_util_max);
>> > +
>> > + mutex_unlock(&uclamp_mutex);
>> > +
>> > + /*
>> > + * If one of the two clamp values should fail,
>> > + * let the userspace know.
>> > + */
>> > + if (retval)
>> > + return -ENOSPC;
>>
>> Maybe a minor issue but this failure is ambiguous. It might mean:
>> 1. no clamp value was updated
>> 2. UCLAMP_MIN was updated but UCLAMP_MAX was not
>> 3. UCLAMP_MAX was updated but UCLAMP_MIN was not
>
> That's right, I put a bit of thought on that me too but at the end
> I've been convinced that the possibility to use a single syscall to
> set both clamps at the same time has some benefits for user-space.
>
> Maybe the current solution can be improved by supporting an (optional)
> strict semantic with an in-kernel roll-back in case one of the two
> uclamp_group_get should fail.
>
> The strict semantic with roll-back could be controller via an
> additional flag, e.g. SCHED_FLAG_UTIL_CLAMP_STRICT.
>
> When the flag is set, either we are able to set both the attributes or
> we roll-back. Otherwise, when the flag is not set, we keep the current
> behavior. i.e. we set what we can and report an error to notify
> userspace that one constraints was not enforced.
>
> The following snippet should implement this semantics:
>
> ---8<---
>
> /* Uclamp flags */
> #define SCHED_FLAG_UTIL_CLAMP_STRICT 0x11 /* Roll-back on failure */
> #define SCHED_FLAG_UTIL_CLAMP_MIN 0x12 /* Update util_min */
> #define SCHED_FLAG_UTIL_CLAMP_MAX 0x14 /* Update util_max */
> #define SCHED_FLAG_UTIL_CLAMP ( \
> SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)
>

Having ability to update only min or only max this way might be indeed
very useful.
Instead of rolling back on failure I would suggest to check both
inputs first to make sure there won't be any error before updating.
This would remove the need for SCHED_FLAG_UTIL_CLAMP_STRICT (which I
think any user would want to set to 1 anyway).
Looks like uclamp_group_get() can fail only if uclamp_group_find()
fails to find a slot for uclamp_value or a free slot. So one way to do
this search before update is to call uclamp_group_find() for both
UCLAMP_MIN and UCLAMP_MAX beforehand and if they succeed then pass
obtained next_group_ids into uclamp_group_get() to avoid doing the
same search twice. This requires some refactoring of
uclamp_group_get() but I think the end result would be a cleaner and
more predictable solution.

> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> unsigned int uclamp_value_old = 0;
> unsigned int uclamp_value;
> struct uclamp_se *uc_se;
> int retval = 0;
>
> if (attr->sched_util_min > attr->sched_util_max)
> return -EINVAL;
> if (attr->sched_util_max > 100)
> return -EINVAL;
>
> mutex_lock(&uclamp_mutex);
>
> if (!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN))
> goto set_util_max;
>
> uc_se = &p->uclamp[UCLAMP_MIN];
> uclamp_value = scale_from_percent(attr->sched_util_min);
> if (uc_se->value == uclamp_value)
> goto set_util_max;
>
> /* Update min utilization clamp */
> uclamp_value_old = uc_se->value;
> retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se, uclamp_value);
> if (retval &&
> attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_STRICT)
> goto exit_failure;
>
> set_util_max:
>
> if (!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX))
> goto exit_done;
>
> uc_se = &p->uclamp[UCLAMP_MAX];
> uclamp_value = scale_from_percent(attr->sched_util_max);
> if (uc_se->value == uclamp_value)
> goto exit_done;
>
> /* Update max utilization clamp */
> if (uclamp_group_get(p, NULL, UCLAMP_MAX,
> uc_se, uclamp_value))
> goto exit_rollback;
>
> exit_done:
> mutex_unlock(&uclamp_mutex);
> return retval;
>
> exit_rollback:
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN &&
> attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_STRICT) {
> uclamp_group_get(p, NULL, UCLAMP_MIN,
> uc_se, uclamp_value_old);
> }
> exit_failure:
> mutex_unlock(&uclamp_mutex);
>
> return -ENOSPC;
> }
>
> ---8<---
>
>
> Could that work better?
>
> The code is maybe a bit more convoluted... but perhaps it can be
> improved by encoding it in a loop.
>
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-07-21 01:25:34

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX

Hi Patrick,

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
<[email protected]> wrote:
> When a util_max clamped task sleeps, its clamp constraints are removed
> from the CPU. However, the blocked utilization on that CPU can still be
> higher than the max clamp value enforced while that task was running.
> This max clamp removal when a CPU is going to be idle could thus allow
> unwanted CPU frequency increases, right while the task is not running.
>
> This can happen, for example, where there is another (smaller) task
> running on a different CPU of the same frequency domain.
> In this case, when we aggregates the utilization of all the CPUs in a

typo: we aggregate

> shared frequency domain, schedutil can still see the full non clamped
> blocked utilization of all the CPUs and thus eventually increase the
> frequency.
>
> Let's fix this by using:
>
> uclamp_cpu_put_id(UCLAMP_MAX)
> uclamp_cpu_update(last_clamp_value)
>
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition. Thus, while a CPU is idle, we can still enforce the last used
> clamp value for it.
>
> To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
> idle, we don't want to enforce any minimum frequency
> Indeed, we relay just on blocked load decay to smoothly reduce the

typo: We rely

> frequency.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/core.c | 30 ++++++++++++++++++++++++++----
> kernel/sched/sched.h | 2 ++
> 2 files changed, 28 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b2424eea7990..0cb6e0aa4faa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -930,7 +930,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
> * For the specified clamp index, this method computes the new CPU utilization
> * clamp to use until the next change on the set of RUNNABLE tasks on that CPU.
> */
> -static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> + unsigned int last_clamp_value)
> {
> struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> int max_value = UCLAMP_NONE;
> @@ -948,6 +949,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> if (max_value >= SCHED_CAPACITY_SCALE)
> break;
> }
> +
> + /*
> + * Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> + * task, we keep the CPU clamped to the last task's clamp value.
> + * This avoids frequency spikes to MAX when one CPU, with an high
> + * blocked utilization, sleeps and another CPU, in the same frequency
> + * domain, do not see anymore the clamp on the first CPU.
> + */
> + if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NONE) {
> + rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> + max_value = last_clamp_value;
> + }
> +
> rq->uclamp.value[clamp_id] = max_value;
> }
>
> @@ -977,13 +991,21 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
> uc_grp = &rq->uclamp.group[clamp_id][0];
> uc_grp[group_id].tasks += 1;
>
> + /* Force clamp update on idle exit */
> + uc_cpu = &rq->uclamp;
> + clamp_value = p->uclamp[clamp_id].value;
> + if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {

The condition below is not needed because UCLAMP_FLAG_IDLE is set only
for UCLAMP_MAX clamp_id, therefore the above condition already covers
the one below.

> + if (clamp_id == UCLAMP_MAX)
> + uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
> + uc_cpu->value[clamp_id] = clamp_value;
> + return;
> + }
> +
> /*
> * If this is the new max utilization clamp value, then we can update
> * straight away the CPU clamp value. Otherwise, the current CPU clamp
> * value is still valid and we are done.
> */
> - uc_cpu = &rq->uclamp;
> - clamp_value = p->uclamp[clamp_id].value;
> if (uc_cpu->value[clamp_id] < clamp_value)
> uc_cpu->value[clamp_id] = clamp_value;
> }
> @@ -1028,7 +1050,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
> uc_cpu = &rq->uclamp;
> clamp_value = uc_grp[group_id].value;
> if (clamp_value >= uc_cpu->value[clamp_id])
> - uclamp_cpu_update(rq, clamp_id);
> + uclamp_cpu_update(rq, clamp_id, clamp_value);
> }
>
> /**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1207add36478..7e4f10c507b7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -783,6 +783,8 @@ struct uclamp_group {
> * values, i.e. no min/max clamping at all.
> */
> struct uclamp_cpu {
> +#define UCLAMP_FLAG_IDLE 0x01
> + int flags;
> int value[UCLAMP_CNT];
> struct uclamp_group group[UCLAMP_CNT][CONFIG_UCLAMP_GROUPS_COUNT + 1];
> };
> --
> 2.17.1
>

2018-07-21 02:39:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
<[email protected]> wrote:
> The cgroup's CPU controller allows to assign a specified (maximum)
> bandwidth to the tasks of a group. However this bandwidth is defined and
> enforced only on a temporal base, without considering the actual
> frequency a CPU is running on. Thus, the amount of computation completed
> by a task within an allocated bandwidth can be very different depending
> on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.
>
> With the availability of schedutil, the scheduler is now able
> to drive frequency selections based on actual task utilization.
> Moreover, the utilization clamping support provides a mechanism to
> bias the frequency selection operated by schedutil depending on
> constraints assigned to the tasks currently RUNNABLE on a CPU.
>
> Give the above mechanisms, it is now possible to extend the cpu
> controller to specify what is the minimum (or maximum) utilization which
> a task is expected (or allowed) to generate.
> Constraints on minimum and maximum utilization allowed for tasks in a
> CPU cgroup can improve the control on the actual amount of CPU bandwidth
> consumed by tasks.
>
> Utilization clamping constraints are useful not only to bias frequency
> selection, when a task is running, but also to better support certain
> scheduler decisions regarding task placement. For example, on
> asymmetric capacity systems, a utilization clamp value can be
> conveniently used to enforce important interactive tasks on more capable
> CPUs or to run low priority and background tasks on more energy
> efficient CPUs.
>
> The ultimate goal of utilization clamping is thus to enable:
>
> - boosting: by selecting an higher capacity CPU and/or higher execution
> frequency for small tasks which are affecting the user
> interactive experience.
>
> - capping: by selecting more energy efficiency CPUs or lower execution
> frequency, for big tasks which are mainly related to
> background activities, and thus without a direct impact on
> the user experience.
>
> Thus, a proper extension of the cpu controller with utilization clamping
> support will make this controller even more suitable for integration
> with advanced system management software (e.g. Android).
> Indeed, an informed user-space can provide rich information hints to the
> scheduler regarding the tasks it's going to schedule.
>
> This patch extends the CPU controller by adding a couple of new
> attributes, util_min and util_max, which can be used to enforce task's
> utilization boosting and capping. Specifically:
>
> - util_min: defines the minimum utilization which should be considered,
> e.g. when schedutil selects the frequency for a CPU while a
> task in this group is RUNNABLE.
> i.e. the task will run at least at a minimum frequency which
> corresponds to the min_util utilization
>
> - util_max: defines the maximum utilization which should be considered,
> e.g. when schedutil selects the frequency for a CPU while a
> task in this group is RUNNABLE.
> i.e. the task will run up to a maximum frequency which
> corresponds to the max_util utilization
>
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
> hierarchies
> b) do not enforce any constraints and/or dependency between the parent
> and its child nodes, thus relying on the delegation model and
> permission settings defined by the system management software
> c) allow to (eventually) further restrict task-specific clamps defined
> via sched_setattr(2)
>
> This patch provides the basic support to expose the two new attributes
> and to validate their run-time updates.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> Documentation/admin-guide/cgroup-v2.rst | 25 ++++
> init/Kconfig | 22 +++
> kernel/sched/core.c | 186 ++++++++++++++++++++++++
> kernel/sched/sched.h | 5 +
> 4 files changed, 238 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 8a2c52d5c53b..328c011cc105 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -904,6 +904,12 @@ controller implements weight and absolute bandwidth limit models for
> normal scheduling policy and absolute bandwidth allocation model for
> realtime scheduling policy.
>
> +Cycles distribution is based, by default, on a temporal base and it
> +does not account for the frequency at which tasks are executed.
> +The (optional) utilization clamping support allows to enforce a minimum
> +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> +which should never be exceeded by a CPU.
> +
> WARNING: cgroup2 doesn't yet support control of realtime processes and
> the cpu controller can only be enabled when all RT processes are in
> the root cgroup. Be aware that system management software may already
> @@ -963,6 +969,25 @@ All time durations are in microseconds.
> $PERIOD duration. "max" for $MAX indicates no limit. If only
> one number is written, $MAX is updated.
>
> + cpu.util_min
> + A read-write single value file which exists on non-root cgroups.
> + The default is "0", i.e. no bandwidth boosting.
> +
> + The minimum utilization in the range [0, 1023].
> +
> + This interface allows reading and setting minimum utilization clamp
> + values similar to the sched_setattr(2). This minimum utilization
> + value is used to clamp the task specific minimum utilization clamp.
> +
> + cpu.util_max
> + A read-write single value file which exists on non-root cgroups.
> + The default is "1023". i.e. no bandwidth clamping
> +
> + The maximum utilization in the range [0, 1023].
> +
> + This interface allows reading and setting maximum utilization clamp
> + values similar to the sched_setattr(2). This maximum utilization
> + value is used to clamp the task specific maximum utilization clamp.
>
> Memory
> ------
> diff --git a/init/Kconfig b/init/Kconfig
> index 0a377ad7c166..d7e2b74637ff 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -792,6 +792,28 @@ config RT_GROUP_SCHED
>
> endif #CGROUP_SCHED
>
> +config UCLAMP_TASK_GROUP
> + bool "Utilization clamping per group of tasks"
> + depends on CGROUP_SCHED
> + depends on UCLAMP_TASK
> + default n
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> +
> + When this option is enabled, the user can specify a min and max
> + CPU bandwidth which is allowed for each single task in a group.
> + The max bandwidth allows to clamp the maximum frequency a task
> + can use, while the min bandwidth allows to define a minimum
> + frequency a task will always use.
> +
> + When task group based utilization clamping is enabled, an eventually
> + specified task-specific clamp value is constrained by the cgroup
> + specified clamp value. Both minimum and maximum task clamping cannot
> + be bigger than the corresponding clamping defined at task group level.
> +
> + If in doubt, say N.
> +
> config CGROUP_PIDS
> bool "PIDs controller"
> help
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0cb6e0aa4faa..30b1d894f978 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1227,6 +1227,74 @@ static inline int uclamp_group_get(struct task_struct *p,
> return 0;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +/**
> + * init_uclamp_sched_group: initialize data structures required for TG's
> + * utilization clamping
> + */
> +static inline void init_uclamp_sched_group(void)
> +{
> + struct uclamp_map *uc_map;
> + struct uclamp_se *uc_se;
> + int group_id;
> + int clamp_id;
> +
> + /* Root TG's is statically assigned to the first clamp group */
> + group_id = 0;
> +
> + /* Initialize root TG's to default (none) clamp values */
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_map = &uclamp_maps[clamp_id][0];
> +
> + /* Map root TG's clamp value */
> + uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
> +
> + /* Init root TG's clamp group */
> + uc_se = &root_task_group.uclamp[clamp_id];
> + uc_se->value = uclamp_none(clamp_id);
> + uc_se->group_id = group_id;
> +
> + /* Attach root TG's clamp group */
> + uc_map[group_id].se_count = 1;
> + }
> +}
> +
> +/**
> + * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
> + * @tg: the newly created task group
> + * @parent: its parent task group
> + *
> + * A newly created task group inherits its utilization clamp values, for all
> + * clamp indexes, from its parent task group.
> + * This ensures that its values are properly initialized and that the task
> + * group is accounted in the same parent's group index.
> + *
> + * Return: !0 on error
> + */
> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> + struct task_group *parent)
> +{
> + struct uclamp_se *uc_se;
> + int clamp_id;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_se = &tg->uclamp[clamp_id];
> +
> + uc_se->value = parent->uclamp[clamp_id].value;
> + uc_se->group_id = UCLAMP_NONE;
> + }
> +
> + return 1;
> +}
> +#else /* CONFIG_UCLAMP_TASK_GROUP */
> +static inline void init_uclamp_sched_group(void) { }
> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> + struct task_group *parent)
> +{
> + return 1;
> +}
> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
> +
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> @@ -1289,11 +1357,18 @@ static void __init init_uclamp(void)
> raw_spin_lock_init(&uc_map[group_id].se_lock);
> }
> }
> +
> + init_uclamp_sched_group();
> }
>
> #else /* CONFIG_UCLAMP_TASK */
> static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
> static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> + struct task_group *parent)
> +{
> + return 1;
> +}
> static inline int __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> @@ -6890,6 +6965,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> if (!alloc_rt_sched_group(tg, parent))
> goto err;
>
> + if (!alloc_uclamp_sched_group(tg, parent))
> + goto err;
> +
> return tg;
>
> err:
> @@ -7110,6 +7188,88 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> sched_move_task(task);
> }
>
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cftype, u64 min_value)
> +{
> + struct task_group *tg;
> + int ret = -EINVAL;
> +
> + if (min_value > SCHED_CAPACITY_SCALE)
> + return -ERANGE;
> +
> + mutex_lock(&uclamp_mutex);
> + rcu_read_lock();
> +
> + tg = css_tg(css);
> + if (tg->uclamp[UCLAMP_MIN].value == min_value) {
> + ret = 0;
> + goto out;
> + }
> + if (tg->uclamp[UCLAMP_MAX].value < min_value)
> + goto out;
> +

+ tg->uclamp[UCLAMP_MIN].value = min_value;
+ ret = 0;

Are these assignments missing or am I missing something? Same for
cpu_util_max_write_u64().

> +out:
> + rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
> +
> + return ret;
> +}
> +
> +static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cftype, u64 max_value)
> +{
> + struct task_group *tg;
> + int ret = -EINVAL;
> +
> + if (max_value > SCHED_CAPACITY_SCALE)
> + return -ERANGE;
> +
> + mutex_lock(&uclamp_mutex);
> + rcu_read_lock();
> +
> + tg = css_tg(css);
> + if (tg->uclamp[UCLAMP_MAX].value == max_value) {
> + ret = 0;
> + goto out;
> + }
> + if (tg->uclamp[UCLAMP_MIN].value > max_value)
> + goto out;
> +
> +out:
> + rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
> +
> + return ret;
> +}
> +
> +static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> + enum uclamp_id clamp_id)
> +{
> + struct task_group *tg;
> + u64 util_clamp;
> +
> + rcu_read_lock();
> + tg = css_tg(css);
> + util_clamp = tg->uclamp[clamp_id].value;
> + rcu_read_unlock();
> +
> + return util_clamp;
> +}
> +
> +static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MIN);
> +}
> +
> +static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MAX);
> +}
> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 shareval)
> @@ -7437,6 +7597,18 @@ static struct cftype cpu_legacy_files[] = {
> .read_u64 = cpu_rt_period_read_uint,
> .write_u64 = cpu_rt_period_write_uint,
> },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + {
> + .name = "util_min",
> + .read_u64 = cpu_util_min_read_u64,
> + .write_u64 = cpu_util_min_write_u64,
> + },
> + {
> + .name = "util_max",
> + .read_u64 = cpu_util_max_read_u64,
> + .write_u64 = cpu_util_max_write_u64,
> + },
> #endif
> { } /* Terminate */
> };
> @@ -7604,6 +7776,20 @@ static struct cftype cpu_files[] = {
> .seq_show = cpu_max_show,
> .write = cpu_max_write,
> },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + {
> + .name = "util_min",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_min_read_u64,
> + .write_u64 = cpu_util_min_write_u64,
> + },
> + {
> + .name = "util_max",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_max_read_u64,
> + .write_u64 = cpu_util_max_write_u64,
> + },
> #endif
> { } /* terminate */
> };
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7e4f10c507b7..1471a23e8f57 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -389,6 +389,11 @@ struct task_group {
> #endif
>
> struct cfs_bandwidth cfs_bandwidth;
> +
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + struct uclamp_se uclamp[UCLAMP_CNT];
> +#endif
> +
> };
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.17.1
>

2018-07-21 03:17:55

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

On Fri, Jul 20, 2018 at 7:37 PM, Suren Baghdasaryan <[email protected]> wrote:
> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
> <[email protected]> wrote:
>> The cgroup's CPU controller allows to assign a specified (maximum)
>> bandwidth to the tasks of a group. However this bandwidth is defined and
>> enforced only on a temporal base, without considering the actual
>> frequency a CPU is running on. Thus, the amount of computation completed
>> by a task within an allocated bandwidth can be very different depending
>> on the actual frequency the CPU is running that task.
>> The amount of computation can be affected also by the specific CPU a
>> task is running on, especially when running on asymmetric capacity
>> systems like Arm's big.LITTLE.
>>
>> With the availability of schedutil, the scheduler is now able
>> to drive frequency selections based on actual task utilization.
>> Moreover, the utilization clamping support provides a mechanism to
>> bias the frequency selection operated by schedutil depending on
>> constraints assigned to the tasks currently RUNNABLE on a CPU.
>>
>> Give the above mechanisms, it is now possible to extend the cpu
>> controller to specify what is the minimum (or maximum) utilization which
>> a task is expected (or allowed) to generate.
>> Constraints on minimum and maximum utilization allowed for tasks in a
>> CPU cgroup can improve the control on the actual amount of CPU bandwidth
>> consumed by tasks.
>>
>> Utilization clamping constraints are useful not only to bias frequency
>> selection, when a task is running, but also to better support certain
>> scheduler decisions regarding task placement. For example, on
>> asymmetric capacity systems, a utilization clamp value can be
>> conveniently used to enforce important interactive tasks on more capable
>> CPUs or to run low priority and background tasks on more energy
>> efficient CPUs.
>>
>> The ultimate goal of utilization clamping is thus to enable:
>>
>> - boosting: by selecting an higher capacity CPU and/or higher execution
>> frequency for small tasks which are affecting the user
>> interactive experience.
>>
>> - capping: by selecting more energy efficiency CPUs or lower execution
>> frequency, for big tasks which are mainly related to
>> background activities, and thus without a direct impact on
>> the user experience.
>>
>> Thus, a proper extension of the cpu controller with utilization clamping
>> support will make this controller even more suitable for integration
>> with advanced system management software (e.g. Android).
>> Indeed, an informed user-space can provide rich information hints to the
>> scheduler regarding the tasks it's going to schedule.
>>
>> This patch extends the CPU controller by adding a couple of new
>> attributes, util_min and util_max, which can be used to enforce task's
>> utilization boosting and capping. Specifically:
>>
>> - util_min: defines the minimum utilization which should be considered,
>> e.g. when schedutil selects the frequency for a CPU while a
>> task in this group is RUNNABLE.
>> i.e. the task will run at least at a minimum frequency which
>> corresponds to the min_util utilization
>>
>> - util_max: defines the maximum utilization which should be considered,
>> e.g. when schedutil selects the frequency for a CPU while a
>> task in this group is RUNNABLE.
>> i.e. the task will run up to a maximum frequency which
>> corresponds to the max_util utilization
>>
>> These attributes:
>>
>> a) are available only for non-root nodes, both on default and legacy
>> hierarchies
>> b) do not enforce any constraints and/or dependency between the parent
>> and its child nodes, thus relying on the delegation model and
>> permission settings defined by the system management software
>> c) allow to (eventually) further restrict task-specific clamps defined
>> via sched_setattr(2)
>>
>> This patch provides the basic support to expose the two new attributes
>> and to validate their run-time updates.
>>
>> Signed-off-by: Patrick Bellasi <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Tejun Heo <[email protected]>
>> Cc: Rafael J. Wysocki <[email protected]>
>> Cc: Viresh Kumar <[email protected]>
>> Cc: Todd Kjos <[email protected]>
>> Cc: Joel Fernandes <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> Documentation/admin-guide/cgroup-v2.rst | 25 ++++
>> init/Kconfig | 22 +++
>> kernel/sched/core.c | 186 ++++++++++++++++++++++++
>> kernel/sched/sched.h | 5 +
>> 4 files changed, 238 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index 8a2c52d5c53b..328c011cc105 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -904,6 +904,12 @@ controller implements weight and absolute bandwidth limit models for
>> normal scheduling policy and absolute bandwidth allocation model for
>> realtime scheduling policy.
>>
>> +Cycles distribution is based, by default, on a temporal base and it
>> +does not account for the frequency at which tasks are executed.
>> +The (optional) utilization clamping support allows to enforce a minimum
>> +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
>> +which should never be exceeded by a CPU.
>> +
>> WARNING: cgroup2 doesn't yet support control of realtime processes and
>> the cpu controller can only be enabled when all RT processes are in
>> the root cgroup. Be aware that system management software may already
>> @@ -963,6 +969,25 @@ All time durations are in microseconds.
>> $PERIOD duration. "max" for $MAX indicates no limit. If only
>> one number is written, $MAX is updated.
>>
>> + cpu.util_min
>> + A read-write single value file which exists on non-root cgroups.
>> + The default is "0", i.e. no bandwidth boosting.
>> +
>> + The minimum utilization in the range [0, 1023].
>> +
>> + This interface allows reading and setting minimum utilization clamp
>> + values similar to the sched_setattr(2). This minimum utilization
>> + value is used to clamp the task specific minimum utilization clamp.
>> +
>> + cpu.util_max
>> + A read-write single value file which exists on non-root cgroups.
>> + The default is "1023". i.e. no bandwidth clamping
>> +
>> + The maximum utilization in the range [0, 1023].
>> +
>> + This interface allows reading and setting maximum utilization clamp
>> + values similar to the sched_setattr(2). This maximum utilization
>> + value is used to clamp the task specific maximum utilization clamp.
>>
>> Memory
>> ------
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 0a377ad7c166..d7e2b74637ff 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -792,6 +792,28 @@ config RT_GROUP_SCHED
>>
>> endif #CGROUP_SCHED
>>
>> +config UCLAMP_TASK_GROUP
>> + bool "Utilization clamping per group of tasks"
>> + depends on CGROUP_SCHED
>> + depends on UCLAMP_TASK
>> + default n
>> + help
>> + This feature enables the scheduler to track the clamped utilization
>> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
>> +
>> + When this option is enabled, the user can specify a min and max
>> + CPU bandwidth which is allowed for each single task in a group.
>> + The max bandwidth allows to clamp the maximum frequency a task
>> + can use, while the min bandwidth allows to define a minimum
>> + frequency a task will always use.
>> +
>> + When task group based utilization clamping is enabled, an eventually
>> + specified task-specific clamp value is constrained by the cgroup
>> + specified clamp value. Both minimum and maximum task clamping cannot
>> + be bigger than the corresponding clamping defined at task group level.
>> +
>> + If in doubt, say N.
>> +
>> config CGROUP_PIDS
>> bool "PIDs controller"
>> help
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0cb6e0aa4faa..30b1d894f978 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1227,6 +1227,74 @@ static inline int uclamp_group_get(struct task_struct *p,
>> return 0;
>> }
>>
>> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> +/**
>> + * init_uclamp_sched_group: initialize data structures required for TG's
>> + * utilization clamping
>> + */
>> +static inline void init_uclamp_sched_group(void)
>> +{
>> + struct uclamp_map *uc_map;
>> + struct uclamp_se *uc_se;
>> + int group_id;
>> + int clamp_id;
>> +
>> + /* Root TG's is statically assigned to the first clamp group */
>> + group_id = 0;
>> +
>> + /* Initialize root TG's to default (none) clamp values */
>> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>> + uc_map = &uclamp_maps[clamp_id][0];
>> +
>> + /* Map root TG's clamp value */
>> + uclamp_group_init(clamp_id, group_id, uclamp_none(clamp_id));
>> +
>> + /* Init root TG's clamp group */
>> + uc_se = &root_task_group.uclamp[clamp_id];
>> + uc_se->value = uclamp_none(clamp_id);
>> + uc_se->group_id = group_id;
>> +
>> + /* Attach root TG's clamp group */
>> + uc_map[group_id].se_count = 1;
>> + }
>> +}
>> +
>> +/**
>> + * alloc_uclamp_sched_group: initialize a new TG's for utilization clamping
>> + * @tg: the newly created task group
>> + * @parent: its parent task group
>> + *
>> + * A newly created task group inherits its utilization clamp values, for all
>> + * clamp indexes, from its parent task group.
>> + * This ensures that its values are properly initialized and that the task
>> + * group is accounted in the same parent's group index.
>> + *
>> + * Return: !0 on error
>> + */
>> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
>> + struct task_group *parent)
>> +{
>> + struct uclamp_se *uc_se;
>> + int clamp_id;
>> +
>> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>> + uc_se = &tg->uclamp[clamp_id];
>> +
>> + uc_se->value = parent->uclamp[clamp_id].value;
>> + uc_se->group_id = UCLAMP_NONE;
>> + }
>> +
>> + return 1;
>> +}
>> +#else /* CONFIG_UCLAMP_TASK_GROUP */
>> +static inline void init_uclamp_sched_group(void) { }
>> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
>> + struct task_group *parent)
>> +{
>> + return 1;
>> +}
>> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
>> +
>> static inline int __setscheduler_uclamp(struct task_struct *p,
>> const struct sched_attr *attr)
>> {
>> @@ -1289,11 +1357,18 @@ static void __init init_uclamp(void)
>> raw_spin_lock_init(&uc_map[group_id].se_lock);
>> }
>> }
>> +
>> + init_uclamp_sched_group();
>> }
>>
>> #else /* CONFIG_UCLAMP_TASK */
>> static inline void uclamp_cpu_get(struct rq *rq, struct task_struct *p) { }
>> static inline void uclamp_cpu_put(struct rq *rq, struct task_struct *p) { }
>> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
>> + struct task_group *parent)
>> +{
>> + return 1;
>> +}
>> static inline int __setscheduler_uclamp(struct task_struct *p,
>> const struct sched_attr *attr)
>> {
>> @@ -6890,6 +6965,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>> if (!alloc_rt_sched_group(tg, parent))
>> goto err;
>>
>> + if (!alloc_uclamp_sched_group(tg, parent))
>> + goto err;
>> +
>> return tg;
>>
>> err:
>> @@ -7110,6 +7188,88 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
>> sched_move_task(task);
>> }
>>
>> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> +static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
>> + struct cftype *cftype, u64 min_value)
>> +{
>> + struct task_group *tg;
>> + int ret = -EINVAL;
>> +
>> + if (min_value > SCHED_CAPACITY_SCALE)
>> + return -ERANGE;
>> +
>> + mutex_lock(&uclamp_mutex);
>> + rcu_read_lock();
>> +
>> + tg = css_tg(css);
>> + if (tg->uclamp[UCLAMP_MIN].value == min_value) {
>> + ret = 0;
>> + goto out;
>> + }
>> + if (tg->uclamp[UCLAMP_MAX].value < min_value)
>> + goto out;
>> +
>
> + tg->uclamp[UCLAMP_MIN].value = min_value;
> + ret = 0;
>
> Are these assignments missing or am I missing something? Same for
> cpu_util_max_write_u64().
>

Ah, I see the assignments now in [9/12] patch...

>> +out:
>> + rcu_read_unlock();
>> + mutex_unlock(&uclamp_mutex);
>> +
>> + return ret;
>> +}
>> +
>> +static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>> + struct cftype *cftype, u64 max_value)
>> +{
>> + struct task_group *tg;
>> + int ret = -EINVAL;
>> +
>> + if (max_value > SCHED_CAPACITY_SCALE)
>> + return -ERANGE;
>> +
>> + mutex_lock(&uclamp_mutex);
>> + rcu_read_lock();
>> +
>> + tg = css_tg(css);
>> + if (tg->uclamp[UCLAMP_MAX].value == max_value) {
>> + ret = 0;
>> + goto out;
>> + }
>> + if (tg->uclamp[UCLAMP_MIN].value > max_value)
>> + goto out;
>> +
>> +out:
>> + rcu_read_unlock();
>> + mutex_unlock(&uclamp_mutex);
>> +
>> + return ret;
>> +}
>> +
>> +static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
>> + enum uclamp_id clamp_id)
>> +{
>> + struct task_group *tg;
>> + u64 util_clamp;
>> +
>> + rcu_read_lock();
>> + tg = css_tg(css);
>> + util_clamp = tg->uclamp[clamp_id].value;
>> + rcu_read_unlock();
>> +
>> + return util_clamp;
>> +}
>> +
>> +static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
>> + struct cftype *cft)
>> +{
>> + return cpu_uclamp_read(css, UCLAMP_MIN);
>> +}
>> +
>> +static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
>> + struct cftype *cft)
>> +{
>> + return cpu_uclamp_read(css, UCLAMP_MAX);
>> +}
>> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
>> +
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
>> struct cftype *cftype, u64 shareval)
>> @@ -7437,6 +7597,18 @@ static struct cftype cpu_legacy_files[] = {
>> .read_u64 = cpu_rt_period_read_uint,
>> .write_u64 = cpu_rt_period_write_uint,
>> },
>> +#endif
>> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> + {
>> + .name = "util_min",
>> + .read_u64 = cpu_util_min_read_u64,
>> + .write_u64 = cpu_util_min_write_u64,
>> + },
>> + {
>> + .name = "util_max",
>> + .read_u64 = cpu_util_max_read_u64,
>> + .write_u64 = cpu_util_max_write_u64,
>> + },
>> #endif
>> { } /* Terminate */
>> };
>> @@ -7604,6 +7776,20 @@ static struct cftype cpu_files[] = {
>> .seq_show = cpu_max_show,
>> .write = cpu_max_write,
>> },
>> +#endif
>> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> + {
>> + .name = "util_min",
>> + .flags = CFTYPE_NOT_ON_ROOT,
>> + .read_u64 = cpu_util_min_read_u64,
>> + .write_u64 = cpu_util_min_write_u64,
>> + },
>> + {
>> + .name = "util_max",
>> + .flags = CFTYPE_NOT_ON_ROOT,
>> + .read_u64 = cpu_util_max_read_u64,
>> + .write_u64 = cpu_util_max_write_u64,
>> + },
>> #endif
>> { } /* terminate */
>> };
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 7e4f10c507b7..1471a23e8f57 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -389,6 +389,11 @@ struct task_group {
>> #endif
>>
>> struct cfs_bandwidth cfs_bandwidth;
>> +
>> +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> + struct uclamp_se uclamp[UCLAMP_CNT];
>> +#endif
>> +
>> };
>>
>> #ifdef CONFIG_FAIR_GROUP_SCHED
>> --
>> 2.17.1
>>

2018-07-22 03:06:41

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
<[email protected]> wrote:
> When a task's util_clamp value is configured via sched_setattr(2), this
> value has to be properly accounted in the corresponding clamp group
> every time the task is enqueued and dequeued. When cgroups are also in
> use, per-task clamp values have to be aggregated to those of the CPU's
> controller's Task Group (TG) in which the task is currently living.
>
> Let's update uclamp_cpu_get() to provide aggregation between the task
> and the TG clamp values. Every time a task is enqueued, it will be
> accounted in the clamp_group which defines the smaller clamp between the
> task specific value and its TG value.

So choosing smallest for both UCLAMP_MIN and UCLAMP_MAX means the
least boosted value and the most clamped value between syscall and TG
will be used. My understanding is that boost means "at least this
much" and clamp means "at most this much". So to satisfy both TG and
syscall requirements I think you would need to choose the largest
value for UCLAMP_MIN and the smallest one for UCLAMP_MAX, meaning the
most boosted and most clamped range. Current implementation choses the
least boosted value, so effectively one of the UCLAMP_MIN requirements
(either from TG or from syscall) are being ignored...
Could you please clarify why this choice is made?

>
> This also mimics what already happen for a task's CPU affinity mask when
> the task is also living in a cpuset. he overall idea is that cgroup

typo: The overall...

> attributes are always used to restrict the per-task attributes.
>
> Thus, this implementation allows to:
>
> 1. ensure cgroup clamps are always used to restrict task specific
> requests, i.e. boosted only up to a granted value or clamped at least
> to a certain value
> 2. implements a "nice-like" policy, where tasks are still allowed to
> request less then what enforced by their current TG
>
> For this mecanisms to work properly, we need to implement a concept of
> "effective" clamp group, which is used to track the currently most
> restrictive clamp value each task is subject to.
> The effective clamp is computed at enqueue time, by using an additional
> task_struct::uclamp_group_id
> to keep track of the clamp group in which each task is currently
> accounted into. This solution allows to update task constrains on
> demand, only when they became RUNNABLE, to always get the least
> restrictive clamp depending on the current TG's settings.
>
> This solution allows also to better decouple the slow-path, where task
> and task group clamp values are updated, from the fast-path, where the
> most appropriate clamp value is tracked by refcounting clamp groups.
>
> For consistency purposes, as well as to properly inform userspace, the
> sched_getattr(2) call is updated to always return the properly
> aggregated constrains as described above. This will also make
> sched_getattr(2) a convenient userpace API to know the utilization
> constraints enforced on a task by the cgroup's CPU controller.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/sched.h | 2 ++
> kernel/sched/core.c | 40 +++++++++++++++++++++++++++++++++++-----
> kernel/sched/sched.h | 2 +-
> 3 files changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 260aa8d3fca9..5dd76a27ec17 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -676,6 +676,8 @@ struct task_struct {
> struct sched_dl_entity dl;
>
> #ifdef CONFIG_UCLAMP_TASK
> + /* Clamp group the task is currently accounted into */
> + int uclamp_group_id[UCLAMP_CNT];
> /* Utlization clamp values for this task */
> struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 04e758224e22..50613d3d5b83 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -971,8 +971,15 @@ static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> * @rq: the CPU's rq where the clamp group has to be reference counted
> * @clamp_id: the utilization clamp (e.g. min or max utilization) to reference
> *
> - * Once a task is enqueued on a CPU's RQ, the clamp group currently defined by
> - * the task's uclamp.group_id is reference counted on that CPU.
> + * Once a task is enqueued on a CPU's RQ, the most restrictive clamp group,
> + * among the task specific and that of the task's cgroup one, is reference
> + * counted on that CPU.
> + *
> + * Since the CPUs reference counted clamp group can be either that of the task
> + * or of its cgroup, we keep track of the reference counted clamp group by
> + * storing its index (group_id) into the task's task_struct::uclamp_group_id.
> + * This group index will then be used at task's dequeue time to release the
> + * correct refcount.
> */
> static inline void uclamp_cpu_get_id(struct task_struct *p,
> struct rq *rq, int clamp_id)
> @@ -982,18 +989,30 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
> int clamp_value;
> int group_id;
>
> - /* No task specific clamp values: nothing to do */
> group_id = p->uclamp[clamp_id].group_id;
> + clamp_value = p->uclamp[clamp_id].value;
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + /* Use TG's clamp value to limit task specific values */
> + if (group_id == UCLAMP_NONE ||
> + clamp_value >= task_group(p)->uclamp[clamp_id].value) {

Not a big deal but do you need to override if (clamp_value ==
task_group(p)->uclamp[clamp_id].value)? Maybe:
- clamp_value >= task_group(p)->uclamp[clamp_id].value) {
+ clamp_value > task_group(p)->uclamp[clamp_id].value) {

> + clamp_value = task_group(p)->uclamp[clamp_id].value;
> + group_id = task_group(p)->uclamp[clamp_id].group_id;
> + }
> +#else
> + /* No task specific clamp values: nothing to do */
> if (group_id == UCLAMP_NONE)
> return;
> +#endif
>
> /* Reference count the task into its current group_id */
> uc_grp = &rq->uclamp.group[clamp_id][0];
> uc_grp[group_id].tasks += 1;
>
> + /* Track the effective clamp group */
> + p->uclamp_group_id[clamp_id] = group_id;
> +
> /* Force clamp update on idle exit */
> uc_cpu = &rq->uclamp;
> - clamp_value = p->uclamp[clamp_id].value;
> if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
> if (clamp_id == UCLAMP_MAX)
> uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
> @@ -1031,7 +1050,7 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
> int group_id;
>
> /* No task specific clamp values: nothing to do */
> - group_id = p->uclamp[clamp_id].group_id;
> + group_id = p->uclamp_group_id[clamp_id];
> if (group_id == UCLAMP_NONE)
> return;
>
> @@ -1039,6 +1058,9 @@ static inline void uclamp_cpu_put_id(struct task_struct *p,
> uc_grp = &rq->uclamp.group[clamp_id][0];
> uc_grp[group_id].tasks -= 1;
>
> + /* Flag the task as not affecting any clamp index */
> + p->uclamp_group_id[clamp_id] = UCLAMP_NONE;
> +
> /* If this is not the last task, no updates are required */
> if (uc_grp[group_id].tasks > 0)
> return;
> @@ -2848,6 +2870,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> #endif
>
> #ifdef CONFIG_UCLAMP_TASK
> + memset(&p->uclamp_group_id, UCLAMP_NONE, sizeof(p->uclamp_group_id));
> p->uclamp[UCLAMP_MIN].value = 0;
> p->uclamp[UCLAMP_MIN].group_id = UCLAMP_NONE;
> p->uclamp[UCLAMP_MAX].value = SCHED_CAPACITY_SCALE;
> @@ -5437,6 +5460,13 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> #ifdef CONFIG_UCLAMP_TASK
> attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
> attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + /* Use cgroup enforced clamps to restrict task specific clamps */
> + if (task_group(p)->uclamp[UCLAMP_MIN].value < attr.sched_util_min)
> + attr.sched_util_min = task_group(p)->uclamp[UCLAMP_MIN].value;
> + if (task_group(p)->uclamp[UCLAMP_MAX].value < attr.sched_util_max)
> + attr.sched_util_max = task_group(p)->uclamp[UCLAMP_MAX].value;
> +#endif
> #endif
>
> rcu_read_unlock();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1471a23e8f57..e3d5a2bc2f6c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2220,7 +2220,7 @@ static inline bool uclamp_group_active(struct uclamp_group *uc_grp,
> */
> static inline bool uclamp_task_affects(struct task_struct *p, int clamp_id)
> {
> - return (p->uclamp[clamp_id].group_id != UCLAMP_NONE);
> + return (p->uclamp_group_id[clamp_id] != UCLAMP_NONE);
> }
>
> /**
> --
> 2.17.1
>

2018-07-22 03:28:03

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 11/12] sched/core: uclamp: update CPU's refcount on TG's clamp changes

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
<[email protected]> wrote:
> When a task group refcounts a new clamp group, we need to ensure that
> the new clamp values are immediately enforced to all its tasks which are
> currently RUNNABLE. This is to ensure that all currently RUNNABLE task

tasks

> are boosted and/or clamped as requested as soon as possible.
>
> Let's ensure that, whenever a new clamp group is refcounted by a task
> group, all its RUNNABLE tasks are correctly accounted in their
> respective CPUs. We do that by slightly refactoring uclamp_group_get()
> to get an additional parameter *cgroup_subsys_state which, when
> provided, it's used to walk the list of tasks in the correspond TGs and

corresponding TGs

> update the RUNNABLE ones.
>
> This is a "brute force" solution which allows to reuse the same refcount
> update code already used by the per-task API. That's also the only way
> to ensure a prompt enforcement of new clamp constraints on RUNNABLE
> tasks, as soon as a task group attribute is tweaked.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/core.c | 42 ++++++++++++++++++++++++++++++++++--------
> 1 file changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 50613d3d5b83..42cff5ffddae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1198,21 +1198,43 @@ static inline void uclamp_group_put(int clamp_id, int group_id)
> raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
> }
>
> +static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
> + int clamp_id, unsigned int group_id)
> +{
> + struct css_task_iter it;
> + struct task_struct *p;
> +
> + /* Update clamp groups for RUNNABLE tasks in this TG */
> + css_task_iter_start(css, 0, &it);
> + while ((p = css_task_iter_next(&it)))
> + uclamp_task_update_active(p, clamp_id, group_id);
> + css_task_iter_end(&it);
> +}
> +
> /**
> * uclamp_group_get: increase the reference count for a clamp group
> * @p: the task which clamp value must be tracked
> - * @clamp_id: the clamp index affected by the task
> - * @uc_se: the utilization clamp data for the task
> - * @clamp_value: the new clamp value for the task
> + * @css: the task group which clamp value must be tracked
> + * @clamp_id: the clamp index affected by the task (group)
> + * @uc_se: the utilization clamp data for the task (group)
> + * @clamp_value: the new clamp value for the task (group)
> *
> * Each time a task changes its utilization clamp value, for a specified clamp
> * index, we need to find an available clamp group which can be used to track
> * this new clamp value. The corresponding clamp group index will be used by
> * the task to reference count the clamp value on CPUs while enqueued.
> *
> + * When the cgroup's cpu controller utilization clamping support is enabled,
> + * each task group has a set of clamp values which are used to restrict the
> + * corresponding task specific clamp values.
> + * When a clamp value for a task group is changed, all the (active) tasks
> + * belonging to that task group must be update to ensure they are refcounting

must be updated

> + * the correct CPU's clamp value.
> + *
> * Return: -ENOSPC if there are no available clamp groups, 0 on success.
> */
> static inline int uclamp_group_get(struct task_struct *p,
> + struct cgroup_subsys_state *css,
> int clamp_id, struct uclamp_se *uc_se,
> unsigned int clamp_value)
> {
> @@ -1240,6 +1262,10 @@ static inline int uclamp_group_get(struct task_struct *p,
> uc_map[next_group_id].se_count += 1;
> raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
>
> + /* Newly created TG don't have tasks assigned */
> + if (css)
> + uclamp_group_get_tg(css, clamp_id, next_group_id);
> +
> /* Update CPU's clamp group refcounts of RUNNABLE task */
> if (p)
> uclamp_task_update_active(p, clamp_id, next_group_id);
> @@ -1307,7 +1333,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
> uc_se->value = parent->uclamp[clamp_id].value;
> uc_se->group_id = UCLAMP_NONE;
>
> - if (uclamp_group_get(NULL, clamp_id, uc_se,
> + if (uclamp_group_get(NULL, NULL, clamp_id, uc_se,
> parent->uclamp[clamp_id].value)) {
> ret = 0;
> goto out;
> @@ -1362,12 +1388,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
>
> /* Update min utilization clamp */
> uc_se = &p->uclamp[UCLAMP_MIN];
> - retval |= uclamp_group_get(p, UCLAMP_MIN, uc_se,
> + retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se,
> attr->sched_util_min);
>
> /* Update max utilization clamp */
> uc_se = &p->uclamp[UCLAMP_MAX];
> - retval |= uclamp_group_get(p, UCLAMP_MAX, uc_se,
> + retval |= uclamp_group_get(p, NULL, UCLAMP_MAX, uc_se,
> attr->sched_util_max);
>
> mutex_unlock(&uclamp_mutex);
> @@ -7274,7 +7300,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
>
> /* Update TG's reference count */
> uc_se = &tg->uclamp[UCLAMP_MIN];
> - ret = uclamp_group_get(NULL, UCLAMP_MIN, uc_se, min_value);
> + ret = uclamp_group_get(NULL, css, UCLAMP_MIN, uc_se, min_value);
>
> out:
> rcu_read_unlock();
> @@ -7306,7 +7332,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
>
> /* Update TG's reference count */
> uc_se = &tg->uclamp[UCLAMP_MAX];
> - ret = uclamp_group_get(NULL, UCLAMP_MAX, uc_se, max_value);
> + ret = uclamp_group_get(NULL, css, UCLAMP_MAX, uc_se, max_value);
>
> out:
> rcu_read_unlock();
> --
> 2.17.1
>

2018-07-22 04:06:17

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
<[email protected]> wrote:
> The utilization is a well defined property of tasks and CPUs with an
> in-kernel representation based on power-of-two values.
> The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
> allows efficient computations in hot-paths and a sufficient fixed point
> arithmetic precision.
> However, the utilization values range is still an implementation detail
> which is also possibly subject to changes in the future.
>
> Since we don't want to commit new user-space APIs to any in-kernel
> implementation detail, let's add an abstraction layer on top of the APIs
> used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
> cpu.util_{min,max} attributes.
>
> We do that by adding a couple of conversion function which can be used

couple of conversion functions

> to conveniently transform utilization/capacity values from/to the internal
> SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
> in the standard [0..100] range.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> Documentation/admin-guide/cgroup-v2.rst | 6 +++---
> include/linux/sched.h | 20 ++++++++++++++++++++
> include/uapi/linux/sched/types.h | 14 ++++++++------
> kernel/sched/core.c | 18 ++++++++++++------
> 4 files changed, 43 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 328c011cc105..08b8062e55cd 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -973,7 +973,7 @@ All time durations are in microseconds.
> A read-write single value file which exists on non-root cgroups.
> The default is "0", i.e. no bandwidth boosting.
>
> - The minimum utilization in the range [0, 1023].
> + The minimum percentage of utilization in the range [0, 100].
>
> This interface allows reading and setting minimum utilization clamp
> values similar to the sched_setattr(2). This minimum utilization
> @@ -981,9 +981,9 @@ All time durations are in microseconds.
>
> cpu.util_max
> A read-write single value file which exists on non-root cgroups.
> - The default is "1023". i.e. no bandwidth clamping
> + The default is "100". i.e. no bandwidth clamping
>
> - The maximum utilization in the range [0, 1023].
> + The maximum percentage of utilization in the range [0, 100].
>
> This interface allows reading and setting maximum utilization clamp
> values similar to the sched_setattr(2). This maximum utilization
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5dd76a27ec17..f5970903c187 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -321,6 +321,26 @@ struct sched_info {
> # define SCHED_FIXEDPOINT_SHIFT 10
> # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
>
> +static inline unsigned int scale_from_percent(unsigned int pct)
> +{
> + WARN_ON(pct > 100);
> +
> + return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
> +}
> +
> +static inline unsigned int scale_to_percent(unsigned int value)
> +{
> + unsigned int rounding = 0;
> +
> + WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
> +
> + /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
> + if (likely((value & 0xFF) && ~(value & 0x700)))
> + rounding = 1;

Hmm. I don't think ~(value & 0x700) will ever yield FALSE... What am I missing?

> +
> + return (rounding + ((100 * value) / SCHED_FIXEDPOINT_SCALE));
> +}
> +
> struct load_weight {
> unsigned long weight;
> u32 inv_weight;
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index 7421cd25354d..e2c2acb1c6af 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -84,15 +84,17 @@ struct sched_param {
> *
> * @sched_util_min represents the minimum utilization
> * @sched_util_max represents the maximum utilization
> + * @sched_util_min represents the minimum utilization percentage
> + * @sched_util_max represents the maximum utilization percentage
> *
> - * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> - * represents the percentage of CPU time used by a task when running at the
> - * maximum frequency on the highest capacity CPU of the system. Thus, for
> - * example, a 20% utilization task is a task running for 2ms every 10ms.
> + * Utilization is a value in the range [0..100] which represents the
> + * percentage of CPU time used by a task when running at the maximum frequency
> + * on the highest capacity CPU of the system. Thus, for example, a 20%
> + * utilization task is a task running for 2ms every 10ms.
> *
> - * A task with a min utilization value bigger then 0 is more likely to be
> + * A task with a min utilization value bigger then 0% is more likely to be
> * scheduled on a CPU which can provide that bandwidth.
> - * A task with a max utilization value smaller then 1024 is more likely to be
> + * A task with a max utilization value smaller then 100% is more likely to be
> * scheduled on a CPU which do not provide more then the required bandwidth.
> */
> struct sched_attr {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 42cff5ffddae..da7b8630cc8d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1381,7 +1381,7 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
>
> if (attr->sched_util_min > attr->sched_util_max)
> return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> + if (attr->sched_util_max > 100)
> return -EINVAL;
>
> mutex_lock(&uclamp_mutex);
> @@ -1389,12 +1389,12 @@ static inline int __setscheduler_uclamp(struct task_struct *p,
> /* Update min utilization clamp */
> uc_se = &p->uclamp[UCLAMP_MIN];
> retval |= uclamp_group_get(p, NULL, UCLAMP_MIN, uc_se,
> - attr->sched_util_min);
> + scale_from_percent(attr->sched_util_min));
>
> /* Update max utilization clamp */
> uc_se = &p->uclamp[UCLAMP_MAX];
> retval |= uclamp_group_get(p, NULL, UCLAMP_MAX, uc_se,
> - attr->sched_util_max);
> + scale_from_percent(attr->sched_util_max));
>
> mutex_unlock(&uclamp_mutex);
>
> @@ -5493,6 +5493,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> if (task_group(p)->uclamp[UCLAMP_MAX].value < attr.sched_util_max)
> attr.sched_util_max = task_group(p)->uclamp[UCLAMP_MAX].value;
> #endif
> + attr.sched_util_min = scale_to_percent(attr.sched_util_min);
> + attr.sched_util_max = scale_to_percent(attr.sched_util_max);
> #endif
>
> rcu_read_unlock();
> @@ -7284,8 +7286,10 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> struct task_group *tg;
> int ret = -EINVAL;
>
> - if (min_value > SCHED_CAPACITY_SCALE)
> + /* Check range and scale to internal representation */
> + if (min_value > 100)
> return -ERANGE;
> + min_value = scale_from_percent(min_value);
>
> mutex_lock(&uclamp_mutex);
> rcu_read_lock();
> @@ -7316,8 +7320,10 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> struct task_group *tg;
> int ret = -EINVAL;
>
> - if (max_value > SCHED_CAPACITY_SCALE)
> + /* Check range and scale to internal representation */
> + if (max_value > 100)
> return -ERANGE;
> + max_value = scale_from_percent(max_value);
>
> mutex_lock(&uclamp_mutex);
> rcu_read_lock();
> @@ -7352,7 +7358,7 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> util_clamp = tg->uclamp[clamp_id].value;
> rcu_read_unlock();
>
> - return util_clamp;
> + return scale_to_percent(util_clamp);
> }
>
> static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
> --
> 2.17.1
>

2018-07-23 13:37:49

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

On 20-Jul 17:25, Suren Baghdasaryan wrote:

[...]

> > ---8<---
> >
> > /* Uclamp flags */
> > #define SCHED_FLAG_UTIL_CLAMP_STRICT 0x11 /* Roll-back on failure */
> > #define SCHED_FLAG_UTIL_CLAMP_MIN 0x12 /* Update util_min */
> > #define SCHED_FLAG_UTIL_CLAMP_MAX 0x14 /* Update util_max */
> > #define SCHED_FLAG_UTIL_CLAMP ( \
> > SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)
> >
>
> Having ability to update only min or only max this way might be indeed
> very useful.
> Instead of rolling back on failure I would suggest to check both
> inputs first to make sure there won't be any error before updating.
> This would remove the need for SCHED_FLAG_UTIL_CLAMP_STRICT (which I
> think any user would want to set to 1 anyway).
> Looks like uclamp_group_get() can fail only if uclamp_group_find()
> fails to find a slot for uclamp_value or a free slot. So one way to do
> this search before update is to call uclamp_group_find() for both
> UCLAMP_MIN and UCLAMP_MAX beforehand and if they succeed then pass
> obtained next_group_ids into uclamp_group_get() to avoid doing the
> same search twice. This requires some refactoring of
> uclamp_group_get() but I think the end result would be a cleaner and
> more predictable solution.

Yes, that sound possible... provided we check all the groups under the
same uclamp_mutex, it should be possible to find the group_ids before
actually increasing the refcount.

... will look into this for the next reposting.

--
#include <best/regards.h>

Patrick Bellasi

2018-07-23 15:05:30

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX

On 20-Jul 18:23, Suren Baghdasaryan wrote:
> Hi Patrick,

Hi Sure,
thank!

> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > @@ -977,13 +991,21 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
> > uc_grp = &rq->uclamp.group[clamp_id][0];
> > uc_grp[group_id].tasks += 1;
> >
> > + /* Force clamp update on idle exit */
> > + uc_cpu = &rq->uclamp;
> > + clamp_value = p->uclamp[clamp_id].value;
> > + if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
>
> The condition below is not needed because UCLAMP_FLAG_IDLE is set only
> for UCLAMP_MAX clamp_id, therefore the above condition already covers
> the one below.

Not really, this function is called two times, the first time to
update UCLAMP_MIN and a second time to update UCLAMP_MAX.

For both clamp_id we want to force update uc_cpu->value[clamp_id],
thus the UCLAMP_FLAG_IDLE flag has to be cleared only the second time.

Maybe I can had the following comment to better explain the reason of
the check:

/*
* This function is called for both UCLAMP_MIN (before) and
* UCLAMP_MAX (after). Let's reset the flag only the when
* we know that UCLAMP_MIN has been already updated.
*/

> > + if (clamp_id == UCLAMP_MAX)
> > + uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
> > + uc_cpu->value[clamp_id] = clamp_value;
> > + return;
> > + }

[...]

--
#include <best/regards.h>

Patrick Bellasi

2018-07-23 15:20:15

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

On 20-Jul 19:37, Suren Baghdasaryan wrote:
> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
> > +static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> > + struct cftype *cftype, u64 min_value)
> > +{
> > + struct task_group *tg;
> > + int ret = -EINVAL;
> > +
> > + if (min_value > SCHED_CAPACITY_SCALE)
> > + return -ERANGE;
> > +
> > + mutex_lock(&uclamp_mutex);
> > + rcu_read_lock();
> > +
> > + tg = css_tg(css);
> > + if (tg->uclamp[UCLAMP_MIN].value == min_value) {
> > + ret = 0;
> > + goto out;
> > + }
> > + if (tg->uclamp[UCLAMP_MAX].value < min_value)
> > + goto out;
> > +
>
> + tg->uclamp[UCLAMP_MIN].value = min_value;
> + ret = 0;
>
> Are these assignments missing or am I missing something? Same for
> cpu_util_max_write_u64().

They are introduced in the following patch, to keep this one focus
just on CGroups integration.

I'm also returning -EINVAL at this stage since, with just this patch
in, we are not really providing any good service to user-space, i.e.
it's like clamp groups not being available...

Maybe I can call this out better in the change log ;)

> > +out:
> > + rcu_read_unlock();
> > + mutex_unlock(&uclamp_mutex);
> > +
> > + return ret;
> > +}

[...]

--
#include <best/regards.h>

Patrick Bellasi

2018-07-23 15:42:13

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

On 21-Jul 20:05, Suren Baghdasaryan wrote:
> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
> <[email protected]> wrote:
> > When a task's util_clamp value is configured via sched_setattr(2), this
> > value has to be properly accounted in the corresponding clamp group
> > every time the task is enqueued and dequeued. When cgroups are also in
> > use, per-task clamp values have to be aggregated to those of the CPU's
> > controller's Task Group (TG) in which the task is currently living.
> >
> > Let's update uclamp_cpu_get() to provide aggregation between the task
> > and the TG clamp values. Every time a task is enqueued, it will be
> > accounted in the clamp_group which defines the smaller clamp between the
> > task specific value and its TG value.
>
> So choosing smallest for both UCLAMP_MIN and UCLAMP_MAX means the
> least boosted value and the most clamped value between syscall and TG
> will be used.

Right

> My understanding is that boost means "at least this much" and clamp
> means "at most this much".

Right

> So to satisfy both TG and syscall requirements I think you would
> need to choose the largest value for UCLAMP_MIN and the smallest one
> for UCLAMP_MAX, meaning the most boosted and most clamped range.
> Current implementation choses the least boosted value, so
> effectively one of the UCLAMP_MIN requirements (either from TG or
> from syscall) are being ignored... Could you please clarify why
> this choice is made?

The TG values are always used to specify a _restriction_ on
task-specific values.

Thus, if you look or example at the CPU mask for a task, you can have
a task with affinity to CPUs 0-1, currently running on a cgroup with
cpuset.cpus=0... then the task can run only on CPU 0 (althought its
affinity includes CPU1 too).

Same we do here: if a task has util_min=10, but it's running in a
cgroup with cpu.util_min=0, then it will not be boosted.

IOW, this allows to implement a "nice" policy at task level, where a
task (via syscall) can decide to be less boosted with respect to its
group but never more boosted. The same task can also decide to be more
clamped, but not less clamped then its current group.

[...]

> > @@ -982,18 +989,30 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
> > int clamp_value;
> > int group_id;
> >
> > - /* No task specific clamp values: nothing to do */
> > group_id = p->uclamp[clamp_id].group_id;
> > + clamp_value = p->uclamp[clamp_id].value;
> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
> > + /* Use TG's clamp value to limit task specific values */
> > + if (group_id == UCLAMP_NONE ||
> > + clamp_value >= task_group(p)->uclamp[clamp_id].value) {
>
> Not a big deal but do you need to override if (clamp_value ==
> task_group(p)->uclamp[clamp_id].value)? Maybe:
> - clamp_value >= task_group(p)->uclamp[clamp_id].value) {
> + clamp_value > task_group(p)->uclamp[clamp_id].value) {

Good point, yes... the override is not really changing anything here.
Will fix this!

> > + clamp_value = task_group(p)->uclamp[clamp_id].value;
> > + group_id = task_group(p)->uclamp[clamp_id].group_id;
> > + }
> > +#else
> > + /* No task specific clamp values: nothing to do */
> > if (group_id == UCLAMP_NONE)
> > return;
> > +#endif
> >
> > /* Reference count the task into its current group_id */
> > uc_grp = &rq->uclamp.group[clamp_id][0];
> > uc_grp[group_id].tasks += 1;
> >
> > + /* Track the effective clamp group */
> > + p->uclamp_group_id[clamp_id] = group_id;
> > +
> > /* Force clamp update on idle exit */
> > uc_cpu = &rq->uclamp;
> > - clamp_value = p->uclamp[clamp_id].value;
> > if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
> > if (clamp_id == UCLAMP_MAX)
> > uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;

[...]

--
#include <best/regards.h>

Patrick Bellasi

2018-07-23 16:00:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

Hello,

On Mon, Jul 16, 2018 at 09:29:02AM +0100, Patrick Bellasi wrote:
> The cgroup's CPU controller allows to assign a specified (maximum)
> bandwidth to the tasks of a group. However this bandwidth is defined and
> enforced only on a temporal base, without considering the actual
> frequency a CPU is running on. Thus, the amount of computation completed
> by a task within an allocated bandwidth can be very different depending
> on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.

One basic problem I have with this patchset is that what's being
described is way more generic than what actually got implemented.
What's described is computation bandwidth control but what's
implemented is just frequency clamping. So, there are fundamental
discrepancies between description+interface vs. what it actually does.

I really don't think that's something we can fix up later.

> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
> hierarchies
> b) do not enforce any constraints and/or dependency between the parent
> and its child nodes, thus relying on the delegation model and
> permission settings defined by the system management software

cgroup does host attributes which only concern the cgroup itself and
thus don't need any hierarchical behaviors on their own, but what's
being implemented does control resource allocation, and what you're
describing inherently breaks the delegation model.

> c) allow to (eventually) further restrict task-specific clamps defined
> via sched_setattr(2)

Thanks.

--
tejun

2018-07-23 16:41:38

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX

On Mon, Jul 23, 2018 at 8:02 AM, Patrick Bellasi
<[email protected]> wrote:
> On 20-Jul 18:23, Suren Baghdasaryan wrote:
>> Hi Patrick,
>
> Hi Sure,
> thank!
>
>> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> > @@ -977,13 +991,21 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
>> > uc_grp = &rq->uclamp.group[clamp_id][0];
>> > uc_grp[group_id].tasks += 1;
>> >
>> > + /* Force clamp update on idle exit */
>> > + uc_cpu = &rq->uclamp;
>> > + clamp_value = p->uclamp[clamp_id].value;
>> > + if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
>>
>> The condition below is not needed because UCLAMP_FLAG_IDLE is set only
>> for UCLAMP_MAX clamp_id, therefore the above condition already covers
>> the one below.
>
> Not really, this function is called two times, the first time to
> update UCLAMP_MIN and a second time to update UCLAMP_MAX.
>
> For both clamp_id we want to force update uc_cpu->value[clamp_id],
> thus the UCLAMP_FLAG_IDLE flag has to be cleared only the second time.
>
> Maybe I can had the following comment to better explain the reason of
> the check:
>
> /*
> * This function is called for both UCLAMP_MIN (before) and
> * UCLAMP_MAX (after). Let's reset the flag only the when
> * we know that UCLAMP_MIN has been already updated.
> */
>

Ah, my bad. I missed the fact that uc_cpu->flags is shared for both
UCLAMP_MIN and UCLAMP_MAX. It's fine the way it originally was. Thanks
for explanation!

>> > + if (clamp_id == UCLAMP_MAX)
>> > + uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
>> > + uc_cpu->value[clamp_id] = clamp_value;
>> > + return;
>> > + }
>
> [...]
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-07-23 17:13:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

On Mon, Jul 23, 2018 at 8:40 AM, Patrick Bellasi
<[email protected]> wrote:
> On 21-Jul 20:05, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>> <[email protected]> wrote:
>> > When a task's util_clamp value is configured via sched_setattr(2), this
>> > value has to be properly accounted in the corresponding clamp group
>> > every time the task is enqueued and dequeued. When cgroups are also in
>> > use, per-task clamp values have to be aggregated to those of the CPU's
>> > controller's Task Group (TG) in which the task is currently living.
>> >
>> > Let's update uclamp_cpu_get() to provide aggregation between the task
>> > and the TG clamp values. Every time a task is enqueued, it will be
>> > accounted in the clamp_group which defines the smaller clamp between the
>> > task specific value and its TG value.
>>
>> So choosing smallest for both UCLAMP_MIN and UCLAMP_MAX means the
>> least boosted value and the most clamped value between syscall and TG
>> will be used.
>
> Right
>
>> My understanding is that boost means "at least this much" and clamp
>> means "at most this much".
>
> Right
>
>> So to satisfy both TG and syscall requirements I think you would
>> need to choose the largest value for UCLAMP_MIN and the smallest one
>> for UCLAMP_MAX, meaning the most boosted and most clamped range.
>> Current implementation choses the least boosted value, so
>> effectively one of the UCLAMP_MIN requirements (either from TG or
>> from syscall) are being ignored... Could you please clarify why
>> this choice is made?
>
> The TG values are always used to specify a _restriction_ on
> task-specific values.
>
> Thus, if you look or example at the CPU mask for a task, you can have
> a task with affinity to CPUs 0-1, currently running on a cgroup with
> cpuset.cpus=0... then the task can run only on CPU 0 (althought its
> affinity includes CPU1 too).
>
> Same we do here: if a task has util_min=10, but it's running in a
> cgroup with cpu.util_min=0, then it will not be boosted.
>
> IOW, this allows to implement a "nice" policy at task level, where a
> task (via syscall) can decide to be less boosted with respect to its
> group but never more boosted. The same task can also decide to be more
> clamped, but not less clamped then its current group.
>

The fact that boost means "at least this much" to me seems like we can
safely choose higher CPU bandwidth (as long as it's lower than
UCLAMP_MAX) but from your description sounds like TG's UCLAMP_MIN
means "at most this much boost" and it's not safe to use CPU bandwidth
higher than TG's UCLAMP_MIN. So instead of specifying min CPU
bandwidth for a task it specifies the max allowed boost. Seems like a
discrepancy to me but maybe there are compelling usecases when this
behavior is necessary? In that case would be good to spell them out to
explain why this choice is made.

> [...]
>
>> > @@ -982,18 +989,30 @@ static inline void uclamp_cpu_get_id(struct task_struct *p,
>> > int clamp_value;
>> > int group_id;
>> >
>> > - /* No task specific clamp values: nothing to do */
>> > group_id = p->uclamp[clamp_id].group_id;
>> > + clamp_value = p->uclamp[clamp_id].value;
>> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> > + /* Use TG's clamp value to limit task specific values */
>> > + if (group_id == UCLAMP_NONE ||
>> > + clamp_value >= task_group(p)->uclamp[clamp_id].value) {
>>
>> Not a big deal but do you need to override if (clamp_value ==
>> task_group(p)->uclamp[clamp_id].value)? Maybe:
>> - clamp_value >= task_group(p)->uclamp[clamp_id].value) {
>> + clamp_value > task_group(p)->uclamp[clamp_id].value) {
>
> Good point, yes... the override is not really changing anything here.
> Will fix this!
>
>> > + clamp_value = task_group(p)->uclamp[clamp_id].value;
>> > + group_id = task_group(p)->uclamp[clamp_id].group_id;
>> > + }
>> > +#else
>> > + /* No task specific clamp values: nothing to do */
>> > if (group_id == UCLAMP_NONE)
>> > return;
>> > +#endif
>> >
>> > /* Reference count the task into its current group_id */
>> > uc_grp = &rq->uclamp.group[clamp_id][0];
>> > uc_grp[group_id].tasks += 1;
>> >
>> > + /* Track the effective clamp group */
>> > + p->uclamp_group_id[clamp_id] = group_id;
>> > +
>> > /* Force clamp update on idle exit */
>> > uc_cpu = &rq->uclamp;
>> > - clamp_value = p->uclamp[clamp_id].value;
>> > if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
>> > if (clamp_id == UCLAMP_MAX)
>> > uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
>
> [...]
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-07-23 17:26:36

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

On 23-Jul 08:30, Tejun Heo wrote:
> Hello,

Hi Tejun!

> On Mon, Jul 16, 2018 at 09:29:02AM +0100, Patrick Bellasi wrote:
> > The cgroup's CPU controller allows to assign a specified (maximum)
> > bandwidth to the tasks of a group. However this bandwidth is defined and
> > enforced only on a temporal base, without considering the actual
> > frequency a CPU is running on. Thus, the amount of computation completed
> > by a task within an allocated bandwidth can be very different depending
> > on the actual frequency the CPU is running that task.
> > The amount of computation can be affected also by the specific CPU a
> > task is running on, especially when running on asymmetric capacity
> > systems like Arm's big.LITTLE.
>
> One basic problem I have with this patchset is that what's being
> described is way more generic than what actually got implemented.
> What's described is computation bandwidth control but what's
> implemented is just frequency clamping.

What I meant to describe is that we already have a computation
bandwidth control mechanism which is working quite fine for the
scheduling classes it applies to, i.e. CFS and RT.

For these classes we are usually happy with just a _best effort_
allocation of the bandwidth: nothing enforced in strict terms. Indeed,
there is not (at least not in kernel space) a tracking of the actual
available and allocated bandwidth. If we need strict enforcement, we
already have DL with its CBS servers.

However, the "best effort" bandwidth control we have for CFS and RT
can be further improved if, instead of just looking at time spent on
CPUs, we provide some more hints to the scheduler to know at which
min/max "MIPS" we want to consume the (best effort) time we have been
allocated on a CPU.

Such a simple extension is still quite useful to satisfy many use-case
we have, mainly on mobile systems, like the ones I've described in the
"Newcomer's Short Abstract (Updated)"
section of the cover letter:
https://lore.kernel.org/lkml/[email protected]/T/#u

> So, there are fundamental discrepancies between
> description+interface vs. what it actually does.

Perhaps then I should just change the description to make it less
generic...

> I really don't think that's something we can fix up later.

... since, really, I don't think we can get to the point to extend
later this interface to provide the strict bandwidth enforcement you
are thinking about.

This would not be a fixup, but something really close to
re-implementing what we already have with the DL class.

> > These attributes:
> >
> > a) are available only for non-root nodes, both on default and legacy
> > hierarchies
> > b) do not enforce any constraints and/or dependency between the parent
> > and its child nodes, thus relying on the delegation model and
> > permission settings defined by the system management software
>
> cgroup does host attributes which only concern the cgroup itself and
> thus don't need any hierarchical behaviors on their own, but what's
> being implemented does control resource allocation,

I'm not completely sure to get your point here.

Maybe it all depends on what we mean by "control resource allocation".

AFAIU, currently both the CFS and RT bandwidth controllers allow you
to define how much CPU time a group of tasks can use. It does that by
looking just within the group: there is no enforced/required relation
between the bandwidth assigned to a group and the bandwidth assigned
to its parent, siblings and/or children.

The resource control allocation is eventually enforced "indirectly" by
means of the fact that, based on tasks priorities and cgroup shares,
the scheduler will prefer to pick and run "more frequently" and
"longer" certain tasks instead of others.

Thus I would say that the resource allocation control is already
performed by the combined action of:
A) priorities / shares to favor certain tasks over others
B) period & bandwidth to further bias the scheduler in _not_ selecting
tasks which already executed for the configured amount of time.

> and what you're describing inherently breaks the delegation model.

What I describe here is just an additional hint to the scheduler which
enrich the above described model. Provided A and B are already
satisfied, when a task gets a chance to run it will be executed at a
min/max configured frequency. That's really all... there is not
additional impact on "resources allocation".

I don't see why you say that this breaks the delegation model?

Maybe an example can help to better explain what you mean?

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-07-24 09:58:09

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

On 23-Jul 10:11, Suren Baghdasaryan wrote:
> On Mon, Jul 23, 2018 at 8:40 AM, Patrick Bellasi
> <[email protected]> wrote:
> > On 21-Jul 20:05, Suren Baghdasaryan wrote:
> >> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi

[...]

> >> So to satisfy both TG and syscall requirements I think you would
> >> need to choose the largest value for UCLAMP_MIN and the smallest one
> >> for UCLAMP_MAX, meaning the most boosted and most clamped range.
> >> Current implementation choses the least boosted value, so
> >> effectively one of the UCLAMP_MIN requirements (either from TG or
> >> from syscall) are being ignored... Could you please clarify why
> >> this choice is made?
> >
> > The TG values are always used to specify a _restriction_ on
> > task-specific values.
> >
> > Thus, if you look or example at the CPU mask for a task, you can have
> > a task with affinity to CPUs 0-1, currently running on a cgroup with
> > cpuset.cpus=0... then the task can run only on CPU 0 (althought its
> > affinity includes CPU1 too).
> >
> > Same we do here: if a task has util_min=10, but it's running in a
> > cgroup with cpu.util_min=0, then it will not be boosted.
> >
> > IOW, this allows to implement a "nice" policy at task level, where a
> > task (via syscall) can decide to be less boosted with respect to its
> > group but never more boosted. The same task can also decide to be more
> > clamped, but not less clamped then its current group.
> >
>
> The fact that boost means "at least this much" to me seems like we can
> safely choose higher CPU bandwidth (as long as it's lower than
> UCLAMP_MAX)

I understand your view point, which actually is matching my first
implementation for util_min aggregation:

https://lore.kernel.org/lkml/[email protected]/


> but from your description sounds like TG's UCLAMP_MIN means "at most
> this much boost" and it's not safe to use CPU bandwidth higher than
> TG's UCLAMP_MIN.

Indeed, after this discussion with Tejun:

https://lore.kernel.org/lkml/[email protected]/

I've convinced myself that for the cgroup interface we have to got for
a "restrictive" interface where a parent value must set the upper
bound for all its descendants values. AFAIU, that's one of the basic
principles of the "delegation model" implemented by cgroups and the
common behavior implemented by all controllers.

> So instead of specifying min CPU bandwidth for a task it specifies
> the max allowed boost. Seems like a discrepancy to me but maybe
> there are compelling usecases when this behavior is necessary?

I don't think it's strictly related to use-cases, you can always
describe a give use-case in one model or the other. It all depends on
how you configure your hierarchy and where you place your tasks.

For our Android use cases, we are still happy to say that all tasks of
a CGroup can be boosted up to a certain value and then we can either:
- don't configure tasks: and thus get the CG defined boost
- configure a task: and explicitly give back what we don't need

This model works quite well with containers, where the parent want to
precisely control how much resources are (eventually) usable by a
given container.

> In that case would be good to spell them out to explain why this
> choice is made.

Yes, well... if I understand it correctly is really just the
recommended way cgroups must be used to re-partition resources.

I'll try to better explain this behavior in the changelog for this
patch.

[...]

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-07-24 13:30:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

Hello, Patrick.

On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote:
> However, the "best effort" bandwidth control we have for CFS and RT
> can be further improved if, instead of just looking at time spent on
> CPUs, we provide some more hints to the scheduler to know at which
> min/max "MIPS" we want to consume the (best effort) time we have been
> allocated on a CPU.
>
> Such a simple extension is still quite useful to satisfy many use-case
> we have, mainly on mobile systems, like the ones I've described in the
> "Newcomer's Short Abstract (Updated)"
> section of the cover letter:
> https://lore.kernel.org/lkml/[email protected]/T/#u

So, that's all completely fine but then let's please not give it a
name which doesn't quite match what it does. We can just call it
e.g. cpufreq range control.

> > So, there are fundamental discrepancies between
> > description+interface vs. what it actually does.
>
> Perhaps then I should just change the description to make it less
> generic...

I think so, along with the interface itself.

> > I really don't think that's something we can fix up later.
>
> ... since, really, I don't think we can get to the point to extend
> later this interface to provide the strict bandwidth enforcement you
> are thinking about.

That's completley fine. The interface just has to match what's
implemented.

...
> > and what you're describing inherently breaks the delegation model.
>
> What I describe here is just an additional hint to the scheduler which
> enrich the above described model. Provided A and B are already
> satisfied, when a task gets a chance to run it will be executed at a
> min/max configured frequency. That's really all... there is not
> additional impact on "resources allocation".

So, if it's a cpufreq range controller. It'd have sth like
cpu.freq.min and cpu.freq.max, where min defines the maximum minimum
cpufreq its descendants can get and max defines the maximum cpufreq
allowed in the subtree. For an example, please refer to how
memory.min and memory.max are defined.

Thanks.

--
tejun

2018-07-24 15:30:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

Hi Patrick. Thanks for the explanation and links. No more questions
from me on this one :)

On Tue, Jul 24, 2018 at 2:56 AM, Patrick Bellasi
<[email protected]> wrote:
> On 23-Jul 10:11, Suren Baghdasaryan wrote:
>> On Mon, Jul 23, 2018 at 8:40 AM, Patrick Bellasi
>> <[email protected]> wrote:
>> > On 21-Jul 20:05, Suren Baghdasaryan wrote:
>> >> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>
> [...]
>
>> >> So to satisfy both TG and syscall requirements I think you would
>> >> need to choose the largest value for UCLAMP_MIN and the smallest one
>> >> for UCLAMP_MAX, meaning the most boosted and most clamped range.
>> >> Current implementation choses the least boosted value, so
>> >> effectively one of the UCLAMP_MIN requirements (either from TG or
>> >> from syscall) are being ignored... Could you please clarify why
>> >> this choice is made?
>> >
>> > The TG values are always used to specify a _restriction_ on
>> > task-specific values.
>> >
>> > Thus, if you look or example at the CPU mask for a task, you can have
>> > a task with affinity to CPUs 0-1, currently running on a cgroup with
>> > cpuset.cpus=0... then the task can run only on CPU 0 (althought its
>> > affinity includes CPU1 too).
>> >
>> > Same we do here: if a task has util_min=10, but it's running in a
>> > cgroup with cpu.util_min=0, then it will not be boosted.
>> >
>> > IOW, this allows to implement a "nice" policy at task level, where a
>> > task (via syscall) can decide to be less boosted with respect to its
>> > group but never more boosted. The same task can also decide to be more
>> > clamped, but not less clamped then its current group.
>> >
>>
>> The fact that boost means "at least this much" to me seems like we can
>> safely choose higher CPU bandwidth (as long as it's lower than
>> UCLAMP_MAX)
>
> I understand your view point, which actually is matching my first
> implementation for util_min aggregation:
>
> https://lore.kernel.org/lkml/[email protected]/
>
>
>> but from your description sounds like TG's UCLAMP_MIN means "at most
>> this much boost" and it's not safe to use CPU bandwidth higher than
>> TG's UCLAMP_MIN.
>
> Indeed, after this discussion with Tejun:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> I've convinced myself that for the cgroup interface we have to got for
> a "restrictive" interface where a parent value must set the upper
> bound for all its descendants values. AFAIU, that's one of the basic
> principles of the "delegation model" implemented by cgroups and the
> common behavior implemented by all controllers.
>
>> So instead of specifying min CPU bandwidth for a task it specifies
>> the max allowed boost. Seems like a discrepancy to me but maybe
>> there are compelling usecases when this behavior is necessary?
>
> I don't think it's strictly related to use-cases, you can always
> describe a give use-case in one model or the other. It all depends on
> how you configure your hierarchy and where you place your tasks.
>
> For our Android use cases, we are still happy to say that all tasks of
> a CGroup can be boosted up to a certain value and then we can either:
> - don't configure tasks: and thus get the CG defined boost
> - configure a task: and explicitly give back what we don't need
>
> This model works quite well with containers, where the parent want to
> precisely control how much resources are (eventually) usable by a
> given container.
>
>> In that case would be good to spell them out to explain why this
>> choice is made.
>
> Yes, well... if I understand it correctly is really just the
> recommended way cgroups must be used to re-partition resources.
>
> I'll try to better explain this behavior in the changelog for this
> patch.
>
> [...]
>
> Best,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-07-24 15:40:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

Hi Tejun,

I apologize in advance for the (yet another) long reply, however I did
my best hereafter to try to resume all the controversial points
discussed so far.

If you will have (one more time) the patience to go through the
following text you'll find a set of precise clarifications and
questions I have for you.

Thank you again for your time.

On 24-Jul 06:29, Tejun Heo wrote:

[...]

> > What I describe here is just an additional hint to the scheduler which
> > enrich the above described model. Provided A and B are already
> > satisfied, when a task gets a chance to run it will be executed at a
> > min/max configured frequency. That's really all... there is not
> > additional impact on "resources allocation".
>
> So, if it's a cpufreq range controller. It'd have sth like
> cpu.freq.min and cpu.freq.max, where min defines the maximum minimum
> cpufreq its descendants can get and max defines the maximum cpufreq
> allowed in the subtree. For an example, please refer to how
> memory.min and memory.max are defined.

I think you are still looking at just one usage of this interface,
which is likely mainly my fault also because of the long time between
posting. Sorry for that...

Let me re-propose here an abstract of the cover letter with some
additional notes inline.

--- Cover Letter Abstract START ---

> > [...] utilization is a task specific property which is used by the scheduler
> > to know how much CPU bandwidth a task requires (under certain conditions).
> > Thus, the utilization clamp values defined either per-task or via the
> > CPU controller, can be used to represent tasks to the scheduler as
> > being bigger (or smaller) then what they really are.
^^^^^^^^^^^^^^^^^^^

This is a fundamental feature added by utilization clamping: this is a
task property which can be useful in many different ways to the
scheduler and not "just" to bias frequency selection.

> > Utilization clamping thus ultimately enable interesting additional
> > optimizations, especially on asymmetric capacity systems like Arm
> > big.LITTLE and DynamIQ CPUs, where:
> >
> > - boosting: small tasks are preferably scheduled on higher-capacity CPUs
> > where, despite being less energy efficient, they can complete faster
> >
> > - clamping: big/background tasks are preferably scheduler on low-capacity CPUs
> > where, being more energy efficient, they can still run but save power and
> > thermal headroom for more important tasks.

These two point above are two examples of how we can use utilization
clamping which is not frequency selection.

> > This additional usage of the utilization clamping is not presented in this
^^^^^^^^^^^^^^^^^^^^^^^^

Is it acceptable to add a generic interface by properly and completely
describing, both in the cover letter and in the relative changelogs,
what will be the future bits we can add ?

> > series but it's an integral part of the Energy Aware Scheduler (EAS) feature
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The EAS scheduler, without the utilization clamping bits, does a great
job in scheduling tasks while saving energy. However, on every system,
we are interested also in other metrics, like for example: completion
time and power dissipation.

Whether certain tasks should be scheduled to optimize energy
efficiency, completion time and/or power dissipation is something we
can achieve only by:

1. adopting a proper tasks classification schema
=> that's why CGroups are of interest

2. using a generic enough mechanism to describe certain tasks
properties which affect all the metrics above,
i.e. energy, speed and power
=> that's why utilization and its clamping is of interest

> > set. A similar solution (SchedTune) is already used on Android kernels, which
^^^^^^^^^^^^^^^^^^^^^^^

This _complete support_ is already actively and successfully used on
many Android devices...

> > targets both frequency selection and task placement biasing.
^^^^ ^^^^^^^^^^^^^^^^^^

... to support _not only_ frequency selections.

> > This series provides the foundation bits to add similar features in mainline
^^^^^^^^^^^^^^^
> > and its first simple client with the schedutil integration.
^^^^^^^^^^^^^^^^^^^

The solution presented here shows only the integration with
cpufreq/schedutil. However, since we are adding a user-space
interface, we have to add this new interface in a generic way since
the beginning to support also the complete implementation we will have
at the end.

--- Cover Letter Abstract END ---


From my comments above I hope it's now more clear that "utilization
clamping" is not just a "cpufreq range controller" and, since we
will extend the internal usage of such interface, we cannot add now a
user-space interface which targets just frequency control.

To resume, here we are at proposing a generic interface which:

a) do not strictly enforce and/or grant any bandwidth to tasks and
do not directly define how the CPU resource has to be partitioned
among tasks

b) improves the way we can constraint bandwidth consumed by TGs, by
specifying a min/max "MIPS range" (in scheduler terms: utilization)
the bandwidth can be consumed at

c) it's based on a fundamental task scheduler metric: utilization
since the "MIPS range" can be affected by the "type of CPUs" and
not only by the "operating frequency"

d) can be used by the scheduler to bias "tasks placement" as well as
"frequency selection"

e) do not provide the full implementation here not only to keep the
initial patchset limited in size but also because of some
dependencies on other EAS bits which are currently under discussion
on LKML.
These different EAS features can still be progressed independently.

f) at our best, it aims at providing a complete use-case description
both in the cover-letter as well as in the relative changelogs

Going back to one of your previous comments, when you says:

> What's described is computation bandwidth control but what's
> implemented is just frequency clamping.

Do we agree now that:

1. what we propose is not a "computational bandwidth control"
mechanism and/or interface

2. what we implement is freq clamping but that's just one use case to
keep the series small enough

3. despite 2) we need to add an interface which is generic enough to
accommodate the other use-cases

4. the basic metric exposed (i.e. utilization) is used now for
frequency clamping but the same one will be used for task placement
biasing

?

And again, when you say:

> So, there are fundamental discrepancies between
> description+interface vs. what it actually does.

Is it acceptable to have a new interface which fits a wider
description?

With such a description, our aim is also to demonstrate that we are
_not_ adding a special case new user-space interface but a generic
enough interface which can be properly extended in the future without
breaking existing functionalities but just by keep improving them.

Best,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-07-24 15:51:31

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

On 24-Jul 08:28, Suren Baghdasaryan wrote:
> Hi Patrick. Thanks for the explanation and links. No more questions
> from me on this one :)

No problems at all!

The important question is instead: does it makes sense for you too?

I think the important bits are that we are all on the same page about
the end goals and features we like to have as well as the interface we use.
This last has to fits best our goals and features while still being
perfectly aligned with the frameworks we are integrating into... and
that's still under discussion with Tejun on PATCH 08/12.

Thanks again for your review!

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-07-24 16:44:24

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

On 21-Jul 21:04, Suren Baghdasaryan wrote:
> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
> <[email protected]> wrote:

[...]

> > +static inline unsigned int scale_from_percent(unsigned int pct)
> > +{
> > + WARN_ON(pct > 100);
> > +
> > + return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
> > +}
> > +
> > +static inline unsigned int scale_to_percent(unsigned int value)
> > +{
> > + unsigned int rounding = 0;
> > +
> > + WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
> > +
> > + /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
> > + if (likely((value & 0xFF) && ~(value & 0x700)))
> > + rounding = 1;
>
> Hmm. I don't think ~(value & 0x700) will ever yield FALSE... What am I missing?

So, 0x700 is the topmost 3 bits sets (111 0000 0000) which different
configuration corresponds to:

001 0000 0000 => 256
010 0000 0000 => 512
011 0000 0000 => 768
100 0000 0000 => 1024

Thus, if 0x700 matches then we have one of these values in input and
for these cases we have to add a unit to the percentage value.

For the case (value == 0) we translate it into 0% thanks to the check
on (value & 0xFF) to ensure rounding = 0.

Here is a small python snippet I've used to check the conversion of
all the possible percentage values:

---8<---
values = range(0, 101)
for pct in xrange(0, 101):
util = int((1024 * pct) / 100)
rounding = 1
if not ((util & 0xFF) and ~(util & 0x700)):
print "Fixing util_to_perc({:3d} => {:4d})".format(pct, util)
rounding = 0
pct2 = (rounding + ((100 * util) / 1024))
if pct2 in values:
values.remove(pct2)
if pct != pct2:
print "Convertion failed for: {:3d} => {:4d} => {:3d}".format(pct, util, pct2)
if values:
print "ERROR: not all percentage values converted"
---8<---

--
#include <best/regards.h>

Patrick Bellasi

2018-07-24 17:13:45

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

On Tue, Jul 24, 2018 at 9:43 AM, Patrick Bellasi
<[email protected]> wrote:
> On 21-Jul 21:04, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>> <[email protected]> wrote:
>
> [...]
>
>> > +static inline unsigned int scale_from_percent(unsigned int pct)
>> > +{
>> > + WARN_ON(pct > 100);
>> > +
>> > + return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
>> > +}
>> > +
>> > +static inline unsigned int scale_to_percent(unsigned int value)
>> > +{
>> > + unsigned int rounding = 0;
>> > +
>> > + WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
>> > +
>> > + /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
>> > + if (likely((value & 0xFF) && ~(value & 0x700)))
>> > + rounding = 1;
>>
>> Hmm. I don't think ~(value & 0x700) will ever yield FALSE... What am I missing?
>
> So, 0x700 is the topmost 3 bits sets (111 0000 0000) which different
> configuration corresponds to:
>
> 001 0000 0000 => 256
> 010 0000 0000 => 512
> 011 0000 0000 => 768
> 100 0000 0000 => 1024
>
> Thus, if 0x700 matches then we have one of these values in input and
> for these cases we have to add a unit to the percentage value.
>
> For the case (value == 0) we translate it into 0% thanks to the check
> on (value & 0xFF) to ensure rounding = 0.
>

I think just (value & 0xFF) is enough to get you the right behavior.
~(value & 0x700) is not needed, it's effectively a NoOp which always
yields TRUE. For any *value* (value & 0x700) == 0x...00 and ~(value &
0x700) == 0x...FF == TRUE.

> Here is a small python snippet I've used to check the conversion of
> all the possible percentage values:
>
> ---8<---
> values = range(0, 101)
> for pct in xrange(0, 101):
> util = int((1024 * pct) / 100)
> rounding = 1
> if not ((util & 0xFF) and ~(util & 0x700)):
> print "Fixing util_to_perc({:3d} => {:4d})".format(pct, util)
> rounding = 0
> pct2 = (rounding + ((100 * util) / 1024))
> if pct2 in values:
> values.remove(pct2)
> if pct != pct2:
> print "Convertion failed for: {:3d} => {:4d} => {:3d}".format(pct, util, pct2)
> if values:
> print "ERROR: not all percentage values converted"
> ---8<---
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-07-24 17:18:51

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

On 24-Jul 10:11, Suren Baghdasaryan wrote:
> On Tue, Jul 24, 2018 at 9:43 AM, Patrick Bellasi
> <[email protected]> wrote:
> > On 21-Jul 21:04, Suren Baghdasaryan wrote:
> >> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
> >> <[email protected]> wrote:
> >
> > [...]
> >
> >> > +static inline unsigned int scale_from_percent(unsigned int pct)
> >> > +{
> >> > + WARN_ON(pct > 100);
> >> > +
> >> > + return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
> >> > +}
> >> > +
> >> > +static inline unsigned int scale_to_percent(unsigned int value)
> >> > +{
> >> > + unsigned int rounding = 0;
> >> > +
> >> > + WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
> >> > +
> >> > + /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
> >> > + if (likely((value & 0xFF) && ~(value & 0x700)))
> >> > + rounding = 1;
> >>
> >> Hmm. I don't think ~(value & 0x700) will ever yield FALSE... What am I missing?
> >
> > So, 0x700 is the topmost 3 bits sets (111 0000 0000) which different
> > configuration corresponds to:
> >
> > 001 0000 0000 => 256
> > 010 0000 0000 => 512
> > 011 0000 0000 => 768
> > 100 0000 0000 => 1024
> >
> > Thus, if 0x700 matches then we have one of these values in input and
> > for these cases we have to add a unit to the percentage value.
> >
> > For the case (value == 0) we translate it into 0% thanks to the check
> > on (value & 0xFF) to ensure rounding = 0.
> >
>
> I think just (value & 0xFF) is enough to get you the right behavior.
> ~(value & 0x700) is not needed, it's effectively a NoOp which always
> yields TRUE. For any *value* (value & 0x700) == 0x...00 and ~(value &
> 0x700) == 0x...FF == TRUE.

And you are actually right! ;)

> > Here is a small python snippet I've used to check the conversion of
> > all the possible percentage values:
> >
> > ---8<---
> > values = range(0, 101)
> > for pct in xrange(0, 101):
> > util = int((1024 * pct) / 100)
> > rounding = 1
> > if not ((util & 0xFF) and ~(util & 0x700)):

... it works also by patching the line above!


> > print "Fixing util_to_perc({:3d} => {:4d})".format(pct, util)
> > rounding = 0
> > pct2 = (rounding + ((100 * util) / 1024))
> > if pct2 in values:
> > values.remove(pct2)
> > if pct != pct2:
> > print "Convertion failed for: {:3d} => {:4d} => {:3d}".format(pct, util, pct2)
> > if values:
> > print "ERROR: not all percentage values converted"
> > ---8<---

Good catch, thanks!

--
#include <best/regards.h>

Patrick Bellasi

2018-07-26 22:13:18

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

Sorry for the delay. Overlooked this comment...

On Tue, Jul 24, 2018 at 8:49 AM, Patrick Bellasi <[email protected]>
wrote:

> On 24-Jul 08:28, Suren Baghdasaryan wrote:
> > Hi Patrick. Thanks for the explanation and links. No more questions
> > from me on this one :)
>
> No problems at all!
>
> The important question is instead: does it makes sense for you too?
>

Well, it still feels unnatural to me due to the definition of the boost (at
least this much CPU bandwidth but higher should be fine).
Say I have a task which normally has specific boost and clamp requirements
(say TG.UCLAMP_MIN=20, TG.UCLAMP_MAX=80) which I want to temporarily boost
using a syscall to UCLAMP_MIN=60 (let's say a process should handle some
request and temporarily needs more CPU bandwidth). With this interface we
can clamp more than TG.UCLAMP_MAX value but we can't boost more than
TG.UCLAMP_MIN. For this usecase I would have to set TG.UCLAMP_MIN=80, then
use syscall to set SYSCALL.UCLAMP_MIN=20 to get effective UCLAMP_MIN=20 and
then set SYSCALL.UCLAMP_MIN=60 when I need that temporary boost.
To summarize, while this API does not stop me from achieving the desired
result it requires some hoop-jumping :)


>
> I think the important bits are that we are all on the same page about
> the end goals and features we like to have as well as the interface we use.
> This last has to fits best our goals and features while still being
> perfectly aligned with the frameworks we are integrating into... and
> that's still under discussion with Tejun on PATCH 08/12.
>
> Thanks again for your review!
>
> Cheers Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
>

2018-07-27 00:40:31

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

On Tue, Jul 24, 2018 at 06:29:02AM -0700, Tejun Heo wrote:
> Hello, Patrick.
>
> On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote:
> > However, the "best effort" bandwidth control we have for CFS and RT
> > can be further improved if, instead of just looking at time spent on
> > CPUs, we provide some more hints to the scheduler to know at which
> > min/max "MIPS" we want to consume the (best effort) time we have been
> > allocated on a CPU.
> >
> > Such a simple extension is still quite useful to satisfy many use-case
> > we have, mainly on mobile systems, like the ones I've described in the
> > "Newcomer's Short Abstract (Updated)"
> > section of the cover letter:
> > https://lore.kernel.org/lkml/[email protected]/T/#u
>
> So, that's all completely fine but then let's please not give it a
> name which doesn't quite match what it does. We can just call it
> e.g. cpufreq range control.

But then what name can one give it if it does more than one thing, like
task-placement and CPU frequency control?

It doesn't make sense to name it cpufreq IMHO. Its a clamp on the utilization
of the task which can be used for many purposes.

thanks,

- Joel


2018-07-27 08:10:20

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

On Thursday 26 Jul 2018 at 17:39:19 (-0700), Joel Fernandes wrote:
> On Tue, Jul 24, 2018 at 06:29:02AM -0700, Tejun Heo wrote:
> > Hello, Patrick.
> >
> > On Mon, Jul 23, 2018 at 06:22:15PM +0100, Patrick Bellasi wrote:
> > > However, the "best effort" bandwidth control we have for CFS and RT
> > > can be further improved if, instead of just looking at time spent on
> > > CPUs, we provide some more hints to the scheduler to know at which
> > > min/max "MIPS" we want to consume the (best effort) time we have been
> > > allocated on a CPU.
> > >
> > > Such a simple extension is still quite useful to satisfy many use-case
> > > we have, mainly on mobile systems, like the ones I've described in the
> > > "Newcomer's Short Abstract (Updated)"
> > > section of the cover letter:
> > > https://lore.kernel.org/lkml/[email protected]/T/#u
> >
> > So, that's all completely fine but then let's please not give it a
> > name which doesn't quite match what it does. We can just call it
> > e.g. cpufreq range control.
>
> But then what name can one give it if it does more than one thing, like
> task-placement and CPU frequency control?
>
> It doesn't make sense to name it cpufreq IMHO. Its a clamp on the utilization
> of the task which can be used for many purposes.

Indeed, the scheduler could use clamped utilization values in several
places. The capacity-awareness bits (mostly useful for big.LITTLE
platforms) could already use that today I guess.

And on the longer term, depending on where the EAS patches [1] end up,
utilization clamping might actually become very useful to bias task
placement decisions. EAS basically decides where to place tasks based on
their utilization, so util_clamp would make a lot of sense there IMO.

Thanks,
Quentin

[1] https://lkml.org/lkml/2018/7/24/420