2019-01-15 11:51:35

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 00/16] Add utilization clamping support

Hi all, this is a respin of:

https://lore.kernel.org/lkml/[email protected]/

which addresses all the comments collected in the previous posting and during
the LPC presentation [1].

It's based on v5.0-rc2, the full tree is available here:

git://linux-arm.org/linux-pb.git lkml/utilclamp_v6
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v6

Changes in this version are:

- rebased on top of recently merged EAS code [3] and better integrated with it
- squashed bucketization patch into previous patches
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
- updated cmpxchg loops to looks like "do { } while (cmpxchg(ptr, old, new) != old)"
- switched to usage of try_cmpxchg()
- use SCHED_WARN_ON() instead of CONFIG_SCHED_DEBUG guarded blocks
- moved UCLAMP_FLAG_IDLE management into dedicated functions, i.e.
uclamp_idle_value() and uclamp_idle_reset()
- switched from rq::uclamp::flags to rq::uclamp_flags,
since now rq::uclamp is a per-clamp_id array
- added size check in sched_copy_attr()
- ensure se_count will never underflow
- better comment invariant conditions
- consistently use unary (++/--) operators
- redefined UCLAMP_GROUPS_COUNT range to be [5..20]
- added and make use of the bit_for() macro
- replaced some ifdifery with IS_ENABLED() checks
- overall documentation review to match new subsystem/maintainer
handbook for tip/sched/core

Thanks to all the valuable comments, hopefully this should be a reasonably
stable version for all the core scheduler bits. Thus, I hope we should
be in a good position to unlock Tejun [2] to delve into the review of
the proposed cgroup integration, but let see what Peter and Ingo think
before.

Cheers Patrick


Series Organization
===================

The series is organized into these main sections:

- Patches [01-07]: Per task (primary) API
- Patches [08-09]: Schedutil integration for CFS and RT tasks
- Patches [10-11]: EAS's energy_compute() integration
- Patches [12-16]: Per task group (secondary) API


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [3].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilisation values by themselves cannot convey the desired
power/performance behaviours of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorised user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience.

Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
the scheduler does not have a complete view of what the task's requirements
are and, if it's a small utilization task, it keeps selecting a more energy
efficient CPU, with smaller capacity and lower frequency, thus negatively
impacting the overall time required to complete task activations.

- capping: increase energy efficiency for background tasks not affecting the
user experience.

Since running on a lower capacity CPU at a lower frequency is more energy
efficient, when the completion time is not a main goal, then capping the
utilization considered for certain (maybe big) tasks can have positive
effects, both on energy consumption and thermal headroom.
This feature allows also to make RT tasks more energy friendly on mobile
systems where running them on high capacity CPUs and at the maximum
frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in macking tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

- boosting: try to run small/foreground tasks on higher-capacity CPUs to
complete them faster despite being less energy efficient.

- capping: try to run big/background tasks on low-capacity CPUs to save power
and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [1] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] "Expressing per-task/per-cgroup performance hints"
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/

[2] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

[3] https://lore.kernel.org/lkml/[email protected]/


Patrick Bellasi (16):
sched/core: Allow sched_setattr() to use the current policy
sched/core: uclamp: Extend sched_setattr() to support utilization
clamping
sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets
sched/core: uclamp: Add CPU's clamp buckets refcounting
sched/core: uclamp: Update CPU's refcount on clamp changes
sched/core: uclamp: Enforce last task UCLAMP_MAX
sched/core: uclamp: Add system default clamps
sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks
sched/cpufreq: uclamp: Add utilization clamping for RT tasks
sched/core: Add uclamp_util_with()
sched/fair: Add uclamp support to energy_compute()
sched/core: uclamp: Extend CPU's cgroup controller
sched/core: uclamp: Propagate parent clamps
sched/core: uclamp: Map TG's clamp values into CPU's clamp buckets
sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Documentation/admin-guide/cgroup-v2.rst | 46 ++
include/linux/log2.h | 37 +
include/linux/sched.h | 87 +++
include/linux/sched/sysctl.h | 11 +
include/linux/sched/task.h | 6 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 12 +-
include/uapi/linux/sched/types.h | 65 +-
init/Kconfig | 75 ++
init/init_task.c | 1 +
kernel/exit.c | 1 +
kernel/sched/core.c | 947 +++++++++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 46 +-
kernel/sched/fair.c | 41 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 136 +++-
kernel/sysctl.c | 16 +
17 files changed, 1480 insertions(+), 57 deletions(-)

--
2.19.2



2019-01-15 10:18:37

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

Utilization clamp values enforced on a CPU by a task can be updated, for
example via a sched_setattr() syscall, while a task is RUNNABLE on that
CPU. A clamp value change always implies a clamp bucket refcount update
to ensure the new constraints are enforced.

Hook into uclamp_bucket_get() to trigger a CPU refcount syncup, via
uclamp_cpu_{inc,dec}_id(), whenever a task is RUNNABLE.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v6:
Other:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
- small documentation updates
---
kernel/sched/core.c | 48 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 190137cd7b3b..67f059ee0a05 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -884,6 +884,38 @@ static inline void uclamp_cpu_dec(struct rq *rq, struct task_struct *p)
uclamp_cpu_dec_id(p, rq, clamp_id);
}

+static inline void
+uclamp_task_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the CPU where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_cpu_dec_id(p, rq, clamp_id);
+ uclamp_cpu_inc_id(p, rq, clamp_id);
+
+done:
+ task_rq_unlock(rq, p, &rf);
+}
+
static void uclamp_bucket_dec(unsigned int clamp_id, unsigned int bucket_id)
{
union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
@@ -907,8 +939,8 @@ static void uclamp_bucket_dec(unsigned int clamp_id, unsigned int bucket_id)
&uc_map_old.data, uc_map_new.data));
}

-static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
- unsigned int clamp_value)
+static void uclamp_bucket_inc(struct task_struct *p, struct uclamp_se *uc_se,
+ unsigned int clamp_id, unsigned int clamp_value)
{
union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
unsigned int prev_bucket_id = uc_se->bucket_id;
@@ -979,6 +1011,9 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
uc_se->value = clamp_value;
uc_se->bucket_id = bucket_id;

+ if (p)
+ uclamp_task_update_active(p, clamp_id);
+
if (uc_se->mapped)
uclamp_bucket_dec(clamp_id, prev_bucket_id);

@@ -1008,11 +1043,11 @@ static int __setscheduler_uclamp(struct task_struct *p,

mutex_lock(&uclamp_mutex);
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
- uclamp_bucket_inc(&p->uclamp[UCLAMP_MIN],
+ uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
- uclamp_bucket_inc(&p->uclamp[UCLAMP_MAX],
+ uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
mutex_unlock(&uclamp_mutex);
@@ -1049,7 +1084,8 @@ static void uclamp_fork(struct task_struct *p, bool reset)

p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
- uclamp_bucket_inc(&p->uclamp[clamp_id], clamp_id, clamp_value);
+ uclamp_bucket_inc(NULL, &p->uclamp[clamp_id],
+ clamp_id, clamp_value);
}
}

@@ -1069,7 +1105,7 @@ static void __init init_uclamp(void)
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
- uclamp_bucket_inc(uc_se, clamp_id, uclamp_none(clamp_id));
+ uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
}
}

--
2.19.2


2019-01-15 10:18:49

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class. However, when utilization clamping is in use, the
frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization and IOWait boost of RUNNABLE FAIR
tasks within the boundaries defined by their aggregated utilization
clamp constraints.
Based on the max(min_util, max_util) of each task, max-aggregated the
CPU clamp value in a way to give the boosted tasks the performance they
need when they happen to be co-scheduled with other capped tasks.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>

---
Changes in v6:
Message-ID: <20181107113849.GC14309@e110439-lin>
- sanity check util_max >= util_min
Others:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
---
kernel/sched/cpufreq_schedutil.c | 27 ++++++++++++++++++++++++---
kernel/sched/sched.h | 23 +++++++++++++++++++++++
2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 033ec7c45f13..520ee2b785e7 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -218,8 +218,15 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS utilization can be boosted or capped, depending on utilization
+ * clamp constraints requested by currently RUNNABLE tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
- util = util_cfs;
+ util = (type == ENERGY_UTIL)
+ ? util_cfs
+ : uclamp_util(rq, util_cfs);
util += cpu_util_rt(rq);

dl_util = cpu_util_dl(rq);
@@ -327,6 +334,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
unsigned int flags)
{
bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
+ unsigned int max_boost;

/* Reset boost if the CPU appears to have been idle enough */
if (sg_cpu->iowait_boost &&
@@ -342,11 +350,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
return;
sg_cpu->iowait_boost_pending = true;

+ /*
+ * Boost FAIR tasks only up to the CPU clamped utilization.
+ *
+ * Since DL tasks have a much more advanced bandwidth control, it's
+ * safe to assume that IO boost does not apply to those tasks.
+ * Instead, since RT tasks are not utilization clamped, we don't want
+ * to apply clamping on IO boost while there is blocked RT
+ * utilization.
+ */
+ max_boost = sg_cpu->iowait_boost_max;
+ if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
+ max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+
/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
sg_cpu->iowait_boost <<= 1;
- if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
- sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
+ if (sg_cpu->iowait_boost > max_boost)
+ sg_cpu->iowait_boost = max_boost;
return;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b7f3ee8ba164..95d62a2a0b44 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2267,6 +2267,29 @@ static inline unsigned int uclamp_none(int clamp_id)
return SCHED_CAPACITY_SCALE;
}

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+ unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * invertion, which we can fix at usage time.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.19.2


2019-01-15 10:19:00

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 14/16] sched/core: uclamp: Map TG's clamp values into CPU's clamp buckets

Utilization clamping requires to map each different clamp value into one
of the available clamp buckets used at {en,de}queue time (fast-path).
Each time a TG's clamp value sysfs attribute is updated via:
cpu_util_{min,max}_write_u64()
we need to update the task group reference to the new value's clamp
bucket and release the reference to the previous one.

Ensure that, whenever a task group is assigned a specific clamp_value,
this is properly translated into a unique clamp bucket to be used in the
fast-path. Do it by slightly refactoring uclamp_bucket_inc() to make the
(*task_struct) parameter optional and by reusing the code already
available for the per-task API.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v6:
Others:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
---
include/linux/sched.h | 4 ++--
kernel/sched/core.c | 53 +++++++++++++++++++++++++++++++++----------
2 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 05d286524d70..3f02128fe6b2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -617,8 +617,8 @@ struct sched_dl_entity {
* uclamp_bucket_dec() - for the old clamp value
*
* The active bit is set whenever a task has got an effective clamp bucket
- * and value assigned, which can be different from the user requested ones.
- * This allows to know a task is actually refcounting a CPU's clamp bucket.
+ * and value assigned, and it allows to know a task is actually refcounting a
+ * CPU's clamp bucket.
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ddbd591b305c..734b769db2ca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1298,9 +1298,9 @@ static void __init init_uclamp(void)
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp bucket */
uc_se = &root_task_group.uclamp[clamp_id];
- uc_se->value = uclamp_none(clamp_id);
- uc_se->bucket_id = 0;
- uc_se->effective.value = uclamp_none(clamp_id);
+ uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+ uc_se->effective.bucket_id = uc_se->bucket_id;
+ uc_se->effective.value = uc_se->value;
#endif
}
}
@@ -6880,6 +6880,16 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline void free_uclamp_sched_group(struct task_group *tg)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_bucket_dec(clamp_id, tg->uclamp[clamp_id].bucket_id);
+#endif
+}
+
static inline int alloc_uclamp_sched_group(struct task_group *tg,
struct task_group *parent)
{
@@ -6887,12 +6897,12 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
int clamp_id;

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- tg->uclamp[clamp_id].value =
- parent->uclamp[clamp_id].value;
- tg->uclamp[clamp_id].bucket_id =
- parent->uclamp[clamp_id].bucket_id;
+ uclamp_bucket_inc(NULL, &tg->uclamp[clamp_id], clamp_id,
+ parent->uclamp[clamp_id].value);
tg->uclamp[clamp_id].effective.value =
parent->uclamp[clamp_id].effective.value;
+ tg->uclamp[clamp_id].effective.bucket_id =
+ parent->uclamp[clamp_id].effective.bucket_id;
}
#endif

@@ -6901,6 +6911,7 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,

static void sched_free_group(struct task_group *tg)
{
+ free_uclamp_sched_group(tg);
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
@@ -7147,7 +7158,8 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)

#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_hier(struct cgroup_subsys_state *css,
- int clamp_id, unsigned int value)
+ unsigned int clamp_id, unsigned int bucket_id,
+ unsigned int value)
{
struct cgroup_subsys_state *top_css = css;
struct uclamp_se *uc_se, *uc_parent;
@@ -7159,8 +7171,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
* groups we consider their current value.
*/
uc_se = &css_tg(css)->uclamp[clamp_id];
- if (css != top_css)
+ if (css != top_css) {
value = uc_se->value;
+ bucket_id = uc_se->effective.bucket_id;
+ }

/*
* Skip the whole subtrees if the current effective clamp is
@@ -7176,12 +7190,15 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,
}

/* Propagate the most restrictive effective value */
- if (uc_parent->effective.value < value)
+ if (uc_parent->effective.value < value) {
value = uc_parent->effective.value;
+ bucket_id = uc_parent->effective.bucket_id;
+ }
if (uc_se->effective.value == value)
continue;

uc_se->effective.value = value;
+ uc_se->effective.bucket_id = bucket_id;
}
}

@@ -7194,6 +7211,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7204,11 +7222,16 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
goto out;
}

+ /* Update TG's reference count */
+ uclamp_bucket_inc(NULL, &tg->uclamp[UCLAMP_MIN], UCLAMP_MIN, min_value);
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+ cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id,
+ min_value);

out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}
@@ -7222,6 +7245,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7232,11 +7256,16 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
goto out;
}

+ /* Update TG's reference count */
+ uclamp_bucket_inc(NULL, &tg->uclamp[UCLAMP_MAX], UCLAMP_MAX, max_value);
+
/* Update effective clamps to track the most restrictive value */
- cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+ cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id,
+ max_value);

out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}
--
2.19.2


2019-01-15 10:19:06

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 16/16] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or clamped as requested.

Do that by slightly refactoring uclamp_bucket_inc(). An additional
parameter *cgroup_subsys_state (css) is used to walk the list of tasks
in the TGs and update the RUNNABLE ones. Do that by taking the rq
lock for each task, the same mechanism used for cpu affinity masks
updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v6:
Others:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
- small documentation updates
---
kernel/sched/core.c | 56 +++++++++++++++++++++++++++++++++------------
1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8d1fc9880ff..36866a1b9f9d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1111,7 +1111,22 @@ static void uclamp_bucket_dec(unsigned int clamp_id, unsigned int bucket_id)
&uc_map_old.data, uc_map_new.data));
}

-static void uclamp_bucket_inc(struct task_struct *p, struct uclamp_se *uc_se,
+static inline void uclamp_bucket_inc_tg(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int bucket_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ /* Update clamp buckets for RUNNABLE tasks in this TG */
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_task_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
+static void uclamp_bucket_inc(struct task_struct *p,
+ struct cgroup_subsys_state *css,
+ struct uclamp_se *uc_se,
unsigned int clamp_id, unsigned int clamp_value)
{
union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
@@ -1183,6 +1198,9 @@ static void uclamp_bucket_inc(struct task_struct *p, struct uclamp_se *uc_se,
uc_se->value = clamp_value;
uc_se->bucket_id = bucket_id;

+ if (css)
+ uclamp_bucket_inc_tg(css, clamp_id, bucket_id);
+
if (p)
uclamp_task_update_active(p, clamp_id);

@@ -1221,11 +1239,11 @@ int sched_uclamp_handler(struct ctl_table *table, int write,
}

if (old_min != sysctl_sched_uclamp_util_min) {
- uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MIN],
+ uclamp_bucket_inc(NULL, NULL, &uclamp_default[UCLAMP_MIN],
UCLAMP_MIN, sysctl_sched_uclamp_util_min);
}
if (old_max != sysctl_sched_uclamp_util_max) {
- uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MAX],
+ uclamp_bucket_inc(NULL, NULL, &uclamp_default[UCLAMP_MAX],
UCLAMP_MAX, sysctl_sched_uclamp_util_max);
}
goto done;
@@ -1260,12 +1278,12 @@ static int __setscheduler_uclamp(struct task_struct *p,
mutex_lock(&uclamp_mutex);
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
p->uclamp[UCLAMP_MIN].user_defined = true;
- uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
+ uclamp_bucket_inc(p, NULL, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
p->uclamp[UCLAMP_MAX].user_defined = true;
- uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
+ uclamp_bucket_inc(p, NULL, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
mutex_unlock(&uclamp_mutex);
@@ -1304,7 +1322,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)

p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
- uclamp_bucket_inc(NULL, &p->uclamp[clamp_id],
+ uclamp_bucket_inc(NULL, NULL, &p->uclamp[clamp_id],
clamp_id, clamp_value);
}
}
@@ -1326,19 +1344,23 @@ static void __init init_uclamp(void)
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
- uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+ uclamp_bucket_inc(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(clamp_id));

uc_se = &uclamp_default[clamp_id];
- uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+ uclamp_bucket_inc(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(clamp_id));

/* RT tasks by default will go to max frequency */
uc_se = &uclamp_default_perf[clamp_id];
- uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+ uclamp_bucket_inc(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(UCLAMP_MAX));

#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Init root TG's clamp bucket */
uc_se = &root_task_group.uclamp[clamp_id];
- uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+ uclamp_bucket_inc(NULL, NULL, uc_se, clamp_id,
+ uclamp_none(UCLAMP_MAX));
uc_se->effective.bucket_id = uc_se->bucket_id;
uc_se->effective.value = uc_se->value;
#endif
@@ -6937,8 +6959,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
int clamp_id;

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- uclamp_bucket_inc(NULL, &tg->uclamp[clamp_id], clamp_id,
- parent->uclamp[clamp_id].value);
+ uclamp_bucket_inc(NULL, NULL, &tg->uclamp[clamp_id],
+ clamp_id, parent->uclamp[clamp_id].value);
tg->uclamp[clamp_id].effective.value =
parent->uclamp[clamp_id].effective.value;
tg->uclamp[clamp_id].effective.bucket_id =
@@ -7239,6 +7261,10 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,

uc_se->effective.value = value;
uc_se->effective.bucket_id = bucket_id;
+
+ /* Immediately updated descendants active tasks */
+ if (css != top_css)
+ uclamp_bucket_inc_tg(css, clamp_id, bucket_id);
}
}

@@ -7263,7 +7289,8 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
}

/* Update TG's reference count */
- uclamp_bucket_inc(NULL, &tg->uclamp[UCLAMP_MIN], UCLAMP_MIN, min_value);
+ uclamp_bucket_inc(NULL, css, &tg->uclamp[UCLAMP_MIN],
+ UCLAMP_MIN, min_value);

/* Update effective clamps to track the most restrictive value */
cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id,
@@ -7297,7 +7324,8 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
}

/* Update TG's reference count */
- uclamp_bucket_inc(NULL, &tg->uclamp[UCLAMP_MAX], UCLAMP_MAX, max_value);
+ uclamp_bucket_inc(NULL, css, &tg->uclamp[UCLAMP_MAX],
+ UCLAMP_MAX, max_value);

/* Update effective clamps to track the most restrictive value */
cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id,
--
2.19.2


2019-01-15 10:19:08

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 13/16] sched/core: uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This is the actual clamp value enforced on tasks in a
task group.

Since it can be interesting for userspace, e.g. system management
software, to know exactly what the currently propagated/enforced
configuration is, the effective clamp values are exposed to user-space
by means of a new pair of read-only attributes
cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v6:
Others:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
---
Documentation/admin-guide/cgroup-v2.rst | 25 ++++++-
include/linux/sched.h | 10 ++-
kernel/sched/core.c | 89 +++++++++++++++++++++++--
3 files changed, 117 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a059aaf7cce6..7aad2435e961 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -984,22 +984,43 @@ All time durations are in microseconds.
A read-write single value file which exists on non-root cgroups.
The default is "0", i.e. no utilization boosting.

- The minimum utilization in the range [0, 1024].
+ The requested minimum utilization in the range [0, 1024].

This interface allows reading and setting minimum utilization clamp
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.

+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ allows only smaller minimum utilization values.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1024". i.e. no utilization capping

- The maximum utilization in the range [0, 1024].
+ The requested maximum utilization in the range [0, 1024].

This interface allows reading and setting maximum utilization clamp
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.

+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.
+
+
Memory
------

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c8f391d1cdc5..05d286524d70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -625,7 +625,15 @@ struct uclamp_se {
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int mapped : 1;
unsigned int active : 1;
- /* Clamp bucket and value actually used by a RUNNABLE task */
+ /*
+ * Clamp bucket and value actually used by a scheduling entity,
+ * i.e. a (RUNNABLE) task or a task group.
+ * For task groups, this is the value (possibly) enforced by a
+ * parent task group.
+ * For a task, this is the value (possibly) enforced by the
+ * task group the task is currently part of or by the system
+ * default clamp values, whichever is the most restrictive.
+ */
struct {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 29ae83fb9786..ddbd591b305c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1300,6 +1300,7 @@ static void __init init_uclamp(void)
uc_se = &root_task_group.uclamp[clamp_id];
uc_se->value = uclamp_none(clamp_id);
uc_se->bucket_id = 0;
+ uc_se->effective.value = uclamp_none(clamp_id);
#endif
}
}
@@ -6890,6 +6891,8 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
parent->uclamp[clamp_id].value;
tg->uclamp[clamp_id].bucket_id =
parent->uclamp[clamp_id].bucket_id;
+ tg->uclamp[clamp_id].effective.value =
+ parent->uclamp[clamp_id].effective.value;
}
#endif

@@ -7143,6 +7146,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+ int clamp_id, unsigned int value)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+
+ css_for_each_descendant_pre(css, top_css) {
+ /*
+ * The first visited task group is top_css, which clamp value
+ * is the one passed as parameter. For descendent task
+ * groups we consider their current value.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (css != top_css)
+ value = uc_se->value;
+
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * already matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive value, as effective clamp.
+ */
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+ if (uc_se->effective.value == value &&
+ uc_parent->effective.value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent->effective.value < value)
+ value = uc_parent->effective.value;
+ if (uc_se->effective.value == value)
+ continue;
+
+ uc_se->effective.value = value;
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -7162,6 +7204,9 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
goto out;
}

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MIN, min_value);
+
out:
rcu_read_unlock();

@@ -7187,6 +7232,9 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
goto out;
}

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MAX, max_value);
+
out:
rcu_read_unlock();

@@ -7194,14 +7242,17 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
}

static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;

rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].effective.value
+ : tg->uclamp[clamp_id].value;
rcu_read_unlock();

return util_clamp;
@@ -7210,13 +7261,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}

static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */

@@ -7564,11 +7627,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7744,12 +7815,22 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* terminate */
};
--
2.19.2


2019-01-15 10:19:15

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy

The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and it makes it impossible to add more generic
task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

This is done by extending to the sched_setattr() non-POSIX syscall with
the SETPARAM_POLICY policy already used by the sched_setparam() POSIX
syscall.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v6:
Message-ID: <[email protected]>
- rename SCHED_FLAG_TUNE_POLICY in SCHED_FLAG_KEEP_POLICY
- moved at the beginning of the series
---
include/uapi/linux/sched.h | 6 +++++-
kernel/sched/core.c | 11 ++++++++++-
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..43832a87016a 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -40,6 +40,8 @@
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+/* Must be the last entry: used to sanity check attr.policy values */
+#define SCHED_POLICY_MAX 7

/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
@@ -50,9 +52,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_KEEP_POLICY 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_KEEP_POLICY)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a674c7db2f29..a68763a4ccae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4560,8 +4560,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
if (retval)
return retval;

- if ((int)attr.sched_policy < 0)
+ /*
+ * A valid policy is always required from userspace, unless
+ * SCHED_FLAG_KEEP_POLICY is set and the current policy
+ * is enforced for this call.
+ */
+ if (attr.sched_policy >= SCHED_POLICY_MAX &&
+ !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
return -EINVAL;
+ }
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;

rcu_read_lock();
retval = -ESRCH;
--
2.19.2


2019-01-15 10:19:32

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 06/16] sched/core: uclamp: Enforce last task UCLAMP_MAX

When the task sleeps, it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.

Fix this by using
uclamp_cpu_dec_id(p, rq, UCLAMP_MAX)
uclamp_cpu_update(rq, UCLAMP_MAX, clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.

Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v6:
Others:
- moved UCLAMP_FLAG_IDLE management into dedicated functions:
uclamp_idle_value() and uclamp_idle_reset()
- switched from rq::uclamp::flags to rq::uclamp_flags, since now
rq::uclamp is a per-clamp_id array
---
kernel/sched/core.c | 51 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 2 ++
2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67f059ee0a05..b7ac516a70be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -766,9 +766,45 @@ static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * (clamp_value / UCLAMP_BUCKET_DELTA);
}

-static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
+static inline unsigned int
+uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
+{
+ /*
+ * Avoid blocked utilization pushing up the frequency when we go
+ * idle (which drops the max-clamp) by retaining the last known
+ * max-clamp.
+ */
+ if (clamp_id == UCLAMP_MAX) {
+ rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
+ return clamp_value;
+ }
+
+ return uclamp_none(UCLAMP_MIN);
+}
+
+static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+ unsigned int clamp_value)
+{
+ /* Reset max-clamp retention only on idle exit */
+ if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
+ return;
+
+ WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
+
+ /*
+ * This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
+ * (after). The idle flag is reset only the second time, when we know
+ * that UCLAMP_MIN has been already updated.
+ */
+ if (clamp_id == UCLAMP_MAX)
+ rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+}
+
+static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
+ unsigned int clamp_value)
{
unsigned int max_value = 0;
+ bool buckets_active = false;
unsigned int bucket_id;

for (bucket_id = 0; bucket_id < UCLAMP_BUCKETS; ++bucket_id) {
@@ -776,6 +812,7 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)

if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
continue;
+ buckets_active = true;

/* Both min and max clamps are MAX aggregated */
bucket_value = rq->uclamp[clamp_id].bucket[bucket_id].value;
@@ -783,6 +820,10 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
if (max_value >= SCHED_CAPACITY_SCALE)
break;
}
+
+ if (unlikely(!buckets_active))
+ max_value = uclamp_idle_value(rq, clamp_id, clamp_value);
+
WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
}

@@ -808,8 +849,11 @@ static inline void uclamp_cpu_inc_id(struct task_struct *p, struct rq *rq,

rq->uclamp[clamp_id].bucket[bucket_id].tasks++;

- /* CPU's clamp buckets track the max effective clamp value */
+ /* Reset clamp holds on idle exit */
tsk_clamp = p->uclamp[clamp_id].value;
+ uclamp_idle_reset(rq, clamp_id, tsk_clamp);
+
+ /* CPU's clamp buckets track the max effective clamp value */
grp_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
rq->uclamp[clamp_id].bucket[bucket_id].value = max(grp_clamp, tsk_clamp);

@@ -858,7 +902,7 @@ static inline void uclamp_cpu_dec_id(struct task_struct *p, struct rq *rq,
*/
rq->uclamp[clamp_id].bucket[bucket_id].value =
uclamp_maps[clamp_id][bucket_id].value;
- uclamp_cpu_update(rq, clamp_id);
+ uclamp_cpu_update(rq, clamp_id, clamp_value);
}
}

@@ -1100,6 +1144,7 @@ static void __init init_uclamp(void)
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
cpu_rq(cpu)->uclamp[UCLAMP_MAX].value = uclamp_none(UCLAMP_MAX);
+ cpu_rq(cpu)->uclamp_flags = 0;
}

memset(uclamp_maps, 0, sizeof(uclamp_maps));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 06ff7d890ff6..b7f3ee8ba164 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -882,6 +882,8 @@ struct rq {
#ifdef CONFIG_UCLAMP_TASK
/* Utilization clamp values based on CPU's RUNNABLE tasks */
struct uclamp_cpu uclamp[UCLAMP_CNT] ____cacheline_aligned;
+ unsigned int uclamp_flags;
+#define UCLAMP_FLAG_IDLE 0x01
#endif

struct cfs_rq cfs;
--
2.19.2


2019-01-15 10:19:39

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and we unconditionally enforce the required clamps.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

/proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

If a task specific value is not compliant with the system default range,
it will be forced to the corresponding system default value.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
The current restriction could be too aggressive since, for example if a
task has a util_min which is higher then the system default max, it
will be forced to the system default min unconditionally.

Let say we have:

Task Clamp: min=30, max=40
System Clamps: min=10, max=20

In principle we should set the task's min=20, since the system allows
boosts up to 20%. In the current implementation, however, since the task
mins exceed the system max, we just go for task min=10.

We should probably better restrict util_min to the maximum system
default value, but that would make the code more complex since it
required to track a cross clamp_id dependency.
Let's keep this as a possible future extension whenever we should really
see the need for it.

Changes in v6:
Others:
- wholesale s/group/bucket/
- make use of the bit_for() macro
---
include/linux/sched.h | 5 ++
include/linux/sched/sysctl.h | 11 +++
kernel/sched/core.c | 137 ++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 16 ++++
4 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84294925d006..c8f391d1cdc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -625,6 +625,11 @@ struct uclamp_se {
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int mapped : 1;
unsigned int active : 1;
+ /* Clamp bucket and value actually used by a RUNNABLE task */
+ struct {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+ } effective;
};
#endif /* CONFIG_UCLAMP_TASK */

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index a9c32daeb9d8..445fb54eaeff 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
@@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+#ifdef CONFIG_UCLAMP_TASK
+extern int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+#endif
+
extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7ac516a70be..d1ea5825501a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -731,6 +731,23 @@ static void set_load_weight(struct task_struct *p, bool update_load)
static DEFINE_MUTEX(uclamp_mutex);

/*
+ * Minimum utilization for FAIR tasks
+ * default: 0
+ */
+unsigned int sysctl_sched_uclamp_util_min;
+
+/*
+ * Maximum utilization for FAIR tasks
+ * default: 1024
+ */
+unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+
+/*
+ * Tasks specific clamp values are required to be within this range
+ */
+static struct uclamp_se uclamp_default[UCLAMP_CNT];
+
+/**
* Reference count utilization clamp buckets
* @value: the utilization "clamp value" tracked by this clamp bucket
* @se_count: the number of scheduling entities using this "clamp value"
@@ -827,6 +844,72 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
}

+/*
+ * The effective clamp bucket index of a task depends on, by increasing
+ * priority:
+ * - the task specific clamp value, explicitly requested from userspace
+ * - the system default clamp value, defined by the sysadmin
+ *
+ * As a side effect, update the task's effective value:
+ * task_struct::uclamp::effective::value
+ * to represent the clamp value of the task effective bucket index.
+ */
+static inline void
+uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
+ unsigned int *clamp_value, unsigned int *bucket_id)
+{
+ /* Task specific clamp value */
+ *clamp_value = p->uclamp[clamp_id].value;
+ *bucket_id = p->uclamp[clamp_id].bucket_id;
+
+ /* System default restriction */
+ if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
+ *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
+ /* Keep it simple: unconditionally enforce system defaults */
+ *clamp_value = uclamp_default[clamp_id].value;
+ *bucket_id = uclamp_default[clamp_id].bucket_id;
+ }
+}
+
+static inline void
+uclamp_effective_assign(struct task_struct *p, unsigned int clamp_id)
+{
+ unsigned int clamp_value, bucket_id;
+
+ uclamp_effective_get(p, clamp_id, &clamp_value, &bucket_id);
+
+ p->uclamp[clamp_id].effective.value = clamp_value;
+ p->uclamp[clamp_id].effective.bucket_id = bucket_id;
+}
+
+static inline unsigned int uclamp_effective_bucket_id(struct task_struct *p,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value, bucket_id;
+
+ /* Task currently refcounted: use back-annotate effective value */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].effective.bucket_id;
+
+ uclamp_effective_get(p, clamp_id, &clamp_value, &bucket_id);
+
+ return bucket_id;
+}
+
+static unsigned int uclamp_effective_value(struct task_struct *p,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value, bucket_id;
+
+ /* Task currently refcounted: use back-annotate effective value */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].effective.value;
+
+ uclamp_effective_get(p, clamp_id, &clamp_value, &bucket_id);
+
+ return clamp_value;
+}
+
/*
* When a task is enqueued on a CPU's rq, the clamp bucket currently defined by
* the task's uclamp::bucket_id is reference counted on that CPU. This also
@@ -843,14 +926,15 @@ static inline void uclamp_cpu_inc_id(struct task_struct *p, struct rq *rq,

if (unlikely(!p->uclamp[clamp_id].mapped))
return;
+ uclamp_effective_assign(p, clamp_id);

- bucket_id = p->uclamp[clamp_id].bucket_id;
+ bucket_id = uclamp_effective_bucket_id(p, clamp_id);
p->uclamp[clamp_id].active = true;

rq->uclamp[clamp_id].bucket[bucket_id].tasks++;

/* Reset clamp holds on idle exit */
- tsk_clamp = p->uclamp[clamp_id].value;
+ tsk_clamp = uclamp_effective_value(p, clamp_id);
uclamp_idle_reset(rq, clamp_id, tsk_clamp);

/* CPU's clamp buckets track the max effective clamp value */
@@ -880,7 +964,7 @@ static inline void uclamp_cpu_dec_id(struct task_struct *p, struct rq *rq,
if (unlikely(!p->uclamp[clamp_id].mapped))
return;

- bucket_id = p->uclamp[clamp_id].bucket_id;
+ bucket_id = uclamp_effective_bucket_id(p, clamp_id);
p->uclamp[clamp_id].active = false;

SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
@@ -1068,6 +1152,50 @@ static void uclamp_bucket_inc(struct task_struct *p, struct uclamp_se *uc_se,
uc_se->mapped = true;
}

+int sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int old_min, old_max;
+ int result = 0;
+
+ mutex_lock(&uclamp_mutex);
+
+ old_min = sysctl_sched_uclamp_util_min;
+ old_max = sysctl_sched_uclamp_util_max;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ goto done;
+
+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
+ sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto undo;
+ }
+
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MIN],
+ UCLAMP_MIN, sysctl_sched_uclamp_util_min);
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MAX],
+ UCLAMP_MAX, sysctl_sched_uclamp_util_max);
+ }
+ goto done;
+
+undo:
+ sysctl_sched_uclamp_util_min = old_min;
+ sysctl_sched_uclamp_util_max = old_max;
+
+done:
+ mutex_unlock(&uclamp_mutex);
+
+ return result;
+}
+
static int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -1151,6 +1279,9 @@ static void __init init_uclamp(void)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+
+ uc_se = &uclamp_default[clamp_id];
+ uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
}
}

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ba4d9e85feb8..b0fa4a883999 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -446,6 +446,22 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rr_handler,
},
+#ifdef CONFIG_UCLAMP_TASK
+ {
+ .procname = "sched_uclamp_util_min",
+ .data = &sysctl_sched_uclamp_util_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+ {
+ .procname = "sched_uclamp_util_max",
+ .data = &sysctl_sched_uclamp_util_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sched_uclamp_handler,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
2.19.2


2019-01-15 10:19:45

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

The Energy Aware Scheduler (AES) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:
a) an (active) power consumptions defined for each CPU frequency
b) an estimation of which frequency will be used on each CPU
c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c) estimations. A CPU is
expected to run:
- on an higher than required frequency, but for a shorter time, in case
its estimated utilization will be smaller then the minimum utilization
enforced by uclamp
- on a smaller than required frequency, but for a longer time, in case
its estimated utilization is bigger then the maximum utilization
enforced by uclamp

While effects on busy time for both boosted/capped tasks are already
considered by compute_energy(), clamping effects on frequency selection
are currently ignored by that function.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Since we are at that:
- rename schedutil_freq_util() into schedutil_cpu_util(),
since it's not only used for frequency selection.
- use "unsigned int" instead of "unsigned long" whenever the tracked
utilization value is not expected to overflow 32bit.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 26 +++++++++++-----------
kernel/sched/fair.c | 37 ++++++++++++++++++++++++++------
kernel/sched/sched.h | 19 +++++-----------
3 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 38a05a4f78cc..4e02b419c482 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -195,10 +195,11 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type)
+unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
+ unsigned int max, enum schedutil_type type,
+ struct task_struct *p)
{
- unsigned long dl_util, util, irq;
+ unsigned int dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

/*
@@ -222,13 +223,9 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* When there are no CFS RUNNABLE tasks, clamps are released and
* frequency will be gracefully reduced with the utilization decay.
*/
- util = cpu_util_rt(rq);
- if (type == FREQUENCY_UTIL) {
- util += cpu_util_cfs(rq);
- util = uclamp_util(rq, util);
- } else {
- util += util_cfs;
- }
+ util = cpu_util_rt(rq) + util_cfs;
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util_with(rq, util, p);

dl_util = cpu_util_dl(rq);

@@ -282,13 +279,14 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util = cpu_util_cfs(rq);
- unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ unsigned int util_cfs = cpu_util_cfs(rq);
+ unsigned int cpu_cap = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);

- sg_cpu->max = max;
+ sg_cpu->max = cpu_cap;
sg_cpu->bw_dl = cpu_bw_dl(rq);

- return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
+ return schedutil_cpu_util(sg_cpu->cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, NULL);
}

/**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5de061b055d2..7f8ca3b02dec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6424,11 +6424,20 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- long util, max_util, sum_util, energy = 0;
+ unsigned int max_util, cfs_util, cpu_util, cpu_cap;
+ unsigned long sum_util, energy = 0;
int cpu;

for (; pd; pd = pd->next) {
+ struct cpumask *pd_mask = perf_domain_span(pd);
+
+ /*
+ * The energy model mandate all the CPUs of a performance
+ * domain have the same capacity.
+ */
+ cpu_cap = arch_scale_cpu_capacity(NULL, cpumask_first(pd_mask));
max_util = sum_util = 0;
+
/*
* The capacity state of CPUs of the current rd can be driven by
* CPUs of another rd if they belong to the same performance
@@ -6439,11 +6448,27 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* it will not appear in its pd list and will not be accounted
* by compute_energy().
*/
- for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
- util = schedutil_energy_util(cpu, util);
- max_util = max(util, max_util);
- sum_util += util;
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ cfs_util = cpu_util_next(cpu, p, dst_cpu);
+
+ /*
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
+ */
+ sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
+ ENERGY_UTIL, NULL);
+
+ /*
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ */
+ cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
+ FREQUENCY_UTIL,
+ cpu == dst_cpu ? p : NULL);
+ max_util = max(max_util, cpu_util);
}

energy += em_pd_energy(pd->em_pd, max_util, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b7ce3023d023..a70f4bf66285 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2317,7 +2317,6 @@ static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
# define arch_scale_freq_invariant() false
#endif

-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
/**
* enum schedutil_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
@@ -2333,15 +2332,10 @@ enum schedutil_type {
ENERGY_UTIL,
};

-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type);
-
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
-
- return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
-}
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
+ unsigned int max, enum schedutil_type type,
+ struct task_struct *p);

static inline unsigned long cpu_bw_dl(struct rq *rq)
{
@@ -2370,10 +2364,7 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
return READ_ONCE(rq->avg_rt.util_avg);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- return cfs;
-}
+#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
#endif

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
--
2.19.2


2019-01-15 10:19:56

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 10/16] sched/core: Add uclamp_util_with()

Currently uclamp_util() allows to clamp a specified utilization
considering the clamp values requested by RUNNABLE tasks in a CPU.
Sometimes however, it could be interesting to verify how clamp values
will change when a task is going to be running on a given CPU.
For example, the Energy Aware Scheduler (EAS) is interested in
evaluating and comparing the energy impact of different scheduling
decisions.

Add uclamp_util_with() which allows to clamp a given utilization by
considering the possible impact on CPU clamp values of a specified task.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/sched.h | 21 ++++++++++++++++++++-
2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1ed01f381641..b41db1190d28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -904,8 +904,8 @@ static inline unsigned int uclamp_effective_bucket_id(struct task_struct *p,
return bucket_id;
}

-static unsigned int uclamp_effective_value(struct task_struct *p,
- unsigned int clamp_id)
+unsigned int uclamp_effective_value(struct task_struct *p,
+ unsigned int clamp_id)
{
unsigned int clamp_value, bucket_id;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 95d62a2a0b44..b7ce3023d023 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2268,11 +2268,20 @@ static inline unsigned int uclamp_none(int clamp_id)
}

#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+unsigned int uclamp_effective_value(struct task_struct *p, unsigned int clamp_id);
+
+static __always_inline
+unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
+ struct task_struct *p)
{
unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);

+ if (p) {
+ min_util = max(min_util, uclamp_effective_value(p, UCLAMP_MIN));
+ max_util = max(max_util, uclamp_effective_value(p, UCLAMP_MAX));
+ }
+
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
* RUNNABLE tasks with _different_ clamps, we can end up with an
@@ -2283,7 +2292,17 @@ static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)

return clamp(util, min_util, max_util);
}
+
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return uclamp_util_with(rq, util, NULL);
+}
#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
+ struct task_struct *p)
+{
+ return util;
+}
static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
{
return util;
--
2.19.2


2019-01-15 10:20:29

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 12/16] sched/core: uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes util.{min,max}
which allows to enforce utilization boosting and capping for all the
tasks in a group. Specifically:

- util.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the min_util
utilization

- util.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the max_util
utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups

b) do not enforce any constraints and/or dependencies between the parent
and its child nodes, thus relying:
- on permission settings defined by the system management software,
to define if subgroups can configure their clamp values
- on the delegation model, to ensure that effective clamps are
updated to consider both subgroup requests and parent group
constraints

c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests

This patch provides the basic support to expose the two new attributes
and to validate their run-time updates, while we do not (yet) actually
allocated clamp buckets.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---

NOTEs:

1) The delegation model described above is provided in one of the
following patches of this series.

2) Utilization clamping constraints are useful not only to bias frequency
selection, when a task is running, but also to better support certain
scheduler decisions regarding task placement. For example, on
asymmetric capacity systems, a utilization clamp value can be
conveniently used to enforce important interactive tasks on more capable
CPUs or to run low priority and background tasks on more energy
efficient CPUs.

The ultimate goal of utilization clamping is thus to enable:

- boosting: by selecting an higher capacity CPU and/or higher execution
frequency for small tasks which are affecting the user
interactive experience.

- capping: by selecting more energy efficiency CPUs or lower execution
frequency, for big tasks which are mainly related to
background activities, and thus without a direct impact on
the user experience.

Thus, a proper extension of the cpu controller with utilization clamping
support will make this controller even more suitable for integration
with advanced system management software (e.g. Android).
Indeed, an informed user-space can provide rich information hints to the
scheduler regarding the tasks it's going to schedule.

The bits related to task placement biasing are left for a further
extension once the basic support introduced by this series will be
merged. Anyway they will not affect the integration with cgroups.

Changes in v6:
Others:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
---
Documentation/admin-guide/cgroup-v2.rst | 25 +++++
init/Kconfig | 22 ++++
kernel/sched/core.c | 131 ++++++++++++++++++++++++
kernel/sched/sched.h | 5 +
4 files changed, 183 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7bf3f129c68b..a059aaf7cce6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -974,6 +980,25 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.

+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The minimum utilization in the range [0, 1024].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1024". i.e. no utilization capping
+
+ The maximum utilization in the range [0, 1024].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.

Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index e60950ec01c0..94abf368bd52 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -866,6 +866,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b41db1190d28..29ae83fb9786 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1294,6 +1294,13 @@ static void __init init_uclamp(void)
/* RT tasks by default will go to max frequency */
uc_se = &uclamp_default_perf[clamp_id];
uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Init root TG's clamp bucket */
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->bucket_id = 0;
+#endif
}
}

@@ -6872,6 +6879,23 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ tg->uclamp[clamp_id].value =
+ parent->uclamp[clamp_id].value;
+ tg->uclamp[clamp_id].bucket_id =
+ parent->uclamp[clamp_id].bucket_id;
+ }
+#endif
+
+ return 1;
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6895,6 +6919,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;

err:
@@ -7115,6 +7142,84 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MIN].value == min_value)
+ goto out;
+ if (tg->uclamp[UCLAMP_MAX].value < min_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg->uclamp[UCLAMP_MAX].value == max_value)
+ goto out;
+ if (tg->uclamp[UCLAMP_MIN].value > max_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7452,6 +7557,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7619,6 +7736,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a70f4bf66285..eca7d1a6cd43 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -399,6 +399,11 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.19.2


2019-01-15 11:51:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 03/16] sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets

Utilization clamping requires each CPU to know which clamp values are
assigned to tasks RUNNABLE on that CPU. A per-CPU array of reference
counters can be used where each entry tracks how many RUNNABLE tasks
require the same clamp value on each CPU. However, the range of clamp
values is too wide to track all the possible values in a per-CPU array.

Trade-off clamping precision for run-time and space efficiency using a
"bucketization and mapping" mechanism to translate "clamp values" into
"clamp buckets", each one representing a range of possible clamp values.

While the bucketization allows to use only a minimal set of clamp
buckets at run-time, the mapping ensures that the clamp buckets in use
are always at the beginning of the per-CPU array.

The minimum set of clamp buckets used at run-time depends on their
granularity and how many clamp values the target system expects to
use. Since on most systems we expect only a few different clamp
values, the bucketization and mapping mechanism increases our chances
to have all the required data fitting in one cache line.

For example, if we have only 20% and 25% clamped tasks, by setting:
CONFIG_UCLAMP_BUCKETS_COUNT 20
we allocate 20 clamp buckets with 5% resolution each, however we will
use only 2 of them at run-time, since their 5% resolution is enough to
always distinguish the clamp values in use, and they will both fit
into a single cache line for each CPU.

Introduce the "bucketization and mapping" mechanisms which are required
for the implement of the per-CPU operations.

Add a new "uclamp_enabled" sched_class attribute to mark which class
will contribute to clamping the CPU utilization. Move few callbacks
around to ensure that the most used callbacks are all in the same cache
line along with the new attribute.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v6:
Message-ID: <[email protected]>
- added bucketization support since the beginning to avoid
semi-functional code in this patch
Message-ID: <[email protected]>
- update cmpxchg loops to use "do { } while (cmpxchg(ptr, old, new) != old)"
- switch to usage of try_cmpxchg()
Message-ID: <[email protected]>
- use SCHED_WARN_ON() instead of CONFIG_SCHED_DEBUG guarded blocks
- ensure se_count never underflow
Message-ID: <20181112000910.GC3038@worktop>
- wholesale s/group/bucket/
Message-ID: <20181111164754.GA3038@worktop>
- consistently use unary (++/--) operators
Message-ID: <20181107142428.GG14309@e110439-lin>
- added some better comments for invariant conditions
Message-ID: <20181107145612.GJ14309@e110439-lin>
- ensure UCLAMP_BUCKETS_COUNT >= 1
Others:
- added and make use of the bit_for() macro
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
- documentation review and cleanup
---
include/linux/log2.h | 37 ++++++
include/linux/sched.h | 44 ++++++-
include/linux/sched/task.h | 6 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 6 +-
init/Kconfig | 32 +++++
init/init_task.c | 4 -
kernel/exit.c | 1 +
kernel/sched/core.c | 234 ++++++++++++++++++++++++++++++---
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 19 ++-
11 files changed, 362 insertions(+), 31 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 2af7f77866d0..e2db25734532 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
ilog2((n) - 1) + 1) : \
__order_base_2(n) \
)
+
+static inline __attribute__((const))
+int __bits_per(unsigned long n)
+{
+ if (n < 2)
+ return 1;
+ if (is_power_of_2(n))
+ return order_base_2(n) + 1;
+ return order_base_2(n);
+}
+
+/**
+ * bits_per - calculate the number of bits required for the argument
+ * @n: parameter
+ *
+ * This is constant-capable and can be used for compile time
+ * initiaizations, e.g bitfields.
+ *
+ * The first few values calculated by this routine:
+ * bf(0) = 1
+ * bf(1) = 1
+ * bf(2) = 2
+ * bf(3) = 2
+ * bf(4) = 3
+ * ... and so on.
+ */
+#define bits_per(n) \
+( \
+ __builtin_constant_p(n) ? ( \
+ ((n) == 0 || (n) == 1) ? 1 : ( \
+ ((n) & (n - 1)) == 0 ? \
+ ilog2((n) - 1) + 2 : \
+ ilog2((n) - 1) + 1 \
+ ) \
+ ) : \
+ __bits_per(n) \
+)
#endif /* _LINUX_LOG2_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 65199309b866..4f72f956850f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -323,6 +323,12 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+/*
+ * Increase resolution of cpu_capacity calculations
+ */
+#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -580,6 +586,42 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Number of utilization clamp buckets.
+ *
+ * The first clamp bucket (bucket_id=0) is used to track non clamped tasks, i.e.
+ * util_{min,max} (0,SCHED_CAPACITY_SCALE). Thus we allocate one more bucket in
+ * addition to the compile time configured number.
+ */
+#define UCLAMP_BUCKETS (CONFIG_UCLAMP_BUCKETS_COUNT + 1)
+
+/*
+ * Utilization clamp bucket
+ * @value: clamp value tracked by a clamp bucket
+ * @bucket_id: the bucket index used by the fast-path
+ * @mapped: the bucket index is valid
+ *
+ * A utilization clamp bucket maps a:
+ * clamp value (value), i.e.
+ * util_{min,max} value requested from userspace
+ * to a:
+ * clamp bucket index (bucket_id), i.e.
+ * index of the per-cpu RUNNABLE tasks refcounting array
+ *
+ * The mapped bit is set whenever a task has been mapped on a clamp bucket for
+ * the first time. When this bit is set, any:
+ * uclamp_bucket_inc() - for a new clamp value
+ * is matched by a:
+ * uclamp_bucket_dec() - for the old clamp value
+ */
+struct uclamp_se {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+ unsigned int mapped : 1;
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
union rcu_special {
struct {
u8 blocked;
@@ -661,7 +703,7 @@ struct task_struct {
struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
- int uclamp[UCLAMP_CNT];
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 44c6f15800ff..c3a71698b6b8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -70,6 +70,12 @@ static inline void exit_thread(struct task_struct *tsk)
#endif
extern void do_group_exit(int);

+#ifdef CONFIG_UCLAMP_TASK
+extern void uclamp_exit_task(struct task_struct *p);
+#else
+static inline void uclamp_exit_task(struct task_struct *p) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c31d3a47a47c..04beadac6985 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@

#include <linux/sched/idle.h>

-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
/*
* sched-domains (multiprocessor balancing) declarations:
*/
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 9ef6dad0f854..36c65da32b31 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -53,7 +53,11 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
#define SCHED_FLAG_KEEP_POLICY 0x08
-#define SCHED_FLAG_UTIL_CLAMP 0x10
+
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x20
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
diff --git a/init/Kconfig b/init/Kconfig
index ea7c928a177b..e60950ec01c0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -660,7 +660,39 @@ config UCLAMP_TASK

If in doubt, say N.

+config UCLAMP_BUCKETS_COUNT
+ int "Number of supported utilization clamp buckets"
+ range 5 20
+ default 5
+ depends on UCLAMP_TASK
+ help
+ Defines the number of clamp buckets to use. The range of each bucket
+ will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
+ number of clamp buckets the finer their granularity and the higher
+ the precision of clamping aggregation and tracking at run-time.
+
+ For example, with the default configuration we will have 5 clamp
+ buckets tracking 20% utilization each. A 25% boosted tasks will be
+ refcounted in the [20..39]% bucket and will set the bucket clamp
+ effective value to 25%.
+ If a second 30% boosted task should be co-scheduled on the same CPU,
+ that task will be refcounted in the same bucket of the first task and
+ it will boost the bucket clamp effective value to 30%.
+ The clamp effective value of a bucket is reset to its nominal value
+ (20% in the example above) when there are anymore tasks refcounted in
+ that bucket.
+
+ An additional boost/capping margin can be added to some tasks. In the
+ example above the 25% task will be boosted to 30% until it exits the
+ CPU. If that should be considered not acceptable on certain systems,
+ it's always possible to reduce the margin by increasing the number of
+ clamp buckets to trade off used memory for run-time tracking
+ precision.
+
+ If in doubt, use the default value.
+
endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5bfdcc3fb839..7f77741b6a9b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -92,10 +92,6 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
-#endif
-#ifdef CONFIG_UCLAMP_TASK
- .uclamp[UCLAMP_MIN] = 0,
- .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/exit.c b/kernel/exit.c
index 2d14979577ee..c2a4aa4463be 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -877,6 +877,7 @@ void __noreturn do_exit(long code)

sched_autogroup_exit_task(tsk);
cgroup_exit(tsk);
+ uclamp_exit_task(tsk);

/*
* FIXME: do that only when needed, using sched_exit tracepoint
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66ff83e115db..3f87898b13a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -718,25 +718,221 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
+/*
+ * Reference count utilization clamp buckets
+ * @value: the utilization "clamp value" tracked by this clamp bucket
+ * @se_count: the number of scheduling entities using this "clamp value"
+ * @data: accessor for value and se_count reading
+ * @adata: accessor for atomic operations on value and se_count
+ */
+union uclamp_map {
+ struct {
+ unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned long se_count : BITS_PER_LONG -
+ bits_per(SCHED_CAPACITY_SCALE);
+ };
+ unsigned long data;
+ atomic_long_t adata;
+};
+
+/*
+ * Map SEs "clamp value" into CPUs "clamp bucket"
+ *
+ * Matrix mapping "clamp values" (value) to "clamp buckets" (bucket_id),
+ * for each "clamp index" (clamp_id)
+ */
+static union uclamp_map uclamp_maps[UCLAMP_CNT][UCLAMP_BUCKETS];
+
+static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
+{
+#define UCLAMP_BUCKET_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_BUCKETS_COUNT)
+#define UCLAMP_BUCKET_UPPER (UCLAMP_BUCKET_DELTA * CONFIG_UCLAMP_BUCKETS_COUNT)
+
+ if (clamp_value >= UCLAMP_BUCKET_UPPER)
+ return SCHED_CAPACITY_SCALE;
+
+ return UCLAMP_BUCKET_DELTA * (clamp_value / UCLAMP_BUCKET_DELTA);
+}
+
+static void uclamp_bucket_dec(unsigned int clamp_id, unsigned int bucket_id)
+{
+ union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
+ union uclamp_map uc_map_old, uc_map_new;
+
+ uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
+ do {
+ /*
+ * Refcounting consistency check. If we release a non
+ * referenced bucket: refcounting is broken and we warn.
+ */
+ if (unlikely(!uc_map_old.se_count)) {
+ SCHED_WARN_ON(!uc_map_old.se_count);
+ return;
+ }
+
+ uc_map_new = uc_map_old;
+ uc_map_new.se_count--;
+
+ } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
+ &uc_map_old.data, uc_map_new.data));
+}
+
+static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
+ unsigned int clamp_value)
+{
+ union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
+ unsigned int prev_bucket_id = uc_se->bucket_id;
+ union uclamp_map uc_map_old, uc_map_new;
+ unsigned int free_bucket_id;
+ unsigned int bucket_value;
+ unsigned int bucket_id;
+
+ bucket_value = uclamp_bucket_value(clamp_value);
+
+ do {
+ /* Find the bucket_id of an already mapped clamp bucket... */
+ free_bucket_id = UCLAMP_BUCKETS;
+ for (bucket_id = 0; bucket_id < UCLAMP_BUCKETS; ++bucket_id) {
+ uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
+ if (free_bucket_id == UCLAMP_BUCKETS && !uc_map_old.se_count)
+ free_bucket_id = bucket_id;
+ if (uc_map_old.value == bucket_value)
+ break;
+ }
+
+ /* ... or allocate a new clamp bucket */
+ if (bucket_id >= UCLAMP_BUCKETS) {
+ /*
+ * A valid clamp bucket must always be available.
+ * If we cannot find one: refcounting is broken and we
+ * warn once. The sched_entity will be tracked in the
+ * fast-path using its previous clamp bucket, or not
+ * tracked at all if not yet mapped (i.e. it's new).
+ */
+ if (unlikely(free_bucket_id == UCLAMP_BUCKETS)) {
+ SCHED_WARN_ON(free_bucket_id == UCLAMP_BUCKETS);
+ return;
+ }
+ bucket_id = free_bucket_id;
+ uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
+ }
+
+ uc_map_new.se_count = uc_map_old.se_count + 1;
+ uc_map_new.value = bucket_value;
+
+ } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
+ &uc_map_old.data, uc_map_new.data));
+
+ uc_se->value = clamp_value;
+ uc_se->bucket_id = bucket_id;
+
+ if (uc_se->mapped)
+ uclamp_bucket_dec(clamp_id, prev_bucket_id);
+
+ /*
+ * Task's sched_entity are refcounted in the fast-path only when they
+ * have got a valid clamp_bucket assigned.
+ */
+ uc_se->mapped = true;
+}
+
static int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
- if (attr->sched_util_min > attr->sched_util_max)
- return -EINVAL;
- if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
+ unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
+ int result = 0;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ lower_bound = attr->sched_util_min;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ upper_bound = attr->sched_util_max;
+
+ if (lower_bound > upper_bound ||
+ upper_bound > SCHED_CAPACITY_SCALE)
return -EINVAL;

- p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
- p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+ mutex_lock(&uclamp_mutex);
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ uclamp_bucket_inc(&p->uclamp[UCLAMP_MIN],
+ UCLAMP_MIN, lower_bound);
+ }
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ uclamp_bucket_inc(&p->uclamp[UCLAMP_MAX],
+ UCLAMP_MAX, upper_bound);
+ }
+ mutex_unlock(&uclamp_mutex);

- return 0;
+ return result;
+}
+
+void uclamp_exit_task(struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ if (!p->uclamp[clamp_id].mapped)
+ continue;
+ uclamp_bucket_dec(clamp_id, p->uclamp[clamp_id].bucket_id);
+ }
+}
+
+static void uclamp_fork(struct task_struct *p, bool reset)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ unsigned int clamp_value = p->uclamp[clamp_id].value;
+
+ if (unlikely(reset))
+ clamp_value = uclamp_none(clamp_id);
+
+ p->uclamp[clamp_id].mapped = false;
+ uclamp_bucket_inc(&p->uclamp[clamp_id], clamp_id, clamp_value);
+ }
+}
+
+static void __init init_uclamp(void)
+{
+ struct uclamp_se *uc_se;
+ unsigned int clamp_id;
+
+ mutex_init(&uclamp_mutex);
+
+ memset(uclamp_maps, 0, sizeof(uclamp_maps));
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &init_task.uclamp[clamp_id];
+ uclamp_bucket_inc(uc_se, clamp_id, uclamp_none(clamp_id));
+ }
}
+
#else /* CONFIG_UCLAMP_TASK */
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
return -EINVAL;
}
+static inline void uclamp_fork(struct task_struct *p, bool reset) { }
+static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2320,6 +2516,7 @@ static inline void init_schedstats(void) {}
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
+ bool reset;

__sched_fork(clone_flags, p);
/*
@@ -2337,7 +2534,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
/*
* Revert to default priority/policy on fork if requested.
*/
- if (unlikely(p->sched_reset_on_fork)) {
+ reset = p->sched_reset_on_fork;
+ if (unlikely(reset)) {
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
@@ -2348,11 +2546,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);

-#ifdef CONFIG_UCLAMP_TASK
- p->uclamp[UCLAMP_MIN] = 0;
- p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
-#endif
-
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -2369,6 +2562,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)

init_entity_runnable_average(&p->se);

+ uclamp_fork(p, reset);
+
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
@@ -4613,10 +4808,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
- retval = sched_setattr(p, &attr);
+ if (likely(p))
+ get_task_struct(p);
rcu_read_unlock();

+ if (likely(p)) {
+ retval = sched_setattr(p, &attr);
+ put_task_struct(p);
+ }
+
return retval;
}

@@ -4768,8 +4968,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
attr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
- attr.sched_util_min = p->uclamp[UCLAMP_MIN];
- attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
#endif

rcu_read_unlock();
@@ -6125,6 +6325,8 @@ void __init sched_init(void)

psi_init();

+ init_uclamp();
+
scheduler_running = 1;
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50aa2aba69bd..5de061b055d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10540,6 +10540,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d04530bf251f..a0b238156161 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1630,10 +1630,12 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
- void (*yield_task) (struct rq *rq);
- bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);

void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

@@ -1666,7 +1668,6 @@ struct sched_class {
void (*set_curr_task)(struct rq *rq);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
- void (*task_dead)(struct task_struct *p);

/*
* The switched_from() call is allowed to drop rq->lock, therefore we
@@ -1683,12 +1684,17 @@ struct sched_class {

void (*update_curr)(struct rq *rq);

+ void (*yield_task) (struct rq *rq);
+ bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
+
#define TASK_SET_GROUP 0
#define TASK_MOVE_GROUP 1

#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
+
+ void (*task_dead)(struct task_struct *p);
};

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
@@ -2203,6 +2209,13 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.19.2


2019-01-15 11:51:37

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 02/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.

Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.

Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).

Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v6:
Message-ID: <[email protected]>
- add size check in sched_copy_attr()
Others:
- typos and changelog cleanups
---
include/linux/sched.h | 16 ++++++++
include/uapi/linux/sched.h | 4 +-
include/uapi/linux/sched/types.h | 65 +++++++++++++++++++++++++++-----
init/Kconfig | 21 +++++++++++
init/init_task.c | 5 +++
kernel/sched/core.c | 43 +++++++++++++++++++++
6 files changed, 144 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 224666226e87..65199309b866 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -280,6 +280,18 @@ struct vtime {
u64 gtime;
};

+/**
+ * enum uclamp_id - Utilization clamp constraints
+ * @UCLAMP_MIN: Minimum utilization
+ * @UCLAMP_MAX: Maximum utilization
+ * @UCLAMP_CNT: Utilization clamp constraints count
+ */
+enum uclamp_id {
+ UCLAMP_MIN = 0,
+ UCLAMP_MAX,
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -648,6 +660,10 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 43832a87016a..9ef6dad0f854 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -53,10 +53,12 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
#define SCHED_FLAG_KEEP_POLICY 0x08
+#define SCHED_FLAG_UTIL_CLAMP 0x10

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
- SCHED_FLAG_KEEP_POLICY)
+ SCHED_FLAG_KEEP_POLICY | \
+ SCHED_FLAG_UTIL_CLAMP)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..01439e07507c 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -9,6 +9,7 @@ struct sched_param {
};

#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
+#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */

/*
* Extended scheduling parameters data structure.
@@ -21,8 +22,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +60,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +73,28 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization
+ * expected for a task. These attributes allow to inform the scheduler about
+ * the utilization boundaries within which it should schedule tasks. These
+ * boundaries are valuable hints to support scheduler decisions on both task
+ * placement and frequency selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE]. It
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. For example, a
+ * 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger than 0 is more likely scheduled
+ * on a CPU with a capacity big enough to fit the specified value.
+ * A task with a max utilization value smaller than 1024 is more likely
+ * scheduled on a CPU with no more capacity than the specified value.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +112,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/init/Kconfig b/init/Kconfig
index d47cb77a220e..ea7c928a177b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,6 +640,27 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks scheduled on that CPU.
+
+ With this option, the user can specify the min and max CPU
+ utilization allowed for RUNNABLE tasks. The max utilization defines
+ the maximum frequency a task should use while the min utilization
+ defines the minimum frequency it should use.
+
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+endmenu
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3be4d7c..5bfdcc3fb839 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -6,6 +6,7 @@
#include <linux/sched/sysctl.h>
#include <linux/sched/rt.h>
#include <linux/sched/task.h>
+#include <linux/sched/topology.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -91,6 +92,10 @@ struct task_struct init_task
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
+#endif
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp[UCLAMP_MIN] = 0,
+ .uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68763a4ccae..66ff83e115db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -717,6 +717,28 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+static int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (attr->sched_util_min > attr->sched_util_max)
+ return -EINVAL;
+ if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
+ p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
+
+ return 0;
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline int __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -EINVAL;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -2326,6 +2348,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);

+#ifdef CONFIG_UCLAMP_TASK
+ p->uclamp[UCLAMP_MIN] = 0;
+ p->uclamp[UCLAMP_MAX] = SCHED_CAPACITY_SCALE;
+#endif
+
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
@@ -4214,6 +4241,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ /* Configure utilization clamps for the task */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = __setscheduler_uclamp(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4499,6 +4533,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
if (ret)
return -EFAULT;

+ if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
+ size < SCHED_ATTR_SIZE_VER1)
+ return -EINVAL;
+
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
@@ -4729,6 +4767,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);

+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN];
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX];
+#endif
+
rcu_read_unlock();

retval = sched_read_attr(uattr, &attr, size);
--
2.19.2


2019-01-15 11:52:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a CPU, the set of active clamp
buckets on that CPU can change. Since each clamp bucket enforces a
different utilization clamp value, when the set of active clamp buckets
changes, a new "aggregated" clamp value is computed for that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no tasks can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

Each task has a:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

Each CPU's rq has a:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many tasks, currently RUNNABLE on that CPU, refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).

Each CPU's rq has also a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The unordered array rq::uclamp::bucket[clamp_id][] is scanned every time
we need to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when we dequeue the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In these cases,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.

The expected number of different clamp values, configured at build time,
is small enough to fit the full unordered array into a single cache
line. In most use-cases we expect less than 10 different clamp values
for each clamp_id.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v6:
Message-ID: <20181113151127.GA7681@darkstar>
- use SCHED_WARN_ON() instead of CONFIG_SCHED_DEBUG guarded WARN()s
- add some better inline documentation to explain per-CPU initializations
- add some local variables to use library's max() for aggregation on
bitfields attirbutes
Message-ID: <20181112000910.GC3038@worktop>
- wholesale s/group/bucket/
Message-ID: <20181111164754.GA3038@worktop>
- consistently use unary (++/--) operators
Others:
- updated from rq::uclamp::group[clamp_id][group_id]
to rq::uclamp[clamp_id]::bucket[bucket_id]
which better matches the layout already used for tasks, i.e.
p::uclamp[clamp_id]::value
- use {WRITE,READ}_ONCE() for rq's clamp access
- update layout of rq::uclamp_cpu to better match that of tasks,
i.e now access CPU's clamp buckets as:
rq->uclamp[clamp_id]{.bucket[bucket_id].value}
which matches:
p->uclamp[clamp_id]
---
include/linux/sched.h | 6 ++
kernel/sched/core.c | 152 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 49 ++++++++++++++
3 files changed, 207 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f72f956850f..84294925d006 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -601,6 +601,7 @@ struct sched_dl_entity {
* @value: clamp value tracked by a clamp bucket
* @bucket_id: the bucket index used by the fast-path
* @mapped: the bucket index is valid
+ * @active: the se is currently refcounted in a CPU's clamp bucket
*
* A utilization clamp bucket maps a:
* clamp value (value), i.e.
@@ -614,11 +615,16 @@ struct sched_dl_entity {
* uclamp_bucket_inc() - for a new clamp value
* is matched by a:
* uclamp_bucket_dec() - for the old clamp value
+ *
+ * The active bit is set whenever a task has got an effective clamp bucket
+ * and value assigned, which can be different from the user requested ones.
+ * This allows to know a task is actually refcounting a CPU's clamp bucket.
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int mapped : 1;
+ unsigned int active : 1;
};
#endif /* CONFIG_UCLAMP_TASK */

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3f87898b13a0..190137cd7b3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -766,6 +766,124 @@ static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * (clamp_value / UCLAMP_BUCKET_DELTA);
}

+static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id)
+{
+ unsigned int max_value = 0;
+ unsigned int bucket_id;
+
+ for (bucket_id = 0; bucket_id < UCLAMP_BUCKETS; ++bucket_id) {
+ unsigned int bucket_value;
+
+ if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
+ continue;
+
+ /* Both min and max clamps are MAX aggregated */
+ bucket_value = rq->uclamp[clamp_id].bucket[bucket_id].value;
+ max_value = max(max_value, bucket_value);
+ if (max_value >= SCHED_CAPACITY_SCALE)
+ break;
+ }
+ WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
+}
+
+/*
+ * When a task is enqueued on a CPU's rq, the clamp bucket currently defined by
+ * the task's uclamp::bucket_id is reference counted on that CPU. This also
+ * immediately updates the CPU's clamp value if required.
+ *
+ * Since tasks know their specific value requested from user-space, we track
+ * within each bucket the maximum value for tasks refcounted in that bucket.
+ */
+static inline void uclamp_cpu_inc_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int cpu_clamp, grp_clamp, tsk_clamp;
+ unsigned int bucket_id;
+
+ if (unlikely(!p->uclamp[clamp_id].mapped))
+ return;
+
+ bucket_id = p->uclamp[clamp_id].bucket_id;
+ p->uclamp[clamp_id].active = true;
+
+ rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
+
+ /* CPU's clamp buckets track the max effective clamp value */
+ tsk_clamp = p->uclamp[clamp_id].value;
+ grp_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
+ rq->uclamp[clamp_id].bucket[bucket_id].value = max(grp_clamp, tsk_clamp);
+
+ /* Update CPU clamp value if required */
+ cpu_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
+ WRITE_ONCE(rq->uclamp[clamp_id].value, max(cpu_clamp, tsk_clamp));
+}
+
+/*
+ * When a task is dequeued from a CPU's rq, the CPU's clamp bucket reference
+ * counted by the task is released. If this is the last task reference
+ * counting the CPU's max active clamp value, then the CPU's clamp value is
+ * updated.
+ * Both the tasks reference counter and the CPU's cached clamp values are
+ * expected to be always valid, if we detect they are not we skip the updates,
+ * enforce a consistent state and warn.
+ */
+static inline void uclamp_cpu_dec_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value;
+ unsigned int bucket_id;
+
+ if (unlikely(!p->uclamp[clamp_id].mapped))
+ return;
+
+ bucket_id = p->uclamp[clamp_id].bucket_id;
+ p->uclamp[clamp_id].active = false;
+
+ SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
+ if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+ rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
+
+ /* We accept to (possibly) overboost tasks still RUNNABLE */
+ if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+ return;
+ clamp_value = rq->uclamp[clamp_id].bucket[bucket_id].value;
+
+ /* The CPU's clamp value is expected to always track the max */
+ SCHED_WARN_ON(clamp_value > rq->uclamp[clamp_id].value);
+
+ if (clamp_value >= READ_ONCE(rq->uclamp[clamp_id].value)) {
+ /*
+ * Reset CPU's clamp bucket value to its nominal value whenever
+ * there are anymore RUNNABLE tasks refcounting it.
+ */
+ rq->uclamp[clamp_id].bucket[bucket_id].value =
+ uclamp_maps[clamp_id][bucket_id].value;
+ uclamp_cpu_update(rq, clamp_id);
+ }
+}
+
+static inline void uclamp_cpu_inc(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_inc_id(p, rq, clamp_id);
+}
+
+static inline void uclamp_cpu_dec(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_cpu_dec_id(p, rq, clamp_id);
+}
+
static void uclamp_bucket_dec(unsigned int clamp_id, unsigned int bucket_id)
{
union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
@@ -798,6 +916,7 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
unsigned int free_bucket_id;
unsigned int bucket_value;
unsigned int bucket_id;
+ int cpu;

bucket_value = uclamp_bucket_value(clamp_value);

@@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
} while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
&uc_map_old.data, uc_map_new.data));

+ /*
+ * Ensure each CPU tracks the correct value for this clamp bucket.
+ * This initialization of per-CPU variables is required only when a
+ * clamp value is requested for the first time from a slow-path.
+ */
+ if (unlikely(!uc_map_old.se_count)) {
+ for_each_possible_cpu(cpu) {
+ struct uclamp_cpu *uc_cpu =
+ &cpu_rq(cpu)->uclamp[clamp_id];
+
+ /* CPU's tasks count must be 0 for free buckets */
+ SCHED_WARN_ON(uc_cpu->bucket[bucket_id].tasks);
+ if (unlikely(uc_cpu->bucket[bucket_id].tasks))
+ uc_cpu->bucket[bucket_id].tasks = 0;
+
+ /* Minimize cache lines invalidation */
+ if (uc_cpu->bucket[bucket_id].value == bucket_value)
+ continue;
+ uc_cpu->bucket[bucket_id].value = bucket_value;
+ }
+ }
+
uc_se->value = clamp_value;
uc_se->bucket_id = bucket_id;

@@ -907,6 +1048,7 @@ static void uclamp_fork(struct task_struct *p, bool reset)
clamp_value = uclamp_none(clamp_id);

p->uclamp[clamp_id].mapped = false;
+ p->uclamp[clamp_id].active = false;
uclamp_bucket_inc(&p->uclamp[clamp_id], clamp_id, clamp_value);
}
}
@@ -915,9 +1057,15 @@ static void __init init_uclamp(void)
{
struct uclamp_se *uc_se;
unsigned int clamp_id;
+ int cpu;

mutex_init(&uclamp_mutex);

+ for_each_possible_cpu(cpu) {
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_cpu));
+ cpu_rq(cpu)->uclamp[UCLAMP_MAX].value = uclamp_none(UCLAMP_MAX);
+ }
+
memset(uclamp_maps, 0, sizeof(uclamp_maps));
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uc_se = &init_task.uclamp[clamp_id];
@@ -926,6 +1074,8 @@ static void __init init_uclamp(void)
}

#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_cpu_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_cpu_dec(struct rq *rq, struct task_struct *p) { }
static inline int __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
@@ -945,6 +1095,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, flags & ENQUEUE_WAKEUP);
}

+ uclamp_cpu_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -958,6 +1109,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

+ uclamp_cpu_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a0b238156161..06ff7d890ff6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -797,6 +797,50 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+ unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_cpu - CPU's utilization clamp
+ * @value: currently active clamp values for a CPU
+ * @bucket: utilization clamp buckets affecting a CPU
+ *
+ * Keep track of RUNNABLE tasks on a CPUs to aggregate their clamp values.
+ * A clamp value is affecting a CPU where there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp values, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (CONFIG_UCLAMP_bucketS_COUNT), we use a simple
+ * array to track the metrics required to compute all the per-CPU utilization
+ * clamp values. The additional slot is used to track the default clamp
+ * values, i.e. no min/max clamping at all.
+ */
+struct uclamp_cpu {
+ unsigned int value;
+ struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -835,6 +879,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_cpu uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
--
2.19.2


2019-01-15 12:16:59

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2),
this value is accounted in the corresponding clamp bucket every time the
task is {en,de}qeued. However, when cgroups are also in use, the task
specific clamp values could be restricted by the task_group (TG)
clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every
time a task is enqueued, it's accounted in the clamp_bucket defining the
smaller clamp between the task specific value and its TG effective
value. This allows to:

1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value

2. implement a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG

This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only
system defaults are enforced.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v6:
Others:
- wholesale s/group/bucket/
---
include/linux/sched.h | 10 ++++++++++
kernel/sched/core.c | 42 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3f02128fe6b2..bb4e3b1085f9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -602,6 +602,7 @@ struct sched_dl_entity {
* @bucket_id: the bucket index used by the fast-path
* @mapped: the bucket index is valid
* @active: the se is currently refcounted in a CPU's clamp bucket
+ * @user_defined: calmp value explicitly required from user-space
*
* A utilization clamp bucket maps a:
* clamp value (value), i.e.
@@ -619,12 +620,21 @@ struct sched_dl_entity {
* The active bit is set whenever a task has got an effective clamp bucket
* and value assigned, and it allows to know a task is actually refcounting a
* CPU's clamp bucket.
+ *
+ * The user_defined bit is set whenever a task has got a task-specific clamp
+ * value requested from userspace, i.e. the system defaults apply to this
+ * task just as a restriction. This allows to relax TG's clamps when a less
+ * restrictive task specific value has been defined, thus allowing to
+ * implement a "nice" semantic when both task bucket and task specific values
+ * are used. For example, a task running on a 20% boosted TG can still drop
+ * its own boosting to 0%.
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int mapped : 1;
unsigned int active : 1;
+ unsigned int user_defined : 1;
/*
* Clamp bucket and value actually used by a scheduling entity,
* i.e. a (RUNNABLE) task or a task group.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 734b769db2ca..c8d1fc9880ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -845,10 +845,23 @@ static inline void uclamp_cpu_update(struct rq *rq, unsigned int clamp_id,
WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
}

+static inline bool uclamp_apply_defaults(struct task_struct *p)
+{
+ if (!IS_ENABLED(CONFIG_UCLAMP_TASK_GROUP))
+ return true;
+ if (task_group_is_autogroup(task_group(p)))
+ return true;
+ if (task_group(p) == &root_task_group)
+ return true;
+ return false;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not in the root group or
+ * in an autogroup
* - the system default clamp value, defined by the sysadmin
*
* As a side effect, update the task's effective value:
@@ -865,6 +878,29 @@ uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
*clamp_value = p->uclamp[clamp_id].value;
*bucket_id = p->uclamp[clamp_id].bucket_id;

+ if (!uclamp_apply_defaults(p)) {
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ unsigned int clamp_max, bucket_max;
+ struct uclamp_se *tg_clamp;
+
+ tg_clamp = &task_group(p)->uclamp[clamp_id];
+ clamp_max = tg_clamp->effective.value;
+ bucket_max = tg_clamp->effective.bucket_id;
+
+ if (!p->uclamp[clamp_id].user_defined ||
+ *clamp_value > clamp_max) {
+ *clamp_value = clamp_max;
+ *bucket_id = bucket_max;
+ }
+#endif
+ /*
+ * If we have task groups and we are running in a child group,
+ * system default does not apply anymore since we assume task
+ * group clamps are properly configured.
+ */
+ return;
+ }
+
/* RT tasks have different default values */
default_clamp = task_has_rt_policy(p)
? uclamp_default_perf
@@ -1223,10 +1259,12 @@ static int __setscheduler_uclamp(struct task_struct *p,

mutex_lock(&uclamp_mutex);
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ p->uclamp[UCLAMP_MIN].user_defined = true;
uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
UCLAMP_MIN, lower_bound);
}
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ p->uclamp[UCLAMP_MAX].user_defined = true;
uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
UCLAMP_MAX, upper_bound);
}
@@ -1259,8 +1297,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
unsigned int clamp_value = p->uclamp[clamp_id].value;

- if (unlikely(reset))
+ if (unlikely(reset)) {
clamp_value = uclamp_none(clamp_id);
+ p->uclamp[clamp_id].user_defined = false;
+ }

p->uclamp[clamp_id].mapped = false;
p->uclamp[clamp_id].active = false;
--
2.19.2


2019-01-15 12:16:59

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

Schedutil enforces a maximum frequency when RT tasks are RUNNABLE.
This mandatory policy can be made tunable from userspace to define a max
frequency which is still reasonable for the execution of a specific RT
workload while being also power/energy friendly.

Extend the usage of util_{min,max} to the RT scheduling class.

Add uclamp_default_perf, a special set of clamp values to be used
for tasks requiring maximum performance, i.e. by default all the non
clamped RT tasks.

Since utilization clamping applies now to both CFS and RT tasks,
schedutil clamps the combined utilization of these two classes.
The IOWait boost value is also subject to clamping for RT tasks.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>

---
Changes in v6:
Others:
- wholesale s/group/bucket/
- wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
---
kernel/sched/core.c | 20 ++++++++++++++++----
kernel/sched/cpufreq_schedutil.c | 27 +++++++++++++--------------
kernel/sched/rt.c | 4 ++++
3 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d1ea5825501a..1ed01f381641 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -746,6 +746,7 @@ unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
* Tasks specific clamp values are required to be within this range
*/
static struct uclamp_se uclamp_default[UCLAMP_CNT];
+static struct uclamp_se uclamp_default_perf[UCLAMP_CNT];

/**
* Reference count utilization clamp buckets
@@ -858,16 +859,23 @@ static inline void
uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
unsigned int *clamp_value, unsigned int *bucket_id)
{
+ struct uclamp_se *default_clamp;
+
/* Task specific clamp value */
*clamp_value = p->uclamp[clamp_id].value;
*bucket_id = p->uclamp[clamp_id].bucket_id;

+ /* RT tasks have different default values */
+ default_clamp = task_has_rt_policy(p)
+ ? uclamp_default_perf
+ : uclamp_default;
+
/* System default restriction */
- if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
- *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
+ if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
+ *clamp_value > default_clamp[UCLAMP_MAX].value)) {
/* Keep it simple: unconditionally enforce system defaults */
- *clamp_value = uclamp_default[clamp_id].value;
- *bucket_id = uclamp_default[clamp_id].bucket_id;
+ *clamp_value = default_clamp[clamp_id].value;
+ *bucket_id = default_clamp[clamp_id].bucket_id;
}
}

@@ -1282,6 +1290,10 @@ static void __init init_uclamp(void)

uc_se = &uclamp_default[clamp_id];
uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(clamp_id));
+
+ /* RT tasks by default will go to max frequency */
+ uc_se = &uclamp_default_perf[clamp_id];
+ uclamp_bucket_inc(NULL, uc_se, clamp_id, uclamp_none(UCLAMP_MAX));
}
}

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 520ee2b785e7..38a05a4f78cc 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -201,9 +201,6 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

- if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
- return max;
-
/*
* Early check to see if IRQ/steal time saturates the CPU, can be
* because of inaccuracies in how we track these -- see
@@ -219,15 +216,19 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
*
- * CFS utilization can be boosted or capped, depending on utilization
- * clamp constraints requested by currently RUNNABLE tasks.
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
* When there are no CFS RUNNABLE tasks, clamps are released and
* frequency will be gracefully reduced with the utilization decay.
*/
- util = (type == ENERGY_UTIL)
- ? util_cfs
- : uclamp_util(rq, util_cfs);
- util += cpu_util_rt(rq);
+ util = cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL) {
+ util += cpu_util_cfs(rq);
+ util = uclamp_util(rq, util);
+ } else {
+ util += util_cfs;
+ }

dl_util = cpu_util_dl(rq);

@@ -355,13 +356,11 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
*
* Since DL tasks have a much more advanced bandwidth control, it's
* safe to assume that IO boost does not apply to those tasks.
- * Instead, since RT tasks are not utilization clamped, we don't want
- * to apply clamping on IO boost while there is blocked RT
- * utilization.
+ * Instead, for CFS and RT tasks we clamp the IO boost max value
+ * considering the current constraints for the CPU.
*/
max_boost = sg_cpu->iowait_boost_max;
- if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
- max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
+ max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);

/* Double the boost at each request */
if (sg_cpu->iowait_boost) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e4f398ad9e73..614b0bc359cb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
--
2.19.2


2019-01-21 10:20:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 03/16] sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets

On Tue, Jan 15, 2019 at 10:15:00AM +0000, Patrick Bellasi wrote:
> +/*
> + * Number of utilization clamp buckets.
> + *
> + * The first clamp bucket (bucket_id=0) is used to track non clamped tasks, i.e.
> + * util_{min,max} (0,SCHED_CAPACITY_SCALE). Thus we allocate one more bucket in
> + * addition to the compile time configured number.
> + */
> +#define UCLAMP_BUCKETS (CONFIG_UCLAMP_BUCKETS_COUNT + 1)
> +
> +/*
> + * Utilization clamp bucket
> + * @value: clamp value tracked by a clamp bucket
> + * @bucket_id: the bucket index used by the fast-path
> + * @mapped: the bucket index is valid
> + *
> + * A utilization clamp bucket maps a:
> + * clamp value (value), i.e.
> + * util_{min,max} value requested from userspace
> + * to a:
> + * clamp bucket index (bucket_id), i.e.
> + * index of the per-cpu RUNNABLE tasks refcounting array
> + *
> + * The mapped bit is set whenever a task has been mapped on a clamp bucket for
> + * the first time. When this bit is set, any:
> + * uclamp_bucket_inc() - for a new clamp value
> + * is matched by a:
> + * uclamp_bucket_dec() - for the old clamp value
> + */
> +struct uclamp_se {
> + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> + unsigned int mapped : 1;
> +};

Do we want something like:

BUILD_BUG_ON(sizeof(struct uclamp_se) == sizeof(unsigned int));

And/or put a limit on CONFIG_UCLAMP_BUCKETS_COUNT that guarantees that ?

2019-01-21 12:28:49

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 03/16] sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets

On 21-Jan 11:15, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:00AM +0000, Patrick Bellasi wrote:
> > +/*
> > + * Number of utilization clamp buckets.
> > + *
> > + * The first clamp bucket (bucket_id=0) is used to track non clamped tasks, i.e.
> > + * util_{min,max} (0,SCHED_CAPACITY_SCALE). Thus we allocate one more bucket in
> > + * addition to the compile time configured number.
> > + */
> > +#define UCLAMP_BUCKETS (CONFIG_UCLAMP_BUCKETS_COUNT + 1)
> > +
> > +/*
> > + * Utilization clamp bucket
> > + * @value: clamp value tracked by a clamp bucket
> > + * @bucket_id: the bucket index used by the fast-path
> > + * @mapped: the bucket index is valid
> > + *
> > + * A utilization clamp bucket maps a:
> > + * clamp value (value), i.e.
> > + * util_{min,max} value requested from userspace
> > + * to a:
> > + * clamp bucket index (bucket_id), i.e.
> > + * index of the per-cpu RUNNABLE tasks refcounting array
> > + *
> > + * The mapped bit is set whenever a task has been mapped on a clamp bucket for
> > + * the first time. When this bit is set, any:
> > + * uclamp_bucket_inc() - for a new clamp value
> > + * is matched by a:
> > + * uclamp_bucket_dec() - for the old clamp value
> > + */
> > +struct uclamp_se {
> > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > + unsigned int mapped : 1;
> > +};
>
> Do we want something like:
>
> BUILD_BUG_ON(sizeof(struct uclamp_se) == sizeof(unsigned int));

Mmm... isn't "!=" what you mean ?

We cannot use less then an unsigned int for the fields above... am I
missing something?

> And/or put a limit on CONFIG_UCLAMP_BUCKETS_COUNT that guarantees that ?

The number of buckets is currently KConfig limited to a max of 20, which gives:

UCLAMP_BUCKETS: 21 => 5bits

Thus, even on 32 bit targets and assuming 21bits for an "extended"
SCHED_CAPACITY_SCALE range we should always fit into an unsigned int
and have at least 6 bits for flags.

Are you afraid of some compiler magic related to bitfields packing ?

--
#include <best/regards.h>

Patrick Bellasi

2019-01-21 12:53:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 03/16] sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets

On Mon, Jan 21, 2019 at 12:27:10PM +0000, Patrick Bellasi wrote:
> On 21-Jan 11:15, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:00AM +0000, Patrick Bellasi wrote:
> > > +/*
> > > + * Number of utilization clamp buckets.
> > > + *
> > > + * The first clamp bucket (bucket_id=0) is used to track non clamped tasks, i.e.
> > > + * util_{min,max} (0,SCHED_CAPACITY_SCALE). Thus we allocate one more bucket in
> > > + * addition to the compile time configured number.
> > > + */
> > > +#define UCLAMP_BUCKETS (CONFIG_UCLAMP_BUCKETS_COUNT + 1)
> > > +
> > > +/*
> > > + * Utilization clamp bucket
> > > + * @value: clamp value tracked by a clamp bucket
> > > + * @bucket_id: the bucket index used by the fast-path
> > > + * @mapped: the bucket index is valid
> > > + *
> > > + * A utilization clamp bucket maps a:
> > > + * clamp value (value), i.e.
> > > + * util_{min,max} value requested from userspace
> > > + * to a:
> > > + * clamp bucket index (bucket_id), i.e.
> > > + * index of the per-cpu RUNNABLE tasks refcounting array
> > > + *
> > > + * The mapped bit is set whenever a task has been mapped on a clamp bucket for
> > > + * the first time. When this bit is set, any:
> > > + * uclamp_bucket_inc() - for a new clamp value
> > > + * is matched by a:
> > > + * uclamp_bucket_dec() - for the old clamp value
> > > + */
> > > +struct uclamp_se {
> > > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > > + unsigned int mapped : 1;
> > > +};
> >
> > Do we want something like:
> >
> > BUILD_BUG_ON(sizeof(struct uclamp_se) == sizeof(unsigned int));
>
> Mmm... isn't "!=" what you mean ?

Quite.

> We cannot use less then an unsigned int for the fields above... am I
> missing something?

I wanted to ensure we don't accidentally use more.

> > And/or put a limit on CONFIG_UCLAMP_BUCKETS_COUNT that guarantees that ?
>
> The number of buckets is currently KConfig limited to a max of 20, which gives:
>
> UCLAMP_BUCKETS: 21 => 5bits
>
> Thus, even on 32 bit targets and assuming 21bits for an "extended"
> SCHED_CAPACITY_SCALE range we should always fit into an unsigned int
> and have at least 6 bits for flags.
>
> Are you afraid of some compiler magic related to bitfields packing ?

Nah, I missed the Kconfig limit and was afraid that some weird configs
would end up with massively huge structures.

2019-01-21 15:01:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> &uc_map_old.data, uc_map_new.data));
>
> + /*
> + * Ensure each CPU tracks the correct value for this clamp bucket.
> + * This initialization of per-CPU variables is required only when a
> + * clamp value is requested for the first time from a slow-path.
> + */

I'm confused; why is this needed?

> + if (unlikely(!uc_map_old.se_count)) {
> + for_each_possible_cpu(cpu) {
> + struct uclamp_cpu *uc_cpu =
> + &cpu_rq(cpu)->uclamp[clamp_id];
> +
> + /* CPU's tasks count must be 0 for free buckets */
> + SCHED_WARN_ON(uc_cpu->bucket[bucket_id].tasks);
> + if (unlikely(uc_cpu->bucket[bucket_id].tasks))
> + uc_cpu->bucket[bucket_id].tasks = 0;
> +
> + /* Minimize cache lines invalidation */
> + if (uc_cpu->bucket[bucket_id].value == bucket_value)
> + continue;
> + uc_cpu->bucket[bucket_id].value = bucket_value;
> + }
> + }
> +
> uc_se->value = clamp_value;
> uc_se->bucket_id = bucket_id;
>

2019-01-21 15:07:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 03/16] sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets

On Tue, Jan 15, 2019 at 10:15:00AM +0000, Patrick Bellasi wrote:
> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> +#define UCLAMP_BUCKET_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_BUCKETS_COUNT)
> +#define UCLAMP_BUCKET_UPPER (UCLAMP_BUCKET_DELTA * CONFIG_UCLAMP_BUCKETS_COUNT)
> +
> + if (clamp_value >= UCLAMP_BUCKET_UPPER)
> + return SCHED_CAPACITY_SCALE;
> +
> + return UCLAMP_BUCKET_DELTA * (clamp_value / UCLAMP_BUCKET_DELTA);
> +}

> +static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> + unsigned int clamp_value)
> +{
> + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> + unsigned int prev_bucket_id = uc_se->bucket_id;
> + union uclamp_map uc_map_old, uc_map_new;
> + unsigned int free_bucket_id;
> + unsigned int bucket_value;
> + unsigned int bucket_id;
> +
> + bucket_value = uclamp_bucket_value(clamp_value);

Aahh!!

So why don't you do:

bucket_id = clamp_value / UCLAMP_BUCKET_DELTA;
bucket_value = bucket_id * UCLAMP_BUCKET_DELTA;

> + do {
> + /* Find the bucket_id of an already mapped clamp bucket... */
> + free_bucket_id = UCLAMP_BUCKETS;
> + for (bucket_id = 0; bucket_id < UCLAMP_BUCKETS; ++bucket_id) {
> + uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
> + if (free_bucket_id == UCLAMP_BUCKETS && !uc_map_old.se_count)
> + free_bucket_id = bucket_id;
> + if (uc_map_old.value == bucket_value)
> + break;
> + }
> +
> + /* ... or allocate a new clamp bucket */
> + if (bucket_id >= UCLAMP_BUCKETS) {
> + /*
> + * A valid clamp bucket must always be available.
> + * If we cannot find one: refcounting is broken and we
> + * warn once. The sched_entity will be tracked in the
> + * fast-path using its previous clamp bucket, or not
> + * tracked at all if not yet mapped (i.e. it's new).
> + */
> + if (unlikely(free_bucket_id == UCLAMP_BUCKETS)) {
> + SCHED_WARN_ON(free_bucket_id == UCLAMP_BUCKETS);
> + return;
> + }
> + bucket_id = free_bucket_id;
> + uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
> + }

And then skip all this?

> +
> + uc_map_new.se_count = uc_map_old.se_count + 1;
> + uc_map_new.value = bucket_value;
> +
> + } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> + &uc_map_old.data, uc_map_new.data));
> +
> + uc_se->value = clamp_value;
> + uc_se->bucket_id = bucket_id;
> +
> + if (uc_se->mapped)
> + uclamp_bucket_dec(clamp_id, prev_bucket_id);
> +
> + /*
> + * Task's sched_entity are refcounted in the fast-path only when they
> + * have got a valid clamp_bucket assigned.
> + */
> + uc_se->mapped = true;
> +}

2019-01-21 15:19:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> +#ifdef CONFIG_UCLAMP_TASK

> +struct uclamp_bucket {
> + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> +};

> +struct uclamp_cpu {
> + unsigned int value;

/* 4 byte hole */

> + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> +};

With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
hole).

> +#endif /* CONFIG_UCLAMP_TASK */
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -835,6 +879,11 @@ struct rq {
> unsigned long nr_load_updates;
> u64 nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> + struct uclamp_cpu uclamp[UCLAMP_CNT] ____cacheline_aligned;

Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
64 byte cachelines.

Is that the best layout?

> +#endif
> +
> struct cfs_rq cfs;
> struct rt_rq rt;
> struct dl_rq dl;
> --
> 2.19.2
>

2019-01-21 15:27:52

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 21-Jan 15:59, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> > } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> > &uc_map_old.data, uc_map_new.data));
> >
> > + /*
> > + * Ensure each CPU tracks the correct value for this clamp bucket.
> > + * This initialization of per-CPU variables is required only when a
> > + * clamp value is requested for the first time from a slow-path.
> > + */
>
> I'm confused; why is this needed?

That's a lazy initialization of the per-CPU uclamp data for a given
bucket, i.e. the clamp value assigned to a bucket, which happens only
when new clamp values are requested... usually only at system
boot/configuration time.

For example, let say we have these buckets mapped to given clamp
values:

bucket_#0: clamp value: 10% (mapped)
bucket_#1: clamp value: 20% (mapped)
bucket_#2: clamp value: 30% (mapped)

and then let's assume all the users of bucket_#1 are "destroyed", i.e.
there are no more tasks, system defaults or cgroups asking for a
20% clamp value. The corresponding bucket will become free:

bucket_#0: clamp value: 10% (mapped)
bucket_#1: clamp value: 20% (free)
bucket_#2: clamp value: 30% (mapped)

If, in the future, we ask for a new clamp value, let say a task ask
for a 40% clamp value, then we need to map that value into a bucket.
Since bucket_#1 is free we can use it to fill up the hold and keep all
the buckets in use at the beginning of a cache line.

However, since now bucket_#1 tracks a different clamp value (40
instead of 20) we need to walk all the CPUs and updated the cached
value:

bucket_#0: clamp value: 10% (mapped)
bucket_#1: clamp value: 40% (mapped)
bucket_#2: clamp value: 30% (mapped)

Is that more clear ?


In the following code:

>
> > + if (unlikely(!uc_map_old.se_count)) {

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This condition is matched by clamp buckets which needs the
initialization described above. These are buckets without a client so
fare and that have been selected to map/track a new clamp value.
That's why we have an unlikely... quite likely tasks/cgroups will keep
asking for the same (limited number of) clamp values and thus we find
a bucket already properly initialized for them.


> > + for_each_possible_cpu(cpu) {
> > + struct uclamp_cpu *uc_cpu =
> > + &cpu_rq(cpu)->uclamp[clamp_id];
> > +
> > + /* CPU's tasks count must be 0 for free buckets */
> > + SCHED_WARN_ON(uc_cpu->bucket[bucket_id].tasks);
> > + if (unlikely(uc_cpu->bucket[bucket_id].tasks))
> > + uc_cpu->bucket[bucket_id].tasks = 0;

That's a safety check, we expect that (free) buckets do not refcount
any task. That's one of the conditions for a bucket to be considered
free. Here we do just a sanity check, that's because we use unlikely.
If the check matches there is a data corruption, which is reported by
the previous SCHED_WARN_ON and "fixed" by the if branch.

In my tests I have s/SCHED_WARN_ON/BUG_ON/ and never hit that bug...
thus the refcounting code should be ok and this check is there just to
be more on the safe side for future changes.

> > +
> > + /* Minimize cache lines invalidation */
> > + if (uc_cpu->bucket[bucket_id].value == bucket_value)
> > + continue;

If by any chance we get a request for a new clamp value which happened
to be already used before... we can skip the initialization to avoid.

> > + uc_cpu->bucket[bucket_id].value = bucket_value;
> > + }
> > + }
> > +
> > uc_se->value = clamp_value;
> > uc_se->bucket_id = bucket_id;
> >

--
#include <best/regards.h>

Patrick Bellasi

2019-01-21 15:34:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Tue, Jan 15, 2019 at 10:15:02AM +0000, Patrick Bellasi wrote:

> +static inline void
> +uclamp_task_update_active(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + /*
> + * Lock the task and the CPU where the task is (or was) queued.
> + *
> + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
> + * price to pay to safely serialize util_{min,max} updates with
> + * enqueues, dequeues and migration operations.
> + * This is the same locking schema used by __set_cpus_allowed_ptr().
> + */
> + rq = task_rq_lock(p, &rf);
> +
> + /*
> + * Setting the clamp bucket is serialized by task_rq_lock().
> + * If the task is not yet RUNNABLE and its task_struct is not
> + * affecting a valid clamp bucket, the next time it's enqueued,
> + * it will already see the updated clamp bucket value.
> + */
> + if (!p->uclamp[clamp_id].active)
> + goto done;
> +
> + uclamp_cpu_dec_id(p, rq, clamp_id);
> + uclamp_cpu_inc_id(p, rq, clamp_id);
> +
> +done:
> + task_rq_unlock(rq, p, &rf);
> +}

> @@ -1008,11 +1043,11 @@ static int __setscheduler_uclamp(struct task_struct *p,
>
> mutex_lock(&uclamp_mutex);
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> - uclamp_bucket_inc(&p->uclamp[UCLAMP_MIN],
> + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
> UCLAMP_MIN, lower_bound);
> }
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> - uclamp_bucket_inc(&p->uclamp[UCLAMP_MAX],
> + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
> UCLAMP_MAX, upper_bound);
> }
> mutex_unlock(&uclamp_mutex);


But.... __sched_setscheduler() actually does the whole dequeue + enqueue
thing already ?!? See where it does __setscheduler().



2019-01-21 15:36:25

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 03/16] sched/core: uclamp: Map TASK's clamp values into CPU's clamp buckets

On 21-Jan 16:05, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:00AM +0000, Patrick Bellasi wrote:
> > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > +{
> > +#define UCLAMP_BUCKET_DELTA (SCHED_CAPACITY_SCALE / CONFIG_UCLAMP_BUCKETS_COUNT)
> > +#define UCLAMP_BUCKET_UPPER (UCLAMP_BUCKET_DELTA * CONFIG_UCLAMP_BUCKETS_COUNT)
> > +
> > + if (clamp_value >= UCLAMP_BUCKET_UPPER)
> > + return SCHED_CAPACITY_SCALE;
> > +
> > + return UCLAMP_BUCKET_DELTA * (clamp_value / UCLAMP_BUCKET_DELTA);
> > +}
>
> > +static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> > + unsigned int clamp_value)
> > +{
> > + union uclamp_map *uc_maps = &uclamp_maps[clamp_id][0];
> > + unsigned int prev_bucket_id = uc_se->bucket_id;
> > + union uclamp_map uc_map_old, uc_map_new;
> > + unsigned int free_bucket_id;
> > + unsigned int bucket_value;
> > + unsigned int bucket_id;
> > +
> > + bucket_value = uclamp_bucket_value(clamp_value);
>
> Aahh!!
>
> So why don't you do:
>
> bucket_id = clamp_value / UCLAMP_BUCKET_DELTA;
> bucket_value = bucket_id * UCLAMP_BUCKET_DELTA;

The mapping done here is meant to keep at the beginning of the cache
line all and only the buckets we use. Let say we have configured the
system to track 20 buckets, to have a 5% clamping resolution, but then
we use only two values at run-time, e.g. 13% and 87%.

With the mapping done here the per-CPU variables will have to consider
only 2 buckets:

bucket_#00: clamp value: 10% (mapped)
bucket_#01: clamp value: 85% (mapped)
bucket_#02: (free)
...
bucket_#20: (free)

While without the mapping we will have:

bucket_#00: (free)
bucket_#01: clamp value: 10 (mapped)
bucket_#02: (free)
... big hole crossing a cache line ....
bucket_#16: (free)
bucket_#17: clamp value: 85 (mapped)
bucket_#18: (free)
...
bucket_#20: (free)

Addressing is simple without mapping but we can have performance
issues in the hot-path, since sometimes we need to scan all the
buckets to figure out the new max.

The mapping done here is meant to keep all the used slots at the very
beginning of a cache line to speed up that max computation when
required.

>
> > + do {
> > + /* Find the bucket_id of an already mapped clamp bucket... */
> > + free_bucket_id = UCLAMP_BUCKETS;
> > + for (bucket_id = 0; bucket_id < UCLAMP_BUCKETS; ++bucket_id) {
> > + uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
> > + if (free_bucket_id == UCLAMP_BUCKETS && !uc_map_old.se_count)
> > + free_bucket_id = bucket_id;
> > + if (uc_map_old.value == bucket_value)
> > + break;
> > + }
> > +
> > + /* ... or allocate a new clamp bucket */
> > + if (bucket_id >= UCLAMP_BUCKETS) {
> > + /*
> > + * A valid clamp bucket must always be available.
> > + * If we cannot find one: refcounting is broken and we
> > + * warn once. The sched_entity will be tracked in the
> > + * fast-path using its previous clamp bucket, or not
> > + * tracked at all if not yet mapped (i.e. it's new).
> > + */
> > + if (unlikely(free_bucket_id == UCLAMP_BUCKETS)) {
> > + SCHED_WARN_ON(free_bucket_id == UCLAMP_BUCKETS);
> > + return;
> > + }
> > + bucket_id = free_bucket_id;
> > + uc_map_old.data = atomic_long_read(&uc_maps[bucket_id].adata);
> > + }
>
> And then skip all this?
> > +
> > + uc_map_new.se_count = uc_map_old.se_count + 1;
> > + uc_map_new.value = bucket_value;
> > +
> > + } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> > + &uc_map_old.data, uc_map_new.data));
> > +
> > + uc_se->value = clamp_value;
> > + uc_se->bucket_id = bucket_id;
> > +
> > + if (uc_se->mapped)
> > + uclamp_bucket_dec(clamp_id, prev_bucket_id);
> > +
> > + /*
> > + * Task's sched_entity are refcounted in the fast-path only when they
> > + * have got a valid clamp_bucket assigned.
> > + */
> > + uc_se->mapped = true;
> > +}

--
#include <best/regards.h>

Patrick Bellasi

2019-01-21 15:46:22

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On 21-Jan 16:33, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:02AM +0000, Patrick Bellasi wrote:
>
> > +static inline void
> > +uclamp_task_update_active(struct task_struct *p, unsigned int clamp_id)
> > +{
> > + struct rq_flags rf;
> > + struct rq *rq;
> > +
> > + /*
> > + * Lock the task and the CPU where the task is (or was) queued.
> > + *
> > + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
> > + * price to pay to safely serialize util_{min,max} updates with
> > + * enqueues, dequeues and migration operations.
> > + * This is the same locking schema used by __set_cpus_allowed_ptr().
> > + */
> > + rq = task_rq_lock(p, &rf);
> > +
> > + /*
> > + * Setting the clamp bucket is serialized by task_rq_lock().
> > + * If the task is not yet RUNNABLE and its task_struct is not
> > + * affecting a valid clamp bucket, the next time it's enqueued,
> > + * it will already see the updated clamp bucket value.
> > + */
> > + if (!p->uclamp[clamp_id].active)
> > + goto done;
> > +
> > + uclamp_cpu_dec_id(p, rq, clamp_id);
> > + uclamp_cpu_inc_id(p, rq, clamp_id);
> > +
> > +done:
> > + task_rq_unlock(rq, p, &rf);
> > +}
>
> > @@ -1008,11 +1043,11 @@ static int __setscheduler_uclamp(struct task_struct *p,
> >
> > mutex_lock(&uclamp_mutex);
> > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > - uclamp_bucket_inc(&p->uclamp[UCLAMP_MIN],
> > + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
> > UCLAMP_MIN, lower_bound);
> > }
> > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > - uclamp_bucket_inc(&p->uclamp[UCLAMP_MAX],
> > + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
> > UCLAMP_MAX, upper_bound);
> > }
> > mutex_unlock(&uclamp_mutex);
>
>
> But.... __sched_setscheduler() actually does the whole dequeue + enqueue
> thing already ?!? See where it does __setscheduler().

This is slow-path accounting, not fast path.

There are two refcounting going on here:

1) mapped buckets:

clamp_value <--(M1)--> bucket_id

2) RUNNABLE tasks:

bucket_id <--(M2)--> RUNNABLE tasks in a bucket

What we fix here is the refcounting for the buckets mapping. If a task
does not have a task specific clamp value it does not refcount any
bucket. The moment we assign a task specific clamp value, we need to
refcount the task in the bucket corresponding to that clamp value.

This will keep the bucket in use at least as long as the task will
need that clamp value.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-21 15:56:52

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 21-Jan 16:17, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > +#ifdef CONFIG_UCLAMP_TASK
>
> > +struct uclamp_bucket {
> > + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> > + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> > +};
>
> > +struct uclamp_cpu {
> > + unsigned int value;
>
> /* 4 byte hole */
>
> > + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> > +};
>
> With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
> ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
> hole).

Yes, that's dimensioned and configured to fit into a single cache line
for all the possible 5 (by default) clamp values of a clamp index
(i.e. min or max util).

>
> > +#endif /* CONFIG_UCLAMP_TASK */
> > +
> > /*
> > * This is the main, per-CPU runqueue data structure.
> > *
> > @@ -835,6 +879,11 @@ struct rq {
> > unsigned long nr_load_updates;
> > u64 nr_switches;
> >
> > +#ifdef CONFIG_UCLAMP_TASK
> > + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> > + struct uclamp_cpu uclamp[UCLAMP_CNT] ____cacheline_aligned;
>
> Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
> 64 byte cachelines.

Right, we have 2 cache lines where:
- the first $L tracks 5 different util_min values
- the second $L tracks 5 different util_max values

> Is that the best layout?

It changed few times and that's what I found more reasonable for both
for fitting the default configuration and also for code readability.
Notice that we access RQ and SE clamp values with the same patter,
for example:

{rq|p}->uclamp[clamp_idx].value

Are you worried about the holes or something else specific ?

> > +#endif
> > +
> > struct cfs_rq cfs;
> > struct rt_rq rt;
> > struct dl_rq dl;
> > --
> > 2.19.2
> >

--
#include <best/regards.h>

Patrick Bellasi

2019-01-21 16:14:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:
> On 21-Jan 15:59, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> > > } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> > > &uc_map_old.data, uc_map_new.data));
> > >
> > > + /*
> > > + * Ensure each CPU tracks the correct value for this clamp bucket.
> > > + * This initialization of per-CPU variables is required only when a
> > > + * clamp value is requested for the first time from a slow-path.
> > > + */
> >
> > I'm confused; why is this needed?
>
> That's a lazy initialization of the per-CPU uclamp data for a given
> bucket, i.e. the clamp value assigned to a bucket, which happens only
> when new clamp values are requested... usually only at system
> boot/configuration time.
>
> For example, let say we have these buckets mapped to given clamp
> values:
>
> bucket_#0: clamp value: 10% (mapped)
> bucket_#1: clamp value: 20% (mapped)
> bucket_#2: clamp value: 30% (mapped)
>
> and then let's assume all the users of bucket_#1 are "destroyed", i.e.
> there are no more tasks, system defaults or cgroups asking for a
> 20% clamp value. The corresponding bucket will become free:
>
> bucket_#0: clamp value: 10% (mapped)
> bucket_#1: clamp value: 20% (free)
> bucket_#2: clamp value: 30% (mapped)
>
> If, in the future, we ask for a new clamp value, let say a task ask
> for a 40% clamp value, then we need to map that value into a bucket.
> Since bucket_#1 is free we can use it to fill up the hold and keep all
> the buckets in use at the beginning of a cache line.
>
> However, since now bucket_#1 tracks a different clamp value (40
> instead of 20) we need to walk all the CPUs and updated the cached
> value:
>
> bucket_#0: clamp value: 10% (mapped)
> bucket_#1: clamp value: 40% (mapped)
> bucket_#2: clamp value: 30% (mapped)
>
> Is that more clear ?

Yes, and I realized this a little while after sending this; but I'm not
sure I have an answer to why though.

That is; why isn't the whole thing hard coded to have:

bucket_n: clamp value: n*UCLAMP_BUCKET_DELTA

We already do that division anyway (clamp_value / UCLAMP_BUCKET_DELTA),
and from that we instantly have the right bucket index. And that allows
us to initialize all this beforehand.

> and keep all
> the buckets in use at the beginning of a cache line.

That; is that the rationale for all this? Note that per the defaults
everything is in a single line already.



2019-01-21 16:35:19

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 21-Jan 17:12, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 15:59, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > > @@ -835,6 +954,28 @@ static void uclamp_bucket_inc(struct uclamp_se *uc_se, unsigned int clamp_id,
> > > > } while (!atomic_long_try_cmpxchg(&uc_maps[bucket_id].adata,
> > > > &uc_map_old.data, uc_map_new.data));
> > > >
> > > > + /*
> > > > + * Ensure each CPU tracks the correct value for this clamp bucket.
> > > > + * This initialization of per-CPU variables is required only when a
> > > > + * clamp value is requested for the first time from a slow-path.
> > > > + */
> > >
> > > I'm confused; why is this needed?
> >
> > That's a lazy initialization of the per-CPU uclamp data for a given
> > bucket, i.e. the clamp value assigned to a bucket, which happens only
> > when new clamp values are requested... usually only at system
> > boot/configuration time.
> >
> > For example, let say we have these buckets mapped to given clamp
> > values:
> >
> > bucket_#0: clamp value: 10% (mapped)
> > bucket_#1: clamp value: 20% (mapped)
> > bucket_#2: clamp value: 30% (mapped)
> >
> > and then let's assume all the users of bucket_#1 are "destroyed", i.e.
> > there are no more tasks, system defaults or cgroups asking for a
> > 20% clamp value. The corresponding bucket will become free:
> >
> > bucket_#0: clamp value: 10% (mapped)
> > bucket_#1: clamp value: 20% (free)
> > bucket_#2: clamp value: 30% (mapped)
> >
> > If, in the future, we ask for a new clamp value, let say a task ask
> > for a 40% clamp value, then we need to map that value into a bucket.
> > Since bucket_#1 is free we can use it to fill up the hold and keep all
> > the buckets in use at the beginning of a cache line.
> >
> > However, since now bucket_#1 tracks a different clamp value (40
> > instead of 20) we need to walk all the CPUs and updated the cached
> > value:
> >
> > bucket_#0: clamp value: 10% (mapped)
> > bucket_#1: clamp value: 40% (mapped)
> > bucket_#2: clamp value: 30% (mapped)
> >
> > Is that more clear ?
>
> Yes, and I realized this a little while after sending this; but I'm not
> sure I have an answer to why though.
>
> That is; why isn't the whole thing hard coded to have:
>
> bucket_n: clamp value: n*UCLAMP_BUCKET_DELTA
>
> We already do that division anyway (clamp_value / UCLAMP_BUCKET_DELTA),
> and from that we instantly have the right bucket index. And that allows
> us to initialize all this beforehand.
>
> > and keep all
> > the buckets in use at the beginning of a cache line.
>
> That; is that the rationale for all this? Note that per the defaults
> everything is in a single line already.

Yes, that's because of the loop in:

dequeue_task()
uclamp_cpu_dec()
uclamp_cpu_dec_id()
uclamp_cpu_update()

where buckets needs sometimes to be scanned to find a new max.

Consider also that, with mapping, we can more easily increase the
buckets count to 20 in order to have a finer clamping granularity if
needed without warring too much about performance impact especially
when we use anyway few different clamp values.

So, I agree that mapping adds (code) complexity but it can also save
few cycles in the fast path... do you think it's not worth the added
complexity?

TBH I never did a proper profiling w/-w/o mapping... I'm just worried
in principle for a loop on 20 entries spanning 4 cache lines. :/

NOTE: the loop is currently going through all the entries anyway,
but we can add later a guard to bail out once we covered the
number of active entries.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 09:39:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Mon, Jan 21, 2019 at 03:44:12PM +0000, Patrick Bellasi wrote:
> On 21-Jan 16:33, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:02AM +0000, Patrick Bellasi wrote:
> >
> > > +static inline void
> > > +uclamp_task_update_active(struct task_struct *p, unsigned int clamp_id)
> > > +{
> > > + struct rq_flags rf;
> > > + struct rq *rq;
> > > +
> > > + /*
> > > + * Lock the task and the CPU where the task is (or was) queued.
> > > + *
> > > + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
> > > + * price to pay to safely serialize util_{min,max} updates with
> > > + * enqueues, dequeues and migration operations.
> > > + * This is the same locking schema used by __set_cpus_allowed_ptr().
> > > + */
> > > + rq = task_rq_lock(p, &rf);
> > > +
> > > + /*
> > > + * Setting the clamp bucket is serialized by task_rq_lock().
> > > + * If the task is not yet RUNNABLE and its task_struct is not
> > > + * affecting a valid clamp bucket, the next time it's enqueued,
> > > + * it will already see the updated clamp bucket value.
> > > + */
> > > + if (!p->uclamp[clamp_id].active)
> > > + goto done;
> > > +
> > > + uclamp_cpu_dec_id(p, rq, clamp_id);
> > > + uclamp_cpu_inc_id(p, rq, clamp_id);
> > > +
> > > +done:
> > > + task_rq_unlock(rq, p, &rf);
> > > +}
> >
> > > @@ -1008,11 +1043,11 @@ static int __setscheduler_uclamp(struct task_struct *p,
> > >
> > > mutex_lock(&uclamp_mutex);
> > > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > > - uclamp_bucket_inc(&p->uclamp[UCLAMP_MIN],
> > > + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
> > > UCLAMP_MIN, lower_bound);
> > > }
> > > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > > - uclamp_bucket_inc(&p->uclamp[UCLAMP_MAX],
> > > + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
> > > UCLAMP_MAX, upper_bound);
> > > }
> > > mutex_unlock(&uclamp_mutex);
> >
> >
> > But.... __sched_setscheduler() actually does the whole dequeue + enqueue
> > thing already ?!? See where it does __setscheduler().
>
> This is slow-path accounting, not fast path.

Sure; but that's still no reason for duplicate or unneeded code.

> There are two refcounting going on here:
>
> 1) mapped buckets:
>
> clamp_value <--(M1)--> bucket_id
>
> 2) RUNNABLE tasks:
>
> bucket_id <--(M2)--> RUNNABLE tasks in a bucket
>
> What we fix here is the refcounting for the buckets mapping. If a task
> does not have a task specific clamp value it does not refcount any
> bucket. The moment we assign a task specific clamp value, we need to
> refcount the task in the bucket corresponding to that clamp value.
>
> This will keep the bucket in use at least as long as the task will
> need that clamp value.

Sure, I get that. What I don't get is why you're adding that (2) here.
Like said, __sched_setscheduler() already does a dequeue/enqueue under
rq->lock, which should already take care of that.

2019-01-22 09:48:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Mon, Jan 21, 2019 at 04:33:38PM +0000, Patrick Bellasi wrote:
> On 21-Jan 17:12, Peter Zijlstra wrote:
> > On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:

> > > and keep all
> > > the buckets in use at the beginning of a cache line.
> >
> > That; is that the rationale for all this? Note that per the defaults
> > everything is in a single line already.
>
> Yes, that's because of the loop in:
>
> dequeue_task()
> uclamp_cpu_dec()
> uclamp_cpu_dec_id()
> uclamp_cpu_update()
>
> where buckets needs sometimes to be scanned to find a new max.
>
> Consider also that, with mapping, we can more easily increase the
> buckets count to 20 in order to have a finer clamping granularity if
> needed without warring too much about performance impact especially
> when we use anyway few different clamp values.
>
> So, I agree that mapping adds (code) complexity but it can also save
> few cycles in the fast path... do you think it's not worth the added
> complexity?

Then maybe split this out in a separate patch? Do the trivial linear
bucket thing first and then do this smarty pants thing on top.

One problem with the scheme is that it doesn't defrag; so if you get a
peak usage, you can still end up with only two active buckets in
different lines.

Also; if it is it's own patch, you get a much better view of the
additional complexity and a chance to justify it ;-)

Also; would it make sense to do s/cpu/rq/ on much of this? All this
uclamp_cpu_*() stuff really is per rq and takes rq arguments, so why
does it have cpu in the name... no strong feelings, just noticed it and
thought is a tad inconsistent.

2019-01-22 10:05:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Mon, Jan 21, 2019 at 03:54:07PM +0000, Patrick Bellasi wrote:
> On 21-Jan 16:17, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > +#ifdef CONFIG_UCLAMP_TASK
> >
> > > +struct uclamp_bucket {
> > > + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> > > + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> > > +};
> >
> > > +struct uclamp_cpu {
> > > + unsigned int value;
> >
> > /* 4 byte hole */
> >
> > > + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> > > +};
> >
> > With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
> > ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
> > hole).
>
> Yes, that's dimensioned and configured to fit into a single cache line
> for all the possible 5 (by default) clamp values of a clamp index
> (i.e. min or max util).

And I suppose you picked 5 because 20% is a 'nice' number? whereas
16./666/% is a bit odd?

> > > +#endif /* CONFIG_UCLAMP_TASK */
> > > +
> > > /*
> > > * This is the main, per-CPU runqueue data structure.
> > > *
> > > @@ -835,6 +879,11 @@ struct rq {
> > > unsigned long nr_load_updates;
> > > u64 nr_switches;
> > >
> > > +#ifdef CONFIG_UCLAMP_TASK
> > > + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> > > + struct uclamp_cpu uclamp[UCLAMP_CNT] ____cacheline_aligned;
> >
> > Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
> > 64 byte cachelines.
>
> Right, we have 2 cache lines where:
> - the first $L tracks 5 different util_min values
> - the second $L tracks 5 different util_max values

Well, not quite so, if you want that you should put
____cacheline_aligned on struct uclamp_cpu. Such that the individual
array entries are each aligned, the above only alignes the whole array,
so the second uclamp_cpu is spread over both lines.

But I think this is actually better, since you have to scan both
min/max anyway, and allowing one the straddle a line you have to touch
anyway, allows for using less lines in total.

Consider for example the case where UCLAMP_BUCKETS=8, then each
uclamp_cpu would be 9 words or 72 bytes. If you force align the member,
then you end up with 4 lines, whereas now it would be 3.

> > Is that the best layout?
>
> It changed few times and that's what I found more reasonable for both
> for fitting the default configuration and also for code readability.
> Notice that we access RQ and SE clamp values with the same patter,
> for example:
>
> {rq|p}->uclamp[clamp_idx].value
>
> Are you worried about the holes or something else specific ?

Not sure; just mostly asking if this was by design or by accident.

One thing I did wonder though; since bucket[0] is counting the tasks
that are unconstrained and it's bucket value is basically fixed (0 /
1024), can't we abuse that value field to store uclamp_cpu::value ?

OTOH, doing that might make the code really ugly with all them:

if (!bucket_id)

exceptions all over the place.

2019-01-22 10:32:54

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 22-Jan 10:45, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 04:33:38PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 17:12, Peter Zijlstra wrote:
> > > On Mon, Jan 21, 2019 at 03:23:11PM +0000, Patrick Bellasi wrote:
>
> > > > and keep all
> > > > the buckets in use at the beginning of a cache line.
> > >
> > > That; is that the rationale for all this? Note that per the defaults
> > > everything is in a single line already.
> >
> > Yes, that's because of the loop in:
> >
> > dequeue_task()
> > uclamp_cpu_dec()
> > uclamp_cpu_dec_id()
> > uclamp_cpu_update()
> >
> > where buckets needs sometimes to be scanned to find a new max.
> >
> > Consider also that, with mapping, we can more easily increase the
> > buckets count to 20 in order to have a finer clamping granularity if
> > needed without warring too much about performance impact especially
> > when we use anyway few different clamp values.
> >
> > So, I agree that mapping adds (code) complexity but it can also save
> > few cycles in the fast path... do you think it's not worth the added
> > complexity?
>
> Then maybe split this out in a separate patch? Do the trivial linear
> bucket thing first and then do this smarty pants thing on top.
>
> One problem with the scheme is that it doesn't defrag; so if you get a
> peak usage, you can still end up with only two active buckets in
> different lines.

You right, that was saved for a later optimization. :/

Mainly in consideration of the fact that, at least for the main usage
we have in mind on Android, we will likely configure all the required
clamps once for all at boot time.

> Also; if it is it's own patch, you get a much better view of the
> additional complexity and a chance to justify it ;-)

What about ditching the mapping for the time being and see if we
get a real overhead hit in the future ?

At that point we will revamp the mapping patch with also a proper
defrag support.

> Also; would it make sense to do s/cpu/rq/ on much of this? All this
> uclamp_cpu_*() stuff really is per rq and takes rq arguments, so why
> does it have cpu in the name... no strong feelings, just noticed it and
> thought is a tad inconsistent.

The idea behind using "cpu" instead of "rq" was that we use those only at
root rq level and the clamps are aggregated per-CPU.

I remember one of the first versions used "cpu" instead of "rq" as a
parameter name and you proposed to change it as an optimization since
we call it from dequeue_task() where we already have a *rq.

... but, since we have those uclamp data within struct rq, I think you
are right: it makes more sense to rename the functions.

Will do it in v7, thanks.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 10:40:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On Tuesday, January 15, 2019 11:15:05 AM CET Patrick Bellasi wrote:
> Each time a frequency update is required via schedutil, a frequency is
> selected to (possibly) satisfy the utilization reported by each
> scheduling class. However, when utilization clamping is in use, the
> frequency selection should consider userspace utilization clamping
> hints. This will allow, for example, to:
>
> - boost tasks which are directly affecting the user experience
> by running them at least at a minimum "requested" frequency
>
> - cap low priority tasks not directly affecting the user experience
> by running them only up to a maximum "allowed" frequency
>
> These constraints are meant to support a per-task based tuning of the
> frequency selection thus supporting a fine grained definition of
> performance boosting vs energy saving strategies in kernel space.
>
> Add support to clamp the utilization and IOWait boost of RUNNABLE FAIR
> tasks within the boundaries defined by their aggregated utilization
> clamp constraints.
> Based on the max(min_util, max_util) of each task, max-aggregated the
> CPU clamp value in a way to give the boosted tasks the performance they
> need when they happen to be co-scheduled with other capped tasks.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
>
> ---
> Changes in v6:
> Message-ID: <20181107113849.GC14309@e110439-lin>
> - sanity check util_max >= util_min
> Others:
> - wholesale s/group/bucket/
> - wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
> ---
> kernel/sched/cpufreq_schedutil.c | 27 ++++++++++++++++++++++++---
> kernel/sched/sched.h | 23 +++++++++++++++++++++++
> 2 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 033ec7c45f13..520ee2b785e7 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,8 +218,15 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> * CFS tasks and we use the same metric to track the effective
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> + *
> + * CFS utilization can be boosted or capped, depending on utilization
> + * clamp constraints requested by currently RUNNABLE tasks.
> + * When there are no CFS RUNNABLE tasks, clamps are released and
> + * frequency will be gracefully reduced with the utilization decay.
> */
> - util = util_cfs;
> + util = (type == ENERGY_UTIL)
> + ? util_cfs
> + : uclamp_util(rq, util_cfs);
> util += cpu_util_rt(rq);
>
> dl_util = cpu_util_dl(rq);
> @@ -327,6 +334,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> unsigned int flags)
> {
> bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> + unsigned int max_boost;
>
> /* Reset boost if the CPU appears to have been idle enough */
> if (sg_cpu->iowait_boost &&
> @@ -342,11 +350,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> return;
> sg_cpu->iowait_boost_pending = true;
>
> + /*
> + * Boost FAIR tasks only up to the CPU clamped utilization.
> + *
> + * Since DL tasks have a much more advanced bandwidth control, it's
> + * safe to assume that IO boost does not apply to those tasks.
> + * Instead, since RT tasks are not utilization clamped, we don't want
> + * to apply clamping on IO boost while there is blocked RT
> + * utilization.
> + */
> + max_boost = sg_cpu->iowait_boost_max;
> + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
> +
> /* Double the boost at each request */
> if (sg_cpu->iowait_boost) {
> sg_cpu->iowait_boost <<= 1;
> - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> + if (sg_cpu->iowait_boost > max_boost)
> + sg_cpu->iowait_boost = max_boost;
> return;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b7f3ee8ba164..95d62a2a0b44 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2267,6 +2267,29 @@ static inline unsigned int uclamp_none(int clamp_id)
> return SCHED_CAPACITY_SCALE;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
> +{
> + unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> + unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
> +
> + /*
> + * Since CPU's {min,max}_util clamps are MAX aggregated considering
> + * RUNNABLE tasks with _different_ clamps, we can end up with an
> + * invertion, which we can fix at usage time.
> + */
> + if (unlikely(min_util >= max_util))
> + return min_util;
> +
> + return clamp(util, min_util, max_util);
> +}
> +#else /* CONFIG_UCLAMP_TASK */
> +static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
> +{
> + return util;
> +}
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> #ifdef arch_scale_freq_capacity
> # ifndef arch_scale_freq_invariant
> # define arch_scale_freq_invariant() true
>

IMO it would be better to combine this patch with the next one.

At least some things in it I was about to ask about would go away
then. :-)

Besides, I don't really see a reason for the split here.



2019-01-22 10:45:48

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On 22-Jan 10:37, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 03:44:12PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 16:33, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:02AM +0000, Patrick Bellasi wrote:
> > >
> > > > +static inline void
> > > > +uclamp_task_update_active(struct task_struct *p, unsigned int clamp_id)
> > > > +{
> > > > + struct rq_flags rf;
> > > > + struct rq *rq;
> > > > +
> > > > + /*
> > > > + * Lock the task and the CPU where the task is (or was) queued.
> > > > + *
> > > > + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
> > > > + * price to pay to safely serialize util_{min,max} updates with
> > > > + * enqueues, dequeues and migration operations.
> > > > + * This is the same locking schema used by __set_cpus_allowed_ptr().
> > > > + */
> > > > + rq = task_rq_lock(p, &rf);
> > > > +
> > > > + /*
> > > > + * Setting the clamp bucket is serialized by task_rq_lock().
> > > > + * If the task is not yet RUNNABLE and its task_struct is not
> > > > + * affecting a valid clamp bucket, the next time it's enqueued,
> > > > + * it will already see the updated clamp bucket value.
> > > > + */
> > > > + if (!p->uclamp[clamp_id].active)
> > > > + goto done;
> > > > +
> > > > + uclamp_cpu_dec_id(p, rq, clamp_id);
> > > > + uclamp_cpu_inc_id(p, rq, clamp_id);
> > > > +
> > > > +done:
> > > > + task_rq_unlock(rq, p, &rf);
> > > > +}
> > >
> > > > @@ -1008,11 +1043,11 @@ static int __setscheduler_uclamp(struct task_struct *p,
> > > >
> > > > mutex_lock(&uclamp_mutex);
> > > > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > > > - uclamp_bucket_inc(&p->uclamp[UCLAMP_MIN],
> > > > + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MIN],
> > > > UCLAMP_MIN, lower_bound);
> > > > }
> > > > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > > > - uclamp_bucket_inc(&p->uclamp[UCLAMP_MAX],
> > > > + uclamp_bucket_inc(p, &p->uclamp[UCLAMP_MAX],
> > > > UCLAMP_MAX, upper_bound);
> > > > }
> > > > mutex_unlock(&uclamp_mutex);
> > >
> > >
> > > But.... __sched_setscheduler() actually does the whole dequeue + enqueue
> > > thing already ?!? See where it does __setscheduler().
> >
> > This is slow-path accounting, not fast path.
>
> Sure; but that's still no reason for duplicate or unneeded code.
>
> > There are two refcounting going on here:
> >
> > 1) mapped buckets:
> >
> > clamp_value <--(M1)--> bucket_id
> >
> > 2) RUNNABLE tasks:
> >
> > bucket_id <--(M2)--> RUNNABLE tasks in a bucket
> >
> > What we fix here is the refcounting for the buckets mapping. If a task
> > does not have a task specific clamp value it does not refcount any
> > bucket. The moment we assign a task specific clamp value, we need to
> > refcount the task in the bucket corresponding to that clamp value.
> >
> > This will keep the bucket in use at least as long as the task will
> > need that clamp value.
>
> Sure, I get that. What I don't get is why you're adding that (2) here.
> Like said, __sched_setscheduler() already does a dequeue/enqueue under
> rq->lock, which should already take care of that.

Oh, ok... got it what you mean now.

With:

[PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy
<[email protected]>

we can call __sched_setscheduler() with:

attr->sched_flags & SCHED_FLAG_KEEP_POLICY

whenever we want just to change the clamp values of a task without
changing its class. Thus, we can end up returning from
__sched_setscheduler() without doing an actual dequeue/enqueue.

This is likely the most common use-case.

I'll better check if I can propagate this info and avoid M2 if we
actually did a dequeue/enqueue.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 10:56:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 04/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 22-Jan 11:03, Peter Zijlstra wrote:
> On Mon, Jan 21, 2019 at 03:54:07PM +0000, Patrick Bellasi wrote:
> > On 21-Jan 16:17, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:01AM +0000, Patrick Bellasi wrote:
> > > > +#ifdef CONFIG_UCLAMP_TASK
> > >
> > > > +struct uclamp_bucket {
> > > > + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> > > > + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> > > > +};
> > >
> > > > +struct uclamp_cpu {
> > > > + unsigned int value;
> > >
> > > /* 4 byte hole */
> > >
> > > > + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> > > > +};
> > >
> > > With the default of 5, this UCLAMP_BUCKETS := 6, so struct uclamp_cpu
> > > ends up being 7 'unsigned long's, or 56 bytes on 64bit (with a 4 byte
> > > hole).
> >
> > Yes, that's dimensioned and configured to fit into a single cache line
> > for all the possible 5 (by default) clamp values of a clamp index
> > (i.e. min or max util).
>
> And I suppose you picked 5 because 20% is a 'nice' number? whereas
> 16./666/% is a bit odd?

Yes, UCLAMP_BUCKETS:=6 gives me 5 20% buckets:

0-19%, 20-39%, 40-59%, 60-79%, 80-99%

plus a 100% bucket to track the max boosted tasks.

Does that makes sense ?

> > > > +#endif /* CONFIG_UCLAMP_TASK */
> > > > +
> > > > /*
> > > > * This is the main, per-CPU runqueue data structure.
> > > > *
> > > > @@ -835,6 +879,11 @@ struct rq {
> > > > unsigned long nr_load_updates;
> > > > u64 nr_switches;
> > > >
> > > > +#ifdef CONFIG_UCLAMP_TASK
> > > > + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> > > > + struct uclamp_cpu uclamp[UCLAMP_CNT] ____cacheline_aligned;
> > >
> > > Which makes this 112 bytes with 8 bytes in 2 holes, which is short of 2
> > > 64 byte cachelines.
> >
> > Right, we have 2 cache lines where:
> > - the first $L tracks 5 different util_min values
> > - the second $L tracks 5 different util_max values
>
> Well, not quite so, if you want that you should put
> ____cacheline_aligned on struct uclamp_cpu. Such that the individual
> array entries are each aligned, the above only alignes the whole array,
> so the second uclamp_cpu is spread over both lines.

That's true... I was considering more important to save space if we
have a buckets number which can fit in let say 3 cache lines.
... but if you prefer the other way around I'll move it.

> But I think this is actually better, since you have to scan both
> min/max anyway, and allowing one the straddle a line you have to touch
> anyway, allows for using less lines in total.

Right.

> Consider for example the case where UCLAMP_BUCKETS=8, then each
> uclamp_cpu would be 9 words or 72 bytes. If you force align the member,
> then you end up with 4 lines, whereas now it would be 3.

Exactly :)

> > > Is that the best layout?
> >
> > It changed few times and that's what I found more reasonable for both
> > for fitting the default configuration and also for code readability.
> > Notice that we access RQ and SE clamp values with the same patter,
> > for example:
> >
> > {rq|p}->uclamp[clamp_idx].value
> >
> > Are you worried about the holes or something else specific ?
>
> Not sure; just mostly asking if this was by design or by accident.
>
> One thing I did wonder though; since bucket[0] is counting the tasks
> that are unconstrained and it's bucket value is basically fixed (0 /
> 1024), can't we abuse that value field to store uclamp_cpu::value ?

Mmm... should be possible, just worried about adding special cases
which can make the code even more complex of what it's not already.

.... moreover, if we ditch the mapping, the 1024 will be indexed at
the top of the array... so...

> OTOH, doing that might make the code really ugly with all them:
>
> if (!bucket_id)
>
> exceptions all over the place.

Exactly... I should read all your comments before replying :)

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 11:03:46

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On 22-Jan 11:37, Rafael J. Wysocki wrote:
> On Tuesday, January 15, 2019 11:15:05 AM CET Patrick Bellasi wrote:
> > Each time a frequency update is required via schedutil, a frequency is
> > selected to (possibly) satisfy the utilization reported by each
> > scheduling class. However, when utilization clamping is in use, the
> > frequency selection should consider userspace utilization clamping
> > hints. This will allow, for example, to:
> >
> > - boost tasks which are directly affecting the user experience
> > by running them at least at a minimum "requested" frequency
> >
> > - cap low priority tasks not directly affecting the user experience
> > by running them only up to a maximum "allowed" frequency
> >
> > These constraints are meant to support a per-task based tuning of the
> > frequency selection thus supporting a fine grained definition of
> > performance boosting vs energy saving strategies in kernel space.
> >
> > Add support to clamp the utilization and IOWait boost of RUNNABLE FAIR
> > tasks within the boundaries defined by their aggregated utilization
> > clamp constraints.
> > Based on the max(min_util, max_util) of each task, max-aggregated the
> > CPU clamp value in a way to give the boosted tasks the performance they
> > need when they happen to be co-scheduled with other capped tasks.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> >
> > ---
> > Changes in v6:
> > Message-ID: <20181107113849.GC14309@e110439-lin>
> > - sanity check util_max >= util_min
> > Others:
> > - wholesale s/group/bucket/
> > - wholesale s/_{get,put}/_{inc,dec}/ to match refcount APIs
> > ---
> > kernel/sched/cpufreq_schedutil.c | 27 ++++++++++++++++++++++++---
> > kernel/sched/sched.h | 23 +++++++++++++++++++++++
> > 2 files changed, 47 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 033ec7c45f13..520ee2b785e7 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -218,8 +218,15 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > * CFS tasks and we use the same metric to track the effective
> > * utilization (PELT windows are synchronized) we can directly add them
> > * to obtain the CPU's actual utilization.
> > + *
> > + * CFS utilization can be boosted or capped, depending on utilization
> > + * clamp constraints requested by currently RUNNABLE tasks.
> > + * When there are no CFS RUNNABLE tasks, clamps are released and
> > + * frequency will be gracefully reduced with the utilization decay.
> > */
> > - util = util_cfs;
> > + util = (type == ENERGY_UTIL)
> > + ? util_cfs
> > + : uclamp_util(rq, util_cfs);
> > util += cpu_util_rt(rq);
> >
> > dl_util = cpu_util_dl(rq);
> > @@ -327,6 +334,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > unsigned int flags)
> > {
> > bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> > + unsigned int max_boost;
> >
> > /* Reset boost if the CPU appears to have been idle enough */
> > if (sg_cpu->iowait_boost &&
> > @@ -342,11 +350,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > return;
> > sg_cpu->iowait_boost_pending = true;
> >
> > + /*
> > + * Boost FAIR tasks only up to the CPU clamped utilization.
> > + *
> > + * Since DL tasks have a much more advanced bandwidth control, it's
> > + * safe to assume that IO boost does not apply to those tasks.
> > + * Instead, since RT tasks are not utilization clamped, we don't want
> > + * to apply clamping on IO boost while there is blocked RT
> > + * utilization.
> > + */
> > + max_boost = sg_cpu->iowait_boost_max;
> > + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> > + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
> > +
> > /* Double the boost at each request */
> > if (sg_cpu->iowait_boost) {
> > sg_cpu->iowait_boost <<= 1;
> > - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> > - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> > + if (sg_cpu->iowait_boost > max_boost)
> > + sg_cpu->iowait_boost = max_boost;
> > return;
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b7f3ee8ba164..95d62a2a0b44 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2267,6 +2267,29 @@ static inline unsigned int uclamp_none(int clamp_id)
> > return SCHED_CAPACITY_SCALE;
> > }
> >
> > +#ifdef CONFIG_UCLAMP_TASK
> > +static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
> > +{
> > + unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> > + unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
> > +
> > + /*
> > + * Since CPU's {min,max}_util clamps are MAX aggregated considering
> > + * RUNNABLE tasks with _different_ clamps, we can end up with an
> > + * invertion, which we can fix at usage time.
> > + */
> > + if (unlikely(min_util >= max_util))
> > + return min_util;
> > +
> > + return clamp(util, min_util, max_util);
> > +}
> > +#else /* CONFIG_UCLAMP_TASK */
> > +static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
> > +{
> > + return util;
> > +}
> > +#endif /* CONFIG_UCLAMP_TASK */
> > +
> > #ifdef arch_scale_freq_capacity
> > # ifndef arch_scale_freq_invariant
> > # define arch_scale_freq_invariant() true
> >
>
> IMO it would be better to combine this patch with the next one.

Main reason was to better document in the changelog what we do for the
two different classes...

> At least some things in it I was about to ask about would go away
> then. :-)

... but if it creates confusion I can certainly merge them.

Or maybe clarify better in this patch what's not clear: may I ask what
were your questions ?

> Besides, I don't really see a reason for the split here.

Was mainly to make the changes required for RT more self-contained.

For that class only, not for FAIR, we have additional code in the
following patch which add uclamp_default_perf which are system
defaults used to track/account tasks requesting the maximum frequency.

Again, I can either better clarify the above patch or just merge the
two together: what do you prefer ?

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 11:07:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On Tue, Jan 22, 2019 at 12:02 PM Patrick Bellasi
<[email protected]> wrote:
>
> On 22-Jan 11:37, Rafael J. Wysocki wrote:
> > On Tuesday, January 15, 2019 11:15:05 AM CET Patrick Bellasi wrote:

[cut]

> >
> > IMO it would be better to combine this patch with the next one.
>
> Main reason was to better document in the changelog what we do for the
> two different classes...
>
> > At least some things in it I was about to ask about would go away
> > then. :-)
>
> ... but if it creates confusion I can certainly merge them.
>
> Or maybe clarify better in this patch what's not clear: may I ask what
> were your questions ?
>
> > Besides, I don't really see a reason for the split here.
>
> Was mainly to make the changes required for RT more self-contained.
>
> For that class only, not for FAIR, we have additional code in the
> following patch which add uclamp_default_perf which are system
> defaults used to track/account tasks requesting the maximum frequency.
>
> Again, I can either better clarify the above patch or just merge the
> two together: what do you prefer ?

Merge the two together, please.

2019-01-22 11:29:02

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On 22-Jan 12:04, Rafael J. Wysocki wrote:
> On Tue, Jan 22, 2019 at 12:02 PM Patrick Bellasi
> <[email protected]> wrote:
> >
> > On 22-Jan 11:37, Rafael J. Wysocki wrote:
> > > On Tuesday, January 15, 2019 11:15:05 AM CET Patrick Bellasi wrote:
>
> Merge the two together, please.

Ok, will do in v7, thanks.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 12:15:34

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> The Energy Aware Scheduler (AES) estimates the energy impact of waking

s/AES/EAS :-)

[...]
> + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> + cfs_util = cpu_util_next(cpu, p, dst_cpu);
> +
> + /*
> + * Busy time computation: utilization clamping is not
> + * required since the ratio (sum_util / cpu_capacity)
> + * is already enough to scale the EM reported power
> + * consumption at the (eventually clamped) cpu_capacity.
> + */

Right.

> + sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> + ENERGY_UTIL, NULL);
> +
> + /*
> + * Performance domain frequency: utilization clamping
> + * must be considered since it affects the selection
> + * of the performance domain frequency.
> + */

So that actually affects the way we deal with RT I think. I assume the
idea is to say if you don't want to reflect the RT-go-to-max-freq thing
in EAS (which is what we do now) you should set the min cap for RT to 0.
Is that correct ?

I'm fine with this conceptually but maybe the specific case of RT should
be mentioned somewhere in the commit message or so ? I think it's
important to say that clearly since this patch changes the default
behaviour.

> + cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> + FREQUENCY_UTIL,
> + cpu == dst_cpu ? p : NULL);
> + max_util = max(max_util, cpu_util);
> }
>
> energy += em_pd_energy(pd->em_pd, max_util, sum_util);

Thanks,
Quentin

2019-01-22 12:31:38

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Tuesday 15 Jan 2019 at 10:15:06 (+0000), Patrick Bellasi wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 520ee2b785e7..38a05a4f78cc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -201,9 +201,6 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> unsigned long dl_util, util, irq;
> struct rq *rq = cpu_rq(cpu);
>
> - if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
> - return max;
> -
> /*
> * Early check to see if IRQ/steal time saturates the CPU, can be
> * because of inaccuracies in how we track these -- see
> @@ -219,15 +216,19 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> *
> - * CFS utilization can be boosted or capped, depending on utilization
> - * clamp constraints requested by currently RUNNABLE tasks.
> + * CFS and RT utilization can be boosted or capped, depending on
> + * utilization clamp constraints requested by currently RUNNABLE
> + * tasks.
> * When there are no CFS RUNNABLE tasks, clamps are released and
> * frequency will be gracefully reduced with the utilization decay.
> */
> - util = (type == ENERGY_UTIL)
> - ? util_cfs
> - : uclamp_util(rq, util_cfs);
> - util += cpu_util_rt(rq);
> + util = cpu_util_rt(rq);
> + if (type == FREQUENCY_UTIL) {
> + util += cpu_util_cfs(rq);
> + util = uclamp_util(rq, util);

So with this we don't go to max to anymore for CONFIG_UCLAMP_TASK=n no ?

Thanks,
Quentin

2019-01-22 12:39:39

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 22-Jan 12:30, Quentin Perret wrote:
> On Tuesday 15 Jan 2019 at 10:15:06 (+0000), Patrick Bellasi wrote:
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 520ee2b785e7..38a05a4f78cc 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -201,9 +201,6 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > unsigned long dl_util, util, irq;
> > struct rq *rq = cpu_rq(cpu);
> >
> > - if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
> > - return max;
> > -
> > /*
> > * Early check to see if IRQ/steal time saturates the CPU, can be
> > * because of inaccuracies in how we track these -- see
> > @@ -219,15 +216,19 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > * utilization (PELT windows are synchronized) we can directly add them
> > * to obtain the CPU's actual utilization.
> > *
> > - * CFS utilization can be boosted or capped, depending on utilization
> > - * clamp constraints requested by currently RUNNABLE tasks.
> > + * CFS and RT utilization can be boosted or capped, depending on
> > + * utilization clamp constraints requested by currently RUNNABLE
> > + * tasks.
> > * When there are no CFS RUNNABLE tasks, clamps are released and
> > * frequency will be gracefully reduced with the utilization decay.
> > */
> > - util = (type == ENERGY_UTIL)
> > - ? util_cfs
> > - : uclamp_util(rq, util_cfs);
> > - util += cpu_util_rt(rq);
> > + util = cpu_util_rt(rq);
> > + if (type == FREQUENCY_UTIL) {
> > + util += cpu_util_cfs(rq);
> > + util = uclamp_util(rq, util);
>
> So with this we don't go to max to anymore for CONFIG_UCLAMP_TASK=n no ?

Mmm... good point!

I need to guard this chagen for the !CONFIG_UCLAMP_TASK case!

>
> Thanks,
> Quentin

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 12:49:01

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On 22-Jan 12:13, Quentin Perret wrote:
> On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > The Energy Aware Scheduler (AES) estimates the energy impact of waking

[...]

> > + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > + cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > +
> > + /*
> > + * Busy time computation: utilization clamping is not
> > + * required since the ratio (sum_util / cpu_capacity)
> > + * is already enough to scale the EM reported power
> > + * consumption at the (eventually clamped) cpu_capacity.
> > + */
>
> Right.
>
> > + sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > + ENERGY_UTIL, NULL);
> > +
> > + /*
> > + * Performance domain frequency: utilization clamping
> > + * must be considered since it affects the selection
> > + * of the performance domain frequency.
> > + */
>
> So that actually affects the way we deal with RT I think. I assume the
> idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> in EAS (which is what we do now) you should set the min cap for RT to 0.
> Is that correct ?

By default configuration, RT tasks still go to max when uclamp is
enabled, since they get a util_min=1024.

If we want to save power on RT tasks, we can set a smaller util_min...
but not necessarily 0. A util_min=0 for RT tasks means to use just
cpu_util_rt() for that class.

> I'm fine with this conceptually but maybe the specific case of RT should
> be mentioned somewhere in the commit message or so ? I think it's
> important to say that clearly since this patch changes the default
> behaviour.

Default behavior for RT should not be affected. While a capping is
possible for those tasks... where do you see issues ?
Here we are just figuring out what's the capacity the task will run
at, if we will have clamped RT tasks will not be the max but: is that
a problem ?

> > + cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > + FREQUENCY_UTIL,
> > + cpu == dst_cpu ? p : NULL);
> > + max_util = max(max_util, cpu_util);
> > }
> >
> > energy += em_pd_energy(pd->em_pd, max_util, sum_util);
>
> Thanks,
> Quentin

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 13:30:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Tue, Jan 22, 2019 at 10:43:05AM +0000, Patrick Bellasi wrote:
> On 22-Jan 10:37, Peter Zijlstra wrote:

> > Sure, I get that. What I don't get is why you're adding that (2) here.
> > Like said, __sched_setscheduler() already does a dequeue/enqueue under
> > rq->lock, which should already take care of that.
>
> Oh, ok... got it what you mean now.
>
> With:
>
> [PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy
> <[email protected]>
>
> we can call __sched_setscheduler() with:
>
> attr->sched_flags & SCHED_FLAG_KEEP_POLICY
>
> whenever we want just to change the clamp values of a task without
> changing its class. Thus, we can end up returning from
> __sched_setscheduler() without doing an actual dequeue/enqueue.

I don't see that happening.. when KEEP_POLICY we set attr.sched_policy =
SETPARAM_POLICY. That is then checked again in __setscheduler_param(),
which is in the middle of that dequeue/enqueue.

Also, and this might be 'broken', SETPARAM_POLICY _does_ reset all the
other attributes, it only preserves policy, but it will (re)set nice
level for example (see that same function).

So maybe we want to introduce another (few?) FLAG_KEEP flag(s) that
preserve the other bits; I'm thinking at least KEEP_PARAM and KEEP_UTIL
or something.


2019-01-22 13:31:30

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> On 22-Jan 12:13, Quentin Perret wrote:
> > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > The Energy Aware Scheduler (AES) estimates the energy impact of waking
>
> [...]
>
> > > + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > > + cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > > +
> > > + /*
> > > + * Busy time computation: utilization clamping is not
> > > + * required since the ratio (sum_util / cpu_capacity)
> > > + * is already enough to scale the EM reported power
> > > + * consumption at the (eventually clamped) cpu_capacity.
> > > + */
> >
> > Right.
> >
> > > + sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > > + ENERGY_UTIL, NULL);
> > > +
> > > + /*
> > > + * Performance domain frequency: utilization clamping
> > > + * must be considered since it affects the selection
> > > + * of the performance domain frequency.
> > > + */
> >
> > So that actually affects the way we deal with RT I think. I assume the
> > idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> > in EAS (which is what we do now) you should set the min cap for RT to 0.
> > Is that correct ?
>
> By default configuration, RT tasks still go to max when uclamp is
> enabled, since they get a util_min=1024.
>
> If we want to save power on RT tasks, we can set a smaller util_min...
> but not necessarily 0. A util_min=0 for RT tasks means to use just
> cpu_util_rt() for that class.

Ah, sorry, I guess my message was misleading. I'm saying this is
changing the way _EAS_ deals with RT tasks. Right now we don't actually
consider the RT-go-to-max thing at all in the EAS prediction. Your
patch is changing that AFAICT. It actually changes the way EAS sees RT
tasks even without uclamp ...

But I'm not hostile to the idea since it's possible to enable uclamp and
set the min cap to 0 for RT if you want. So there is a story there.
However, I think this needs be documented somewhere, at the very least.

Thanks,
Quentin

2019-01-22 13:59:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On Tue, Jan 15, 2019 at 10:15:04AM +0000, Patrick Bellasi wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 84294925d006..c8f391d1cdc5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -625,6 +625,11 @@ struct uclamp_se {
> unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> unsigned int mapped : 1;
> unsigned int active : 1;
> + /* Clamp bucket and value actually used by a RUNNABLE task */
> + struct {
> + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> + } effective;

I am confuzled by this thing.. so uclamp_se already has a value,bucket,
which per the prior code is the effective one.

Now; I think I see why you want another value; you need the second to
store the original value for when the system limits change and we must
re-evaluate.

So why are you not adding something like:

unsigned int orig_value : bits_per(SCHED_CAPACITY_SCALE);

> +unsigned int sysctl_sched_uclamp_util_min;

> +unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;

> +static inline void
> +uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> + unsigned int *clamp_value, unsigned int *bucket_id)
> +{
> + /* Task specific clamp value */
> + *clamp_value = p->uclamp[clamp_id].value;
> + *bucket_id = p->uclamp[clamp_id].bucket_id;
> +
> + /* System default restriction */
> + if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> + *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> + /* Keep it simple: unconditionally enforce system defaults */
> + *clamp_value = uclamp_default[clamp_id].value;
> + *bucket_id = uclamp_default[clamp_id].bucket_id;
> + }
> +}

That would then turn into something like:

unsigned int high = READ_ONCE(sysctl_sched_uclamp_util_max);
unsigned int low = READ_ONCE(sysctl_sched_uclamp_util_min);

uclamp_se->orig_value = value;
uclamp_se->value = clamp(value, low, high);

And then determine bucket_id based on value.

> +int sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + int old_min, old_max;
> + int result = 0;
> +
> + mutex_lock(&uclamp_mutex);
> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> + sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto undo;
> + }
> +
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MIN],
> + UCLAMP_MIN, sysctl_sched_uclamp_util_min);
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MAX],
> + UCLAMP_MAX, sysctl_sched_uclamp_util_max);
> + }

Should you not update all tasks?


2019-01-22 14:03:13

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On 22-Jan 14:28, Peter Zijlstra wrote:
> On Tue, Jan 22, 2019 at 10:43:05AM +0000, Patrick Bellasi wrote:
> > On 22-Jan 10:37, Peter Zijlstra wrote:
>
> > > Sure, I get that. What I don't get is why you're adding that (2) here.
> > > Like said, __sched_setscheduler() already does a dequeue/enqueue under
> > > rq->lock, which should already take care of that.
> >
> > Oh, ok... got it what you mean now.
> >
> > With:
> >
> > [PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy
> > <[email protected]>
> >
> > we can call __sched_setscheduler() with:
> >
> > attr->sched_flags & SCHED_FLAG_KEEP_POLICY
> >
> > whenever we want just to change the clamp values of a task without
> > changing its class. Thus, we can end up returning from
> > __sched_setscheduler() without doing an actual dequeue/enqueue.
>
> I don't see that happening.. when KEEP_POLICY we set attr.sched_policy =
> SETPARAM_POLICY. That is then checked again in __setscheduler_param(),
> which is in the middle of that dequeue/enqueue.

Yes, I think I've forgot a check before we actually dequeue the task.

The current code does:

---8<---
SYSCALL_DEFINE3(sched_setattr)

// A) request to keep the same policy
if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
attr.sched_policy = SETPARAM_POLICY;

sched_setattr()
// B) actually enforce the same policy
if (policy < 0)
policy = oldpolicy = p->policy;

// C) tune the clamp values
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
retval = __setscheduler_uclamp(p, attr);

// D) tune attributes if policy is the same
if (unlikely(policy == p->policy))
if (fair_policy(policy) && attr->sched_nice != task_nice(p))
goto change;
if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
goto change;
if (dl_policy(policy) && dl_param_changed(p, attr))
goto change;
return 0;
change:

// E) dequeue/enqueue task
---8<---

So, probably in D) I've missed a check on SCHED_FLAG_KEEP_POLICY to
enforce a return in that case...

> Also, and this might be 'broken', SETPARAM_POLICY _does_ reset all the
> other attributes, it only preserves policy, but it will (re)set nice
> level for example (see that same function).

Mmm... right... my bad! :/

> So maybe we want to introduce another (few?) FLAG_KEEP flag(s) that
> preserve the other bits; I'm thinking at least KEEP_PARAM and KEEP_UTIL
> or something.

Yes, I would say we have two options:

1) SCHED_FLAG_KEEP_POLICY enforces all the scheduling class specific
attributes, but cross class attributes (e.g. uclamp)

2) add SCHED_KEEP_NICE, SCHED_KEEP_PRIO, and SCED_KEEP_PARAMS
and use them in the if conditions in D)

In both cases the goal should be to return from code block D).

What do you prefer?

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 14:27:38

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On 22-Jan 13:29, Quentin Perret wrote:
> On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> > On 22-Jan 12:13, Quentin Perret wrote:
> > > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > > The Energy Aware Scheduler (AES) estimates the energy impact of waking
> >
> > [...]
> >
> > > > + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > > > + cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > > > +
> > > > + /*
> > > > + * Busy time computation: utilization clamping is not
> > > > + * required since the ratio (sum_util / cpu_capacity)
> > > > + * is already enough to scale the EM reported power
> > > > + * consumption at the (eventually clamped) cpu_capacity.
> > > > + */
> > >
> > > Right.
> > >
> > > > + sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > > > + ENERGY_UTIL, NULL);
> > > > +
> > > > + /*
> > > > + * Performance domain frequency: utilization clamping
> > > > + * must be considered since it affects the selection
> > > > + * of the performance domain frequency.
> > > > + */
> > >
> > > So that actually affects the way we deal with RT I think. I assume the
> > > idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> > > in EAS (which is what we do now) you should set the min cap for RT to 0.
> > > Is that correct ?
> >
> > By default configuration, RT tasks still go to max when uclamp is
> > enabled, since they get a util_min=1024.
> >
> > If we want to save power on RT tasks, we can set a smaller util_min...
> > but not necessarily 0. A util_min=0 for RT tasks means to use just
> > cpu_util_rt() for that class.
>
> Ah, sorry, I guess my message was misleading. I'm saying this is
> changing the way _EAS_ deals with RT tasks. Right now we don't actually
> consider the RT-go-to-max thing at all in the EAS prediction. Your
> patch is changing that AFAICT. It actually changes the way EAS sees RT
> tasks even without uclamp ...

Lemme see if I get it right.

Currently, whenever we look at CPU utilization for ENERGY_UTIL, we
always use cpu_util_rt() for RT tasks:

---8<---
util = util_cfs;
util += cpu_util_rt(rq);
util += dl_util;
---8<---

Thus, even when RT tasks are RUNNABLE, we don't always assume the CPU
running at the max capacity but just whatever is the aggregated
utilization across all the classes.

With uclamp, we have:

---8<---
util = cpu_util_rt(rq) + util_cfs;
if (type == FREQUENCY_UTIL)
util = uclamp_util_with(rq, util, p);
dl_util = cpu_util_dl(rq);
if (type == ENERGY_UTIL)
util += dl_util;
---8<---

So, I would say that, in terms of ENERGY_UTIL we do the same both
w/ and w/o uclamp. Isn't it?


> But I'm not hostile to the idea since it's possible to enable uclamp and
> set the min cap to 0 for RT if you want. So there is a story there.
> However, I think this needs be documented somewhere, at the very least.

The only difference I see is that the actual frequency could be
different (lower then max) when a clamped RT task is RUNNABLE.

Are you worried that running RT on a lower freq could have side
effects on the estimated busy time the CPU ?

I also still don't completely get why you say it could be useful to
"set the min cap to 0 for RT if you want"

IMO this will be an even bigger difference wrt mainline, since the RT
tasks will never have a granted minimum freq but just whatever
utilization we measure for them.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 14:41:44

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On Tuesday 22 Jan 2019 at 14:26:06 (+0000), Patrick Bellasi wrote:
> On 22-Jan 13:29, Quentin Perret wrote:
> > On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> > > On 22-Jan 12:13, Quentin Perret wrote:
> > > > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > > > The Energy Aware Scheduler (AES) estimates the energy impact of waking
> > >
> > > [...]
> > >
> > > > > + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > > > > + cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > > > > +
> > > > > + /*
> > > > > + * Busy time computation: utilization clamping is not
> > > > > + * required since the ratio (sum_util / cpu_capacity)
> > > > > + * is already enough to scale the EM reported power
> > > > > + * consumption at the (eventually clamped) cpu_capacity.
> > > > > + */
> > > >
> > > > Right.
> > > >
> > > > > + sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > > > > + ENERGY_UTIL, NULL);
> > > > > +
> > > > > + /*
> > > > > + * Performance domain frequency: utilization clamping
> > > > > + * must be considered since it affects the selection
> > > > > + * of the performance domain frequency.
> > > > > + */
> > > >
> > > > So that actually affects the way we deal with RT I think. I assume the
> > > > idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> > > > in EAS (which is what we do now) you should set the min cap for RT to 0.
> > > > Is that correct ?
> > >
> > > By default configuration, RT tasks still go to max when uclamp is
> > > enabled, since they get a util_min=1024.
> > >
> > > If we want to save power on RT tasks, we can set a smaller util_min...
> > > but not necessarily 0. A util_min=0 for RT tasks means to use just
> > > cpu_util_rt() for that class.
> >
> > Ah, sorry, I guess my message was misleading. I'm saying this is
> > changing the way _EAS_ deals with RT tasks. Right now we don't actually
> > consider the RT-go-to-max thing at all in the EAS prediction. Your
> > patch is changing that AFAICT. It actually changes the way EAS sees RT
> > tasks even without uclamp ...
>
> Lemme see if I get it right.
>
> Currently, whenever we look at CPU utilization for ENERGY_UTIL, we
> always use cpu_util_rt() for RT tasks:
>
> ---8<---
> util = util_cfs;
> util += cpu_util_rt(rq);
> util += dl_util;
> ---8<---
>
> Thus, even when RT tasks are RUNNABLE, we don't always assume the CPU
> running at the max capacity but just whatever is the aggregated
> utilization across all the classes.
>
> With uclamp, we have:
>
> ---8<---
> util = cpu_util_rt(rq) + util_cfs;
> if (type == FREQUENCY_UTIL)
> util = uclamp_util_with(rq, util, p);
> dl_util = cpu_util_dl(rq);
> if (type == ENERGY_UTIL)
> util += dl_util;
> ---8<---
>
> So, I would say that, in terms of ENERGY_UTIL we do the same both
> w/ and w/o uclamp. Isn't it?

Yes but now you use FREQUENCY_UTIL for computing 'max_util' in the EAS
prediction.

Let's take an example. You have a perf domain with two CPUs. One CPU is
busy running a RT task, the other CPU runs a CFS task. Right now in
compute_energy() we only use ENERGY_UTIL, so 'max_util' ends up being
the max between the utilization of the two tasks. So we don't predict
we're going to max freq.

With your patch, we use FREQUENCY_UTIL to compute 'max_util', so we
_will_ predict that we're going to max freq. And we will do that even if
CONFIG_UCLAMP_TASK=n.

The default EAS calculation will be different with your patch when there
are runnable RT tasks in the system. This needs to be documented, I
think.

> > But I'm not hostile to the idea since it's possible to enable uclamp and
> > set the min cap to 0 for RT if you want. So there is a story there.
> > However, I think this needs be documented somewhere, at the very least.
>
> The only difference I see is that the actual frequency could be
> different (lower then max) when a clamped RT task is RUNNABLE.
>
> Are you worried that running RT on a lower freq could have side
> effects on the estimated busy time the CPU ?
>
> I also still don't completely get why you say it could be useful to
> "set the min cap to 0 for RT if you want"

I'm not saying it's useful, I'm saying userspace can decide to do that
if it thinks it is a good idea. The default should be min_cap = 1024 for
RT, no questions. But you _can_ change it at runtime if you want to.
That's my point. And doing that basically provides the same behaviour as
what we have right now in terms of EAS calculation (but it changes the
freq selection obviously) which is why I'm not fundamentally opposed to
your patch.

So in short, I'm fine with the behavioural change, but please at least
mention it somewhere :-)

Thanks,
Quentin

2019-01-22 14:45:33

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On 22-Jan 14:56, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:04AM +0000, Patrick Bellasi wrote:
>
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 84294925d006..c8f391d1cdc5 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -625,6 +625,11 @@ struct uclamp_se {
> > unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > unsigned int mapped : 1;
> > unsigned int active : 1;
> > + /* Clamp bucket and value actually used by a RUNNABLE task */
> > + struct {
> > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > + } effective;
>
> I am confuzled by this thing.. so uclamp_se already has a value,bucket,
> which per the prior code is the effective one.
>
> Now; I think I see why you want another value; you need the second to
> store the original value for when the system limits change and we must
> re-evaluate.

Yes, that's one reason, the other one being to properly support
CGroup when we add them in the following patches.

Effective will always track the value/bucket in which the task has
been refcounted at enqueue time and it depends on the aggregated
value.

> So why are you not adding something like:
>
> unsigned int orig_value : bits_per(SCHED_CAPACITY_SCALE);

Would say that can be enough if we decide to ditch the mapping and use
a linear mapping. In that case the value will always be enough to find
in which bucket a task has been accounted.

> > +unsigned int sysctl_sched_uclamp_util_min;
>
> > +unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>
> > +static inline void
> > +uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > + unsigned int *clamp_value, unsigned int *bucket_id)
> > +{
> > + /* Task specific clamp value */
> > + *clamp_value = p->uclamp[clamp_id].value;
> > + *bucket_id = p->uclamp[clamp_id].bucket_id;
> > +
> > + /* System default restriction */
> > + if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > + *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > + /* Keep it simple: unconditionally enforce system defaults */
> > + *clamp_value = uclamp_default[clamp_id].value;
> > + *bucket_id = uclamp_default[clamp_id].bucket_id;
> > + }
> > +}
>
> That would then turn into something like:
>
> unsigned int high = READ_ONCE(sysctl_sched_uclamp_util_max);
> unsigned int low = READ_ONCE(sysctl_sched_uclamp_util_min);
>
> uclamp_se->orig_value = value;
> uclamp_se->value = clamp(value, low, high);
>
> And then determine bucket_id based on value.

Right... if I ditch the mapping that should work.


> > +int sched_uclamp_handler(struct ctl_table *table, int write,
> > + void __user *buffer, size_t *lenp,
> > + loff_t *ppos)
> > +{
> > + int old_min, old_max;
> > + int result = 0;
> > +
> > + mutex_lock(&uclamp_mutex);
> > +
> > + old_min = sysctl_sched_uclamp_util_min;
> > + old_max = sysctl_sched_uclamp_util_max;
> > +
> > + result = proc_dointvec(table, write, buffer, lenp, ppos);
> > + if (result)
> > + goto undo;
> > + if (!write)
> > + goto done;
> > +
> > + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> > + sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> > + result = -EINVAL;
> > + goto undo;
> > + }
> > +
> > + if (old_min != sysctl_sched_uclamp_util_min) {
> > + uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MIN],
> > + UCLAMP_MIN, sysctl_sched_uclamp_util_min);
> > + }
> > + if (old_max != sysctl_sched_uclamp_util_max) {
> > + uclamp_bucket_inc(NULL, &uclamp_default[UCLAMP_MAX],
> > + UCLAMP_MAX, sysctl_sched_uclamp_util_max);
> > + }
>
> Should you not update all tasks?

That's true, but that's also an expensive operation, that's why now
I'm doing only lazy updates at next enqueue time.

Do you think that could be acceptable?

Perhaps I can sanity check all the CPU to ensure that they all have a
current clamp value within the new enforced range. This kind-of
anticipate the idea to have an in-kernel API which has higher priority
and allows to set clamp values across all the CPUs...

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 14:59:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Tue, Jan 22, 2019 at 02:01:15PM +0000, Patrick Bellasi wrote:
> On 22-Jan 14:28, Peter Zijlstra wrote:
> > On Tue, Jan 22, 2019 at 10:43:05AM +0000, Patrick Bellasi wrote:
> > > On 22-Jan 10:37, Peter Zijlstra wrote:
> >
> > > > Sure, I get that. What I don't get is why you're adding that (2) here.
> > > > Like said, __sched_setscheduler() already does a dequeue/enqueue under
> > > > rq->lock, which should already take care of that.
> > >
> > > Oh, ok... got it what you mean now.
> > >
> > > With:
> > >
> > > [PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy
> > > <[email protected]>
> > >
> > > we can call __sched_setscheduler() with:
> > >
> > > attr->sched_flags & SCHED_FLAG_KEEP_POLICY
> > >
> > > whenever we want just to change the clamp values of a task without
> > > changing its class. Thus, we can end up returning from
> > > __sched_setscheduler() without doing an actual dequeue/enqueue.
> >
> > I don't see that happening.. when KEEP_POLICY we set attr.sched_policy =
> > SETPARAM_POLICY. That is then checked again in __setscheduler_param(),
> > which is in the middle of that dequeue/enqueue.
>
> Yes, I think I've forgot a check before we actually dequeue the task.
>
> The current code does:
>
> ---8<---
> SYSCALL_DEFINE3(sched_setattr)
>
> // A) request to keep the same policy
> if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
> attr.sched_policy = SETPARAM_POLICY;
>
> sched_setattr()
> // B) actually enforce the same policy
> if (policy < 0)
> policy = oldpolicy = p->policy;
>
> // C) tune the clamp values
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> retval = __setscheduler_uclamp(p, attr);
>
> // D) tune attributes if policy is the same
> if (unlikely(policy == p->policy))
> if (fair_policy(policy) && attr->sched_nice != task_nice(p))
> goto change;
> if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
> goto change;
> if (dl_policy(policy) && dl_param_changed(p, attr))
> goto change;

if (util_changed)
goto change;

?

> return 0;
> change:
>
> // E) dequeue/enqueue task
> ---8<---
>
> So, probably in D) I've missed a check on SCHED_FLAG_KEEP_POLICY to
> enforce a return in that case...
>
> > Also, and this might be 'broken', SETPARAM_POLICY _does_ reset all the
> > other attributes, it only preserves policy, but it will (re)set nice
> > level for example (see that same function).
>
> Mmm... right... my bad! :/
>
> > So maybe we want to introduce another (few?) FLAG_KEEP flag(s) that
> > preserve the other bits; I'm thinking at least KEEP_PARAM and KEEP_UTIL
> > or something.
>
> Yes, I would say we have two options:
>
> 1) SCHED_FLAG_KEEP_POLICY enforces all the scheduling class specific
> attributes, but cross class attributes (e.g. uclamp)
>
> 2) add SCHED_KEEP_NICE, SCHED_KEEP_PRIO, and SCED_KEEP_PARAMS
> and use them in the if conditions in D)

So the current KEEP_POLICY basically provides sched_setparam(), and
given we have that as a syscall, that is supposedly a useful
functionality.

Also, NICE/PRIO/DL* is all the same thing and depends on the policy,
KEEP_PARAM should cover the lot

And I suppose the UTIL_CLAMP is !KEEP_UTIL; we could go either way
around with that flag.

> In both cases the goal should be to return from code block D).

I don't think so; we really do want to 'goto change' for util changes
too I think. Why duplicate part of that logic?

2019-01-22 15:03:24

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On 22-Jan 14:39, Quentin Perret wrote:
> On Tuesday 22 Jan 2019 at 14:26:06 (+0000), Patrick Bellasi wrote:
> > On 22-Jan 13:29, Quentin Perret wrote:
> > > On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> > > > On 22-Jan 12:13, Quentin Perret wrote:
> > > > > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > > > > The Energy Aware Scheduler (AES) estimates the energy impact of waking

[...]

> > > Ah, sorry, I guess my message was misleading. I'm saying this is
> > > changing the way _EAS_ deals with RT tasks. Right now we don't actually
> > > consider the RT-go-to-max thing at all in the EAS prediction. Your
> > > patch is changing that AFAICT. It actually changes the way EAS sees RT
> > > tasks even without uclamp ...
> >
> > Lemme see if I get it right.
> >
> > Currently, whenever we look at CPU utilization for ENERGY_UTIL, we
> > always use cpu_util_rt() for RT tasks:
> >
> > ---8<---
> > util = util_cfs;
> > util += cpu_util_rt(rq);
> > util += dl_util;
> > ---8<---
> >
> > Thus, even when RT tasks are RUNNABLE, we don't always assume the CPU
> > running at the max capacity but just whatever is the aggregated
> > utilization across all the classes.
> >
> > With uclamp, we have:
> >
> > ---8<---
> > util = cpu_util_rt(rq) + util_cfs;
> > if (type == FREQUENCY_UTIL)
> > util = uclamp_util_with(rq, util, p);
> > dl_util = cpu_util_dl(rq);
> > if (type == ENERGY_UTIL)
> > util += dl_util;
> > ---8<---
> >
> > So, I would say that, in terms of ENERGY_UTIL we do the same both
> > w/ and w/o uclamp. Isn't it?
>
> Yes but now you use FREQUENCY_UTIL for computing 'max_util' in the EAS
> prediction.

Right, I overlook that "little" detail... :/

> Let's take an example. You have a perf domain with two CPUs. One CPU is
> busy running a RT task, the other CPU runs a CFS task. Right now in
> compute_energy() we only use ENERGY_UTIL, so 'max_util' ends up being
> the max between the utilization of the two tasks. So we don't predict
> we're going to max freq.

+1

> With your patch, we use FREQUENCY_UTIL to compute 'max_util', so we
> _will_ predict that we're going to max freq.

Right, with the default conf yes.

> And we will do that even if CONFIG_UCLAMP_TASK=n.

While this should not happen, as I wrote in the RT integration patch,
that's happening because I'm missing some compilation guard or
similar. In this configurations we should always go to max... will
look into that.

> The default EAS calculation will be different with your patch when there
> are runnable RT tasks in the system. This needs to be documented, I
> think.

Sure...

> > > But I'm not hostile to the idea since it's possible to enable uclamp and
> > > set the min cap to 0 for RT if you want. So there is a story there.
> > > However, I think this needs be documented somewhere, at the very least.
> >
> > The only difference I see is that the actual frequency could be
> > different (lower then max) when a clamped RT task is RUNNABLE.
> >
> > Are you worried that running RT on a lower freq could have side
> > effects on the estimated busy time the CPU ?
> >
> > I also still don't completely get why you say it could be useful to
> > "set the min cap to 0 for RT if you want"
>
> I'm not saying it's useful, I'm saying userspace can decide to do that
> if it thinks it is a good idea. The default should be min_cap = 1024 for
> RT, no questions. But you _can_ change it at runtime if you want to.
> That's my point. And doing that basically provides the same behaviour as
> what we have right now in terms of EAS calculation (but it changes the
> freq selection obviously) which is why I'm not fundamentally opposed to
> your patch.

Well, I think it's tricky to say whether the current or new approach
is better... it probably depends on the use-case.

> So in short, I'm fine with the behavioural change, but please at least
> mention it somewhere :-)

Anyway... agree, it's just that to add some documentation I need to
get what you are pointing out ;)

Will come up with some additional text to be added to the changelog

Maybe we can add a more detailed explanation of the different
behaviors you can get in the EAS documentation which is coming to
mainline ?

> Thanks,
> Quentin

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 15:14:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On Tue, Jan 22, 2019 at 02:43:29PM +0000, Patrick Bellasi wrote:
> On 22-Jan 14:56, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:04AM +0000, Patrick Bellasi wrote:
> >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 84294925d006..c8f391d1cdc5 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -625,6 +625,11 @@ struct uclamp_se {
> > > unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > > unsigned int mapped : 1;
> > > unsigned int active : 1;
> > > + /* Clamp bucket and value actually used by a RUNNABLE task */
> > > + struct {
> > > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > > + } effective;
> >
> > I am confuzled by this thing.. so uclamp_se already has a value,bucket,
> > which per the prior code is the effective one.
> >
> > Now; I think I see why you want another value; you need the second to
> > store the original value for when the system limits change and we must
> > re-evaluate.
>
> Yes, that's one reason, the other one being to properly support
> CGroup when we add them in the following patches.
>
> Effective will always track the value/bucket in which the task has
> been refcounted at enqueue time and it depends on the aggregated
> value.

> > Should you not update all tasks?
>
> That's true, but that's also an expensive operation, that's why now
> I'm doing only lazy updates at next enqueue time.

Aaah, so you refcount on the original value, which allows you to skip
fixing up all tasks. I missed that bit.


> Do you think that could be acceptable?

Think so, it's a sysctl poke, 'nobody' ever does that.

2019-01-22 15:24:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On Tue, Jan 15, 2019 at 10:15:05AM +0000, Patrick Bellasi wrote:
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -218,8 +218,15 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> * CFS tasks and we use the same metric to track the effective
> * utilization (PELT windows are synchronized) we can directly add them
> * to obtain the CPU's actual utilization.
> + *
> + * CFS utilization can be boosted or capped, depending on utilization
> + * clamp constraints requested by currently RUNNABLE tasks.
> + * When there are no CFS RUNNABLE tasks, clamps are released and
> + * frequency will be gracefully reduced with the utilization decay.
> */
> - util = util_cfs;
> + util = (type == ENERGY_UTIL)
> + ? util_cfs
> + : uclamp_util(rq, util_cfs);

That's pretty horrible; what's wrong with:

util = util_cfs;
if (type == FREQUENCY_UTIL)
util = uclamp_util(rq, util);

That should generate the same code, but is (IMO) far easier to read.

> util += cpu_util_rt(rq);
>
> dl_util = cpu_util_dl(rq);

2019-01-22 15:34:50

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On 22-Jan 15:57, Peter Zijlstra wrote:
> On Tue, Jan 22, 2019 at 02:01:15PM +0000, Patrick Bellasi wrote:
> > On 22-Jan 14:28, Peter Zijlstra wrote:
> > > On Tue, Jan 22, 2019 at 10:43:05AM +0000, Patrick Bellasi wrote:
> > > > On 22-Jan 10:37, Peter Zijlstra wrote:
> > >
> > > > > Sure, I get that. What I don't get is why you're adding that (2) here.
> > > > > Like said, __sched_setscheduler() already does a dequeue/enqueue under
> > > > > rq->lock, which should already take care of that.
> > > >
> > > > Oh, ok... got it what you mean now.
> > > >
> > > > With:
> > > >
> > > > [PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy
> > > > <[email protected]>
> > > >
> > > > we can call __sched_setscheduler() with:
> > > >
> > > > attr->sched_flags & SCHED_FLAG_KEEP_POLICY
> > > >
> > > > whenever we want just to change the clamp values of a task without
> > > > changing its class. Thus, we can end up returning from
> > > > __sched_setscheduler() without doing an actual dequeue/enqueue.
> > >
> > > I don't see that happening.. when KEEP_POLICY we set attr.sched_policy =
> > > SETPARAM_POLICY. That is then checked again in __setscheduler_param(),
> > > which is in the middle of that dequeue/enqueue.
> >
> > Yes, I think I've forgot a check before we actually dequeue the task.
> >
> > The current code does:
> >
> > ---8<---
> > SYSCALL_DEFINE3(sched_setattr)
> >
> > // A) request to keep the same policy
> > if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
> > attr.sched_policy = SETPARAM_POLICY;
> >
> > sched_setattr()
> > // B) actually enforce the same policy
> > if (policy < 0)
> > policy = oldpolicy = p->policy;
> >
> > // C) tune the clamp values
> > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> > retval = __setscheduler_uclamp(p, attr);
> >
> > // D) tune attributes if policy is the same
> > if (unlikely(policy == p->policy))
> > if (fair_policy(policy) && attr->sched_nice != task_nice(p))
> > goto change;
> > if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
> > goto change;
> > if (dl_policy(policy) && dl_param_changed(p, attr))
> > goto change;
>
> if (util_changed)
> goto change;
>
> ?
>
> > return 0;
> > change:
> >
> > // E) dequeue/enqueue task
> > ---8<---
> >
> > So, probably in D) I've missed a check on SCHED_FLAG_KEEP_POLICY to
> > enforce a return in that case...
> >
> > > Also, and this might be 'broken', SETPARAM_POLICY _does_ reset all the
> > > other attributes, it only preserves policy, but it will (re)set nice
> > > level for example (see that same function).
> >
> > Mmm... right... my bad! :/
> >
> > > So maybe we want to introduce another (few?) FLAG_KEEP flag(s) that
> > > preserve the other bits; I'm thinking at least KEEP_PARAM and KEEP_UTIL
> > > or something.
> >
> > Yes, I would say we have two options:
> >
> > 1) SCHED_FLAG_KEEP_POLICY enforces all the scheduling class specific
> > attributes, but cross class attributes (e.g. uclamp)
> >
> > 2) add SCHED_KEEP_NICE, SCHED_KEEP_PRIO, and SCED_KEEP_PARAMS
> > and use them in the if conditions in D)
>
> So the current KEEP_POLICY basically provides sched_setparam(), and

But it's not exposed user-space.

> given we have that as a syscall, that is supposedly a useful
> functionality.

For uclamp is definitively useful to change clamps without the need to
read beforehand the current policy params and use them in a following
set syscall... which is racy pattern.

> Also, NICE/PRIO/DL* is all the same thing and depends on the policy,
> KEEP_PARAM should cover the lot

Right, that makes sense.

> And I suppose the UTIL_CLAMP is !KEEP_UTIL; we could go either way
> around with that flag.

What about getting rid of the racy case above by exposing userspace
only the new UTIL_CLAMP and, on:

sched_setscheduler(flags: UTIL_CLAMP)

we enforce the other two flags from the syscall:

---8<---
SYSCALL_DEFINE3(sched_setattr)
if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY) {
attr.sched_policy = SETPARAM_POLICY;
attr.sched_flags |= (KEEP_POLICY|KEEP_PARAMS);
}
---8<---

This will not make possible to change class and set flags in one go,
but honestly that's likely a very limited use-case, isn't it ?

> > In both cases the goal should be to return from code block D).
>
> I don't think so; we really do want to 'goto change' for util changes
> too I think. Why duplicate part of that logic?

But that will force a dequeue/enqueue... isn't too much overhead just
to change a clamp value? Perhaps we can also end up with some wired
side-effects like the task being preempted ?

Consider also that the uclamp_task_update_active() added by this patch
not only has lower overhead but it will be use also by cgroups where
we want to force update all the tasks on a cgroup's clamp change.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 15:44:24

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On 22-Jan 16:13, Peter Zijlstra wrote:
> On Tue, Jan 22, 2019 at 02:43:29PM +0000, Patrick Bellasi wrote:
> > On 22-Jan 14:56, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:04AM +0000, Patrick Bellasi wrote:
> > >
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index 84294925d006..c8f391d1cdc5 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -625,6 +625,11 @@ struct uclamp_se {
> > > > unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > > > unsigned int mapped : 1;
> > > > unsigned int active : 1;
> > > > + /* Clamp bucket and value actually used by a RUNNABLE task */
> > > > + struct {
> > > > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > > > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > > > + } effective;
> > >
> > > I am confuzled by this thing.. so uclamp_se already has a value,bucket,
> > > which per the prior code is the effective one.
> > >
> > > Now; I think I see why you want another value; you need the second to
> > > store the original value for when the system limits change and we must
> > > re-evaluate.
> >
> > Yes, that's one reason, the other one being to properly support
> > CGroup when we add them in the following patches.
> >
> > Effective will always track the value/bucket in which the task has
> > been refcounted at enqueue time and it depends on the aggregated
> > value.
>
> > > Should you not update all tasks?
> >
> > That's true, but that's also an expensive operation, that's why now
> > I'm doing only lazy updates at next enqueue time.
>
> Aaah, so you refcount on the original value, which allows you to skip
> fixing up all tasks. I missed that bit.

Right, effective is always tracking the bucket we refcounted at
enqueue time.

We can still argue that, the moment we change a clamp, a task should
be updated without waiting for a dequeue/enqueue cycle.

IMO, that could be a limitation only for tasks which never sleep, but
that's a very special case.

Instead, as you'll see, in the cgroup integration we force update all
RUNNABLE tasks. Although that's expensive, since we are in the domain
of the "delegation model" and "containers resources control", there
it's probably more worth to pay than here.

> > Do you think that could be acceptable?
>
> Think so, it's a sysctl poke, 'nobody' ever does that.

Cool, so... I'll keep lazy update for system default.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 15:48:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On 22-Jan 16:21, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:05AM +0000, Patrick Bellasi wrote:
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -218,8 +218,15 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > * CFS tasks and we use the same metric to track the effective
> > * utilization (PELT windows are synchronized) we can directly add them
> > * to obtain the CPU's actual utilization.
> > + *
> > + * CFS utilization can be boosted or capped, depending on utilization
> > + * clamp constraints requested by currently RUNNABLE tasks.
> > + * When there are no CFS RUNNABLE tasks, clamps are released and
> > + * frequency will be gracefully reduced with the utilization decay.
> > */
> > - util = util_cfs;
> > + util = (type == ENERGY_UTIL)
> > + ? util_cfs
> > + : uclamp_util(rq, util_cfs);
>
> That's pretty horrible; what's wrong with:
>
> util = util_cfs;
> if (type == FREQUENCY_UTIL)
> util = uclamp_util(rq, util);
>
> That should generate the same code, but is (IMO) far easier to read.

Yes, right... and that's also the pattern we end up with the
following patch on RT integration.

However, as suggested by Rafael, I'll squash these two patches
together and we will get rid of the above for free ;)

> > util += cpu_util_rt(rq);
> >
> > dl_util = cpu_util_dl(rq);

--
#include <best/regards.h>

Patrick Bellasi

2019-01-22 17:16:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On Tue, Jan 15, 2019 at 10:15:05AM +0000, Patrick Bellasi wrote:
> @@ -342,11 +350,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> return;
> sg_cpu->iowait_boost_pending = true;
>
> + /*
> + * Boost FAIR tasks only up to the CPU clamped utilization.
> + *
> + * Since DL tasks have a much more advanced bandwidth control, it's
> + * safe to assume that IO boost does not apply to those tasks.

I'm not buying that argument. IO-boost isn't related to b/w management.

IO-boot is more about compensating for hidden dependencies, and those
don't get less hidden for using a different scheduling class.

Now, arguably DL should not be doing IO in the first place, but that's a
whole different discussion.

> + * Instead, since RT tasks are not utilization clamped, we don't want
> + * to apply clamping on IO boost while there is blocked RT
> + * utilization.
> + */
> + max_boost = sg_cpu->iowait_boost_max;
> + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
> +
> /* Double the boost at each request */
> if (sg_cpu->iowait_boost) {
> sg_cpu->iowait_boost <<= 1;
> - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> + if (sg_cpu->iowait_boost > max_boost)
> + sg_cpu->iowait_boost = max_boost;
> return;
> }

Hurmph... so I'm not sold on this bit.

2019-01-22 17:30:43

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v6 11/16] sched/fair: Add uclamp support to energy_compute()

On Tuesday 22 Jan 2019 at 15:01:37 (+0000), Patrick Bellasi wrote:
> > I'm not saying it's useful, I'm saying userspace can decide to do that
> > if it thinks it is a good idea. The default should be min_cap = 1024 for
> > RT, no questions. But you _can_ change it at runtime if you want to.
> > That's my point. And doing that basically provides the same behaviour as
> > what we have right now in terms of EAS calculation (but it changes the
> > freq selection obviously) which is why I'm not fundamentally opposed to
> > your patch.
>
> Well, I think it's tricky to say whether the current or new approach
> is better... it probably depends on the use-case.

Agreed.

> > So in short, I'm fine with the behavioural change, but please at least
> > mention it somewhere :-)
>
> Anyway... agree, it's just that to add some documentation I need to
> get what you are pointing out ;)
>
> Will come up with some additional text to be added to the changelog

Sounds good.

> Maybe we can add a more detailed explanation of the different
> behaviors you can get in the EAS documentation which is coming to
> mainline ?

Yeah, if you feel like it, I guess that won't hurt :-)

Thanks,
Quentin

2019-01-22 18:21:21

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On 22-Jan 18:13, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:05AM +0000, Patrick Bellasi wrote:
> > @@ -342,11 +350,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > return;
> > sg_cpu->iowait_boost_pending = true;
> >
> > + /*
> > + * Boost FAIR tasks only up to the CPU clamped utilization.
> > + *
> > + * Since DL tasks have a much more advanced bandwidth control, it's
> > + * safe to assume that IO boost does not apply to those tasks.
>
> I'm not buying that argument. IO-boost isn't related to b/w management.
>
> IO-boot is more about compensating for hidden dependencies, and those
> don't get less hidden for using a different scheduling class.
>
> Now, arguably DL should not be doing IO in the first place, but that's a
> whole different discussion.

My understanding is that IOBoost is there to help tasks doing many
and _frequent_ IO operations, which are relatively _not so much_
computational intensive on the CPU.

Those tasks generate a small utilization and, without IOBoost, will be
executed at a lower frequency and will add undesired latency on
triggering the next IO operation.

Isn't mainly that the reason for it?

IO operations have also to be _frequent_ since we don't got to max OPP
at the very first wakeup from IO. We double frequency and get to max
only if we have a stable stream of IO operations.

IMHO, it makes perfectly sense to use DL for these kind of operations
but I would expect that, since you care about latency we should come
up with a proper description of the required bandwidth... eventually
accounting for an additional headroom to compensate for "hidden
dependencies"... without relaying on a quite dummy policy like
IOBoost to get our DL tasks working.

At the end, DL is now quite good in driving the freq as high has it
needs... and by closing userspace feedback loops it can also
compensate for all sort of fluctuations and noise... as demonstrated
by Alessio during last OSPM:

http://retis.sssup.it/luca/ospm-summit/2018/Downloads/OSPM_deadline_audio.pdf

> > + * Instead, since RT tasks are not utilization clamped, we don't want
> > + * to apply clamping on IO boost while there is blocked RT
> > + * utilization.
> > + */
> > + max_boost = sg_cpu->iowait_boost_max;
> > + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> > + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
> > +
> > /* Double the boost at each request */
> > if (sg_cpu->iowait_boost) {
> > sg_cpu->iowait_boost <<= 1;
> > - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> > - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> > + if (sg_cpu->iowait_boost > max_boost)
> > + sg_cpu->iowait_boost = max_boost;
> > return;
> > }
>
> Hurmph... so I'm not sold on this bit.

If a task is not clamped we execute it at its required utilization or
even max frequency in case of wakeup from IO.

When a task is util_max clamped instead, we are saying that we don't
care to run it above the specified clamp value and, if possible, we
should run it below that capacity level.

If that's the case, why this clamping hints should not be enforced on
IO wakeups too?

At the end it's still a user-space decision, we basically allow
userspace to defined what's the max IO boost they like to get.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 09:18:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Tue, Jan 22, 2019 at 03:33:15PM +0000, Patrick Bellasi wrote:
> On 22-Jan 15:57, Peter Zijlstra wrote:
> > On Tue, Jan 22, 2019 at 02:01:15PM +0000, Patrick Bellasi wrote:

> > > Yes, I would say we have two options:
> > >
> > > 1) SCHED_FLAG_KEEP_POLICY enforces all the scheduling class specific
> > > attributes, but cross class attributes (e.g. uclamp)
> > >
> > > 2) add SCHED_KEEP_NICE, SCHED_KEEP_PRIO, and SCED_KEEP_PARAMS
> > > and use them in the if conditions in D)
> >
> > So the current KEEP_POLICY basically provides sched_setparam(), and
>
> But it's not exposed user-space.

Correct; not until your first patch indeed.

> > given we have that as a syscall, that is supposedly a useful
> > functionality.
>
> For uclamp is definitively useful to change clamps without the need to
> read beforehand the current policy params and use them in a following
> set syscall... which is racy pattern.

Right; but my argument was mostly that if sched_setparam() is a useful
interface, a 'pure' KEEP_POLICY would be too and your (1) looses that.

> > And I suppose the UTIL_CLAMP is !KEEP_UTIL; we could go either way
> > around with that flag.
>
> What about getting rid of the racy case above by exposing userspace
> only the new UTIL_CLAMP and, on:
>
> sched_setscheduler(flags: UTIL_CLAMP)
>
> we enforce the other two flags from the syscall:
>
> ---8<---
> SYSCALL_DEFINE3(sched_setattr)
> if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY) {
> attr.sched_policy = SETPARAM_POLICY;
> attr.sched_flags |= (KEEP_POLICY|KEEP_PARAMS);
> }
> ---8<---
>
> This will not make possible to change class and set flags in one go,
> but honestly that's likely a very limited use-case, isn't it ?

So I must admit to not seeing much use for sched_setparam() (and its
equivalents) myself, but given it is an existing interface, I also think
it would be nice to cover that functionality in the sched_setattr()
call.

That is; I know of userspace priority-ceiling implementations using
sched_setparam(), but I don't see any reason why that wouldn't also work
with sched_setscheduler() (IOW always also set the policy).

> > > In both cases the goal should be to return from code block D).
> >
> > I don't think so; we really do want to 'goto change' for util changes
> > too I think. Why duplicate part of that logic?
>
> But that will force a dequeue/enqueue... isn't too much overhead just
> to change a clamp value?

These syscalls aren't what I consider fast paths anyway. However, there
are people that rely on the scheduler syscalls not to schedule
themselves, or rather be non-blocking (see for example that prio-ceiling
implementation).

And in that respect the newly introduced uclamp_mutex does appear to be
a problem.

Also; do you expect these clamp values to be changed often?

> Perhaps we can also end up with some wired

s/wired/weird/, right?

> side-effects like the task being preempted ?

Nothing worse than any other random reschedule would cause.

> Consider also that the uclamp_task_update_active() added by this patch
> not only has lower overhead but it will be use also by cgroups where
> we want to force update all the tasks on a cgroup's clamp change.

I haven't gotten that far; but I would prefer not to have two different
'change' paths in __sched_setscheduler().

2019-01-23 09:23:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On Tue, Jan 22, 2019 at 03:41:29PM +0000, Patrick Bellasi wrote:
> On 22-Jan 16:13, Peter Zijlstra wrote:
> > On Tue, Jan 22, 2019 at 02:43:29PM +0000, Patrick Bellasi wrote:

> > > Do you think that could be acceptable?
> >
> > Think so, it's a sysctl poke, 'nobody' ever does that.
>
> Cool, so... I'll keep lazy update for system default.

Ah, I think I misunderstood. I meant to say that since nobody ever pokes
at sysctl's it doesn't matter if its a little more expensive and iterate
everything.

Also; if you always keep everything up-to-date, you can avoid doing that
duplicate accounting.

2019-01-23 09:54:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On Tue, Jan 22, 2019 at 06:18:31PM +0000, Patrick Bellasi wrote:
> On 22-Jan 18:13, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:05AM +0000, Patrick Bellasi wrote:
> > > @@ -342,11 +350,24 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > > return;
> > > sg_cpu->iowait_boost_pending = true;
> > >
> > > + /*
> > > + * Boost FAIR tasks only up to the CPU clamped utilization.
> > > + *
> > > + * Since DL tasks have a much more advanced bandwidth control, it's
> > > + * safe to assume that IO boost does not apply to those tasks.
> >
> > I'm not buying that argument. IO-boost isn't related to b/w management.
> >
> > IO-boot is more about compensating for hidden dependencies, and those
> > don't get less hidden for using a different scheduling class.
> >
> > Now, arguably DL should not be doing IO in the first place, but that's a
> > whole different discussion.
>
> My understanding is that IOBoost is there to help tasks doing many
> and _frequent_ IO operations, which are relatively _not so much_
> computational intensive on the CPU.
>
> Those tasks generate a small utilization and, without IOBoost, will be
> executed at a lower frequency and will add undesired latency on
> triggering the next IO operation.
>
> Isn't mainly that the reason for it?

http://lkml.kernel.org/r/[email protected]

Using a lower frequency will allow the IO device to go idle while we try
and get the next request going.

The connection between IO device and task/freq selection is hidden/lost.
We could potentially do better here, but fundamentally a completion
doesn't have an 'owner', there can be multiple waiters etc.

We loose (through our software architecture, and this we could possibly
improve, although it would be fairly invasive) the device busy state,
and it would be the device that raises the CPU frequency (to the point
where request submission is no longer the bottle neck to staying busy).

Currently all we do is mark a task as sleeping on IO and loose any
and all device relations/metrics.

So I don't think the task clamping should affect the IO boosting, as
that is meant to represent the device state, not the task utilization.

> IMHO, it makes perfectly sense to use DL for these kind of operations
> but I would expect that, since you care about latency we should come
> up with a proper description of the required bandwidth... eventually
> accounting for an additional headroom to compensate for "hidden
> dependencies"... without relaying on a quite dummy policy like
> IOBoost to get our DL tasks working.

Deadline is about determinsm, (file/disk) IO is typically the
anti-thesis of that.

> At the end, DL is now quite good in driving the freq as high has it
> needs... and by closing userspace feedback loops it can also
> compensate for all sort of fluctuations and noise... as demonstrated
> by Alessio during last OSPM:
>
> http://retis.sssup.it/luca/ospm-summit/2018/Downloads/OSPM_deadline_audio.pdf

Audio is a special in that it is indeed a deterministic device, also, I
don't think ALSA touches the IO-wait code, that is typically all
filesystem stuff.

> > > + * Instead, since RT tasks are not utilization clamped, we don't want
> > > + * to apply clamping on IO boost while there is blocked RT
> > > + * utilization.
> > > + */
> > > + max_boost = sg_cpu->iowait_boost_max;
> > > + if (!cpu_util_rt(cpu_rq(sg_cpu->cpu)))
> > > + max_boost = uclamp_util(cpu_rq(sg_cpu->cpu), max_boost);
> > > +
> > > /* Double the boost at each request */
> > > if (sg_cpu->iowait_boost) {
> > > sg_cpu->iowait_boost <<= 1;
> > > - if (sg_cpu->iowait_boost > sg_cpu->iowait_boost_max)
> > > - sg_cpu->iowait_boost = sg_cpu->iowait_boost_max;
> > > + if (sg_cpu->iowait_boost > max_boost)
> > > + sg_cpu->iowait_boost = max_boost;
> > > return;
> > > }
> >
> > Hurmph... so I'm not sold on this bit.
>
> If a task is not clamped we execute it at its required utilization or
> even max frequency in case of wakeup from IO.
>
> When a task is util_max clamped instead, we are saying that we don't
> care to run it above the specified clamp value and, if possible, we
> should run it below that capacity level.
>
> If that's the case, why this clamping hints should not be enforced on
> IO wakeups too?
>
> At the end it's still a user-space decision, we basically allow
> userspace to defined what's the max IO boost they like to get.

Because it is the wrong knob for it.

Ideally we'd extend the IO-wait state to include the device-busy state
at the time of sleep. At the very least double state io_schedule() state
space from 1 to 2 bits, where we not only indicate: yes this is an
IO-sleep, but also can indicate device saturation. When the device is
saturated, we don't need to boost further.

(this binary state will ofcourse cause oscilations where we drop the
freq, drop device saturation, then ramp the freq, regain device
saturation etc..)

However, doing this is going to require fairly massive surgery on our
whole IO stack.

Also; how big of a problem is 'supriouos' boosting really? Joel tried to
introduce a boost_max tunable, but the grandual boosting thing was good
enough at the time.

2019-01-23 10:31:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> - util = (type == ENERGY_UTIL)
> - ? util_cfs
> - : uclamp_util(rq, util_cfs);
> - util += cpu_util_rt(rq);
> + util = cpu_util_rt(rq);
> + if (type == FREQUENCY_UTIL) {
> + util += cpu_util_cfs(rq);
> + util = uclamp_util(rq, util);
> + } else {
> + util += util_cfs;
> + }

Or the combined thing:

- util = util_cfs;
- util += cpu_util_rt(rq);
+ util = cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL) {
+ util += cpu_util_cfs(rq);
+ util = uclamp_util(rq, util);
+ } else {
+ util += util_cfs;
+ }


Leaves me confused.

When type == FREQ, util_cfs should already be cpu_util_cfs(), per
sugov_get_util().

So should that not end up like:

util = util_cfs;
util += cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util(rq, util);

instead?



2019-01-23 10:51:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> @@ -858,16 +859,23 @@ static inline void
> uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> unsigned int *clamp_value, unsigned int *bucket_id)
> {
> + struct uclamp_se *default_clamp;
> +
> /* Task specific clamp value */
> *clamp_value = p->uclamp[clamp_id].value;
> *bucket_id = p->uclamp[clamp_id].bucket_id;
>
> + /* RT tasks have different default values */
> + default_clamp = task_has_rt_policy(p)
> + ? uclamp_default_perf
> + : uclamp_default;
> +
> /* System default restriction */
> - if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> - *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> + if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> + *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> /* Keep it simple: unconditionally enforce system defaults */
> - *clamp_value = uclamp_default[clamp_id].value;
> - *bucket_id = uclamp_default[clamp_id].bucket_id;
> + *clamp_value = default_clamp[clamp_id].value;
> + *bucket_id = default_clamp[clamp_id].bucket_id;
> }
> }

So I still don't much like the whole effective thing; but I think you
should use rt_task() instead of task_has_rt_policy().

2019-01-23 13:35:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 10/16] sched/core: Add uclamp_util_with()

On Tue, Jan 15, 2019 at 10:15:07AM +0000, Patrick Bellasi wrote:
> +static __always_inline
> +unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
> + struct task_struct *p)
> {
> unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
>
> + if (p) {
> + min_util = max(min_util, uclamp_effective_value(p, UCLAMP_MIN));
> + max_util = max(max_util, uclamp_effective_value(p, UCLAMP_MAX));
> + }
> +

Like I think you mentioned earlier; this doesn't look right at all.

Should that not be something like:

lo = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
hi = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);

min_util = clamp(uclamp_effective(p, UCLAMP_MIN), lo, hi);
max_util = clamp(uclamp_effective(p, UCLAMP_MAX), lo, hi);



2019-01-23 14:15:52

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On 23-Jan 10:16, Peter Zijlstra wrote:
> On Tue, Jan 22, 2019 at 03:33:15PM +0000, Patrick Bellasi wrote:
> > On 22-Jan 15:57, Peter Zijlstra wrote:
> > > On Tue, Jan 22, 2019 at 02:01:15PM +0000, Patrick Bellasi wrote:
>
> > > > Yes, I would say we have two options:
> > > >
> > > > 1) SCHED_FLAG_KEEP_POLICY enforces all the scheduling class specific
> > > > attributes, but cross class attributes (e.g. uclamp)
> > > >
> > > > 2) add SCHED_KEEP_NICE, SCHED_KEEP_PRIO, and SCED_KEEP_PARAMS
> > > > and use them in the if conditions in D)
> > >
> > > So the current KEEP_POLICY basically provides sched_setparam(), and
> >
> > But it's not exposed user-space.
>
> Correct; not until your first patch indeed.
>
> > > given we have that as a syscall, that is supposedly a useful
> > > functionality.
> >
> > For uclamp is definitively useful to change clamps without the need to
> > read beforehand the current policy params and use them in a following
> > set syscall... which is racy pattern.
>
> Right; but my argument was mostly that if sched_setparam() is a useful
> interface, a 'pure' KEEP_POLICY would be too and your (1) looses that.

Ok, that's an argument in favour of option (2).

> > > And I suppose the UTIL_CLAMP is !KEEP_UTIL; we could go either way
> > > around with that flag.
> >
> > What about getting rid of the racy case above by exposing userspace
> > only the new UTIL_CLAMP and, on:
> >
> > sched_setscheduler(flags: UTIL_CLAMP)
> >
> > we enforce the other two flags from the syscall:
> >
> > ---8<---
> > SYSCALL_DEFINE3(sched_setattr)
> > if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY) {
> > attr.sched_policy = SETPARAM_POLICY;
> > attr.sched_flags |= (KEEP_POLICY|KEEP_PARAMS);
> > }
> > ---8<---
> >
> > This will not make possible to change class and set flags in one go,
> > but honestly that's likely a very limited use-case, isn't it ?
>
> So I must admit to not seeing much use for sched_setparam() (and its
> equivalents) myself, but given it is an existing interface, I also think
> it would be nice to cover that functionality in the sched_setattr()
> call.

Which will make them sort-of equivalent... meaning: both the POSIX
sched_setparam() and the !POSIX sched_setattr() will allow to change
params/attributes without changing the policy.

> That is; I know of userspace priority-ceiling implementations using
> sched_setparam(), but I don't see any reason why that wouldn't also work
> with sched_setscheduler() (IOW always also set the policy).

The sched_setscheduler() requires a policy to be explicitely defined,
it's a mandatory parameter and has to be specified.

Unless a RT task could be blocked by a FAIR one and you need
sched_setscheduler() to boost both prio and class (which looks like a
poor RT design to begin with) why would you use sched_setscheduler()
instead of sched_setparam()?

They are both POSIX calls and, AFAIU, sched_setparam() seems to be
designed exactly for those kind on use cases.

> > > > In both cases the goal should be to return from code block D).
> > >
> > > I don't think so; we really do want to 'goto change' for util changes
> > > too I think. Why duplicate part of that logic?
> >
> > But that will force a dequeue/enqueue... isn't too much overhead just
> > to change a clamp value?
>
> These syscalls aren't what I consider fast paths anyway. However, there
> are people that rely on the scheduler syscalls not to schedule
> themselves, or rather be non-blocking (see for example that prio-ceiling
> implementation).
>
> And in that respect the newly introduced uclamp_mutex does appear to be
> a problem.

Mmm... could be... I'll look better into it. Could be that that mutex
is not really required. We don't need to serialize task specific
clamp changes and anyway the protected code never sleeps and uses
atomic instruction.

> Also; do you expect these clamp values to be changed often?

Not really, the most common use cases are:
a) a resource manager (e.g. the Android run-time) set clamps for a
bunch of tasks whenever you switch for example from one app to
antoher... but that will be done via cgroups (i.e. different path)
b) a task can relax his constraints to save energy (something
conceptually similar to use a deferrable timers)

In both cases I expect a limited call frequency.

> > Perhaps we can also end up with some wired
>
> s/wired/weird/, right?

Right :)

> > side-effects like the task being preempted ?
>
> Nothing worse than any other random reschedule would cause.
>
> > Consider also that the uclamp_task_update_active() added by this patch
> > not only has lower overhead but it will be use also by cgroups where
> > we want to force update all the tasks on a cgroup's clamp change.
>
> I haven't gotten that far; but I would prefer not to have two different
> 'change' paths in __sched_setscheduler().

Yes, I agree that two paths in __sched_setscheduler() could be
confusing. Still we have to consider that here we are adding
"not class specific" attributes.

What if we keep "not class specific" code completely outside of
__sched_setscheduler() and do something like:

---8<---
int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
{
retval = __sched_setattr(p, attr);
if (retval)
return retval;
return __sched_setscheduler(p, attr, true, true);
}
EXPORT_SYMBOL_GPL(sched_setattr);
---8<---

where __sched_setattr() will collect all the tunings which do not
require an enqueue/dequeue, so far only the new uclamp settings, while
the rest remains under __sched_setscheduler().

Thoughts ?

--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 14:21:21

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On 23-Jan 10:22, Peter Zijlstra wrote:
> On Tue, Jan 22, 2019 at 03:41:29PM +0000, Patrick Bellasi wrote:
> > On 22-Jan 16:13, Peter Zijlstra wrote:
> > > On Tue, Jan 22, 2019 at 02:43:29PM +0000, Patrick Bellasi wrote:
>
> > > > Do you think that could be acceptable?
> > >
> > > Think so, it's a sysctl poke, 'nobody' ever does that.
> >
> > Cool, so... I'll keep lazy update for system default.
>
> Ah, I think I misunderstood. I meant to say that since nobody ever pokes
> at sysctl's it doesn't matter if its a little more expensive and iterate
> everything.

Here I was more worried about the code complexity/overhead... for
something actually not very used/useful.

> Also; if you always keep everything up-to-date, you can avoid doing that
> duplicate accounting.

To update everything we will have to walk all the CPUs and update all
the RUNNABLE tasks currently enqueued, which are either RT or CFS.

That's way more expensive both in code and time then what we do for
cgroups, where at least we have a limited scope since the cgroup
already provides a (usually limited) list of tasks to consider.

Do you think it's really worth to have ?

Perhaps we can add it in a second step, once we have the core bits in
and we really see a need for a specific use-case.


--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 14:27:48

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 08/16] sched/cpufreq: uclamp: Add utilization clamping for FAIR tasks

On 23-Jan 10:52, Peter Zijlstra wrote:
> On Tue, Jan 22, 2019 at 06:18:31PM +0000, Patrick Bellasi wrote:
> > On 22-Jan 18:13, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:05AM +0000, Patrick Bellasi wrote:

[...]

> > If a task is not clamped we execute it at its required utilization or
> > even max frequency in case of wakeup from IO.
> >
> > When a task is util_max clamped instead, we are saying that we don't
> > care to run it above the specified clamp value and, if possible, we
> > should run it below that capacity level.
> >
> > If that's the case, why this clamping hints should not be enforced on
> > IO wakeups too?
> >
> > At the end it's still a user-space decision, we basically allow
> > userspace to defined what's the max IO boost they like to get.
>
> Because it is the wrong knob for it.
>
> Ideally we'd extend the IO-wait state to include the device-busy state
> at the time of sleep. At the very least double state io_schedule() state
> space from 1 to 2 bits, where we not only indicate: yes this is an
> IO-sleep, but also can indicate device saturation. When the device is
> saturated, we don't need to boost further.
>
> (this binary state will ofcourse cause oscilations where we drop the
> freq, drop device saturation, then ramp the freq, regain device
> saturation etc..)
>
> However, doing this is going to require fairly massive surgery on our
> whole IO stack.
>
> Also; how big of a problem is 'supriouos' boosting really? Joel tried to
> introduce a boost_max tunable, but the grandual boosting thing was good
> enough at the time.

Ok then, I'll drop the clamp on IOBoost... you right and moreover we
can always investigate for a better solution in the future with a
real use-case on hand.

Cheers.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 14:35:06

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 23-Jan 11:28, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > - util = (type == ENERGY_UTIL)
> > - ? util_cfs
> > - : uclamp_util(rq, util_cfs);
> > - util += cpu_util_rt(rq);
> > + util = cpu_util_rt(rq);
> > + if (type == FREQUENCY_UTIL) {
> > + util += cpu_util_cfs(rq);
> > + util = uclamp_util(rq, util);
> > + } else {
> > + util += util_cfs;
> > + }
>
> Or the combined thing:
>
> - util = util_cfs;
> - util += cpu_util_rt(rq);
> + util = cpu_util_rt(rq);
> + if (type == FREQUENCY_UTIL) {
> + util += cpu_util_cfs(rq);
> + util = uclamp_util(rq, util);
> + } else {
> + util += util_cfs;
> + }
>
>
> Leaves me confused.
>
> When type == FREQ, util_cfs should already be cpu_util_cfs(), per
> sugov_get_util().
>
> So should that not end up like:
>
> util = util_cfs;
> util += cpu_util_rt(rq);
> + if (type == FREQUENCY_UTIL)
> + util = uclamp_util(rq, util);
>
> instead?

You right, I get to that core after the patches which integrate
compute_energy(). The chuck above was the version before the EM got
merged but I missed to backport the change once I rebased on
tip/sched/core with the EM in.

Sorry for the confusion, will fix in v7.

Cheers

--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 14:43:30

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 23-Jan 11:49, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > @@ -858,16 +859,23 @@ static inline void
> > uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > unsigned int *clamp_value, unsigned int *bucket_id)
> > {
> > + struct uclamp_se *default_clamp;
> > +
> > /* Task specific clamp value */
> > *clamp_value = p->uclamp[clamp_id].value;
> > *bucket_id = p->uclamp[clamp_id].bucket_id;
> >
> > + /* RT tasks have different default values */
> > + default_clamp = task_has_rt_policy(p)
> > + ? uclamp_default_perf
> > + : uclamp_default;
> > +
> > /* System default restriction */
> > - if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > - *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > + if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> > + *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> > /* Keep it simple: unconditionally enforce system defaults */
> > - *clamp_value = uclamp_default[clamp_id].value;
> > - *bucket_id = uclamp_default[clamp_id].bucket_id;
> > + *clamp_value = default_clamp[clamp_id].value;
> > + *bucket_id = default_clamp[clamp_id].bucket_id;
> > }
> > }
>
> So I still don't much like the whole effective thing;

:/

I find back-annotation useful in many cases since we have different
sources for possible clamp values:

1. task specific
2. cgroup defined
3. system defaults
4. system power default

I don't think we can avoid to somehow back annotate on which bucket a
task has been refcounted... it makes dequeue so much easier, it helps
in ensuring that the refcouning is consistent and enable lazy updates.

> but I think you should use rt_task() instead of
> task_has_rt_policy().

Right... will do.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 14:52:51

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 10/16] sched/core: Add uclamp_util_with()

On 23-Jan 14:33, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:07AM +0000, Patrick Bellasi wrote:
> > +static __always_inline
> > +unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
> > + struct task_struct *p)
> > {
> > unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> > unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
> >
> > + if (p) {
> > + min_util = max(min_util, uclamp_effective_value(p, UCLAMP_MIN));
> > + max_util = max(max_util, uclamp_effective_value(p, UCLAMP_MAX));
> > + }
> > +
>
> Like I think you mentioned earlier; this doesn't look right at all.

What we wanna do here is to compute what _will_ be the clamp values of
a CPU if we enqueue *p on it.

The code above starts from the current CPU clamp value and mimics what
uclamp will do in case we move the task there... which is always a max
aggregation.

> Should that not be something like:
>
> lo = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> hi = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
>
> min_util = clamp(uclamp_effective(p, UCLAMP_MIN), lo, hi);
> max_util = clamp(uclamp_effective(p, UCLAMP_MAX), lo, hi);

Here you end up with a restriction of the task clamp (effective)
clamps values considering the CPU clamps... which is different.

Why do you think we should do that?... perhaps I'm missing something.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-23 19:00:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Wed, Jan 23, 2019 at 02:14:26PM +0000, Patrick Bellasi wrote:

> > > Consider also that the uclamp_task_update_active() added by this patch
> > > not only has lower overhead but it will be use also by cgroups where
> > > we want to force update all the tasks on a cgroup's clamp change.
> >
> > I haven't gotten that far; but I would prefer not to have two different
> > 'change' paths in __sched_setscheduler().
>
> Yes, I agree that two paths in __sched_setscheduler() could be
> confusing. Still we have to consider that here we are adding
> "not class specific" attributes.

But that change thing is not class specific; the whole:


rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p);
running = task_current(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
put_prev_task(rq, p);


/* @p is in it's invariant state; frob it's state */


if (queued)
enqueue_task(rq, p, queue_flags);
if (running)
set_curr_task(rq, p);
task_rq_unlock(rq, p, &rf);


pattern is all over the place; it is just because C sucks that that
isn't more explicitly shared (do_set_cpus_allowed(), rt_mutex_setprio(),
set_user_nice(), __sched_setscheduler(), sched_setnuma(),
sched_move_task()).

This is _the_ pattern for changing state and is not class specific at
all.


2019-01-23 19:10:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 07/16] sched/core: uclamp: Add system default clamps

On Wed, Jan 23, 2019 at 02:19:24PM +0000, Patrick Bellasi wrote:
> On 23-Jan 10:22, Peter Zijlstra wrote:
> > On Tue, Jan 22, 2019 at 03:41:29PM +0000, Patrick Bellasi wrote:
> > > On 22-Jan 16:13, Peter Zijlstra wrote:
> > > > On Tue, Jan 22, 2019 at 02:43:29PM +0000, Patrick Bellasi wrote:
> >
> > > > > Do you think that could be acceptable?
> > > >
> > > > Think so, it's a sysctl poke, 'nobody' ever does that.
> > >
> > > Cool, so... I'll keep lazy update for system default.
> >
> > Ah, I think I misunderstood. I meant to say that since nobody ever pokes
> > at sysctl's it doesn't matter if its a little more expensive and iterate
> > everything.
>
> Here I was more worried about the code complexity/overhead... for
> something actually not very used/useful.
>
> > Also; if you always keep everything up-to-date, you can avoid doing that
> > duplicate accounting.
>
> To update everything we will have to walk all the CPUs and update all
> the RUNNABLE tasks currently enqueued, which are either RT or CFS.
>
> That's way more expensive both in code and time then what we do for
> cgroups, where at least we have a limited scope since the cgroup
> already provides a (usually limited) list of tasks to consider.
>
> Do you think it's really worth to have ?

Dunno; the whole double bucket thing seems a bit weird to me; but maybe
it will all look better without the mapping stuff.

2019-01-23 19:22:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 10/16] sched/core: Add uclamp_util_with()

On Wed, Jan 23, 2019 at 02:51:06PM +0000, Patrick Bellasi wrote:
> On 23-Jan 14:33, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:07AM +0000, Patrick Bellasi wrote:
> > > +static __always_inline
> > > +unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
> > > + struct task_struct *p)
> > > {
> > > unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> > > unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
> > >
> > > + if (p) {
> > > + min_util = max(min_util, uclamp_effective_value(p, UCLAMP_MIN));
> > > + max_util = max(max_util, uclamp_effective_value(p, UCLAMP_MAX));
> > > + }
> > > +
> >
> > Like I think you mentioned earlier; this doesn't look right at all.
>
> What we wanna do here is to compute what _will_ be the clamp values of
> a CPU if we enqueue *p on it.
>
> The code above starts from the current CPU clamp value and mimics what
> uclamp will do in case we move the task there... which is always a max
> aggregation.

Ah, then I misunderstood the purpose of this function.

> > Should that not be something like:
> >
> > lo = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
> > hi = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
> >
> > min_util = clamp(uclamp_effective(p, UCLAMP_MIN), lo, hi);
> > max_util = clamp(uclamp_effective(p, UCLAMP_MAX), lo, hi);
>
> Here you end up with a restriction of the task clamp (effective)
> clamps values considering the CPU clamps... which is different.
>
> Why do you think we should do that?... perhaps I'm missing something.

I was left with the impression from patch 7 that we don't compose clamps
right and throught that was what this code was supposed to do.

I'll have another look at this patch.

2019-01-23 20:13:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> On 23-Jan 11:49, Peter Zijlstra wrote:
> > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > > @@ -858,16 +859,23 @@ static inline void
> > > uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > > unsigned int *clamp_value, unsigned int *bucket_id)
> > > {
> > > + struct uclamp_se *default_clamp;
> > > +
> > > /* Task specific clamp value */
> > > *clamp_value = p->uclamp[clamp_id].value;
> > > *bucket_id = p->uclamp[clamp_id].bucket_id;
> > >
> > > + /* RT tasks have different default values */
> > > + default_clamp = task_has_rt_policy(p)
> > > + ? uclamp_default_perf
> > > + : uclamp_default;
> > > +
> > > /* System default restriction */
> > > - if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > > - *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > > + if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> > > + *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> > > /* Keep it simple: unconditionally enforce system defaults */
> > > - *clamp_value = uclamp_default[clamp_id].value;
> > > - *bucket_id = uclamp_default[clamp_id].bucket_id;
> > > + *clamp_value = default_clamp[clamp_id].value;
> > > + *bucket_id = default_clamp[clamp_id].bucket_id;
> > > }
> > > }
> >
> > So I still don't much like the whole effective thing;
>
> :/
>
> I find back-annotation useful in many cases since we have different
> sources for possible clamp values:
>
> 1. task specific
> 2. cgroup defined
> 3. system defaults
> 4. system power default

(I'm not sure I've seen 4 happen yet, what's that?)

Anyway, once you get range composition defined; that should be something
like:

R_p \Compose_g R_g

Where R_p is the range of task-p, and R_g is the range of the g'th
cgroup of p (where you can make an identity between the root cgroup and
the system default).

Now; as per the other email; I think the straight forward composition:

struct range compose(struct range a, struct range b)
{
return (range){.min = clamp(a.min, b.min, b.max),
.max = clamp(a.max, b.min, b.max), };
}

(note that this is non-commutative, so we have to pay attention to
get the order 'right')

Works in this case; unlike the cpu/rq conposition where we resort to a
pure max function for non-interference.

> I don't think we can avoid to somehow back annotate on which bucket a
> task has been refcounted... it makes dequeue so much easier, it helps
> in ensuring that the refcouning is consistent and enable lazy updates.

So I'll have to go over the code again, but I'm wondering why you're
changing uclamp_se::bucket_id on a runnable task.

Ideally you keep bucket_id invariant between enqueue and dequeue; then
dequeue knows where we put it.

Now I suppose actually determining bucket_id is 'expensive' (it
certainly is with the whole mapping scheme, but even that integer
division is not nice), so we'd like to precompute the bucket_id.

This then leads to the problem of how to change uclamp_se::value while
runnable (since per the other thread you don't want to always update all
runnable tasks).

So far so right?

I'm thikning that if we haz a single bit, say:

struct uclamp_se {
...
unsigned int changed : 1;
};

We can update uclamp_se::value and set uclamp_se::changed, and then the
next enqueue will (unlikely) test-and-clear changed and recompute the
bucket_id.

Would that not be simpler?



2019-01-24 11:24:04

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On 23-Jan 19:59, Peter Zijlstra wrote:
> On Wed, Jan 23, 2019 at 02:14:26PM +0000, Patrick Bellasi wrote:
>
> > > > Consider also that the uclamp_task_update_active() added by this patch
> > > > not only has lower overhead but it will be use also by cgroups where
> > > > we want to force update all the tasks on a cgroup's clamp change.
> > >
> > > I haven't gotten that far; but I would prefer not to have two different
> > > 'change' paths in __sched_setscheduler().
> >
> > Yes, I agree that two paths in __sched_setscheduler() could be
> > confusing. Still we have to consider that here we are adding
> > "not class specific" attributes.
>
> But that change thing is not class specific; the whole:
>
>
> rq = task_rq_lock(p, &rf);
> queued = task_on_rq_queued(p);
> running = task_current(rq, p);
> if (queued)
> dequeue_task(rq, p, queue_flags);
> if (running)
> put_prev_task(rq, p);
>
>
> /* @p is in it's invariant state; frob it's state */
>
>
> if (queued)
> enqueue_task(rq, p, queue_flags);
> if (running)
> set_curr_task(rq, p);
> task_rq_unlock(rq, p, &rf);
>
>
> pattern is all over the place; it is just because C sucks that that

Yes, understand, don't want to enter a language war :)

> isn't more explicitly shared (do_set_cpus_allowed(), rt_mutex_setprio(),
> set_user_nice(), __sched_setscheduler(), sched_setnuma(),
> sched_move_task()).
>
> This is _the_ pattern for changing state and is not class specific at
> all.

Right, that pattern is not "class specific" true and I should have not
used that term to begin with.

What I was trying to point out is that all the calls above directly
affect the current scheduling decision and "requires" a
dequeue/enqueue pattern.

When a task-specific uclamp value is changed for a task, instead, a
dequeue/enqueue is not needed. As long as we are doing a lazy update,
that sounds just like not necessary overhead.

However, there could still be value in keeping code consistent and if
you prefer it this way what should I do?

---8<---
__sched_setscheduler()
...
if (policy < 0)
policy = oldpolicy = p->policy;
...
if (unlikely(policy == p->policy)) {
...
if (uclamp_changed()) // Force dequeue/enqueue
goto change;
}
change:
...

if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
put_prev_task(rq, p);

__setscheduler_uclamp();
__setscheduler(rq, p, attr, pi);

if (queued)
enqueue_task(rq, p, queue_flags);
if (running)
set_curr_task(rq, p);
...
---8<---

Could be something like that ok with you?

Not sure about side-effects on p->prio(): for CFS seems to be reset to
NORMAL in this case :/

--
#include <best/regards.h>

Patrick Bellasi

2019-01-24 12:30:46

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 23-Jan 21:11, Peter Zijlstra wrote:
> On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > > > @@ -858,16 +859,23 @@ static inline void
> > > > uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > > > unsigned int *clamp_value, unsigned int *bucket_id)
> > > > {
> > > > + struct uclamp_se *default_clamp;
> > > > +
> > > > /* Task specific clamp value */
> > > > *clamp_value = p->uclamp[clamp_id].value;
> > > > *bucket_id = p->uclamp[clamp_id].bucket_id;
> > > >
> > > > + /* RT tasks have different default values */
> > > > + default_clamp = task_has_rt_policy(p)
> > > > + ? uclamp_default_perf
> > > > + : uclamp_default;
> > > > +
> > > > /* System default restriction */
> > > > - if (unlikely(*clamp_value < uclamp_default[UCLAMP_MIN].value ||
> > > > - *clamp_value > uclamp_default[UCLAMP_MAX].value)) {
> > > > + if (unlikely(*clamp_value < default_clamp[UCLAMP_MIN].value ||
> > > > + *clamp_value > default_clamp[UCLAMP_MAX].value)) {
> > > > /* Keep it simple: unconditionally enforce system defaults */
> > > > - *clamp_value = uclamp_default[clamp_id].value;
> > > > - *bucket_id = uclamp_default[clamp_id].bucket_id;
> > > > + *clamp_value = default_clamp[clamp_id].value;
> > > > + *bucket_id = default_clamp[clamp_id].bucket_id;
> > > > }
> > > > }
> > >
> > > So I still don't much like the whole effective thing;
> >
> > :/
> >
> > I find back-annotation useful in many cases since we have different
> > sources for possible clamp values:
> >
> > 1. task specific
> > 2. cgroup defined
> > 3. system defaults
> > 4. system power default
>
> (I'm not sure I've seen 4 happen yet, what's that?)

Typo: that's "system s/power/perf/ default", i.e. uclamp_default_perf
introduced by this patch.

> Anyway, once you get range composition defined; that should be something
> like:
>
> R_p \Compose_g R_g
>
> Where R_p is the range of task-p, and R_g is the range of the g'th
> cgroup of p (where you can make an identity between the root cgroup and
> the system default).
>
> Now; as per the other email; I think the straight forward composition:
>
> struct range compose(struct range a, struct range b)
> {
> return (range){.min = clamp(a.min, b.min, b.max),
> .max = clamp(a.max, b.min, b.max), };
> }

This composition is done in uclamp_effective_get() but it's
slightly different, since we want to support a "nice policy" where
tasks can always ask less then what they have got assigned.

Thus, from an abstract standpoint, if a task is in a cgroup:

task.min <= R_g.min
task.max <= R_g.max

While, for tasks in the root cgroup system default applies and we
enforece:

task.min >= R_0.min
task.max <= R_0.max

... where the "nice policy" is currently not more supported, but
perhaps we can/should use the same for system defaults too.

> (note that this is non-commutative, so we have to pay attention to
> get the order 'right')
>
> Works in this case; unlike the cpu/rq conposition where we resort to a
> pure max function for non-interference.

Right.

> > I don't think we can avoid to somehow back annotate on which bucket a
> > task has been refcounted... it makes dequeue so much easier, it helps
> > in ensuring that the refcouning is consistent and enable lazy updates.
>
> So I'll have to go over the code again, but I'm wondering why you're
> changing uclamp_se::bucket_id on a runnable task.

We change only the "requested" value, not the "effective" one.

> Ideally you keep bucket_id invariant between enqueue and dequeue; then
> dequeue knows where we put it.

Right, that's what we do for the "effective" value.

> Now I suppose actually determining bucket_id is 'expensive' (it
> certainly is with the whole mapping scheme, but even that integer
> division is not nice), so we'd like to precompute the bucket_id.

Yes, although the complexity is mostly in the composition logic
described above not on mapping at all. We have "mapping" overheads
only when we change a "request" value and that's from slow-paths.

The "effective" value is computed at each enqueue time by composing
precomputed bucket_id representing the current "requested" task's,
cgroup's and system default's values.

> This then leads to the problem of how to change uclamp_se::value while
> runnable (since per the other thread you don't want to always update all
> runnable tasks).
>
> So far so right?

Yes.

> I'm thikning that if we haz a single bit, say:
>
> struct uclamp_se {
> ...
> unsigned int changed : 1;
> };
>
> We can update uclamp_se::value and set uclamp_se::changed, and then the
> next enqueue will (unlikely) test-and-clear changed and recompute the
> bucket_id.

This mean will lazy update the "requested" bucket_id by deferring its
computation at enqueue time. Which saves us a copy of the bucket_id,
i.e. we will have only the "effective" value updated at enqueue time.

But...

> Would that not be simpler?

... although being simpler it does not fully exploit the slow-path,
a syscall which is usually running from a different process context
(system management software).

It also fits better for lazy updates but, in the cgroup case, where we
wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
do the updates from the slow-path.

Will look better into this simplification while working on v7, perhaps
the linear mapping can really help in that too.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-01-24 12:38:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 05/16] sched/core: uclamp: Update CPU's refcount on clamp changes

On Thu, Jan 24, 2019 at 11:21:53AM +0000, Patrick Bellasi wrote:

> When a task-specific uclamp value is changed for a task, instead, a
> dequeue/enqueue is not needed. As long as we are doing a lazy update,
> that sounds just like not necessary overhead.

When that overhead is shown to be a problem, is when we'll look at that
:-)

> However, there could still be value in keeping code consistent and if
> you prefer it this way what should I do?
>
> ---8<---
> __sched_setscheduler()
> ...
> if (policy < 0)
> policy = oldpolicy = p->policy;
> ...
> if (unlikely(policy == p->policy)) {
> ...
> if (uclamp_changed()) // Force dequeue/enqueue
> goto change;
> }
> change:
> ...
>
> if (queued)
> dequeue_task(rq, p, queue_flags);
> if (running)
> put_prev_task(rq, p);
>
> __setscheduler_uclamp();
> __setscheduler(rq, p, attr, pi);
>
> if (queued)
> enqueue_task(rq, p, queue_flags);
> if (running)
> set_curr_task(rq, p);
> ...
> ---8<---
>
> Could be something like that ok with you?

Yes, that's about what I was expecting.

> Not sure about side-effects on p->prio(): for CFS seems to be reset to
> NORMAL in this case :/

That's what we need KEEP_PARAM for, right?

2019-01-24 12:39:35

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 24-Jan 12:30, Patrick Bellasi wrote:
> On 23-Jan 21:11, Peter Zijlstra wrote:
> > On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:

[...]

> > I'm thikning that if we haz a single bit, say:
> >
> > struct uclamp_se {
> > ...
> > unsigned int changed : 1;
> > };
> >
> > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > next enqueue will (unlikely) test-and-clear changed and recompute the
> > bucket_id.
>
> This mean will lazy update the "requested" bucket_id by deferring its
> computation at enqueue time. Which saves us a copy of the bucket_id,
> i.e. we will have only the "effective" value updated at enqueue time.
>
> But...
>
> > Would that not be simpler?
>
> ... although being simpler it does not fully exploit the slow-path,
> a syscall which is usually running from a different process context
> (system management software).
>
> It also fits better for lazy updates but, in the cgroup case, where we
> wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> do the updates from the slow-path.
>
> Will look better into this simplification while working on v7, perhaps
> the linear mapping can really help in that too.

Actually, I forgot to mention that:

uclamp_se::effective::{
value, bucket_id
}

will be still required to proper support the cgroup delegation model,
where a child group could be restricted by the parent but we want to
keep track of the original "requested" value for when the parent
should relax the restriction.

Thus, since effective values are already there, why not using them
also to pre-compute the new requested bucket_id from the slow path?

--
#include <best/regards.h>

Patrick Bellasi

2019-01-24 15:12:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Thu, Jan 24, 2019 at 12:38:35PM +0000, Patrick Bellasi wrote:
> On 24-Jan 12:30, Patrick Bellasi wrote:
> > On 23-Jan 21:11, Peter Zijlstra wrote:
> > > On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > > > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
>
> [...]
>
> > > I'm thikning that if we haz a single bit, say:
> > >
> > > struct uclamp_se {
> > > ...
> > > unsigned int changed : 1;
> > > };
> > >
> > > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > > next enqueue will (unlikely) test-and-clear changed and recompute the
> > > bucket_id.
> >
> > This mean will lazy update the "requested" bucket_id by deferring its
> > computation at enqueue time. Which saves us a copy of the bucket_id,
> > i.e. we will have only the "effective" value updated at enqueue time.
> >
> > But...
> >
> > > Would that not be simpler?
> >
> > ... although being simpler it does not fully exploit the slow-path,
> > a syscall which is usually running from a different process context
> > (system management software).
> >
> > It also fits better for lazy updates but, in the cgroup case, where we
> > wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> > do the updates from the slow-path.
> >
> > Will look better into this simplification while working on v7, perhaps
> > the linear mapping can really help in that too.
>
> Actually, I forgot to mention that:
>
> uclamp_se::effective::{
> value, bucket_id
> }
>
> will be still required to proper support the cgroup delegation model,
> where a child group could be restricted by the parent but we want to
> keep track of the original "requested" value for when the parent
> should relax the restriction.
>
> Thus, since effective values are already there, why not using them
> also to pre-compute the new requested bucket_id from the slow path?

Well, we need the orig_value; but I'm still not sure why you need more
bucket_id's. Also, retaining orig_value is already required for the
system limits, there's nothing cgroup-y about this afaict.

2019-01-24 15:17:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> Add uclamp_default_perf, a special set of clamp values to be used
> for tasks requiring maximum performance, i.e. by default all the non
> clamped RT tasks.

Urgh... why though?

2019-01-24 15:33:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Thu, Jan 24, 2019 at 12:30:09PM +0000, Patrick Bellasi wrote:
> > So I'll have to go over the code again, but I'm wondering why you're
> > changing uclamp_se::bucket_id on a runnable task.
>
> We change only the "requested" value, not the "effective" one.
>
> > Ideally you keep bucket_id invariant between enqueue and dequeue; then
> > dequeue knows where we put it.
>
> Right, that's what we do for the "effective" value.

So the problem I have is that you first introduce uclamp_se::value and
use that all over the code, and then introduce effective and change all
the usage sites.

That seems daft. Why not keep all the code as-is and add orig_value.

> > Now I suppose actually determining bucket_id is 'expensive' (it
> > certainly is with the whole mapping scheme, but even that integer
> > division is not nice), so we'd like to precompute the bucket_id.
>
> Yes, although the complexity is mostly in the composition logic
> described above not on mapping at all. We have "mapping" overheads
> only when we change a "request" value and that's from slow-paths.

It's weird though. Esp. when combined with that mapping logic, because
then you get to use additional maps that are not in fact ever used.

> > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > next enqueue will (unlikely) test-and-clear changed and recompute the
> > bucket_id.
>
> This mean will lazy update the "requested" bucket_id by deferring its
> computation at enqueue time. Which saves us a copy of the bucket_id,
> i.e. we will have only the "effective" value updated at enqueue time.
>
> But...
>
> > Would that not be simpler?
>
> ... although being simpler it does not fully exploit the slow-path,
> a syscall which is usually running from a different process context
> (system management software).
>
> It also fits better for lazy updates but, in the cgroup case, where we
> wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> do the updates from the slow-path.
>
> Will look better into this simplification while working on v7, perhaps
> the linear mapping can really help in that too.

OK. So mostly my complaint is that it seems to do things odd for ill
explained reasons.

2019-01-24 15:34:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On Thu, Jan 24, 2019 at 12:30:09PM +0000, Patrick Bellasi wrote:
> On 23-Jan 21:11, Peter Zijlstra wrote:

> > Anyway, once you get range composition defined; that should be something
> > like:
> >
> > R_p \Compose_g R_g
> >
> > Where R_p is the range of task-p, and R_g is the range of the g'th
> > cgroup of p (where you can make an identity between the root cgroup and
> > the system default).
> >
> > Now; as per the other email; I think the straight forward composition:
> >
> > struct range compose(struct range a, struct range b)
> > {
> > return (range){.min = clamp(a.min, b.min, b.max),
> > .max = clamp(a.max, b.min, b.max), };
> > }
>
> This composition is done in uclamp_effective_get() but it's
> slightly different, since we want to support a "nice policy" where
> tasks can always ask less then what they have got assigned.

Not sure I follow..

> Thus, from an abstract standpoint, if a task is in a cgroup:
>
> task.min <= R_g.min
> task.max <= R_g.max
>
> While, for tasks in the root cgroup system default applies and we
> enforece:
>
> task.min >= R_0.min
> task.max <= R_0.max
>
> ... where the "nice policy" is currently not more supported, but
> perhaps we can/should use the same for system defaults too.

That seems inconsistent at best.

OK, I'll go have another look. I never recognised that function for
doing that.



2019-01-24 16:00:39

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 24-Jan 16:12, Peter Zijlstra wrote:
> On Thu, Jan 24, 2019 at 12:38:35PM +0000, Patrick Bellasi wrote:
> > On 24-Jan 12:30, Patrick Bellasi wrote:
> > > On 23-Jan 21:11, Peter Zijlstra wrote:
> > > > On Wed, Jan 23, 2019 at 02:40:11PM +0000, Patrick Bellasi wrote:
> > > > > On 23-Jan 11:49, Peter Zijlstra wrote:
> > > > > > On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > > I'm thikning that if we haz a single bit, say:
> > > >
> > > > struct uclamp_se {
> > > > ...
> > > > unsigned int changed : 1;
> > > > };
> > > >
> > > > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > > > next enqueue will (unlikely) test-and-clear changed and recompute the
> > > > bucket_id.
> > >
> > > This mean will lazy update the "requested" bucket_id by deferring its
> > > computation at enqueue time. Which saves us a copy of the bucket_id,
> > > i.e. we will have only the "effective" value updated at enqueue time.
> > >
> > > But...
> > >
> > > > Would that not be simpler?
> > >
> > > ... although being simpler it does not fully exploit the slow-path,
> > > a syscall which is usually running from a different process context
> > > (system management software).
> > >
> > > It also fits better for lazy updates but, in the cgroup case, where we
> > > wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> > > do the updates from the slow-path.
> > >
> > > Will look better into this simplification while working on v7, perhaps
> > > the linear mapping can really help in that too.
> >
> > Actually, I forgot to mention that:
> >
> > uclamp_se::effective::{
> > value, bucket_id
> > }
> >
> > will be still required to proper support the cgroup delegation model,
> > where a child group could be restricted by the parent but we want to
> > keep track of the original "requested" value for when the parent
> > should relax the restriction.
> >
> > Thus, since effective values are already there, why not using them
> > also to pre-compute the new requested bucket_id from the slow path?
>
> Well, we need the orig_value; but I'm still not sure why you need more
> bucket_id's. Also, retaining orig_value is already required for the
> system limits, there's nothing cgroup-y about this afaict.

Sure, the "effective" values are just a very convenient way (IMHO) to
know exactly which value/bucket_id is currently in use by a task while
keeping them well separated from the "requested" values.

So, you propose to add "orig_value"... but the end effect will be the
same... it's just that if we look at uclamp_se you have two dual
information:

A) whatever a task or cgroup "request" is always in:

uclamp_se::value
uclamp_se::bucket_id

B) whatever a task or cgroup "gets" is always in:

uclamp_se::effective::value
uclamp_se::effective::bucket_id

I find this partitioning useful and easy to use:

1) the slow-path updates only data in A

2) the fast-path updates only data in B
by composing A data in uclamp_effective_get() @enqueue time.

--
#include <best/regards.h>

Patrick Bellasi

2019-01-24 16:06:59

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 24-Jan 16:15, Peter Zijlstra wrote:
> On Tue, Jan 15, 2019 at 10:15:06AM +0000, Patrick Bellasi wrote:
> > Add uclamp_default_perf, a special set of clamp values to be used
> > for tasks requiring maximum performance, i.e. by default all the non
> > clamped RT tasks.
>
> Urgh... why though?

Because tasks without a task-specific clamp value assigned are subject
to system default... but those, by default, don't clamp at all:

uclamp_default = [0..1024]

For RT tasks however, we want to execute then always at max OPP, if
not otherwise requested (e.g. task-specific value). Thus we need a
different set of default clamps:

uclamp_perf_default = [1024..1024]

--
#include <best/regards.h>

Patrick Bellasi

2019-01-24 16:17:02

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v6 09/16] sched/cpufreq: uclamp: Add utilization clamping for RT tasks

On 24-Jan 16:31, Peter Zijlstra wrote:
> On Thu, Jan 24, 2019 at 12:30:09PM +0000, Patrick Bellasi wrote:
> > > So I'll have to go over the code again, but I'm wondering why you're
> > > changing uclamp_se::bucket_id on a runnable task.
> >
> > We change only the "requested" value, not the "effective" one.
> >
> > > Ideally you keep bucket_id invariant between enqueue and dequeue; then
> > > dequeue knows where we put it.
> >
> > Right, that's what we do for the "effective" value.
>
> So the problem I have is that you first introduce uclamp_se::value and
> use that all over the code, and then introduce effective and change all
> the usage sites.

Right, because the moment we introduce the combination/aggregation mechanism
is the moment "effective" value makes sense to have. That's when the
code show that:

a task cannot always get what it "request", an "effective" value is
computed by aggregation and that's the value we use now for actual
clamp enforcing.

> That seems daft. Why not keep all the code as-is and add orig_value.

If you prefer, I can use effective values since the beginning and then
add the "requested" values later... but I fear the patchset will not
be more clear to parse.


> > > Now I suppose actually determining bucket_id is 'expensive' (it
> > > certainly is with the whole mapping scheme, but even that integer
> > > division is not nice), so we'd like to precompute the bucket_id.
> >
> > Yes, although the complexity is mostly in the composition logic
> > described above not on mapping at all. We have "mapping" overheads
> > only when we change a "request" value and that's from slow-paths.
>
> It's weird though. Esp. when combined with that mapping logic, because
> then you get to use additional maps that are not in fact ever used.

Mmm... don't get this point...

AFAICS "mapping" and "effective" are two different concepts, that's
why I can probably get rid of the first by I would prefer to keep the
second.

> > > We can update uclamp_se::value and set uclamp_se::changed, and then the
> > > next enqueue will (unlikely) test-and-clear changed and recompute the
> > > bucket_id.
> >
> > This mean will lazy update the "requested" bucket_id by deferring its
> > computation at enqueue time. Which saves us a copy of the bucket_id,
> > i.e. we will have only the "effective" value updated at enqueue time.
> >
> > But...
> >
> > > Would that not be simpler?
> >
> > ... although being simpler it does not fully exploit the slow-path,
> > a syscall which is usually running from a different process context
> > (system management software).
> >
> > It also fits better for lazy updates but, in the cgroup case, where we
> > wanna enforce an update ASAP for RUNNABLE tasks, we will still have to
> > do the updates from the slow-path.
> >
> > Will look better into this simplification while working on v7, perhaps
> > the linear mapping can really help in that too.
>
> OK. So mostly my complaint is that it seems to do things odd for ill
> explained reasons.

:( I'm really sorry I'm not able to convey the overall design idea.

TBH however, despite being a quite "limited" feature it has many
different viewpoints: task-specific, cgroups and system defaults.

I've really tried my best to come up with something reasonable but I
understand that, looking at the single patches, the overall design
could be difficult to grasp... without considering that optimizations
are always possible of course.

If you like better, I can try to give a respin by just removing the
mapping part and then we go back and see if the reaming bits makes
more sense ?

--
#include <best/regards.h>

Patrick Bellasi

2019-01-25 13:58:39

by Alessio Balsini

[permalink] [raw]
Subject: Re: [PATCH v6 01/16] sched/core: Allow sched_setattr() to use the current policy

Hello Patrick,

What do you think about the following minor changes:

On Tue, Jan 15, 2019 at 10:14:58AM +0000, Patrick Bellasi wrote:
> /* SCHED_ISO: reserved but not implemented yet */
> #define SCHED_IDLE 5
> #define SCHED_DEADLINE 6
> +/* Must be the last entry: used to sanity check attr.policy values */

I would remove this comment:
- the meaning of SCHED_POLICY_MAX is evident, and
- should we hint on how the value is used? That comment will be removed
the next time SCHED_POLICY_MAX is used for something else.
This is what should also happen to the comment of SETPARAM_POLICY:
now sched_setparam() is no more the only code path accessing
SETPARAM_POLICY.

> +#define SCHED_POLICY_MAX 7

+#define SCHED_POLICY_MAX SCHED_DEADLINE

This would make it compliant with the definition of MAX.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4560,8 +4560,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> if (retval)
> return retval;
>
> - if ((int)attr.sched_policy < 0)
> + /*
> + * A valid policy is always required from userspace, unless
> + * SCHED_FLAG_KEEP_POLICY is set and the current policy
> + * is enforced for this call.
> + */
> + if (attr.sched_policy >= SCHED_POLICY_MAX &&

+ if (attr.sched_policy > SCHED_POLICY_MAX &&

In line with the previous update.

> + !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
> return -EINVAL;

Thanks,
Alessio