2019-04-02 10:43:07

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 00/16] Add utilization clamping support

Hi all, this is a respin of:

https://lore.kernel.org/lkml/[email protected]/

which includes the following main changes:

- remove "bucket local boosting" code and move it into a dedicated patch
- refactor uclamp_rq_update() to make it cleaner
- s/uclamp_rq_update/uclamp_rq_max_value/ and move update into caller
- update changelog to clarify the configuration fitting in one cache line
- s/uclamp_bucket_value/uclamp_bucket_base_value/
- update UCLAMP_BUCKET_DELTA to use DIV_ROUND_CLOSEST()
- moved flag reset into uclamp_rq_inc()
- add "requested" values uclamp_se instance beside the existing "effective"
values instance
- rename uclamp_effective_{get,assign}() into uclamp_eff_{get,set}()
- make uclamp_eff_get() return the new "effective" values by copy
- run uclamp_fork() code independently from the class being supported
- add sysctl_sched_uclamp_handler()'s internal mutex to serialize concurrent
usages
- make schedutil_type visible on !CONFIG_CPU_FREQ_GOV_SCHEDUTIL
- drop optional renamings
- keep using unsigned long for utilization
- update first cgroup patch's changelog to make it more clear

Thanks for all the valuable comments, almost there... :?

Cheers Patrick


Series Organization
===================

The series is organized into these main sections:

- Patches [01-07]: Per task (primary) API
- Patches [08-09]: Schedutil integration for FAIR and RT tasks
- Patches [10-11]: Integration with EAS's energy_compute()
- Patches [12-16]: Per task group (secondary) API

It is based on today's tip/sched/core and the full tree is available here:

git://linux-arm.org/linux-pb.git lkml/utilclamp_v8
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v8


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [3].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilisation values by themselves cannot convey the desired
power/performance behaviours of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorised user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience.

Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
the scheduler does not have a complete view of what the task's requirements
are and, if it's a small utilization task, it keeps selecting a more energy
efficient CPU, with smaller capacity and lower frequency, thus negatively
impacting the overall time required to complete task activations.

- capping: increase energy efficiency for background tasks not affecting the
user experience.

Since running on a lower capacity CPU at a lower frequency is more energy
efficient, when the completion time is not a main goal, then capping the
utilization considered for certain (maybe big) tasks can have positive
effects, both on energy consumption and thermal headroom.
This feature allows also to make RT tasks more energy friendly on mobile
systems where running them on high capacity CPUs and at the maximum
frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in macking tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

- boosting: try to run small/foreground tasks on higher-capacity CPUs to
complete them faster despite being less energy efficient.

- capping: try to run big/background tasks on low-capacity CPUs to save power
and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [1] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] "Expressing per-task/per-cgroup performance hints"
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/

[2] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

[3] https://lore.kernel.org/lkml/[email protected]/


Patrick Bellasi (16):
sched/core: uclamp: Add CPU's clamp buckets refcounting
sched/core: Add bucket local max tracking
sched/core: uclamp: Enforce last task's UCLAMP_MAX
sched/core: uclamp: Add system default clamps
sched/core: Allow sched_setattr() to use the current policy
sched/core: uclamp: Extend sched_setattr() to support utilization
clamping
sched/core: uclamp: Reset uclamp values on RESET_ON_FORK
sched/core: uclamp: Set default clamps for RT tasks
sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks
sched/core: uclamp: Add uclamp_util_with()
sched/fair: uclamp: Add uclamp support to energy_compute()
sched/core: uclamp: Extend CPU's cgroup controller
sched/core: uclamp: Propagate parent clamps
sched/core: uclamp: Propagate system defaults to root group
sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Documentation/admin-guide/cgroup-v2.rst | 46 ++
include/linux/log2.h | 37 ++
include/linux/sched.h | 58 ++
include/linux/sched/sysctl.h | 11 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 16 +-
include/uapi/linux/sched/types.h | 66 +-
init/Kconfig | 75 +++
kernel/sched/core.c | 791 +++++++++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 22 +-
kernel/sched/fair.c | 44 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 123 +++-
kernel/sysctl.c | 16 +
14 files changed, 1270 insertions(+), 45 deletions(-)

--
2.20.1


2019-04-02 10:43:20

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 01/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A tasks has:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v8:
Message-ID: <20190314111849.gx6bl6myfjtaan7r@e110439-lin>
- remove "bucket local boosting" code and move it into a dedicated
patch
Message-ID: <20190313161229.pkib2tmjass5chtb@e110439-lin>
- refactored uclamp_rq_update() code to make code cleaner
Message-ID: <20190314122256.7wb3ydswpkfmntvf@e110439-lin>
- s/uclamp_rq_update/uclamp_rq_max_value/ and move update into caller
Message-ID: <CAJuCfpEWCcWj=B2SPai2pQt+wcjsAhEfVV1O+H0A+_fqLCnb8Q@mail.gmail.com>
- update changelog to clarify the configuration fitting in one cache line
Message-ID: <20190314145456.5qpxchfltfauqaem@e110439-lin>
- s/uclamp_bucket_value/uclamp_bucket_base_value/
Message-ID: <20190313113757.aeaksz5akv6y5uep@e110439-lin>
- update UCLAMP_BUCKET_DELTA to use DIV_ROUND_CLOSEST()
---
include/linux/log2.h | 37 ++++++++
include/linux/sched.h | 39 ++++++++
include/linux/sched/topology.h | 6 --
init/Kconfig | 53 +++++++++++
kernel/sched/core.c | 160 +++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 51 +++++++++++
6 files changed, 340 insertions(+), 6 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 2af7f77866d0..e2db25734532 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
ilog2((n) - 1) + 1) : \
__order_base_2(n) \
)
+
+static inline __attribute__((const))
+int __bits_per(unsigned long n)
+{
+ if (n < 2)
+ return 1;
+ if (is_power_of_2(n))
+ return order_base_2(n) + 1;
+ return order_base_2(n);
+}
+
+/**
+ * bits_per - calculate the number of bits required for the argument
+ * @n: parameter
+ *
+ * This is constant-capable and can be used for compile time
+ * initiaizations, e.g bitfields.
+ *
+ * The first few values calculated by this routine:
+ * bf(0) = 1
+ * bf(1) = 1
+ * bf(2) = 2
+ * bf(3) = 2
+ * bf(4) = 3
+ * ... and so on.
+ */
+#define bits_per(n) \
+( \
+ __builtin_constant_p(n) ? ( \
+ ((n) == 0 || (n) == 1) ? 1 : ( \
+ ((n) & (n - 1)) == 0 ? \
+ ilog2((n) - 1) + 2 : \
+ ilog2((n) - 1) + 1 \
+ ) \
+ ) : \
+ __bits_per(n) \
+)
#endif /* _LINUX_LOG2_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18696a194e06..0c0dd7aac8e9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,18 @@ struct vtime {
u64 gtime;
};

+/*
+ * Utilization clamp constraints.
+ * @UCLAMP_MIN: Minimum utilization
+ * @UCLAMP_MAX: Maximum utilization
+ * @UCLAMP_CNT: Utilization clamp constraints count
+ */
+enum uclamp_id {
+ UCLAMP_MIN = 0,
+ UCLAMP_MAX,
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -312,6 +324,10 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+/* Increase resolution of cpu_capacity calculations */
+# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -560,6 +576,25 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+#ifdef CONFIG_UCLAMP_TASK
+/* Number of utilization clamp buckets (shorter alias) */
+#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
+
+/*
+ * Utilization clamp for a scheduling entity
+ * @value: clamp value "assigned" to a se
+ * @bucket_id: bucket index corresponding to the "assigned" value
+ *
+ * The bucket_id is the index of the clamp bucket matching the clamp value
+ * which is pre-computed and stored to avoid expensive integer divisions from
+ * the fast path.
+ */
+struct uclamp_se {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
union rcu_special {
struct {
u8 blocked;
@@ -640,6 +675,10 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 57c7ed3fe465..bb5d77d45b09 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@

#include <linux/sched/idle.h>

-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
/*
* sched-domains (multiprocessor balancing) declarations:
*/
diff --git a/init/Kconfig b/init/Kconfig
index c9386a365eea..7439cbf4d02e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -651,6 +651,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks scheduled on that CPU.
+
+ With this option, the user can specify the min and max CPU
+ utilization allowed for RUNNABLE tasks. The max utilization defines
+ the maximum frequency a task should use while the min utilization
+ defines the minimum frequency it should use.
+
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+config UCLAMP_BUCKETS_COUNT
+ int "Number of supported utilization clamp buckets"
+ range 5 20
+ default 5
+ depends on UCLAMP_TASK
+ help
+ Defines the number of clamp buckets to use. The range of each bucket
+ will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
+ number of clamp buckets the finer their granularity and the higher
+ the precision of clamping aggregation and tracking at run-time.
+
+ For example, with the minimum configuration value we will have 5
+ clamp buckets tracking 20% utilization each. A 25% boosted tasks will
+ be refcounted in the [20..39]% bucket and will set the bucket clamp
+ effective value to 25%.
+ If a second 30% boosted task should be co-scheduled on the same CPU,
+ that task will be refcounted in the same bucket of the first task and
+ it will boost the bucket clamp effective value to 30%.
+ The clamp effective value of a bucket is reset to its nominal value
+ (20% in the example above) when there are no more tasks refcounted in
+ that bucket.
+
+ An additional boost/capping margin can be added to some tasks. In the
+ example above the 25% task will be boosted to 30% until it exits the
+ CPU. If that should be considered not acceptable on certain systems,
+ it's always possible to reduce the margin by increasing the number of
+ clamp buckets to trade off used memory for run-time tracking
+ precision.
+
+ If in doubt, use the default value.
+
+endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6b2c055564b5..032211b72110 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -732,6 +732,162 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+
+/* Integer rounded range for each bucket */
+#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
+
+static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
+{
+ return clamp_value / UCLAMP_BUCKET_DELTA;
+}
+
+static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
+{
+ return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
+}
+
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline
+unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
+{
+ struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
+ int bucket_id = UCLAMP_BUCKETS - 1;
+
+ /*
+ * Since both min and max clamps are max aggregated, find the
+ * top most bucket with tasks in.
+ */
+ for ( ; bucket_id >= 0; bucket_id--) {
+ if (!bucket[bucket_id].tasks)
+ continue;
+ return bucket[bucket_id].value;
+ }
+
+ /* No tasks -- default clamp values */
+ return uclamp_none(clamp_id);
+}
+
+/*
+ * When a task is enqueued on a rq, the clamp bucket currently defined by the
+ * task's uclamp::bucket_id is refcounted on that rq. This also immediately
+ * updates the rq's clamp value if required.
+ */
+static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_bucket *bucket;
+
+ bucket = &uc_rq->bucket[uc_se->bucket_id];
+ bucket->tasks++;
+
+ if (uc_se->value > READ_ONCE(uc_rq->value))
+ WRITE_ONCE(uc_rq->value, bucket->value);
+}
+
+/*
+ * When a task is dequeued from a rq, the clamp bucket refcounted by the task
+ * is released. If this is the last task reference counting the rq's max
+ * active clamp value, then the rq's clamp value is updated.
+ *
+ * Both refcounted tasks and rq's cached clamp values are expected to be
+ * always valid. If it's detected they are not, as defensive programming,
+ * enforce the expected state and warn.
+ */
+static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_bucket *bucket;
+ unsigned int rq_clamp;
+
+ bucket = &uc_rq->bucket[uc_se->bucket_id];
+ SCHED_WARN_ON(!bucket->tasks);
+ if (likely(bucket->tasks))
+ bucket->tasks--;
+
+ if (likely(bucket->tasks))
+ return;
+
+ rq_clamp = READ_ONCE(uc_rq->value);
+ /*
+ * Defensive programming: this should never happen. If it happens,
+ * e.g. due to future modification, warn and fixup the expected value.
+ */
+ SCHED_WARN_ON(bucket->value > rq_clamp);
+ if (bucket->value >= rq_clamp)
+ WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
+}
+
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_rq_inc_id(p, rq, clamp_id);
+}
+
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_rq_dec_id(p, rq, clamp_id);
+}
+
+static void __init init_uclamp(void)
+{
+ unsigned int clamp_id;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct uclamp_bucket *bucket;
+ struct uclamp_rq *uc_rq;
+ unsigned int bucket_id;
+
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
+
+ bucket_id = 1;
+ while (bucket_id < UCLAMP_BUCKETS) {
+ bucket = &uc_rq->bucket[bucket_id];
+ bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
+ ++bucket_id;
+ }
+ }
+ }
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
+
+ uc_se->value = uclamp_none(clamp_id);
+ uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
+ }
+}
+
+#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -742,6 +898,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, flags & ENQUEUE_WAKEUP);
}

+ uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -755,6 +912,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

+ uclamp_rq_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

@@ -6088,6 +6246,8 @@ void __init sched_init(void)

psi_init();

+ init_uclamp();
+
scheduler_running = 1;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 71208b67e58a..c3d1ae1e7eec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+ unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_rq - rq's utilization clamp
+ * @value: currently active clamp values for a rq
+ * @bucket: utilization clamp buckets affecting a rq
+ *
+ * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
+ * A clamp value is affecting a rq when there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * There are up to UCLAMP_CNT possible different clamp values, currently there
+ * are only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
+ * the metrics required to compute all the per-rq utilization clamp values.
+ */
+struct uclamp_rq {
+ unsigned int value;
+ struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -835,6 +877,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -1649,6 +1696,10 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
--
2.20.1

2019-04-02 10:43:24

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 02/16] sched/core: Add bucket local max tracking

Because of bucketization, different task-specific clamp values are
tracked in the same bucket. For example, with 20% bucket size and
assuming to have:
Task1: util_min=25%
Task2: util_min=35%
both tasks will be refcounted in the [20..39]% bucket and always boosted
only up to 20% thus implementing a simple floor aggregation normally
used in histograms.

In systems with only few and well-defined clamp values, it would be
useful to track the exact clamp value required by a task whenever
possible. For example, if a system requires only 23% and 47% boost
values then it's possible to track the exact boost required by each
task using only 3 buckets of ~33% size each.

Introduce a mechanism to max aggregate the requested clamp values of
RUNNABLE tasks in the same bucket. Keep it simple by resetting the
bucket value to its base value only when a bucket becomes inactive.
Allow a limited and controlled overboosting margin for tasks recounted
in the same bucket.

In systems where the boost values are not known in advance, it is still
possible to control the maximum acceptable overboosting margin by tuning
the number of clamp groups. For example, 20 groups ensure a 5% maximum
overboost.

Remove the rq bucket initialization code since a correct bucket value
is now computed when a task is refcounted into a CPU's rq.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

--
Changes in v8:
Message-ID: <[email protected]>
- split this code out from the previous patch
---
kernel/sched/core.c | 46 ++++++++++++++++++++++++++-------------------
1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 032211b72110..6e1beae5f348 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -778,6 +778,11 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
* When a task is enqueued on a rq, the clamp bucket currently defined by the
* task's uclamp::bucket_id is refcounted on that rq. This also immediately
* updates the rq's clamp value if required.
+ *
+ * Tasks can have a task-specific value requested from user-space, track
+ * within each bucket the maximum value for tasks refcounted in it.
+ * This "local max aggregation" allows to track the exact "requested" value
+ * for each bucket when all its RUNNABLE tasks require the same clamp.
*/
static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
@@ -789,8 +794,15 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
bucket = &uc_rq->bucket[uc_se->bucket_id];
bucket->tasks++;

+ /*
+ * Local max aggregation: rq buckets always track the max
+ * "requested" clamp value of its RUNNABLE tasks.
+ */
+ if (uc_se->value > bucket->value)
+ bucket->value = uc_se->value;
+
if (uc_se->value > READ_ONCE(uc_rq->value))
- WRITE_ONCE(uc_rq->value, bucket->value);
+ WRITE_ONCE(uc_rq->value, uc_se->value);
}

/*
@@ -815,6 +827,12 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
if (likely(bucket->tasks))
bucket->tasks--;

+ /*
+ * Keep "local max aggregation" simple and accept to (possibly)
+ * overboost some RUNNABLE tasks in the same bucket.
+ * The rq clamp bucket value is reset to its base value whenever
+ * there are no more RUNNABLE tasks refcounting it.
+ */
if (likely(bucket->tasks))
return;

@@ -824,8 +842,14 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
* e.g. due to future modification, warn and fixup the expected value.
*/
SCHED_WARN_ON(bucket->value > rq_clamp);
- if (bucket->value >= rq_clamp)
+ if (bucket->value >= rq_clamp) {
+ /*
+ * Reset clamp bucket value to its nominal value whenever
+ * there are anymore RUNNABLE tasks refcounting it.
+ */
+ bucket->value = uclamp_bucket_base_value(bucket->value);
WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
+ }
}

static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
@@ -855,25 +879,9 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

- for_each_possible_cpu(cpu) {
- struct uclamp_bucket *bucket;
- struct uclamp_rq *uc_rq;
- unsigned int bucket_id;
-
+ for_each_possible_cpu(cpu)
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));

- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
-
- bucket_id = 1;
- while (bucket_id < UCLAMP_BUCKETS) {
- bucket = &uc_rq->bucket[bucket_id];
- bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
- ++bucket_id;
- }
- }
- }
-
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];

--
2.20.1

2019-04-02 10:43:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX

When a task sleeps it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.

Fix this by using
uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.

Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

--
Changes in v8:
Message-ID: <20190314170619.rt6yhelj3y6dzypu@e110439-lin>
- moved flag reset into uclamp_rq_inc()
---
kernel/sched/core.c | 45 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 ++
2 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e1beae5f348..046f61d33f00 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -754,8 +754,35 @@ static inline unsigned int uclamp_none(int clamp_id)
return SCHED_CAPACITY_SCALE;
}

+static inline unsigned int
+uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
+{
+ /*
+ * Avoid blocked utilization pushing up the frequency when we go
+ * idle (which drops the max-clamp) by retaining the last known
+ * max-clamp.
+ */
+ if (clamp_id == UCLAMP_MAX) {
+ rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
+ return clamp_value;
+ }
+
+ return uclamp_none(UCLAMP_MIN);
+}
+
+static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+ unsigned int clamp_value)
+{
+ /* Reset max-clamp retention only on idle exit */
+ if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
+ return;
+
+ WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
+}
+
static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
+unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
+ unsigned int clamp_value)
{
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
}

/* No tasks -- default clamp values */
- return uclamp_none(clamp_id);
+ return uclamp_idle_value(rq, clamp_id, clamp_value);
}

/*
@@ -794,6 +821,8 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
bucket = &uc_rq->bucket[uc_se->bucket_id];
bucket->tasks++;

+ uclamp_idle_reset(rq, clamp_id, uc_se->value);
+
/*
* Local max aggregation: rq buckets always track the max
* "requested" clamp value of its RUNNABLE tasks.
@@ -820,6 +849,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
struct uclamp_bucket *bucket;
+ unsigned int bkt_clamp;
unsigned int rq_clamp;

bucket = &uc_rq->bucket[uc_se->bucket_id];
@@ -848,7 +878,8 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
* there are anymore RUNNABLE tasks refcounting it.
*/
bucket->value = uclamp_bucket_base_value(bucket->value);
- WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
+ bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
+ WRITE_ONCE(uc_rq->value, bkt_clamp);
}
}

@@ -861,6 +892,10 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
uclamp_rq_inc_id(p, rq, clamp_id);
+
+ /* Reset clamp idle holding when there is one RUNNABLE task */
+ if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
+ rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
}

static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
@@ -879,8 +914,10 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+ cpu_rq(cpu)->uclamp_flags = 0;
+ }

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c3d1ae1e7eec..d8b182f1782c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -880,6 +880,8 @@ struct rq {
#ifdef CONFIG_UCLAMP_TASK
/* Utilization clamp values based on CPU's RUNNABLE tasks */
struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
+ unsigned int uclamp_flags;
+#define UCLAMP_FLAG_IDLE 0x01
#endif

struct cfs_rq cfs;
--
2.20.1

2019-04-02 10:43:33

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

/proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v8:
Message-ID: <[email protected]>
- add "requested" values uclamp_se instance beside the existing
"effective" values instance
- rename uclamp_effective_{get,assign}() into uclamp_eff_{get,set}()
- make uclamp_eff_get() return the new "effective" values by copy
Message-ID: <20190318125844.ajhjpaqlcgxn7qkq@e110439-lin>
- run uclamp_fork() code independently from the class being supported.
Resetting active flag is not harmful and following patches will add
other code which still needs to be executed independently from class
support.
Message-ID: <[email protected]>
- add sysctl_sched_uclamp_handler()'s internal mutex to serialize
concurrent usages
---
include/linux/sched.h | 10 +++
include/linux/sched/sysctl.h | 11 +++
kernel/sched/core.c | 131 ++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 16 +++++
4 files changed, 167 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c0dd7aac8e9..d8491954e2e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -584,14 +584,21 @@ struct sched_dl_entity {
* Utilization clamp for a scheduling entity
* @value: clamp value "assigned" to a se
* @bucket_id: bucket index corresponding to the "assigned" value
+ * @active: the se is currently refcounted in a rq's bucket
*
* The bucket_id is the index of the clamp bucket matching the clamp value
* which is pre-computed and stored to avoid expensive integer divisions from
* the fast path.
+ *
+ * The active bit is set whenever a task has got an "effective" value assigned,
+ * which can be different from the clamp value "requested" from user-space.
+ * This allows to know a task is refcounted in the rq's bucket corresponding
+ * to the "effective" bucket_id.
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+ unsigned int active : 1;
};
#endif /* CONFIG_UCLAMP_TASK */

@@ -676,6 +683,9 @@ struct task_struct {
struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
+ /* Clamp values requested for a scheduling entity */
+ struct uclamp_se uclamp_req[UCLAMP_CNT];
+ /* Effective clamp values used for a scheduling entity */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d728df7..d4f6215ee03f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
@@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+#ifdef CONFIG_UCLAMP_TASK
+extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+#endif
+
extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 046f61d33f00..d368ac26b8aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -733,6 +733,14 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/* Max allowed minimum utilization */
+unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
+
+/* Max allowed maximum utilization */
+unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+
+/* All clamps are required to be less or equal than these values */
+static struct uclamp_se uclamp_default[UCLAMP_CNT];

/* Integer rounded range for each bucket */
#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
@@ -801,6 +809,52 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+/*
+ * The effective clamp bucket index of a task depends on, by increasing
+ * priority:
+ * - the task specific clamp value, when explicitly requested from userspace
+ * - the system default clamp value, defined by the sysadmin
+ */
+static inline struct uclamp_se
+uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_max = uclamp_default[clamp_id];
+
+ /* System default restrictions always apply */
+ if (unlikely(uc_req.value > uc_max.value))
+ return uc_max;
+
+ return uc_req;
+}
+
+static inline unsigned int
+uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_eff;
+
+ /* Task currently refcounted: use back-annotated (effective) bucket */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].bucket_id;
+
+ uc_eff = uclamp_eff_get(p, clamp_id);
+
+ return uc_eff.bucket_id;
+}
+
+unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_eff;
+
+ /* Task currently refcounted: use back-annotated (effective) value */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].value;
+
+ uc_eff = uclamp_eff_get(p, clamp_id);
+
+ return uc_eff.value;
+}
+
/*
* When a task is enqueued on a rq, the clamp bucket currently defined by the
* task's uclamp::bucket_id is refcounted on that rq. This also immediately
@@ -818,8 +872,12 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
struct uclamp_se *uc_se = &p->uclamp[clamp_id];
struct uclamp_bucket *bucket;

+ /* Update task effective clamp */
+ p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
+
bucket = &uc_rq->bucket[uc_se->bucket_id];
bucket->tasks++;
+ uc_se->active = true;

uclamp_idle_reset(rq, clamp_id, uc_se->value);

@@ -856,6 +914,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
SCHED_WARN_ON(!bucket->tasks);
if (likely(bucket->tasks))
bucket->tasks--;
+ uc_se->active = false;

/*
* Keep "local max aggregation" simple and accept to (possibly)
@@ -909,8 +968,69 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(p, rq, clamp_id);
}

+int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int old_min, old_max;
+ static DEFINE_MUTEX(mutex);
+ int result;
+
+ mutex_lock(&mutex);
+ old_min = sysctl_sched_uclamp_util_min;
+ old_max = sysctl_sched_uclamp_util_max;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ goto done;
+
+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
+ sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto undo;
+ }
+
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ uclamp_default[UCLAMP_MIN].value =
+ sysctl_sched_uclamp_util_min;
+ uclamp_default[UCLAMP_MIN].bucket_id =
+ uclamp_bucket_id(sysctl_sched_uclamp_util_min);
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ uclamp_default[UCLAMP_MAX].value =
+ sysctl_sched_uclamp_util_max;
+ uclamp_default[UCLAMP_MAX].bucket_id =
+ uclamp_bucket_id(sysctl_sched_uclamp_util_max);
+ }
+
+ /*
+ * Updating all the RUNNABLE task is expensive, keep it simple and do
+ * just a lazy update at each next enqueue time.
+ */
+ goto done;
+
+undo:
+ sysctl_sched_uclamp_util_min = old_min;
+ sysctl_sched_uclamp_util_max = old_max;
+done:
+ mutex_unlock(&mutex);
+
+ return result;
+}
+
+static void uclamp_fork(struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ p->uclamp[clamp_id].active = false;
+}
+
static void __init init_uclamp(void)
{
+ struct uclamp_se uc_max = {};
unsigned int clamp_id;
int cpu;

@@ -920,16 +1040,23 @@ static void __init init_uclamp(void)
}

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
+ struct uclamp_se *uc_se = &init_task.uclamp_req[clamp_id];

uc_se->value = uclamp_none(clamp_id);
uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
}
+
+ /* System defaults allow max clamp values for both indexes */
+ uc_max.value = uclamp_none(UCLAMP_MAX);
+ uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_default[clamp_id] = uc_max;
}

#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

@@ -2530,6 +2657,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
*/
p->prio = current->normal_prio;

+ uclamp_fork(p);
+
/*
* Revert to default priority/policy on fork if requested.
*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 987ae08147bf..72277f09887d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -446,6 +446,22 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rr_handler,
},
+#ifdef CONFIG_UCLAMP_TASK
+ {
+ .procname = "sched_uclamp_util_min",
+ .data = &sysctl_sched_uclamp_util_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_uclamp_handler,
+ },
+ {
+ .procname = "sched_uclamp_util_max",
+ .data = &sysctl_sched_uclamp_util_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_uclamp_handler,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
2.20.1

2019-04-02 10:43:43

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 05/16] sched/core: Allow sched_setattr() to use the current policy

The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/uapi/linux/sched.h | 6 +++++-
kernel/sched/core.c | 11 ++++++++++-
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..075c610adf45 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -40,6 +40,8 @@
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+/* Must be the last entry: used to sanity check attr.policy values */
+#define SCHED_POLICY_MAX SCHED_DEADLINE

/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
@@ -50,9 +52,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_KEEP_POLICY 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_KEEP_POLICY)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d368ac26b8aa..20efb32e1a7e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4907,8 +4907,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
if (retval)
return retval;

- if ((int)attr.sched_policy < 0)
+ /*
+ * A valid policy is always required from userspace, unless
+ * SCHED_FLAG_KEEP_POLICY is set and the current policy
+ * is enforced for this call.
+ */
+ if (attr.sched_policy > SCHED_POLICY_MAX &&
+ !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
return -EINVAL;
+ }
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;

rcu_read_lock();
retval = -ESRCH;
--
2.20.1

2019-04-02 10:43:51

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 07/16] sched/core: uclamp: Reset uclamp values on RESET_ON_FORK

A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:

sys_sched_setattr()
sched_setattr()
__sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)

the new forked task is expected to start with all attributes reset to
default values.

Do that for utilization clamp values too by checking the reset request
from the existing uclamp_fork() call which already provides the required
initialization for other uclamp related bits.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 68aed32e8ec7..bdebdabe9bc4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1070,6 +1070,17 @@ static void uclamp_fork(struct task_struct *p)

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
p->uclamp[clamp_id].active = false;
+
+ if (likely(!p->sched_reset_on_fork))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ p->uclamp_req[clamp_id].user_defined = false;
+ p->uclamp_req[clamp_id].value = clamp_value;
+ p->uclamp_req[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
+ }
}

static void __init init_uclamp(void)
--
2.20.1

2019-04-02 10:44:05

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 10/16] sched/core: uclamp: Add uclamp_util_with()

So far uclamp_util() allows to clamp a specified utilization considering
the clamp values requested by RUNNABLE tasks in a CPU. For the Energy
Aware Scheduler (EAS) it is interesting to test how clamp values will
change when a task is becoming RUNNABLE on a given CPU.
For example, EAS is interested in comparing the energy impact of
different scheduling decisions and the clamp values can play a role on
that.

Add uclamp_util_with() which allows to clamp a given utilization by
considering the possible impact on CPU clamp values of a specified task.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v8:
Others:
- s/uclamp_effective_value()/uclamp_eff_value()/
---
kernel/sched/sched.h | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b4c13f3beb2f..a1ed3d94652a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2276,11 +2276,20 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id);
+
+static __always_inline
+unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
+ struct task_struct *p)
{
unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);

+ if (p) {
+ min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
+ max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX));
+ }
+
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
* RUNNABLE tasks with _different_ clamps, we can end up with an
@@ -2291,7 +2300,17 @@ static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)

return clamp(util, min_util, max_util);
}
+
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return uclamp_util_with(rq, util, NULL);
+}
#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
+ struct task_struct *p)
+{
+ return util;
+}
static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
{
return util;
--
2.20.1

2019-04-02 10:44:07

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 11/16] sched/fair: uclamp: Add uclamp support to energy_compute()

The Energy Aware Scheduler (EAS) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:
a) an (active) power consumption defined for each CPU frequency
b) an estimation of which frequency will be used on each CPU
c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c).
A CPU is expected to run:
- on an higher than required frequency, but for a shorter time, in case
its estimated utilization will be smaller than the minimum utilization
enforced by uclamp
- on a smaller than required frequency, but for a longer time, in case
its estimated utilization is bigger than the maximum utilization
enforced by uclamp

While compute_energy() already accounts clamping effects on busy time,
the clamping effects on frequency selection are currently ignored.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
computation of the cpu_util signal implies that we are more likely to
estimate the highest OPP when a RT task is running in another CPU of
the same performance domain. This can have an impact on energy
estimation but:
- it's not easy to say which approach is better, since it depends on
the use case
- the original approach could still be obtained by setting a smaller
task-specific util_min whenever required

Since we are at that:
- rename schedutil_freq_util() into schedutil_cpu_util(),
since it's not only used for frequency selection.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>

---
Changes in v8:
Message-ID: <20190318151900.p2lm2ys4qx7yfjhs@e110439-lin>
- make schedutil_type visible on !CONFIG_CPU_FREQ_GOV_SCHEDUTIL
- keep using unsigned long for utilization
- drop optional renamings
---
kernel/sched/cpufreq_schedutil.c | 9 +++----
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++-----
kernel/sched/sched.h | 20 +++++-----------
3 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 70a8b87fa29c..f206c7732acd 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -195,8 +195,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type)
+unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p)
{
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);
@@ -229,7 +230,7 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
- util = uclamp_util(rq, util);
+ util = uclamp_util_with(rq, util, p);

dl_util = cpu_util_dl(rq);

@@ -289,7 +290,7 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = max;
sg_cpu->bw_dl = cpu_bw_dl(rq);

- return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
+ return schedutil_cpu_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL, NULL);
}

/**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ac98204de86..9f9c680a6aee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6469,11 +6469,21 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- long util, max_util, sum_util, energy = 0;
+ unsigned int max_util, util_cfs, cpu_util, cpu_cap;
+ unsigned long sum_util, energy = 0;
+ struct task_struct *tsk;
int cpu;

for (; pd; pd = pd->next) {
+ struct cpumask *pd_mask = perf_domain_span(pd);
+
+ /*
+ * The energy model mandates all the CPUs of a performance
+ * domain have the same capacity.
+ */
+ cpu_cap = arch_scale_cpu_capacity(NULL, cpumask_first(pd_mask));
max_util = sum_util = 0;
+
/*
* The capacity state of CPUs of the current rd can be driven by
* CPUs of another rd if they belong to the same performance
@@ -6484,11 +6494,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* it will not appear in its pd list and will not be accounted
* by compute_energy().
*/
- for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
- util = schedutil_energy_util(cpu, util);
- max_util = max(util, max_util);
- sum_util += util;
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ util_cfs = cpu_util_next(cpu, p, dst_cpu);
+
+ /*
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
+ */
+ sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ ENERGY_UTIL, NULL);
+
+ /*
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ * NOTE: in case RT tasks are running, by default the
+ * FREQUENCY_UTIL's utilization can be max OPP.
+ */
+ tsk = cpu == dst_cpu ? p : NULL;
+ cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}

energy += em_pd_energy(pd->em_pd, max_util, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a1ed3d94652a..6ae3628248eb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2332,7 +2332,6 @@ static inline unsigned long capacity_orig_of(int cpu)
}
#endif

-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
/**
* enum schedutil_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
@@ -2348,15 +2347,11 @@ enum schedutil_type {
ENERGY_UTIL,
};

-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type);
-
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

- return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
-}
+unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p);

static inline unsigned long cpu_bw_dl(struct rq *rq)
{
@@ -2385,11 +2380,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
return READ_ONCE(rq->avg_rt.util_avg);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- return cfs;
-}
-#endif
+#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
+#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
static inline unsigned long cpu_util_irq(struct rq *rq)
--
2.20.1

2019-04-02 10:44:18

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2),
this value is accounted in the corresponding clamp bucket every time the
task is {en,de}qeued. However, when cgroups are also in use, the task
specific clamp values could be restricted by the task_group (TG)
clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every
time a task is enqueued, it's accounted in the clamp_bucket defining the
smaller clamp between the task specific value and its TG effective
value. This allows to:

1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value

2. implement a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG

This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only
system defaults are enforced.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de8391bafe11..423d74ed47f6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -821,16 +821,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uc_max;
+
+ /*
+ * Tasks in autogroups or root task group will be
+ * restricted by system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return uc_req;
+ if (task_group(p) == &root_task_group)
+ return uc_req;
+
+ uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value || !uc_req.user_defined)
+ return uc_max;
+#endif
+
+ return uc_req;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
- struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];

/* System default restrictions always apply */
--
2.20.1

2019-04-02 10:44:28

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 14/16] sched/core: uclamp: Propagate system defaults to root group

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.

- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

- system defaults: which define the maximum possible clamp values
usable by tasks.

- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e09cf8881567..de8391bafe11 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -980,6 +980,13 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(p, rq, clamp_id);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id);
+#else
+#define cpu_util_update_eff(...)
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -1017,6 +1024,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
uclamp_bucket_id(sysctl_sched_uclamp_util_max);
}

+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MIN);
+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MAX);
+
/*
* Updating all the RUNNABLE task is expensive, keep it simple and do
* just a lazy update at each next enqueue time.
--
2.20.1

2019-04-02 10:45:33

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.

Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.

Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).

Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.

Do that by validating the required clamp values before and then applying
the required changes using _the_ same pattern already in use for
__setscheduler(). This ensures that the task is re-enqueued with the new
clamp values.

Do not allow to change sched class specific params and non class
specific params (i.e. clamp values) at the same time. This keeps things
simple and still works for the most common cases since we are usually
interested in just one of the two actions.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v8:
Others:
- using p->uclamp_req to track clamp values "requested" from userspace
---
include/linux/sched.h | 9 ++++
include/uapi/linux/sched.h | 12 ++++-
include/uapi/linux/sched/types.h | 66 ++++++++++++++++++++----
kernel/sched/core.c | 87 +++++++++++++++++++++++++++++++-
4 files changed, 162 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d8491954e2e1..c2b81a84985b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -585,6 +585,7 @@ struct sched_dl_entity {
* @value: clamp value "assigned" to a se
* @bucket_id: bucket index corresponding to the "assigned" value
* @active: the se is currently refcounted in a rq's bucket
+ * @user_defined: the requested clamp value comes from user-space
*
* The bucket_id is the index of the clamp bucket matching the clamp value
* which is pre-computed and stored to avoid expensive integer divisions from
@@ -594,11 +595,19 @@ struct sched_dl_entity {
* which can be different from the clamp value "requested" from user-space.
* This allows to know a task is refcounted in the rq's bucket corresponding
* to the "effective" bucket_id.
+ *
+ * The user_defined bit is set whenever a task has got a task-specific clamp
+ * value requested from userspace, i.e. the system defaults apply to this task
+ * just as a restriction. This allows to relax default clamps when a less
+ * restrictive task-specific value has been requested, thus allowing to
+ * implement a "nice" semantic. For example, a task running with a 20%
+ * default boost can still drop its own boosting to 0%.
*/
struct uclamp_se {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int active : 1;
+ unsigned int user_defined : 1;
};
#endif /* CONFIG_UCLAMP_TASK */

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 075c610adf45..d2c65617a4a4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -53,10 +53,20 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
#define SCHED_FLAG_KEEP_POLICY 0x08
+#define SCHED_FLAG_KEEP_PARAMS 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+
+#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
+ SCHED_FLAG_KEEP_PARAMS)
+
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
- SCHED_FLAG_KEEP_POLICY)
+ SCHED_FLAG_KEEP_ALL | \
+ SCHED_FLAG_UTIL_CLAMP)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..c852153ddb0d 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -9,6 +9,7 @@ struct sched_param {
};

#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
+#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */

/*
* Extended scheduling parameters data structure.
@@ -21,8 +22,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Task Attributes
+ * =========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such a model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +60,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +73,29 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization
+ * expected for a task. These attributes allow to inform the scheduler about
+ * the utilization boundaries within which it should schedule the task. These
+ * boundaries are valuable hints to support scheduler decisions on both task
+ * placement and frequency selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE]. It
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. For example, a
+ * 20% utilization task is a task running for 2ms every 10ms at maximum
+ * frequency.
+ *
+ * A task with a min utilization value bigger than 0 is more likely scheduled
+ * on a CPU with a capacity big enough to fit the specified value.
+ * A task with a max utilization value smaller than 1024 is more likely
+ * scheduled on a CPU with no more capacity than the specified value.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +113,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 20efb32e1a7e..68aed32e8ec7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1020,6 +1020,50 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
return result;
}

+static int uclamp_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
+ unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ lower_bound = attr->sched_util_min;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ upper_bound = attr->sched_util_max;
+
+ if (lower_bound > upper_bound)
+ return -EINVAL;
+ if (upper_bound > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
+ return;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ unsigned int lower_bound = attr->sched_util_min;
+
+ p->uclamp_req[UCLAMP_MIN].user_defined = true;
+ p->uclamp_req[UCLAMP_MIN].value = lower_bound;
+ p->uclamp_req[UCLAMP_MIN].bucket_id =
+ uclamp_bucket_id(lower_bound);
+ }
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ unsigned int upper_bound = attr->sched_util_max;
+
+ p->uclamp_req[UCLAMP_MAX].user_defined = true;
+ p->uclamp_req[UCLAMP_MAX].value = upper_bound;
+ p->uclamp_req[UCLAMP_MAX].bucket_id =
+ uclamp_bucket_id(upper_bound);
+ }
+}
+
static void uclamp_fork(struct task_struct *p)
{
unsigned int clamp_id;
@@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline int uclamp_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -ENODEV;
+}
+static void __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr) { }
static inline void uclamp_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
@@ -4424,6 +4475,13 @@ static void __setscheduler_params(struct task_struct *p,
static void __setscheduler(struct rq *rq, struct task_struct *p,
const struct sched_attr *attr, bool keep_boost)
{
+ /*
+ * If params can't change scheduling class changes aren't allowed
+ * either.
+ */
+ if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
+ return;
+
__setscheduler_params(p, attr);

/*
@@ -4561,6 +4619,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ /* Update task specific "requested" clamps */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = uclamp_validate(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4590,6 +4655,8 @@ static int __sched_setscheduler(struct task_struct *p,
goto change;
if (dl_policy(policy) && dl_param_changed(p, attr))
goto change;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
+ goto change;

p->sched_reset_on_fork = reset_on_fork;
task_rq_unlock(rq, p, &rf);
@@ -4670,7 +4737,9 @@ static int __sched_setscheduler(struct task_struct *p,
put_prev_task(rq, p);

prev_class = p->sched_class;
+
__setscheduler(rq, p, attr, pi);
+ __setscheduler_uclamp(p, attr);

if (queued) {
/*
@@ -4846,6 +4915,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
if (ret)
return -EFAULT;

+ if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
+ size < SCHED_ATTR_SIZE_VER1)
+ return -EINVAL;
+
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
@@ -4922,10 +4995,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
- retval = sched_setattr(p, &attr);
+ if (likely(p))
+ get_task_struct(p);
rcu_read_unlock();

+ if (likely(p)) {
+ retval = sched_setattr(p, &attr);
+ put_task_struct(p);
+ }
+
return retval;
}

@@ -5076,6 +5154,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);

+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
+#endif
+
rcu_read_unlock();

retval = sched_read_attr(uattr, &attr, size);
--
2.20.1

2019-04-02 10:45:37

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 16/16] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or clamped as requested.

Do that by slightly refactoring uclamp_bucket_inc(). An additional
parameter *cgroup_subsys_state (css) is used to walk the list of tasks
in the TGs and update the RUNNABLE ones. Do that by taking the rq
lock for each task, the same mechanism used for cpu affinity masks
updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 423d74ed47f6..d686b2f1c0e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1006,6 +1006,51 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(p, rq, clamp_id);
}

+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_rq_dec_id(p, rq, clamp_id);
+ uclamp_rq_inc_id(p, rq, clamp_id);
+
+done:
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css, int clamp_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css,
unsigned int clamp_id);
@@ -7072,6 +7117,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css,

uc_se->value = value;
uc_se->bucket_id = uclamp_bucket_id(value);
+
+ /* Immediately update descendants RUNNABLE tasks */
+ uclamp_update_active_tasks(css, clamp_id);
}
}

--
2.20.1

2019-04-02 11:35:23

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 08/16] sched/core: uclamp: Set default clamps for RT tasks

By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand. This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.

Enforce the correct behavior for RT tasks by setting util_min to max
whenever:

1. the task is switched to the RT class and it does not already have a
user-defined clamp value assigned.

2. an RT task is forked from a parent with RESET_ON_FORK set.

NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bdebdabe9bc4..71c9dd6487b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1042,6 +1042,28 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
+ unsigned int clamp_id;
+
+ /*
+ * On scheduling class change, reset to default clamps for tasks
+ * without a task-specific value.
+ */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ /* Keep using defined clamps across class changes */
+ if (uc_se->user_defined)
+ continue;
+
+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
+ uc_se->bucket_id = uclamp_bucket_id(clamp_value);
+ uc_se->value = clamp_value;
+ }
+
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

@@ -1077,6 +1099,10 @@ static void uclamp_fork(struct task_struct *p)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
unsigned int clamp_value = uclamp_none(clamp_id);

+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
p->uclamp_req[clamp_id].user_defined = false;
p->uclamp_req[clamp_id].value = clamp_value;
p->uclamp_req[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
--
2.20.1

2019-04-02 11:42:22

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 09/16] sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.

Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
kernel/sched/fair.c | 4 ++++
kernel/sched/rt.c | 4 ++++
kernel/sched/sched.h | 23 +++++++++++++++++++++++
4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 033ec7c45f13..70a8b87fa29c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -201,8 +201,10 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

- if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
+ if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -218,9 +220,16 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
- util = util_cfs;
- util += cpu_util_rt(rq);
+ util = util_cfs + cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util(rq, util);

dl_util = cpu_util_dl(rq);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8213ff6e365d..1ac98204de86 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10617,6 +10617,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 90fa23d36565..d968f7209656 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d8b182f1782c..b4c13f3beb2f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2275,6 +2275,29 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+ unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.20.1

2019-04-02 11:42:22

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 12/16] sched/core: uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes util.{min,max}
which allows to enforce utilization boosting and capping for all the
tasks in a group. Specifically:

- util.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the util.min
utilization

- util.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the util.max
utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing util.min < util.max.
Keep it simple by do not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

--
Changes in v8:
Message-ID: <[email protected]>
- update changelog description for points b), c) and following paragraph
---
Documentation/admin-guide/cgroup-v2.rst | 27 +++++
init/Kconfig | 22 ++++
kernel/sched/core.c | 142 +++++++++++++++++++++++-
kernel/sched/sched.h | 6 +
4 files changed, 196 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7bf3f129c68b..47710a77f4fa 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -974,6 +980,27 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.

+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization in the range [0, 1024].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1024". i.e. no utilization capping
+
+ The requested maximum utilization in the range [0, 1024].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+

Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index 7439cbf4d02e..33006e8de996 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -877,6 +877,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 71c9dd6487b1..aeed2dd315cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1130,8 +1130,12 @@ static void __init init_uclamp(void)
/* System defaults allow max clamp values for both indexes */
uc_max.value = uclamp_none(UCLAMP_MAX);
uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+ }
}

#else /* CONFIG_UCLAMP_TASK */
@@ -6720,6 +6724,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
+#endif
+
+ return 1;
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6743,6 +6760,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;

err:
@@ -6963,6 +6983,100 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
+ goto out;
+ if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Update tg's "requested" clamp value */
+ tg->uclamp_req[UCLAMP_MIN].value = min_value;
+ tg->uclamp_req[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp_req[UCLAMP_MAX].value == max_value)
+ goto out;
+ if (tg->uclamp_req[UCLAMP_MIN].value > max_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Update tg's "requested" clamp value */
+ tg->uclamp_req[UCLAMP_MAX].value = max_value;
+ tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp_req[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7300,6 +7414,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7467,6 +7593,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ae3628248eb..b46b6912beba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -399,6 +399,12 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Clamp values requested for a task group */
+ struct uclamp_se uclamp_req[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.20.1

2019-04-02 11:42:24

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v8 13/16] sched/core: uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This is the actual clamp value enforced on tasks in a
task group.

Since it can be interesting for userspace, e.g. system management
software, to know exactly what the currently propagated/enforced
configuration is, the effective clamp values are exposed to user-space
by means of a new pair of read-only attributes
cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v8:
Message-ID: <[email protected]>
- to be aligned with the changes in the above message:
add "effective" values uclamp_se instance beside the existing
"requested" values instance
Message-ID: <20190318165430.n222vuq4tv3ntbod@e110439-lin>
- s/cpu_util_update_hier/cpu_util_update_eff/
---
Documentation/admin-guide/cgroup-v2.rst | 19 +++++
kernel/sched/core.c | 108 ++++++++++++++++++++++--
kernel/sched/sched.h | 2 +
3 files changed, 124 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 47710a77f4fa..7aad2435e961 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -990,6 +990,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.

+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ allows only smaller minimum utilization values.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1024". i.e. no utilization capping
@@ -1000,6 +1010,15 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.

+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.


Memory
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aeed2dd315cc..e09cf8881567 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -733,6 +733,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
/* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;

@@ -1115,6 +1127,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

+ mutex_init(&uclamp_mutex);
+
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1134,6 +1148,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
#ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+ root_task_group.uclamp[clamp_id] = uc_max;
#endif
}
}
@@ -6730,8 +6745,10 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
#ifdef CONFIG_UCLAMP_TASK_GROUP
int clamp_id;

- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
+ tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
+ }
#endif

return 1;
@@ -6984,6 +7001,44 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+ unsigned int value;
+
+ css_for_each_descendant_pre(css, top_css) {
+ value = css_tg(css)->uclamp_req[clamp_id].value;
+
+ uc_parent = NULL;
+ if (css_tg(css)->parent)
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * already matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive value, as effective clamp.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (uc_se->value == value &&
+ uc_parent && uc_parent->value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent && uc_parent->value < value)
+ value = uc_parent->value;
+ if (uc_se->value == value)
+ continue;
+
+ uc_se->value = value;
+ uc_se->bucket_id = uclamp_bucket_id(value);
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -6993,6 +7048,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7011,8 +7067,12 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
tg->uclamp_req[UCLAMP_MIN].value = min_value;
tg->uclamp_req[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(css, UCLAMP_MIN);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}
@@ -7026,6 +7086,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7044,21 +7105,28 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
tg->uclamp_req[UCLAMP_MAX].value = max_value;
tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(css, UCLAMP_MAX);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}

static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;

rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp_req[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].value
+ : tg->uclamp_req[clamp_id].value;
rcu_read_unlock();

return util_clamp;
@@ -7067,13 +7135,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}

static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */

@@ -7421,11 +7501,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7601,12 +7689,22 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b46b6912beba..b0d4c5e98305 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -403,6 +403,8 @@ struct task_group {
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
+ /* Effective clamp values used for a task group */
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

};
--
2.20.1

2019-04-06 23:58:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 01/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
>
> Utilization clamping allows to clamp the CPU's utilization within a
> [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> that CPU. Each task references two "clamp buckets" defining its minimum
> and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> bucket is active if there is at least one RUNNABLE tasks enqueued on
> that CPU and refcounting that bucket.
>
> When a task is {en,de}queued {on,from} a rq, the set of active clamp
> buckets on that CPU can change. If the set of active clamp buckets
> changes for a CPU a new "aggregated" clamp value is computed for that
> CPU. This is because each clamp bucket enforces a different utilization
> clamp value.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This ensures that no task can affect the performance of other
> co-scheduled tasks which are more boosted (i.e. with higher util_min
> clamp) or less capped (i.e. with higher util_max clamp).
>
> A tasks has:

A task has | Tasks have

> task_struct::uclamp[clamp_id]::bucket_id
> to track the "bucket index" of the CPU's clamp bucket it refcounts while
> enqueued, for each clamp index (clamp_id).
>
> A runqueue has:
> rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> to track how many RUNNABLE tasks on that CPU refcount each
> clamp bucket (bucket_id) of a clamp index (clamp_id).
> It also has a:
> rq::uclamp[clamp_id]::bucket[bucket_id].value
> to track the clamp value of each clamp bucket (bucket_id) of a clamp
> index (clamp_id).
>
> The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
> needed to find a new MAX aggregated clamp value for a clamp_id. This
> operation is required only when it's dequeued the last task of a clamp
> bucket tracking the current MAX aggregated clamp value. In this case,
> the CPU is either entering IDLE or going to schedule a less boosted or
> more clamped task.
> The expected number of different clamp values configured at build time
> is small enough to fit the full unordered array into a single cache
> line, for configurations of up to 7 buckets.
>
> Add to struct rq the basic data structures required to refcount the
> number of RUNNABLE tasks for each clamp bucket. Add also the max
> aggregation required to update the rq's clamp value at each
> enqueue/dequeue event.
>
> Use a simple linear mapping of clamp values into clamp buckets.
> Pre-compute and cache bucket_id to avoid integer divisions at
> enqueue/dequeue time.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> ---
> Changes in v8:
> Message-ID: <20190314111849.gx6bl6myfjtaan7r@e110439-lin>
> - remove "bucket local boosting" code and move it into a dedicated
> patch
> Message-ID: <20190313161229.pkib2tmjass5chtb@e110439-lin>
> - refactored uclamp_rq_update() code to make code cleaner
> Message-ID: <20190314122256.7wb3ydswpkfmntvf@e110439-lin>
> - s/uclamp_rq_update/uclamp_rq_max_value/ and move update into caller
> Message-ID: <CAJuCfpEWCcWj=B2SPai2pQt+wcjsAhEfVV1O+H0A+_fqLCnb8Q@mail.gmail.com>
> - update changelog to clarify the configuration fitting in one cache line
> Message-ID: <20190314145456.5qpxchfltfauqaem@e110439-lin>
> - s/uclamp_bucket_value/uclamp_bucket_base_value/
> Message-ID: <20190313113757.aeaksz5akv6y5uep@e110439-lin>
> - update UCLAMP_BUCKET_DELTA to use DIV_ROUND_CLOSEST()
> ---
> include/linux/log2.h | 37 ++++++++
> include/linux/sched.h | 39 ++++++++
> include/linux/sched/topology.h | 6 --
> init/Kconfig | 53 +++++++++++
> kernel/sched/core.c | 160 +++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 51 +++++++++++
> 6 files changed, 340 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/log2.h b/include/linux/log2.h
> index 2af7f77866d0..e2db25734532 100644
> --- a/include/linux/log2.h
> +++ b/include/linux/log2.h
> @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
> ilog2((n) - 1) + 1) : \
> __order_base_2(n) \
> )
> +
> +static inline __attribute__((const))
> +int __bits_per(unsigned long n)
> +{
> + if (n < 2)
> + return 1;
> + if (is_power_of_2(n))
> + return order_base_2(n) + 1;
> + return order_base_2(n);
> +}
> +
> +/**
> + * bits_per - calculate the number of bits required for the argument
> + * @n: parameter
> + *
> + * This is constant-capable and can be used for compile time
> + * initiaizations, e.g bitfields.

s/initiaizations/initializations

> + *
> + * The first few values calculated by this routine:
> + * bf(0) = 1
> + * bf(1) = 1
> + * bf(2) = 2
> + * bf(3) = 2
> + * bf(4) = 3
> + * ... and so on.
> + */
> +#define bits_per(n) \
> +( \
> + __builtin_constant_p(n) ? ( \
> + ((n) == 0 || (n) == 1) ? 1 : ( \
> + ((n) & (n - 1)) == 0 ? \

missing braces around 'n'
- ((n) & (n - 1)) == 0 ? \
+ ((n) & ((n) - 1)) == 0 ? \

> + ilog2((n) - 1) + 2 : \
> + ilog2((n) - 1) + 1 \

Isn't this "((n) & ((n) - 1)) == 0 ? ilog2((n) - 1) + 2 : ilog2((n) -
1) + 1" expression equivalent to a simple "ilog2(n) + 1"?

> + ) \
> + ) : \
> + __bits_per(n) \
> +)
> #endif /* _LINUX_LOG2_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 18696a194e06..0c0dd7aac8e9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -281,6 +281,18 @@ struct vtime {
> u64 gtime;
> };
>
> +/*
> + * Utilization clamp constraints.
> + * @UCLAMP_MIN: Minimum utilization
> + * @UCLAMP_MAX: Maximum utilization
> + * @UCLAMP_CNT: Utilization clamp constraints count
> + */
> +enum uclamp_id {
> + UCLAMP_MIN = 0,
> + UCLAMP_MAX,
> + UCLAMP_CNT
> +};
> +
> struct sched_info {
> #ifdef CONFIG_SCHED_INFO
> /* Cumulative counters: */
> @@ -312,6 +324,10 @@ struct sched_info {
> # define SCHED_FIXEDPOINT_SHIFT 10
> # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
>
> +/* Increase resolution of cpu_capacity calculations */
> +# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> +# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> +
> struct load_weight {
> unsigned long weight;
> u32 inv_weight;
> @@ -560,6 +576,25 @@ struct sched_dl_entity {
> struct hrtimer inactive_timer;
> };
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/* Number of utilization clamp buckets (shorter alias) */
> +#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
> +
> +/*
> + * Utilization clamp for a scheduling entity
> + * @value: clamp value "assigned" to a se
> + * @bucket_id: bucket index corresponding to the "assigned" value
> + *
> + * The bucket_id is the index of the clamp bucket matching the clamp value
> + * which is pre-computed and stored to avoid expensive integer divisions from
> + * the fast path.
> + */
> +struct uclamp_se {
> + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> union rcu_special {
> struct {
> u8 blocked;
> @@ -640,6 +675,10 @@ struct task_struct {
> #endif
> struct sched_dl_entity dl;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + struct uclamp_se uclamp[UCLAMP_CNT];
> +#endif
> +
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> /* List of struct preempt_notifier: */
> struct hlist_head preempt_notifiers;
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 57c7ed3fe465..bb5d77d45b09 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -6,12 +6,6 @@
>
> #include <linux/sched/idle.h>
>
> -/*
> - * Increase resolution of cpu_capacity calculations
> - */
> -#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> -#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> -
> /*
> * sched-domains (multiprocessor balancing) declarations:
> */
> diff --git a/init/Kconfig b/init/Kconfig
> index c9386a365eea..7439cbf4d02e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -651,6 +651,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> config GENERIC_SCHED_CLOCK
> bool
>
> +menu "Scheduler features"
> +
> +config UCLAMP_TASK
> + bool "Enable utilization clamping for RT/FAIR tasks"
> + depends on CPU_FREQ_GOV_SCHEDUTIL
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks scheduled on that CPU.
> +
> + With this option, the user can specify the min and max CPU
> + utilization allowed for RUNNABLE tasks. The max utilization defines
> + the maximum frequency a task should use while the min utilization
> + defines the minimum frequency it should use.
> +
> + Both min and max utilization clamp values are hints to the scheduler,
> + aiming at improving its frequency selection policy, but they do not
> + enforce or grant any specific bandwidth for tasks.
> +
> + If in doubt, say N.
> +
> +config UCLAMP_BUCKETS_COUNT
> + int "Number of supported utilization clamp buckets"
> + range 5 20
> + default 5
> + depends on UCLAMP_TASK
> + help
> + Defines the number of clamp buckets to use. The range of each bucket
> + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> + number of clamp buckets the finer their granularity and the higher
> + the precision of clamping aggregation and tracking at run-time.
> +
> + For example, with the minimum configuration value we will have 5
> + clamp buckets tracking 20% utilization each. A 25% boosted tasks will
> + be refcounted in the [20..39]% bucket and will set the bucket clamp
> + effective value to 25%.
> + If a second 30% boosted task should be co-scheduled on the same CPU,
> + that task will be refcounted in the same bucket of the first task and
> + it will boost the bucket clamp effective value to 30%.
> + The clamp effective value of a bucket is reset to its nominal value
> + (20% in the example above) when there are no more tasks refcounted in
> + that bucket.
> +
> + An additional boost/capping margin can be added to some tasks. In the
> + example above the 25% task will be boosted to 30% until it exits the
> + CPU. If that should be considered not acceptable on certain systems,
> + it's always possible to reduce the margin by increasing the number of
> + clamp buckets to trade off used memory for run-time tracking
> + precision.
> +
> + If in doubt, use the default value.
> +
> +endmenu
> +
> #
> # For architectures that want to enable the support for NUMA-affine scheduler
> # balancing logic:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6b2c055564b5..032211b72110 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -732,6 +732,162 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +
> +/* Integer rounded range for each bucket */
> +#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
> +
> +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> +{
> + return clamp_value / UCLAMP_BUCKET_DELTA;
> +}
> +
> +static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)

Where are you using uclamp_bucket_base_value()? I would expect its
usage somewhere inside uclamp_rq_dec_id() when the last task in the
bucket is dequeued but I don't see it...

> +{
> + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> +}
> +
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> + if (clamp_id == UCLAMP_MIN)
> + return 0;
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline
> +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> +{
> + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> + int bucket_id = UCLAMP_BUCKETS - 1;
> +
> + /*
> + * Since both min and max clamps are max aggregated, find the
> + * top most bucket with tasks in.
> + */
> + for ( ; bucket_id >= 0; bucket_id--) {
> + if (!bucket[bucket_id].tasks)
> + continue;
> + return bucket[bucket_id].value;
> + }
> +
> + /* No tasks -- default clamp values */
> + return uclamp_none(clamp_id);
> +}
> +
> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is refcounted on that rq. This also immediately
> + * updates the rq's clamp value if required.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> + struct uclamp_bucket *bucket;
> +
> + bucket = &uc_rq->bucket[uc_se->bucket_id];
> + bucket->tasks++;
> +
> + if (uc_se->value > READ_ONCE(uc_rq->value))
> + WRITE_ONCE(uc_rq->value, bucket->value);
> +}
> +
> +/*
> + * When a task is dequeued from a rq, the clamp bucket refcounted by the task
> + * is released. If this is the last task reference counting the rq's max
> + * active clamp value, then the rq's clamp value is updated.
> + *
> + * Both refcounted tasks and rq's cached clamp values are expected to be
> + * always valid. If it's detected they are not, as defensive programming,
> + * enforce the expected state and warn.
> + */
> +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> + struct uclamp_bucket *bucket;
> + unsigned int rq_clamp;
> +
> + bucket = &uc_rq->bucket[uc_se->bucket_id];
> + SCHED_WARN_ON(!bucket->tasks);
> + if (likely(bucket->tasks))
> + bucket->tasks--;
> +
> + if (likely(bucket->tasks))

Shouldn't you adjust bucket->value if the remaining tasks in the
bucket have a lower clamp value than the task that was just removed?

> + return;
> +
> + rq_clamp = READ_ONCE(uc_rq->value);
> + /*
> + * Defensive programming: this should never happen. If it happens,
> + * e.g. due to future modification, warn and fixup the expected value.
> + */
> + SCHED_WARN_ON(bucket->value > rq_clamp);
> + if (bucket->value >= rq_clamp)
> + WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> +}
> +
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_rq_inc_id(p, rq, clamp_id);
> +}
> +
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_rq_dec_id(p, rq, clamp_id);
> +}
> +
> +static void __init init_uclamp(void)
> +{
> + unsigned int clamp_id;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct uclamp_bucket *bucket;
> + struct uclamp_rq *uc_rq;
> + unsigned int bucket_id;
> +
> + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
> +
> + bucket_id = 1;
> + while (bucket_id < UCLAMP_BUCKETS) {
> + bucket = &uc_rq->bucket[bucket_id];
> + bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
> + ++bucket_id;
> + }
> + }
> + }
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> +
> + uc_se->value = uclamp_none(clamp_id);
> + uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
> + }
> +}
> +
> +#else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline void init_uclamp(void) { }
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (!(flags & ENQUEUE_NOCLOCK))
> @@ -742,6 +898,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> psi_enqueue(p, flags & ENQUEUE_WAKEUP);
> }
>
> + uclamp_rq_inc(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -755,6 +912,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> psi_dequeue(p, flags & DEQUEUE_SLEEP);
> }
>
> + uclamp_rq_dec(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }
>
> @@ -6088,6 +6246,8 @@ void __init sched_init(void)
>
> psi_init();
>
> + init_uclamp();
> +
> scheduler_running = 1;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 71208b67e58a..c3d1ae1e7eec 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> #endif
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * struct uclamp_bucket - Utilization clamp bucket
> + * @value: utilization clamp value for tasks on this clamp bucket
> + * @tasks: number of RUNNABLE tasks on this clamp bucket
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_bucket {
> + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> +};
> +
> +/*
> + * struct uclamp_rq - rq's utilization clamp
> + * @value: currently active clamp values for a rq
> + * @bucket: utilization clamp buckets affecting a rq
> + *
> + * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
> + * A clamp value is affecting a rq when there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * There are up to UCLAMP_CNT possible different clamp values, currently there
> + * are only two: minmum utilization and maximum utilization.

s/minmum/minimum

> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + * utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + * maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
> + * the metrics required to compute all the per-rq utilization clamp values.
> + */
> +struct uclamp_rq {
> + unsigned int value;
> + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -835,6 +877,11 @@ struct rq {
> unsigned long nr_load_updates;
> u64 nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> + struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
> +#endif
> +
> struct cfs_rq cfs;
> struct rt_rq rt;
> struct dl_rq dl;
> @@ -1649,6 +1696,10 @@ extern const u32 sched_prio_to_wmult[40];
> struct sched_class {
> const struct sched_class *next;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + int uclamp_enabled;
> +#endif
> +
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*yield_task) (struct rq *rq);
> --
> 2.20.1
>

2019-04-08 11:50:33

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 01/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 06-Apr 16:51, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:

[...]

> > + * The first few values calculated by this routine:
> > + * bf(0) = 1
> > + * bf(1) = 1
> > + * bf(2) = 2
> > + * bf(3) = 2
> > + * bf(4) = 3
> > + * ... and so on.
> > + */
> > +#define bits_per(n) \
> > +( \
> > + __builtin_constant_p(n) ? ( \
> > + ((n) == 0 || (n) == 1) ? 1 : ( \
> > + ((n) & (n - 1)) == 0 ? \
>
> missing braces around 'n'
> - ((n) & (n - 1)) == 0 ? \
> + ((n) & ((n) - 1)) == 0 ? \
>
> > + ilog2((n) - 1) + 2 : \
> > + ilog2((n) - 1) + 1 \
>
> Isn't this "((n) & ((n) - 1)) == 0 ? ilog2((n) - 1) + 2 : ilog2((n) -
> 1) + 1" expression equivalent to a simple "ilog2(n) + 1"?

Right, since we already have n=0 and n=1 as special cases, what you
propose should work for all n>=2.

>
> > + ) \
> > + ) : \
> > + __bits_per(n) \
> > +)
> > #endif /* _LINUX_LOG2_H */

[...]

> > +static inline unsigned int uclamp_bucket_base_value(unsigned int clamp_value)
>
> Where are you using uclamp_bucket_base_value()? I would expect its
> usage somewhere inside uclamp_rq_dec_id() when the last task in the
> bucket is dequeued but I don't see it...

This behavior is not move into a dedicated patch, as per Peter
request:

Message-ID: <20190314111849.gx6bl6myfjtaan7r@e110439-lin>

This functions was left here to support a the intialization code in
init_uclamp() but... I notice know I'm doing the initialization in a
different way thus, I'll move it into the following patch.

> > +{
> > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > +}
> > +

[...]

> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > + struct uclamp_bucket *bucket;
> > + unsigned int rq_clamp;
> > +
> > + bucket = &uc_rq->bucket[uc_se->bucket_id];
> > + SCHED_WARN_ON(!bucket->tasks);
> > + if (likely(bucket->tasks))
> > + bucket->tasks--;
> > +
> > + if (likely(bucket->tasks))
>
> Shouldn't you adjust bucket->value if the remaining tasks in the
> bucket have a lower clamp value than the task that was just removed?

No, this is never done. As long as a bucket is not empty/idle we never
reset it to its nominal value. In this patch specifically, the value
is never changed since we moved the "local max tracking" bits into a
dedicated patch.

> > + return;
> > +
> > + rq_clamp = READ_ONCE(uc_rq->value);
> > + /*
> > + * Defensive programming: this should never happen. If it happens,
> > + * e.g. due to future modification, warn and fixup the expected value.
> > + */
> > + SCHED_WARN_ON(bucket->value > rq_clamp);
> > + if (bucket->value >= rq_clamp)
> > + WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> > +}

[...]

> > +static void __init init_uclamp(void)
> > +{
> > + unsigned int clamp_id;
> > + int cpu;
> > +
> > + for_each_possible_cpu(cpu) {
> > + struct uclamp_bucket *bucket;
> > + struct uclamp_rq *uc_rq;
> > + unsigned int bucket_id;
> > +
> > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
> > +
> > + bucket_id = 1;
> > + while (bucket_id < UCLAMP_BUCKETS) {
> > + bucket = &uc_rq->bucket[bucket_id];
> > + bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
> > + ++bucket_id;
> > + }
> > + }
> > + }

All the initialization code above is not more required after the next
patch introducing "local max tracking".

> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> > +
> > + uc_se->value = uclamp_none(clamp_id);
> > + uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
> > + }
> > +}

[...]

--
#include <best/regards.h>

Patrick Bellasi

2019-04-15 14:53:37

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 02/16] sched/core: Add bucket local max tracking

On 02-Apr 11:41, Patrick Bellasi wrote:
> Because of bucketization, different task-specific clamp values are
> tracked in the same bucket. For example, with 20% bucket size and
> assuming to have:
> Task1: util_min=25%
> Task2: util_min=35%
> both tasks will be refcounted in the [20..39]% bucket and always boosted
> only up to 20% thus implementing a simple floor aggregation normally
> used in histograms.
>
> In systems with only few and well-defined clamp values, it would be
> useful to track the exact clamp value required by a task whenever
> possible. For example, if a system requires only 23% and 47% boost
> values then it's possible to track the exact boost required by each
> task using only 3 buckets of ~33% size each.
>
> Introduce a mechanism to max aggregate the requested clamp values of
> RUNNABLE tasks in the same bucket. Keep it simple by resetting the
> bucket value to its base value only when a bucket becomes inactive.
> Allow a limited and controlled overboosting margin for tasks recounted
> in the same bucket.
>
> In systems where the boost values are not known in advance, it is still
> possible to control the maximum acceptable overboosting margin by tuning
> the number of clamp groups. For example, 20 groups ensure a 5% maximum
> overboost.
>
> Remove the rq bucket initialization code since a correct bucket value
> is now computed when a task is refcounted into a CPU's rq.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> --
> Changes in v8:
> Message-ID: <[email protected]>
> - split this code out from the previous patch
> ---
> kernel/sched/core.c | 46 ++++++++++++++++++++++++++-------------------
> 1 file changed, 27 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 032211b72110..6e1beae5f348 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -778,6 +778,11 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> * When a task is enqueued on a rq, the clamp bucket currently defined by the
> * task's uclamp::bucket_id is refcounted on that rq. This also immediately
> * updates the rq's clamp value if required.
> + *
> + * Tasks can have a task-specific value requested from user-space, track
> + * within each bucket the maximum value for tasks refcounted in it.
> + * This "local max aggregation" allows to track the exact "requested" value
> + * for each bucket when all its RUNNABLE tasks require the same clamp.
> */
> static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> unsigned int clamp_id)
> @@ -789,8 +794,15 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> bucket->tasks++;
>
> + /*
> + * Local max aggregation: rq buckets always track the max
> + * "requested" clamp value of its RUNNABLE tasks.
> + */
> + if (uc_se->value > bucket->value)
> + bucket->value = uc_se->value;
> +
> if (uc_se->value > READ_ONCE(uc_rq->value))
> - WRITE_ONCE(uc_rq->value, bucket->value);
> + WRITE_ONCE(uc_rq->value, uc_se->value);
> }
>
> /*
> @@ -815,6 +827,12 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> if (likely(bucket->tasks))
> bucket->tasks--;
>
> + /*
> + * Keep "local max aggregation" simple and accept to (possibly)
> + * overboost some RUNNABLE tasks in the same bucket.
> + * The rq clamp bucket value is reset to its base value whenever
> + * there are no more RUNNABLE tasks refcounting it.
> + */
> if (likely(bucket->tasks))
> return;
>
> @@ -824,8 +842,14 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> * e.g. due to future modification, warn and fixup the expected value.
> */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> - if (bucket->value >= rq_clamp)
> + if (bucket->value >= rq_clamp) {
> + /*
> + * Reset clamp bucket value to its nominal value whenever
> + * there are anymore RUNNABLE tasks refcounting it.
> + */
> + bucket->value = uclamp_bucket_base_value(bucket->value);

While running tests on Android, I found that the snipped above should
be better done in uclamp_rq_inc_id() for two main reasons:
1. because of the early return in this function, we skip the reset in
case the task is not the last one running on a CPU, thus
triggering the above SCHED_WARN_ON
2. since a non active bucket is already ignored, we don't care about
resetting its "local max value"

Will move the "max local update" in uclamp_rq_inc_id() where something
like:


/*
* Local max aggregation: rq buckets always track the max
* "requested" clamp value of its RUNNABLE tasks.
*/
if (bucket->tasks == 1 || uc_se->value > bucket->value)
bucket->value = uc_se->value;


Should grant to always have an updated local max every time a bucket
is active.


> WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> + }
> }
>
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> @@ -855,25 +879,9 @@ static void __init init_uclamp(void)
> unsigned int clamp_id;
> int cpu;
>
> - for_each_possible_cpu(cpu) {
> - struct uclamp_bucket *bucket;
> - struct uclamp_rq *uc_rq;
> - unsigned int bucket_id;
> -
> + for_each_possible_cpu(cpu)
> memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
>
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> - uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
> -
> - bucket_id = 1;
> - while (bucket_id < UCLAMP_BUCKETS) {
> - bucket = &uc_rq->bucket[bucket_id];
> - bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
> - ++bucket_id;
> - }
> - }
> - }
> -
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
>
> --
> 2.20.1
>

--
#include <best/regards.h>

Patrick Bellasi

2019-04-17 20:37:58

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX

Hi Patrick,

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
>
> When a task sleeps it removes its max utilization clamp from its CPU.
> However, the blocked utilization on that CPU can be higher than the max
> clamp value enforced while the task was running. This allows undesired
> CPU frequency increases while a CPU is idle, for example, when another
> CPU on the same frequency domain triggers a frequency update, since
> schedutil can now see the full not clamped blocked utilization of the
> idle CPU.
>
> Fix this by using
> uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
> uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition.
>

If I understand the intent correctly, you are trying to exclude idle
CPUs from affecting calculations of rq UCLAMP_MAX value. If that is
true I think description can be simplified a bit :) In particular it
took me some time to understand what "blocked utilization" means,
however if it's a widely accepted term then feel free to ignore my
input.

> Don't track any minimum utilization clamps since an idle CPU never
> requires a minimum frequency. The decay of the blocked utilization is
> good enough to reduce the CPU frequency.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> --
> Changes in v8:
> Message-ID: <20190314170619.rt6yhelj3y6dzypu@e110439-lin>
> - moved flag reset into uclamp_rq_inc()
> ---
> kernel/sched/core.c | 45 ++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 2 ++
> 2 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6e1beae5f348..046f61d33f00 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -754,8 +754,35 @@ static inline unsigned int uclamp_none(int clamp_id)
> return SCHED_CAPACITY_SCALE;
> }
>
> +static inline unsigned int
> +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> +{
> + /*
> + * Avoid blocked utilization pushing up the frequency when we go
> + * idle (which drops the max-clamp) by retaining the last known
> + * max-clamp.
> + */
> + if (clamp_id == UCLAMP_MAX) {
> + rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> + return clamp_value;
> + }
> +
> + return uclamp_none(UCLAMP_MIN);
> +}
> +
> +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> + unsigned int clamp_value)
> +{
> + /* Reset max-clamp retention only on idle exit */
> + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> + return;
> +
> + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> +}
> +
> static inline
> -unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> + unsigned int clamp_value)

IMHO the name of uclamp_rq_max_value() is a bit misleading because:
1. It does not imply that it has to be called only when there are no
more runnable tasks on a CPU. This is currently the case because it's
called only from uclamp_rq_dec_id() and only when bucket->tasks==0 but
nothing in the name of this function indicates that it can't be called
from other places.
2. It does not imply that it marks rq UCLAMP_FLAG_IDLE.

> {
> struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> int bucket_id = UCLAMP_BUCKETS - 1;
> @@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> }
>
> /* No tasks -- default clamp values */
> - return uclamp_none(clamp_id);
> + return uclamp_idle_value(rq, clamp_id, clamp_value);
> }
>
> /*
> @@ -794,6 +821,8 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> bucket->tasks++;
>
> + uclamp_idle_reset(rq, clamp_id, uc_se->value);
> +
> /*
> * Local max aggregation: rq buckets always track the max
> * "requested" clamp value of its RUNNABLE tasks.
> @@ -820,6 +849,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> struct uclamp_bucket *bucket;
> + unsigned int bkt_clamp;
> unsigned int rq_clamp;
>
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> @@ -848,7 +878,8 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> * there are anymore RUNNABLE tasks refcounting it.
> */
> bucket->value = uclamp_bucket_base_value(bucket->value);
> - WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> + bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> + WRITE_ONCE(uc_rq->value, bkt_clamp);
> }
> }
>
> @@ -861,6 +892,10 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> uclamp_rq_inc_id(p, rq, clamp_id);
> +
> + /* Reset clamp idle holding when there is one RUNNABLE task */
> + if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> + rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
> }
>
> static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> @@ -879,8 +914,10 @@ static void __init init_uclamp(void)
> unsigned int clamp_id;
> int cpu;
>
> - for_each_possible_cpu(cpu)
> + for_each_possible_cpu(cpu) {
> memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> + cpu_rq(cpu)->uclamp_flags = 0;
> + }
>
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c3d1ae1e7eec..d8b182f1782c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -880,6 +880,8 @@ struct rq {
> #ifdef CONFIG_UCLAMP_TASK
> /* Utilization clamp values based on CPU's RUNNABLE tasks */
> struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
> + unsigned int uclamp_flags;
> +#define UCLAMP_FLAG_IDLE 0x01
> #endif
>
> struct cfs_rq cfs;
> --
> 2.20.1
>

2019-04-17 22:56:49

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
>
> The SCHED_DEADLINE scheduling class provides an advanced and formal
> model to define tasks requirements that can translate into proper
> decisions for both task placements and frequencies selections. Other
> classes have a more simplified model based on the POSIX concept of
> priorities.
>
> Such a simple priority based model however does not allow to exploit
> most advanced features of the Linux scheduler like, for example, driving
> frequencies selection via the schedutil cpufreq governor. However, also
> for non SCHED_DEADLINE tasks, it's still interesting to define tasks
> properties to support scheduler decisions.
>
> Utilization clamping exposes to user-space a new set of per-task
> attributes the scheduler can use as hints about the expected/required
> utilization for a task. This allows to implement a "proactive" per-task
> frequency control policy, a more advanced policy than the current one
> based just on "passive" measured task utilization. For example, it's
> possible to boost interactive tasks (e.g. to get better performance) or
> cap background tasks (e.g. to be more energy/thermal efficient).
>
> Introduce a new API to set utilization clamping values for a specified
> task by extending sched_setattr(), a syscall which already allows to
> define task specific properties for different scheduling classes. A new
> pair of attributes allows to specify a minimum and maximum utilization
> the scheduler can consider for a task.
>
> Do that by validating the required clamp values before and then applying
> the required changes using _the_ same pattern already in use for
> __setscheduler(). This ensures that the task is re-enqueued with the new
> clamp values.
>
> Do not allow to change sched class specific params and non class
> specific params (i.e. clamp values) at the same time. This keeps things
> simple and still works for the most common cases since we are usually
> interested in just one of the two actions.

Sorry, I can't find where you are checking to eliminate the
possibility of simultaneous changes to both sched class specific
params and non class specific params... Am I too tired or they are
indeed missing?

>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> ---
> Changes in v8:
> Others:
> - using p->uclamp_req to track clamp values "requested" from userspace
> ---
> include/linux/sched.h | 9 ++++
> include/uapi/linux/sched.h | 12 ++++-
> include/uapi/linux/sched/types.h | 66 ++++++++++++++++++++----
> kernel/sched/core.c | 87 +++++++++++++++++++++++++++++++-
> 4 files changed, 162 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d8491954e2e1..c2b81a84985b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -585,6 +585,7 @@ struct sched_dl_entity {
> * @value: clamp value "assigned" to a se
> * @bucket_id: bucket index corresponding to the "assigned" value
> * @active: the se is currently refcounted in a rq's bucket
> + * @user_defined: the requested clamp value comes from user-space
> *
> * The bucket_id is the index of the clamp bucket matching the clamp value
> * which is pre-computed and stored to avoid expensive integer divisions from
> @@ -594,11 +595,19 @@ struct sched_dl_entity {
> * which can be different from the clamp value "requested" from user-space.
> * This allows to know a task is refcounted in the rq's bucket corresponding
> * to the "effective" bucket_id.
> + *
> + * The user_defined bit is set whenever a task has got a task-specific clamp
> + * value requested from userspace, i.e. the system defaults apply to this task
> + * just as a restriction. This allows to relax default clamps when a less
> + * restrictive task-specific value has been requested, thus allowing to
> + * implement a "nice" semantic. For example, a task running with a 20%
> + * default boost can still drop its own boosting to 0%.
> */
> struct uclamp_se {
> unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> unsigned int active : 1;
> + unsigned int user_defined : 1;
> };
> #endif /* CONFIG_UCLAMP_TASK */
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 075c610adf45..d2c65617a4a4 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -53,10 +53,20 @@
> #define SCHED_FLAG_RECLAIM 0x02
> #define SCHED_FLAG_DL_OVERRUN 0x04
> #define SCHED_FLAG_KEEP_POLICY 0x08
> +#define SCHED_FLAG_KEEP_PARAMS 0x10
> +#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
> +#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
> +
> +#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
> + SCHED_FLAG_KEEP_PARAMS)
> +
> +#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
> + SCHED_FLAG_UTIL_CLAMP_MAX)
>
> #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> SCHED_FLAG_RECLAIM | \
> SCHED_FLAG_DL_OVERRUN | \
> - SCHED_FLAG_KEEP_POLICY)
> + SCHED_FLAG_KEEP_ALL | \
> + SCHED_FLAG_UTIL_CLAMP)
>
> #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index 10fbb8031930..c852153ddb0d 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -9,6 +9,7 @@ struct sched_param {
> };
>
> #define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
> +#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */
>
> /*
> * Extended scheduling parameters data structure.
> @@ -21,8 +22,33 @@ struct sched_param {
> * the tasks may be useful for a wide variety of application fields, e.g.,
> * multimedia, streaming, automation and control, and many others.
> *
> - * This variant (sched_attr) is meant at describing a so-called
> - * sporadic time-constrained task. In such model a task is specified by:
> + * This variant (sched_attr) allows to define additional attributes to
> + * improve the scheduler knowledge about task requirements.
> + *
> + * Scheduling Class Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes specifies the
> + * scheduling policy and relative POSIX attributes:
> + *
> + * @size size of the structure, for fwd/bwd compat.
> + *
> + * @sched_policy task's scheduling policy
> + * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
> + * @sched_priority task's static priority (SCHED_FIFO/RR)
> + *
> + * Certain more advanced scheduling features can be controlled by a
> + * predefined set of flags via the attribute:
> + *
> + * @sched_flags for customizing the scheduler behaviour
> + *
> + * Sporadic Time-Constrained Task Attributes
> + * =========================================
> + *
> + * A subset of sched_attr attributes allows to describe a so-called
> + * sporadic time-constrained task.
> + *
> + * In such a model a task is specified by:
> * - the activation period or minimum instance inter-arrival time;
> * - the maximum (or average, depending on the actual scheduling
> * discipline) computation time of all instances, a.k.a. runtime;
> @@ -34,14 +60,8 @@ struct sched_param {
> * than the runtime and must be completed by time instant t equal to
> * the instance activation time + the deadline.
> *
> - * This is reflected by the actual fields of the sched_attr structure:
> + * This is reflected by the following fields of the sched_attr structure:
> *
> - * @size size of the structure, for fwd/bwd compat.
> - *
> - * @sched_policy task's scheduling policy
> - * @sched_flags for customizing the scheduler behaviour
> - * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
> - * @sched_priority task's static priority (SCHED_FIFO/RR)
> * @sched_deadline representative of the task's deadline
> * @sched_runtime representative of the task's runtime
> * @sched_period representative of the task's period
> @@ -53,6 +73,29 @@ struct sched_param {
> * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
> * only user of this new interface. More information about the algorithm
> * available in the scheduling class file or in Documentation/.
> + *
> + * Task Utilization Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes allows to specify the utilization
> + * expected for a task. These attributes allow to inform the scheduler about
> + * the utilization boundaries within which it should schedule the task. These
> + * boundaries are valuable hints to support scheduler decisions on both task
> + * placement and frequency selection.
> + *
> + * @sched_util_min represents the minimum utilization
> + * @sched_util_max represents the maximum utilization
> + *
> + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE]. It
> + * represents the percentage of CPU time used by a task when running at the
> + * maximum frequency on the highest capacity CPU of the system. For example, a
> + * 20% utilization task is a task running for 2ms every 10ms at maximum
> + * frequency.
> + *
> + * A task with a min utilization value bigger than 0 is more likely scheduled
> + * on a CPU with a capacity big enough to fit the specified value.
> + * A task with a max utilization value smaller than 1024 is more likely
> + * scheduled on a CPU with no more capacity than the specified value.
> */
> struct sched_attr {
> __u32 size;
> @@ -70,6 +113,11 @@ struct sched_attr {
> __u64 sched_runtime;
> __u64 sched_deadline;
> __u64 sched_period;
> +
> + /* Utilization hints */
> + __u32 sched_util_min;
> + __u32 sched_util_max;
> +
> };
>
> #endif /* _UAPI_LINUX_SCHED_TYPES_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 20efb32e1a7e..68aed32e8ec7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1020,6 +1020,50 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> return result;
> }
>
> +static int uclamp_validate(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
> + unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
> +
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + lower_bound = attr->sched_util_min;
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + upper_bound = attr->sched_util_max;
> +
> + if (lower_bound > upper_bound)
> + return -EINVAL;
> + if (upper_bound > SCHED_CAPACITY_SCALE)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static void __setscheduler_uclamp(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> + return;
> +
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + unsigned int lower_bound = attr->sched_util_min;
> +
> + p->uclamp_req[UCLAMP_MIN].user_defined = true;
> + p->uclamp_req[UCLAMP_MIN].value = lower_bound;
> + p->uclamp_req[UCLAMP_MIN].bucket_id =
> + uclamp_bucket_id(lower_bound);
> + }
> +
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + unsigned int upper_bound = attr->sched_util_max;
> +
> + p->uclamp_req[UCLAMP_MAX].user_defined = true;
> + p->uclamp_req[UCLAMP_MAX].value = upper_bound;
> + p->uclamp_req[UCLAMP_MAX].bucket_id =
> + uclamp_bucket_id(upper_bound);
> + }
> +}
> +
> static void uclamp_fork(struct task_struct *p)
> {
> unsigned int clamp_id;
> @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> #else /* CONFIG_UCLAMP_TASK */
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline int uclamp_validate(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + return -ENODEV;

ENOSYS might be more appropriate?

> +}
> +static void __setscheduler_uclamp(struct task_struct *p,
> + const struct sched_attr *attr) { }
> static inline void uclamp_fork(struct task_struct *p) { }
> static inline void init_uclamp(void) { }
> #endif /* CONFIG_UCLAMP_TASK */
> @@ -4424,6 +4475,13 @@ static void __setscheduler_params(struct task_struct *p,
> static void __setscheduler(struct rq *rq, struct task_struct *p,
> const struct sched_attr *attr, bool keep_boost)
> {
> + /*
> + * If params can't change scheduling class changes aren't allowed
> + * either.
> + */
> + if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
> + return;
> +
> __setscheduler_params(p, attr);
>
> /*
> @@ -4561,6 +4619,13 @@ static int __sched_setscheduler(struct task_struct *p,
> return retval;
> }
>
> + /* Update task specific "requested" clamps */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> + retval = uclamp_validate(p, attr);
> + if (retval)
> + return retval;
> + }
> +
> /*
> * Make sure no PI-waiters arrive (or leave) while we are
> * changing the priority of the task:
> @@ -4590,6 +4655,8 @@ static int __sched_setscheduler(struct task_struct *p,
> goto change;
> if (dl_policy(policy) && dl_param_changed(p, attr))
> goto change;
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> + goto change;
>
> p->sched_reset_on_fork = reset_on_fork;
> task_rq_unlock(rq, p, &rf);
> @@ -4670,7 +4737,9 @@ static int __sched_setscheduler(struct task_struct *p,
> put_prev_task(rq, p);
>
> prev_class = p->sched_class;
> +
> __setscheduler(rq, p, attr, pi);
> + __setscheduler_uclamp(p, attr);
>
> if (queued) {
> /*
> @@ -4846,6 +4915,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
> if (ret)
> return -EFAULT;
>
> + if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
> + size < SCHED_ATTR_SIZE_VER1)
> + return -EINVAL;
> +
> /*
> * XXX: Do we want to be lenient like existing syscalls; or do we want
> * to be strict and return an error on out-of-bounds values?
> @@ -4922,10 +4995,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> rcu_read_lock();
> retval = -ESRCH;
> p = find_process_by_pid(pid);
> - if (p != NULL)
> - retval = sched_setattr(p, &attr);
> + if (likely(p))
> + get_task_struct(p);
> rcu_read_unlock();
>
> + if (likely(p)) {
> + retval = sched_setattr(p, &attr);
> + put_task_struct(p);
> + }
> +
> return retval;
> }
>
> @@ -5076,6 +5154,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
> else
> attr.sched_nice = task_nice(p);
>
> +#ifdef CONFIG_UCLAMP_TASK
> + attr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
> + attr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
> +#endif
> +
> rcu_read_unlock();
>
> retval = sched_read_attr(uattr, &attr, size);
> --
> 2.20.1
>

2019-04-17 23:10:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 08/16] sched/core: uclamp: Set default clamps for RT tasks

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
>
> By default FAIR tasks start without clamps, i.e. neither boosted nor
> capped, and they run at the best frequency matching their utilization
> demand. This default behavior does not fit RT tasks which instead are
> expected to run at the maximum available frequency, if not otherwise
> required by explicitly capping them.
>
> Enforce the correct behavior for RT tasks by setting util_min to max
> whenever:
>
> 1. the task is switched to the RT class and it does not already have a
> user-defined clamp value assigned.
>
> 2. an RT task is forked from a parent with RESET_ON_FORK set.
>
> NOTE: utilization clamp values are cross scheduling class attributes and
> thus they are never changed/reset once a value has been explicitly
> defined from user-space.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/core.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bdebdabe9bc4..71c9dd6487b1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1042,6 +1042,28 @@ static int uclamp_validate(struct task_struct *p,
> static void __setscheduler_uclamp(struct task_struct *p,
> const struct sched_attr *attr)
> {
> + unsigned int clamp_id;
> +
> + /*
> + * On scheduling class change, reset to default clamps for tasks
> + * without a task-specific value.
> + */
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
> + unsigned int clamp_value = uclamp_none(clamp_id);
> +
> + /* Keep using defined clamps across class changes */
> + if (uc_se->user_defined)
> + continue;
> +
> + /* By default, RT tasks always get 100% boost */
> + if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> + clamp_value = uclamp_none(UCLAMP_MAX);
> +
> + uc_se->bucket_id = uclamp_bucket_id(clamp_value);
> + uc_se->value = clamp_value;

Is it possible for p->uclamp_req[UCLAMP_MAX].value to be less than
uclamp_none(UCLAMP_MAX) for this RT task? If that's a possibility then
I think we will end up with a case of p->uclamp_req[UCLAMP_MIN].value
> p->uclamp_req[UCLAMP_MAX].value after these assignments are done.

> + }
> +
> if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> return;
>
> @@ -1077,6 +1099,10 @@ static void uclamp_fork(struct task_struct *p)
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> unsigned int clamp_value = uclamp_none(clamp_id);
>
> + /* By default, RT tasks always get 100% boost */
> + if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> + clamp_value = uclamp_none(UCLAMP_MAX);
> +
> p->uclamp_req[clamp_id].user_defined = false;
> p->uclamp_req[clamp_id].value = clamp_value;
> p->uclamp_req[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
> --
> 2.20.1
>

2019-04-18 00:13:40

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 12/16] sched/core: uclamp: Extend CPU's cgroup controller

On Tue, Apr 2, 2019 at 3:43 AM Patrick Bellasi <[email protected]> wrote:
>
> The cgroup CPU bandwidth controller allows to assign a specified
> (maximum) bandwidth to the tasks of a group. However this bandwidth is
> defined and enforced only on a temporal base, without considering the
> actual frequency a CPU is running on. Thus, the amount of computation
> completed by a task within an allocated bandwidth can be very different
> depending on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.
>
> With the availability of schedutil, the scheduler is now able
> to drive frequency selections based on actual task utilization.
> Moreover, the utilization clamping support provides a mechanism to
> bias the frequency selection operated by schedutil depending on
> constraints assigned to the tasks currently RUNNABLE on a CPU.
>
> Giving the mechanisms described above, it is now possible to extend the
> cpu controller to specify the minimum (or maximum) utilization which
> should be considered for tasks RUNNABLE on a cpu.
> This makes it possible to better defined the actual computational
> power assigned to task groups, thus improving the cgroup CPU bandwidth
> controller which is currently based just on time constraints.
>
> Extend the CPU controller with a couple of new attributes util.{min,max}
> which allows to enforce utilization boosting and capping for all the
> tasks in a group. Specifically:
>
> - util.min: defines the minimum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run at least at a
> minimum frequency which corresponds to the util.min
> utilization
>
> - util.max: defines the maximum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run up to a
> maximum frequency which corresponds to the util.max
> utilization
>
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
> hierarchies, while system wide clamps are defined by a generic
> interface which does not depends on cgroups. This system wide
> interface enforces constraints on tasks in the root node.
>
> b) enforce effective constraints at each level of the hierarchy which
> are a restriction of the group requests considering its parent's
> effective constraints. Root group effective constraints are defined
> by the system wide interface.
> This mechanism allows each (non-root) level of the hierarchy to:
> - request whatever clamp values it would like to get
> - effectively get only up to the maximum amount allowed by its parent
>
> c) have higher priority than task-specific clamps, defined via
> sched_setattr(), thus allowing to control and restrict task requests
>
> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Validate local consistency by enforcing util.min < util.max.
> Keep it simple by do not caring now about "effective" values computation
> and propagation along the hierarchy.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
>
> --
> Changes in v8:
> Message-ID: <[email protected]>
> - update changelog description for points b), c) and following paragraph
> ---
> Documentation/admin-guide/cgroup-v2.rst | 27 +++++
> init/Kconfig | 22 ++++
> kernel/sched/core.c | 142 +++++++++++++++++++++++-
> kernel/sched/sched.h | 6 +
> 4 files changed, 196 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 7bf3f129c68b..47710a77f4fa 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
> normal scheduling policy and absolute bandwidth allocation model for
> realtime scheduling policy.
>
> +Cycles distribution is based, by default, on a temporal base and it
> +does not account for the frequency at which tasks are executed.
> +The (optional) utilization clamping support allows to enforce a minimum
> +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> +which should never be exceeded by a CPU.
> +
> WARNING: cgroup2 doesn't yet support control of realtime processes and
> the cpu controller can only be enabled when all RT processes are in
> the root cgroup. Be aware that system management software may already
> @@ -974,6 +980,27 @@ All time durations are in microseconds.
> Shows pressure stall information for CPU. See
> Documentation/accounting/psi.txt for details.
>
> + cpu.util.min
> + A read-write single value file which exists on non-root cgroups.
> + The default is "0", i.e. no utilization boosting.
> +
> + The requested minimum utilization in the range [0, 1024].
> +
> + This interface allows reading and setting minimum utilization clamp
> + values similar to the sched_setattr(2). This minimum utilization
> + value is used to clamp the task specific minimum utilization clamp.
> +
> + cpu.util.max
> + A read-write single value file which exists on non-root cgroups.
> + The default is "1024". i.e. no utilization capping
> +
> + The requested maximum utilization in the range [0, 1024].
> +
> + This interface allows reading and setting maximum utilization clamp
> + values similar to the sched_setattr(2). This maximum utilization
> + value is used to clamp the task specific maximum utilization clamp.
> +
> +
>
> Memory
> ------
> diff --git a/init/Kconfig b/init/Kconfig
> index 7439cbf4d02e..33006e8de996 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -877,6 +877,28 @@ config RT_GROUP_SCHED
>
> endif #CGROUP_SCHED
>
> +config UCLAMP_TASK_GROUP
> + bool "Utilization clamping per group of tasks"
> + depends on CGROUP_SCHED
> + depends on UCLAMP_TASK
> + default n
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> +
> + When this option is enabled, the user can specify a min and max
> + CPU bandwidth which is allowed for each single task in a group.
> + The max bandwidth allows to clamp the maximum frequency a task
> + can use, while the min bandwidth allows to define a minimum
> + frequency a task will always use.
> +
> + When task group based utilization clamping is enabled, an eventually
> + specified task-specific clamp value is constrained by the cgroup
> + specified clamp value. Both minimum and maximum task clamping cannot
> + be bigger than the corresponding clamping defined at task group level.
> +
> + If in doubt, say N.
> +
> config CGROUP_PIDS
> bool "PIDs controller"
> help
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 71c9dd6487b1..aeed2dd315cc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1130,8 +1130,12 @@ static void __init init_uclamp(void)
> /* System defaults allow max clamp values for both indexes */
> uc_max.value = uclamp_none(UCLAMP_MAX);
> uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> uclamp_default[clamp_id] = uc_max;
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + root_task_group.uclamp_req[clamp_id] = uc_max;
> +#endif
> + }
> }
>
> #else /* CONFIG_UCLAMP_TASK */
> @@ -6720,6 +6724,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
> /* task_group_lock serializes the addition/removal of task groups */
> static DEFINE_SPINLOCK(task_group_lock);
>
> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> + struct task_group *parent)
> +{
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + int clamp_id;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
> +#endif
> +
> + return 1;

Looks like you never return anything else neither here nor in the
following patches I think...

> +}
> +
> static void sched_free_group(struct task_group *tg)
> {
> free_fair_sched_group(tg);
> @@ -6743,6 +6760,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> if (!alloc_rt_sched_group(tg, parent))
> goto err;
>
> + if (!alloc_uclamp_sched_group(tg, parent))
> + goto err;
> +
> return tg;
>
> err:
> @@ -6963,6 +6983,100 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> sched_move_task(task);
> }
>
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cftype, u64 min_value)
> +{
> + struct task_group *tg;
> + int ret = 0;
> +
> + if (min_value > SCHED_CAPACITY_SCALE)
> + return -ERANGE;
> +
> + rcu_read_lock();
> +
> + tg = css_tg(css);
> + if (tg == &root_task_group) {
> + ret = -EINVAL;
> + goto out;
> + }
> + if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
> + goto out;
> + if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Update tg's "requested" clamp value */
> + tg->uclamp_req[UCLAMP_MIN].value = min_value;
> + tg->uclamp_req[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
> +
> +out:
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> + struct cftype *cftype, u64 max_value)
> +{
> + struct task_group *tg;
> + int ret = 0;
> +
> + if (max_value > SCHED_CAPACITY_SCALE)
> + return -ERANGE;
> +
> + rcu_read_lock();
> +
> + tg = css_tg(css);
> + if (tg == &root_task_group) {
> + ret = -EINVAL;
> + goto out;
> + }
> + if (tg->uclamp_req[UCLAMP_MAX].value == max_value)
> + goto out;
> + if (tg->uclamp_req[UCLAMP_MIN].value > max_value) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Update tg's "requested" clamp value */
> + tg->uclamp_req[UCLAMP_MAX].value = max_value;
> + tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
> +
> +out:
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> + enum uclamp_id clamp_id)
> +{
> + struct task_group *tg;
> + u64 util_clamp;
> +
> + rcu_read_lock();
> + tg = css_tg(css);
> + util_clamp = tg->uclamp_req[clamp_id].value;
> + rcu_read_unlock();
> +
> + return util_clamp;
> +}
> +
> +static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MIN);
> +}
> +
> +static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MAX);
> +}
> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
> +
> #ifdef CONFIG_FAIR_GROUP_SCHED
> static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 shareval)
> @@ -7300,6 +7414,18 @@ static struct cftype cpu_legacy_files[] = {
> .read_u64 = cpu_rt_period_read_uint,
> .write_u64 = cpu_rt_period_write_uint,
> },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + {
> + .name = "util.min",
> + .read_u64 = cpu_util_min_read_u64,
> + .write_u64 = cpu_util_min_write_u64,
> + },
> + {
> + .name = "util.max",
> + .read_u64 = cpu_util_max_read_u64,
> + .write_u64 = cpu_util_max_write_u64,
> + },
> #endif
> { } /* Terminate */
> };
> @@ -7467,6 +7593,20 @@ static struct cftype cpu_files[] = {
> .seq_show = cpu_max_show,
> .write = cpu_max_write,
> },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + {
> + .name = "util.min",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_min_read_u64,
> + .write_u64 = cpu_util_min_write_u64,
> + },
> + {
> + .name = "util.max",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_max_read_u64,
> + .write_u64 = cpu_util_max_write_u64,
> + },
> #endif
> { } /* terminate */
> };
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6ae3628248eb..b46b6912beba 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -399,6 +399,12 @@ struct task_group {
> #endif
>
> struct cfs_bandwidth cfs_bandwidth;
> +
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> + /* Clamp values requested for a task group */
> + struct uclamp_se uclamp_req[UCLAMP_CNT];
> +#endif
> +
> };
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.20.1
>

2019-04-18 00:53:13

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
>
> Tasks without a user-defined clamp value are considered not clamped
> and by default their utilization can have any value in the
> [0..SCHED_CAPACITY_SCALE] range.
>
> Tasks with a user-defined clamp value are allowed to request any value
> in that range, and the required clamp is unconditionally enforced.
> However, a "System Management Software" could be interested in limiting
> the range of clamp values allowed for all tasks.
>
> Add a privileged interface to define a system default configuration via:
>
> /proc/sys/kernel/sched_uclamp_util_{min,max}
>
> which works as an unconditional clamp range restriction for all tasks.
>
> With the default configuration, the full SCHED_CAPACITY_SCALE range of
> values is allowed for each clamp index. Otherwise, the task-specific
> clamp is capped by the corresponding system default value.
>
> Do that by tracking, for each task, the "effective" clamp value and
> bucket the task has been refcounted in at enqueue time. This
> allows to lazy aggregate "requested" and "system default" values at
> enqueue time and simplifies refcounting updates at dequeue time.
>
> The cached bucket ids are used to avoid (relatively) more expensive
> integer divisions every time a task is enqueued.
>
> An active flag is used to report when the "effective" value is valid and
> thus the task is actually refcounted in the corresponding rq's bucket.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> ---
> Changes in v8:
> Message-ID: <[email protected]>
> - add "requested" values uclamp_se instance beside the existing
> "effective" values instance
> - rename uclamp_effective_{get,assign}() into uclamp_eff_{get,set}()
> - make uclamp_eff_get() return the new "effective" values by copy
> Message-ID: <20190318125844.ajhjpaqlcgxn7qkq@e110439-lin>
> - run uclamp_fork() code independently from the class being supported.
> Resetting active flag is not harmful and following patches will add
> other code which still needs to be executed independently from class
> support.
> Message-ID: <[email protected]>
> - add sysctl_sched_uclamp_handler()'s internal mutex to serialize
> concurrent usages
> ---
> include/linux/sched.h | 10 +++
> include/linux/sched/sysctl.h | 11 +++
> kernel/sched/core.c | 131 ++++++++++++++++++++++++++++++++++-
> kernel/sysctl.c | 16 +++++
> 4 files changed, 167 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0c0dd7aac8e9..d8491954e2e1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -584,14 +584,21 @@ struct sched_dl_entity {
> * Utilization clamp for a scheduling entity
> * @value: clamp value "assigned" to a se
> * @bucket_id: bucket index corresponding to the "assigned" value
> + * @active: the se is currently refcounted in a rq's bucket
> *
> * The bucket_id is the index of the clamp bucket matching the clamp value
> * which is pre-computed and stored to avoid expensive integer divisions from
> * the fast path.
> + *
> + * The active bit is set whenever a task has got an "effective" value assigned,
> + * which can be different from the clamp value "requested" from user-space.
> + * This allows to know a task is refcounted in the rq's bucket corresponding
> + * to the "effective" bucket_id.
> */
> struct uclamp_se {
> unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> + unsigned int active : 1;
> };
> #endif /* CONFIG_UCLAMP_TASK */
>
> @@ -676,6 +683,9 @@ struct task_struct {
> struct sched_dl_entity dl;
>
> #ifdef CONFIG_UCLAMP_TASK
> + /* Clamp values requested for a scheduling entity */
> + struct uclamp_se uclamp_req[UCLAMP_CNT];
> + /* Effective clamp values used for a scheduling entity */
> struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 99ce6d728df7..d4f6215ee03f 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
> extern unsigned int sysctl_sched_rt_period;
> extern int sysctl_sched_rt_runtime;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern unsigned int sysctl_sched_uclamp_util_min;
> +extern unsigned int sysctl_sched_uclamp_util_max;
> +#endif
> +
> #ifdef CONFIG_CFS_BANDWIDTH
> extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> #endif
> @@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp,
> loff_t *ppos);
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos);
> +#endif
> +
> extern int sysctl_numa_balancing(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp,
> loff_t *ppos);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 046f61d33f00..d368ac26b8aa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -733,6 +733,14 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
>
> #ifdef CONFIG_UCLAMP_TASK
> +/* Max allowed minimum utilization */
> +unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> +
> +/* Max allowed maximum utilization */
> +unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> +
> +/* All clamps are required to be less or equal than these values */
> +static struct uclamp_se uclamp_default[UCLAMP_CNT];
>
> /* Integer rounded range for each bucket */
> #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
> @@ -801,6 +809,52 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> return uclamp_idle_value(rq, clamp_id, clamp_value);
> }
>
> +/*
> + * The effective clamp bucket index of a task depends on, by increasing
> + * priority:
> + * - the task specific clamp value, when explicitly requested from userspace
> + * - the system default clamp value, defined by the sysadmin
> + */
> +static inline struct uclamp_se
> +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> + struct uclamp_se uc_max = uclamp_default[clamp_id];
> +
> + /* System default restrictions always apply */
> + if (unlikely(uc_req.value > uc_max.value))
> + return uc_max;
> +
> + return uc_req;
> +}
> +
> +static inline unsigned int
> +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)

This function is not used anywhere AFAIKT. uclamp_eff_bucket_id() and
uclamp_eff_value() look very similar, maybe they can be combined into
one function returning struct uclamp_se?

> +{
> + struct uclamp_se uc_eff;
> +
> + /* Task currently refcounted: use back-annotated (effective) bucket */
> + if (p->uclamp[clamp_id].active)
> + return p->uclamp[clamp_id].bucket_id;
> +
> + uc_eff = uclamp_eff_get(p, clamp_id);
> +
> + return uc_eff.bucket_id;
> +}
> +
> +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct uclamp_se uc_eff;
> +
> + /* Task currently refcounted: use back-annotated (effective) value */
> + if (p->uclamp[clamp_id].active)
> + return p->uclamp[clamp_id].value;
> +
> + uc_eff = uclamp_eff_get(p, clamp_id);
> +
> + return uc_eff.value;
> +}
> +
> /*
> * When a task is enqueued on a rq, the clamp bucket currently defined by the
> * task's uclamp::bucket_id is refcounted on that rq. This also immediately
> @@ -818,8 +872,12 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> struct uclamp_bucket *bucket;
>
> + /* Update task effective clamp */
> + p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
> +
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> bucket->tasks++;
> + uc_se->active = true;
>
> uclamp_idle_reset(rq, clamp_id, uc_se->value);
>
> @@ -856,6 +914,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> SCHED_WARN_ON(!bucket->tasks);
> if (likely(bucket->tasks))
> bucket->tasks--;
> + uc_se->active = false;
>
> /*
> * Keep "local max aggregation" simple and accept to (possibly)
> @@ -909,8 +968,69 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> uclamp_rq_dec_id(p, rq, clamp_id);
> }
>
> +int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + int old_min, old_max;
> + static DEFINE_MUTEX(mutex);
> + int result;
> +
> + mutex_lock(&mutex);
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> + sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto undo;
> + }
> +
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uclamp_default[UCLAMP_MIN].value =
> + sysctl_sched_uclamp_util_min;
> + uclamp_default[UCLAMP_MIN].bucket_id =
> + uclamp_bucket_id(sysctl_sched_uclamp_util_min);
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + uclamp_default[UCLAMP_MAX].value =
> + sysctl_sched_uclamp_util_max;
> + uclamp_default[UCLAMP_MAX].bucket_id =
> + uclamp_bucket_id(sysctl_sched_uclamp_util_max);
> + }
> +
> + /*
> + * Updating all the RUNNABLE task is expensive, keep it simple and do
> + * just a lazy update at each next enqueue time.
> + */
> + goto done;
> +
> +undo:
> + sysctl_sched_uclamp_util_min = old_min;
> + sysctl_sched_uclamp_util_max = old_max;
> +done:
> + mutex_unlock(&mutex);
> +
> + return result;
> +}
> +
> +static void uclamp_fork(struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + p->uclamp[clamp_id].active = false;
> +}
> +
> static void __init init_uclamp(void)
> {
> + struct uclamp_se uc_max = {};
> unsigned int clamp_id;
> int cpu;
>
> @@ -920,16 +1040,23 @@ static void __init init_uclamp(void)
> }
>
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> - struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> + struct uclamp_se *uc_se = &init_task.uclamp_req[clamp_id];
>
> uc_se->value = uclamp_none(clamp_id);
> uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
> }
> +
> + /* System defaults allow max clamp values for both indexes */
> + uc_max.value = uclamp_none(UCLAMP_MAX);
> + uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_default[clamp_id] = uc_max;
> }
>
> #else /* CONFIG_UCLAMP_TASK */
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_fork(struct task_struct *p) { }
> static inline void init_uclamp(void) { }
> #endif /* CONFIG_UCLAMP_TASK */
>
> @@ -2530,6 +2657,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> */
> p->prio = current->normal_prio;
>
> + uclamp_fork(p);
> +
> /*
> * Revert to default priority/policy on fork if requested.
> */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 987ae08147bf..72277f09887d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -446,6 +446,22 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = sched_rr_handler,
> },
> +#ifdef CONFIG_UCLAMP_TASK
> + {
> + .procname = "sched_uclamp_util_min",
> + .data = &sysctl_sched_uclamp_util_min,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = sysctl_sched_uclamp_handler,
> + },
> + {
> + .procname = "sched_uclamp_util_max",
> + .data = &sysctl_sched_uclamp_util_max,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = sysctl_sched_uclamp_handler,
> + },
> +#endif
> #ifdef CONFIG_SCHED_AUTOGROUP
> {
> .procname = "sched_autogroup_enabled",
> --
> 2.20.1
>

2019-05-07 10:12:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX

On 17-Apr 13:36, Suren Baghdasaryan wrote:
> Hi Patrick,
>
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
> >
> > When a task sleeps it removes its max utilization clamp from its CPU.
> > However, the blocked utilization on that CPU can be higher than the max
> > clamp value enforced while the task was running. This allows undesired
> > CPU frequency increases while a CPU is idle, for example, when another
> > CPU on the same frequency domain triggers a frequency update, since
> > schedutil can now see the full not clamped blocked utilization of the
> > idle CPU.
> >
> > Fix this by using
> > uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
> > uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
> > to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> > condition.
> >
>
> If I understand the intent correctly, you are trying to exclude idle
> CPUs from affecting calculations of rq UCLAMP_MAX value. If that is
> true I think description can be simplified a bit :)

That's not entirely correct. What I want to avoid is an OPP increase
because of an idle CPU. Maybe an example can explain it better,
consider this sequence:

1. A task is running unconstrained on a CPUx and it generates a 100%
utilization
2. The task is now constrained by setting util_max=20
3. We now select an OPP which provides 20% capacity on CPUx

In this scenario the task is still running flat out on that CPUx which
will keep it's util_avg to 1024. Note that after Vincet's PELT rewrite
we don't converge down to the current capacity.

4. The task sleep, it's removed from CPUx but the "blocked
utilization" is still 1024

After this point: the CPU is idle, its "blocked utilization" starts
to "slowly" decay but we _already_ removed the 20% util_max constraint
on that CPU since there are no RUNNABLE tasks (i.e no active buckets).

At this point in time, if there is a schedutil update requested from
another CPU of the same frequency domain, by looking at CPUx we will
see its full "blocked utilization" signal, which can be above 20%.

> In particular it took me some time to understand what "blocked
> utilization" means, however if it's a widely accepted term then feel
> free to ignore my input.

Yes, "blocked utilization" is a commonly used term to refer to the
utilization generated by tasks executed on a CPU.

[...]

> > +static inline unsigned int
> > +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> > +{
> > + /*
> > + * Avoid blocked utilization pushing up the frequency when we go
> > + * idle (which drops the max-clamp) by retaining the last known
> > + * max-clamp.
> > + */
> > + if (clamp_id == UCLAMP_MAX) {
> > + rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> > + return clamp_value;
> > + }
> > +
> > + return uclamp_none(UCLAMP_MIN);
> > +}
> > +
> > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> > + unsigned int clamp_value)
> > +{
> > + /* Reset max-clamp retention only on idle exit */
> > + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > + return;
> > +
> > + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > +}
> > +
> > static inline
> > -unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> > +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> > + unsigned int clamp_value)
>
> IMHO the name of uclamp_rq_max_value() is a bit misleading because:

That's very similar to what you proposed in:

https://lore.kernel.org/lkml/20190314122256.7wb3ydswpkfmntvf@e110439-lin/

> 1. It does not imply that it has to be called only when there are no
> more runnable tasks on a CPU. This is currently the case because it's
> called only from uclamp_rq_dec_id() and only when bucket->tasks==0 but
> nothing in the name of this function indicates that it can't be called
> from other places.
> 2. It does not imply that it marks rq UCLAMP_FLAG_IDLE.

Even if you call it from other places, which is not required, it does
not arm. That function still return the current max clamp for a CPU
given its current state. If the CPU is idle we set the flag one more
time but that's not a problem too.

However, do you have any other proposal for a better name ?

> > {
> > struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > int bucket_id = UCLAMP_BUCKETS - 1;
> > @@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> > }
> >
> > /* No tasks -- default clamp values */
> > - return uclamp_none(clamp_id);
> > + return uclamp_idle_value(rq, clamp_id, clamp_value);
> > }

[...]

--
#include <best/regards.h>

Patrick Bellasi

2019-05-07 10:39:56

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On 17-Apr 17:51, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:

[...]

> > +/*
> > + * The effective clamp bucket index of a task depends on, by increasing
> > + * priority:
> > + * - the task specific clamp value, when explicitly requested from userspace
> > + * - the system default clamp value, defined by the sysadmin
> > + */
> > +static inline struct uclamp_se
> > +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > +{
> > + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> > + struct uclamp_se uc_max = uclamp_default[clamp_id];
> > +
> > + /* System default restrictions always apply */
> > + if (unlikely(uc_req.value > uc_max.value))
> > + return uc_max;
> > +
> > + return uc_req;
> > +}
> > +
> > +static inline unsigned int
> > +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
>
> This function is not used anywhere AFAIKT.

Right, this is the dual of uclamp_eff_value() but, since we don't
actually use it in the corrent code, let's remove it and keep only the
latter.

> uclamp_eff_bucket_id() and
> uclamp_eff_value() look very similar, maybe they can be combined into
> one function returning struct uclamp_se?

I would prefer not since at the callsites of uclamp_eff_value() we
actually need just the value.

> > +{
> > + struct uclamp_se uc_eff;
> > +
> > + /* Task currently refcounted: use back-annotated (effective) bucket */
> > + if (p->uclamp[clamp_id].active)
> > + return p->uclamp[clamp_id].bucket_id;
> > +
> > + uc_eff = uclamp_eff_get(p, clamp_id);
> > +
> > + return uc_eff.bucket_id;
> > +}
> > +
> > +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> > +{
> > + struct uclamp_se uc_eff;
> > +
> > + /* Task currently refcounted: use back-annotated (effective) value */
> > + if (p->uclamp[clamp_id].active)
> > + return p->uclamp[clamp_id].value;
> > +
> > + uc_eff = uclamp_eff_get(p, clamp_id);
> > +
> > + return uc_eff.value;
> > +}
> > +

--
#include <best/regards.h>

Patrick Bellasi

2019-05-07 11:14:58

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

On 17-Apr 15:26, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:

[...]

> > Do not allow to change sched class specific params and non class
> > specific params (i.e. clamp values) at the same time. This keeps things
> > simple and still works for the most common cases since we are usually
> > interested in just one of the two actions.
>
> Sorry, I can't find where you are checking to eliminate the
> possibility of simultaneous changes to both sched class specific
> params and non class specific params... Am I too tired or they are
> indeed missing?

No, you right... that limitation has been removed in v8 :)

I'll remove the above paragraph in v9, thanks for spotting it.

[...]

> > +static int uclamp_validate(struct task_struct *p,
> > + const struct sched_attr *attr)
> > +{
> > + unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
> > + unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
> > +
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> > + lower_bound = attr->sched_util_min;
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > + upper_bound = attr->sched_util_max;
> > +
> > + if (lower_bound > upper_bound)
> > + return -EINVAL;
> > + if (upper_bound > SCHED_CAPACITY_SCALE)
> > + return -EINVAL;
> > +
> > + return 0;
> > +}

[...]

> > static void uclamp_fork(struct task_struct *p)
> > {
> > unsigned int clamp_id;
> > @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> > #else /* CONFIG_UCLAMP_TASK */
> > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> > static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> > +static inline int uclamp_validate(struct task_struct *p,
> > + const struct sched_attr *attr)
> > +{
> > + return -ENODEV;
>
> ENOSYS might be more appropriate?

Yep, agree, thanks!

>
> > +}
> > +static void __setscheduler_uclamp(struct task_struct *p,
> > + const struct sched_attr *attr) { }
> > static inline void uclamp_fork(struct task_struct *p) { }
> > static inline void init_uclamp(void) { }
> > #endif /* CONFIG_UCLAMP_TASK */
> > @@ -4424,6 +4475,13 @@ static void __setscheduler_params(struct task_struct *p,
> > static void __setscheduler(struct rq *rq, struct task_struct *p,
> > const struct sched_attr *attr, bool keep_boost)
> > {
> > + /*
> > + * If params can't change scheduling class changes aren't allowed
> > + * either.
> > + */
> > + if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
> > + return;
> > +
> > __setscheduler_params(p, attr);
> >
> > /*
> > @@ -4561,6 +4619,13 @@ static int __sched_setscheduler(struct task_struct *p,
> > return retval;
> > }
> >
> > + /* Update task specific "requested" clamps */
> > + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> > + retval = uclamp_validate(p, attr);
> > + if (retval)
> > + return retval;
> > + }
> > +
> > /*
> > * Make sure no PI-waiters arrive (or leave) while we are
> > * changing the priority of the task:

[...]

--
#include <best/regards.h>

Patrick Bellasi

2019-05-07 11:26:35

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 08/16] sched/core: uclamp: Set default clamps for RT tasks

On 17-Apr 16:07, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
> >
> > By default FAIR tasks start without clamps, i.e. neither boosted nor
> > capped, and they run at the best frequency matching their utilization
> > demand. This default behavior does not fit RT tasks which instead are
> > expected to run at the maximum available frequency, if not otherwise
> > required by explicitly capping them.
> >
> > Enforce the correct behavior for RT tasks by setting util_min to max
> > whenever:
> >
> > 1. the task is switched to the RT class and it does not already have a
> > user-defined clamp value assigned.
> >
> > 2. an RT task is forked from a parent with RESET_ON_FORK set.
> >
> > NOTE: utilization clamp values are cross scheduling class attributes and
> > thus they are never changed/reset once a value has been explicitly
> > defined from user-space.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > ---
> > kernel/sched/core.c | 26 ++++++++++++++++++++++++++
> > 1 file changed, 26 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bdebdabe9bc4..71c9dd6487b1 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1042,6 +1042,28 @@ static int uclamp_validate(struct task_struct *p,
> > static void __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> > {
> > + unsigned int clamp_id;
> > +
> > + /*
> > + * On scheduling class change, reset to default clamps for tasks
> > + * without a task-specific value.
> > + */
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
> > + unsigned int clamp_value = uclamp_none(clamp_id);
> > +
> > + /* Keep using defined clamps across class changes */
> > + if (uc_se->user_defined)
> > + continue;
> > +
> > + /* By default, RT tasks always get 100% boost */
> > + if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > + clamp_value = uclamp_none(UCLAMP_MAX);
> > +
> > + uc_se->bucket_id = uclamp_bucket_id(clamp_value);
> > + uc_se->value = clamp_value;
>
> Is it possible for p->uclamp_req[UCLAMP_MAX].value to be less than
> uclamp_none(UCLAMP_MAX) for this RT task? If that's a possibility then
> I think we will end up with a case of
> p->uclamp_req[UCLAMP_MIN].value > p->uclamp_req[UCLAMP_MAX].value
> after these assignments are done.
>

The util_max for an RT task can be less then uclamp_none(UCLAMP_MAX),
however, requesting a task specific util_max which is smaller then the
current util_min will fail in:

__sched_setscheduler()
uclamp_validate()

since we only allow util_min <= util_max for all task specific values.

> > + }
> > +
> > if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> > return;
> >
> > @@ -1077,6 +1099,10 @@ static void uclamp_fork(struct task_struct *p)
> > for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > unsigned int clamp_value = uclamp_none(clamp_id);
> >
> > + /* By default, RT tasks always get 100% boost */
> > + if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> > + clamp_value = uclamp_none(UCLAMP_MAX);
> > +
> > p->uclamp_req[clamp_id].user_defined = false;
> > p->uclamp_req[clamp_id].value = clamp_value;
> > p->uclamp_req[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
> > --
> > 2.20.1
> >

--
#include <best/regards.h>

Patrick Bellasi

2019-05-07 11:45:21

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 12/16] sched/core: uclamp: Extend CPU's cgroup controller

On 17-Apr 17:12, Suren Baghdasaryan wrote:
> On Tue, Apr 2, 2019 at 3:43 AM Patrick Bellasi <[email protected]> wrote:
> >
> > The cgroup CPU bandwidth controller allows to assign a specified
> > (maximum) bandwidth to the tasks of a group. However this bandwidth is
> > defined and enforced only on a temporal base, without considering the
> > actual frequency a CPU is running on. Thus, the amount of computation
> > completed by a task within an allocated bandwidth can be very different
> > depending on the actual frequency the CPU is running that task.
> > The amount of computation can be affected also by the specific CPU a
> > task is running on, especially when running on asymmetric capacity
> > systems like Arm's big.LITTLE.
> >
> > With the availability of schedutil, the scheduler is now able
> > to drive frequency selections based on actual task utilization.
> > Moreover, the utilization clamping support provides a mechanism to
> > bias the frequency selection operated by schedutil depending on
> > constraints assigned to the tasks currently RUNNABLE on a CPU.
> >
> > Giving the mechanisms described above, it is now possible to extend the
> > cpu controller to specify the minimum (or maximum) utilization which
> > should be considered for tasks RUNNABLE on a cpu.
> > This makes it possible to better defined the actual computational
> > power assigned to task groups, thus improving the cgroup CPU bandwidth
> > controller which is currently based just on time constraints.
> >
> > Extend the CPU controller with a couple of new attributes util.{min,max}
> > which allows to enforce utilization boosting and capping for all the
> > tasks in a group. Specifically:
> >
> > - util.min: defines the minimum utilization which should be considered
> > i.e. the RUNNABLE tasks of this group will run at least at a
> > minimum frequency which corresponds to the util.min
> > utilization
> >
> > - util.max: defines the maximum utilization which should be considered
> > i.e. the RUNNABLE tasks of this group will run up to a
> > maximum frequency which corresponds to the util.max
> > utilization
> >
> > These attributes:
> >
> > a) are available only for non-root nodes, both on default and legacy
> > hierarchies, while system wide clamps are defined by a generic
> > interface which does not depends on cgroups. This system wide
> > interface enforces constraints on tasks in the root node.
> >
> > b) enforce effective constraints at each level of the hierarchy which
> > are a restriction of the group requests considering its parent's
> > effective constraints. Root group effective constraints are defined
> > by the system wide interface.
> > This mechanism allows each (non-root) level of the hierarchy to:
> > - request whatever clamp values it would like to get
> > - effectively get only up to the maximum amount allowed by its parent
> >
> > c) have higher priority than task-specific clamps, defined via
> > sched_setattr(), thus allowing to control and restrict task requests
> >
> > Add two new attributes to the cpu controller to collect "requested"
> > clamp values. Allow that at each non-root level of the hierarchy.
> > Validate local consistency by enforcing util.min < util.max.
> > Keep it simple by do not caring now about "effective" values computation
> > and propagation along the hierarchy.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> >
> > --
> > Changes in v8:
> > Message-ID: <[email protected]>
> > - update changelog description for points b), c) and following paragraph
> > ---
> > Documentation/admin-guide/cgroup-v2.rst | 27 +++++
> > init/Kconfig | 22 ++++
> > kernel/sched/core.c | 142 +++++++++++++++++++++++-
> > kernel/sched/sched.h | 6 +
> > 4 files changed, 196 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 7bf3f129c68b..47710a77f4fa 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
> > normal scheduling policy and absolute bandwidth allocation model for
> > realtime scheduling policy.
> >
> > +Cycles distribution is based, by default, on a temporal base and it
> > +does not account for the frequency at which tasks are executed.
> > +The (optional) utilization clamping support allows to enforce a minimum
> > +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> > +which should never be exceeded by a CPU.
> > +
> > WARNING: cgroup2 doesn't yet support control of realtime processes and
> > the cpu controller can only be enabled when all RT processes are in
> > the root cgroup. Be aware that system management software may already
> > @@ -974,6 +980,27 @@ All time durations are in microseconds.
> > Shows pressure stall information for CPU. See
> > Documentation/accounting/psi.txt for details.
> >
> > + cpu.util.min
> > + A read-write single value file which exists on non-root cgroups.
> > + The default is "0", i.e. no utilization boosting.
> > +
> > + The requested minimum utilization in the range [0, 1024].
> > +
> > + This interface allows reading and setting minimum utilization clamp
> > + values similar to the sched_setattr(2). This minimum utilization
> > + value is used to clamp the task specific minimum utilization clamp.
> > +
> > + cpu.util.max
> > + A read-write single value file which exists on non-root cgroups.
> > + The default is "1024". i.e. no utilization capping
> > +
> > + The requested maximum utilization in the range [0, 1024].
> > +
> > + This interface allows reading and setting maximum utilization clamp
> > + values similar to the sched_setattr(2). This maximum utilization
> > + value is used to clamp the task specific maximum utilization clamp.
> > +
> > +
> >
> > Memory
> > ------
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 7439cbf4d02e..33006e8de996 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -877,6 +877,28 @@ config RT_GROUP_SCHED
> >
> > endif #CGROUP_SCHED
> >
> > +config UCLAMP_TASK_GROUP
> > + bool "Utilization clamping per group of tasks"
> > + depends on CGROUP_SCHED
> > + depends on UCLAMP_TASK
> > + default n
> > + help
> > + This feature enables the scheduler to track the clamped utilization
> > + of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> > +
> > + When this option is enabled, the user can specify a min and max
> > + CPU bandwidth which is allowed for each single task in a group.
> > + The max bandwidth allows to clamp the maximum frequency a task
> > + can use, while the min bandwidth allows to define a minimum
> > + frequency a task will always use.
> > +
> > + When task group based utilization clamping is enabled, an eventually
> > + specified task-specific clamp value is constrained by the cgroup
> > + specified clamp value. Both minimum and maximum task clamping cannot
> > + be bigger than the corresponding clamping defined at task group level.
> > +
> > + If in doubt, say N.
> > +
> > config CGROUP_PIDS
> > bool "PIDs controller"
> > help
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 71c9dd6487b1..aeed2dd315cc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1130,8 +1130,12 @@ static void __init init_uclamp(void)
> > /* System defaults allow max clamp values for both indexes */
> > uc_max.value = uclamp_none(UCLAMP_MAX);
> > uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> > - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > uclamp_default[clamp_id] = uc_max;
> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
> > + root_task_group.uclamp_req[clamp_id] = uc_max;
> > +#endif
> > + }
> > }
> >
> > #else /* CONFIG_UCLAMP_TASK */
> > @@ -6720,6 +6724,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
> > /* task_group_lock serializes the addition/removal of task groups */
> > static DEFINE_SPINLOCK(task_group_lock);
> >
> > +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> > + struct task_group *parent)
> > +{
> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
> > + int clamp_id;
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > + tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
> > +#endif
> > +
> > + return 1;
>
> Looks like you never return anything else neither here nor in the
> following patches I think...

That's right, I just preferred to keep the same structure in the
callsite below...

> > +}
> > +
> > static void sched_free_group(struct task_group *tg)
> > {
> > free_fair_sched_group(tg);
> > @@ -6743,6 +6760,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> > if (!alloc_rt_sched_group(tg, parent))
> > goto err;
> >
> > + if (!alloc_uclamp_sched_group(tg, parent))
> > + goto err;
> > +

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

... under the assumption the compiler is smart enough to optimized that.

But perhaps it's less confusing to just use void, will update in v9.

> > return tg;
> >
> > err:
--
#include <best/regards.h>

Patrick Bellasi

2019-05-08 19:20:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps


There was a bunch of repetition that seemed fragile; does something like
the below make sense?


Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -770,6 +770,9 @@ unsigned int sysctl_sched_uclamp_util_ma
/* All clamps are required to be less or equal than these values */
static struct uclamp_se uclamp_default[UCLAMP_CNT];

+#define for_each_clamp_id(clamp_id) \
+ for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)
+
/* Integer rounded range for each bucket */
#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)

@@ -790,6 +793,12 @@ static inline unsigned int uclamp_none(i
return SCHED_CAPACITY_SCALE;
}

+static inline void uclamp_se_set(struct uclamp_se *uc_se, unsigned int value)
+{
+ uc_se->value = value;
+ uc_se->bucket_id = uclamp_bucket_id(value);
+}
+
static inline unsigned int
uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
{
@@ -977,7 +986,7 @@ static inline void uclamp_rq_inc(struct
if (unlikely(!p->sched_class->uclamp_enabled))
return;

- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ for_each_clamp_id(clamp_id)
uclamp_rq_inc_id(p, rq, clamp_id);

/* Reset clamp idle holding when there is one RUNNABLE task */
@@ -992,7 +1001,7 @@ static inline void uclamp_rq_dec(struct
if (unlikely(!p->sched_class->uclamp_enabled))
return;

- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ for_each_clamp_id(clamp_id)
uclamp_rq_dec_id(p, rq, clamp_id);
}

@@ -1021,16 +1030,13 @@ int sysctl_sched_uclamp_handler(struct c
}

if (old_min != sysctl_sched_uclamp_util_min) {
- uclamp_default[UCLAMP_MIN].value =
- sysctl_sched_uclamp_util_min;
- uclamp_default[UCLAMP_MIN].bucket_id =
- uclamp_bucket_id(sysctl_sched_uclamp_util_min);
+ uclamp_se_set(&uclamp_default[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min);
}
+
if (old_max != sysctl_sched_uclamp_util_max) {
- uclamp_default[UCLAMP_MAX].value =
- sysctl_sched_uclamp_util_max;
- uclamp_default[UCLAMP_MAX].bucket_id =
- uclamp_bucket_id(sysctl_sched_uclamp_util_max);
+ uclamp_se_set(&uclamp_default[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max);
}

/*
@@ -1052,7 +1058,7 @@ static void uclamp_fork(struct task_stru
{
unsigned int clamp_id;

- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false;
}

@@ -1067,17 +1073,12 @@ static void __init init_uclamp(void)
cpu_rq(cpu)->uclamp_flags = 0;
}

- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- struct uclamp_se *uc_se = &init_task.uclamp_req[clamp_id];
-
- uc_se->value = uclamp_none(clamp_id);
- uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
- }
+ for_each_clamp_id(clamp_id)
+ uclamp_se_set(&init_task.uclamp_req[clamp_id], uclamp_none(clamp_id));

/* System defaults allow max clamp values for both indexes */
- uc_max.value = uclamp_none(UCLAMP_MAX);
- uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
- for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX));
+ for_each_clamp_id(clamp_id)
uclamp_default[clamp_id] = uc_max;
}


2019-05-08 19:22:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> +static inline struct uclamp_se
> +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> + struct uclamp_se uc_max = uclamp_default[clamp_id];
> +
> + /* System default restrictions always apply */
> + if (unlikely(uc_req.value > uc_max.value))
> + return uc_max;
> +
> + return uc_req;
> +}
> +
> +static inline unsigned int
> +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct uclamp_se uc_eff;
> +
> + /* Task currently refcounted: use back-annotated (effective) bucket */
> + if (p->uclamp[clamp_id].active)
> + return p->uclamp[clamp_id].bucket_id;
> +
> + uc_eff = uclamp_eff_get(p, clamp_id);
> +
> + return uc_eff.bucket_id;
> +}
> +
> +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> +{
> + struct uclamp_se uc_eff;
> +
> + /* Task currently refcounted: use back-annotated (effective) value */
> + if (p->uclamp[clamp_id].active)
> + return p->uclamp[clamp_id].value;
> +
> + uc_eff = uclamp_eff_get(p, clamp_id);
> +
> + return uc_eff.value;
> +}

This is 'wrong' because:

uclamp_eff_value(p,id) := uclamp_eff(p,id).value

Which seems to suggest the uclamp_eff_*() functions want another name.

Also, suppose the above would be true; does GCC really generate better
code for the LHS compared to the RHS?

2019-05-08 19:42:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

On Tue, Apr 02, 2019 at 11:41:42AM +0100, Patrick Bellasi wrote:
> @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> #else /* CONFIG_UCLAMP_TASK */
> static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline int uclamp_validate(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + return -ENODEV;

Does that maybe want to be -EOPNOTSUPP ?

> +}

2019-05-08 19:46:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

On Tue, May 07, 2019 at 12:13:47PM +0100, Patrick Bellasi wrote:
> On 17-Apr 15:26, Suren Baghdasaryan wrote:
> > On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:

> > > @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> > > #else /* CONFIG_UCLAMP_TASK */
> > > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> > > static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> > > +static inline int uclamp_validate(struct task_struct *p,
> > > + const struct sched_attr *attr)
> > > +{
> > > + return -ENODEV;
> >
> > ENOSYS might be more appropriate?
>
> Yep, agree, thanks!

No, -ENOSYS (see the comment) is special in that it indicates the whole
system call is unavailable; that is most certainly not the case!

2019-05-08 20:02:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> Add a privileged interface to define a system default configuration via:
>
> /proc/sys/kernel/sched_uclamp_util_{min,max}

Isn't the 'u' in "uclamp" already for util?

2019-05-08 20:02:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On Wed, May 08, 2019 at 09:07:33PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> > +static inline struct uclamp_se
> > +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > +{
> > + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> > + struct uclamp_se uc_max = uclamp_default[clamp_id];
> > +
> > + /* System default restrictions always apply */
> > + if (unlikely(uc_req.value > uc_max.value))
> > + return uc_max;
> > +
> > + return uc_req;
> > +}
> > +
> > +static inline unsigned int
> > +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
> > +{
> > + struct uclamp_se uc_eff;
> > +
> > + /* Task currently refcounted: use back-annotated (effective) bucket */
> > + if (p->uclamp[clamp_id].active)
> > + return p->uclamp[clamp_id].bucket_id;
> > +
> > + uc_eff = uclamp_eff_get(p, clamp_id);
> > +
> > + return uc_eff.bucket_id;
> > +}
> > +
> > +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> > +{
> > + struct uclamp_se uc_eff;
> > +
> > + /* Task currently refcounted: use back-annotated (effective) value */
> > + if (p->uclamp[clamp_id].active)
> > + return p->uclamp[clamp_id].value;
> > +
> > + uc_eff = uclamp_eff_get(p, clamp_id);
> > +
> > + return uc_eff.value;
> > +}
>
> This is 'wrong' because:
>
> uclamp_eff_value(p,id) := uclamp_eff(p,id).value

Clearly I means to say the above does not hold with the given
implementation, while the naming would suggest it does.

> Which seems to suggest the uclamp_eff_*() functions want another name.
>
> Also, suppose the above would be true; does GCC really generate better
> code for the LHS compared to the RHS?

2019-05-08 20:14:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 05/16] sched/core: Allow sched_setattr() to use the current policy

On Tue, Apr 02, 2019 at 11:41:41AM +0100, Patrick Bellasi wrote:
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 22627f80063e..075c610adf45 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -40,6 +40,8 @@
> /* SCHED_ISO: reserved but not implemented yet */
> #define SCHED_IDLE 5
> #define SCHED_DEADLINE 6
> +/* Must be the last entry: used to sanity check attr.policy values */
> +#define SCHED_POLICY_MAX SCHED_DEADLINE

This is a wee bit sad to put in a uapi header; but yeah, where else :/

Another option would be something like:

enum {
SCHED_NORMAL = 0,
SCHED_FIFO = 1,
SCHED_RR = 2,
SCHED_BATCH = 3,
/* SCHED_ISO = 4, reserved */
SCHED_IDLE = 5,
SCHED_DEADLINE = 6,
SCHED_POLICY_NR
};

> /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
> #define SCHED_RESET_ON_FORK 0x40000000
> @@ -50,9 +52,11 @@
> #define SCHED_FLAG_RESET_ON_FORK 0x01
> #define SCHED_FLAG_RECLAIM 0x02
> #define SCHED_FLAG_DL_OVERRUN 0x04
> +#define SCHED_FLAG_KEEP_POLICY 0x08
>
> #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> SCHED_FLAG_RECLAIM | \
> - SCHED_FLAG_DL_OVERRUN)
> + SCHED_FLAG_DL_OVERRUN | \
> + SCHED_FLAG_KEEP_POLICY)
>
> #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d368ac26b8aa..20efb32e1a7e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4907,8 +4907,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> if (retval)
> return retval;
>
> - if ((int)attr.sched_policy < 0)
> + /*
> + * A valid policy is always required from userspace, unless
> + * SCHED_FLAG_KEEP_POLICY is set and the current policy
> + * is enforced for this call.
> + */
> + if (attr.sched_policy > SCHED_POLICY_MAX &&
> + !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
> return -EINVAL;
> + }

And given I just looked at those darn SCHED_* things, I now note the
above does 'funny' things when passed: attr.policy=4.

> + if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
> + attr.sched_policy = SETPARAM_POLICY;
>
> rcu_read_lock();
> retval = -ESRCH;

2019-05-09 08:44:29

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On 08-May 20:42, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> > Add a privileged interface to define a system default configuration via:
> >
> > /proc/sys/kernel/sched_uclamp_util_{min,max}
>
> Isn't the 'u' in "uclamp" already for util?

Yes, right... I've just wanted to keep the same "uclamp" prefix used
by all related kernel symbols. But, since that's user-space API we can
certainly drop it and go for either:

/proc/sys/kernel/sched_clamp_util_{min,max}
/proc/sys/kernel/sched_util_clamp_{min,max}

Preference?

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 08:47:10

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On 08-May 21:00, Peter Zijlstra wrote:
>
> There was a bunch of repetition that seemed fragile; does something like
> the below make sense?

Absolutely yes... will add to v9, thanks.

> Index: linux-2.6/kernel/sched/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/core.c
> +++ linux-2.6/kernel/sched/core.c
> @@ -770,6 +770,9 @@ unsigned int sysctl_sched_uclamp_util_ma
> /* All clamps are required to be less or equal than these values */
> static struct uclamp_se uclamp_default[UCLAMP_CNT];
>
> +#define for_each_clamp_id(clamp_id) \
> + for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)
> +
> /* Integer rounded range for each bucket */
> #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
>
> @@ -790,6 +793,12 @@ static inline unsigned int uclamp_none(i
> return SCHED_CAPACITY_SCALE;
> }
>
> +static inline void uclamp_se_set(struct uclamp_se *uc_se, unsigned int value)
> +{
> + uc_se->value = value;
> + uc_se->bucket_id = uclamp_bucket_id(value);
> +}
> +
> static inline unsigned int
> uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> {
> @@ -977,7 +986,7 @@ static inline void uclamp_rq_inc(struct
> if (unlikely(!p->sched_class->uclamp_enabled))
> return;
>
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + for_each_clamp_id(clamp_id)
> uclamp_rq_inc_id(p, rq, clamp_id);
>
> /* Reset clamp idle holding when there is one RUNNABLE task */
> @@ -992,7 +1001,7 @@ static inline void uclamp_rq_dec(struct
> if (unlikely(!p->sched_class->uclamp_enabled))
> return;
>
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + for_each_clamp_id(clamp_id)
> uclamp_rq_dec_id(p, rq, clamp_id);
> }
>
> @@ -1021,16 +1030,13 @@ int sysctl_sched_uclamp_handler(struct c
> }
>
> if (old_min != sysctl_sched_uclamp_util_min) {
> - uclamp_default[UCLAMP_MIN].value =
> - sysctl_sched_uclamp_util_min;
> - uclamp_default[UCLAMP_MIN].bucket_id =
> - uclamp_bucket_id(sysctl_sched_uclamp_util_min);
> + uclamp_se_set(&uclamp_default[UCLAMP_MIN],
> + sysctl_sched_uclamp_util_min);
> }
> +
> if (old_max != sysctl_sched_uclamp_util_max) {
> - uclamp_default[UCLAMP_MAX].value =
> - sysctl_sched_uclamp_util_max;
> - uclamp_default[UCLAMP_MAX].bucket_id =
> - uclamp_bucket_id(sysctl_sched_uclamp_util_max);
> + uclamp_se_set(&uclamp_default[UCLAMP_MAX],
> + sysctl_sched_uclamp_util_max);
> }
>
> /*
> @@ -1052,7 +1058,7 @@ static void uclamp_fork(struct task_stru
> {
> unsigned int clamp_id;
>
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + for_each_clamp_id(clamp_id)
> p->uclamp[clamp_id].active = false;
> }
>
> @@ -1067,17 +1073,12 @@ static void __init init_uclamp(void)
> cpu_rq(cpu)->uclamp_flags = 0;
> }
>
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> - struct uclamp_se *uc_se = &init_task.uclamp_req[clamp_id];
> -
> - uc_se->value = uclamp_none(clamp_id);
> - uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
> - }
> + for_each_clamp_id(clamp_id)
> + uclamp_se_set(&init_task.uclamp_req[clamp_id], uclamp_none(clamp_id));
>
> /* System defaults allow max clamp values for both indexes */
> - uc_max.value = uclamp_none(UCLAMP_MAX);
> - uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> - for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX));
> + for_each_clamp_id(clamp_id)
> uclamp_default[clamp_id] = uc_max;
> }
>
>

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 09:13:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On 08-May 21:15, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 09:07:33PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> > > +static inline struct uclamp_se
> > > +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > > +{
> > > + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> > > + struct uclamp_se uc_max = uclamp_default[clamp_id];
> > > +
> > > + /* System default restrictions always apply */
> > > + if (unlikely(uc_req.value > uc_max.value))
> > > + return uc_max;
> > > +
> > > + return uc_req;
> > > +}
> > > +
> > > +static inline unsigned int
> > > +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
> > > +{
> > > + struct uclamp_se uc_eff;
> > > +
> > > + /* Task currently refcounted: use back-annotated (effective) bucket */
> > > + if (p->uclamp[clamp_id].active)
> > > + return p->uclamp[clamp_id].bucket_id;
> > > +
> > > + uc_eff = uclamp_eff_get(p, clamp_id);
> > > +
> > > + return uc_eff.bucket_id;
> > > +}
> > > +
> > > +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> > > +{
> > > + struct uclamp_se uc_eff;
> > > +
> > > + /* Task currently refcounted: use back-annotated (effective) value */
> > > + if (p->uclamp[clamp_id].active)
> > > + return p->uclamp[clamp_id].value;
> > > +
> > > + uc_eff = uclamp_eff_get(p, clamp_id);
> > > +
> > > + return uc_eff.value;
> > > +}
> >
> > This is 'wrong' because:
> >
> > uclamp_eff_value(p,id) := uclamp_eff(p,id).value
>
> Clearly I means to say the above does not hold with the given
> implementation, while the naming would suggest it does.

Not sure to completely get your point...

AFAIU, what you call uclamp_eff(p, id).value is the "value" member of
the struct returned by uclamp_eff_get(p,id), which is back annotate
by uclamp_rq_inc_id(p, rq, id) in:

p->uclamp[clamp_id].value

when a task becomes RUNNABLE.

> > Which seems to suggest the uclamp_eff_*() functions want another name.

That function returns the effective value of a task, which is either:
1. the back annotated value for a RUNNABLE task
or
2. the aggregation of task-specific, system-default and cgroup values
for a non RUNNABLE task.

> > Also, suppose the above would be true; does GCC really generate better
> > code for the LHS compared to the RHS?

It generate "sane" code which implements the above logic and allows
to know that whenever we call uclamp_eff_value(p,id) we get the most
updated effective value for a task, independently from its {!}RUNNABLE
state.

I would keep the function but, since Suren also complained also about
the name... perhaps I should come up with a better name? Proposals?

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 09:19:35

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 05/16] sched/core: Allow sched_setattr() to use the current policy

On 08-May 21:21, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 11:41:41AM +0100, Patrick Bellasi wrote:
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index 22627f80063e..075c610adf45 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -40,6 +40,8 @@
> > /* SCHED_ISO: reserved but not implemented yet */
> > #define SCHED_IDLE 5
> > #define SCHED_DEADLINE 6
> > +/* Must be the last entry: used to sanity check attr.policy values */
> > +#define SCHED_POLICY_MAX SCHED_DEADLINE
>
> This is a wee bit sad to put in a uapi header; but yeah, where else :/
>
> Another option would be something like:
>
> enum {
> SCHED_NORMAL = 0,
> SCHED_FIFO = 1,
> SCHED_RR = 2,
> SCHED_BATCH = 3,
> /* SCHED_ISO = 4, reserved */
> SCHED_IDLE = 5,
> SCHED_DEADLINE = 6,
> SCHED_POLICY_NR
> };

I just wanted to minimize the changes by keeping the same structure...
If you prefer the above I can add a refactoring patch just to update
existing definitions before adding this patch...

>
> > /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
> > #define SCHED_RESET_ON_FORK 0x40000000
> > @@ -50,9 +52,11 @@
> > #define SCHED_FLAG_RESET_ON_FORK 0x01
> > #define SCHED_FLAG_RECLAIM 0x02
> > #define SCHED_FLAG_DL_OVERRUN 0x04
> > +#define SCHED_FLAG_KEEP_POLICY 0x08
> >
> > #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> > SCHED_FLAG_RECLAIM | \
> > - SCHED_FLAG_DL_OVERRUN)
> > + SCHED_FLAG_DL_OVERRUN | \
> > + SCHED_FLAG_KEEP_POLICY)
> >
> > #endif /* _UAPI_LINUX_SCHED_H */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d368ac26b8aa..20efb32e1a7e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4907,8 +4907,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> > if (retval)
> > return retval;
> >
> > - if ((int)attr.sched_policy < 0)
> > + /*
> > + * A valid policy is always required from userspace, unless
> > + * SCHED_FLAG_KEEP_POLICY is set and the current policy
> > + * is enforced for this call.
> > + */
> > + if (attr.sched_policy > SCHED_POLICY_MAX &&
> > + !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
> > return -EINVAL;
> > + }
>
> And given I just looked at those darn SCHED_* things, I now note the
> above does 'funny' things when passed: attr.policy=4.

... and maybe factor in the same refactoring patch a check on
SCHED_ISO being not yet supported.

>
> > + if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
> > + attr.sched_policy = SETPARAM_POLICY;
> >
> > rcu_read_lock();
> > retval = -ESRCH;

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 09:25:02

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

On 08-May 21:41, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 11:41:42AM +0100, Patrick Bellasi wrote:
> > @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> > #else /* CONFIG_UCLAMP_TASK */
> > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> > static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> > +static inline int uclamp_validate(struct task_struct *p,
> > + const struct sched_attr *attr)
> > +{
> > + return -ENODEV;
>
> Does that maybe want to be -EOPNOTSUPP ?

Suren propose ENOSYS for another similar case, i.e.
!CONFIG_UCLAMP_TASK definitions.

But EOPNOTSUPP seems more appropriate to me too.

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 09:25:57

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

On 08-May 21:44, Peter Zijlstra wrote:
> On Tue, May 07, 2019 at 12:13:47PM +0100, Patrick Bellasi wrote:
> > On 17-Apr 15:26, Suren Baghdasaryan wrote:
> > > On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <[email protected]> wrote:
>
> > > > @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
> > > > #else /* CONFIG_UCLAMP_TASK */
> > > > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> > > > static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> > > > +static inline int uclamp_validate(struct task_struct *p,
> > > > + const struct sched_attr *attr)
> > > > +{
> > > > + return -ENODEV;
> > >
> > > ENOSYS might be more appropriate?
> >
> > Yep, agree, thanks!
>
> No, -ENOSYS (see the comment) is special in that it indicates the whole
> system call is unavailable; that is most certainly not the case!

Yep, noted. Thanks.

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 11:54:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On Thu, May 09, 2019 at 10:10:57AM +0100, Patrick Bellasi wrote:
> On 08-May 21:15, Peter Zijlstra wrote:
> > On Wed, May 08, 2019 at 09:07:33PM +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> > > > +static inline struct uclamp_se
> > > > +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > > > +{
> > > > + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> > > > + struct uclamp_se uc_max = uclamp_default[clamp_id];
> > > > +
> > > > + /* System default restrictions always apply */
> > > > + if (unlikely(uc_req.value > uc_max.value))
> > > > + return uc_max;
> > > > +
> > > > + return uc_req;
> > > > +}
> > > > +
> > > > +static inline unsigned int
> > > > +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
> > > > +{
> > > > + struct uclamp_se uc_eff;
> > > > +
> > > > + /* Task currently refcounted: use back-annotated (effective) bucket */
> > > > + if (p->uclamp[clamp_id].active)
> > > > + return p->uclamp[clamp_id].bucket_id;
> > > > +
> > > > + uc_eff = uclamp_eff_get(p, clamp_id);
> > > > +
> > > > + return uc_eff.bucket_id;
> > > > +}
> > > > +
> > > > +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> > > > +{
> > > > + struct uclamp_se uc_eff;
> > > > +
> > > > + /* Task currently refcounted: use back-annotated (effective) value */
> > > > + if (p->uclamp[clamp_id].active)
> > > > + return p->uclamp[clamp_id].value;
> > > > +
> > > > + uc_eff = uclamp_eff_get(p, clamp_id);
> > > > +
> > > > + return uc_eff.value;
> > > > +}
> > >
> > > This is 'wrong' because:
> > >
> > > uclamp_eff_value(p,id) := uclamp_eff(p,id).value
> >
> > Clearly I means to say the above does not hold with the given
> > implementation, while the naming would suggest it does.
>
> Not sure to completely get your point...

the point is that uclamp_eff_get() doesn't do the back annotate thing
and therefore returns something entirely different from
uclamp_eff_{bucket_id,value}(), where the naming would suggest it in
fact returns the same thing.

> > > Which seems to suggest the uclamp_eff_*() functions want another name.
>
> That function returns the effective value of a task, which is either:
> 1. the back annotated value for a RUNNABLE task
> or
> 2. the aggregation of task-specific, system-default and cgroup values
> for a non RUNNABLE task.

Right, but uclamp_eff_get() doesn't do 1, while the other two do do it.
And that is confusing.

> > > Also, suppose the above would be true; does GCC really generate better
> > > code for the LHS compared to the RHS?
>
> It generate "sane" code which implements the above logic and allows
> to know that whenever we call uclamp_eff_value(p,id) we get the most
> updated effective value for a task, independently from its {!}RUNNABLE
> state.
>
> I would keep the function but, since Suren also complained also about
> the name... perhaps I should come up with a better name? Proposals?

Right, so they should move to the patch where they're needed, but I was
wondering why you'd not written something like:

static inline
struct uclamp_se uclamp_active(struct task_struct *p, unsigned int clamp_id)
{
if (p->uclamp[clamp_id].active)
return p->uclamp[clamp_id];

return uclamp_eff(p, clamp_id);
}

And then used:

uclamp_active(p, id).{value,bucket_id}

- OR -

have uclamp_eff() include the active thing, afaict the callsite in
uclamp_rq_inc_id() guarantees !active.

In any case, I'm thinking the foo().member notation saves us from having
to have two almost identical functions and the 'inline' part should get
GCC to generate sane code.

2019-05-09 11:57:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 05/16] sched/core: Allow sched_setattr() to use the current policy

On Thu, May 09, 2019 at 10:18:07AM +0100, Patrick Bellasi wrote:
> On 08-May 21:21, Peter Zijlstra wrote:
> > On Tue, Apr 02, 2019 at 11:41:41AM +0100, Patrick Bellasi wrote:
> > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > index 22627f80063e..075c610adf45 100644
> > > --- a/include/uapi/linux/sched.h
> > > +++ b/include/uapi/linux/sched.h
> > > @@ -40,6 +40,8 @@
> > > /* SCHED_ISO: reserved but not implemented yet */
> > > #define SCHED_IDLE 5
> > > #define SCHED_DEADLINE 6
> > > +/* Must be the last entry: used to sanity check attr.policy values */
> > > +#define SCHED_POLICY_MAX SCHED_DEADLINE
> >
> > This is a wee bit sad to put in a uapi header; but yeah, where else :/
> >
> > Another option would be something like:
> >
> > enum {
> > SCHED_NORMAL = 0,
> > SCHED_FIFO = 1,
> > SCHED_RR = 2,
> > SCHED_BATCH = 3,
> > /* SCHED_ISO = 4, reserved */
> > SCHED_IDLE = 5,
> > SCHED_DEADLINE = 6,
> > SCHED_POLICY_NR
> > };
>
> I just wanted to minimize the changes by keeping the same structure...
> If you prefer the above I can add a refactoring patch just to update
> existing definitions before adding this patch...

Right; I've no idea really. The thing that started all this was adding
that define to UAPI. Maybe we can do without it and instead put in a
comment to check sched_setattr() any time we add a new policy and just
hard code the thing.

2019-05-09 12:52:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 11/16] sched/fair: uclamp: Add uclamp support to energy_compute()

On Tue, Apr 02, 2019 at 11:41:47AM +0100, Patrick Bellasi wrote:
> @@ -6484,11 +6494,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * it will not appear in its pd list and will not be accounted
> * by compute_energy().
> */
> - for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
> - util = cpu_util_next(cpu, p, dst_cpu);
> - util = schedutil_energy_util(cpu, util);
> - max_util = max(util, max_util);
> - sum_util += util;
> + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> + util_cfs = cpu_util_next(cpu, p, dst_cpu);
> +
> + /*
> + * Busy time computation: utilization clamping is not
> + * required since the ratio (sum_util / cpu_capacity)
> + * is already enough to scale the EM reported power
> + * consumption at the (eventually clamped) cpu_capacity.
> + */
> + sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
> + ENERGY_UTIL, NULL);
> +
> + /*
> + * Performance domain frequency: utilization clamping
> + * must be considered since it affects the selection
> + * of the performance domain frequency.
> + * NOTE: in case RT tasks are running, by default the
> + * FREQUENCY_UTIL's utilization can be max OPP.
> + */
> + tsk = cpu == dst_cpu ? p : NULL;
> + cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
> + FREQUENCY_UTIL, tsk);
> + max_util = max(max_util, cpu_util);
> }

That's a bit unfortunate; having to do both variants here, but I see
why. Nothing to be done about it I suppose.

2019-05-09 13:04:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 00/16] Add utilization clamping support

On Tue, Apr 02, 2019 at 11:41:36AM +0100, Patrick Bellasi wrote:
> Series Organization
> ===================
>
> The series is organized into these main sections:
>
> - Patches [01-07]: Per task (primary) API
> - Patches [08-09]: Schedutil integration for FAIR and RT tasks
> - Patches [10-11]: Integration with EAS's energy_compute()

Aside from the comments already provided, I think this is starting to
look really good.

Thanks!

> - Patches [12-16]: Per task group (secondary) API

I still have to stare at these, but maybe a little later...

2019-05-09 13:06:42

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

On 09-May 13:53, Peter Zijlstra wrote:
> On Thu, May 09, 2019 at 10:10:57AM +0100, Patrick Bellasi wrote:
> > On 08-May 21:15, Peter Zijlstra wrote:
> > > On Wed, May 08, 2019 at 09:07:33PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Apr 02, 2019 at 11:41:40AM +0100, Patrick Bellasi wrote:
> > > > > +static inline struct uclamp_se
> > > > > +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > > > > +{
> > > > > + struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> > > > > + struct uclamp_se uc_max = uclamp_default[clamp_id];
> > > > > +
> > > > > + /* System default restrictions always apply */
> > > > > + if (unlikely(uc_req.value > uc_max.value))
> > > > > + return uc_max;
> > > > > +
> > > > > + return uc_req;
> > > > > +}
> > > > > +
> > > > > +static inline unsigned int
> > > > > +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)
> > > > > +{
> > > > > + struct uclamp_se uc_eff;
> > > > > +
> > > > > + /* Task currently refcounted: use back-annotated (effective) bucket */
> > > > > + if (p->uclamp[clamp_id].active)
> > > > > + return p->uclamp[clamp_id].bucket_id;
> > > > > +
> > > > > + uc_eff = uclamp_eff_get(p, clamp_id);
> > > > > +
> > > > > + return uc_eff.bucket_id;
> > > > > +}
> > > > > +
> > > > > +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> > > > > +{
> > > > > + struct uclamp_se uc_eff;
> > > > > +
> > > > > + /* Task currently refcounted: use back-annotated (effective) value */
> > > > > + if (p->uclamp[clamp_id].active)
> > > > > + return p->uclamp[clamp_id].value;
> > > > > +
> > > > > + uc_eff = uclamp_eff_get(p, clamp_id);
> > > > > +
> > > > > + return uc_eff.value;
> > > > > +}
> > > >
> > > > This is 'wrong' because:
> > > >
> > > > uclamp_eff_value(p,id) := uclamp_eff(p,id).value
> > >
> > > Clearly I means to say the above does not hold with the given
> > > implementation, while the naming would suggest it does.
> >
> > Not sure to completely get your point...
>
> the point is that uclamp_eff_get() doesn't do the back annotate thing
> and therefore returns something entirely different from
> uclamp_eff_{bucket_id,value}(), where the naming would suggest it in
> fact returns the same thing.
>
> > > > Which seems to suggest the uclamp_eff_*() functions want another name.
> >
> > That function returns the effective value of a task, which is either:
> > 1. the back annotated value for a RUNNABLE task
> > or
> > 2. the aggregation of task-specific, system-default and cgroup values
> > for a non RUNNABLE task.
>
> Right, but uclamp_eff_get() doesn't do 1, while the other two do do it.
> And that is confusing.

I see, right.

> > > > Also, suppose the above would be true; does GCC really generate better
> > > > code for the LHS compared to the RHS?
> >
> > It generate "sane" code which implements the above logic and allows
> > to know that whenever we call uclamp_eff_value(p,id) we get the most
> > updated effective value for a task, independently from its {!}RUNNABLE
> > state.
> >
> > I would keep the function but, since Suren also complained also about
> > the name... perhaps I should come up with a better name? Proposals?
>
> Right, so they should move to the patch where they're needed, but I was

Yes, I'll move _value() to 10/16:

sched/core: uclamp: Add uclamp_util_with()

where we actually need to access the clamp value and...

> wondering why you'd not written something like:
>
> static inline
> struct uclamp_se uclamp_active(struct task_struct *p, unsigned int clamp_id)
> {
> if (p->uclamp[clamp_id].active)
> return p->uclamp[clamp_id];
>
> return uclamp_eff(p, clamp_id);
> }
>
> And then used:
>
> uclamp_active(p, id).{value,bucket_id}
>
> - OR -
>
> have uclamp_eff() include the active thing, afaict the callsite in
> uclamp_rq_inc_id() guarantees !active.
>
> In any case, I'm thinking the foo().member notation saves us from having
> to have two almost identical functions and the 'inline' part should get
> GCC to generate sane code.

... look into this approach, seems reasonable and actually better to read.

Thanks

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 13:11:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 00/16] Add utilization clamping support

On 09-May 15:02, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 11:41:36AM +0100, Patrick Bellasi wrote:
> > Series Organization
> > ===================
> >
> > The series is organized into these main sections:
> >
> > - Patches [01-07]: Per task (primary) API
> > - Patches [08-09]: Schedutil integration for FAIR and RT tasks
> > - Patches [10-11]: Integration with EAS's energy_compute()
>
> Aside from the comments already provided, I think this is starting to
> look really good.

Thanks Peter for the very useful review...

> Thanks!
>
> > - Patches [12-16]: Per task group (secondary) API
>
> I still have to stare at these, but maybe a little later...

... I'll soon post a v9 to factor in all the last comments from this
round so that you have a better base for when you wanna start looking
at the cgroup bits.

--
#include <best/regards.h>

Patrick Bellasi

2019-05-09 15:01:42

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v8 05/16] sched/core: Allow sched_setattr() to use the current policy

On 08-May 21:21, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 11:41:41AM +0100, Patrick Bellasi wrote:
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index 22627f80063e..075c610adf45 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -40,6 +40,8 @@
> > /* SCHED_ISO: reserved but not implemented yet */
> > #define SCHED_IDLE 5
> > #define SCHED_DEADLINE 6
> > +/* Must be the last entry: used to sanity check attr.policy values */
> > +#define SCHED_POLICY_MAX SCHED_DEADLINE
>
> This is a wee bit sad to put in a uapi header; but yeah, where else :/
>
> Another option would be something like:
>
> enum {
> SCHED_NORMAL = 0,
> SCHED_FIFO = 1,
> SCHED_RR = 2,
> SCHED_BATCH = 3,
> /* SCHED_ISO = 4, reserved */
> SCHED_IDLE = 5,
> SCHED_DEADLINE = 6,
> SCHED_POLICY_NR
> };
>
> > /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
> > #define SCHED_RESET_ON_FORK 0x40000000
> > @@ -50,9 +52,11 @@
> > #define SCHED_FLAG_RESET_ON_FORK 0x01
> > #define SCHED_FLAG_RECLAIM 0x02
> > #define SCHED_FLAG_DL_OVERRUN 0x04
> > +#define SCHED_FLAG_KEEP_POLICY 0x08
> >
> > #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
> > SCHED_FLAG_RECLAIM | \
> > - SCHED_FLAG_DL_OVERRUN)
> > + SCHED_FLAG_DL_OVERRUN | \
> > + SCHED_FLAG_KEEP_POLICY)
> >
> > #endif /* _UAPI_LINUX_SCHED_H */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d368ac26b8aa..20efb32e1a7e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4907,8 +4907,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
> > if (retval)
> > return retval;
> >
> > - if ((int)attr.sched_policy < 0)
> > + /*
> > + * A valid policy is always required from userspace, unless
> > + * SCHED_FLAG_KEEP_POLICY is set and the current policy
> > + * is enforced for this call.
> > + */
> > + if (attr.sched_policy > SCHED_POLICY_MAX &&
> > + !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
> > return -EINVAL;
> > + }
>
> And given I just looked at those darn SCHED_* things, I now note the
> above does 'funny' things when passed: attr.policy=4.

Looking better at the code, I see now that we don't really need that
check anymore. Indeed, v8 introduced the support to change policy
specific and independent attributes at the same time. Thus:

1. the policy validity is already checked in:

sched_setattr()
sched_setattr()
__sched_setscheduler()
valid_policy()

which knows how to deal with attr.policy=4 (i.e. -EINVAL)

2. when we pass in SCHED_FLAG_KEEP_POLICY we force the current policy
by setting attr.sched_policy = SETPARAM_POLICY, so we just need a
non negative policy being defined (usually 0 by default).

Thus, I'll remove the new #define and update the check above to be just:

if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
attr.sched_policy = SETPARAM_POLICY;
else if ((int)attr.sched_policy < 0)
return -EINVAL;

which should cover the additional case:

you can syscall with just SCHED_FLAG_KEEP_POLICY set if you want to
change only cross-policy attributes.

> > + if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
> > + attr.sched_policy = SETPARAM_POLICY;
> >
> > rcu_read_lock();
> > retval = -ESRCH;

--
#include <best/regards.h>

Patrick Bellasi