2019-02-08 10:08:06

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 00/15] Add utilization clamping support

Hi all, this is a respin of:

https://lore.kernel.org/lkml/[email protected]/

which includes the following main changes:

- remove the mapping code and use a simple linear mapping of
clamp values into buckets
- move core bits and main data structures at the beginning,
in a further attempt to make the overall series easier to digest
- update the mapping logic to use exactly UCLAMP_BUCKETS_COUNT buckets,
i.e. no more "special" bucket for default values
- update uclamp_rq_update() to do a top-to-bottom max search
- make system defaults to support a "nice" policy where a task, for
each clamp index, can get only "up to" what allowed by the system
default setting, i.e. tasks are always allowed to request for less
- get rid of "perf" system defaults and initialize RT tasks as max boosted
- fix definition of SCHED_POLICY_MAX
- split sched_setattr()'s validation code from actual state changing code
- for sched_setattr()'s state changing code, use _the_ same pattern
__setscheduler() and other code already use,
i.e. dequeue-change-enqueue
- add SCHED_FLAG_KEEP_PARAMS and use it to skip __setscheduler() when
policy and params are not specified
- schedutil: add FAIR and RT integration with a single patch
- drop clamping for IOWait boost
- fixed go to max for RT tasks on !CONFIG_UCLAMP_TASK
- add a note on side-effects due to the usage of FREQUENCY_UTIL for
performance domain frequency estimation and add a similar note to this
changelog
- ensure clamp values are not tunable at root cgroup level
- propagate system defaults to root group's effective value

Thanks for all the valuable comments, let see where we stand now ;)

Cheers Patrick


Series Organization
===================

The series is organized into these main sections:

- Patches [01-07]: Per task (primary) API
- Patches [08]: Schedutil integration for FAIR and RT tasks
- Patches [09-10]: Integration with EAS's energy_compute()
- Patches [11-15]: Per task group (secondary) API

It is based on today's tip/sched/core and the full tree is available here:

git://linux-arm.org/linux-pb.git lkml/utilclamp_v7
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v7


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [3].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilisation values by themselves cannot convey the desired
power/performance behaviours of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorised user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience.

Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
the scheduler does not have a complete view of what the task's requirements
are and, if it's a small utilization task, it keeps selecting a more energy
efficient CPU, with smaller capacity and lower frequency, thus negatively
impacting the overall time required to complete task activations.

- capping: increase energy efficiency for background tasks not affecting the
user experience.

Since running on a lower capacity CPU at a lower frequency is more energy
efficient, when the completion time is not a main goal, then capping the
utilization considered for certain (maybe big) tasks can have positive
effects, both on energy consumption and thermal headroom.
This feature allows also to make RT tasks more energy friendly on mobile
systems where running them on high capacity CPUs and at the maximum
frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in macking tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

- boosting: try to run small/foreground tasks on higher-capacity CPUs to
complete them faster despite being less energy efficient.

- capping: try to run big/background tasks on low-capacity CPUs to save power
and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [1] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] "Expressing per-task/per-cgroup performance hints"
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/

[2] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

[3] https://lore.kernel.org/lkml/[email protected]/


Patrick Bellasi (15):
sched/core: uclamp: Add CPU's clamp buckets refcounting
sched/core: uclamp: Enforce last task UCLAMP_MAX
sched/core: uclamp: Add system default clamps
sched/core: Allow sched_setattr() to use the current policy
sched/core: uclamp: Extend sched_setattr() to support utilization
clamping
sched/core: uclamp: Reset uclamp values on RESET_ON_FORK
sched/core: uclamp: Set default clamps for RT tasks
sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks
sched/core: uclamp: Add uclamp_util_with()
sched/fair: uclamp: Add uclamp support to energy_compute()
sched/core: uclamp: Extend CPU's cgroup controller
sched/core: uclamp: Propagate parent clamps
sched/core: uclamp: Propagate system defaults to root group
sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Documentation/admin-guide/cgroup-v2.rst | 46 ++
include/linux/log2.h | 37 +
include/linux/sched.h | 69 ++
include/linux/sched/sysctl.h | 11 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 16 +-
include/uapi/linux/sched/types.h | 65 +-
init/Kconfig | 75 +++
kernel/sched/core.c | 862 +++++++++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 31 +-
kernel/sched/fair.c | 43 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 126 +++-
kernel/sysctl.c | 16 +
14 files changed, 1355 insertions(+), 52 deletions(-)

--
2.20.1



2019-02-08 10:07:03

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. Since each clamp bucket enforces a
different utilization clamp value, when the set of active clamp buckets
changes, a new "aggregated" clamp value is computed for that CPU.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no tasks can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

Each task has a:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

Each CPU's rq has a:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many tasks, currently RUNNABLE on that CPU, refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).

Each CPU's rq has also a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
to find a new MAX aggregated clamp value for a clamp_id. This operation
is required only when we dequeue the last task of a clamp bucket
tracking the current MAX aggregated clamp value. In these cases, the CPU
is either entering IDLE or going to schedule a less boosted or more
clamped task.
The expected number of different clamp values, configured at build time,
is small enough to fit the full unordered array into a single cache
line.

Add the basic data structures required to refcount, in each CPU's rq,
the number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v7:
Message-ID: <[email protected]>
- removed buckets mapping code
- use a simpler linear mapping of clamp values into buckets
Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin>
- move this patch at the beginning of the series,
in the attempt to make the overall series easier to digest by moving
at the very beginning the core bits and main data structures
Others:
- update the mapping logic to use exactly and only
UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket
- update uclamp_rq_update() to do top-bottom max search
---
include/linux/log2.h | 37 ++++++++
include/linux/sched.h | 39 ++++++++
include/linux/sched/topology.h | 6 --
init/Kconfig | 53 +++++++++++
kernel/sched/core.c | 165 +++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 59 +++++++++++-
6 files changed, 350 insertions(+), 9 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 2af7f77866d0..e2db25734532 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
ilog2((n) - 1) + 1) : \
__order_base_2(n) \
)
+
+static inline __attribute__((const))
+int __bits_per(unsigned long n)
+{
+ if (n < 2)
+ return 1;
+ if (is_power_of_2(n))
+ return order_base_2(n) + 1;
+ return order_base_2(n);
+}
+
+/**
+ * bits_per - calculate the number of bits required for the argument
+ * @n: parameter
+ *
+ * This is constant-capable and can be used for compile time
+ * initiaizations, e.g bitfields.
+ *
+ * The first few values calculated by this routine:
+ * bf(0) = 1
+ * bf(1) = 1
+ * bf(2) = 2
+ * bf(3) = 2
+ * bf(4) = 3
+ * ... and so on.
+ */
+#define bits_per(n) \
+( \
+ __builtin_constant_p(n) ? ( \
+ ((n) == 0 || (n) == 1) ? 1 : ( \
+ ((n) & (n - 1)) == 0 ? \
+ ilog2((n) - 1) + 2 : \
+ ilog2((n) - 1) + 1 \
+ ) \
+ ) : \
+ __bits_per(n) \
+)
#endif /* _LINUX_LOG2_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4112639c2a85..45460e7a3eee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,18 @@ struct vtime {
u64 gtime;
};

+/*
+ * Utilization clamp constraints.
+ * @UCLAMP_MIN: Minimum utilization
+ * @UCLAMP_MAX: Maximum utilization
+ * @UCLAMP_CNT: Utilization clamp constraints count
+ */
+enum uclamp_id {
+ UCLAMP_MIN = 0,
+ UCLAMP_MAX,
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -312,6 +324,10 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+/* Increase resolution of cpu_capacity calculations */
+# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -560,6 +576,25 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+#ifdef CONFIG_UCLAMP_TASK
+/* Number of utilization clamp buckets (shorter alias) */
+#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
+
+/*
+ * Utilization clamp for a scheduling entity
+ * @value: clamp value "requested" by a se
+ * @bucket_id: clamp bucket corresponding to the "requested" value
+ *
+ * The bucket_id is the index of the clamp bucket matching the clamp value
+ * which is pre-computed and stored to avoid expensive integer divisions from
+ * the fast path.
+ */
+struct uclamp_se {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
union rcu_special {
struct {
u8 blocked;
@@ -640,6 +675,10 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c31d3a47a47c..04beadac6985 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@

#include <linux/sched/idle.h>

-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
/*
* sched-domains (multiprocessor balancing) declarations:
*/
diff --git a/init/Kconfig b/init/Kconfig
index 513fa544a134..34e23d5d95d1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -640,6 +640,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks scheduled on that CPU.
+
+ With this option, the user can specify the min and max CPU
+ utilization allowed for RUNNABLE tasks. The max utilization defines
+ the maximum frequency a task should use while the min utilization
+ defines the minimum frequency it should use.
+
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+config UCLAMP_BUCKETS_COUNT
+ int "Number of supported utilization clamp buckets"
+ range 5 20
+ default 5
+ depends on UCLAMP_TASK
+ help
+ Defines the number of clamp buckets to use. The range of each bucket
+ will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
+ number of clamp buckets the finer their granularity and the higher
+ the precision of clamping aggregation and tracking at run-time.
+
+ For example, with the default configuration we will have 5 clamp
+ buckets tracking 20% utilization each. A 25% boosted tasks will be
+ refcounted in the [20..39]% bucket and will set the bucket clamp
+ effective value to 25%.
+ If a second 30% boosted task should be co-scheduled on the same CPU,
+ that task will be refcounted in the same bucket of the first task and
+ it will boost the bucket clamp effective value to 30%.
+ The clamp effective value of a bucket is reset to its nominal value
+ (20% in the example above) when there are anymore tasks refcounted in
+ that bucket.
+
+ An additional boost/capping margin can be added to some tasks. In the
+ example above the 25% task will be boosted to 30% until it exits the
+ CPU. If that should be considered not acceptable on certain systems,
+ it's always possible to reduce the margin by increasing the number of
+ clamp buckets to trade off used memory for run-time tracking
+ precision.
+
+ If in doubt, use the default value.
+
+endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1b67a195cc..8ecf5470058c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -719,6 +719,167 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+
+/* Integer ceil-rounded range for each bucket */
+#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
+
+static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
+{
+ return clamp_value / UCLAMP_BUCKET_DELTA;
+}
+
+static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
+{
+ return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
+}
+
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
+{
+ struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
+ unsigned int max_value = uclamp_none(clamp_id);
+ unsigned int bucket_id;
+
+ /*
+ * Both min and max clamps are MAX aggregated, thus the topmost
+ * bucket with some tasks defines the rq's clamp value.
+ */
+ bucket_id = UCLAMP_BUCKETS;
+ do {
+ --bucket_id;
+ if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
+ continue;
+ max_value = bucket[bucket_id].value;
+ break;
+ } while (bucket_id);
+
+ WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
+}
+
+/*
+ * When a task is enqueued on a rq, the clamp bucket currently defined by the
+ * task's uclamp::bucket_id is reference counted on that rq. This also
+ * immediately updates the rq's clamp value if required.
+ *
+ * Since tasks know their specific value requested from user-space, we track
+ * within each bucket the maximum value for tasks refcounted in that bucket.
+ * This provide a further aggregation (local clamping) which allows to track
+ * within each bucket the exact "requested" clamp value whenever all tasks
+ * RUNNABLE in that bucket require the same clamp.
+ */
+static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
+ unsigned int rq_clamp, bkt_clamp, tsk_clamp;
+
+ rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
+
+ /*
+ * Local clamping: rq's buckets always track the max "requested"
+ * clamp value from all RUNNABLE tasks in that bucket.
+ */
+ tsk_clamp = p->uclamp[clamp_id].value;
+ bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
+ rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
+
+ rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
+ WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
+}
+
+/*
+ * When a task is dequeued from a rq, the clamp bucket reference counted by
+ * the task is released. If this is the last task reference counting the rq's
+ * max active clamp value, then the rq's clamp value is updated.
+ * Both the tasks reference counter and the rq's cached clamp values are
+ * expected to be always valid, if we detect they are not we skip the updates,
+ * enforce a consistent state and warn.
+ */
+static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
+ unsigned int clamp_id)
+{
+ unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
+ unsigned int rq_clamp, bkt_clamp;
+
+ SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
+ if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+ rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
+
+ /*
+ * Keep "local clamping" simple and accept to (possibly) overboost
+ * still RUNNABLE tasks in the same bucket.
+ */
+ if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
+ return;
+ bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
+
+ /* The rq's clamp value is expected to always track the max */
+ rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
+ SCHED_WARN_ON(bkt_clamp > rq_clamp);
+ if (bkt_clamp >= rq_clamp) {
+ /*
+ * Reset rq's clamp bucket value to its nominal value whenever
+ * there are anymore RUNNABLE tasks refcounting it.
+ */
+ rq->uclamp[clamp_id].bucket[bucket_id].value =
+ uclamp_bucket_value(rq_clamp);
+ uclamp_rq_update(rq, clamp_id);
+ }
+}
+
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_rq_inc_id(p, rq, clamp_id);
+}
+
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ uclamp_rq_dec_id(p, rq, clamp_id);
+}
+
+static void __init init_uclamp(void)
+{
+ unsigned int clamp_id;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ unsigned int clamp_value = uclamp_none(clamp_id);
+ unsigned int bucket_id = uclamp_bucket_id(clamp_value);
+ struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
+
+ uc_se->bucket_id = bucket_id;
+ uc_se->value = clamp_value;
+ }
+}
+
+#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -729,6 +890,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, flags & ENQUEUE_WAKEUP);
}

+ uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -742,6 +904,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

+ uclamp_rq_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

@@ -6075,6 +6238,8 @@ void __init sched_init(void)

psi_init();

+ init_uclamp();
+
scheduler_running = 1;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c688ef5012e5..ea9e28723946 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+ unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_rq - rq's utilization clamp
+ * @value: currently active clamp values for a rq
+ * @bucket: utilization clamp buckets affecting a rq
+ *
+ * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
+ * A clamp value is affecting a rq when there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * We have up to UCLAMP_CNT possible different clamp values, which are
+ * currently only two: minmum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (UCLAMP_BUCKETS), we use a simple array to track
+ * the metrics required to compute all the per-rq utilization clamp values.
+ */
+struct uclamp_rq {
+ unsigned int value;
+ struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -835,6 +877,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -1649,10 +1696,12 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
- void (*yield_task) (struct rq *rq);
- bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);

void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

@@ -1685,7 +1734,6 @@ struct sched_class {
void (*set_curr_task)(struct rq *rq);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*task_fork)(struct task_struct *p);
- void (*task_dead)(struct task_struct *p);

/*
* The switched_from() call is allowed to drop rq->lock, therefore we
@@ -1702,12 +1750,17 @@ struct sched_class {

void (*update_curr)(struct rq *rq);

+ void (*yield_task) (struct rq *rq);
+ bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
+
#define TASK_SET_GROUP 0
#define TASK_MOVE_GROUP 1

#ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_change_group)(struct task_struct *p, int type);
#endif
+
+ void (*task_dead)(struct task_struct *p);
};

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
--
2.20.1


2019-02-08 10:07:08

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and we unconditionally enforce the required clamps.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

/proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

The default configuration allows the full range of SCHED_CAPACITY_SCALE
values for each clamp index. If otherwise configured, a task specific
clamp is always capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been actual refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplify refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v7:
Message-ID: <20190124123009.2yulcf25ld66popd@e110439-lin>
- make system defaults to support a "nice" policy where a task, for
each clamp index, can get only "up to" what allowed by the system
default setting, i.e. tasks are always allowed to request for less
---
include/linux/sched.h | 24 ++++-
include/linux/sched/sysctl.h | 11 +++
kernel/sched/core.c | 169 +++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 16 ++++
4 files changed, 210 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 45460e7a3eee..447261cd23ba 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -584,14 +584,32 @@ struct sched_dl_entity {
* Utilization clamp for a scheduling entity
* @value: clamp value "requested" by a se
* @bucket_id: clamp bucket corresponding to the "requested" value
+ * @effective: clamp value and bucket actually "assigned" to the se
+ * @active: the se is currently refcounted in a rq's bucket
*
- * The bucket_id is the index of the clamp bucket matching the clamp value
- * which is pre-computed and stored to avoid expensive integer divisions from
- * the fast path.
+ * Both bucket_id and effective::bucket_id are the index of the clamp bucket
+ * matching the corresponding clamp value which are pre-computed and stored to
+ * avoid expensive integer divisions from the fast path.
+ *
+ * The active bit is set whenever a task has got an effective::value assigned,
+ * which can be different from the user requested clamp value. This allows to
+ * know a task is actually refcounting the rq's effective::bucket_id bucket.
*/
struct uclamp_se {
+ /* Clamp value "requested" by a scheduling entity */
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+ unsigned int active : 1;
+ /*
+ * Clamp value "obtained" by a scheduling entity.
+ *
+ * This cache the actual clamp value, possibly enforced by system
+ * default clamps, a task is subject to while enqueued in a rq.
+ */
+ struct {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+ } effective;
};
#endif /* CONFIG_UCLAMP_TASK */

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d728df7..d4f6215ee03f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

+#ifdef CONFIG_UCLAMP_TASK
+extern unsigned int sysctl_sched_uclamp_util_min;
+extern unsigned int sysctl_sched_uclamp_util_max;
+#endif
+
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif
@@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);

+#ifdef CONFIG_UCLAMP_TASK
+extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
+#endif
+
extern int sysctl_numa_balancing(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e4f5e8c426ab..c0429bcbe09a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -720,6 +720,14 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/* Max allowed minimum utilization */
+unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
+
+/* Max allowed maximum utilization */
+unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
+
+/* All clamps are required to be not greater then these values */
+static struct uclamp_se uclamp_default[UCLAMP_CNT];

/* Integer ceil-rounded range for each bucket */
#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
@@ -803,6 +811,70 @@ static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id,
WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
}

+/*
+ * The effective clamp bucket index of a task depends on, by increasing
+ * priority:
+ * - the task specific clamp value, when explicitly requested from userspace
+ * - the system default clamp value, defined by the sysadmin
+ *
+ * As a side effect, update the task's effective value:
+ * task_struct::uclamp::effective::value
+ * to represent the clamp value of the task effective bucket index.
+ */
+static inline void
+uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
+ unsigned int *clamp_value, unsigned int *bucket_id)
+{
+ /* Task specific clamp value */
+ *bucket_id = p->uclamp[clamp_id].bucket_id;
+ *clamp_value = p->uclamp[clamp_id].value;
+
+ /* Always apply system default restrictions */
+ if (unlikely(*clamp_value > uclamp_default[clamp_id].value)) {
+ *clamp_value = uclamp_default[clamp_id].value;
+ *bucket_id = uclamp_default[clamp_id].bucket_id;
+ }
+}
+
+static inline void
+uclamp_effective_assign(struct task_struct *p, unsigned int clamp_id)
+{
+ unsigned int clamp_value, bucket_id;
+
+ uclamp_effective_get(p, clamp_id, &clamp_value, &bucket_id);
+
+ p->uclamp[clamp_id].effective.value = clamp_value;
+ p->uclamp[clamp_id].effective.bucket_id = bucket_id;
+}
+
+static inline unsigned int uclamp_effective_bucket_id(struct task_struct *p,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value, bucket_id;
+
+ /* Task currently refcounted: use back-annotated (effective) bucket */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].effective.bucket_id;
+
+ uclamp_effective_get(p, clamp_id, &clamp_value, &bucket_id);
+
+ return bucket_id;
+}
+
+unsigned int uclamp_effective_value(struct task_struct *p,
+ unsigned int clamp_id)
+{
+ unsigned int clamp_value, bucket_id;
+
+ /* Task currently refcounted: use back-annotated (effective) value */
+ if (p->uclamp[clamp_id].active)
+ return p->uclamp[clamp_id].effective.value;
+
+ uclamp_effective_get(p, clamp_id, &clamp_value, &bucket_id);
+
+ return clamp_value;
+}
+
/*
* When a task is enqueued on a rq, the clamp bucket currently defined by the
* task's uclamp::bucket_id is reference counted on that rq. This also
@@ -817,12 +889,17 @@ static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id,
static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
{
- unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
unsigned int rq_clamp, bkt_clamp, tsk_clamp;
+ unsigned int bucket_id;
+
+ uclamp_effective_assign(p, clamp_id);
+ bucket_id = uclamp_effective_bucket_id(p, clamp_id);

rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
+ p->uclamp[clamp_id].active = true;
+
/* Reset clamp holds on idle exit */
- tsk_clamp = p->uclamp[clamp_id].value;
+ tsk_clamp = uclamp_effective_value(p, clamp_id);
uclamp_idle_reset(rq, clamp_id, tsk_clamp);

/*
@@ -847,12 +924,15 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
{
- unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
unsigned int rq_clamp, bkt_clamp;
+ unsigned int bucket_id;
+
+ bucket_id = uclamp_effective_bucket_id(p, clamp_id);

SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
+ p->uclamp[clamp_id].active = false;

/*
* Keep "local clamping" simple and accept to (possibly) overboost
@@ -898,9 +978,72 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(p, rq, clamp_id);
}

+int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int old_min, old_max;
+ int result = 0;
+
+ old_min = sysctl_sched_uclamp_util_min;
+ old_max = sysctl_sched_uclamp_util_max;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ goto done;
+
+ if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
+ sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+ result = -EINVAL;
+ goto undo;
+ }
+
+ if (old_min != sysctl_sched_uclamp_util_min) {
+ uclamp_default[UCLAMP_MIN].value =
+ sysctl_sched_uclamp_util_min;
+ uclamp_default[UCLAMP_MIN].bucket_id =
+ uclamp_bucket_id(sysctl_sched_uclamp_util_min);
+ }
+ if (old_max != sysctl_sched_uclamp_util_max) {
+ uclamp_default[UCLAMP_MAX].value =
+ sysctl_sched_uclamp_util_max;
+ uclamp_default[UCLAMP_MAX].bucket_id =
+ uclamp_bucket_id(sysctl_sched_uclamp_util_max);
+ }
+
+ /*
+ * Updating all the RUNNABLE task is expensive, keep it simple and do
+ * just a lazy update at each next enqueue time.
+ */
+ goto done;
+
+undo:
+ sysctl_sched_uclamp_util_min = old_min;
+ sysctl_sched_uclamp_util_max = old_max;
+done:
+
+ return result;
+}
+
+static void uclamp_fork(struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
+ p->uclamp[clamp_id].active = false;
+}
+
static void __init init_uclamp(void)
{
+ struct uclamp_se *uc_se;
+ unsigned int bucket_id;
unsigned int clamp_id;
+ unsigned int value;
int cpu;

for_each_possible_cpu(cpu) {
@@ -909,18 +1052,28 @@ static void __init init_uclamp(void)
}

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
- unsigned int clamp_value = uclamp_none(clamp_id);
- unsigned int bucket_id = uclamp_bucket_id(clamp_value);
- struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
+ value = uclamp_none(clamp_id);
+ bucket_id = uclamp_bucket_id(value);
+
+ uc_se = &init_task.uclamp[clamp_id];
+ uc_se->bucket_id = bucket_id;
+ uc_se->value = value;
+ }

+ /* System defaults allow max clamp values for both indexes */
+ value = uclamp_none(UCLAMP_MAX);
+ bucket_id = uclamp_bucket_id(value);
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ uc_se = &uclamp_default[clamp_id];
uc_se->bucket_id = bucket_id;
- uc_se->value = clamp_value;
+ uc_se->value = value;
}
}

#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

@@ -2551,6 +2704,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)

init_entity_runnable_average(&p->se);

+ uclamp_fork(p);
+
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 987ae08147bf..72277f09887d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -446,6 +446,22 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sched_rr_handler,
},
+#ifdef CONFIG_UCLAMP_TASK
+ {
+ .procname = "sched_uclamp_util_min",
+ .data = &sysctl_sched_uclamp_util_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_uclamp_handler,
+ },
+ {
+ .procname = "sched_uclamp_util_max",
+ .data = &sysctl_sched_uclamp_util_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_uclamp_handler,
+ },
+#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
.procname = "sched_autogroup_enabled",
--
2.20.1


2019-02-08 10:07:16

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 04/15] sched/core: Allow sched_setattr() to use the current policy

The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and it makes it impossible to add more generic
task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

This is done by extending to the sched_setattr() non-POSIX syscall with
the SETPARAM_POLICY policy already used by the sched_setparam() POSIX
syscall.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v7:
Message-ID: <[email protected]>
- fix definition of SCHED_POLICY_MAX
---
include/uapi/linux/sched.h | 6 +++++-
kernel/sched/core.c | 11 ++++++++++-
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..075c610adf45 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -40,6 +40,8 @@
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
+/* Must be the last entry: used to sanity check attr.policy values */
+#define SCHED_POLICY_MAX SCHED_DEADLINE

/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
@@ -50,9 +52,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_KEEP_POLICY 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_KEEP_POLICY)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c0429bcbe09a..26c5ede8ebca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4924,8 +4924,17 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
if (retval)
return retval;

- if ((int)attr.sched_policy < 0)
+ /*
+ * A valid policy is always required from userspace, unless
+ * SCHED_FLAG_KEEP_POLICY is set and the current policy
+ * is enforced for this call.
+ */
+ if (attr.sched_policy > SCHED_POLICY_MAX &&
+ !(attr.sched_flags & SCHED_FLAG_KEEP_POLICY)) {
return -EINVAL;
+ }
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;

rcu_read_lock();
retval = -ESRCH;
--
2.20.1


2019-02-08 10:07:20

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 05/15] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.

Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.

Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).

Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.

Do that by checking and validating the required clamp values before and
then applying the required changes using _the_ same pattern already in
use for __setscheduler(). This ensures that the task is re-enqueued with
the new clamp values.

Do not allow to change sched class specific params and non class
specific params (i.e. clamp values) at the same time. This keeps things
simple and still works for the most common cases since we are usually
interested in just one of the two actions.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>

---
Changes in v7:
Message-ID: <[email protected]>
- split validation code from actual state changing code
- for state changing code, use _the_ same pattern __setscheduler() and
other code already use, i.e. dequeue-change-enqueue
- add SCHED_FLAG_KEEP_PARAMS and use it to skip __setscheduler() when
policy and params are not specified
---
include/linux/sched.h | 9 ++++
include/uapi/linux/sched.h | 12 ++++-
include/uapi/linux/sched/types.h | 65 ++++++++++++++++++++----
kernel/sched/core.c | 87 +++++++++++++++++++++++++++++++-
4 files changed, 161 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 447261cd23ba..711ea303f4c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -586,6 +586,7 @@ struct sched_dl_entity {
* @bucket_id: clamp bucket corresponding to the "requested" value
* @effective: clamp value and bucket actually "assigned" to the se
* @active: the se is currently refcounted in a rq's bucket
+ * @user_defined: the requested clamp value comes from user-space
*
* Both bucket_id and effective::bucket_id are the index of the clamp bucket
* matching the corresponding clamp value which are pre-computed and stored to
@@ -594,12 +595,20 @@ struct sched_dl_entity {
* The active bit is set whenever a task has got an effective::value assigned,
* which can be different from the user requested clamp value. This allows to
* know a task is actually refcounting the rq's effective::bucket_id bucket.
+ *
+ * The user_defined bit is set whenever a task has got a task-specific clamp
+ * value requested from userspace, i.e. the system defaults apply to this task
+ * just as a restriction. This allows to relax default clamps when a less
+ * restrictive task-specific value has been requested, thus allowing to
+ * implement a "nice" semantic. For example, a task running with a 20%
+ * default boost can still drop its own boosting to 0%.
*/
struct uclamp_se {
/* Clamp value "requested" by a scheduling entity */
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int active : 1;
+ unsigned int user_defined : 1;
/*
* Clamp value "obtained" by a scheduling entity.
*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 075c610adf45..d2c65617a4a4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -53,10 +53,20 @@
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
#define SCHED_FLAG_KEEP_POLICY 0x08
+#define SCHED_FLAG_KEEP_PARAMS 0x10
+#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
+#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+
+#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
+ SCHED_FLAG_KEEP_PARAMS)
+
+#define SCHED_FLAG_UTIL_CLAMP (SCHED_FLAG_UTIL_CLAMP_MIN | \
+ SCHED_FLAG_UTIL_CLAMP_MAX)

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
SCHED_FLAG_DL_OVERRUN | \
- SCHED_FLAG_KEEP_POLICY)
+ SCHED_FLAG_KEEP_ALL | \
+ SCHED_FLAG_UTIL_CLAMP)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 10fbb8031930..01439e07507c 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -9,6 +9,7 @@ struct sched_param {
};

#define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */
+#define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */

/*
* Extended scheduling parameters data structure.
@@ -21,8 +22,33 @@ struct sched_param {
* the tasks may be useful for a wide variety of application fields, e.g.,
* multimedia, streaming, automation and control, and many others.
*
- * This variant (sched_attr) is meant at describing a so-called
- * sporadic time-constrained task. In such model a task is specified by:
+ * This variant (sched_attr) allows to define additional attributes to
+ * improve the scheduler knowledge about task requirements.
+ *
+ * Scheduling Class Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes specifies the
+ * scheduling policy and relative POSIX attributes:
+ *
+ * @size size of the structure, for fwd/bwd compat.
+ *
+ * @sched_policy task's scheduling policy
+ * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
+ * @sched_priority task's static priority (SCHED_FIFO/RR)
+ *
+ * Certain more advanced scheduling features can be controlled by a
+ * predefined set of flags via the attribute:
+ *
+ * @sched_flags for customizing the scheduler behaviour
+ *
+ * Sporadic Time-Constrained Tasks Attributes
+ * ==========================================
+ *
+ * A subset of sched_attr attributes allows to describe a so-called
+ * sporadic time-constrained task.
+ *
+ * In such model a task is specified by:
* - the activation period or minimum instance inter-arrival time;
* - the maximum (or average, depending on the actual scheduling
* discipline) computation time of all instances, a.k.a. runtime;
@@ -34,14 +60,8 @@ struct sched_param {
* than the runtime and must be completed by time instant t equal to
* the instance activation time + the deadline.
*
- * This is reflected by the actual fields of the sched_attr structure:
+ * This is reflected by the following fields of the sched_attr structure:
*
- * @size size of the structure, for fwd/bwd compat.
- *
- * @sched_policy task's scheduling policy
- * @sched_flags for customizing the scheduler behaviour
- * @sched_nice task's nice value (SCHED_NORMAL/BATCH)
- * @sched_priority task's static priority (SCHED_FIFO/RR)
* @sched_deadline representative of the task's deadline
* @sched_runtime representative of the task's runtime
* @sched_period representative of the task's period
@@ -53,6 +73,28 @@ struct sched_param {
* As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
* only user of this new interface. More information about the algorithm
* available in the scheduling class file or in Documentation/.
+ *
+ * Task Utilization Attributes
+ * ===========================
+ *
+ * A subset of sched_attr attributes allows to specify the utilization
+ * expected for a task. These attributes allow to inform the scheduler about
+ * the utilization boundaries within which it should schedule tasks. These
+ * boundaries are valuable hints to support scheduler decisions on both task
+ * placement and frequency selection.
+ *
+ * @sched_util_min represents the minimum utilization
+ * @sched_util_max represents the maximum utilization
+ *
+ * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE]. It
+ * represents the percentage of CPU time used by a task when running at the
+ * maximum frequency on the highest capacity CPU of the system. For example, a
+ * 20% utilization task is a task running for 2ms every 10ms.
+ *
+ * A task with a min utilization value bigger than 0 is more likely scheduled
+ * on a CPU with a capacity big enough to fit the specified value.
+ * A task with a max utilization value smaller than 1024 is more likely
+ * scheduled on a CPU with no more capacity than the specified value.
*/
struct sched_attr {
__u32 size;
@@ -70,6 +112,11 @@ struct sched_attr {
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
+
+ /* Utilization hints */
+ __u32 sched_util_min;
+ __u32 sched_util_max;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26c5ede8ebca..070caa1f72eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1027,6 +1027,50 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
return result;
}

+static int uclamp_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
+ unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+ lower_bound = attr->sched_util_min;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+ upper_bound = attr->sched_util_max;
+
+ if (lower_bound > upper_bound)
+ return -EINVAL;
+ if (upper_bound > SCHED_CAPACITY_SCALE)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
+ return;
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
+ unsigned int lower_bound = attr->sched_util_min;
+
+ p->uclamp[UCLAMP_MIN].user_defined = true;
+ p->uclamp[UCLAMP_MIN].value = lower_bound;
+ p->uclamp[UCLAMP_MIN].bucket_id =
+ uclamp_bucket_id(lower_bound);
+ }
+
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+ unsigned int upper_bound = attr->sched_util_max;
+
+ p->uclamp[UCLAMP_MAX].user_defined = true;
+ p->uclamp[UCLAMP_MAX].value = upper_bound;
+ p->uclamp[UCLAMP_MAX].bucket_id =
+ uclamp_bucket_id(upper_bound);
+ }
+}
+
static void uclamp_fork(struct task_struct *p)
{
unsigned int clamp_id;
@@ -1073,6 +1117,13 @@ static void __init init_uclamp(void)
#else /* CONFIG_UCLAMP_TASK */
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline int uclamp_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return -ENODEV;
+}
+static void __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr) { }
static inline void uclamp_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */
@@ -4441,6 +4492,13 @@ static void __setscheduler_params(struct task_struct *p,
static void __setscheduler(struct rq *rq, struct task_struct *p,
const struct sched_attr *attr, bool keep_boost)
{
+ /*
+ * If params cannot change we don't allow scheduling class
+ * changes too.
+ */
+ if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
+ return;
+
__setscheduler_params(p, attr);

/*
@@ -4578,6 +4636,13 @@ static int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ /* Update task specific "requested" clamps */
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
+ retval = uclamp_validate(p, attr);
+ if (retval)
+ return retval;
+ }
+
/*
* Make sure no PI-waiters arrive (or leave) while we are
* changing the priority of the task:
@@ -4607,6 +4672,8 @@ static int __sched_setscheduler(struct task_struct *p,
goto change;
if (dl_policy(policy) && dl_param_changed(p, attr))
goto change;
+ if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
+ goto change;

p->sched_reset_on_fork = reset_on_fork;
task_rq_unlock(rq, p, &rf);
@@ -4687,7 +4754,9 @@ static int __sched_setscheduler(struct task_struct *p,
put_prev_task(rq, p);

prev_class = p->sched_class;
+
__setscheduler(rq, p, attr, pi);
+ __setscheduler_uclamp(p, attr);

if (queued) {
/*
@@ -4863,6 +4932,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
if (ret)
return -EFAULT;

+ if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
+ size < SCHED_ATTR_SIZE_VER1)
+ return -EINVAL;
+
/*
* XXX: Do we want to be lenient like existing syscalls; or do we want
* to be strict and return an error on out-of-bounds values?
@@ -4939,10 +5012,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
- if (p != NULL)
- retval = sched_setattr(p, &attr);
+ if (likely(p))
+ get_task_struct(p);
rcu_read_unlock();

+ if (likely(p)) {
+ retval = sched_setattr(p, &attr);
+ put_task_struct(p);
+ }
+
return retval;
}

@@ -5093,6 +5171,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
else
attr.sched_nice = task_nice(p);

+#ifdef CONFIG_UCLAMP_TASK
+ attr.sched_util_min = p->uclamp[UCLAMP_MIN].value;
+ attr.sched_util_max = p->uclamp[UCLAMP_MAX].value;
+#endif
+
rcu_read_unlock();

retval = sched_read_attr(uattr, &attr, size);
--
2.20.1


2019-02-08 10:07:33

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 08/15] sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class. However, when utilization clamping is in use, the
frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.

Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>

---
Changes in v7:
Message-ID: <CAJZ5v0j2NQY_gKJOAy=rP5_1Dk9TODKNhW0vuvsynTN3BUmYaQ@mail.gmail.com>
- merged FAIR and RT integration patches in this one
Message-ID: <20190123142455.454u4w253xaxzar3@e110439-lin>
- dropped clamping for IOWait boost
Message-ID: <20190122123704.6rb3xemvxbp5yfjq@e110439-lin>
- fixed go to max for RT tasks on !CONFIG_UCLAMP_TASK
---
kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
kernel/sched/fair.c | 4 ++++
kernel/sched/rt.c | 4 ++++
kernel/sched/sched.h | 23 +++++++++++++++++++++++
4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 033ec7c45f13..70a8b87fa29c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -201,8 +201,10 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

- if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
+ if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -218,9 +220,16 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
- util = util_cfs;
- util += cpu_util_rt(rq);
+ util = util_cfs + cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util(rq, util);

dl_util = cpu_util_dl(rq);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffd1ae7237e7..8c0aa76af90a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10587,6 +10587,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 90fa23d36565..d968f7209656 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b3274b2423f8..f07048a0e845 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2277,6 +2277,29 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+ unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * invertion, which we can fix at usage time.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.20.1


2019-02-08 10:07:46

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 11/15] sched/core: uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes util.{min,max}
which allows to enforce utilization boosting and capping for all the
tasks in a group. Specifically:

- util.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the min_util
utilization

- util.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the max_util
utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups

b) do not enforce any constraints and/or dependencies between the parent
and its child nodes, thus relying:
- on permission settings defined by the system management software,
to define if subgroups can configure their clamp values
- on the delegation model, to ensure that effective clamps are
updated to consider both subgroup requests and parent group
constraints

c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests

This patch provides the basic support to expose the two new attributes
and to validate their run-time updates, while we do not (yet) actually
allocated clamp buckets.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 27 +++++
include/linux/sched.h | 7 +-
init/Kconfig | 22 ++++
kernel/sched/core.c | 148 ++++++++++++++++++++++++
kernel/sched/sched.h | 5 +
5 files changed, 207 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 7bf3f129c68b..47710a77f4fa 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -974,6 +980,27 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.

+ cpu.util.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization in the range [0, 1024].
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.util.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "1024". i.e. no utilization capping
+
+ The requested maximum utilization in the range [0, 1024].
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+

Memory
------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 711ea303f4c7..9d38fd588bbb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -612,8 +612,11 @@ struct uclamp_se {
/*
* Clamp value "obtained" by a scheduling entity.
*
- * This cache the actual clamp value, possibly enforced by system
- * default clamps, a task is subject to while enqueued in a rq.
+ * For a task, this is the value (possibly) enforced by the
+ * task group the task is currently part of or by the system
+ * default clamp values, whichever is the most restrictive.
+ * For task groups, this is the value (possibly) enforced by a
+ * parent task group.
*/
struct {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
diff --git a/init/Kconfig b/init/Kconfig
index 34e23d5d95d1..87bd962ed848 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -866,6 +866,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 569564012ddc..122ab069ade5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1148,6 +1148,14 @@ static void __init init_uclamp(void)
uc_se = &uclamp_default[clamp_id];
uc_se->bucket_id = bucket_id;
uc_se->value = value;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ uc_se = &root_task_group.uclamp[clamp_id];
+ uc_se->bucket_id = bucket_id;
+ uc_se->value = value;
+ uc_se->effective.bucket_id = bucket_id;
+ uc_se->effective.value = value;
+#endif
}
}

@@ -6739,6 +6747,23 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline int alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ tg->uclamp[clamp_id].value =
+ parent->uclamp[clamp_id].value;
+ tg->uclamp[clamp_id].bucket_id =
+ parent->uclamp[clamp_id].bucket_id;
+ }
+#endif
+
+ return 1;
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6762,6 +6787,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ if (!alloc_uclamp_sched_group(tg, parent))
+ goto err;
+
return tg;

err:
@@ -6982,6 +7010,100 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 min_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MIN].value == min_value)
+ goto out;
+ if (tg->uclamp[UCLAMP_MAX].value < min_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Update tg's "requested" clamp value */
+ tg->uclamp[UCLAMP_MIN].value = min_value;
+ tg->uclamp[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 max_value)
+{
+ struct task_group *tg;
+ int ret = 0;
+
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(css);
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp[UCLAMP_MAX].value == max_value)
+ goto out;
+ if (tg->uclamp[UCLAMP_MIN].value > max_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Update tg's "requested" clamp value */
+ tg->uclamp[UCLAMP_MAX].value = max_value;
+ tg->uclamp[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
+
+out:
+ rcu_read_unlock();
+
+ return ret;
+}
+
+static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+
+ rcu_read_lock();
+ tg = css_tg(css);
+ util_clamp = tg->uclamp[clamp_id].value;
+ rcu_read_unlock();
+
+ return util_clamp;
+}
+
+static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN);
+}
+
+static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX);
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7319,6 +7441,18 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7486,6 +7620,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "util.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_read_u64,
+ .write_u64 = cpu_util_min_write_u64,
+ },
+ {
+ .name = "util.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_read_u64,
+ .write_u64 = cpu_util_max_write_u64,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b9acef080d99..a97396295b47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -399,6 +399,11 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.20.1


2019-02-08 10:07:49

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 13/15] sched/core: uclamp: Propagate system defaults to root group

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.

- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

- system defaults: which define the maximum possible clamp values
usable by tasks.

- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e54517acd58..35e9f06af08d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -990,6 +990,14 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(p, rq, clamp_id);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+ unsigned int clamp_id, unsigned int bucket_id,
+ unsigned int value);
+#else
+#define cpu_util_update_hier(...)
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -1025,6 +1033,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
uclamp_bucket_id(sysctl_sched_uclamp_util_max);
}

+ cpu_util_update_hier(&root_task_group.css, UCLAMP_MIN,
+ uclamp_default[UCLAMP_MIN].bucket_id,
+ uclamp_default[UCLAMP_MIN].value);
+ cpu_util_update_hier(&root_task_group.css, UCLAMP_MAX,
+ uclamp_default[UCLAMP_MAX].bucket_id,
+ uclamp_default[UCLAMP_MAX].value);
+
/*
* Updating all the RUNNABLE task is expensive, keep it simple and do
* just a lazy update at each next enqueue time.
--
2.20.1


2019-02-08 10:08:07

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 07/15] sched/core: uclamp: Set default clamps for RT tasks

By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand. This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.

Enforce the correct behavior for RT tasks by setting util_min to max
whenever:

1. a task is switched to the RT class and it does not already have a
user-defined clamp value assigned.

2. a task is forked from a parent with RESET_ON_FORK set.

NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b282616e9c9..569564012ddc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1049,6 +1049,28 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
+ unsigned int clamp_id;
+
+ /*
+ * On scheduling class change, reset to default clamps for tasks
+ * without a task-specific value.
+ */
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ unsigned int clamp_value = uclamp_none(clamp_id);
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+
+ /* Keep using defined clamps across class changes */
+ if (uc_se->user_defined)
+ continue;
+
+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
+ uc_se->bucket_id = uclamp_bucket_id(clamp_value);
+ uc_se->value = clamp_value;
+ }
+
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

@@ -1087,6 +1109,10 @@ static void uclamp_fork(struct task_struct *p, bool reset)
for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
unsigned int clamp_value = uclamp_none(clamp_id);

+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
p->uclamp[clamp_id].user_defined = false;
p->uclamp[clamp_id].value = clamp_value;
p->uclamp[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
--
2.20.1


2019-02-08 10:08:12

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 09/15] sched/core: uclamp: Add uclamp_util_with()

Currently uclamp_util() allows to clamp a specified utilization
considering the clamp values requested by RUNNABLE tasks in a CPU.
Sometimes however, it could be interesting to verify how clamp values
will change when a task is going to be running on a given CPU.
For example, the Energy Aware Scheduler (EAS) is interested in
evaluating and comparing the energy impact of different scheduling
decisions.

Add uclamp_util_with() which allows to clamp a given utilization by
considering the possible impact on CPU clamp values of a specified task.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/sched.h | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f07048a0e845..de181b8a3a2a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2278,11 +2278,20 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

#ifdef CONFIG_UCLAMP_TASK
-static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+unsigned int uclamp_effective_value(struct task_struct *p, unsigned int clamp_id);
+
+static __always_inline
+unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
+ struct task_struct *p)
{
unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);

+ if (p) {
+ min_util = max(min_util, uclamp_effective_value(p, UCLAMP_MIN));
+ max_util = max(max_util, uclamp_effective_value(p, UCLAMP_MAX));
+ }
+
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
* RUNNABLE tasks with _different_ clamps, we can end up with an
@@ -2293,7 +2302,17 @@ static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)

return clamp(util, min_util, max_util);
}
+
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return uclamp_util_with(rq, util, NULL);
+}
#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util_with(struct rq *rq, unsigned int util,
+ struct task_struct *p)
+{
+ return util;
+}
static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
{
return util;
--
2.20.1


2019-02-08 10:08:20

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

When the task sleeps, it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.

Fix this by using
uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
uclamp_rq_update(rq, UCLAMP_MAX, clamp_value)
to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.

Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 52 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 2 ++
2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ecf5470058c..e4f5e8c426ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -741,11 +741,47 @@ static inline unsigned int uclamp_none(int clamp_id)
return SCHED_CAPACITY_SCALE;
}

-static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
+static inline unsigned int
+uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
+{
+ /*
+ * Avoid blocked utilization pushing up the frequency when we go
+ * idle (which drops the max-clamp) by retaining the last known
+ * max-clamp.
+ */
+ if (clamp_id == UCLAMP_MAX) {
+ rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
+ return clamp_value;
+ }
+
+ return uclamp_none(UCLAMP_MIN);
+}
+
+static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+ unsigned int clamp_value)
+{
+ /* Reset max-clamp retention only on idle exit */
+ if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
+ return;
+
+ WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
+
+ /*
+ * This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
+ * (after). The idle flag is reset only the second time, when we know
+ * that UCLAMP_MIN has been already updated.
+ */
+ if (clamp_id == UCLAMP_MAX)
+ rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+}
+
+static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id,
+ unsigned int clamp_value)
{
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
unsigned int max_value = uclamp_none(clamp_id);
unsigned int bucket_id;
+ bool active = false;

/*
* Both min and max clamps are MAX aggregated, thus the topmost
@@ -757,9 +793,13 @@ static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
continue;
max_value = bucket[bucket_id].value;
+ active = true;
break;
} while (bucket_id);

+ if (unlikely(!active))
+ max_value = uclamp_idle_value(rq, clamp_id, clamp_value);
+
WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
}

@@ -781,12 +821,14 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
unsigned int rq_clamp, bkt_clamp, tsk_clamp;

rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
+ /* Reset clamp holds on idle exit */
+ tsk_clamp = p->uclamp[clamp_id].value;
+ uclamp_idle_reset(rq, clamp_id, tsk_clamp);

/*
* Local clamping: rq's buckets always track the max "requested"
* clamp value from all RUNNABLE tasks in that bucket.
*/
- tsk_clamp = p->uclamp[clamp_id].value;
bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);

@@ -830,7 +872,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
*/
rq->uclamp[clamp_id].bucket[bucket_id].value =
uclamp_bucket_value(rq_clamp);
- uclamp_rq_update(rq, clamp_id);
+ uclamp_rq_update(rq, clamp_id, bkt_clamp);
}
}

@@ -861,8 +903,10 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+ cpu_rq(cpu)->uclamp_flags = 0;
+ }

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
unsigned int clamp_value = uclamp_none(clamp_id);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ea9e28723946..b3274b2423f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -880,6 +880,8 @@ struct rq {
#ifdef CONFIG_UCLAMP_TASK
/* Utilization clamp values based on CPU's RUNNABLE tasks */
struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
+ unsigned int uclamp_flags;
+#define UCLAMP_FLAG_IDLE 0x01
#endif

struct cfs_rq cfs;
--
2.20.1


2019-02-08 10:08:56

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 10/15] sched/fair: uclamp: Add uclamp support to energy_compute()

The Energy Aware Scheduler (AES) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:
a) an (active) power consumptions defined for each CPU frequency
b) an estimation of which frequency will be used on each CPU
c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c) estimations. A CPU is
expected to run:
- on an higher than required frequency, but for a shorter time, in case
its estimated utilization will be smaller then the minimum utilization
enforced by uclamp
- on a smaller than required frequency, but for a longer time, in case
its estimated utilization is bigger then the maximum utilization
enforced by uclamp

While effects on busy time for both boosted/capped tasks are already
considered by compute_energy(), clamping effects on frequency selection
are currently ignored by that function.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
computation of cpu_util signal implies that we are more likely to
estimate the higherst OPP when a RT task is running in another CPU of
the same performance domain. This can have an impact on energy
estimation but:
- it's not easy to say which approach is better, since it quite likely
depends on the use case
- the original approach could still be obtained by setting a smaller
task-specific util_min whenever required

Since we are at that:
- rename schedutil_freq_util() into schedutil_cpu_util(),
since it's not only used for frequency selection.
- use "unsigned int" instead of "unsigned long" whenever the tracked
utilization value is not expected to overflow 32bit.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>

---
Changes in v7:
Message-ID: <20190122151404.5rtosic6puixado3@queper01-lin>
- add a note on side-effects due to the usage of FREQUENCY_UTIL for
performance domain frequency estimation
- add a similer note to this changelog
---
kernel/sched/cpufreq_schedutil.c | 18 ++++++++-------
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++-----
kernel/sched/sched.h | 18 ++++-----------
3 files changed, 48 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 70a8b87fa29c..fdad719fca8b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -195,10 +195,11 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type)
+unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
+ unsigned int max, enum schedutil_type type,
+ struct task_struct *p)
{
- unsigned long dl_util, util, irq;
+ unsigned int dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
@@ -229,7 +230,7 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
- util = uclamp_util(rq, util);
+ util = uclamp_util_with(rq, util, p);

dl_util = cpu_util_dl(rq);

@@ -283,13 +284,14 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util = cpu_util_cfs(rq);
- unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+ unsigned int util_cfs = cpu_util_cfs(rq);
+ unsigned int cpu_cap = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);

- sg_cpu->max = max;
+ sg_cpu->max = cpu_cap;
sg_cpu->bw_dl = cpu_bw_dl(rq);

- return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
+ return schedutil_cpu_util(sg_cpu->cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, NULL);
}

/**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c0aa76af90a..f6b0808e01ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6453,11 +6453,20 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- long util, max_util, sum_util, energy = 0;
+ unsigned int max_util, cfs_util, cpu_util, cpu_cap;
+ unsigned long sum_util, energy = 0;
int cpu;

for (; pd; pd = pd->next) {
+ struct cpumask *pd_mask = perf_domain_span(pd);
+
+ /*
+ * The energy model mandate all the CPUs of a performance
+ * domain have the same capacity.
+ */
+ cpu_cap = arch_scale_cpu_capacity(NULL, cpumask_first(pd_mask));
max_util = sum_util = 0;
+
/*
* The capacity state of CPUs of the current rd can be driven by
* CPUs of another rd if they belong to the same performance
@@ -6468,11 +6477,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* it will not appear in its pd list and will not be accounted
* by compute_energy().
*/
- for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
- util = schedutil_energy_util(cpu, util);
- max_util = max(util, max_util);
- sum_util += util;
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ cfs_util = cpu_util_next(cpu, p, dst_cpu);
+
+ /*
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
+ */
+ sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
+ ENERGY_UTIL, NULL);
+
+ /*
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ * NOTE: in case RT tasks are running, by default the
+ * FREQUENCY_UTIL's utilization can be max OPP.
+ */
+ cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
+ FREQUENCY_UTIL,
+ cpu == dst_cpu ? p : NULL);
+ max_util = max(max_util, cpu_util);
}

energy += em_pd_energy(pd->em_pd, max_util, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de181b8a3a2a..b9acef080d99 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2335,6 +2335,7 @@ static inline unsigned long capacity_orig_of(int cpu)
#endif

#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+
/**
* enum schedutil_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
@@ -2350,15 +2351,9 @@ enum schedutil_type {
ENERGY_UTIL,
};

-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type);
-
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
-
- return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
-}
+unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
+ unsigned int max, enum schedutil_type type,
+ struct task_struct *p);

static inline unsigned long cpu_bw_dl(struct rq *rq)
{
@@ -2387,10 +2382,7 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
return READ_ONCE(rq->avg_rt.util_avg);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- return cfs;
-}
+#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
#endif

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
--
2.20.1


2019-02-08 10:09:03

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 14/15] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2),
this value is accounted in the corresponding clamp bucket every time the
task is {en,de}qeued. However, when cgroups are also in use, the task
specific clamp values could be restricted by the task_group (TG)
clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every
time a task is enqueued, it's accounted in the clamp_bucket defining the
smaller clamp between the task specific value and its TG effective
value. This allows to:

1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value

2. implement a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG

This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only
system defaults are enforced.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 42 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 35e9f06af08d..6f8f68d18d0f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -823,10 +823,44 @@ static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id,
WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
}

+static inline bool
+uclamp_tg_restricted(struct task_struct *p, unsigned int clamp_id,
+ unsigned int *clamp_value, unsigned int *bucket_id)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ unsigned int clamp_max, bucket_max;
+ struct uclamp_se *tg_clamp;
+
+ /*
+ * Tasks in an autogroup or the root task group are restricted by
+ * system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return false;
+ if (task_group(p) == &root_task_group)
+ return false;
+
+ tg_clamp = &task_group(p)->uclamp[clamp_id];
+ bucket_max = tg_clamp->effective.bucket_id;
+ clamp_max = tg_clamp->effective.value;
+
+ if (!p->uclamp[clamp_id].user_defined || *clamp_value > clamp_max) {
+ *clamp_value = clamp_max;
+ *bucket_id = bucket_max;
+ }
+
+ return true;
+#else
+ return false;
+#endif
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*
* As a side effect, update the task's effective value:
@@ -841,7 +875,13 @@ uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
*bucket_id = p->uclamp[clamp_id].bucket_id;
*clamp_value = p->uclamp[clamp_id].value;

- /* Always apply system default restrictions */
+ /*
+ * If we have task groups and we are running in a child group, system
+ * default are already affecting the group's clamp values.
+ */
+ if (uclamp_tg_restricted(p, clamp_id, clamp_value, bucket_id))
+ return;
+
if (unlikely(*clamp_value > uclamp_default[clamp_id].value)) {
*clamp_value = uclamp_default[clamp_id].value;
*bucket_id = uclamp_default[clamp_id].bucket_id;
--
2.20.1


2019-02-08 10:09:10

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 15/15] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or clamped as requested.

Do that by slightly refactoring uclamp_bucket_inc(). An additional
parameter *cgroup_subsys_state (css) is used to walk the list of tasks
in the TGs and update the RUNNABLE ones. Do that by taking the rq
lock for each task, the same mechanism used for cpu affinity masks
updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f8f68d18d0f..e0fdc98b3663 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1030,6 +1030,51 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(p, rq, clamp_id);
}

+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_rq_dec_id(p, rq, clamp_id);
+ uclamp_rq_inc_id(p, rq, clamp_id);
+
+done:
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css, int clamp_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_hier(struct cgroup_subsys_state *css,
unsigned int clamp_id, unsigned int bucket_id,
@@ -7128,6 +7173,7 @@ static void cpu_util_update_hier(struct cgroup_subsys_state *css,

uc_se->effective.value = value;
uc_se->effective.bucket_id = bucket_id;
+ uclamp_update_active_tasks(css, clamp_id);
}
}

--
2.20.1


2019-02-08 10:09:15

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 06/15] sched/core: uclamp: Reset uclamp values on RESET_ON_FORK

A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:

sys_sched_setattr()
sched_setattr()
__sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)

the new forked task is expected to start with all attributes reset to
default values.

Do that for utilization clamp values too by caching the reset request
and propagating it into the existing uclamp_fork() call which already
provides the required initialization for other uclamp related bits.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 070caa1f72eb..8b282616e9c9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1071,7 +1071,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
}
}

-static void uclamp_fork(struct task_struct *p)
+static void uclamp_fork(struct task_struct *p, bool reset)
{
unsigned int clamp_id;

@@ -1080,6 +1080,17 @@ static void uclamp_fork(struct task_struct *p)

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
p->uclamp[clamp_id].active = false;
+
+ if (likely(!reset))
+ return;
+
+ for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ p->uclamp[clamp_id].user_defined = false;
+ p->uclamp[clamp_id].value = clamp_value;
+ p->uclamp[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
+ }
}

static void __init init_uclamp(void)
@@ -1124,7 +1135,7 @@ static inline int uclamp_validate(struct task_struct *p,
}
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr) { }
-static inline void uclamp_fork(struct task_struct *p) { }
+static inline void uclamp_fork(struct task_struct *p, bool reset) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

@@ -2711,6 +2722,7 @@ static inline void init_schedstats(void) {}
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
+ bool reset;

__sched_fork(clone_flags, p);
/*
@@ -2728,7 +2740,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
/*
* Revert to default priority/policy on fork if requested.
*/
- if (unlikely(p->sched_reset_on_fork)) {
+ reset = p->sched_reset_on_fork;
+ if (unlikely(reset)) {
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
@@ -2755,7 +2768,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)

init_entity_runnable_average(&p->se);

- uclamp_fork(p);
+ uclamp_fork(p, reset);

/*
* The child is not yet in the pid-hash so no cgroup attach races,
--
2.20.1


2019-02-08 10:09:25

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This is the actual clamp value enforced on tasks in a
task group.

Since it can be interesting for userspace, e.g. system management
software, to know exactly what the currently propagated/enforced
configuration is, the effective clamp values are exposed to user-space
by means of a new pair of read-only attributes
cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v7:
Others:
- ensure clamp values are not tunable at root cgroup level
---
Documentation/admin-guide/cgroup-v2.rst | 19 ++++
kernel/sched/core.c | 118 +++++++++++++++++++++++-
2 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 47710a77f4fa..7aad2435e961 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -990,6 +990,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.

+ cpu.util.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.min in case a parent cgroup
+ allows only smaller minimum utilization values.
+
cpu.util.max
A read-write single value file which exists on non-root cgroups.
The default is "1024". i.e. no utilization capping
@@ -1000,6 +1010,15 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.

+ cpu.util.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization in the range [0, 1024].
+
+ This value can be lower then cpu.util.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.


Memory
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 122ab069ade5..1e54517acd58 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -720,6 +720,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
/* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;

@@ -1127,6 +1139,8 @@ static void __init init_uclamp(void)
unsigned int value;
int cpu;

+ mutex_init(&uclamp_mutex);
+
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -6758,6 +6772,10 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
parent->uclamp[clamp_id].value;
tg->uclamp[clamp_id].bucket_id =
parent->uclamp[clamp_id].bucket_id;
+ tg->uclamp[clamp_id].effective.value =
+ parent->uclamp[clamp_id].effective.value;
+ tg->uclamp[clamp_id].effective.bucket_id =
+ parent->uclamp[clamp_id].effective.bucket_id;
}
#endif

@@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_hier(struct cgroup_subsys_state *css,
+ unsigned int clamp_id, unsigned int bucket_id,
+ unsigned int value)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+
+ css_for_each_descendant_pre(css, top_css) {
+ /*
+ * The first visited task group is top_css, which clamp value
+ * is the one passed as parameter. For descendent task
+ * groups we consider their current value.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (css != top_css) {
+ value = uc_se->value;
+ bucket_id = uc_se->effective.bucket_id;
+ }
+ uc_parent = NULL;
+ if (css_tg(css)->parent)
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * already matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive value, as effective clamp.
+ */
+ if (uc_se->effective.value == value &&
+ uc_parent && uc_parent->effective.value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent && uc_parent->effective.value < value) {
+ value = uc_parent->effective.value;
+ bucket_id = uc_parent->effective.bucket_id;
+ }
+ if (uc_se->effective.value == value)
+ continue;
+
+ uc_se->effective.value = value;
+ uc_se->effective.bucket_id = bucket_id;
+ }
+}
+
static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 min_value)
{
@@ -7020,6 +7085,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7038,8 +7104,13 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
tg->uclamp[UCLAMP_MIN].value = min_value;
tg->uclamp[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id,
+ min_value);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}
@@ -7053,6 +7124,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(css);
@@ -7071,21 +7143,29 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
tg->uclamp[UCLAMP_MAX].value = max_value;
tg->uclamp[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id,
+ max_value);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return ret;
}

static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;

rcu_read_lock();
tg = css_tg(css);
- util_clamp = tg->uclamp[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].effective.value
+ : tg->uclamp[clamp_id].value;
rcu_read_unlock();

return util_clamp;
@@ -7094,13 +7174,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MIN);
+ return cpu_uclamp_read(css, UCLAMP_MIN, false);
}

static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- return cpu_uclamp_read(css, UCLAMP_MAX);
+ return cpu_uclamp_read(css, UCLAMP_MAX, false);
+}
+
+static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MIN, true);
+}
+
+static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return cpu_uclamp_read(css, UCLAMP_MAX, true);
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */

@@ -7448,11 +7540,19 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* Terminate */
};
@@ -7628,12 +7728,22 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_util_min_read_u64,
.write_u64 = cpu_util_min_write_u64,
},
+ {
+ .name = "util.min.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_min_effective_read_u64,
+ },
{
.name = "util.max",
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = cpu_util_max_read_u64,
.write_u64 = cpu_util_max_write_u64,
},
+ {
+ .name = "util.max.effective",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = cpu_util_max_effective_read_u64,
+ },
#endif
{ } /* terminate */
};
--
2.20.1


2019-02-15 00:48:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 11/15] sched/core: uclamp: Extend CPU's cgroup controller

Hello,

On Fri, Feb 08, 2019 at 10:05:50AM +0000, Patrick Bellasi wrote:
> a) are available only for non-root nodes, both on default and legacy
> hierarchies, while system wide clamps are defined by a generic
> interface which does not depends on cgroups
>
> b) do not enforce any constraints and/or dependencies between the parent
> and its child nodes, thus relying:
> - on permission settings defined by the system management software,
> to define if subgroups can configure their clamp values
> - on the delegation model, to ensure that effective clamps are
> updated to consider both subgroup requests and parent group
> constraints

I'm not sure about this hierarchical behavior.

> c) have higher priority than task-specific clamps, defined via
> sched_setattr(), thus allowing to control and restrict task requests

and I have some other concerns about the interface, but let's discuss
them once the !cgroup portion is settled.

Thanks.

--
tejun

2019-03-06 18:33:42

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH v7 10/15] sched/fair: uclamp: Add uclamp support to energy_compute()

Hi Patrick,

On Friday 08 Feb 2019 at 10:05:49 (+0000), Patrick Bellasi wrote:
> The Energy Aware Scheduler (AES) estimates the energy impact of waking

s/AES/EAS

> up a task on a given CPU. This estimation is based on:
> a) an (active) power consumptions defined for each CPU frequency

s/consumptions/consumption

> b) an estimation of which frequency will be used on each CPU
> c) an estimation of the busy time (utilization) of each CPU
>
> Utilization clamping can affect both b) and c) estimations. A CPU is
> expected to run:
> - on an higher than required frequency, but for a shorter time, in case
> its estimated utilization will be smaller then the minimum utilization

s/then/than

> enforced by uclamp
> - on a smaller than required frequency, but for a longer time, in case
> its estimated utilization is bigger then the maximum utilization

s/then/than

> enforced by uclamp
>
> While effects on busy time for both boosted/capped tasks are already
> considered by compute_energy(), clamping effects on frequency selection
> are currently ignored by that function.
>
> Fix it by considering how CPU clamp values will be affected by a
> task waking up and being RUNNABLE on that CPU.
>
> Do that by refactoring schedutil_freq_util() to take an additional
> task_struct* which allows EAS to evaluate the impact on clamp values of
> a task being eventually queued in a CPU. Clamp values are applied to the
> RT+CFS utilization only when a FREQUENCY_UTIL is required by
> compute_energy().
>
> Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
> computation of cpu_util signal implies that we are more likely to
> estimate the higherst OPP when a RT task is running in another CPU of

s/higherst/highest

> the same performance domain. This can have an impact on energy
> estimation but:
> - it's not easy to say which approach is better, since it quite likely
> depends on the use case
> - the original approach could still be obtained by setting a smaller
> task-specific util_min whenever required
>
> Since we are at that:
> - rename schedutil_freq_util() into schedutil_cpu_util(),
> since it's not only used for frequency selection.
> - use "unsigned int" instead of "unsigned long" whenever the tracked
> utilization value is not expected to overflow 32bit.

We use unsigned long all over the place right ? All the task_util*()
functions return unsigned long, the capacity-related functions too, and
util_avg is an unsigned long in sched_avg. So I'm not sure if we want to
do this TBH.

> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
>
> ---
> Changes in v7:
> Message-ID: <20190122151404.5rtosic6puixado3@queper01-lin>
> - add a note on side-effects due to the usage of FREQUENCY_UTIL for
> performance domain frequency estimation
> - add a similer note to this changelog
> ---
> kernel/sched/cpufreq_schedutil.c | 18 ++++++++-------
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++-----
> kernel/sched/sched.h | 18 ++++-----------
> 3 files changed, 48 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 70a8b87fa29c..fdad719fca8b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -195,10 +195,11 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> * based on the task model parameters and gives the minimal utilization
> * required to meet deadlines.
> */
> -unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> - unsigned long max, enum schedutil_type type)
> +unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
> + unsigned int max, enum schedutil_type type,
> + struct task_struct *p)
> {
> - unsigned long dl_util, util, irq;
> + unsigned int dl_util, util, irq;
> struct rq *rq = cpu_rq(cpu);
>
> if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
> @@ -229,7 +230,7 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> */
> util = util_cfs + cpu_util_rt(rq);
> if (type == FREQUENCY_UTIL)
> - util = uclamp_util(rq, util);
> + util = uclamp_util_with(rq, util, p);
>
> dl_util = cpu_util_dl(rq);
>
> @@ -283,13 +284,14 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> - unsigned long util = cpu_util_cfs(rq);
> - unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> + unsigned int util_cfs = cpu_util_cfs(rq);
> + unsigned int cpu_cap = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);

Do you really need this one ? What's wrong with 'max' :-) ?

> - sg_cpu->max = max;
> + sg_cpu->max = cpu_cap;
> sg_cpu->bw_dl = cpu_bw_dl(rq);
>
> - return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
> + return schedutil_cpu_util(sg_cpu->cpu, util_cfs, cpu_cap,
> + FREQUENCY_UTIL, NULL);
> }
>
> /**
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8c0aa76af90a..f6b0808e01ad 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6453,11 +6453,20 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> static long
> compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> {
> - long util, max_util, sum_util, energy = 0;
> + unsigned int max_util, cfs_util, cpu_util, cpu_cap;
> + unsigned long sum_util, energy = 0;
> int cpu;
>
> for (; pd; pd = pd->next) {
> + struct cpumask *pd_mask = perf_domain_span(pd);
> +
> + /*
> + * The energy model mandate all the CPUs of a performance

s/mandate/mandates

> + * domain have the same capacity.
> + */
> + cpu_cap = arch_scale_cpu_capacity(NULL, cpumask_first(pd_mask));
> max_util = sum_util = 0;
> +
> /*
> * The capacity state of CPUs of the current rd can be driven by
> * CPUs of another rd if they belong to the same performance
> @@ -6468,11 +6477,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * it will not appear in its pd list and will not be accounted
> * by compute_energy().
> */
> - for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
> - util = cpu_util_next(cpu, p, dst_cpu);
> - util = schedutil_energy_util(cpu, util);
> - max_util = max(util, max_util);
> - sum_util += util;
> + for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> + cfs_util = cpu_util_next(cpu, p, dst_cpu);
> +
> + /*
> + * Busy time computation: utilization clamping is not
> + * required since the ratio (sum_util / cpu_capacity)
> + * is already enough to scale the EM reported power
> + * consumption at the (eventually clamped) cpu_capacity.
> + */
> + sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> + ENERGY_UTIL, NULL);
> +
> + /*
> + * Performance domain frequency: utilization clamping
> + * must be considered since it affects the selection
> + * of the performance domain frequency.
> + * NOTE: in case RT tasks are running, by default the
> + * FREQUENCY_UTIL's utilization can be max OPP.
> + */
> + cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> + FREQUENCY_UTIL,
> + cpu == dst_cpu ? p : NULL);
> + max_util = max(max_util, cpu_util);
> }
>
> energy += em_pd_energy(pd->em_pd, max_util, sum_util);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de181b8a3a2a..b9acef080d99 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2335,6 +2335,7 @@ static inline unsigned long capacity_orig_of(int cpu)
> #endif
>
> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +
> /**
> * enum schedutil_type - CPU utilization type

Since you're using this enum unconditionally in fair.c, you should to
move it out of the #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL block, I think.

> * @FREQUENCY_UTIL: Utilization used to select frequency
> @@ -2350,15 +2351,9 @@ enum schedutil_type {
> ENERGY_UTIL,
> };
>
> -unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> - unsigned long max, enum schedutil_type type);
> -
> -static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
> -{
> - unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
> -
> - return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
> -}
> +unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
> + unsigned int max, enum schedutil_type type,
> + struct task_struct *p);
>
> static inline unsigned long cpu_bw_dl(struct rq *rq)
> {
> @@ -2387,10 +2382,7 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
> return READ_ONCE(rq->avg_rt.util_avg);
> }
> #else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> -static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
> -{
> - return cfs;
> -}
> +#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
> #endif
>
> #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
> --
> 2.20.1
>

Thanks,
Quentin

2019-03-12 12:55:02

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 2/8/19 11:05 AM, Patrick Bellasi wrote:

[...]

> +config UCLAMP_BUCKETS_COUNT
> + int "Number of supported utilization clamp buckets"
> + range 5 20
> + default 5
> + depends on UCLAMP_TASK
> + help
> + Defines the number of clamp buckets to use. The range of each bucket
> + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> + number of clamp buckets the finer their granularity and the higher
> + the precision of clamping aggregation and tracking at run-time.
> +
> + For example, with the default configuration we will have 5 clamp
> + buckets tracking 20% utilization each. A 25% boosted tasks will be
> + refcounted in the [20..39]% bucket and will set the bucket clamp
> + effective value to 25%.
> + If a second 30% boosted task should be co-scheduled on the same CPU,
> + that task will be refcounted in the same bucket of the first task and
> + it will boost the bucket clamp effective value to 30%.
> + The clamp effective value of a bucket is reset to its nominal value
> + (20% in the example above) when there are anymore tasks refcounted in

this sounds weird.

[...]

> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> +}

Soemthing like uclamp_bucket_nominal_value() should be clearer.

> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> + unsigned int max_value = uclamp_none(clamp_id);
> + unsigned int bucket_id;

unsigned int bucket_id = UCLAMP_BUCKETS;

> +
> + /*
> + * Both min and max clamps are MAX aggregated, thus the topmost
> + * bucket with some tasks defines the rq's clamp value.
> + */
> + bucket_id = UCLAMP_BUCKETS;

to get rid of this line?

> + do {
> + --bucket_id;
> + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)

if (!bucket[bucket_id].tasks)

[...]

> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is reference counted on that rq. This also
> + * immediately updates the rq's clamp value if required.
> + *
> + * Since tasks know their specific value requested from user-space, we track
> + * within each bucket the maximum value for tasks refcounted in that bucket.
> + * This provide a further aggregation (local clamping) which allows to track

s/This provide/This provides

> + * within each bucket the exact "requested" clamp value whenever all tasks
> + * RUNNABLE in that bucket require the same clamp.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> + unsigned int rq_clamp, bkt_clamp, tsk_clamp;

Wouldn't it be easier to have a pointer to the task's and rq's uclamp
structure as well to the bucket?

- unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];

The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example
becomes much more readable.

[...]

> struct sched_class {
> const struct sched_class *next;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + int uclamp_enabled;
> +#endif
> +
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> - void (*yield_task) (struct rq *rq);
> - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
>
> void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1685,7 +1734,6 @@ struct sched_class {
> void (*set_curr_task)(struct rq *rq);
> void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> void (*task_fork)(struct task_struct *p);
> - void (*task_dead)(struct task_struct *p);
>
> /*
> * The switched_from() call is allowed to drop rq->lock, therefore we
> @@ -1702,12 +1750,17 @@ struct sched_class {
>
> void (*update_curr)(struct rq *rq);
>
> + void (*yield_task) (struct rq *rq);
> + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> +
> #define TASK_SET_GROUP 0
> #define TASK_MOVE_GROUP 1
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> void (*task_change_group)(struct task_struct *p, int type);
> #endif
> +
> + void (*task_dead)(struct task_struct *p);

Why do you move yield_task, yield_to_task and task_dead here?

[...]

2019-03-12 15:22:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +/* Integer ceil-rounded range for each bucket */
> +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)

Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?

The above would give 4 for 9/3, which is clearly buggered.

2019-03-12 15:51:45

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 12-Mar 16:20, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +/* Integer ceil-rounded range for each bucket */
> > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
>
> Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?

Well, there is certainly some rounding to be done...

> The above would give 4 for 9/3, which is clearly buggered.

.. still the math above should work fine within the boundaries we
define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that
SCHED_CAPACITY_SCALE will never be smaller then 1024.

The above is designed to shrink the topmost bucket wrt all the others
but it will never be smaller than ~30%.

Here are the start values computed for each bucket using the math
above and the computed shrinking percentage for the topmost bucket:

bukets size: 205, top bucket start@820 (err: 0.49%), buckets: {0: 0, 1: 205, 2: 410, 3: 615, 4: 820}
bukets size: 171, top bucket start@855 (err: 1.17%), buckets: {0: 0, 1: 171, 2: 342, 3: 513, 4: 684, 5: 855}
bukets size: 147, top bucket start@882 (err: 3.40%), buckets: {0: 0, 1: 147, 2: 294, 3: 441, 4: 588, 5: 735, 6: 882}
bukets size: 129, top bucket start@903 (err: 6.20%), buckets: {0: 0, 1: 129, 2: 258, 3: 387, 4: 516, 5: 645, 6: 774, 7: 903}
bukets size: 114, top bucket start@912 (err: 1.75%), buckets: {0: 0, 1: 114, 2: 228, 3: 342, 4: 456, 5: 570, 6: 684, 7: 798, 8: 912}
bukets size: 103, top bucket start@927 (err: 5.83%), buckets: {0: 0, 1: 103, 2: 206, 3: 309, 4: 412, 5: 515, 6: 618, 7: 721, 8: 824, 9: 927}
bukets size: 94, top bucket start@940 (err: 10.64%), buckets: {0: 0, 1: 94, 2: 188, 3: 282, 4: 376, 5: 470, 6: 564, 7: 658, 8: 752, 9: 846, 10: 940}
bukets size: 86, top bucket start@946 (err: 9.30%), buckets: {0: 0, 1: 86, 2: 172, 3: 258, 4: 344, 5: 430, 6: 516, 7: 602, 8: 688, 9: 774, 10: 860, 11: 946}
bukets size: 79, top bucket start@948 (err: 3.80%), buckets: {0: 0, 1: 79, 2: 158, 3: 237, 4: 316, 5: 395, 6: 474, 7: 553, 8: 632, 9: 711, 10: 790, 11: 869, 12: 948}
bukets size: 74, top bucket start@962 (err: 16.22%), buckets: {0: 0, 1: 74, 2: 148, 3: 222, 4: 296, 5: 370, 6: 444, 7: 518, 8: 592, 9: 666, 10: 740, 11: 814, 12: 888, 13: 962}
bukets size: 69, top bucket start@966 (err: 15.94%), buckets: {0: 0, 1: 69, 2: 138, 3: 207, 4: 276, 5: 345, 6: 414, 7: 483, 8: 552, 9: 621, 10: 690, 11: 759, 12: 828, 13: 897, 14: 966}
bukets size: 65, top bucket start@975 (err: 24.62%), buckets: {0: 0, 1: 65, 2: 130, 3: 195, 4: 260, 5: 325, 6: 390, 7: 455, 8: 520, 9: 585, 10: 650, 11: 715, 12: 780, 13: 845, 14: 910, 15: 975}
bukets size: 61, top bucket start@976 (err: 21.31%), buckets: {0: 0, 1: 61, 2: 122, 3: 183, 4: 244, 5: 305, 6: 366, 7: 427, 8: 488, 9: 549, 10: 610, 11: 671, 12: 732, 13: 793, 14: 854, 15: 915, 16: 976}
bukets size: 57, top bucket start@969 (err: 3.51%), buckets: {0: 0, 1: 57, 2: 114, 3: 171, 4: 228, 5: 285, 6: 342, 7: 399, 8: 456, 9: 513, 10: 570, 11: 627, 12: 684, 13: 741, 14: 798, 15: 855, 16: 912, 17: 969}
bukets size: 54, top bucket start@972 (err: 3.70%), buckets: {0: 0, 1: 54, 2: 108, 3: 162, 4: 216, 5: 270, 6: 324, 7: 378, 8: 432, 9: 486, 10: 540, 11: 594, 12: 648, 13: 702, 14: 756, 15: 810, 16: 864, 17: 918, 18: 972}
bukets size: 52, top bucket start@988 (err: 30.77%), buckets: {0: 0, 1: 52, 2: 104, 3: 156, 4: 208, 5: 260, 6: 312, 7: 364, 8: 416, 9: 468, 10: 520, 11: 572, 12: 624, 13: 676, 14: 728, 15: 780, 16: 832, 17: 884, 18: 936, 19: 988}

Does that makes sense?

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 08:21:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Tue, Mar 12, 2019 at 03:50:43PM +0000, Patrick Bellasi wrote:
> On 12-Mar 16:20, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > +/* Integer ceil-rounded range for each bucket */

^^^^^^^^^^^^^^^^^^^^^^^^^^^

> > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

simply do not match.

> > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?
>
> Well, there is certainly some rounding to be done...
>
> > The above would give 4 for 9/3, which is clearly buggered.
>
> .. still the math above should work fine within the boundaries we
> define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that
> SCHED_CAPACITY_SCALE will never be smaller then 1024.

That's a very poor reason to write utter nonsense :-)

> The above is designed to shrink the topmost bucket wrt all the others
> but it will never be smaller than ~30%.

30% sounds like a lot, esp. for this range.

> Here are the start values computed for each bucket using the math
> above and the computed shrinking percentage for the topmost bucket:

If you use a regular rounding, the error is _much_ smaller:

$ for ((x=5;x<21;x++)) ; do let d=(1024+x/2)/x; let s=(x-1)*d; let e=1024-s; let p=100*(d-e)/d; echo $x $d $s $e $p%; done
5 205 820 204 0%
6 171 855 169 1%
7 146 876 148 -1%
8 128 896 128 0%
9 114 912 112 1%
10 102 918 106 -3%
11 93 930 94 -1%
12 85 935 89 -4%
13 79 948 76 3%
14 73 949 75 -2%
15 68 952 72 -5%
16 64 960 64 0%
17 60 960 64 -6%
18 57 969 55 3%
19 54 972 52 3%
20 51 969 55 -7%

Funnily enough, we have a helper for that too: DIV_ROUND_CLOSEST().

Now, if we go further, the error will obviously increase because we run
out of precision, but even there, regular rounding will be better than
either floor or ceil.

2019-03-13 11:38:43

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 09:19, Peter Zijlstra wrote:
> On Tue, Mar 12, 2019 at 03:50:43PM +0000, Patrick Bellasi wrote:
> > On 12-Mar 16:20, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > > +/* Integer ceil-rounded range for each bucket */
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> > > > +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> simply do not match.

Right, that don't match when UCLAMP_BUCKETS is a divider of
SCHED_CAPACITY_SCALE, i.e. when we use 8 or 16 buckets.

> > > Uhm, should that not me ((x+y-1)/y), aka. DIV_ROUND_UP(x,y) ?
> >
> > Well, there is certainly some rounding to be done...
> >
> > > The above would give 4 for 9/3, which is clearly buggered.
> >
> > .. still the math above should work fine within the boundaries we
> > define for UCLAMP_BUCKET_DELTA (5..20 groups) and considering that
> > SCHED_CAPACITY_SCALE will never be smaller then 1024.
>
> That's a very poor reason to write utter nonsense :-)
>
> > The above is designed to shrink the topmost bucket wrt all the others
> > but it will never be smaller than ~30%.
>
> 30% sounds like a lot, esp. for this range.

Well, that 30% is really just ~16 utiliation units on a scale of 1024
when buckets have a size of 52.

Still, yes, we can argue that's big but that's also the same error
generated by DIV_ROUND_UP() when UCLAMP_BUCKETS is not 8 or 16.

> > Here are the start values computed for each bucket using the math
> > above and the computed shrinking percentage for the topmost bucket:
>
> If you use a regular rounding, the error is _much_ smaller:
>
> $ for ((x=5;x<21;x++)) ; do let d=(1024+x/2)/x; let s=(x-1)*d; let e=1024-s; let p=100*(d-e)/d; echo $x $d $s $e $p%; done
^^^^^^^^^^^^^
> 5 205 820 204 0%
> 6 171 855 169 1%
> 7 146 876 148 -1%
> 8 128 896 128 0%
> 9 114 912 112 1%
> 10 102 918 106 -3%
> 11 93 930 94 -1%
> 12 85 935 89 -4%
> 13 79 948 76 3%
> 14 73 949 75 -2%
> 15 68 952 72 -5%
> 16 64 960 64 0%
> 17 60 960 64 -6%
> 18 57 969 55 3%
> 19 54 972 52 3%
> 20 51 969 55 -7%
>
> Funnily enough, we have a helper for that too: DIV_ROUND_CLOSEST().
^^^^^^^^^^^^^^^^^^^^
This is different than DIV_ROUND_UP() and actually better across the
full range.

> Now, if we go further, the error will obviously increase because we run
> out of precision, but even there, regular rounding will be better than
> either floor or ceil.

I don't think we will have to cover other values in the further but I
agree that this "closest rounding" is definitively better.

Thanks for spotting it, will update in v8.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 13:41:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> +{
> + return clamp_value / UCLAMP_BUCKET_DELTA;
> +}
> +
> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);

return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);

might generate better code; just a single division, instead of a div and
mult.

> +}
> +
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> + if (clamp_id == UCLAMP_MIN)
> + return 0;
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> + unsigned int max_value = uclamp_none(clamp_id);
> + unsigned int bucket_id;
> +
> + /*
> + * Both min and max clamps are MAX aggregated, thus the topmost
> + * bucket with some tasks defines the rq's clamp value.
> + */
> + bucket_id = UCLAMP_BUCKETS;
> + do {
> + --bucket_id;
> + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> + continue;
> + max_value = bucket[bucket_id].value;
> + break;

If you flip the if condition the code will be nicer.

> + } while (bucket_id);

But you can also use a for loop:

for (i = UCLAMP_BUCKETS-1; i>=0; i--) {
if (rq->uclamp[clamp_id].bucket[i].tasks) {
max_value = bucket[i].value;
break;
}
}

> + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> +}

2019-03-13 13:54:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is reference counted on that rq. This also
> + * immediately updates the rq's clamp value if required.
> + *
> + * Since tasks know their specific value requested from user-space, we track
> + * within each bucket the maximum value for tasks refcounted in that bucket.
> + * This provide a further aggregation (local clamping) which allows to track
> + * within each bucket the exact "requested" clamp value whenever all tasks
> + * RUNNABLE in that bucket require the same clamp.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> +
> + rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> +
> + /*
> + * Local clamping: rq's buckets always track the max "requested"
> + * clamp value from all RUNNABLE tasks in that bucket.
> + */
> + tsk_clamp = p->uclamp[clamp_id].value;
> + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);

So, if I read this correct:

- here we track a max value in a bucket,

> + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> +}
> +
> +/*
> + * When a task is dequeued from a rq, the clamp bucket reference counted by
> + * the task is released. If this is the last task reference counting the rq's
> + * max active clamp value, then the rq's clamp value is updated.
> + * Both the tasks reference counter and the rq's cached clamp values are
> + * expected to be always valid, if we detect they are not we skip the updates,
> + * enforce a consistent state and warn.
> + */
> +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> + unsigned int rq_clamp, bkt_clamp;
> +
> + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> +
> + /*
> + * Keep "local clamping" simple and accept to (possibly) overboost
> + * still RUNNABLE tasks in the same bucket.
> + */
> + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> + return;

(Oh man, I hope that generates semi sane code; long live CSE passes I
suppose)

But we never decrement that bkt_clamp value on dequeue.

> + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> +
> + /* The rq's clamp value is expected to always track the max */
> + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> + SCHED_WARN_ON(bkt_clamp > rq_clamp);
> + if (bkt_clamp >= rq_clamp) {

head hurts, this reads ==, how can this ever not be so?

> + /*
> + * Reset rq's clamp bucket value to its nominal value whenever
> + * there are anymore RUNNABLE tasks refcounting it.

-ENOPARSE

> + */
> + rq->uclamp[clamp_id].bucket[bucket_id].value =
> + uclamp_bucket_value(rq_clamp);

But basically you decrement the bucket value to the nominal value.

> + uclamp_rq_update(rq, clamp_id);
> + }
> +}

Given all that, what is to stop the bucket value to climbing to
uclamp_bucket_value(+1)-1 and staying there (provided there's someone
runnable)?

Why are we doing this... ?

2019-03-13 14:08:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +static void __init init_uclamp(void)
> +{
> + unsigned int clamp_id;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +

Is that really needed? Doesn't rq come from .bss ?

2019-03-13 14:11:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> + if (clamp_id == UCLAMP_MIN)
> + return 0;
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> + unsigned int max_value = uclamp_none(clamp_id);

That's 1024 for uclamp_max

> + unsigned int bucket_id;
> +
> + /*
> + * Both min and max clamps are MAX aggregated, thus the topmost
> + * bucket with some tasks defines the rq's clamp value.
> + */
> + bucket_id = UCLAMP_BUCKETS;
> + do {
> + --bucket_id;
> + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> + continue;
> + max_value = bucket[bucket_id].value;

but this will then _lower_ it. That's not a MAX aggregate.

> + break;
> + } while (bucket_id);
> +
> + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> +}

2019-03-13 14:11:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> +{
> + /*
> + * Avoid blocked utilization pushing up the frequency when we go
> + * idle (which drops the max-clamp) by retaining the last known
> + * max-clamp.
> + */
> + if (clamp_id == UCLAMP_MAX) {
> + rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> + return clamp_value;
> + }
> +
> + return uclamp_none(UCLAMP_MIN);

That's a very complicated way or writing: return 0, right?

> +}

2019-03-13 14:15:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> + unsigned int clamp_value)
> +{
> + /* Reset max-clamp retention only on idle exit */
> + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> + return;
> +
> + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> +
> + /*
> + * This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
> + * (after). The idle flag is reset only the second time, when we know
> + * that UCLAMP_MIN has been already updated.

Why do we care? That is, what is this comment trying to tell us.

> + */
> + if (clamp_id == UCLAMP_MAX)
> + rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
> +}

2019-03-13 14:33:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 45460e7a3eee..447261cd23ba 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -584,14 +584,32 @@ struct sched_dl_entity {
> * Utilization clamp for a scheduling entity
> * @value: clamp value "requested" by a se
> * @bucket_id: clamp bucket corresponding to the "requested" value
> + * @effective: clamp value and bucket actually "assigned" to the se
> + * @active: the se is currently refcounted in a rq's bucket
> *
> + * Both bucket_id and effective::bucket_id are the index of the clamp bucket
> + * matching the corresponding clamp value which are pre-computed and stored to
> + * avoid expensive integer divisions from the fast path.
> + *
> + * The active bit is set whenever a task has got an effective::value assigned,
> + * which can be different from the user requested clamp value. This allows to
> + * know a task is actually refcounting the rq's effective::bucket_id bucket.
> */
> struct uclamp_se {
> + /* Clamp value "requested" by a scheduling entity */
> unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> + unsigned int active : 1;
> + /*
> + * Clamp value "obtained" by a scheduling entity.
> + *
> + * This cache the actual clamp value, possibly enforced by system
> + * default clamps, a task is subject to while enqueued in a rq.
> + */
> + struct {
> + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> + } effective;

I still think that this effective thing is backwards.

The existing code already used @value and @bucket_id as 'effective' and
you're now changing all that again. This really doesn't make sense to
me.

Also; if you don't add it inside struct uclamp_se, but add a second
instance,

> };
> #endif /* CONFIG_UCLAMP_TASK */
>


> @@ -803,6 +811,70 @@ static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id,
> WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> }
>
> +/*
> + * The effective clamp bucket index of a task depends on, by increasing
> + * priority:
> + * - the task specific clamp value, when explicitly requested from userspace
> + * - the system default clamp value, defined by the sysadmin
> + *
> + * As a side effect, update the task's effective value:
> + * task_struct::uclamp::effective::value
> + * to represent the clamp value of the task effective bucket index.
> + */
> +static inline void
> +uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> + unsigned int *clamp_value, unsigned int *bucket_id)
> +{
> + /* Task specific clamp value */
> + *bucket_id = p->uclamp[clamp_id].bucket_id;
> + *clamp_value = p->uclamp[clamp_id].value;
> +
> + /* Always apply system default restrictions */
> + if (unlikely(*clamp_value > uclamp_default[clamp_id].value)) {
> + *clamp_value = uclamp_default[clamp_id].value;
> + *bucket_id = uclamp_default[clamp_id].bucket_id;
> + }
> +}

you can avoid horrors like this and simply return a struct uclamp_se by
value.



2019-03-13 15:17:40

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 12-Mar 13:52, Dietmar Eggemann wrote:
> On 2/8/19 11:05 AM, Patrick Bellasi wrote:
>
> [...]
>
> > +config UCLAMP_BUCKETS_COUNT
> > + int "Number of supported utilization clamp buckets"
> > + range 5 20
> > + default 5
> > + depends on UCLAMP_TASK
> > + help
> > + Defines the number of clamp buckets to use. The range of each bucket
> > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> > + number of clamp buckets the finer their granularity and the higher
> > + the precision of clamping aggregation and tracking at run-time.
> > +
> > + For example, with the default configuration we will have 5 clamp
> > + buckets tracking 20% utilization each. A 25% boosted tasks will be
> > + refcounted in the [20..39]% bucket and will set the bucket clamp
> > + effective value to 25%.
> > + If a second 30% boosted task should be co-scheduled on the same CPU,
> > + that task will be refcounted in the same bucket of the first task and
> > + it will boost the bucket clamp effective value to 30%.
> > + The clamp effective value of a bucket is reset to its nominal value
> > + (20% in the example above) when there are anymore tasks refcounted in
>
> this sounds weird.

Why ?

>
> [...]
>
> > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > +{
> > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > +}
>
> Soemthing like uclamp_bucket_nominal_value() should be clearer.

Maybe... can update it in v8

> > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > + unsigned int max_value = uclamp_none(clamp_id);
> > + unsigned int bucket_id;
>
> unsigned int bucket_id = UCLAMP_BUCKETS;
>
> > +
> > + /*
> > + * Both min and max clamps are MAX aggregated, thus the topmost
> > + * bucket with some tasks defines the rq's clamp value.
> > + */
> > + bucket_id = UCLAMP_BUCKETS;
>
> to get rid of this line?

I put it on a different line as a justfication for the loop variable
initialization described in the comment above.

>
> > + do {
> > + --bucket_id;
> > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
>
> if (!bucket[bucket_id].tasks)

Right... that's some leftover from the last refactoring!

[...]

> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
>
> Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> structure as well to the bucket?
>
> - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];

I think I went back/forth a couple of times in using pointer or the
extended version, which both have pros and cons.

I personally prefer the pointers as you suggest but I've got the
impression in the past that since everybody cleared "basic C trainings"
it's not so difficult to read the code above too.

> The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> much more readable.

Agree... let's try to switch once again in v8 and see ;)

> [...]
>
> > struct sched_class {
> > const struct sched_class *next;
> > +#ifdef CONFIG_UCLAMP_TASK
> > + int uclamp_enabled;
> > +#endif
> > +
> > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> > - void (*yield_task) (struct rq *rq);
> > - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
> > @@ -1685,7 +1734,6 @@ struct sched_class {
> > void (*set_curr_task)(struct rq *rq);
> > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> > void (*task_fork)(struct task_struct *p);
> > - void (*task_dead)(struct task_struct *p);
> > /*
> > * The switched_from() call is allowed to drop rq->lock, therefore we
> > @@ -1702,12 +1750,17 @@ struct sched_class {
> > void (*update_curr)(struct rq *rq);
> > + void (*yield_task) (struct rq *rq);
> > + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > +
> > #define TASK_SET_GROUP 0
> > #define TASK_MOVE_GROUP 1
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > void (*task_change_group)(struct task_struct *p, int type);
> > #endif
> > +
> > + void (*task_dead)(struct task_struct *p);
>
> Why do you move yield_task, yield_to_task and task_dead here?

Since I'm adding a new field at the beginning of the struct, which is
used at enqueue/dequeue time, this is to ensure that all the
callbacks used in these paths are grouped together and don't fall
across a cache line... but yes, that's supposed to be a
micro-optimization which I can skip in this patch.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 15:25:21

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 15:09, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +static inline unsigned int uclamp_none(int clamp_id)
> > +{
> > + if (clamp_id == UCLAMP_MIN)
> > + return 0;
> > + return SCHED_CAPACITY_SCALE;
> > +}
> > +
> > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > + unsigned int max_value = uclamp_none(clamp_id);
>
> That's 1024 for uclamp_max
>
> > + unsigned int bucket_id;
> > +
> > + /*
> > + * Both min and max clamps are MAX aggregated, thus the topmost
> > + * bucket with some tasks defines the rq's clamp value.
> > + */
> > + bucket_id = UCLAMP_BUCKETS;
> > + do {
> > + --bucket_id;
> > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > + continue;
> > + max_value = bucket[bucket_id].value;
>
> but this will then _lower_ it. That's not a MAX aggregate.

For uclamp_max we want max_value=1024 when there are no active tasks,
which means: no max clamp enforced on CFS/RT "idle" cpus.

If instead there are active RT/CFS tasks then we want the clamp value
of the max group, which means: MAX aggregate active clamps.

That's what the code above does and the comment says.

> > + break;
> > + } while (bucket_id);
> > +
> > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> > +}

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 15:29:19

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 15:06, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +static void __init init_uclamp(void)
> > +{
> > + unsigned int clamp_id;
> > + int cpu;
> > +
> > + for_each_possible_cpu(cpu)
> > + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> > +
>
> Is that really needed? Doesn't rq come from .bss ?

Will check better... I've just assumed not since in sched_init() we
already have tons of rq's related zero initializations.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 16:01:16

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 14:52, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +/*
> > + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > + * immediately updates the rq's clamp value if required.
> > + *
> > + * Since tasks know their specific value requested from user-space, we track
> > + * within each bucket the maximum value for tasks refcounted in that bucket.
> > + * This provide a further aggregation (local clamping) which allows to track
> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > +
> > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > +
> > + /*
> > + * Local clamping: rq's buckets always track the max "requested"
> > + * clamp value from all RUNNABLE tasks in that bucket.
> > + */
> > + tsk_clamp = p->uclamp[clamp_id].value;
> > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
>
> So, if I read this correct:
>
> - here we track a max value in a bucket,
>
> > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > +}
> > +
> > +/*
> > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > + * the task is released. If this is the last task reference counting the rq's
> > + * max active clamp value, then the rq's clamp value is updated.
> > + * Both the tasks reference counter and the rq's cached clamp values are
> > + * expected to be always valid, if we detect they are not we skip the updates,
> > + * enforce a consistent state and warn.
> > + */
> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp;
> > +
> > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > +
> > + /*
> > + * Keep "local clamping" simple and accept to (possibly) overboost
> > + * still RUNNABLE tasks in the same bucket.
> > + */
> > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > + return;
>
> (Oh man, I hope that generates semi sane code; long live CSE passes I
> suppose)

What do you mean ?

> But we never decrement that bkt_clamp value on dequeue.

We decrement the bkt_clamp value only when the bucket becomes empty
and thus we pass the condition above. That's what the comment above is
there to call out.


> > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +
> > + /* The rq's clamp value is expected to always track the max */
> > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > + SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > + if (bkt_clamp >= rq_clamp) {
>
> head hurts, this reads ==, how can this ever not be so?

Never, given the current code, that's just defensive programming.

If in the future the accounting should be accidentally broken by some
refactoring we warn and fix the corrupted data structures at first
chance.

Is that so bad?

> > + /*
> > + * Reset rq's clamp bucket value to its nominal value whenever
> > + * there are anymore RUNNABLE tasks refcounting it.
>
> -ENOPARSE

That's related to the comment above, when you say we don't decrement
the bkt_clamp.

Because of backetization, we potentially end up tracking tasks with
different requested clamp values in the same bucket.

For example, with 20% bucket size, we can have:
Task1: util_min=25%
Task2: util_min=35%
accounted in the same bucket.

This ensure that while they are both running we boost 35%. If Task1
should run longer than Task2, Task1 will be "overboosted" until its
end. The bucket value will be reset to 20% (nominal value) when both
tasks are idle.


> > + */
> > + rq->uclamp[clamp_id].bucket[bucket_id].value =
> > + uclamp_bucket_value(rq_clamp);
>
> But basically you decrement the bucket value to the nominal value.

Yes, at this point we know there are no more tasks in this bucket and
we reset its value.

>
> > + uclamp_rq_update(rq, clamp_id);
> > + }
> > +}
>
> Given all that, what is to stop the bucket value to climbing to
> uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> runnable)?

Nothing... but that's an expected consequence of bucketization.

> Why are we doing this... ?

You can either decide to:

a) always boost tasks to just the bucket nominal value
thus always penalizing both Task1 and Task2 of the example above

b) always boost tasks to the bucket "max" value
thus always overboosting both Task1 and Task2 of the example above

The solution above instead has a very good property: in systems
where you have only few and well defined clamp values we always
provide the exact boost.

For example, if your system requires only 23% and 47% boost values
(totally random numbers), then you can always get the exact boost
required using just 3 bucksts or ~33% size each.

In systems where you don't know which boost values you will have, you
can still defined the maximum overboost granularity you accept for
each task by just tuning the number of clamp groups. For example, with
20 groups you can have a 5% max overboost.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 16:13:58

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 14:40, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> > +{
> > + return clamp_value / UCLAMP_BUCKET_DELTA;
> > +}
> > +
> > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > +{
> > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
>
> return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
>
> might generate better code; just a single division, instead of a div and
> mult.

Wondering if compilers cannot do these optimizations... but yes, looks
cool and will do it in v8, thanks.

> > +}
> > +
> > +static inline unsigned int uclamp_none(int clamp_id)
> > +{
> > + if (clamp_id == UCLAMP_MIN)
> > + return 0;
> > + return SCHED_CAPACITY_SCALE;
> > +}
> > +
> > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > +{
> > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > + unsigned int max_value = uclamp_none(clamp_id);
> > + unsigned int bucket_id;
> > +
> > + /*
> > + * Both min and max clamps are MAX aggregated, thus the topmost
> > + * bucket with some tasks defines the rq's clamp value.
> > + */
> > + bucket_id = UCLAMP_BUCKETS;
> > + do {
> > + --bucket_id;
> > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > + continue;
> > + max_value = bucket[bucket_id].value;
> > + break;
>
> If you flip the if condition the code will be nicer.
>
> > + } while (bucket_id);
>
> But you can also use a for loop:
>
> for (i = UCLAMP_BUCKETS-1; i>=0; i--) {
> if (rq->uclamp[clamp_id].bucket[i].tasks) {
> max_value = bucket[i].value;
> break;
> }
> }

Yes, the for looks better, but perhaps like that:

unsigned int bucket_id = UCLAMP_BUCKETS;

/*
* Both min and max clamps are MAX aggregated, thus the topmost
* bucket with some tasks defines the rq's clamp value.
*/
for (; bucket_id >= 0; --bucket_id) {
if (!bucket[bucket_id].tasks)
continue;
max_value = bucket[bucket_id].value;
break;
}

... just to save a {} block.


> > + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> > +}

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 16:17:31

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On 13-Mar 15:12, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> > + unsigned int clamp_value)
> > +{
> > + /* Reset max-clamp retention only on idle exit */
> > + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > + return;
> > +
> > + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > +
> > + /*
> > + * This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
> > + * (after). The idle flag is reset only the second time, when we know
> > + * that UCLAMP_MIN has been already updated.
>
> Why do we care? That is, what is this comment trying to tell us.

Right, the code is clear enough, I'll remove this comment.

>
> > + */
> > + if (clamp_id == UCLAMP_MAX)
> > + rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
> > +}

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 16:23:13

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On 13-Mar 15:10, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> > +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> > +{
> > + /*
> > + * Avoid blocked utilization pushing up the frequency when we go
> > + * idle (which drops the max-clamp) by retaining the last known
> > + * max-clamp.
> > + */
> > + if (clamp_id == UCLAMP_MAX) {
> > + rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> > + return clamp_value;
> > + }
> > +
> > + return uclamp_none(UCLAMP_MIN);
>
> That's a very complicated way or writing: return 0, right?

In my mind it's just a simple way to hardcode values in just one place.

In the current implementation uclamp_none(UCLAMP_MIN) is 0 and the
compiler is not in trubles to inline a 0 there.

Is it really so disgusting ?

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 17:10:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On 13-Mar 15:32, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
>
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 45460e7a3eee..447261cd23ba 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -584,14 +584,32 @@ struct sched_dl_entity {
> > * Utilization clamp for a scheduling entity
> > * @value: clamp value "requested" by a se
> > * @bucket_id: clamp bucket corresponding to the "requested" value
> > + * @effective: clamp value and bucket actually "assigned" to the se
> > + * @active: the se is currently refcounted in a rq's bucket
> > *
> > + * Both bucket_id and effective::bucket_id are the index of the clamp bucket
> > + * matching the corresponding clamp value which are pre-computed and stored to
> > + * avoid expensive integer divisions from the fast path.
> > + *
> > + * The active bit is set whenever a task has got an effective::value assigned,
> > + * which can be different from the user requested clamp value. This allows to
> > + * know a task is actually refcounting the rq's effective::bucket_id bucket.
> > */
> > struct uclamp_se {
> > + /* Clamp value "requested" by a scheduling entity */
> > unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > + unsigned int active : 1;
> > + /*
> > + * Clamp value "obtained" by a scheduling entity.
> > + *
> > + * This cache the actual clamp value, possibly enforced by system
> > + * default clamps, a task is subject to while enqueued in a rq.
> > + */
> > + struct {
> > + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > + } effective;
>
> I still think that this effective thing is backwards.
>
> The existing code already used @value and @bucket_id as 'effective' and
> you're now changing all that again. This really doesn't make sense to
> me.

With respect to the previous v6, I've now moved this concept to the
patch where we actually use it for the first time.

In this patch we add system default values, thus a task is now subject
to two possible constraints: the task specific (TS) one or the system
default (SD) one.

The most restrictive of the two must be enforced but we also want to
keep track of the task specific value, while system defaults are
enforce, to restore it when the system defaults are relaxed.

For example:

TS: |.............. 20 .................|
SD: |... 0 ....|..... 40 .....|... 0 ...|
Time: |..........|..............|.........|
t0 t1 t2 t3

Despite the task asking always only for a 20% boost:
- in [t1,t2] we want to boost it to 40% but, right after...
- in [t2,t3] we want to go back to the 20% boost.

The "effective" values allows to efficiently enforce the most
restrictive clamp value for a task at enqueue time by:
- not loosing track of the original request
- don't caring about updating non runnable tasks

> Also; if you don't add it inside struct uclamp_se, but add a second
> instance,
>
> > };
> > #endif /* CONFIG_UCLAMP_TASK */
> >
>
>
> > @@ -803,6 +811,70 @@ static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id,
> > WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> > }
> >
> > +/*
> > + * The effective clamp bucket index of a task depends on, by increasing
> > + * priority:
> > + * - the task specific clamp value, when explicitly requested from userspace
> > + * - the system default clamp value, defined by the sysadmin
> > + *
> > + * As a side effect, update the task's effective value:
> > + * task_struct::uclamp::effective::value
> > + * to represent the clamp value of the task effective bucket index.
> > + */
> > +static inline void
> > +uclamp_effective_get(struct task_struct *p, unsigned int clamp_id,
> > + unsigned int *clamp_value, unsigned int *bucket_id)
> > +{
> > + /* Task specific clamp value */
> > + *bucket_id = p->uclamp[clamp_id].bucket_id;
> > + *clamp_value = p->uclamp[clamp_id].value;
> > +
> > + /* Always apply system default restrictions */
> > + if (unlikely(*clamp_value > uclamp_default[clamp_id].value)) {
> > + *clamp_value = uclamp_default[clamp_id].value;
> > + *bucket_id = uclamp_default[clamp_id].bucket_id;
> > + }
> > +}
>
> you can avoid horrors like this and simply return a struct uclamp_se by
> value.

Yes, that should be possible... will look into splitting this out in
v8 to have something like:

---8<---
struct uclamp_req {
/* Clamp value "requested" by a scheduling entity */
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
unsigned int active : 1;
unsigned int user_defined : 1;
}

struct uclamp_eff {
unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
}

struct task_struct {
// ...
#ifdef CONFIG_UCLAMP_TASK
struct uclamp_req uclamp_req[UCLAMP_CNT];
struct uclamp_eff uclamp_eff[UCLAMP_CNT];
#endif
// ...
}

static inline struct uclamp_eff
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
struct uclamp_eff uc_eff = p->uclamp_eff[clamp_id];

uc_eff.bucket_id = p->uclamp_req[clamp_id].bucket_id;
uc_eff.clamp_value = p->uclamp_req[clamp_id].value;

if (unlikely(uc_eff.clamp_value > uclamp_default[clamp_id].value)) {
uc_eff.clamp_value = uclamp_default[clamp_id].value;
uc_eff.bucket_id = uclamp_default[clamp_id].bucket_id;
}

return uc_eff;
}

static inline void
uclamp_eff_set(struct task_struct *p, unsigned int clamp_id)
{
p->uclamp_eff[clamp_id] = uclamp_eff_get(p, clamp_id);
}
---8<---

Is that what you mean ?

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 17:23:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> Yes, the for looks better, but perhaps like that:
>
> unsigned int bucket_id = UCLAMP_BUCKETS;
>
> /*
> * Both min and max clamps are MAX aggregated, thus the topmost
> * bucket with some tasks defines the rq's clamp value.
> */
> for (; bucket_id >= 0; --bucket_id) {

GCC will be clever and figure that unsigned will never be smaller than 0
and turn the above into an infinite loop or something daft.

That is; use 'int', not 'unsigned int' for bucket_id.

2019-03-13 17:31:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On Wed, Mar 13, 2019 at 04:20:51PM +0000, Patrick Bellasi wrote:
> On 13-Mar 15:10, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> > > +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> > > +{
> > > + /*
> > > + * Avoid blocked utilization pushing up the frequency when we go
> > > + * idle (which drops the max-clamp) by retaining the last known
> > > + * max-clamp.
> > > + */
> > > + if (clamp_id == UCLAMP_MAX) {
> > > + rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> > > + return clamp_value;
> > > + }
> > > +
> > > + return uclamp_none(UCLAMP_MIN);
> >
> > That's a very complicated way or writing: return 0, right?
>
> In my mind it's just a simple way to hardcode values in just one place.
>
> In the current implementation uclamp_none(UCLAMP_MIN) is 0 and the
> compiler is not in trubles to inline a 0 there.
>
> Is it really so disgusting ?

Not disguisting per se, just complicated. It had me go back and check
wth uclamp_none() did again.

2019-03-13 18:23:33

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 18:22, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> > Yes, the for looks better, but perhaps like that:
> >
> > unsigned int bucket_id = UCLAMP_BUCKETS;
> >
> > /*
> > * Both min and max clamps are MAX aggregated, thus the topmost
> > * bucket with some tasks defines the rq's clamp value.
> > */
> > for (; bucket_id >= 0; --bucket_id) {
>
> GCC will be clever and figure that unsigned will never be smaller than 0
> and turn the above into an infinite loop or something daft.
>
> That is; use 'int', not 'unsigned int' for bucket_id.

Right, which remembers me now why I originally went for a do { } while();

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 18:32:13

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On 13-Mar 18:29, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 04:20:51PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 15:10, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> > > > +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> > > > +{
> > > > + /*
> > > > + * Avoid blocked utilization pushing up the frequency when we go
> > > > + * idle (which drops the max-clamp) by retaining the last known
> > > > + * max-clamp.
> > > > + */
> > > > + if (clamp_id == UCLAMP_MAX) {
> > > > + rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> > > > + return clamp_value;
> > > > + }
> > > > +
> > > > + return uclamp_none(UCLAMP_MIN);
> > >
> > > That's a very complicated way or writing: return 0, right?
> >
> > In my mind it's just a simple way to hardcode values in just one place.
> >
> > In the current implementation uclamp_none(UCLAMP_MIN) is 0 and the
> > compiler is not in trubles to inline a 0 there.
> >
> > Is it really so disgusting ?
>
> Not disguisting per se, just complicated. It had me go back and check
> wth uclamp_none() did again.

Yes, I see... every time I read it I just consider that uclamp_none()
it's just returning whatever is (or will be) the "non clamped" value
for the specified clamp index.

If it's ok with you, I would keep the code above as it is now.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-13 19:32:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> On 13-Mar 14:52, Peter Zijlstra wrote:
> > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > > + unsigned int clamp_id)
> > > +{
> > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > + unsigned int rq_clamp, bkt_clamp;
> > > +
> > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > > +
> > > + /*
> > > + * Keep "local clamping" simple and accept to (possibly) overboost
> > > + * still RUNNABLE tasks in the same bucket.
> > > + */
> > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > + return;
> >
> > (Oh man, I hope that generates semi sane code; long live CSE passes I
> > suppose)
>
> What do you mean ?

that does: 'rq->uclamp[clamp_id].bucket[bucket_id].tasks' three times in
a row. And yes the compiler _should_ dtrt, but....

2019-03-13 19:40:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> On 13-Mar 14:52, Peter Zijlstra wrote:

> Because of backetization, we potentially end up tracking tasks with
> different requested clamp values in the same bucket.
>
> For example, with 20% bucket size, we can have:
> Task1: util_min=25%
> Task2: util_min=35%
> accounted in the same bucket.

> > Given all that, what is to stop the bucket value to climbing to
> > uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> > runnable)?
>
> Nothing... but that's an expected consequence of bucketization.

No, it is not.

> > Why are we doing this... ?
>
> You can either decide to:
>
> a) always boost tasks to just the bucket nominal value
> thus always penalizing both Task1 and Task2 of the example above

This is the expected behaviour. When was the last time your histogram
did something like b?

> b) always boost tasks to the bucket "max" value
> thus always overboosting both Task1 and Task2 of the example above
>
> The solution above instead has a very good property: in systems
> where you have only few and well defined clamp values we always
> provide the exact boost.
>
> For example, if your system requires only 23% and 47% boost values
> (totally random numbers), then you can always get the exact boost
> required using just 3 bucksts or ~33% size each.
>
> In systems where you don't know which boost values you will have, you
> can still defined the maximum overboost granularity you accept for
> each task by just tuning the number of clamp groups. For example, with
> 20 groups you can have a 5% max overboost.

Maybe, but this is not a direct concequence of buckets, but an
additional heuristic that might work well in this case.

Maybe split this out in a separate patch? So start with the trivial
bucket, and then do this change on top with the above few paragraphs as
changelog?

2019-03-13 19:47:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> On 13-Mar 15:09, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:

> > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > +{
> > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > + unsigned int max_value = uclamp_none(clamp_id);
> >
> > That's 1024 for uclamp_max
> >
> > > + unsigned int bucket_id;
> > > +
> > > + /*
> > > + * Both min and max clamps are MAX aggregated, thus the topmost
> > > + * bucket with some tasks defines the rq's clamp value.
> > > + */
> > > + bucket_id = UCLAMP_BUCKETS;
> > > + do {
> > > + --bucket_id;
> > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > + continue;
> > > + max_value = bucket[bucket_id].value;
> >
> > but this will then _lower_ it. That's not a MAX aggregate.
>
> For uclamp_max we want max_value=1024 when there are no active tasks,
> which means: no max clamp enforced on CFS/RT "idle" cpus.
>
> If instead there are active RT/CFS tasks then we want the clamp value
> of the max group, which means: MAX aggregate active clamps.
>
> That's what the code above does and the comment says.

That's (obviously) not how I read it... maybe something like:

static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
{
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int i;

/*
* Since both min and max clamps are max aggregated, find the
* top most bucket with tasks in.
*/
for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
if (!bucket[i].tasks)
continue;
return bucket[i].value;
}

/* No tasks -- default clamp values */
return uclamp_none(clamp_id);
}

would make it clearer?

2019-03-13 19:49:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> On 13-Mar 14:40, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> > > +{
> > > + return clamp_value / UCLAMP_BUCKET_DELTA;
> > > +}
> > > +
> > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > +{
> > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> >
> > return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
> >
> > might generate better code; just a single division, instead of a div and
> > mult.
>
> Wondering if compilers cannot do these optimizations... but yes, looks
> cool and will do it in v8, thanks.

I'd be most impressed if they pull this off. Check the generated code
and see I suppose :-)

2019-03-13 20:00:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Wed, Mar 13, 2019 at 05:09:40PM +0000, Patrick Bellasi wrote:
> On 13-Mar 15:32, Peter Zijlstra wrote:

> > I still think that this effective thing is backwards.

> With respect to the previous v6, I've now moved this concept to the
> patch where we actually use it for the first time.

> The "effective" values allows to efficiently enforce the most
> restrictive clamp value for a task at enqueue time by:
> - not loosing track of the original request
> - don't caring about updating non runnable tasks

My point is that you already had an effective value; namely p->uclamp[],
since patch 1.

This patch then changes that into something else, instead of adding
p->uclamp_orig[], and concequently has to update all sites that
previously used p->uclamp[], which is a lot of pointless churn.



2019-03-13 20:10:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Wed, Mar 13, 2019 at 05:09:40PM +0000, Patrick Bellasi wrote:

> Yes, that should be possible... will look into splitting this out in
> v8 to have something like:
>
> ---8<---
> struct uclamp_req {
> /* Clamp value "requested" by a scheduling entity */
> unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> unsigned int active : 1;
> unsigned int user_defined : 1;
> }
>
> struct uclamp_eff {
> unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> }

No, have _1_ type. There is no point what so ever to splitting this.

Also, what's @user_defined about, I don't think I've seen that in the
parent patch.

> struct task_struct {
> // ...
> #ifdef CONFIG_UCLAMP_TASK
> struct uclamp_req uclamp_req[UCLAMP_CNT];
> struct uclamp_eff uclamp_eff[UCLAMP_CNT];

struct uclamp_se uclamp[UCLAMP_CNT];
struct uclamp_se uclamp_req[UCLAMP_CNT];

Where the first is the very same introduced in patch #1, and leaving it
in place avoids having to update the sites already using that (or start
#1 with the _eff name to avoid having to change things around?).

> #endif
> // ...
> }
>
> static inline struct uclamp_eff
> uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> {
> struct uclamp_eff uc_eff = p->uclamp_eff[clamp_id];

just this ^, these lines seem like a superfluous duplication:

> uc_eff.bucket_id = p->uclamp_req[clamp_id].bucket_id;
> uc_eff.value = p->uclamp_req[clamp_id].value;


> if (unlikely(uc_eff.clamp_value > uclamp_default[clamp_id].value)) {
> uc_eff.clamp_value = uclamp_default[clamp_id].value;
> uc_eff.bucket_id = uclamp_default[clamp_id].bucket_id;

and:

uc = uclamp_default[clamp_id];

> }
>
> return uc_eff;
> }
>
> static inline void
> uclamp_eff_set(struct task_struct *p, unsigned int clamp_id)
> {
> p->uclamp_eff[clamp_id] = uclamp_eff_get(p, clamp_id);
> }
> ---8<---
>
> Is that what you mean ?

Getting there :-)

2019-03-13 20:15:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
> +int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp,
> + loff_t *ppos)
> +{
> + int old_min, old_max;
> + int result = 0;

Should this not have an internal mutex to serialize concurrent usage?
See for example sched_rt_handler().

> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> + sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto undo;
> + }
> +
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uclamp_default[UCLAMP_MIN].value =
> + sysctl_sched_uclamp_util_min;
> + uclamp_default[UCLAMP_MIN].bucket_id =
> + uclamp_bucket_id(sysctl_sched_uclamp_util_min);
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + uclamp_default[UCLAMP_MAX].value =
> + sysctl_sched_uclamp_util_max;
> + uclamp_default[UCLAMP_MAX].bucket_id =
> + uclamp_bucket_id(sysctl_sched_uclamp_util_max);
> + }
> +
> + /*
> + * Updating all the RUNNABLE task is expensive, keep it simple and do
> + * just a lazy update at each next enqueue time.
> + */
> + goto done;
> +
> +undo:
> + sysctl_sched_uclamp_util_min = old_min;
> + sysctl_sched_uclamp_util_max = old_max;
> +done:
> +
> + return result;
> +}

2019-03-13 20:20:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
> +static void uclamp_fork(struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + p->uclamp[clamp_id].active = false;
> +}

Because in that case .active == false, and copy_process() will have done
thr right thing?

2019-03-13 20:53:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/15] sched/core: uclamp: Reset uclamp values on RESET_ON_FORK

On Fri, Feb 08, 2019 at 10:05:45AM +0000, Patrick Bellasi wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 070caa1f72eb..8b282616e9c9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1071,7 +1071,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
> }
> }
>
> -static void uclamp_fork(struct task_struct *p)
> +static void uclamp_fork(struct task_struct *p, bool reset)
> {
> unsigned int clamp_id;
>
> @@ -1080,6 +1080,17 @@ static void uclamp_fork(struct task_struct *p)

IIRC there's an early return here if the class doesn't have uclamp
support, which I think is wrong now. You want the reset irrespective of
whether the class supports it, no?

>
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> p->uclamp[clamp_id].active = false;
> +
> + if (likely(!reset))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + unsigned int clamp_value = uclamp_none(clamp_id);
> +
> + p->uclamp[clamp_id].user_defined = false;
> + p->uclamp[clamp_id].value = clamp_value;
> + p->uclamp[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
> + }
> }



2019-03-13 21:02:44

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi <[email protected]> wrote:
>
> On 12-Mar 13:52, Dietmar Eggemann wrote:
> > On 2/8/19 11:05 AM, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > +config UCLAMP_BUCKETS_COUNT
> > > + int "Number of supported utilization clamp buckets"
> > > + range 5 20
> > > + default 5
> > > + depends on UCLAMP_TASK
> > > + help
> > > + Defines the number of clamp buckets to use. The range of each bucket
> > > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> > > + number of clamp buckets the finer their granularity and the higher
> > > + the precision of clamping aggregation and tracking at run-time.
> > > +
> > > + For example, with the default configuration we will have 5 clamp
> > > + buckets tracking 20% utilization each. A 25% boosted tasks will be
> > > + refcounted in the [20..39]% bucket and will set the bucket clamp
> > > + effective value to 25%.
> > > + If a second 30% boosted task should be co-scheduled on the same CPU,
> > > + that task will be refcounted in the same bucket of the first task and
> > > + it will boost the bucket clamp effective value to 30%.
> > > + The clamp effective value of a bucket is reset to its nominal value
> > > + (20% in the example above) when there are anymore tasks refcounted in
> >
> > this sounds weird.
>
> Why ?

Should probably be "when there are no more tasks refcounted"

> >
> > [...]
> >
> > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > +{
> > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > > +}
> >
> > Soemthing like uclamp_bucket_nominal_value() should be clearer.
>
> Maybe... can update it in v8
>

uclamp_bucket_base_value is a little shorter, just to consider :)

> > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > +{
> > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > + unsigned int max_value = uclamp_none(clamp_id);
> > > + unsigned int bucket_id;
> >
> > unsigned int bucket_id = UCLAMP_BUCKETS;
> >
> > > +
> > > + /*
> > > + * Both min and max clamps are MAX aggregated, thus the topmost
> > > + * bucket with some tasks defines the rq's clamp value.
> > > + */
> > > + bucket_id = UCLAMP_BUCKETS;
> >
> > to get rid of this line?
>
> I put it on a different line as a justfication for the loop variable
> initialization described in the comment above.
>
> >
> > > + do {
> > > + --bucket_id;
> > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> >
> > if (!bucket[bucket_id].tasks)
>
> Right... that's some leftover from the last refactoring!
>
> [...]
>
> > > + * within each bucket the exact "requested" clamp value whenever all tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > + unsigned int clamp_id)
> > > +{
> > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> >
> > Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> > structure as well to the bucket?
> >
> > - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];
>
> I think I went back/forth a couple of times in using pointer or the
> extended version, which both have pros and cons.
>
> I personally prefer the pointers as you suggest but I've got the
> impression in the past that since everybody cleared "basic C trainings"
> it's not so difficult to read the code above too.
>
> > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> > much more readable.
>
> Agree... let's try to switch once again in v8 and see ;)
>
> > [...]
> >
> > > struct sched_class {
> > > const struct sched_class *next;
> > > +#ifdef CONFIG_UCLAMP_TASK
> > > + int uclamp_enabled;
> > > +#endif
> > > +
> > > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> > > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> > > - void (*yield_task) (struct rq *rq);
> > > - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
> > > @@ -1685,7 +1734,6 @@ struct sched_class {
> > > void (*set_curr_task)(struct rq *rq);
> > > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> > > void (*task_fork)(struct task_struct *p);
> > > - void (*task_dead)(struct task_struct *p);
> > > /*
> > > * The switched_from() call is allowed to drop rq->lock, therefore we
> > > @@ -1702,12 +1750,17 @@ struct sched_class {
> > > void (*update_curr)(struct rq *rq);
> > > + void (*yield_task) (struct rq *rq);
> > > + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> > > +
> > > #define TASK_SET_GROUP 0
> > > #define TASK_MOVE_GROUP 1
> > > #ifdef CONFIG_FAIR_GROUP_SCHED
> > > void (*task_change_group)(struct task_struct *p, int type);
> > > #endif
> > > +
> > > + void (*task_dead)(struct task_struct *p);
> >
> > Why do you move yield_task, yield_to_task and task_dead here?
>
> Since I'm adding a new field at the beginning of the struct, which is
> used at enqueue/dequeue time, this is to ensure that all the
> callbacks used in these paths are grouped together and don't fall
> across a cache line... but yes, that's supposed to be a
> micro-optimization which I can skip in this patch.
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2019-03-13 21:09:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
>
> > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > > +{
> > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > + unsigned int max_value = uclamp_none(clamp_id);
> > >
> > > That's 1024 for uclamp_max
> > >
> > > > + unsigned int bucket_id;
> > > > +
> > > > + /*
> > > > + * Both min and max clamps are MAX aggregated, thus the topmost
> > > > + * bucket with some tasks defines the rq's clamp value.
> > > > + */
> > > > + bucket_id = UCLAMP_BUCKETS;
> > > > + do {
> > > > + --bucket_id;
> > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > + continue;
> > > > + max_value = bucket[bucket_id].value;
> > >
> > > but this will then _lower_ it. That's not a MAX aggregate.
> >
> > For uclamp_max we want max_value=1024 when there are no active tasks,
> > which means: no max clamp enforced on CFS/RT "idle" cpus.
> >
> > If instead there are active RT/CFS tasks then we want the clamp value
> > of the max group, which means: MAX aggregate active clamps.
> >
> > That's what the code above does and the comment says.
>
> That's (obviously) not how I read it... maybe something like:
>
> static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> {
> struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> int i;
>
> /*
> * Since both min and max clamps are max aggregated, find the
> * top most bucket with tasks in.
> */
> for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
> if (!bucket[i].tasks)
> continue;
> return bucket[i].value;
> }
>
> /* No tasks -- default clamp values */
> return uclamp_none(clamp_id);
> }
>
> would make it clearer?

This way it's also more readable/obvious when it's used inside
uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like
get_max_rq_uclamp.

2019-03-13 21:24:24

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > +/*
> > + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > + * immediately updates the rq's clamp value if required.
> > + *
> > + * Since tasks know their specific value requested from user-space, we track
> > + * within each bucket the maximum value for tasks refcounted in that bucket.
> > + * This provide a further aggregation (local clamping) which allows to track
> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > +
> > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > +
> > + /*
> > + * Local clamping: rq's buckets always track the max "requested"
> > + * clamp value from all RUNNABLE tasks in that bucket.
> > + */
> > + tsk_clamp = p->uclamp[clamp_id].value;
> > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
>
> So, if I read this correct:
>
> - here we track a max value in a bucket,
>
> > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > +}
> > +
> > +/*
> > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > + * the task is released. If this is the last task reference counting the rq's
> > + * max active clamp value, then the rq's clamp value is updated.
> > + * Both the tasks reference counter and the rq's cached clamp values are
> > + * expected to be always valid, if we detect they are not we skip the updates,
> > + * enforce a consistent state and warn.
> > + */
> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp;
> > +
> > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > +
> > + /*
> > + * Keep "local clamping" simple and accept to (possibly) overboost
> > + * still RUNNABLE tasks in the same bucket.
> > + */
> > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > + return;
>
> (Oh man, I hope that generates semi sane code; long live CSE passes I
> suppose)
>
> But we never decrement that bkt_clamp value on dequeue.
>
> > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +
> > + /* The rq's clamp value is expected to always track the max */
> > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > + SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > + if (bkt_clamp >= rq_clamp) {
>
> head hurts, this reads ==, how can this ever not be so?
>
> > + /*
> > + * Reset rq's clamp bucket value to its nominal value whenever
> > + * there are anymore RUNNABLE tasks refcounting it.
>
> -ENOPARSE
>
> > + */
> > + rq->uclamp[clamp_id].bucket[bucket_id].value =
> > + uclamp_bucket_value(rq_clamp);
>
> But basically you decrement the bucket value to the nominal value.
>
> > + uclamp_rq_update(rq, clamp_id);
> > + }
> > +}
>
> Given all that, what is to stop the bucket value to climbing to
> uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> runnable)?
>
> Why are we doing this... ?

I agree with Peter, this part of the patch was the hardest to read.
SCHED_WARN_ON line makes sense to me. The condition that follows and
the following comment are a little baffling. Condition seems to
indicate that the code that follows should be executed only if we are
in the top-most occupied bucket (the bucket which has tasks and has
the highest uclamp value). So this bucket just lost its last task and
we should update rq->uclamp[clamp_id].value. However that's not
exactly what the code does... It also resets
rq->uclamp[clamp_id].bucket[bucket_id].value. So if I understand
correctly, unless the bucket that just lost its last task is the
top-most one its value will not be reset to nominal value. That looks
like a bug to me. Am I missing something?

Side note: some more explanation would be very helpful.

2019-03-13 21:33:28

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
>
> Utilization clamping allows to clamp the CPU's utilization within a
> [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> that CPU. Each task references two "clamp buckets" defining its minimum
> and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> bucket is active if there is at least one RUNNABLE tasks enqueued on
> that CPU and refcounting that bucket.
>
> When a task is {en,de}queued {on,from} a rq, the set of active clamp
> buckets on that CPU can change. Since each clamp bucket enforces a
> different utilization clamp value, when the set of active clamp buckets
> changes, a new "aggregated" clamp value is computed for that CPU.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This ensures that no tasks can affect the performance of other
> co-scheduled tasks which are more boosted (i.e. with higher util_min
> clamp) or less capped (i.e. with higher util_max clamp).
>
> Each task has a:
> task_struct::uclamp[clamp_id]::bucket_id
> to track the "bucket index" of the CPU's clamp bucket it refcounts while
> enqueued, for each clamp index (clamp_id).
>
> Each CPU's rq has a:
> rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> to track how many tasks, currently RUNNABLE on that CPU, refcount each
> clamp bucket (bucket_id) of a clamp index (clamp_id).
>
> Each CPU's rq has also a:
> rq::uclamp[clamp_id]::bucket[bucket_id].value
> to track the clamp value of each clamp bucket (bucket_id) of a clamp
> index (clamp_id).
>
> The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> to find a new MAX aggregated clamp value for a clamp_id. This operation
> is required only when we dequeue the last task of a clamp bucket
> tracking the current MAX aggregated clamp value. In these cases, the CPU
> is either entering IDLE or going to schedule a less boosted or more
> clamped task.
> The expected number of different clamp values, configured at build time,
> is small enough to fit the full unordered array into a single cache
> line.

I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
here. uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
to 20. sizeof(long)*20 is already more than 64 bytes. What am I
missing?

> Add the basic data structures required to refcount, in each CPU's rq,
> the number of RUNNABLE tasks for each clamp bucket. Add also the max
> aggregation required to update the rq's clamp value at each
> enqueue/dequeue event.
>
> Use a simple linear mapping of clamp values into clamp buckets.
> Pre-compute and cache bucket_id to avoid integer divisions at
> enqueue/dequeue time.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> ---
> Changes in v7:
> Message-ID: <[email protected]>
> - removed buckets mapping code
> - use a simpler linear mapping of clamp values into buckets
> Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin>
> - move this patch at the beginning of the series,
> in the attempt to make the overall series easier to digest by moving
> at the very beginning the core bits and main data structures
> Others:
> - update the mapping logic to use exactly and only
> UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket
> - update uclamp_rq_update() to do top-bottom max search
> ---
> include/linux/log2.h | 37 ++++++++
> include/linux/sched.h | 39 ++++++++
> include/linux/sched/topology.h | 6 --
> init/Kconfig | 53 +++++++++++
> kernel/sched/core.c | 165 +++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 59 +++++++++++-
> 6 files changed, 350 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/log2.h b/include/linux/log2.h
> index 2af7f77866d0..e2db25734532 100644
> --- a/include/linux/log2.h
> +++ b/include/linux/log2.h
> @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
> ilog2((n) - 1) + 1) : \
> __order_base_2(n) \
> )
> +
> +static inline __attribute__((const))
> +int __bits_per(unsigned long n)
> +{
> + if (n < 2)
> + return 1;
> + if (is_power_of_2(n))
> + return order_base_2(n) + 1;
> + return order_base_2(n);
> +}
> +
> +/**
> + * bits_per - calculate the number of bits required for the argument
> + * @n: parameter
> + *
> + * This is constant-capable and can be used for compile time
> + * initiaizations, e.g bitfields.
> + *
> + * The first few values calculated by this routine:
> + * bf(0) = 1
> + * bf(1) = 1
> + * bf(2) = 2
> + * bf(3) = 2
> + * bf(4) = 3
> + * ... and so on.
> + */
> +#define bits_per(n) \
> +( \
> + __builtin_constant_p(n) ? ( \
> + ((n) == 0 || (n) == 1) ? 1 : ( \
> + ((n) & (n - 1)) == 0 ? \
> + ilog2((n) - 1) + 2 : \
> + ilog2((n) - 1) + 1 \
> + ) \
> + ) : \
> + __bits_per(n) \
> +)
> #endif /* _LINUX_LOG2_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4112639c2a85..45460e7a3eee 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -281,6 +281,18 @@ struct vtime {
> u64 gtime;
> };
>
> +/*
> + * Utilization clamp constraints.
> + * @UCLAMP_MIN: Minimum utilization
> + * @UCLAMP_MAX: Maximum utilization
> + * @UCLAMP_CNT: Utilization clamp constraints count
> + */
> +enum uclamp_id {
> + UCLAMP_MIN = 0,
> + UCLAMP_MAX,
> + UCLAMP_CNT
> +};
> +
> struct sched_info {
> #ifdef CONFIG_SCHED_INFO
> /* Cumulative counters: */
> @@ -312,6 +324,10 @@ struct sched_info {
> # define SCHED_FIXEDPOINT_SHIFT 10
> # define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)
>
> +/* Increase resolution of cpu_capacity calculations */
> +# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> +# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> +
> struct load_weight {
> unsigned long weight;
> u32 inv_weight;
> @@ -560,6 +576,25 @@ struct sched_dl_entity {
> struct hrtimer inactive_timer;
> };
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/* Number of utilization clamp buckets (shorter alias) */
> +#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
> +
> +/*
> + * Utilization clamp for a scheduling entity
> + * @value: clamp value "requested" by a se
> + * @bucket_id: clamp bucket corresponding to the "requested" value
> + *
> + * The bucket_id is the index of the clamp bucket matching the clamp value
> + * which is pre-computed and stored to avoid expensive integer divisions from
> + * the fast path.
> + */
> +struct uclamp_se {
> + unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> union rcu_special {
> struct {
> u8 blocked;
> @@ -640,6 +675,10 @@ struct task_struct {
> #endif
> struct sched_dl_entity dl;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + struct uclamp_se uclamp[UCLAMP_CNT];
> +#endif
> +
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> /* List of struct preempt_notifier: */
> struct hlist_head preempt_notifiers;
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c31d3a47a47c..04beadac6985 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -6,12 +6,6 @@
>
> #include <linux/sched/idle.h>
>
> -/*
> - * Increase resolution of cpu_capacity calculations
> - */
> -#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
> -#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
> -
> /*
> * sched-domains (multiprocessor balancing) declarations:
> */
> diff --git a/init/Kconfig b/init/Kconfig
> index 513fa544a134..34e23d5d95d1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -640,6 +640,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
> config GENERIC_SCHED_CLOCK
> bool
>
> +menu "Scheduler features"
> +
> +config UCLAMP_TASK
> + bool "Enable utilization clamping for RT/FAIR tasks"
> + depends on CPU_FREQ_GOV_SCHEDUTIL
> + help
> + This feature enables the scheduler to track the clamped utilization
> + of each CPU based on RUNNABLE tasks scheduled on that CPU.
> +
> + With this option, the user can specify the min and max CPU
> + utilization allowed for RUNNABLE tasks. The max utilization defines
> + the maximum frequency a task should use while the min utilization
> + defines the minimum frequency it should use.
> +
> + Both min and max utilization clamp values are hints to the scheduler,
> + aiming at improving its frequency selection policy, but they do not
> + enforce or grant any specific bandwidth for tasks.
> +
> + If in doubt, say N.
> +
> +config UCLAMP_BUCKETS_COUNT
> + int "Number of supported utilization clamp buckets"
> + range 5 20
> + default 5
> + depends on UCLAMP_TASK
> + help
> + Defines the number of clamp buckets to use. The range of each bucket
> + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> + number of clamp buckets the finer their granularity and the higher
> + the precision of clamping aggregation and tracking at run-time.
> +
> + For example, with the default configuration we will have 5 clamp
> + buckets tracking 20% utilization each. A 25% boosted tasks will be
> + refcounted in the [20..39]% bucket and will set the bucket clamp
> + effective value to 25%.
> + If a second 30% boosted task should be co-scheduled on the same CPU,
> + that task will be refcounted in the same bucket of the first task and
> + it will boost the bucket clamp effective value to 30%.
> + The clamp effective value of a bucket is reset to its nominal value
> + (20% in the example above) when there are anymore tasks refcounted in
> + that bucket.
> +
> + An additional boost/capping margin can be added to some tasks. In the
> + example above the 25% task will be boosted to 30% until it exits the
> + CPU. If that should be considered not acceptable on certain systems,
> + it's always possible to reduce the margin by increasing the number of
> + clamp buckets to trade off used memory for run-time tracking
> + precision.
> +
> + If in doubt, use the default value.
> +
> +endmenu
> +
> #
> # For architectures that want to enable the support for NUMA-affine scheduler
> # balancing logic:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ec1b67a195cc..8ecf5470058c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -719,6 +719,167 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +
> +/* Integer ceil-rounded range for each bucket */
> +#define UCLAMP_BUCKET_DELTA ((SCHED_CAPACITY_SCALE / UCLAMP_BUCKETS) + 1)
> +
> +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> +{
> + return clamp_value / UCLAMP_BUCKET_DELTA;
> +}
> +
> +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> +{
> + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> +}
> +
> +static inline unsigned int uclamp_none(int clamp_id)
> +{
> + if (clamp_id == UCLAMP_MIN)
> + return 0;
> + return SCHED_CAPACITY_SCALE;
> +}
> +
> +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> +{
> + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> + unsigned int max_value = uclamp_none(clamp_id);
> + unsigned int bucket_id;
> +
> + /*
> + * Both min and max clamps are MAX aggregated, thus the topmost
> + * bucket with some tasks defines the rq's clamp value.
> + */
> + bucket_id = UCLAMP_BUCKETS;
> + do {
> + --bucket_id;
> + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> + continue;
> + max_value = bucket[bucket_id].value;
> + break;
> + } while (bucket_id);
> +
> + WRITE_ONCE(rq->uclamp[clamp_id].value, max_value);
> +}
> +
> +/*
> + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> + * task's uclamp::bucket_id is reference counted on that rq. This also
> + * immediately updates the rq's clamp value if required.
> + *
> + * Since tasks know their specific value requested from user-space, we track
> + * within each bucket the maximum value for tasks refcounted in that bucket.
> + * This provide a further aggregation (local clamping) which allows to track
> + * within each bucket the exact "requested" clamp value whenever all tasks
> + * RUNNABLE in that bucket require the same clamp.
> + */
> +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> +
> + rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> +
> + /*
> + * Local clamping: rq's buckets always track the max "requested"
> + * clamp value from all RUNNABLE tasks in that bucket.
> + */
> + tsk_clamp = p->uclamp[clamp_id].value;
> + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
> +
> + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> +}
> +
> +/*
> + * When a task is dequeued from a rq, the clamp bucket reference counted by
> + * the task is released. If this is the last task reference counting the rq's
> + * max active clamp value, then the rq's clamp value is updated.
> + * Both the tasks reference counter and the rq's cached clamp values are
> + * expected to be always valid, if we detect they are not we skip the updates,
> + * enforce a consistent state and warn.
> + */
> +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> + unsigned int clamp_id)
> +{
> + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> + unsigned int rq_clamp, bkt_clamp;
> +
> + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> +
> + /*
> + * Keep "local clamping" simple and accept to (possibly) overboost
> + * still RUNNABLE tasks in the same bucket.
> + */
> + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> + return;
> + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> +
> + /* The rq's clamp value is expected to always track the max */
> + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> + SCHED_WARN_ON(bkt_clamp > rq_clamp);
> + if (bkt_clamp >= rq_clamp) {
> + /*
> + * Reset rq's clamp bucket value to its nominal value whenever
> + * there are anymore RUNNABLE tasks refcounting it.
> + */
> + rq->uclamp[clamp_id].bucket[bucket_id].value =
> + uclamp_bucket_value(rq_clamp);
> + uclamp_rq_update(rq, clamp_id);
> + }
> +}
> +
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_rq_inc_id(p, rq, clamp_id);
> +}
> +
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> +{
> + unsigned int clamp_id;
> +
> + if (unlikely(!p->sched_class->uclamp_enabled))
> + return;
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> + uclamp_rq_dec_id(p, rq, clamp_id);
> +}
> +
> +static void __init init_uclamp(void)
> +{
> + unsigned int clamp_id;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +
> + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> + unsigned int clamp_value = uclamp_none(clamp_id);
> + unsigned int bucket_id = uclamp_bucket_id(clamp_value);
> + struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> +
> + uc_se->bucket_id = bucket_id;
> + uc_se->value = clamp_value;
> + }
> +}
> +
> +#else /* CONFIG_UCLAMP_TASK */
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline void init_uclamp(void) { }
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (!(flags & ENQUEUE_NOCLOCK))
> @@ -729,6 +890,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
> psi_enqueue(p, flags & ENQUEUE_WAKEUP);
> }
>
> + uclamp_rq_inc(rq, p);
> p->sched_class->enqueue_task(rq, p, flags);
> }
>
> @@ -742,6 +904,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> psi_dequeue(p, flags & DEQUEUE_SLEEP);
> }
>
> + uclamp_rq_dec(rq, p);
> p->sched_class->dequeue_task(rq, p, flags);
> }
>
> @@ -6075,6 +6238,8 @@ void __init sched_init(void)
>
> psi_init();
>
> + init_uclamp();
> +
> scheduler_running = 1;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c688ef5012e5..ea9e28723946 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -797,6 +797,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
> #endif
> #endif /* CONFIG_SMP */
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * struct uclamp_bucket - Utilization clamp bucket
> + * @value: utilization clamp value for tasks on this clamp bucket
> + * @tasks: number of RUNNABLE tasks on this clamp bucket
> + *
> + * Keep track of how many tasks are RUNNABLE for a given utilization
> + * clamp value.
> + */
> +struct uclamp_bucket {
> + unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> + unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> +};
> +
> +/*
> + * struct uclamp_rq - rq's utilization clamp
> + * @value: currently active clamp values for a rq
> + * @bucket: utilization clamp buckets affecting a rq
> + *
> + * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
> + * A clamp value is affecting a rq when there is at least one task RUNNABLE
> + * (or actually running) with that value.
> + *
> + * We have up to UCLAMP_CNT possible different clamp values, which are
> + * currently only two: minmum utilization and maximum utilization.
> + *
> + * All utilization clamping values are MAX aggregated, since:
> + * - for util_min: we want to run the CPU at least at the max of the minimum
> + * utilization required by its currently RUNNABLE tasks.
> + * - for util_max: we want to allow the CPU to run up to the max of the
> + * maximum utilization allowed by its currently RUNNABLE tasks.
> + *
> + * Since on each system we expect only a limited number of different
> + * utilization clamp values (UCLAMP_BUCKETS), we use a simple array to track
> + * the metrics required to compute all the per-rq utilization clamp values.
> + */
> +struct uclamp_rq {
> + unsigned int value;
> + struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> +};
> +#endif /* CONFIG_UCLAMP_TASK */
> +
> /*
> * This is the main, per-CPU runqueue data structure.
> *
> @@ -835,6 +877,11 @@ struct rq {
> unsigned long nr_load_updates;
> u64 nr_switches;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + /* Utilization clamp values based on CPU's RUNNABLE tasks */
> + struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
> +#endif
> +
> struct cfs_rq cfs;
> struct rt_rq rt;
> struct dl_rq dl;
> @@ -1649,10 +1696,12 @@ extern const u32 sched_prio_to_wmult[40];
> struct sched_class {
> const struct sched_class *next;
>
> +#ifdef CONFIG_UCLAMP_TASK
> + int uclamp_enabled;
> +#endif
> +
> void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
> void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
> - void (*yield_task) (struct rq *rq);
> - bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
>
> void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
>
> @@ -1685,7 +1734,6 @@ struct sched_class {
> void (*set_curr_task)(struct rq *rq);
> void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> void (*task_fork)(struct task_struct *p);
> - void (*task_dead)(struct task_struct *p);
>
> /*
> * The switched_from() call is allowed to drop rq->lock, therefore we
> @@ -1702,12 +1750,17 @@ struct sched_class {
>
> void (*update_curr)(struct rq *rq);
>
> + void (*yield_task) (struct rq *rq);
> + bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
> +
> #define TASK_SET_GROUP 0
> #define TASK_MOVE_GROUP 1
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> void (*task_change_group)(struct task_struct *p, int type);
> #endif
> +
> + void (*task_dead)(struct task_struct *p);
> };
>
> static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
> --
> 2.20.1
>

2019-03-14 00:31:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On Wed, Mar 13, 2019 at 9:16 AM Patrick Bellasi <[email protected]> wrote:
>
> On 13-Mar 15:12, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> > > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> > > + unsigned int clamp_value)
> > > +{
> > > + /* Reset max-clamp retention only on idle exit */
> > > + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > > + return;
> > > +
> > > + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > > +
> > > + /*
> > > + * This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
> > > + * (after). The idle flag is reset only the second time, when we know
> > > + * that UCLAMP_MIN has been already updated.
> >
> > Why do we care? That is, what is this comment trying to tell us.
>
> Right, the code is clear enough, I'll remove this comment.

It would be probably even clearer if rq->uclamp_flags &=
~UCLAMP_FLAG_IDLE is done from inside uclamp_rq_inc after
uclamp_rq_inc_id for both clamps is called.

> >
> > > + */
> > > + if (clamp_id == UCLAMP_MAX)
> > > + rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
> > > +}
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2019-03-14 11:05:04

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 20:30, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 14:52, Peter Zijlstra wrote:
> > > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > > > + unsigned int clamp_id)
> > > > +{
> > > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > > + unsigned int rq_clamp, bkt_clamp;
> > > > +
> > > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > > > +
> > > > + /*
> > > > + * Keep "local clamping" simple and accept to (possibly) overboost
> > > > + * still RUNNABLE tasks in the same bucket.
> > > > + */
> > > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > > + return;
> > >
> > > (Oh man, I hope that generates semi sane code; long live CSE passes I
> > > suppose)
> >
> > What do you mean ?
>
> that does: 'rq->uclamp[clamp_id].bucket[bucket_id].tasks' three times in
> a row. And yes the compiler _should_ dtrt, but....

Sorry, don't follow you here... but it's an interesting point. :)

The code above becomes:

if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1))
return;

Are you referring to the resolution of the memory references, i.e
1) rq->uclamp
2) rq->uclamp[clamp_id]
3) rq->uclamp[clamp_id].bucket[bucket_id]
?

By playing with:

https://godbolt.org/z/OPLpyR

I can see that this simplified version:

---8<---
#define BUCKETS 5
#define CLAMPS 2

struct uclamp {
unsigned int value;
struct bucket {
unsigned int value;
unsigned int tasks;
} bucket[BUCKETS];
};

struct rq {
struct uclamp uclamp[CLAMPS];
};

void uclamp_rq_dec_id(struct rq *rq, int clamp_id, int bucket_id) {
if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1))
return;
rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
}
---8<---

generates something like:

---8<---
uclamp_rq_dec_id:
sxtw x1, w1
add x3, x1, x1, lsl 1
lsl x3, x3, 2
sub x3, x3, x1
lsl x3, x3, 2
add x2, x3, x2, sxtw 3
add x0, x0, x2
ldr w1, [x0, 8]
cbz w1, .L4
ret
.L4:
mov w1, -1
str w1, [x0, 8]
ret
---8<---


which looks "sane" and quite expected, isn't it?

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 11:19:54

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 20:39, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 03:59:54PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 14:52, Peter Zijlstra wrote:
>
> > Because of backetization, we potentially end up tracking tasks with
> > different requested clamp values in the same bucket.
> >
> > For example, with 20% bucket size, we can have:
> > Task1: util_min=25%
> > Task2: util_min=35%
> > accounted in the same bucket.
>
> > > Given all that, what is to stop the bucket value to climbing to
> > > uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> > > runnable)?
> >
> > Nothing... but that's an expected consequence of bucketization.
>
> No, it is not.
>
> > > Why are we doing this... ?
> >
> > You can either decide to:
> >
> > a) always boost tasks to just the bucket nominal value
> > thus always penalizing both Task1 and Task2 of the example above
>
> This is the expected behaviour. When was the last time your histogram
> did something like b?

Right, I see what you mean... strictly speaking histograms always do a
floor aggregation.

> > b) always boost tasks to the bucket "max" value
> > thus always overboosting both Task1 and Task2 of the example above
> >
> > The solution above instead has a very good property: in systems
> > where you have only few and well defined clamp values we always
> > provide the exact boost.
> >
> > For example, if your system requires only 23% and 47% boost values
> > (totally random numbers), then you can always get the exact boost
> > required using just 3 bucksts or ~33% size each.
> >
> > In systems where you don't know which boost values you will have, you
> > can still defined the maximum overboost granularity you accept for
> > each task by just tuning the number of clamp groups. For example, with
> > 20 groups you can have a 5% max overboost.
>
> Maybe, but this is not a direct concequence of buckets, but an
> additional heuristic that might work well in this case.

Right... that's the point.

We started with mapping to be able to track exact clamp values.
Then we switched to linear mapping to remove the complexity of
mapping, but we would like to still have the possibility to track
exact numbers whenever possible.

> Maybe split this out in a separate patch? So start with the trivial
> bucket, and then do this change on top with the above few paragraphs as
> changelog?

That's doable, otherwise maybe we can just add the above paragraphs to
the changelog of this patch. But give your comment above I assume you
prefer to split it out... just let me know otherwise.


--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 11:46:55

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 20:46, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
>
> > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > > +{
> > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > + unsigned int max_value = uclamp_none(clamp_id);
> > >
> > > That's 1024 for uclamp_max
> > >
> > > > + unsigned int bucket_id;
> > > > +
> > > > + /*
> > > > + * Both min and max clamps are MAX aggregated, thus the topmost
> > > > + * bucket with some tasks defines the rq's clamp value.
> > > > + */
> > > > + bucket_id = UCLAMP_BUCKETS;
> > > > + do {
> > > > + --bucket_id;
> > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > + continue;
> > > > + max_value = bucket[bucket_id].value;
> > >
> > > but this will then _lower_ it. That's not a MAX aggregate.
> >
> > For uclamp_max we want max_value=1024 when there are no active tasks,
> > which means: no max clamp enforced on CFS/RT "idle" cpus.
> >
> > If instead there are active RT/CFS tasks then we want the clamp value
> > of the max group, which means: MAX aggregate active clamps.
> >
> > That's what the code above does and the comment says.
>
> That's (obviously) not how I read it.... maybe something like:
>
> static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> {
> struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> int i;
>
> /*
> * Since both min and max clamps are max aggregated, find the
> * top most bucket with tasks in.
> */
> for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
> if (!bucket[i].tasks)
> continue;
> return bucket[i].value;
> }
>
> /* No tasks -- default clamp value */
> return uclamp_none(clamp_id);
> }
>
> would make it clearer?

Fine for me, I'll then change the name in something else since that's
not more an "_update" by moving the WRITE_ONCE into the caller.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 12:16:03

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 20:48, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 04:12:29PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 14:40, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > > +static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
> > > > +{
> > > > + return clamp_value / UCLAMP_BUCKET_DELTA;
> > > > +}
> > > > +
> > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > > +{
> > > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > >
> > > return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
> > >
> > > might generate better code; just a single division, instead of a div and
> > > mult.
> >
> > Wondering if compilers cannot do these optimizations... but yes, looks
> > cool and will do it in v8, thanks.
>
> I'd be most impressed if they pull this off. Check the generated code
> and see I suppose :-)

On x86 the code generated looks exactly the same:

https://godbolt.org/z/PjmA7k

While on on arm64 it seems the difference boils down to:
- one single "mul" instruction
vs
- two instructions: "sub" _plus_ one "multiply subtract"

https://godbolt.org/z/0shU0S

So, if I din't get something wrong... perhaps the original version is
even better, isn't it?


Test code:

---8<---
#define UCLAMP_BUCKET_DELTA 52

static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
{
return clamp_value / UCLAMP_BUCKET_DELTA;
}

static inline unsigned int uclamp_bucket_value1(unsigned int clamp_value)
{
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
}

static inline unsigned int uclamp_bucket_value2(unsigned int clamp_value)
{
return clamp_value - (clamp_value % UCLAMP_BUCKET_DELTA);
}

int test1(int argc, char *argv[]) {
return uclamp_bucket_value1(argc);
}

int test2(int argc, char *argv[]) {
return uclamp_bucket_value2(argc);
}

int test3(int argc, char *argv[]) {
return uclamp_bucket_value1(argc) - uclamp_bucket_value2(argc);
}
---8<---

which gives on arm64:

---8<---
test1:
mov w1, 60495
movk w1, 0x4ec4, lsl 16
umull x0, w0, w1
lsr x0, x0, 36
mov w1, 52
mul w0, w0, w1
ret
test2:
mov w1, 60495
movk w1, 0x4ec4, lsl 16
umull x1, w0, w1
lsr x1, x1, 36
mov w2, 52
msub w1, w1, w2, w0
sub w0, w0, w1
ret
test3:
mov w0, 0
ret
---8<---


--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 12:26:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 14:08, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Mar 13, 2019 at 03:23:59PM +0000, Patrick Bellasi wrote:
> > > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> >
> > > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > > > +{
> > > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > > + unsigned int max_value = uclamp_none(clamp_id);
> > > >
> > > > That's 1024 for uclamp_max
> > > >
> > > > > + unsigned int bucket_id;
> > > > > +
> > > > > + /*
> > > > > + * Both min and max clamps are MAX aggregated, thus the topmost
> > > > > + * bucket with some tasks defines the rq's clamp value.
> > > > > + */
> > > > > + bucket_id = UCLAMP_BUCKETS;
> > > > > + do {
> > > > > + --bucket_id;
> > > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > > + continue;
> > > > > + max_value = bucket[bucket_id].value;
> > > >
> > > > but this will then _lower_ it. That's not a MAX aggregate.
> > >
> > > For uclamp_max we want max_value=1024 when there are no active tasks,
> > > which means: no max clamp enforced on CFS/RT "idle" cpus.
> > >
> > > If instead there are active RT/CFS tasks then we want the clamp value
> > > of the max group, which means: MAX aggregate active clamps.
> > >
> > > That's what the code above does and the comment says.
> >
> > That's (obviously) not how I read it... maybe something like:
> >
> > static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > {
> > struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > int i;
> >
> > /*
> > * Since both min and max clamps are max aggregated, find the
> > * top most bucket with tasks in.
> > */
> > for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
> > if (!bucket[i].tasks)
> > continue;
> > return bucket[i].value;
> > }
> >
> > /* No tasks -- default clamp values */
> > return uclamp_none(clamp_id);
> > }
> >
> > would make it clearer?
>
> This way it's also more readable/obvious when it's used inside
> uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like
> get_max_rq_uclamp.

Rightm, I have now something like that:

---8<---
static inline unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
{
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id;

/*
* Since both min and max clamps are max aggregated, find the
* top most bucket with tasks in.
*/
for (bucket_id = UCLMAP_BUCKETS-1; bucket_id >= 0; bucket_id--) {
if (!bucket[bucket_id].tasks)
continue;
return bucket[bucket_id].value;
}

/* No tasks -- default clamp value */
return uclamp_none(clamp_id);
}

static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
{
//...
if (bucket->value >= rq_clamp) {
/*
* Reset rq's clamp bucket value to its nominal value whenever
* there are anymore RUNNABLE tasks refcounting it.
*/
bucket->value = uclamp_bucket_nominal_value(rq_clamp);
WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
}
}
---8<---

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 12:44:19

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 14:23, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Feb 08, 2019 at 10:05:40AM +0000, Patrick Bellasi wrote:
> > > +/*
> > > + * When a task is enqueued on a rq, the clamp bucket currently defined by the
> > > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > > + * immediately updates the rq's clamp value if required.
> > > + *
> > > + * Since tasks know their specific value requested from user-space, we track
> > > + * within each bucket the maximum value for tasks refcounted in that bucket.
> > > + * This provide a further aggregation (local clamping) which allows to track
> > > + * within each bucket the exact "requested" clamp value whenever all tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > + unsigned int clamp_id)
> > > +{
> > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > > +
> > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > > +
> > > + /*
> > > + * Local clamping: rq's buckets always track the max "requested"
> > > + * clamp value from all RUNNABLE tasks in that bucket.
> > > + */
> > > + tsk_clamp = p->uclamp[clamp_id].value;
> > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, tsk_clamp);
> >
> > So, if I read this correct:
> >
> > - here we track a max value in a bucket,
> >
> > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > > +}
> > > +
> > > +/*
> > > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > > + * the task is released. If this is the last task reference counting the rq's
> > > + * max active clamp value, then the rq's clamp value is updated.
> > > + * Both the tasks reference counter and the rq's cached clamp values are
> > > + * expected to be always valid, if we detect they are not we skip the updates,
> > > + * enforce a consistent state and warn.
> > > + */
> > > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > > + unsigned int clamp_id)
> > > +{
> > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > + unsigned int rq_clamp, bkt_clamp;
> > > +
> > > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > > +
> > > + /*
> > > + * Keep "local clamping" simple and accept to (possibly) overboost
> > > + * still RUNNABLE tasks in the same bucket.
> > > + */
> > > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > > + return;
> >
> > (Oh man, I hope that generates semi sane code; long live CSE passes I
> > suppose)
> >
> > But we never decrement that bkt_clamp value on dequeue.
> >
> > > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > > +
> > > + /* The rq's clamp value is expected to always track the max */
> > > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > > + SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > > + if (bkt_clamp >= rq_clamp) {
> >
> > head hurts, this reads ==, how can this ever not be so?
> >
> > > + /*
> > > + * Reset rq's clamp bucket value to its nominal value whenever
> > > + * there are anymore RUNNABLE tasks refcounting it.
> >
> > -ENOPARSE
> >
> > > + */
> > > + rq->uclamp[clamp_id].bucket[bucket_id].value =
> > > + uclamp_bucket_value(rq_clamp);
> >
> > But basically you decrement the bucket value to the nominal value.
> >
> > > + uclamp_rq_update(rq, clamp_id);
> > > + }
> > > +}
> >
> > Given all that, what is to stop the bucket value to climbing to
> > uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> > runnable)?
> >
> > Why are we doing this... ?
>
> I agree with Peter, this part of the patch was the hardest to read.
> SCHED_WARN_ON line makes sense to me. The condition that follows and
> the following comment are a little baffling. Condition seems to
> indicate that the code that follows should be executed only if we are
> in the top-most occupied bucket (the bucket which has tasks and has
> the highest uclamp value).
> So this bucket just lost its last task and we should update
> rq->uclamp[clamp_id].value.

Right.

> However that's not exactly what the code does... It also resets
> rq->uclamp[clamp_id].bucket[bucket_id].value.

Right...

> So if I understand correctly, unless the bucket that just lost its
> last task is the top-most one its value will not be reset to nominal
> value. That looks like a bug to me. Am I missing something?

... and I think you've got a point here!

The reset to nominal value line should be done unconditionally.
I'll move it outside its current block. Thanks for spotting it.

> Side note: some more explanation would be very helpful.

Will move that "bucket local max" management code into a separate
patch as suggested by Peter. Hopefully that should make the logic more
clear and allows me to add some notes in the changelog.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 13:29:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Thu, Mar 14, 2019 at 11:03:30AM +0000, Patrick Bellasi wrote:

> void uclamp_rq_dec_id(struct rq *rq, int clamp_id, int bucket_id) {
> if (__builtin_expect(!!(rq->uclamp[clamp_id].bucket[bucket_id].tasks), 1))
> return;
> rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> }
> ---8<---
>
> generates something like:
>
> ---8<---
> uclamp_rq_dec_id:
> sxtw x1, w1
> add x3, x1, x1, lsl 1
> lsl x3, x3, 2
> sub x3, x3, x1
> lsl x3, x3, 2
> add x2, x3, x2, sxtw 3
> add x0, x0, x2
> ldr w1, [x0, 8]
> cbz w1, .L4
> ret
> .L4:
> mov w1, -1
> str w1, [x0, 8]
> ret
> ---8<---
>
>
> which looks "sane" and quite expected, isn't it?

Yep, thanks! Sometimes I worry about silly things.

2019-03-14 13:33:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote:
> > I'd be most impressed if they pull this off. Check the generated code
> > and see I suppose :-)
>
> On x86 the code generated looks exactly the same:
>
> https://godbolt.org/z/PjmA7k

Argh, they do mult by inverse to avoid the division, and can do this
because its a constant.

And then yes, your arm version looks worse.

It does what I expected with -Os, but as Rutland said the other day,
that stands for Optimize for Sadness.

2019-03-14 14:47:06

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 14:32, Suren Baghdasaryan wrote:
> On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
> >
> > Utilization clamping allows to clamp the CPU's utilization within a
> > [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> > that CPU. Each task references two "clamp buckets" defining its minimum
> > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> > bucket is active if there is at least one RUNNABLE tasks enqueued on
> > that CPU and refcounting that bucket.
> >
> > When a task is {en,de}queued {on,from} a rq, the set of active clamp
> > buckets on that CPU can change. Since each clamp bucket enforces a
> > different utilization clamp value, when the set of active clamp buckets
> > changes, a new "aggregated" clamp value is computed for that CPU.
> >
> > Clamp values are always MAX aggregated for both util_min and util_max.
> > This ensures that no tasks can affect the performance of other
> > co-scheduled tasks which are more boosted (i.e. with higher util_min
> > clamp) or less capped (i.e. with higher util_max clamp).
> >
> > Each task has a:
> > task_struct::uclamp[clamp_id]::bucket_id
> > to track the "bucket index" of the CPU's clamp bucket it refcounts while
> > enqueued, for each clamp index (clamp_id).
> >
> > Each CPU's rq has a:
> > rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> > to track how many tasks, currently RUNNABLE on that CPU, refcount each
> > clamp bucket (bucket_id) of a clamp index (clamp_id).
> >
> > Each CPU's rq has also a:
> > rq::uclamp[clamp_id]::bucket[bucket_id].value
> > to track the clamp value of each clamp bucket (bucket_id) of a clamp
> > index (clamp_id).
> >
> > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > is required only when we dequeue the last task of a clamp bucket
> > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > is either entering IDLE or going to schedule a less boosted or more
> > clamped task.
> > The expected number of different clamp values, configured at build time,
> > is small enough to fit the full unordered array into a single cache
> > line.
>
> I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
> here.

No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an
array of:

struct uclamp_bucket {
unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
};

defined as part of:

struct uclamp_rq {
unsigned int value;
struct uclamp_bucket bucket[UCLAMP_BUCKETS];
};


So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs.

> uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
> to 20. sizeof(long)*20 is already more than 64 bytes. What am I
> missing?

Right, the comment above refers to the default configuration, which is
5 buckets. With that configuration we have:


$> pahole kernel/sched/core.o

---8<---
struct uclamp_bucket {
long unsigned int value:11; /* 0:53 8 */
long unsigned int tasks:53; /* 0: 0 8 */

/* size: 8, cachelines: 1, members: 2 */
/* last cacheline: 8 bytes */
};

struct uclamp_rq {
unsigned int value; /* 0 4 */

/* XXX 4 bytes hole, try to pack */

struct uclamp_bucket bucket[5]; /* 8 40 */

/* size: 48, cachelines: 1, members: 2 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};

struct rq {
// ...
/* --- cacheline 2 boundary (128 bytes) --- */
struct uclamp_rq uclamp[2]; /* 128 96 */
/* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
// ...
};
---8<---

Where you see the array fits into a single cache line.

Actually I notice now that, since when we removed the bucket dedicated
to the default values, we now have some spare space and we can
probably increase the default (and minimum) value of UCLAMP_BUCKETS to
be 7.

This will uses two full cache lines in struct rq, one for each clamp
index... Although 7 it's a bit of a odd number and gives by default
buckets of ~14% size instead of the ~20%.

Thoughts ?

[...]

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 14:56:06

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 14:01, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi <[email protected]> wrote:
> >
> > On 12-Mar 13:52, Dietmar Eggemann wrote:
> > > On 2/8/19 11:05 AM, Patrick Bellasi wrote:

[...]

> > > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > > +{
> > > > + return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > > > +}
> > >
> > > Soemthing like uclamp_bucket_nominal_value() should be clearer.
> >
> > Maybe... can update it in v8
> >
>
> uclamp_bucket_base_value is a little shorter, just to consider :)

Right, I also like it better ;)

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 15:03:19

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 13-Mar 15:15, Patrick Bellasi wrote:
> On 12-Mar 13:52, Dietmar Eggemann wrote:
> > On 2/8/19 11:05 AM, Patrick Bellasi wrote:

[...]

> > > + * within each bucket the exact "requested" clamp value whenever all tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > + unsigned int clamp_id)
> > > +{
> > > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> >
> > Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> > structure as well to the bucket?
> >
> > - unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > + struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > + struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];
>
> I think I went back/forth a couple of times in using pointer or the
> extended version, which both have pros and cons.
>
> I personally prefer the pointers as you suggest but I've got the
> impression in the past that since everybody cleared "basic C trainings"
> it's not so difficult to read the code above too.
>
> > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> > much more readable.
>
> Agree... let's try to switch once again in v8 and see ;)

This is not as straightforward as I thought.

We either still need local variables to use with max(), which does not
play well with bitfields values, or we have to avoid using it and go
back to conditional updates:

---8<---
static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
unsigned int clamp_id)
{
struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
struct uclamp_req *uc_se = &p->uclamp_se[clamp_id];
struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];

bucket->tasks++;

/*
* Local clamping: rq's buckets always track the max "requested"
* clamp value from all RUNNABLE tasks in that bucket.
*/
if (uc_se->value > bucket->value)
bucket->value = uc_se->value;

if (uc_se->value > READ_ONCE(uc_rq->value))
WRITE_ONCE(uc_rq->value, uc_se->value);
}
---8<---

I remember Peter asking for max() in one of the past reviews.. but the
code above looks simpler to me too... let see if this time it can be
accepted. :)

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 15:09:03

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 14-Mar 14:32, Peter Zijlstra wrote:
> On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote:
> > > I'd be most impressed if they pull this off. Check the generated code
> > > and see I suppose :-)
> >
> > On x86 the code generated looks exactly the same:
> >
> > https://godbolt.org/z/PjmA7k
>
> Argh, they do mult by inverse to avoid the division, and can do this
> because its a constant.
>
> And then yes, your arm version looks worse.

your "arm version" is worst then x86, or "your version" is worse?

IOW, should I keep the code as the original? Do you prefer your
version? Or... we don't really care...

> It does what I expected with -Os, but as Rutland said the other day,
> that stands for Optimize for Sadness.

Yes, I guess we cannot optimize for all flags... however, just let me
know what you prefer and I'll put that version in ;)

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 15:31:23

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <[email protected]> wrote:
>
> On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
> > >
> > > Utilization clamping allows to clamp the CPU's utilization within a
> > > [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> > > that CPU. Each task references two "clamp buckets" defining its minimum
> > > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> > > bucket is active if there is at least one RUNNABLE tasks enqueued on
> > > that CPU and refcounting that bucket.
> > >
> > > When a task is {en,de}queued {on,from} a rq, the set of active clamp
> > > buckets on that CPU can change. Since each clamp bucket enforces a
> > > different utilization clamp value, when the set of active clamp buckets
> > > changes, a new "aggregated" clamp value is computed for that CPU.
> > >
> > > Clamp values are always MAX aggregated for both util_min and util_max.
> > > This ensures that no tasks can affect the performance of other
> > > co-scheduled tasks which are more boosted (i.e. with higher util_min
> > > clamp) or less capped (i.e. with higher util_max clamp).
> > >
> > > Each task has a:
> > > task_struct::uclamp[clamp_id]::bucket_id
> > > to track the "bucket index" of the CPU's clamp bucket it refcounts while
> > > enqueued, for each clamp index (clamp_id).
> > >
> > > Each CPU's rq has a:
> > > rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> > > to track how many tasks, currently RUNNABLE on that CPU, refcount each
> > > clamp bucket (bucket_id) of a clamp index (clamp_id).
> > >
> > > Each CPU's rq has also a:
> > > rq::uclamp[clamp_id]::bucket[bucket_id].value
> > > to track the clamp value of each clamp bucket (bucket_id) of a clamp
> > > index (clamp_id).
> > >
> > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > is required only when we dequeue the last task of a clamp bucket
> > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > is either entering IDLE or going to schedule a less boosted or more
> > > clamped task.
> > > The expected number of different clamp values, configured at build time,
> > > is small enough to fit the full unordered array into a single cache
> > > line.
> >
> > I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
> > here.
>
> No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an
> array of:
>
> struct uclamp_bucket {
> unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
> };
>
> defined as part of:
>
> struct uclamp_rq {
> unsigned int value;
> struct uclamp_bucket bucket[UCLAMP_BUCKETS];
> };
>
>
> So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs.
>
> > uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
> > to 20. sizeof(long)*20 is already more than 64 bytes. What am I
> > missing?
>
> Right, the comment above refers to the default configuration, which is
> 5 buckets. With that configuration we have:
>
>
> $> pahole kernel/sched/core.o
>
> ---8<---
> struct uclamp_bucket {
> long unsigned int value:11; /* 0:53 8 */
> long unsigned int tasks:53; /* 0: 0 8 */
>
> /* size: 8, cachelines: 1, members: 2 */
> /* last cacheline: 8 bytes */
> };
>
> struct uclamp_rq {
> unsigned int value; /* 0 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> struct uclamp_bucket bucket[5]; /* 8 40 */
>
> /* size: 48, cachelines: 1, members: 2 */
> /* sum members: 44, holes: 1, sum holes: 4 */
> /* last cacheline: 48 bytes */
> };
>
> struct rq {
> // ...
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct uclamp_rq uclamp[2]; /* 128 96 */
> /* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
> // ...
> };
> ---8<---
>
> Where you see the array fits into a single cache line.
>
> Actually I notice now that, since when we removed the bucket dedicated
> to the default values, we now have some spare space and we can
> probably increase the default (and minimum) value of UCLAMP_BUCKETS to
> be 7.
>
> This will uses two full cache lines in struct rq, one for each clamp
> index... Although 7 it's a bit of a odd number and gives by default
> buckets of ~14% size instead of the ~20%.
>
> Thoughts ?

Got it. From reading the documentation at the beginning my impression
was that whatever value I choose within allowed 5-20 range it would
still fit in a cache line. To disambiguate it might be worse
mentioning that this is true for the default value or for values up to
7. Thanks!

> [...]
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2019-03-14 15:42:03

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On 14-Mar 08:29, Suren Baghdasaryan wrote:
> On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <[email protected]> wrote:
> > On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:

[...]

> > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > > is required only when we dequeue the last task of a clamp bucket
> > > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > > is either entering IDLE or going to schedule a less boosted or more
> > > > clamped task.

The following:

> > > > The expected number of different clamp values, configured at build time,
> > > > is small enough to fit the full unordered array into a single cache
> > > > line.

will read:

The expected number of different clamp values, configured at build time,
is small enough to fit the full unordered array into a single cache
line for the default UCLAMP_BUCKETS configuration of 7 buckets.

[...]

> Got it. From reading the documentation at the beginning my impression
> was that whatever value I choose within allowed 5-20 range it would
> still fit in a cache line. To disambiguate it might be worse
> mentioning that this is true for the default value or for values up to
> 7. Thanks!

Right, I hope the above proposed change helps to clarify that.

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 16:20:35

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps

On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
>
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
>
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This is the actual clamp value enforced on tasks in a
> task group.

In patch 10 in this series you mentioned "b) do not enforce any
constraints and/or dependencies between the parent and its child
nodes"

This patch seems to change that behavior. If so, should it be documented?

> Since it can be interesting for userspace, e.g. system management
> software, to know exactly what the currently propagated/enforced
> configuration is, the effective clamp values are exposed to user-space
> by means of a new pair of read-only attributes
> cpu.util.{min,max}.effective.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Tejun Heo <[email protected]>
>
> ---
> Changes in v7:
> Others:
> - ensure clamp values are not tunable at root cgroup level
> ---
> Documentation/admin-guide/cgroup-v2.rst | 19 ++++
> kernel/sched/core.c | 118 +++++++++++++++++++++++-
> 2 files changed, 133 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 47710a77f4fa..7aad2435e961 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -990,6 +990,16 @@ All time durations are in microseconds.
> values similar to the sched_setattr(2). This minimum utilization
> value is used to clamp the task specific minimum utilization clamp.
>
> + cpu.util.min.effective
> + A read-only single value file which exists on non-root cgroups and
> + reports minimum utilization clamp value currently enforced on a task
> + group.
> +
> + The actual minimum utilization in the range [0, 1024].
> +
> + This value can be lower then cpu.util.min in case a parent cgroup
> + allows only smaller minimum utilization values.
> +
> cpu.util.max
> A read-write single value file which exists on non-root cgroups.
> The default is "1024". i.e. no utilization capping
> @@ -1000,6 +1010,15 @@ All time durations are in microseconds.
> values similar to the sched_setattr(2). This maximum utilization
> value is used to clamp the task specific maximum utilization clamp.
>
> + cpu.util.max.effective
> + A read-only single value file which exists on non-root cgroups and
> + reports maximum utilization clamp value currently enforced on a task
> + group.
> +
> + The actual maximum utilization in the range [0, 1024].
> +
> + This value can be lower then cpu.util.max in case a parent cgroup
> + is enforcing a more restrictive clamping on max utilization.
>
>
> Memory
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 122ab069ade5..1e54517acd58 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -720,6 +720,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
> }
>
> #ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Serializes updates of utilization clamp values
> + *
> + * The (slow-path) user-space triggers utilization clamp value updates which
> + * can require updates on (fast-path) scheduler's data structures used to
> + * support enqueue/dequeue operations.
> + * While the per-CPU rq lock protects fast-path update operations, user-space
> + * requests are serialized using a mutex to reduce the risk of conflicting
> + * updates or API abuses.
> + */
> +static DEFINE_MUTEX(uclamp_mutex);
> +
> /* Max allowed minimum utilization */
> unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>
> @@ -1127,6 +1139,8 @@ static void __init init_uclamp(void)
> unsigned int value;
> int cpu;
>
> + mutex_init(&uclamp_mutex);
> +
> for_each_possible_cpu(cpu) {
> memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> cpu_rq(cpu)->uclamp_flags = 0;
> @@ -6758,6 +6772,10 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg,
> parent->uclamp[clamp_id].value;
> tg->uclamp[clamp_id].bucket_id =
> parent->uclamp[clamp_id].bucket_id;
> + tg->uclamp[clamp_id].effective.value =
> + parent->uclamp[clamp_id].effective.value;
> + tg->uclamp[clamp_id].effective.bucket_id =
> + parent->uclamp[clamp_id].effective.bucket_id;
> }
> #endif
>
> @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> }
>
> #ifdef CONFIG_UCLAMP_TASK_GROUP
> +static void cpu_util_update_hier(struct cgroup_subsys_state *css,

s/cpu_util_update_hier/cpu_util_update_heir ?

> + unsigned int clamp_id, unsigned int bucket_id,
> + unsigned int value)
> +{
> + struct cgroup_subsys_state *top_css = css;
> + struct uclamp_se *uc_se, *uc_parent;
> +
> + css_for_each_descendant_pre(css, top_css) {
> + /*
> + * The first visited task group is top_css, which clamp value
> + * is the one passed as parameter. For descendent task
> + * groups we consider their current value.
> + */
> + uc_se = &css_tg(css)->uclamp[clamp_id];
> + if (css != top_css) {
> + value = uc_se->value;
> + bucket_id = uc_se->effective.bucket_id;
> + }
> + uc_parent = NULL;
> + if (css_tg(css)->parent)
> + uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> +
> + /*
> + * Skip the whole subtrees if the current effective clamp is
> + * already matching the TG's clamp value.
> + * In this case, all the subtrees already have top_value, or a
> + * more restrictive value, as effective clamp.
> + */
> + if (uc_se->effective.value == value &&
> + uc_parent && uc_parent->effective.value >= value) {
> + css = css_rightmost_descendant(css);
> + continue;
> + }
> +
> + /* Propagate the most restrictive effective value */
> + if (uc_parent && uc_parent->effective.value < value) {
> + value = uc_parent->effective.value;
> + bucket_id = uc_parent->effective.bucket_id;
> + }
> + if (uc_se->effective.value == value)
> + continue;
> +
> + uc_se->effective.value = value;
> + uc_se->effective.bucket_id = bucket_id;
> + }
> +}
> +
> static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> struct cftype *cftype, u64 min_value)
> {
> @@ -7020,6 +7085,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> if (min_value > SCHED_CAPACITY_SCALE)
> return -ERANGE;
>
> + mutex_lock(&uclamp_mutex);
> rcu_read_lock();
>
> tg = css_tg(css);
> @@ -7038,8 +7104,13 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> tg->uclamp[UCLAMP_MIN].value = min_value;
> tg->uclamp[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
>
> + /* Update effective clamps to track the most restrictive value */
> + cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id,
> + min_value);
> +
> out:
> rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
>
> return ret;
> }
> @@ -7053,6 +7124,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> if (max_value > SCHED_CAPACITY_SCALE)
> return -ERANGE;
>
> + mutex_lock(&uclamp_mutex);
> rcu_read_lock();
>
> tg = css_tg(css);
> @@ -7071,21 +7143,29 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> tg->uclamp[UCLAMP_MAX].value = max_value;
> tg->uclamp[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
>
> + /* Update effective clamps to track the most restrictive value */
> + cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id,
> + max_value);
> +
> out:
> rcu_read_unlock();
> + mutex_unlock(&uclamp_mutex);
>
> return ret;
> }
>
> static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> - enum uclamp_id clamp_id)
> + enum uclamp_id clamp_id,
> + bool effective)
> {
> struct task_group *tg;
> u64 util_clamp;
>
> rcu_read_lock();
> tg = css_tg(css);
> - util_clamp = tg->uclamp[clamp_id].value;
> + util_clamp = effective
> + ? tg->uclamp[clamp_id].effective.value
> + : tg->uclamp[clamp_id].value;
> rcu_read_unlock();
>
> return util_clamp;
> @@ -7094,13 +7174,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
> struct cftype *cft)
> {
> - return cpu_uclamp_read(css, UCLAMP_MIN);
> + return cpu_uclamp_read(css, UCLAMP_MIN, false);
> }
>
> static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
> struct cftype *cft)
> {
> - return cpu_uclamp_read(css, UCLAMP_MAX);
> + return cpu_uclamp_read(css, UCLAMP_MAX, false);
> +}
> +
> +static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MIN, true);
> +}
> +
> +static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return cpu_uclamp_read(css, UCLAMP_MAX, true);
> }
> #endif /* CONFIG_UCLAMP_TASK_GROUP */
>
> @@ -7448,11 +7540,19 @@ static struct cftype cpu_legacy_files[] = {
> .read_u64 = cpu_util_min_read_u64,
> .write_u64 = cpu_util_min_write_u64,
> },
> + {
> + .name = "util.min.effective",
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
> {
> .name = "util.max",
> .read_u64 = cpu_util_max_read_u64,
> .write_u64 = cpu_util_max_write_u64,
> },
> + {
> + .name = "util.max.effective",
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
> #endif
> { } /* Terminate */
> };
> @@ -7628,12 +7728,22 @@ static struct cftype cpu_files[] = {
> .read_u64 = cpu_util_min_read_u64,
> .write_u64 = cpu_util_min_write_u64,
> },
> + {
> + .name = "util.min.effective",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
> {
> .name = "util.max",
> .flags = CFTYPE_NOT_ON_ROOT,
> .read_u64 = cpu_util_max_read_u64,
> .write_u64 = cpu_util_max_write_u64,
> },
> + {
> + .name = "util.max.effective",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
> #endif
> { } /* terminate */
> };
> --
> 2.20.1
>

2019-03-14 16:40:18

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Thu, Mar 14, 2019 at 8:41 AM Patrick Bellasi <[email protected]> wrote:
>
> On 14-Mar 08:29, Suren Baghdasaryan wrote:
> > On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi <[email protected]> wrote:
> > > On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
>
> [...]
>
> > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > > > is required only when we dequeue the last task of a clamp bucket
> > > > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > > > is either entering IDLE or going to schedule a less boosted or more
> > > > > clamped task.
>
> The following:
>
> > > > > The expected number of different clamp values, configured at build time,
> > > > > is small enough to fit the full unordered array into a single cache
> > > > > line.
>
> will read:
>
> The expected number of different clamp values, configured at build time,
> is small enough to fit the full unordered array into a single cache
> line for the default UCLAMP_BUCKETS configuration of 7 buckets.

I think keeping default to be 5 is good. As you mentioned it's a nice
round number and keeping it at the minimum also hints that this is not
a free resource and the more buckets you use the more you pay.
Documentation might say "to fit the full unordered array into a single
cache line for configurations of up to 7 buckets".

> [...]
>
> > Got it. From reading the documentation at the beginning my impression
> > was that whatever value I choose within allowed 5-20 range it would
> > still fit in a cache line. To disambiguate it might be worse
> > mentioning that this is true for the default value or for values up to
> > 7. Thanks!
>
> Right, I hope the above proposed change helps to clarify that.
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2019-03-14 17:07:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

On 13-Mar 17:29, Suren Baghdasaryan wrote:
> On Wed, Mar 13, 2019 at 9:16 AM Patrick Bellasi <[email protected]> wrote:
> >
> > On 13-Mar 15:12, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:41AM +0000, Patrick Bellasi wrote:
> > > > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> > > > + unsigned int clamp_value)
> > > > +{
> > > > + /* Reset max-clamp retention only on idle exit */
> > > > + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > > > + return;
> > > > +
> > > > + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > > > +
> > > > + /*
> > > > + * This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
> > > > + * (after). The idle flag is reset only the second time, when we know
> > > > + * that UCLAMP_MIN has been already updated.
> > >
> > > Why do we care? That is, what is this comment trying to tell us.
> >
> > Right, the code is clear enough, I'll remove this comment.
>
> It would be probably even clearer if rq->uclamp_flags &=
> ~UCLAMP_FLAG_IDLE is done from inside uclamp_rq_inc after
> uclamp_rq_inc_id for both clamps is called.

Good point! I'll move it there to have something like:

---8<---
static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
{
unsigned int clamp_id;

if (unlikely(!p->sched_class->uclamp_enabled))
return;

for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
uclamp_rq_inc_id(p, rq, clamp_id);

/* Reset clamp holding when we have at least one RUNNABLE task */
if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
}
---8<---

--
#include <best/regards.h>

Patrick Bellasi

2019-03-14 19:20:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

On Thu, Mar 14, 2019 at 03:07:53PM +0000, Patrick Bellasi wrote:
> On 14-Mar 14:32, Peter Zijlstra wrote:
> > On Thu, Mar 14, 2019 at 12:13:15PM +0000, Patrick Bellasi wrote:
> > > > I'd be most impressed if they pull this off. Check the generated code
> > > > and see I suppose :-)
> > >
> > > On x86 the code generated looks exactly the same:
> > >
> > > https://godbolt.org/z/PjmA7k
> >
> > Argh, they do mult by inverse to avoid the division, and can do this
> > because its a constant.
> >
> > And then yes, your arm version looks worse.
>
> your "arm version" is worst then x86, or "your version" is worse?
>
> IOW, should I keep the code as the original? Do you prefer your
> version? Or... we don't really care...

Yeah, keep the original, it didn't matter on x86 and arm regressed with
my version.

> > It does what I expected with -Os, but as Rutland said the other day,
> > that stands for Optimize for Sadness.
>
> Yes, I guess we cannot optimize for all flags... however, just let me
> know what you prefer and I'll put that version in ;)

Yeah, don't bother optimizing for -Os, it generally creates atrocious
crap, hence Rutland calling it 'Optimize for Sadness'.

2019-03-15 13:42:47

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On 13-Mar 21:10, Peter Zijlstra wrote:
> On Wed, Mar 13, 2019 at 05:09:40PM +0000, Patrick Bellasi wrote:
>
> > Yes, that should be possible... will look into splitting this out in
> > v8 to have something like:
> >
> > ---8<---
> > struct uclamp_req {
> > /* Clamp value "requested" by a scheduling entity */
> > unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > unsigned int active : 1;
> > unsigned int user_defined : 1;
> > }
> >
> > struct uclamp_eff {
> > unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
> > unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
> > }
>
> No, have _1_ type. There is no point what so ever to splitting this.
>
> Also, what's @user_defined about, I don't think I've seen that in the
> parent patch.

That's a flag added by one of the following patches, but with the
change you are suggesting below...

> > struct task_struct {
> > // ...
> > #ifdef CONFIG_UCLAMP_TASK
> > struct uclamp_req uclamp_req[UCLAMP_CNT];
> > struct uclamp_eff uclamp_eff[UCLAMP_CNT];
>
> struct uclamp_se uclamp[UCLAMP_CNT];
> struct uclamp_se uclamp_req[UCLAMP_CNT];
>
> Where the first is the very same introduced in patch #1, and leaving it
> in place avoids having to update the sites already using that (or start
> #1 with the _eff name to avoid having to change things around?).
>
> > #endif
> > // ...
> > }
> >
> > static inline struct uclamp_eff
> > uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> > {
> > struct uclamp_eff uc_eff = p->uclamp_eff[clamp_id];
>
> just this ^, these lines seem like a superfluous duplication:
>
> > uc_eff.bucket_id = p->uclamp_req[clamp_id].bucket_id;
> > uc_eff.value = p->uclamp_req[clamp_id].value;
>
>
> > if (unlikely(uc_eff.clamp_value > uclamp_default[clamp_id].value)) {
> > uc_eff.clamp_value = uclamp_default[clamp_id].value;
> > uc_eff.bucket_id = uclamp_default[clamp_id].bucket_id;
>
> and:
>
> uc = uclamp_default[clamp_id];

... with things like the above line it becomes quite tricky to exploit
the uclamp_se bitfield to track additional flags.

I'll try to remove the need for the "user_defined" flag, as long as we
have only the "active" you we can still manage to keep it in uclamp_se.

If instead we really need more flags, we will likely have to move them
into a separate bitfield. :/

>
> > }
> >
> > return uc_eff;
> > }
> >
> > static inline void
> > uclamp_eff_set(struct task_struct *p, unsigned int clamp_id)
> > {
> > p->uclamp_eff[clamp_id] = uclamp_eff_get(p, clamp_id);
> > }
> > ---8<---
> >
> > Is that what you mean ?
>
> Getting there :-)

Yeah... let see :)


--
#include <best/regards.h>

Patrick Bellasi

2019-03-18 12:19:08

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On 13-Mar 21:18, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
> > +static void uclamp_fork(struct task_struct *p)
> > +{
> > + unsigned int clamp_id;
> > +
> > + if (unlikely(!p->sched_class->uclamp_enabled))
> > + return;
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > + p->uclamp[clamp_id].active = false;
> > +}
>
> Because in that case .active == false, and copy_process() will have done
> thr right thing?

Don't really get what you mean here? :/


--
#include <best/regards.h>

Patrick Bellasi

2019-03-18 13:01:11

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 06/15] sched/core: uclamp: Reset uclamp values on RESET_ON_FORK

On 13-Mar 21:52, Peter Zijlstra wrote:
> On Fri, Feb 08, 2019 at 10:05:45AM +0000, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 070caa1f72eb..8b282616e9c9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1071,7 +1071,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
> > }
> > }
> >
> > -static void uclamp_fork(struct task_struct *p)
> > +static void uclamp_fork(struct task_struct *p, bool reset)
> > {
> > unsigned int clamp_id;
> >
> > @@ -1080,6 +1080,17 @@ static void uclamp_fork(struct task_struct *p)
>
> IIRC there's an early return here if the class doesn't have uclamp
> support, which I think is wrong now. You want the reset irrespective of
> whether the class supports it, no?

Yep, actually... since in this method we are always and only resetting
certain values, it's probably better to just remove the check on class
support and unconditionally do the required ones.

> > for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > p->uclamp[clamp_id].active = false;
> > +
> > + if (likely(!reset))
> > + return;
> > +
> > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > + unsigned int clamp_value = uclamp_none(clamp_id);
> > +
> > + p->uclamp[clamp_id].user_defined = false;
> > + p->uclamp[clamp_id].value = clamp_value;
> > + p->uclamp[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
> > + }
> > }
>
>

--
#include <best/regards.h>

Patrick Bellasi

2019-03-18 13:12:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Mon, Mar 18, 2019 at 12:18:04PM +0000, Patrick Bellasi wrote:
> On 13-Mar 21:18, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
> > > +static void uclamp_fork(struct task_struct *p)
> > > +{
> > > + unsigned int clamp_id;
> > > +
> > > + if (unlikely(!p->sched_class->uclamp_enabled))
> > > + return;
> > > +
> > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > > + p->uclamp[clamp_id].active = false;
> > > +}
> >
> > Because in that case .active == false, and copy_process() will have done
> > thr right thing?
>
> Don't really get what you mean here? :/

Why don't we have to set .active=false when
!sched_class->uclamp_enabled?

2019-03-18 14:23:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On 18-Mar 14:10, Peter Zijlstra wrote:
> On Mon, Mar 18, 2019 at 12:18:04PM +0000, Patrick Bellasi wrote:
> > On 13-Mar 21:18, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
> > > > +static void uclamp_fork(struct task_struct *p)
> > > > +{
> > > > + unsigned int clamp_id;
> > > > +
> > > > + if (unlikely(!p->sched_class->uclamp_enabled))
> > > > + return;
> > > > +
> > > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > > > + p->uclamp[clamp_id].active = false;
> > > > +}
> > >
> > > Because in that case .active == false, and copy_process() will have done
> > > thr right thing?
> >
> > Don't really get what you mean here? :/
>
> Why don't we have to set .active=false when
> !sched_class->uclamp_enabled?

Ok, got it.

In principle because:
- FAIR and RT will have uclamp_enabled
- DL cannot fork

... thus, yes, it seems that the check above is not necessary anyway.

Moreover, as per one of your comments in another message, we still need
to cover the "reset on fork" case for FAIR and RT. Thus, I'm going to
completely remove the support check in uclamp_fork and we always reset
active for all classes.

Cheers!

--
#include <best/regards.h>

Patrick Bellasi

2019-03-18 14:30:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 03/15] sched/core: uclamp: Add system default clamps

On Mon, Mar 18, 2019 at 02:21:52PM +0000, Patrick Bellasi wrote:
> On 18-Mar 14:10, Peter Zijlstra wrote:
> > On Mon, Mar 18, 2019 at 12:18:04PM +0000, Patrick Bellasi wrote:
> > > On 13-Mar 21:18, Peter Zijlstra wrote:
> > > > On Fri, Feb 08, 2019 at 10:05:42AM +0000, Patrick Bellasi wrote:
> > > > > +static void uclamp_fork(struct task_struct *p)
> > > > > +{
> > > > > + unsigned int clamp_id;
> > > > > +
> > > > > + if (unlikely(!p->sched_class->uclamp_enabled))
> > > > > + return;
> > > > > +
> > > > > + for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> > > > > + p->uclamp[clamp_id].active = false;
> > > > > +}
> > > >
> > > > Because in that case .active == false, and copy_process() will have done
> > > > thr right thing?
> > >
> > > Don't really get what you mean here? :/
> >
> > Why don't we have to set .active=false when
> > !sched_class->uclamp_enabled?
>
> Ok, got it.
>
> In principle because:
> - FAIR and RT will have uclamp_enabled
> - DL cannot fork
>
> ... thus, yes, it seems that the check above is not necessary anyway.
>
> Moreover, as per one of your comments in another message, we still need
> to cover the "reset on fork" case for FAIR and RT. Thus, I'm going to
> completely remove the support check in uclamp_fork and we always reset
> active for all classes.

Right, thanks!

2019-03-18 15:20:01

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 10/15] sched/fair: uclamp: Add uclamp support to energy_compute()

On 06-Mar 17:21, Quentin Perret wrote:

[...]

> > Since we are at that:
> > - rename schedutil_freq_util() into schedutil_cpu_util(),
> > since it's not only used for frequency selection.
> > - use "unsigned int" instead of "unsigned long" whenever the tracked
> > utilization value is not expected to overflow 32bit.
>
> We use unsigned long all over the place right ? All the task_util*()
> functions return unsigned long, the capacity-related functions too, and
> util_avg is an unsigned long in sched_avg. So I'm not sure if we want to
> do this TBH.

For utilization we never need more then an "unsigned int" as storage
class. Even at RQ level, 32bits allows +4mln tasks.

However we started with long and keep going on with that, this was
just an attempt to incrementally fix that whenever we do some
changes or we add some new code.
But, perhaps a single whole sale update patch would fit better this
job in case we really wanna do it at some point.

I'll drop this change in v8 and keep this patch focused on functional
bits, don't want to risk to sidetrack the discussion again.

[...]

> > @@ -283,13 +284,14 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > - unsigned long util = cpu_util_cfs(rq);
> > - unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> > + unsigned int util_cfs = cpu_util_cfs(rq);
> > + unsigned int cpu_cap = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
>
> Do you really need this one ? What's wrong with 'max' :-) ?

Being a pretty "generic" and thus confusing name is not enough? :)

Anyway, same reasoning as above and same conclusions: I'll drop the
renaming so that we don't sidetrack the discussion on v8.

[...]

> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index de181b8a3a2a..b9acef080d99 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2335,6 +2335,7 @@ static inline unsigned long capacity_orig_of(int cpu)
> > #endif
> >
> > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > +
> > /**
> > * enum schedutil_type - CPU utilization type
>
> Since you're using this enum unconditionally in fair.c, you should to
> move it out of the #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL block, I think.
>
> > * @FREQUENCY_UTIL: Utilization used to select frequency
> > @@ -2350,15 +2351,9 @@ enum schedutil_type {
> > ENERGY_UTIL,
> > };

Good point, will do!

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-03-18 16:56:12

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps

On 14-Mar 09:17, Suren Baghdasaryan wrote:
> On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
> >
> > In order to properly support hierarchical resources control, the cgroup
> > delegation model requires that attribute writes from a child group never
> > fail but still are (potentially) constrained based on parent's assigned
> > resources. This requires to properly propagate and aggregate parent
> > attributes down to its descendants.
> >
> > Let's implement this mechanism by adding a new "effective" clamp value
> > for each task group. The effective clamp value is defined as the smaller
> > value between the clamp value of a group and the effective clamp value
> > of its parent. This is the actual clamp value enforced on tasks in a
> > task group.
>
> In patch 10 in this series you mentioned "b) do not enforce any
> constraints and/or dependencies between the parent and its child
> nodes"
>
> This patch seems to change that behavior. If so, should it be documented?

Not, I actually have to update the changelog of that patch.

What I mean is that we do not enforce constraints among "requested"
values thus ensuring that each sub-group can always request a clamp
value.
Of course, if it gets that value or not depends on parent constraints,
which are propagated down the hierarchy under the form of "effective"
values by cpu_util_update_heir()

I'll fix the changelog in patch 10 which seems to be confusing for
Tejun too.

[...]

> > @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> > }
> >
> > #ifdef CONFIG_UCLAMP_TASK_GROUP
> > +static void cpu_util_update_hier(struct cgroup_subsys_state *css,
>
> s/cpu_util_update_hier/cpu_util_update_heir ?

Mmm... why?

That "_hier" stands for "hierarchical".

However, since there we update the effective values, maybe I can
better rename it in "_eff" ?

> > + unsigned int clamp_id, unsigned int bucket_id,
> > + unsigned int value)
> > +{
> > + struct cgroup_subsys_state *top_css = css;
> > + struct uclamp_se *uc_se, *uc_parent;
> > +
> > + css_for_each_descendant_pre(css, top_css) {
> > + /*
> > + * The first visited task group is top_css, which clamp value
> > + * is the one passed as parameter. For descendent task
> > + * groups we consider their current value.
> > + */
> > + uc_se = &css_tg(css)->uclamp[clamp_id];
> > + if (css != top_css) {
> > + value = uc_se->value;
> > + bucket_id = uc_se->effective.bucket_id;
> > + }
> > + uc_parent = NULL;
> > + if (css_tg(css)->parent)
> > + uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> > +
> > + /*
> > + * Skip the whole subtrees if the current effective clamp is
> > + * already matching the TG's clamp value.
> > + * In this case, all the subtrees already have top_value, or a
> > + * more restrictive value, as effective clamp.
> > + */
> > + if (uc_se->effective.value == value &&
> > + uc_parent && uc_parent->effective.value >= value) {
> > + css = css_rightmost_descendant(css);
> > + continue;
> > + }
> > +
> > + /* Propagate the most restrictive effective value */
> > + if (uc_parent && uc_parent->effective.value < value) {
> > + value = uc_parent->effective.value;
> > + bucket_id = uc_parent->effective.bucket_id;
> > + }
> > + if (uc_se->effective.value == value)
> > + continue;
> > +
> > + uc_se->effective.value = value;
> > + uc_se->effective.bucket_id = bucket_id;
> > + }
> > +}
> > +
> > static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> > struct cftype *cftype, u64 min_value)
> > {

[...]

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-03-18 17:00:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps

On Mon, Mar 18, 2019 at 9:54 AM Patrick Bellasi <[email protected]> wrote:
>
> On 14-Mar 09:17, Suren Baghdasaryan wrote:
> > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <[email protected]> wrote:
> > >
> > > In order to properly support hierarchical resources control, the cgroup
> > > delegation model requires that attribute writes from a child group never
> > > fail but still are (potentially) constrained based on parent's assigned
> > > resources. This requires to properly propagate and aggregate parent
> > > attributes down to its descendants.
> > >
> > > Let's implement this mechanism by adding a new "effective" clamp value
> > > for each task group. The effective clamp value is defined as the smaller
> > > value between the clamp value of a group and the effective clamp value
> > > of its parent. This is the actual clamp value enforced on tasks in a
> > > task group.
> >
> > In patch 10 in this series you mentioned "b) do not enforce any
> > constraints and/or dependencies between the parent and its child
> > nodes"
> >
> > This patch seems to change that behavior. If so, should it be documented?
>
> Not, I actually have to update the changelog of that patch.
>
> What I mean is that we do not enforce constraints among "requested"
> values thus ensuring that each sub-group can always request a clamp
> value.
> Of course, if it gets that value or not depends on parent constraints,
> which are propagated down the hierarchy under the form of "effective"
> values by cpu_util_update_heir()
>
> I'll fix the changelog in patch 10 which seems to be confusing for
> Tejun too.
>
> [...]
>
> > > @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
> > > }
> > >
> > > #ifdef CONFIG_UCLAMP_TASK_GROUP
> > > +static void cpu_util_update_hier(struct cgroup_subsys_state *css,
> >
> > s/cpu_util_update_hier/cpu_util_update_heir ?
>
> Mmm... why?
>
> That "_hier" stands for "hierarchical".

Yeah, I realized that later on but did not want to create more
chatter. _hier seems fine.

> However, since there we update the effective values, maybe I can
> better rename it in "_eff" ?
>
> > > + unsigned int clamp_id, unsigned int bucket_id,
> > > + unsigned int value)
> > > +{
> > > + struct cgroup_subsys_state *top_css = css;
> > > + struct uclamp_se *uc_se, *uc_parent;
> > > +
> > > + css_for_each_descendant_pre(css, top_css) {
> > > + /*
> > > + * The first visited task group is top_css, which clamp value
> > > + * is the one passed as parameter. For descendent task
> > > + * groups we consider their current value.
> > > + */
> > > + uc_se = &css_tg(css)->uclamp[clamp_id];
> > > + if (css != top_css) {
> > > + value = uc_se->value;
> > > + bucket_id = uc_se->effective.bucket_id;
> > > + }
> > > + uc_parent = NULL;
> > > + if (css_tg(css)->parent)
> > > + uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
> > > +
> > > + /*
> > > + * Skip the whole subtrees if the current effective clamp is
> > > + * already matching the TG's clamp value.
> > > + * In this case, all the subtrees already have top_value, or a
> > > + * more restrictive value, as effective clamp.
> > > + */
> > > + if (uc_se->effective.value == value &&
> > > + uc_parent && uc_parent->effective.value >= value) {
> > > + css = css_rightmost_descendant(css);
> > > + continue;
> > > + }
> > > +
> > > + /* Propagate the most restrictive effective value */
> > > + if (uc_parent && uc_parent->effective.value < value) {
> > > + value = uc_parent->effective.value;
> > > + bucket_id = uc_parent->effective.bucket_id;
> > > + }
> > > + if (uc_se->effective.value == value)
> > > + continue;
> > > +
> > > + uc_se->effective.value = value;
> > > + uc_se->effective.bucket_id = bucket_id;
> > > + }
> > > +}
> > > +
> > > static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> > > struct cftype *cftype, u64 min_value)
> > > {
>
> [...]
>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2019-03-19 10:02:20

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v7 11/15] sched/core: uclamp: Extend CPU's cgroup controller

On 14-Feb 07:48, Tejun Heo wrote:
> Hello,

Hi Tejun,

> On Fri, Feb 08, 2019 at 10:05:50AM +0000, Patrick Bellasi wrote:
> > a) are available only for non-root nodes, both on default and legacy
> > hierarchies, while system wide clamps are defined by a generic
> > interface which does not depends on cgroups
> >
> > b) do not enforce any constraints and/or dependencies between the parent
> > and its child nodes, thus relying:
> > - on permission settings defined by the system management software,
> > to define if subgroups can configure their clamp values
> > - on the delegation model, to ensure that effective clamps are
> > updated to consider both subgroup requests and parent group
> > constraints
>
> I'm not sure about this hierarchical behavior.

Yes, the above paragraph is misleading and it fails in properly
document what the code really does.

I'll update it on v8 to be more clear.

>
> > c) have higher priority than task-specific clamps, defined via
> > sched_setattr(), thus allowing to control and restrict task requests
>
> and I have some other concerns about the interface, but let's discuss
> them once the !cgroup portion is settled.

Sure, looking forward to get some more feedbacks from you on that
part.

For the time being I'll keep adding the cgroups bits on top of
the core ones just to follow the principle of "share early, share
often" and to give interested people a more complete picture of our
final goal.

I'm sure Peter will let you know when it's worth for you to be
more actively involved with the review ;)

> Thanks.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi