2019-06-21 08:43:45

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 00/16] Add utilization clamping support

Hi all, this is a respin of:

https://lore.kernel.org/lkml/[email protected]/

which addresses all Tejun's concerns:

- rename cgroup attributes to be cpu.uclamp.{min,max}
- update initialization of subgroups clamps to be "no clamps" by default
- use percentage rational numbers for clamp attributes, e.g. "12.34" for 12.34%.

by introducing modifications impacting only patches:

[PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller
[PATCH v10 13/16] sched/core: uclamp: Propagate parent clamps

The rest of the patches are the same as per in v9, they have been just rebased
on top of:

tj/cgroup.git for-5.3
tip/tip.git sched/core

AFAIU, all the first 11 patches have been code reviewed and should be at a
"ready to merge" quality level. Please let me know if I'm wrong and there
is something else I need/can to do on those patches.

Otherwise, now that we should have settled all the behavioral aspects, I'm
looking forward for a code review on the last set of patches of this series.

Cheers,
Patrick


Series Organization
===================

The series is organized into these main sections:

- Patches [01-07]: Per task (primary) API
- Patches [08-09]: Schedutil integration for FAIR and RT tasks
- Patches [10-11]: Integration with EAS's energy_compute()
- Patches [12-16]: Per task group (secondary) API

It is based on today's tj/cgroup/for-5.3 and tip/sched/core and the full tree
is available here:

git://linux-arm.org/linux-pb.git lkml/utilclamp_v10
http://www.linux-arm.org/git?p=linux-pb.git;a=shortlog;h=refs/heads/lkml/utilclamp_v10


Newcomer's Short Abstract
=========================

The Linux scheduler tracks a "utilization" signal for each scheduling entity
(SE), e.g. tasks, to know how much CPU time they use. This signal allows the
scheduler to know how "big" a task is and, in principle, it can support
advanced task placement strategies by selecting the best CPU to run a task.
Some of these strategies are represented by the Energy Aware Scheduler [3].

When the schedutil cpufreq governor is in use, the utilization signal allows
the Linux scheduler to also drive frequency selection. The CPU utilization
signal, which represents the aggregated utilization of tasks scheduled on that
CPU, is used to select the frequency which best fits the workload generated by
the tasks.

The current translation of utilization values into a frequency selection is
simple: we go to max for RT tasks or to the minimum frequency which can
accommodate the utilization of DL+FAIR tasks.
However, utilization values by themselves cannot convey the desired
power/performance behaviors of each task as intended by user-space.
As such they are not ideally suited for task placement decisions.

Task placement and frequency selection policies in the kernel can be improved
by taking into consideration hints coming from authorized user-space elements,
like for example the Android middleware or more generally any "System
Management Software" (SMS) framework.

Utilization clamping is a mechanism which allows to "clamp" (i.e. filter) the
utilization generated by RT and FAIR tasks within a range defined by user-space.
The clamped utilization value can then be used, for example, to enforce a
minimum and/or maximum frequency depending on which tasks are active on a CPU.

The main use-cases for utilization clamping are:

- boosting: better interactive response for small tasks which
are affecting the user experience.

Consider for example the case of a small control thread for an external
accelerator (e.g. GPU, DSP, other devices). Here, from the task utilization
the scheduler does not have a complete view of what the task's requirements
are and, if it's a small utilization task, it keeps selecting a more energy
efficient CPU, with smaller capacity and lower frequency, thus negatively
impacting the overall time required to complete task activations.

- capping: increase energy efficiency for background tasks not affecting the
user experience.

Since running on a lower capacity CPU at a lower frequency is more energy
efficient, when the completion time is not a main goal, then capping the
utilization considered for certain (maybe big) tasks can have positive
effects, both on energy consumption and thermal headroom.
This feature allows also to make RT tasks more energy friendly on mobile
systems where running them on high capacity CPUs and at the maximum
frequency is not required.

From these two use-cases, it's worth noticing that frequency selection
biasing, introduced by patches 9 and 10 of this series, is just one possible
usage of utilization clamping. Another compelling extension of utilization
clamping is in helping the scheduler in making tasks placement decisions.

Utilization is (also) a task specific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

- boosting: try to run small/foreground tasks on higher-capacity CPUs to
complete them faster despite being less energy efficient.

- capping: try to run big/background tasks on low-capacity CPUs to save power
and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [4] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==========

[1] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

[2] Message-ID: <[email protected]>
https://lore.kernel.org/lkml/[email protected]/

[3] Energy Aware Scheduling
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[4] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (16):
sched/core: uclamp: Add CPU's clamp buckets refcounting
sched/core: uclamp: Add bucket local max tracking
sched/core: uclamp: Enforce last task's UCLAMP_MAX
sched/core: uclamp: Add system default clamps
sched/core: Allow sched_setattr() to use the current policy
sched/core: uclamp: Extend sched_setattr() to support utilization
clamping
sched/core: uclamp: Reset uclamp values on RESET_ON_FORK
sched/core: uclamp: Set default clamps for RT tasks
sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks
sched/core: uclamp: Add uclamp_util_with()
sched/fair: uclamp: Add uclamp support to energy_compute()
sched/core: uclamp: Extend CPU's cgroup controller
sched/core: uclamp: Propagate parent clamps
sched/core: uclamp: Propagate system defaults to root group
sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Documentation/admin-guide/cgroup-v2.rst | 50 ++
include/linux/log2.h | 34 +
include/linux/sched.h | 58 ++
include/linux/sched/sysctl.h | 11 +
include/linux/sched/topology.h | 6 -
include/uapi/linux/sched.h | 14 +-
include/uapi/linux/sched/types.h | 66 +-
init/Kconfig | 75 +++
kernel/sched/core.c | 797 +++++++++++++++++++++++-
kernel/sched/cpufreq_schedutil.c | 22 +-
kernel/sched/fair.c | 44 +-
kernel/sched/rt.c | 4 +
kernel/sched/sched.h | 123 +++-
kernel/sysctl.c | 16 +
14 files changed, 1276 insertions(+), 44 deletions(-)

--
2.21.0


2019-06-21 08:43:48

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 11/16] sched/fair: uclamp: Add uclamp support to energy_compute()

The Energy Aware Scheduler (EAS) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:
a) an (active) power consumption defined for each CPU frequency
b) an estimation of which frequency will be used on each CPU
c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c).
A CPU is expected to run:
- on an higher than required frequency, but for a shorter time, in case
its estimated utilization will be smaller than the minimum utilization
enforced by uclamp
- on a smaller than required frequency, but for a longer time, in case
its estimated utilization is bigger than the maximum utilization
enforced by uclamp

While compute_energy() already accounts clamping effects on busy time,
the clamping effects on frequency selection are currently ignored.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
computation of the cpu_util signal implies that we are more likely to
estimate the highest OPP when a RT task is running in another CPU of
the same performance domain. This can have an impact on energy
estimation but:
- it's not easy to say which approach is better, since it depends on
the use case
- the original approach could still be obtained by setting a smaller
task-specific util_min whenever required

Since we are at that:
- rename schedutil_freq_util() into schedutil_cpu_util(),
since it's not only used for frequency selection.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 9 +++----
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++-----
kernel/sched/sched.h | 20 +++++-----------
3 files changed, 45 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 35cdb4a4d802..9c0419087260 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -196,8 +196,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type)
+unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p)
{
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);
@@ -230,7 +231,7 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
- util = uclamp_util(rq, util);
+ util = uclamp_util_with(rq, util, p);

dl_util = cpu_util_dl(rq);

@@ -290,7 +291,7 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = max;
sg_cpu->bw_dl = cpu_bw_dl(rq);

- return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
+ return schedutil_cpu_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL, NULL);
}

/**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6de1547c2c13..964ebab863a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6203,11 +6203,21 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- long util, max_util, sum_util, energy = 0;
+ unsigned int max_util, util_cfs, cpu_util, cpu_cap;
+ unsigned long sum_util, energy = 0;
+ struct task_struct *tsk;
int cpu;

for (; pd; pd = pd->next) {
+ struct cpumask *pd_mask = perf_domain_span(pd);
+
+ /*
+ * The energy model mandates all the CPUs of a performance
+ * domain have the same capacity.
+ */
+ cpu_cap = arch_scale_cpu_capacity(NULL, cpumask_first(pd_mask));
max_util = sum_util = 0;
+
/*
* The capacity state of CPUs of the current rd can be driven by
* CPUs of another rd if they belong to the same performance
@@ -6218,11 +6228,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* it will not appear in its pd list and will not be accounted
* by compute_energy().
*/
- for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
- util = schedutil_energy_util(cpu, util);
- max_util = max(util, max_util);
- sum_util += util;
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ util_cfs = cpu_util_next(cpu, p, dst_cpu);
+
+ /*
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
+ */
+ sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ ENERGY_UTIL, NULL);
+
+ /*
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ * NOTE: in case RT tasks are running, by default the
+ * FREQUENCY_UTIL's utilization can be max OPP.
+ */
+ tsk = cpu == dst_cpu ? p : NULL;
+ cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}

energy += em_pd_energy(pd->em_pd, max_util, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce2da8b9ff8c..f81e8930ff19 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2322,7 +2322,6 @@ static inline unsigned long capacity_orig_of(int cpu)
}
#endif

-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
/**
* enum schedutil_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
@@ -2338,15 +2337,11 @@ enum schedutil_type {
ENERGY_UTIL,
};

-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type);
-
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

- return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
-}
+unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p);

static inline unsigned long cpu_bw_dl(struct rq *rq)
{
@@ -2375,11 +2370,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
return READ_ONCE(rq->avg_rt.util_avg);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- return cfs;
-}
-#endif
+#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
+#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
static inline unsigned long cpu_util_irq(struct rq *rq)
--
2.21.0

2019-06-21 08:43:56

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 13/16] sched/core: uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are (potentially) constrained based on parent's assigned
resources. This requires to properly propagate and aggregate parent
attributes down to its descendants.

Let's implement this mechanism by adding a new "effective" clamp value
for each task group. The effective clamp value is defined as the smaller
value between the clamp value of a group and the effective clamp value
of its parent. This is the actual clamp value enforced on tasks in a
task group.

Since it can be interesting for userspace, e.g. system management
software, to know exactly what the currently propagated/enforced
configuration is, the effective clamp values are exposed to user-space
by means of a new pair of read-only attributes
cpu.util.{min,max}.effective.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v10:
Message-ID: <https://lore.kernel.org/lkml/20190603122422.GA19426@darkstar/>
- rename cgroup attributes to be cpu.uclamp.{min,max}
Message-ID: <https://lore.kernel.org/lkml/[email protected]/>
- use a percentage rational numbers for clamp attributes
---
Documentation/admin-guide/cgroup-v2.rst | 21 +++++
kernel/sched/core.c | 103 +++++++++++++++++++++++-
kernel/sched/sched.h | 2 +
3 files changed, 122 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 4761d20c5cad..d4407c40bc64 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1033,6 +1033,17 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This minimum utilization
value is used to clamp the task specific minimum utilization clamp.

+ cpu.uclamp.min.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports minimum utilization clamp value currently enforced on a task
+ group.
+
+ The actual minimum utilization as a percentage rational number,
+ e.g. 12.34 for 12.34%.
+
+ This value can be lower then cpu.uclamp.min in case a parent cgroup
+ allows only smaller minimum utilization values.
+
cpu.uclamp.max
A read-write single value file which exists on non-root cgroups.
The default is "max". i.e. no utilization capping
@@ -1044,6 +1055,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.

+ cpu.uclamp.max.effective
+ A read-only single value file which exists on non-root cgroups and
+ reports maximum utilization clamp value currently enforced on a task
+ group.
+
+ The actual maximum utilization as a percentage rational number,
+ e.g. 98.76 for 98.76%.
+
+ This value can be lower then cpu.uclamp.max in case a parent cgroup
+ is enforcing a more restrictive clamping on max utilization.


Memory
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0975f832066e..2b4d0b9bd6b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -762,6 +762,18 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}

#ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
/* Max allowed minimum utilization */
unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;

@@ -1126,6 +1138,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;

+ mutex_init(&uclamp_mutex);
+
for_each_possible_cpu(cpu) {
memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1142,6 +1156,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
#ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+ root_task_group.uclamp[clamp_id] = uc_max;
#endif
}
}
@@ -6727,6 +6742,7 @@ static inline void alloc_uclamp_sched_group(struct task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(&tg->uclamp_req[clamp_id],
uclamp_none(clamp_id), false);
+ tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
#endif
}
@@ -6977,6 +6993,44 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
}

#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id)
+{
+ struct cgroup_subsys_state *top_css = css;
+ struct uclamp_se *uc_se, *uc_parent;
+ unsigned int value;
+
+ css_for_each_descendant_pre(css, top_css) {
+ value = css_tg(css)->uclamp_req[clamp_id].value;
+
+ uc_parent = NULL;
+ if (css_tg(css)->parent)
+ uc_parent = &css_tg(css)->parent->uclamp[clamp_id];
+
+ /*
+ * Skip the whole subtrees if the current effective clamp is
+ * already matching the TG's clamp value.
+ * In this case, all the subtrees already have top_value, or a
+ * more restrictive value, as effective clamp.
+ */
+ uc_se = &css_tg(css)->uclamp[clamp_id];
+ if (uc_se->value == value &&
+ uc_parent && uc_parent->value >= value) {
+ css = css_rightmost_descendant(css);
+ continue;
+ }
+
+ /* Propagate the most restrictive effective value */
+ if (uc_parent && uc_parent->value < value)
+ value = uc_parent->value;
+ if (uc_se->value == value)
+ continue;
+
+ uc_se->value = value;
+ uc_se->bucket_id = uclamp_bucket_id(value);
+ }
+}
+
static inline int uclamp_scale_from_percent(char *buf, u64 *value)
{
*value = SCHED_CAPACITY_SCALE;
@@ -7016,6 +7070,7 @@ static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
if (min_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(of_css(of));
@@ -7032,8 +7087,12 @@ static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,

uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(of_css(of), UCLAMP_MIN);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return nbytes;
}
@@ -7052,6 +7111,7 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
if (max_value > SCHED_CAPACITY_SCALE)
return -ERANGE;

+ mutex_lock(&uclamp_mutex);
rcu_read_lock();

tg = css_tg(of_css(of));
@@ -7068,14 +7128,19 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,

uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);

+ /* Update effective clamps to track the most restrictive value */
+ cpu_util_update_eff(of_css(of), UCLAMP_MAX);
+
out:
rcu_read_unlock();
+ mutex_unlock(&uclamp_mutex);

return nbytes;
}

static inline void cpu_uclamp_print(struct seq_file *sf,
- enum uclamp_id clamp_id)
+ enum uclamp_id clamp_id,
+ bool effective)
{
struct task_group *tg;
u64 util_clamp;
@@ -7083,7 +7148,9 @@ static inline void cpu_uclamp_print(struct seq_file *sf,

rcu_read_lock();
tg = css_tg(seq_css(sf));
- util_clamp = tg->uclamp_req[clamp_id].value;
+ util_clamp = effective
+ ? tg->uclamp[clamp_id].value
+ : tg->uclamp_req[clamp_id].value;
rcu_read_unlock();

if (util_clamp == SCHED_CAPACITY_SCALE) {
@@ -7097,13 +7164,25 @@ static inline void cpu_uclamp_print(struct seq_file *sf,

static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
{
- cpu_uclamp_print(sf, UCLAMP_MIN);
+ cpu_uclamp_print(sf, UCLAMP_MIN, false);
return 0;
}

static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
{
- cpu_uclamp_print(sf, UCLAMP_MAX);
+ cpu_uclamp_print(sf, UCLAMP_MAX, false);
+ return 0;
+}
+
+static int cpu_uclamp_min_effective_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MIN, true);
+ return 0;
+}
+
+static int cpu_uclamp_max_effective_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MAX, true);
return 0;
}
#endif /* CONFIG_UCLAMP_TASK_GROUP */
@@ -7460,12 +7539,20 @@ static struct cftype cpu_legacy_files[] = {
.seq_show = cpu_uclamp_min_show,
.write = cpu_uclamp_min_write,
},
+ {
+ .name = "uclamp.min.effective",
+ .seq_show = cpu_uclamp_min_effective_show,
+ },
{
.name = "uclamp.max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_max_show,
.write = cpu_uclamp_max_write,
},
+ {
+ .name = "uclamp.max.effective",
+ .seq_show = cpu_uclamp_max_effective_show,
+ },
#endif
{ } /* Terminate */
};
@@ -7641,12 +7728,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_uclamp_min_show,
.write = cpu_uclamp_min_write,
},
+ {
+ .name = "uclamp.min.effective",
+ .seq_show = cpu_uclamp_min_effective_show,
+ },
{
.name = "uclamp.max",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cpu_uclamp_max_show,
.write = cpu_uclamp_max_write,
},
+ {
+ .name = "uclamp.max.effective",
+ .seq_show = cpu_uclamp_max_effective_show,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bdbefd50ff46..507c99898a2a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@ struct task_group {
#ifdef CONFIG_UCLAMP_TASK_GROUP
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
+ /* Effective clamp values used for a task group */
+ struct uclamp_se uclamp[UCLAMP_CNT];
#endif

};
--
2.21.0

2019-06-21 08:44:03

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 14/16] sched/core: uclamp: Propagate system defaults to root group

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.

- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

- system defaults: which define the maximum possible clamp values
usable by tasks.

- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b4d0b9bd6b9..f09712f65017 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1006,6 +1006,13 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id);
+#else
+#define cpu_util_update_eff(...)
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -1039,6 +1046,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
sysctl_sched_uclamp_util_max, false);
}

+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MIN);
+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MAX);
+
/*
* Updating all the RUNNABLE task is expensive, keep it simple and do
* just a lazy update at each next enqueue time.
--
2.21.0

2019-06-21 08:44:11

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 16/16] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or clamped as requested.

Do that by slightly refactoring uclamp_bucket_inc(). An additional
parameter *cgroup_subsys_state (css) is used to walk the list of tasks
in the TGs and update the RUNNABLE ones. Do that by taking the rq
lock for each task, the same mechanism used for cpu affinity masks
updates.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8070e11cafa0..1ddc37320f3d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1032,6 +1032,51 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}

+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_rq_dec_id(rq, p, clamp_id);
+ uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css, int clamp_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css,
unsigned int clamp_id);
@@ -7064,6 +7109,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css,

uc_se->value = value;
uc_se->bucket_id = uclamp_bucket_id(value);
+
+ /* Immediately update descendants RUNNABLE tasks */
+ uclamp_update_active_tasks(css, clamp_id);
}
}

--
2.21.0

2019-06-21 08:44:35

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 05/16] sched/core: Allow sched_setattr() to use the current policy

The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/uapi/linux/sched.h | 4 +++-
kernel/sched/core.c | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..58b2368d3634 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -51,9 +51,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_KEEP_POLICY 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_KEEP_POLICY)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4b8bb6678f16..0cf6d9270868 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4886,6 +4886,8 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,

if ((int)attr.sched_policy < 0)
return -EINVAL;
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;

rcu_read_lock();
retval = -ESRCH;
--
2.21.0

2019-06-21 08:44:39

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 01/16] sched/core: uclamp: Add CPU's clamp buckets refcounting

Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A task has:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
include/linux/log2.h | 34 +++++++
include/linux/sched.h | 39 ++++++++
include/linux/sched/topology.h | 6 --
init/Kconfig | 53 +++++++++++
kernel/sched/core.c | 166 +++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 51 ++++++++++
6 files changed, 343 insertions(+), 6 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 1aec01365ed4..83a4a3ca3e8a 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -220,4 +220,38 @@ int __order_base_2(unsigned long n)
ilog2((n) - 1) + 1) : \
__order_base_2(n) \
)
+
+static inline __attribute__((const))
+int __bits_per(unsigned long n)
+{
+ if (n < 2)
+ return 1;
+ if (is_power_of_2(n))
+ return order_base_2(n) + 1;
+ return order_base_2(n);
+}
+
+/**
+ * bits_per - calculate the number of bits required for the argument
+ * @n: parameter
+ *
+ * This is constant-capable and can be used for compile time
+ * initializations, e.g bitfields.
+ *
+ * The first few values calculated by this routine:
+ * bf(0) = 1
+ * bf(1) = 1
+ * bf(2) = 2
+ * bf(3) = 2
+ * bf(4) = 3
+ * ... and so on.
+ */
+#define bits_per(n) \
+( \
+ __builtin_constant_p(n) ? ( \
+ ((n) == 0 || (n) == 1) \
+ ? 1 : ilog2(n) + 1 \
+ ) : \
+ __bits_per(n) \
+)
#endif /* _LINUX_LOG2_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b2590a8d038..0027b379fa9b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -281,6 +281,18 @@ struct vtime {
u64 gtime;
};

+/*
+ * Utilization clamp constraints.
+ * @UCLAMP_MIN: Minimum utilization
+ * @UCLAMP_MAX: Maximum utilization
+ * @UCLAMP_CNT: Utilization clamp constraints count
+ */
+enum uclamp_id {
+ UCLAMP_MIN = 0,
+ UCLAMP_MAX,
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -312,6 +324,10 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+/* Increase resolution of cpu_capacity calculations */
+# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -560,6 +576,25 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+#ifdef CONFIG_UCLAMP_TASK
+/* Number of utilization clamp buckets (shorter alias) */
+#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
+
+/*
+ * Utilization clamp for a scheduling entity
+ * @value: clamp value "assigned" to a se
+ * @bucket_id: bucket index corresponding to the "assigned" value
+ *
+ * The bucket_id is the index of the clamp bucket matching the clamp value
+ * which is pre-computed and stored to avoid expensive integer divisions from
+ * the fast path.
+ */
+struct uclamp_se {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
union rcu_special {
struct {
u8 blocked;
@@ -640,6 +675,10 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 53afbe07354a..2f8317e55c12 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@

#include <linux/sched/idle.h>

-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
/*
* sched-domains (multiprocessor balancing) declarations:
*/
diff --git a/init/Kconfig b/init/Kconfig
index 086b72a99d60..bf96faf3fe43 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -677,6 +677,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks scheduled on that CPU.
+
+ With this option, the user can specify the min and max CPU
+ utilization allowed for RUNNABLE tasks. The max utilization defines
+ the maximum frequency a task should use while the min utilization
+ defines the minimum frequency it should use.
+
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+config UCLAMP_BUCKETS_COUNT
+ int "Number of supported utilization clamp buckets"
+ range 5 20
+ default 5
+ depends on UCLAMP_TASK
+ help
+ Defines the number of clamp buckets to use. The range of each bucket
+ will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
+ number of clamp buckets the finer their granularity and the higher
+ the precision of clamping aggregation and tracking at run-time.
+
+ For example, with the minimum configuration value we will have 5
+ clamp buckets tracking 20% utilization each. A 25% boosted tasks will
+ be refcounted in the [20..39]% bucket and will set the bucket clamp
+ effective value to 25%.
+ If a second 30% boosted task should be co-scheduled on the same CPU,
+ that task will be refcounted in the same bucket of the first task and
+ it will boost the bucket clamp effective value to 30%.
+ The clamp effective value of a bucket is reset to its nominal value
+ (20% in the example above) when there are no more tasks refcounted in
+ that bucket.
+
+ An additional boost/capping margin can be added to some tasks. In the
+ example above the 25% task will be boosted to 30% until it exits the
+ CPU. If that should be considered not acceptable on certain systems,
+ it's always possible to reduce the margin by increasing the number of
+ clamp buckets to trade off used memory for run-time tracking
+ precision.
+
+ If in doubt, use the default value.
+
+endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83bd6bb32a34..bc0389d3e77d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -761,6 +761,168 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+
+/* Integer rounded range for each bucket */
+#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
+
+#define for_each_clamp_id(clamp_id) \
+ for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)
+
+static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
+{
+ return clamp_value / UCLAMP_BUCKET_DELTA;
+}
+
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline void uclamp_se_set(struct uclamp_se *uc_se, unsigned int value)
+{
+ uc_se->value = value;
+ uc_se->bucket_id = uclamp_bucket_id(value);
+}
+
+static inline
+unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
+{
+ struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
+ int bucket_id = UCLAMP_BUCKETS - 1;
+
+ /*
+ * Since both min and max clamps are max aggregated, find the
+ * top most bucket with tasks in.
+ */
+ for ( ; bucket_id >= 0; bucket_id--) {
+ if (!bucket[bucket_id].tasks)
+ continue;
+ return bucket[bucket_id].value;
+ }
+
+ /* No tasks -- default clamp values */
+ return uclamp_none(clamp_id);
+}
+
+/*
+ * When a task is enqueued on a rq, the clamp bucket currently defined by the
+ * task's uclamp::bucket_id is refcounted on that rq. This also immediately
+ * updates the rq's clamp value if required.
+ */
+static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
+ unsigned int clamp_id)
+{
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_bucket *bucket;
+
+ lockdep_assert_held(&rq->lock);
+
+ bucket = &uc_rq->bucket[uc_se->bucket_id];
+ bucket->tasks++;
+
+ if (uc_se->value > READ_ONCE(uc_rq->value))
+ WRITE_ONCE(uc_rq->value, bucket->value);
+}
+
+/*
+ * When a task is dequeued from a rq, the clamp bucket refcounted by the task
+ * is released. If this is the last task reference counting the rq's max
+ * active clamp value, then the rq's clamp value is updated.
+ *
+ * Both refcounted tasks and rq's cached clamp values are expected to be
+ * always valid. If it's detected they are not, as defensive programming,
+ * enforce the expected state and warn.
+ */
+static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
+ unsigned int clamp_id)
+{
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_bucket *bucket;
+ unsigned int rq_clamp;
+
+ lockdep_assert_held(&rq->lock);
+
+ bucket = &uc_rq->bucket[uc_se->bucket_id];
+ SCHED_WARN_ON(!bucket->tasks);
+ if (likely(bucket->tasks))
+ bucket->tasks--;
+
+ if (likely(bucket->tasks))
+ return;
+
+ rq_clamp = READ_ONCE(uc_rq->value);
+ /*
+ * Defensive programming: this should never happen. If it happens,
+ * e.g. due to future modification, warn and fixup the expected value.
+ */
+ SCHED_WARN_ON(bucket->value > rq_clamp);
+ if (bucket->value >= rq_clamp)
+ WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
+}
+
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for_each_clamp_id(clamp_id)
+ uclamp_rq_inc_id(rq, p, clamp_id);
+}
+
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for_each_clamp_id(clamp_id)
+ uclamp_rq_dec_id(rq, p, clamp_id);
+}
+
+static void __init init_uclamp(void)
+{
+ unsigned int clamp_id;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct uclamp_bucket *bucket;
+ struct uclamp_rq *uc_rq;
+ unsigned int bucket_id;
+
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+
+ for_each_clamp_id(clamp_id) {
+ uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
+
+ bucket_id = 1;
+ while (bucket_id < UCLAMP_BUCKETS) {
+ bucket = &uc_rq->bucket[bucket_id];
+ bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
+ ++bucket_id;
+ }
+ }
+ }
+
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&init_task.uclamp[clamp_id],
+ uclamp_none(clamp_id));
+ }
+}
+
+#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -771,6 +933,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, flags & ENQUEUE_WAKEUP);
}

+ uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -784,6 +947,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

+ uclamp_rq_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

@@ -6082,6 +6246,8 @@ void __init sched_init(void)

psi_init();

+ init_uclamp();
+
scheduler_running = 1;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b08dee29ef5e..9149d90b8b7c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -791,6 +791,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+ unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_rq - rq's utilization clamp
+ * @value: currently active clamp values for a rq
+ * @bucket: utilization clamp buckets affecting a rq
+ *
+ * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
+ * A clamp value is affecting a rq when there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * There are up to UCLAMP_CNT possible different clamp values, currently there
+ * are only two: minimum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
+ * the metrics required to compute all the per-rq utilization clamp values.
+ */
+struct uclamp_rq {
+ unsigned int value;
+ struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -825,6 +867,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -1639,6 +1686,10 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
--
2.21.0

2019-06-21 08:44:47

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 08/16] sched/core: uclamp: Set default clamps for RT tasks

By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand. This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.

Enforce the correct behavior for RT tasks by setting util_min to max
whenever:

1. the task is switched to the RT class and it does not already have a
user-defined clamp value assigned.

2. an RT task is forked from a parent with RESET_ON_FORK set.

NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2c49b23efc87..ab75e874fdcc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1051,6 +1051,27 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
+ unsigned int clamp_id;
+
+ /*
+ * On scheduling class change, reset to default clamps for tasks
+ * without a task-specific value.
+ */
+ for_each_clamp_id(clamp_id) {
+ struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ /* Keep using defined clamps across class changes */
+ if (uc_se->user_defined)
+ continue;
+
+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
+ uclamp_se_set(uc_se, clamp_value, false);
+ }
+
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

@@ -1076,8 +1097,13 @@ static void uclamp_fork(struct task_struct *p)
return;

for_each_clamp_id(clamp_id) {
- uclamp_se_set(&p->uclamp_req[clamp_id],
- uclamp_none(clamp_id), false);
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
+ uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
}
}

--
2.21.0

2019-06-21 08:44:49

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 09/16] sched/cpufreq: uclamp: Add clamps for FAIR and RT tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.

Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
kernel/sched/fair.c | 4 ++++
kernel/sched/rt.c | 4 ++++
kernel/sched/sched.h | 23 +++++++++++++++++++++++
4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 962cf343f798..35cdb4a4d802 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -202,8 +202,10 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

- if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
+ if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -219,9 +221,16 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
- util = util_cfs;
- util += cpu_util_rt(rq);
+ util = util_cfs + cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util(rq, util);

dl_util = cpu_util_dl(rq);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c11dcdedcbc..6de1547c2c13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10361,6 +10361,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 63ad7c90822c..a532558a5176 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cd36002436fc..c33a57f14743 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2265,6 +2265,29 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+ unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true
--
2.21.0

2019-06-21 08:45:00

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2),
this value is accounted in the corresponding clamp bucket every time the
task is {en,de}qeued. However, when cgroups are also in use, the task
specific clamp values could be restricted by the task_group (TG)
clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every
time a task is enqueued, it's accounted in the clamp_bucket defining the
smaller clamp between the task specific value and its TG effective
value. This allows to:

1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value

2. implement a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG

This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only
system defaults are enforced.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
---
kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f09712f65017..8070e11cafa0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -862,16 +862,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uc_max;
+
+ /*
+ * Tasks in autogroups or root task group will be
+ * restricted by system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return uc_req;
+ if (task_group(p) == &root_task_group)
+ return uc_req;
+
+ uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value || !uc_req.user_defined)
+ return uc_max;
+#endif
+
+ return uc_req;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
- struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];

/* System default restrictions always apply */
--
2.21.0

2019-06-21 08:45:29

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run at least at a
minimum frequency which corresponds to the uclamp.min
utilization

- uclamp.max: defines the maximum utilization which should be considered
i.e. the RUNNABLE tasks of this group will run up to a
maximum frequency which corresponds to the uclamp.max
utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
hierarchies, while system wide clamps are defined by a generic
interface which does not depends on cgroups. This system wide
interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
are a restriction of the group requests considering its parent's
effective constraints. Root group effective constraints are defined
by the system wide interface.
This mechanism allows each (non-root) level of the hierarchy to:
- request whatever clamp values it would like to get
- effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing uclamp.min < uclamp.max.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>

---
Changes in v10:
Message-ID: <https://lore.kernel.org/lkml/20190603122422.GA19426@darkstar/>
- rename cgroup attributes to be cpu.uclamp.{min,max}
Message-ID: <https://lore.kernel.org/lkml/[email protected]/>
- use a percentage rational numbers for clamp attributes
Message-ID: <https://lore.kernel.org/lkml/[email protected]/>
- update initialization of subgroups clamps to be none by default
---
Documentation/admin-guide/cgroup-v2.rst | 29 ++++
init/Kconfig | 22 +++
kernel/sched/core.c | 181 +++++++++++++++++++++++-
kernel/sched/sched.h | 6 +
4 files changed, 237 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index a5c845338d6d..4761d20c5cad 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,12 @@ controller implements weight and absolute bandwidth limit models for
normal scheduling policy and absolute bandwidth allocation model for
realtime scheduling policy.

+Cycles distribution is based, by default, on a temporal base and it
+does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to enforce a minimum
+bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
+which should never be exceeded by a CPU.
+
WARNING: cgroup2 doesn't yet support control of realtime processes and
the cpu controller can only be enabled when all RT processes are in
the root cgroup. Be aware that system management software may already
@@ -1016,6 +1022,29 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.

+ cpu.uclamp.min
+ A read-write single value file which exists on non-root cgroups.
+ The default is "0", i.e. no utilization boosting.
+
+ The requested minimum utilization as a percentage rational number,
+ e.g. 12.34 for 12.34%.
+
+ This interface allows reading and setting minimum utilization clamp
+ values similar to the sched_setattr(2). This minimum utilization
+ value is used to clamp the task specific minimum utilization clamp.
+
+ cpu.uclamp.max
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max". i.e. no utilization capping
+
+ The requested maximum utilization as a percentage rational number,
+ e.g. 98.76 for 98.76%.
+
+ This interface allows reading and setting maximum utilization clamp
+ values similar to the sched_setattr(2). This maximum utilization
+ value is used to clamp the task specific maximum utilization clamp.
+
+

Memory
------
diff --git a/init/Kconfig b/init/Kconfig
index bf96faf3fe43..68a21188786c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -903,6 +903,28 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config UCLAMP_TASK_GROUP
+ bool "Utilization clamping per group of tasks"
+ depends on CGROUP_SCHED
+ depends on UCLAMP_TASK
+ default n
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
+
+ When this option is enabled, the user can specify a min and max
+ CPU bandwidth which is allowed for each single task in a group.
+ The max bandwidth allows to clamp the maximum frequency a task
+ can use, while the min bandwidth allows to define a minimum
+ frequency a task will always use.
+
+ When task group based utilization clamping is enabled, an eventually
+ specified task-specific clamp value is constrained by the cgroup
+ specified clamp value. Both minimum and maximum task clamping cannot
+ be bigger than the corresponding clamping defined at task group level.
+
+ If in doubt, say N.
+
config CGROUP_PIDS
bool "PIDs controller"
help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2226ddd1de04..0975f832066e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1138,8 +1138,12 @@ static void __init init_uclamp(void)

/* System defaults allow max clamp values for both indexes */
uclamp_se_set(&uc_max, uclamp_none(UCLAMP_MAX), false);
- for_each_clamp_id(clamp_id)
+ for_each_clamp_id(clamp_id) {
uclamp_default[clamp_id] = uc_max;
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ root_task_group.uclamp_req[clamp_id] = uc_max;
+#endif
+ }
}

#else /* CONFIG_UCLAMP_TASK */
@@ -6714,6 +6718,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
/* task_group_lock serializes the addition/removal of task groups */
static DEFINE_SPINLOCK(task_group_lock);

+static inline void alloc_uclamp_sched_group(struct task_group *tg,
+ struct task_group *parent)
+{
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ int clamp_id;
+
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&tg->uclamp_req[clamp_id],
+ uclamp_none(clamp_id), false);
+ }
+#endif
+}
+
static void sched_free_group(struct task_group *tg)
{
free_fair_sched_group(tg);
@@ -6737,6 +6754,8 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+ alloc_uclamp_sched_group(tg, parent);
+
return tg;

err:
@@ -6957,6 +6976,138 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
sched_move_task(task);
}

+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static inline int uclamp_scale_from_percent(char *buf, u64 *value)
+{
+ *value = SCHED_CAPACITY_SCALE;
+
+ buf = strim(buf);
+ if (strncmp("max", buf, 4)) {
+ s64 percent;
+ int ret;
+
+ ret = cgroup_parse_float(buf, 2, &percent);
+ if (ret)
+ return ret;
+
+ percent <<= SCHED_CAPACITY_SHIFT;
+ *value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
+ }
+
+ return 0;
+}
+
+static inline u64 uclamp_percent_from_scale(u64 value)
+{
+ return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
+}
+
+static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct task_group *tg;
+ u64 min_value;
+ int ret;
+
+ ret = uclamp_scale_from_percent(buf, &min_value);
+ if (ret)
+ return ret;
+ if (min_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(of_css(of));
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
+ goto out;
+ if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MIN], min_value, false);
+
+out:
+ rcu_read_unlock();
+
+ return nbytes;
+}
+
+static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes,
+ loff_t off)
+{
+ struct task_group *tg;
+ u64 max_value;
+ int ret;
+
+ ret = uclamp_scale_from_percent(buf, &max_value);
+ if (ret)
+ return ret;
+ if (max_value > SCHED_CAPACITY_SCALE)
+ return -ERANGE;
+
+ rcu_read_lock();
+
+ tg = css_tg(of_css(of));
+ if (tg == &root_task_group) {
+ ret = -EINVAL;
+ goto out;
+ }
+ if (tg->uclamp_req[UCLAMP_MAX].value == max_value)
+ goto out;
+ if (tg->uclamp_req[UCLAMP_MIN].value > max_value) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ uclamp_se_set(&tg->uclamp_req[UCLAMP_MAX], max_value, false);
+
+out:
+ rcu_read_unlock();
+
+ return nbytes;
+}
+
+static inline void cpu_uclamp_print(struct seq_file *sf,
+ enum uclamp_id clamp_id)
+{
+ struct task_group *tg;
+ u64 util_clamp;
+ u64 percent;
+
+ rcu_read_lock();
+ tg = css_tg(seq_css(sf));
+ util_clamp = tg->uclamp_req[clamp_id].value;
+ rcu_read_unlock();
+
+ if (util_clamp == SCHED_CAPACITY_SCALE) {
+ seq_puts(sf, "max\n");
+ return;
+ }
+
+ percent = uclamp_percent_from_scale(util_clamp);
+ seq_printf(sf, "%llu.%llu\n", percent / 100, percent % 100);
+}
+
+static int cpu_uclamp_min_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MIN);
+ return 0;
+}
+
+static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
+{
+ cpu_uclamp_print(sf, UCLAMP_MAX);
+ return 0;
+}
+#endif /* CONFIG_UCLAMP_TASK_GROUP */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
@@ -7301,6 +7452,20 @@ static struct cftype cpu_legacy_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "uclamp.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_min_show,
+ .write = cpu_uclamp_min_write,
+ },
+ {
+ .name = "uclamp.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_max_show,
+ .write = cpu_uclamp_max_write,
+ },
#endif
{ } /* Terminate */
};
@@ -7468,6 +7633,20 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_max_show,
.write = cpu_max_write,
},
+#endif
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ {
+ .name = "uclamp.min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_min_show,
+ .write = cpu_uclamp_min_write,
+ },
+ {
+ .name = "uclamp.max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = cpu_uclamp_max_show,
+ .write = cpu_uclamp_max_write,
+ },
#endif
{ } /* terminate */
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f81e8930ff19..bdbefd50ff46 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -393,6 +393,12 @@ struct task_group {
#endif

struct cfs_bandwidth cfs_bandwidth;
+
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* Clamp values requested for a task group */
+ struct uclamp_se uclamp_req[UCLAMP_CNT];
+#endif
+
};

#ifdef CONFIG_FAIR_GROUP_SCHED
--
2.21.0

2019-06-21 14:03:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 11/16] sched/fair: uclamp: Add uclamp support to energy_compute()

On Fri, Jun 21, 2019 at 09:42:12AM +0100, Patrick Bellasi wrote:

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ce2da8b9ff8c..f81e8930ff19 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2322,7 +2322,6 @@ static inline unsigned long capacity_orig_of(int cpu)
> }
> #endif
>
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> /**
> * enum schedutil_type - CPU utilization type
> * @FREQUENCY_UTIL: Utilization used to select frequency
> @@ -2338,15 +2337,11 @@ enum schedutil_type {
> ENERGY_UTIL,
> };
>
> -unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> - unsigned long max, enum schedutil_type type);
> -
> -static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
> -{
> - unsigned long max = arch_scale_cpu_capacity(NULL, cpu);

That conflicts with the patch I have removing that NULL argument, fixed
it up.

> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>
> - return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
> -}
> +unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
> + unsigned long max, enum schedutil_type type,
> + struct task_struct *p);
>
> static inline unsigned long cpu_bw_dl(struct rq *rq)
> {
> @@ -2375,11 +2370,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
> return READ_ONCE(rq->avg_rt.util_avg);
> }
> #else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> -static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
> -{
> - return cfs;
> -}
> -#endif
> +#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0

Was there a good reason for this to be a macro and not an inline
function? I've changed it, if it explodes in 0day, it's all my fault ;-)

> +#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */

2019-06-21 14:48:33

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v10 11/16] sched/fair: uclamp: Add uclamp support to energy_compute()

On 21-Jun 16:01, Peter Zijlstra wrote:
> On Fri, Jun 21, 2019 at 09:42:12AM +0100, Patrick Bellasi wrote:
>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index ce2da8b9ff8c..f81e8930ff19 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2322,7 +2322,6 @@ static inline unsigned long capacity_orig_of(int cpu)
> > }
> > #endif
> >
> > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > /**
> > * enum schedutil_type - CPU utilization type
> > * @FREQUENCY_UTIL: Utilization used to select frequency
> > @@ -2338,15 +2337,11 @@ enum schedutil_type {
> > ENERGY_UTIL,
> > };
> >
> > -unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > - unsigned long max, enum schedutil_type type);
> > -
> > -static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
> > -{
> > - unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
>
> That conflicts with the patch I have removing that NULL argument, fixed
> it up.

Ok, I notice only know you have this:

commit 119fd437f412 ("sched/topology: Remove unused sd param from arch_scale_cpu_capacity()")

from Vincent on your queue. :/


> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> >
> > - return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
> > -}
> > +unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
> > + unsigned long max, enum schedutil_type type,
> > + struct task_struct *p);
> >
> > static inline unsigned long cpu_bw_dl(struct rq *rq)
> > {
> > @@ -2375,11 +2370,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
> > return READ_ONCE(rq->avg_rt.util_avg);
> > }
> > #else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> > -static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
> > -{
> > - return cfs;
> > -}
> > -#endif
> > +#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
>
> Was there a good reason for this to be a macro and not an inline
> function?

Mmm... not really, apart perhaps saving some lines.

I notice sometimes we use macros (e.g. perf_domain_span), but it's
certainly not the most common pattern.

> I've changed it, if it explodes in 0day, it's all my fault ;-)

Sure, I guess if 0day explodes will not be for that change. :)


--
#include <best/regards.h>

Patrick Bellasi

2019-06-21 14:57:25

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Add utilization clamping support

On 21-Jun 09:42, Patrick Bellasi wrote:
> Hi all, this is a respin of:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> which addresses all Tejun's concerns:
>
> - rename cgroup attributes to be cpu.uclamp.{min,max}
> - update initialization of subgroups clamps to be "no clamps" by default
> - use percentage rational numbers for clamp attributes, e.g. "12.34" for 12.34%.
>
> by introducing modifications impacting only patches:
>
> [PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller
> [PATCH v10 13/16] sched/core: uclamp: Propagate parent clamps
>
> The rest of the patches are the same as per in v9, they have been just rebased
> on top of:
>
> tj/cgroup.git for-5.3
> tip/tip.git sched/core
>
> AFAIU, all the first 11 patches have been code reviewed and should be at a
> "ready to merge" quality level. Please let me know if I'm wrong and there
> is something else I need/can to do on those patches.
>
> Otherwise, now that we should have settled all the behavioral aspects, I'm

Regarding the behavioral aspects, here I have a report with some
simple tests for the current implementation:

https://gist.github.com/derkling/519459b5a2be35d8681fbaf1d6efe225

There are a couple of sections at the end to test the "Delegation
Model" with both CGroups v1 and v2.

I'm sharing the link just in case it can be helpful to verify if what
has been implemented is actually matching what Tejun expects as a sane
cgroups interface.

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-06-22 15:06:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller

Hello,

Generally looks good to me. Some nitpicks.

On Fri, Jun 21, 2019 at 09:42:13AM +0100, Patrick Bellasi wrote:
> @@ -951,6 +951,12 @@ controller implements weight and absolute bandwidth limit models for
> normal scheduling policy and absolute bandwidth allocation model for
> realtime scheduling policy.
>
> +Cycles distribution is based, by default, on a temporal base and it
> +does not account for the frequency at which tasks are executed.
> +The (optional) utilization clamping support allows to enforce a minimum
> +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> +which should never be exceeded by a CPU.

I kinda wonder whether the term bandwidth is a bit confusing because
it's also used for cpu.max/min. Would just calling it frequency be
clearer?

> +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes,
> + loff_t off)
> +{
> + struct task_group *tg;
> + u64 min_value;
> + int ret;
> +
> + ret = uclamp_scale_from_percent(buf, &min_value);
> + if (ret)
> + return ret;
> + if (min_value > SCHED_CAPACITY_SCALE)
> + return -ERANGE;
> +
> + rcu_read_lock();
> +
> + tg = css_tg(of_css(of));
> + if (tg == &root_task_group) {
> + ret = -EINVAL;
> + goto out;
> + }

I don't think you need the above check.

> + if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
> + goto out;
> + if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
> + ret = -EINVAL;

So, uclamp.max limits the maximum freq% can get and uclamp.min limits
hte maximum freq% protection can get in the subtree. Let's say
uclamp.max is 50% and uclamp.min is 100%. It means that protection is
not limited but the actual freq% is limited upto 50%, which isn't
necessarily invalid. For a simple example, a user might be saying
that they want to get whatever protection they can get from its parent
but wanna limit eventual freq at 50% and it isn't too difficult to
imagine cases where the two knobs are configured separately especially
configuration is being managed hierarchically / automatically.

tl;dr is that we don't need the above restriction and shouldn't
generally be restricting configurations when they don't need to.

Thanks.

--
tejun

2019-06-22 15:10:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v10 13/16] sched/core: uclamp: Propagate parent clamps

Hello,

On Fri, Jun 21, 2019 at 09:42:14AM +0100, Patrick Bellasi wrote:
> Since it can be interesting for userspace, e.g. system management
> software, to know exactly what the currently propagated/enforced
> configuration is, the effective clamp values are exposed to user-space
> by means of a new pair of read-only attributes
> cpu.util.{min,max}.effective.

Can we not add the effective interface file for now? I don't think
it's a bad idea but would like to think more about it. For cpuset, it
was needed because configuration was so interwoven with the effective
masks, but we don't generally do this for other min/max or weight
knobs, all of which have effective hierarchical values and I'm not
quite sure about adding .effective for all of them. It could be that
that's what we end up doing eventually but I'd like to think a bit
more about it.

Thanks.

--
tejun

2019-06-24 20:43:56

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller

On 22-Jun 08:03, Tejun Heo wrote:
> Hello,

Hi,

> Generally looks good to me. Some nitpicks.
>
> On Fri, Jun 21, 2019 at 09:42:13AM +0100, Patrick Bellasi wrote:
> > @@ -951,6 +951,12 @@ controller implements weight and absolute bandwidth limit models for
> > normal scheduling policy and absolute bandwidth allocation model for
> > realtime scheduling policy.
> >
> > +Cycles distribution is based, by default, on a temporal base and it
> > +does not account for the frequency at which tasks are executed.
> > +The (optional) utilization clamping support allows to enforce a minimum
> > +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> > +which should never be exceeded by a CPU.
>
> I kinda wonder whether the term bandwidth is a bit confusing because
> it's also used for cpu.max/min. Would just calling it frequency be
> clearer?

Maybe I should find a better way to express the concept above.

I agree that bandwidth is already used by cpu.{max,min}, what I want
to call out is that clamps allows to enrich that concept.

By hinting the scheduler on min/max required utilization we can better
defined the amount of actual CPU cycles required/allowed.
That's a bit more precise bandwidth control compared to just rely on
temporal runnable/period limits.

> > +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
> > + char *buf, size_t nbytes,
> > + loff_t off)
> > +{
> > + struct task_group *tg;
> > + u64 min_value;
> > + int ret;
> > +
> > + ret = uclamp_scale_from_percent(buf, &min_value);
> > + if (ret)
> > + return ret;
> > + if (min_value > SCHED_CAPACITY_SCALE)
> > + return -ERANGE;
> > +
> > + rcu_read_lock();
> > +
> > + tg = css_tg(of_css(of));
> > + if (tg == &root_task_group) {
> > + ret = -EINVAL;
> > + goto out;
> > + }
>
> I don't think you need the above check.

Don't we want to forbid attributes tuning from the root group?

> > + if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
> > + goto out;
> > + if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
> > + ret = -EINVAL;
>
> So, uclamp.max limits the maximum freq% can get and uclamp.min limits
> hte maximum freq% protection can get in the subtree. Let's say
> uclamp.max is 50% and uclamp.min is 100%.

That's not possible, in the current implementation we always enforce
the limit (uclamp.max) to be _not smaller_ then the protection
(uclamp.min).

Indeed, in principle, it does not make sense to ask for a minimum
utilization (i.e. frequency boosting) which is higher then the
maximum allowed utilization (i.e. frequency capping).


> It means that protection is not limited but the actual freq% is
> limited upto 50%, which isn't necessarily invalid.
> For a simple example, a user might be saying
> that they want to get whatever protection they can get from its parent
> but wanna limit eventual freq at 50% and it isn't too difficult to
> imagine cases where the two knobs are configured separately especially
> configuration is being managed hierarchically / automatically.

That's not my understanding, in v10 by default when we create a
subgroup we assign it uclamp.min=0%, meaning that we don't boost
frequencies.

It seems instead that you are asking to set uclamp.min=100% by
default, so that the effective value will give us whatever the father
allow. Is that correct?

> tl;dr is that we don't need the above restriction and shouldn't
> generally be restricting configurations when they don't need to.
>
> Thanks.
>
> --
> tejun

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-06-24 20:53:32

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v10 13/16] sched/core: uclamp: Propagate parent clamps

On 22-Jun 08:07, Tejun Heo wrote:
> Hello,

Hi,

> On Fri, Jun 21, 2019 at 09:42:14AM +0100, Patrick Bellasi wrote:
> > Since it can be interesting for userspace, e.g. system management
> > software, to know exactly what the currently propagated/enforced
> > configuration is, the effective clamp values are exposed to user-space
> > by means of a new pair of read-only attributes
> > cpu.util.{min,max}.effective.
>
> Can we not add the effective interface file for now?

You mean just the (read-only) user-space API right?

I found it quite convenient, even just for debugging.
Moreover it allows a container to know what it's exactly getting...

> I don't think it's a bad idea but would like to think more about it.
> For cpuset, it was needed because configuration was so interwoven
> with the effective masks, but we don't generally do this for other
> min/max or weight knobs, all of which have effective hierarchical
> values and I'm not quite sure about adding .effective for all of
> them.
> It could be that that's what we end up doing eventually but
> I'd like to think a bit more about it.

... but I see your point and, since it's not strictly required, I
think we can drop it in v11. Will check better if it's of any use
apart from debugging/testing support.

> Thanks.
>
> --
> tejun

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

2019-06-24 21:05:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v10 13/16] sched/core: uclamp: Propagate parent clamps

Hello, Patrick.

On Mon, Jun 24, 2019 at 06:34:05PM +0100, Patrick Bellasi wrote:
> > On Fri, Jun 21, 2019 at 09:42:14AM +0100, Patrick Bellasi wrote:
> > > Since it can be interesting for userspace, e.g. system management
> > > software, to know exactly what the currently propagated/enforced
> > > configuration is, the effective clamp values are exposed to user-space
> > > by means of a new pair of read-only attributes
> > > cpu.util.{min,max}.effective.
> >
> > Can we not add the effective interface file for now?
>
> You mean just the (read-only) user-space API right?

Yeah.

> I found it quite convenient, even just for debugging.
> Moreover it allows a container to know what it's exactly getting...

I fully agree.

> > I don't think it's a bad idea but would like to think more about it.
> > For cpuset, it was needed because configuration was so interwoven
> > with the effective masks, but we don't generally do this for other
> > min/max or weight knobs, all of which have effective hierarchical
> > values and I'm not quite sure about adding .effective for all of
> > them.
> > It could be that that's what we end up doing eventually but
> > I'd like to think a bit more about it.
>
> ... but I see your point and, since it's not strictly required, I
> think we can drop it in v11. Will check better if it's of any use
> apart from debugging/testing support.

Yeah, I just wanna figure out a plan which works for other controllers
too. It could be that the right thing to do is just adding .effective
to everything but idk I need to think more about it.

Thanks.

--
tejun

2019-06-24 21:16:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller

Hey, Patrick.

On Mon, Jun 24, 2019 at 06:29:06PM +0100, Patrick Bellasi wrote:
> > I kinda wonder whether the term bandwidth is a bit confusing because
> > it's also used for cpu.max/min. Would just calling it frequency be
> > clearer?
>
> Maybe I should find a better way to express the concept above.
>
> I agree that bandwidth is already used by cpu.{max,min}, what I want
> to call out is that clamps allows to enrich that concept.
>
> By hinting the scheduler on min/max required utilization we can better
> defined the amount of actual CPU cycles required/allowed.
> That's a bit more precise bandwidth control compared to just rely on
> temporal runnable/period limits.

I see. I wonder whether it's overloading the same term too subtly
tho. It's great to document how they interact but it *might* be
easier for readers if a different term is used even if the meaning is
essentially the same. Anyways, it's a nitpick. Please feel free to
ignore.

> > > + tg = css_tg(of_css(of));
> > > + if (tg == &root_task_group) {
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> >
> > I don't think you need the above check.
>
> Don't we want to forbid attributes tuning from the root group?

Yeah, that's enforced by NOT_ON_ROOT flag, right?

> > So, uclamp.max limits the maximum freq% can get and uclamp.min limits
> > hte maximum freq% protection can get in the subtree. Let's say
> > uclamp.max is 50% and uclamp.min is 100%.
>
> That's not possible, in the current implementation we always enforce
> the limit (uclamp.max) to be _not smaller_ then the protection
> (uclamp.min).
>
> Indeed, in principle, it does not make sense to ask for a minimum
> utilization (i.e. frequency boosting) which is higher then the
> maximum allowed utilization (i.e. frequency capping).

Yeah, I'm trying to explain actually it does.

> > It means that protection is not limited but the actual freq% is
> > limited upto 50%, which isn't necessarily invalid.
> > For a simple example, a user might be saying
> > that they want to get whatever protection they can get from its parent
> > but wanna limit eventual freq at 50% and it isn't too difficult to
> > imagine cases where the two knobs are configured separately especially
> > configuration is being managed hierarchically / automatically.
>
> That's not my understanding, in v10 by default when we create a
> subgroup we assign it uclamp.min=0%, meaning that we don't boost
> frequencies.
>
> It seems instead that you are asking to set uclamp.min=100% by
> default, so that the effective value will give us whatever the father
> allow. Is that correct?

No, the defaults are fine. I'm trying to say that min/max
configurations don't need to be coupled like this and there are valid
use cases where the configured min is higher than max when
configurations are nested and managed automatically.

Limits always trump protection in effect of course but please don't
limit what can be configured.

Thanks.

--
tejun

Subject: [tip:sched/core] sched/uclamp: Add CPU's clamp buckets refcounting

Commit-ID: 69842cba9ace84849bb9b8edcdf2cefccd97901c
Gitweb: https://git.kernel.org/tip/69842cba9ace84849bb9b8edcdf2cefccd97901c
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 21 Jun 2019 09:42:02 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 24 Jun 2019 19:23:44 +0200

sched/uclamp: Add CPU's clamp buckets refcounting

Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A task has:
task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/log2.h | 34 +++++++++
include/linux/sched.h | 39 ++++++++++
include/linux/sched/topology.h | 6 --
init/Kconfig | 53 +++++++++++++
kernel/sched/core.c | 166 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 51 +++++++++++++
6 files changed, 343 insertions(+), 6 deletions(-)

diff --git a/include/linux/log2.h b/include/linux/log2.h
index 1aec01365ed4..83a4a3ca3e8a 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -220,4 +220,38 @@ int __order_base_2(unsigned long n)
ilog2((n) - 1) + 1) : \
__order_base_2(n) \
)
+
+static inline __attribute__((const))
+int __bits_per(unsigned long n)
+{
+ if (n < 2)
+ return 1;
+ if (is_power_of_2(n))
+ return order_base_2(n) + 1;
+ return order_base_2(n);
+}
+
+/**
+ * bits_per - calculate the number of bits required for the argument
+ * @n: parameter
+ *
+ * This is constant-capable and can be used for compile time
+ * initializations, e.g bitfields.
+ *
+ * The first few values calculated by this routine:
+ * bf(0) = 1
+ * bf(1) = 1
+ * bf(2) = 2
+ * bf(3) = 2
+ * bf(4) = 3
+ * ... and so on.
+ */
+#define bits_per(n) \
+( \
+ __builtin_constant_p(n) ? ( \
+ ((n) == 0 || (n) == 1) \
+ ? 1 : ilog2(n) + 1 \
+ ) : \
+ __bits_per(n) \
+)
#endif /* _LINUX_LOG2_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 044c023875e8..80235bcd05f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -283,6 +283,18 @@ struct vtime {
u64 gtime;
};

+/*
+ * Utilization clamp constraints.
+ * @UCLAMP_MIN: Minimum utilization
+ * @UCLAMP_MAX: Maximum utilization
+ * @UCLAMP_CNT: Utilization clamp constraints count
+ */
+enum uclamp_id {
+ UCLAMP_MIN = 0,
+ UCLAMP_MAX,
+ UCLAMP_CNT
+};
+
struct sched_info {
#ifdef CONFIG_SCHED_INFO
/* Cumulative counters: */
@@ -314,6 +326,10 @@ struct sched_info {
# define SCHED_FIXEDPOINT_SHIFT 10
# define SCHED_FIXEDPOINT_SCALE (1L << SCHED_FIXEDPOINT_SHIFT)

+/* Increase resolution of cpu_capacity calculations */
+# define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
+# define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -562,6 +578,25 @@ struct sched_dl_entity {
struct hrtimer inactive_timer;
};

+#ifdef CONFIG_UCLAMP_TASK
+/* Number of utilization clamp buckets (shorter alias) */
+#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
+
+/*
+ * Utilization clamp for a scheduling entity
+ * @value: clamp value "assigned" to a se
+ * @bucket_id: bucket index corresponding to the "assigned" value
+ *
+ * The bucket_id is the index of the clamp bucket matching the clamp value
+ * which is pre-computed and stored to avoid expensive integer divisions from
+ * the fast path.
+ */
+struct uclamp_se {
+ unsigned int value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned int bucket_id : bits_per(UCLAMP_BUCKETS);
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
union rcu_special {
struct {
u8 blocked;
@@ -642,6 +677,10 @@ struct task_struct {
#endif
struct sched_dl_entity dl;

+#ifdef CONFIG_UCLAMP_TASK
+ struct uclamp_se uclamp[UCLAMP_CNT];
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index e445d3767cdd..7863bb62d2ab 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -6,12 +6,6 @@

#include <linux/sched/idle.h>

-/*
- * Increase resolution of cpu_capacity calculations
- */
-#define SCHED_CAPACITY_SHIFT SCHED_FIXEDPOINT_SHIFT
-#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
-
/*
* sched-domains (multiprocessor balancing) declarations:
*/
diff --git a/init/Kconfig b/init/Kconfig
index 0e2344389501..c88289c18d59 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -677,6 +677,59 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GENERIC_SCHED_CLOCK
bool

+menu "Scheduler features"
+
+config UCLAMP_TASK
+ bool "Enable utilization clamping for RT/FAIR tasks"
+ depends on CPU_FREQ_GOV_SCHEDUTIL
+ help
+ This feature enables the scheduler to track the clamped utilization
+ of each CPU based on RUNNABLE tasks scheduled on that CPU.
+
+ With this option, the user can specify the min and max CPU
+ utilization allowed for RUNNABLE tasks. The max utilization defines
+ the maximum frequency a task should use while the min utilization
+ defines the minimum frequency it should use.
+
+ Both min and max utilization clamp values are hints to the scheduler,
+ aiming at improving its frequency selection policy, but they do not
+ enforce or grant any specific bandwidth for tasks.
+
+ If in doubt, say N.
+
+config UCLAMP_BUCKETS_COUNT
+ int "Number of supported utilization clamp buckets"
+ range 5 20
+ default 5
+ depends on UCLAMP_TASK
+ help
+ Defines the number of clamp buckets to use. The range of each bucket
+ will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
+ number of clamp buckets the finer their granularity and the higher
+ the precision of clamping aggregation and tracking at run-time.
+
+ For example, with the minimum configuration value we will have 5
+ clamp buckets tracking 20% utilization each. A 25% boosted tasks will
+ be refcounted in the [20..39]% bucket and will set the bucket clamp
+ effective value to 25%.
+ If a second 30% boosted task should be co-scheduled on the same CPU,
+ that task will be refcounted in the same bucket of the first task and
+ it will boost the bucket clamp effective value to 30%.
+ The clamp effective value of a bucket is reset to its nominal value
+ (20% in the example above) when there are no more tasks refcounted in
+ that bucket.
+
+ An additional boost/capping margin can be added to some tasks. In the
+ example above the 25% task will be boosted to 30% until it exits the
+ CPU. If that should be considered not acceptable on certain systems,
+ it's always possible to reduce the margin by increasing the number of
+ clamp buckets to trade off used memory for run-time tracking
+ precision.
+
+ If in doubt, use the default value.
+
+endmenu
+
#
# For architectures that want to enable the support for NUMA-affine scheduler
# balancing logic:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e5e02d23e693..d8c1e67afd82 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -772,6 +772,168 @@ static void set_load_weight(struct task_struct *p, bool update_load)
}
}

+#ifdef CONFIG_UCLAMP_TASK
+
+/* Integer rounded range for each bucket */
+#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
+
+#define for_each_clamp_id(clamp_id) \
+ for ((clamp_id) = 0; (clamp_id) < UCLAMP_CNT; (clamp_id)++)
+
+static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
+{
+ return clamp_value / UCLAMP_BUCKET_DELTA;
+}
+
+static inline unsigned int uclamp_none(int clamp_id)
+{
+ if (clamp_id == UCLAMP_MIN)
+ return 0;
+ return SCHED_CAPACITY_SCALE;
+}
+
+static inline void uclamp_se_set(struct uclamp_se *uc_se, unsigned int value)
+{
+ uc_se->value = value;
+ uc_se->bucket_id = uclamp_bucket_id(value);
+}
+
+static inline
+unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
+{
+ struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
+ int bucket_id = UCLAMP_BUCKETS - 1;
+
+ /*
+ * Since both min and max clamps are max aggregated, find the
+ * top most bucket with tasks in.
+ */
+ for ( ; bucket_id >= 0; bucket_id--) {
+ if (!bucket[bucket_id].tasks)
+ continue;
+ return bucket[bucket_id].value;
+ }
+
+ /* No tasks -- default clamp values */
+ return uclamp_none(clamp_id);
+}
+
+/*
+ * When a task is enqueued on a rq, the clamp bucket currently defined by the
+ * task's uclamp::bucket_id is refcounted on that rq. This also immediately
+ * updates the rq's clamp value if required.
+ */
+static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
+ unsigned int clamp_id)
+{
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_bucket *bucket;
+
+ lockdep_assert_held(&rq->lock);
+
+ bucket = &uc_rq->bucket[uc_se->bucket_id];
+ bucket->tasks++;
+
+ if (uc_se->value > READ_ONCE(uc_rq->value))
+ WRITE_ONCE(uc_rq->value, bucket->value);
+}
+
+/*
+ * When a task is dequeued from a rq, the clamp bucket refcounted by the task
+ * is released. If this is the last task reference counting the rq's max
+ * active clamp value, then the rq's clamp value is updated.
+ *
+ * Both refcounted tasks and rq's cached clamp values are expected to be
+ * always valid. If it's detected they are not, as defensive programming,
+ * enforce the expected state and warn.
+ */
+static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
+ unsigned int clamp_id)
+{
+ struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
+ struct uclamp_se *uc_se = &p->uclamp[clamp_id];
+ struct uclamp_bucket *bucket;
+ unsigned int rq_clamp;
+
+ lockdep_assert_held(&rq->lock);
+
+ bucket = &uc_rq->bucket[uc_se->bucket_id];
+ SCHED_WARN_ON(!bucket->tasks);
+ if (likely(bucket->tasks))
+ bucket->tasks--;
+
+ if (likely(bucket->tasks))
+ return;
+
+ rq_clamp = READ_ONCE(uc_rq->value);
+ /*
+ * Defensive programming: this should never happen. If it happens,
+ * e.g. due to future modification, warn and fixup the expected value.
+ */
+ SCHED_WARN_ON(bucket->value > rq_clamp);
+ if (bucket->value >= rq_clamp)
+ WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
+}
+
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for_each_clamp_id(clamp_id)
+ uclamp_rq_inc_id(rq, p, clamp_id);
+}
+
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+{
+ unsigned int clamp_id;
+
+ if (unlikely(!p->sched_class->uclamp_enabled))
+ return;
+
+ for_each_clamp_id(clamp_id)
+ uclamp_rq_dec_id(rq, p, clamp_id);
+}
+
+static void __init init_uclamp(void)
+{
+ unsigned int clamp_id;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct uclamp_bucket *bucket;
+ struct uclamp_rq *uc_rq;
+ unsigned int bucket_id;
+
+ memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
+
+ for_each_clamp_id(clamp_id) {
+ uc_rq = &cpu_rq(cpu)->uclamp[clamp_id];
+
+ bucket_id = 1;
+ while (bucket_id < UCLAMP_BUCKETS) {
+ bucket = &uc_rq->bucket[bucket_id];
+ bucket->value = bucket_id * UCLAMP_BUCKET_DELTA;
+ ++bucket_id;
+ }
+ }
+ }
+
+ for_each_clamp_id(clamp_id) {
+ uclamp_se_set(&init_task.uclamp[clamp_id],
+ uclamp_none(clamp_id));
+ }
+}
+
+#else /* CONFIG_UCLAMP_TASK */
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
+static inline void init_uclamp(void) { }
+#endif /* CONFIG_UCLAMP_TASK */
+
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@@ -782,6 +944,7 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
psi_enqueue(p, flags & ENQUEUE_WAKEUP);
}

+ uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
}

@@ -795,6 +958,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}

+ uclamp_rq_dec(rq, p);
p->sched_class->dequeue_task(rq, p, flags);
}

@@ -6093,6 +6257,8 @@ void __init sched_init(void)

psi_init();

+ init_uclamp();
+
scheduler_running = 1;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e58ab597ec88..cecc6baaba93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -791,6 +791,48 @@ extern void rto_push_irq_work_func(struct irq_work *work);
#endif
#endif /* CONFIG_SMP */

+#ifdef CONFIG_UCLAMP_TASK
+/*
+ * struct uclamp_bucket - Utilization clamp bucket
+ * @value: utilization clamp value for tasks on this clamp bucket
+ * @tasks: number of RUNNABLE tasks on this clamp bucket
+ *
+ * Keep track of how many tasks are RUNNABLE for a given utilization
+ * clamp value.
+ */
+struct uclamp_bucket {
+ unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
+ unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
+};
+
+/*
+ * struct uclamp_rq - rq's utilization clamp
+ * @value: currently active clamp values for a rq
+ * @bucket: utilization clamp buckets affecting a rq
+ *
+ * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
+ * A clamp value is affecting a rq when there is at least one task RUNNABLE
+ * (or actually running) with that value.
+ *
+ * There are up to UCLAMP_CNT possible different clamp values, currently there
+ * are only two: minimum utilization and maximum utilization.
+ *
+ * All utilization clamping values are MAX aggregated, since:
+ * - for util_min: we want to run the CPU at least at the max of the minimum
+ * utilization required by its currently RUNNABLE tasks.
+ * - for util_max: we want to allow the CPU to run up to the max of the
+ * maximum utilization allowed by its currently RUNNABLE tasks.
+ *
+ * Since on each system we expect only a limited number of different
+ * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
+ * the metrics required to compute all the per-rq utilization clamp values.
+ */
+struct uclamp_rq {
+ unsigned int value;
+ struct uclamp_bucket bucket[UCLAMP_BUCKETS];
+};
+#endif /* CONFIG_UCLAMP_TASK */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -825,6 +867,11 @@ struct rq {
unsigned long nr_load_updates;
u64 nr_switches;

+#ifdef CONFIG_UCLAMP_TASK
+ /* Utilization clamp values based on CPU's RUNNABLE tasks */
+ struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
+#endif
+
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
@@ -1639,6 +1686,10 @@ extern const u32 sched_prio_to_wmult[40];
struct sched_class {
const struct sched_class *next;

+#ifdef CONFIG_UCLAMP_TASK
+ int uclamp_enabled;
+#endif
+
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);

Subject: [tip:sched/core] sched/core: Allow sched_setattr() to use the current policy

Commit-ID: 1d6362fa0cfc8c7b243fa92924429d826599e691
Gitweb: https://git.kernel.org/tip/1d6362fa0cfc8c7b243fa92924429d826599e691
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 21 Jun 2019 09:42:06 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 24 Jun 2019 19:23:46 +0200

sched/core: Allow sched_setattr() to use the current policy

The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/sched.h | 4 +++-
kernel/sched/core.c | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..58b2368d3634 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -51,9 +51,11 @@
#define SCHED_FLAG_RESET_ON_FORK 0x01
#define SCHED_FLAG_RECLAIM 0x02
#define SCHED_FLAG_DL_OVERRUN 0x04
+#define SCHED_FLAG_KEEP_POLICY 0x08

#define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK | \
SCHED_FLAG_RECLAIM | \
- SCHED_FLAG_DL_OVERRUN)
+ SCHED_FLAG_DL_OVERRUN | \
+ SCHED_FLAG_KEEP_POLICY)

#endif /* _UAPI_LINUX_SCHED_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b74de86b68c7..6d519f3f9789 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4897,6 +4897,8 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,

if ((int)attr.sched_policy < 0)
return -EINVAL;
+ if (attr.sched_flags & SCHED_FLAG_KEEP_POLICY)
+ attr.sched_policy = SETPARAM_POLICY;

rcu_read_lock();
retval = -ESRCH;

Subject: [tip:sched/core] sched/uclamp: Set default clamps for RT tasks

Commit-ID: 1a00d999971c78ab024a17b0efc37d78404dd120
Gitweb: https://git.kernel.org/tip/1a00d999971c78ab024a17b0efc37d78404dd120
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 21 Jun 2019 09:42:09 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 24 Jun 2019 19:23:47 +0200

sched/uclamp: Set default clamps for RT tasks

By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand. This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.

Enforce the correct behavior for RT tasks by setting util_min to max
whenever:

1. the task is switched to the RT class and it does not already have a
user-defined clamp value assigned.

2. an RT task is forked from a parent with RESET_ON_FORK set.

NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ecc304ab906f..6cd5133f0c2a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1062,6 +1062,27 @@ static int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
+ unsigned int clamp_id;
+
+ /*
+ * On scheduling class change, reset to default clamps for tasks
+ * without a task-specific value.
+ */
+ for_each_clamp_id(clamp_id) {
+ struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ /* Keep using defined clamps across class changes */
+ if (uc_se->user_defined)
+ continue;
+
+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
+ uclamp_se_set(uc_se, clamp_value, false);
+ }
+
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

@@ -1087,8 +1108,13 @@ static void uclamp_fork(struct task_struct *p)
return;

for_each_clamp_id(clamp_id) {
- uclamp_se_set(&p->uclamp_req[clamp_id],
- uclamp_none(clamp_id), false);
+ unsigned int clamp_value = uclamp_none(clamp_id);
+
+ /* By default, RT tasks always get 100% boost */
+ if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+ clamp_value = uclamp_none(UCLAMP_MAX);
+
+ uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false);
}
}

Subject: [tip:sched/core] sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks

Commit-ID: 982d9cdc22c9f6df5ad790caa229ff74fb1d95e7
Gitweb: https://git.kernel.org/tip/982d9cdc22c9f6df5ad790caa229ff74fb1d95e7
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 21 Jun 2019 09:42:10 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 24 Jun 2019 19:23:48 +0200

sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks

Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints. This will allow, for example, to:

- boost tasks which are directly affecting the user experience
by running them at least at a minimum "requested" frequency

- cap low priority tasks not directly affecting the user experience
by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.

Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 15 ++++++++++++---
kernel/sched/fair.c | 4 ++++
kernel/sched/rt.c | 4 ++++
kernel/sched/sched.h | 23 +++++++++++++++++++++++
4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7c4ce69067c4..d84e036a7536 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -202,8 +202,10 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);

- if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
+ if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
+ type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
+ }

/*
* Early check to see if IRQ/steal time saturates the CPU, can be
@@ -219,9 +221,16 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
* CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization.
+ *
+ * CFS and RT utilization can be boosted or capped, depending on
+ * utilization clamp constraints requested by currently RUNNABLE
+ * tasks.
+ * When there are no CFS RUNNABLE tasks, clamps are released and
+ * frequency will be gracefully reduced with the utilization decay.
*/
- util = util_cfs;
- util += cpu_util_rt(rq);
+ util = util_cfs + cpu_util_rt(rq);
+ if (type == FREQUENCY_UTIL)
+ util = uclamp_util(rq, util);

dl_util = cpu_util_dl(rq);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bdcd3c718bc..28db7ce5c3a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10393,6 +10393,10 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 63ad7c90822c..a532558a5176 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2400,6 +2400,10 @@ const struct sched_class rt_sched_class = {
.switched_to = switched_to_rt,

.update_curr = update_curr_rt,
+
+#ifdef CONFIG_UCLAMP_TASK
+ .uclamp_enabled = 1,
+#endif
};

#ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0d2ba8bb2cb3..9b0c77a99346 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2265,6 +2265,29 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif /* CONFIG_CPU_FREQ */

+#ifdef CONFIG_UCLAMP_TASK
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ unsigned int min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
+ unsigned int max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
+
+ /*
+ * Since CPU's {min,max}_util clamps are MAX aggregated considering
+ * RUNNABLE tasks with _different_ clamps, we can end up with an
+ * inversion. Fix it now when the clamps are applied.
+ */
+ if (unlikely(min_util >= max_util))
+ return min_util;
+
+ return clamp(util, min_util, max_util);
+}
+#else /* CONFIG_UCLAMP_TASK */
+static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
+{
+ return util;
+}
+#endif /* CONFIG_UCLAMP_TASK */
+
#ifdef arch_scale_freq_capacity
# ifndef arch_scale_freq_invariant
# define arch_scale_freq_invariant() true

Subject: [tip:sched/core] sched/uclamp: Add uclamp support to energy_compute()

Commit-ID: af24bde8df2029f067dc46aff0393c8f18ff6e2f
Gitweb: https://git.kernel.org/tip/af24bde8df2029f067dc46aff0393c8f18ff6e2f
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 21 Jun 2019 09:42:12 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 24 Jun 2019 19:23:49 +0200

sched/uclamp: Add uclamp support to energy_compute()

The Energy Aware Scheduler (EAS) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:

a) an (active) power consumption defined for each CPU frequency
b) an estimation of which frequency will be used on each CPU
c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c).

A CPU is expected to run:

- on an higher than required frequency, but for a shorter time, in case
its estimated utilization will be smaller than the minimum utilization
enforced by uclamp
- on a smaller than required frequency, but for a longer time, in case
its estimated utilization is bigger than the maximum utilization
enforced by uclamp

While compute_energy() already accounts clamping effects on busy time,
the clamping effects on frequency selection are currently ignored.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
computation of the cpu_util signal implies that we are more likely to
estimate the highest OPP when a RT task is running in another CPU of
the same performance domain. This can have an impact on energy
estimation but:

- it's not easy to say which approach is better, since it depends on
the use case
- the original approach could still be obtained by setting a smaller
task-specific util_min whenever required

Since we are at that:

- rename schedutil_freq_util() into schedutil_cpu_util(),
since it's not only used for frequency selection.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alessio Balsini <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 9 +++++----
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 21 +++++++++------------
3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index d84e036a7536..636ca6f88c8e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -196,8 +196,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type)
+unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p)
{
unsigned long dl_util, util, irq;
struct rq *rq = cpu_rq(cpu);
@@ -230,7 +231,7 @@ unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
*/
util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL)
- util = uclamp_util(rq, util);
+ util = uclamp_util_with(rq, util, p);

dl_util = cpu_util_dl(rq);

@@ -290,7 +291,7 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
sg_cpu->max = max;
sg_cpu->bw_dl = cpu_bw_dl(rq);

- return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
+ return schedutil_cpu_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL, NULL);
}

/**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28db7ce5c3a6..b798fe7ff7cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6231,11 +6231,21 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
static long
compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
{
- long util, max_util, sum_util, energy = 0;
+ unsigned int max_util, util_cfs, cpu_util, cpu_cap;
+ unsigned long sum_util, energy = 0;
+ struct task_struct *tsk;
int cpu;

for (; pd; pd = pd->next) {
+ struct cpumask *pd_mask = perf_domain_span(pd);
+
+ /*
+ * The energy model mandates all the CPUs of a performance
+ * domain have the same capacity.
+ */
+ cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask));
max_util = sum_util = 0;
+
/*
* The capacity state of CPUs of the current rd can be driven by
* CPUs of another rd if they belong to the same performance
@@ -6246,11 +6256,29 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* it will not appear in its pd list and will not be accounted
* by compute_energy().
*/
- for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
- util = cpu_util_next(cpu, p, dst_cpu);
- util = schedutil_energy_util(cpu, util);
- max_util = max(util, max_util);
- sum_util += util;
+ for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+ util_cfs = cpu_util_next(cpu, p, dst_cpu);
+
+ /*
+ * Busy time computation: utilization clamping is not
+ * required since the ratio (sum_util / cpu_capacity)
+ * is already enough to scale the EM reported power
+ * consumption at the (eventually clamped) cpu_capacity.
+ */
+ sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ ENERGY_UTIL, NULL);
+
+ /*
+ * Performance domain frequency: utilization clamping
+ * must be considered since it affects the selection
+ * of the performance domain frequency.
+ * NOTE: in case RT tasks are running, by default the
+ * FREQUENCY_UTIL's utilization can be max OPP.
+ */
+ tsk = cpu == dst_cpu ? p : NULL;
+ cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+ FREQUENCY_UTIL, tsk);
+ max_util = max(max_util, cpu_util);
}

energy += em_pd_energy(pd->em_pd, max_util, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1783f6b4c2e0..802b1f3405f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2322,7 +2322,6 @@ static inline unsigned long capacity_orig_of(int cpu)
}
#endif

-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
/**
* enum schedutil_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
@@ -2338,15 +2337,11 @@ enum schedutil_type {
ENERGY_UTIL,
};

-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
- unsigned long max, enum schedutil_type type);
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
- unsigned long max = arch_scale_cpu_capacity(cpu);
-
- return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
-}
+unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p);

static inline unsigned long cpu_bw_dl(struct rq *rq)
{
@@ -2375,11 +2370,13 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
return READ_ONCE(rq->avg_rt.util_avg);
}
#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
+static inline unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,
+ unsigned long max, enum schedutil_type type,
+ struct task_struct *p)
{
- return cfs;
+ return 0;
}
-#endif
+#endif /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */

#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
static inline unsigned long cpu_util_irq(struct rq *rq)

2019-06-25 12:01:04

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH v10 12/16] sched/core: uclamp: Extend CPU's cgroup controller

On 24-Jun 10:52, Tejun Heo wrote:

> Hey, Patrick.

Hi,

> On Mon, Jun 24, 2019 at 06:29:06PM +0100, Patrick Bellasi wrote:
> > > I kinda wonder whether the term bandwidth is a bit confusing because
> > > it's also used for cpu.max/min. Would just calling it frequency be
> > > clearer?
> >
> > Maybe I should find a better way to express the concept above.
> >
> > I agree that bandwidth is already used by cpu.{max,min}, what I want
> > to call out is that clamps allows to enrich that concept.
> >
> > By hinting the scheduler on min/max required utilization we can better
> > defined the amount of actual CPU cycles required/allowed.
> > That's a bit more precise bandwidth control compared to just rely on
> > temporal runnable/period limits.
>
> I see. I wonder whether it's overloading the same term too subtly
> tho. It's great to document how they interact but it *might* be
> easier for readers if a different term is used even if the meaning is
> essentially the same. Anyways, it's a nitpick. Please feel free to
> ignore.

Got it, will try come up with a better description in the v11 to avoid
confusion and better explain the "improvements" without polluting too
much the original concept.

> > > > + tg = css_tg(of_css(of));
> > > > + if (tg == &root_task_group) {
> > > > + ret = -EINVAL;
> > > > + goto out;
> > > > + }
> > >
> > > I don't think you need the above check.
> >
> > Don't we want to forbid attributes tuning from the root group?
>
> Yeah, that's enforced by NOT_ON_ROOT flag, right?

Oh right, since we don't show them we can't write them :)

> > > So, uclamp.max limits the maximum freq% can get and uclamp.min limits
> > > hte maximum freq% protection can get in the subtree. Let's say
> > > uclamp.max is 50% and uclamp.min is 100%.
> >
> > That's not possible, in the current implementation we always enforce
> > the limit (uclamp.max) to be _not smaller_ then the protection
> > (uclamp.min).
> >
> > Indeed, in principle, it does not make sense to ask for a minimum
> > utilization (i.e. frequency boosting) which is higher then the
> > maximum allowed utilization (i.e. frequency capping).
>
> Yeah, I'm trying to explain actually it does.
>
> > > It means that protection is not limited but the actual freq% is
> > > limited upto 50%, which isn't necessarily invalid.
> > > For a simple example, a user might be saying
> > > that they want to get whatever protection they can get from its parent
> > > but wanna limit eventual freq at 50% and it isn't too difficult to
> > > imagine cases where the two knobs are configured separately especially
> > > configuration is being managed hierarchically / automatically.
> >
> > That's not my understanding, in v10 by default when we create a
> > subgroup we assign it uclamp.min=0%, meaning that we don't boost
> > frequencies.
> >
> > It seems instead that you are asking to set uclamp.min=100% by
> > default, so that the effective value will give us whatever the father
> > allow. Is that correct?
>
> No, the defaults are fine. I'm trying to say that min/max
> configurations don't need to be coupled like this and there are valid
> use cases where the configured min is higher than max when
> configurations are nested and managed automatically.
>
> Limits always trump protection in effect of course but please don't
> limit what can be configured.

Got it, thanks!

Will fix it in v11.

> Thanks.
>
> --
> tejun

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi