2020-07-16 11:07:55

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v7 0/3] sched/uclamp: new sysctl for default RT boost value

Changes in v7:

* Rebase on top of tip/sched/core
* Hold task_rq_lock() instead of using RCU.
* Better document that changes to p->uclamp_ require task_rq_lock()
* Remove smp_{wr}mp()
* Hold the the tasklist_lock with smp_mp__after_spinlock()
* Add patch 3 which addresses a splat I've seen while testing.
static_branch_enable() in __setscheduler_uclamp() was causing it.
Remove the call outside of the critical section to fix it.


*** v6 cover-letter ***

This series introduces a new sysctl_sched_uclamp_util_min_rt_default to control
at runtime the default boost value of RT tasks.

Full rationale is in patch 1 commit message.

v6 has changed the approach taken in v5 [1] and earlier by moving away from the
lazy update approach that touched the fast path to a synchronous one that is
performed when the write to the procfs entry is done.

for_each_process_thread() is used to update all existing RT tasks now. And to
handle the race with a concurrent fork() we introduce sched_post_fork() in
_do_fork() to ensure a concurrently forked RT tasks gets the right update.

To ensure the race condition is handled correctly, I wrote this small (simple!)
test program:

https://github.com/qais-yousef/uclamp_test.git

And ran it on 4core x86 system and 8core big.LITTLE juno-r2 system.

From juno-r2 run, 10 iterations each run:

Without sched_post_fork()

# ./run.sh
pid 3105 has 336 but default should be 337
pid 13162 has 336 but default should be 337
pid 23256 has 338 but default should be 339
All forked RT tasks had the correct uclamp.min
pid 10638 has 334 but default should be 335
All forked RT tasks had the correct uclamp.min
pid 30683 has 335 but default should be 336
pid 8247 has 336 but default should be 337
pid 18170 has 1024 but default should be 334
pid 28274 has 336 but default should be 337

With sched_post_fork()

# ./run.sh
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min
All forked RT tasks had the correct uclamp.min

Thanks

--
Qais Yousef

[1] https://lore.kernel.org/lkml/[email protected]/

CC: Jonathan Corbet <[email protected]>
CC: Juri Lelli <[email protected]>
CC: Vincent Guittot <[email protected]>
CC: Dietmar Eggemann <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ben Segall <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Luis Chamberlain <[email protected]>
CC: Kees Cook <[email protected]>
CC: Iurii Zaikin <[email protected]>
CC: Quentin Perret <[email protected]>
CC: Valentin Schneider <[email protected]>
CC: Patrick Bellasi <[email protected]>
CC: Pavan Kondeti <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]


Qais Yousef (3):
sched/uclamp: Add a new sysctl to control RT default boost value
Documentation/sysctl: Document uclamp sysctl knobs
sched/uclamp: Fix a deadlock when enabling uclamp static key

Documentation/admin-guide/sysctl/kernel.rst | 54 +++++++
include/linux/sched.h | 10 +-
include/linux/sched/sysctl.h | 1 +
include/linux/sched/task.h | 1 +
kernel/fork.c | 1 +
kernel/sched/core.c | 149 ++++++++++++++++++--
kernel/sysctl.c | 7 +
7 files changed, 208 insertions(+), 15 deletions(-)

--
2.17.1


2020-07-16 11:08:03

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default boost value

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Juri Lelli <[email protected]>
CC: Vincent Guittot <[email protected]>
CC: Dietmar Eggemann <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ben Segall <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Luis Chamberlain <[email protected]>
CC: Kees Cook <[email protected]>
CC: Iurii Zaikin <[email protected]>
CC: Quentin Perret <[email protected]>
CC: Valentin Schneider <[email protected]>
CC: Patrick Bellasi <[email protected]>
CC: Pavan Kondeti <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
include/linux/sched.h | 10 ++-
include/linux/sched/sysctl.h | 1 +
include/linux/sched/task.h | 1 +
kernel/fork.c | 1 +
kernel/sched/core.c | 119 +++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 7 +++
6 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 683372943093..86d90bc40e0b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -686,9 +686,15 @@ struct task_struct {
struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
- /* Clamp values requested for a scheduling entity */
+ /*
+ * Clamp values requested for a scheduling entity.
+ * Must be updated with task_rq_lock() held.
+ */
struct uclamp_se uclamp_req[UCLAMP_CNT];
- /* Effective clamp values used for a scheduling entity */
+ /*
+ * Effective clamp values used for a scheduling entity.
+ * Must be updated with task_rq_lock() held.
+ */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 24be30a40814..3c31ba88aca5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -67,6 +67,7 @@ extern unsigned int sysctl_sched_dl_period_min;
#ifdef CONFIG_UCLAMP_TASK
extern unsigned int sysctl_sched_uclamp_util_min;
extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
#endif

#ifdef CONFIG_CFS_BANDWIDTH
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 38359071236a..e7ddab095baf 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -55,6 +55,7 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);

extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern void sched_post_fork(struct task_struct *p);
extern void sched_dead(struct task_struct *p);

void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493203ae..e75c2e41f3d1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2304,6 +2304,7 @@ static __latent_entropy struct task_struct *copy_process(
write_unlock_irq(&tasklist_lock);

proc_fork_connector(p);
+ sched_post_fork(p);
cgroup_post_fork(p, args);
perf_event_fork(p);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 31c07e6c00b1..e1578c3ad40c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,6 +793,23 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
/* Max allowed maximum utilization */
unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;

+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects RT tasks that their uclamp_se->user_defined == false.
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ */
+unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
+
/* All clamps are required to be less or equal than these values */
static struct uclamp_se uclamp_default[UCLAMP_CNT];

@@ -895,6 +912,64 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
+{
+ unsigned int default_util_min;
+ struct uclamp_se *uc_se;
+
+ lockdep_assert_held(&p->pi_lock);
+
+ uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+ /* Only sync if user didn't override the default */
+ if (uc_se->user_defined)
+ return;
+
+ default_util_min = sysctl_sched_uclamp_util_min_rt_default;
+ uclamp_se_set(uc_se, default_util_min, false);
+}
+
+static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (!rt_task(p))
+ return;
+
+ /* Protect updates to p->uclamp_* */
+ rq = task_rq_lock(p, &rf);
+ __uclamp_sync_util_min_rt_default_locked(p);
+ task_rq_unlock(rq, p, &rf);
+}
+
+static void uclamp_sync_util_min_rt_default(void)
+{
+ struct task_struct *g, *p;
+
+ /*
+ * copy_process() sysctl_uclamp
+ * uclamp_min_rt = X;
+ * write_lock(&tasklist_lock) read_lock(&tasklist_lock)
+ * // link thread smp_mb__after_spinlock()
+ * write_unlock(&tasklist_lock) read_unlock(&tasklist_lock);
+ * sched_post_fork() for_each_process_thread()
+ * __uclamp_sync_rt() __uclamp_sync_rt()
+ *
+ * Ensures that either sched_post_fork() will observe the new
+ * uclamp_min_rt or for_each_process_thread() will observe the new
+ * task.
+ */
+ read_lock(&tasklist_lock);
+ smp_mb__after_spinlock();
+ read_unlock(&tasklist_lock);
+
+ rcu_read_lock();
+ for_each_process_thread(g, p)
+ __uclamp_sync_util_min_rt_default(p);
+ rcu_read_unlock();
+}
+
static inline struct uclamp_se
uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
@@ -1193,12 +1268,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
bool update_root_tg = false;
- int old_min, old_max;
+ int old_min, old_max, old_min_rt;
int result;

mutex_lock(&uclamp_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;
+ old_min_rt = sysctl_sched_uclamp_util_min_rt_default;

result = proc_dointvec(table, write, buffer, lenp, ppos);
if (result)
@@ -1207,7 +1283,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
goto done;

if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
- sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+ sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE ||
+ sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
+
result = -EINVAL;
goto undo;
}
@@ -1228,6 +1306,11 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
uclamp_update_root_tg();
}

+ if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) {
+ static_branch_enable(&sched_uclamp_used);
+ uclamp_sync_util_min_rt_default();
+ }
+
/*
* We update all RUNNABLE tasks only when task groups are in use.
* Otherwise, keep it simple and do just a lazy update at each next
@@ -1239,6 +1322,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
undo:
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
+ sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
done:
mutex_unlock(&uclamp_mutex);

@@ -1275,17 +1359,20 @@ static void __setscheduler_uclamp(struct task_struct *p,
*/
for_each_clamp_id(clamp_id) {
struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
- unsigned int clamp_value = uclamp_none(clamp_id);

/* Keep using defined clamps across class changes */
if (uc_se->user_defined)
continue;

- /* By default, RT tasks always get 100% boost */
+ /*
+ * RT by default have a 100% boost value that could be modified
+ * at runtime.
+ */
if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
- clamp_value = uclamp_none(UCLAMP_MAX);
+ __uclamp_sync_util_min_rt_default_locked(p);
+ else
+ uclamp_se_set(uc_se, uclamp_none(clamp_id), false);

- uclamp_se_set(uc_se, clamp_value, false);
}

if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
@@ -1308,6 +1395,10 @@ static void uclamp_fork(struct task_struct *p)
{
enum uclamp_id clamp_id;

+ /*
+ * We don't need to hold task_rq_lock() when updating p->uclamp_* here
+ * as the task is still at its early fork stages.
+ */
for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false;

@@ -1320,6 +1411,11 @@ static void uclamp_fork(struct task_struct *p)
}
}

+static void uclamp_post_fork(struct task_struct *p)
+{
+ __uclamp_sync_util_min_rt_default(p);
+}
+
static void __init init_uclamp_rq(struct rq *rq)
{
enum uclamp_id clamp_id;
@@ -1372,6 +1468,7 @@ static inline int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr) { }
static inline void uclamp_fork(struct task_struct *p) { }
+static inline void uclamp_post_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

@@ -3074,6 +3171,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
return 0;
}

+void sched_post_fork(struct task_struct *p)
+{
+ uclamp_post_fork(p);
+}
+
unsigned long to_ratio(u64 period, u64 runtime)
{
if (runtime == RUNTIME_INF)
@@ -5597,6 +5699,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
kattr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
+ /*
+ * This could race with another potential updater, but this is fine
+ * because it'll correctly read the old or the new value. We don't need
+ * to guarantee who wins the race as long as it doesn't return garbage.
+ */
kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4aea67d3d552..1b4d2dc270a5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1815,6 +1815,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sysctl_sched_uclamp_handler,
},
+ {
+ .procname = "sched_util_clamp_min_rt_default",
+ .data = &sysctl_sched_uclamp_util_min_rt_default,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_uclamp_handler,
+ },
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{
--
2.17.1

2020-07-16 11:10:12

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

The following splat was caught when setting uclamp value of a task.

BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

======================================================
in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 731, name: l_3-1
WARNING: possible circular locking dependency detected
5.8.0-rc4-00040-g6345b3305877-dirty #864 Not tainted
INFO: lockdep is turned off.
------------------------------------------------------
l_0-0/730 is trying to acquire lock:
irq event stamp: 150
ffff80001343dea0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x1c/0x38
el0_svc_common.constprop.4+0xe4/0x200

but task is already holding lock:
ffff00097ef4ca58 (&rq->lock){-.-.}-{2:2}, at: task_rq_lock+0x60/0xf0
_raw_spin_lock_irqsave+0x38/0xa8

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:
copy_process+0x620/0x18f0

-> #1 (&rq->lock){-.-.}-{2:2}:
0x0
_raw_spin_lock+0x64/0x80
__schedule+0x108/0x910
CPU: 5 PID: 731 Comm: l_3-1 Not tainted 5.8.0-rc4-00040-g6345b3305877-dirty #864
schedule+0x7c/0x108
schedule_timeout+0x2b0/0x448
Hardware name: ARM Juno development board (r2) (DT)
wait_for_completion_killable+0xb8/0x1a8
__kthread_create_on_node+0xe0/0x1c0
Call trace:
kthread_create_on_node+0x8c/0xb8
create_worker+0xd0/0x1b8
dump_backtrace+0x0/0x1f0
workqueue_prepare_cpu+0x5c/0xa0
cpuhp_invoke_callback+0xe8/0xe30
show_stack+0x2c/0x38
_cpu_up+0xf4/0x1c0
cpu_up+0xa0/0xc0
dump_stack+0xf0/0x170
bringup_nonboot_cpus+0x88/0xc0
smp_init+0x34/0x90
___might_sleep+0x144/0x200
kernel_init_freeable+0x1b8/0x338
kernel_init+0x18/0x118
__might_sleep+0x54/0x88
ret_from_fork+0x10/0x18

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x2c/0x130
__lock_acquire+0x11a0/0x1550
lock_acquire+0xf8/0x460
static_key_enable+0x1c/0x38
cpus_read_lock+0x68/0x130
static_key_enable+0x1c/0x38
__sched_setscheduler+0x900/0xad8
__sched_setscheduler+0x900/0xad8
__arm64_sys_sched_setattr+0x2e0/0x4f8
__arm64_sys_sched_setattr+0x2e0/0x4f8
el0_svc_common.constprop.4+0x84/0x200
do_el0_svc+0x34/0xa0
el0_svc_common.constprop.4+0x84/0x200
el0_sync_handler+0x16c/0x340
el0_sync+0x140/0x180
do_el0_svc+0x34/0xa0

other info that might help us debug this:

Possible unsafe locking scenario:

el0_sync_handler+0x16c/0x340
CPU0 CPU1
---- ----
el0_sync+0x140/0x180
lock(&rq->lock);
lock(cpu_hotplug_lock);
lock(&rq->lock);
lock(cpu_hotplug_lock);

*** DEADLOCK ***

3 locks held by l_0-0/730:
#0: ffff80001345b4d0 (&cpuset_rwsem){++++}-{0:0}, at: __sched_setscheduler+0x4c0/0xad8
#1: ffff00096e83b718 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x44/0xf0
#2: ffff00097ef4ca58 (&rq->lock){-.-.}-{2:2}, at: task_rq_lock+0x60/0xf0

stack backtrace:
CPU: 1 PID: 730 Comm: l_0-0 Tainted: G W 5.8.0-rc4-00040-g6345b3305877-dirty #864
Hardware name: ARM Juno development board (r2) (DT)
Call trace:
dump_backtrace+0x0/0x1f0
show_stack+0x2c/0x38
dump_stack+0xf0/0x170
print_circular_bug.isra.40+0x228/0x280
check_noncircular+0x14c/0x1b0
__lock_acquire+0x11a0/0x1550
lock_acquire+0xf8/0x460
cpus_read_lock+0x68/0x130
static_key_enable+0x1c/0x38
__sched_setscheduler+0x900/0xad8
__arm64_sys_sched_setattr+0x2e0/0x4f8
el0_svc_common.constprop.4+0x84/0x200
do_el0_svc+0x34/0xa0
el0_sync_handler+0x16c/0x340
el0_sync+0x140/0x180

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
CC: Patrick Bellasi <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: [email protected]
---
kernel/sched/core.c | 30 +++++++++++++++++++++++-------
1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1578c3ad40c..947a1f4fa112 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1348,7 +1348,7 @@ static int uclamp_validate(struct task_struct *p,
return 0;
}

-static void __setscheduler_uclamp(struct task_struct *p,
+static bool __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr)
{
enum uclamp_id clamp_id;
@@ -1376,9 +1376,7 @@ static void __setscheduler_uclamp(struct task_struct *p,
}

if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
- return;
-
- static_branch_enable(&sched_uclamp_used);
+ return false;

if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
@@ -1389,6 +1387,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
attr->sched_util_max, true);
}
+
+ return true;
}

static void uclamp_fork(struct task_struct *p)
@@ -1465,8 +1465,11 @@ static inline int uclamp_validate(struct task_struct *p,
{
return -EOPNOTSUPP;
}
-static void __setscheduler_uclamp(struct task_struct *p,
- const struct sched_attr *attr) { }
+static bool __setscheduler_uclamp(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ return false;
+}
static inline void uclamp_fork(struct task_struct *p) { }
static inline void uclamp_post_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
@@ -5305,7 +5308,8 @@ static int __sched_setscheduler(struct task_struct *p,
prev_class = p->sched_class;

__setscheduler(rq, p, attr, pi);
- __setscheduler_uclamp(p, attr);
+
+ retval = __setscheduler_uclamp(p, attr);

if (queued) {
/*
@@ -5335,6 +5339,18 @@ static int __sched_setscheduler(struct task_struct *p,
balance_callback(rq);
preempt_enable();

+#ifdef CONFIG_UCLAMP_TASK
+ /*
+ * Enable uclamp static key outside the critical section.
+ * static_branch_enable() will hold cpu_hotplug_lock; if done from
+ * critical section above which holds other locks (rq->lock namely),
+ * it could lead to deadlock scenarios as both are popular locks and
+ * could be acquired from different paths in different orders.
+ */
+ if (retval)
+ static_branch_enable(&sched_uclamp_used);
+#endif
+
return 0;

unlock:
--
2.17.1

2020-07-16 11:10:24

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v7 2/3] Documentation/sysctl: Document uclamp sysctl knobs

Uclamp exposes 3 sysctl knobs:

* sched_util_clamp_min
* sched_util_clamp_max
* sched_util_clamp_min_rt_default

Document them in sysctl/kernel.rst.

Signed-off-by: Qais Yousef <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Juri Lelli <[email protected]>
CC: Vincent Guittot <[email protected]>
CC: Dietmar Eggemann <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ben Segall <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Luis Chamberlain <[email protected]>
CC: Kees Cook <[email protected]>
CC: Iurii Zaikin <[email protected]>
CC: Quentin Perret <[email protected]>
CC: Valentin Schneider <[email protected]>
CC: Patrick Bellasi <[email protected]>
CC: Pavan Kondeti <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
Documentation/admin-guide/sysctl/kernel.rst | 54 +++++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 83acf5025488..55bf6b4de4ec 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1062,6 +1062,60 @@ Enables/disables scheduler statistics. Enabling this feature
incurs a small amount of overhead in the scheduler but is
useful for debugging and performance tuning.

+sched_util_clamp_min:
+=====================
+
+Max allowed *minimum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.min value cannot be greater than
+sched_util_clamp_min, i.e., it is restricted to the range
+[0:sched_util_clamp_min].
+
+sched_util_clamp_max:
+=====================
+
+Max allowed *maximum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.max value cannot be greater than
+sched_util_clamp_max, i.e., it is restricted to the range
+[0:sched_util_clamp_max].
+
+sched_util_clamp_min_rt_default:
+================================
+
+By default Linux is tuned for performance. Which means that RT tasks always run
+at the highest frequency and most capable (highest capacity) CPU (in
+heterogeneous systems).
+
+Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
+1024 by default, which effectively boosts the tasks to run at the highest
+frequency and biases them to run on the biggest CPU.
+
+This knob allows admins to change the default behavior when uclamp is being
+used. In battery powered devices particularly, running at the maximum
+capacity and frequency will increase energy consumption and shorten the battery
+life.
+
+This knob is only effective for RT tasks which the user hasn't modified their
+requested uclamp.min value via sched_setattr() syscall.
+
+This knob will not escape the range constraint imposed by sched_util_clamp_min
+defined above.
+
+For example if
+
+ sched_util_clamp_min_rt_default = 800
+ sched_util_clamp_min = 600
+
+Then the boost will be clamped to 600 because 800 is outside of the permissible
+range of [0:600]. This could happen for instance if a powersave mode will
+restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as
+this restriction is lifted, the requested sched_util_clamp_min_rt_default
+will take effect.

seccomp
=======
--
2.17.1

2020-07-16 11:14:05

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

On 07/16/20 12:03, Qais Yousef wrote:

[...]

> Fix by ensuring we enable the key outside of the critical section in
> __sched_setscheduler()
>
> Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> Signed-off-by: Qais Yousef <[email protected]>

I am assuming this Fixes tag is still valid given the patch is only in
tip/sched/core only. And that the hash value won't change after Linus merges
it.

Thanks

--
Qais Yousef

2020-07-23 15:52:19

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

On 07/16/20 12:13, Qais Yousef wrote:
> On 07/16/20 12:03, Qais Yousef wrote:
>
> [...]
>
> > Fix by ensuring we enable the key outside of the critical section in
> > __sched_setscheduler()
> >
> > Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> > Signed-off-by: Qais Yousef <[email protected]>
>
> I am assuming this Fixes tag is still valid given the patch is only in
> tip/sched/core only. And that the hash value won't change after Linus merges
> it.

Gentle ping that this patch is actually a fix that I happened to uncover while
testing this series.

Happy to send separately. It should apply on its own too as it has no
dependency on previous patches.

Thanks

--
Qais Yousef

2020-07-24 08:56:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default boost value

On Thu, Jul 16, 2020 at 12:03:45PM +0100, Qais Yousef wrote:

Would you mind terribly if I rename things like so?

I tried and failed to come up with a shorter name in general, these
functions names are somewhat unwieldy. I considered s/_default//.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -914,7 +914,7 @@ unsigned int uclamp_rq_max_value(struct
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

-static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
+static void __uclamp_update_util_min_rt_default(struct task_struct *p)
{
unsigned int default_util_min;
struct uclamp_se *uc_se;
@@ -931,7 +931,7 @@ static void __uclamp_sync_util_min_rt_de
uclamp_se_set(uc_se, default_util_min, false);
}

-static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
+static void uclamp_update_util_min_rt_default(struct task_struct *p)
{
struct rq_flags rf;
struct rq *rq;
@@ -941,7 +941,7 @@ static void __uclamp_sync_util_min_rt_de

/* Protect updates to p->uclamp_* */
rq = task_rq_lock(p, &rf);
- __uclamp_sync_util_min_rt_default_locked(p);
+ __uclamp_update_util_min_rt_default(p);
task_rq_unlock(rq, p, &rf);
}

@@ -968,7 +968,7 @@ static void uclamp_sync_util_min_rt_defa

rcu_read_lock();
for_each_process_thread(g, p)
- __uclamp_sync_util_min_rt_default(p);
+ uclamp_update_util_min_rt_default(p);
rcu_read_unlock();
}

@@ -1360,7 +1360,7 @@ static void __setscheduler_uclamp(struct
* at runtime.
*/
if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
- __uclamp_sync_util_min_rt_default_locked(p);
+ __uclamp_update_util_min_rt_default(p);
else
uclamp_se_set(uc_se, uclamp_none(clamp_id), false);

@@ -1404,7 +1404,7 @@ static void uclamp_fork(struct task_stru

static void uclamp_post_fork(struct task_struct *p)
{
- __uclamp_sync_util_min_rt_default(p);
+ uclamp_update_util_min_rt_default(p);
}

static void __init init_uclamp_rq(struct rq *rq)

2020-07-24 09:13:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:

I've trimmed the Changelog to read like:

---
Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
From: Qais Yousef <[email protected]>
Date: Thu, 16 Jul 2020 12:03:47 +0100

From: Qais Yousef <[email protected]>

The following splat was caught when setting uclamp value of a task:

BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

cpus_read_lock+0x68/0x130
static_key_enable+0x1c/0x38
__sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---

And placed this patch first in the series

That said; don't we have a slight problem with enabling the key late
like this? It means the uclamp will not actually take effect immediately
and we'll have to wait for the next context switch ... whenever that
might be.

Should we not have enabled the key early, instead of late?

something like so perhaps?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2a244af9a53..c6499b2696f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1282,8 +1282,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

- static_branch_enable(&sched_uclamp_used);
-
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
attr->sched_util_min, true);
@@ -5074,6 +5072,7 @@ static int __sched_setscheduler(struct task_struct *p,
retval = uclamp_validate(p, attr);
if (retval)
return retval;
+ static_branch_enable(&sched_uclamp_used);
}

if (pi)

2020-07-24 09:17:42

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] sched/uclamp: Add a new sysctl to control RT default boost value

On 07/24/20 10:54, Peter Zijlstra wrote:
> On Thu, Jul 16, 2020 at 12:03:45PM +0100, Qais Yousef wrote:
>
> Would you mind terribly if I rename things like so?

Nope, the new name is good for me too.

>
> I tried and failed to come up with a shorter name in general, these
> functions names are somewhat unwieldy. I considered s/_default//.

Can do. Me thinking that maybe we need to sprinkle more comments then. But if
I felt the need for more comments, I can always post another patch on top :-)

If you'd like a shorter name, a slightly shorter one would be

update_uclamp_min_rt()

Thanks

--
Qais Yousef

>
> ---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -914,7 +914,7 @@ unsigned int uclamp_rq_max_value(struct
> return uclamp_idle_value(rq, clamp_id, clamp_value);
> }
>
> -static void __uclamp_sync_util_min_rt_default_locked(struct task_struct *p)
> +static void __uclamp_update_util_min_rt_default(struct task_struct *p)
> {
> unsigned int default_util_min;
> struct uclamp_se *uc_se;
> @@ -931,7 +931,7 @@ static void __uclamp_sync_util_min_rt_de
> uclamp_se_set(uc_se, default_util_min, false);
> }
>
> -static void __uclamp_sync_util_min_rt_default(struct task_struct *p)
> +static void uclamp_update_util_min_rt_default(struct task_struct *p)
> {
> struct rq_flags rf;
> struct rq *rq;
> @@ -941,7 +941,7 @@ static void __uclamp_sync_util_min_rt_de
>
> /* Protect updates to p->uclamp_* */
> rq = task_rq_lock(p, &rf);
> - __uclamp_sync_util_min_rt_default_locked(p);
> + __uclamp_update_util_min_rt_default(p);
> task_rq_unlock(rq, p, &rf);
> }
>
> @@ -968,7 +968,7 @@ static void uclamp_sync_util_min_rt_defa
>
> rcu_read_lock();
> for_each_process_thread(g, p)
> - __uclamp_sync_util_min_rt_default(p);
> + uclamp_update_util_min_rt_default(p);
> rcu_read_unlock();
> }
>
> @@ -1360,7 +1360,7 @@ static void __setscheduler_uclamp(struct
> * at runtime.
> */
> if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> - __uclamp_sync_util_min_rt_default_locked(p);
> + __uclamp_update_util_min_rt_default(p);
> else
> uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
>
> @@ -1404,7 +1404,7 @@ static void uclamp_fork(struct task_stru
>
> static void uclamp_post_fork(struct task_struct *p)
> {
> - __uclamp_sync_util_min_rt_default(p);
> + uclamp_update_util_min_rt_default(p);
> }
>
> static void __init init_uclamp_rq(struct rq *rq)

2020-07-24 09:47:29

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

On 07/24/20 11:12, Peter Zijlstra wrote:
> On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:
>
> I've trimmed the Changelog to read like:

+1

Should we mention the ordering issue too? Or maybe I misinterpreted the
'Possible unsafe locking scenario' part?

>
> ---
> Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
> From: Qais Yousef <[email protected]>
> Date: Thu, 16 Jul 2020 12:03:47 +0100
>
> From: Qais Yousef <[email protected]>
>
> The following splat was caught when setting uclamp value of a task:
>
> BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
>
> cpus_read_lock+0x68/0x130
> static_key_enable+0x1c/0x38
> __sched_setscheduler+0x900/0xad8
>
> Fix by ensuring we enable the key outside of the critical section in
> __sched_setscheduler()
>
> Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> Signed-off-by: Qais Yousef <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> ---
>
> And placed this patch first in the series
>
> That said; don't we have a slight problem with enabling the key late
> like this? It means the uclamp will not actually take effect immediately
> and we'll have to wait for the next context switch ... whenever that
> might be.

The enabling is racy inherently. Your suggestion below is better though.
I should have scrolled more screens up!

>
> Should we not have enabled the key early, instead of late?
>
> something like so perhaps?

This should work, but you'll need to sprinkle ifdef around the key. Or move it
to uclamp_validate()

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1578c3ad40c..cd3b7a25ac59 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1345,6 +1345,21 @@ static int uclamp_validate(struct task_struct *p,
if (upper_bound > SCHED_CAPACITY_SCALE)
return -EINVAL;

+ /*
+ * Enable uclamp static key outside the critical section of
+ * __sched_setschuler().
+ *
+ * static_branch_enable() will hold cpu_hotplug_lock; if done from
+ * critical section, __setscheduler_uclamp(), which holds other locks
+ * (rq->lock namely), it could lead to deadlock scenarios as both are
+ * popular locks and could be acquired from different paths in
+ * different orders.
+ *
+ * Besides cpu_hotplug_lock could sleep, which is not allowed inside
+ * the critical section of __sched_setscheduler().
+ */
+ static_branch_enable(&sched_uclamp_used);
+
return 0;
}

@@ -1378,8 +1393,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

- static_branch_enable(&sched_uclamp_used);
-
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
attr->sched_util_min, true);

2020-07-24 10:42:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

On Fri, Jul 24, 2020 at 10:46:50AM +0100, Qais Yousef wrote:
> On 07/24/20 11:12, Peter Zijlstra wrote:
> > On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:
> >
> > I've trimmed the Changelog to read like:
>
> +1
>
> Should we mention the ordering issue too? Or maybe I misinterpreted the
> 'Possible unsafe locking scenario' part?

The lock inversion was, imo, secondary. It only existed because of the
impossible lock ordering -- taking a blocking lock inside an atomic
lock. Fixing the first, avoids the second etc.. So I left it out.

> This should work, but you'll need to sprinkle ifdef around the key. Or move it
> to uclamp_validate()

Indeed, the patch now reads like:

---
Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
From: Qais Yousef <[email protected]>
Date: Thu, 16 Jul 2020 12:03:47 +0100

From: Qais Yousef <[email protected]>

The following splat was caught when setting uclamp value of a task:

BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

cpus_read_lock+0x68/0x130
static_key_enable+0x1c/0x38
__sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1252,6 +1252,15 @@ static int uclamp_validate(struct task_s
if (upper_bound > SCHED_CAPACITY_SCALE)
return -EINVAL;

+ /*
+ * We have valid uclamp attributes; make sure uclamp is enabled.
+ *
+ * We need to do that here, because enabling static branches is a
+ * blocking operation which obviously cannot be done while holding
+ * scheduler locks.
+ */
+ static_branch_enable(&sched_uclamp_used);
+
return 0;
}

@@ -1282,8 +1291,6 @@ static void __setscheduler_uclamp(struct
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

- static_branch_enable(&sched_uclamp_used);
-
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
attr->sched_util_min, true);

2020-07-24 10:52:29

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v7 3/3] sched/uclamp: Fix a deadlock when enabling uclamp static key

On 07/24/20 12:41, Peter Zijlstra wrote:
> On Fri, Jul 24, 2020 at 10:46:50AM +0100, Qais Yousef wrote:
> > On 07/24/20 11:12, Peter Zijlstra wrote:
> > > On Thu, Jul 16, 2020 at 12:03:47PM +0100, Qais Yousef wrote:
> > >
> > > I've trimmed the Changelog to read like:
> >
> > +1
> >
> > Should we mention the ordering issue too? Or maybe I misinterpreted the
> > 'Possible unsafe locking scenario' part?
>
> The lock inversion was, imo, secondary. It only existed because of the
> impossible lock ordering -- taking a blocking lock inside an atomic
> lock. Fixing the first, avoids the second etc.. So I left it out.
>
> > This should work, but you'll need to sprinkle ifdef around the key. Or move it
> > to uclamp_validate()
>
> Indeed, the patch now reads like:

Maybe s/deadlock/splat/ in the subject now? I clearly focused on the secondary
thing..

Sorry you had to modify the patch that much yourself.

Thanks!

--
Qais Yousef

>
> ---
> Subject: sched/uclamp: Fix a deadlock when enabling uclamp static key
> From: Qais Yousef <[email protected]>
> Date: Thu, 16 Jul 2020 12:03:47 +0100
>
> From: Qais Yousef <[email protected]>
>
> The following splat was caught when setting uclamp value of a task:
>
> BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49
>
> cpus_read_lock+0x68/0x130
> static_key_enable+0x1c/0x38
> __sched_setscheduler+0x900/0xad8
>
> Fix by ensuring we enable the key outside of the critical section in
> __sched_setscheduler()
>
> Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
> Signed-off-by: Qais Yousef <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> ---
> kernel/sched/core.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1252,6 +1252,15 @@ static int uclamp_validate(struct task_s
> if (upper_bound > SCHED_CAPACITY_SCALE)
> return -EINVAL;
>
> + /*
> + * We have valid uclamp attributes; make sure uclamp is enabled.
> + *
> + * We need to do that here, because enabling static branches is a
> + * blocking operation which obviously cannot be done while holding
> + * scheduler locks.
> + */
> + static_branch_enable(&sched_uclamp_used);
> +
> return 0;
> }
>
> @@ -1282,8 +1291,6 @@ static void __setscheduler_uclamp(struct
> if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> return;
>
> - static_branch_enable(&sched_uclamp_used);
> -
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
> attr->sched_util_min, true);

Subject: [tip: sched/core] Documentation/sysctl: Document uclamp sysctl knobs

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 1f73d1abe5836bd8ffe747ff5cb7561b17ce5bc6
Gitweb: https://git.kernel.org/tip/1f73d1abe5836bd8ffe747ff5cb7561b17ce5bc6
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 16 Jul 2020 12:03:46 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 29 Jul 2020 13:51:48 +02:00

Documentation/sysctl: Document uclamp sysctl knobs

Uclamp exposes 3 sysctl knobs:

* sched_util_clamp_min
* sched_util_clamp_max
* sched_util_clamp_min_rt_default

Document them in sysctl/kernel.rst.

Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
Documentation/admin-guide/sysctl/kernel.rst | 54 ++++++++++++++++++++-
1 file changed, 54 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 83acf50..55bf6b4 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1062,6 +1062,60 @@ Enables/disables scheduler statistics. Enabling this feature
incurs a small amount of overhead in the scheduler but is
useful for debugging and performance tuning.

+sched_util_clamp_min:
+=====================
+
+Max allowed *minimum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.min value cannot be greater than
+sched_util_clamp_min, i.e., it is restricted to the range
+[0:sched_util_clamp_min].
+
+sched_util_clamp_max:
+=====================
+
+Max allowed *maximum* utilization.
+
+Default value is 1024, which is the maximum possible value.
+
+It means that any requested uclamp.max value cannot be greater than
+sched_util_clamp_max, i.e., it is restricted to the range
+[0:sched_util_clamp_max].
+
+sched_util_clamp_min_rt_default:
+================================
+
+By default Linux is tuned for performance. Which means that RT tasks always run
+at the highest frequency and most capable (highest capacity) CPU (in
+heterogeneous systems).
+
+Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
+1024 by default, which effectively boosts the tasks to run at the highest
+frequency and biases them to run on the biggest CPU.
+
+This knob allows admins to change the default behavior when uclamp is being
+used. In battery powered devices particularly, running at the maximum
+capacity and frequency will increase energy consumption and shorten the battery
+life.
+
+This knob is only effective for RT tasks which the user hasn't modified their
+requested uclamp.min value via sched_setattr() syscall.
+
+This knob will not escape the range constraint imposed by sched_util_clamp_min
+defined above.
+
+For example if
+
+ sched_util_clamp_min_rt_default = 800
+ sched_util_clamp_min = 600
+
+Then the boost will be clamped to 600 because 800 is outside of the permissible
+range of [0:600]. This could happen for instance if a powersave mode will
+restrict all boosts temporarily by modifying sched_util_clamp_min. As soon as
+this restriction is lifted, the requested sched_util_clamp_min_rt_default
+will take effect.

seccomp
=======

Subject: [tip: sched/core] sched/uclamp: Fix a deadlock when enabling uclamp static key

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e65855a52b479f98674998cb23b21ef5a8144b04
Gitweb: https://git.kernel.org/tip/e65855a52b479f98674998cb23b21ef5a8144b04
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 16 Jul 2020 12:03:47 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 29 Jul 2020 13:51:47 +02:00

sched/uclamp: Fix a deadlock when enabling uclamp static key

The following splat was caught when setting uclamp value of a task:

BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

cpus_read_lock+0x68/0x130
static_key_enable+0x1c/0x38
__sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6782534..e44d83f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1346,6 +1346,15 @@ static int uclamp_validate(struct task_struct *p,
if (upper_bound > SCHED_CAPACITY_SCALE)
return -EINVAL;

+ /*
+ * We have valid uclamp attributes; make sure uclamp is enabled.
+ *
+ * We need to do that here, because enabling static branches is a
+ * blocking operation which obviously cannot be done while holding
+ * scheduler locks.
+ */
+ static_branch_enable(&sched_uclamp_used);
+
return 0;
}

@@ -1376,8 +1385,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
return;

- static_branch_enable(&sched_uclamp_used);
-
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
attr->sched_util_min, true);

Subject: [tip: sched/core] sched/uclamp: Add a new sysctl to control RT default boost value

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 13685c4a08fca9dd76bf53bfcbadc044ab2a08cb
Gitweb: https://git.kernel.org/tip/13685c4a08fca9dd76bf53bfcbadc044ab2a08cb
Author: Qais Yousef <[email protected]>
AuthorDate: Thu, 16 Jul 2020 12:03:45 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 29 Jul 2020 13:51:47 +02:00

sched/uclamp: Add a new sysctl to control RT default boost value

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 10 ++-
include/linux/sched/sysctl.h | 1 +-
include/linux/sched/task.h | 1 +-
kernel/fork.c | 1 +-
kernel/sched/core.c | 119 ++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 7 ++-
6 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index adf0125..a6bf77c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -686,9 +686,15 @@ struct task_struct {
struct sched_dl_entity dl;

#ifdef CONFIG_UCLAMP_TASK
- /* Clamp values requested for a scheduling entity */
+ /*
+ * Clamp values requested for a scheduling entity.
+ * Must be updated with task_rq_lock() held.
+ */
struct uclamp_se uclamp_req[UCLAMP_CNT];
- /* Effective clamp values used for a scheduling entity */
+ /*
+ * Effective clamp values used for a scheduling entity.
+ * Must be updated with task_rq_lock() held.
+ */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 24be30a..3c31ba8 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -67,6 +67,7 @@ extern unsigned int sysctl_sched_dl_period_min;
#ifdef CONFIG_UCLAMP_TASK
extern unsigned int sysctl_sched_uclamp_util_min;
extern unsigned int sysctl_sched_uclamp_util_max;
+extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
#endif

#ifdef CONFIG_CFS_BANDWIDTH
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 3835907..e7ddab0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -55,6 +55,7 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);

extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
+extern void sched_post_fork(struct task_struct *p);
extern void sched_dead(struct task_struct *p);

void __noreturn do_task_dead(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493..e75c2e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2304,6 +2304,7 @@ static __latent_entropy struct task_struct *copy_process(
write_unlock_irq(&tasklist_lock);

proc_fork_connector(p);
+ sched_post_fork(p);
cgroup_post_fork(p, args);
perf_event_fork(p);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e44d83f..12e1f3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -889,6 +889,23 @@ unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
/* Max allowed maximum utilization */
unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;

+/*
+ * By default RT tasks run at the maximum performance point/capacity of the
+ * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
+ * SCHED_CAPACITY_SCALE.
+ *
+ * This knob allows admins to change the default behavior when uclamp is being
+ * used. In battery powered devices, particularly, running at the maximum
+ * capacity and frequency will increase energy consumption and shorten the
+ * battery life.
+ *
+ * This knob only affects RT tasks that their uclamp_se->user_defined == false.
+ *
+ * This knob will not override the system default sched_util_clamp_min defined
+ * above.
+ */
+unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
+
/* All clamps are required to be less or equal than these values */
static struct uclamp_se uclamp_default[UCLAMP_CNT];

@@ -991,6 +1008,64 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}

+static void __uclamp_update_util_min_rt_default(struct task_struct *p)
+{
+ unsigned int default_util_min;
+ struct uclamp_se *uc_se;
+
+ lockdep_assert_held(&p->pi_lock);
+
+ uc_se = &p->uclamp_req[UCLAMP_MIN];
+
+ /* Only sync if user didn't override the default */
+ if (uc_se->user_defined)
+ return;
+
+ default_util_min = sysctl_sched_uclamp_util_min_rt_default;
+ uclamp_se_set(uc_se, default_util_min, false);
+}
+
+static void uclamp_update_util_min_rt_default(struct task_struct *p)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (!rt_task(p))
+ return;
+
+ /* Protect updates to p->uclamp_* */
+ rq = task_rq_lock(p, &rf);
+ __uclamp_update_util_min_rt_default(p);
+ task_rq_unlock(rq, p, &rf);
+}
+
+static void uclamp_sync_util_min_rt_default(void)
+{
+ struct task_struct *g, *p;
+
+ /*
+ * copy_process() sysctl_uclamp
+ * uclamp_min_rt = X;
+ * write_lock(&tasklist_lock) read_lock(&tasklist_lock)
+ * // link thread smp_mb__after_spinlock()
+ * write_unlock(&tasklist_lock) read_unlock(&tasklist_lock);
+ * sched_post_fork() for_each_process_thread()
+ * __uclamp_sync_rt() __uclamp_sync_rt()
+ *
+ * Ensures that either sched_post_fork() will observe the new
+ * uclamp_min_rt or for_each_process_thread() will observe the new
+ * task.
+ */
+ read_lock(&tasklist_lock);
+ smp_mb__after_spinlock();
+ read_unlock(&tasklist_lock);
+
+ rcu_read_lock();
+ for_each_process_thread(g, p)
+ uclamp_update_util_min_rt_default(p);
+ rcu_read_unlock();
+}
+
static inline struct uclamp_se
uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
{
@@ -1278,12 +1353,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
bool update_root_tg = false;
- int old_min, old_max;
+ int old_min, old_max, old_min_rt;
int result;

mutex_lock(&uclamp_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;
+ old_min_rt = sysctl_sched_uclamp_util_min_rt_default;

result = proc_dointvec(table, write, buffer, lenp, ppos);
if (result)
@@ -1292,7 +1368,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
goto done;

if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
- sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
+ sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE ||
+ sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
+
result = -EINVAL;
goto undo;
}
@@ -1313,6 +1391,11 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
uclamp_update_root_tg();
}

+ if (old_min_rt != sysctl_sched_uclamp_util_min_rt_default) {
+ static_branch_enable(&sched_uclamp_used);
+ uclamp_sync_util_min_rt_default();
+ }
+
/*
* We update all RUNNABLE tasks only when task groups are in use.
* Otherwise, keep it simple and do just a lazy update at each next
@@ -1324,6 +1407,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
undo:
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
+ sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
done:
mutex_unlock(&uclamp_mutex);

@@ -1369,17 +1453,20 @@ static void __setscheduler_uclamp(struct task_struct *p,
*/
for_each_clamp_id(clamp_id) {
struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
- unsigned int clamp_value = uclamp_none(clamp_id);

/* Keep using defined clamps across class changes */
if (uc_se->user_defined)
continue;

- /* By default, RT tasks always get 100% boost */
+ /*
+ * RT by default have a 100% boost value that could be modified
+ * at runtime.
+ */
if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
- clamp_value = uclamp_none(UCLAMP_MAX);
+ __uclamp_update_util_min_rt_default(p);
+ else
+ uclamp_se_set(uc_se, uclamp_none(clamp_id), false);

- uclamp_se_set(uc_se, clamp_value, false);
}

if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
@@ -1400,6 +1487,10 @@ static void uclamp_fork(struct task_struct *p)
{
enum uclamp_id clamp_id;

+ /*
+ * We don't need to hold task_rq_lock() when updating p->uclamp_* here
+ * as the task is still at its early fork stages.
+ */
for_each_clamp_id(clamp_id)
p->uclamp[clamp_id].active = false;

@@ -1412,6 +1503,11 @@ static void uclamp_fork(struct task_struct *p)
}
}

+static void uclamp_post_fork(struct task_struct *p)
+{
+ uclamp_update_util_min_rt_default(p);
+}
+
static void __init init_uclamp_rq(struct rq *rq)
{
enum uclamp_id clamp_id;
@@ -1462,6 +1558,7 @@ static inline int uclamp_validate(struct task_struct *p,
static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr) { }
static inline void uclamp_fork(struct task_struct *p) { }
+static inline void uclamp_post_fork(struct task_struct *p) { }
static inline void init_uclamp(void) { }
#endif /* CONFIG_UCLAMP_TASK */

@@ -3205,6 +3302,11 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
return 0;
}

+void sched_post_fork(struct task_struct *p)
+{
+ uclamp_post_fork(p);
+}
+
unsigned long to_ratio(u64 period, u64 runtime)
{
if (runtime == RUNTIME_INF)
@@ -5724,6 +5826,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
kattr.sched_nice = task_nice(p);

#ifdef CONFIG_UCLAMP_TASK
+ /*
+ * This could race with another potential updater, but this is fine
+ * because it'll correctly read the old or the new value. We don't need
+ * to guarantee who wins the race as long as it doesn't return garbage.
+ */
kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4aea67d..1b4d2dc 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1815,6 +1815,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = sysctl_sched_uclamp_handler,
},
+ {
+ .procname = "sched_util_clamp_min_rt_default",
+ .data = &sysctl_sched_uclamp_util_min_rt_default,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_uclamp_handler,
+ },
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
{