2020-05-20 13:45:07

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v3 0/5] Capacity awareness for SCHED_DEADLINE

The SCHED_DEADLINE (DL) Admission Control (AC) and task placement do
not work correctly on heterogeneous (asymmetric CPU capacity) systems
such as Arm big.LITTLE or DynamIQ.

Let's fix this by explicitly considering CPU capacity in AC and task
placement.

The DL sched class now attempts to avoid missing task deadlines due to
smaller CPU (CPU capacity < 1024) not being capable enough to finish a
task in time. It does so by trying to place a task so that its CPU
capacity scaled deadline is not smaller than its runtime.

This patch-set only supports capacity awareness in the idle scenario
(cpudl::free_cpus not empty). Capacity awareness for the non-idle
case should be added in a later series.

Changes v2 [1] -> v3:

Discussion about that if 'rq->rd == def_root_domain' AC should be
performed against the capacity of the CPU the task is running on rather
the rd CPU capacity sum.
Since this issue already exists w/o capacity awareness a 'XXX Fix:'
comment was added for now.

Per-patch changes:

(1) Patch 'sched/topology: Store root domain CPU capacity sum' removed
since rd->sum_cpu_capacity is not needed anymore [v2 patch 1/6]

(2) Redesign of dl_bw_capacity() and 'XXX Fix:' comment (mentioned
above) added [patch 2/5]

(3) Favor task_cpu(p) if it has max capacity of !fitting CPUs
[patch 5/5]

Changes v1 [2] -> v2:

Discussion about capacity awareness in idle and non-idle scenarios
indicated that the current patch-set only supports the former.

Per-patch changes:

(1) Use rq->cpu_capacity_orig or capacity_orig_of() instead of
arch_scale_cpu_capacity() [patch 1,6/6]

(2) Optimize dl_bw_cpus(), i.e. return weight of rd->span if rd->span
&sube cpu_active_mask [patch 2/6]

(3) Replace rd_capacity() with dl_bw_capacity() [patch 3/6]

Changes RFC [3] -> v1:

Only use static values for CPU bandwidth (sched_dl_entity::dl_runtime,
::dl_deadline) and CPU capacity (arch_scale_cpu_capacity()) to fix AC.

Dynamic values for CPU bandwidth (sched_dl_entity::runtime, ::deadline)
and CPU capacity (capacity_of()) are considered to be more related to
energy trade-off calculations which could be later introduced using the
Energy Model.

Since the design of the DL and RT sched classes are very similar, the
implementation follows the overall design of RT capacity awareness
(commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")).

Per-patch changes:

(1) Store CPU capacity sum in the root domain during
build_sched_domains() [patch 1/4]

(2) Adjust to RT capacity awareness design [patch 3/4]

(3) Remove CPU capacity aware placement in switched_to_dl()
(dl_migrate callback) [RFC patch 3/6]

Balance callbacks (push, pull) run only in schedule_tail()
__schedule(), rt_mutex_setprio() or __sched_setscheduler().
DL throttling leads to a call to __dequeue_task_dl() which is not a
full task dequeue. The task is still enqueued and only removed from
the rq.
So a queue_balance_callback() call in update_curr_dl()->
__dequeue_task_dl() will not be followed by a balance_callback()
call in one of the 4 functions mentioned above.

(4) Remove 'dynamic CPU bandwidth' consideration and only support
'static CPU bandwidth' (ratio between sched_dl_entity::dl_runtime
and ::dl_deadline) [RFC patch 4/6]

(5) Remove modification to migration logic which tried to schedule
small tasks on LITTLE CPUs [RFC patch 6/6]

[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/[email protected]
[3] https://lore.kernel.org/r/[email protected]

The following rt-app testcase tailored to Arm64 Hikey960:

root@h960:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
462
462
462
462
1024
1024
1024
1024

shows the expected behavior.

According to the following condition in dl_task_fits_capacity()

cap_scale(dl_deadline, arch_scale_cpu_capacity(cpu)) >= dl_runtime

thread0-[0-3] are placed on a big CPUs whereas thread1-[0-3] run on a
LITTLE CPU respectively.

The 'delay' parameter for the little tasks makes sure that they start
later than the big tasks allowing the big tasks to choose big CPUs.

...
"tasks" : {
"thread0" : {
"policy" : "SCHED_DEADLINE",
"instance" : 4,
"timer" : { "ref" : "unique0", "period" : 16000, "mode" : "absolute" },
"run" : 10000,
"dl-runtime" : 11000,
"dl-period" : 16000,
"dl-deadline" : 16000
},
"thread1" : {
"policy" : "SCHED_DEADLINE",
"instance" : 4,
"delay" : 1000,
"timer" : { "ref" : "unique1", "period" : 16000, "mode" : "absolute" },
"run" : 5500,
"dl-runtime" : 6500
"dl-period" : 16000,
"dl-deadline" : 16000
}
...

Tests were run with Performance CPUfreq governor so that the Schedutil
CPUfreq governor DL threads (sugov:[0,4]), necessary on a
slow-switching platform like Hikey960, do not interfere with the
rt-app test tasks. Using Schedutil would require to lower the number of
tasks to 3 instances each.

Dietmar Eggemann (2):
sched/deadline: Optimize dl_bw_cpus()
sched/deadline: Add dl_bw_capacity()

Luca Abeni (3):
sched/deadline: Improve admission control for asymmetric CPU
capacities
sched/deadline: Make DL capacity-aware
sched/deadline: Implement fallback mechanism for !fit case

kernel/sched/cpudeadline.c | 24 ++++++++++
kernel/sched/deadline.c | 89 ++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 21 +++++++--
3 files changed, 113 insertions(+), 21 deletions(-)

--
2.17.1


2020-05-20 13:45:14

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v3 2/5] sched/deadline: Add dl_bw_capacity()

Capacity-aware SCHED_DEADLINE Admission Control (AC) needs root domain
(rd) CPU capacity sum.

Introduce dl_bw_capacity() which for a symmetric rd w/ a CPU capacity
of SCHED_CAPACITY_SCALE simply relies on dl_bw_cpus() to return #CPUs
multiplied by SCHED_CAPACITY_SCALE.

For an asymmetric rd or a CPU capacity < SCHED_CAPACITY_SCALE it
computes the CPU capacity sum over rd span and cpu_active_mask.

A 'XXX Fix:' comment was added to highlight that if 'rq->rd ==
def_root_domain' AC should be performed against the capacity of the
CPU the task is running on rather the rd CPU capacity sum. This
issue already exists w/o capacity awareness.

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ae22bfc37ae..ea7282ce484c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -69,6 +69,34 @@ static inline int dl_bw_cpus(int i)

return cpus;
}
+
+static inline unsigned long __dl_bw_capacity(int i)
+{
+ struct root_domain *rd = cpu_rq(i)->rd;
+ unsigned long cap = 0;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+
+ for_each_cpu_and(i, rd->span, cpu_active_mask)
+ cap += capacity_orig_of(i);
+
+ return cap;
+}
+
+/*
+ * XXX Fix: If 'rq->rd == def_root_domain' perform AC against capacity
+ * of the CPU the task is running on rather rd's \Sum CPU capacity.
+ */
+static inline unsigned long dl_bw_capacity(int i)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
+ capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
+ return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
+ } else {
+ return __dl_bw_capacity(i);
+ }
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -79,6 +107,11 @@ static inline int dl_bw_cpus(int i)
{
return 1;
}
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+ return SCHED_CAPACITY_SCALE;
+}
#endif

static inline
--
2.17.1

2020-05-20 13:45:33

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v3 4/5] sched/deadline: Make DL capacity-aware

From: Luca Abeni <[email protected]>

The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
algorithm w/o considering CPU capacity or task utilization.
This works well on homogeneous systems where DL tasks are guaranteed
to have a bounded tardiness but presents issues on heterogeneous
systems.

A DL task can migrate to a CPU which does not have enough CPU capacity
to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
period on a CPU w/ 512 capacity).

Add the DL fitness function dl_task_fits_capacity() for DL admission
control on heterogeneous systems. A task fits onto a CPU if:

CPU original capacity / 1024 >= task runtime / task deadline

Use this function on heterogeneous systems to try to find a CPU which
meets this criterion during task wakeup, push and offline migration.

On homogeneous systems the original behavior of the DL admission
control should be retained.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/cpudeadline.c | 14 +++++++++++++-
kernel/sched/deadline.c | 18 ++++++++++++++----
kernel/sched/sched.h | 15 +++++++++++++++
3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5cc4012572ec..8630f2a40a3f 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- return 1;
+ int cpu;
+
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 1;
+
+ /* Ensure the capacity of the CPUs fits the task. */
+ for_each_cpu(cpu, later_mask) {
+ if (!dl_task_fits_capacity(p, cpu))
+ cpumask_clear_cpu(cpu, later_mask);
+ }
+
+ if (!cpumask_empty(later_mask))
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fa8566517715..f2e8f5a36707 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1643,6 +1643,7 @@ static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
+ bool select_rq;
struct rq *rq;

if (sd_flag != SD_BALANCE_WAKE)
@@ -1662,10 +1663,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- if (unlikely(dl_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
- (p->nr_cpus_allowed > 1)) {
+ select_rq = unlikely(dl_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 ||
+ !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ p->nr_cpus_allowed > 1;
+
+ /*
+ * Take the capacity of the CPU into account to
+ * ensure it fits the requirement of the task.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ select_rq |= !dl_task_fits_capacity(p, cpu);
+
+ if (select_rq) {
int target = find_later_rq(p);

if (target != -1 &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14cb6a97e2d2..6ebbb1f353c4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the
+ * CPU original capacity and the runtime/deadline ratio of the task.
+ *
+ * The function will return true if the CPU original capacity of the
+ * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
+ * task and false otherwise.
+ */
+static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+
+ return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
+}
+
extern void init_dl_bw(struct dl_bw *dl_b);
extern int sched_dl_global_validate(void);
extern void sched_dl_do_global(void);
--
2.17.1

2020-05-20 13:47:22

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v3 3/5] sched/deadline: Improve admission control for asymmetric CPU capacities

From: Luca Abeni <[email protected]>

The current SCHED_DEADLINE (DL) admission control ensures that

sum of reserved CPU bandwidth < x * M

where

x = /proc/sys/kernel/sched_rt_{runtime,period}_us
M = # CPUs in root domain.

DL admission control works well for homogeneous systems where the
capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
and non-starvation of non-DL tasks is guaranteed.

But on heterogeneous systems where capacity of CPUs are different it
could fail by over-allocating CPU time on smaller capacity CPUs.

On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
tasks making it unusable.

Fix this by explicitly considering the CPU capacity in the DL admission
test by replacing M with the root domain CPU capacity sum.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/deadline.c | 30 +++++++++++++++++-------------
kernel/sched/sched.h | 6 +++---
2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ea7282ce484c..fa8566517715 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2590,11 +2590,12 @@ void sched_dl_do_global(void)
int sched_dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
- int cpus, err = -1;
+ int cpus, err = -1, cpu = task_cpu(p);
+ struct dl_bw *dl_b = dl_bw_of(cpu);
+ unsigned long cap;

if (attr->sched_flags & SCHED_FLAG_SUGOV)
return 0;
@@ -2609,15 +2610,17 @@ int sched_dl_overflow(struct task_struct *p, int policy,
* allocated bandwidth of the container.
*/
raw_spin_lock(&dl_b->lock);
- cpus = dl_bw_cpus(task_cpu(p));
+ cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+
if (dl_policy(policy) && !task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ !__dl_overflow(dl_b, cap, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
__dl_sub(dl_b, p->dl.dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
/*
* XXX this is slightly incorrect: when the task
* utilization decreases, we should delay the total
@@ -2753,19 +2756,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
#ifdef CONFIG_SMP
int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
{
+ unsigned long flags, cap;
unsigned int dest_cpu;
struct dl_bw *dl_b;
bool overflow;
- int cpus, ret;
- unsigned long flags;
+ int ret;

dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);

rcu_read_lock_sched();
dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(dest_cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+ cap = dl_bw_capacity(dest_cpu);
+ overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw);
if (overflow) {
ret = -EBUSY;
} else {
@@ -2775,6 +2778,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
* We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()).
*/
+ int cpus = dl_bw_cpus(dest_cpu);
+
__dl_add(dl_b, p->dl.dl_bw, cpus);
ret = 0;
}
@@ -2807,16 +2812,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,

bool dl_cpu_busy(unsigned int cpu)
{
- unsigned long flags;
+ unsigned long flags, cap;
struct dl_bw *dl_b;
bool overflow;
- int cpus;

rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, 0);
+ cap = dl_bw_capacity(cpu);
+ overflow = __dl_overflow(dl_b, cap, 0, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 21416b30c520..14cb6a97e2d2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,11 +310,11 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
__dl_update(dl_b, -((s32)tsk_bw / cpus));
}

-static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
+ u64 old_bw, u64 new_bw)
{
return dl_b->bw != -1 &&
- dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+ cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

extern void init_dl_bw(struct dl_bw *dl_b);
--
2.17.1

2020-05-20 13:48:04

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v3 5/5] sched/deadline: Implement fallback mechanism for !fit case

From: Luca Abeni <[email protected]>

When a task has a runtime that cannot be served within the scheduling
deadline by any of the idle CPU (later_mask) the task is doomed to miss
its deadline.

This can happen since the SCHED_DEADLINE admission control guarantees
only bounded tardiness and not the hard respect of all deadlines.
In this case try to select the idle CPU with the largest CPU capacity
to minimize tardiness.

Favor task_cpu(p) if it has max capacity of !fitting CPUs so that
find_later_rq() can potentially still return it (most likely cache-hot)
early.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/cpudeadline.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 8630f2a40a3f..8cb06c8c7eb1 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,19 +121,31 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- int cpu;
+ unsigned long cap, max_cap = 0;
+ int cpu, max_cpu = -1;

if (!static_branch_unlikely(&sched_asym_cpucapacity))
return 1;

/* Ensure the capacity of the CPUs fits the task. */
for_each_cpu(cpu, later_mask) {
- if (!dl_task_fits_capacity(p, cpu))
+ if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
+
+ cap = capacity_orig_of(cpu);
+
+ if (cap > max_cap ||
+ (cpu == task_cpu(p) && cap == max_cap)) {
+ max_cap = cap;
+ max_cpu = cpu;
+ }
+ }
}

- if (!cpumask_empty(later_mask))
- return 1;
+ if (cpumask_empty(later_mask))
+ cpumask_set_cpu(max_cpu, later_mask);
+
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);

--
2.17.1

2020-05-22 15:00:14

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] sched/deadline: Add dl_bw_capacity()

On 20/05/20 15:42, Dietmar Eggemann wrote:
> Capacity-aware SCHED_DEADLINE Admission Control (AC) needs root domain
> (rd) CPU capacity sum.
>
> Introduce dl_bw_capacity() which for a symmetric rd w/ a CPU capacity
> of SCHED_CAPACITY_SCALE simply relies on dl_bw_cpus() to return #CPUs
> multiplied by SCHED_CAPACITY_SCALE.
>
> For an asymmetric rd or a CPU capacity < SCHED_CAPACITY_SCALE it
> computes the CPU capacity sum over rd span and cpu_active_mask.
>
> A 'XXX Fix:' comment was added to highlight that if 'rq->rd ==
> def_root_domain' AC should be performed against the capacity of the
> CPU the task is running on rather the rd CPU capacity sum. This
> issue already exists w/o capacity awareness.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 4ae22bfc37ae..ea7282ce484c 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -69,6 +69,34 @@ static inline int dl_bw_cpus(int i)
>
> return cpus;
> }
> +
> +static inline unsigned long __dl_bw_capacity(int i)
> +{
> + struct root_domain *rd = cpu_rq(i)->rd;
> + unsigned long cap = 0;
> +
> + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> + "sched RCU must be held");
> +
> + for_each_cpu_and(i, rd->span, cpu_active_mask)
> + cap += capacity_orig_of(i);
> +
> + return cap;
> +}
> +
> +/*
> + * XXX Fix: If 'rq->rd == def_root_domain' perform AC against capacity
> + * of the CPU the task is running on rather rd's \Sum CPU capacity.
> + */
> +static inline unsigned long dl_bw_capacity(int i)
> +{
> + if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
> + capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
> + return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
> + } else {
> + return __dl_bw_capacity(i);
> + }
> +}
> #else
> static inline struct dl_bw *dl_bw_of(int i)
> {
> @@ -79,6 +107,11 @@ static inline int dl_bw_cpus(int i)
> {
> return 1;
> }
> +
> +static inline unsigned long dl_bw_capacity(int i)
> +{
> + return SCHED_CAPACITY_SCALE;
> +}
> #endif
>
> static inline
> --

Acked-by: Juri Lelli <[email protected]>

2020-05-22 15:00:57

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] sched/deadline: Make DL capacity-aware

On 20/05/20 15:42, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
> algorithm w/o considering CPU capacity or task utilization.
> This works well on homogeneous systems where DL tasks are guaranteed
> to have a bounded tardiness but presents issues on heterogeneous
> systems.
>
> A DL task can migrate to a CPU which does not have enough CPU capacity
> to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
> period on a CPU w/ 512 capacity).
>
> Add the DL fitness function dl_task_fits_capacity() for DL admission
> control on heterogeneous systems. A task fits onto a CPU if:
>
> CPU original capacity / 1024 >= task runtime / task deadline
>
> Use this function on heterogeneous systems to try to find a CPU which
> meets this criterion during task wakeup, push and offline migration.
>
> On homogeneous systems the original behavior of the DL admission
> control should be retained.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 14 +++++++++++++-
> kernel/sched/deadline.c | 18 ++++++++++++++----
> kernel/sched/sched.h | 15 +++++++++++++++
> 3 files changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 5cc4012572ec..8630f2a40a3f 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> if (later_mask &&
> cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - return 1;
> + int cpu;
> +
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fits the task. */
> + for_each_cpu(cpu, later_mask) {
> + if (!dl_task_fits_capacity(p, cpu))
> + cpumask_clear_cpu(cpu, later_mask);
> + }
> +
> + if (!cpumask_empty(later_mask))
> + return 1;
> } else {
> int best_cpu = cpudl_maximum(cp);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index fa8566517715..f2e8f5a36707 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1643,6 +1643,7 @@ static int
> select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> + bool select_rq;
> struct rq *rq;
>
> if (sd_flag != SD_BALANCE_WAKE)
> @@ -1662,10 +1663,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> * other hand, if it has a shorter deadline, we
> * try to make it stay here, it might be important.
> */
> - if (unlikely(dl_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
> - (p->nr_cpus_allowed > 1)) {
> + select_rq = unlikely(dl_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 ||
> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
> + p->nr_cpus_allowed > 1;
> +
> + /*
> + * Take the capacity of the CPU into account to
> + * ensure it fits the requirement of the task.
> + */
> + if (static_branch_unlikely(&sched_asym_cpucapacity))
> + select_rq |= !dl_task_fits_capacity(p, cpu);
> +
> + if (select_rq) {
> int target = find_later_rq(p);
>
> if (target != -1 &&
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 14cb6a97e2d2..6ebbb1f353c4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
> cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
> }
>
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the
> + * CPU original capacity and the runtime/deadline ratio of the task.
> + *
> + * The function will return true if the CPU original capacity of the
> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> + * task and false otherwise.
> + */
> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> +
> + return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> +}
> +
> extern void init_dl_bw(struct dl_bw *dl_b);
> extern int sched_dl_global_validate(void);
> extern void sched_dl_do_global(void);
> --

Acked-by: Juri Lelli <[email protected]>

2020-05-22 15:03:19

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] sched/deadline: Improve admission control for asymmetric CPU capacities

On 20/05/20 15:42, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> The current SCHED_DEADLINE (DL) admission control ensures that
>
> sum of reserved CPU bandwidth < x * M
>
> where
>
> x = /proc/sys/kernel/sched_rt_{runtime,period}_us
> M = # CPUs in root domain.
>
> DL admission control works well for homogeneous systems where the
> capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
> and non-starvation of non-DL tasks is guaranteed.
>
> But on heterogeneous systems where capacity of CPUs are different it
> could fail by over-allocating CPU time on smaller capacity CPUs.
>
> On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
> tasks making it unusable.
>
> Fix this by explicitly considering the CPU capacity in the DL admission
> test by replacing M with the root domain CPU capacity sum.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/deadline.c | 30 +++++++++++++++++-------------
> kernel/sched/sched.h | 6 +++---
> 2 files changed, 20 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index ea7282ce484c..fa8566517715 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2590,11 +2590,12 @@ void sched_dl_do_global(void)
> int sched_dl_overflow(struct task_struct *p, int policy,
> const struct sched_attr *attr)
> {
> - struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> u64 period = attr->sched_period ?: attr->sched_deadline;
> u64 runtime = attr->sched_runtime;
> u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> - int cpus, err = -1;
> + int cpus, err = -1, cpu = task_cpu(p);
> + struct dl_bw *dl_b = dl_bw_of(cpu);
> + unsigned long cap;
>
> if (attr->sched_flags & SCHED_FLAG_SUGOV)
> return 0;
> @@ -2609,15 +2610,17 @@ int sched_dl_overflow(struct task_struct *p, int policy,
> * allocated bandwidth of the container.
> */
> raw_spin_lock(&dl_b->lock);
> - cpus = dl_bw_cpus(task_cpu(p));
> + cpus = dl_bw_cpus(cpu);
> + cap = dl_bw_capacity(cpu);
> +
> if (dl_policy(policy) && !task_has_dl_policy(p) &&
> - !__dl_overflow(dl_b, cpus, 0, new_bw)) {
> + !__dl_overflow(dl_b, cap, 0, new_bw)) {
> if (hrtimer_active(&p->dl.inactive_timer))
> __dl_sub(dl_b, p->dl.dl_bw, cpus);
> __dl_add(dl_b, new_bw, cpus);
> err = 0;
> } else if (dl_policy(policy) && task_has_dl_policy(p) &&
> - !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> + !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
> /*
> * XXX this is slightly incorrect: when the task
> * utilization decreases, we should delay the total
> @@ -2753,19 +2756,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
> #ifdef CONFIG_SMP
> int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
> {
> + unsigned long flags, cap;
> unsigned int dest_cpu;
> struct dl_bw *dl_b;
> bool overflow;
> - int cpus, ret;
> - unsigned long flags;
> + int ret;
>
> dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
>
> rcu_read_lock_sched();
> dl_b = dl_bw_of(dest_cpu);
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cpus = dl_bw_cpus(dest_cpu);
> - overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
> + cap = dl_bw_capacity(dest_cpu);
> + overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw);
> if (overflow) {
> ret = -EBUSY;
> } else {
> @@ -2775,6 +2778,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
> * We will free resources in the source root_domain
> * later on (see set_cpus_allowed_dl()).
> */
> + int cpus = dl_bw_cpus(dest_cpu);
> +
> __dl_add(dl_b, p->dl.dl_bw, cpus);
> ret = 0;
> }
> @@ -2807,16 +2812,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,
>
> bool dl_cpu_busy(unsigned int cpu)
> {
> - unsigned long flags;
> + unsigned long flags, cap;
> struct dl_bw *dl_b;
> bool overflow;
> - int cpus;
>
> rcu_read_lock_sched();
> dl_b = dl_bw_of(cpu);
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cpus = dl_bw_cpus(cpu);
> - overflow = __dl_overflow(dl_b, cpus, 0, 0);
> + cap = dl_bw_capacity(cpu);
> + overflow = __dl_overflow(dl_b, cap, 0, 0);
> raw_spin_unlock_irqrestore(&dl_b->lock, flags);
> rcu_read_unlock_sched();
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 21416b30c520..14cb6a97e2d2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -310,11 +310,11 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
> __dl_update(dl_b, -((s32)tsk_bw / cpus));
> }
>
> -static inline
> -bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
> +static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
> + u64 old_bw, u64 new_bw)
> {
> return dl_b->bw != -1 &&
> - dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> + cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
> }
>
> extern void init_dl_bw(struct dl_bw *dl_b);
> --

Acked-by: Juri Lelli <[email protected]>

2020-05-22 15:03:27

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] sched/deadline: Implement fallback mechanism for !fit case

On 20/05/20 15:42, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> When a task has a runtime that cannot be served within the scheduling
> deadline by any of the idle CPU (later_mask) the task is doomed to miss
> its deadline.
>
> This can happen since the SCHED_DEADLINE admission control guarantees
> only bounded tardiness and not the hard respect of all deadlines.
> In this case try to select the idle CPU with the largest CPU capacity
> to minimize tardiness.
>
> Favor task_cpu(p) if it has max capacity of !fitting CPUs so that
> find_later_rq() can potentially still return it (most likely cache-hot)
> early.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 8630f2a40a3f..8cb06c8c7eb1 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,19 +121,31 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> if (later_mask &&
> cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - int cpu;
> + unsigned long cap, max_cap = 0;
> + int cpu, max_cpu = -1;
>
> if (!static_branch_unlikely(&sched_asym_cpucapacity))
> return 1;
>
> /* Ensure the capacity of the CPUs fits the task. */
> for_each_cpu(cpu, later_mask) {
> - if (!dl_task_fits_capacity(p, cpu))
> + if (!dl_task_fits_capacity(p, cpu)) {
> cpumask_clear_cpu(cpu, later_mask);
> +
> + cap = capacity_orig_of(cpu);
> +
> + if (cap > max_cap ||
> + (cpu == task_cpu(p) && cap == max_cap)) {
> + max_cap = cap;
> + max_cpu = cpu;
> + }
> + }
> }
>
> - if (!cpumask_empty(later_mask))
> - return 1;
> + if (cpumask_empty(later_mask))
> + cpumask_set_cpu(max_cpu, later_mask);
> +
> + return 1;
> } else {
> int best_cpu = cpudl_maximum(cp);
>
> --

Acked-by: Juri Lelli <[email protected]>

2020-06-10 15:13:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] Capacity awareness for SCHED_DEADLINE

On Wed, May 20, 2020 at 03:42:38PM +0200, Dietmar Eggemann wrote:
> Dietmar Eggemann (2):
> sched/deadline: Optimize dl_bw_cpus()
> sched/deadline: Add dl_bw_capacity()
>
> Luca Abeni (3):
> sched/deadline: Improve admission control for asymmetric CPU
> capacities
> sched/deadline: Make DL capacity-aware
> sched/deadline: Implement fallback mechanism for !fit case
>
> kernel/sched/cpudeadline.c | 24 ++++++++++
> kernel/sched/deadline.c | 89 ++++++++++++++++++++++++++++++--------
> kernel/sched/sched.h | 21 +++++++--
> 3 files changed, 113 insertions(+), 21 deletions(-)

Thanks!

Subject: [tip: sched/core] sched/deadline: Implement fallback mechanism for !fit case

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 23e71d8ba42933bff12e453858fd68c073bc5258
Gitweb: https://git.kernel.org/tip/23e71d8ba42933bff12e453858fd68c073bc5258
Author: Luca Abeni <[email protected]>
AuthorDate: Wed, 20 May 2020 15:42:43 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 15 Jun 2020 14:10:05 +02:00

sched/deadline: Implement fallback mechanism for !fit case

When a task has a runtime that cannot be served within the scheduling
deadline by any of the idle CPU (later_mask) the task is doomed to miss
its deadline.

This can happen since the SCHED_DEADLINE admission control guarantees
only bounded tardiness and not the hard respect of all deadlines.
In this case try to select the idle CPU with the largest CPU capacity
to minimize tardiness.

Favor task_cpu(p) if it has max capacity of !fitting CPUs so that
find_later_rq() can potentially still return it (most likely cache-hot)
early.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/cpudeadline.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 8630f2a..8cb06c8 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,19 +121,31 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- int cpu;
+ unsigned long cap, max_cap = 0;
+ int cpu, max_cpu = -1;

if (!static_branch_unlikely(&sched_asym_cpucapacity))
return 1;

/* Ensure the capacity of the CPUs fits the task. */
for_each_cpu(cpu, later_mask) {
- if (!dl_task_fits_capacity(p, cpu))
+ if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
+
+ cap = capacity_orig_of(cpu);
+
+ if (cap > max_cap ||
+ (cpu == task_cpu(p) && cap == max_cap)) {
+ max_cap = cap;
+ max_cpu = cpu;
+ }
+ }
}

- if (!cpumask_empty(later_mask))
- return 1;
+ if (cpumask_empty(later_mask))
+ cpumask_set_cpu(max_cpu, later_mask);
+
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);

Subject: [tip: sched/core] sched/deadline: Make DL capacity-aware

The following commit has been merged into the sched/core branch of tip:

Commit-ID: b4118988fdcb4554ea6687dd8ff68bcab690b8ea
Gitweb: https://git.kernel.org/tip/b4118988fdcb4554ea6687dd8ff68bcab690b8ea
Author: Luca Abeni <[email protected]>
AuthorDate: Wed, 20 May 2020 15:42:42 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 15 Jun 2020 14:10:05 +02:00

sched/deadline: Make DL capacity-aware

The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
algorithm w/o considering CPU capacity or task utilization.
This works well on homogeneous systems where DL tasks are guaranteed
to have a bounded tardiness but presents issues on heterogeneous
systems.

A DL task can migrate to a CPU which does not have enough CPU capacity
to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
period on a CPU w/ 512 capacity).

Add the DL fitness function dl_task_fits_capacity() for DL admission
control on heterogeneous systems. A task fits onto a CPU if:

CPU original capacity / 1024 >= task runtime / task deadline

Use this function on heterogeneous systems to try to find a CPU which
meets this criterion during task wakeup, push and offline migration.

On homogeneous systems the original behavior of the DL admission
control should be retained.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/cpudeadline.c | 14 +++++++++++++-
kernel/sched/deadline.c | 18 ++++++++++++++----
kernel/sched/sched.h | 15 +++++++++++++++
3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5cc4012..8630f2a 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- return 1;
+ int cpu;
+
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 1;
+
+ /* Ensure the capacity of the CPUs fits the task. */
+ for_each_cpu(cpu, later_mask) {
+ if (!dl_task_fits_capacity(p, cpu))
+ cpumask_clear_cpu(cpu, later_mask);
+ }
+
+ if (!cpumask_empty(later_mask))
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9ebd0a9..84e84ba 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1643,6 +1643,7 @@ static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
+ bool select_rq;
struct rq *rq;

if (sd_flag != SD_BALANCE_WAKE)
@@ -1662,10 +1663,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- if (unlikely(dl_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
- (p->nr_cpus_allowed > 1)) {
+ select_rq = unlikely(dl_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 ||
+ !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ p->nr_cpus_allowed > 1;
+
+ /*
+ * Take the capacity of the CPU into account to
+ * ensure it fits the requirement of the task.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ select_rq |= !dl_task_fits_capacity(p, cpu);
+
+ if (select_rq) {
int target = find_later_rq(p);

if (target != -1 &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91b250f..3368876 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the
+ * CPU original capacity and the runtime/deadline ratio of the task.
+ *
+ * The function will return true if the CPU original capacity of the
+ * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
+ * task and false otherwise.
+ */
+static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+
+ return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
+}
+
extern void init_dl_bw(struct dl_bw *dl_b);
extern int sched_dl_global_validate(void);
extern void sched_dl_do_global(void);

Subject: [tip: sched/core] sched/deadline: Improve admission control for asymmetric CPU capacities

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 60ffd5edc5e4fa69622c125c54ef8e7d5d894af8
Gitweb: https://git.kernel.org/tip/60ffd5edc5e4fa69622c125c54ef8e7d5d894af8
Author: Luca Abeni <[email protected]>
AuthorDate: Wed, 20 May 2020 15:42:41 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 15 Jun 2020 14:10:05 +02:00

sched/deadline: Improve admission control for asymmetric CPU capacities

The current SCHED_DEADLINE (DL) admission control ensures that

sum of reserved CPU bandwidth < x * M

where

x = /proc/sys/kernel/sched_rt_{runtime,period}_us
M = # CPUs in root domain.

DL admission control works well for homogeneous systems where the
capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
and non-starvation of non-DL tasks is guaranteed.

But on heterogeneous systems where capacity of CPUs are different it
could fail by over-allocating CPU time on smaller capacity CPUs.

On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
tasks making it unusable.

Fix this by explicitly considering the CPU capacity in the DL admission
test by replacing M with the root domain CPU capacity sum.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/deadline.c | 30 +++++++++++++++++-------------
kernel/sched/sched.h | 6 +++---
2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 01f474a..9ebd0a9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2590,11 +2590,12 @@ void sched_dl_do_global(void)
int sched_dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
- int cpus, err = -1;
+ int cpus, err = -1, cpu = task_cpu(p);
+ struct dl_bw *dl_b = dl_bw_of(cpu);
+ unsigned long cap;

if (attr->sched_flags & SCHED_FLAG_SUGOV)
return 0;
@@ -2609,15 +2610,17 @@ int sched_dl_overflow(struct task_struct *p, int policy,
* allocated bandwidth of the container.
*/
raw_spin_lock(&dl_b->lock);
- cpus = dl_bw_cpus(task_cpu(p));
+ cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+
if (dl_policy(policy) && !task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ !__dl_overflow(dl_b, cap, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
__dl_sub(dl_b, p->dl.dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
/*
* XXX this is slightly incorrect: when the task
* utilization decreases, we should delay the total
@@ -2772,19 +2775,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
#ifdef CONFIG_SMP
int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
{
+ unsigned long flags, cap;
unsigned int dest_cpu;
struct dl_bw *dl_b;
bool overflow;
- int cpus, ret;
- unsigned long flags;
+ int ret;

dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);

rcu_read_lock_sched();
dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(dest_cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+ cap = dl_bw_capacity(dest_cpu);
+ overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw);
if (overflow) {
ret = -EBUSY;
} else {
@@ -2794,6 +2797,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
* We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()).
*/
+ int cpus = dl_bw_cpus(dest_cpu);
+
__dl_add(dl_b, p->dl.dl_bw, cpus);
ret = 0;
}
@@ -2826,16 +2831,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,

bool dl_cpu_busy(unsigned int cpu)
{
- unsigned long flags;
+ unsigned long flags, cap;
struct dl_bw *dl_b;
bool overflow;
- int cpus;

rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, 0);
+ cap = dl_bw_capacity(cpu);
+ overflow = __dl_overflow(dl_b, cap, 0, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8d5d068..91b250f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,11 +310,11 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
__dl_update(dl_b, -((s32)tsk_bw / cpus));
}

-static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
+ u64 old_bw, u64 new_bw)
{
return dl_b->bw != -1 &&
- dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+ cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

extern void init_dl_bw(struct dl_bw *dl_b);

Subject: [tip: sched/core] sched/deadline: Add dl_bw_capacity()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: fc9dc698472aa460a8b3b036d9b1d0b751f12f58
Gitweb: https://git.kernel.org/tip/fc9dc698472aa460a8b3b036d9b1d0b751f12f58
Author: Dietmar Eggemann <[email protected]>
AuthorDate: Wed, 20 May 2020 15:42:40 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Mon, 15 Jun 2020 14:10:05 +02:00

sched/deadline: Add dl_bw_capacity()

Capacity-aware SCHED_DEADLINE Admission Control (AC) needs root domain
(rd) CPU capacity sum.

Introduce dl_bw_capacity() which for a symmetric rd w/ a CPU capacity
of SCHED_CAPACITY_SCALE simply relies on dl_bw_cpus() to return #CPUs
multiplied by SCHED_CAPACITY_SCALE.

For an asymmetric rd or a CPU capacity < SCHED_CAPACITY_SCALE it
computes the CPU capacity sum over rd span and cpu_active_mask.

A 'XXX Fix:' comment was added to highlight that if 'rq->rd ==
def_root_domain' AC should be performed against the capacity of the
CPU the task is running on rather the rd CPU capacity sum. This
issue already exists w/o capacity awareness.

Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ec90265..01f474a 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -69,6 +69,34 @@ static inline int dl_bw_cpus(int i)

return cpus;
}
+
+static inline unsigned long __dl_bw_capacity(int i)
+{
+ struct root_domain *rd = cpu_rq(i)->rd;
+ unsigned long cap = 0;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+
+ for_each_cpu_and(i, rd->span, cpu_active_mask)
+ cap += capacity_orig_of(i);
+
+ return cap;
+}
+
+/*
+ * XXX Fix: If 'rq->rd == def_root_domain' perform AC against capacity
+ * of the CPU the task is running on rather rd's \Sum CPU capacity.
+ */
+static inline unsigned long dl_bw_capacity(int i)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
+ capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
+ return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
+ } else {
+ return __dl_bw_capacity(i);
+ }
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -79,6 +107,11 @@ static inline int dl_bw_cpus(int i)
{
return 1;
}
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+ return SCHED_CAPACITY_SCALE;
+}
#endif

static inline