2020-04-27 08:39:15

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 0/6] Capacity awareness for SCHED_DEADLINE

The SCHED_DEADLINE (DL) admission control does not work correctly on
heterogeneous (asymmetric CPU capacity) systems such as Arm big.LITTLE
or DynamIQ.

Let's fix this by explicitly considering CPU capacity in DL admission
control and task migration.

The DL sched class now attempts to avoid missing task deadlines due to
smaller CPU (CPU capacity < 1024) not being capable enough to finish a
task in time. It does so by trying to place a task so that its CPU
capacity scaled deadline is not smaller than its runtime.

This patch-set only supports capacity awareness in idle scenarios
(cpudl::free_cpus not empty). Capacity awareness for non-idle cases
should be added in a later series.

Changes v1 [1] -> v2:

Discussion about capacity awareness in idle and non-idle scenarios
indicated that the current patch-set only supports the former.

Per-patch changes:

(1) Use rq->cpu_capacity_orig or capacity_orig_of() instead of
arch_scale_cpu_capacity() [patch 1,6/6]

(2) Optimize dl_bw_cpus(), i.e. return weight of rd->span if rd->span
&sube cpu_active_mask [patch 2/6]

(3) Replace rd_capacity() with dl_bw_capacity(). [patch 3/6]

Changes RFC [2] -> v1:

Only use static values for CPU bandwidth (sched_dl_entity::dl_runtime,
::dl_deadline) and CPU capacity (arch_scale_cpu_capacity()) to fix DL
admission control.

Dynamic values for CPU bandwidth (sched_dl_entity::runtime, ::deadline)
and CPU capacity (capacity_of()) are considered to be more related to
energy trade-off calculations which could be later introduced using the
Energy Model.

Since the design of the DL and RT sched classes are very similar, the
implementation follows the overall design of RT capacity awareness
(commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")).

Per-patch changes:

(1) Store CPU capacity sum in the root domain during
build_sched_domains() [patch 1/4]

(2) Adjust to RT capacity awareness design [patch 3/4]

(3) Remove CPU capacity aware placement in switched_to_dl()
(dl_migrate callback) [RFC patch 3/6]

Balance callbacks (push, pull) run only in schedule_tail()
__schedule(), rt_mutex_setprio() or __sched_setscheduler().
DL throttling leads to a call to __dequeue_task_dl() which is not a
full task dequeue. The task is still enqueued and only removed from
the rq.
So a queue_balance_callback() call in update_curr_dl()->
__dequeue_task_dl() will not be followed by a balance_callback()
call in one of the 4 functions mentioned above.

(4) Remove 'dynamic CPU bandwidth' consideration and only support
'static CPU bandwidth' (ratio between sched_dl_entity::dl_runtime
and ::dl_deadline) [RFC patch 4/6]

(5) Remove modification to migration logic which tried to schedule
small tasks on LITTLE CPUs [RFC patch 6/6]

[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/[email protected]

The following rt-app testcase tailored to Arm64 Hikey960:

root@h960:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
462
462
462
462
1024
1024
1024
1024

shows the expected behavior.

According to the following condition in dl_task_fits_capacity()

cap_scale(dl_deadline, arch_scale_cpu_capacity(cpu)) >= dl_runtime

thread0-[0-3] are placed on a big CPUs whereas thread1-[0-3] run on a
LITTLE CPU respectively.

...
"tasks" : {
"thread0" : {
"policy" : "SCHED_DEADLINE",
"instance" : 4,
"timer" : { "ref" : "unique0", "period" : 16000, "mode" : "absolute" },
"run" : 10000,
"dl-runtime" : 11000,
"dl-period" : 16000,
"dl-deadline" : 16000
},
"thread1" : {
"policy" : "SCHED_DEADLINE",
"instance" : 4,
"delay" : 1000,
"timer" : { "ref" : "unique1", "period" : 16000, "mode" : "absolute" },
"run" : 5500,
"dl-runtime" : 6500
"dl-period" : 16000,
"dl-deadline" : 16000
}
...

Tests were run with Performance CPUfreq governor so that the Schedutil
CPUfreq governor DL threads (sugov:[0,4]), necessary on a
slow-switching platform like Hikey960, do not interfere with the
rt-app test tasks. Using Schedutil would require to lower the number of
tasks to 3 instances each.

Dietmar Eggemann (3):
sched/topology: Store root domain CPU capacity sum
sched/deadline: Optimize dl_bw_cpus()
sched/deadline: Add dl_bw_capacity()

Luca Abeni (3):
sched/deadline: Improve admission control for asymmetric CPU
capacities
sched/deadline: Make DL capacity-aware
sched/deadline: Implement fallback mechanism for !fit case

kernel/sched/cpudeadline.c | 23 +++++++++++
kernel/sched/deadline.c | 80 +++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 22 +++++++++--
kernel/sched/topology.c | 15 +++++--
4 files changed, 115 insertions(+), 25 deletions(-)

--
2.17.1


2020-04-27 08:39:23

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 1/6] sched/topology: Store root domain CPU capacity sum

Add sum of (original) CPU capacity of all member CPUs to root domain.

This is needed for capacity-aware SCHED_DEADLINE admission control.

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 15 +++++++++++----
2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..58e1d3903ab9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -803,6 +803,7 @@ struct root_domain {
cpumask_var_t rto_mask;
struct cpupri cpupri;

+ unsigned long sum_cpu_capacity;
unsigned long max_cpu_capacity;

/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..fbb20b7a80c0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2052,12 +2052,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
+ unsigned long cap;
+
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
+ cap = rq->cpu_capacity_orig;

/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+ if (cap > READ_ONCE(d.rd->max_cpu_capacity))
+ WRITE_ONCE(d.rd->max_cpu_capacity, cap);
+
+ WRITE_ONCE(d.rd->sum_cpu_capacity,
+ READ_ONCE(d.rd->sum_cpu_capacity) + cap);

cpu_attach_domain(sd, d.rd, i);
}
@@ -2067,8 +2073,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
static_branch_inc_cpuslocked(&sched_asym_cpucapacity);

if (rq && sched_debug_enabled) {
- pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
- cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+ pr_info("root domain span: %*pbl (capacity = %lu max cpu_capacity = %lu)\n",
+ cpumask_pr_args(cpu_map), rq->rd->sum_cpu_capacity,
+ rq->rd->max_cpu_capacity);
}

ret = 0;
--
2.17.1

2020-04-27 08:39:33

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()

Return the weight of the rd (root domain) span in case it is a subset
of the cpu_active_mask.

Continue to compute the number of CPUs over rd span and cpu_active_mask
when in hotplug.

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/deadline.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 504d2f51b0d6..4ae22bfc37ae 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
static inline int dl_bw_cpus(int i)
{
struct root_domain *rd = cpu_rq(i)->rd;
- int cpus = 0;
+ int cpus;

RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
"sched RCU must be held");
+
+ if (cpumask_subset(rd->span, cpu_active_mask))
+ return cpumask_weight(rd->span);
+
+ cpus = 0;
+
for_each_cpu_and(i, rd->span, cpu_active_mask)
cpus++;

--
2.17.1

2020-04-27 08:39:55

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

Capacity-aware SCHED_DEADLINE admission control needs rd (root domain)
CPU capacity sum.

The design of this function follows that of dl_bw_cpus().

That is, return the rd CPU capacity sum in case the rd span a subset of
the cpu_active_mask.

Compute the CPU capacity sum over rd span and cpu_active_mask when in
hotplug.

Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/deadline.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ae22bfc37ae..eb23e6921d94 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -69,6 +69,25 @@ static inline int dl_bw_cpus(int i)

return cpus;
}
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+ struct root_domain *rd = cpu_rq(i)->rd;
+ unsigned long cap;
+
+ RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+ "sched RCU must be held");
+
+ if (cpumask_subset(rd->span, cpu_active_mask))
+ return rd->sum_cpu_capacity;
+
+ cap = 0;
+
+ for_each_cpu_and(i, rd->span, cpu_active_mask)
+ cap += capacity_orig_of(i);
+
+ return cap;
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -79,6 +98,11 @@ static inline int dl_bw_cpus(int i)
{
return 1;
}
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+ return SCHED_CAPACITY_SCALE;
+}
#endif

static inline
--
2.17.1

2020-04-27 08:40:14

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 4/6] sched/deadline: Improve admission control for asymmetric CPU capacities

From: Luca Abeni <[email protected]>

The current SCHED_DEADLINE (DL) admission control ensures that

sum of reserved CPU bandwidth < x * M

where

x = /proc/sys/kernel/sched_rt_{runtime,period}_us
M = # CPUs in root domain.

DL admission control works well for homogeneous systems where the
capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
and non-starvation of non-DL tasks is guaranteed.

But on heterogeneous systems where capacity of CPUs are different it
could fail by over-allocating CPU time on smaller capacity CPUs.

On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
tasks making it unusable.

Fix this by explicitly considering the CPU capacity in the DL admission
test by replacing M with the root domain CPU capacity sum.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/deadline.c | 30 +++++++++++++++++-------------
kernel/sched/sched.h | 6 +++---
2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index eb23e6921d94..08ab28e1cefc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2581,11 +2581,12 @@ void sched_dl_do_global(void)
int sched_dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
- int cpus, err = -1;
+ int cpus, err = -1, cpu = task_cpu(p);
+ struct dl_bw *dl_b = dl_bw_of(cpu);
+ unsigned long cap;

if (attr->sched_flags & SCHED_FLAG_SUGOV)
return 0;
@@ -2600,15 +2601,17 @@ int sched_dl_overflow(struct task_struct *p, int policy,
* allocated bandwidth of the container.
*/
raw_spin_lock(&dl_b->lock);
- cpus = dl_bw_cpus(task_cpu(p));
+ cpus = dl_bw_cpus(cpu);
+ cap = dl_bw_capacity(cpu);
+
if (dl_policy(policy) && !task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ !__dl_overflow(dl_b, cap, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
__dl_sub(dl_b, p->dl.dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ !__dl_overflow(dl_b, cap, p->dl.dl_bw, new_bw)) {
/*
* XXX this is slightly incorrect: when the task
* utilization decreases, we should delay the total
@@ -2744,19 +2747,19 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
#ifdef CONFIG_SMP
int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
{
+ unsigned long flags, cap;
unsigned int dest_cpu;
struct dl_bw *dl_b;
bool overflow;
- int cpus, ret;
- unsigned long flags;
+ int ret;

dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);

rcu_read_lock_sched();
dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(dest_cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+ cap = dl_bw_capacity(dest_cpu);
+ overflow = __dl_overflow(dl_b, cap, 0, p->dl.dl_bw);
if (overflow) {
ret = -EBUSY;
} else {
@@ -2766,6 +2769,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
* We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()).
*/
+ int cpus = dl_bw_cpus(dest_cpu);
+
__dl_add(dl_b, p->dl.dl_bw, cpus);
ret = 0;
}
@@ -2798,16 +2803,15 @@ int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur,

bool dl_cpu_busy(unsigned int cpu)
{
- unsigned long flags;
+ unsigned long flags, cap;
struct dl_bw *dl_b;
bool overflow;
- int cpus;

rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, 0);
+ cap = dl_bw_capacity(cpu);
+ overflow = __dl_overflow(dl_b, cap, 0, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58e1d3903ab9..511edacc2282 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -310,11 +310,11 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
__dl_update(dl_b, -((s32)tsk_bw / cpus));
}

-static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
+ u64 old_bw, u64 new_bw)
{
return dl_b->bw != -1 &&
- dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+ cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

extern void init_dl_bw(struct dl_bw *dl_b);
--
2.17.1

2020-04-27 08:40:58

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

From: Luca Abeni <[email protected]>

The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
algorithm w/o considering CPU capacity or task utilization.
This works well on homogeneous systems where DL tasks are guaranteed
to have a bounded tardiness but presents issues on heterogeneous
systems.

A DL task can migrate to a CPU which does not have enough CPU capacity
to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
period on a CPU w/ 512 capacity).

Add the DL fitness function dl_task_fits_capacity() for DL admission
control on heterogeneous systems. A task fits onto a CPU if:

CPU original capacity / 1024 >= task runtime / task deadline

Use this function on heterogeneous systems to try to find a CPU which
meets this criterion during task wakeup, push and offline migration.

On homogeneous systems the original behavior of the DL admission
control should be retained.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/cpudeadline.c | 14 +++++++++++++-
kernel/sched/deadline.c | 18 ++++++++++++++----
kernel/sched/sched.h | 15 +++++++++++++++
3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5cc4012572ec..8630f2a40a3f 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- return 1;
+ int cpu;
+
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 1;
+
+ /* Ensure the capacity of the CPUs fits the task. */
+ for_each_cpu(cpu, later_mask) {
+ if (!dl_task_fits_capacity(p, cpu))
+ cpumask_clear_cpu(cpu, later_mask);
+ }
+
+ if (!cpumask_empty(later_mask))
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 08ab28e1cefc..575b7d88d839 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1634,6 +1634,7 @@ static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
+ bool select_rq;
struct rq *rq;

if (sd_flag != SD_BALANCE_WAKE)
@@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- if (unlikely(dl_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
- (p->nr_cpus_allowed > 1)) {
+ select_rq = unlikely(dl_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 ||
+ !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ p->nr_cpus_allowed > 1;
+
+ /*
+ * Take the capacity of the CPU into account to
+ * ensure it fits the requirement of the task.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ select_rq |= !dl_task_fits_capacity(p, cpu);
+
+ if (select_rq) {
int target = find_later_rq(p);

if (target != -1 &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 511edacc2282..ec0efd99497b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
}

+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the
+ * CPU original capacity and the runtime/deadline ratio of the task.
+ *
+ * The function will return true if the CPU original capacity of the
+ * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
+ * task and false otherwise.
+ */
+static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+
+ return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
+}
+
extern void init_dl_bw(struct dl_bw *dl_b);
extern int sched_dl_global_validate(void);
extern void sched_dl_do_global(void);
--
2.17.1

2020-04-27 08:42:43

by Dietmar Eggemann

[permalink] [raw]
Subject: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

From: Luca Abeni <[email protected]>

When a task has a runtime that cannot be served within the scheduling
deadline by any of the idle CPU (later_mask) the task is doomed to miss
its deadline.

This can happen since the SCHED_DEADLINE admission control guarantees
only bounded tardiness and not the hard respect of all deadlines.
In this case try to select the idle CPU with the largest CPU capacity
to minimize tardiness.

Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/cpudeadline.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 8630f2a40a3f..b6c7a0bc0880 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- int cpu;
+ unsigned long cap, max_cap = 0;
+ int cpu, max_cpu = -1;

if (!static_branch_unlikely(&sched_asym_cpucapacity))
return 1;

/* Ensure the capacity of the CPUs fits the task. */
for_each_cpu(cpu, later_mask) {
- if (!dl_task_fits_capacity(p, cpu))
+ if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
+
+ cap = capacity_orig_of(cpu);
+
+ if (cap > max_cap) {
+ max_cap = cap;
+ max_cpu = cpu;
+ }
+ }
}

- if (!cpumask_empty(later_mask))
- return 1;
+ if (cpumask_empty(later_mask))
+ cpumask_set_cpu(max_cpu, later_mask);
+
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);

--
2.17.1

2020-04-27 13:38:50

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

Hi,

On 27/04/20 10:37, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> When a task has a runtime that cannot be served within the scheduling
> deadline by any of the idle CPU (later_mask) the task is doomed to miss
> its deadline.
>
> This can happen since the SCHED_DEADLINE admission control guarantees
> only bounded tardiness and not the hard respect of all deadlines.
> In this case try to select the idle CPU with the largest CPU capacity
> to minimize tardiness.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 8630f2a40a3f..b6c7a0bc0880 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> if (later_mask &&
> cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - int cpu;
> + unsigned long cap, max_cap = 0;
> + int cpu, max_cpu = -1;
>
> if (!static_branch_unlikely(&sched_asym_cpucapacity))
> return 1;
>
> /* Ensure the capacity of the CPUs fits the task. */
> for_each_cpu(cpu, later_mask) {
> - if (!dl_task_fits_capacity(p, cpu))
> + if (!dl_task_fits_capacity(p, cpu)) {
> cpumask_clear_cpu(cpu, later_mask);
> +
> + cap = capacity_orig_of(cpu);
> +
> + if (cap > max_cap) {
> + max_cap = cap;
> + max_cpu = cpu;
> + }
> + }
> }
>
> - if (!cpumask_empty(later_mask))
> - return 1;
> + if (cpumask_empty(later_mask))
> + cpumask_set_cpu(max_cpu, later_mask);

Think we touched upon this during v1 review, but I'm (still?) wondering
if we can do a little better, still considering only free cpus.

Can't we get into a situation that some of the (once free) big cpus have
been occupied by small tasks and now a big task enters the system and it
only finds small cpus available, were it could have fit into bigs if
small tasks were put onto small cpus?

I.e., shouldn't we always try to best fit among free cpus?

Thanks,

Juri

2020-04-27 14:22:00

by luca abeni

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

Hi Juri,

On Mon, 27 Apr 2020 15:34:38 +0200
Juri Lelli <[email protected]> wrote:

> Hi,
>
> On 27/04/20 10:37, Dietmar Eggemann wrote:
> > From: Luca Abeni <[email protected]>
> >
> > When a task has a runtime that cannot be served within the
> > scheduling deadline by any of the idle CPU (later_mask) the task is
> > doomed to miss its deadline.
> >
> > This can happen since the SCHED_DEADLINE admission control
> > guarantees only bounded tardiness and not the hard respect of all
> > deadlines. In this case try to select the idle CPU with the largest
> > CPU capacity to minimize tardiness.
> >
> > Signed-off-by: Luca Abeni <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
[...]
> > - if (!cpumask_empty(later_mask))
> > - return 1;
> > + if (cpumask_empty(later_mask))
> > + cpumask_set_cpu(max_cpu, later_mask);
>
> Think we touched upon this during v1 review, but I'm (still?)
> wondering if we can do a little better, still considering only free
> cpus.
>
> Can't we get into a situation that some of the (once free) big cpus
> have been occupied by small tasks and now a big task enters the
> system and it only finds small cpus available, were it could have fit
> into bigs if small tasks were put onto small cpus?
>
> I.e., shouldn't we always try to best fit among free cpus?

Yes; there was an additional patch that tried schedule each task on the
slowest core where it can fit, to address this issue.
But I think it will go in a second round of patches.



Luca

2020-04-29 17:43:54

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

On 27/04/2020 16:17, luca abeni wrote:
> Hi Juri,
>
> On Mon, 27 Apr 2020 15:34:38 +0200
> Juri Lelli <[email protected]> wrote:
>
>> Hi,
>>
>> On 27/04/20 10:37, Dietmar Eggemann wrote:
>>> From: Luca Abeni <[email protected]>
>>>
>>> When a task has a runtime that cannot be served within the
>>> scheduling deadline by any of the idle CPU (later_mask) the task is
>>> doomed to miss its deadline.
>>>
>>> This can happen since the SCHED_DEADLINE admission control
>>> guarantees only bounded tardiness and not the hard respect of all
>>> deadlines. In this case try to select the idle CPU with the largest
>>> CPU capacity to minimize tardiness.
>>>
>>> Signed-off-by: Luca Abeni <[email protected]>
>>> Signed-off-by: Dietmar Eggemann <[email protected]>
> [...]
>>> - if (!cpumask_empty(later_mask))
>>> - return 1;
>>> + if (cpumask_empty(later_mask))
>>> + cpumask_set_cpu(max_cpu, later_mask);
>>
>> Think we touched upon this during v1 review, but I'm (still?)
>> wondering if we can do a little better, still considering only free
>> cpus.
>>
>> Can't we get into a situation that some of the (once free) big cpus
>> have been occupied by small tasks and now a big task enters the
>> system and it only finds small cpus available, were it could have fit
>> into bigs if small tasks were put onto small cpus?
>>
>> I.e., shouldn't we always try to best fit among free cpus?
>
> Yes; there was an additional patch that tried schedule each task on the
> slowest core where it can fit, to address this issue.
> But I think it will go in a second round of patches.

Yes, we can run into this situation in DL, but also in CFS or RT.

IMHO, this patch is aligned with the Capacity Awareness implementation
in CFS and RT.

Capacity Awareness so far is 'find a CPU which fits the requirement of
the task (Req)'. It's not (yet) find the best CPU.

CFS - select_idle_capacity() -> task_fits_capacity()

Req: util(p) * 1.25 < capacity_of(cpu)

RT - select_task_rq_rt(), cpupri_find_fitness() ->
rt_task_fits_capacity()

Req: uclamp_eff_value(p) <= capacity_orig_of(cpu)

DL - select_task_rq_dl(), cpudl_find() -> dl_task_fits_capacity()

Req: dl_runtime(p)/dl_deadline(p) * 1024 <= capacity_orig_of(cpu)


There has to be an "idle" (from the viewpoint of the task) CPU available
with a fitting capacity. Otherwise a fallback mechanism applies.

CFS - best capacity handling in select_idle_capacity().

RT - Non-fitting lowest mask

DL - This patch

You did spot the rt-app 'delay' for the small tasks in the test case ;-)

2020-04-30 10:57:22

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()

On Mon, Apr 27, 2020 at 10:37:05AM +0200, Dietmar Eggemann wrote:
> Return the weight of the rd (root domain) span in case it is a subset
> of the cpu_active_mask.
>
> Continue to compute the number of CPUs over rd span and cpu_active_mask
> when in hotplug.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/deadline.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 504d2f51b0d6..4ae22bfc37ae 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
> static inline int dl_bw_cpus(int i)
> {
> struct root_domain *rd = cpu_rq(i)->rd;
> - int cpus = 0;
> + int cpus;
>
> RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> "sched RCU must be held");
> +
> + if (cpumask_subset(rd->span, cpu_active_mask))
> + return cpumask_weight(rd->span);
> +

Looks good to me. This is a nice optimization.

> + cpus = 0;
> +
> for_each_cpu_and(i, rd->span, cpu_active_mask)
> cpus++;
>
Do you know why this check is in place? Is it only to cover
the case of cpuset_cpu_inactive()->dl_cpu_busy()?

Thanks,
Pavan
> --
> 2.17.1
>

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-04-30 11:03:12

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

On Wed, Apr 29, 2020 at 07:39:50PM +0200, Dietmar Eggemann wrote:
> On 27/04/2020 16:17, luca abeni wrote:
> > Hi Juri,
> >
> > On Mon, 27 Apr 2020 15:34:38 +0200
> > Juri Lelli <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> On 27/04/20 10:37, Dietmar Eggemann wrote:
> >>> From: Luca Abeni <[email protected]>
> >>>
> >>> When a task has a runtime that cannot be served within the
> >>> scheduling deadline by any of the idle CPU (later_mask) the task is
> >>> doomed to miss its deadline.
> >>>
> >>> This can happen since the SCHED_DEADLINE admission control
> >>> guarantees only bounded tardiness and not the hard respect of all
> >>> deadlines. In this case try to select the idle CPU with the largest
> >>> CPU capacity to minimize tardiness.
> >>>
> >>> Signed-off-by: Luca Abeni <[email protected]>
> >>> Signed-off-by: Dietmar Eggemann <[email protected]>
> > [...]
> >>> - if (!cpumask_empty(later_mask))
> >>> - return 1;
> >>> + if (cpumask_empty(later_mask))
> >>> + cpumask_set_cpu(max_cpu, later_mask);
> >>
> >> Think we touched upon this during v1 review, but I'm (still?)
> >> wondering if we can do a little better, still considering only free
> >> cpus.
> >>
> >> Can't we get into a situation that some of the (once free) big cpus
> >> have been occupied by small tasks and now a big task enters the
> >> system and it only finds small cpus available, were it could have fit
> >> into bigs if small tasks were put onto small cpus?
> >>
> >> I.e., shouldn't we always try to best fit among free cpus?
> >
> > Yes; there was an additional patch that tried schedule each task on the
> > slowest core where it can fit, to address this issue.
> > But I think it will go in a second round of patches.
>
> Yes, we can run into this situation in DL, but also in CFS or RT.
>
In CFS case, the misfit task handling in load balancer should help pulling
the BIG task running on the little CPUs. I get your point that we can run
into the same scenario with other scheduling class tasks.

> IMHO, this patch is aligned with the Capacity Awareness implementation
> in CFS and RT.
>
> Capacity Awareness so far is 'find a CPU which fits the requirement of
> the task (Req)'. It's not (yet) find the best CPU.
>
> CFS - select_idle_capacity() -> task_fits_capacity()
>
> Req: util(p) * 1.25 < capacity_of(cpu)
>
> RT - select_task_rq_rt(), cpupri_find_fitness() ->
> rt_task_fits_capacity()
>
> Req: uclamp_eff_value(p) <= capacity_orig_of(cpu)
>
> DL - select_task_rq_dl(), cpudl_find() -> dl_task_fits_capacity()
>
> Req: dl_runtime(p)/dl_deadline(p) * 1024 <= capacity_orig_of(cpu)
>
>
> There has to be an "idle" (from the viewpoint of the task) CPU available
> with a fitting capacity. Otherwise a fallback mechanism applies.
>
> CFS - best capacity handling in select_idle_capacity().
>
> RT - Non-fitting lowest mask
>
> DL - This patch
>
> You did spot the rt-app 'delay' for the small tasks in the test case ;-)

Thanks for the hint. It was not clear to me why 1 msec delay is given for
the small tasks in the rt-app json description in the cover letter.
I get it now :-)

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-04-30 13:13:30

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
> algorithm w/o considering CPU capacity or task utilization.
> This works well on homogeneous systems where DL tasks are guaranteed
> to have a bounded tardiness but presents issues on heterogeneous
> systems.
>
> A DL task can migrate to a CPU which does not have enough CPU capacity
> to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
> period on a CPU w/ 512 capacity).
>
> Add the DL fitness function dl_task_fits_capacity() for DL admission
> control on heterogeneous systems. A task fits onto a CPU if:
>
> CPU original capacity / 1024 >= task runtime / task deadline
>
> Use this function on heterogeneous systems to try to find a CPU which
> meets this criterion during task wakeup, push and offline migration.
>
> On homogeneous systems the original behavior of the DL admission
> control should be retained.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 14 +++++++++++++-
> kernel/sched/deadline.c | 18 ++++++++++++++----
> kernel/sched/sched.h | 15 +++++++++++++++
> 3 files changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 5cc4012572ec..8630f2a40a3f 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> if (later_mask &&
> cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - return 1;
> + int cpu;
> +
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fits the task. */
> + for_each_cpu(cpu, later_mask) {
> + if (!dl_task_fits_capacity(p, cpu))
> + cpumask_clear_cpu(cpu, later_mask);
> + }
> +
> + if (!cpumask_empty(later_mask))
> + return 1;
> } else {
> int best_cpu = cpudl_maximum(cp);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 08ab28e1cefc..575b7d88d839 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1634,6 +1634,7 @@ static int
> select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> + bool select_rq;
> struct rq *rq;
>
> if (sd_flag != SD_BALANCE_WAKE)
> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> * other hand, if it has a shorter deadline, we
> * try to make it stay here, it might be important.
> */
> - if (unlikely(dl_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
> - (p->nr_cpus_allowed > 1)) {
> + select_rq = unlikely(dl_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 ||
> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
> + p->nr_cpus_allowed > 1;
> +
> + /*
> + * Take the capacity of the CPU into account to
> + * ensure it fits the requirement of the task.
> + */
> + if (static_branch_unlikely(&sched_asym_cpucapacity))
> + select_rq |= !dl_task_fits_capacity(p, cpu);
> +
> + if (select_rq) {
> int target = find_later_rq(p);

I see that find_later_rq() checks if the previous CPU is part of
later_mask and returns it immediately. So we don't migrate the
task in the case where there previous CPU can't fit the task and
there are no idle CPUs on which the task can fit. LGTM.

>
> if (target != -1 &&
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 511edacc2282..ec0efd99497b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap,
> cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
> }
>
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the
> + * CPU original capacity and the runtime/deadline ratio of the task.
> + *
> + * The function will return true if the CPU original capacity of the
> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> + * task and false otherwise.
> + */
> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> +
> + return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> +}
> +

This is same as

return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap

Correct? If yes, would it be better to use this?

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-05-01 16:14:03

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

On 30/04/2020 15:10, Pavan Kondeti wrote:
> On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
>> From: Luca Abeni <[email protected]>

[...]

>> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>> * other hand, if it has a shorter deadline, we
>> * try to make it stay here, it might be important.
>> */
>> - if (unlikely(dl_task(curr)) &&
>> - (curr->nr_cpus_allowed < 2 ||
>> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> - (p->nr_cpus_allowed > 1)) {
>> + select_rq = unlikely(dl_task(curr)) &&
>> + (curr->nr_cpus_allowed < 2 ||
>> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> + p->nr_cpus_allowed > 1;
>> +
>> + /*
>> + * Take the capacity of the CPU into account to
>> + * ensure it fits the requirement of the task.
>> + */
>> + if (static_branch_unlikely(&sched_asym_cpucapacity))
>> + select_rq |= !dl_task_fits_capacity(p, cpu);
>> +
>> + if (select_rq) {
>> int target = find_later_rq(p);
>
> I see that find_later_rq() checks if the previous CPU is part of
> later_mask and returns it immediately. So we don't migrate the
> task in the case where there previous CPU can't fit the task and
> there are no idle CPUs on which the task can fit. LGTM.

Hope I understand you here. I don't think that [patch 6/6] provides this
already.

In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).

Example on Juno [L b b L L L] with thread0-0 (big task)

cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0

A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
with the maximum capacity in cpudl_find() for the !fit case.

[...]

>> +/*
>> + * Verify the fitness of task @p to run on @cpu taking into account the
>> + * CPU original capacity and the runtime/deadline ratio of the task.
>> + *
>> + * The function will return true if the CPU original capacity of the
>> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
>> + * task and false otherwise.
>> + */
>> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
>> +{
>> + unsigned long cap = arch_scale_cpu_capacity(cpu);
>> +
>> + return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
>> +}
>> +
>
> This is same as
>
> return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap
>
> Correct? If yes, would it be better to use this?

We could use sched_dl_entity::dl_density (dl_runtime / dl_deadline) but
then I would have to export BW_SHIFT.

2020-05-01 16:14:46

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()

On 30/04/2020 12:55, Pavan Kondeti wrote:
> On Mon, Apr 27, 2020 at 10:37:05AM +0200, Dietmar Eggemann wrote:

[..]

>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 504d2f51b0d6..4ae22bfc37ae 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
>> static inline int dl_bw_cpus(int i)
>> {
>> struct root_domain *rd = cpu_rq(i)->rd;
>> - int cpus = 0;
>> + int cpus;
>>
>> RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
>> "sched RCU must be held");
>> +
>> + if (cpumask_subset(rd->span, cpu_active_mask))
>> + return cpumask_weight(rd->span);
>> +
>
> Looks good to me. This is a nice optimization.
>
>> + cpus = 0;
>> +
>> for_each_cpu_and(i, rd->span, cpu_active_mask)
>> cpus++;
>>
> Do you know why this check is in place? Is it only to cover
> the case of cpuset_cpu_inactive()->dl_cpu_busy()?

It should cover:

(1) Preventing CPU hp when DL detects a possible overflow w/o the CPU:

sched_cpu_deactivate() -> cpuset_cpu_inactive() -> dl_cpu_busy() ->
dl_bw_cpus() [now replaced by dl_bw_capacity()].

(2) DL Admission Control in CPU HP:

__sched_setscheduler() -> sched_dl_overflow() -> dl_bw_cpus()
[now + -> dl_bw_capacity()]

(3) In create/destroy exclusive cpusets scenarios (comment in
set_cpus_allowed_dl(), although I wasn't able to provoke this so
far:

do_set_cpus_allowed() -> p->sched_class->set_cpus_allowed() was
never called when I ran a DL testcase and create/destroy exclusive
cpusets at the same time?

2020-05-01 16:16:42

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

On 30/04/2020 13:00, Pavan Kondeti wrote:
> On Wed, Apr 29, 2020 at 07:39:50PM +0200, Dietmar Eggemann wrote:
>> On 27/04/2020 16:17, luca abeni wrote:

[...]

>>> On Mon, 27 Apr 2020 15:34:38 +0200
>>> Juri Lelli <[email protected]> wrote:

[...]

>>>> On 27/04/20 10:37, Dietmar Eggemann wrote:
>>>>> From: Luca Abeni <[email protected]>

[...]

>>>>> - if (!cpumask_empty(later_mask))
>>>>> - return 1;
>>>>> + if (cpumask_empty(later_mask))
>>>>> + cpumask_set_cpu(max_cpu, later_mask);
>>>>
>>>> Think we touched upon this during v1 review, but I'm (still?)
>>>> wondering if we can do a little better, still considering only free
>>>> cpus.
>>>>
>>>> Can't we get into a situation that some of the (once free) big cpus
>>>> have been occupied by small tasks and now a big task enters the
>>>> system and it only finds small cpus available, were it could have fit
>>>> into bigs if small tasks were put onto small cpus?
>>>>
>>>> I.e., shouldn't we always try to best fit among free cpus?
>>>
>>> Yes; there was an additional patch that tried schedule each task on the
>>> slowest core where it can fit, to address this issue.
>>> But I think it will go in a second round of patches.
>>
>> Yes, we can run into this situation in DL, but also in CFS or RT.
>>
> In CFS case, the misfit task handling in load balancer should help pulling
> the BIG task running on the little CPUs. I get your point that we can run
> into the same scenario with other scheduling class tasks.

Yes, the CPU stopper (i.e. CFS's active load balance) can help here.
IMHO, using the CPU stopper in RT/DL for moving the running task (next
to using best fit rather than just fit CPU) is considered future work.
AFAICS, push/pull is not designed for migration of running tasks.

[...]

>> You did spot the rt-app 'delay' for the small tasks in the test case ;-)
>
> Thanks for the hint. It was not clear to me why 1 msec delay is given for
> the small tasks in the rt-app json description in the cover letter.
> I get it now :-)

So far Capacity awareness in RT/DL means that as long as there are CPUs
available which fit the task, use one of them. This is already
beneficial for a lot of use cases on CPU asymmetric systems since it
offers more predictable behavior.

I'll add a note to the cover letter in the next version about the reason
of the rt-app 'delay'.

2020-05-04 04:42:24

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

On Fri, May 01, 2020 at 06:12:07PM +0200, Dietmar Eggemann wrote:
> On 30/04/2020 15:10, Pavan Kondeti wrote:
> > On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
> >> From: Luca Abeni <[email protected]>
>
> [...]
>
> >> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> >> * other hand, if it has a shorter deadline, we
> >> * try to make it stay here, it might be important.
> >> */
> >> - if (unlikely(dl_task(curr)) &&
> >> - (curr->nr_cpus_allowed < 2 ||
> >> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> - (p->nr_cpus_allowed > 1)) {
> >> + select_rq = unlikely(dl_task(curr)) &&
> >> + (curr->nr_cpus_allowed < 2 ||
> >> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> + p->nr_cpus_allowed > 1;
> >> +
> >> + /*
> >> + * Take the capacity of the CPU into account to
> >> + * ensure it fits the requirement of the task.
> >> + */
> >> + if (static_branch_unlikely(&sched_asym_cpucapacity))
> >> + select_rq |= !dl_task_fits_capacity(p, cpu);
> >> +
> >> + if (select_rq) {
> >> int target = find_later_rq(p);
> >
> > I see that find_later_rq() checks if the previous CPU is part of
> > later_mask and returns it immediately. So we don't migrate the
> > task in the case where there previous CPU can't fit the task and
> > there are no idle CPUs on which the task can fit. LGTM.
>
> Hope I understand you here. I don't think that [patch 6/6] provides this
> already.
>
> In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
> otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).
>
> Example on Juno [L b b L L L] with thread0-0 (big task)
>
> cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
> find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0
>
> A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
> with the maximum capacity in cpudl_find() for the !fit case.
>

You are right. max_cpu can be other than task_cpu(p) in which case we
migrate the task though it won't fit on the new CPU. While introducing
capacity awareness in RT, Quais made the below change to avoid the
migration. We can do something similar here also.

commit b28bc1e002c2 (sched/rt: Re-instate old behavior in select_task_rq_rt())

> [...]
>
> >> +/*
> >> + * Verify the fitness of task @p to run on @cpu taking into account the
> >> + * CPU original capacity and the runtime/deadline ratio of the task.
> >> + *
> >> + * The function will return true if the CPU original capacity of the
> >> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> >> + * task and false otherwise.
> >> + */
> >> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> >> +{
> >> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> >> +
> >> + return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> >> +}
> >> +
> >
> > This is same as
> >
> > return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap
> >
> > Correct? If yes, would it be better to use this?
>
> We could use sched_dl_entity::dl_density (dl_runtime / dl_deadline) but
> then I would have to export BW_SHIFT.

Yeah, I meant dl_denstity not dl_bw. Thanks for correcting me. I see that
BW_SHIFT is defined in kernel/sched/sched.h

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

2020-05-05 18:07:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

On 04/05/2020 05:58, Pavan Kondeti wrote:
> On Fri, May 01, 2020 at 06:12:07PM +0200, Dietmar Eggemann wrote:
>> On 30/04/2020 15:10, Pavan Kondeti wrote:
>>> On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
>>>> From: Luca Abeni <[email protected]>
>>
>> [...]
>>
>>>> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>>>> * other hand, if it has a shorter deadline, we
>>>> * try to make it stay here, it might be important.
>>>> */
>>>> - if (unlikely(dl_task(curr)) &&
>>>> - (curr->nr_cpus_allowed < 2 ||
>>>> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
>>>> - (p->nr_cpus_allowed > 1)) {
>>>> + select_rq = unlikely(dl_task(curr)) &&
>>>> + (curr->nr_cpus_allowed < 2 ||
>>>> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
>>>> + p->nr_cpus_allowed > 1;
>>>> +
>>>> + /*
>>>> + * Take the capacity of the CPU into account to
>>>> + * ensure it fits the requirement of the task.
>>>> + */
>>>> + if (static_branch_unlikely(&sched_asym_cpucapacity))
>>>> + select_rq |= !dl_task_fits_capacity(p, cpu);
>>>> +
>>>> + if (select_rq) {
>>>> int target = find_later_rq(p);
>>>
>>> I see that find_later_rq() checks if the previous CPU is part of
>>> later_mask and returns it immediately. So we don't migrate the
>>> task in the case where there previous CPU can't fit the task and
>>> there are no idle CPUs on which the task can fit. LGTM.
>>
>> Hope I understand you here. I don't think that [patch 6/6] provides this
>> already.
>>
>> In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
>> otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).
>>
>> Example on Juno [L b b L L L] with thread0-0 (big task)
>>
>> cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
>> find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0
>>
>> A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
>> with the maximum capacity in cpudl_find() for the !fit case.
>>
>
> You are right. max_cpu can be other than task_cpu(p) in which case we
> migrate the task though it won't fit on the new CPU. While introducing
> capacity awareness in RT, Quais made the below change to avoid the
> migration. We can do something similar here also.
>
> commit b28bc1e002c2 (sched/rt: Re-instate old behavior in select_task_rq_rt())

I'm not sure something like this is necessary here.

With DL capacity awareness we reduce the later_mask returned by
cpudl_find() in find_later_rq() to those idle CPUs which can handle p
or in case there is none to (the/a) 'non-fitting CPU w/ max capacity'.

We just have to favor task_cpu(p) in [patch 6/6] in case it is part
of the initial later_mask and among these 'non-fitting CPUs w/ max
capacity'.

This will make sure that it gets chosen so find_later_rq() returns it
before the 'for_each_domain()' loop.

And I guess we still want to migrate if there is a non-fitting CPU w/ a
higher CPU capacity than task_cpu() (tri-gear).

@@ -137,7 +137,8 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

cap = capacity_orig_of(cpu);

- if (cap > max_cap) {
+ if (cap > max_cap ||
+ (cpu == task_cpu(p) && cap == max_cap)) {
max_cap = cap;
max_cpu = cpu;

In case task_cpu() is not part of later_cpu, the 'max CPU capacity CPU' is
returned as 'best_cpu'.

[...]

2020-05-06 11:14:20

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

On 27/04/2020 10:37, Dietmar Eggemann wrote:

[...]

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 4ae22bfc37ae..eb23e6921d94 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -69,6 +69,25 @@ static inline int dl_bw_cpus(int i)
>
> return cpus;
> }
> +
> +static inline unsigned long dl_bw_capacity(int i)
> +{
> + struct root_domain *rd = cpu_rq(i)->rd;
> + unsigned long cap;
> +
> + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> + "sched RCU must be held");
> +
> + if (cpumask_subset(rd->span, cpu_active_mask))
> + return rd->sum_cpu_capacity;
> +
> + cap = 0;
> +
> + for_each_cpu_and(i, rd->span, cpu_active_mask)
> + cap += capacity_orig_of(i);
> +
> + return cap;
> +}

There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
latter is needed to demonstrate the problem since DL task affinity can't
be altered.

A CPU in such a cpuset has its rq attached to def_root_domain which does
not have its 'sum_cpu_capacity' properly set.

root@juno:~# bash
root@juno:~# ps -eo comm,pid,class | grep bash
bash 1661 TS
bash 2040 TS
bash 2176 TS <--

root@juno:~# echo 2176 > /sys/fs/cgroup/cpuset/B/tasks

root@juno:~# chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 2176
chrt: failed to set pid 2176's policy: Device or resource busy

...
sched_dl_overflow: [bash 2176] task_cpu=4 cpus_ptr=2,4-5
dl_bw_capacity() CPU4 dflt_rd->sum_cpu_capacity=0 <-- !!! dflt_rd->span=2,4-5 cpu_active_mask=0-5
...

OTHA, rd->span is properly set due to 'cpumask_clear_cpu(rq->cpu,
old_rd->span) and cpumask_set_cpu(rq->cpu, rd->span)' in rq_attach_root().

It's not possible to treat 'rd->sum_cpu_capacity' like 'rd->span' since
the former changes between sched domain teardown/bringup w/ asymmetric
CPU capacity.

What could be done is to return 'dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT'
w/ symmetric CPU capacity (of 1024) and to loop over rd->span otherwise.
Latter includes symmetric cpusets w/ only little CPUs.

---8<---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 575b7d88d839..6d17748cb7a1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -70,24 +70,28 @@ static inline int dl_bw_cpus(int i)
return cpus;
}

-static inline unsigned long dl_bw_capacity(int i)
-{
+static inline unsigned long __dl_bw_capacity(int i) {
struct root_domain *rd = cpu_rq(i)->rd;
- unsigned long cap;
+ unsigned long cap = 0;

RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
"sched RCU must be held");

- if (cpumask_subset(rd->span, cpu_active_mask))
- return rd->sum_cpu_capacity;
-
- cap = 0;
-
for_each_cpu_and(i, rd->span, cpu_active_mask)
cap += capacity_orig_of(i);

return cap;
}
+
+static inline unsigned long dl_bw_capacity(int i)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity) &&
+ capacity_orig_of(i) == SCHED_CAPACITY_SCALE) {
+ return dl_bw_cpus(i) << SCHED_CAPACITY_SHIFT;
+ } else {
+ return __dl_bw_capacity(i);
+ }
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
--
2.17.1

2020-05-06 12:39:41

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

On 06/05/20 12:54, Dietmar Eggemann wrote:
> On 27/04/2020 10:37, Dietmar Eggemann wrote:
>
> [...]
>
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 4ae22bfc37ae..eb23e6921d94 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -69,6 +69,25 @@ static inline int dl_bw_cpus(int i)
> >
> > return cpus;
> > }
> > +
> > +static inline unsigned long dl_bw_capacity(int i)
> > +{
> > + struct root_domain *rd = cpu_rq(i)->rd;
> > + unsigned long cap;
> > +
> > + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> > + "sched RCU must be held");
> > +
> > + if (cpumask_subset(rd->span, cpu_active_mask))
> > + return rd->sum_cpu_capacity;
> > +
> > + cap = 0;
> > +
> > + for_each_cpu_and(i, rd->span, cpu_active_mask)
> > + cap += capacity_orig_of(i);
> > +
> > + return cap;
> > +}
>
> There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
> latter is needed to demonstrate the problem since DL task affinity can't
> be altered.
>
> A CPU in such a cpuset has its rq attached to def_root_domain which does
> not have its 'sum_cpu_capacity' properly set.

Hummm, but if sched_load_balance is disabled it means that we've now got
a subset of CPUs which (from a DL AC pow) are partitioned. So, I'd tend
to say that we actually want to check new tasks bw requirement against
the available bandwidth of the particular CPU they happen to be running
(and will continue to run) when setscheduler is called.

If then load balance is enabled again, AC check we did above should
still be valid for all tasks admitted in the meantime, no?

2020-05-06 15:14:33

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

On 06/05/2020 14:37, Juri Lelli wrote:
> On 06/05/20 12:54, Dietmar Eggemann wrote:
>> On 27/04/2020 10:37, Dietmar Eggemann wrote:

[...]

>> There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
>> latter is needed to demonstrate the problem since DL task affinity can't
>> be altered.
>>
>> A CPU in such a cpuset has its rq attached to def_root_domain which does
>> not have its 'sum_cpu_capacity' properly set.
>
> Hummm, but if sched_load_balance is disabled it means that we've now got
> a subset of CPUs which (from a DL AC pow) are partitioned. So, I'd tend

Yes, the CPUs of the cpuset w/ cpuset.sched_load_balance=0 (cpuset B in
the example).

> to say that we actually want to check new tasks bw requirement against
> the available bandwidth of the particular CPU they happen to be running
> (and will continue to run) when setscheduler is called.

By 'available bandwidth of the particular CPU' you refer to
'\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?

This is what this fix tries to achieve. Regardless whether cpu_rq(i)->rd
is a 'real' rd or the def_root_domain, dl_bw_capacity() will now always
return '\Sum_{cpu_rq(i)->rd->span} CPU capacity'

> If then load balance is enabled again, AC check we did above should
> still be valid for all tasks admitted in the meantime, no?

Example (w/ this fix) on Juno [L b b L L L], capacity_orig_of(L)=446 :

mkdir /sys/fs/cgroup/cpuset/A
echo 0 > /sys/fs/cgroup/cpuset/A/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/A/cpuset.cpu_exclusive
echo 0-2 > /sys/fs/cgroup/cpuset/A/cpuset.cpus

mkdir /sys/fs/cgroup/cpuset/B
echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.cpu_exclusive
echo 3-5 > /sys/fs/cgroup/cpuset/B/cpuset.cpus

echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance
echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance

echo $$ > /sys/fs/cgroup/cpuset/B/tasks
chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$

...
[ 144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
[ 144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5
[ 144.932841] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
...

echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance

echo $$ > /sys/fs/cgroup/cpuset/B/tasks
chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$

...
[ 254.367982] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
[ 254.373487] sched_dl_overflow: [bash 2052] task_cpu(p)=5 cap=1338 cpus_ptr=3-5
[ 254.380721] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
...

Regardless of 'B/cpuset.sched_load_balance'
'\Sum_{cpu_rq(i)->rd->span} CPU capacity' stays 1338 (3*446)

So IMHO, DL AC check stays intact.

2020-05-11 08:04:24

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

On 06/05/20 17:09, Dietmar Eggemann wrote:
> On 06/05/2020 14:37, Juri Lelli wrote:
> > On 06/05/20 12:54, Dietmar Eggemann wrote:
> >> On 27/04/2020 10:37, Dietmar Eggemann wrote:
>
> [...]
>
> >> There is an issue w/ excl. cpusets and cpuset.sched_load_balance=0. The
> >> latter is needed to demonstrate the problem since DL task affinity can't
> >> be altered.
> >>
> >> A CPU in such a cpuset has its rq attached to def_root_domain which does
> >> not have its 'sum_cpu_capacity' properly set.
> >
> > Hummm, but if sched_load_balance is disabled it means that we've now got
> > a subset of CPUs which (from a DL AC pow) are partitioned. So, I'd tend
>
> Yes, the CPUs of the cpuset w/ cpuset.sched_load_balance=0 (cpuset B in
> the example).
>
> > to say that we actually want to check new tasks bw requirement against
> > the available bandwidth of the particular CPU they happen to be running
> > (and will continue to run) when setscheduler is called.
>
> By 'available bandwidth of the particular CPU' you refer to
> '\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?

No. I was referring to the single CPU capacity. The capacity of the CPU
where a task is running when setscheduler is called for it (and DL AC
performed). See below, maybe more clear why I wondered about this case..

> This is what this fix tries to achieve. Regardless whether cpu_rq(i)->rd
> is a 'real' rd or the def_root_domain, dl_bw_capacity() will now always
> return '\Sum_{cpu_rq(i)->rd->span} CPU capacity'
>
> > If then load balance is enabled again, AC check we did above should
> > still be valid for all tasks admitted in the meantime, no?
>
> Example (w/ this fix) on Juno [L b b L L L], capacity_orig_of(L)=446 :
>
> mkdir /sys/fs/cgroup/cpuset/A
> echo 0 > /sys/fs/cgroup/cpuset/A/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpuset/A/cpuset.cpu_exclusive
> echo 0-2 > /sys/fs/cgroup/cpuset/A/cpuset.cpus
>
> mkdir /sys/fs/cgroup/cpuset/B
> echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.mems
> echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.cpu_exclusive
> echo 3-5 > /sys/fs/cgroup/cpuset/B/cpuset.cpus
>
> echo 0 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance
> echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
>
> echo $$ > /sys/fs/cgroup/cpuset/B/tasks
> chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
>
> ...
> [ 144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
> [ 144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5

So, here you are checking new task bw against 1338 which is 3*L
capacity. However, since load balance is disabled at this point for 3-5,
once admitted the task will only be able to run on CPU 3. Now, if more
tasks on CPU 3 are admitted the same way (up to 1338) I believe they
will start to experience deadline misses because only 446 will be
actually available to them until load balance is enabled below and they
are then free to migrate to CPUs 4 and 5.

Does it makes sense?

> [ 144.932841] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
> ...
>
> echo 1 > /sys/fs/cgroup/cpuset/B/cpuset.sched_load_balance
>
> echo $$ > /sys/fs/cgroup/cpuset/B/tasks
> chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
>
> ...
> [ 254.367982] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
> [ 254.373487] sched_dl_overflow: [bash 2052] task_cpu(p)=5 cap=1338 cpus_ptr=3-5
> [ 254.380721] __dl_bw_capacity CPU5 rd->span=3-5 return 1338
> ...
>
> Regardless of 'B/cpuset.sched_load_balance'
> '\Sum_{cpu_rq(i)->rd->span} CPU capacity' stays 1338 (3*446)
>
> So IMHO, DL AC check stays intact.
>

2020-05-12 12:41:28

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

On 11/05/2020 10:01, Juri Lelli wrote:
> On 06/05/20 17:09, Dietmar Eggemann wrote:
>> On 06/05/2020 14:37, Juri Lelli wrote:
>>> On 06/05/20 12:54, Dietmar Eggemann wrote:
>>>> On 27/04/2020 10:37, Dietmar Eggemann wrote:

[...]

>>> to say that we actually want to check new tasks bw requirement against
>>> the available bandwidth of the particular CPU they happen to be running
>>> (and will continue to run) when setscheduler is called.
>>
>> By 'available bandwidth of the particular CPU' you refer to
>> '\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?
>
> No. I was referring to the single CPU capacity. The capacity of the CPU
> where a task is running when setscheduler is called for it (and DL AC
> performed). See below, maybe more clear why I wondered about this case..

OK, got it! I was just confused since I don't think that this patch
introduced the issue.

Before the patch 'int cpus = dl_bw_cpus(task_cpu(p))' was used which
returns the number of cpus on the (default) rd (n). So for a single CPU
(1024) we use n*1024.

I wonder if a fix for that should be part of this patch-set?

[...]

>> ...
>> [ 144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
>> [ 144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5
>
> So, here you are checking new task bw against 1338 which is 3*L
> capacity. However, since load balance is disabled at this point for 3-5,
> once admitted the task will only be able to run on CPU 3. Now, if more
> tasks on CPU 3 are admitted the same way (up to 1338) I believe they
> will start to experience deadline misses because only 446 will be
> actually available to them until load balance is enabled below and they
> are then free to migrate to CPUs 4 and 5.
>
> Does it makes sense?

Yes, it does.

So my first idea was to only consider the CPU (i.e. its CPU capacity) in
case we detect 'cpu_rq(cpu)->rd == def_root_domain'?

In case I re-enable load-balancing on cpuset '/', we can't make a task
in cpuset 'B' DL since we hit this in __sched_setscheduler():

4931 /*
4932 * Don't allow tasks with an affinity mask smaller than
4933 * the entire root_domain to become SCHED_DEADLINE.
...
4935 */
4936 if (!cpumask_subset(span, p->cpus_ptr) || ...

root@juno:~# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
root@juno:~# echo $$ > /sys/fs/cgroup/cpuset/B/tasks
root@juno:~# chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
chrt: failed to set pid 2316's policy: Operation not permitted

So this task has to leave 'B' first I assume.

[...]

2020-05-15 12:31:11

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] sched/deadline: Add dl_bw_capacity()

On 12/05/20 14:39, Dietmar Eggemann wrote:
> On 11/05/2020 10:01, Juri Lelli wrote:
> > On 06/05/20 17:09, Dietmar Eggemann wrote:
> >> On 06/05/2020 14:37, Juri Lelli wrote:
> >>> On 06/05/20 12:54, Dietmar Eggemann wrote:
> >>>> On 27/04/2020 10:37, Dietmar Eggemann wrote:
>
> [...]
>
> >>> to say that we actually want to check new tasks bw requirement against
> >>> the available bandwidth of the particular CPU they happen to be running
> >>> (and will continue to run) when setscheduler is called.
> >>
> >> By 'available bandwidth of the particular CPU' you refer to
> >> '\Sum_{cpu_rq(i)->rd->span} CPU capacity', right?
> >
> > No. I was referring to the single CPU capacity. The capacity of the CPU
> > where a task is running when setscheduler is called for it (and DL AC
> > performed). See below, maybe more clear why I wondered about this case..
>
> OK, got it! I was just confused since I don't think that this patch
> introduced the issue.
>
> Before the patch 'int cpus = dl_bw_cpus(task_cpu(p))' was used which
> returns the number of cpus on the (default) rd (n). So for a single CPU
> (1024) we use n*1024.
>
> I wonder if a fix for that should be part of this patch-set?

Not really, I guess. As you said, the issue was there already. We can
fix both situations with a subsequent patch. I just realized that we
have a problem by reviewing this set, but not this set job to fix it.

While you are at changing this part, it might be good to put a comment
(XXX fix this, or something) about the issue, so that we don't forget.

> [...]
>
> >> ...
> >> [ 144.920102] __dl_bw_capacity CPU3 rd->span=3-5 return 1338
> >> [ 144.925607] sched_dl_overflow: [bash 1999] task_cpu(p)=3 cap=1338 cpus_ptr=3-5
> >
> > So, here you are checking new task bw against 1338 which is 3*L
> > capacity. However, since load balance is disabled at this point for 3-5,
> > once admitted the task will only be able to run on CPU 3. Now, if more
> > tasks on CPU 3 are admitted the same way (up to 1338) I believe they
> > will start to experience deadline misses because only 446 will be
> > actually available to them until load balance is enabled below and they
> > are then free to migrate to CPUs 4 and 5.
> >
> > Does it makes sense?
>
> Yes, it does.
>
> So my first idea was to only consider the CPU (i.e. its CPU capacity) in
> case we detect 'cpu_rq(cpu)->rd == def_root_domain'?
>
> In case I re-enable load-balancing on cpuset '/', we can't make a task
> in cpuset 'B' DL since we hit this in __sched_setscheduler():
>
> 4931 /*
> 4932 * Don't allow tasks with an affinity mask smaller than
> 4933 * the entire root_domain to become SCHED_DEADLINE.
> ...
> 4935 */
> 4936 if (!cpumask_subset(span, p->cpus_ptr) || ...
>
> root@juno:~# echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> root@juno:~# echo $$ > /sys/fs/cgroup/cpuset/B/tasks
> root@juno:~# chrt -d --sched-runtime 8000 --sched-period 16000 -p 0 $$
> chrt: failed to set pid 2316's policy: Operation not permitted
>
> So this task has to leave 'B' first I assume.

Right, because the span is back to contain all cpus (load balancing
enabled at root level), but tasks in 'B' still have affinity set to a
subset of them.