The SCHED_DEADLINE (DL) admission control does not work correctly on
heterogeneous (asymmetric CPU capacity) systems such as Arm big.LITTLE
or DynamIQ.
Let's fix this by explicitly considering CPU capacity in DL admission
control and task migration.
The DL sched class now attempts to avoid missing task deadlines due to
smaller CPU (CPU capacity < 1024) not being capable enough to finish a
task in time. It does so by trying to place a task so that its CPU
capacity scaled deadline is not smaller than its runtime.
Changes RFC [1] -> v1:
Only use static values for CPU bandwidth (sched_dl_entity::dl_runtime,
::dl_deadline) and CPU capacity (arch_scale_cpu_capacity()) to fix DL
admission control.
Dynamic values for CPU bandwidth (sched_dl_entity::runtime, ::deadline)
and CPU capacity (capacity_of()) are considered to be more related to
energy trade-off calculations which could be later introduced using the
Energy Model.
Since the design of the DL and RT sched classes are very similar, the
implementation follows the overall design of RT capacity awareness
(commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware")).
Per-patch changes:
(1) Store CPU capacity sum in the root domain during
build_sched_domains() [patch 1/4]
(2) Adjust to RT capacity awareness design [patch 3/4]
(3) Remove CPU capacity aware placement in switched_to_dl()
(dl_migrate callback) [RFC patch 3/6]
Balance callbacks (push, pull) run only in schedule_tail()
__schedule(), rt_mutex_setprio() or __sched_setscheduler().
DL throttling leads to a call to __dequeue_task_dl() which is not a
full task dequeue. The task is still enqueued and only removed from
the rq.
So a queue_balance_callback() call in update_curr_dl()->
__dequeue_task_dl() will not be followed by a balance_callback()
call in one of the 4 functions mentioned above.
(4) Remove 'dynamic CPU bandwidth' consideration and only support
'static CPU bandwidth' (ratio between sched_dl_entity::dl_runtime
and ::dl_deadline) [RFC patch 4/6]
(5) Remove modification to migration logic which tried to schedule
small tasks on LITTLE CPUs [RFC patch 6/6]
[1] https://lore.kernel.org/r/[email protected]
The following rt-app testcase tailored to Arm64 Hikey960:
root@h960:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
462
462
462
462
1024
1024
1024
1024
shows the expected behavior.
According to the following condition in dl_task_fits_capacity()
cap_scale(dl_deadline, arch_scale_cpu_capacity(cpu)) >= dl_runtime
thread0-[0-3] are placed on a big CPUs whereas thread1-[0-3] run on a
LITTLE CPU respectively.
...
"tasks" : {
"thread0" : {
"policy" : "SCHED_DEADLINE",
"instance" : 4,
"timer" : { "ref" : "unique0", "period" : 16000, "mode" : "absolute" },
"run" : 10000,
"dl-runtime" : 11000,
"dl-period" : 16000,
"dl-deadline" : 16000
},
"thread1" : {
"policy" : "SCHED_DEADLINE",
"instance" : 4,
"delay" : 1000,
"timer" : { "ref" : "unique1", "period" : 16000, "mode" : "absolute" },
"run" : 5500,
"dl-runtime" : 6500
"dl-period" : 16000,
"dl-deadline" : 16000
}
...
Tests were run with Performance CPUfreq governor so that the Schedutil
CPUfreq governor DL threads (sugov:[0,4]), necessary on a
slow-switching platform like Hikey960, do not interfere with the
rt-app test tasks. Using Schedutil would require to lower the number of
tasks to 3 instances each.
Dietmar Eggemann (1):
sched/topology: Store root domain CPU capacity sum
Luca Abeni (3):
sched/deadline: Improve admission control for asymmetric CPU
capacities
sched/deadline: Make DL capacity-aware
sched/deadline: Implement fallback mechanism for !fit case
kernel/sched/cpudeadline.c | 23 +++++++++++++++++++++
kernel/sched/deadline.c | 41 +++++++++++++++++++++++---------------
kernel/sched/sched.h | 33 ++++++++++++++++++++++++++++--
kernel/sched/topology.c | 14 +++++++++----
4 files changed, 89 insertions(+), 22 deletions(-)
--
2.17.1
Add the sum of (original) CPU capacity of all member CPUs to the root
domain.
This is needed for capacity-aware SCHED_DEADLINE admission control.
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/sched.h | 11 +++++++++++
kernel/sched/topology.c | 14 ++++++++++----
2 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1e72d1b3d3ce..91bd0cb0c529 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -797,6 +797,7 @@ struct root_domain {
cpumask_var_t rto_mask;
struct cpupri cpupri;
+ unsigned long sum_cpu_capacity;
unsigned long max_cpu_capacity;
/*
@@ -2393,6 +2394,16 @@ static inline unsigned long capacity_orig_of(int cpu)
{
return cpu_rq(cpu)->cpu_capacity_orig;
}
+
+static inline unsigned long rd_capacity(int cpu)
+{
+ return cpu_rq(cpu)->rd->sum_cpu_capacity;
+}
+#else
+static inline unsigned long rd_capacity(int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
#endif
/**
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8344757bba6e..74b0c0fa4b1b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
+ unsigned long cap = arch_scale_cpu_capacity(i);
+
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
+ if (cap > READ_ONCE(d.rd->max_cpu_capacity))
+ WRITE_ONCE(d.rd->max_cpu_capacity, cap);
+
+ WRITE_ONCE(d.rd->sum_cpu_capacity,
+ READ_ONCE(d.rd->sum_cpu_capacity) + cap);
cpu_attach_domain(sd, d.rd, i);
}
@@ -2067,8 +2072,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
if (rq && sched_debug_enabled) {
- pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
- cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+ pr_info("root domain span: %*pbl (capacity = %lu max cpu_capacity = %lu)\n",
+ cpumask_pr_args(cpu_map), rq->rd->sum_cpu_capacity,
+ rq->rd->max_cpu_capacity);
}
ret = 0;
--
2.17.1
From: Luca Abeni <[email protected]>
When a task has a runtime that cannot be served within the scheduling
deadline by any of the idle CPU (later_mask) the task is doomed to miss
its deadline.
This can happen since the SCHED_DEADLINE admission control guarantees
only bounded tardiness and not the hard respect of all deadlines.
In this case try to select the idle CPU with the largest CPU capacity
to minimize tardiness.
Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/cpudeadline.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 8630f2a40a3f..8525d73e3de4 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- int cpu;
+ unsigned long cap, max_cap = 0;
+ int cpu, max_cpu = -1;
if (!static_branch_unlikely(&sched_asym_cpucapacity))
return 1;
/* Ensure the capacity of the CPUs fits the task. */
for_each_cpu(cpu, later_mask) {
- if (!dl_task_fits_capacity(p, cpu))
+ if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);
+
+ cap = arch_scale_cpu_capacity(cpu);
+
+ if (cap > max_cap) {
+ max_cap = cap;
+ max_cpu = cpu;
+ }
+ }
}
- if (!cpumask_empty(later_mask))
- return 1;
+ if (cpumask_empty(later_mask))
+ cpumask_set_cpu(max_cpu, later_mask);
+
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);
--
2.17.1
From: Luca Abeni <[email protected]>
The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
algorithm w/o considering CPU capacity or task utilization.
This works well on homogeneous systems where DL tasks are guaranteed
to have a bounded tardiness but presents issues on heterogeneous
systems.
A DL task can migrate to a CPU which does not have enough CPU capacity
to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
period on a CPU w/ 512 capacity).
Add the DL fitness function dl_task_fits_capacity() for DL admission
control on heterogeneous systems. A task fits onto a CPU if:
CPU original capacity / 1024 >= task runtime / task deadline
Use this function on heterogeneous systems to try to find a CPU which
meets this criterion during task wakeup, push and offline migration.
On homogeneous systems the original behavior of the DL admission
control should be retained.
Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/cpudeadline.c | 14 +++++++++++++-
kernel/sched/deadline.c | 18 ++++++++++++++----
kernel/sched/sched.h | 15 +++++++++++++++
3 files changed, 42 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 5cc4012572ec..8630f2a40a3f 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
- return 1;
+ int cpu;
+
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 1;
+
+ /* Ensure the capacity of the CPUs fits the task. */
+ for_each_cpu(cpu, later_mask) {
+ if (!dl_task_fits_capacity(p, cpu))
+ cpumask_clear_cpu(cpu, later_mask);
+ }
+
+ if (!cpumask_empty(later_mask))
+ return 1;
} else {
int best_cpu = cpudl_maximum(cp);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 53b34a95e29e..e10adf1e3c27 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1604,6 +1604,7 @@ static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
+ bool select_rq;
struct rq *rq;
if (sd_flag != SD_BALANCE_WAKE)
@@ -1623,10 +1624,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- if (unlikely(dl_task(curr)) &&
- (curr->nr_cpus_allowed < 2 ||
- !dl_entity_preempt(&p->dl, &curr->dl)) &&
- (p->nr_cpus_allowed > 1)) {
+ select_rq = unlikely(dl_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 ||
+ !dl_entity_preempt(&p->dl, &curr->dl)) &&
+ p->nr_cpus_allowed > 1;
+
+ /*
+ * We take into account the capacity of the CPU to
+ * ensure it fits the requirement of the task.
+ */
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ select_rq |= !dl_task_fits_capacity(p, cpu);
+
+ if (select_rq) {
int target = find_later_rq(p);
if (target != -1 &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6cbdf7a342a6..598b58c68639 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -306,6 +306,21 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
static inline unsigned long rd_capacity(int cpu);
+/*
+ * Verify the fitness of task @p to run on @cpu taking into account the
+ * CPU original capacity and the runtime/deadline ratio of the task.
+ *
+ * The function will return true if the CPU original capacity of the
+ * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
+ * task and false otherwise.
+ */
+static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
+{
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+
+ return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
+}
+
static inline
bool __dl_overflow(struct dl_bw *dl_b, int cpu, u64 old_bw, u64 new_bw)
{
--
2.17.1
From: Luca Abeni <[email protected]>
The current SCHED_DEADLINE (DL) admission control ensures that
sum of reserved CPU bandwidth < x * M
where
x = /proc/sys/kernel/sched_rt_{runtime,period}_us
M = # CPUs in root domain.
DL admission control works well for homogeneous systems where the
capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
and non-starvation of non-DL tasks is guaranteed.
But on heterogeneous systems where capacity of CPUs are different it
could fail by over-allocating CPU time on smaller capacity CPUs.
On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
tasks making it unusable.
Fix this by explicitly considering the CPU capacity in the DL admission
test by replacing M with the root domain CPU capacity sum.
Signed-off-by: Luca Abeni <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
---
kernel/sched/deadline.c | 23 +++++++++++------------
kernel/sched/sched.h | 7 +++++--
2 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 504d2f51b0d6..53b34a95e29e 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2551,11 +2551,11 @@ void sched_dl_do_global(void)
int sched_dl_overflow(struct task_struct *p, int policy,
const struct sched_attr *attr)
{
- struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
- int cpus, err = -1;
+ int cpus, err = -1, cpu = task_cpu(p);
+ struct dl_bw *dl_b = dl_bw_of(cpu);
if (attr->sched_flags & SCHED_FLAG_SUGOV)
return 0;
@@ -2570,15 +2570,15 @@ int sched_dl_overflow(struct task_struct *p, int policy,
* allocated bandwidth of the container.
*/
raw_spin_lock(&dl_b->lock);
- cpus = dl_bw_cpus(task_cpu(p));
+ cpus = dl_bw_cpus(cpu);
if (dl_policy(policy) && !task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ !__dl_overflow(dl_b, cpu, 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
__dl_sub(dl_b, p->dl.dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ !__dl_overflow(dl_b, cpu, p->dl.dl_bw, new_bw)) {
/*
* XXX this is slightly incorrect: when the task
* utilization decreases, we should delay the total
@@ -2715,18 +2715,17 @@ bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr)
int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed)
{
unsigned int dest_cpu;
+ unsigned long flags;
struct dl_bw *dl_b;
bool overflow;
- int cpus, ret;
- unsigned long flags;
+ int ret;
dest_cpu = cpumask_any_and(cpu_active_mask, cs_cpus_allowed);
rcu_read_lock_sched();
dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(dest_cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+ overflow = __dl_overflow(dl_b, dest_cpu, 0, p->dl.dl_bw);
if (overflow) {
ret = -EBUSY;
} else {
@@ -2736,6 +2735,8 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
* We will free resources in the source root_domain
* later on (see set_cpus_allowed_dl()).
*/
+ int cpus = dl_bw_cpus(dest_cpu);
+
__dl_add(dl_b, p->dl.dl_bw, cpus);
ret = 0;
}
@@ -2771,13 +2772,11 @@ bool dl_cpu_busy(unsigned int cpu)
unsigned long flags;
struct dl_bw *dl_b;
bool overflow;
- int cpus;
rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, 0);
+ overflow = __dl_overflow(dl_b, cpu, 0, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91bd0cb0c529..6cbdf7a342a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -304,11 +304,14 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
__dl_update(dl_b, -((s32)tsk_bw / cpus));
}
+static inline unsigned long rd_capacity(int cpu);
+
static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+bool __dl_overflow(struct dl_bw *dl_b, int cpu, u64 old_bw, u64 new_bw)
{
return dl_b->bw != -1 &&
- dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+ cap_scale(dl_b->bw, rd_capacity(cpu)) <
+ dl_b->total_bw - old_bw + new_bw;
}
extern void init_dl_bw(struct dl_bw *dl_b);
--
2.17.1
On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <[email protected]> wrote:
>
> Add the sum of (original) CPU capacity of all member CPUs to the root
> domain.
>
> This is needed for capacity-aware SCHED_DEADLINE admission control.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/sched.h | 11 +++++++++++
> kernel/sched/topology.c | 14 ++++++++++----
> 2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1e72d1b3d3ce..91bd0cb0c529 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -797,6 +797,7 @@ struct root_domain {
> cpumask_var_t rto_mask;
> struct cpupri cpupri;
>
> + unsigned long sum_cpu_capacity;
> unsigned long max_cpu_capacity;
>
> /*
> @@ -2393,6 +2394,16 @@ static inline unsigned long capacity_orig_of(int cpu)
> {
> return cpu_rq(cpu)->cpu_capacity_orig;
> }
> +
> +static inline unsigned long rd_capacity(int cpu)
> +{
> + return cpu_rq(cpu)->rd->sum_cpu_capacity;
> +}
> +#else
> +static inline unsigned long rd_capacity(int cpu)
> +{
> + return SCHED_CAPACITY_SCALE;
> +}
> #endif
>
> /**
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 8344757bba6e..74b0c0fa4b1b 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> /* Attach the domains */
> rcu_read_lock();
> for_each_cpu(i, cpu_map) {
> + unsigned long cap = arch_scale_cpu_capacity(i);
Why do you replace the use of rq->cpu_capacity_orig by
arch_scale_cpu_capacity(i) ?
There is nothing about this change in the commit message
> +
> rq = cpu_rq(i);
> sd = *per_cpu_ptr(d.sd, i);
>
> /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
> - if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
> - WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
> + if (cap > READ_ONCE(d.rd->max_cpu_capacity))
> + WRITE_ONCE(d.rd->max_cpu_capacity, cap);
> +
> + WRITE_ONCE(d.rd->sum_cpu_capacity,
> + READ_ONCE(d.rd->sum_cpu_capacity) + cap);
>
> cpu_attach_domain(sd, d.rd, i);
> }
> @@ -2067,8 +2072,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
>
> if (rq && sched_debug_enabled) {
> - pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
> - cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
> + pr_info("root domain span: %*pbl (capacity = %lu max cpu_capacity = %lu)\n",
> + cpumask_pr_args(cpu_map), rq->rd->sum_cpu_capacity,
> + rq->rd->max_cpu_capacity);
> }
>
> ret = 0;
> --
> 2.17.1
>
On 08/04/20 14:30, luca abeni wrote:
>>
>> I don't think this is strictly equivalent to what we have now for the
>> SMP case. 'cpus' used to come from dl_bw_cpus(), which is an ugly way
>> of writing
>>
>> cpumask_weight(rd->span AND cpu_active_mask);
>>
>> The rd->cpu_capacity_orig field you added gets set once per domain
>> rebuild, so it also happens in sched_cpu_(de)activate() but is
>> separate from touching cpu_active_mask. AFAICT this mean we can
>> observe a CPU as !active but still see its capacity_orig accounted in
>> a root_domain.
>
> Sorry, I suspect this is my fault, because the bug comes from my
> original patch.
> When I wrote the original code, I believed that when a CPU is
> deactivated it is also removed from its root domain.
>
> I now see that I was wrong.
>
Well it is indeed the case, but sadly it's not an atomic step - AFAICT with
cpusets we do hold some cpuset lock when calling __dl_overflow() and when
rebuilding the domains, but not when fiddling with the active mask.
I just realized it's even more obvious for dl_cpu_busy(): IIUC it is meant
to prevent the removal of a CPU if it would lead to a DL overflow - it
works now because the active mask is modified before it gets called, but
here it breaks because it's called before the sched_domain rebuild.
Perhaps re-computing the root domain capacity sum at every dl_bw_cpus()
call would be simpler. It's a bit more work, but then we already have a
for_each_cpu_*() loop, and we only rely on the masks being correct.
>
> Luca
On 04/08/20 15:30, luca abeni wrote:
> Hi Valentin,
>
> On Wed, 08 Apr 2020 11:42:14 +0100
> Valentin Schneider <[email protected]> wrote:
>
> > On 08/04/20 10:50, Dietmar Eggemann wrote:
> > > +++ b/kernel/sched/sched.h
> > > @@ -304,11 +304,14 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw,
> > > int cpus) __dl_update(dl_b, -((s32)tsk_bw / cpus));
> > > }
> > >
> > > +static inline unsigned long rd_capacity(int cpu);
> > > +
> > > static inline
> > > -bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64
> > > new_bw) +bool __dl_overflow(struct dl_bw *dl_b, int cpu, u64
> > > old_bw, u64 new_bw) {
> > > return dl_b->bw != -1 &&
> > > - dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> > > + cap_scale(dl_b->bw, rd_capacity(cpu)) <
> > > + dl_b->total_bw - old_bw + new_bw;
> > > }
> > >
> >
> > I don't think this is strictly equivalent to what we have now for the
> > SMP case. 'cpus' used to come from dl_bw_cpus(), which is an ugly way
> > of writing
> >
> > cpumask_weight(rd->span AND cpu_active_mask);
> >
> > The rd->cpu_capacity_orig field you added gets set once per domain
> > rebuild, so it also happens in sched_cpu_(de)activate() but is
> > separate from touching cpu_active_mask. AFAICT this mean we can
> > observe a CPU as !active but still see its capacity_orig accounted in
> > a root_domain.
>
> Sorry, I suspect this is my fault, because the bug comes from my
> original patch.
> When I wrote the original code, I believed that when a CPU is
> deactivated it is also removed from its root domain.
>
> I now see that I was wrong.
Shouldn't rd->online be equivalent to (rd->span & cpu-active_mask)?
--
Qais Yousef
On 08.04.20 14:29, Vincent Guittot wrote:
> On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <[email protected]> wrote:
[...]
>> /**
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 8344757bba6e..74b0c0fa4b1b 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> /* Attach the domains */
>> rcu_read_lock();
>> for_each_cpu(i, cpu_map) {
>> + unsigned long cap = arch_scale_cpu_capacity(i);
>
> Why do you replace the use of rq->cpu_capacity_orig by
> arch_scale_cpu_capacity(i) ?
> There is nothing about this change in the commit message
True. And I can change this back.
It seems though that the solution is not sufficient because of the
'rd->span &nsub cpu_active_mask' issue discussed under patch 2/4.
But this remind me of another question I have.
Currently we use arch_scale_cpu_capacity() more often (16 times) than
capacity_orig_of()/rq->cpu_capacity_orig .
What's hindering us to remove rq->cpu_capacity_orig and the code around
it and solely rely on arch_scale_cpu_capacity()? I mean the arch
implementation should be fast.
On Wed, 8 Apr 2020 at 18:31, Dietmar Eggemann <[email protected]> wrote:
>
> On 08.04.20 14:29, Vincent Guittot wrote:
> > On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <[email protected]> wrote:
>
> [...]
>
> >> /**
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index 8344757bba6e..74b0c0fa4b1b 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >> /* Attach the domains */
> >> rcu_read_lock();
> >> for_each_cpu(i, cpu_map) {
> >> + unsigned long cap = arch_scale_cpu_capacity(i);
> >
> > Why do you replace the use of rq->cpu_capacity_orig by
> > arch_scale_cpu_capacity(i) ?
> > There is nothing about this change in the commit message
>
> True. And I can change this back.
>
> It seems though that the solution is not sufficient because of the
> 'rd->span &nsub cpu_active_mask' issue discussed under patch 2/4.
>
> But this remind me of another question I have.
>
> Currently we use arch_scale_cpu_capacity() more often (16 times) than
> capacity_orig_of()/rq->cpu_capacity_orig .
>
> What's hindering us to remove rq->cpu_capacity_orig and the code around
> it and solely rely on arch_scale_cpu_capacity()? I mean the arch
> implementation should be fast.
Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
Is there a case where the max cpu capacity changes over time ? So I
would prefer to use cpu_capacity_orig which is a field of scheduler
instead of always calling an external arch specific function
On 04/08/20 11:50, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> When a task has a runtime that cannot be served within the scheduling
> deadline by any of the idle CPU (later_mask) the task is doomed to miss
> its deadline.
>
> This can happen since the SCHED_DEADLINE admission control guarantees
> only bounded tardiness and not the hard respect of all deadlines.
> In this case try to select the idle CPU with the largest CPU capacity
> to minimize tardiness.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
Outside of the scope of this series. But does it make sense to make
sched_setattr() fail to create a new deadline task if the system will be
overcommitted, hence causing some dl tasks to miss their deadlines?
If some overcommitting is fine (some deadlines are soft and are okay to fail
every once in a while), does it make sense for this to be a tunable of how much
the system can be overcommitted before disallowing new DL tasks to be created?
Just thinking out loudly. This fallback is fine, but it made me think why did
we have to end up in a situation that we can fail in the first place since the
same info is available when a new DL task is created, and being preventative
might be a better approach..
Thanks
--
Qais Yousef
> kernel/sched/cpudeadline.c | 19 +++++++++++++++----
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 8630f2a40a3f..8525d73e3de4 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> if (later_mask &&
> cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - int cpu;
> + unsigned long cap, max_cap = 0;
> + int cpu, max_cpu = -1;
>
> if (!static_branch_unlikely(&sched_asym_cpucapacity))
> return 1;
>
> /* Ensure the capacity of the CPUs fits the task. */
> for_each_cpu(cpu, later_mask) {
> - if (!dl_task_fits_capacity(p, cpu))
> + if (!dl_task_fits_capacity(p, cpu)) {
> cpumask_clear_cpu(cpu, later_mask);
> +
> + cap = arch_scale_cpu_capacity(cpu);
> +
> + if (cap > max_cap) {
> + max_cap = cap;
> + max_cpu = cpu;
> + }
> + }
> }
>
> - if (!cpumask_empty(later_mask))
> - return 1;
> + if (cpumask_empty(later_mask))
> + cpumask_set_cpu(max_cpu, later_mask);
> +
> + return 1;
> } else {
> int best_cpu = cpudl_maximum(cp);
>
> --
> 2.17.1
>
Hi,
On Thu, 9 Apr 2020 11:25:58 +0100
Qais Yousef <[email protected]> wrote:
> On 04/08/20 11:50, Dietmar Eggemann wrote:
> > From: Luca Abeni <[email protected]>
> >
> > When a task has a runtime that cannot be served within the
> > scheduling deadline by any of the idle CPU (later_mask) the task is
> > doomed to miss its deadline.
> >
> > This can happen since the SCHED_DEADLINE admission control
> > guarantees only bounded tardiness and not the hard respect of all
> > deadlines. In this case try to select the idle CPU with the largest
> > CPU capacity to minimize tardiness.
> >
> > Signed-off-by: Luca Abeni <[email protected]>
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > ---
>
> Outside of the scope of this series. But does it make sense to make
> sched_setattr() fail to create a new deadline task if the system will
> be overcommitted, hence causing some dl tasks to miss their deadlines?
The problem is that with multiple processors/cores it is not easy to
know in advance if any task will miss a deadline (see section 3.3 of
Documentation/scheduler/sched-deadline.rst).
The admission control we are currently using should prevent
SCHED_DEADLINE tasks from overloading the system (starving non-deadline
tasks); proving hard deadline guarantees with global EDF scheduling is
much more difficult (and could be probably done in user-space, I think).
> If some overcommitting is fine (some deadlines are soft and are okay
> to fail every once in a while), does it make sense for this to be a
> tunable of how much the system can be overcommitted before
> disallowing new DL tasks to be created?
There is already a tunable for the SCHED_DEADLINE admission test
(/proc/sys/kernel/sched_rt_{runtime,period}_us, if I understand well
what you are suggesting). The problem is that it is not easy to find a
value for this tunable that guarantees the hard respect of all
deadlines.
But IMHO if someone really wants hard deadline guarantees it is better
to use partitioned scheduling (see Section 5 of the SCHED_DEADLINE
documentation).
Luca
>
> Just thinking out loudly. This fallback is fine, but it made me think
> why did we have to end up in a situation that we can fail in the
> first place since the same info is available when a new DL task is
> created, and being preventative might be a better approach..
>
> Thanks
>
> --
> Qais Yousef
>
> > kernel/sched/cpudeadline.c | 19 +++++++++++++++----
> > 1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> > index 8630f2a40a3f..8525d73e3de4 100644
> > --- a/kernel/sched/cpudeadline.c
> > +++ b/kernel/sched/cpudeadline.c
> > @@ -121,19 +121,30 @@ int cpudl_find(struct cpudl *cp, struct
> > task_struct *p,
> > if (later_mask &&
> > cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> > - int cpu;
> > + unsigned long cap, max_cap = 0;
> > + int cpu, max_cpu = -1;
> >
> > if
> > (!static_branch_unlikely(&sched_asym_cpucapacity)) return 1;
> >
> > /* Ensure the capacity of the CPUs fits the task.
> > */ for_each_cpu(cpu, later_mask) {
> > - if (!dl_task_fits_capacity(p, cpu))
> > + if (!dl_task_fits_capacity(p, cpu)) {
> > cpumask_clear_cpu(cpu, later_mask);
> > +
> > + cap = arch_scale_cpu_capacity(cpu);
> > +
> > + if (cap > max_cap) {
> > + max_cap = cap;
> > + max_cpu = cpu;
> > + }
> > + }
> > }
> >
> > - if (!cpumask_empty(later_mask))
> > - return 1;
> > + if (cpumask_empty(later_mask))
> > + cpumask_set_cpu(max_cpu, later_mask);
> > +
> > + return 1;
> > } else {
> > int best_cpu = cpudl_maximum(cp);
> >
> > --
> > 2.17.1
> >
On 08.04.20 19:03, Vincent Guittot wrote:
> On Wed, 8 Apr 2020 at 18:31, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 08.04.20 14:29, Vincent Guittot wrote:
>>> On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <[email protected]> wrote:
>>
>> [...]
>>
>>>> /**
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index 8344757bba6e..74b0c0fa4b1b 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>> /* Attach the domains */
>>>> rcu_read_lock();
>>>> for_each_cpu(i, cpu_map) {
>>>> + unsigned long cap = arch_scale_cpu_capacity(i);
>>>
>>> Why do you replace the use of rq->cpu_capacity_orig by
>>> arch_scale_cpu_capacity(i) ?
>>> There is nothing about this change in the commit message
>>
>> True. And I can change this back.
>>
>> It seems though that the solution is not sufficient because of the
>> 'rd->span &nsub cpu_active_mask' issue discussed under patch 2/4.
>>ap
>> But this remind me of another question I have.
>>
>> Currently we use arch_scale_cpu_capacity() more often (16 times) than
>> capacity_orig_of()/rq->cpu_capacity_orig .
>>
>> What's hindering us to remove rq->cpu_capacity_orig and the code around
>> it and solely rely on arch_scale_cpu_capacity()? I mean the arch
>> implementation should be fast.
>
> Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
>
> Is there a case where the max cpu capacity changes over time ? So I
> would prefer to use cpu_capacity_orig which is a field of scheduler
> instead of always calling an external arch specific function
I see. So far it only changes during startup.
And it looks like that asym_cpu_capacity_level() [topology.c] would fail
if we would use capacity_orig_of() instead of arch_scale_cpu_capacity().
post_init_entity_util_avg() [fair.c] and sugov_get_util()
[cpufreq_schedutil.c] would be temporarily off until
update_cpu_capacity() has updated cpu_rq(cpu)->cpu_capacity_orig.
compute_energy() [fair.c] is guarded by sched_energy_enabled() from
being used at startup.
scale_rt_capacity() could be changed in case we call it after the
cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu) in
update_cpu_capacity().
The Energy Model (and CPUfreq cooling) code would need
capacity_orig_of() exported. arch_scale_cpu_capacity() currently is
exported via include/linux/sched/topology.h.
I guess Pelt and 'scale invariant Deadline bandwidth enforcement' should
continue using arch_scale_cpu_capacity() in sync with
arch_scale_freq_capacity().
IMHO it's hard to give clear advice when to use the one or the other.
We probably don't want to set cpu_rq(cpu)->cpu_capacity_orig in the arch
cpu scale setter. We have arch_scale_cpu_capacity() to decouple that.
On Thu, 9 Apr 2020 at 15:50, Dietmar Eggemann <[email protected]> wrote:
>
> On 08.04.20 19:03, Vincent Guittot wrote:
> > On Wed, 8 Apr 2020 at 18:31, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> On 08.04.20 14:29, Vincent Guittot wrote:
> >>> On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <[email protected]> wrote:
> >>
> >> [...]
> >>
> >>>> /**
> >>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >>>> index 8344757bba6e..74b0c0fa4b1b 100644
> >>>> --- a/kernel/sched/topology.c
> >>>> +++ b/kernel/sched/topology.c
> >>>> @@ -2052,12 +2052,17 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >>>> /* Attach the domains */
> >>>> rcu_read_lock();
> >>>> for_each_cpu(i, cpu_map) {
> >>>> + unsigned long cap = arch_scale_cpu_capacity(i);
> >>>
> >>> Why do you replace the use of rq->cpu_capacity_orig by
> >>> arch_scale_cpu_capacity(i) ?
> >>> There is nothing about this change in the commit message
> >>
> >> True. And I can change this back.
> >>
> >> It seems though that the solution is not sufficient because of the
> >> 'rd->span &nsub cpu_active_mask' issue discussed under patch 2/4.
> >>ap
> >> But this remind me of another question I have.
> >>
> >> Currently we use arch_scale_cpu_capacity() more often (16 times) than
> >> capacity_orig_of()/rq->cpu_capacity_orig .
> >>
> >> What's hindering us to remove rq->cpu_capacity_orig and the code around
> >> it and solely rely on arch_scale_cpu_capacity()? I mean the arch
> >> implementation should be fast.
> >
> > Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
> >
> > Is there a case where the max cpu capacity changes over time ? So I
> > would prefer to use cpu_capacity_orig which is a field of scheduler
> > instead of always calling an external arch specific function
>
> I see. So far it only changes during startup.
>
> And it looks like that asym_cpu_capacity_level() [topology.c] would fail
> if we would use capacity_orig_of() instead of arch_scale_cpu_capacity().
Yes I agree. See below
>
> post_init_entity_util_avg() [fair.c] and sugov_get_util()
> [cpufreq_schedutil.c] would be temporarily off until
> update_cpu_capacity() has updated cpu_rq(cpu)->cpu_capacity_orig.
I think that we could even get rid of this update in
update_cpu_capacity(). cpu_capacity_orig should be set while building
the sched_domain topology because the topology itself is built based
on this max cpu capacity with asym_cpu_capacity_level(). So changing
the capacity without rebuilding the domain could break the
sched_domain topology correctness.
And we can't really set cpu_capacity_orig earlier during the boot
because the capacity of b.L is set late during the boot and a rebuild
of the sched_domain topology is then triggered.
>
> compute_energy() [fair.c] is guarded by sched_energy_enabled() from
> being used at startup.
>
> scale_rt_capacity() could be changed in case we call it after the
> cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu) in
> update_cpu_capacity().
With the removal of the update in update_cpu_capacity(), we don't have
a problem anymore, isn't it ?
>
> The Energy Model (and CPUfreq cooling) code would need
> capacity_orig_of() exported. arch_scale_cpu_capacity() currently is
> exported via include/linux/sched/topology.h.
Not sure that we need to export it outside scheduler, they can still
use arch_scale_cpu_capacity()
>
> I guess Pelt and 'scale invariant Deadline bandwidth enforcement' should
> continue using arch_scale_cpu_capacity() in sync with
> arch_scale_freq_capacity().
Why can't they use capacity_orig_of ?
we keep using arch_scale_freq_capacity() because it's dynamic but we
don't really need to keep using arch_scale_cpu_capacity()
>
> IMHO it's hard to give clear advice when to use the one or the other.
>
> We probably don't want to set cpu_rq(cpu)->cpu_capacity_orig in the arch
> cpu scale setter. We have arch_scale_cpu_capacity() to decouple that.
Yes I agree
Hi Luca
On 04/09/20 15:00, luca abeni wrote:
> > Outside of the scope of this series. But does it make sense to make
> > sched_setattr() fail to create a new deadline task if the system will
> > be overcommitted, hence causing some dl tasks to miss their deadlines?
>
> The problem is that with multiple processors/cores it is not easy to
> know in advance if any task will miss a deadline (see section 3.3 of
> Documentation/scheduler/sched-deadline.rst).
>
> The admission control we are currently using should prevent
> SCHED_DEADLINE tasks from overloading the system (starving non-deadline
> tasks); proving hard deadline guarantees with global EDF scheduling is
> much more difficult (and could be probably done in user-space, I think).
I see. I'll dig through the docs, thanks for the reference.
> > If some overcommitting is fine (some deadlines are soft and are okay
> > to fail every once in a while), does it make sense for this to be a
> > tunable of how much the system can be overcommitted before
> > disallowing new DL tasks to be created?
>
> There is already a tunable for the SCHED_DEADLINE admission test
> (/proc/sys/kernel/sched_rt_{runtime,period}_us, if I understand well
> what you are suggesting). The problem is that it is not easy to find a
> value for this tunable that guarantees the hard respect of all
> deadlines.
I don't think it's similar to what I was referring to. But my knowledge about
DL could have failed me to fully appreciate what you're saying.
This tunable for RT prevents a single task from using 100% CPU time. I think
DL uses it in a similar manner.
What I meant by overcommiting, is allowing more DL tasks than the system can
guarantee to meet their deadlines.
For example, in the context of capacity awareness, if you have 2 big cores, but
4 DL tasks request a bandwidth that can only be satisfied by the big cores,
then 2 of them will miss their deadlines (almost) consistently, IIUC.
This can be generalized on SMP (I believe). But judging from your earlier
response, it's not as straightforward as it seems :)
>
> But IMHO if someone really wants hard deadline guarantees it is better
> to use partitioned scheduling (see Section 5 of the SCHED_DEADLINE
> documentation).
RT is the same. So this makes sense. Though one would hope to be able to
improve on this in the future. Something for me to ponder over :)
Thanks
--
Qais Yousef
On 08.04.20 17:01, Valentin Schneider wrote:
>
> On 08/04/20 14:30, luca abeni wrote:
>>>
>>> I don't think this is strictly equivalent to what we have now for the
>>> SMP case. 'cpus' used to come from dl_bw_cpus(), which is an ugly way
>>> of writing
>>>
>>> cpumask_weight(rd->span AND cpu_active_mask);
>>>
>>> The rd->cpu_capacity_orig field you added gets set once per domain
>>> rebuild, so it also happens in sched_cpu_(de)activate() but is
>>> separate from touching cpu_active_mask. AFAICT this mean we can
>>> observe a CPU as !active but still see its capacity_orig accounted in
>>> a root_domain.
>>
>> Sorry, I suspect this is my fault, because the bug comes from my
>> original patch.
>> When I wrote the original code, I believed that when a CPU is
>> deactivated it is also removed from its root domain.
>>
>> I now see that I was wrong.
>>
>
> Well it is indeed the case, but sadly it's not an atomic step - AFAICT with
> cpusets we do hold some cpuset lock when calling __dl_overflow() and when
> rebuilding the domains, but not when fiddling with the active mask.
>
> I just realized it's even more obvious for dl_cpu_busy(): IIUC it is meant
> to prevent the removal of a CPU if it would lead to a DL overflow - it
> works now because the active mask is modified before it gets called, but
> here it breaks because it's called before the sched_domain rebuild.
>
> Perhaps re-computing the root domain capacity sum at every dl_bw_cpus()
> call would be simpler. It's a bit more work, but then we already have a
> for_each_cpu_*() loop, and we only rely on the masks being correct.
Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
with the help of an extra per-cpu cpumask we could just
DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
dl_bw_cpus(int i) {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
...
cpumask_and(cpus, rd->span, cpu_active_mask);
return cpumask_weight(cpus);
}
and
dl_bw_capacity(int i) {
struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
...
cpumask_and(cpus, rd->span, cpu_active_mask);
if (cpumask_equal(cpus, rd->span))
return rd->sum_cpu_capacity;
for_each_cpu(i, cpus)
cap += capacity_orig_of(i);
return cap;
}
So only in cases in which rd->span and cpu_active_mask differ we would
have to sum up again.
On 09.04.20 16:55, Qais Yousef wrote:
> Hi Luca
>
> On 04/09/20 15:00, luca abeni wrote:
[...]
>> There is already a tunable for the SCHED_DEADLINE admission test
>> (/proc/sys/kernel/sched_rt_{runtime,period}_us, if I understand well
>> what you are suggesting). The problem is that it is not easy to find a
>> value for this tunable that guarantees the hard respect of all
>> deadlines.
>
> I don't think it's similar to what I was referring to. But my knowledge about
> DL could have failed me to fully appreciate what you're saying.
>
> This tunable for RT prevents a single task from using 100% CPU time. I think
> DL uses it in a similar manner.
>
> What I meant by overcommiting, is allowing more DL tasks than the system can
> guarantee to meet their deadlines.
>
> For example, in the context of capacity awareness, if you have 2 big cores, but
> 4 DL tasks request a bandwidth that can only be satisfied by the big cores,
> then 2 of them will miss their deadlines (almost) consistently, IIUC.
__dl_overflow() now uses
X = cap_scale(dl_b->bw, rd_capacity(cpu)) instead of X = cpus
in
dl_b->bw * X < dl_b->total_bw - old_bw + new_bw;
As an example, Arm64 Hikey64 with 4 LITTLE and 4 big CPUs uses now
(4x462 + 4x1024)/1024 = 5944/1024 ~ 5,80 instead of 8 CPUs.
On mainline, the rt-app tests:
"tasks" : {
"thread0" : {
"policy" : "SCHED_DEADLINE",
"instance" : 8,
"timer" : {"ref" : "unique0", "period" : 16000, "mode" : "absolute" },
"run" : 10000,
"dl-runtime" :11000,
"dl-period" : 16000,
"dl-deadline" : 16000
}
}
is admitted on this board whereas with the patchset some of the tasks
are rejected:
[rt-app] <crit> [7] sched_setattr returned -1
sched_setattr: failed to set deadline attributes: Device or resource busy
---
IMHO, one of the 3 places where DL Admission Control hooks into is:
__sched_setscheduler -> sched_dl_overflow() ->__dl_overflow()
[...]
Hi,
On 08/04/20 11:50, Dietmar Eggemann wrote:
> From: Luca Abeni <[email protected]>
>
> The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
> algorithm w/o considering CPU capacity or task utilization.
> This works well on homogeneous systems where DL tasks are guaranteed
> to have a bounded tardiness but presents issues on heterogeneous
> systems.
>
> A DL task can migrate to a CPU which does not have enough CPU capacity
> to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
> period on a CPU w/ 512 capacity).
>
> Add the DL fitness function dl_task_fits_capacity() for DL admission
> control on heterogeneous systems. A task fits onto a CPU if:
>
> CPU original capacity / 1024 >= task runtime / task deadline
>
> Use this function on heterogeneous systems to try to find a CPU which
> meets this criterion during task wakeup, push and offline migration.
>
> On homogeneous systems the original behavior of the DL admission
> control should be retained.
>
> Signed-off-by: Luca Abeni <[email protected]>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 14 +++++++++++++-
> kernel/sched/deadline.c | 18 ++++++++++++++----
> kernel/sched/sched.h | 15 +++++++++++++++
> 3 files changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 5cc4012572ec..8630f2a40a3f 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>
> if (later_mask &&
> cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - return 1;
> + int cpu;
> +
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fits the task. */
> + for_each_cpu(cpu, later_mask) {
> + if (!dl_task_fits_capacity(p, cpu))
> + cpumask_clear_cpu(cpu, later_mask);
> + }
> +
> + if (!cpumask_empty(later_mask))
> + return 1;
> } else {
> int best_cpu = cpudl_maximum(cp);
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 53b34a95e29e..e10adf1e3c27 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1604,6 +1604,7 @@ static int
> select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> + bool select_rq;
> struct rq *rq;
>
> if (sd_flag != SD_BALANCE_WAKE)
> @@ -1623,10 +1624,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> * other hand, if it has a shorter deadline, we
> * try to make it stay here, it might be important.
> */
> - if (unlikely(dl_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
> - (p->nr_cpus_allowed > 1)) {
> + select_rq = unlikely(dl_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 ||
> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
> + p->nr_cpus_allowed > 1;
> +
> + /*
> + * We take into account the capacity of the CPU to
> + * ensure it fits the requirement of the task.
> + */
> + if (static_branch_unlikely(&sched_asym_cpucapacity))
> + select_rq |= !dl_task_fits_capacity(p, cpu);
I'm thinking that, while dl_task_fits_capacity() works well when
selecting idle cpus, in this case we should consider the fact that curr
might be deadline as well and already consuming some of the rq capacity.
Do you think we should try to take that into account, maybe using
dl_rq->this_bw ?
Thanks,
Juri
On Wednesday 08 Apr 2020 at 19:03:57 (+0200), Vincent Guittot wrote:
> Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
>
> Is there a case where the max cpu capacity changes over time ? So I
> would prefer to use cpu_capacity_orig which is a field of scheduler
> instead of always calling an external arch specific function
Note however that using arch_scale_cpu_capacity() would be more
efficient, especially on non-arm/arm64 systems where it is a
compile-time constant.
It's probably a matter of personal taste, but I find rq->cpu_capacity_orig
superfluous. It wastes space without actually giving you anything no?
Anybody remembers why it was introduced in the first place?
Thanks,
Quentin
On Tue, 14 Apr 2020 at 17:27, Dietmar Eggemann <[email protected]> wrote:
>
> On 14.04.20 14:45, Quentin Perret wrote:
> > On Wednesday 08 Apr 2020 at 19:03:57 (+0200), Vincent Guittot wrote:
> >> Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
> >>
> >> Is there a case where the max cpu capacity changes over time ? So I
> >> would prefer to use cpu_capacity_orig which is a field of scheduler
> >> instead of always calling an external arch specific function
> >
> > Note however that using arch_scale_cpu_capacity() would be more
> > efficient, especially on non-arm/arm64 systems where it is a
> > compile-time constant.
>
> or essentially all ARCHs not defining it.
>
> >
> > It's probably a matter of personal taste, but I find rq->cpu_capacity_orig
> > superfluous. It wastes space without actually giving you anything no?
> > Anybody remembers why it was introduced in the first place?
>
> v3.18 arm providing arch_scale_cpu_capacity()
> v4.1 introduction of rq->cpu_capacity_orig
> v4.10 arm64 providing arch_scale_cpu_capacity()
> ...
>
> So it's down to the question of 'minimizing the scheduler calls to an
> external arch function' vs 'efficiency'.
Using cpu_capacity_orig will provide more consistency because it will
be set while setting the sched_domain topology
On 09.04.20 16:13, Vincent Guittot wrote:
> On Thu, 9 Apr 2020 at 15:50, Dietmar Eggemann <[email protected]> wrote:
>>
>> On 08.04.20 19:03, Vincent Guittot wrote:
>>> On Wed, 8 Apr 2020 at 18:31, Dietmar Eggemann <[email protected]> wrote:
>>>>
>>>> On 08.04.20 14:29, Vincent Guittot wrote:
>>>>> On Wed, 8 Apr 2020 at 11:50, Dietmar Eggemann <[email protected]> wrote:
[...]
>> And it looks like that asym_cpu_capacity_level() [topology.c] would fail
>> if we would use capacity_orig_of() instead of arch_scale_cpu_capacity().
>
> Yes I agree. See below
>
>> post_init_entity_util_avg() [fair.c] and sugov_get_util()
>> [cpufreq_schedutil.c] would be temporarily off until
>> update_cpu_capacity() has updated cpu_rq(cpu)->cpu_capacity_orig.
>
> I think that we could even get rid of this update in
> update_cpu_capacity(). cpu_capacity_orig should be set while building
> the sched_domain topology because the topology itself is built based
> on this max cpu capacity with asym_cpu_capacity_level(). So changing
> the capacity without rebuilding the domain could break the
> sched_domain topology correctness.
True. rq->cpu_capacity_orig could be set early in build_sched_domains(),
before the call to asym_cpu_capacity_level() or within this function.
> And we can't really set cpu_capacity_orig earlier during the boot
> because the capacity of b.L is set late during the boot and a rebuild
> of the sched_domain topology is then triggered.
>
>>
>> compute_energy() [fair.c] is guarded by sched_energy_enabled() from
>> being used at startup.
>>
>> scale_rt_capacity() could be changed in case we call it after the
>> cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(cpu) in
>> update_cpu_capacity().
>
> With the removal of the update in update_cpu_capacity(), we don't have
> a problem anymore, isn't it ?
True.
>> The Energy Model (and CPUfreq cooling) code would need
>> capacity_orig_of() exported. arch_scale_cpu_capacity() currently is
>> exported via include/linux/sched/topology.h.
>
> Not sure that we need to export it outside scheduler, they can still
> use arch_scale_cpu_capacity()
OK, let's change this for the task scheduler only.
>> I guess Pelt and 'scale invariant Deadline bandwidth enforcement' should
>> continue using arch_scale_cpu_capacity() in sync with
>> arch_scale_freq_capacity().
>
> Why can't they use capacity_orig_of ?
> we keep using arch_scale_freq_capacity() because it's dynamic but we
> don't really need to keep using arch_scale_cpu_capacity()
OK, Pelt is task scheduler so it can be changed here as well.
I'm going to create a patch following these ideas.
[...]
On 04/09/20 19:29, Dietmar Eggemann wrote:
[...]
> Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
> with the help of an extra per-cpu cpumask we could just
>
> DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
>
> dl_bw_cpus(int i) {
>
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> ...
> cpumask_and(cpus, rd->span, cpu_active_mask);
>
> return cpumask_weight(cpus);
> }
>
> and
>
> dl_bw_capacity(int i) {
>
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> ...
> cpumask_and(cpus, rd->span, cpu_active_mask);
> if (cpumask_equal(cpus, rd->span))
> return rd->sum_cpu_capacity;
>
> for_each_cpu(i, cpus)
> cap += capacity_orig_of(i);
>
> return cap;
> }
>
> So only in cases in which rd->span and cpu_active_mask differ we would
> have to sum up again.
I haven't followed this discussion closely, so I could be missing something
here.
In sched_cpu_dying() we call set_rq_offline() which clears the cpu in
rq->rd->online.
So the way I read the code
rd->online = cpumask_and(rd->span, cpu_active_mask)
But I could have easily missed some detail.
Regardless, it seems to me that DL is working around something not right in the
definition of rd->span or using the wrong variable.
My 2p :-). I have to go back and read the discussion in more detail.
Thanks
--
Qais Yousef
On 04/09/20 20:43, Dietmar Eggemann wrote:
> On 09.04.20 16:55, Qais Yousef wrote:
> > Hi Luca
> >
> > On 04/09/20 15:00, luca abeni wrote:
>
> [...]
>
> >> There is already a tunable for the SCHED_DEADLINE admission test
> >> (/proc/sys/kernel/sched_rt_{runtime,period}_us, if I understand well
> >> what you are suggesting). The problem is that it is not easy to find a
> >> value for this tunable that guarantees the hard respect of all
> >> deadlines.
> >
> > I don't think it's similar to what I was referring to. But my knowledge about
> > DL could have failed me to fully appreciate what you're saying.
> >
> > This tunable for RT prevents a single task from using 100% CPU time. I think
> > DL uses it in a similar manner.
> >
> > What I meant by overcommiting, is allowing more DL tasks than the system can
> > guarantee to meet their deadlines.
> >
> > For example, in the context of capacity awareness, if you have 2 big cores, but
> > 4 DL tasks request a bandwidth that can only be satisfied by the big cores,
> > then 2 of them will miss their deadlines (almost) consistently, IIUC.
>
> __dl_overflow() now uses
>
> X = cap_scale(dl_b->bw, rd_capacity(cpu)) instead of X = cpus
>
> in
>
> dl_b->bw * X < dl_b->total_bw - old_bw + new_bw;
>
>
> As an example, Arm64 Hikey64 with 4 LITTLE and 4 big CPUs uses now
> (4x462 + 4x1024)/1024 = 5944/1024 ~ 5,80 instead of 8 CPUs.
>
> On mainline, the rt-app tests:
>
> "tasks" : {
> "thread0" : {
> "policy" : "SCHED_DEADLINE",
> "instance" : 8,
> "timer" : {"ref" : "unique0", "period" : 16000, "mode" : "absolute" },
> "run" : 10000,
> "dl-runtime" :11000,
> "dl-period" : 16000,
> "dl-deadline" : 16000
> }
> }
>
> is admitted on this board whereas with the patchset some of the tasks
> are rejected:
>
> [rt-app] <crit> [7] sched_setattr returned -1
> sched_setattr: failed to set deadline attributes: Device or resource busy
>
> ---
>
> IMHO, one of the 3 places where DL Admission Control hooks into is:
>
> __sched_setscheduler -> sched_dl_overflow() ->__dl_overflow()
>
> [...]
Thanks Dietmar! I should have looked at the code with more intent.
--
Qais Yousef
On 09/04/20 18:29, Dietmar Eggemann wrote:
>> Well it is indeed the case, but sadly it's not an atomic step - AFAICT with
>> cpusets we do hold some cpuset lock when calling __dl_overflow() and when
>> rebuilding the domains, but not when fiddling with the active mask.
>>
>> I just realized it's even more obvious for dl_cpu_busy(): IIUC it is meant
>> to prevent the removal of a CPU if it would lead to a DL overflow - it
>> works now because the active mask is modified before it gets called, but
>> here it breaks because it's called before the sched_domain rebuild.
>>
>> Perhaps re-computing the root domain capacity sum at every dl_bw_cpus()
>> call would be simpler. It's a bit more work, but then we already have a
>> for_each_cpu_*() loop, and we only rely on the masks being correct.
>
> Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
> with the help of an extra per-cpu cpumask we could just
>
> DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
>
> dl_bw_cpus(int i) {
>
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> ...
> cpumask_and(cpus, rd->span, cpu_active_mask);
>
> return cpumask_weight(cpus);
+1 on making this use cpumask_weight() :)
> }
>
> and
>
> dl_bw_capacity(int i) {
>
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> ...
> cpumask_and(cpus, rd->span, cpu_active_mask);
> if (cpumask_equal(cpus, rd->span))
> return rd->sum_cpu_capacity;
>
> for_each_cpu(i, cpus)
> cap += capacity_orig_of(i);
>
> return cap;
> }
>
> So only in cases in which rd->span and cpu_active_mask differ we would
> have to sum up again.
I think this might just work. In the "stable" case (i.e. not racing with
hotplug), we can use the value cached in the root_domain. Otherwise we'll
detect the mismatch between the cpumask and the root_domain (i.e. CPU
active but not yet included in root_domain, or CPU !active but still
included in root_domain).
On 14/04/20 12:40, Qais Yousef wrote:
>
> I haven't followed this discussion closely, so I could be missing something
> here.
>
> In sched_cpu_dying() we call set_rq_offline() which clears the cpu in
> rq->rd->online.
>
> So the way I read the code
>
> rd->online = cpumask_and(rd->span, cpu_active_mask)
>
> But I could have easily missed some detail.
>
sched_cpu_dying() is wayyyy below sched_cpu_deactivate(). This doesn't help
at all for the dl_cpu_busy() check in sched_cpu_deactivate().
> Regardless, it seems to me that DL is working around something not right in the
> definition of rd->span or using the wrong variable.
>
What DL is doing now is fine, it only needs to be aligned with the active
mask (which it is). We're making things a bit trickier by adding capacity
values into the mix.
> My 2p :-). I have to go back and read the discussion in more detail.
>
> Thanks
On 04/14/20 15:29, Valentin Schneider wrote:
>
> On 14/04/20 12:40, Qais Yousef wrote:
> >
> > I haven't followed this discussion closely, so I could be missing something
> > here.
> >
> > In sched_cpu_dying() we call set_rq_offline() which clears the cpu in
> > rq->rd->online.
> >
> > So the way I read the code
> >
> > rd->online = cpumask_and(rd->span, cpu_active_mask)
> >
> > But I could have easily missed some detail.
> >
>
> sched_cpu_dying() is wayyyy below sched_cpu_deactivate(). This doesn't help
> at all for the dl_cpu_busy() check in sched_cpu_deactivate().
It'd makes more sense to introduce rd->active then rather than a PERCPU global
variable IMHO.
>
> > Regardless, it seems to me that DL is working around something not right in the
> > definition of rd->span or using the wrong variable.
> >
>
> What DL is doing now is fine, it only needs to be aligned with the active
> mask (which it is). We're making things a bit trickier by adding capacity
> values into the mix.
I'm still worried that others might be affected by a similar problem.
Will dig deeper next week when I'm back from hols.
Thanks
--
Qais Yousef
On 14.04.20 14:45, Quentin Perret wrote:
> On Wednesday 08 Apr 2020 at 19:03:57 (+0200), Vincent Guittot wrote:
>> Or we can do the opposite and only use capacity_orig_of()/rq->cpu_capacity_orig.
>>
>> Is there a case where the max cpu capacity changes over time ? So I
>> would prefer to use cpu_capacity_orig which is a field of scheduler
>> instead of always calling an external arch specific function
>
> Note however that using arch_scale_cpu_capacity() would be more
> efficient, especially on non-arm/arm64 systems where it is a
> compile-time constant.
or essentially all ARCHs not defining it.
>
> It's probably a matter of personal taste, but I find rq->cpu_capacity_orig
> superfluous. It wastes space without actually giving you anything no?
> Anybody remembers why it was introduced in the first place?
v3.18 arm providing arch_scale_cpu_capacity()
v4.1 introduction of rq->cpu_capacity_orig
v4.10 arm64 providing arch_scale_cpu_capacity()
...
So it's down to the question of 'minimizing the scheduler calls to an
external arch function' vs 'efficiency'.
On 10.04.20 14:52, Juri Lelli wrote:
> Hi,
>
> On 08/04/20 11:50, Dietmar Eggemann wrote:
>> From: Luca Abeni <[email protected]>
[...]
>> @@ -1623,10 +1624,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>> * other hand, if it has a shorter deadline, we
>> * try to make it stay here, it might be important.
>> */
>> - if (unlikely(dl_task(curr)) &&
>> - (curr->nr_cpus_allowed < 2 ||
>> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> - (p->nr_cpus_allowed > 1)) {
>> + select_rq = unlikely(dl_task(curr)) &&
>> + (curr->nr_cpus_allowed < 2 ||
>> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> + p->nr_cpus_allowed > 1;
>> +
>> + /*
>> + * We take into account the capacity of the CPU to
>> + * ensure it fits the requirement of the task.
>> + */
>> + if (static_branch_unlikely(&sched_asym_cpucapacity))
>> + select_rq |= !dl_task_fits_capacity(p, cpu);
>
> I'm thinking that, while dl_task_fits_capacity() works well when
> selecting idle cpus, in this case we should consider the fact that curr
> might be deadline as well and already consuming some of the rq capacity.
>
> Do you think we should try to take that into account, maybe using
> dl_rq->this_bw ?
So you're saying that cpudl_find(..., later_mask) could return 1 (w/
best_cpu (cp->elements[0].cpu) in later_mask).
And that this best_cpu could be a non-fitting CPU for p.
This could happen if cp->free_cpus is empty (no idle CPUs) so we take
cpudl_find()'s else path and in case p's deadline < cp->elements[0]
deadline.
We could condition the 'return 1' on best_cpu fitting p.
But should we do this for cpudl_find(..., NULL) calls from
check_preempt_equal_dl() as well or will this break GEDF?
On 15/04/20 11:39, Dietmar Eggemann wrote:
> On 10.04.20 14:52, Juri Lelli wrote:
> > Hi,
> >
> > On 08/04/20 11:50, Dietmar Eggemann wrote:
> >> From: Luca Abeni <[email protected]>
>
> [...]
>
> >> @@ -1623,10 +1624,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> >> * other hand, if it has a shorter deadline, we
> >> * try to make it stay here, it might be important.
> >> */
> >> - if (unlikely(dl_task(curr)) &&
> >> - (curr->nr_cpus_allowed < 2 ||
> >> - !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> - (p->nr_cpus_allowed > 1)) {
> >> + select_rq = unlikely(dl_task(curr)) &&
> >> + (curr->nr_cpus_allowed < 2 ||
> >> + !dl_entity_preempt(&p->dl, &curr->dl)) &&
> >> + p->nr_cpus_allowed > 1;
> >> +
> >> + /*
> >> + * We take into account the capacity of the CPU to
> >> + * ensure it fits the requirement of the task.
> >> + */
> >> + if (static_branch_unlikely(&sched_asym_cpucapacity))
> >> + select_rq |= !dl_task_fits_capacity(p, cpu);
> >
> > I'm thinking that, while dl_task_fits_capacity() works well when
> > selecting idle cpus, in this case we should consider the fact that curr
> > might be deadline as well and already consuming some of the rq capacity.
> >
> > Do you think we should try to take that into account, maybe using
> > dl_rq->this_bw ?
>
> So you're saying that cpudl_find(..., later_mask) could return 1 (w/
> best_cpu (cp->elements[0].cpu) in later_mask).
>
> And that this best_cpu could be a non-fitting CPU for p.
>
> This could happen if cp->free_cpus is empty (no idle CPUs) so we take
> cpudl_find()'s else path and in case p's deadline < cp->elements[0]
> deadline.
>
> We could condition the 'return 1' on best_cpu fitting p.
>
> But should we do this for cpudl_find(..., NULL) calls from
> check_preempt_equal_dl() as well or will this break GEDF?
So, even by not returning best_cpu, as above, if it doesn't fit p's bw
requirement, I think we would be breaking GEDF, which however doesn't
take asym capacities into account. OTOH, if we let p migrate to a cpu
that can't suit it, it will still be missing its deadlines (plus it
would be causing deadline misses on the task that was running on
best_cpu).
check_preempt_equal_dl() worries me less, as it is there to service
corner cases (hopefully not so frequent).
Hi Juri,
On Wed, 15 Apr 2020 15:20:04 +0200
Juri Lelli <[email protected]> wrote:
[...]
> > > I'm thinking that, while dl_task_fits_capacity() works well when
> > > selecting idle cpus, in this case we should consider the fact
> > > that curr might be deadline as well and already consuming some of
> > > the rq capacity.
> > >
> > > Do you think we should try to take that into account, maybe using
> > > dl_rq->this_bw ?
> >
> > So you're saying that cpudl_find(..., later_mask) could return 1 (w/
> > best_cpu (cp->elements[0].cpu) in later_mask).
> >
> > And that this best_cpu could be a non-fitting CPU for p.
> >
> > This could happen if cp->free_cpus is empty (no idle CPUs) so we
> > take cpudl_find()'s else path and in case p's deadline <
> > cp->elements[0] deadline.
> >
> > We could condition the 'return 1' on best_cpu fitting p.
> >
> > But should we do this for cpudl_find(..., NULL) calls from
> > check_preempt_equal_dl() as well or will this break GEDF?
>
> So, even by not returning best_cpu, as above, if it doesn't fit p's bw
> requirement, I think we would be breaking GEDF, which however doesn't
> take asym capacities into account.
Well, gEDF could take asymmetric capacities into account by scheduling
the earliest deadline task on the fastest CPU (and the task with the
second earliest deadline on the second fastest CPU, and so on...)
But this could cause a lot of unneeded migrations (I tried to discuss
this issue in a previous OSPM presentation). My original approach to
work around this issue was to schedule a task on the slowest core on
which the task can fit (some experiments revealed that this heuristic
can approximate the gEDF behaviour without causing too many
migrations)... But this patch is not included on the current patchset,
and will be proposed later, after the most important patches have been
merged.
> OTOH, if we let p migrate to a cpu
> that can't suit it, it will still be missing its deadlines (plus it
> would be causing deadline misses on the task that was running on
> best_cpu).
In theory, if the task is scheduled on a core that is too slow for it
then we must allow faster cores to pull it later (when tasks with
earlier deadlines block). But this might be problematic, because it can
require to migrate a currently scheduled task.
Luca
>
> check_preempt_equal_dl() worries me less, as it is there to service
> corner cases (hopefully not so frequent).
>
On 15/04/20 18:42, luca abeni wrote:
> Hi Juri,
>
> On Wed, 15 Apr 2020 15:20:04 +0200
> Juri Lelli <[email protected]> wrote:
> [...]
> > > > I'm thinking that, while dl_task_fits_capacity() works well when
> > > > selecting idle cpus, in this case we should consider the fact
> > > > that curr might be deadline as well and already consuming some of
> > > > the rq capacity.
> > > >
> > > > Do you think we should try to take that into account, maybe using
> > > > dl_rq->this_bw ?
> > >
> > > So you're saying that cpudl_find(..., later_mask) could return 1 (w/
> > > best_cpu (cp->elements[0].cpu) in later_mask).
> > >
> > > And that this best_cpu could be a non-fitting CPU for p.
> > >
> > > This could happen if cp->free_cpus is empty (no idle CPUs) so we
> > > take cpudl_find()'s else path and in case p's deadline <
> > > cp->elements[0] deadline.
> > >
> > > We could condition the 'return 1' on best_cpu fitting p.
> > >
> > > But should we do this for cpudl_find(..., NULL) calls from
> > > check_preempt_equal_dl() as well or will this break GEDF?
> >
> > So, even by not returning best_cpu, as above, if it doesn't fit p's bw
> > requirement, I think we would be breaking GEDF, which however doesn't
> > take asym capacities into account.
>
> Well, gEDF could take asymmetric capacities into account by scheduling
> the earliest deadline task on the fastest CPU (and the task with the
> second earliest deadline on the second fastest CPU, and so on...)
>
> But this could cause a lot of unneeded migrations (I tried to discuss
> this issue in a previous OSPM presentation). My original approach to
> work around this issue was to schedule a task on the slowest core on
> which the task can fit (some experiments revealed that this heuristic
> can approximate the gEDF behaviour without causing too many
> migrations)... But this patch is not included on the current patchset,
> and will be proposed later, after the most important patches have been
> merged.
OK, makes sense to me. And I'm ok also with a 2 steps approach. Asym
idle now and asym busy with a later series.
Best,
Juri
On 09/04/20 19:29, Dietmar Eggemann wrote:
[...]
>
> Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
> with the help of an extra per-cpu cpumask we could just
Hummm, I like the idea, but
> DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
>
> dl_bw_cpus(int i) {
This works if calls are always local to the rd we are interested into
(argument 'i' isn't used). Are we always doing that?
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> ...
> cpumask_and(cpus, rd->span, cpu_active_mask);
>
> return cpumask_weight(cpus);
> }
>
> and
>
> dl_bw_capacity(int i) {
>
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> ...
> cpumask_and(cpus, rd->span, cpu_active_mask);
> if (cpumask_equal(cpus, rd->span))
> return rd->sum_cpu_capacity;
What if capacities change between invocations (with the same span)?
Can that happen?
>
> for_each_cpu(i, cpus)
> cap += capacity_orig_of(i);
>
> return cap;
> }
>
> So only in cases in which rd->span and cpu_active_mask differ we would
> have to sum up again.
Thanks,
Juri
On 17.04.20 14:19, Juri Lelli wrote:
> On 09/04/20 19:29, Dietmar Eggemann wrote:
[...]
>> Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
>> with the help of an extra per-cpu cpumask we could just
>
> Hummm, I like the idea, but
>
>> DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
>>
>> dl_bw_cpus(int i) {
>
> This works if calls are always local to the rd we are interested into
> (argument 'i' isn't used). Are we always doing that?
I thought so. The existing dl_bw_cpus(int i) implementation already
assumes this by using:
struct root_domain *rd = cpu_rq(i)->rd;
...
for_each_cpu_and(i, rd->span, cpu_active_mask)
Or did you refer to something else here?
And the patch would not introduce new places in which we call
dl_bw_cpus(). It will just replace some with a dl_bw_capacity() call.
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
>> ...
>> cpumask_and(cpus, rd->span, cpu_active_mask);
>>
>> return cpumask_weight(cpus);
>> }
>>
>> and
>>
>> dl_bw_capacity(int i) {
>>
>> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
>> ...
>> cpumask_and(cpus, rd->span, cpu_active_mask);
>> if (cpumask_equal(cpus, rd->span))
>> return rd->sum_cpu_capacity;
>
> What if capacities change between invocations (with the same span)?
> Can that happen?
The CPU capacity should only change during initial bring-up. On
asymmetric CPU capacity systems we have to re-create the Sched Domain
(SD) topology after CPUfreq becomes available.
After the initial build and this first rebuild of the SD topology, the
CPU capacity should be stable.
Everything which might follow afterwards (starting EAS, exclusive
cpusets or CPU hp) will not change the CPU capacity.
Obviously, if you defer loading CPUfreq driver after you started DL
scheduling you can break things. But this is not considered a valid
environment here.
[...]
On 17/04/20 16:55, Dietmar Eggemann wrote:
> On 17.04.20 14:19, Juri Lelli wrote:
> > On 09/04/20 19:29, Dietmar Eggemann wrote:
>
> [...]
>
> >> Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
> >> with the help of an extra per-cpu cpumask we could just
> >
> > Hummm, I like the idea, but
> >
> >> DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
> >>
> >> dl_bw_cpus(int i) {
> >
> > This works if calls are always local to the rd we are interested into
> > (argument 'i' isn't used). Are we always doing that?
>
> I thought so. The existing dl_bw_cpus(int i) implementation already
> assumes this by using:
>
> struct root_domain *rd = cpu_rq(i)->rd;
Hummm, can't dl_task_can_attach() call it with a dest_cpu different from
this_cpu?
Current implementation uses 'i' argument to get to the right root_domain
(e.g., when moving tasks between execlusive set).
> ...
>
> for_each_cpu_and(i, rd->span, cpu_active_mask)
>
> Or did you refer to something else here?
>
> And the patch would not introduce new places in which we call
> dl_bw_cpus(). It will just replace some with a dl_bw_capacity() call.
>
> >> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> >> ...
> >> cpumask_and(cpus, rd->span, cpu_active_mask);
> >>
> >> return cpumask_weight(cpus);
> >> }
> >>
> >> and
> >>
> >> dl_bw_capacity(int i) {
> >>
> >> struct cpumask *cpus = this_cpu_cpumask_var_ptr(dl_bw_mask);
> >> ...
> >> cpumask_and(cpus, rd->span, cpu_active_mask);
> >> if (cpumask_equal(cpus, rd->span))
> >> return rd->sum_cpu_capacity;
> >
> > What if capacities change between invocations (with the same span)?
> > Can that happen?
>
> The CPU capacity should only change during initial bring-up. On
> asymmetric CPU capacity systems we have to re-create the Sched Domain
> (SD) topology after CPUfreq becomes available.
>
> After the initial build and this first rebuild of the SD topology, the
> CPU capacity should be stable.
>
> Everything which might follow afterwards (starting EAS, exclusive
> cpusets or CPU hp) will not change the CPU capacity.
>
> Obviously, if you defer loading CPUfreq driver after you started DL
> scheduling you can break things. But this is not considered a valid
> environment here.
OK. Makes sense.
On 17/04/20 17:08, Juri Lelli wrote:
> On 17/04/20 16:55, Dietmar Eggemann wrote:
> > On 17.04.20 14:19, Juri Lelli wrote:
> > > On 09/04/20 19:29, Dietmar Eggemann wrote:
> >
> > [...]
> >
> > >> Maybe we can do a hybrid. We have rd->span and rd->sum_cpu_capacity and
> > >> with the help of an extra per-cpu cpumask we could just
> > >
> > > Hummm, I like the idea, but
> > >
> > >> DEFINE_PER_CPU(cpumask_var_t, dl_bw_mask);
> > >>
> > >> dl_bw_cpus(int i) {
> > >
> > > This works if calls are always local to the rd we are interested into
> > > (argument 'i' isn't used). Are we always doing that?
> >
> > I thought so. The existing dl_bw_cpus(int i) implementation already
> > assumes this by using:
> >
> > struct root_domain *rd = cpu_rq(i)->rd;
>
> Hummm, can't dl_task_can_attach() call it with a dest_cpu different from
> this_cpu?
>
> Current implementation uses 'i' argument to get to the right root_domain
> (e.g., when moving tasks between execlusive set).
>
> > ...
Bah, forget that. If everything else stays the same (get rd using 'i')
this should work ok.