2019-05-06 05:52:12

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 0/6] Capacity awareness for SCHED_DEADLINE

From: luca abeni <[email protected]>

The SCHED_DEADLINE scheduling policy currently has some issues with
asymmetric CPU capacity architectures (such as ARM big.LITTLE).
In particular, the admission control mechanism does not work correctly
(because it considers all cores to have the same speed of the fastest
core), and the scheduler does not consider the cores' speed when
migrating tasks.

This patchset fixes the issues by explicitly considering the cores'
capacities in the admission control mechanism and in tasks' migration.

In particular, the scheduler is improved so that "heavyweight tasks"
(processes or threads having a runtime that on LITTLE cores becomes
larger than the relative deadline) are scheduled on big cores. This
allows respecting deadlines that are missed by the current scheduler.

Moreover, the migration logic is modified so that "lightweight tasks"
(processes or threads having a runtime that on LITTLE cores is still
smaller than the relative deadline) are scheduled on LITTLE cores
when possible. This allows saving some energy without missing deadlines.


luca abeni (6):
sched/dl: Improve deadline admission control for asymmetric CPU
capacities
sched/dl: Capacity-aware migrations
sched/dl: Try better placement even for deadline tasks that do not
block
sched/dl: Improve capacity-aware wakeup
sched/dl: If the task does not fit anywhere, select the fastest core
sched/dl: Try not to select a too fast core

drivers/base/arch_topology.c | 1 +
include/linux/sched.h | 1 +
include/linux/sched/topology.h | 3 ++
kernel/sched/core.c | 2 +
kernel/sched/cpudeadline.c | 53 ++++++++++++++++++++++--
kernel/sched/deadline.c | 76 ++++++++++++++++++++++++++++------
kernel/sched/sched.h | 16 +++++--
kernel/sched/topology.c | 9 ++++
8 files changed, 143 insertions(+), 18 deletions(-)

--
2.20.1


2019-05-06 05:52:12

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 5/6] sched/dl: If the task does not fit anywhere, select the fastest core

From: luca abeni <[email protected]>

When a task has a remaining runtime that cannot be served within the
scheduling deadline by anyone of the idle cores, the task is doomed
to miss its deadline (this can happen, because the admission control
only guarantees a bounded tardiness, not the hard respect of all
deadlines). In this case, try to select the fastest idle core, to
minimize the tardiness.

Signed-off-by: luca abeni <[email protected]>
---
kernel/sched/cpudeadline.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 111dd9ac837b..2a4ac7b529b7 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -123,7 +123,7 @@ static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
}

if (c)
- *c = cap;
+ *c = arch_scale_cpu_capacity(NULL, cpu);

if ((rel_deadline * cap) >> SCHED_CAPACITY_SHIFT < rem_runtime)
return 0;
@@ -146,14 +146,22 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
- int cpu;
+ int cpu, max_cpu = -1;
+ u64 max_cap = 0;

for_each_cpu(cpu, later_mask) {
u64 cap;

if (!dl_task_fit(&p->dl, cpu, &cap))
cpumask_clear_cpu(cpu, later_mask);
+
+ if (cap > max_cap) {
+ max_cap = cap;
+ max_cpu = cpu;
+ }
}
+ if (cpumask_empty(later_mask) && max_cap)
+ cpumask_set_cpu(max_cpu, later_mask);

if (!cpumask_empty(later_mask))
return 1;
--
2.20.1

2019-05-06 05:52:12

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

From: luca abeni <[email protected]>

Currently, the SCHED_DEADLINE scheduler uses a global EDF scheduling
algorithm, migrating tasks to CPU cores without considering the core
capacity and the task utilization. This works well on homogeneous
systems (SCHED_DEADLINE tasks are guaranteed to have a bounded
tardiness), but presents some issues on heterogeneous systems. For
example, a SCHED_DEADLINE task might be migrated on a core that has not
enough processing capacity to correctly serve the task (think about a
task with runtime 70ms and period 100ms migrated to a core with
processing capacity 0.5)

This commit is a first step to address the issue: When a task wakes
up or migrates away from a CPU core, the scheduler tries to find an
idle core having enough processing capacity to serve the task.

Signed-off-by: luca abeni <[email protected]>
---
kernel/sched/cpudeadline.c | 31 +++++++++++++++++++++++++++++--
kernel/sched/deadline.c | 8 ++++++--
kernel/sched/sched.h | 7 ++++++-
3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 50316455ea66..d21f7905b9c1 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -110,6 +110,22 @@ static inline int cpudl_maximum(struct cpudl *cp)
return cp->elements[0].cpu;
}

+static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
+ int cpu, u64 *c)
+{
+ u64 cap = (arch_scale_cpu_capacity(NULL, cpu) * arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
+ s64 rel_deadline = dl_se->dl_deadline;
+ u64 rem_runtime = dl_se->dl_runtime;
+
+ if (c)
+ *c = cap;
+
+ if ((rel_deadline * cap) >> SCHED_CAPACITY_SHIFT < rem_runtime)
+ return 0;
+
+ return 1;
+}
+
/*
* cpudl_find - find the best (later-dl) CPU in the system
* @cp: the cpudl max-heap context
@@ -125,8 +141,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,

if (later_mask &&
cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
- return 1;
- } else {
+ int cpu;
+
+ for_each_cpu(cpu, later_mask) {
+ u64 cap;
+
+ if (!dl_task_fit(&p->dl, cpu, &cap))
+ cpumask_clear_cpu(cpu, later_mask);
+ }
+
+ if (!cpumask_empty(later_mask))
+ return 1;
+ }
+ {
int best_cpu = cpudl_maximum(cp);

WARN_ON(best_cpu != -1 && !cpu_present(best_cpu));
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5b981eeeb944..3436f3d8fa8f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1584,6 +1584,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
if (sd_flag != SD_BALANCE_WAKE)
goto out;

+ if (dl_entity_is_special(&p->dl))
+ goto out;
+
rq = cpu_rq(cpu);

rcu_read_lock();
@@ -1598,10 +1601,11 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
* other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important.
*/
- if (unlikely(dl_task(curr)) &&
+ if ((unlikely(dl_task(curr)) &&
(curr->nr_cpus_allowed < 2 ||
!dl_entity_preempt(&p->dl, &curr->dl)) &&
- (p->nr_cpus_allowed > 1)) {
+ (p->nr_cpus_allowed > 1)) ||
+ static_branch_unlikely(&sched_asym_cpucapacity)) {
int target = find_later_rq(p);

if (target != -1 &&
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 32d242694863..e5f9fd3aee80 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2367,7 +2367,12 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long util_cfs,

static inline unsigned long cpu_bw_dl(struct rq *rq)
{
- return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+ unsigned long res;
+
+ res = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
+
+ return (res << SCHED_CAPACITY_SHIFT) /
+ arch_scale_cpu_capacity(NULL, rq->cpu);
}

static inline unsigned long cpu_util_dl(struct rq *rq)
--
2.20.1

2019-05-06 05:52:12

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

From: luca abeni <[email protected]>

Currently, the SCHED_DEADLINE admission control ensures that the
sum of reserved CPU bandwidths is smaller than x * M, where the
reserved CPU bandwidth of a SCHED_DEADLINE task is defined as the
ratio between its runtime and its period, x is a user-definable
percentage (95% by default, can be controlled through
/proc/sys/kernel/sched_rt_{runtime,period}_us) and M is the number
of CPU cores.

This admission control works well (guaranteeing bounded tardiness
for SCHED_DEADLINE tasks and non-starvation of non SCHED_DEADLINE
tasks) for homogeneous systems (where all the CPU cores have the
same computing capacity), but ends up over-allocating the CPU time
in presence of less-powerful CPU cores.

It can be easily shown how on asymmetric CPU capacity architectures
(such as ARM big.LITTLE) SCHED_DEADLINE tasks can easily starve all the
other tasks, making the system unusable.

This commit fixes the issue by explicitly considering the cores'
capacities in the admission test (where "M" is replaced by the sum
of all the capacities of cores in the root domain).

Signed-off-by: luca abeni <[email protected]>
---
drivers/base/arch_topology.c | 1 +
include/linux/sched/topology.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/deadline.c | 19 +++++++++++++------
kernel/sched/sched.h | 7 +++++--
kernel/sched/topology.c | 9 +++++++++
6 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index edfcf8d982e4..646d6d349d53 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;

void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
{
+ topology_update_cpu_capacity(cpu, per_cpu(cpu_scale, cpu), capacity);
per_cpu(cpu_scale, cpu) = capacity;
}

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2bf680b42f3c..a4898e42f368 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -233,4 +233,7 @@ static inline int task_node(const struct task_struct *p)
return cpu_to_node(task_cpu(p));
}

+void topology_update_cpu_capacity(unsigned int cpu, unsigned long old,
+ unsigned long new);
+
#endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8456801caa3..a37bbb246802 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6271,6 +6271,7 @@ void set_rq_online(struct rq *rq)
if (class->rq_online)
class->rq_online(rq);
}
+ rq->rd->rd_capacity += arch_scale_cpu_capacity(NULL, cpu_of(rq));
}
}

@@ -6286,6 +6287,7 @@ void set_rq_offline(struct rq *rq)

cpumask_clear_cpu(rq->cpu, rq->rd->online);
rq->online = 0;
+ rq->rd->rd_capacity -= arch_scale_cpu_capacity(NULL, cpu_of(rq));
}
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6a73e41a2016..5b981eeeb944 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -63,6 +63,10 @@ static inline int dl_bw_cpus(int i)

return cpus;
}
+static inline unsigned long rd_capacity(struct rq *rq)
+{
+ return rq->rd->rd_capacity;
+}
#else
static inline struct dl_bw *dl_bw_of(int i)
{
@@ -73,6 +77,11 @@ static inline int dl_bw_cpus(int i)
{
return 1;
}
+
+static inline unsigned long rd_capacity(struct rq *rq)
+{
+ return SCHED_CAPACITY_SCALE;
+}
#endif

static inline
@@ -2535,13 +2544,13 @@ int sched_dl_overflow(struct task_struct *p, int policy,
raw_spin_lock(&dl_b->lock);
cpus = dl_bw_cpus(task_cpu(p));
if (dl_policy(policy) && !task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ !__dl_overflow(dl_b, rd_capacity(task_rq(p)), 0, new_bw)) {
if (hrtimer_active(&p->dl.inactive_timer))
__dl_sub(dl_b, p->dl.dl_bw, cpus);
__dl_add(dl_b, new_bw, cpus);
err = 0;
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
+ !__dl_overflow(dl_b, rd_capacity(task_rq(p)), p->dl.dl_bw, new_bw)) {
/*
* XXX this is slightly incorrect: when the task
* utilization decreases, we should delay the total
@@ -2689,7 +2698,7 @@ int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allo
dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
cpus = dl_bw_cpus(dest_cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+ overflow = __dl_overflow(dl_b, rd_capacity(cpu_rq(dest_cpu)), 0, p->dl.dl_bw);
if (overflow) {
ret = -EBUSY;
} else {
@@ -2734,13 +2743,11 @@ bool dl_cpu_busy(unsigned int cpu)
unsigned long flags;
struct dl_bw *dl_b;
bool overflow;
- int cpus;

rcu_read_lock_sched();
dl_b = dl_bw_of(cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, 0);
+ overflow = __dl_overflow(dl_b, rd_capacity(cpu_rq(cpu)), 0, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags);
rcu_read_unlock_sched();

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69114842d640..32d242694863 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -305,10 +305,11 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus)
}

static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
+bool __dl_overflow(struct dl_bw *dl_b, unsigned long cap, u64 old_bw, u64 new_bw)
{
return dl_b->bw != -1 &&
- dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
+ (dl_b->bw * cap) >> SCHED_CAPACITY_SHIFT <
+ dl_b->total_bw - old_bw + new_bw;
}

extern void dl_change_utilization(struct task_struct *p, u64 new_bw);
@@ -785,6 +786,8 @@ struct root_domain {
unsigned long max_cpu_capacity;
unsigned long min_cpu_capacity;

+ unsigned long rd_capacity;
+
/*
* NULL-terminated list of performance domains intersecting with the
* CPUs of the rd. Protected by RCU.
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b1c4b68a4d17..1cd20a2e5bc8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2255,3 +2255,12 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

mutex_unlock(&sched_domains_mutex);
}
+
+void topology_update_cpu_capacity(unsigned int cpu, unsigned long old,
+ unsigned long new)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ rq->rd->rd_capacity -= old;
+ rq->rd->rd_capacity += new;
+}
--
2.20.1

2019-05-06 05:53:04

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

From: luca abeni <[email protected]>

Instead of considering the "static CPU bandwidth" allocated to
a SCHED_DEADLINE task (ratio between its maximum runtime and
reservation period), try to use the remaining runtime and time
to scheduling deadline.

Signed-off-by: luca abeni <[email protected]>
---
kernel/sched/cpudeadline.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index d21f7905b9c1..111dd9ac837b 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -114,8 +114,13 @@ static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
int cpu, u64 *c)
{
u64 cap = (arch_scale_cpu_capacity(NULL, cpu) * arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
- s64 rel_deadline = dl_se->dl_deadline;
- u64 rem_runtime = dl_se->dl_runtime;
+ s64 rel_deadline = dl_se->deadline - sched_clock_cpu(smp_processor_id());
+ u64 rem_runtime = dl_se->runtime;
+
+ if ((rel_deadline < 0) || (rel_deadline * dl_se->dl_runtime < dl_se->dl_deadline * rem_runtime)) {
+ rel_deadline = dl_se->dl_deadline;
+ rem_runtime = dl_se->dl_runtime;
+ }

if (c)
*c = cap;
--
2.20.1

2019-05-06 05:53:04

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

From: luca abeni <[email protected]>

Currently, the scheduler tries to find a proper placement for
SCHED_DEADLINE tasks when they are pushed out of a core or when
they wake up. Hence, if there is a single SCHED_DEADLINE task
that never blocks and wakes up, such a task is never migrated to
an appropriate CPU core, but continues to execute on its original
core.

This commit addresses the issue by trying to migrate a SCHED_DEADLINE
task (searching for an appropriate CPU core) the first time it is
throttled.

Signed-off-by: luca abeni <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/deadline.c | 53 ++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 2 ++
3 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 863f70843875..5e322c8a94e0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -560,6 +560,7 @@ struct sched_dl_entity {
unsigned int dl_yielded : 1;
unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1;
+ unsigned int dl_adjust : 1;

/*
* Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3436f3d8fa8f..db471889196b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -515,6 +515,7 @@ static inline bool need_pull_dl_task(struct rq *rq, struct task_struct *prev)
return dl_task(prev);
}

+static DEFINE_PER_CPU(struct callback_head, dl_migrate_head);
static DEFINE_PER_CPU(struct callback_head, dl_push_head);
static DEFINE_PER_CPU(struct callback_head, dl_pull_head);

@@ -1149,6 +1150,32 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
return (delta * u_act) >> BW_SHIFT;
}

+#ifdef CONFIG_SMP
+static int find_later_rq(struct task_struct *task);
+
+static void migrate_dl_task(struct rq *rq)
+{
+ struct task_struct *t = rq->migrating_task;
+ struct sched_dl_entity *dl_se = &t->dl;
+ int cpu = find_later_rq(t);
+
+ if ((cpu != -1) && (cpu != rq->cpu)) {
+ struct rq *later_rq;
+
+ later_rq = cpu_rq(cpu);
+
+ double_lock_balance(rq, later_rq);
+ sub_running_bw(&t->dl, &rq->dl);
+ sub_rq_bw(&t->dl, &rq->dl);
+ set_task_cpu(t, later_rq->cpu);
+ add_rq_bw(&t->dl, &later_rq->dl);
+ add_running_bw(&t->dl, &later_rq->dl);
+ double_unlock_balance(rq, later_rq);
+ }
+ rq->migrating_task = NULL;
+ dl_se->dl_adjust = 0;
+}
+#endif
/*
* Update the current task's runtime statistics (provided it is still
* a -deadline task and has not been removed from the dl_rq).
@@ -1223,8 +1250,17 @@ static void update_curr_dl(struct rq *rq)
dl_se->dl_overrun = 1;

__dequeue_task_dl(rq, curr, 0);
- if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
+ if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr))) {
enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
+#ifdef CONFIG_SMP
+ } else if (dl_se->dl_adjust) {
+ if (rq->migrating_task == NULL) {
+ queue_balance_callback(rq, &per_cpu(dl_migrate_head, rq->cpu), migrate_dl_task);
+ rq->migrating_task = current;
+ } else
+ printk_deferred("Throttled task before migratin g the previous one???\n");
+#endif
+ }

if (!is_leftmost(curr, &rq->dl))
resched_curr(rq);
@@ -1573,13 +1609,12 @@ static void yield_task_dl(struct rq *rq)

#ifdef CONFIG_SMP

-static int find_later_rq(struct task_struct *task);
-
static int
select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
struct rq *rq;
+ bool het;

if (sd_flag != SD_BALANCE_WAKE)
goto out;
@@ -1591,6 +1626,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)

rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */
+ het = static_branch_unlikely(&sched_asym_cpucapacity);

/*
* If we are dealing with a -deadline task, we must
@@ -1604,15 +1640,17 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
if ((unlikely(dl_task(curr)) &&
(curr->nr_cpus_allowed < 2 ||
!dl_entity_preempt(&p->dl, &curr->dl)) &&
- (p->nr_cpus_allowed > 1)) ||
- static_branch_unlikely(&sched_asym_cpucapacity)) {
+ (p->nr_cpus_allowed > 1)) || het) {
int target = find_later_rq(p);

if (target != -1 &&
(dl_time_before(p->dl.deadline,
cpu_rq(target)->dl.earliest_dl.curr) ||
- (cpu_rq(target)->dl.dl_nr_running == 0)))
+ (cpu_rq(target)->dl.dl_nr_running == 0))) {
+ if (het && (target != cpu))
+ p->dl.dl_adjust = 1;
cpu = target;
+ }
}
rcu_read_unlock();

@@ -2369,6 +2407,9 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
else
resched_curr(rq);
}
+
+ if (static_branch_unlikely(&sched_asym_cpucapacity))
+ p->dl.dl_adjust = 1;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e5f9fd3aee80..1a8f75338ac2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -963,6 +963,8 @@ struct rq {

/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
+
+ struct task_struct *migrating_task;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
--
2.20.1

2019-05-06 05:53:34

by luca abeni

[permalink] [raw]
Subject: [RFC PATCH 6/6] sched/dl: Try not to select a too fast core

From: luca abeni <[email protected]>

When a task can fit on multiple CPU cores, try to select the slowest
core that is able to properly serve the task. This avoids useless
future migrations, leaving the "fast cores" idle for more heavyweight
tasks.

Signed-off-by: luca abeni <[email protected]>
---
kernel/sched/cpudeadline.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 2a4ac7b529b7..897ed71af515 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -143,17 +143,24 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
struct cpumask *later_mask)
{
const struct sched_dl_entity *dl_se = &p->dl;
+ struct cpumask tmp_mask;

if (later_mask &&
- cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
+ cpumask_and(&tmp_mask, cp->free_cpus, &p->cpus_allowed)) {
int cpu, max_cpu = -1;
- u64 max_cap = 0;
+ u64 max_cap = 0, min_cap = SCHED_CAPACITY_SCALE * SCHED_CAPACITY_SCALE;

- for_each_cpu(cpu, later_mask) {
+ cpumask_clear(later_mask);
+ for_each_cpu(cpu, &tmp_mask) {
u64 cap;

- if (!dl_task_fit(&p->dl, cpu, &cap))
- cpumask_clear_cpu(cpu, later_mask);
+ if (dl_task_fit(&p->dl, cpu, &cap) && (cap <= min_cap)) {
+ if (cap < min_cap) {
+ min_cap = cap;
+ cpumask_clear(later_mask);
+ }
+ cpumask_set_cpu(cpu, later_mask);
+ }

if (cap > max_cap) {
max_cap = cap;
--
2.20.1

2019-05-07 13:38:03

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

Hi Luca,

On Monday 06 May 2019 at 06:48:32 (+0200), Luca Abeni wrote:
> +static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
> + int cpu, u64 *c)
> +{
> + u64 cap = (arch_scale_cpu_capacity(NULL, cpu) * arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;

I'm a little bit confused by this use of arch_scale_freq_capacity()
here. IIUC this means you would say a big DL task doesn't fit on a big
CPU just because it happens to be running at a low frequency when this
function is called. Is this what we want ?

If the frequency is low, we can (probably) raise it to accommodate this
DL task so perhaps we should say it fits ?

> + s64 rel_deadline = dl_se->dl_deadline;
> + u64 rem_runtime = dl_se->dl_runtime;
> +
> + if (c)
> + *c = cap;
> +
> + if ((rel_deadline * cap) >> SCHED_CAPACITY_SHIFT < rem_runtime)
> + return 0;
> +
> + return 1;
> +}

Thanks,
Quentin

2019-05-07 13:50:08

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

Hi Luca,

On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index edfcf8d982e4..646d6d349d53 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
>
> void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
> {
> + topology_update_cpu_capacity(cpu, per_cpu(cpu_scale, cpu), capacity);

Why is that one needed ? Don't you end up re-building the sched domains
after this anyways ?

> per_cpu(cpu_scale, cpu) = capacity;
> }

Thanks,
Quentin

2019-05-07 13:57:25

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On Tue, 7 May 2019 at 15:48, Quentin Perret <[email protected]> wrote:
>
> Hi Luca,
>
> On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index edfcf8d982e4..646d6d349d53 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
> >
> > void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
> > {
> > + topology_update_cpu_capacity(cpu, per_cpu(cpu_scale, cpu), capacity);
>
> Why is that one needed ? Don't you end up re-building the sched domains
> after this anyways ?

I was looking at the same point.
Also this doesn't take into account if the cpu is offline

Do we also need of the line below in set_rq_online
+ rq->rd->rd_capacity += arch_scale_cpu_capacity(NULL,
cpu_of(rq));

building the sched_domain seems a better place to set rq->rd->rd_capacity

>
> > per_cpu(cpu_scale, cpu) = capacity;
> > }
>
> Thanks,
> Quentin

2019-05-07 14:04:47

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On Tuesday 07 May 2019 at 15:55:37 (+0200), Vincent Guittot wrote:
> On Tue, 7 May 2019 at 15:48, Quentin Perret <[email protected]> wrote:
> >
> > Hi Luca,
> >
> > On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
> > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > > index edfcf8d982e4..646d6d349d53 100644
> > > --- a/drivers/base/arch_topology.c
> > > +++ b/drivers/base/arch_topology.c
> > > @@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
> > >
> > > void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
> > > {
> > > + topology_update_cpu_capacity(cpu, per_cpu(cpu_scale, cpu), capacity);
> >
> > Why is that one needed ? Don't you end up re-building the sched domains
> > after this anyways ?
>
> I was looking at the same point.
> Also this doesn't take into account if the cpu is offline
>
> Do we also need of the line below in set_rq_online
> + rq->rd->rd_capacity += arch_scale_cpu_capacity(NULL,
> cpu_of(rq));
>
> building the sched_domain seems a better place to set rq->rd->rd_capacity

Perhaps this could hook directly in rq_attach_root() ? We don't really
need the decrement part no ? That is, in case of hotplug the old rd
should be destroyed anyways.

Thanks,
Quentin

>
> >
> > > per_cpu(cpu_scale, cpu) = capacity;
> > > }
> >
> > Thanks,
> > Quentin

2019-05-07 14:13:40

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

On Monday 06 May 2019 at 06:48:32 (+0200), Luca Abeni wrote:
> static inline unsigned long cpu_bw_dl(struct rq *rq)
> {
> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> + unsigned long res;
> +
> + res = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> +
> + return (res << SCHED_CAPACITY_SHIFT) /
> + arch_scale_cpu_capacity(NULL, rq->cpu);

The only user of cpu_bw_dl() is schedutil right ? If yes, we probably
don't want to scale things here -- schedutil already does this I
believe.

Thanks,
Quentin

> }
>
> static inline unsigned long cpu_util_dl(struct rq *rq)
> --
> 2.20.1
>

2019-05-07 14:16:13

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On Monday 06 May 2019 at 06:48:33 (+0200), Luca Abeni wrote:
> @@ -1591,6 +1626,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>
> rcu_read_lock();
> curr = READ_ONCE(rq->curr); /* unlocked access */
> + het = static_branch_unlikely(&sched_asym_cpucapacity);

Nit: not sure how the generated code looks like but I wonder if this
could potentially make you loose the benefit of the static key ?

Thanks,
Quentin

2019-05-07 14:19:10

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

Hi Quentin,

On Tue, 7 May 2019 14:35:28 +0100
Quentin Perret <[email protected]> wrote:

> Hi Luca,
>
> On Monday 06 May 2019 at 06:48:32 (+0200), Luca Abeni wrote:
> > +static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
> > + int cpu, u64 *c)
> > +{
> > + u64 cap = (arch_scale_cpu_capacity(NULL, cpu) *
> > arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
>
> I'm a little bit confused by this use of arch_scale_freq_capacity()
> here. IIUC this means you would say a big DL task doesn't fit on a big
> CPU just because it happens to be running at a low frequency when this
> function is called. Is this what we want ?

The idea of this approach was to avoid frequency switches when
possible; so, I wanted to check if the task fits on a CPU core at its
current operating frequency.


> If the frequency is low, we can (probably) raise it to accommodate
> this DL task so perhaps we should say it fits ?

In a later patch, if the task does not fit on any core (at its current
frequency), the task is moved to the core having the maximum capacity
(without considering the operating frequency --- at least, this was my
intention when I wrote the patches :)



Luca

>
> > + s64 rel_deadline = dl_se->dl_deadline;
> > + u64 rem_runtime = dl_se->dl_runtime;
> > +
> > + if (c)
> > + *c = cap;
> > +
> > + if ((rel_deadline * cap) >> SCHED_CAPACITY_SHIFT <
> > rem_runtime)
> > + return 0;
> > +
> > + return 1;
> > +}
>
> Thanks,
> Quentin

2019-05-07 14:28:45

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On Tue, 7 May 2019 14:48:52 +0100
Quentin Perret <[email protected]> wrote:

> Hi Luca,
>
> On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
> > diff --git a/drivers/base/arch_topology.c
> > b/drivers/base/arch_topology.c index edfcf8d982e4..646d6d349d53
> > 100644 --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) =
> > SCHED_CAPACITY_SCALE;
> > void topology_set_cpu_scale(unsigned int cpu, unsigned long
> > capacity) {
> > + topology_update_cpu_capacity(cpu, per_cpu(cpu_scale, cpu),
> > capacity);
>
> Why is that one needed ? Don't you end up re-building the sched
> domains after this anyways ?

If I remember correctly, this function was called at boot time when the
capacities are assigned to the CPU cores.

I do not remember if the sched domain was re-built after this call, but
I admit I do not know this part of the kernel very well...

This achieved the effect of correctly setting up the "rd_capacity"
field, but I do not know if there is a better/simpler way to achieve
the same result :)



Thanks,
Luca



>
> > per_cpu(cpu_scale, cpu) = capacity;
> > }
>
> Thanks,
> Quentin

2019-05-07 14:32:43

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On Tuesday 07 May 2019 at 16:25:23 (+0200), luca abeni wrote:
> On Tue, 7 May 2019 14:48:52 +0100
> Quentin Perret <[email protected]> wrote:
>
> > Hi Luca,
> >
> > On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
> > > diff --git a/drivers/base/arch_topology.c
> > > b/drivers/base/arch_topology.c index edfcf8d982e4..646d6d349d53
> > > 100644 --- a/drivers/base/arch_topology.c
> > > +++ b/drivers/base/arch_topology.c
> > > @@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) =
> > > SCHED_CAPACITY_SCALE;
> > > void topology_set_cpu_scale(unsigned int cpu, unsigned long
> > > capacity) {
> > > + topology_update_cpu_capacity(cpu, per_cpu(cpu_scale, cpu),
> > > capacity);
> >
> > Why is that one needed ? Don't you end up re-building the sched
> > domains after this anyways ?
>
> If I remember correctly, this function was called at boot time when the
> capacities are assigned to the CPU cores.
>
> I do not remember if the sched domain was re-built after this call, but
> I admit I do not know this part of the kernel very well...

Right and things moved recently in this area, see bb1fbdd3c3fd
("sched/topology, drivers/base/arch_topology: Rebuild the sched_domain
hierarchy when capacities change")

> This achieved the effect of correctly setting up the "rd_capacity"
> field, but I do not know if there is a better/simpler way to achieve
> the same result :)

OK, that's really an implementation detail, so no need to worry too
much about it at the RFC stage I suppose :-)

Thanks,
Quentin

2019-05-07 14:42:52

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

Hi,

On Tue, 7 May 2019 15:10:50 +0100
Quentin Perret <[email protected]> wrote:

> On Monday 06 May 2019 at 06:48:32 (+0200), Luca Abeni wrote:
> > static inline unsigned long cpu_bw_dl(struct rq *rq)
> > {
> > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >>
> > BW_SHIFT;
> > + unsigned long res;
> > +
> > + res = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >>
> > BW_SHIFT; +
> > + return (res << SCHED_CAPACITY_SHIFT) /
> > + arch_scale_cpu_capacity(NULL, rq->cpu);
>
> The only user of cpu_bw_dl() is schedutil right ? If yes, we probably
> don't want to scale things here -- schedutil already does this I
> believe.

I think I added this modification because I arrived to the conclusion
that schedutils does not perform this rescaling (at least, I think it
did not perform it when I wrote the patch :)


BTW, while re-reading the patch I do not understand why this change was
added in this specific patch... I suspect it should have gone in a
separate patch.



Luca

>
> Thanks,
> Quentin
>
> > }
> >
> > static inline unsigned long cpu_util_dl(struct rq *rq)
> > --
> > 2.20.1
> >

2019-05-07 14:47:14

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On Tue, 7 May 2019 15:31:27 +0100
Quentin Perret <[email protected]> wrote:

> On Tuesday 07 May 2019 at 16:25:23 (+0200), luca abeni wrote:
> > On Tue, 7 May 2019 14:48:52 +0100
> > Quentin Perret <[email protected]> wrote:
> >
> > > Hi Luca,
> > >
> > > On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
> > > > diff --git a/drivers/base/arch_topology.c
> > > > b/drivers/base/arch_topology.c index edfcf8d982e4..646d6d349d53
> > > > 100644 --- a/drivers/base/arch_topology.c
> > > > +++ b/drivers/base/arch_topology.c
> > > > @@ -36,6 +36,7 @@ DEFINE_PER_CPU(unsigned long, cpu_scale) =
> > > > SCHED_CAPACITY_SCALE;
> > > > void topology_set_cpu_scale(unsigned int cpu, unsigned long
> > > > capacity) {
> > > > + topology_update_cpu_capacity(cpu, per_cpu(cpu_scale,
> > > > cpu), capacity);
> > >
> > > Why is that one needed ? Don't you end up re-building the sched
> > > domains after this anyways ?
> >
> > If I remember correctly, this function was called at boot time when
> > the capacities are assigned to the CPU cores.
> >
> > I do not remember if the sched domain was re-built after this call,
> > but I admit I do not know this part of the kernel very well...
>
> Right and things moved recently in this area, see bb1fbdd3c3fd
> ("sched/topology, drivers/base/arch_topology: Rebuild the sched_domain
> hierarchy when capacities change")

Ah, thanks! I missed this change when rebasing the patchset.
I guess this part of the patch has to be updated (and probably became
useless?), then.


Thanks,
Luca



>
> > This achieved the effect of correctly setting up the "rd_capacity"
> > field, but I do not know if there is a better/simpler way to achieve
> > the same result :)
>
> OK, that's really an implementation detail, so no need to worry too
> much about it at the RFC stage I suppose :-)
>
> Thanks,
> Quentin

2019-05-07 15:06:04

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

On Tuesday 07 May 2019 at 16:41:27 (+0200), luca abeni wrote:
> I think I added this modification because I arrived to the conclusion
> that schedutils does not perform this rescaling (at least, I think it
> did not perform it when I wrote the patch :)

Right, I'm not sure what was the behaviour before, but schedutil should
actually do that scaling now, though pretty far down the line (see
map_util_freq()). So I think we're good on that end ;-)

Thanks,
Quentin

2019-05-07 15:06:07

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

On Tuesday 07 May 2019 at 16:17:33 (+0200), luca abeni wrote:
> Hi Quentin,
>
> On Tue, 7 May 2019 14:35:28 +0100
> Quentin Perret <[email protected]> wrote:
>
> > Hi Luca,
> >
> > On Monday 06 May 2019 at 06:48:32 (+0200), Luca Abeni wrote:
> > > +static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
> > > + int cpu, u64 *c)
> > > +{
> > > + u64 cap = (arch_scale_cpu_capacity(NULL, cpu) *
> > > arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
> >
> > I'm a little bit confused by this use of arch_scale_freq_capacity()
> > here. IIUC this means you would say a big DL task doesn't fit on a big
> > CPU just because it happens to be running at a low frequency when this
> > function is called. Is this what we want ?
>
> The idea of this approach was to avoid frequency switches when
> possible; so, I wanted to check if the task fits on a CPU core at its
> current operating frequency.
>
>
> > If the frequency is low, we can (probably) raise it to accommodate
> > this DL task so perhaps we should say it fits ?
>
> In a later patch, if the task does not fit on any core (at its current
> frequency), the task is moved to the core having the maximum capacity
> (without considering the operating frequency --- at least, this was my
> intention when I wrote the patches :)

Ah, I see, patches 05-06. I'll go have a look then !

Thanks,
Quentin

2019-05-07 15:59:05

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] sched/dl: Try not to select a too fast core

On Monday 06 May 2019 at 06:48:36 (+0200), Luca Abeni wrote:
> From: luca abeni <[email protected]>
>
> When a task can fit on multiple CPU cores, try to select the slowest
> core that is able to properly serve the task. This avoids useless
> future migrations, leaving the "fast cores" idle for more heavyweight
> tasks.

But only if the _current_ capacity of big CPUs (at the current freq) is
higher than the current capacity of the littles, is that right ? So we
don't really have a guarantee to pack small tasks on little cores ...

What is the rationale for looking at the current freq in dl_task_fit() ?
Energy reasons ? If so, I'd argue you should look at the energy model to
break the tie between CPU candidates ... ;)

And in the mean time, you could just look at arch_scale_cpu_capacity()
to check if a task fits ?

> Signed-off-by: luca abeni <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 2a4ac7b529b7..897ed71af515 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -143,17 +143,24 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
> struct cpumask *later_mask)
> {
> const struct sched_dl_entity *dl_se = &p->dl;
> + struct cpumask tmp_mask;

Hmm, these can get pretty big, so not sure about having one on the stack ...

>
> if (later_mask &&
> - cpumask_and(later_mask, cp->free_cpus, &p->cpus_allowed)) {
> + cpumask_and(&tmp_mask, cp->free_cpus, &p->cpus_allowed)) {
> int cpu, max_cpu = -1;
> - u64 max_cap = 0;
> + u64 max_cap = 0, min_cap = SCHED_CAPACITY_SCALE * SCHED_CAPACITY_SCALE;
>
> - for_each_cpu(cpu, later_mask) {
> + cpumask_clear(later_mask);
> + for_each_cpu(cpu, &tmp_mask) {
> u64 cap;
>
> - if (!dl_task_fit(&p->dl, cpu, &cap))
> - cpumask_clear_cpu(cpu, later_mask);
> + if (dl_task_fit(&p->dl, cpu, &cap) && (cap <= min_cap)) {
> + if (cap < min_cap) {
> + min_cap = cap;
> + cpumask_clear(later_mask);
> + }
> + cpumask_set_cpu(cpu, later_mask);
> + }
>
> if (cap > max_cap) {
> max_cap = cap;
> --
> 2.20.1
>

Thanks,
Quentin

2019-05-07 16:02:03

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On Tue, May 07, 2019 at 03:13:40PM +0100, Quentin Perret wrote:
> On Monday 06 May 2019 at 06:48:33 (+0200), Luca Abeni wrote:
> > @@ -1591,6 +1626,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> >
> > rcu_read_lock();
> > curr = READ_ONCE(rq->curr); /* unlocked access */
> > + het = static_branch_unlikely(&sched_asym_cpucapacity);
>
> Nit: not sure how the generated code looks like but I wonder if this
> could potentially make you loose the benefit of the static key ?

I have to take the blame for this bit :-)

I would be surprised the static_key gives us anything here, but that is
actually not the point here. It is purely to know whether we have to be
capacity aware or not. I don't think we are in a critical path and the
variable providing the necessary condition just happened to be a
static_key.

We might be able to make better use of it if we refactor the code a bit.

Morten

2019-05-08 06:27:52

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] sched/dl: Try not to select a too fast core

Hi Quentin,

On Tue, 7 May 2019 16:57:34 +0100
Quentin Perret <[email protected]> wrote:

> On Monday 06 May 2019 at 06:48:36 (+0200), Luca Abeni wrote:
> > From: luca abeni <[email protected]>
> >
> > When a task can fit on multiple CPU cores, try to select the slowest
> > core that is able to properly serve the task. This avoids useless
> > future migrations, leaving the "fast cores" idle for more
> > heavyweight tasks.
>
> But only if the _current_ capacity of big CPUs (at the current freq)
> is higher than the current capacity of the littles, is that right ?
> So we don't really have a guarantee to pack small tasks on little
> cores ...

Yes, the capacity is estimated at the current frequency, so this is a
potential problem.


> What is the rationale for looking at the current freq in
> dl_task_fit() ?

Mainly two reasons: the first one is to try to reduce frequency
switches (I did not perform measurements on the hikey960, I remember
that on other CPUs a frequency switch can take a considerable amount of
time).

Then, I wanted to have a mechanism that can work with all the possible
cpufreq governors... So, I did not assume that the frequency can change
(for example, I remember that without considering the current
frequency I had issues when using the "userspace" governor).


Maybe I just do not know this kernel subsystem well enough, but I did
not find any way to know the maximum frequency that the current
governor can set (I mean, I found a "maximum frequency" field that
tells me the maximum frequency that the cpufreq driver can set, but I
do not know if the governor will be able to set it --- again, consider
the "userspace" governor).

If there is a way to know this value, then I can use it for checking if
a task can fit in a core.



Thanks,
Luca



> Energy reasons ? If so, I'd argue you should look at
> the energy model to break the tie between CPU candidates ... ;)
>
> And in the mean time, you could just look at arch_scale_cpu_capacity()
> to check if a task fits ?
>
> > Signed-off-by: luca abeni <[email protected]>
> > ---
> > kernel/sched/cpudeadline.c | 17 ++++++++++++-----
> > 1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> > index 2a4ac7b529b7..897ed71af515 100644
> > --- a/kernel/sched/cpudeadline.c
> > +++ b/kernel/sched/cpudeadline.c
> > @@ -143,17 +143,24 @@ int cpudl_find(struct cpudl *cp, struct
> > task_struct *p, struct cpumask *later_mask)
> > {
> > const struct sched_dl_entity *dl_se = &p->dl;
> > + struct cpumask tmp_mask;
>
> Hmm, these can get pretty big, so not sure about having one on the
> stack ...
>
> >
> > if (later_mask &&
> > - cpumask_and(later_mask, cp->free_cpus,
> > &p->cpus_allowed)) {
> > + cpumask_and(&tmp_mask, cp->free_cpus,
> > &p->cpus_allowed)) { int cpu, max_cpu = -1;
> > - u64 max_cap = 0;
> > + u64 max_cap = 0, min_cap = SCHED_CAPACITY_SCALE *
> > SCHED_CAPACITY_SCALE;
> > - for_each_cpu(cpu, later_mask) {
> > + cpumask_clear(later_mask);
> > + for_each_cpu(cpu, &tmp_mask) {
> > u64 cap;
> >
> > - if (!dl_task_fit(&p->dl, cpu, &cap))
> > - cpumask_clear_cpu(cpu, later_mask);
> > + if (dl_task_fit(&p->dl, cpu, &cap) && (cap
> > <= min_cap)) {
> > + if (cap < min_cap) {
> > + min_cap = cap;
> > + cpumask_clear(later_mask);
> > + }
> > + cpumask_set_cpu(cpu, later_mask);
> > + }
> >
> > if (cap > max_cap) {
> > max_cap = cap;
> > --
> > 2.20.1
> >
>
> Thanks,
> Quentin

2019-05-08 08:04:39

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

Hi Luca,

On 06/05/19 06:48, Luca Abeni wrote:
> From: luca abeni <[email protected]>
>
> Currently, the scheduler tries to find a proper placement for
> SCHED_DEADLINE tasks when they are pushed out of a core or when
> they wake up. Hence, if there is a single SCHED_DEADLINE task
> that never blocks and wakes up, such a task is never migrated to
> an appropriate CPU core, but continues to execute on its original
> core.
>
> This commit addresses the issue by trying to migrate a SCHED_DEADLINE
> task (searching for an appropriate CPU core) the first time it is
> throttled.

Why we failed to put the task on a CPU with enough (max) capacity right
after it passed admission control? The very first time the task was
scheduled I mean.

Thanks,

- Juri

2019-05-08 08:06:43

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

Hi Luca,

On 06/05/19 06:48, Luca Abeni wrote:
> From: luca abeni <[email protected]>
>
> Currently, the SCHED_DEADLINE scheduler uses a global EDF scheduling
> algorithm, migrating tasks to CPU cores without considering the core
> capacity and the task utilization. This works well on homogeneous
> systems (SCHED_DEADLINE tasks are guaranteed to have a bounded
> tardiness), but presents some issues on heterogeneous systems. For
> example, a SCHED_DEADLINE task might be migrated on a core that has not
> enough processing capacity to correctly serve the task (think about a
> task with runtime 70ms and period 100ms migrated to a core with
> processing capacity 0.5)
>
> This commit is a first step to address the issue: When a task wakes
> up or migrates away from a CPU core, the scheduler tries to find an
> idle core having enough processing capacity to serve the task.
>
> Signed-off-by: luca abeni <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 31 +++++++++++++++++++++++++++++--
> kernel/sched/deadline.c | 8 ++++++--
> kernel/sched/sched.h | 7 ++++++-
> 3 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 50316455ea66..d21f7905b9c1 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -110,6 +110,22 @@ static inline int cpudl_maximum(struct cpudl *cp)
> return cp->elements[0].cpu;
> }
>
> +static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
> + int cpu, u64 *c)
> +{
> + u64 cap = (arch_scale_cpu_capacity(NULL, cpu) * arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
> + s64 rel_deadline = dl_se->dl_deadline;
> + u64 rem_runtime = dl_se->dl_runtime;

This is not the dynamic remaining one, is it?

I see however 4/6.. lemme better look at that.

Best,

- Juri

2019-05-08 08:29:34

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

On Wed, 8 May 2019 10:04:36 +0200
Juri Lelli <[email protected]> wrote:

> Hi Luca,
>
> On 06/05/19 06:48, Luca Abeni wrote:
> > From: luca abeni <[email protected]>
> >
> > Currently, the SCHED_DEADLINE scheduler uses a global EDF scheduling
> > algorithm, migrating tasks to CPU cores without considering the core
> > capacity and the task utilization. This works well on homogeneous
> > systems (SCHED_DEADLINE tasks are guaranteed to have a bounded
> > tardiness), but presents some issues on heterogeneous systems. For
> > example, a SCHED_DEADLINE task might be migrated on a core that has
> > not enough processing capacity to correctly serve the task (think
> > about a task with runtime 70ms and period 100ms migrated to a core
> > with processing capacity 0.5)
> >
> > This commit is a first step to address the issue: When a task wakes
> > up or migrates away from a CPU core, the scheduler tries to find an
> > idle core having enough processing capacity to serve the task.
> >
> > Signed-off-by: luca abeni <[email protected]>
> > ---
> > kernel/sched/cpudeadline.c | 31 +++++++++++++++++++++++++++++--
> > kernel/sched/deadline.c | 8 ++++++--
> > kernel/sched/sched.h | 7 ++++++-
> > 3 files changed, 41 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> > index 50316455ea66..d21f7905b9c1 100644
> > --- a/kernel/sched/cpudeadline.c
> > +++ b/kernel/sched/cpudeadline.c
> > @@ -110,6 +110,22 @@ static inline int cpudl_maximum(struct cpudl
> > *cp) return cp->elements[0].cpu;
> > }
> >
> > +static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
> > + int cpu, u64 *c)
> > +{
> > + u64 cap = (arch_scale_cpu_capacity(NULL, cpu) *
> > arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
> > + s64 rel_deadline = dl_se->dl_deadline;
> > + u64 rem_runtime = dl_se->dl_runtime;
>
> This is not the dynamic remaining one, is it?

Right; I preferred to split this in two patches so that if we decide to
use only the static task parameters (dl_deadline and dl_runtime) I can
simply drop a patch ;-)


Luca

>
> I see however 4/6.. lemme better look at that.
>
> Best,
>
> - Juri

2019-05-08 08:58:00

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

Hi Juri,

On Wed, 8 May 2019 10:01:16 +0200
Juri Lelli <[email protected]> wrote:

> Hi Luca,
>
> On 06/05/19 06:48, Luca Abeni wrote:
> > From: luca abeni <[email protected]>
> >
> > Currently, the scheduler tries to find a proper placement for
> > SCHED_DEADLINE tasks when they are pushed out of a core or when
> > they wake up. Hence, if there is a single SCHED_DEADLINE task
> > that never blocks and wakes up, such a task is never migrated to
> > an appropriate CPU core, but continues to execute on its original
> > core.
> >
> > This commit addresses the issue by trying to migrate a
> > SCHED_DEADLINE task (searching for an appropriate CPU core) the
> > first time it is throttled.
>
> Why we failed to put the task on a CPU with enough (max) capacity
> right after it passed admission control? The very first time the task
> was scheduled I mean.

I think the currently executing task cannot be pushed out of a
CPU/core, right?

So, if a task switches from SCHED_OTHER to SCHED_DEADLINE while it is
executing on a fast core, the only way to migrate it would be to
preempt it (by using the stop_sched_class, I think), no?
(the typical situation here is a "cpu hog" task that switches from
SCHED_OTHER to SCHED_DEADLINE, and it is the only SCHED_DEADLINE
task... The task never blocks, so push/pull functions are never invoked)

Or am I missing something?



Thanks,
Luca

2019-05-08 09:10:53

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

On 06/05/19 06:48, Luca Abeni wrote:
> From: luca abeni <[email protected]>
>
> Instead of considering the "static CPU bandwidth" allocated to
> a SCHED_DEADLINE task (ratio between its maximum runtime and
> reservation period), try to use the remaining runtime and time
> to scheduling deadline.
>
> Signed-off-by: luca abeni <[email protected]>
> ---
> kernel/sched/cpudeadline.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index d21f7905b9c1..111dd9ac837b 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -114,8 +114,13 @@ static inline int dl_task_fit(const struct sched_dl_entity *dl_se,
> int cpu, u64 *c)
> {
> u64 cap = (arch_scale_cpu_capacity(NULL, cpu) * arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
> - s64 rel_deadline = dl_se->dl_deadline;
> - u64 rem_runtime = dl_se->dl_runtime;
> + s64 rel_deadline = dl_se->deadline - sched_clock_cpu(smp_processor_id());
> + u64 rem_runtime = dl_se->runtime;
> +
> + if ((rel_deadline < 0) || (rel_deadline * dl_se->dl_runtime < dl_se->dl_deadline * rem_runtime)) {
> + rel_deadline = dl_se->dl_deadline;
> + rem_runtime = dl_se->dl_runtime;
> + }

So, are you basically checking if current remaining bw can be consumed
safely?

I'm not actually sure if looking at dynamic values is what we need to do
at this stage. By considering static values we fix admission control
(and scheduling). Aren't dynamic values more to do with energy tradeoffs
(and so to be introduced when starting to look at the energy model)?

Another pair of hands maybe is to look at the dynamic spare bw of CPUs
(to check that we don't overload CPUs).

Thanks,

- Juri

2019-05-08 09:24:18

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On 08/05/19 10:14, luca abeni wrote:
> Hi Juri,
>
> On Wed, 8 May 2019 10:01:16 +0200
> Juri Lelli <[email protected]> wrote:
>
> > Hi Luca,
> >
> > On 06/05/19 06:48, Luca Abeni wrote:
> > > From: luca abeni <[email protected]>
> > >
> > > Currently, the scheduler tries to find a proper placement for
> > > SCHED_DEADLINE tasks when they are pushed out of a core or when
> > > they wake up. Hence, if there is a single SCHED_DEADLINE task
> > > that never blocks and wakes up, such a task is never migrated to
> > > an appropriate CPU core, but continues to execute on its original
> > > core.
> > >
> > > This commit addresses the issue by trying to migrate a
> > > SCHED_DEADLINE task (searching for an appropriate CPU core) the
> > > first time it is throttled.
> >
> > Why we failed to put the task on a CPU with enough (max) capacity
> > right after it passed admission control? The very first time the task
> > was scheduled I mean.
>
> I think the currently executing task cannot be pushed out of a
> CPU/core, right?
>
> So, if a task switches from SCHED_OTHER to SCHED_DEADLINE while it is
> executing on a fast core, the only way to migrate it would be to
> preempt it (by using the stop_sched_class, I think), no?
> (the typical situation here is a "cpu hog" task that switches from
> SCHED_OTHER to SCHED_DEADLINE, and it is the only SCHED_DEADLINE
> task... The task never blocks, so push/pull functions are never invoked)
>
> Or am I missing something?

OK, but "ideally" we should not be waiting to it to be throttled, right?

I wonder if you could queue a balance callback in switched_to_dl() (from
check_class_changed()), so that it is picked up by balance_callback()
before setscheduler returns.

2019-05-08 09:26:57

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

On Wed, 8 May 2019 11:08:55 +0200
Juri Lelli <[email protected]> wrote:

> On 06/05/19 06:48, Luca Abeni wrote:
> > From: luca abeni <[email protected]>
> >
> > Instead of considering the "static CPU bandwidth" allocated to
> > a SCHED_DEADLINE task (ratio between its maximum runtime and
> > reservation period), try to use the remaining runtime and time
> > to scheduling deadline.
> >
> > Signed-off-by: luca abeni <[email protected]>
> > ---
> > kernel/sched/cpudeadline.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> > index d21f7905b9c1..111dd9ac837b 100644
> > --- a/kernel/sched/cpudeadline.c
> > +++ b/kernel/sched/cpudeadline.c
> > @@ -114,8 +114,13 @@ static inline int dl_task_fit(const struct
> > sched_dl_entity *dl_se, int cpu, u64 *c)
> > {
> > u64 cap = (arch_scale_cpu_capacity(NULL, cpu) *
> > arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
> > - s64 rel_deadline = dl_se->dl_deadline;
> > - u64 rem_runtime = dl_se->dl_runtime;
> > + s64 rel_deadline = dl_se->deadline -
> > sched_clock_cpu(smp_processor_id());
> > + u64 rem_runtime = dl_se->runtime;
> > +
> > + if ((rel_deadline < 0) || (rel_deadline *
> > dl_se->dl_runtime < dl_se->dl_deadline * rem_runtime)) {
> > + rel_deadline = dl_se->dl_deadline;
> > + rem_runtime = dl_se->dl_runtime;
> > + }
>
> So, are you basically checking if current remaining bw can be consumed
> safely?

I check if the current runtime (rescaled based on the capacity) is
smaller than the time to the current scheduling deadline (basically, if
it can be consumed in time).

However, if
q / (d - t) > Q / P
(where "q" is the current runtime, "d" is the scheduling deadline, "Q"
is the maximum runtime, and "P" is the CBS period), then a new
scheduling deadline will be generated (later), and the runtime will be
reset to Q... So, I need to use the maximum budget and CBS period for
checking if the task fits in the core.

>
> I'm not actually sure if looking at dynamic values is what we need to
> do at this stage. By considering static values we fix admission
> control (and scheduling). Aren't dynamic values more to do with
> energy tradeoffs (and so to be introduced when starting to look at
> the energy model)?

Using the current runtime and scheduling deadline might allow to
migrate a task to SMALL cores (if its remaining runtime is small
enough), even if the rescaled Q is larger than P.
So, in theory it might allow to reduce the load on big cores.

If we decide that this is overkilling, I can just drop the patch.



Luca

> Another pair of hands maybe is to look at the dynamic spare bw of CPUs
> (to check that we don't overload CPUs).
>
> Thanks,
>
> - Juri

2019-05-08 12:09:29

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

On 08/05/19 11:24, luca abeni wrote:
> On Wed, 8 May 2019 11:08:55 +0200
> Juri Lelli <[email protected]> wrote:
>
> > On 06/05/19 06:48, Luca Abeni wrote:
> > > From: luca abeni <[email protected]>
> > >
> > > Instead of considering the "static CPU bandwidth" allocated to
> > > a SCHED_DEADLINE task (ratio between its maximum runtime and
> > > reservation period), try to use the remaining runtime and time
> > > to scheduling deadline.
> > >
> > > Signed-off-by: luca abeni <[email protected]>
> > > ---
> > > kernel/sched/cpudeadline.c | 9 +++++++--
> > > 1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> > > index d21f7905b9c1..111dd9ac837b 100644
> > > --- a/kernel/sched/cpudeadline.c
> > > +++ b/kernel/sched/cpudeadline.c
> > > @@ -114,8 +114,13 @@ static inline int dl_task_fit(const struct
> > > sched_dl_entity *dl_se, int cpu, u64 *c)
> > > {
> > > u64 cap = (arch_scale_cpu_capacity(NULL, cpu) *
> > > arch_scale_freq_capacity(cpu)) >> SCHED_CAPACITY_SHIFT;
> > > - s64 rel_deadline = dl_se->dl_deadline;
> > > - u64 rem_runtime = dl_se->dl_runtime;
> > > + s64 rel_deadline = dl_se->deadline -
> > > sched_clock_cpu(smp_processor_id());
> > > + u64 rem_runtime = dl_se->runtime;
> > > +
> > > + if ((rel_deadline < 0) || (rel_deadline *
> > > dl_se->dl_runtime < dl_se->dl_deadline * rem_runtime)) {
> > > + rel_deadline = dl_se->dl_deadline;
> > > + rem_runtime = dl_se->dl_runtime;
> > > + }
> >
> > So, are you basically checking if current remaining bw can be consumed
> > safely?
>
> I check if the current runtime (rescaled based on the capacity) is
> smaller than the time to the current scheduling deadline (basically, if
> it can be consumed in time).
>
> However, if
> q / (d - t) > Q / P
> (where "q" is the current runtime, "d" is the scheduling deadline, "Q"
> is the maximum runtime, and "P" is the CBS period), then a new
> scheduling deadline will be generated (later), and the runtime will be
> reset to Q... So, I need to use the maximum budget and CBS period for
> checking if the task fits in the core.

OK. I'd add a comment about it.

> > I'm not actually sure if looking at dynamic values is what we need to
> > do at this stage. By considering static values we fix admission
> > control (and scheduling). Aren't dynamic values more to do with
> > energy tradeoffs (and so to be introduced when starting to look at
> > the energy model)?
>
> Using the current runtime and scheduling deadline might allow to
> migrate a task to SMALL cores (if its remaining runtime is small
> enough), even if the rescaled Q is larger than P.
> So, in theory it might allow to reduce the load on big cores.
>
> If we decide that this is overkilling, I can just drop the patch.

So, my first impression was that we shouldn't be too clever until we
start using info from the energy model (using which one should be able
to understand if, for example, reducing load on big cores is a winning
power/perf decision).

However, I was also wondering if we should actually compare dynamic
parameters with {running,this}_bw (per-rq) the moment we search for
potential migration candidates (so that we are not overloading rqs).

2019-05-08 13:28:27

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

Hi Juri,

On Wed, 8 May 2019 14:05:26 +0200
Juri Lelli <[email protected]> wrote:
[...]
> > > > + if ((rel_deadline < 0) || (rel_deadline *
> > > > dl_se->dl_runtime < dl_se->dl_deadline * rem_runtime)) {
> > > > + rel_deadline = dl_se->dl_deadline;
> > > > + rem_runtime = dl_se->dl_runtime;
> > > > + }
> > >
> > > So, are you basically checking if current remaining bw can be
> > > consumed safely?
> >
> > I check if the current runtime (rescaled based on the capacity) is
> > smaller than the time to the current scheduling deadline
> > (basically, if it can be consumed in time).
> >
> > However, if
> > q / (d - t) > Q / P
> > (where "q" is the current runtime, "d" is the scheduling deadline,
> > "Q" is the maximum runtime, and "P" is the CBS period), then a new
> > scheduling deadline will be generated (later), and the runtime will
> > be reset to Q... So, I need to use the maximum budget and CBS
> > period for checking if the task fits in the core.
>
> OK. I'd add a comment about it.

Ok; I'll add a comment in the next version of the patchset.


[...]
> > > I'm not actually sure if looking at dynamic values is what we
> > > need to do at this stage. By considering static values we fix
> > > admission control (and scheduling). Aren't dynamic values more to
> > > do with energy tradeoffs (and so to be introduced when starting
> > > to look at the energy model)?
> >
> > Using the current runtime and scheduling deadline might allow to
> > migrate a task to SMALL cores (if its remaining runtime is small
> > enough), even if the rescaled Q is larger than P.
> > So, in theory it might allow to reduce the load on big cores.
> >
> > If we decide that this is overkilling, I can just drop the patch.
>
> So, my first impression was that we shouldn't be too clever until we
> start using info from the energy model (using which one should be able
> to understand if, for example, reducing load on big cores is a winning
> power/perf decision).

Ok.


> However, I was also wondering if we should actually compare dynamic
> parameters with {running,this}_bw (per-rq) the moment we search for
> potential migration candidates (so that we are not overloading rqs).

Notice that all this logic is used only to select one of the idle cores
(instead of picking the first idle core, we select the less powerful
core on which the task "fits").

So, running_bw does not provide any useful information, in this case;
maybe this_bw can be more useful.



Luca

2019-05-08 14:13:35

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

Hi Juri,

On Wed, 8 May 2019 15:10:18 +0200
Juri Lelli <[email protected]> wrote:

> On 08/05/19 14:47, luca abeni wrote:
>
> [...]
>
> > Notice that all this logic is used only to select one of the idle
> > cores (instead of picking the first idle core, we select the less
> > powerful core on which the task "fits").
> >
> > So, running_bw does not provide any useful information, in this
> > case; maybe this_bw can be more useful.
>
> Ah, indeed.
>
> However, what happens when cores are all busy? Can load balancing
> decisions based on deadline values only break capacity awareness? (I
> understand this is an RFC, so a "yes, something we need to look at" is
> OK :-)

I have no definitive answer about this :)

If we do not want to break EDF and its properties, migrations have to
be performed so that the "global EDF invariant" (schedule the m tasks
with earliest deadlines) is not broken (and capacity-based migrations
for non-idle cores risk to break this invariant).

In theory, we should do something like "schedule the earliest deadline
task on the fastest core", but this would require too many migrations
(and would require to migrate a task while it is executing, etc...)

I am convinced that doing the capacity-aware migrations when we have
idle cores helps to reduce migrations (avoiding "useless migrations")
when there are no idle cores, but I have no proof about this (just some
simple examples); this is why I say that I have no definitive answer...



Luca

2019-05-08 14:45:29

by Juri Lelli

[permalink] [raw]
Subject: Re: [RFC PATCH 4/6] sched/dl: Improve capacity-aware wakeup

On 08/05/19 14:47, luca abeni wrote:

[...]

> Notice that all this logic is used only to select one of the idle cores
> (instead of picking the first idle core, we select the less powerful
> core on which the task "fits").
>
> So, running_bw does not provide any useful information, in this case;
> maybe this_bw can be more useful.

Ah, indeed.

However, what happens when cores are all busy? Can load balancing
decisions based on deadline values only break capacity awareness? (I
understand this is an RFC, so a "yes, something we need to look at" is
OK :-)

2019-05-09 13:47:51

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] sched/dl: Try not to select a too fast core

Hi Luca,

On Wednesday 08 May 2019 at 08:26:05 (+0200), luca abeni wrote:
> Mainly two reasons: the first one is to try to reduce frequency
> switches (I did not perform measurements on the hikey960, I remember
> that on other CPUs a frequency switch can take a considerable amount of
> time).

Right, though if you want to use DL and scale frequency on these systems
the only 'safe' thing to do is probably to account for the frequency
switch delay in the runtime parameter of your DL tasks ...

> Then, I wanted to have a mechanism that can work with all the possible
> cpufreq governors... So, I did not assume that the frequency can change
> (for example, I remember that without considering the current
> frequency I had issues when using the "userspace" governor).

But this is a problem I agree. OTOH it is also true that 'userspace' has
a much weaker use-case than sugov for asymmetric systems where we
typically care about energy and do 'proper' DVFS. So, I'd say we really
shouldn't assume the frequencies won't change.

DL is already sub-optimal on SMP if the user sets arbitrary frequencies
for the CPUs no ? So perhaps this problem could be solved in different
patch series ?

> Maybe I just do not know this kernel subsystem well enough, but I did
> not find any way to know the maximum frequency that the current
> governor can set (I mean, I found a "maximum frequency" field that
> tells me the maximum frequency that the cpufreq driver can set, but I
> do not know if the governor will be able to set it --- again, consider
> the "userspace" governor).

Right, we'd probably need more infrastructure to do this, but I'm not
sure how deep we'll need to go ...

Thanks,
Quentin

2019-06-18 16:42:55

by Alessio Balsini

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

Hi Luca,

On Mon, May 06, 2019 at 06:48:31AM +0200, Luca Abeni wrote:
> From: luca abeni <[email protected]>
>
> extern void dl_change_utilization(struct task_struct *p, u64 new_bw);
> @@ -785,6 +786,8 @@ struct root_domain {
> unsigned long max_cpu_capacity;
> unsigned long min_cpu_capacity;
>
> + unsigned long rd_capacity;
>

An easy one: we are already in root_domain, can't we
s/rd_capacity/capacity/g ?

Thanks for the series, looking forward to seeing an updated version
soon. In the meanwhile, I started backporting and testing it.

Cheers,
Alessio

2019-07-04 12:07:58

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

On 5/6/19 6:48 AM, Luca Abeni wrote:

[...]

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 5b981eeeb944..3436f3d8fa8f 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1584,6 +1584,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> if (sd_flag != SD_BALANCE_WAKE)
> goto out;
>
> + if (dl_entity_is_special(&p->dl))
> + goto out;

I wonder if this is really required. The if condition

1591 if (unlikely(dl_task(curr)) &&
1592 (curr->nr_cpus_allowed < 2 ||
1593 !dl_entity_preempt(&p->dl, &curr->dl)) &&
1594 (p->nr_cpus_allowed > 1)) {

further below uses '!dl_entity_preempt(&p->dl, &curr->dl))' which
returns 'dl_entity_is_special(a) || ...'

A BUG_ON(dl_entity_is_special(&p->dl)) in this if condition hasn't
triggered on my platform yet.

[...]

2019-07-08 11:57:18

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

Hi Dietmar,

On Thu, 4 Jul 2019 14:05:22 +0200
Dietmar Eggemann <[email protected]> wrote:

> On 5/6/19 6:48 AM, Luca Abeni wrote:
>
> [...]
>
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 5b981eeeb944..3436f3d8fa8f 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1584,6 +1584,9 @@ select_task_rq_dl(struct task_struct *p, int
> > cpu, int sd_flag, int flags) if (sd_flag != SD_BALANCE_WAKE)
> > goto out;
> >
> > + if (dl_entity_is_special(&p->dl))
> > + goto out;
>
> I wonder if this is really required. The if condition
>
> 1591 if (unlikely(dl_task(curr)) &&
> 1592 (curr->nr_cpus_allowed < 2 ||
> 1593 !dl_entity_preempt(&p->dl, &curr->dl)) &&
> 1594 (p->nr_cpus_allowed > 1)) {
>
> further below uses '!dl_entity_preempt(&p->dl, &curr->dl))' which
> returns 'dl_entity_is_special(a) || ...'

Uhm... I do not remember the details; I remember that the check fixed
something during the development of the patchset, but I did not check
if it was still needed when I forward-ported the patches...

So, maybe it worked around some bugs in previous versions of the
kernel, but is not needed now.



Luca

2019-07-08 15:09:51

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] sched/dl: Capacity-aware migrations

On 7/8/19 9:41 AM, luca abeni wrote:
> Hi Dietmar,
>
> On Thu, 4 Jul 2019 14:05:22 +0200
> Dietmar Eggemann <[email protected]> wrote:
>
>> On 5/6/19 6:48 AM, Luca Abeni wrote:
>>
>> [...]
>>
>>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>>> index 5b981eeeb944..3436f3d8fa8f 100644
>>> --- a/kernel/sched/deadline.c
>>> +++ b/kernel/sched/deadline.c
>>> @@ -1584,6 +1584,9 @@ select_task_rq_dl(struct task_struct *p, int
>>> cpu, int sd_flag, int flags) if (sd_flag != SD_BALANCE_WAKE)
>>> goto out;
>>>
>>> + if (dl_entity_is_special(&p->dl))
>>> + goto out;
>>
>> I wonder if this is really required. The if condition
>>
>> 1591 if (unlikely(dl_task(curr)) &&
>> 1592 (curr->nr_cpus_allowed < 2 ||
>> 1593 !dl_entity_preempt(&p->dl, &curr->dl)) &&
>> 1594 (p->nr_cpus_allowed > 1)) {
>>
>> further below uses '!dl_entity_preempt(&p->dl, &curr->dl))' which
>> returns 'dl_entity_is_special(a) || ...'
>
> Uhm... I do not remember the details; I remember that the check fixed
> something during the development of the patchset, but I did not check
> if it was still needed when I forward-ported the patches...
>
> So, maybe it worked around some bugs in previous versions of the
> kernel, but is not needed now.

I figured it out in the meantime ... you added a ' ... ||
static_branch_unlikely(&sched_asym_cpucapacity)' in this patch to this
if condition which let's special dl tasks (like sugov:X) enter this
condition e.g. on big.LITTLE.

2019-07-08 16:33:13

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On 5/7/19 4:43 PM, luca abeni wrote:
> On Tue, 7 May 2019 15:31:27 +0100
> Quentin Perret <[email protected]> wrote:
>
>> On Tuesday 07 May 2019 at 16:25:23 (+0200), luca abeni wrote:
>>> On Tue, 7 May 2019 14:48:52 +0100
>>> Quentin Perret <[email protected]> wrote:
>>>
>>>> Hi Luca,
>>>>
>>>> On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:

[...]

>> Right and things moved recently in this area, see bb1fbdd3c3fd
>> ("sched/topology, drivers/base/arch_topology: Rebuild the sched_domain
>> hierarchy when capacities change")
>
> Ah, thanks! I missed this change when rebasing the patchset.
> I guess this part of the patch has to be updated (and probably became
> useless?), then.

[...]

>>> This achieved the effect of correctly setting up the "rd_capacity"
>>> field, but I do not know if there is a better/simpler way to achieve
>>> the same result :)
>>
>> OK, that's really an implementation detail, so no need to worry too
>> much about it at the RFC stage I suppose :-)

What about we integrate the code to calculate the rd capacity into
build_sched_domains() (next to the code to establish the rd
max_cpu_capacity)?

root@juno:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
446
1024
1024
446
446
446

root@juno:~# dmesg | grep "rd capacity"
/* before CPUfreq max CPU freq calibration */
[ 0.749389] rd span: 0-5 rd capacity: 4360 max cpu_capacity: 1024
/* after CPUfreq max CPU freq calibration */
[ 3.372759] rd span: 0-5 rd capacity: 3832 max cpu_capacity: 1024

/* 2*1024 + 4*446 = 3832 */

root@juno:~# echo 0 > /sys/devices/system/cpu/cpu5/online

root@juno:~# dmesg | grep "rd capacity"
...
[ 2715.068198] rd span: 0-4 rd capacity: 3386 max cpu_capacity: 1024

root@juno:~# echo 1 > /sys/devices/system/cpu/cpu5/online

root@juno:~# dmesg | grep "rd capacity"
...
[ 2807.200662] rd span: 0-5 rd capacity: 3832 max cpu_capacity: 1024


--->8---

@@ -768,6 +769,7 @@ struct root_domain {
cpumask_var_t rto_mask;
struct cpupri cpupri;

+ unsigned long capacity;
unsigned long max_cpu_capacity;

/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f751ce0b783e..68acdca27eaf 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2000,6 +2000,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);

+ WRITE_ONCE(d.rd->capacity,
+ READ_ONCE(d.rd->capacity) + rq->cpu_capacity_orig);
+
cpu_attach_domain(sd, d.rd, i);
}
rcu_read_unlock();
@@ -2008,8 +2011,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
static_branch_enable_cpuslocked(&sched_asym_cpucapacity);

if (rq && sched_debug_enabled) {
- pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
- cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+ pr_info("rd span: %*pbl rd capacity: %lu max cpu_capacity: %lu\n",
+ cpumask_pr_args(cpu_map), rq->rd->capacity,
+ rq->rd->max_cpu_capacity);
}

2019-07-08 19:03:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On Mon, May 06, 2019 at 06:48:33AM +0200, Luca Abeni wrote:
> @@ -1223,8 +1250,17 @@ static void update_curr_dl(struct rq *rq)
> dl_se->dl_overrun = 1;
>
> __dequeue_task_dl(rq, curr, 0);
> - if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
> + if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr))) {
> enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
> +#ifdef CONFIG_SMP
> + } else if (dl_se->dl_adjust) {
> + if (rq->migrating_task == NULL) {
> + queue_balance_callback(rq, &per_cpu(dl_migrate_head, rq->cpu), migrate_dl_task);

I'm not entirely sure about this one.

That is, we only do those callbacks from:

schedule_tail()
__schedule()
rt_mutex_setprio()
__sched_setscheduler()

and the above looks like it can happen outside of those.

The pattern in those sites is:

rq_lock();
... do crap that leads to queue_balance_callback()
rq_unlock()
if (rq->balance_callback) {
raw_spin_lock_irqsave(rq->lock, flags);
... do callbacks
raw_spin_unlock_irqrestore(rq->lock, flags);
}

So I suppose can catch abuse of this API by doing something like the
below; can you validate?

---

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aaca0e743776..89e615f1eae6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1134,6 +1134,14 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
rf->cookie = lockdep_pin_lock(&rq->lock);

#ifdef CONFIG_SCHED_DEBUG
+#ifdef CONFIG_SMP
+ /*
+ * There should not be pending callbacks at the start of rq_lock();
+ * all sites that handle them flush them at the end.
+ */
+ WARN_ON_ONCE(rq->balance_callback);
+#endif
+
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
rf->clock_update_flags = 0;
#endif

2019-07-08 19:28:29

by Quentin Perret

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] sched/dl: Improve deadline admission control for asymmetric CPU capacities

On Monday 08 Jul 2019 at 13:22:35 (+0200), Dietmar Eggemann wrote:
> On 5/7/19 4:43 PM, luca abeni wrote:
> > On Tue, 7 May 2019 15:31:27 +0100
> > Quentin Perret <[email protected]> wrote:
> >
> >> On Tuesday 07 May 2019 at 16:25:23 (+0200), luca abeni wrote:
> >>> On Tue, 7 May 2019 14:48:52 +0100
> >>> Quentin Perret <[email protected]> wrote:
> >>>
> >>>> Hi Luca,
> >>>>
> >>>> On Monday 06 May 2019 at 06:48:31 (+0200), Luca Abeni wrote:
>
> [...]
>
> >> Right and things moved recently in this area, see bb1fbdd3c3fd
> >> ("sched/topology, drivers/base/arch_topology: Rebuild the sched_domain
> >> hierarchy when capacities change")
> >
> > Ah, thanks! I missed this change when rebasing the patchset.
> > I guess this part of the patch has to be updated (and probably became
> > useless?), then.
>
> [...]
>
> >>> This achieved the effect of correctly setting up the "rd_capacity"
> >>> field, but I do not know if there is a better/simpler way to achieve
> >>> the same result :)
> >>
> >> OK, that's really an implementation detail, so no need to worry too
> >> much about it at the RFC stage I suppose :-)
>
> What about we integrate the code to calculate the rd capacity into
> build_sched_domains() (next to the code to establish the rd
> max_cpu_capacity)?

Right, that's also what Vincent suggested IIRC. I think that'd work :)

Thanks,
Quentin

2019-07-09 13:26:15

by luca abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

Hi Peter,

On Mon, 8 Jul 2019 15:55:36 +0200
Peter Zijlstra <[email protected]> wrote:

> On Mon, May 06, 2019 at 06:48:33AM +0200, Luca Abeni wrote:
> > @@ -1223,8 +1250,17 @@ static void update_curr_dl(struct rq *rq)
> > dl_se->dl_overrun = 1;
> >
> > __dequeue_task_dl(rq, curr, 0);
> > - if (unlikely(dl_se->dl_boosted
> > || !start_dl_timer(curr)))
> > + if (unlikely(dl_se->dl_boosted
> > || !start_dl_timer(curr))) { enqueue_task_dl(rq, curr,
> > ENQUEUE_REPLENISH); +#ifdef CONFIG_SMP
> > + } else if (dl_se->dl_adjust) {
> > + if (rq->migrating_task == NULL) {
> > + queue_balance_callback(rq,
> > &per_cpu(dl_migrate_head, rq->cpu), migrate_dl_task);
>
> I'm not entirely sure about this one.
>
> That is, we only do those callbacks from:
>
> schedule_tail()
> __schedule()
> rt_mutex_setprio()
> __sched_setscheduler()
>
> and the above looks like it can happen outside of those.

Sorry, I did not know the constraints or requirements for using
queue_balance_callback()...

I used it because I wanted to trigger a migration from
update_curr_dl(), but invoking double_lock_balance() from this function
obviously resulted in a warning. So, I probably misunderstood the
purpose of the balance callback API, and I misused it.

What would have been the "right way" to trigger a migration for a task
when it is throttled?


>
> The pattern in those sites is:
>
> rq_lock();
> ... do crap that leads to queue_balance_callback()
> rq_unlock()
> if (rq->balance_callback) {
> raw_spin_lock_irqsave(rq->lock, flags);
> ... do callbacks
> raw_spin_unlock_irqrestore(rq->lock, flags);
> }
>
> So I suppose can catch abuse of this API by doing something like the
> below; can you validate?

Sorry; right now I cannot run tests on big.LITTLE machines...
Maybe Dietmar (added in cc), who is working on mainlining this patcset,
can test?



Thanks,
Luca

>
> ---
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aaca0e743776..89e615f1eae6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1134,6 +1134,14 @@ static inline void rq_pin_lock(struct rq *rq,
> struct rq_flags *rf) rf->cookie = lockdep_pin_lock(&rq->lock);
>
> #ifdef CONFIG_SCHED_DEBUG
> +#ifdef CONFIG_SMP
> + /*
> + * There should not be pending callbacks at the start of
> rq_lock();
> + * all sites that handle them flush them at the end.
> + */
> + WARN_ON_ONCE(rq->balance_callback);
> +#endif
> +
> rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
> rf->clock_update_flags = 0;
> #endif

2019-07-09 13:45:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On Tue, Jul 09, 2019 at 03:24:36PM +0200, luca abeni wrote:
> Hi Peter,
>
> On Mon, 8 Jul 2019 15:55:36 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, May 06, 2019 at 06:48:33AM +0200, Luca Abeni wrote:
> > > @@ -1223,8 +1250,17 @@ static void update_curr_dl(struct rq *rq)
> > > dl_se->dl_overrun = 1;
> > >
> > > __dequeue_task_dl(rq, curr, 0);
> > > - if (unlikely(dl_se->dl_boosted
> > > || !start_dl_timer(curr)))
> > > + if (unlikely(dl_se->dl_boosted
> > > || !start_dl_timer(curr))) { enqueue_task_dl(rq, curr,
> > > ENQUEUE_REPLENISH); +#ifdef CONFIG_SMP
> > > + } else if (dl_se->dl_adjust) {
> > > + if (rq->migrating_task == NULL) {
> > > + queue_balance_callback(rq,
> > > &per_cpu(dl_migrate_head, rq->cpu), migrate_dl_task);
> >
> > I'm not entirely sure about this one.
> >
> > That is, we only do those callbacks from:
> >
> > schedule_tail()
> > __schedule()
> > rt_mutex_setprio()
> > __sched_setscheduler()
> >
> > and the above looks like it can happen outside of those.
>
> Sorry, I did not know the constraints or requirements for using
> queue_balance_callback()...
>
> I used it because I wanted to trigger a migration from
> update_curr_dl(), but invoking double_lock_balance() from this function
> obviously resulted in a warning. So, I probably misunderstood the
> purpose of the balance callback API, and I misused it.
>
> What would have been the "right way" to trigger a migration for a task
> when it is throttled?

I'm thinking we'll end up in schedule() pretty soon after a throttle to
make 'current' go away, right? We could put the queue_balance_callback()
in dequeue_task_dl() or something.

2019-07-09 14:45:10

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On 7/9/19 3:24 PM, luca abeni wrote:
> Hi Peter,
>
> On Mon, 8 Jul 2019 15:55:36 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, May 06, 2019 at 06:48:33AM +0200, Luca Abeni wrote:
>>> @@ -1223,8 +1250,17 @@ static void update_curr_dl(struct rq *rq)
>>> dl_se->dl_overrun = 1;
>>>
>>> __dequeue_task_dl(rq, curr, 0);
>>> - if (unlikely(dl_se->dl_boosted
>>> || !start_dl_timer(curr)))
>>> + if (unlikely(dl_se->dl_boosted
>>> || !start_dl_timer(curr))) { enqueue_task_dl(rq, curr,
>>> ENQUEUE_REPLENISH); +#ifdef CONFIG_SMP
>>> + } else if (dl_se->dl_adjust) {
>>> + if (rq->migrating_task == NULL) {
>>> + queue_balance_callback(rq,
>>> &per_cpu(dl_migrate_head, rq->cpu), migrate_dl_task);
>>
>> I'm not entirely sure about this one.
>>
>> That is, we only do those callbacks from:
>>
>> schedule_tail()
>> __schedule()
>> rt_mutex_setprio()
>> __sched_setscheduler()
>>
>> and the above looks like it can happen outside of those.
>
> Sorry, I did not know the constraints or requirements for using
> queue_balance_callback()...
>
> I used it because I wanted to trigger a migration from
> update_curr_dl(), but invoking double_lock_balance() from this function
> obviously resulted in a warning. So, I probably misunderstood the
> purpose of the balance callback API, and I misused it.
>
> What would have been the "right way" to trigger a migration for a task
> when it is throttled?
>
>
>>
>> The pattern in those sites is:
>>
>> rq_lock();
>> ... do crap that leads to queue_balance_callback()
>> rq_unlock()
>> if (rq->balance_callback) {
>> raw_spin_lock_irqsave(rq->lock, flags);
>> ... do callbacks
>> raw_spin_unlock_irqrestore(rq->lock, flags);
>> }
>>
>> So I suppose can catch abuse of this API by doing something like the
>> below; can you validate?
>
> Sorry; right now I cannot run tests on big.LITTLE machines...
> Maybe Dietmar (added in cc), who is working on mainlining this patcset,
> can test?

I do see this one triggering (on ARM64 (Juno, 2 big/4 LITTLE,
performance CPUfreq gov, CPU_IDLE disabled):

1 deadline tasks (12000, 100000, 100000)

but the warnings come out of the pi, CFS and tick code?

[ 70.190812] WARNING: CPU: 0 PID: 3550 at kernel/sched/sched.h:1145
task_rq_lock+0xe8/0xf0
...
[ 70.310931] Call trace:
[ 70.313352] task_rq_lock+0xe8/0xf0
[ 70.316808] inactive_task_timer+0x48/0x4f0
[ 70.320951] __hrtimer_run_queues+0x11c/0x3d0
[ 70.325265] hrtimer_interrupt+0xd8/0x248
[ 70.329236] arch_timer_handler_phys+0x38/0x58
[ 70.333637] handle_percpu_devid_irq+0x90/0x2b8
[ 70.338123] generic_handle_irq+0x34/0x50
[ 70.342093] __handle_domain_irq+0x68/0xc0
[ 70.346149] gic_handle_irq+0x60/0xb0
[ 70.349773] el1_irq+0xbc/0x180
[ 70.352884] _raw_spin_unlock_irqrestore+0x64/0x90
[ 70.357629] rt_mutex_adjust_pi+0x4c/0xb0
[ 70.361599] __sched_setscheduler+0x49c/0x830
[ 70.365912] _sched_setscheduler+0x98/0xc0
[ 70.369967] do_sched_setscheduler+0xb4/0x118
[ 70.374281] __arm64_sys_sched_setscheduler+0x28/0x40
[ 70.379285] el0_svc_common.constprop.0+0x7c/0x178
[ 70.384029] el0_svc_handler+0x34/0x90
[ 70.387739] el0_svc+0x8/0xc
...
[ 70.395177] WARNING: CPU: 4 PID: 43 at kernel/sched/sched.h:1145
update_blocked_averages+0x924/0x998
...
[ 70.523815] Call trace:
[ 70.526236] update_blocked_averages+0x924/0x998
[ 70.530807] update_nohz_stats+0x78/0xa0
[ 70.534690] find_busiest_group+0x5f0/0xc18
[ 70.538831] load_balance+0x174/0xbc0
[ 70.542456] pick_next_task_fair+0x34c/0x740
[ 70.546683] __schedule+0x130/0x690
[ 70.550136] schedule+0x38/0xc0
[ 70.553246] worker_thread+0xc8/0x458
[ 70.556872] kthread+0x130/0x138
[ 70.560067] ret_from_fork+0x10/0x1c
...
[ 70.568191] WARNING: CPU: 0 PID: 3550 at kernel/sched/sched.h:1145
scheduler_tick+0x110/0x118
...
[ 70.690607] Call trace:
[ 70.693029] scheduler_tick+0x110/0x118
[ 70.696826] update_process_times+0x48/0x60
[ 70.700968] tick_sched_handle.isra.5+0x44/0x68
[ 70.705451] tick_sched_timer+0x50/0xa0
[ 70.709249] __hrtimer_run_queues+0x11c/0x3d0
[ 70.713562] hrtimer_interrupt+0xd8/0x248
[ 70.717531] arch_timer_handler_phys+0x38/0x58
[ 70.721930] handle_percpu_devid_irq+0x90/0x2b8
[ 70.726416] generic_handle_irq+0x34/0x50
[ 70.730385] __handle_domain_irq+0x68/0xc0
[ 70.734439] gic_handle_irq+0x60/0xb0
[ 70.738063] el1_irq+0xbc/0x180
[ 70.741172] _raw_spin_unlock_irqrestore+0x64/0x90
[ 70.745916] rt_mutex_adjust_pi+0x4c/0xb0
[ 70.749885] __sched_setscheduler+0x49c/0x830
[ 70.754198] _sched_setscheduler+0x98/0xc0
[ 70.758253] do_sched_setscheduler+0xb4/0x118
[ 70.762567] __arm64_sys_sched_setscheduler+0x28/0x40
[ 70.767569] el0_svc_common.constprop.0+0x7c/0x178
[ 70.772312] el0_svc_handler+0x34/0x90
[ 70.776022] el0_svc+0x8/0xc

2019-07-11 11:19:20

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On 7/9/19 3:42 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2019 at 03:24:36PM +0200, luca abeni wrote:
>> Hi Peter,
>>
>> On Mon, 8 Jul 2019 15:55:36 +0200
>> Peter Zijlstra <[email protected]> wrote:
>>
>>> On Mon, May 06, 2019 at 06:48:33AM +0200, Luca Abeni wrote:
>>>> @@ -1223,8 +1250,17 @@ static void update_curr_dl(struct rq *rq)
>>>> dl_se->dl_overrun = 1;
>>>>
>>>> __dequeue_task_dl(rq, curr, 0);
>>>> - if (unlikely(dl_se->dl_boosted
>>>> || !start_dl_timer(curr)))
>>>> + if (unlikely(dl_se->dl_boosted
>>>> || !start_dl_timer(curr))) { enqueue_task_dl(rq, curr,
>>>> ENQUEUE_REPLENISH); +#ifdef CONFIG_SMP
>>>> + } else if (dl_se->dl_adjust) {
>>>> + if (rq->migrating_task == NULL) {
>>>> + queue_balance_callback(rq,
>>>> &per_cpu(dl_migrate_head, rq->cpu), migrate_dl_task);
>>>
>>> I'm not entirely sure about this one.
>>>
>>> That is, we only do those callbacks from:
>>>
>>> schedule_tail()
>>> __schedule()
>>> rt_mutex_setprio()
>>> __sched_setscheduler()
>>>
>>> and the above looks like it can happen outside of those.
>>
>> Sorry, I did not know the constraints or requirements for using
>> queue_balance_callback()...
>>
>> I used it because I wanted to trigger a migration from
>> update_curr_dl(), but invoking double_lock_balance() from this function
>> obviously resulted in a warning. So, I probably misunderstood the
>> purpose of the balance callback API, and I misused it.
>>
>> What would have been the "right way" to trigger a migration for a task
>> when it is throttled?
>
> I'm thinking we'll end up in schedule() pretty soon after a throttle to
> make 'current' go away, right? We could put the queue_balance_callback()
> in dequeue_task_dl() or something.

Is this what you are concerned about?

(2 Cpus (CPU1, CPU2), 4 deadline task (thread0-X)) with

@@ -1137,6 +1137,13 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
rf->cookie = lockdep_pin_lock(&rq->lock);

#ifdef CONFIG_SCHED_DEBUG
+#ifdef CONFIG_SMP
+ /*
+ * There should not be pending callbacks at the start of rq_lock();
+ * all sites that handle them flush them at the end.
+ */
+ WARN_ON_ONCE(rq->balance_callback);
+#endif


[ 87.251233] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-2 3626] on CPU1
[ 87.251237] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-3 3627] on CPU2
[ 87.251261] WARNING: CPU: 2 PID: 3627 at kernel/sched/sched.h:1145 __schedule+0x56c/0x690
[ 87.259173] WARNING: CPU: 1 PID: 3626 at kernel/sched/sched.h:1145 try_to_wake_up+0x68c/0x778
[ 87.615871] WARNING: CPU: 1 PID: 3626 at kernel/sched/sched.h:1145 scheduler_tick+0x110/0x118
[ 87.615882] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 task_rq_lock+0xe8/0xf0
[ 88.012571] WARNING: CPU: 1 PID: 3626 at kernel/sched/sched.h:1145 update_blocked_averages+0x924/0x998
[ 88.176844] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 load_balance+0x4d0/0xbc0
[ 88.381905] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 load_balance+0x7d8/0xbc0
[ 88.586991] *** ---> migrate_dl_task() p=[thread0-3 3627] to CPU1
[ 88.587008] *** ---> migrate_dl_task() p=[thread0-2 3626] to CPU1
[ 89.335307] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-0 3624] on CPU2
[ 89.343252] *** ---> migrate_dl_task() p=[thread0-0 3624] to CPU1
[ 89.379309] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-1 3625] on CPU2
[ 89.387249] *** ---> migrate_dl_task() p=[thread0-1 3625] to CPU1
[ 89.591323] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-3 3627] on CPU2
[ 89.599263] *** ---> migrate_dl_task() p=[thread0-3 3627] to CPU2

2019-07-11 12:03:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On Thu, Jul 11, 2019 at 01:17:17PM +0200, Dietmar Eggemann wrote:
> On 7/9/19 3:42 PM, Peter Zijlstra wrote:

> >>> That is, we only do those callbacks from:
> >>>
> >>> schedule_tail()
> >>> __schedule()
> >>> rt_mutex_setprio()
> >>> __sched_setscheduler()
> >>>
> >>> and the above looks like it can happen outside of those.

> Is this what you are concerned about?
>
> (2 Cpus (CPU1, CPU2), 4 deadline task (thread0-X)) with
>
> @@ -1137,6 +1137,13 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
> rf->cookie = lockdep_pin_lock(&rq->lock);
>
> #ifdef CONFIG_SCHED_DEBUG
> +#ifdef CONFIG_SMP
> + /*
> + * There should not be pending callbacks at the start of rq_lock();
> + * all sites that handle them flush them at the end.
> + */
> + WARN_ON_ONCE(rq->balance_callback);
> +#endif
>
>
> [ 87.251237] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-3 3627] on CPU2
> [ 87.251261] WARNING: CPU: 2 PID: 3627 at kernel/sched/sched.h:1145 __schedule+0x56c/0x690
> [ 87.615882] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 task_rq_lock+0xe8/0xf0
> [ 88.176844] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 load_balance+0x4d0/0xbc0
> [ 88.381905] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 load_balance+0x7d8/0xbc0

I'm not sure how we get 4 warns, I was thinking that as soon as we exit
__schedule() we'd procress the callback so further warns would be
avoided.

> [ 88.586991] *** ---> migrate_dl_task() p=[thread0-3 3627] to CPU1

But yes, something like this. Basucally I want to avoid calling
queue_balance_callback() from a context where we'll not follow up with
balance_callback().

2019-07-11 15:49:50

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

On 7/11/19 2:00 PM, Peter Zijlstra wrote:
> On Thu, Jul 11, 2019 at 01:17:17PM +0200, Dietmar Eggemann wrote:
>> On 7/9/19 3:42 PM, Peter Zijlstra wrote:
>
>>>>> That is, we only do those callbacks from:
>>>>>
>>>>> schedule_tail()
>>>>> __schedule()
>>>>> rt_mutex_setprio()
>>>>> __sched_setscheduler()
>>>>>
>>>>> and the above looks like it can happen outside of those.
>
>> Is this what you are concerned about?
>>
>> (2 Cpus (CPU1, CPU2), 4 deadline task (thread0-X)) with
>>
>> @@ -1137,6 +1137,13 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
>> rf->cookie = lockdep_pin_lock(&rq->lock);
>>
>> #ifdef CONFIG_SCHED_DEBUG
>> +#ifdef CONFIG_SMP
>> + /*
>> + * There should not be pending callbacks at the start of rq_lock();
>> + * all sites that handle them flush them at the end.
>> + */
>> + WARN_ON_ONCE(rq->balance_callback);
>> +#endif
>>
>>
>> [ 87.251237] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-3 3627] on CPU2
>> [ 87.251261] WARNING: CPU: 2 PID: 3627 at kernel/sched/sched.h:1145 __schedule+0x56c/0x690
>> [ 87.615882] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 task_rq_lock+0xe8/0xf0
>> [ 88.176844] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 load_balance+0x4d0/0xbc0
>> [ 88.381905] WARNING: CPU: 2 PID: 3616 at kernel/sched/sched.h:1145 load_balance+0x7d8/0xbc0
>
> I'm not sure how we get 4 warns, I was thinking that as soon as we exit
> __schedule() we'd procress the callback so further warns would be
> avoided.

Reducing the warning to only fire on CPU1 I got another test-run:

[ 6688.373607] *** <--- queue_balance_callback(migrate_dl_task) p=[thread0-3 4343] on CPU1
[ 6688.381557] WARNING: CPU: 1 PID: 4343 at kernel/sched/sched.h:1146 try_to_wake_up+0x614/0x788
...
[ 6688.505000] try_to_wake_up+0x614/0x788
[ 6688.508794] default_wake_function+0x34/0x48
[ 6688.513017] autoremove_wake_function+0x3c/0x68
[ 6688.517497] __wake_up_common+0x90/0x158
[ 6688.521374] __wake_up_common_lock+0x88/0xd0
[ 6688.525595] __wake_up+0x40/0x50
[ 6688.528787] wake_up_klogd_work_func+0x4c/0x88
[ 6688.533184] irq_work_run_list+0x8c/0xd8
[ 6688.537063] irq_work_tick+0x48/0x60
[ 6688.540598] update_process_times+0x44/0x60
[ 6688.544735] tick_sched_handle.isra.5+0x44/0x68
[ 6688.549215] tick_sched_timer+0x50/0xa0
[ 6688.553007] __hrtimer_run_queues+0x11c/0x3d0
[ 6688.557316] hrtimer_interrupt+0xd8/0x248
[ 6688.561282] arch_timer_handler_phys+0x38/0x58
[ 6688.565678] handle_percpu_devid_irq+0x90/0x2b8
[ 6688.570160] generic_handle_irq+0x34/0x50
[ 6688.574124] __handle_domain_irq+0x68/0xc0
[ 6688.578175] gic_handle_irq+0x60/0xb0
...
[ 6688.589909] WARNING: CPU: 1 PID: 4343 at kernel/sched/sched.h:1146 scheduler_tick+0xe8/0x128
...
[ 6688.714463] scheduler_tick+0xe8/0x128
[ 6688.718170] update_process_times+0x48/0x60
[ 6688.722306] tick_sched_handle.isra.5+0x44/0x68
[ 6688.726786] tick_sched_timer+0x50/0xa0
[ 6688.730579] __hrtimer_run_queues+0x11c/0x3d0
[ 6688.734887] hrtimer_interrupt+0xd8/0x248
[ 6688.738852] arch_timer_handler_phys+0x38/0x58
[ 6688.743246] handle_percpu_devid_irq+0x90/0x2b8
[ 6688.747727] generic_handle_irq+0x34/0x50
[ 6688.751692] __handle_domain_irq+0x68/0xc0
[ 6688.755741] gic_handle_irq+0x60/0xb0
...
[ 6688.767476] WARNING: CPU: 1 PID: 4343 at kernel/sched/sched.h:1146 task_rq_lock+0xc0/0x100
...
[ 6688.891511] task_rq_lock+0xc0/0x100
[ 6688.895046] dl_task_timer+0x48/0x2c8
[ 6688.898666] __hrtimer_run_queues+0x11c/0x3d0
[ 6688.902975] hrtimer_interrupt+0xd8/0x248
[ 6688.906939] arch_timer_handler_phys+0x38/0x58
[ 6688.911334] handle_percpu_devid_irq+0x90/0x2b8
[ 6688.915815] generic_handle_irq+0x34/0x50
[ 6688.919779] __handle_domain_irq+0x68/0xc0
[ 6688.923828] gic_handle_irq+0x60/0xb0
...
[ 6688.944618] WARNING: CPU: 1 PID: 4343 at kernel/sched/sched.h:1146 update_blocked_averages+0x84c/0x9a0
...
[ 6689.071664] update_blocked_averages+0x84c/0x9a0
[ 6689.076231] run_rebalance_domains+0x74/0xb0
[ 6689.080452] __do_softirq+0x154/0x3f0
[ 6689.084074] irq_exit+0xf0/0xf8
[ 6689.087178] __handle_domain_irq+0x6c/0xc0
[ 6689.091228] gic_handle_irq+0x60/0xb0
...
[ 6689.303143] WARNING: CPU: 1 PID: 4343 at kernel/sched/sched.h:1146 __schedule+0x478/0x698
...
[ 6689.440861] __schedule+0x478/0x698
[ 6689.444310] schedule+0x38/0xc0
[ 6689.447416] do_notify_resume+0x88/0x380
[ 6689.451294] work_pending+0x8/0x14
...
[ 6689.459256] *** ---> migrate_dl_task() p=[thread0-3 4343] to CPU-1

>
>> [ 88.586991] *** ---> migrate_dl_task() p=[thread0-3 3627] to CPU1
>
> But yes, something like this. Basucally I want to avoid calling
> queue_balance_callback() from a context where we'll not follow up with
> balance_callback().

Understood.