LinuxLists.cc - [RFD] sched/deadline: Support single CPU affinity

2016-11-10 08:08:22

Subject: [RFD] sched/deadline: Support single CPU affinity

Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for
wanting single CPU affinity is better QoS than provided by G-EDF.

Therefore the aim is to provide harder guarantees, similar to UP, for single
CPU affine tasks. This then leads to a mixed criticality scheduling
requirement for the CPU scheduler. G-EDF like for the non-affine (global)
tasks and UP like for the single CPU tasks.

ADMISSION CONTROL

Do simple UP admission control on the CPU local tasks, and subtract the
admitted bandwidth from the global total when doing global admission control.

single cpu: U[n] := \Sum tl_u,n <= 1
global: \Sum tg_u <= N - \Sum U[n]

MIXED CRITICALITY SCHEDULING

Since we want to provide better guarantees for single CPU affine tasks than
the G-EDF scheduler provides for the single CPU tasks, we need to somehow
alter the scheduling algorithm.

The trivial layered EDF/G-EDF approach is obviously flawed in that it will
result in many unnecessary deadline misses. The trivial example is having a
single CPU task with a deadline after a runnable global task. By always
running single CPU tasks over global tasks we can make the global task miss
its deadline even though we could easily have ran both within the alloted
time.

Therefore we must use a more complicated scheme. By adding a second measure
present in the sporadic task model to the scheduling function we can try and
distinguish between the constraints of handling the two cases in a single
scheduler.

We define the time to fail as:

ttf(t) := t_d - t_b; where

t_d is t's absolute deadline
t_b is t's remaining budget

This is the last possible moment we must schedule this task such that it can
complete its work and not miss its deadline.

If we then augment the regular EDF rules by, for local tasks, considering the
time to fail and let this measure override the regular EDF pick when the
time to fail can be overran by the EDF pick.

That is, when:

now + left_b > min(ttf)

Use augmented RB-tree to store absolute deadlines of all rq local tasks and
keep the heap sorted on the earliest time to fail of a locally affine task.

TODO

- finish patch, this only sketches the outlines
- formal analysis of the proposed scheduling function; this is only a hunch.

---
include/linux/sched.h | 1 +
kernel/sched/core.c | 75 ++++++++++++++++++-------
kernel/sched/deadline.c | 142 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 12 ++--
4 files changed, 191 insertions(+), 39 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3762fe4e3a80..32f948615d4c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1412,6 +1412,7 @@ struct sched_rt_entity {

struct sched_dl_entity {
struct rb_node rb_node;
+ u64 __subtree_ttf;

/*
* Original scheduling parameters. Copied here from sched_attr
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bee18baa603a..46995c060a89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2469,6 +2469,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
}

#ifdef CONFIG_SMP
+DEFINE_PER_CPU(u64, dl_cpu_bw);
+
inline struct dl_bw *dl_bw_of(int i)
{
RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
@@ -2476,17 +2478,39 @@ inline struct dl_bw *dl_bw_of(int i)
return &cpu_rq(i)->rd->dl_bw;
}

-static inline int dl_bw_cpus(int i)
+static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64 old_bw,
+ int new_cpu, u64 new_bw)
{
- struct root_domain *rd = cpu_rq(i)->rd;
- int cpus = 0;
+ struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
+ u64 total, avail, used;
+ int i;

- RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
- "sched RCU must be held");
- for_each_cpu_and(i, rd->span, cpu_active_mask)
- cpus++;
+ if (dl_b->bw == -1) /* admission control disabled */
+ return false;
+
+ avail = 0;
+ for_each_cpu_and(i, rd->span, cpu_active_mask) {
+ total = dl_b->bw;
+ used = per_cpu(dl_cpu_bw, i);
+
+ if (old_cpu == i)
+ used -= old_bw;
+ if (new_cpu == i)
+ used += new_bw;
+
+ if (used > total)
+ return true;
+
+ avail += total - used;
+ }
+
+ total = dl_b->total_bw;
+ if (old_cpu == -1)
+ total -= old_bw;
+ if (new_cpu == -1)
+ total += new_bw;

- return cpus;
+ return avail < total;
}
#else
inline struct dl_bw *dl_bw_of(int i)
@@ -2494,12 +2518,21 @@ inline struct dl_bw *dl_bw_of(int i)
return &cpu_rq(i)->dl.dl_bw;
}

-static inline int dl_bw_cpus(int i)
+static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64 old_bw,
+ int new_cpu, u64 new_bw)
{
- return 1;
+ u64 avail;
+
+ if (dl_b->bw == -1) /* admission control disabled */
+ return false;
+
+ avail = dl_b->bw;
+
+ return avail < dl_b->total_bw - old_bw + new_bw;
}
#endif

+
/*
* We must be sure that accepting a new task (or allowing changing the
* parameters of an existing one) is consistent with the bandwidth
@@ -2519,7 +2552,7 @@ static int dl_overflow(struct task_struct *p, int policy,
u64 period = attr->sched_period ?: attr->sched_deadline;
u64 runtime = attr->sched_runtime;
u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
- int cpus, err = -1;
+ int err = -1;

/* !deadline task may carry old deadline bandwidth */
if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
@@ -2531,18 +2564,22 @@ static int dl_overflow(struct task_struct *p, int policy,
* allocated bandwidth of the container.
*/
raw_spin_lock(&dl_b->lock);
- cpus = dl_bw_cpus(task_cpu(p));
+ /* new DL task */
if (dl_policy(policy) && !task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, 0, new_bw)) {
+ !__dl_overflow(dl_b, -1, 0, -1, new_bw)) {
__dl_add(dl_b, new_bw);
err = 0;
+
+ /* changed DL task */
} else if (dl_policy(policy) && task_has_dl_policy(p) &&
- !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ !__dl_overflow(dl_b, -1, p->dl.dl_bw, -1, new_bw)) {
+ __dl_sub(dl_b, p->dl.dl_bw);
__dl_add(dl_b, new_bw);
err = 0;
+
+ /* leaves DL */
} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
- __dl_clear(dl_b, p->dl.dl_bw);
+ __dl_sub(dl_b, p->dl.dl_bw);
err = 0;
}
raw_spin_unlock(&dl_b->lock);
@@ -5390,8 +5427,7 @@ int task_can_attach(struct task_struct *p,
rcu_read_lock_sched();
dl_b = dl_bw_of(dest_cpu);
raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(dest_cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
+ overflow = __dl_overflow(dl_b, -1, 0, -1, p->dl.dl_bw);
if (overflow)
ret = -EBUSY;
else {
@@ -7309,8 +7345,7 @@ static int cpuset_cpu_inactive(unsigned int cpu)
dl_b = dl_bw_of(cpu);

raw_spin_lock_irqsave(&dl_b->lock, flags);
- cpus = dl_bw_cpus(cpu);
- overflow = __dl_overflow(dl_b, cpus, 0, 0);
+ overflow = __dl_overflow(dl_b, -1, 0, -1, 0);
raw_spin_unlock_irqrestore(&dl_b->lock, flags);

rcu_read_unlock_sched();
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 37e2449186c4..4eb65ea9c100 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -778,6 +778,11 @@ static void update_curr_dl(struct rq *rq)
}
}

+static inline struct sched_dl_entity *dl_of(struct rb_node *node)
+{
+ return rb_entry(node, struct sched_dl_entity, rb_node);
+}
+
#ifdef CONFIG_SMP

static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
@@ -805,9 +810,8 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
cpudl_clear(&rq->rd->cpudl, rq->cpu);
} else {
struct rb_node *leftmost = dl_rq->rb_leftmost;
- struct sched_dl_entity *entry;
+ struct sched_dl_entity *entry = dl_of(leftmost);

- entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);
dl_rq->earliest_dl.curr = entry->deadline;
cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline);
}
@@ -848,6 +852,69 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
dec_dl_migration(dl_se, dl_rq);
}

+static inline u64 entity_ttf(struct sched_dl_entity *dl_se)
+{
+ u64 ttf = U64_MAX;
+
+ if (dl_se->flags & DL_AFFINE)
+ ttf = dl_se->deadline - dl_se->runtime;
+
+ return ttf;
+}
+
+static inline u64
+compute_subtree_ttf(struct sched_dl_entry *dl_se)
+{
+ u64 subtree_ttd, ttf = entity_ttf(dl_se);
+
+ if (dl_se->rb_node.rb_left) {
+ subtree_ttf = dl_of(dl_se->rb_node.rb_left)->__subtree_ttf;
+ ttf = min(ttf, subtree_ttf);
+ }
+ if (dl_se->rb_node.rb_right) {
+ subtree_ttf = dl_of(dl_se->rb_node.rb_right)->__subtree_ttf;
+ ttf = min(ttf, subtree_ttf);
+ }
+
+ return ttf;
+}
+
+static void ttf_propagate(struct rb_node *rb, struct rb_node *stop)
+{
+ while (rb != stop) {
+ struct sched_dl_entity *dl_se = dl_of(rb);
+ u64 subtree_ttf = compute_subtree_ttf(dl_se);
+
+ if (dl_se->__subtree_ttf == subtree_ttf)
+ break;
+
+ rb = rb_parent(rb);
+ }
+}
+
+static void ttf_copy(struct rb_node *rb_old, struct rb_node *rb_new)
+{
+ struct sched_dl_entity *dl_old = dl_of(rb_old);
+ struct sched_dl_entity *dl_new = dl_of(rb_new);
+
+ new->__subtree_ttf = old->__subtree_ttf;
+}
+
+static void ttf_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
+{
+ struct sched_dl_entity *dl_old = dl_of(rb_old);
+ struct sched_dl_entity *dl_new = dl_of(rb_new);
+
+ new->__subtree_ttf = old->__subtree_ttf;
+ old->__subtree_ttf = compute_subtree_ttf(old);
+}
+
+static const struct rb_augment_callbacks ttf_callbacks = {
+ .propagate = ttf_propagate,
+ .copy = ttf_copy,
+ .rotate = ttf_rotate,
+};
+
static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@@ -855,15 +922,16 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
struct rb_node *parent = NULL;
struct sched_dl_entity *entry;
int leftmost = 1;
+ u64 ttf;

BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));

while (*link) {
parent = *link;
- entry = rb_entry(parent, struct sched_dl_entity, rb_node);
- if (dl_time_before(dl_se->deadline, entry->deadline))
+ entry = dl_of(parent);
+ if (dl_time_before(dl_se->deadline, entry->deadline)) {
link = &parent->rb_left;
- else {
+ } else {
link = &parent->rb_right;
leftmost = 0;
}
@@ -872,8 +940,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
if (leftmost)
dl_rq->rb_leftmost = &dl_se->rb_node;

+ dl_se->__subtree_ttf = ttf = entity_ttf(dl_se);
rb_link_node(&dl_se->rb_node, parent, link);
- rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
+ rb_insert_augmented(&dl_se->rb_node, &dl_rq->rb_root, &ttf_callbacks);
+
+ if (dl_of(dl_rq->rb_root.rb_node)->__subtree_ttf == ttf)
+ dl_rb->rb_least_ttf = dl_rq->rb_root.rb_node;

inc_dl_tasks(dl_se, dl_rq);
}
@@ -892,9 +964,41 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
dl_rq->rb_leftmost = next_node;
}

- rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
+ rb_erase_augmented(&dl_se->rb_node, &dl_rq->rb_root, &ttf_callbacks);
RB_CLEAR_NODE(&dl_se->rb_node);

+ if (dl_rq->rb_least_ttf == &dl_se->rb_node) {
+ struct rb_node **link = &dl_rq->rb_root.rb_node;
+
+ while (*link) {
+ struct rb_node *node = *link;
+ struct sched_dl_entity *entry = dl_of(node);
+ u64 subtree_ttf = entry->__subtree_ttf;
+
+ if (subtree_ttf = U64_MAX) {
+ dl_rq->rb_least_ttf = NULL;
+ goto no_least_ttf;
+ }
+
+ if (node->rb_left &&
+ dl_of(node->rb_left)->__subtree_ttf == subtree_ttf) {
+ link = &node->rb_left;
+ continue;
+ }
+
+ if (node->rb_right &&
+ dl_of(node->rb_right)->__subtree_ttf == subtree_ttf) {
+ link = &node->rb_right;
+ continue;
+ }
+
+ break;
+ }
+
+ dl_rq->rb_least_ttf = *link;
+no_least_ttf:
+ }
+
dec_dl_tasks(dl_se, dl_rq);
}

@@ -1109,12 +1213,28 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
struct dl_rq *dl_rq)
{
- struct rb_node *left = dl_rq->rb_leftmost;
+ struct rb_node *rb_left = dl_rq->rb_leftmost;
+ struct sched_dl_entity *left, *least;
+ u64 now;

- if (!left)
+ if (!rb_left)
return NULL;

- return rb_entry(left, struct sched_dl_entity, rb_node);
+ left = dl_of(rb_left);
+
+ /*
+ * Only local tasks have TTF set, if one is present and it would miss
+ * its deadline because of the (G)EDF task selection, override that
+ * and select the local task.
+ */
+ if (dl_rq->rb_least_ttf) {
+ least = dl_of(dl_rq->rb_least_ttf);
+ now = rq_clock_task(rq);
+ if (now + left->runtime > entity_ttf(least))
+ return least;
+ }
+
+ return left;
}

struct task_struct *
@@ -1645,7 +1765,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
* until we complete the update.
*/
raw_spin_lock(&src_dl_b->lock);
- __dl_clear(src_dl_b, p->dl.dl_bw);
+ __dl_sub(src_dl_b, p->dl.dl_bw);
raw_spin_unlock(&src_dl_b->lock);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935d4421..55059d2568bc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -198,13 +198,15 @@ static inline int dl_bandwidth_enabled(void)

extern struct dl_bw *dl_bw_of(int i);

+extern DECLARE_PER_CPU(u64, dl_cpu_bw);
+
struct dl_bw {
raw_spinlock_t lock;
u64 bw, total_bw;
};

static inline
-void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
+void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
{
dl_b->total_bw -= tsk_bw;
}
@@ -215,13 +217,6 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
dl_b->total_bw += tsk_bw;
}

-static inline
-bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
-{
- return dl_b->bw != -1 &&
- dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
-}
-
extern struct mutex sched_domains_mutex;

#ifdef CONFIG_CGROUP_SCHED
@@ -507,6 +502,7 @@ struct dl_rq {
/* runqueue is an rbtree, ordered by deadline */
struct rb_root rb_root;
struct rb_node *rb_leftmost;
+ struct rb_node *rb_least_ttf;

unsigned long dl_nr_running;

2016-11-10 09:06:17

by Luca Abeni

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

Hi Peter,

On Thu, 10 Nov 2016 09:08:07 +0100
Peter Zijlstra <[email protected]> wrote:

> Add support for single CPU affinity to SCHED_DEADLINE; the supposed
> reason for wanting single CPU affinity is better QoS than provided by
> G-EDF.
This looks very interesting, thanks for sharing!
I have some "theoretical" comments; I'll look at the code later.

> Therefore the aim is to provide harder guarantees, similar to UP, for
> single CPU affine tasks. This then leads to a mixed criticality
> scheduling requirement for the CPU scheduler. G-EDF like for the
> non-affine (global) tasks and UP like for the single CPU tasks.
The goal for single CPU affine tasks is clear; which kind of guarantees
do you want to provide for the other (global) tasks? The tardiness
guarantees?

[...]
> MIXED CRITICALITY SCHEDULING
>
> Since we want to provide better guarantees for single CPU affine
> tasks than the G-EDF scheduler provides for the single CPU tasks, we
> need to somehow alter the scheduling algorithm.
>
> The trivial layered EDF/G-EDF approach is obviously flawed in that it
> will result in many unnecessary deadline misses. The trivial example
> is having a single CPU task with a deadline after a runnable global
> task. By always running single CPU tasks over global tasks we can
> make the global task miss its deadline even though we could easily
> have ran both within the alloted time.
Ok; the layered approach clearly causes some unneeded deadline misses
on global tasks... But those tasks risk to miss deadlines anyway (with
the corrent scheduler, they are guaranteed to have a limited tardiness,
not to respect all of the deadlines). I think this is related to the
question about which guarantees are provided to the global tasks.

> Therefore we must use a more complicated scheme. By adding a second
> measure present in the sporadic task model to the scheduling function
> we can try and distinguish between the constraints of handling the
> two cases in a single scheduler.
>
> We define the time to fail as:
>
> ttf(t) := t_d - t_b; where
>
> t_d is t's absolute deadline
> t_b is t's remaining budget
Ok; I think this is similar to what is called "pseudo-deadline" in some
papers.
If I understand well, it is equal to the current time + the task
"laxity" (or slack time). So, scheduling the task with the smaller ttf
is equivalent to the "least laxity first" (LLF) algorithm.
Giving precedence to tasks with 0 laxity is a technique that is often
used to improve the schedulability on multi-processor systems.

> This is the last possible moment we must schedule this task such that
> it can complete its work and not miss its deadline.
>
> If we then augment the regular EDF rules by, for local tasks,
> considering the time to fail and let this measure override the
> regular EDF pick when the time to fail can be overran by the EDF pick.
>
> That is, when:
>
> now + left_b > min(ttf)
Ok, this looks interesting... I have to better understand this rule. If
I am not wrong, this is equivalent to
now + left_budget > min(now + laxity) =>
=> left_budget > min(laxity)
So, if I understand well when the remaining budget of the current task
is larger than the minimum laxity you switch from EDF to LLF (which is
equivalent to local time to fail)? Is this understanding correct?

I _suspect_ that the migration rules must also be changed (when a task
migrates from a runqueue, it currently selects as a target the runqueue
having the largest earliest deadline... Maybe it should also consider
the presence of rq-local tasks - or the LLF-ordered heap - on the
target?)

Did you find this scheduling strategy on some paper? Otherwise, I think
we need to develop some theoretical analysis for it... (which is of
course another interesting thing! :)

Luca
>
> Use augmented RB-tree to store absolute deadlines of all rq local
> tasks and keep the heap sorted on the earliest time to fail of a
> locally affine task.
>
>
>
> TODO
>
> - finish patch, this only sketches the outlines
> - formal analysis of the proposed scheduling function; this is only
> a hunch.
>
>
>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 75 ++++++++++++++++++-------
> kernel/sched/deadline.c | 142
> ++++++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 12 ++-- 4 files changed, 191
> insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3762fe4e3a80..32f948615d4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,7 @@ struct sched_rt_entity {
>
> struct sched_dl_entity {
> struct rb_node rb_node;
> + u64 __subtree_ttf;
>
> /*
> * Original scheduling parameters. Copied here from
> sched_attr diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee18baa603a..46995c060a89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2469,6 +2469,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
> }
>
> #ifdef CONFIG_SMP
> +DEFINE_PER_CPU(u64, dl_cpu_bw);
> +
> inline struct dl_bw *dl_bw_of(int i)
> {
> RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> @@ -2476,17 +2478,39 @@ inline struct dl_bw *dl_bw_of(int i)
> return &cpu_rq(i)->rd->dl_bw;
> }
>
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64
> old_bw,
> + int new_cpu, u64
> new_bw) {
> - struct root_domain *rd = cpu_rq(i)->rd;
> - int cpus = 0;
> + struct root_domain *rd = container_of(dl_b, struct
> root_domain, dl_bw);
> + u64 total, avail, used;
> + int i;
>
> - RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> - "sched RCU must be held");
> - for_each_cpu_and(i, rd->span, cpu_active_mask)
> - cpus++;
> + if (dl_b->bw == -1) /* admission control disabled */
> + return false;
> +
> + avail = 0;
> + for_each_cpu_and(i, rd->span, cpu_active_mask) {
> + total = dl_b->bw;
> + used = per_cpu(dl_cpu_bw, i);
> +
> + if (old_cpu == i)
> + used -= old_bw;
> + if (new_cpu == i)
> + used += new_bw;
> +
> + if (used > total)
> + return true;
> +
> + avail += total - used;
> + }
> +
> + total = dl_b->total_bw;
> + if (old_cpu == -1)
> + total -= old_bw;
> + if (new_cpu == -1)
> + total += new_bw;
>
> - return cpus;
> + return avail < total;
> }
> #else
> inline struct dl_bw *dl_bw_of(int i)
> @@ -2494,12 +2518,21 @@ inline struct dl_bw *dl_bw_of(int i)
> return &cpu_rq(i)->dl.dl_bw;
> }
>
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64
> old_bw,
> + int new_cpu, u64
> new_bw) {
> - return 1;
> + u64 avail;
> +
> + if (dl_b->bw == -1) /* admission control disabled */
> + return false;
> +
> + avail = dl_b->bw;
> +
> + return avail < dl_b->total_bw - old_bw + new_bw;
> }
> #endif
>
> +
> /*
> * We must be sure that accepting a new task (or allowing changing
> the
> * parameters of an existing one) is consistent with the bandwidth
> @@ -2519,7 +2552,7 @@ static int dl_overflow(struct task_struct *p,
> int policy, u64 period = attr->sched_period ?: attr->sched_deadline;
> u64 runtime = attr->sched_runtime;
> u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) :
> 0;
> - int cpus, err = -1;
> + int err = -1;
>
> /* !deadline task may carry old deadline bandwidth */
> if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
> @@ -2531,18 +2564,22 @@ static int dl_overflow(struct task_struct *p,
> int policy,
> * allocated bandwidth of the container.
> */
> raw_spin_lock(&dl_b->lock);
> - cpus = dl_bw_cpus(task_cpu(p));
> + /* new DL task */
> if (dl_policy(policy) && !task_has_dl_policy(p) &&
> - !__dl_overflow(dl_b, cpus, 0, new_bw)) {
> + !__dl_overflow(dl_b, -1, 0, -1, new_bw)) {
> __dl_add(dl_b, new_bw);
> err = 0;
> +
> + /* changed DL task */
> } else if (dl_policy(policy) && task_has_dl_policy(p) &&
> - !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> - __dl_clear(dl_b, p->dl.dl_bw);
> + !__dl_overflow(dl_b, -1, p->dl.dl_bw, -1,
> new_bw)) {
> + __dl_sub(dl_b, p->dl.dl_bw);
> __dl_add(dl_b, new_bw);
> err = 0;
> +
> + /* leaves DL */
> } else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> - __dl_clear(dl_b, p->dl.dl_bw);
> + __dl_sub(dl_b, p->dl.dl_bw);
> err = 0;
> }
> raw_spin_unlock(&dl_b->lock);
> @@ -5390,8 +5427,7 @@ int task_can_attach(struct task_struct *p,
> rcu_read_lock_sched();
> dl_b = dl_bw_of(dest_cpu);
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cpus = dl_bw_cpus(dest_cpu);
> - overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
> + overflow = __dl_overflow(dl_b, -1, 0, -1,
> p->dl.dl_bw); if (overflow)
> ret = -EBUSY;
> else {
> @@ -7309,8 +7345,7 @@ static int cpuset_cpu_inactive(unsigned int cpu)
> dl_b = dl_bw_of(cpu);
>
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cpus = dl_bw_cpus(cpu);
> - overflow = __dl_overflow(dl_b, cpus, 0, 0);
> + overflow = __dl_overflow(dl_b, -1, 0, -1, 0);
> raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>
> rcu_read_unlock_sched();
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 37e2449186c4..4eb65ea9c100 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -778,6 +778,11 @@ static void update_curr_dl(struct rq *rq)
> }
> }
>
> +static inline struct sched_dl_entity *dl_of(struct rb_node *node)
> +{
> + return rb_entry(node, struct sched_dl_entity, rb_node);
> +}
> +
> #ifdef CONFIG_SMP
>
> static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -805,9 +810,8 @@ static void dec_dl_deadline(struct dl_rq *dl_rq,
> u64 deadline) cpudl_clear(&rq->rd->cpudl, rq->cpu);
> } else {
> struct rb_node *leftmost = dl_rq->rb_leftmost;
> - struct sched_dl_entity *entry;
> + struct sched_dl_entity *entry = dl_of(leftmost);
>
> - entry = rb_entry(leftmost, struct sched_dl_entity,
> rb_node); dl_rq->earliest_dl.curr = entry->deadline;
> cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline);
> }
> @@ -848,6 +852,69 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se,
> struct dl_rq *dl_rq) dec_dl_migration(dl_se, dl_rq);
> }
>
> +static inline u64 entity_ttf(struct sched_dl_entity *dl_se)
> +{
> + u64 ttf = U64_MAX;
> +
> + if (dl_se->flags & DL_AFFINE)
> + ttf = dl_se->deadline - dl_se->runtime;
> +
> + return ttf;
> +}
> +
> +static inline u64
> +compute_subtree_ttf(struct sched_dl_entry *dl_se)
> +{
> + u64 subtree_ttd, ttf = entity_ttf(dl_se);
> +
> + if (dl_se->rb_node.rb_left) {
> + subtree_ttf =
> dl_of(dl_se->rb_node.rb_left)->__subtree_ttf;
> + ttf = min(ttf, subtree_ttf);
> + }
> + if (dl_se->rb_node.rb_right) {
> + subtree_ttf =
> dl_of(dl_se->rb_node.rb_right)->__subtree_ttf;
> + ttf = min(ttf, subtree_ttf);
> + }
> +
> + return ttf;
> +}
> +
> +static void ttf_propagate(struct rb_node *rb, struct rb_node *stop)
> +{
> + while (rb != stop) {
> + struct sched_dl_entity *dl_se = dl_of(rb);
> + u64 subtree_ttf = compute_subtree_ttf(dl_se);
> +
> + if (dl_se->__subtree_ttf == subtree_ttf)
> + break;
> +
> + rb = rb_parent(rb);
> + }
> +}
> +
> +static void ttf_copy(struct rb_node *rb_old, struct rb_node *rb_new)
> +{
> + struct sched_dl_entity *dl_old = dl_of(rb_old);
> + struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> + new->__subtree_ttf = old->__subtree_ttf;
> +}
> +
> +static void ttf_rotate(struct rb_node *rb_old, struct rb_node
> *rb_new) +{
> + struct sched_dl_entity *dl_old = dl_of(rb_old);
> + struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> + new->__subtree_ttf = old->__subtree_ttf;
> + old->__subtree_ttf = compute_subtree_ttf(old);
> +}
> +
> +static const struct rb_augment_callbacks ttf_callbacks = {
> + .propagate = ttf_propagate,
> + .copy = ttf_copy,
> + .rotate = ttf_rotate,
> +};
> +
> static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -855,15 +922,16 @@ static void __enqueue_dl_entity(struct
> sched_dl_entity *dl_se) struct rb_node *parent = NULL;
> struct sched_dl_entity *entry;
> int leftmost = 1;
> + u64 ttf;
>
> BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
>
> while (*link) {
> parent = *link;
> - entry = rb_entry(parent, struct sched_dl_entity,
> rb_node);
> - if (dl_time_before(dl_se->deadline, entry->deadline))
> + entry = dl_of(parent);
> + if (dl_time_before(dl_se->deadline,
> entry->deadline)) { link = &parent->rb_left;
> - else {
> + } else {
> link = &parent->rb_right;
> leftmost = 0;
> }
> @@ -872,8 +940,12 @@ static void __enqueue_dl_entity(struct
> sched_dl_entity *dl_se) if (leftmost)
> dl_rq->rb_leftmost = &dl_se->rb_node;
>
> + dl_se->__subtree_ttf = ttf = entity_ttf(dl_se);
> rb_link_node(&dl_se->rb_node, parent, link);
> - rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> + rb_insert_augmented(&dl_se->rb_node, &dl_rq->rb_root,
> &ttf_callbacks); +
> + if (dl_of(dl_rq->rb_root.rb_node)->__subtree_ttf == ttf)
> + dl_rb->rb_least_ttf = dl_rq->rb_root.rb_node;
>
> inc_dl_tasks(dl_se, dl_rq);
> }
> @@ -892,9 +964,41 @@ static void __dequeue_dl_entity(struct
> sched_dl_entity *dl_se) dl_rq->rb_leftmost = next_node;
> }
>
> - rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> + rb_erase_augmented(&dl_se->rb_node, &dl_rq->rb_root,
> &ttf_callbacks); RB_CLEAR_NODE(&dl_se->rb_node);
>
> + if (dl_rq->rb_least_ttf == &dl_se->rb_node) {
> + struct rb_node **link = &dl_rq->rb_root.rb_node;
> +
> + while (*link) {
> + struct rb_node *node = *link;
> + struct sched_dl_entity *entry = dl_of(node);
> + u64 subtree_ttf = entry->__subtree_ttf;
> +
> + if (subtree_ttf = U64_MAX) {
> + dl_rq->rb_least_ttf = NULL;
> + goto no_least_ttf;
> + }
> +
> + if (node->rb_left &&
> + dl_of(node->rb_left)->__subtree_ttf ==
> subtree_ttf) {
> + link = &node->rb_left;
> + continue;
> + }
> +
> + if (node->rb_right &&
> + dl_of(node->rb_right)->__subtree_ttf ==
> subtree_ttf) {
> + link = &node->rb_right;
> + continue;
> + }
> +
> + break;
> + }
> +
> + dl_rq->rb_least_ttf = *link;
> +no_least_ttf:
> + }
> +
> dec_dl_tasks(dl_se, dl_rq);
> }
>
> @@ -1109,12 +1213,28 @@ static void start_hrtick_dl(struct rq *rq,
> struct task_struct *p) static struct sched_dl_entity
> *pick_next_dl_entity(struct rq *rq, struct dl_rq *dl_rq)
> {
> - struct rb_node *left = dl_rq->rb_leftmost;
> + struct rb_node *rb_left = dl_rq->rb_leftmost;
> + struct sched_dl_entity *left, *least;
> + u64 now;
>
> - if (!left)
> + if (!rb_left)
> return NULL;
>
> - return rb_entry(left, struct sched_dl_entity, rb_node);
> + left = dl_of(rb_left);
> +
> + /*
> + * Only local tasks have TTF set, if one is present and it
> would miss
> + * its deadline because of the (G)EDF task selection,
> override that
> + * and select the local task.
> + */
> + if (dl_rq->rb_least_ttf) {
> + least = dl_of(dl_rq->rb_least_ttf);
> + now = rq_clock_task(rq);
> + if (now + left->runtime > entity_ttf(least))
> + return least;
> + }
> +
> + return left;
> }
>
> struct task_struct *
> @@ -1645,7 +1765,7 @@ static void set_cpus_allowed_dl(struct
> task_struct *p,
> * until we complete the update.
> */
> raw_spin_lock(&src_dl_b->lock);
> - __dl_clear(src_dl_b, p->dl.dl_bw);
> + __dl_sub(src_dl_b, p->dl.dl_bw);
> raw_spin_unlock(&src_dl_b->lock);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 055f935d4421..55059d2568bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -198,13 +198,15 @@ static inline int dl_bandwidth_enabled(void)
>
> extern struct dl_bw *dl_bw_of(int i);
>
> +extern DECLARE_PER_CPU(u64, dl_cpu_bw);
> +
> struct dl_bw {
> raw_spinlock_t lock;
> u64 bw, total_bw;
> };
>
> static inline
> -void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
> +void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
> {
> dl_b->total_bw -= tsk_bw;
> }
> @@ -215,13 +217,6 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
> dl_b->total_bw += tsk_bw;
> }
>
> -static inline
> -bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64
> new_bw) -{
> - return dl_b->bw != -1 &&
> - dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> -}
> -
> extern struct mutex sched_domains_mutex;
>
> #ifdef CONFIG_CGROUP_SCHED
> @@ -507,6 +502,7 @@ struct dl_rq {
> /* runqueue is an rbtree, ordered by deadline */
> struct rb_root rb_root;
> struct rb_node *rb_leftmost;
> + struct rb_node *rb_least_ttf;
>
> unsigned long dl_nr_running;
>

2016-11-10 10:01:56

by Tommaso Cucinotta

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

Hi,

On 10/11/2016 09:08, Peter Zijlstra wrote:
> Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for
> wanting single CPU affinity is better QoS than provided by G-EDF.
>
> Therefore the aim is to provide harder guarantees, similar to UP, for single
> CPU affine tasks. This then leads to a mixed criticality scheduling
> requirement for the CPU scheduler. G-EDF like for the non-affine (global)
> tasks and UP like for the single CPU tasks.
>
>
>
> ADMISSION CONTROL
>
> Do simple UP admission control on the CPU local tasks, and subtract the
> admitted bandwidth from the global total when doing global admission control.
>
> single cpu: U[n] := \Sum tl_u,n <= 1
> global: \Sum tg_u <= N - \Sum U[n]

+1, even with the current G-EDF: we need in the kernel a minimum permissive admission control simple enough to just avoid ill-formed workloads that pile-up forever without hope of recovering (as opposed to an AC that runs a complex test and doesn't allow you to deploy a task unless it's absolutely guaranteed to schedule its runtime by its deadline), even though it won't be perfect in terms of hard RT guarantees; the latter would require anyway more complex analysis techniques, considering also frequency of interrupts & the likes, and can be done in user-space by proper middleware, libraries, or even at design-time for static embedded systems where everything is known upfront and doesn't change often.

That said, it's good if in addition the mechanism behaves well from an analysis viewpoint (and we have a tardiness bound), the only problem being that there's a zillion proposals in research (see upcoming reply to Luca's).

Just a note: if you want to recover arbitrary task affinities, you can re-cast your above test like this:

for_each_processor(cpu)
\sum U[t]/A[t] \leq 1 (or U_max), for each task t on cpu, with utilization U[t] and A[t] tasks overall in its affinity mask

(I'm not claiming we need scenarios with overlapping cpusets and G-EDF tasks, it's just in case it simplifies code)

T.
--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso

2016-11-10 10:59:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Nov 10, 2016 at 10:06:02AM +0100, luca abeni wrote:
> Hi Peter,
>
> On Thu, 10 Nov 2016 09:08:07 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > Add support for single CPU affinity to SCHED_DEADLINE; the supposed
> > reason for wanting single CPU affinity is better QoS than provided by
> > G-EDF.
> This looks very interesting, thanks for sharing!
> I have some "theoretical" comments; I'll look at the code later.
>
> > Therefore the aim is to provide harder guarantees, similar to UP, for
> > single CPU affine tasks. This then leads to a mixed criticality
> > scheduling requirement for the CPU scheduler. G-EDF like for the
> > non-affine (global) tasks and UP like for the single CPU tasks.
> The goal for single CPU affine tasks is clear; which kind of guarantees
> do you want to provide for the other (global) tasks? The tardiness
> guarantees?

I wanted to provide as close to regular G-EDF guarantees as possible. So
yes, the bounded tardiness, but also try and meet deadlines when
possible.

>
> [...]
> > MIXED CRITICALITY SCHEDULING
> >
> > Since we want to provide better guarantees for single CPU affine
> > tasks than the G-EDF scheduler provides for the single CPU tasks, we
> > need to somehow alter the scheduling algorithm.
> >
> > The trivial layered EDF/G-EDF approach is obviously flawed in that it
> > will result in many unnecessary deadline misses. The trivial example
> > is having a single CPU task with a deadline after a runnable global
> > task. By always running single CPU tasks over global tasks we can
> > make the global task miss its deadline even though we could easily
> > have ran both within the alloted time.

> Ok; the layered approach clearly causes some unneeded deadline misses
> on global tasks... But those tasks risk to miss deadlines anyway (with
> the corrent scheduler, they are guaranteed to have a limited tardiness,
> not to respect all of the deadlines). I think this is related to the
> question about which guarantees are provided to the global tasks.

Right, so I wanted to limit the impact. Basically, given a stronger
admission test for the GEDF scheduler that would guarantee deadlines, I
would like the new scheduling function to still guarantee all deadlines.

That is, I didn't want to let the scheduling function be the weak spot,
only the admission test.

> > Therefore we must use a more complicated scheme. By adding a second
> > measure present in the sporadic task model to the scheduling function
> > we can try and distinguish between the constraints of handling the
> > two cases in a single scheduler.
> >
> > We define the time to fail as:
> >
> > ttf(t) := t_d - t_b; where
> >
> > t_d is t's absolute deadline
> > t_b is t's remaining budget

> Ok; I think this is similar to what is called "pseudo-deadline" in some
> papers.
> If I understand well, it is equal to the current time + the task
> "laxity" (or slack time). So, scheduling the task with the smaller ttf
> is equivalent to the "least laxity first" (LLF) algorithm.
> Giving precedence to tasks with 0 laxity is a technique that is often
> used to improve the schedulability on multi-processor systems.

Right, similar to LLF. Initially I was using the term laxity here, but
Hendrik convinced me that this is called time-to-fail. I'll let him
explain :-)

But yes, a pure TTF based scheduler should be equivalent to LLF which
itself is fairly similar to EDF in that its optimal for UP etc.. (I
think).

> > This is the last possible moment we must schedule this task such that
> > it can complete its work and not miss its deadline.
> >
> > If we then augment the regular EDF rules by, for local tasks,
> > considering the time to fail and let this measure override the
> > regular EDF pick when the time to fail can be overran by the EDF pick.
> >
> > That is, when:
> >
> > now + left_b > min(ttf)

> Ok, this looks interesting... I have to better understand this rule. If
> I am not wrong, this is equivalent to

> now + left_budget > min(now + laxity) =>
> => left_budget > min(laxity)

> So, if I understand well when the remaining budget of the current task
> is larger than the minimum laxity you switch from EDF to LLF (which is
> equivalent to local time to fail)? Is this understanding correct?

I think so, but I don't have the exact laxity definitions at hand, I'll
need to go dig out that paper. I should have it somewhere.

> I _suspect_ that the migration rules must also be changed (when a task
> migrates from a runqueue, it currently selects as a target the runqueue
> having the largest earliest deadline... Maybe it should also consider
> the presence of rq-local tasks - or the LLF-ordered heap - on the
> target?)

Quite possible, I didn't think about that.

> Did you find this scheduling strategy on some paper? Otherwise, I think
> we need to develop some theoretical analysis for it... (which is of
> course another interesting thing! :)

No, I cooked this up myself. There might of course still be a paper on
this, but if so, I'm blissfully unaware.

2016-11-10 11:03:41

by Tommaso Cucinotta

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On 10/11/2016 10:06, luca abeni wrote:
> is equivalent to the "least laxity first" (LLF) algorithm.
> Giving precedence to tasks with 0 laxity is a technique that is often
> used to improve the schedulability on multi-processor systems.

EDZL (EDF / Zero Laxity first), right? AFAICR, there's quite a lot of
analysis on EDZL for multi-cores... eg, Insik Shin et al....

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6374195

But, before going the EDZL way, isn't it worthwhile to consider
just splitting tasks among 2 cpus

https://people.mpi-sws.org/~bbb/papers/pdf/rtss16b.pdf

? ... we're working at RETIS on simpler ways to make the AC for
these split tasks cases (cc-ing Alessandro) that doesn't need
demand-bound complex analysis...

My2c,

T.
--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso

2016-11-10 12:21:24

by Henrik Austad

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
>
>
> Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for
> wanting single CPU affinity is better QoS than provided by G-EDF.
>
> Therefore the aim is to provide harder guarantees, similar to UP, for single
> CPU affine tasks. This then leads to a mixed criticality scheduling
> requirement for the CPU scheduler. G-EDF like for the non-affine (global)
> tasks and UP like for the single CPU tasks.
>
>
>
> ADMISSION CONTROL
>
> Do simple UP admission control on the CPU local tasks, and subtract the
> admitted bandwidth from the global total when doing global admission control.
>
> single cpu: U[n] := \Sum tl_u,n <= 1
> global: \Sum tg_u <= N - \Sum U[n]
>
>
>
> MIXED CRITICALITY SCHEDULING
>
> Since we want to provide better guarantees for single CPU affine tasks than
> the G-EDF scheduler provides for the single CPU tasks, we need to somehow
> alter the scheduling algorithm.
>
> The trivial layered EDF/G-EDF approach is obviously flawed in that it will
> result in many unnecessary deadline misses. The trivial example is having a
> single CPU task with a deadline after a runnable global task. By always
> running single CPU tasks over global tasks we can make the global task miss
> its deadline even though we could easily have ran both within the alloted
> time.
>
> Therefore we must use a more complicated scheme. By adding a second measure
> present in the sporadic task model to the scheduling function we can try and
> distinguish between the constraints of handling the two cases in a single
> scheduler.
>
> We define the time to fail as:
>
> ttf(t) := t_d - t_b; where
>
> t_d is t's absolute deadline
> t_b is t's remaining budget
>
> This is the last possible moment we must schedule this task such that it can
> complete its work and not miss its deadline.

To elaborate a bit on this (this is a modified LLF approach if my memory
serves):

You have the dynamic time-to-failure (TtF), i.e. as the task progresses
(scheduled to run), the relative time-to-failure will remain constant. This
can be used to compare thasks to a running task and should minimize the
number of calculations required.

Then you have the static Time-of-failure (ToF, which is the absoulte time
when a task will no longer be able to meet its deadline. This is what you
use for keeping a sorted list of tasks in the runqueue. As this is a fixed
point in time, you do not have to dynamically update or do crazy
calculation when inserting/removing threads from the rq.

> If we then augment the regular EDF rules by, for local tasks, considering the
> time to fail and let this measure override the regular EDF pick when the
> time to fail can be overran by the EDF pick.

Then, if you do this - do you need to constrict this to a local CPU? I
*think* you could do this in a global scheduler if you use ToF/TtF for all
deadline-tasks, I think you should be able to meet deadlines.

I had a rant about this way back [1,2 Sec 11.4], I need to sit down and
re-read most of it, it has been a few too many years, but the idea was to
minimize the number of context-switches (which LLF is prone to get a lot
of) as well as minimize the computational overhead by avoiding
re-calculating time-of-failre/time-to-failre a lot.

> That is, when:
>
> now + left_b > min(ttf)

Why not just use ttf/tof for all deadline-tasks? We have all the
information available anyway and it would probably make the internal logic
easier?

> Use augmented RB-tree to store absolute deadlines of all rq local tasks and
> keep the heap sorted on the earliest time to fail of a locally affine task.
>
> TODO
>
> - finish patch, this only sketches the outlines
> - formal analysis of the proposed scheduling function; this is only a hunch.

I think you are on the right track, but then again, you agree with some of
the stuff I messed around with a while a go, so no wonder I think you're
right :)

1) https://lkml.org/lkml/2009/7/10/380
2) https://brage.bibsys.no/xmlui/bitstream/handle/11250/259744/347756_FULLTEXT01.pdf

-Henrik

> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 75 ++++++++++++++++++-------
> kernel/sched/deadline.c | 142 ++++++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h | 12 ++--
> 4 files changed, 191 insertions(+), 39 deletions(-)
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3762fe4e3a80..32f948615d4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,7 @@ struct sched_rt_entity {
>
> struct sched_dl_entity {
> struct rb_node rb_node;
> + u64 __subtree_ttf;

Didn't you say augmented rb-tree?

> /*
> * Original scheduling parameters. Copied here from sched_attr
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee18baa603a..46995c060a89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2469,6 +2469,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
> }
>
> #ifdef CONFIG_SMP
> +DEFINE_PER_CPU(u64, dl_cpu_bw);
> +
> inline struct dl_bw *dl_bw_of(int i)
> {
> RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> @@ -2476,17 +2478,39 @@ inline struct dl_bw *dl_bw_of(int i)
> return &cpu_rq(i)->rd->dl_bw;
> }
>
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64 old_bw,
> + int new_cpu, u64 new_bw)
> {
> - struct root_domain *rd = cpu_rq(i)->rd;
> - int cpus = 0;
> + struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
> + u64 total, avail, used;
> + int i;
>
> - RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> - "sched RCU must be held");
> - for_each_cpu_and(i, rd->span, cpu_active_mask)
> - cpus++;
> + if (dl_b->bw == -1) /* admission control disabled */
> + return false;
> +
> + avail = 0;
> + for_each_cpu_and(i, rd->span, cpu_active_mask) {
> + total = dl_b->bw;
> + used = per_cpu(dl_cpu_bw, i);
> +
> + if (old_cpu == i)
> + used -= old_bw;
> + if (new_cpu == i)
> + used += new_bw;
> +
> + if (used > total)
> + return true;

What happens in this case? Wouldn't it make sense to test to see if the
new_cpu has the capacity to swap tasks _before_ updating the value?

> + avail += total - used;
> + }

So some scenarios:
1) Global task to affiniy of a single core:
The total bandwidth is ok, so we only have to check if the local CPU can
accomodate the task. Then we must decrement the global dl_bw pool.

2) Local task to global: we only need to check if the global pool has room
for the task, then we update it and decrease the sued bw for the local CPU

3) Local to local (move cpu). We need only check used for the new CPU, and
if that is ok, the move is straightforward.

I am probably missing out on a whole lot of details here, but I cannot help
but feel that this is a bit too complex for the task at hand.

> + total = dl_b->total_bw;
> + if (old_cpu == -1)
> + total -= old_bw;
> + if (new_cpu == -1)
> + total += new_bw;

Why not keep the globally used and available bw in the root-domain? that
would avoid iterating over all the active CPUs every time you do a
bw-change.

CPU hotplug would probably be interseting though..

> - return cpus;
> + return avail < total;
> }
> #else
> inline struct dl_bw *dl_bw_of(int i)
> @@ -2494,12 +2518,21 @@ inline struct dl_bw *dl_bw_of(int i)
> return &cpu_rq(i)->dl.dl_bw;
> }
>
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64 old_bw,
> + int new_cpu, u64 new_bw)
> {
> - return 1;
> + u64 avail;
> +
> + if (dl_b->bw == -1) /* admission control disabled */
> + return false;
> +
> + avail = dl_b->bw;
> +
> + return avail < dl_b->total_bw - old_bw + new_bw;
> }
> #endif
>
> +
> /*
> * We must be sure that accepting a new task (or allowing changing the
> * parameters of an existing one) is consistent with the bandwidth
> @@ -2519,7 +2552,7 @@ static int dl_overflow(struct task_struct *p, int policy,
> u64 period = attr->sched_period ?: attr->sched_deadline;
> u64 runtime = attr->sched_runtime;
> u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> - int cpus, err = -1;
> + int err = -1;
>
> /* !deadline task may carry old deadline bandwidth */
> if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
> @@ -2531,18 +2564,22 @@ static int dl_overflow(struct task_struct *p, int policy,
> * allocated bandwidth of the container.
> */
> raw_spin_lock(&dl_b->lock);
> - cpus = dl_bw_cpus(task_cpu(p));
> + /* new DL task */
> if (dl_policy(policy) && !task_has_dl_policy(p) &&
> - !__dl_overflow(dl_b, cpus, 0, new_bw)) {
> + !__dl_overflow(dl_b, -1, 0, -1, new_bw)) {
> __dl_add(dl_b, new_bw);
> err = 0;
> +
> + /* changed DL task */
> } else if (dl_policy(policy) && task_has_dl_policy(p) &&
> - !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> - __dl_clear(dl_b, p->dl.dl_bw);
> + !__dl_overflow(dl_b, -1, p->dl.dl_bw, -1, new_bw)) {
> + __dl_sub(dl_b, p->dl.dl_bw);
> __dl_add(dl_b, new_bw);
> err = 0;
> +
> + /* leaves DL */
> } else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> - __dl_clear(dl_b, p->dl.dl_bw);
> + __dl_sub(dl_b, p->dl.dl_bw);
> err = 0;
> }
> raw_spin_unlock(&dl_b->lock);
> @@ -5390,8 +5427,7 @@ int task_can_attach(struct task_struct *p,
> rcu_read_lock_sched();
> dl_b = dl_bw_of(dest_cpu);
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cpus = dl_bw_cpus(dest_cpu);
> - overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
> + overflow = __dl_overflow(dl_b, -1, 0, -1, p->dl.dl_bw);
> if (overflow)
> ret = -EBUSY;
> else {
> @@ -7309,8 +7345,7 @@ static int cpuset_cpu_inactive(unsigned int cpu)
> dl_b = dl_bw_of(cpu);
>
> raw_spin_lock_irqsave(&dl_b->lock, flags);
> - cpus = dl_bw_cpus(cpu);
> - overflow = __dl_overflow(dl_b, cpus, 0, 0);
> + overflow = __dl_overflow(dl_b, -1, 0, -1, 0);
> raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>
> rcu_read_unlock_sched();
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 37e2449186c4..4eb65ea9c100 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -778,6 +778,11 @@ static void update_curr_dl(struct rq *rq)
> }
> }
>
> +static inline struct sched_dl_entity *dl_of(struct rb_node *node)
> +{
> + return rb_entry(node, struct sched_dl_entity, rb_node);
> +}
> +
> #ifdef CONFIG_SMP
>
> static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -805,9 +810,8 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> cpudl_clear(&rq->rd->cpudl, rq->cpu);
> } else {
> struct rb_node *leftmost = dl_rq->rb_leftmost;
> - struct sched_dl_entity *entry;
> + struct sched_dl_entity *entry = dl_of(leftmost);
>
> - entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);

split this out perhaps as a general cleanup?

> dl_rq->earliest_dl.curr = entry->deadline;
> cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline);
> }
> @@ -848,6 +852,69 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> dec_dl_migration(dl_se, dl_rq);
> }
>
> +static inline u64 entity_ttf(struct sched_dl_entity *dl_se)
> +{
> + u64 ttf = U64_MAX;
> +
> + if (dl_se->flags & DL_AFFINE)
> + ttf = dl_se->deadline - dl_se->runtime;
> +
> + return ttf;
> +}
> +
> +static inline u64
> +compute_subtree_ttf(struct sched_dl_entry *dl_se)
> +{
> + u64 subtree_ttd, ttf = entity_ttf(dl_se);
> +
> + if (dl_se->rb_node.rb_left) {
> + subtree_ttf = dl_of(dl_se->rb_node.rb_left)->__subtree_ttf;
> + ttf = min(ttf, subtree_ttf);
> + }
> + if (dl_se->rb_node.rb_right) {
> + subtree_ttf = dl_of(dl_se->rb_node.rb_right)->__subtree_ttf;
> + ttf = min(ttf, subtree_ttf);
> + }
> +
> + return ttf;
> +}
> +
> +static void ttf_propagate(struct rb_node *rb, struct rb_node *stop)
> +{
> + while (rb != stop) {
> + struct sched_dl_entity *dl_se = dl_of(rb);
> + u64 subtree_ttf = compute_subtree_ttf(dl_se);
> +
> + if (dl_se->__subtree_ttf == subtree_ttf)
> + break;
> +
> + rb = rb_parent(rb);
> + }
> +}
> +
> +static void ttf_copy(struct rb_node *rb_old, struct rb_node *rb_new)
> +{
> + struct sched_dl_entity *dl_old = dl_of(rb_old);
> + struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> + new->__subtree_ttf = old->__subtree_ttf;
> +}
> +
> +static void ttf_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
> +{
> + struct sched_dl_entity *dl_old = dl_of(rb_old);
> + struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> + new->__subtree_ttf = old->__subtree_ttf;
> + old->__subtree_ttf = compute_subtree_ttf(old);
> +}
> +
> +static const struct rb_augment_callbacks ttf_callbacks = {
> + .propagate = ttf_propagate,
> + .copy = ttf_copy,
> + .rotate = ttf_rotate,
> +};
> +
> static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> {
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -855,15 +922,16 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> struct rb_node *parent = NULL;
> struct sched_dl_entity *entry;
> int leftmost = 1;
> + u64 ttf;
>
> BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
>
> while (*link) {
> parent = *link;
> - entry = rb_entry(parent, struct sched_dl_entity, rb_node);
> - if (dl_time_before(dl_se->deadline, entry->deadline))
> + entry = dl_of(parent);
> + if (dl_time_before(dl_se->deadline, entry->deadline)) {
> link = &parent->rb_left;
> - else {
> + } else {
> link = &parent->rb_right;
> leftmost = 0;
> }
> @@ -872,8 +940,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
> if (leftmost)
> dl_rq->rb_leftmost = &dl_se->rb_node;
>
> + dl_se->__subtree_ttf = ttf = entity_ttf(dl_se);
> rb_link_node(&dl_se->rb_node, parent, link);
> - rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> + rb_insert_augmented(&dl_se->rb_node, &dl_rq->rb_root, &ttf_callbacks);
> +
> + if (dl_of(dl_rq->rb_root.rb_node)->__subtree_ttf == ttf)
> + dl_rb->rb_least_ttf = dl_rq->rb_root.rb_node;
>
> inc_dl_tasks(dl_se, dl_rq);
> }
> @@ -892,9 +964,41 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
> dl_rq->rb_leftmost = next_node;
> }
>
> - rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> + rb_erase_augmented(&dl_se->rb_node, &dl_rq->rb_root, &ttf_callbacks);
> RB_CLEAR_NODE(&dl_se->rb_node);
>
> + if (dl_rq->rb_least_ttf == &dl_se->rb_node) {
> + struct rb_node **link = &dl_rq->rb_root.rb_node;
> +
> + while (*link) {
> + struct rb_node *node = *link;
> + struct sched_dl_entity *entry = dl_of(node);
> + u64 subtree_ttf = entry->__subtree_ttf;
> +
> + if (subtree_ttf = U64_MAX) {
> + dl_rq->rb_least_ttf = NULL;
> + goto no_least_ttf;
> + }
> +
> + if (node->rb_left &&
> + dl_of(node->rb_left)->__subtree_ttf == subtree_ttf) {
> + link = &node->rb_left;
> + continue;
> + }
> +
> + if (node->rb_right &&
> + dl_of(node->rb_right)->__subtree_ttf == subtree_ttf) {
> + link = &node->rb_right;
> + continue;
> + }
> +
> + break;
> + }
> +
> + dl_rq->rb_least_ttf = *link;
> +no_least_ttf:
> + }
> +
> dec_dl_tasks(dl_se, dl_rq);
> }
>
> @@ -1109,12 +1213,28 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
> static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
> struct dl_rq *dl_rq)
> {
> - struct rb_node *left = dl_rq->rb_leftmost;
> + struct rb_node *rb_left = dl_rq->rb_leftmost;
> + struct sched_dl_entity *left, *least;
> + u64 now;
>
> - if (!left)
> + if (!rb_left)
> return NULL;
>
> - return rb_entry(left, struct sched_dl_entity, rb_node);
> + left = dl_of(rb_left);
> +
> + /*
> + * Only local tasks have TTF set, if one is present and it would miss
> + * its deadline because of the (G)EDF task selection, override that
> + * and select the local task.
> + */
> + if (dl_rq->rb_least_ttf) {
> + least = dl_of(dl_rq->rb_least_ttf);
> + now = rq_clock_task(rq);
> + if (now + left->runtime > entity_ttf(least))
> + return least;
> + }

What happens if TtF for the currently enqueued task is 0 or very small?

(this is the problem with unmodified LLF, as you risk getting a storm of
sched_switches when 2 or a few threads have very small relative
time-to-failure, and the reason why the approach I mentioned up at the
beginning makes this a bit easier to reason about).

You really, really want to minimize the number of context-switches

In general, once a task's relative time to failure is smaller than the
smallest TtF of all the running tasks, you should switch it in with the
task with the largest TtF. Since you are doing a partitioned approach, this
global approach won't work so well, so to avoid the storm, you need to set
a threshold and once you've swapped, you should let that task run until
completion and let the other task fail before swapping.

> + return left;
> }
>
> struct task_struct *
> @@ -1645,7 +1765,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
> * until we complete the update.
> */
> raw_spin_lock(&src_dl_b->lock);
> - __dl_clear(src_dl_b, p->dl.dl_bw);
> + __dl_sub(src_dl_b, p->dl.dl_bw);
> raw_spin_unlock(&src_dl_b->lock);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 055f935d4421..55059d2568bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -198,13 +198,15 @@ static inline int dl_bandwidth_enabled(void)
>
> extern struct dl_bw *dl_bw_of(int i);
>
> +extern DECLARE_PER_CPU(u64, dl_cpu_bw);
> +
> struct dl_bw {
> raw_spinlock_t lock;
> u64 bw, total_bw;
> };
>
> static inline
> -void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
> +void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
> {
> dl_b->total_bw -= tsk_bw;
> }
> @@ -215,13 +217,6 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
> dl_b->total_bw += tsk_bw;
> }
>
> -static inline
> -bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
> -{
> - return dl_b->bw != -1 &&
> - dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> -}
> -
> extern struct mutex sched_domains_mutex;
>
> #ifdef CONFIG_CGROUP_SCHED
> @@ -507,6 +502,7 @@ struct dl_rq {
> /* runqueue is an rbtree, ordered by deadline */
> struct rb_root rb_root;
> struct rb_node *rb_leftmost;
> + struct rb_node *rb_least_ttf;
>
> unsigned long dl_nr_running;

Some nitpick, and probably a lot of misunderstandings from my side, but all
in all, I'm enthusiastic to the approach of using time spent on the CPU,
remaining time and laxity to enhance the deadline-scheduler.

-Henrik

Attachments:

(No filename) (18.25 kB)
signature.asc (181.00 B)
Download all attachments

2016-11-10 12:27:14

by Luca Abeni

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

Hi Peter,

On Thu, 10 Nov 2016 11:59:18 +0100
Peter Zijlstra <[email protected]> wrote:

[...]
> > > MIXED CRITICALITY SCHEDULING
> > >
> > > Since we want to provide better guarantees for single CPU affine
> > > tasks than the G-EDF scheduler provides for the single CPU tasks,
> > > we need to somehow alter the scheduling algorithm.
> > >
> > > The trivial layered EDF/G-EDF approach is obviously flawed in
> > > that it will result in many unnecessary deadline misses. The
> > > trivial example is having a single CPU task with a deadline after
> > > a runnable global task. By always running single CPU tasks over
> > > global tasks we can make the global task miss its deadline even
> > > though we could easily have ran both within the alloted time.
>
> > Ok; the layered approach clearly causes some unneeded deadline
> > misses on global tasks... But those tasks risk to miss deadlines
> > anyway (with the corrent scheduler, they are guaranteed to have a
> > limited tardiness, not to respect all of the deadlines). I think
> > this is related to the question about which guarantees are provided
> > to the global tasks.
>
> Right, so I wanted to limit the impact. Basically, given a stronger
> admission test for the GEDF scheduler that would guarantee deadlines,
> I would like the new scheduling function to still guarantee all
> deadlines.
Ok; this is interesting... I do not know if it is possible, but it is
definitly something interesting to look at.

> > > Therefore we must use a more complicated scheme. By adding a
> > > second measure present in the sporadic task model to the
> > > scheduling function we can try and distinguish between the
> > > constraints of handling the two cases in a single scheduler.
> > >
> > > We define the time to fail as:
> > >
> > > ttf(t) := t_d - t_b; where
> > >
> > > t_d is t's absolute deadline
> > > t_b is t's remaining budget
>
> > Ok; I think this is similar to what is called "pseudo-deadline" in
> > some papers.
> > If I understand well, it is equal to the current time + the task
> > "laxity" (or slack time). So, scheduling the task with the smaller
> > ttf is equivalent to the "least laxity first" (LLF) algorithm.
> > Giving precedence to tasks with 0 laxity is a technique that is
> > often used to improve the schedulability on multi-processor
> > systems.
>
> Right, similar to LLF. Initially I was using the term laxity here, but
> Hendrik convinced me that this is called time-to-fail. I'll let him
> explain :-)
Well, if I understand well "time-to-fail" is equal to "laxity + current
time"... So, they are two different quantities but the final scheduling
algorithm is the same (and using ttf simplifies the implementation a
lot :).

> But yes, a pure TTF based scheduler should be equivalent to LLF which
> itself is fairly similar to EDF in that its optimal for UP etc.. (I
> think).
Right

> > I _suspect_ that the migration rules must also be changed (when a
> > task migrates from a runqueue, it currently selects as a target the
> > runqueue having the largest earliest deadline... Maybe it should
> > also consider the presence of rq-local tasks - or the LLF-ordered
> > heap - on the target?)
>
> Quite possible, I didn't think about that.
>
> > Did you find this scheduling strategy on some paper? Otherwise, I
> > think we need to develop some theoretical analysis for it... (which
> > is of course another interesting thing! :)
>
> No, I cooked this up myself. There might of course still be a paper on
> this, but if so, I'm blissfully unaware.
Ok; I'll try to have a look at the theoretical analysis.

Thanks,
Luca

2016-11-10 12:38:51

by Luca Abeni

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

Hi Henrik,

On Thu, 10 Nov 2016 13:21:00 +0100
Henrik Austad <[email protected]> wrote:
> On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
[...]
> > We define the time to fail as:
> >
> > ttf(t) := t_d - t_b; where
> >
> > t_d is t's absolute deadline
> > t_b is t's remaining budget
> >
> > This is the last possible moment we must schedule this task such
> > that it can complete its work and not miss its deadline.
>
> To elaborate a bit on this (this is a modified LLF approach if my
> memory serves):
>
> You have the dynamic time-to-failure (TtF), i.e. as the task
> progresses (scheduled to run), the relative time-to-failure will
> remain constant. This can be used to compare thasks to a running task
> and should minimize the number of calculations required.
>
> Then you have the static Time-of-failure (ToF, which is the absoulte
> time when a task will no longer be able to meet its deadline. This is
> what you use for keeping a sorted list of tasks in the runqueue. As
> this is a fixed point in time, you do not have to dynamically update
> or do crazy calculation when inserting/removing threads from the rq.

Sorry, I am missing something here: if ttf is defined as
ttf_i = d_i - q_i
(where d_i is the deadline of thask i and q_i is its remaining budget),
then it also is the time before which you have to schedule task i if
you do not want to miss the deadline... No? So, I do not understand the
difference with tof.

> > If we then augment the regular EDF rules by, for local tasks,
> > considering the time to fail and let this measure override the
> > regular EDF pick when the time to fail can be overran by the EDF
> > pick.
>
> Then, if you do this - do you need to constrict this to a local CPU?
> I *think* you could do this in a global scheduler if you use ToF/TtF
> for all deadline-tasks, I think you should be able to meet deadlines.
I think the ToF/TtF scheduler will be equivalent to LLF (see previous
emails)... Or am I misunderstanding something? (see above)
And LLF is not optimal on multiple CPUs, so I do not think it will be
able to meet deadlines if you use it as a global scheduler.

>
> I had a rant about this way back [1,2 Sec 11.4], I need to sit down
> and re-read most of it, it has been a few too many years, but the
> idea was to minimize the number of context-switches (which LLF is
> prone to get a lot of) as well as minimize the computational overhead
> by avoiding re-calculating time-of-failre/time-to-failre a lot.
>
> > That is, when:
> >
> > now + left_b > min(ttf)
>
> Why not just use ttf/tof for all deadline-tasks? We have all the
> information available anyway and it would probably make the internal
> logic easier?
I think LLF causes more preemptions and migrations than EDF.

Luca

2016-11-10 12:56:55

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Nov 10, 2016 at 01:21:00PM +0100, Henrik Austad wrote:
> On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:

> > We define the time to fail as:
> >
> > ttf(t) := t_d - t_b; where
> >
> > t_d is t's absolute deadline
> > t_b is t's remaining budget
> >
> > This is the last possible moment we must schedule this task such that it can
> > complete its work and not miss its deadline.
>
> To elaborate a bit on this (this is a modified LLF approach if my memory
> serves):
>
> You have the dynamic time-to-failure (TtF), i.e. as the task progresses
> (scheduled to run), the relative time-to-failure will remain constant. This
> can be used to compare thasks to a running task and should minimize the
> number of calculations required.
>
> Then you have the static Time-of-failure (ToF, which is the absoulte time
> when a task will no longer be able to meet its deadline. This is what you
> use for keeping a sorted list of tasks in the runqueue. As this is a fixed
> point in time, you do not have to dynamically update or do crazy
> calculation when inserting/removing threads from the rq.
>
> > If we then augment the regular EDF rules by, for local tasks, considering the
> > time to fail and let this measure override the regular EDF pick when the
> > time to fail can be overran by the EDF pick.
>
> Then, if you do this - do you need to constrict this to a local CPU? I
> *think* you could do this in a global scheduler if you use ToF/TtF for all
> deadline-tasks, I think you should be able to meet deadlines.
>
> I had a rant about this way back [1,2 Sec 11.4], I need to sit down and
> re-read most of it, it has been a few too many years, but the idea was to
> minimize the number of context-switches (which LLF is prone to get a lot
> of) as well as minimize the computational overhead by avoiding
> re-calculating time-of-failre/time-to-failre a lot.
>
> > That is, when:
> >
> > now + left_b > min(ttf)
>
> Why not just use ttf/tof for all deadline-tasks? We have all the
> information available anyway and it would probably make the internal logic
> easier?

So the point is that the total task set isn't able to meet its deadlines
per-se. Therefore, no matter which scheduling function you pick, be it
EDF, LLF, TTF or EDZL, you'll miss deadlines.

Consider the trivial example of 2 CPU and 3 tasks with u=2/3. That's
just not going to work (with static job priority schedulers, so lets not
do Pfair etc. ;-).

The point here is to allow the local tasks to operate UP-like and
therefore get UP-like guarantees. We know all these scheduling functions
(EDF, LLF, TTF, EDZL) are optimal on UP.

At the same time, we'd like the global task set to not unduly suffer,
that is, if the total task set is schedulable under G-EDF, then we'd
like to actually achieve this.

So we have to distinguish between the local and global tasks from
principle, as we have different criticality requirements for them. We
want to always meet the local deadlines, but the global tasks are
subject to tardiness depending on the AC function.

Now, mixed criticality in general is 'proven' NP hard. But since the
sporadic task model has at least 2 variables and we need to distinguish
between 2 levels only, I feel this should be tractable.

And this is where and why I introduced TTF, its a second measure, next
to the exiting earlier deadline, to differentiate the criticality
levels. This then also means we should only use TTF for local tasks,
otherwise it cannot differentiate.

Furthermore, as you mentioned, we don't immediately use the 0-laxity
point as per EDZL (I've only read the link Tommaso provided, didn't get
the article per paywall), since 0-laxity is a very fragile point, _any_
system disturbance after that will mean a fail, also, getting it
requires a timer. So I slightly relaxed that to the last natural
scheduling point before that, which is where the:

now + leftmost_budget > min(ttf)

constraint comes from.

2016-11-10 12:57:06

by Henrik Austad

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Nov 10, 2016 at 01:38:40PM +0100, luca abeni wrote:
> Hi Henrik,

Hi Luca,

> On Thu, 10 Nov 2016 13:21:00 +0100
> Henrik Austad <[email protected]> wrote:
> > On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> [...]
> > > We define the time to fail as:
> > >
> > > ttf(t) := t_d - t_b; where
> > >
> > > t_d is t's absolute deadline
> > > t_b is t's remaining budget
> > >
> > > This is the last possible moment we must schedule this task such
> > > that it can complete its work and not miss its deadline.
> >
> > To elaborate a bit on this (this is a modified LLF approach if my
> > memory serves):
> >
> > You have the dynamic time-to-failure (TtF), i.e. as the task
> > progresses (scheduled to run), the relative time-to-failure will
> > remain constant. This can be used to compare thasks to a running task
> > and should minimize the number of calculations required.
> >
> > Then you have the static Time-of-failure (ToF, which is the absoulte
> > time when a task will no longer be able to meet its deadline. This is
> > what you use for keeping a sorted list of tasks in the runqueue. As
> > this is a fixed point in time, you do not have to dynamically update
> > or do crazy calculation when inserting/removing threads from the rq.
>
> Sorry, I am missing something here: if ttf is defined as
> ttf_i = d_i - q_i

So I picked the naming somewhat independently of Peter, his approach is
the _absolute_ time of failure, the actual time X, irrespective of the task
running or not.

I added 2 different measures for the same thing:

* ToF:
The absolute time of failure is the point in time when the task will no
longer be able to meet its deadline. If a task is scheduled and is running
on a CPU, this value will move forward at the speed of execution. i.e. when
the task is running, this value is changing. When the task is waiting in
the runqueue, this value is constant.

TtF:
The relative time to failure is the value that is tied to the local CPU so
to speak. When a task is running, this value is constant as it is the
remaining time until the task is no longer able to meet its deadline. When
the task is enqueued, this value will steadily decrease as it draws closer
to the time when it will fail.

So when a task is running on a CPU, you use TtF, when it is in the runqueue
you compare ToF

> (where d_i is the deadline of thask i and q_i is its remaining budget),
> then it also is the time before which you have to schedule task i if
> you do not want to miss the deadline... No? So, I do not understand the
> difference with tof.

So you can calculate one form the other given absolute deadline and
remaining budget (or consumed CPU-time). But it is handy to use both as it
removes a lot of duplicity and once you get the hang of the terms, makes it
a bit easier to reason about the system.

> > > If we then augment the regular EDF rules by, for local tasks,
> > > considering the time to fail and let this measure override the
> > > regular EDF pick when the time to fail can be overran by the EDF
> > > pick.
> >
> > Then, if you do this - do you need to constrict this to a local CPU?
> > I *think* you could do this in a global scheduler if you use ToF/TtF
> > for all deadline-tasks, I think you should be able to meet deadlines.
> I think the ToF/TtF scheduler will be equivalent to LLF (see previous
> emails)... Or am I misunderstanding something? (see above)
> And LLF is not optimal on multiple CPUs, so I do not think it will be
> able to meet deadlines if you use it as a global scheduler.

I think I called it Earliest Failure First (I really wanted to call it
failure-driven scheduling but that implied a crappy scheduler ;)

LLF is prone to high task-switch count when multiple threads gets close to
0 laxity. But as I said, it's been a while since I last worked through the
theory, so I have some homework to do before arguing too hard about this.

> > I had a rant about this way back [1,2 Sec 11.4], I need to sit down
> > and re-read most of it, it has been a few too many years, but the
> > idea was to minimize the number of context-switches (which LLF is
> > prone to get a lot of) as well as minimize the computational overhead
> > by avoiding re-calculating time-of-failre/time-to-failre a lot.
> >
> > > That is, when:
> > >
> > > now + left_b > min(ttf)
> >
> > Why not just use ttf/tof for all deadline-tasks? We have all the
> > information available anyway and it would probably make the internal
> > logic easier?
> I think LLF causes more preemptions and migrations than EDF.

yes, it does, which is why you need to adjust LLF to minimize the number of
task-switches.

-Henrik

Attachments:

(No filename) (4.58 kB)
signature.asc (181.00 B)
Download all attachments

2016-11-10 14:34:23

by Luca Abeni

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, 10 Nov 2016 12:03:47 +0100
Tommaso Cucinotta <[email protected]> wrote:

> On 10/11/2016 10:06, luca abeni wrote:
> > is equivalent to the "least laxity first" (LLF) algorithm.
> > Giving precedence to tasks with 0 laxity is a technique that is
> > often used to improve the schedulability on multi-processor
> > systems.
>
> EDZL (EDF / Zero Laxity first), right?
Yes, basically all the "ZL" algorithms (EDZL, but I think I've also
seen something like RMZL or similar).

> AFAICR, there's quite a lot of
> analysis on EDZL for multi-cores... eg, Insik Shin et al....
>
> http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6374195
Yes, this is why I mentined the 0-laxity thing... Of course, here the
situation is different (there are tasks that can be migrated, and tasks
that cannot), but maybe the 0-laxity analysis can be adapted to this
case?

> But, before going the EDZL way, isn't it worthwhile to consider
> just splitting tasks among 2 cpus
>
> https://people.mpi-sws.org/~bbb/papers/pdf/rtss16b.pdf
Yes, there are many possible different strategies that can be tested (I
think somewhere I saw some semi-partitioned algorithm that was even
optimal). I suspect everything depends on the trade-off between
implementation complexity and scheduling efficiency.

Luca

>
> ? ... we're working at RETIS on simpler ways to make the AC for
> these split tasks cases (cc-ing Alessandro) that doesn't need
> demand-bound complex analysis...
>
> My2c,
>
> T.

2016-11-10 14:36:06

by Luca Abeni

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

Hi Henrik,

On Thu, 10 Nov 2016 13:56:35 +0100
Henrik Austad <[email protected]> wrote:

> On Thu, Nov 10, 2016 at 01:38:40PM +0100, luca abeni wrote:
> > Hi Henrik,
>
> Hi Luca,
>
> > On Thu, 10 Nov 2016 13:21:00 +0100
> > Henrik Austad <[email protected]> wrote:
> > > On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> > [...]
> > > > We define the time to fail as:
> > > >
> > > > ttf(t) := t_d - t_b; where
> > > >
> > > > t_d is t's absolute deadline
> > > > t_b is t's remaining budget
> > > >
> > > > This is the last possible moment we must schedule this task such
> > > > that it can complete its work and not miss its deadline.
> > >
> > > To elaborate a bit on this (this is a modified LLF approach if my
> > > memory serves):
> > >
> > > You have the dynamic time-to-failure (TtF), i.e. as the task
> > > progresses (scheduled to run), the relative time-to-failure will
> > > remain constant. This can be used to compare thasks to a running
> > > task and should minimize the number of calculations required.
> > >
> > > Then you have the static Time-of-failure (ToF, which is the
> > > absoulte time when a task will no longer be able to meet its
> > > deadline. This is what you use for keeping a sorted list of tasks
> > > in the runqueue. As this is a fixed point in time, you do not
> > > have to dynamically update or do crazy calculation when
> > > inserting/removing threads from the rq.
> >
> > Sorry, I am missing something here: if ttf is defined as
> > ttf_i = d_i - q_i
>
> So I picked the naming somewhat independently of Peter, his approach
> is the _absolute_ time of failure, the actual time X, irrespective of
> the task running or not.
>
> I added 2 different measures for the same thing:
>
> * ToF:
> The absolute time of failure is the point in time when the task will
> no longer be able to meet its deadline. If a task is scheduled and is
> running on a CPU, this value will move forward at the speed of
> execution. i.e. when the task is running, this value is changing.
> When the task is waiting in the runqueue, this value is constant.
Ah, ok... So, if I understand well you ToF is Peter's ttf... Right?

> TtF:
> The relative time to failure is the value that is tied to the local
> CPU so to speak. When a task is running, this value is constant as it
> is the remaining time until the task is no longer able to meet its
> deadline. When the task is enqueued, this value will steadily
> decrease as it draws closer to the time when it will fail.
So, if I understand weel, TtF = ToF - current time... Right? I think
this is often called "laxity" or "slack time". No?

[...]
> > > > If we then augment the regular EDF rules by, for local tasks,
> > > > considering the time to fail and let this measure override the
> > > > regular EDF pick when the time to fail can be overran by the EDF
> > > > pick.
> > >
> > > Then, if you do this - do you need to constrict this to a local
> > > CPU? I *think* you could do this in a global scheduler if you use
> > > ToF/TtF for all deadline-tasks, I think you should be able to
> > > meet deadlines.
> > I think the ToF/TtF scheduler will be equivalent to LLF (see
> > previous emails)... Or am I misunderstanding something? (see above)
> > And LLF is not optimal on multiple CPUs, so I do not think it will
> > be able to meet deadlines if you use it as a global scheduler.
>
> I think I called it Earliest Failure First (I really wanted to call
> it failure-driven scheduling but that implied a crappy scheduler ;)
Ok, but... How is it different from LLF?
In my understanding (but, again, maybe I am missing something) ToF and
TtF are just a way to implement LLF more efficiently (because, as you
notice, ToF does not change when a task is executing and TtF does not
change when the task is executing).

Luca

> LLF is prone to high task-switch count when multiple threads gets
> close to 0 laxity. But as I said, it's been a while since I last
> worked through the theory, so I have some homework to do before
> arguing too hard about this.
>
> > > I had a rant about this way back [1,2 Sec 11.4], I need to sit
> > > down and re-read most of it, it has been a few too many years,
> > > but the idea was to minimize the number of context-switches
> > > (which LLF is prone to get a lot of) as well as minimize the
> > > computational overhead by avoiding re-calculating
> > > time-of-failre/time-to-failre a lot.
> > > > That is, when:
> > > >
> > > > now + left_b > min(ttf)
> > >
> > > Why not just use ttf/tof for all deadline-tasks? We have all the
> > > information available anyway and it would probably make the
> > > internal logic easier?
> > I think LLF causes more preemptions and migrations than EDF.
>
> yes, it does, which is why you need to adjust LLF to minimize the
> number of task-switches.
>
> -Henrik

2016-12-13 10:21:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Nov 10, 2016 at 11:01:59AM +0100, Tommaso Cucinotta wrote:
>
> Just a note: if you want to recover arbitrary task affinities, you can re-cast your above test like this:
>
> for_each_processor(cpu)
> \sum U[t]/A[t] \leq 1 (or U_max), for each task t on cpu, with utilization U[t] and A[t] tasks overall in its affinity mask
>

Do I read it correct when I interpret A[t] as the number of CPUs in its
affinity mask?

For A[t] == 1, that reduces to the UP case:

\Sum U[t] \leq 1

and A[t] = N that reduces to the G-EDF case:

\Sum U[t] \leq N

Also, does recoverable mean a bound tardiness, or is that something
weaker still?

2016-12-15 11:30:47

by Tommaso Cucinotta

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

Hi Peter,

On 13/12/2016 11:21, Peter Zijlstra wrote:
> On Thu, Nov 10, 2016 at 11:01:59AM +0100, Tommaso Cucinotta wrote:
>> Just a note: if you want to recover arbitrary task affinities, you can re-cast your above test like this:
>>
>> for_each_processor(cpu)
>> \sum U[t]/A[t] \leq 1 (or U_max), for each task t on cpu, with utilization U[t] and A[t] tasks overall in its affinity mask
>>
> Do I read it correct when I interpret A[t] as the number of CPUs in its
> affinity mask?

yes, exactly, A[t] number of CPUs in the task affinity mask (sorry for my bad write-up)

> Also, does recoverable mean a bound tardiness, or is that something
> weaker still?

nope, nothing exact -- it just meant providing flexible but simple & consistent (ie, towards recovering affinity masks) options from the kernel/scheduler side, leaving more complex & exact tests to user-space, or future add-ons to the kernel.

Thanks,

T.

2016-12-15 12:16:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Dec 15, 2016 at 12:30:43PM +0100, Tommaso Cucinotta wrote:
> Hi Peter,
>
> On 13/12/2016 11:21, Peter Zijlstra wrote:
> >On Thu, Nov 10, 2016 at 11:01:59AM +0100, Tommaso Cucinotta wrote:
> >>Just a note: if you want to recover arbitrary task affinities, you can re-cast your above test like this:
> >>
> >>for_each_processor(cpu)
> >> \sum U[t]/A[t] \leq 1 (or U_max), for each task t on cpu, with utilization U[t] and A[t] tasks overall in its affinity mask
> >>
> >Do I read it correct when I interpret A[t] as the number of CPUs in its
> >affinity mask?
>
> yes, exactly, A[t] number of CPUs in the task affinity mask (sorry for my bad write-up)

n/p, I got it ;-)

> >Also, does recoverable mean a bound tardiness, or is that something
> >weaker still?
>
> nope, nothing exact -- it just meant providing flexible but simple &
> consistent (ie, towards recovering affinity masks) options from the
> kernel/scheduler side, leaving more complex & exact tests to
> user-space, or future add-ons to the kernel.

So it would be good to get a more exact answer on what 'recoverable'
means. It cannot mean unbounded tardiness, since that implies runaway
state. It clearly doesn't mean no tardiness, as proven by the G-EDF
special case.

So I was hoping it would mean bounded, but possibly with a worse bound
than regular G-EDF.

In any case, it does provide a way to look at admission control that
might be useful. I'll have to play around with it a bit.