Date: Thu, 10 Nov 2016 10:06:02 +0100
From: luca abeni <luca.abeni@unitn.it>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        Juri Lelli <juri.lelli@gmail.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Claudio Scordino <claudio@evidence.eu.com>,
        Tommaso Cucinotta <tommaso.cucinotta@sssup.it>,
        Daniel Bistrot de Oliveira <danielbristot@gmail.com>,
        Henrik Austad <henrik@austad.us>, linux-kernel@vger.kernel.org
Subject: Re: [RFD] sched/deadline: Support single CPU affinity
Message-ID: <20161110100602.2781af72@sweethome>
In-Reply-To: <20161110080807.GD11311@worktop.programming.kicks-ass.net>
References: <20161110080807.GD11311@worktop.programming.kicks-ass.net>
Organization: university of trento
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 16568
Lines: 540

Hi Peter,

On Thu, 10 Nov 2016 09:08:07 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> Add support for single CPU affinity to SCHED_DEADLINE; the supposed
> reason for wanting single CPU affinity is better QoS than provided by
> G-EDF.
This looks very interesting, thanks for sharing!
I have some "theoretical" comments; I'll look at the code later.

> Therefore the aim is to provide harder guarantees, similar to UP, for
> single CPU affine tasks. This then leads to a mixed criticality
> scheduling requirement for the CPU scheduler. G-EDF like for the
> non-affine (global) tasks and UP like for the single CPU tasks.
The goal for single CPU affine tasks is clear; which kind of guarantees
do you want to provide for the other (global) tasks? The tardiness
guarantees?


[...]
> MIXED CRITICALITY SCHEDULING
> 
> Since we want to provide better guarantees for single CPU affine
> tasks than the G-EDF scheduler provides for the single CPU tasks, we
> need to somehow alter the scheduling algorithm.
> 
> The trivial layered EDF/G-EDF approach is obviously flawed in that it
> will result in many unnecessary deadline misses. The trivial example
> is having a single CPU task with a deadline after a runnable global
> task. By always running single CPU tasks over global tasks we can
> make the global task miss its deadline even though we could easily
> have ran both within the alloted time.
Ok; the layered approach clearly causes some unneeded deadline misses
on global tasks... But those tasks risk to miss deadlines anyway (with
the corrent scheduler, they are guaranteed to have a limited tardiness,
not to respect all of the deadlines). I think this is related to the
question about which guarantees are provided to the global tasks.


> Therefore we must use a more complicated scheme. By adding a second
> measure present in the sporadic task model to the scheduling function
> we can try and distinguish between the constraints of handling the
> two cases in a single scheduler.
> 
> We define the time to fail as:
> 
>   ttf(t) := t_d - t_b; where
> 
> 	t_d is t's absolute deadline
> 	t_b is t's remaining budget
Ok; I think this is similar to what is called "pseudo-deadline" in some
papers.
If I understand well, it is equal to the current time + the task
"laxity" (or slack time). So, scheduling the task with the smaller ttf
is equivalent to the "least laxity first" (LLF) algorithm.
Giving precedence to tasks with 0 laxity is a technique that is often
used to improve the schedulability on multi-processor systems.

> This is the last possible moment we must schedule this task such that
> it can complete its work and not miss its deadline.
> 
> If we then augment the regular EDF rules by, for local tasks,
> considering the time to fail and let this measure override the
> regular EDF pick when the time to fail can be overran by the EDF pick.
> 
> That is, when:
> 
> 	now + left_b > min(ttf)
Ok, this looks interesting... I have to better understand this rule. If
I am not wrong, this is equivalent to
	now + left_budget > min(now + laxity) =>
	=> left_budget > min(laxity)
So, if I understand well when the remaining budget of the current task
is larger than the minimum laxity you switch from EDF to LLF (which is
equivalent to local time to fail)? Is this understanding correct?

I _suspect_ that the migration rules must also be changed (when a task
migrates from a runqueue, it currently selects as a target the runqueue
having the largest earliest deadline... Maybe it should also consider
the presence of rq-local tasks - or the LLF-ordered heap - on the
target?)

Did you find this scheduling strategy on some paper? Otherwise, I think
we need to develop some theoretical analysis for it... (which is of
course another interesting thing! :)


				Luca
> 
> Use augmented RB-tree to store absolute deadlines of all rq local
> tasks and keep the heap sorted on the earliest time to fail of a
> locally affine task.
> 
> 
> 
> TODO
> 
>  - finish patch, this only sketches the outlines
>  - formal analysis of the proposed scheduling function; this is only
> a hunch.
> 
> 
> 
> ---
>  include/linux/sched.h   |   1 +
>  kernel/sched/core.c     |  75 ++++++++++++++++++-------
>  kernel/sched/deadline.c | 142
> ++++++++++++++++++++++++++++++++++++++++++++----
> kernel/sched/sched.h    |  12 ++-- 4 files changed, 191
> insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3762fe4e3a80..32f948615d4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,7 @@ struct sched_rt_entity {
>  
>  struct sched_dl_entity {
>  	struct rb_node	rb_node;
> +	u64 __subtree_ttf;
>  
>  	/*
>  	 * Original scheduling parameters. Copied here from
> sched_attr diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee18baa603a..46995c060a89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2469,6 +2469,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
>  }
>  
>  #ifdef CONFIG_SMP
> +DEFINE_PER_CPU(u64, dl_cpu_bw);
> +
>  inline struct dl_bw *dl_bw_of(int i)
>  {
>  	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> @@ -2476,17 +2478,39 @@ inline struct dl_bw *dl_bw_of(int i)
>  	return &cpu_rq(i)->rd->dl_bw;
>  }
>  
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64
> old_bw,
> +					      int new_cpu, u64
> new_bw) {
> -	struct root_domain *rd = cpu_rq(i)->rd;
> -	int cpus = 0;
> +	struct root_domain *rd = container_of(dl_b, struct
> root_domain, dl_bw);
> +	u64 total, avail, used;
> +	int i;
>  
> -	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> -			 "sched RCU must be held");
> -	for_each_cpu_and(i, rd->span, cpu_active_mask)
> -		cpus++;
> +	if (dl_b->bw == -1) /* admission control disabled */
> +		return false;
> +
> +	avail = 0;
> +	for_each_cpu_and(i, rd->span, cpu_active_mask) {
> +		total = dl_b->bw;
> +		used = per_cpu(dl_cpu_bw, i);
> +
> +		if (old_cpu == i)
> +			used -= old_bw;
> +		if (new_cpu == i)
> +			used += new_bw;
> +
> +		if (used > total)
> +			return true;
> +
> +		avail += total - used;
> +	}
> +
> +	total = dl_b->total_bw;
> +	if (old_cpu == -1)
> +		total -= old_bw;
> +	if (new_cpu == -1)
> +		total += new_bw;
>  
> -	return cpus;
> +	return avail < total;
>  }
>  #else
>  inline struct dl_bw *dl_bw_of(int i)
> @@ -2494,12 +2518,21 @@ inline struct dl_bw *dl_bw_of(int i)
>  	return &cpu_rq(i)->dl.dl_bw;
>  }
>  
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64
> old_bw,
> +					      int new_cpu, u64
> new_bw) {
> -	return 1;
> +	u64 avail;
> +
> +	if (dl_b->bw == -1) /* admission control disabled */
> +		return false;
> +
> +	avail = dl_b->bw;
> +
> +	return avail < dl_b->total_bw - old_bw + new_bw;
>  }
>  #endif
>  
> +
>  /*
>   * We must be sure that accepting a new task (or allowing changing
> the
>   * parameters of an existing one) is consistent with the bandwidth
> @@ -2519,7 +2552,7 @@ static int dl_overflow(struct task_struct *p,
> int policy, u64 period = attr->sched_period ?: attr->sched_deadline;
>  	u64 runtime = attr->sched_runtime;
>  	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) :
> 0;
> -	int cpus, err = -1;
> +	int err = -1;
>  
>  	/* !deadline task may carry old deadline bandwidth */
>  	if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
> @@ -2531,18 +2564,22 @@ static int dl_overflow(struct task_struct *p,
> int policy,
>  	 * allocated bandwidth of the container.
>  	 */
>  	raw_spin_lock(&dl_b->lock);
> -	cpus = dl_bw_cpus(task_cpu(p));
> +	/* new DL task */
>  	if (dl_policy(policy) && !task_has_dl_policy(p) &&
> -	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
> +	    !__dl_overflow(dl_b, -1, 0, -1, new_bw)) {
>  		__dl_add(dl_b, new_bw);
>  		err = 0;
> +
> +	/* changed DL task */
>  	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
> -		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> -		__dl_clear(dl_b, p->dl.dl_bw);
> +		   !__dl_overflow(dl_b, -1, p->dl.dl_bw, -1,
> new_bw)) {
> +		__dl_sub(dl_b, p->dl.dl_bw);
>  		__dl_add(dl_b, new_bw);
>  		err = 0;
> +
> +	/* leaves DL */
>  	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> -		__dl_clear(dl_b, p->dl.dl_bw);
> +		__dl_sub(dl_b, p->dl.dl_bw);
>  		err = 0;
>  	}
>  	raw_spin_unlock(&dl_b->lock);
> @@ -5390,8 +5427,7 @@ int task_can_attach(struct task_struct *p,
>  		rcu_read_lock_sched();
>  		dl_b = dl_bw_of(dest_cpu);
>  		raw_spin_lock_irqsave(&dl_b->lock, flags);
> -		cpus = dl_bw_cpus(dest_cpu);
> -		overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
> +		overflow = __dl_overflow(dl_b, -1, 0, -1,
> p->dl.dl_bw); if (overflow)
>  			ret = -EBUSY;
>  		else {
> @@ -7309,8 +7345,7 @@ static int cpuset_cpu_inactive(unsigned int cpu)
>  		dl_b = dl_bw_of(cpu);
>  
>  		raw_spin_lock_irqsave(&dl_b->lock, flags);
> -		cpus = dl_bw_cpus(cpu);
> -		overflow = __dl_overflow(dl_b, cpus, 0, 0);
> +		overflow = __dl_overflow(dl_b, -1, 0, -1, 0);
>  		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>  
>  		rcu_read_unlock_sched();
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 37e2449186c4..4eb65ea9c100 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -778,6 +778,11 @@ static void update_curr_dl(struct rq *rq)
>  	}
>  }
>  
> +static inline struct sched_dl_entity *dl_of(struct rb_node *node)
> +{
> +	return rb_entry(node, struct sched_dl_entity, rb_node);
> +}
> +
>  #ifdef CONFIG_SMP
>  
>  static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -805,9 +810,8 @@ static void dec_dl_deadline(struct dl_rq *dl_rq,
> u64 deadline) cpudl_clear(&rq->rd->cpudl, rq->cpu);
>  	} else {
>  		struct rb_node *leftmost = dl_rq->rb_leftmost;
> -		struct sched_dl_entity *entry;
> +		struct sched_dl_entity *entry = dl_of(leftmost);
>  
> -		entry = rb_entry(leftmost, struct sched_dl_entity,
> rb_node); dl_rq->earliest_dl.curr = entry->deadline;
>  		cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline);
>  	}
> @@ -848,6 +852,69 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se,
> struct dl_rq *dl_rq) dec_dl_migration(dl_se, dl_rq);
>  }
>  
> +static inline u64 entity_ttf(struct sched_dl_entity *dl_se)
> +{
> +	u64 ttf = U64_MAX;
> +
> +	if (dl_se->flags & DL_AFFINE)
> +		ttf = dl_se->deadline - dl_se->runtime;
> +
> +	return ttf;
> +}
> +
> +static inline u64
> +compute_subtree_ttf(struct sched_dl_entry *dl_se)
> +{
> +	u64 subtree_ttd, ttf = entity_ttf(dl_se);
> +
> +	if (dl_se->rb_node.rb_left) {
> +		subtree_ttf =
> dl_of(dl_se->rb_node.rb_left)->__subtree_ttf;
> +		ttf = min(ttf, subtree_ttf);
> +	}
> +	if (dl_se->rb_node.rb_right) {
> +		subtree_ttf =
> dl_of(dl_se->rb_node.rb_right)->__subtree_ttf;
> +		ttf = min(ttf, subtree_ttf);
> +	}
> +
> +	return ttf;
> +}
> +
> +static void ttf_propagate(struct rb_node *rb, struct rb_node *stop)
> +{
> +	while (rb != stop) {
> +		struct sched_dl_entity *dl_se = dl_of(rb);
> +		u64 subtree_ttf = compute_subtree_ttf(dl_se);
> +
> +		if (dl_se->__subtree_ttf == subtree_ttf)
> +			break;
> +
> +		rb = rb_parent(rb);
> +	}
> +}
> +
> +static void ttf_copy(struct rb_node *rb_old, struct rb_node *rb_new)
> +{
> +	struct sched_dl_entity *dl_old = dl_of(rb_old);
> +	struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> +	new->__subtree_ttf = old->__subtree_ttf;
> +}
> +
> +static void ttf_rotate(struct rb_node *rb_old, struct rb_node
> *rb_new) +{
> +	struct sched_dl_entity *dl_old = dl_of(rb_old);
> +	struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> +	new->__subtree_ttf = old->__subtree_ttf;
> +	old->__subtree_ttf = compute_subtree_ttf(old);
> +}
> +
> +static const struct rb_augment_callbacks ttf_callbacks = {
> +	.propagate	= ttf_propagate,
> +	.copy		= ttf_copy,
> +	.rotate		= ttf_rotate,
> +};
> +
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -855,15 +922,16 @@ static void __enqueue_dl_entity(struct
> sched_dl_entity *dl_se) struct rb_node *parent = NULL;
>  	struct sched_dl_entity *entry;
>  	int leftmost = 1;
> +	u64 ttf;
>  
>  	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
>  
>  	while (*link) {
>  		parent = *link;
> -		entry = rb_entry(parent, struct sched_dl_entity,
> rb_node);
> -		if (dl_time_before(dl_se->deadline, entry->deadline))
> +		entry = dl_of(parent);
> +		if (dl_time_before(dl_se->deadline,
> entry->deadline)) { link = &parent->rb_left;
> -		else {
> +		} else {
>  			link = &parent->rb_right;
>  			leftmost = 0;
>  		}
> @@ -872,8 +940,12 @@ static void __enqueue_dl_entity(struct
> sched_dl_entity *dl_se) if (leftmost)
>  		dl_rq->rb_leftmost = &dl_se->rb_node;
>  
> +	dl_se->__subtree_ttf = ttf = entity_ttf(dl_se);
>  	rb_link_node(&dl_se->rb_node, parent, link);
> -	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> +	rb_insert_augmented(&dl_se->rb_node, &dl_rq->rb_root,
> &ttf_callbacks); +
> +	if (dl_of(dl_rq->rb_root.rb_node)->__subtree_ttf == ttf)
> +		dl_rb->rb_least_ttf = dl_rq->rb_root.rb_node;
>  
>  	inc_dl_tasks(dl_se, dl_rq);
>  }
> @@ -892,9 +964,41 @@ static void __dequeue_dl_entity(struct
> sched_dl_entity *dl_se) dl_rq->rb_leftmost = next_node;
>  	}
>  
> -	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> +	rb_erase_augmented(&dl_se->rb_node, &dl_rq->rb_root,
> &ttf_callbacks); RB_CLEAR_NODE(&dl_se->rb_node);
>  
> +	if (dl_rq->rb_least_ttf == &dl_se->rb_node) {
> +		struct rb_node **link = &dl_rq->rb_root.rb_node;
> +
> +		while (*link) {
> +			struct rb_node *node = *link;
> +			struct sched_dl_entity *entry = dl_of(node);
> +			u64 subtree_ttf = entry->__subtree_ttf;
> +
> +			if (subtree_ttf = U64_MAX) {
> +				dl_rq->rb_least_ttf = NULL;
> +				goto no_least_ttf;
> +			}
> +
> +			if (node->rb_left &&
> +			    dl_of(node->rb_left)->__subtree_ttf ==
> subtree_ttf) {
> +				link = &node->rb_left;
> +				continue;
> +			} 
> +
> +			if (node->rb_right &&
> +			    dl_of(node->rb_right)->__subtree_ttf ==
> subtree_ttf) {
> +				link = &node->rb_right;
> +				continue;
> +			}
> +
> +			break;
> +		}
> +
> +		dl_rq->rb_least_ttf = *link;
> +no_least_ttf:
> +	}
> +
>  	dec_dl_tasks(dl_se, dl_rq);
>  }
>  
> @@ -1109,12 +1213,28 @@ static void start_hrtick_dl(struct rq *rq,
> struct task_struct *p) static struct sched_dl_entity
> *pick_next_dl_entity(struct rq *rq, struct dl_rq *dl_rq)
>  {
> -	struct rb_node *left = dl_rq->rb_leftmost;
> +	struct rb_node *rb_left = dl_rq->rb_leftmost;
> +	struct sched_dl_entity *left, *least;
> +	u64 now;
>  
> -	if (!left)
> +	if (!rb_left)
>  		return NULL;
>  
> -	return rb_entry(left, struct sched_dl_entity, rb_node);
> +	left = dl_of(rb_left);
> +
> +	/*
> +	 * Only local tasks have TTF set, if one is present and it
> would miss
> +	 * its deadline because of the (G)EDF task selection,
> override that
> +	 * and select the local task.
> +	 */
> +	if (dl_rq->rb_least_ttf) {
> +		least = dl_of(dl_rq->rb_least_ttf);
> +		now = rq_clock_task(rq);
> +		if (now + left->runtime > entity_ttf(least))
> +			return least;
> +	}
> +
> +	return left;
>  }
>  
>  struct task_struct *
> @@ -1645,7 +1765,7 @@ static void set_cpus_allowed_dl(struct
> task_struct *p,
>  		 * until we complete the update.
>  		 */
>  		raw_spin_lock(&src_dl_b->lock);
> -		__dl_clear(src_dl_b, p->dl.dl_bw);
> +		__dl_sub(src_dl_b, p->dl.dl_bw);
>  		raw_spin_unlock(&src_dl_b->lock);
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 055f935d4421..55059d2568bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -198,13 +198,15 @@ static inline int dl_bandwidth_enabled(void)
>  
>  extern struct dl_bw *dl_bw_of(int i);
>  
> +extern DECLARE_PER_CPU(u64, dl_cpu_bw);
> +
>  struct dl_bw {
>  	raw_spinlock_t lock;
>  	u64 bw, total_bw;
>  };
>  
>  static inline
> -void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
> +void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
>  {
>  	dl_b->total_bw -= tsk_bw;
>  }
> @@ -215,13 +217,6 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
>  	dl_b->total_bw += tsk_bw;
>  }
>  
> -static inline
> -bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64
> new_bw) -{
> -	return dl_b->bw != -1 &&
> -	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> -}
> -
>  extern struct mutex sched_domains_mutex;
>  
>  #ifdef CONFIG_CGROUP_SCHED
> @@ -507,6 +502,7 @@ struct dl_rq {
>  	/* runqueue is an rbtree, ordered by deadline */
>  	struct rb_root rb_root;
>  	struct rb_node *rb_leftmost;
> +	struct rb_node *rb_least_ttf;
>  
>  	unsigned long dl_nr_running;
>