This patchset introduces a group level EDF scheduler to extend the
current throttling mechanism, in order to make it support generic
period assignments. With this patch, the rt_runtime and rt_period
parameters can be used to specify arbitrary CPU reservations for
RT tasks.
The first two patches fix two problems of the current implementation.
This is an early RFC, I'm interested in having an idea of what people
think about this feature, if it's worth working on it, what can be
improved in the design, etc.
The main design issues involved:
- It is no more possible to specify RUNTIME_INF for a task group
when throttling is enabled. Rationale: supporting both throttled
and unthrottled groups would have required too much extra complexity
(I didn't find anything simpler than two parallel runqueues, one for
throttled and one for unthrottled groups).
- Since it is not easy to mix tasks and groups on the same scheduler
queue (tasks have no deadlines), the bandwidth reserved to the tasks
in a group is controlled with two additional cgroup attributes:
rt_task_runtime_us and rt_task_period_us. These attrinutes control,
within a cgroup, how much bandwidth is reserved to the tasks it
contains.
- Shared resources are still handled using boosting. When a group
contains a task inside a critical section it is scheduled according
the highest priority among the ones of the tasks it contains.
In this way, the same group has two modes: when it is not boosted
it is scheduled according to its deadline; when it is boosted, it
is scheduled according its priority. Boosted groups are always
favored over non-boosted ones.
- The old priority array is now gone. To use only a single data
structure for entities using both priorities and deadlines (due
to boosting), the only possible choice was to use an rb-tree;
the function used to order the keys takes into account the
prioritization described above (boosted tasks, ordered by
priority are favored to non-boosted tasks, ordered by increasing
deadline).
- Given that the various rt_rq's belonging to the same task group
are activated independently, there is the need of a timer per
each rt_rq.
The patchset is against sched-devel, and (temporarily) depends on
CONFIG_RT_GROUP_SCHED.
Any kind of feedback is welcome.
Fabio Checconi (8):
Fix rt_rq->pushable_tasks initialization in init_rt_rq()
Fix hrtick handling
Replace struct prio_array with an RB tree
Remove the balancing logic
Use EDF to throttle RT tasks hierarchically
Modify the curr/next priority tracking
Reprogram timers only when necessary
Use hrtick when available
include/linux/init_task.h | 1 -
include/linux/sched.h | 2 +-
kernel/sched.c | 553 +++++++++++++++++++++++++------------
kernel/sched_debug.c | 4 +-
kernel/sched_rt.c | 672 +++++++++++++++++++++++++++------------------
5 files changed, 781 insertions(+), 451 deletions(-)
init_rt_rq() initializes only rq->rt.pushable_tasks, and not the
pushable_tasks field of the passed rt_rq. The plist is not used
uninitialized since the only pushable_tasks plists used are the
ones of root rt_rqs; anyway reinitializing the list on every group
creation corrupts the root plist, losing its previous contents.
Signed-off-by: Fabio Checconi <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 6032f51..873b252 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9075,7 +9075,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
#ifdef CONFIG_SMP
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
- plist_head_init(&rq->rt.pushable_tasks, &rq->lock);
+ plist_head_init(&rt_rq->pushable_tasks, &rq->lock);
#endif
rt_rq->rt_time = 0;
--
1.6.2.2
The current hrtick implementation can end up asking to the hrtimer code
to wake up ksoftirqd while holding rq->lock, causing a spinlock recursion.
This patch uses __hrtimer_start_range_ns() to disable the wakeup in the
reprogramming path.
Signed-off-by: Fabio Checconi <[email protected]>
---
kernel/sched.c | 20 +++++++++++++++++---
1 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 873b252..b8d432b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1090,6 +1090,19 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
}
#ifdef CONFIG_SMP
+static void hrtick_restart(struct hrtimer *timer)
+{
+ ktime_t hard, soft;
+ unsigned long delta;
+
+ soft = hrtimer_get_softexpires(timer);
+ hard = hrtimer_get_expires(timer);
+ delta = ktime_to_ns(ktime_sub(hard, soft));
+
+ __hrtimer_start_range_ns(timer, soft, delta,
+ HRTIMER_MODE_ABS_PINNED, 0);
+}
+
/*
* called from hardirq (IPI) context
*/
@@ -1098,7 +1111,7 @@ static void __hrtick_start(void *arg)
struct rq *rq = arg;
spin_lock(&rq->lock);
- hrtimer_restart(&rq->hrtick_timer);
+ hrtick_restart(&rq->hrtick_timer);
rq->hrtick_csd_pending = 0;
spin_unlock(&rq->lock);
}
@@ -1116,7 +1129,7 @@ static void hrtick_start(struct rq *rq, u64 delay)
hrtimer_set_expires(timer, time);
if (rq == this_rq()) {
- hrtimer_restart(timer);
+ hrtick_restart(timer);
} else if (!rq->hrtick_csd_pending) {
__smp_call_function_single(cpu_of(rq), &rq->hrtick_csd, 0);
rq->hrtick_csd_pending = 1;
@@ -1173,7 +1186,8 @@ static void init_rq_hrtick(struct rq *rq)
rq->hrtick_csd.info = rq;
#endif
- hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL_PINNED);
rq->hrtick_timer.function = hrtick;
}
#else /* CONFIG_SCHED_HRTICK */
--
1.6.2.2
To store groups in deadline order the old prio_array is not enough.
To keep things simple, use the same data structure to store tasks in
priority order.
---
include/linux/init_task.h | 1 -
include/linux/sched.h | 2 +-
kernel/sched.c | 25 ++-----
kernel/sched_rt.c | 157 +++++++++++++++++++++++++++++++--------------
4 files changed, 117 insertions(+), 68 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 28b1f30..5c698b4 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -139,7 +139,6 @@ extern struct cred init_cred;
.group_node = LIST_HEAD_INIT(tsk.se.group_node), \
}, \
.rt = { \
- .run_list = LIST_HEAD_INIT(tsk.rt.run_list), \
.time_slice = HZ, \
.nr_cpus_allowed = NR_CPUS, \
}, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d08b3f7..a115067 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1128,7 +1128,7 @@ struct sched_entity {
};
struct sched_rt_entity {
- struct list_head run_list;
+ struct rb_node rb_node;
unsigned long timeout;
unsigned int time_slice;
int nr_cpus_allowed;
diff --git a/kernel/sched.c b/kernel/sched.c
index b8d432b..675ea96 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -157,11 +157,11 @@ static inline int task_has_rt_policy(struct task_struct *p)
}
/*
- * This is the priority-queue data structure of the RT scheduling class:
+ * This is the EDF queue data structure of the RT scheduling class:
*/
-struct rt_prio_array {
- DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
- struct list_head queue[MAX_RT_PRIO];
+struct rt_edf_tree {
+ struct rb_root rb_root;
+ struct rb_node *rb_leftmost;
};
struct rt_bandwidth {
@@ -481,7 +481,7 @@ struct cfs_rq {
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
- struct rt_prio_array active;
+ struct rt_edf_tree active;
unsigned long rt_nr_running;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
struct {
@@ -2595,7 +2595,7 @@ static void __sched_fork(struct task_struct *p)
p->se.wait_max = 0;
#endif
- INIT_LIST_HEAD(&p->rt.run_list);
+ RB_CLEAR_NODE(&p->rt.rb_node);
p->se.on_rq = 0;
INIT_LIST_HEAD(&p->se.group_node);
@@ -9069,16 +9069,7 @@ static void init_cfs_rq(struct cfs_rq *cfs_rq, struct rq *rq)
static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
{
- struct rt_prio_array *array;
- int i;
-
- array = &rt_rq->active;
- for (i = 0; i < MAX_RT_PRIO; i++) {
- INIT_LIST_HEAD(array->queue + i);
- __clear_bit(i, array->bitmap);
- }
- /* delimiter for bitsearch: */
- __set_bit(MAX_RT_PRIO, array->bitmap);
+ rt_rq->active.rb_root = RB_ROOT;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
rt_rq->highest_prio.curr = MAX_RT_PRIO;
@@ -9158,7 +9149,7 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
rt_se->my_q = rt_rq;
rt_se->parent = parent;
- INIT_LIST_HEAD(&rt_se->run_list);
+ RB_CLEAR_NODE(&rt_se->rb_node);
}
#endif
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 9bf0d2a..7ae1212 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -136,7 +136,7 @@ void dec_rt_migration(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
static inline int on_rt_rq(struct sched_rt_entity *rt_se)
{
- return !list_empty(&rt_se->run_list);
+ return !RB_EMPTY_NODE(&rt_se->rb_node);
}
#ifdef CONFIG_RT_GROUP_SCHED
@@ -668,6 +668,14 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
#endif /* CONFIG_SMP */
+static inline struct sched_rt_entity *__rt_edf_first(struct rt_edf_tree *tree)
+{
+ if (!tree->rb_leftmost)
+ return NULL;
+
+ return rb_entry(tree->rb_leftmost, struct sched_rt_entity, rb_node);
+}
+
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
static void
inc_rt_prio(struct rt_rq *rt_rq, int prio)
@@ -683,6 +691,7 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
static void
dec_rt_prio(struct rt_rq *rt_rq, int prio)
{
+ struct sched_rt_entity *rt_se;
int prev_prio = rt_rq->highest_prio.curr;
if (rt_rq->rt_nr_running) {
@@ -694,10 +703,8 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
* we may have some recomputation to do
*/
if (prio == prev_prio) {
- struct rt_prio_array *array = &rt_rq->active;
-
- rt_rq->highest_prio.curr =
- sched_find_first_bit(array->bitmap);
+ rt_se = __rt_edf_first(&rt_rq->active);
+ rt_rq->highest_prio.curr = rt_se_prio(rt_se);
}
} else
@@ -772,12 +779,71 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
dec_rt_group(rt_se, rt_rq);
}
+static inline int rt_entity_before(struct sched_rt_entity *a,
+ struct sched_rt_entity *b)
+{
+ return rt_se_prio(a) < rt_se_prio(b);
+}
+
+static void __rt_entity_insert(struct rt_edf_tree *tree,
+ struct sched_rt_entity *rt_se)
+{
+ struct rb_node **link = &tree->rb_root.rb_node;
+ struct rb_node *parent = NULL;
+ struct sched_rt_entity *entry;
+ int leftmost = 1;
+
+ while (*link) {
+ parent = *link;
+ entry = rb_entry(parent, struct sched_rt_entity, rb_node);
+ if (rt_entity_before(rt_se, entry))
+ link = &parent->rb_left;
+ else {
+ link = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ tree->rb_leftmost = &rt_se->rb_node;
+
+ rb_link_node(&rt_se->rb_node, parent, link);
+ rb_insert_color(&rt_se->rb_node, &tree->rb_root);
+}
+
+/*
+ * Attempt to insert rt_se as the first element of tree. This is used
+ * when requeueing a task at the head of a rt_rq before a reschedule, so
+ * it makes sense only if it is at the highest prio.
+ */
+static int __rt_entity_insert_head(struct rt_edf_tree *tree,
+ struct sched_rt_entity *rt_se)
+{
+ struct rb_node **link;
+ struct rb_node *parent;
+ struct sched_rt_entity *entry;
+
+ if (tree->rb_leftmost) {
+ entry = __rt_edf_first(tree);
+ if (rt_se_prio(entry) < rt_se_prio(rt_se))
+ return 1;
+ }
+
+ parent = rb_first(&tree->rb_root);
+ link = &parent->rb_left;
+
+ tree->rb_leftmost = &rt_se->rb_node;
+
+ rb_link_node(&rt_se->rb_node, parent, link);
+ rb_insert_color(&rt_se->rb_node, &tree->rb_root);
+
+ return 0;
+}
+
static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
- struct rt_prio_array *array = &rt_rq->active;
struct rt_rq *group_rq = group_rt_rq(rt_se);
- struct list_head *queue = array->queue + rt_se_prio(rt_se);
/*
* Don't enqueue the group if its throttled, or when empty.
@@ -788,21 +854,25 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)
if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
return;
- list_add_tail(&rt_se->run_list, queue);
- __set_bit(rt_se_prio(rt_se), array->bitmap);
-
+ __rt_entity_insert(&rt_rq->active, rt_se);
inc_rt_tasks(rt_se, rt_rq);
}
+static void __rt_entity_remove(struct rt_edf_tree *tree,
+ struct sched_rt_entity *rt_se)
+{
+ if (tree->rb_leftmost == &rt_se->rb_node)
+ tree->rb_leftmost = rb_next(&rt_se->rb_node);
+
+ rb_erase(&rt_se->rb_node, &tree->rb_root);
+ RB_CLEAR_NODE(&rt_se->rb_node);
+}
+
static void __dequeue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
- struct rt_prio_array *array = &rt_rq->active;
-
- list_del_init(&rt_se->run_list);
- if (list_empty(array->queue + rt_se_prio(rt_se)))
- __clear_bit(rt_se_prio(rt_se), array->bitmap);
+ __rt_entity_remove(&rt_rq->active, rt_se);
dec_rt_tasks(rt_se, rt_rq);
}
@@ -881,14 +951,13 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
static void
requeue_rt_entity(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se, int head)
{
- if (on_rt_rq(rt_se)) {
- struct rt_prio_array *array = &rt_rq->active;
- struct list_head *queue = array->queue + rt_se_prio(rt_se);
+ struct rt_edf_tree *tree;
- if (head)
- list_move(&rt_se->run_list, queue);
- else
- list_move_tail(&rt_se->run_list, queue);
+ if (on_rt_rq(rt_se)) {
+ tree = &rt_rq->active;
+ __rt_entity_remove(tree, rt_se);
+ if (!head || __rt_entity_insert_head(tree, rt_se))
+ __rt_entity_insert(tree, rt_se);
}
}
@@ -1000,18 +1069,12 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int sync
static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
struct rt_rq *rt_rq)
{
- struct rt_prio_array *array = &rt_rq->active;
- struct sched_rt_entity *next = NULL;
- struct list_head *queue;
- int idx;
+ struct rb_node *left = rt_rq->active.rb_leftmost;
- idx = sched_find_first_bit(array->bitmap);
- BUG_ON(idx >= MAX_RT_PRIO);
-
- queue = array->queue + idx;
- next = list_entry(queue->next, struct sched_rt_entity, run_list);
+ if (!left)
+ return NULL;
- return next;
+ return rb_entry(left, struct sched_rt_entity, rb_node);
}
static struct task_struct *_pick_next_task_rt(struct rq *rq)
@@ -1083,31 +1146,27 @@ static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
/* Return the second highest RT task, NULL otherwise */
static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu)
{
- struct task_struct *next = NULL;
+ struct task_struct *p, *next = NULL;
struct sched_rt_entity *rt_se;
- struct rt_prio_array *array;
struct rt_rq *rt_rq;
- int idx;
+ struct rt_edf_tree *tree;
+ struct rb_node *node;
for_each_leaf_rt_rq(rt_rq, rq) {
- array = &rt_rq->active;
- idx = sched_find_first_bit(array->bitmap);
- next_idx:
- if (idx >= MAX_RT_PRIO)
- continue;
- if (next && next->prio < idx)
- continue;
- list_for_each_entry(rt_se, array->queue + idx, run_list) {
- struct task_struct *p = rt_task_of(rt_se);
+ tree = &rt_rq->active;
+ for (node = tree->rb_leftmost; node; node = rb_next(node)) {
+ rt_se = rb_entry(node, struct sched_rt_entity, rb_node);
+ if (group_rt_rq(rt_se))
+ continue;
+
+ p = rt_task_of(rt_se);
+ if (next && next->prio < p->prio)
+ break;
if (pick_rt_task(rq, p, cpu)) {
next = p;
break;
}
}
- if (!next) {
- idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
- goto next_idx;
- }
}
return next;
@@ -1706,7 +1765,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
* Requeue to the end of queue if we are not the only element
* on the queue:
*/
- if (p->rt.run_list.prev != p->rt.run_list.next) {
+ if (rb_next(&p->rt.rb_node)) {
requeue_task_rt(rq, p, 0);
set_tsk_need_resched(p);
}
--
1.6.2.2
The balancing logic would require to run an admission test on each step
it takes trying to balance the bandwidths assigned to each group. Now it
is not doing so, and it can produce unschedulable assignments, which in
turn would make EDF scheduling on each single cpu pointless.
Temporarily remove it; we can add it back later, once the system proves
to be stable enough.
---
kernel/sched_rt.c | 142 -----------------------------------------------------
1 files changed, 0 insertions(+), 142 deletions(-)
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 7ae1212..c23f3ad 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -287,137 +287,16 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
#ifdef CONFIG_SMP
/*
- * We ran out of runtime, see if we can borrow some from our neighbours.
- */
-static int do_balance_runtime(struct rt_rq *rt_rq)
-{
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
- struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
- int i, weight, more = 0;
- u64 rt_period;
-
- weight = cpumask_weight(rd->span);
-
- spin_lock(&rt_b->rt_runtime_lock);
- rt_period = ktime_to_ns(rt_b->rt_period);
- for_each_cpu(i, rd->span) {
- struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
- s64 diff;
-
- if (iter == rt_rq)
- continue;
-
- spin_lock(&iter->rt_runtime_lock);
- /*
- * Either all rqs have inf runtime and there's nothing to steal
- * or __disable_runtime() below sets a specific rq to inf to
- * indicate its been disabled and disalow stealing.
- */
- if (iter->rt_runtime == RUNTIME_INF)
- goto next;
-
- /*
- * From runqueues with spare time, take 1/n part of their
- * spare time, but no more than our period.
- */
- diff = iter->rt_runtime - iter->rt_time;
- if (diff > 0) {
- diff = div_u64((u64)diff, weight);
- if (rt_rq->rt_runtime + diff > rt_period)
- diff = rt_period - rt_rq->rt_runtime;
- iter->rt_runtime -= diff;
- rt_rq->rt_runtime += diff;
- more = 1;
- if (rt_rq->rt_runtime == rt_period) {
- spin_unlock(&iter->rt_runtime_lock);
- break;
- }
- }
-next:
- spin_unlock(&iter->rt_runtime_lock);
- }
- spin_unlock(&rt_b->rt_runtime_lock);
-
- return more;
-}
-
-/*
* Ensure this RQ takes back all the runtime it lend to its neighbours.
*/
static void __disable_runtime(struct rq *rq)
{
- struct root_domain *rd = rq->rd;
struct rt_rq *rt_rq;
- if (unlikely(!scheduler_running))
- return;
-
for_each_leaf_rt_rq(rt_rq, rq) {
- struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
- s64 want;
- int i;
-
- spin_lock(&rt_b->rt_runtime_lock);
- spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * Either we're all inf and nobody needs to borrow, or we're
- * already disabled and thus have nothing to do, or we have
- * exactly the right amount of runtime to take out.
- */
- if (rt_rq->rt_runtime == RUNTIME_INF ||
- rt_rq->rt_runtime == rt_b->rt_runtime)
- goto balanced;
- spin_unlock(&rt_rq->rt_runtime_lock);
-
- /*
- * Calculate the difference between what we started out with
- * and what we current have, that's the amount of runtime
- * we lend and now have to reclaim.
- */
- want = rt_b->rt_runtime - rt_rq->rt_runtime;
-
- /*
- * Greedy reclaim, take back as much as we can.
- */
- for_each_cpu(i, rd->span) {
- struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
- s64 diff;
-
- /*
- * Can't reclaim from ourselves or disabled runqueues.
- */
- if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF)
- continue;
-
- spin_lock(&iter->rt_runtime_lock);
- if (want > 0) {
- diff = min_t(s64, iter->rt_runtime, want);
- iter->rt_runtime -= diff;
- want -= diff;
- } else {
- iter->rt_runtime -= want;
- want -= want;
- }
- spin_unlock(&iter->rt_runtime_lock);
-
- if (!want)
- break;
- }
-
spin_lock(&rt_rq->rt_runtime_lock);
- /*
- * We cannot be left wanting - that would mean some runtime
- * leaked out of the system.
- */
- BUG_ON(want);
-balanced:
- /*
- * Disable all the borrow logic by pretending we have inf
- * runtime - in which case borrowing doesn't make sense.
- */
rt_rq->rt_runtime = RUNTIME_INF;
spin_unlock(&rt_rq->rt_runtime_lock);
- spin_unlock(&rt_b->rt_runtime_lock);
}
}
@@ -461,24 +340,6 @@ static void enable_runtime(struct rq *rq)
__enable_runtime(rq);
spin_unlock_irqrestore(&rq->lock, flags);
}
-
-static int balance_runtime(struct rt_rq *rt_rq)
-{
- int more = 0;
-
- if (rt_rq->rt_time > rt_rq->rt_runtime) {
- spin_unlock(&rt_rq->rt_runtime_lock);
- more = do_balance_runtime(rt_rq);
- spin_lock(&rt_rq->rt_runtime_lock);
- }
-
- return more;
-}
-#else /* !CONFIG_SMP */
-static inline int balance_runtime(struct rt_rq *rt_rq)
-{
- return 0;
-}
#endif /* CONFIG_SMP */
static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
@@ -500,8 +361,6 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
u64 runtime;
spin_lock(&rt_rq->rt_runtime_lock);
- if (rt_rq->rt_throttled)
- balance_runtime(rt_rq);
runtime = rt_rq->rt_runtime;
rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
if (rt_rq->rt_throttled && rt_rq->rt_time < runtime) {
@@ -544,7 +403,6 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
if (sched_rt_runtime(rt_rq) >= sched_rt_period(rt_rq))
return 0;
- balance_runtime(rt_rq);
runtime = sched_rt_runtime(rt_rq);
if (runtime == RUNTIME_INF)
return 0;
--
1.6.2.2
Switch to EDF to implement throttling in sched_rt. This patch:
- introduces a struct task_rt_group to abstract an EDF runqueue,
capable to hold both tasks and groups;
- adds two task_rt_groups to each task_group, one to store tasks,
and another to store subgroups;
- modifies the sched_rt class hooks in order to use the two
task_rt_groups;
- modifies the admission control code, in order to take into account
of the new task-only subgroup.
Using two different runqueues is necessary because tasks have no deadlines
and cannot be mixed with groups, that now are scheduled by deadline
(assigned implicitly by the throttling algorithm).
---
kernel/sched.c | 508 ++++++++++++++++++++++++++++++++++----------------
kernel/sched_debug.c | 4 +-
kernel/sched_rt.c | 288 +++++++++++++++++++++--------
3 files changed, 559 insertions(+), 241 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 675ea96..bdae263 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -169,34 +169,10 @@ struct rt_bandwidth {
spinlock_t rt_runtime_lock;
ktime_t rt_period;
u64 rt_runtime;
- struct hrtimer rt_period_timer;
};
static struct rt_bandwidth def_rt_bandwidth;
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
-
-static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
-{
- struct rt_bandwidth *rt_b =
- container_of(timer, struct rt_bandwidth, rt_period_timer);
- ktime_t now;
- int overrun;
- int idle = 0;
-
- for (;;) {
- now = hrtimer_cb_get_time(timer);
- overrun = hrtimer_forward(timer, now, rt_b->rt_period);
-
- if (!overrun)
- break;
-
- idle = do_sched_rt_period_timer(rt_b, overrun);
- }
-
- return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
-}
-
static
void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
{
@@ -204,10 +180,6 @@ void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
rt_b->rt_runtime = runtime;
spin_lock_init(&rt_b->rt_runtime_lock);
-
- hrtimer_init(&rt_b->rt_period_timer,
- CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- rt_b->rt_period_timer.function = sched_rt_period_timer;
}
static inline int rt_bandwidth_enabled(void)
@@ -215,43 +187,6 @@ static inline int rt_bandwidth_enabled(void)
return sysctl_sched_rt_runtime >= 0;
}
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
- ktime_t now;
-
- if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
- return;
-
- if (hrtimer_active(&rt_b->rt_period_timer))
- return;
-
- spin_lock(&rt_b->rt_runtime_lock);
- for (;;) {
- unsigned long delta;
- ktime_t soft, hard;
-
- if (hrtimer_active(&rt_b->rt_period_timer))
- break;
-
- now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
- hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
- soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
- hard = hrtimer_get_expires(&rt_b->rt_period_timer);
- delta = ktime_to_ns(ktime_sub(hard, soft));
- __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
- HRTIMER_MODE_ABS_PINNED, 0);
- }
- spin_unlock(&rt_b->rt_runtime_lock);
-}
-
-#ifdef CONFIG_RT_GROUP_SCHED
-static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
-{
- hrtimer_cancel(&rt_b->rt_period_timer);
-}
-#endif
-
/*
* sched_domains_mutex serializes calls to arch_init_sched_domains,
* detach_destroy_domains and partition_sched_domains.
@@ -266,6 +201,18 @@ struct cfs_rq;
static LIST_HEAD(task_groups);
+#ifdef CONFIG_RT_GROUP_SCHED
+struct task_group;
+
+struct task_rt_group {
+ struct sched_rt_entity **rt_se;
+ struct rt_rq **rt_rq;
+
+ struct rt_bandwidth rt_bandwidth;
+ struct task_group *tg;
+};
+#endif
+
/* task group related information */
struct task_group {
#ifdef CONFIG_CGROUP_SCHED
@@ -285,10 +232,8 @@ struct task_group {
#endif
#ifdef CONFIG_RT_GROUP_SCHED
- struct sched_rt_entity **rt_se;
- struct rt_rq **rt_rq;
-
- struct rt_bandwidth rt_bandwidth;
+ struct task_rt_group rt_rq_group;
+ struct task_rt_group rt_task_group;
#endif
struct rcu_head rcu;
@@ -324,9 +269,17 @@ static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
#ifdef CONFIG_RT_GROUP_SCHED
static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);
static DEFINE_PER_CPU(struct rt_rq, init_rt_rq) ____cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_task_entity);
+static DEFINE_PER_CPU(struct rt_rq, init_rt_task_rq) \
+ ____cacheline_aligned_in_smp;
#endif /* CONFIG_RT_GROUP_SCHED */
#else /* !CONFIG_USER_SCHED */
#define root_task_group init_task_group
+#ifdef CONFIG_RT_GROUP_SCHED
+static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_task_entity);
+static DEFINE_PER_CPU(struct rt_rq, init_rt_task_rq) \
+ ____cacheline_aligned_in_smp;
+#endif
#endif /* CONFIG_USER_SCHED */
/* task_group_lock serializes add/remove of task groups and also changes to
@@ -394,8 +347,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
#ifdef CONFIG_RT_GROUP_SCHED
- p->rt.rt_rq = task_group(p)->rt_rq[cpu];
- p->rt.parent = task_group(p)->rt_se[cpu];
+ p->rt.rt_rq = task_group(p)->rt_task_group.rt_rq[cpu];
+ p->rt.parent = task_group(p)->rt_task_group.rt_se[cpu];
#endif
}
@@ -496,22 +449,31 @@ struct rt_rq {
int overloaded;
struct plist_head pushable_tasks;
#endif
- int rt_throttled;
+ int rt_flags;
+
+ u64 rt_deadline;
u64 rt_time;
u64 rt_runtime;
+ ktime_t rt_period;
+
+ struct hrtimer rt_period_timer;
+
/* Nests inside the rq lock: */
spinlock_t rt_runtime_lock;
-#ifdef CONFIG_RT_GROUP_SCHED
unsigned long rt_nr_boosted;
+#ifdef CONFIG_RT_GROUP_SCHED
struct rq *rq;
struct list_head leaf_rt_rq_list;
- struct task_group *tg;
+ struct task_rt_group *rt_tg;
struct sched_rt_entity *rt_se;
#endif
};
+#define RT_RQ_THROTTLED 1
+#define RT_RQ_NEEDS_UPDATE 2
+
#ifdef CONFIG_SMP
/*
@@ -6176,7 +6138,7 @@ recheck:
* assigned.
*/
if (rt_bandwidth_enabled() && rt_policy(policy) &&
- task_group(p)->rt_bandwidth.rt_runtime == 0)
+ task_group(p)->rt_task_group.rt_bandwidth.rt_runtime == 0)
return -EPERM;
#endif
@@ -9084,7 +9046,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
#endif
rt_rq->rt_time = 0;
- rt_rq->rt_throttled = 0;
+ rt_rq->rt_flags = 0;
rt_rq->rt_runtime = 0;
spin_lock_init(&rt_rq->rt_runtime_lock);
@@ -9124,21 +9086,57 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
#endif
#ifdef CONFIG_RT_GROUP_SCHED
-static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
- struct sched_rt_entity *rt_se, int cpu, int add,
- struct sched_rt_entity *parent)
+static void init_tg_rt_rq(struct rt_rq *rt_rq, struct task_rt_group *rt_tg,
+ int cpu, struct sched_rt_entity *rt_se,
+ struct sched_rt_entity *parent)
{
struct rq *rq = cpu_rq(cpu);
- tg->rt_rq[cpu] = rt_rq;
+ rt_tg->rt_rq[cpu] = rt_rq;
init_rt_rq(rt_rq, rq);
- rt_rq->tg = tg;
+ rt_rq->rt_tg = rt_tg;
rt_rq->rt_se = rt_se;
- rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
+
+ rt_rq->rt_runtime = rt_tg->rt_bandwidth.rt_runtime;
+ rt_rq->rt_period = rt_tg->rt_bandwidth.rt_period;
+
+ hrtimer_init(&rt_rq->rt_period_timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ rt_rq->rt_period_timer.function = sched_rt_period_timer;
+
+ if (!rt_se)
+ return;
+
+ rt_tg->rt_se[cpu] = rt_se;
+ rt_se->parent = parent;
+ rt_se->my_q = rt_rq;
+ RB_CLEAR_NODE(&rt_se->rb_node);
+}
+
+static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
+ struct sched_rt_entity *rt_se, struct rt_rq *rt_task_rq,
+ struct sched_rt_entity *rt_task_se, int cpu, int add,
+ struct sched_rt_entity *parent)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ init_tg_rt_rq(rt_rq, &tg->rt_rq_group, cpu, rt_se, parent);
+
if (add)
list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);
- tg->rt_se[cpu] = rt_se;
+ if (rt_task_rq) {
+ init_tg_rt_rq(rt_task_rq, &tg->rt_task_group,
+ cpu, rt_task_se, rt_se);
+
+ if (add) {
+ list_add(&rt_task_rq->leaf_rt_rq_list,
+ &rq->leaf_rt_rq_list);
+ }
+
+ rt_task_se->rt_rq = rt_rq;
+ }
+
if (!rt_se)
return;
@@ -9147,9 +9145,6 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
else
rt_se->rt_rq = parent->my_q;
- rt_se->my_q = rt_rq;
- rt_se->parent = parent;
- RB_CLEAR_NODE(&rt_se->rb_node);
}
#endif
@@ -9162,7 +9157,7 @@ void __init sched_init(void)
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
- alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+ alloc_size += 4 * nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_USER_SCHED
alloc_size *= 2;
@@ -9193,17 +9188,32 @@ void __init sched_init(void)
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
- init_task_group.rt_se = (struct sched_rt_entity **)ptr;
+ init_task_group.rt_rq_group.rt_se =
+ (struct sched_rt_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
- init_task_group.rt_rq = (struct rt_rq **)ptr;
+ init_task_group.rt_rq_group.rt_rq = (struct rt_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
+ init_task_group.rt_task_group.rt_se =
+ (struct sched_rt_entity **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
+
+ init_task_group.rt_task_group.rt_rq = (struct rt_rq **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
#ifdef CONFIG_USER_SCHED
- root_task_group.rt_se = (struct sched_rt_entity **)ptr;
+ root_task_group.rt_rq_group.rt_se =
+ (struct sched_rt_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
- root_task_group.rt_rq = (struct rt_rq **)ptr;
+ root_task_group.rt_rq_group.rt_rq = (struct rt_rq **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
+
+ root_task_group.rt_task_group.rt_se =
+ (struct sched_rt_entity **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
+
+ root_task_group.rt_task_group.rt_rq = (struct rt_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_RT_GROUP_SCHED */
@@ -9223,11 +9233,19 @@ void __init sched_init(void)
global_rt_period(), global_rt_runtime());
#ifdef CONFIG_RT_GROUP_SCHED
- init_rt_bandwidth(&init_task_group.rt_bandwidth,
+ init_rt_bandwidth(&init_task_group.rt_rq_group.rt_bandwidth,
+ global_rt_period(), global_rt_runtime());
+ init_rt_bandwidth(&init_task_group.rt_task_group.rt_bandwidth,
global_rt_period(), global_rt_runtime());
+ init_task_group.rt_rq_group.tg = &init_task_group;
+ init_task_group.rt_task_group.tg = &init_task_group;
#ifdef CONFIG_USER_SCHED
- init_rt_bandwidth(&root_task_group.rt_bandwidth,
+ init_rt_bandwidth(&root_task_group.rt_rq_group.rt_bandwidth,
+ global_rt_period(), RUNTIME_INF);
+ init_rt_bandwidth(&root_task_group.rt_task_group.rt_bandwidth,
global_rt_period(), RUNTIME_INF);
+ root_task_group.rt_rq_group.tg = &root_task_group;
+ root_task_group.rt_task_group.tg = &root_task_group;
#endif /* CONFIG_USER_SCHED */
#endif /* CONFIG_RT_GROUP_SCHED */
@@ -9302,13 +9320,19 @@ void __init sched_init(void)
#ifdef CONFIG_RT_GROUP_SCHED
INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
#ifdef CONFIG_CGROUP_SCHED
- init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
+ init_tg_rt_entry(&init_task_group, &rq->rt, NULL,
+ &per_cpu(init_rt_task_rq, i),
+ &per_cpu(init_sched_rt_task_entity, i),
+ i, 1, NULL);
#elif defined CONFIG_USER_SCHED
- init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, 0, NULL);
+ init_tg_rt_entry(&root_task_group, &rq->rt, NULL,
+ NULL, NULL, i, 0, NULL);
init_tg_rt_entry(&init_task_group,
- &per_cpu(init_rt_rq, i),
- &per_cpu(init_sched_rt_entity, i), i, 1,
- root_task_group.rt_se[i]);
+ &per_cpu(init_rt_rq, i),
+ &per_cpu(init_sched_rt_entity, i),
+ &per_cpu(init_rt_task_rq, i),
+ &per_cpu(init_sched_rt_task_entity, i),
+ i, 1, root_task_group.rt_rq_group.rt_se[i]);
#endif
#endif
@@ -9601,40 +9625,61 @@ static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
-static void free_rt_sched_group(struct task_group *tg)
+
+static void free_rt_task_group(struct task_rt_group *rt_tg)
{
int i;
- destroy_rt_bandwidth(&tg->rt_bandwidth);
-
for_each_possible_cpu(i) {
- if (tg->rt_rq)
- kfree(tg->rt_rq[i]);
- if (tg->rt_se)
- kfree(tg->rt_se[i]);
+ if (rt_tg->rt_rq[i]) {
+ hrtimer_cancel(&rt_tg->rt_rq[i]->rt_period_timer);
+ kfree(rt_tg->rt_rq[i]);
+ }
+ if (rt_tg->rt_se[i])
+ kfree(rt_tg->rt_se[i]);
}
- kfree(tg->rt_rq);
- kfree(tg->rt_se);
+ kfree(rt_tg->rt_rq);
+ kfree(rt_tg->rt_se);
+}
+
+static void free_rt_sched_group(struct task_group *tg)
+{
+ free_rt_task_group(&tg->rt_rq_group);
+ free_rt_task_group(&tg->rt_task_group);
}
static
int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
{
- struct rt_rq *rt_rq;
- struct sched_rt_entity *rt_se;
+ struct rt_rq *rt_rq, *rt_task_rq;
+ struct sched_rt_entity *rt_se, *rt_task_se;
struct rq *rq;
int i;
- tg->rt_rq = kzalloc(sizeof(rt_rq) * nr_cpu_ids, GFP_KERNEL);
- if (!tg->rt_rq)
+ tg->rt_rq_group.rt_rq = kzalloc(sizeof(rt_rq) * nr_cpu_ids, GFP_KERNEL);
+ if (!tg->rt_rq_group.rt_rq)
+ goto err;
+ tg->rt_rq_group.rt_se = kzalloc(sizeof(rt_se) * nr_cpu_ids, GFP_KERNEL);
+ if (!tg->rt_rq_group.rt_se)
+ goto err;
+
+ tg->rt_task_group.rt_rq = kzalloc(sizeof(rt_rq) * nr_cpu_ids,
+ GFP_KERNEL);
+ if (!tg->rt_task_group.rt_rq)
goto err;
- tg->rt_se = kzalloc(sizeof(rt_se) * nr_cpu_ids, GFP_KERNEL);
- if (!tg->rt_se)
+ tg->rt_task_group.rt_se = kzalloc(sizeof(rt_se) * nr_cpu_ids,
+ GFP_KERNEL);
+ if (!tg->rt_task_group.rt_se)
goto err;
- init_rt_bandwidth(&tg->rt_bandwidth,
+ init_rt_bandwidth(&tg->rt_rq_group.rt_bandwidth,
+ ktime_to_ns(def_rt_bandwidth.rt_period), 0);
+ tg->rt_rq_group.tg = tg;
+
+ init_rt_bandwidth(&tg->rt_task_group.rt_bandwidth,
ktime_to_ns(def_rt_bandwidth.rt_period), 0);
+ tg->rt_task_group.tg = tg;
for_each_possible_cpu(i) {
rq = cpu_rq(i);
@@ -9647,26 +9692,49 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
GFP_KERNEL, cpu_to_node(i));
if (!rt_se)
- goto err;
+ goto free_rt_rq;
+
+ rt_task_rq = kzalloc_node(sizeof(struct rt_rq),
+ GFP_KERNEL, cpu_to_node(i));
+ if (!rt_task_rq)
+ goto free_rt_se;
+
+ rt_task_se = kzalloc_node(sizeof(struct sched_rt_entity),
+ GFP_KERNEL, cpu_to_node(i));
+ if (!rt_task_se)
+ goto free_rt_task_rq;
- init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent->rt_se[i]);
+ init_tg_rt_entry(tg, rt_rq, rt_se, rt_task_rq, rt_task_se,
+ i, 0, parent->rt_rq_group.rt_se[i]);
}
return 1;
+ free_rt_task_rq:
+ kfree(rt_task_rq);
+
+ free_rt_se:
+ kfree(rt_se);
+
+ free_rt_rq:
+ kfree(rt_rq);
+
err:
return 0;
}
static inline void register_rt_sched_group(struct task_group *tg, int cpu)
{
- list_add_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list,
+ list_add_rcu(&tg->rt_rq_group.rt_rq[cpu]->leaf_rt_rq_list,
+ &cpu_rq(cpu)->leaf_rt_rq_list);
+ list_add_rcu(&tg->rt_task_group.rt_rq[cpu]->leaf_rt_rq_list,
&cpu_rq(cpu)->leaf_rt_rq_list);
}
static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
{
- list_del_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list);
+ list_del_rcu(&tg->rt_rq_group.rt_rq[cpu]->leaf_rt_rq_list);
+ list_del_rcu(&tg->rt_task_group.rt_rq[cpu]->leaf_rt_rq_list);
}
#else /* !CONFIG_RT_GROUP_SCHED */
static inline void free_rt_sched_group(struct task_group *tg)
@@ -9911,7 +9979,7 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
struct task_struct *g, *p;
do_each_thread(g, p) {
- if (rt_task(p) && rt_rq_of_se(&p->rt)->tg == tg)
+ if (rt_task(p) && rt_rq_of_se(&p->rt)->rt_tg->tg == tg)
return 1;
} while_each_thread(g, p);
@@ -9922,19 +9990,58 @@ struct rt_schedulable_data {
struct task_group *tg;
u64 rt_period;
u64 rt_runtime;
+ int rt_task_group;
};
+static inline void rt_tg_parameters(struct task_rt_group *rt_tg,
+ u64 *period, u64 *runtime)
+{
+ struct rt_bandwidth *rt_b = &rt_tg->rt_bandwidth;
+
+ *period = ktime_to_ns(rt_b->rt_period);
+ *runtime = rt_b->rt_runtime;
+
+}
+
+static unsigned long tg_utilization(struct task_group *tg,
+ struct rt_schedulable_data *d)
+{
+ struct task_group *child;
+ unsigned long sum;
+ u64 period, runtime;
+
+ if (d && tg == d->tg && d->rt_task_group) {
+ period = d->rt_period;
+ runtime = d->rt_runtime;
+ } else
+ rt_tg_parameters(&tg->rt_task_group, &period, &runtime);
+
+ sum = to_ratio(period, runtime);
+
+ list_for_each_entry_rcu(child, &tg->children, siblings) {
+ if (d && child == d->tg && !d->rt_task_group) {
+ period = d->rt_period;
+ runtime = d->rt_runtime;
+ } else
+ rt_tg_parameters(&child->rt_rq_group,
+ &period, &runtime);
+
+ sum += to_ratio(period, runtime);
+ }
+
+ return sum;
+}
+
static int tg_schedulable(struct task_group *tg, void *data)
{
struct rt_schedulable_data *d = data;
- struct task_group *child;
- unsigned long total, sum = 0;
+ unsigned long total, sum;
u64 period, runtime;
- period = ktime_to_ns(tg->rt_bandwidth.rt_period);
- runtime = tg->rt_bandwidth.rt_runtime;
+ period = ktime_to_ns(tg->rt_rq_group.rt_bandwidth.rt_period);
+ runtime = tg->rt_rq_group.rt_bandwidth.rt_runtime;
- if (tg == d->tg) {
+ if (tg == d->tg && !d->rt_task_group) {
period = d->rt_period;
runtime = d->rt_runtime;
}
@@ -9949,7 +10056,7 @@ static int tg_schedulable(struct task_group *tg, void *data)
/*
* Cannot have more runtime than the period.
*/
- if (runtime > period && runtime != RUNTIME_INF)
+ if (runtime > period)
return -EINVAL;
/*
@@ -9969,17 +10076,7 @@ static int tg_schedulable(struct task_group *tg, void *data)
/*
* The sum of our children's runtime should not exceed our own.
*/
- list_for_each_entry_rcu(child, &tg->children, siblings) {
- period = ktime_to_ns(child->rt_bandwidth.rt_period);
- runtime = child->rt_bandwidth.rt_runtime;
-
- if (child == d->tg) {
- period = d->rt_period;
- runtime = d->rt_runtime;
- }
-
- sum += to_ratio(period, runtime);
- }
+ sum = tg_utilization(tg, d);
if (sum > total)
return -EINVAL;
@@ -9987,40 +10084,54 @@ static int tg_schedulable(struct task_group *tg, void *data)
return 0;
}
-static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
+static int __rt_schedulable(struct task_group *tg, u64 period,
+ u64 runtime, int rt_task_group)
{
struct rt_schedulable_data data = {
.tg = tg,
.rt_period = period,
.rt_runtime = runtime,
+ .rt_task_group = rt_task_group,
};
return walk_tg_tree(tg_schedulable, tg_nop, &data);
}
-static int tg_set_bandwidth(struct task_group *tg,
- u64 rt_period, u64 rt_runtime)
+static void rt_rq_set_bandwidth(struct task_rt_group *rt_tg, int cpu,
+ u64 rt_period, u64 rt_runtime)
{
+ struct rt_rq *rt_rq = rt_tg->rt_rq[cpu];
+
+ spin_lock(&rt_rq->rt_runtime_lock);
+ rt_rq->rt_runtime = rt_runtime;
+ rt_rq->rt_period = ns_to_ktime(rt_period);
+ spin_unlock(&rt_rq->rt_runtime_lock);
+}
+
+static int tg_set_bandwidth(struct task_group *tg, u64 rt_period,
+ u64 rt_runtime, int rt_task_group)
+{
+ struct task_rt_group *rt_tg;
int i, err = 0;
mutex_lock(&rt_constraints_mutex);
read_lock(&tasklist_lock);
- err = __rt_schedulable(tg, rt_period, rt_runtime);
+ err = __rt_schedulable(tg, rt_period, rt_runtime, rt_task_group);
if (err)
goto unlock;
- spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
- tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
- tg->rt_bandwidth.rt_runtime = rt_runtime;
+ if (rt_task_group)
+ rt_tg = &tg->rt_task_group;
+ else
+ rt_tg = &tg->rt_rq_group;
- for_each_possible_cpu(i) {
- struct rt_rq *rt_rq = tg->rt_rq[i];
+ spin_lock_irq(&rt_tg->rt_bandwidth.rt_runtime_lock);
+ rt_tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);
+ rt_tg->rt_bandwidth.rt_runtime = rt_runtime;
- spin_lock(&rt_rq->rt_runtime_lock);
- rt_rq->rt_runtime = rt_runtime;
- spin_unlock(&rt_rq->rt_runtime_lock);
- }
- spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
+ for_each_possible_cpu(i)
+ rt_rq_set_bandwidth(rt_tg, i, rt_period, rt_runtime);
+ spin_unlock_irq(&rt_tg->rt_bandwidth.rt_runtime_lock);
unlock:
read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
@@ -10032,22 +10143,22 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
{
u64 rt_runtime, rt_period;
- rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);
+ rt_period = ktime_to_ns(tg->rt_rq_group.rt_bandwidth.rt_period);
rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
if (rt_runtime_us < 0)
rt_runtime = RUNTIME_INF;
- return tg_set_bandwidth(tg, rt_period, rt_runtime);
+ return tg_set_bandwidth(tg, rt_period, rt_runtime, 0);
}
long sched_group_rt_runtime(struct task_group *tg)
{
u64 rt_runtime_us;
- if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF)
+ if (tg->rt_rq_group.rt_bandwidth.rt_runtime == RUNTIME_INF)
return -1;
- rt_runtime_us = tg->rt_bandwidth.rt_runtime;
+ rt_runtime_us = tg->rt_rq_group.rt_bandwidth.rt_runtime;
do_div(rt_runtime_us, NSEC_PER_USEC);
return rt_runtime_us;
}
@@ -10057,19 +10168,65 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
u64 rt_runtime, rt_period;
rt_period = (u64)rt_period_us * NSEC_PER_USEC;
- rt_runtime = tg->rt_bandwidth.rt_runtime;
+ rt_runtime = tg->rt_rq_group.rt_bandwidth.rt_runtime;
if (rt_period == 0)
return -EINVAL;
- return tg_set_bandwidth(tg, rt_period, rt_runtime);
+ return tg_set_bandwidth(tg, rt_period, rt_runtime, 0);
}
long sched_group_rt_period(struct task_group *tg)
{
u64 rt_period_us;
- rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period);
+ rt_period_us = ktime_to_ns(tg->rt_rq_group.rt_bandwidth.rt_period);
+ do_div(rt_period_us, NSEC_PER_USEC);
+ return rt_period_us;
+}
+
+int sched_group_set_task_rt_runtime(struct task_group *tg, long rt_runtime_us)
+{
+ u64 rt_runtime, rt_period;
+
+ rt_period = ktime_to_ns(tg->rt_task_group.rt_bandwidth.rt_period);
+ rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC;
+ if (rt_runtime_us < 0)
+ rt_runtime = RUNTIME_INF;
+
+ return tg_set_bandwidth(tg, rt_period, rt_runtime, 1);
+}
+
+long sched_group_task_rt_runtime(struct task_group *tg)
+{
+ u64 rt_runtime_us;
+
+ if (tg->rt_task_group.rt_bandwidth.rt_runtime == RUNTIME_INF)
+ return -1;
+
+ rt_runtime_us = tg->rt_task_group.rt_bandwidth.rt_runtime;
+ do_div(rt_runtime_us, NSEC_PER_USEC);
+ return rt_runtime_us;
+}
+
+int sched_group_set_task_rt_period(struct task_group *tg, long rt_period_us)
+{
+ u64 rt_runtime, rt_period;
+
+ rt_period = (u64)rt_period_us * NSEC_PER_USEC;
+ rt_runtime = tg->rt_task_group.rt_bandwidth.rt_runtime;
+
+ if (rt_period == 0)
+ return -EINVAL;
+
+ return tg_set_bandwidth(tg, rt_period, rt_runtime, 1);
+}
+
+long sched_group_task_rt_period(struct task_group *tg)
+{
+ u64 rt_period_us;
+
+ rt_period_us = ktime_to_ns(tg->rt_task_group.rt_bandwidth.rt_period);
do_div(rt_period_us, NSEC_PER_USEC);
return rt_period_us;
}
@@ -10093,7 +10250,7 @@ static int sched_rt_global_constraints(void)
mutex_lock(&rt_constraints_mutex);
read_lock(&tasklist_lock);
- ret = __rt_schedulable(NULL, 0, 0);
+ ret = __rt_schedulable(NULL, 0, 0, 0);
read_unlock(&tasklist_lock);
mutex_unlock(&rt_constraints_mutex);
@@ -10103,7 +10260,7 @@ static int sched_rt_global_constraints(void)
int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
{
/* Don't accept realtime tasks when there is no way for them to run */
- if (rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
+ if (rt_task(tsk) && tg->rt_task_group.rt_bandwidth.rt_runtime == 0)
return 0;
return 1;
@@ -10131,6 +10288,7 @@ static int sched_rt_global_constraints(void)
spin_lock(&rt_rq->rt_runtime_lock);
rt_rq->rt_runtime = global_rt_runtime();
+ rt_rq->rt_period = ns_to_ktime(global_rt_period());
spin_unlock(&rt_rq->rt_runtime_lock);
}
spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags);
@@ -10254,6 +10412,17 @@ static s64 cpu_rt_runtime_read(struct cgroup *cgrp, struct cftype *cft)
return sched_group_rt_runtime(cgroup_tg(cgrp));
}
+static int cpu_rt_task_runtime_write(struct cgroup *cgrp, struct cftype *cft,
+ s64 val)
+{
+ return sched_group_set_task_rt_runtime(cgroup_tg(cgrp), val);
+}
+
+static s64 cpu_rt_task_runtime_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ return sched_group_task_rt_runtime(cgroup_tg(cgrp));
+}
+
static int cpu_rt_period_write_uint(struct cgroup *cgrp, struct cftype *cftype,
u64 rt_period_us)
{
@@ -10264,6 +10433,17 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
{
return sched_group_rt_period(cgroup_tg(cgrp));
}
+
+static int cpu_rt_task_period_write_uint(struct cgroup *cgrp,
+ struct cftype *cftype, u64 rt_period_us)
+{
+ return sched_group_set_task_rt_period(cgroup_tg(cgrp), rt_period_us);
+}
+
+static u64 cpu_rt_task_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
+{
+ return sched_group_task_rt_period(cgroup_tg(cgrp));
+}
#endif /* CONFIG_RT_GROUP_SCHED */
static struct cftype cpu_files[] = {
@@ -10285,6 +10465,16 @@ static struct cftype cpu_files[] = {
.read_u64 = cpu_rt_period_read_uint,
.write_u64 = cpu_rt_period_write_uint,
},
+ {
+ .name = "rt_task_runtime_us",
+ .read_s64 = cpu_rt_task_runtime_read,
+ .write_s64 = cpu_rt_task_runtime_write,
+ },
+ {
+ .name = "rt_task_period_us",
+ .read_u64 = cpu_rt_task_period_read_uint,
+ .write_u64 = cpu_rt_task_period_write_uint,
+ },
#endif
};
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 467ca72..895b2c7 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -222,7 +222,7 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
{
#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_RT_GROUP_SCHED)
char path[128];
- struct task_group *tg = rt_rq->tg;
+ struct task_group *tg = rt_rq->rt_tg->tg;
task_group_path(tg, path, sizeof(path));
@@ -238,7 +238,7 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rt_rq->x))
P(rt_nr_running);
- P(rt_throttled);
+ P(rt_flags);
PN(rt_time);
PN(rt_runtime);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index c23f3ad..5d2353f 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -143,7 +143,7 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
{
- if (!rt_rq->tg)
+ if (!rt_rq->rt_tg->tg)
return RUNTIME_INF;
return rt_rq->rt_runtime;
@@ -151,7 +151,7 @@ static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
static inline u64 sched_rt_period(struct rt_rq *rt_rq)
{
- return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
+ return ktime_to_ns(rt_rq->rt_tg->rt_bandwidth.rt_period);
}
#define for_each_leaf_rt_rq(rt_rq, rq) \
@@ -191,19 +191,7 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
static inline int rt_rq_throttled(struct rt_rq *rt_rq)
{
- return rt_rq->rt_throttled && !rt_rq->rt_nr_boosted;
-}
-
-static int rt_se_boosted(struct sched_rt_entity *rt_se)
-{
- struct rt_rq *rt_rq = group_rt_rq(rt_se);
- struct task_struct *p;
-
- if (rt_rq)
- return !!rt_rq->rt_nr_boosted;
-
- p = rt_task_of(rt_se);
- return p->prio != p->normal_prio;
+ return (rt_rq->rt_flags & RT_RQ_THROTTLED) && !rt_rq->rt_nr_boosted;
}
#ifdef CONFIG_SMP
@@ -221,12 +209,13 @@ static inline const struct cpumask *sched_rt_period_mask(void)
static inline
struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
{
- return container_of(rt_b, struct task_group, rt_bandwidth)->rt_rq[cpu];
+ return container_of(rt_b, struct task_rt_group,
+ rt_bandwidth)->rt_rq[cpu];
}
static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
{
- return &rt_rq->tg->rt_bandwidth;
+ return &rt_rq->rt_tg->rt_bandwidth;
}
#else /* !CONFIG_RT_GROUP_SCHED */
@@ -264,7 +253,7 @@ static inline void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
static inline int rt_rq_throttled(struct rt_rq *rt_rq)
{
- return rt_rq->rt_throttled;
+ return rt_rq->rt_flags & RT_RQ_THROTTLED;
}
static inline const struct cpumask *sched_rt_period_mask(void)
@@ -285,6 +274,18 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
#endif /* CONFIG_RT_GROUP_SCHED */
+static int rt_se_boosted(struct sched_rt_entity *rt_se)
+{
+ struct rt_rq *rt_rq = group_rt_rq(rt_se);
+ struct task_struct *p;
+
+ if (rt_rq)
+ return !!rt_rq->rt_nr_boosted;
+
+ p = rt_task_of(rt_se);
+ return p->prio != p->normal_prio;
+}
+
#ifdef CONFIG_SMP
/*
* Ensure this RQ takes back all the runtime it lend to its neighbours.
@@ -326,7 +327,7 @@ static void __enable_runtime(struct rq *rq)
spin_lock(&rt_rq->rt_runtime_lock);
rt_rq->rt_runtime = rt_b->rt_runtime;
rt_rq->rt_time = 0;
- rt_rq->rt_throttled = 0;
+ rt_rq->rt_flags = 0;
spin_unlock(&rt_rq->rt_runtime_lock);
spin_unlock(&rt_b->rt_runtime_lock);
}
@@ -342,43 +343,97 @@ static void enable_runtime(struct rq *rq)
}
#endif /* CONFIG_SMP */
-static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
+static int __do_sched_rt_period_timer(struct rt_rq *rt_rq, int overrun)
+{
+ int idle = 1, enqueue = 0;
+
+ if (rt_rq->rt_time) {
+ u64 runtime;
+
+ runtime = rt_rq->rt_runtime;
+ rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
+ rt_rq->rt_deadline += overrun*ktime_to_ns(rt_rq->rt_period);
+ if ((rt_rq->rt_flags & RT_RQ_THROTTLED) &&
+ rt_rq->rt_time < runtime) {
+ rt_rq->rt_flags &= ~RT_RQ_THROTTLED;
+ enqueue = 1;
+ }
+ if (rt_rq->rt_time || rt_rq->rt_nr_running)
+ idle = 0;
+ } else if (rt_rq->rt_nr_running)
+ idle = 0;
+
+ if (enqueue)
+ sched_rt_rq_enqueue(rt_rq);
+
+ return idle;
+}
+
+static int do_sched_rt_period_timer(struct rt_rq *rt_rq, int overrun)
{
- int i, idle = 1;
- const struct cpumask *span;
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+ int idle;
- if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
+ if (!rt_bandwidth_enabled() || rt_rq->rt_runtime == RUNTIME_INF)
return 1;
- span = sched_rt_period_mask();
- for_each_cpu(i, span) {
- int enqueue = 0;
- struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
- struct rq *rq = rq_of_rt_rq(rt_rq);
+ spin_lock(&rq->lock);
+ spin_lock(&rt_rq->rt_runtime_lock);
+ idle = __do_sched_rt_period_timer(rt_rq, overrun);
+ spin_unlock(&rt_rq->rt_runtime_lock);
+ spin_unlock(&rq->lock);
- spin_lock(&rq->lock);
- if (rt_rq->rt_time) {
- u64 runtime;
+ return idle;
+}
- spin_lock(&rt_rq->rt_runtime_lock);
- runtime = rt_rq->rt_runtime;
- rt_rq->rt_time -= min(rt_rq->rt_time, overrun*runtime);
- if (rt_rq->rt_throttled && rt_rq->rt_time < runtime) {
- rt_rq->rt_throttled = 0;
- enqueue = 1;
- }
- if (rt_rq->rt_time || rt_rq->rt_nr_running)
- idle = 0;
- spin_unlock(&rt_rq->rt_runtime_lock);
- } else if (rt_rq->rt_nr_running)
- idle = 0;
+static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
+{
+ struct rt_rq *rt_rq = container_of(timer, struct rt_rq,
+ rt_period_timer);
+ int overrun;
+ int idle = 0;
+
+ for (;;) {
+ overrun = hrtimer_forward_now(timer, rt_rq->rt_period);
+
+ if (!overrun)
+ break;
- if (enqueue)
- sched_rt_rq_enqueue(rt_rq);
- spin_unlock(&rq->lock);
+ idle = do_sched_rt_period_timer(rt_rq, overrun);
}
- return idle;
+ return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static void start_rt_period_timer(struct rt_rq *rt_rq)
+{
+ ktime_t now, delta, soft, hard, dline = ns_to_ktime(rt_rq->rt_deadline);
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+ unsigned long range;
+ int overrun;
+
+ now = hrtimer_cb_get_time(&rt_rq->rt_period_timer);
+ delta = ktime_sub_ns(now, rq->clock);
+ dline = ktime_add(dline, delta);
+
+ hrtimer_set_expires(&rt_rq->rt_period_timer, dline);
+
+ for (;;) {
+ if (hrtimer_active(&rt_rq->rt_period_timer))
+ return;
+
+ overrun = hrtimer_forward_now(&rt_rq->rt_period_timer,
+ rt_rq->rt_period);
+
+ if (overrun)
+ __do_sched_rt_period_timer(rt_rq, overrun);
+
+ soft = hrtimer_get_softexpires(&rt_rq->rt_period_timer);
+ hard = hrtimer_get_expires(&rt_rq->rt_period_timer);
+ range = ktime_to_ns(ktime_sub(hard, soft));
+ __hrtimer_start_range_ns(&rt_rq->rt_period_timer, soft,
+ range, HRTIMER_MODE_ABS, 0);
+ }
}
static inline int rt_se_prio(struct sched_rt_entity *rt_se)
@@ -397,7 +452,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
{
u64 runtime = sched_rt_runtime(rt_rq);
- if (rt_rq->rt_throttled)
+ if (rt_rq->rt_flags & RT_RQ_THROTTLED)
return rt_rq_throttled(rt_rq);
if (sched_rt_runtime(rt_rq) >= sched_rt_period(rt_rq))
@@ -408,7 +463,8 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
return 0;
if (rt_rq->rt_time > runtime) {
- rt_rq->rt_throttled = 1;
+ rt_rq->rt_flags |= RT_RQ_THROTTLED;
+ start_rt_period_timer(rt_rq);
if (rt_rq_throttled(rt_rq)) {
sched_rt_rq_dequeue(rt_rq);
return 1;
@@ -460,6 +516,24 @@ static void update_curr_rt(struct rq *rq)
}
}
+static inline struct sched_rt_entity *__rt_edf_first(struct rt_edf_tree *tree)
+{
+ if (!tree->rb_leftmost)
+ return NULL;
+
+ return rb_entry(tree->rb_leftmost, struct sched_rt_entity, rb_node);
+}
+
+static inline struct sched_rt_entity *__rt_edf_next(struct sched_rt_entity *se)
+{
+ struct rb_node *next = rb_next(&se->rb_node);
+
+ if (!next)
+ return NULL;
+
+ return rb_entry(next, struct sched_rt_entity, rb_node);
+}
+
#if defined CONFIG_SMP
static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu);
@@ -526,14 +600,6 @@ void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
#endif /* CONFIG_SMP */
-static inline struct sched_rt_entity *__rt_edf_first(struct rt_edf_tree *tree)
-{
- if (!tree->rb_leftmost)
- return NULL;
-
- return rb_entry(tree->rb_leftmost, struct sched_rt_entity, rb_node);
-}
-
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
static void
inc_rt_prio(struct rt_rq *rt_rq, int prio)
@@ -578,16 +644,11 @@ static inline void dec_rt_prio(struct rt_rq *rt_rq, int prio) {}
#endif /* CONFIG_SMP || CONFIG_RT_GROUP_SCHED */
-#ifdef CONFIG_RT_GROUP_SCHED
-
static void
inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
if (rt_se_boosted(rt_se))
rt_rq->rt_nr_boosted++;
-
- if (rt_rq->tg)
- start_rt_bandwidth(&rt_rq->tg->rt_bandwidth);
}
static void
@@ -599,19 +660,6 @@ dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
WARN_ON(!rt_rq->rt_nr_running && rt_rq->rt_nr_boosted);
}
-#else /* CONFIG_RT_GROUP_SCHED */
-
-static void
-inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
-{
- start_rt_bandwidth(&def_rt_bandwidth);
-}
-
-static inline
-void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {}
-
-#endif /* CONFIG_RT_GROUP_SCHED */
-
static inline
void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
@@ -637,10 +685,36 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
dec_rt_group(rt_se, rt_rq);
}
+static inline int rt_time_before(u64 a, u64 b)
+{
+ return (s64)(a - b) < 0;
+}
+
static inline int rt_entity_before(struct sched_rt_entity *a,
struct sched_rt_entity *b)
{
- return rt_se_prio(a) < rt_se_prio(b);
+ struct rt_rq *rqa = group_rt_rq(a), *rqb = group_rt_rq(b);
+
+ /*
+ * Schedule by priority if:
+ * - both a and b are tasks;
+ * - both a and b are boosted;
+ * - throttling is disabled system-wide.
+ */
+ if ((!rqa && !rqb) || (rqa->rt_nr_boosted && rqb->rt_nr_boosted) ||
+ global_rt_runtime() == RUNTIME_INF)
+ return rt_se_prio(a) < rt_se_prio(b);
+
+ /* Only a is boosted, choose it. */
+ if (rqa->rt_nr_boosted)
+ return 1;
+
+ /* Only b is boosted, choose it. */
+ if (rqb->rt_nr_boosted)
+ return 0;
+
+ /* Use the deadlines to order entities. */
+ return rt_time_before(rqa->rt_deadline, rqb->rt_deadline);
}
static void __rt_entity_insert(struct rt_edf_tree *tree,
@@ -698,6 +772,44 @@ static int __rt_entity_insert_head(struct rt_edf_tree *tree,
return 0;
}
+static void rt_rq_update_deadline(struct rt_rq *rt_rq)
+{
+ struct rq *rq = rq_of_rt_rq(rt_rq);
+ u64 period = ktime_to_ns(rt_rq->rt_period);
+ u64 runtime = sched_rt_runtime(rt_rq);
+ u64 left, right;
+
+ /*
+ * Update the deadline to the current time only if:
+ * - it is in the past;
+ * - using it would lead to a timeframe during which the
+ * group would exceed ist allocated bandwidth.
+ *
+ * For the second condition to hold, we check that in the
+ * time left before the deadline, using the residual budget,
+ * the group would exceed its runtime / period share.
+ * In formula:
+ * rt_time / (deadline - rq->clock) >= runtime / period
+ *
+ * left and right are the two sides of the equation, after a bit
+ * of shuffling to use multiplications instead of divisions.
+ */
+ left = period * rt_rq->rt_time;
+ right = (rt_rq->rt_deadline - rq->clock) * rt_rq->rt_runtime;
+
+ if (rt_time_before(rt_rq->rt_deadline, rq->clock) ||
+ rt_time_before(right, left)) {
+ rt_rq->rt_deadline = rq->clock;
+
+ while (rt_rq->rt_time > period) {
+ rt_rq->rt_deadline += period;
+ rt_rq->rt_time -= min(rt_rq->rt_time, runtime);
+ }
+ }
+
+ rt_rq->rt_flags &= ~RT_RQ_NEEDS_UPDATE;
+}
+
static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
@@ -709,8 +821,13 @@ static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)
* get throttled and the current group doesn't have any other
* active members.
*/
- if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
- return;
+ if (group_rq) {
+ if (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running)
+ return;
+
+ if (group_rq->rt_flags & RT_RQ_NEEDS_UPDATE)
+ rt_rq_update_deadline(group_rq);
+ }
__rt_entity_insert(&rt_rq->active, rt_se);
inc_rt_tasks(rt_se, rt_rq);
@@ -734,6 +851,14 @@ static void __dequeue_rt_entity(struct sched_rt_entity *rt_se)
dec_rt_tasks(rt_se, rt_rq);
}
+static inline void __rt_entity_update_flags(struct sched_rt_entity *rt_se)
+{
+ struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+ if (rt_rq && !on_rt_rq(rt_se))
+ rt_rq->rt_flags |= RT_RQ_NEEDS_UPDATE;
+}
+
/*
* Because the prio of an upper entry depends on the lower
* entries, we must remove entries top - down.
@@ -748,6 +873,8 @@ static void dequeue_rt_stack(struct sched_rt_entity *rt_se)
}
for (rt_se = back; rt_se; rt_se = rt_se->back) {
+ __rt_entity_update_flags(rt_se);
+
if (on_rt_rq(rt_se))
__dequeue_rt_entity(rt_se);
}
@@ -899,9 +1026,10 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
/*
* Preempt the current task with a newly woken task if needed:
*/
-static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int sync)
+static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p,
+ int sync)
{
- if (p->prio < rq->curr->prio) {
+ if (rt_entity_before(&p->rt, &rq->curr->rt)) {
resched_task(rq->curr);
return;
}
--
1.6.2.2
Synchronize with the data structures used for EDF throttling the code
which tracks the highest and the second-highest priority in each group.
---
kernel/sched_rt.c | 82 ++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 63 insertions(+), 19 deletions(-)
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 5d2353f..f43ce7b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -536,16 +536,57 @@ static inline struct sched_rt_entity *__rt_edf_next(struct sched_rt_entity *se)
#if defined CONFIG_SMP
-static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu);
+static int is_task_rq(struct rt_rq *rt_rq)
+{
+ struct task_rt_group *rt_tg = rt_rq->rt_tg;
+ return rt_tg->tg == container_of(rt_tg, struct task_group,
+ rt_task_group);
+}
-static inline int next_prio(struct rq *rq)
+static void find_highest_prios(struct rt_rq *rt_rq, int *curr, int *next)
{
- struct task_struct *next = pick_next_highest_task_rt(rq, rq->cpu);
+ struct sched_rt_entity *se, *next_se;
+ struct rt_rq *child;
+ struct rb_node *node;
+ int curp = MAX_RT_PRIO, nextp = MAX_RT_PRIO;
+
+ if (is_task_rq(rt_rq)) {
+ se = __rt_edf_first(&rt_rq->active);
+ if (se) {
+ curp = rt_task_of(se)->prio;
+ next_se = __rt_edf_next(se);
+ if (next_se)
+ nextp = rt_task_of(next_se)->prio;
+ }
+ } else {
+ for (node = rb_first(&rt_rq->active.rb_root);
+ node; node = rb_next(node)) {
+ se = rb_entry(node, struct sched_rt_entity, rb_node);
+ child = group_rt_rq(se);
+
+ if (child->highest_prio.next < curp) {
+ curp = child->highest_prio.curr;
+ nextp = child->highest_prio.next;
+ } else if (child->highest_prio.curr < curp) {
+ nextp = curp;
+ curp = child->highest_prio.curr;
+ if (child->highest_prio.next < nextp)
+ nextp = child->highest_prio.next;
+ } else if (child->highest_prio.curr < nextp)
+ nextp = child->highest_prio.curr;
+ }
+ }
- if (next && rt_prio(next->prio))
- return next->prio;
- else
- return MAX_RT_PRIO;
+ *curr = curp;
+ *next = nextp;
+}
+
+static inline int next_prio(struct rt_rq *rt_rq)
+{
+ int curp, nextp;
+
+ find_highest_prios(rt_rq, &curp, &nextp);
+ return nextp;
}
static void
@@ -554,7 +595,6 @@ inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
struct rq *rq = rq_of_rt_rq(rt_rq);
if (prio < prev_prio) {
-
/*
* If the new task is higher in priority than anything on the
* run-queue, we know that the previous high becomes our
@@ -576,18 +616,22 @@ inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
/*
* Otherwise, we need to recompute next-highest
*/
- rt_rq->highest_prio.next = next_prio(rq);
+ rt_rq->highest_prio.next = next_prio(rt_rq);
}
static void
-dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
+dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prevp, int nextp)
{
struct rq *rq = rq_of_rt_rq(rt_rq);
- if (rt_rq->rt_nr_running && (prio <= rt_rq->highest_prio.next))
- rt_rq->highest_prio.next = next_prio(rq);
+ if (rt_rq->rt_nr_running && (prio <= rt_rq->highest_prio.next)) {
+ if (nextp == -1)
+ rt_rq->highest_prio.next = next_prio(rt_rq);
+ else
+ rt_rq->highest_prio.next = nextp;
+ }
- if (rq->online && rt_rq->highest_prio.curr != prev_prio)
+ if (rq->online && rt_rq->highest_prio.curr != prevp)
cpupri_set(&rq->rd->cpupri, rq->cpu, rt_rq->highest_prio.curr);
}
@@ -596,7 +640,8 @@ dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio)
static inline
void inc_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
static inline
-void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio) {}
+void dec_rt_prio_smp(struct rt_rq *rt_rq, int prio, int prev_prio,
+ int next_prio) {}
#endif /* CONFIG_SMP */
@@ -615,8 +660,7 @@ inc_rt_prio(struct rt_rq *rt_rq, int prio)
static void
dec_rt_prio(struct rt_rq *rt_rq, int prio)
{
- struct sched_rt_entity *rt_se;
- int prev_prio = rt_rq->highest_prio.curr;
+ int prev_prio = rt_rq->highest_prio.curr, curr_prio, next_prio = -1;
if (rt_rq->rt_nr_running) {
@@ -627,14 +671,14 @@ dec_rt_prio(struct rt_rq *rt_rq, int prio)
* we may have some recomputation to do
*/
if (prio == prev_prio) {
- rt_se = __rt_edf_first(&rt_rq->active);
- rt_rq->highest_prio.curr = rt_se_prio(rt_se);
+ find_highest_prios(rt_rq, &curr_prio, &next_prio);
+ rt_rq->highest_prio.curr = curr_prio;
}
} else
rt_rq->highest_prio.curr = MAX_RT_PRIO;
- dec_rt_prio_smp(rt_rq, prio, prev_prio);
+ dec_rt_prio_smp(rt_rq, prio, prev_prio, next_prio);
}
#else
--
1.6.2.2
Avoid unnecessary timer reprogramming when a group is at its full
capacity and it is already enqueued. Such a group will not need
further budget recharges until it astually consumes its budget, so
do not rearm its timer immediately.
---
kernel/sched_rt.c | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index f43ce7b..a9db16b 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -386,6 +386,19 @@ static int do_sched_rt_period_timer(struct rt_rq *rt_rq, int overrun)
return idle;
}
+static inline int rt_rq_needs_recharge(struct rt_rq *rt_rq)
+{
+ struct sched_rt_entity *rt_se = rt_rq->rt_se;
+
+ if (rt_rq->rt_time)
+ return 1;
+
+ if (!rt_se)
+ return 0;
+
+ return !on_rt_rq(rt_rq->rt_se);
+}
+
static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
{
struct rt_rq *rt_rq = container_of(timer, struct rt_rq,
@@ -402,7 +415,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
idle = do_sched_rt_period_timer(rt_rq, overrun);
}
- return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+ return idle || !rt_rq_needs_recharge(rt_rq) ?
+ HRTIMER_NORESTART : HRTIMER_RESTART;
}
static void start_rt_period_timer(struct rt_rq *rt_rq)
@@ -428,6 +442,9 @@ static void start_rt_period_timer(struct rt_rq *rt_rq)
if (overrun)
__do_sched_rt_period_timer(rt_rq, overrun);
+ if (!rt_rq_needs_recharge(rt_rq))
+ return;
+
soft = hrtimer_get_softexpires(&rt_rq->rt_period_timer);
hard = hrtimer_get_expires(&rt_rq->rt_period_timer);
range = ktime_to_ns(ktime_sub(hard, soft));
--
1.6.2.2
To have a more precise budget enforcement, use the hrtick when available.
This comes at the cost of additional overhead; anyway it is disabled by
default.
---
kernel/sched_rt.c | 32 +++++++++++++++++++++++++++++++-
1 files changed, 31 insertions(+), 1 deletions(-)
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index a9db16b..38a7be0 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -1150,14 +1150,41 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return p;
}
+static void rt_start_hrtick(struct rq *rq, struct task_struct *p)
+{
+ struct sched_rt_entity *rt_se = &p->rt;
+ struct rt_rq *rt_rq;
+ s64 delta, expires = global_rt_period();
+
+ for_each_sched_rt_entity(rt_se) {
+ rt_rq = rt_rq_of_se(rt_se);
+ delta = rt_rq->rt_runtime - rt_rq->rt_time;
+ expires = min(expires, delta);
+ }
+
+ /*
+ * Given the current handling of hrtick in sched_fair, this
+ * may interfere with sched_fair tasks, generating a spurious
+ * tick when switching from an RT to a SCHED_OTHER task.
+ * This is due to the fact that sched_fair reprograms the
+ * hrtick only in its enqueue/dequeue codepaths.
+ */
+ if (expires > 10000)
+ hrtick_start(rq, expires);
+}
+
static struct task_struct *pick_next_task_rt(struct rq *rq)
{
struct task_struct *p = _pick_next_task_rt(rq);
/* The running task is never eligible for pushing */
- if (p)
+ if (p) {
dequeue_pushable_task(rq, p);
+ if (hrtick_enabled(rq))
+ rt_start_hrtick(rq, p);
+ }
+
return p;
}
@@ -1794,6 +1821,9 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
update_curr_rt(rq);
+ if (queued)
+ return;
+
watchdog(rq, p);
/*
--
1.6.2.2
Hi Fabio,
Sorry for the delay, but we already spoke in person about this last
week, a bit more detail here.
On Mon, 2009-06-15 at 20:55 +0200, Fabio Checconi wrote:
> This patchset introduces a group level EDF scheduler to extend the
> current throttling mechanism, in order to make it support generic
> period assignments. With this patch, the rt_runtime and rt_period
> parameters can be used to specify arbitrary CPU reservations for
> RT tasks.
> This is an early RFC, I'm interested in having an idea of what people
> think about this feature, if it's worth working on it, what can be
> improved in the design, etc.
>
> The main design issues involved:
>
> - It is no more possible to specify RUNTIME_INF for a task group
> when throttling is enabled. Rationale: supporting both throttled
> and unthrottled groups would have required too much extra complexity
> (I didn't find anything simpler than two parallel runqueues, one for
> throttled and one for unthrottled groups).
I would think this is more than reasonable to the point that I wonder if
it was possible to begin with. I was assuming the current bandwidth
validation stuff would already exclude this possibility.
> - Since it is not easy to mix tasks and groups on the same scheduler
> queue (tasks have no deadlines), the bandwidth reserved to the tasks
> in a group is controlled with two additional cgroup attributes:
> rt_task_runtime_us and rt_task_period_us. These attrinutes control,
> within a cgroup, how much bandwidth is reserved to the tasks it
> contains.
Agreed.
> - Shared resources are still handled using boosting. When a group
> contains a task inside a critical section it is scheduled according
> the highest priority among the ones of the tasks it contains.
> In this way, the same group has two modes: when it is not boosted
> it is scheduled according to its deadline; when it is boosted, it
> is scheduled according its priority. Boosted groups are always
> favored over non-boosted ones.
Yeah, for now this PCP like solution is the best we have. Preferably
we'll come up with something really smart soon :-)
> - The old priority array is now gone. To use only a single data
> structure for entities using both priorities and deadlines (due
> to boosting), the only possible choice was to use an rb-tree;
> the function used to order the keys takes into account the
> prioritization described above (boosted tasks, ordered by
> priority are favored to non-boosted tasks, ordered by increasing
> deadline).
>
> - Given that the various rt_rq's belonging to the same task group
> are activated independently, there is the need of a timer per
> each rt_rq.
Like I suggested last week, we could flatten the full hierarchy to a 2
level one, the top level being a EDF like scheduler which purely
schedules the 'phantom' task groups as created by the new rt_task_*_us
parameters.
Once such a task group is selected we use the regular FIFO rules to pick
a task.
Further, like mentioned, you remove the bandwidth distribution between
cpus in patch 4. You do this because you schedule the groups using PEDF,
however I was thinking that when we use GEDF to schedule the groups we
can use the theory from:
H. Leontyev and J. Anderson, " A Hierarchical Multiprocessor Bandwidth
Reservation Scheme with Timing Guarantees ", Proceedings of the 20th
Euromicro Conference on Real-Time Systems, pp. 191-200, July 200
http://www.cs.unc.edu/~anderson/papers/ecrts08c.pdf
This would allow us to lift this constraint.
As to load-balancing. The current scheme of load-balancing the tasks
utterly ignores the deadline settings and would also need some suitable
changes. I fear this part will be a little more challenging.
I would think that by using the GEDF and minimal concurrency group
scheduling we'd get the desired deadline behaviour. After that we'd get
the task of selecting the FIFO tasks within the dynamic vcpu range
resulting from the deadline server.
We could simply implement global-fifo to make it work, and then move
towards adapting the current (!group) load-balancer to work within these
more dynamic constraints.
Thoughts?
On Mon, 2009-06-15 at 21:05 +0200, Fabio Checconi wrote:
> The current hrtick implementation can end up asking to the hrtimer code
> to wake up ksoftirqd while holding rq->lock, causing a spinlock recursion.
> This patch uses __hrtimer_start_range_ns() to disable the wakeup in the
> reprogramming path.
Right, A while back I did the following patch and handed that to someone
on IRC to continue with, but I guess that never happened.
The thing that currently gets us into trouble is that when we
hrtimer_start() an already expired timer we try to run it in place.
However we cannot do that in context because we might start the timer
under locks that the timer callback also wants, leading to a deadlock.
The current solution is to queue it in a softirq and kick the softirq.
This however has the problem that we cannot do a wakeup when holding the
rq->lock, which is where __hrtimer_start*() enters.
We can avoid all of this mess by simply failing the hrtimer_start*()
with -ETIME when we pass it an already expired timer, which is what the
below patch does.
All that needs doing is (testing and possibly fwd porting this patch,
its a few weeks old now) auditing all hrtimer usage sites to make sure
the new behaviour does as expected etc..
Not-Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/hrtimer.h | 5 ---
include/linux/interrupt.h | 1 -
kernel/hrtimer.c | 75 ++++++++++++++++++---------------------------
kernel/perf_counter.c | 27 ++++++++++++----
kernel/sched.c | 15 +++------
5 files changed, 55 insertions(+), 68 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index dd44de5..46a9951 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -339,11 +339,6 @@ extern int hrtimer_start(struct hrtimer *timer, ktime_t tim,
const enum hrtimer_mode mode);
extern int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
unsigned long range_ns, const enum hrtimer_mode mode);
-extern int
-__hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- unsigned long delta_ns,
- const enum hrtimer_mode mode, int wakeup);
-
extern int hrtimer_cancel(struct hrtimer *timer);
extern int hrtimer_try_to_cancel(struct hrtimer *timer);
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index dd574d5..62d3073 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -337,7 +337,6 @@ enum
BLOCK_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
- HRTIMER_SOFTIRQ,
RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
NR_SOFTIRQS
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 050382e..ba85ffd 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -699,19 +699,10 @@ static inline void hrtimer_init_timer_hres(struct hrtimer *timer)
* and expiry check is done in the hrtimer_interrupt or in the softirq.
*/
static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
- struct hrtimer_clock_base *base,
- int wakeup)
+ struct hrtimer_clock_base *base)
{
- if (base->cpu_base->hres_active && hrtimer_reprogram(timer, base)) {
- if (wakeup) {
- spin_unlock(&base->cpu_base->lock);
- raise_softirq_irqoff(HRTIMER_SOFTIRQ);
- spin_lock(&base->cpu_base->lock);
- } else
- __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
-
- return 1;
- }
+ if (base->cpu_base->hres_active && hrtimer_reprogram(timer, base))
+ return -ETIME;
return 0;
}
@@ -941,18 +932,31 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
return 0;
}
-int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- unsigned long delta_ns, const enum hrtimer_mode mode,
- int wakeup)
+/**
+ * hrtimer_start_range_ns - (re)start an hrtimer on the current CPU
+ * @timer: the timer to be added
+ * @tim: expiry time
+ * @delta_ns: "slack" range for the timer
+ * @mode: expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
+ *
+ * Returns:
+ * 0 on success
+ * -EEXIST when the timer was active
+ * -ETIME when the timer has already expired
+ */
+int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
+ unsigned long delta_ns, const enum hrtimer_mode mode)
{
struct hrtimer_clock_base *base, *new_base;
unsigned long flags;
- int ret, leftmost;
+ int tmp, leftmost, ret = 0;
base = lock_hrtimer_base(timer, &flags);
/* Remove an active timer from the queue: */
- ret = remove_hrtimer(timer, base);
+ tmp = remove_hrtimer(timer, base);
+ if (tmp)
+ ret = -EEXIST;
/* Switch the timer base, if necessary: */
new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);
@@ -983,30 +987,19 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
*
* XXX send_remote_softirq() ?
*/
- if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases))
- hrtimer_enqueue_reprogram(timer, new_base, wakeup);
+ if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases)) {
+ tmp = hrtimer_enqueue_reprogram(timer, new_base);
+ if (tmp < 0) {
+ __remove_hrtimer(timer, base, HRTIMER_STATE_INACTIVE,
+ reprogram);
+ ret = tmp;
+ }
+ }
unlock_hrtimer_base(timer, &flags);
return ret;
}
-
-/**
- * hrtimer_start_range_ns - (re)start an hrtimer on the current CPU
- * @timer: the timer to be added
- * @tim: expiry time
- * @delta_ns: "slack" range for the timer
- * @mode: expiry mode: absolute (HRTIMER_ABS) or relative (HRTIMER_REL)
- *
- * Returns:
- * 0 on success
- * 1 when the timer was active
- */
-int hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
- unsigned long delta_ns, const enum hrtimer_mode mode)
-{
- return __hrtimer_start_range_ns(timer, tim, delta_ns, mode, 1);
-}
EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
/**
@@ -1022,7 +1015,7 @@ EXPORT_SYMBOL_GPL(hrtimer_start_range_ns);
int
hrtimer_start(struct hrtimer *timer, ktime_t tim, const enum hrtimer_mode mode)
{
- return __hrtimer_start_range_ns(timer, tim, 0, mode, 1);
+ return hrtimer_start_range_ns(timer, tim, 0, mode);
}
EXPORT_SYMBOL_GPL(hrtimer_start);
@@ -1361,11 +1354,6 @@ void hrtimer_peek_ahead_timers(void)
local_irq_restore(flags);
}
-static void run_hrtimer_softirq(struct softirq_action *h)
-{
- hrtimer_peek_ahead_timers();
-}
-
#else /* CONFIG_HIGH_RES_TIMERS */
static inline void __hrtimer_peek_ahead_timers(void) { }
@@ -1705,9 +1693,6 @@ void __init hrtimers_init(void)
hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
(void *)(long)smp_processor_id());
register_cpu_notifier(&hrtimers_nb);
-#ifdef CONFIG_HIGH_RES_TIMERS
- open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq);
-#endif
}
/**
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index ef5d8a5..a5c0ac3 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -3308,15 +3308,22 @@ static int cpu_clock_perf_counter_enable(struct perf_counter *counter)
{
struct hw_perf_counter *hwc = &counter->hw;
int cpu = raw_smp_processor_id();
+ u64 period;
+ int ret;
atomic64_set(&hwc->prev_count, cpu_clock(cpu));
hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hwc->hrtimer.function = perf_swcounter_hrtimer;
if (hwc->sample_period) {
- u64 period = max_t(u64, 10000, hwc->sample_period);
- __hrtimer_start_range_ns(&hwc->hrtimer,
+again:
+ period = max_t(u64, 10000, hwc->sample_period);
+ ret = hrtimer_start_range_ns(&hwc->hrtimer,
ns_to_ktime(period), 0,
- HRTIMER_MODE_REL, 0);
+ HRTIMER_MODE_REL);
+ if (ret == -ETIME) {
+ hwc->sample_period += period;
+ goto again;
+ }
}
return 0;
@@ -3357,7 +3364,8 @@ static void task_clock_perf_counter_update(struct perf_counter *counter, u64 now
static int task_clock_perf_counter_enable(struct perf_counter *counter)
{
struct hw_perf_counter *hwc = &counter->hw;
- u64 now;
+ u64 period, now;
+ int ret;
now = counter->ctx->time;
@@ -3365,10 +3373,15 @@ static int task_clock_perf_counter_enable(struct perf_counter *counter)
hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
hwc->hrtimer.function = perf_swcounter_hrtimer;
if (hwc->sample_period) {
- u64 period = max_t(u64, 10000, hwc->sample_period);
- __hrtimer_start_range_ns(&hwc->hrtimer,
+again:
+ period = max_t(u64, 10000, hwc->sample_period);
+ ret = hrtimer_start_range_ns(&hwc->hrtimer,
ns_to_ktime(period), 0,
- HRTIMER_MODE_REL, 0);
+ HRTIMER_MODE_REL);
+ if (ret == -ETIME) {
+ hwc->sample_period += period;
+ goto again:
+ }
}
return 0;
diff --git a/kernel/sched.c b/kernel/sched.c
index 92dbda2..4157b1c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -228,20 +228,15 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
spin_lock(&rt_b->rt_runtime_lock);
for (;;) {
- unsigned long delta;
- ktime_t soft, hard;
if (hrtimer_active(&rt_b->rt_period_timer))
break;
now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
- hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
- soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
- hard = hrtimer_get_expires(&rt_b->rt_period_timer);
- delta = ktime_to_ns(ktime_sub(hard, soft));
- __hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
- HRTIMER_MODE_ABS_PINNED, 0);
+ hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
+ hrtimer_start_expires(&rt_b->rt_period_timer,
+ HRTIMER_MODE_ABS_PINNED);
}
spin_unlock(&rt_b->rt_runtime_lock);
}
@@ -1155,8 +1150,8 @@ static __init void init_hrtick(void)
*/
static void hrtick_start(struct rq *rq, u64 delay)
{
- __hrtimer_start_range_ns(&rq->hrtick_timer, ns_to_ktime(delay), 0,
- HRTIMER_MODE_REL_PINNED, 0);
+ hrtimer_start_range_ns(&rq->hrtick_timer, ns_to_ktime(delay), 0,
+ HRTIMER_MODE_REL_PINNED);
}
static inline void init_hrtick(void)
On Mon, 2009-06-15 at 20:56 +0200, Fabio Checconi wrote:
> init_rt_rq() initializes only rq->rt.pushable_tasks, and not the
> pushable_tasks field of the passed rt_rq. The plist is not used
> uninitialized since the only pushable_tasks plists used are the
> ones of root rt_rqs; anyway reinitializing the list on every group
> creation corrupts the root plist, losing its previous contents.
>
> Signed-off-by: Fabio Checconi <[email protected]>
Good catch, I'll queue this up.
> ---
> kernel/sched.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6032f51..873b252 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9075,7 +9075,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
> #ifdef CONFIG_SMP
> rt_rq->rt_nr_migratory = 0;
> rt_rq->overloaded = 0;
> - plist_head_init(&rq->rt.pushable_tasks, &rq->lock);
> + plist_head_init(&rt_rq->pushable_tasks, &rq->lock);
> #endif
>
> rt_rq->rt_time = 0;
> From: Peter Zijlstra <[email protected]>
> Date: Thu, Jul 09, 2009 12:36:13PM +0200
>
> Hi Fabio,
>
> Sorry for the delay, but we already spoke in person about this last
> week, a bit more detail here.
>
No problem, thank you for looking at this.
> On Mon, 2009-06-15 at 20:55 +0200, Fabio Checconi wrote:
> > This patchset introduces a group level EDF scheduler to extend the
> > current throttling mechanism, in order to make it support generic
> > period assignments. With this patch, the rt_runtime and rt_period
> > parameters can be used to specify arbitrary CPU reservations for
> > RT tasks.
>
> > This is an early RFC, I'm interested in having an idea of what people
> > think about this feature, if it's worth working on it, what can be
> > improved in the design, etc.
> >
> > The main design issues involved:
> >
> > - It is no more possible to specify RUNTIME_INF for a task group
> > when throttling is enabled. Rationale: supporting both throttled
> > and unthrottled groups would have required too much extra complexity
> > (I didn't find anything simpler than two parallel runqueues, one for
> > throttled and one for unthrottled groups).
>
> I would think this is more than reasonable to the point that I wonder if
> it was possible to begin with. I was assuming the current bandwidth
> validation stuff would already exclude this possibility.
>
>From the scheduling code it seemed possible, but I think I overlooked
the admission control part, sorry about that.
> > - Since it is not easy to mix tasks and groups on the same scheduler
> > queue (tasks have no deadlines), the bandwidth reserved to the tasks
> > in a group is controlled with two additional cgroup attributes:
> > rt_task_runtime_us and rt_task_period_us. These attrinutes control,
> > within a cgroup, how much bandwidth is reserved to the tasks it
> > contains.
>
> Agreed.
>
> > - Shared resources are still handled using boosting. When a group
> > contains a task inside a critical section it is scheduled according
> > the highest priority among the ones of the tasks it contains.
> > In this way, the same group has two modes: when it is not boosted
> > it is scheduled according to its deadline; when it is boosted, it
> > is scheduled according its priority. Boosted groups are always
> > favored over non-boosted ones.
>
> Yeah, for now this PCP like solution is the best we have. Preferably
> we'll come up with something really smart soon :-)
>
> > - The old priority array is now gone. To use only a single data
> > structure for entities using both priorities and deadlines (due
> > to boosting), the only possible choice was to use an rb-tree;
> > the function used to order the keys takes into account the
> > prioritization described above (boosted tasks, ordered by
> > priority are favored to non-boosted tasks, ordered by increasing
> > deadline).
> >
> > - Given that the various rt_rq's belonging to the same task group
> > are activated independently, there is the need of a timer per
> > each rt_rq.
>
> Like I suggested last week, we could flatten the full hierarchy to a 2
> level one, the top level being a EDF like scheduler which purely
> schedules the 'phantom' task groups as created by the new rt_task_*_us
> parameters.
>
I really like this idea, it would simplify things a lot, with almost
no changes from the scheduling theory POV. I'll give it a try as soon
as possible.
> Once such a task group is selected we use the regular FIFO rules to pick
> a task.
>
> Further, like mentioned, you remove the bandwidth distribution between
> cpus in patch 4. You do this because you schedule the groups using PEDF,
> however I was thinking that when we use GEDF to schedule the groups we
> can use the theory from:
>
> H. Leontyev and J. Anderson, " A Hierarchical Multiprocessor Bandwidth
> Reservation Scheme with Timing Guarantees ", Proceedings of the 20th
> Euromicro Conference on Real-Time Systems, pp. 191-200, July 200
>
> http://www.cs.unc.edu/~anderson/papers/ecrts08c.pdf
>
> This would allow us to lift this constraint.
>
> As to load-balancing. The current scheme of load-balancing the tasks
> utterly ignores the deadline settings and would also need some suitable
> changes. I fear this part will be a little more challenging.
>
I was thinking about doing things gradually: first extend throttling
to handle generic periods, then extend the push-pull logic (I think you
are referring to it with load-balancing) to fully support it, and then
think about global EDF. I think it would be difficult to do all the
things at one time.
I'm not sure that a pure GEDF approach would scale well given the
number of CPUs Linux can support in extreme configurations, maybe a
clustered approach would be more suitable (Anderson's group has published
interesting results both about GEDF scalability and C-EDF). The focus
on bounded tardiness for soft real time tasks in GEDF would also give
a slightly different meaning to the runtime/period assignments done at
the interface level.
So my proposal was in the direction of introducing group EDF scheduling
consistently with the current throttling implementation, to obtain a more
flexible form of assigning CPU reservations. GEDF can be implemented on
top of it, adding a push-pull logic to the PEDF queues; anyway I would
not do this without extended benchmarking in various hw configurations.
(Other schemes to implement GEDF would require major surgery to the
scheduler code, a thing that I wanted to avoid.)
I understand your concern with the removal of the bandwidth distribution
logic, I'll try to reintroduce it in my next post, taking into account
the PEDF schedulability constraints.
> I would think that by using the GEDF and minimal concurrency group
> scheduling we'd get the desired deadline behaviour. After that we'd get
> the task of selecting the FIFO tasks within the dynamic vcpu range
> resulting from the deadline server.
>
> We could simply implement global-fifo to make it work, and then move
> towards adapting the current (!group) load-balancer to work within these
> more dynamic constraints.
>
> Thoughts?
About minimal concurrency group scheduling, I am not sure of how we
would handle CPUs hot insertion/extraction, or how the allocation would
be done efficiently (avoiding bin-packing issues) online inside the kernel.
To adapt the current load-balancer to the choices of the deadline-based
scheduler I was thinking about using a cpupri-like structure per task_group,
but now I'm not able to estimate the resulting overhead...
Do you think that this gradual approach makes sense?
Commit-ID: c20b08e3986c2dbfa6df1e880bf4f7159994d199
Gitweb: http://git.kernel.org/tip/c20b08e3986c2dbfa6df1e880bf4f7159994d199
Author: Fabio Checconi <[email protected]>
AuthorDate: Mon, 15 Jun 2009 20:56:38 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 10 Jul 2009 10:43:30 +0200
sched: Fix rt_rq->pushable_tasks initialization in init_rt_rq()
init_rt_rq() initializes only rq->rt.pushable_tasks, and not the
pushable_tasks field of the passed rt_rq. The plist is not used
uninitialized since the only pushable_tasks plists used are the
ones of root rt_rqs; anyway reinitializing the list on every group
creation corrupts the root plist, losing its previous contents.
Signed-off-by: Fabio Checconi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
CC: Gregory Haskins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index c4549bd..efecfda 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9093,7 +9093,7 @@ static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
#ifdef CONFIG_SMP
rt_rq->rt_nr_migratory = 0;
rt_rq->overloaded = 0;
- plist_head_init(&rq->rt.pushable_tasks, &rq->lock);
+ plist_head_init(&rt_rq->pushable_tasks, &rq->lock);
#endif
rt_rq->rt_time = 0;
On Thu, 2009-07-09 at 15:51 +0200, Fabio Checconi wrote:
> I was thinking about doing things gradually: first extend throttling
> to handle generic periods, then extend the push-pull logic (I think you
> are referring to it with load-balancing) to fully support it, and then
> think about global EDF. I think it would be difficult to do all the
> things at one time.
Agreed.
> About minimal concurrency group scheduling, I am not sure of how we
> would handle CPUs hot insertion/extraction, or how the allocation would
> be done efficiently (avoiding bin-packing issues) online inside the kernel.
Right, since the current interface specifies bandwidth in a single-cpu
normalized fashion, adding/removing cpus will only affect the total
bandwidth available, but should not affect the bandwidth calculations.
So it should not break anything, but it sure might surprise, then again,
hotplug is an explicit action on behalf of the admin, so he pretty much
gets what he asked for :-)
I might have to re-read that mim-concurrency G-EDF paper again, but I
failed to spot the bin-packing issue.
> To adapt the current load-balancer to the choices of the deadline-based
> scheduler I was thinking about using a cpupri-like structure per task_group,
> but now I'm not able to estimate the resulting overhead...
Right, per task_group sounds about the right level for the FIFO
balancer. It gets a little more complicated due to having a dynamic
number of vcpus being served at any one time though.
This will also lead to extra task migrations, but sure, whatever works
first, then make it better.
> Do you think that this gradual approach makes sense?
Yeah it does ;-)
> From: Peter Zijlstra <[email protected]>
> Date: Wed, Jul 15, 2009 09:50:12AM +0200
>
> On Thu, 2009-07-09 at 15:51 +0200, Fabio Checconi wrote:
>
> > I was thinking about doing things gradually: first extend throttling
> > to handle generic periods, then extend the push-pull logic (I think you
> > are referring to it with load-balancing) to fully support it, and then
> > think about global EDF. I think it would be difficult to do all the
> > things at one time.
>
> Agreed.
>
> > About minimal concurrency group scheduling, I am not sure of how we
> > would handle CPUs hot insertion/extraction, or how the allocation would
> > be done efficiently (avoiding bin-packing issues) online inside the kernel.
>
> Right, since the current interface specifies bandwidth in a single-cpu
> normalized fashion, adding/removing cpus will only affect the total
> bandwidth available, but should not affect the bandwidth calculations.
>
> So it should not break anything, but it sure might surprise, then again,
> hotplug is an explicit action on behalf of the admin, so he pretty much
> gets what he asked for :-)
>
Even the proper assignment of runtimes and periods is in the hands of who
runs the system, so the effectiveness of admission control already depends
on the userspace.
> I might have to re-read that mim-concurrency G-EDF paper again, but I
> failed to spot the bin-packing issue.
>
In the paper you cited, the Conclusion section lists the support for
dynamic systems and for joining/leaving of tasks as a future work; I
think that handling (de-)fragmentation and allocation of cpu bandwidth
when tasks and groups are created/destroyed might be quite complex from
within the kernel.
I'd prefer to have the mechanims enforcing the bandwidth allocation
inside the kernel, and, eventually, an interface allowing the userspace
to specify nontrivial allocation schemes, like the one in the paper.
> > To adapt the current load-balancer to the choices of the deadline-based
> > scheduler I was thinking about using a cpupri-like structure per task_group,
> > but now I'm not able to estimate the resulting overhead...
>
> Right, per task_group sounds about the right level for the FIFO
> balancer. It gets a little more complicated due to having a dynamic
> number of vcpus being served at any one time though.
>
> This will also lead to extra task migrations, but sure, whatever works
> first, then make it better.
>
This will also be interesting to evaluate the cost of other hierarchical
global schedulers, since implementing them with partitioned queues will
require similar techniques to migrate tasks...
> > Do you think that this gradual approach makes sense?
>
> Yeah it does ;-)
On Wed, 2009-07-15 at 14:08 +0200, Fabio Checconi wrote:
>
> > I might have to re-read that mim-concurrency G-EDF paper again, but I
> > failed to spot the bin-packing issue.
> >
> In the paper you cited, the Conclusion section lists the support for
> dynamic systems and for joining/leaving of tasks as a future work; I
> think that handling (de-)fragmentation and allocation of cpu bandwidth
> when tasks and groups are created/destroyed might be quite complex from
> within the kernel.
Hmm, right, so I was thinking that we could simply create int(w_i) full
cpu and 1 frac(w_i) server tasks and let the single level G-EDF sort it
out.
It looks to me that by only scheduling the leafs of the hierarchy you
side-step a lot of issues, but then maybe it generates other issues :-)
> I'd prefer to have the mechanims enforcing the bandwidth allocation
> inside the kernel, and, eventually, an interface allowing the userspace
> to specify nontrivial allocation schemes, like the one in the paper.
Right.