We may suffer from extra rt overload rq due to the affinity,
so when the affinity of any runnable rt task is changed, we
should check to trigger balancing, otherwise it will cause
some unnecessary delayed real-time response. Unfortunately,
current RT global scheduler doesn't trigger anything.
For example: a 2-cpu system with two runnable FIFO tasks(same
rt_priority) bound on CPU0, let's name them rt1(running) and
rt2(runnable) respectively; CPU1 has no RTs. Then, someone sets
the affinity of rt2 to 0x3(i.e. CPU0 and CPU1), but after this,
rt2 still can't be scheduled until rt1 enters schedule(), this
definitely causes some/big response latency for rt2.
So, when doing set_cpus_allowed_rt(), if detecting such cases,
check to trigger a push behaviour.
Signed-off-by: Xunlei Pang <[email protected]>
---
v2, v3:
Refine according to Steven Rostedt's comments.
kernel/sched/rt.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 68 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f4d4b07..04c58b7 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1428,10 +1428,9 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
return next;
}
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *peek_next_task_rt(struct rq *rq)
{
struct sched_rt_entity *rt_se;
- struct task_struct *p;
struct rt_rq *rt_rq = &rq->rt;
do {
@@ -1440,7 +1439,14 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
rt_rq = group_rt_rq(rt_se);
} while (rt_rq);
- p = rt_task_of(rt_se);
+ return rt_task_of(rt_se);
+}
+
+static inline struct task_struct *_pick_next_task_rt(struct rq *rq)
+{
+ struct task_struct *p;
+
+ p = peek_next_task_rt(rq);
p->se.exec_start = rq_clock_task(rq);
return p;
@@ -1886,28 +1892,74 @@ static void set_cpus_allowed_rt(struct task_struct *p,
const struct cpumask *new_mask)
{
struct rq *rq;
- int weight;
+ int old_weight, new_weight;
+ int preempt_push = 0, direct_push = 0;
BUG_ON(!rt_task(p));
if (!task_on_rq_queued(p))
return;
- weight = cpumask_weight(new_mask);
+ old_weight = p->nr_cpus_allowed;
+ new_weight = cpumask_weight(new_mask);
+
+ rq = task_rq(p);
+
+ if (new_weight > 1 &&
+ rt_task(rq->curr) &&
+ !test_tsk_need_resched(rq->curr)) {
+ /*
+ * We own p->pi_lock and rq->lock. rq->lock might
+ * get released when doing direct pushing, however
+ * p->pi_lock is always held, so it's safe to assign
+ * the new_mask and new_weight to p below.
+ */
+ if (!task_running(rq, p)) {
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->nr_cpus_allowed = new_weight;
+ direct_push = 1;
+ } else if (cpumask_test_cpu(task_cpu(p), new_mask)) {
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->nr_cpus_allowed = new_weight;
+ if (!cpupri_find(&rq->rd->cpupri, p, NULL))
+ goto update;
+
+ /*
+ * At this point, current task gets migratable most
+ * likely due to the change of its affinity, let's
+ * figure out if we can migrate it.
+ *
+ * Is there any task with the same priority as that
+ * of current task? If found one, we should resched.
+ * NOTE: The target may be unpushable.
+ */
+ if (p->prio == rq->rt.highest_prio.next) {
+ /* One target just in pushable_tasks list. */
+ requeue_task_rt(rq, p, 0);
+ preempt_push = 1;
+ } else if (rq->rt.rt_nr_total > 1) {
+ struct task_struct *next;
+
+ requeue_task_rt(rq, p, 0);
+ next = peek_next_task_rt(rq);
+ if (next != p && next->prio == p->prio)
+ preempt_push = 1;
+ }
+ }
+ }
+update:
/*
* Only update if the process changes its state from whether it
* can migrate or not.
*/
- if ((p->nr_cpus_allowed > 1) == (weight > 1))
- return;
-
- rq = task_rq(p);
+ if ((old_weight > 1) == (new_weight > 1))
+ goto out;
/*
* The process used to be able to migrate OR it can now migrate
*/
- if (weight <= 1) {
+ if (new_weight <= 1) {
if (!task_current(rq, p))
dequeue_pushable_task(rq, p);
BUG_ON(!rq->rt.rt_nr_migratory);
@@ -1919,6 +1971,12 @@ static void set_cpus_allowed_rt(struct task_struct *p,
}
update_rt_migration(&rq->rt);
+
+out:
+ if (direct_push)
+ push_rt_tasks(rq);
+ else if (preempt_push)
+ resched_curr(rq);
}
/* Assumes rq->lock is held */
--
1.9.1
check_preempt_curr() doesn't call sched_class::check_preempt_curr
when the class of current is a higher level. So if there is a DL
task running when doing this for RT, check_preempt_equal_prio()
will definitely miss, which may result in some response latency
for this RT task if it is pinned and there're some same-priority
migratable rt tasks already queued.
We should do the similar thing in select_task_rq_rt() when first
picking rt tasks after running out of DL tasks.
This patch tackles the issue by peeking the next rt task(RT1), and
if find RT1 migratable, just requeue it to the tail of the rq using
requeue_task_rt(rq, p, 0). In this way:
- If there do have another rt task(RT2) with the same priority as
RT1, RT2 will finally be picked as the running task. While RT1
will be pushed onto another cpu via RT1's post_schedule(), as
RT1 is migratable. The difference from check_preempt_equal_prio()
here is that we just don't care whether RT2 is migratable.
- Otherwise, if there's no rt task with the same priority as RT1,
RT1 will still be picked as the running task after the requeuing.
Signed-off-by: Xunlei Pang <[email protected]>
---
kernel/sched/rt.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 04c58b7..26114f5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1482,6 +1482,22 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
put_prev_task(rq, prev);
+#ifdef CONFIG_SMP
+ /*
+ * If there's a running higher class task, check_preempt_curr()
+ * doesn't invoke check_preempt_equal_prio() for rt tasks, so
+ * we can do the similar thing here.
+ */
+ if (rq->rt.rt_nr_total > 1 &&
+ (prev->sched_class == &dl_sched_class ||
+ prev->sched_class == &stop_sched_class)) {
+ p = peek_next_task_rt(rq);
+ if (p->nr_cpus_allowed != 1 &&
+ cpupri_find(&rq->rd->cpupri, p, NULL))
+ requeue_task_rt(rq, p, 0);
+ }
+#endif
+
p = _pick_next_task_rt(rq);
/* The running task is never eligible for pushing */
--
1.9.1
On Sun, 8 Feb 2015 23:51:25 +0800
Xunlei Pang <[email protected]> wrote:
> + if (new_weight > 1 &&
> + rt_task(rq->curr) &&
> + !test_tsk_need_resched(rq->curr)) {
> + /*
> + * We own p->pi_lock and rq->lock. rq->lock might
> + * get released when doing direct pushing, however
> + * p->pi_lock is always held, so it's safe to assign
> + * the new_mask and new_weight to p below.
> + */
> + if (!task_running(rq, p)) {
> + cpumask_copy(&p->cpus_allowed, new_mask);
> + p->nr_cpus_allowed = new_weight;
> + direct_push = 1;
> + } else if (cpumask_test_cpu(task_cpu(p), new_mask)) {
> + cpumask_copy(&p->cpus_allowed, new_mask);
> + p->nr_cpus_allowed = new_weight;
> + if (!cpupri_find(&rq->rd->cpupri, p, NULL))
> + goto update;
> +
> + /*
> + * At this point, current task gets migratable most
> + * likely due to the change of its affinity, let's
> + * figure out if we can migrate it.
> + *
> + * Is there any task with the same priority as that
> + * of current task? If found one, we should resched.
> + * NOTE: The target may be unpushable.
> + */
> + if (p->prio == rq->rt.highest_prio.next) {
> + /* One target just in pushable_tasks list. */
> + requeue_task_rt(rq, p, 0);
What's the purpose of the requeue_task_rt() here?
> + preempt_push = 1;
> + } else if (rq->rt.rt_nr_total > 1) {
> + struct task_struct *next;
> +
> + requeue_task_rt(rq, p, 0);
And here? It may just be late and I'm tired, but it's not obvious to me.
Thanks,
-- Steve
> + next = peek_next_task_rt(rq);
> + if (next != p && next->prio == p->prio)
> + preempt_push = 1;
> + }
> + }
> + }
>
On Sun, 8 Feb 2015 23:51:26 +0800
Xunlei Pang <[email protected]> wrote:
> check_preempt_curr() doesn't call sched_class::check_preempt_curr
> when the class of current is a higher level.
The above sentence does not make sense.
> So if there is a DL
> task running when doing this for RT, check_preempt_equal_prio()
Doing what for RT?
> will definitely miss, which may result in some response latency
Miss what?
> for this RT task if it is pinned and there're some same-priority
> migratable rt tasks already queued.
>
> We should do the similar thing in select_task_rq_rt() when first
> picking rt tasks after running out of DL tasks.
>
> This patch tackles the issue by peeking the next rt task(RT1), and
> if find RT1 migratable, just requeue it to the tail of the rq using
> requeue_task_rt(rq, p, 0). In this way:
> - If there do have another rt task(RT2) with the same priority as
> RT1, RT2 will finally be picked as the running task. While RT1
> will be pushed onto another cpu via RT1's post_schedule(), as
> RT1 is migratable. The difference from check_preempt_equal_prio()
> here is that we just don't care whether RT2 is migratable.
>
> - Otherwise, if there's no rt task with the same priority as RT1,
> RT1 will still be picked as the running task after the requeuing.
What happens if there's three RT tasks of the same prio, RT1 is ready
to run and is migratable, RT2 is pinned, RT3 is migratable
RT1 just got pushed behind RT3 and it is now not the next one to run.
RT2 will get this rq, RT3 will be pushed off, but say there's no more
rq's available to run RT1.
You just broke FIFO.
I'm sorry, I'm thinking this is trying too hard to fix the users poor
management of RT tasks.
If you have 2 or more RT tasks of the same prio, you had better be damn
aware that if one is pinned, it will block the others, even from
migrating. You should not have pinned tasks of the same prio as those
that can migrate.
And if your system depends on DL tasks working nicely with RT tasks on
the same CPU, it's even more broken by design.
-- Steve
On 13 February 2015 at 07:31, Steven Rostedt <[email protected]> wrote:
> On Sun, 8 Feb 2015 23:51:25 +0800
> Xunlei Pang <[email protected]> wrote:
>
>
>> + if (new_weight > 1 &&
>> + rt_task(rq->curr) &&
>> + !test_tsk_need_resched(rq->curr)) {
>> + /*
>> + * We own p->pi_lock and rq->lock. rq->lock might
>> + * get released when doing direct pushing, however
>> + * p->pi_lock is always held, so it's safe to assign
>> + * the new_mask and new_weight to p below.
>> + */
>> + if (!task_running(rq, p)) {
>> + cpumask_copy(&p->cpus_allowed, new_mask);
>> + p->nr_cpus_allowed = new_weight;
>> + direct_push = 1;
>> + } else if (cpumask_test_cpu(task_cpu(p), new_mask)) {
>> + cpumask_copy(&p->cpus_allowed, new_mask);
>> + p->nr_cpus_allowed = new_weight;
>> + if (!cpupri_find(&rq->rd->cpupri, p, NULL))
>> + goto update;
>> +
>> + /*
>> + * At this point, current task gets migratable most
>> + * likely due to the change of its affinity, let's
>> + * figure out if we can migrate it.
>> + *
>> + * Is there any task with the same priority as that
>> + * of current task? If found one, we should resched.
>> + * NOTE: The target may be unpushable.
>> + */
>> + if (p->prio == rq->rt.highest_prio.next) {
>> + /* One target just in pushable_tasks list. */
>> + requeue_task_rt(rq, p, 0);
>
> What's the purpose of the requeue_task_rt() here?
>
>> + preempt_push = 1;
>> + } else if (rq->rt.rt_nr_total > 1) {
>> + struct task_struct *next;
>> +
>> + requeue_task_rt(rq, p, 0);
>
> And here? It may just be late and I'm tired, but it's not obvious to me.
If we're changing the affinity of the current running task, and there're also
other tasks with the same prio on the same cpu, we do the similar thing
as check_preempt_equal_prio(). But yes, this may have the same problem
you pointed out on the 2nd patch.
Thanks,
Xunlei
Hi steve,
On 13 February 2015 at 08:04, Steven Rostedt <[email protected]> wrote:
> On Sun, 8 Feb 2015 23:51:26 +0800
> Xunlei Pang <[email protected]> wrote:
>
>> check_preempt_curr() doesn't call sched_class::check_preempt_curr
>> when the class of current is a higher level.
>
> The above sentence does not make sense.
>
>> So if there is a DL
>> task running when doing this for RT, check_preempt_equal_prio()
>
> Doing what for RT?
>
>> will definitely miss, which may result in some response latency
>
> Miss what?
Sorry, this may lack some information I need to further explain in detail.
>
>> for this RT task if it is pinned and there're some same-priority
>> migratable rt tasks already queued.
>>
>> We should do the similar thing in select_task_rq_rt() when first
>> picking rt tasks after running out of DL tasks.
>>
>> This patch tackles the issue by peeking the next rt task(RT1), and
>> if find RT1 migratable, just requeue it to the tail of the rq using
>> requeue_task_rt(rq, p, 0). In this way:
>> - If there do have another rt task(RT2) with the same priority as
>> RT1, RT2 will finally be picked as the running task. While RT1
>> will be pushed onto another cpu via RT1's post_schedule(), as
>> RT1 is migratable. The difference from check_preempt_equal_prio()
>> here is that we just don't care whether RT2 is migratable.
>>
>> - Otherwise, if there's no rt task with the same priority as RT1,
>> RT1 will still be picked as the running task after the requeuing.
>
> What happens if there's three RT tasks of the same prio, RT1 is ready
> to run and is migratable, RT2 is pinned, RT3 is migratable
>
> RT1 just got pushed behind RT3 and it is now not the next one to run.
> RT2 will get this rq, RT3 will be pushed off, but say there's no more
> rq's available to run RT1.
>
> You just broke FIFO.
Yes, I've also thought of this point before.
If this is a problem, we may have the same thing happening in
current check_preempt_equal_prio() code:
When a pinned waking task preempts the current successfully,
because it thinks current is migratable via cpupri_find().
But when resched happens, things may change, i.e. current
becomes non-migratable, so the waking task gets running, while
the previous running task gets stuck. See, it also broke FIFO.
Thanks,
Xunlei
>
> I'm sorry, I'm thinking this is trying too hard to fix the users poor
> management of RT tasks.
>
> If you have 2 or more RT tasks of the same prio, you had better be damn
> aware that if one is pinned, it will block the others, even from
> migrating. You should not have pinned tasks of the same prio as those
> that can migrate.
>
> And if your system depends on DL tasks working nicely with RT tasks on
> the same CPU, it's even more broken by design.
>
> -- Steve
>
Hi Steve,
On 13 February 2015 at 11:55, Xunlei Pang <[email protected]> wrote:
> Hi steve,
>
> On 13 February 2015 at 08:04, Steven Rostedt <[email protected]> wrote:
>> On Sun, 8 Feb 2015 23:51:26 +0800
>> Xunlei Pang <[email protected]> wrote:
>>
>>> check_preempt_curr() doesn't call sched_class::check_preempt_curr
>>> when the class of current is a higher level.
>>
>> The above sentence does not make sense.
>>
>>> So if there is a DL
>>> task running when doing this for RT, check_preempt_equal_prio()
>>
>> Doing what for RT?
>>
>>> will definitely miss, which may result in some response latency
>>
>> Miss what?
>
> Sorry, this may lack some information I need to further explain in detail.
>
>>
>>> for this RT task if it is pinned and there're some same-priority
>>> migratable rt tasks already queued.
>>>
>>> We should do the similar thing in select_task_rq_rt() when first
>>> picking rt tasks after running out of DL tasks.
>>>
>>> This patch tackles the issue by peeking the next rt task(RT1), and
>>> if find RT1 migratable, just requeue it to the tail of the rq using
>>> requeue_task_rt(rq, p, 0). In this way:
>>> - If there do have another rt task(RT2) with the same priority as
>>> RT1, RT2 will finally be picked as the running task. While RT1
>>> will be pushed onto another cpu via RT1's post_schedule(), as
>>> RT1 is migratable. The difference from check_preempt_equal_prio()
>>> here is that we just don't care whether RT2 is migratable.
>>>
>>> - Otherwise, if there's no rt task with the same priority as RT1,
>>> RT1 will still be picked as the running task after the requeuing.
>>
>> What happens if there's three RT tasks of the same prio, RT1 is ready
>> to run and is migratable, RT2 is pinned, RT3 is migratable
>>
>> RT1 just got pushed behind RT3 and it is now not the next one to run.
>> RT2 will get this rq, RT3 will be pushed off, but say there's no more
>> rq's available to run RT1.
>>
>> You just broke FIFO.
>
> Yes, I've also thought of this point before.
>
> If this is a problem, we may have the same thing happening in
> current check_preempt_equal_prio() code:
> When a pinned waking task preempts the current successfully,
> because it thinks current is migratable via cpupri_find().
>
> But when resched happens, things may change, i.e. current
> becomes non-migratable, so the waking task gets running, while
> the previous running task gets stuck. See, it also broke FIFO.
Aside of this, please ignore this patch, the waking rt tasks will also be
pushed via task_woken_rt() when current is DL, which I missed before.
Thanks,
Xunlei
On Fri, 13 Feb 2015 11:55:11 +0800
Xunlei Pang <[email protected]> wrote:
> > RT1 just got pushed behind RT3 and it is now not the next one to run.
> > RT2 will get this rq, RT3 will be pushed off, but say there's no more
> > rq's available to run RT1.
> >
> > You just broke FIFO.
>
> Yes, I've also thought of this point before.
>
> If this is a problem, we may have the same thing happening in
> current check_preempt_equal_prio() code:
> When a pinned waking task preempts the current successfully,
> because it thinks current is migratable via cpupri_find().
>
> But when resched happens, things may change, i.e. current
> becomes non-migratable, so the waking task gets running, while
> the previous running task gets stuck. See, it also broke FIFO.
It breaks FIFO if the state of the system changes before the current
task found another queue to run on, sure, and that probably should be
fixed. And technically, that case does not break FIFO from a state
point of view. Think of the timing, if that task was able to migrate to
another CPU, but suddenly it could not, that means the CPU it was going
to migrate to had a higher priority task that started to run on that
CPU. It still fits the FIFO design. That's because if that task
succeeded to migrate to that CPU, just before the high priority task
ran, that high priority task would have bumped it anyway.
Now if it couldn't migrate because a same priority task started, then,
well yeah, it broke FIFO, and maybe that should be fixed.
But your patch breaks FIFO if the system is in just one
particular state. That's much worse, and it shouldn't be added.
-- Steve
On Sun, 15 Feb 2015 10:54:25 +0800
[email protected] wrote:
> I think this can also happen for check_preempt_equal_prio():
> When RT1(current task) gets preempted by RT2, if there is a
> migratable RT3 with same prio, RT3 will be pushed away instead
> of RT1 afterwards, because RT1 will be enqueued to the tail of
> the pushable list via succeeding put_prev_task_rt() triggered
> by resched.
>
> There seems some trouble involved in rt equal prio cases.
Hmm, you may be right, and that should be fixed. If a task is running
and gets preempted by a higher priority task (or even of same priority
for migrating), then it should stay at the front of the queue to be
migrated. It should only be placed after other FIFO tasks of the same
priority if that task calls schedule() directly (not preempted).
SMP always cracks a few rotten eggs in the RT omelet.
-- Steve