LinuxLists.cc - [PATCH 2/5] sched: Teach scheduler to understand ONRQ

2014-07-22 11:30:27

Subject: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

This is new on_rq state for the cases when task is migrating
from one src_rq to another dst_rq, and locks of the both RQs
are unlocked.

We will use the state this way:

raw_spin_lock(&src_rq->lock);
dequeue_task(src_rq, p, 0);
p->on_rq = ONRQ_MIGRATING;
set_task_cpu(p, dst_cpu);
raw_spin_unlock(&src_rq->lock);

raw_spin_lock(&dst_rq->lock);
p->on_rq = ONRQ_QUEUED;
enqueue_task(dst_rq, p, 0);
raw_spin_unlock(&dst_rq->lock);

The profit is that double_rq_lock() is not needed now,
and this may reduce the latencies in some situations.

The logic of try_to_wake_up() remained the same as it
was. Its behaviour changes in a small subset of cases
(when preempted task in ~TASK_RUNNING state is queued
on rq and we are migrating it to another).

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/sched/core.c | 25 ++++++++++++++++++-------
kernel/sched/sched.h | 1 +
2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 205f99a..78388b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1214,7 +1214,7 @@ static int migration_cpu_stop(void *data);
unsigned long wait_task_inactive(struct task_struct *p, long match_state)
{
unsigned long flags;
- int running, queued;
+ int running, on_rq;
unsigned long ncsw;
struct rq *rq;

@@ -1252,7 +1252,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
rq = task_rq_lock(p, &flags);
trace_sched_wait_task(p);
running = task_running(rq, p);
- queued = task_queued(p);
+ on_rq = p->on_rq;
ncsw = 0;
if (!match_state || p->state == match_state)
ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
@@ -1284,7 +1284,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
* running right now), it's preempted, and we should
* yield - it could be a while.
*/
- if (unlikely(queued)) {
+ if (unlikely(on_rq)) {
ktime_t to = ktime_set(0, NSEC_PER_SEC/HZ);

set_current_state(TASK_UNINTERRUPTIBLE);
@@ -1491,10 +1491,14 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
static void
ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
- check_preempt_curr(rq, p, wake_flags);
trace_sched_wakeup(p, true);

p->state = TASK_RUNNING;
+
+ if (!task_queued(p))
+ return;
+
+ check_preempt_curr(rq, p, wake_flags);
#ifdef CONFIG_SMP
if (p->sched_class->task_woken)
p->sched_class->task_woken(rq, p);
@@ -1537,7 +1541,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
int ret = 0;

rq = __task_rq_lock(p);
- if (task_queued(p)) {
+ if (p->on_rq) {
/* check_preempt_curr() may use rq clock */
update_rq_clock(rq);
ttwu_do_wakeup(rq, p, wake_flags);
@@ -1678,7 +1682,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
success = 1; /* we're going to change ->state */
cpu = task_cpu(p);

- if (task_queued(p) && ttwu_remote(p, wake_flags))
+ if (p->on_rq && ttwu_remote(p, wake_flags))
goto stat;

#ifdef CONFIG_SMP
@@ -1693,6 +1697,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
smp_rmb();

+ BUG_ON(p->on_rq);
+
p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;

@@ -4623,9 +4629,14 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
struct rq *rq;
unsigned int dest_cpu;
int ret = 0;
-
+again:
rq = task_rq_lock(p, &flags);

+ if (unlikely(p->on_rq) == ONRQ_MIGRATING) {
+ task_rq_unlock(rq, p, &flags);
+ goto again;
+ }
+
if (cpumask_equal(&p->cpus_allowed, new_mask))
goto out;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e5a9b6d..9b00e9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -17,6 +17,7 @@ struct rq;

/* .on_rq states of struct task_struct: */
#define ONRQ_QUEUED 1
+#define ONRQ_MIGRATING 2

extern __read_mostly int scheduler_running;

2014-07-22 11:45:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

On Tue, Jul 22, 2014 at 03:30:16PM +0400, Kirill Tkhai wrote:
>
> This is new on_rq state for the cases when task is migrating
> from one src_rq to another dst_rq, and locks of the both RQs
> are unlocked.
>
> We will use the state this way:
>
> raw_spin_lock(&src_rq->lock);
> dequeue_task(src_rq, p, 0);
> p->on_rq = ONRQ_MIGRATING;
> set_task_cpu(p, dst_cpu);
> raw_spin_unlock(&src_rq->lock);
>
> raw_spin_lock(&dst_rq->lock);
> p->on_rq = ONRQ_QUEUED;
> enqueue_task(dst_rq, p, 0);
> raw_spin_unlock(&dst_rq->lock);
>
> The profit is that double_rq_lock() is not needed now,
> and this may reduce the latencies in some situations.
>
> The logic of try_to_wake_up() remained the same as it
> was. Its behaviour changes in a small subset of cases
> (when preempted task in ~TASK_RUNNING state is queued
> on rq and we are migrating it to another).

more details is better ;-) Also, I think Oleg enjoys these kind of
things, so I've added him to the CC.

A few questions, haven't really thought about things yet.

> @@ -1491,10 +1491,14 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
> static void
> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> {
> - check_preempt_curr(rq, p, wake_flags);
> trace_sched_wakeup(p, true);
>
> p->state = TASK_RUNNING;
> +
> + if (!task_queued(p))
> + return;

How can this happen? we're in the middle of a wakeup, we're just added
the task to the rq and are still holding the appropriate rq->lock.

> @@ -4623,9 +4629,14 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
> struct rq *rq;
> unsigned int dest_cpu;
> int ret = 0;
> -
> +again:
> rq = task_rq_lock(p, &flags);
>
> + if (unlikely(p->on_rq) == ONRQ_MIGRATING) {
> + task_rq_unlock(rq, p, &flags);
> + goto again;
> + }
> +
> if (cpumask_equal(&p->cpus_allowed, new_mask))
> goto out;
>

That looks like a non-deterministic spin loop, 'waiting' for the
migration to finish. Not particularly nice and something I think we
should avoid for it has bad (TM) worst case behaviour.

Also, why only this site and not all task_rq_lock() sites?

2014-07-22 12:24:17

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

В Вт, 22/07/2014 в 13:45 +0200, Peter Zijlstra пишет:
> On Tue, Jul 22, 2014 at 03:30:16PM +0400, Kirill Tkhai wrote:
> >
> > This is new on_rq state for the cases when task is migrating
> > from one src_rq to another dst_rq, and locks of the both RQs
> > are unlocked.
> >
> > We will use the state this way:
> >
> > raw_spin_lock(&src_rq->lock);
> > dequeue_task(src_rq, p, 0);
> > p->on_rq = ONRQ_MIGRATING;
> > set_task_cpu(p, dst_cpu);
> > raw_spin_unlock(&src_rq->lock);
> >
> > raw_spin_lock(&dst_rq->lock);
> > p->on_rq = ONRQ_QUEUED;
> > enqueue_task(dst_rq, p, 0);
> > raw_spin_unlock(&dst_rq->lock);
> >
> > The profit is that double_rq_lock() is not needed now,
> > and this may reduce the latencies in some situations.
> >
> > The logic of try_to_wake_up() remained the same as it
> > was. Its behaviour changes in a small subset of cases
> > (when preempted task in ~TASK_RUNNING state is queued
> > on rq and we are migrating it to another).
>
> more details is better ;-) Also, I think Oleg enjoys these kind of
> things, so I've added him to the CC.

try_to_wake_up() wakes tasks in the particular states,
no one calls it with TASK_RUNNING argument. So, our
logic will have a deal with tasks in !TASK_RUNNING state.

If they are on rq and !TASK_RUNNING, this means, they were
preempted using preempt_schedule{,_irq}. If they were preempted
in !TASK_RUNNING state, this was in one of the cases like:

set_current_state(TASK_INTERRUPTIBLE);

(actions)

schedule();

And someone is migrating this task.

Really small subset of cases :)

> A few questions, haven't really thought about things yet.
>
> > @@ -1491,10 +1491,14 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
> > static void
> > ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > {
> > - check_preempt_curr(rq, p, wake_flags);
> > trace_sched_wakeup(p, true);
> >
> > p->state = TASK_RUNNING;
> > +
> > + if (!task_queued(p))
> > + return;
>
> How can this happen? we're in the middle of a wakeup, we're just added
> the task to the rq and are still holding the appropriate rq->lock.

try_to_wake_up()->ttwu_remote()->ttwu_do_wakeup():

task is migrating at the moment, and the only thing we do is we change
its state on TASK_RUNNING. It will be queued with new state (TASK_RUNNING) by the code, which is migrating it.

It looks like this situation in general is not very often. It's only
for the cases when we're migrating a task, which was preempted with
!TASK_RUNNING state.

> > @@ -4623,9 +4629,14 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
> > struct rq *rq;
> > unsigned int dest_cpu;
> > int ret = 0;
> > -
> > +again:
> > rq = task_rq_lock(p, &flags);
> >
> > + if (unlikely(p->on_rq) == ONRQ_MIGRATING) {
> > + task_rq_unlock(rq, p, &flags);
> > + goto again;
> > + }
> > +
> > if (cpumask_equal(&p->cpus_allowed, new_mask))
> > goto out;
> >
>
> That looks like a non-deterministic spin loop, 'waiting' for the
> migration to finish. Not particularly nice and something I think we
> should avoid for it has bad (TM) worst case behaviour.
>
> Also, why only this site and not all task_rq_lock() sites?

All other places tests for task_queued(p) under rq's lock. I watched
all of them and did not find a place who needs that. For example,
rt_mutex_setprio() enqueues only if task was really queued before.
Patch [1/5] made all preparation for that. With a hope, nothing was
skipped by /me.

About set_cpus_allowed_ptr(). We could return -EAGAIN if a task is
migrating, but set_cpus_allowed_ptr() must not fail on kernel threads.
We'd would miss something important this way, it's softirq affinity
for example.

Going to again label looks like manual spinlock for me. We used to
spin there before. The time of held lock was similar, and we also
had competition with other rq's lock users. The lock just was locked
permanently, and we didn't have overhead with repeating spin_{lock,unlock}
here.

I don't see what to do instead..

Thanks,
Kirill

2014-07-22 12:25:22

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

On Tue, 22 Jul 2014 13:45:42 +0200
Peter Zijlstra <[email protected]> wrote:

> > @@ -1491,10 +1491,14 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
> > static void
> > ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > {
> > - check_preempt_curr(rq, p, wake_flags);
> > trace_sched_wakeup(p, true);
> >
> > p->state = TASK_RUNNING;
> > +
> > + if (!task_queued(p))
> > + return;
>
> How can this happen? we're in the middle of a wakeup, we're just added
> the task to the rq and are still holding the appropriate rq->lock.

I believe it can be in the migrating state. A comment would be useful
here.

>
> > @@ -4623,9 +4629,14 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
> > struct rq *rq;
> > unsigned int dest_cpu;
> > int ret = 0;
> > -
> > +again:
> > rq = task_rq_lock(p, &flags);
> >
> > + if (unlikely(p->on_rq) == ONRQ_MIGRATING) {
> > + task_rq_unlock(rq, p, &flags);
> > + goto again;
> > + }
> > +
> > if (cpumask_equal(&p->cpus_allowed, new_mask))
> > goto out;
> >
>
> That looks like a non-deterministic spin loop, 'waiting' for the
> migration to finish. Not particularly nice and something I think we
> should avoid for it has bad (TM) worst case behaviour.

As this patch doesn't introduce the MIGRATING getting set yet, I'd be
interested in this too. I'm assuming that the MIGRATING flag is only
set and then cleared within an interrupts disabled section, such that
the time is no more than a spinlock being taken.

I would also add a cpu_relax() there too.

>
> Also, why only this site and not all task_rq_lock() sites?

I'm assuming that its because set_cpus_allowed_ptr() is suppose to
return with the task already migrated to the CPUs it is allowed on, and
not before.

-- Steve

2014-07-22 13:28:03

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

В Вт, 22/07/2014 в 08:25 -0400, Steven Rostedt пишет:
> On Tue, 22 Jul 2014 13:45:42 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>
> > > @@ -1491,10 +1491,14 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags)
> > > static void
> > > ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> > > {
> > > - check_preempt_curr(rq, p, wake_flags);
> > > trace_sched_wakeup(p, true);
> > >
> > > p->state = TASK_RUNNING;
> > > +
> > > + if (!task_queued(p))
> > > + return;
> >
> > How can this happen? we're in the middle of a wakeup, we're just added
> > the task to the rq and are still holding the appropriate rq->lock.
>
> I believe it can be in the migrating state. A comment would be useful
> here.

Sure, I'll update. Stupid question: should I resend all series or one message
in this thread is enough?

>
> > > @@ -4623,9 +4629,14 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
> > > struct rq *rq;
> > > unsigned int dest_cpu;
> > > int ret = 0;
> > > -
> > > +again:
> > > rq = task_rq_lock(p, &flags);
> > >
> > > + if (unlikely(p->on_rq) == ONRQ_MIGRATING) {
> > > + task_rq_unlock(rq, p, &flags);
> > > + goto again;
> > > + }
> > > +
> > > if (cpumask_equal(&p->cpus_allowed, new_mask))
> > > goto out;
> > >
> >
> > That looks like a non-deterministic spin loop, 'waiting' for the
> > migration to finish. Not particularly nice and something I think we
> > should avoid for it has bad (TM) worst case behaviour.
>
> As this patch doesn't introduce the MIGRATING getting set yet, I'd be
> interested in this too. I'm assuming that the MIGRATING flag is only
> set and then cleared within an interrupts disabled section, such that
> the time is no more than a spinlock being taken.
>
> I would also add a cpu_relax() there too.

Sadly, I did't completely understand what you mean. Could you please explain
what has to be changed?

(I see wrong unlikely() place. It's an error. Is the other thing that there
is no task_migrating() method?)

Thanks,
Kirill

2014-07-24 19:06:03

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

Oh, sorry for delay, I didn't even try to read most of my emails
several days.

On 07/22, Peter Zijlstra wrote:
>
> more details is better ;-) Also, I think Oleg enjoys these kind of
> things, so I've added him to the CC.

Thanks. Trust me, at least I like this much more than what I had to
do this week ;)

This change depends on the previous patches and I am too lazy to try
to find/download the full series.

Kirill, since you are going to send v2, could you cc me? No, it is not
that I think I can help or even review (or may be even understand ;)
But this looks interesting to me, I'd like to know about these changes.

Oleg.

2014-07-25 07:11:15

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH 2/5] sched: Teach scheduler to understand ONRQ_MIGRATING state

В Чт, 24/07/2014 в 21:03 +0200, Oleg Nesterov пишет:
> Oh, sorry for delay, I didn't even try to read most of my emails
> several days.
>
> On 07/22, Peter Zijlstra wrote:
> >
> > more details is better ;-) Also, I think Oleg enjoys these kind of
> > things, so I've added him to the CC.
>
> Thanks. Trust me, at least I like this much more than what I had to
> do this week ;)
>
> This change depends on the previous patches and I am too lazy to try
> to find/download the full series.
>
> Kirill, since you are going to send v2, could you cc me? No, it is not
> that I think I can help or even review (or may be even understand ;)
> But this looks interesting to me, I'd like to know about these changes.

I'm going to update the series on the weekend, of course, you'll surely
be CCed :)

Regards,
Kirill