2013-07-13 15:45:56

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

This patch adds optimization of try_to_wake_up() function
for cases when the system is doing parallel wake_up
of the same task on the different cpus. Also it adds
accounting the statistics of these situations.

We check the status of the task we want to wake up.
If it is TASK_WAKING then the task is manipulated
by try_to_wake_up() on another cpu. And after this
check it will be a moment when the task is queued
and his status is TASK_RUNNING. We just return
earlier when we are sure the task will be TASK_RUNNING
in the future (maybe right after the check). The profit is
we don't loop while we are waiting the spinlock.

There mustn't be any problems connected with
we return earlier, besause scheduler allready does
the same, when he queues a task in wake_list to be
waked up on another cpu.

Parallel wake up is not unreal situation. Here is
statistics from my 2-cpu laptop:

~# grep 'nr_wakeups_parallel.*[1-9]' -B100 -h /proc/*/sched | grep 'threads\|parallel\|wakeups ' | sed 's/(.*)//g'

rcu_sched
se.statistics.nr_wakeups : 102
se.statistics.nr_wakeups_parallel : 2
Xorg
se.statistics.nr_wakeups : 36030
se.statistics.nr_wakeups_parallel : 56
gnome-terminal
se.statistics.nr_wakeups : 70573
se.statistics.nr_wakeups_parallel : 55
rcu_preempt
se.statistics.nr_wakeups : 68101
se.statistics.nr_wakeups_parallel : 1368

It's the moment after boot (5-10 minutes uptime). Later the ratio
goes down:

rcu_sched
se.statistics.nr_wakeups : 102
se.statistics.nr_wakeups_parallel : 2
Xorg
se.statistics.nr_wakeups : 49057
se.statistics.nr_wakeups_parallel : 74
gnome-terminal
se.statistics.nr_wakeups : 1495463
se.statistics.nr_wakeups_parallel : 99
rcu_preempt
se.statistics.nr_wakeups : 2015010
se.statistics.nr_wakeups_parallel : 3738

Signed-off-by: Kirill Tkhai <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 29 +++++++++++++++++++++++++----
kernel/sched/debug.c | 7 +++++++
kernel/sched/stats.h | 16 ++++++++++++++++
4 files changed, 49 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc09d21..235a466 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -964,6 +964,7 @@ struct sched_statistics {
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
+ atomic_t nr_wakeups_parallel;
};
#endif

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d06ad6..1e1475f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
+ /*
+ * Pair bracket with TASK_WAKING check it try_to_wake_up()
+ */
+ smp_wmb();
+
if (p->sched_class->task_woken)
p->sched_class->task_woken(rq, p);

@@ -1487,20 +1492,37 @@ static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
- int cpu, success = 0;
+ int cpu, success = 1;

+ /*
+ * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.
+ */
smp_wmb();
+#ifdef CONFIG_SMP
+ if (p->state == TASK_WAKING) {
+ /*
+ * Pairs with sets of p->state: below and in ttwu_do_wakeup().
+ */
+ smp_rmb();
+ inc_nr_parallel_wakeups(p);
+ return success;
+ }
+#endif
raw_spin_lock_irqsave(&p->pi_lock, flags);
- if (!(p->state & state))
+ if (!(p->state & state)) {
+ success = 0;
goto out;
+ }

- success = 1; /* we're going to change ->state */
cpu = task_cpu(p);

if (p->on_rq && ttwu_remote(p, wake_flags))
goto stat;

#ifdef CONFIG_SMP
+ p->state = TASK_WAKING;
+ smp_wmb();
+
/*
* If the owning (remote) cpu is still in the middle of schedule() with
* this task as prev, wait until its done referencing the task.
@@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
smp_rmb();

p->sched_contributes_to_load = !!task_contributes_to_load(p);
- p->state = TASK_WAKING;

if (p->sched_class->task_waking)
p->sched_class->task_waking(p);
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e076bdd..f18736d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -542,6 +542,13 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
P(se.statistics.nr_wakeups_idle);

{
+ int nr = get_nr_parallel_wakeups(p);
+
+ SEQ_printf(m, "%-45s:%21d\n",
+ "se.statistics.nr_wakeups_parallel", nr);
+ }
+
+ {
u64 avg_atom, avg_per_cpu;

avg_atom = p->se.sum_exec_runtime;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 5aef494..dbbc6e9 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -155,6 +155,22 @@ sched_info_switch(struct task_struct *prev, struct task_struct *next)
#define sched_info_switch(t, next) do { } while (0)
#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */

+#if defined(CONFIG_SCHEDSTATS) && defined(CONFIG_SMP)
+static inline void
+inc_nr_parallel_wakeups(struct task_struct *t)
+{
+ atomic_inc(&t->se.statistics.nr_wakeups_parallel);
+}
+static inline int
+get_nr_parallel_wakeups(struct task_struct *t)
+{
+ return atomic_read(&t->se.statistics.nr_wakeups_parallel);
+}
+#else
+#define inc_nr_parallel_wakeups(t) do { } while (0)
+#define get_nr_parallel_wakeups(t) (0)
+#endif /* CONFIG_SCHEDSTATS && CONFIG_SMP */
+
/*
* The following are functions that support scheduler-internal time accounting.
* These functions are generally called at the timer tick. None of this depends


2013-07-14 05:55:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

On Sat, 2013-07-13 at 19:45 +0400, Kirill Tkhai wrote:
> This patch adds optimization of try_to_wake_up() function
> for cases when the system is doing parallel wake_up
> of the same task on the different cpus. Also it adds
> accounting the statistics of these situations.
>
> We check the status of the task we want to wake up.
> If it is TASK_WAKING then the task is manipulated
> by try_to_wake_up() on another cpu. And after this
> check it will be a moment when the task is queued
> and his status is TASK_RUNNING. We just return
> earlier when we are sure the task will be TASK_RUNNING
> in the future (maybe right after the check). The profit is
> we don't loop while we are waiting the spinlock.

Hm, you're adding cycles to the common case to shave spin cycles in the
very rare case, then spending some to note that a collision happened.
What makes recording worth even 1 cycle? What am I gonna do with the
knowledge that $dinkynum wakeups intersected at a some random task?

-Mike

2013-07-15 06:32:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

On Sat, Jul 13, 2013 at 07:45:49PM +0400, Kirill Tkhai wrote:
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 29 +++++++++++++++++++++++++----
> kernel/sched/debug.c | 7 +++++++
> kernel/sched/stats.h | 16 ++++++++++++++++
> 4 files changed, 49 insertions(+), 4 deletions(-)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fc09d21..235a466 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -964,6 +964,7 @@ struct sched_statistics {
> u64 nr_wakeups_affine_attempts;
> u64 nr_wakeups_passive;
> u64 nr_wakeups_idle;
> + atomic_t nr_wakeups_parallel;

bad idea.. atomics are expensive and stats aren't generally _that_
important.

> };
> #endif
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9d06ad6..1e1475f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>
> p->state = TASK_RUNNING;
> #ifdef CONFIG_SMP
> + /*
> + * Pair bracket with TASK_WAKING check it try_to_wake_up()
> + */
> + smp_wmb();
> +

Like Mike said, you're making the common case more expensive, this is
not appreciated.

> if (p->sched_class->task_woken)
> p->sched_class->task_woken(rq, p);
>
> @@ -1487,20 +1492,37 @@ static int
> try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> {
> unsigned long flags;
> - int cpu, success = 0;
> + int cpu, success = 1;
>
> + /*
> + * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.

That must be the worst barrier comment ever.. If you want it there, just
copy/paste the commit log here. It is also completely unrelated to the
rest of the patch.

> + */
> smp_wmb();
> +#ifdef CONFIG_SMP
> + if (p->state == TASK_WAKING) {
> + /*
> + * Pairs with sets of p->state: below and in ttwu_do_wakeup().
> + */
> + smp_rmb();
> + inc_nr_parallel_wakeups(p);
> + return success;

This is wrong, you didn't change p->state, therefore returning 1 is not
an option. If you want to do something like this, treat it like waking
an already running task.

Also, the barrier comments fail to explain the exact problem they're
solving.


> + }
> +#endif
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> - if (!(p->state & state))
> + if (!(p->state & state)) {
> + success = 0;
> goto out;
> + }
>
> - success = 1; /* we're going to change ->state */
> cpu = task_cpu(p);
>
> if (p->on_rq && ttwu_remote(p, wake_flags))
> goto stat;
>
> #ifdef CONFIG_SMP
> + p->state = TASK_WAKING;
> + smp_wmb();
> +

This too is broken; the loop below needs to be completed first,
otherwise we change p->state while the task is still on the CPU and it
might read the wrong p->state.

> /*
> * If the owning (remote) cpu is still in the middle of schedule() with
> * this task as prev, wait until its done referencing the task.
> @@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> smp_rmb();
>
> p->sched_contributes_to_load = !!task_contributes_to_load(p);
> - p->state = TASK_WAKING;
>
> if (p->sched_class->task_waking)
> p->sched_class->task_waking(p);

2013-07-15 14:14:39

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

Hi, Peter,

15.07.2013, 10:32, "Peter Zijlstra" <[email protected]>:
> On Sat, Jul 13, 2013 at 07:45:49PM +0400, Kirill Tkhai wrote:
>
>> ?---
>> ??include/linux/sched.h | ???1 +
>> ??kernel/sched/core.c ??| ??29 +++++++++++++++++++++++++----
>> ??kernel/sched/debug.c ?| ???7 +++++++
>> ??kernel/sched/stats.h ?| ??16 ++++++++++++++++
>> ??4 files changed, 49 insertions(+), 4 deletions(-)
>> ?diff --git a/include/linux/sched.h b/include/linux/sched.h
>> ?index fc09d21..235a466 100644
>> ?--- a/include/linux/sched.h
>> ?+++ b/include/linux/sched.h
>> ?@@ -964,6 +964,7 @@ struct sched_statistics {
>> ??????????u64 nr_wakeups_affine_attempts;
>> ??????????u64 nr_wakeups_passive;
>> ??????????u64 nr_wakeups_idle;
>> ?+ atomic_t nr_wakeups_parallel;
>
> bad idea.. atomics are expensive and stats aren't generally _that_
> important.
>
>> ??};
>> ??#endif
>>
>> ?diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> ?index 9d06ad6..1e1475f 100644
>> ?--- a/kernel/sched/core.c
>> ?+++ b/kernel/sched/core.c
>> ?@@ -1336,6 +1336,11 @@ ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>>
>> ??????????p->state = TASK_RUNNING;
>> ??#ifdef CONFIG_SMP
>> ?+ /*
>> ?+ * Pair bracket with TASK_WAKING check it try_to_wake_up()
>> ?+ */
>> ?+ smp_wmb();
>> ?+
>
> Like Mike said, you're making the common case more expensive, this is
> not appreciated.
>
>> ??????????if (p->sched_class->task_woken)
>> ??????????????????p->sched_class->task_woken(rq, p);
>>
>> ?@@ -1487,20 +1492,37 @@ static int
>> ??try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>> ??{
>> ??????????unsigned long flags;
>> ?- int cpu, success = 0;
>> ?+ int cpu, success = 1;
>>
>> ?+ /*
>> ?+ * See commentary for commit 04e2f1741d235ba599037734878d72e57cb302b5.
>
> That must be the worst barrier comment ever.. If you want it there, just
> copy/paste the commit log here. It is also completely unrelated to the
> rest of the patch.
>
>> ?+ */
>> ??????????smp_wmb();
>> ?+#ifdef CONFIG_SMP
>> ?+ if (p->state == TASK_WAKING) {
>> ?+ /*
>> ?+ * Pairs with sets of p->state: below and in ttwu_do_wakeup().
>> ?+ */
>> ?+ smp_rmb();
>> ?+ inc_nr_parallel_wakeups(p);
>> ?+ return success;
>
> This is wrong, you didn't change p->state, therefore returning 1 is not
> an option. If you want to do something like this, treat it like waking
> an already running task.
>
> Also, the barrier comments fail to explain the exact problem they're
> solving.
>
>> ?+ }
>> ?+#endif
>> ??????????raw_spin_lock_irqsave(&p->pi_lock, flags);
>> ?- if (!(p->state & state))
>> ?+ if (!(p->state & state)) {
>> ?+ success = 0;
>> ??????????????????goto out;
>> ?+ }
>>
>> ?- success = 1; /* we're going to change ->state */
>> ??????????cpu = task_cpu(p);
>>
>> ??????????if (p->on_rq && ttwu_remote(p, wake_flags))
>> ??????????????????goto stat;
>>
>> ??#ifdef CONFIG_SMP
>> ?+ p->state = TASK_WAKING;
>> ?+ smp_wmb();
>> ?+
>
> This too is broken; the loop below needs to be completed first,
> otherwise we change p->state while the task is still on the CPU and it
> might read the wrong p->state.

This place is below (on_rq && ttwu_remote) check, so the task
either 'dequeued and on_cpu == 0'
or it's in the middle of schedule() on arch, which wants unlocked
context switch.

Nobody scheduler's probes p->state between prepare_lock_switch() and
finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
don't change or probe state of previous process during context_switch.

The only exception is TASK_DEAD case, but it is handled above
in 'p->state & state' check. Nobody passes TASK_DEAD to try_to_wake_up().

p->state might be changed outside from the scheduler. It's ptrace.
But, it doesn't look be touched by earlier TASK_WAKING set.

So, I don't see any problem with it. Peter, what do you really mean?

Other problems is easy to be fixed, all except common case overhead.
If it's really considerable in terms of scheduler, I'll be account this fact
in the future.

>> ??????????/*
>> ???????????* If the owning (remote) cpu is still in the middle of schedule() with
>> ???????????* this task as prev, wait until its done referencing the task.
>> ?@@ -1513,7 +1535,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>> ??????????smp_rmb();
>>
>> ??????????p->sched_contributes_to_load = !!task_contributes_to_load(p);
>> ?- p->state = TASK_WAKING;
>>
>> ??????????if (p->sched_class->task_waking)
>> ??????????????????p->sched_class->task_waking(p);

2013-07-15 20:19:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

On Mon, Jul 15, 2013 at 06:14:34PM +0400, Kirill Tkhai wrote:
> >>
> >> ??#ifdef CONFIG_SMP
> >> ?+ p->state = TASK_WAKING;
> >> ?+ smp_wmb();
> >> ?+
> >
> > This too is broken; the loop below needs to be completed first,
> > otherwise we change p->state while the task is still on the CPU and it
> > might read the wrong p->state.
>
> This place is below (on_rq && ttwu_remote) check, so the task
> either 'dequeued and on_cpu == 0'
> or it's in the middle of schedule() on arch, which wants unlocked
> context switch.
>
> Nobody scheduler's probes p->state between prepare_lock_switch() and
> finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
> don't change or probe state of previous process during context_switch.

It means its after deactivate_task(), but before context_switch(). It so
happens that
context_switch()->prepare_task_switch()->trace_sched_switch() inspects
p->state.

Even if this was not the case, touching a task that is 'life' on another
CPU is very _very_ bad practise.

2013-07-15 20:29:07

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched: Add logic to handle parallel try_to_wake_up() of the same task

16.07.2013, 00:19, "Peter Zijlstra" <[email protected]>:
> On Mon, Jul 15, 2013 at 06:14:34PM +0400, Kirill Tkhai wrote:
>
>>>> ???#ifdef CONFIG_SMP
>>>> ??+ p->state = TASK_WAKING;
>>>> ??+ smp_wmb();
>>>> ??+
>>> ?This too is broken; the loop below needs to be completed first,
>>> ?otherwise we change p->state while the task is still on the CPU and it
>>> ?might read the wrong p->state.
>> ?This place is below (on_rq && ttwu_remote) check, so the task
>> ?either 'dequeued and on_cpu == 0'
>> ?or it's in the middle of schedule() on arch, which wants unlocked
>> ?context switch.
>>
>> ?Nobody scheduler's probes p->state between prepare_lock_switch() and
>> ?finish_lock_switch(). Archs with unlocked ctx switch (mips and ia64)
>> ?don't change or probe state of previous process during context_switch.
>
> It means its after deactivate_task(), but before context_switch(). It so
> happens that
> context_switch()->prepare_task_switch()->trace_sched_switch() inspects
> p->state.
>
> Even if this was not the case, touching a task that is 'life' on another
> CPU is very _very_ bad practise.

Thanks for the explanation.

Kirill