2010-12-16 15:14:21

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2

Hi, here a new posting of my scary patch(es) ;-)

These actually survive a sembench run (and everything else I threw at it).
The discussion between Mike and Frank over the task_running() check made me
realize what was wrong with the previous one.

As it turns out, what was needed (p->oncpu) was something Thomas wanted me
to do for an entirely different reason (see patch #2).

Frank's patch, while encouraging me to poke at it again, has a number of
very fundamental problems with it, the most serious one being that it
completely wrecks the wake-up load-balancing.

I'll try and come up with a way to unwreck the 32bit select_task_rq_fair()
problem, but for now inspiration in that departments seems lacking, yet I
still wanted to share these patches so that others can have a go at them.

If all you care about is the numbers, skip to patch #5.


2010-12-16 19:12:31

by Frank Rowand

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2

On 12/16/10 06:56, Peter Zijlstra wrote:
> Hi, here a new posting of my scary patch(es) ;-)
>
> These actually survive a sembench run (and everything else I threw at it).
> The discussion between Mike and Frank over the task_running() check made me
> realize what was wrong with the previous one.
>
> As it turns out, what was needed (p->oncpu) was something Thomas wanted me
> to do for an entirely different reason (see patch #2).
>
> Frank's patch, while encouraging me to poke at it again, has a number of
> very fundamental problems with it, the most serious one being that it
> completely wrecks the wake-up load-balancing.

And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.

By coincidence, I was about to post a new version of my scary patch when
this email arrived. I'll post my patches as a reply to this email, then
read through Peter's.


Frank's patch, Version 2

Changes from Version 1:
- Ensure return value of try_to_wake_up() is correct, even when queueing
wake up on a different cpu.
- rq->lock contention reduction not as good as first version

patch 1

The core changes. All the scary lock related stuff.

select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
of the waking smp_processor_id().

patch 2

select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()

Limitations
x86 only

Tests
- tested on 2 cpu x86_64
- very simplistic workload
- results:
rq->lock contention count reduced by ~ 75%
rq->lock contention wait time reduced by ~ 70%
test duration reduction is in the noise
rq->lock contention improvement is slightly better with just patch 1
applied, but the difference is in the noise

Todo
- handle cpu being offlined


-Frank

2010-12-16 19:36:39

by Frank Rowand

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2



patch 1 of 2

Signed-off-by: Frank Rowand <[email protected]>

---
arch/x86/kernel/smp.c | 1 1 + 0 - 0 !
include/linux/sched.h | 5 5 + 0 - 0 !
kernel/sched.c | 105 99 + 6 - 0 !
3 files changed, 105 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}

void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_LOAD 0x04 /* for queued try_to_wake_up() */

#define ENQUEUE_WAKEUP 1
#define ENQUEUE_WAKING 2
@@ -1193,6 +1194,8 @@ struct task_struct {
int lock_depth; /* BKL lock depth */

#ifdef CONFIG_SMP
+ struct task_struct *ttwu_queue_wake_entry;
+ int ttwu_queue_wake_flags;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
@@ -2017,6 +2020,7 @@ extern void release_uids(struct user_nam

extern void do_timer(unsigned long ticks);

+extern void sched_ttwu_pending(void);
extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -2332,6 +2334,28 @@ static inline void ttwu_post_activation(
wq_worker_waking_up(p, cpu_of(rq));
}

+#ifdef CONFIG_SMP
+static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags)
+{
+ struct task_struct *next = NULL;
+ struct rq *rq = cpu_rq(cpu);
+
+ p->ttwu_queue_wake_flags = wake_flags;
+
+ for (;;) {
+ struct task_struct *old = next;
+
+ p->ttwu_queue_wake_entry = next;
+ next = cmpxchg(&rq->wake_list, old, p);
+ if (next == old)
+ break;
+ }
+
+ if (!next)
+ smp_send_reschedule(cpu);
+}
+#endif
+
/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
@@ -2350,20 +2374,88 @@ static inline void ttwu_post_activation(
static int try_to_wake_up(struct task_struct *p, unsigned int state,
int wake_flags)
{
+/*
+ * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+ * todo
+ * - pass waking cpu with queued wake up, to be used in call to
+ * select_task_rq().
+ * - handle cpu being offlined
+ * xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+ */
int cpu, orig_cpu, this_cpu, success = 0;
unsigned long flags;
unsigned long en_flags = ENQUEUE_WAKEUP;
struct rq *rq;
+#ifdef CONFIG_SMP
+ int load;
+#endif

this_cpu = get_cpu();

- smp_wmb();
- rq = task_rq_lock(p, &flags);
- if (!(p->state & state))
- goto out;
+ local_irq_save(flags);

- if (p->se.on_rq)
- goto out_running;
+ for (;;) {
+ unsigned int task_state = p->state;
+
+ if (!(task_state & state))
+ goto out_nolock;
+ /*
+ * task_contributes_to_load() tests p->state
+ */
+ load = task_contributes_to_load(p);
+
+ if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
+ if (state == TASK_WAKING)
+ load = wake_flags & WF_LOAD;
+ break;
+ }
+ }
+
+ /*
+ * Avoid a possible cross cpu rq lock attempt until we know that a
+ * lock must be acquired. rq lock is to protect interaction with
+ * schedule().
+ *
+ * p->state == TASK_WAKING protects against any other try_to_wake_up()
+ * setting p->se.on_rq true after this test.
+ */
+ if (unlikely(p->se.on_rq)) {
+ smp_wmb();
+ rq = __task_rq_lock(p);
+ if (p->se.on_rq)
+ goto out_running;
+ __task_rq_unlock(rq);
+ }
+
+#ifdef CONFIG_SMP
+ /*
+ * If task_cpu(p) != this_cpu then the attempt to lock the rq on the
+ * other cpu can result in rq lock contention. Queueing this wake up
+ * on the other cpu may reduce rq lock contention.
+ *
+ * All tests that could have led to returning 0 have been completed
+ * before this point, return value will be 1. The return value of
+ * the try_to_wake_up() executed after unqueueing the wake request
+ * can not be returned to the current caller, so have to know what
+ * the return value of the queued request will be.
+ */
+ cpu = task_cpu(p);
+ if (cpu != this_cpu) {
+ if (load)
+ wake_flags |= WF_LOAD;
+ ttwu_queue_wake_up(p, cpu, wake_flags);
+ success = 1;
+ goto out_nolock;
+ }
+#endif
+
+ /*
+ * task_cpu(p) may have changed since it was checked since rq->lock
+ * is not held. Thus may still end up with cross cpu rq lock
+ * contention. Encountering this race should be very rare.
+ */
+ smp_wmb();
+ rq = __task_rq_lock(p);

cpu = task_cpu(p);
orig_cpu = cpu;
@@ -2378,13 +2470,12 @@ static int try_to_wake_up(struct task_st
*
* First fix up the nr_uninterruptible count:
*/
- if (task_contributes_to_load(p)) {
+ if (load) {
if (likely(cpu_online(orig_cpu)))
rq->nr_uninterruptible--;
else
this_rq()->nr_uninterruptible--;
}
- p->state = TASK_WAKING;

if (p->sched_class->task_waking) {
p->sched_class->task_waking(rq, p);
@@ -2394,6 +2485,10 @@ static int try_to_wake_up(struct task_st
cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
if (cpu != orig_cpu)
set_task_cpu(p, cpu);
+ /*
+ * Protected against concurrent wakeups while rq->lock released because
+ * p is in TASK_WAKING state.
+ */
__task_rq_unlock(rq);

rq = cpu_rq(cpu);
@@ -2430,13 +2525,30 @@ out_activate:
success = 1;
out_running:
ttwu_post_activation(p, rq, wake_flags, success);
-out:
- task_rq_unlock(rq, &flags);
+ __task_rq_unlock(rq);
+out_nolock:
+ local_irq_restore(flags);
put_cpu();

return success;
}

+#ifdef CONFIG_SMP
+void sched_ttwu_pending(void)
+{
+ struct rq *rq = this_rq();
+ struct task_struct *p = xchg(&rq->wake_list, NULL);
+
+ if (!p)
+ return;
+
+ while (p) {
+ try_to_wake_up(p, TASK_WAKING, p->ttwu_queue_wake_flags);
+ p = p->ttwu_queue_wake_entry;
+ }
+}
+#endif
+
/**
* try_to_wake_up_local - try to wake up a local task with rq lock held
* @p: the thread to be awakened

2010-12-16 19:36:54

by Frank Rowand

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2



patch 2 of 2

Signed-off-by: Frank Rowand <[email protected]>

---
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1060,7 +1060,7 @@ struct sched_class {

#ifdef CONFIG_SMP
int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int sd_flag, int flags, int waking_cpu);

void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
@@ -1196,6 +1196,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct task_struct *ttwu_queue_wake_entry;
int ttwu_queue_wake_flags;
+ int ttwu_waking_cpu;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2262,9 +2262,11 @@ static int select_fallback_rq(int cpu, s
* The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags,
+ int wake_flags, int waking_cpu)
{
- int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+ int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags,
+ waking_cpu);

/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -2335,12 +2337,14 @@ static inline void ttwu_post_activation(
}

#ifdef CONFIG_SMP
-static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags)
+static void ttwu_queue_wake_up(struct task_struct *p, int this_cpu, int p_cpu,
+ int wake_flags)
{
struct task_struct *next = NULL;
- struct rq *rq = cpu_rq(cpu);
+ struct rq *rq = cpu_rq(p_cpu);

p->ttwu_queue_wake_flags = wake_flags;
+ p->ttwu_waking_cpu = this_cpu;

for (;;) {
struct task_struct *old = next;
@@ -2352,7 +2356,7 @@ static void ttwu_queue_wake_up(struct ta
}

if (!next)
- smp_send_reschedule(cpu);
+ smp_send_reschedule(p_cpu);
}
#endif

@@ -2377,8 +2381,6 @@ static int try_to_wake_up(struct task_st
/*
* xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
* todo
- * - pass waking cpu with queued wake up, to be used in call to
- * select_task_rq().
* - handle cpu being offlined
* xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
*/
@@ -2387,7 +2389,7 @@ static int try_to_wake_up(struct task_st
unsigned long en_flags = ENQUEUE_WAKEUP;
struct rq *rq;
#ifdef CONFIG_SMP
- int load;
+ int load, waking_cpu;
#endif

this_cpu = get_cpu();
@@ -2405,8 +2407,12 @@ static int try_to_wake_up(struct task_st
load = task_contributes_to_load(p);

if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
- if (state == TASK_WAKING)
+ if (state == TASK_WAKING) {
load = wake_flags & WF_LOAD;
+ waking_cpu = p->ttwu_waking_cpu;
+ } else {
+ waking_cpu = this_cpu;
+ }
break;
}
}
@@ -2443,7 +2449,7 @@ static int try_to_wake_up(struct task_st
if (cpu != this_cpu) {
if (load)
wake_flags |= WF_LOAD;
- ttwu_queue_wake_up(p, cpu, wake_flags);
+ ttwu_queue_wake_up(p, this_cpu, cpu, wake_flags);
success = 1;
goto out_nolock;
}
@@ -2482,7 +2488,7 @@ static int try_to_wake_up(struct task_st
en_flags |= ENQUEUE_WAKING;
}

- cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
+ cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags, waking_cpu);
if (cpu != orig_cpu)
set_task_cpu(p, cpu);
/*
@@ -2728,7 +2734,7 @@ void wake_up_new_task(struct task_struct
* We set TASK_WAKING so that select_task_rq() can drop rq->lock
* without people poking at ->cpus_allowed.
*/
- cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+ cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0, cpu);
set_task_cpu(p, cpu);

p->state = TASK_RUNNING;
@@ -3327,7 +3333,8 @@ void sched_exec(void)
int dest_cpu;

rq = task_rq_lock(p, &flags);
- dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0,
+ smp_processor_id());
if (dest_cpu == smp_processor_id())
goto unlock;

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1606,10 +1606,10 @@ static int select_idle_sibling(struct ta
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag,
+ int wake_flags, int cpu)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
- int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
Index: linux-2.6/kernel/sched_idletask.c
===================================================================
--- linux-2.6.orig/kernel/sched_idletask.c
+++ linux-2.6/kernel/sched_idletask.c
@@ -7,7 +7,8 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag,
+ int flags, int waking_cpu)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -973,7 +973,8 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags,
+ int waking_cpu)
{
if (sd_flag != SD_BALANCE_WAKE)
return smp_processor_id();
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- linux-2.6.orig/kernel/sched_stoptask.c
+++ linux-2.6/kernel/sched_stoptask.c
@@ -10,7 +10,7 @@
#ifdef CONFIG_SMP
static int
select_task_rq_stop(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags)
+ int sd_flag, int flags, int waking_cpu)
{
return task_cpu(p); /* stop tasks as never migrate */
}

2010-12-16 19:43:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2

On Thu, 2010-12-16 at 11:39 -0800, Frank Rowand wrote:
> select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
> of the waking smp_processor_id().

So first you bounce the wakeup to the previous task cpu, when you do
select_task_rq() and then you bounce it to the new target?

2010-12-16 19:45:25

by Frank Rowand

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2

On 12/16/10 11:36, Frank Rowand wrote:
>
>
> patch 1 of 2

The email that explains the context for this does not seem to have gotten
through to the list. Here is the email that this patch should have been
a reply to:

On 12/16/10 06:56, Peter Zijlstra wrote:
> Hi, here a new posting of my scary patch(es) ;-)
>
> These actually survive a sembench run (and everything else I threw at it).
> The discussion between Mike and Frank over the task_running() check made me
> realize what was wrong with the previous one.
>
> As it turns out, what was needed (p->oncpu) was something Thomas wanted me
> to do for an entirely different reason (see patch #2).
>
> Frank's patch, while encouraging me to poke at it again, has a number of
> very fundamental problems with it, the most serious one being that it
> completely wrecks the wake-up load-balancing.

And also as Peter pointed out when I posted the patch (thank you Peter),
I did not properly handle the return value for try_to_wake_up() - a rather
fatal flaw.

By coincidence, I was about to post a new version of my scary patch when
this email arrived. I'll post my patches as a reply to this email, then
read through Peter's.


Frank's patch, Version 2

Changes from Version 1:
- Ensure return value of try_to_wake_up() is correct, even when queueing
wake up on a different cpu.
- rq->lock contention reduction not as good as first version

patch 1

The core changes. All the scary lock related stuff.

select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
of the waking smp_processor_id().

patch 2

select_task_rq() uses the smp_processor_id() of the waking smp_processor_id()

Limitations
x86 only

Tests
- tested on 2 cpu x86_64
- very simplistic workload
- results:
rq->lock contention count reduced by ~ 75%
rq->lock contention wait time reduced by ~ 70%
test duration reduction is in the noise
rq->lock contention improvement is slightly better with just patch 1
applied, but the difference is in the noise

Todo
- handle cpu being offlined


-Frank

2010-12-16 20:46:47

by Frank Rowand

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/5] Reduce runqueue lock contention -v2

On 12/16/10 11:42, Peter Zijlstra wrote:
> On Thu, 2010-12-16 at 11:39 -0800, Frank Rowand wrote:
>> select_task_rq() uses the smp_processor_id() of the old task_cpu(p) instead
>> of the waking smp_processor_id().
>
> So first you bounce the wakeup to the previous task cpu, when you do
> select_task_rq() and then you bounce it to the new target?

Yes. (This is the answer when only the first patch is applied.)

The wakeup is bounced to the previous task cpu. The bounce adds extra
overhead (including an IRQ on the previous task cpu) in return for a
reduction in the number of rq->lock contentions.

Then the try_to_wake_up() on the previous task cpu may wake the task
on a different cpu, but the bias should be the cpu that try_to_wake_up()
is now running on.

The second patch in my series instead feeds the waking cpu number to
select_task_rq(), so the bias will be to wake the task on the waking
cpu.

-Frank