2010-06-21 10:02:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

To return the favour, here's a scary patch that renders your machine a
doorstop :-) Sadly it just hangs, no splat, no sysrq, no nmi, no tripple
fault.

It looses the ttwu task_running() check, as I must admit I'm not quite
sure what it does.. Ingo?

It makes the whole TASK_WAKING thing very interesting again :-)

It also re-introduces a bunch of races because we now need to run
->select_task_rq() without holding the rq->lock. We cannot defer running
that because it really wants to know the cpu the wakeup originated from
(affine wakeups and the like).


---
arch/x86/kernel/smp.c | 1 +
include/linux/sched.h | 7 +-
kernel/sched.c | 225 ++++++++++++++++++++++++-----------------------
kernel/sched_fair.c | 10 +--
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 4 +-
6 files changed, 127 insertions(+), 122 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}

void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}
#endif /* !CONFIG_SMP */

+void sched_ttwu_pending(void);

struct io_context; /* See blkdev.h */

@@ -1046,12 +1047,11 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
- void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+ void (*task_waking) (struct task_struct *task);
void (*task_woken) (struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
int lock_depth; /* BKL lock depth */

#ifdef CONFIG_SMP
+ struct task_struct *wake_entry;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b697606..ef90585 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -518,6 +518,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif

/* calc_load related fields */
@@ -944,7 +946,7 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
for (;;) {
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p)))
+ if (likely(rq == task_rq(p)) && !task_is_waking(p))
return rq;
raw_spin_unlock(&rq->lock);
}
@@ -964,7 +966,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
local_irq_save(*flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p)))
+ if (likely(rq == task_rq(p)) && !task_is_waking(p))
return rq;
raw_spin_unlock_irqrestore(&rq->lock, *flags);
}
@@ -2257,9 +2259,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
* The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
{
- int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+ int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);

/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -2330,111 +2332,6 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
}

/**
- * try_to_wake_up - wake up a thread
- * @p: the thread to be awakened
- * @state: the mask of task states that can be woken
- * @wake_flags: wake modifier flags (WF_*)
- *
- * Put it on the run-queue if it's not already there. The "current"
- * thread is always on the run-queue (except when the actual
- * re-schedule is in progress), and as such you're allowed to do
- * the simpler "current->state = TASK_RUNNING" to mark yourself
- * runnable without the overhead of this.
- *
- * Returns %true if @p was woken up, %false if it was already running
- * or @state didn't match @p's state.
- */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
- int wake_flags)
-{
- int cpu, orig_cpu, this_cpu, success = 0;
- unsigned long flags;
- unsigned long en_flags = ENQUEUE_WAKEUP;
- struct rq *rq;
-
- this_cpu = get_cpu();
-
- smp_wmb();
- rq = task_rq_lock(p, &flags);
- if (!(p->state & state))
- goto out;
-
- if (p->se.on_rq)
- goto out_running;
-
- cpu = task_cpu(p);
- orig_cpu = cpu;
-
-#ifdef CONFIG_SMP
- if (unlikely(task_running(rq, p)))
- goto out_activate;
-
- /*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
- *
- * First fix up the nr_uninterruptible count:
- */
- if (task_contributes_to_load(p)) {
- if (likely(cpu_online(orig_cpu)))
- rq->nr_uninterruptible--;
- else
- this_rq()->nr_uninterruptible--;
- }
- p->state = TASK_WAKING;
-
- if (p->sched_class->task_waking) {
- p->sched_class->task_waking(rq, p);
- en_flags |= ENQUEUE_WAKING;
- }
-
- cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
- set_task_cpu(p, cpu);
- __task_rq_unlock(rq);
-
- rq = cpu_rq(cpu);
- raw_spin_lock(&rq->lock);
-
- /*
- * We migrated the task without holding either rq->lock, however
- * since the task is not on the task list itself, nobody else
- * will try and migrate the task, hence the rq should match the
- * cpu we just moved it to.
- */
- WARN_ON(task_cpu(p) != cpu);
- WARN_ON(p->state != TASK_WAKING);
-
-#ifdef CONFIG_SCHEDSTATS
- schedstat_inc(rq, ttwu_count);
- if (cpu == this_cpu)
- schedstat_inc(rq, ttwu_local);
- else {
- struct sched_domain *sd;
- for_each_domain(this_cpu, sd) {
- if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
- schedstat_inc(sd, ttwu_wake_remote);
- break;
- }
- }
- }
-#endif /* CONFIG_SCHEDSTATS */
-
-out_activate:
-#endif /* CONFIG_SMP */
- ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
- cpu == this_cpu, en_flags);
- success = 1;
-out_running:
- ttwu_post_activation(p, rq, wake_flags, success);
-out:
- task_rq_unlock(rq, &flags);
- put_cpu();
-
- return success;
-}
-
-/**
* try_to_wake_up_local - try to wake up a local task with rq lock held
* @p: the thread to be awakened
*
@@ -2465,6 +2362,112 @@ static void try_to_wake_up_local(struct task_struct *p)
ttwu_post_activation(p, rq, 0, success);
}

+static int ttwu_running(struct task_struct *p, int wake_flags)
+{
+ struct rq *rq;
+
+ for (;;) {
+ rq = task_rq(p);
+ raw_spin_lock(&rq->lock);
+ if (likely(rq == task_rq(p)))
+ break;
+ raw_spin_unlock(&rq->lock);
+ }
+
+ if (!p->se.on_rq) {
+ raw_spin_unlock(&rq->lock);
+ return 0;
+ }
+
+ ttwu_post_activation(p, rq, 0, true);
+ return 1;
+}
+
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, bool local)
+{
+ set_task_cpu(p, cpu_of(rq));
+ ttwu_activate(p, rq, false, !local, local,
+ ENQUEUE_WAKEUP|ENQUEUE_WAKING);
+ ttwu_post_activation(p, rq, 0, true);
+}
+
+void sched_ttwu_pending(void)
+{
+ struct task_struct *list = xchg(&this_rq()->wake_list, NULL);
+ struct rq *rq = this_rq();
+
+ if (!list)
+ return;
+
+ raw_spin_lock(&rq->lock);
+
+ while (list) {
+ struct task_struct *p = list;
+ list = list->wake_entry;
+ ttwu_do_activate(rq, p, false);
+ }
+
+ raw_spin_unlock(&rq->lock);
+}
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+ struct task_struct *next = NULL;
+ struct rq *rq = cpu_rq(cpu);
+
+ if (cpu == smp_processor_id()) {
+ raw_spin_lock(&rq->lock);
+ ttwu_do_activate(rq, p, true);
+ raw_spin_unlock(&rq->lock);
+ return;
+ }
+
+ for (;;) {
+ struct task_struct *old = next;
+
+ p->wake_entry = next;
+ next = cmpxchg(&rq->wake_list, old, p);
+ if (next == old)
+ break;
+ }
+
+ if (!next)
+ resched_cpu(cpu);
+}
+
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+{
+ unsigned long flags;
+ int success = 0;
+ int cpu;
+
+ local_irq_save(flags);
+ for (;;) {
+ unsigned int task_state = p->state;
+
+ if (!(task_state & state))
+ goto out;
+
+ if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+ break;
+ }
+
+ success = 1;
+
+ if (unlikely(p->se.on_rq) && ttwu_running(p, wake_flags))
+ goto out;
+
+ if (p->sched_class->task_waking)
+ p->sched_class->task_waking(p);
+
+ cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ ttwu_queue(p, cpu);
+out:
+ local_irq_restore(flags);
+ return success;
+}
+
/**
* wake_up_process - Wake up a specific process
* @p: The process to be woken up.
@@ -2604,7 +2607,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
* We set TASK_WAKING so that select_task_rq() can drop rq->lock
* without people poking at ->cpus_allowed.
*/
- cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+ cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
set_task_cpu(p, cpu);

p->state = TASK_RUNNING;
@@ -3200,7 +3203,7 @@ void sched_exec(void)
int dest_cpu;

rq = task_rq_lock(p, &flags);
- dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
if (dest_cpu == smp_processor_id())
goto unlock;

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5e8f98c..2bd145f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1128,11 +1128,12 @@ static void yield_task_fair(struct rq *rq)

#ifdef CONFIG_SMP

-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);

+ // XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
se->vruntime -= cfs_rq->min_vruntime;
}

@@ -1443,7 +1444,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
@@ -1516,11 +1517,8 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
tmp = affine_sd;

- if (tmp) {
- raw_spin_unlock(&rq->lock);
+ if (tmp)
update_shares(tmp);
- raw_spin_lock(&rq->lock);
- }
}
#endif

diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..efaed8c 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -7,7 +7,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..1011cdc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -949,8 +949,10 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
{
+ struct rq *rq = task_rq(p); // XXX racy
+
if (sd_flag != SD_BALANCE_WAKE)
return smp_processor_id();


2010-06-21 10:54:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On Mon, 2010-06-21 at 12:02 +0200, Peter Zijlstra wrote:
> To return the favour, here's a scary patch that renders your machine a
> doorstop :-) Sadly it just hangs, no splat, no sysrq, no nmi, no tripple
> fault.
>
> It looses the ttwu task_running() check, as I must admit I'm not quite
> sure what it does.. Ingo?
>
> It makes the whole TASK_WAKING thing very interesting again :-)
>
> It also re-introduces a bunch of races because we now need to run
> ->select_task_rq() without holding the rq->lock. We cannot defer running
> that because it really wants to know the cpu the wakeup originated from
> (affine wakeups and the like).


Progress, this one gave spectacular fireworks ;-)

---
arch/x86/kernel/smp.c | 1 +
include/linux/sched.h | 7 +-
kernel/sched.c | 247 +++++++++++++++++++++++++----------------------
kernel/sched_fair.c | 10 +-
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 4 +-
6 files changed, 143 insertions(+), 128 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}

void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}
#endif /* !CONFIG_SMP */

+void sched_ttwu_pending(void);

struct io_context; /* See blkdev.h */

@@ -1046,12 +1047,11 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
- void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+ void (*task_waking) (struct task_struct *task);
void (*task_woken) (struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
int lock_depth; /* BKL lock depth */

#ifdef CONFIG_SMP
+ struct task_struct *wake_entry;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b697606..5fe442c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -518,6 +518,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif

/* calc_load related fields */
@@ -944,7 +946,7 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
for (;;) {
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p)))
+ if (likely(rq == task_rq(p)) && !task_is_waking(p))
return rq;
raw_spin_unlock(&rq->lock);
}
@@ -964,7 +966,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
local_irq_save(*flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p)))
+ if (likely(rq == task_rq(p)) && !task_is_waking(p))
return rq;
raw_spin_unlock_irqrestore(&rq->lock, *flags);
}
@@ -2257,9 +2259,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
* The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
{
- int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+ int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);

/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -2285,9 +2287,8 @@ static void update_avg(u64 *avg, u64 sample)
}
#endif

-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
- bool is_sync, bool is_migrate, bool is_local,
- unsigned long en_flags)
+static inline void
+ttwu_stat(struct task_struct *p, bool is_sync, bool is_migrate, bool is_local)
{
schedstat_inc(p, se.statistics.nr_wakeups);
if (is_sync)
@@ -2298,8 +2299,6 @@ static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
schedstat_inc(p, se.statistics.nr_wakeups_local);
else
schedstat_inc(p, se.statistics.nr_wakeups_remote);
-
- activate_task(rq, p, en_flags);
}

static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
@@ -2330,111 +2329,6 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
}

/**
- * try_to_wake_up - wake up a thread
- * @p: the thread to be awakened
- * @state: the mask of task states that can be woken
- * @wake_flags: wake modifier flags (WF_*)
- *
- * Put it on the run-queue if it's not already there. The "current"
- * thread is always on the run-queue (except when the actual
- * re-schedule is in progress), and as such you're allowed to do
- * the simpler "current->state = TASK_RUNNING" to mark yourself
- * runnable without the overhead of this.
- *
- * Returns %true if @p was woken up, %false if it was already running
- * or @state didn't match @p's state.
- */
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
- int wake_flags)
-{
- int cpu, orig_cpu, this_cpu, success = 0;
- unsigned long flags;
- unsigned long en_flags = ENQUEUE_WAKEUP;
- struct rq *rq;
-
- this_cpu = get_cpu();
-
- smp_wmb();
- rq = task_rq_lock(p, &flags);
- if (!(p->state & state))
- goto out;
-
- if (p->se.on_rq)
- goto out_running;
-
- cpu = task_cpu(p);
- orig_cpu = cpu;
-
-#ifdef CONFIG_SMP
- if (unlikely(task_running(rq, p)))
- goto out_activate;
-
- /*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
- *
- * First fix up the nr_uninterruptible count:
- */
- if (task_contributes_to_load(p)) {
- if (likely(cpu_online(orig_cpu)))
- rq->nr_uninterruptible--;
- else
- this_rq()->nr_uninterruptible--;
- }
- p->state = TASK_WAKING;
-
- if (p->sched_class->task_waking) {
- p->sched_class->task_waking(rq, p);
- en_flags |= ENQUEUE_WAKING;
- }
-
- cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
- set_task_cpu(p, cpu);
- __task_rq_unlock(rq);
-
- rq = cpu_rq(cpu);
- raw_spin_lock(&rq->lock);
-
- /*
- * We migrated the task without holding either rq->lock, however
- * since the task is not on the task list itself, nobody else
- * will try and migrate the task, hence the rq should match the
- * cpu we just moved it to.
- */
- WARN_ON(task_cpu(p) != cpu);
- WARN_ON(p->state != TASK_WAKING);
-
-#ifdef CONFIG_SCHEDSTATS
- schedstat_inc(rq, ttwu_count);
- if (cpu == this_cpu)
- schedstat_inc(rq, ttwu_local);
- else {
- struct sched_domain *sd;
- for_each_domain(this_cpu, sd) {
- if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
- schedstat_inc(sd, ttwu_wake_remote);
- break;
- }
- }
- }
-#endif /* CONFIG_SCHEDSTATS */
-
-out_activate:
-#endif /* CONFIG_SMP */
- ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
- cpu == this_cpu, en_flags);
- success = 1;
-out_running:
- ttwu_post_activation(p, rq, wake_flags, success);
-out:
- task_rq_unlock(rq, &flags);
- put_cpu();
-
- return success;
-}
-
-/**
* try_to_wake_up_local - try to wake up a local task with rq lock held
* @p: the thread to be awakened
*
@@ -2459,12 +2353,131 @@ static void try_to_wake_up_local(struct task_struct *p)
schedstat_inc(rq, ttwu_count);
schedstat_inc(rq, ttwu_local);
}
- ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
+ ttwu_stat(p, false, false, true);
+ activate_task(rq, p, ENQUEUE_WAKEUP);
success = true;
}
ttwu_post_activation(p, rq, 0, success);
}

+static int ttwu_running(struct task_struct *p, int wake_flags)
+{
+ struct rq *rq;
+
+ for (;;) {
+ rq = task_rq(p);
+ raw_spin_lock(&rq->lock);
+ if (likely(rq == task_rq(p)))
+ break;
+ raw_spin_unlock(&rq->lock);
+ }
+
+ if (!p->se.on_rq) {
+ raw_spin_unlock(&rq->lock);
+ return 0;
+ }
+
+ ttwu_post_activation(p, rq, 0, true);
+ return 1;
+}
+
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, bool local)
+{
+ set_task_cpu(p, cpu_of(rq));
+ activate_task(rq, p, ENQUEUE_WAKEUP
+#ifdef CONFIG_SMP
+ | ENQUEUE_WAKING
+#endif
+ );
+ ttwu_post_activation(p, rq, 0, true);
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+ struct rq *rq = this_rq();
+ struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+ if (!list)
+ return;
+
+ raw_spin_lock(&rq->lock);
+
+ while (list) {
+ struct task_struct *p = list;
+ list = list->wake_entry;
+ ttwu_do_activate(rq, p, false);
+ }
+
+ raw_spin_unlock(&rq->lock);
+#endif
+}
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+ struct task_struct *next = NULL;
+ struct rq *rq = cpu_rq(cpu);
+
+ if (cpu == smp_processor_id()) {
+ raw_spin_lock(&rq->lock);
+ ttwu_do_activate(rq, p, true);
+ raw_spin_unlock(&rq->lock);
+ return;
+ }
+
+#ifdef CONFIG_SMP
+ for (;;) {
+ struct task_struct *old = next;
+
+ p->wake_entry = next;
+ next = cmpxchg(&rq->wake_list, old, p);
+ if (next == old)
+ break;
+ }
+
+ if (!next)
+ smp_send_reschedule(cpu);
+#endif
+}
+
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+{
+ unsigned long flags;
+ int success = 0;
+ int cpu = task_cpu(p);
+
+ local_irq_save(flags);
+ for (;;) {
+ unsigned int task_state = p->state;
+
+ if (!(task_state & state))
+ goto out;
+
+ if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+ break;
+ }
+
+ success = 1;
+
+ if (unlikely(p->se.on_rq) && ttwu_running(p, wake_flags))
+ goto out;
+
+#ifdef CONFIG_SMP
+ if (p->sched_class->task_waking)
+ p->sched_class->task_waking(p);
+
+ cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+#endif
+ ttwu_stat(p, wake_flags & WF_SYNC, /* sync */
+ cpu != task_cpu(p), /* migrate */
+ cpu == smp_processor_id()); /* local */
+ ttwu_queue(p, cpu);
+out:
+ local_irq_restore(flags);
+ return success;
+}
+
/**
* wake_up_process - Wake up a specific process
* @p: The process to be woken up.
@@ -2604,7 +2617,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
* We set TASK_WAKING so that select_task_rq() can drop rq->lock
* without people poking at ->cpus_allowed.
*/
- cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+ cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
set_task_cpu(p, cpu);

p->state = TASK_RUNNING;
@@ -3200,7 +3213,7 @@ void sched_exec(void)
int dest_cpu;

rq = task_rq_lock(p, &flags);
- dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
if (dest_cpu == smp_processor_id())
goto unlock;

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5e8f98c..2bd145f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1128,11 +1128,12 @@ static void yield_task_fair(struct rq *rq)

#ifdef CONFIG_SMP

-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);

+ // XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
se->vruntime -= cfs_rq->min_vruntime;
}

@@ -1443,7 +1444,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
@@ -1516,11 +1517,8 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
tmp = affine_sd;

- if (tmp) {
- raw_spin_unlock(&rq->lock);
+ if (tmp)
update_shares(tmp);
- raw_spin_lock(&rq->lock);
- }
}
#endif

diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..efaed8c 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -7,7 +7,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..1011cdc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -949,8 +949,10 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
{
+ struct rq *rq = task_rq(p); // XXX racy
+
if (sd_flag != SD_BALANCE_WAKE)
return smp_processor_id();

2010-06-21 13:04:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
> > It looses the ttwu task_running() check, as I must admit I'm not quite
> > sure what it does.. Ingo?

I think I figured out what its for, its for when p is prev in schedule()
after deactivate_task(), we have to call activate_task() it again, but
we cannot migrate the task because the CPU its on is still referencing
it.

I added it back, but still fireworks :-)

2010-06-22 13:33:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

So this one boots and builds a kernel on a dual-socket nehalem.

there's still quite a number of XXXs to fix, but I don't think any of
the races are crashing potential, mostly wrong accounting and scheduling
iffies like.

But give it a go.. see what it does for you (x86 only for now).

Ingo, any comments other than, eew, scary? :-)

Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/smp.c | 1 +
include/linux/sched.h | 7 +-
kernel/sched.c | 271 ++++++++++++++++++++++++++++++++---------------
kernel/sched_fair.c | 10 +-
kernel/sched_idletask.c | 2 +-
kernel/sched_rt.c | 4 +-
6 files changed, 198 insertions(+), 97 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index d801210..1e487d6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -202,6 +202,7 @@ void smp_reschedule_interrupt(struct pt_regs *regs)
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}

void smp_call_function_interrupt(struct pt_regs *regs)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a61c08c..9ae9fdb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1003,6 +1003,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
}
#endif /* !CONFIG_SMP */

+void sched_ttwu_pending(void);

struct io_context; /* See blkdev.h */

@@ -1046,12 +1047,11 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
- int (*select_task_rq)(struct rq *rq, struct task_struct *p,
- int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
void (*post_schedule) (struct rq *this_rq);
- void (*task_waking) (struct rq *this_rq, struct task_struct *task);
+ void (*task_waking) (struct task_struct *task);
void (*task_woken) (struct rq *this_rq, struct task_struct *task);

void (*set_cpus_allowed)(struct task_struct *p,
@@ -1174,6 +1174,7 @@ struct task_struct {
int lock_depth; /* BKL lock depth */

#ifdef CONFIG_SMP
+ struct task_struct *wake_entry;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index b697606..9a5f9ea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -518,6 +518,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif

/* calc_load related fields */
@@ -944,7 +946,7 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
for (;;) {
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p)))
+ if (likely(rq == task_rq(p)) && !task_is_waking(p))
return rq;
raw_spin_unlock(&rq->lock);
}
@@ -964,7 +966,7 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
local_irq_save(*flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p)))
+ if (likely(rq == task_rq(p)) && !task_is_waking(p))
return rq;
raw_spin_unlock_irqrestore(&rq->lock, *flags);
}
@@ -2257,9 +2259,9 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
* The caller (fork, wakeup) owns TASK_WAKING, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct rq *rq, struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
{
- int cpu = p->sched_class->select_task_rq(rq, p, sd_flags, wake_flags);
+ int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);

/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -2285,24 +2287,38 @@ static void update_avg(u64 *avg, u64 sample)
}
#endif

-static inline void ttwu_activate(struct task_struct *p, struct rq *rq,
- bool is_sync, bool is_migrate, bool is_local,
- unsigned long en_flags)
+static inline void ttwu_stat(struct task_struct *p, bool sync, int cpu)
{
+#ifdef CONFIG_SCHEDSTATS
schedstat_inc(p, se.statistics.nr_wakeups);
- if (is_sync)
+
+ if (sync)
schedstat_inc(p, se.statistics.nr_wakeups_sync);
- if (is_migrate)
+
+ if (cpu != task_cpu(p))
schedstat_inc(p, se.statistics.nr_wakeups_migrate);
- if (is_local)
+
+ if (cpu == smp_processor_id()) {
schedstat_inc(p, se.statistics.nr_wakeups_local);
- else
- schedstat_inc(p, se.statistics.nr_wakeups_remote);
+ schedstat_inc(this_rq(), ttwu_local);
+ } else {
+ struct sched_domain *sd;

- activate_task(rq, p, en_flags);
+ schedstat_inc(p, se.statistics.nr_wakeups_remote);
+ for_each_domain(smp_processor_id(), sd) {
+ if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+ schedstat_inc(sd, ttwu_wake_remote);
+ break;
+ }
+ }
+ }
+#endif
}

-static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
+/*
+ * Mark the task runnable and perform wakeup-preemption.
+ */
+static inline void ttwu_do_wakeup(struct task_struct *p, struct rq *rq,
int wake_flags, bool success)
{
trace_sched_wakeup(p, success);
@@ -2329,6 +2345,104 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
wq_worker_waking_up(p, cpu_of(rq));
}

+/*
+ * 'Wake' a still running task.
+ */
+static int ttwu_running(struct task_struct *p, int wake_flags)
+{
+ struct rq *rq;
+
+ /* open-coded __task_rq_lock() to avoid the TASK_WAKING check. */
+ for (;;) {
+ rq = task_rq(p);
+ raw_spin_lock(&rq->lock);
+ if (likely(rq == task_rq(p)))
+ break;
+ raw_spin_unlock(&rq->lock);
+ }
+
+ if (!p->se.on_rq) {
+ raw_spin_unlock(&rq->lock);
+ return 0;
+ }
+
+ /*
+ * XXX like below, I think this should be an actual wakeup
+ */
+ ttwu_do_wakeup(p, rq, 0, false);
+ raw_spin_unlock(&rq->lock);
+
+ return 1;
+}
+
+static void ttwu_do_activate(struct rq *rq, struct task_struct *p, bool local)
+{
+ set_task_cpu(p, cpu_of(rq)); /* XXX conditional */
+ activate_task(rq, p, ENQUEUE_WAKEUP
+#ifdef CONFIG_SMP
+ | ENQUEUE_WAKING
+#endif
+ );
+ ttwu_do_wakeup(p, rq, 0, true);
+}
+
+void sched_ttwu_pending(void)
+{
+#ifdef CONFIG_SMP
+ struct rq *rq = this_rq();
+ struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+ if (!list)
+ return;
+
+ raw_spin_lock(&rq->lock);
+
+ while (list) {
+ struct task_struct *p = list;
+ list = list->wake_entry;
+ ttwu_do_activate(rq, p, false);
+ }
+
+ raw_spin_unlock(&rq->lock);
+#endif
+}
+
+#ifdef CONFIG_SMP
+static void ttwu_queue_remote(struct task_struct *p, int cpu)
+{
+ struct task_struct *next = NULL;
+ struct rq *rq = cpu_rq(cpu);
+
+ for (;;) {
+ struct task_struct *old = next;
+
+ p->wake_entry = next;
+ next = cmpxchg(&rq->wake_list, old, p);
+ if (next == old)
+ break;
+ }
+
+ if (!next)
+ smp_send_reschedule(cpu);
+}
+#endif
+
+static void ttwu_queue(struct task_struct *p, int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+#ifdef CONFIG_SMP
+ if (cpu != smp_processor_id()) {
+ ttwu_queue_remote(p, cpu);
+ return;
+ }
+#endif
+
+ raw_spin_lock(&rq->lock);
+ ttwu_do_activate(rq, p, true);
+ raw_spin_unlock(&rq->lock);
+}
+
/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
@@ -2344,92 +2458,76 @@ static inline void ttwu_post_activation(struct task_struct *p, struct rq *rq,
* Returns %true if @p was woken up, %false if it was already running
* or @state didn't match @p's state.
*/
-static int try_to_wake_up(struct task_struct *p, unsigned int state,
- int wake_flags)
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
- int cpu, orig_cpu, this_cpu, success = 0;
+ int cpu = task_cpu(p);
unsigned long flags;
- unsigned long en_flags = ENQUEUE_WAKEUP;
- struct rq *rq;
+ int success = 0;
+ int load;

- this_cpu = get_cpu();
-
- smp_wmb();
- rq = task_rq_lock(p, &flags);
- if (!(p->state & state))
- goto out;
+ local_irq_save(flags);
+ for (;;) {
+ unsigned int task_state = p->state;

- if (p->se.on_rq)
- goto out_running;
+ if (!(task_state & state))
+ goto out;

- cpu = task_cpu(p);
- orig_cpu = cpu;
+ /*
+ * We've got to store the contributes_to_load state before
+ * modifying the task state.
+ */
+ load = task_contributes_to_load(p);

-#ifdef CONFIG_SMP
- if (unlikely(task_running(rq, p)))
- goto out_activate;
+ if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
+ break;
+ }

/*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
+ * If p is prev in schedule() before deactivate_task() we
+ * simply need to set p->state = TASK_RUNNING while
+ * holding the rq->lock.
*
- * First fix up the nr_uninterruptible count:
+ * Also, before deactivate_task() we will not yet have accounted
+ * the contributes_to_load state, so ignore that.
*/
- if (task_contributes_to_load(p)) {
- if (likely(cpu_online(orig_cpu)))
- rq->nr_uninterruptible--;
- else
- this_rq()->nr_uninterruptible--;
- }
- p->state = TASK_WAKING;
+ if (p->se.on_rq && ttwu_running(p, wake_flags))
+ goto out;

- if (p->sched_class->task_waking) {
- p->sched_class->task_waking(rq, p);
- en_flags |= ENQUEUE_WAKING;
- }
+ /*
+ * XXX: I would consider the above to be a proper wakeup too, so
+ * success should be true for anything that changes ->state to RUNNING
+ * but preserve this funny from the previous implementation.
+ */
+ success = 1;

- cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
- set_task_cpu(p, cpu);
- __task_rq_unlock(rq);
+ /*
+ * Now that we'll do an actual wakeup, account the
+ * contribute_to_load state.
+ */
+ if (load) // XXX racy
+ this_rq()->nr_uninterruptible--;

- rq = cpu_rq(cpu);
- raw_spin_lock(&rq->lock);
+#ifdef CONFIG_SMP
+ if (p->sched_class->task_waking)
+ p->sched_class->task_waking(p);

/*
- * We migrated the task without holding either rq->lock, however
- * since the task is not on the task list itself, nobody else
- * will try and migrate the task, hence the rq should match the
- * cpu we just moved it to.
+ * If p is prev in schedule() after deactivate_task() we must
+ * call activate_task(), but we cannot migrate the task.
*/
- WARN_ON(task_cpu(p) != cpu);
- WARN_ON(p->state != TASK_WAKING);
-
-#ifdef CONFIG_SCHEDSTATS
- schedstat_inc(rq, ttwu_count);
- if (cpu == this_cpu)
- schedstat_inc(rq, ttwu_local);
- else {
- struct sched_domain *sd;
- for_each_domain(this_cpu, sd) {
- if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
- schedstat_inc(sd, ttwu_wake_remote);
- break;
- }
- }
+ if (likely(!task_running(task_rq(p), p))) {
+ /*
+ * XXX: by having set TASK_WAKING outside of rq->lock, there
+ * could be an in-flight change to p->cpus_allowed..
+ */
+ cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
}
-#endif /* CONFIG_SCHEDSTATS */
-
-out_activate:
-#endif /* CONFIG_SMP */
- ttwu_activate(p, rq, wake_flags & WF_SYNC, orig_cpu != cpu,
- cpu == this_cpu, en_flags);
- success = 1;
-out_running:
- ttwu_post_activation(p, rq, wake_flags, success);
+#endif
+ ttwu_stat(p, wake_flags & WF_SYNC, cpu);
+ ttwu_queue(p, cpu);
out:
- task_rq_unlock(rq, &flags);
- put_cpu();
+ local_irq_restore(flags);

return success;
}
@@ -2459,10 +2557,11 @@ static void try_to_wake_up_local(struct task_struct *p)
schedstat_inc(rq, ttwu_count);
schedstat_inc(rq, ttwu_local);
}
- ttwu_activate(p, rq, false, false, true, ENQUEUE_WAKEUP);
+ ttwu_stat(p, false, smp_processor_id());
+ activate_task(rq, p, ENQUEUE_WAKEUP);
success = true;
}
- ttwu_post_activation(p, rq, 0, success);
+ ttwu_do_wakeup(p, rq, 0, success);
}

/**
@@ -2604,7 +2703,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
* We set TASK_WAKING so that select_task_rq() can drop rq->lock
* without people poking at ->cpus_allowed.
*/
- cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
+ cpu = select_task_rq(p, SD_BALANCE_FORK, 0);
set_task_cpu(p, cpu);

p->state = TASK_RUNNING;
@@ -3200,7 +3299,7 @@ void sched_exec(void)
int dest_cpu;

rq = task_rq_lock(p, &flags);
- dest_cpu = p->sched_class->select_task_rq(rq, p, SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
if (dest_cpu == smp_processor_id())
goto unlock;

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5e8f98c..2bd145f 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1128,11 +1128,12 @@ static void yield_task_fair(struct rq *rq)

#ifdef CONFIG_SMP

-static void task_waking_fair(struct rq *rq, struct task_struct *p)
+static void task_waking_fair(struct task_struct *p)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);

+ // XXX racy again,.. we need to read cfs_rq->min_vruntime atomically
se->vruntime -= cfs_rq->min_vruntime;
}

@@ -1443,7 +1444,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
@@ -1516,11 +1517,8 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
tmp = affine_sd;

- if (tmp) {
- raw_spin_unlock(&rq->lock);
+ if (tmp)
update_shares(tmp);
- raw_spin_lock(&rq->lock);
- }
}
#endif

diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..efaed8c 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -7,7 +7,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index d10c80e..1011cdc 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -949,8 +949,10 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
{
+ struct rq *rq = task_rq(p); // XXX racy
+
if (sd_flag != SD_BALANCE_WAKE)
return smp_processor_id();

2010-06-22 21:11:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention


* Peter Zijlstra <[email protected]> wrote:

> So this one boots and builds a kernel on a dual-socket nehalem.
>
> there's still quite a number of XXXs to fix, but I don't think any of the
> races are crashing potential, mostly wrong accounting and scheduling iffies
> like.
>
> But give it a go.. see what it does for you (x86 only for now).
>
> Ingo, any comments other than, eew, scary? :-)

None, other than a question: which future kernel do you aim it for? I'd prefer
v2.6.50 or later ;-)

This is a truly scary patch.

Ingo

2010-06-23 09:11:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On Tue, 2010-06-22 at 23:11 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > So this one boots and builds a kernel on a dual-socket nehalem.
> >
> > there's still quite a number of XXXs to fix, but I don't think any of the
> > races are crashing potential, mostly wrong accounting and scheduling iffies
> > like.
> >
> > But give it a go.. see what it does for you (x86 only for now).
> >
> > Ingo, any comments other than, eew, scary? :-)
>
> None, other than a question: which future kernel do you aim it for? I'd prefer
> v2.6.50 or later ;-)

Well, assuming it all works out and actually reduces runqueue lock
contention we still need to sort out all those XXXs in there, I'd say at
the soonest somewhere near .38/.39 or so.

Its definitely not something we should rush in.

2010-12-01 23:13:56

by Frank Rowand

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On 06/23/10 02:10, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 23:11 +0200, Ingo Molnar wrote:
>> * Peter Zijlstra <[email protected]> wrote:
>>
>>> So this one boots and builds a kernel on a dual-socket nehalem.
>>>
>>> there's still quite a number of XXXs to fix, but I don't think any of the
>>> races are crashing potential, mostly wrong accounting and scheduling iffies
>>> like.
>>>
>>> But give it a go.. see what it does for you (x86 only for now).
>>>
>>> Ingo, any comments other than, eew, scary? :-)
>>
>> None, other than a question: which future kernel do you aim it for? I'd prefer
>> v2.6.50 or later ;-)
>
> Well, assuming it all works out and actually reduces runqueue lock
> contention we still need to sort out all those XXXs in there, I'd say at
> the soonest somewhere near .38/.39 or so.
>
> Its definitely not something we should rush in.

This thread was started by Chris Mason back in May:
http://lkml.indiana.edu/hypermail/linux/kernel/1005.2/02329.html

The problem he was trying to fix is:

> Many different workloads end
> up hammering very hard on try_to_wake_up, to the point where the
> runqueue locks dominate CPU profiles.
>
> This includes benchmarking really fast IO subsystems, fast networking,
> fast pipes...well anywhere that we lots of procs on lots of cpus waiting
> for short periods on lots of different things.

Chris provided some code as a starting point for a solution.

Peter Zijlstra had some good ideas, and came up with some alternate code,
culminating with:

http://lkml.indiana.edu/hypermail/linux/kernel/1006.2/02381.html

Building on this previous work, I have another patch to try to address
the problem. I have taken some of Peter's code (the cmpxchg() based
queueing and unqueueing, plus the cross cpu interrupt), but taken a
simpler (and hopefully less scary) approach otherwise:

If the task to be woken is on a run queue on a different cpu then use
cmpxchg() to put it onto a pending try_to_wake_up list on the different
cpu. Then send an interrupt to the different cpu to cause that cpu to
call try_to_wake_up() for each process on the try_to_wake_up list.

The result is that the initial run queue lock acquired by try_to_wake_up()
will be on the cpu we are currently executing on, not a different cpu.

This patch does not address the run queue lock contention that may occur
if try_to_wake_up() chooses to move the waking process to another cpu,
based on the result returned by select_task_rq().

The patch was created on the -top tree.

Signed-off-by: Frank Rowand <[email protected]>


Chris, can you check the performance of this patch on your large system?


Limitations
x86 only

Tests
- tested on 2 cpu x86_64
- very simplistic workload
- results:
rq->lock contention count reduced by ~ 95%
rq->lock contention wait time reduced by ~ 90%
test duration reduced by ~ 0.5% - 4% (in the noise)

Review goals:
(1) performance results
(2) architectural comments

Review non-goal:
code style, etc (but will be a goal in a future review round)

Todo:
- add support for additional architectures
- polish code style
- add a schedule feature to control whether to use the new algorithm
- verify that smp_wmb() is implied by cmpxchg() on x86, so that the explicit
smp_wmb() in ttwu_queue_wake_up() can be removed.

---
arch/x86/kernel/smp.c | 1 1 + 0 - 0 !
include/linux/sched.h | 5 5 + 0 - 0 !
kernel/sched.c | 105 99 + 6 - 0 !
3 files changed, 105 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}

void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_LOAD 0x04 /* for queued try_to_wake_up() */

#define ENQUEUE_WAKEUP 1
#define ENQUEUE_WAKING 2
@@ -1193,6 +1194,9 @@ struct task_struct {
int lock_depth; /* BKL lock depth */

#ifdef CONFIG_SMP
+ struct task_struct *ttwu_queue_wake_entry;
+ int ttwu_queue_load;
+ int ttwu_queue_wake_flags;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
@@ -2017,6 +2021,7 @@ extern void release_uids(struct user_nam

extern void do_timer(unsigned long ticks);

+extern void sched_ttwu_pending(void);
extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -2332,6 +2334,39 @@ static inline void ttwu_post_activation(
wq_worker_waking_up(p, cpu_of(rq));
}

+#ifdef CONFIG_SMP
+static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags,
+ int load)
+{
+ struct task_struct *next = NULL;
+ struct rq *rq = cpu_rq(cpu);
+
+ if (load)
+ wake_flags |= WF_LOAD;
+ p->ttwu_queue_load = load;
+ p->ttwu_queue_wake_flags = wake_flags;
+ /* xxx
+ * smp_wmb() is implied by cmpxchg()
+ * (see Documentation/memory-barriers.txt).
+ * It is the case for arm.
+ * I don't know about x86, so do it explicitly until I know for sure.
+ */
+ smp_wmb();
+
+ for (;;) {
+ struct task_struct *old = next;
+
+ p->ttwu_queue_wake_entry = next;
+ next = cmpxchg(&rq->wake_list, old, p);
+ if (next == old)
+ break;
+ }
+
+ if (!next)
+ smp_send_reschedule(cpu);
+}
+#endif
+
/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
@@ -2354,13 +2389,51 @@ static int try_to_wake_up(struct task_st
unsigned long flags;
unsigned long en_flags = ENQUEUE_WAKEUP;
struct rq *rq;
+#ifdef CONFIG_SMP
+ int load;
+#endif

this_cpu = get_cpu();

+ local_irq_save(flags);
+
+#ifdef CONFIG_SMP
+ for (;;) {
+ unsigned int task_state = p->state;
+
+ if (!(task_state & state))
+ goto out_nolock;
+
+ /*
+ * We've got to store the contributes_to_load state before
+ * modifying the task state.
+ */
+ load = task_contributes_to_load(p);
+
+ if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
+ if (state == TASK_WAKING)
+ load = (wake_flags & WF_LOAD) ? 1 : 0;
+ break;
+ }
+ }
+
+ /*
+ * There is a race where task_cpu could be set to
+ * this_cpu while task_state is TASK_WAKING?
+ *
+ * That's ok, the destination cpu will just send it back here when
+ * it calls try_to_wake_up() of this process.
+ */
+
+ cpu = task_cpu(p);
+ if (cpu != this_cpu) {
+ ttwu_queue_wake_up(p, cpu, wake_flags, load);
+ goto out_nolock;
+ }
+#endif
+
smp_wmb();
- rq = task_rq_lock(p, &flags);
- if (!(p->state & state))
- goto out;
+ rq = __task_rq_lock(p);

if (p->se.on_rq)
goto out_running;
@@ -2373,18 +2446,16 @@ static int try_to_wake_up(struct task_st
goto out_activate;

/*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
- *
- * First fix up the nr_uninterruptible count:
+ * Can handle concurrent wakeups and release the rq->lock
+ * since we put the task in TASK_WAKING state.
*/
- if (task_contributes_to_load(p)) {
+
+ if (load) {
if (likely(cpu_online(orig_cpu)))
rq->nr_uninterruptible--;
else
this_rq()->nr_uninterruptible--;
}
- p->state = TASK_WAKING;

if (p->sched_class->task_waking) {
p->sched_class->task_waking(rq, p);
@@ -2430,13 +2501,32 @@ out_activate:
success = 1;
out_running:
ttwu_post_activation(p, rq, wake_flags, success);
-out:
- task_rq_unlock(rq, &flags);
+ __task_rq_unlock(rq);
+#ifdef CONFIG_SMP
+out_nolock:
+#endif
+ local_irq_restore(flags);
put_cpu();

return success;
}

+#ifdef CONFIG_SMP
+void sched_ttwu_pending(void)
+{
+ struct rq *rq = this_rq();
+ struct task_struct *p = xchg(&rq->wake_list, NULL);
+
+ if (!p)
+ return;
+
+ while (p) {
+ try_to_wake_up(p, TASK_WAKING, p->ttwu_queue_wake_flags);
+ p = p->ttwu_queue_wake_entry;
+ }
+}
+#endif
+
/**
* try_to_wake_up_local - try to wake up a local task with rq lock held
* @p: the thread to be awakened

2010-12-02 01:17:51

by Frank Rowand

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On 12/01/10 15:13, Frank Rowand wrote:

< snip >

> The patch was created on the -top tree.

Fat-fingered typing. That should be the -tip tree....

-Frank

2010-12-02 07:37:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On Wed, 2010-12-01 at 15:13 -0800, Frank Rowand wrote:
>
> If the task to be woken is on a run queue on a different cpu then use
> cmpxchg() to put it onto a pending try_to_wake_up list on the different
> cpu. Then send an interrupt to the different cpu to cause that cpu to
> call try_to_wake_up() for each process on the try_to_wake_up list.

Without having looked at the actual code, the described thing cannot
work, try_to_wake_up() has a return value that needs to be passed back.

Also, try_to_wake_up() does load-balancing, you really want to do that
before queueing it on a remote cpu.

2010-12-14 02:42:28

by Frank Rowand

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On 06/21/10 06:04, Peter Zijlstra wrote:
> On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
>>> It looses the ttwu task_running() check, as I must admit I'm not quite
>>> sure what it does.. Ingo?
>
> I think I figured out what its for, its for when p is prev in schedule()
> after deactivate_task(), we have to call activate_task() it again, but
> we cannot migrate the task because the CPU its on is still referencing
> it.

I have not been able to make sense of the task_running() check in
try_to_wake_up(), even with that clue. The try_to_wake_up() code in
question is:

rq = task_rq_lock(p, &flags);
if (!(p->state & state))
goto out;

if (p->se.on_rq)
goto out_running;

cpu = task_cpu(p);
orig_cpu = cpu;

#ifdef CONFIG_SMP
if (unlikely(task_running(rq, p)))
goto out_activate;


The relevent code in schedule() executes with the rq lock held (many
lines left out to emphasize the key lines):

raw_spin_lock_irq(&rq->lock);

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

deactivate_task(rq, prev, DEQUEUE_SLEEP);

if (likely(prev != next)) {
rq->curr = next;
context_switch(rq, prev, next); /* unlocks the rq */
} else
raw_spin_unlock_irq(&rq->lock);


If (p->se.on_rq) can becomes false due to deactivate_task()
then task_running() will also become false while the rq lock is still
held (either via "rq->curr = next" or via context_switch() updating
p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).

I haven't been able to find any case where task_running() can be true
when (p->se.on_rq) is false, while the rq lock is not being held. Thus
I don't think the branch to out_activate will ever be taken.

What am I missing, or is the task_running() test not needed?

Thanks!

Frank

2010-12-14 03:42:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On Mon, 2010-12-13 at 18:41 -0800, Frank Rowand wrote:
> On 06/21/10 06:04, Peter Zijlstra wrote:
> > On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
> >>> It looses the ttwu task_running() check, as I must admit I'm not quite
> >>> sure what it does.. Ingo?
> >
> > I think I figured out what its for, its for when p is prev in schedule()
> > after deactivate_task(), we have to call activate_task() it again, but
> > we cannot migrate the task because the CPU its on is still referencing
> > it.
>
> I have not been able to make sense of the task_running() check in
> try_to_wake_up(), even with that clue. The try_to_wake_up() code in
> question is:
>
> rq = task_rq_lock(p, &flags);
> if (!(p->state & state))
> goto out;
>
> if (p->se.on_rq)
> goto out_running;
>
> cpu = task_cpu(p);
> orig_cpu = cpu;
>
> #ifdef CONFIG_SMP
> if (unlikely(task_running(rq, p)))
> goto out_activate;
>
>
> The relevent code in schedule() executes with the rq lock held (many
> lines left out to emphasize the key lines):
>
> raw_spin_lock_irq(&rq->lock);
>
> if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>
> deactivate_task(rq, prev, DEQUEUE_SLEEP);
>
> if (likely(prev != next)) {
> rq->curr = next;
> context_switch(rq, prev, next); /* unlocks the rq */
> } else
> raw_spin_unlock_irq(&rq->lock);
>
>
> If (p->se.on_rq) can becomes false due to deactivate_task()
> then task_running() will also become false while the rq lock is still
> held (either via "rq->curr = next" or via context_switch() updating
> p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).
>
> I haven't been able to find any case where task_running() can be true
> when (p->se.on_rq) is false, while the rq lock is not being held. Thus
> I don't think the branch to out_activate will ever be taken.
>
> What am I missing, or is the task_running() test not needed?

Say the last runnable task passes through schedule(), is deactivated.
We hit idle_balance(), which drops/retakes rq->lock _before_ the task
schedules off. ttwu() can acquire rq->lock in that window, p->se.on_rq
is false, p->state is true, as is task_current(rq, p).

We have to check whether the task is still current, but not enqueued,
lest the wakeup be a noop, and the wakee possibly then sleep forever.

-Mike

2010-12-14 21:43:16

by Frank Rowand

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On 12/13/10 19:42, Mike Galbraith wrote:
> On Mon, 2010-12-13 at 18:41 -0800, Frank Rowand wrote:
>> On 06/21/10 06:04, Peter Zijlstra wrote:
>>> On Mon, 2010-06-21 at 12:54 +0200, Peter Zijlstra wrote:
>>>>> It looses the ttwu task_running() check, as I must admit I'm not quite
>>>>> sure what it does.. Ingo?
>>>
>>> I think I figured out what its for, its for when p is prev in schedule()
>>> after deactivate_task(), we have to call activate_task() it again, but
>>> we cannot migrate the task because the CPU its on is still referencing
>>> it.
>>
>> I have not been able to make sense of the task_running() check in
>> try_to_wake_up(), even with that clue. The try_to_wake_up() code in
>> question is:
>>
>> rq = task_rq_lock(p, &flags);
>> if (!(p->state & state))
>> goto out;
>>
>> if (p->se.on_rq)
>> goto out_running;
>>
>> cpu = task_cpu(p);
>> orig_cpu = cpu;
>>
>> #ifdef CONFIG_SMP
>> if (unlikely(task_running(rq, p)))
>> goto out_activate;
>>
>>
>> The relevent code in schedule() executes with the rq lock held (many
>> lines left out to emphasize the key lines):

Additional lines added here to show the function call that Mike pointed out:

>>
>> raw_spin_lock_irq(&rq->lock);
>>
>> if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>>
>> deactivate_task(rq, prev, DEQUEUE_SLEEP);

}
pre_schedule(rq, prev);
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

>>
>> if (likely(prev != next)) {
>> rq->curr = next;
>> context_switch(rq, prev, next); /* unlocks the rq */
>> } else
>> raw_spin_unlock_irq(&rq->lock);
>>
>>
>> If (p->se.on_rq) can becomes false due to deactivate_task()
>> then task_running() will also become false while the rq lock is still
>> held (either via "rq->curr = next" or via context_switch() updating
>> p->oncpu -- which one matters depends on #ifdef __ARCH_WANT_UNLOCKED_CTXSW).
>>
>> I haven't been able to find any case where task_running() can be true
>> when (p->se.on_rq) is false, while the rq lock is not being held. Thus
>> I don't think the branch to out_activate will ever be taken.
>>
>> What am I missing, or is the task_running() test not needed?
>
> Say the last runnable task passes through schedule(), is deactivated.
> We hit idle_balance(), which drops/retakes rq->lock _before_ the task
> schedules off. ttwu() can acquire rq->lock in that window, p->se.on_rq
> is false, p->state is true, as is task_current(rq, p).
>
> We have to check whether the task is still current, but not enqueued,
> lest the wakeup be a noop, and the wakee possibly then sleep forever.

Thanks Mike, that's just the cluebat I needed!

And for the lkml archives, in case anyone has this question again in the
future, with Mike's clue in hand I found another case in this window where
the rq->lock can be dropped then reacquired. Just before idle_balance()
is called, pre_schedule() is called:

pre_schedule()
prev->sched_class->pre_schedule(rq, prev)
[pre_schedule_rt()]
pull_rt_task(rq)
pull_rt_task[this_rq]
for_each_cpu(cpu, this_rq->rd->rto_mask)
double_lock_balance(this_rq, src_rq)
raw_spin_unlock(&this_rq->lock) <-----
double_rq_lock(this_rq, busiest)

-Frank

2010-12-15 19:06:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH RFC] reduce runqueue lock contention

On 12/13, Frank Rowand wrote:
>
> I have not been able to make sense of the task_running() check in
> try_to_wake_up(), even with that clue. The try_to_wake_up() code in
> question is:
> ...
>
> What am I missing, or is the task_running() test not needed?

I am afraid I can misuderstood this all, including the question ;)

But, with __ARCH_WANT_UNLOCKED_CTXSW task_running() checks ->oncpu.
However, schedule() drops rq->lock after prev was deactivated but
before it clears prev->oncpu.

Oleg.