I get a slew of these on boot.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
[ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
[ 4.595443] Call Trace:
[ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
[ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
[ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
[ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
[ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
[ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
[ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
[ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
[ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
[ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
f685ceacab07d3f6c236f04803e2f2f0dbcc5afb is first bad commit
commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
Author: Mike Galbraith <[email protected]>
Date: Fri Oct 23 23:09:22 2009 +0200
sched: Strengthen buddies and mitigate buddy induced latencies
This patch restores the effectiveness of LAST_BUDDY in preventing
pgsql+oltp from collapsing due to wakeup preemption. It also
switches LAST_BUDDY to exclusively do what it does best, namely
mitigate the effects of aggressive wakeup preemption, which
improves vmark throughput markedly, and restores mysql+oltp
scalability.
[snip]
On Wed, 2009-10-28 at 22:42 -0400, Eric Paris wrote:
> I get a slew of these on boot.
Ouch. This fix it up for you?
sched: protect task_hot() buddy check.
Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
causes boot time PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
[ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
[ 4.595443] Call Trace:
[ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
[ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
[ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
[ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
[ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
[ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
[ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
[ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
[ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
[ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
Don't use this_rq() when preemptible.
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reported-by: Eric Paris <[email protected]>
LKML-Reference: <new-submission>
diff --git a/kernel/sched.c b/kernel/sched.c
index 91ffb01..21f52c4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2008,7 +2008,8 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
/*
* Buddy candidates are cache hot:
*/
- if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
+ if (sched_feat(CACHE_HOT_BUDDY) &&
+ (preempt_count() ? this_rq()->nr_running : 1) &&
(&p->se == cfs_rq_of(&p->se)->next ||
&p->se == cfs_rq_of(&p->se)->last))
return 1;
* Mike Galbraith <[email protected]> wrote:
> On Wed, 2009-10-28 at 22:42 -0400, Eric Paris wrote:
> > I get a slew of these on boot.
>
> Ouch. This fix it up for you?
>
> sched: protect task_hot() buddy check.
>
> Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
> causes boot time PREEMPT_DEBUG complaints.
>
> [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
> [ 4.593043] caller is task_hot+0x86/0xd0
> [ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
> [ 4.595443] Call Trace:
> [ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
> [ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
> [ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
> [ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
> [ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
> [ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
> [ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
> [ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
> [ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
> [ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
>
> Don't use this_rq() when preemptible.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Eric Paris <[email protected]>
> LKML-Reference: <new-submission>
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 91ffb01..21f52c4 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2008,7 +2008,8 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
> /*
> * Buddy candidates are cache hot:
> */
> - if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
> + if (sched_feat(CACHE_HOT_BUDDY) &&
> + (preempt_count() ? this_rq()->nr_running : 1) &&
> (&p->se == cfs_rq_of(&p->se)->next ||
> &p->se == cfs_rq_of(&p->se)->last))
> return 1;
hm, the problem is kthread_bind(). It is rummaging around in scheduler
internals without holding the runqueue lock - and this now got exposed.
Even though it is operating on (supposedly ...) inactive tasks, the guts
of that function should be moved into sched.c and it should be fixed to
have proper locking.
Ingo
On Thu, 2009-10-29 at 10:14 +0100, Ingo Molnar wrote:
> hm, the problem is kthread_bind(). It is rummaging around in scheduler
> internals without holding the runqueue lock - and this now got exposed.
> Even though it is operating on (supposedly ...) inactive tasks, the guts
> of that function should be moved into sched.c and it should be fixed to
> have proper locking.
Yeah, I was thinking that nobody should ever be able to hit that without
it being a bug.. but wimped out.
-Mike
On Thu, 2009-10-29 at 10:19 +0100, Mike Galbraith wrote:
> On Thu, 2009-10-29 at 10:14 +0100, Ingo Molnar wrote:
>
> > hm, the problem is kthread_bind(). It is rummaging around in scheduler
> > internals without holding the runqueue lock - and this now got exposed.
> > Even though it is operating on (supposedly ...) inactive tasks, the guts
> > of that function should be moved into sched.c and it should be fixed to
> > have proper locking.
>
> Yeah, I was thinking that nobody should ever be able to hit that without
> it being a bug.. but wimped out.
How about so?
sched: Move the body of kthread_bind() to sched.c.
Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
causes boot time PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
[ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
[ 4.595443] Call Trace:
[ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
[ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
[ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
[ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
[ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
[ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
[ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
[ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
[ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
[ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
Since kthread_bind() messes with scheduler internals, move the body to sched.c,
and lock the runqueue.
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reported-by: Eric Paris <[email protected]>
LKML-Reference: <new-submission>
---
kernel/kthread.c | 15 ++++++---------
kernel/sched.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+), 9 deletions(-)
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -149,6 +149,8 @@ struct task_struct *kthread_create(int (
}
EXPORT_SYMBOL(kthread_create);
+extern void sched_kthread_bind(struct task_struct *k, unsigned int cpu);
+
/**
* kthread_bind - bind a just-created kthread to a cpu.
* @k: thread created by kthread_create().
@@ -157,18 +159,13 @@ EXPORT_SYMBOL(kthread_create);
* Description: This function is equivalent to set_cpus_allowed(),
* except that @cpu doesn't need to be online, and the thread must be
* stopped (i.e., just returned from kthread_create()).
+ *
+ * The runqueue must be locked, ergo move the body if this function
+ * to sched.c
*/
void kthread_bind(struct task_struct *k, unsigned int cpu)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(k, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- set_task_cpu(k, cpu);
- k->cpus_allowed = cpumask_of_cpu(cpu);
- k->rt.nr_cpus_allowed = 1;
- k->flags |= PF_THREAD_BOUND;
+ sched_kthread_bind(k, cpu);
}
EXPORT_SYMBOL(kthread_bind);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2063,6 +2063,37 @@ void set_task_cpu(struct task_struct *p,
__set_task_cpu(p, new_cpu);
}
+/**
+ * sched_kthread_bind - bind a just-created kthread to a cpu.
+ * @k: thread created by kthread_create().
+ * @cpu: cpu (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @cpu doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ *
+ * Function lives here instead of kthread.c because it messes with
+ * scheduler internals which require locking.
+ */
+void sched_kthread_bind(struct task_struct *p, unsigned int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+ WARN_ON(1);
+ return;
+ }
+
+ spin_lock_irqsave(&rq->lock, flags);
+ set_task_cpu(p, cpu);
+ p->cpus_allowed = cpumask_of_cpu(cpu);
+ p->rt.nr_cpus_allowed = 1;
+ p->flags |= PF_THREAD_BOUND;
+ spin_unlock_irqrestore(&rq->lock, flags);
+}
+
struct migration_req {
struct list_head list;
On Thu, 2009-10-29 at 11:48 +0100, Mike Galbraith wrote:
> On Thu, 2009-10-29 at 10:19 +0100, Mike Galbraith wrote:
> > On Thu, 2009-10-29 at 10:14 +0100, Ingo Molnar wrote:
> >
> > > hm, the problem is kthread_bind(). It is rummaging around in scheduler
> > > internals without holding the runqueue lock - and this now got exposed.
> > > Even though it is operating on (supposedly ...) inactive tasks, the guts
> > > of that function should be moved into sched.c and it should be fixed to
> > > have proper locking.
> >
> > Yeah, I was thinking that nobody should ever be able to hit that without
> > it being a bug.. but wimped out.
>
> How about so?
>
> sched: Move the body of kthread_bind() to sched.c.
>
> Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
> causes boot time PREEMPT_DEBUG complaints.
>
> [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
> [ 4.593043] caller is task_hot+0x86/0xd0
> [ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
> [ 4.595443] Call Trace:
> [ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
> [ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
> [ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
> [ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
> [ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
> [ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
> [ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
> [ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
> [ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
> [ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
>
> Since kthread_bind() messes with scheduler internals, move the body to sched.c,
> and lock the runqueue.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Eric Paris <[email protected]>
Nothing complains.
Tested-by: Eric Paris <[email protected]>
Now if I just knew why -next kernels under kvm seem to randomly corrupt
memory and eventually start taking NMI's, then I'd be happy.
-Eric
FYI, non-SMP builds broke:
kernel/built-in.o: In function `kthread_bind':
(.text+0x1d328): undefined reference to `sched_kthread_bind'
make: *** [.tmp_vmlinux1] Error 1
Ingo
Commit-ID: 48f69b95afd3b8f3d0dab1b822f12ed4f5bdb247
Gitweb: http://git.kernel.org/tip/48f69b95afd3b8f3d0dab1b822f12ed4f5bdb247
Author: Mike Galbraith <[email protected]>
AuthorDate: Thu, 29 Oct 2009 11:48:30 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 2 Nov 2009 19:37:57 +0100
sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
Eric Paris reported that commit
f685ceacab07d3f6c236f04803e2f2f0dbcc5afb causes boot time
PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.
Reported-by: Eric Paris <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Eric Paris <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/kthread.c | 23 -----------------------
kernel/sched.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+), 23 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5fe7099..ab7ae57 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -150,29 +150,6 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
EXPORT_SYMBOL(kthread_create);
/**
- * kthread_bind - bind a just-created kthread to a cpu.
- * @k: thread created by kthread_create().
- * @cpu: cpu (might not be online, must be possible) for @k to run on.
- *
- * Description: This function is equivalent to set_cpus_allowed(),
- * except that @cpu doesn't need to be online, and the thread must be
- * stopped (i.e., just returned from kthread_create()).
- */
-void kthread_bind(struct task_struct *k, unsigned int cpu)
-{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(k, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- set_task_cpu(k, cpu);
- k->cpus_allowed = cpumask_of_cpu(cpu);
- k->rt.nr_cpus_allowed = 1;
- k->flags |= PF_THREAD_BOUND;
-}
-EXPORT_SYMBOL(kthread_bind);
-
-/**
* kthread_stop - stop a thread created by kthread_create().
* @k: thread created by kthread_create().
*
diff --git a/kernel/sched.c b/kernel/sched.c
index bf21adb..a7df695 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1996,6 +1996,37 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
p->sched_class->prio_changed(rq, p, oldprio, running);
}
+/**
+ * kthread_bind - bind a just-created kthread to a cpu.
+ * @k: thread created by kthread_create().
+ * @cpu: cpu (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @cpu doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ *
+ * Function lives here instead of kthread.c because it messes with
+ * scheduler internals which require locking.
+ */
+void kthread_bind(struct task_struct *p, unsigned int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+ WARN_ON(1);
+ return;
+ }
+
+ spin_lock_irqsave(&rq->lock, flags);
+ set_task_cpu(p, cpu);
+ p->cpus_allowed = cpumask_of_cpu(cpu);
+ p->rt.nr_cpus_allowed = 1;
+ p->flags |= PF_THREAD_BOUND;
+ spin_unlock_irqrestore(&rq->lock, flags);
+}
+
#ifdef CONFIG_SMP
/*
* Is this task likely cache-hot:
On Mon, 2009-11-02 at 19:28 +0100, Ingo Molnar wrote:
> FYI, non-SMP builds broke:
>
> kernel/built-in.o: In function `kthread_bind':
> (.text+0x1d328): undefined reference to `sched_kthread_bind'
> make: *** [.tmp_vmlinux1] Error 1
Oops. Outside the SMP block might work a little better.
sched: Move the body of kthread_bind() to sched.c.
Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
causes boot time PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
[ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
[ 4.595443] Call Trace:
[ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
[ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
[ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
[ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
[ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
[ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
[ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
[ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
[ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
[ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
Since kthread_bind() messes with scheduler internals, move the body to sched.c,
and lock the runqueue.
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reported-by: Eric Paris <[email protected]>
LKML-Reference: <new-submission>
---
kernel/kthread.c | 15 ++++++---------
kernel/sched.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+), 9 deletions(-)
Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c
+++ linux-2.6/kernel/kthread.c
@@ -149,6 +149,8 @@ struct task_struct *kthread_create(int (
}
EXPORT_SYMBOL(kthread_create);
+extern void sched_kthread_bind(struct task_struct *k, unsigned int cpu);
+
/**
* kthread_bind - bind a just-created kthread to a cpu.
* @k: thread created by kthread_create().
@@ -157,18 +159,13 @@ EXPORT_SYMBOL(kthread_create);
* Description: This function is equivalent to set_cpus_allowed(),
* except that @cpu doesn't need to be online, and the thread must be
* stopped (i.e., just returned from kthread_create()).
+ *
+ * The runqueue must be locked, ergo move the body if this function
+ * to sched.c
*/
void kthread_bind(struct task_struct *k, unsigned int cpu)
{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(k, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- set_task_cpu(k, cpu);
- k->cpus_allowed = cpumask_of_cpu(cpu);
- k->rt.nr_cpus_allowed = 1;
- k->flags |= PF_THREAD_BOUND;
+ sched_kthread_bind(k, cpu);
}
EXPORT_SYMBOL(kthread_bind);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1992,6 +1992,37 @@ static inline void check_class_changed(s
p->sched_class->prio_changed(rq, p, oldprio, running);
}
+/**
+ * sched_kthread_bind - bind a just-created kthread to a cpu.
+ * @k: thread created by kthread_create().
+ * @cpu: cpu (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @cpu doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ *
+ * Function lives here instead of kthread.c because it messes with
+ * scheduler internals which require locking.
+ */
+void sched_kthread_bind(struct task_struct *p, unsigned int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+ WARN_ON(1);
+ return;
+ }
+
+ spin_lock_irqsave(&rq->lock, flags);
+ set_task_cpu(p, cpu);
+ p->cpus_allowed = cpumask_of_cpu(cpu);
+ p->rt.nr_cpus_allowed = 1;
+ p->flags |= PF_THREAD_BOUND;
+ spin_unlock_irqrestore(&rq->lock, flags);
+}
+
#ifdef CONFIG_SMP
/*
* Is this task likely cache-hot:
* Mike Galbraith <[email protected]> wrote:
> On Mon, 2009-11-02 at 19:28 +0100, Ingo Molnar wrote:
> > FYI, non-SMP builds broke:
> >
> > kernel/built-in.o: In function `kthread_bind':
> > (.text+0x1d328): undefined reference to `sched_kthread_bind'
> > make: *** [.tmp_vmlinux1] Error 1
>
> Oops. Outside the SMP block might work a little better.
>
> sched: Move the body of kthread_bind() to sched.c.
Check the -tip commit notification, i did a somewhat cleaner approach by
moving all of kthread_bind() into sched.c. No need to keep that small
wrapper in kthread.c.
Thanks,
Ingo
On Mon, 2009-11-02 at 21:01 +0100, Ingo Molnar wrote:
> * Mike Galbraith <[email protected]> wrote:
>
> > On Mon, 2009-11-02 at 19:28 +0100, Ingo Molnar wrote:
> > > FYI, non-SMP builds broke:
> > >
> > > kernel/built-in.o: In function `kthread_bind':
> > > (.text+0x1d328): undefined reference to `sched_kthread_bind'
> > > make: *** [.tmp_vmlinux1] Error 1
> >
> > Oops. Outside the SMP block might work a little better.
> >
> > sched: Move the body of kthread_bind() to sched.c.
>
> Check the -tip commit notification, i did a somewhat cleaner approach by
> moving all of kthread_bind() into sched.c. No need to keep that small
> wrapper in kthread.c.
Thanks, and sorry for the fumble.
-Mike
Commit-ID: b84ff7d6f1b7f8a43414e74d972ec4c8f3361db4
Gitweb: http://git.kernel.org/tip/b84ff7d6f1b7f8a43414e74d972ec4c8f3361db4
Author: Mike Galbraith <[email protected]>
AuthorDate: Thu, 29 Oct 2009 11:48:30 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 3 Nov 2009 07:25:00 +0100
sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
Eric Paris reported that commit
f685ceacab07d3f6c236f04803e2f2f0dbcc5afb causes boot time
PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.
Reported-by: Eric Paris <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
Tested-by: Eric Paris <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/kthread.c | 23 -----------------------
kernel/sched.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 32 insertions(+), 23 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5fe7099..ab7ae57 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -150,29 +150,6 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
EXPORT_SYMBOL(kthread_create);
/**
- * kthread_bind - bind a just-created kthread to a cpu.
- * @k: thread created by kthread_create().
- * @cpu: cpu (might not be online, must be possible) for @k to run on.
- *
- * Description: This function is equivalent to set_cpus_allowed(),
- * except that @cpu doesn't need to be online, and the thread must be
- * stopped (i.e., just returned from kthread_create()).
- */
-void kthread_bind(struct task_struct *k, unsigned int cpu)
-{
- /* Must have done schedule() in kthread() before we set_task_cpu */
- if (!wait_task_inactive(k, TASK_UNINTERRUPTIBLE)) {
- WARN_ON(1);
- return;
- }
- set_task_cpu(k, cpu);
- k->cpus_allowed = cpumask_of_cpu(cpu);
- k->rt.nr_cpus_allowed = 1;
- k->flags |= PF_THREAD_BOUND;
-}
-EXPORT_SYMBOL(kthread_bind);
-
-/**
* kthread_stop - stop a thread created by kthread_create().
* @k: thread created by kthread_create().
*
diff --git a/kernel/sched.c b/kernel/sched.c
index bf21adb..5cb7d63 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1996,6 +1996,38 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
p->sched_class->prio_changed(rq, p, oldprio, running);
}
+/**
+ * kthread_bind - bind a just-created kthread to a cpu.
+ * @k: thread created by kthread_create().
+ * @cpu: cpu (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @cpu doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ *
+ * Function lives here instead of kthread.c because it messes with
+ * scheduler internals which require locking.
+ */
+void kthread_bind(struct task_struct *p, unsigned int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+ WARN_ON(1);
+ return;
+ }
+
+ spin_lock_irqsave(&rq->lock, flags);
+ set_task_cpu(p, cpu);
+ p->cpus_allowed = cpumask_of_cpu(cpu);
+ p->rt.nr_cpus_allowed = 1;
+ p->flags |= PF_THREAD_BOUND;
+ spin_unlock_irqrestore(&rq->lock, flags);
+}
+EXPORT_SYMBOL(kthread_bind);
+
#ifdef CONFIG_SMP
/*
* Is this task likely cache-hot:
Hello, Ingo
Mike Galbraith's patch didn't work.
There is something with scheduler.
I still get this bug message:
BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
caller is vmstat_update+0x2a/0x3e
Pid: 10, comm: events/1 Not tainted 2.6.32-rc6-tip-01796-gd995f1d-dirty #118
Call Trace:
[<c02a3871>] debug_smp_processor_id+0xa5/0xbc
[<c01a229e>] vmstat_update+0x2a/0x3e
[<c014d6df>] worker_thread+0x134/0x1c2
[<c01a2274>] ? vmstat_update+0x0/0x3e
[<c0151361>] ? autoremove_wake_function+0x0/0x38
[<c014d5ab>] ? worker_thread+0x0/0x1c2
[<c0151298>] kthread+0x66/0x6e
[<c0151232>] ? kthread+0x0/0x6e
[<c0102e97>] kernel_thread_helper+0x7/0x10
Ftrace shows events/1 was run at cpu#0
<idle>-0 [000] 947.573031: set_next_entity <-pick_next_task_fair
<idle>-0 [000] 947.573032: update_stats_wait_end <-set_next_entity
<idle>-0 [000] 947.573033: __dequeue_entity <-set_next_entity
<idle>-0 [000] 947.573034: clear_buddies <-pick_next_task_fair
<idle>-0 [000] 947.573034: set_next_entity <-pick_next_task_fair
<idle>-0 [000] 947.573035: update_stats_wait_end <-set_next_entity
<idle>-0 [000] 947.573036: __dequeue_entity <-set_next_entity
<idle>-0 [000] 947.573037: hrtick_start_fair <-pick_next_task_fair
<idle>-0 [000] 947.573038: perf_event_task_sched_out <-schedule
<idle>-0 [000] 947.573039: memcpy <-tracing_record_cmdline
<idle>-0 [000] 947.573040: __switch_to <-schedule
events/1-10 [000] 947.573050: finish_task_switch <-schedule
events/1-10 [000] 947.573051: perf_event_task_sched_in <-finish_task_switch
events/1-10 [000] 947.573051: _spin_unlock_irq <-finish_task_switch
events/1-10 [000] 947.573052: finish_wait <-worker_thread
events/1-10 [000] 947.573053: kthread_should_stop <-worker_thread
events/1-10 [000] 947.573054: _spin_lock_irq <-worker_thread
events/1-10 [000] 947.573055: _spin_lock_irqsave <-probe_workqueue_execution
events/1-10 [000] 947.573056: _spin_unlock_irqrestore <-probe_workqueue_execution
events/1-10 [000] 947.573057: _spin_unlock_irq <-worker_thread
events/1-10 [000] 947.573058: flush_to_ldisc <-worker_thread
events/1-10 [000] 947.573059: tty_ldisc_ref <-flush_to_ldisc
events/1-10 [000] 947.573059: tty_ldisc_try <-tty_ldisc_ref
events/1-10 [000] 947.573060: _spin_lock_irqsave <-tty_ldisc_try
events/1-10 [000] 947.573061: _spin_unlock_irqrestore <-tty_ldisc_try
events/1 should run at cpu#1, but [000] shows it was run at cpu#0
events/1's cpus_allowed is correct:
# taskset -p 10
pid 10's current affinity mask: 2
Thanks
Lai
Mike Galbraith wrote:
> On Mon, 2009-11-02 at 19:28 +0100, Ingo Molnar wrote:
>> FYI, non-SMP builds broke:
>>
>> kernel/built-in.o: In function `kthread_bind':
>> (.text+0x1d328): undefined reference to `sched_kthread_bind'
>> make: *** [.tmp_vmlinux1] Error 1
>
> Oops. Outside the SMP block might work a little better.
>
> sched: Move the body of kthread_bind() to sched.c.
>
> Eric Paris reported that commit f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
> causes boot time PREEMPT_DEBUG complaints.
>
> [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
> [ 4.593043] caller is task_hot+0x86/0xd0
> [ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-fanotify #127
> [ 4.595443] Call Trace:
> [ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
> [ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
> [ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
> [ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
> [ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
> [ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
> [ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
> [ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
> [ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
> [ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
>
> Since kthread_bind() messes with scheduler internals, move the body to sched.c,
> and lock the runqueue.
>
> Signed-off-by: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Eric Paris <[email protected]>
> LKML-Reference: <new-submission>
>
> ---
> kernel/kthread.c | 15 ++++++---------
> kernel/sched.c | 31 +++++++++++++++++++++++++++++++
> 2 files changed, 37 insertions(+), 9 deletions(-)
>
> Index: linux-2.6/kernel/kthread.c
> ===================================================================
> --- linux-2.6.orig/kernel/kthread.c
> +++ linux-2.6/kernel/kthread.c
> @@ -149,6 +149,8 @@ struct task_struct *kthread_create(int (
> }
> EXPORT_SYMBOL(kthread_create);
>
> +extern void sched_kthread_bind(struct task_struct *k, unsigned int cpu);
> +
> /**
> * kthread_bind - bind a just-created kthread to a cpu.
> * @k: thread created by kthread_create().
> @@ -157,18 +159,13 @@ EXPORT_SYMBOL(kthread_create);
> * Description: This function is equivalent to set_cpus_allowed(),
> * except that @cpu doesn't need to be online, and the thread must be
> * stopped (i.e., just returned from kthread_create()).
> + *
> + * The runqueue must be locked, ergo move the body if this function
> + * to sched.c
> */
> void kthread_bind(struct task_struct *k, unsigned int cpu)
> {
> - /* Must have done schedule() in kthread() before we set_task_cpu */
> - if (!wait_task_inactive(k, TASK_UNINTERRUPTIBLE)) {
> - WARN_ON(1);
> - return;
> - }
> - set_task_cpu(k, cpu);
> - k->cpus_allowed = cpumask_of_cpu(cpu);
> - k->rt.nr_cpus_allowed = 1;
> - k->flags |= PF_THREAD_BOUND;
> + sched_kthread_bind(k, cpu);
> }
> EXPORT_SYMBOL(kthread_bind);
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -1992,6 +1992,37 @@ static inline void check_class_changed(s
> p->sched_class->prio_changed(rq, p, oldprio, running);
> }
>
> +/**
> + * sched_kthread_bind - bind a just-created kthread to a cpu.
> + * @k: thread created by kthread_create().
> + * @cpu: cpu (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @cpu doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + *
> + * Function lives here instead of kthread.c because it messes with
> + * scheduler internals which require locking.
> + */
> +void sched_kthread_bind(struct task_struct *p, unsigned int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + unsigned long flags;
> +
> + /* Must have done schedule() in kthread() before we set_task_cpu */
> + if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> + WARN_ON(1);
> + return;
> + }
> +
> + spin_lock_irqsave(&rq->lock, flags);
> + set_task_cpu(p, cpu);
> + p->cpus_allowed = cpumask_of_cpu(cpu);
> + p->rt.nr_cpus_allowed = 1;
> + p->flags |= PF_THREAD_BOUND;
> + spin_unlock_irqrestore(&rq->lock, flags);
> +}
> +
> #ifdef CONFIG_SMP
> /*
> * Is this task likely cache-hot:
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
On Thu, 2009-11-05 at 18:42 +0800, Lai Jiangshan wrote:
> Hello, Ingo
>
> Mike Galbraith's patch didn't work.
>
> There is something with scheduler.
Looks that way, see below.
> I still get this bug message:
>
> BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
> caller is vmstat_update+0x2a/0x3e
> Pid: 10, comm: events/1 Not tainted 2.6.32-rc6-tip-01796-gd995f1d-dirty #118
> Call Trace:
> [<c02a3871>] debug_smp_processor_id+0xa5/0xbc
> [<c01a229e>] vmstat_update+0x2a/0x3e
> [<c014d6df>] worker_thread+0x134/0x1c2
> [<c01a2274>] ? vmstat_update+0x0/0x3e
> [<c0151361>] ? autoremove_wake_function+0x0/0x38
> [<c014d5ab>] ? worker_thread+0x0/0x1c2
> [<c0151298>] kthread+0x66/0x6e
> [<c0151232>] ? kthread+0x0/0x6e
> [<c0102e97>] kernel_thread_helper+0x7/0x10
>
>
> Ftrace shows events/1 was run at cpu#0
I think we may have a problem. It appears to me that selecting runqueue
without holding the runqueue lock is still unsafe wrt migration.
One problem I see is that we were doing set_task_cpu() without the lock
held, what the kthread_bind() patch fixed. However, fixing that did..
not much at all.
Probably because once we release the lock, we can be preempted. Lots of
the things we're looking at can change underneath us without that lock
held. Anyway below is what I think is proof that this is indeed unsafe.
Virgin source running netperf UDP_STREAM _doubles_ throughput with only
1b9508f applied. No way in hell that patch can do that. It needs some
serious help. Like maybe using more than the two CPUs assigned, running
this and that all over the box (but too fast to _see_ in top)?
sched: restore runqueue locking during runqueue selection.
Selecting runqueue with the runqueue unlocked is not migration safe,
restore old locking.
Running netperf UDP_STREAM:
git v2.6.32-rc6-26-g91d3f9b virgin
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 7340005 0 4008.62
65536 60.00 7320453 3997.94
git v2.6.32-rc6-26-g91d3f9b with only 1b9508f
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 15018541 0 8202.12
65536 60.00 15018232 8201.96
git v2.6.32-rc6-26-g91d3f9b with only 1b9508f + this patch
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 7816923 0 4269.08
65536 60.00 7816663 4268.94
4200 vs 4000 is believable, 8000 vs 4000? Not.
Not-Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
---
kernel/sched.c | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
Index: linux-2.6.32.git/kernel/sched.c
===================================================================
--- linux-2.6.32.git.orig/kernel/sched.c
+++ linux-2.6.32.git/kernel/sched.c
@@ -2333,25 +2333,30 @@ static int try_to_wake_up(struct task_st
if (unlikely(task_running(rq, p)))
goto out_activate;
- /*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
- *
- * First fix up the nr_uninterruptible count:
- */
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
- p->state = TASK_WAKING;
- task_rq_unlock(rq, &flags);
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
+ if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
+ task_rq_unlock(rq, &flags);
+ /* might preempt at this point */
+ rq = task_rq_lock(p, &flags);
+
+ if (rq != orig_rq)
+ update_rq_clock(rq);
+
+ if (!(p->state & state))
+ goto out;
+ if (p->se.on_rq)
+ goto out_running;
- rq = task_rq_lock(p, &flags);
+ this_cpu = smp_processor_id();
+ cpu = task_cpu(p);
+ }
- if (rq != orig_rq)
- update_rq_clock(rq);
+ if (cpu != orig_cpu)
+ set_task_cpu(p, cpu);
if (rq->idle_stamp) {
u64 delta = rq->clock - rq->idle_stamp;
@@ -2364,9 +2369,6 @@ static int try_to_wake_up(struct task_st
rq->idle_stamp = 0;
}
- WARN_ON(p->state != TASK_WAKING);
- cpu = task_cpu(p);
-
#ifdef CONFIG_SCHEDSTATS
schedstat_inc(rq, ttwu_count);
if (cpu == this_cpu)
BTW, with (only) the full scheduler pull request applied, it also acts
sane, and this little hunk is responsible... somehow. But if that's the
problem, why does locking the runqueue fix it too??
Index: linux-2.6.32.git/kernel/sched.c
===================================================================
--- linux-2.6.32.git.orig/kernel/sched.c
+++ linux-2.6.32.git/kernel/sched.c
@@ -9552,13 +9584,13 @@ void __init sched_init(void)
current->sched_class = &fair_sched_class;
/* Allocate the nohz_cpu_mask if CONFIG_CPUMASK_OFFSTACK */
- alloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
+ zalloc_cpumask_var(&nohz_cpu_mask, GFP_NOWAIT);
#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ
- alloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
+ zalloc_cpumask_var(&nohz.cpu_mask, GFP_NOWAIT);
alloc_cpumask_var(&nohz.ilb_grp_nohz_mask, GFP_NOWAIT);
#endif
- alloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
+ zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT);
#endif /* SMP */
perf_event_init();
A bit of late night cut/paste fixed it right up, so tomorrow, I can redo
benchmarks etc etc.
Lai, mind giving this a try? I believe this will fix your problem as
well as mine.
sched: fix runqueue locking buglet.
Calling set_task_cpu() with the runqueue unlocked is unsafe. Add cpu_rq_lock()
locking primitive, and lock the runqueue. Also, update rq->clock before calling
set_task_cpu(), as it could be stale.
Running netperf UDP_STREAM with two pinned tasks with tip 1b9508f applied emitted
the thoroughly unbelievable result that ratelimiting newidle could produce twice
the throughput of the virgin kernel. Reverting to locking the runqueue prior to
runqueue selection restored benchmarking sanity, as finally did this patchlet.
Before:
git v2.6.32-rc6-26-g91d3f9b virgin
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 7340005 0 4008.62
65536 60.00 7320453 3997.94
git v2.6.32-rc6-26-g91d3f9b with only 1b9508f
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 15018541 0 8202.12
65536 60.00 15018232 8201.96
After:
git v2.6.32-rc6-26-g91d3f9b with only 1b9508f + this patch
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 7780289 0 4249.07
65536 60.00 7779832 4248.82
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
---
kernel/sched.c | 38 +++++++++++++++++++++++++++++++-------
1 file changed, 31 insertions(+), 7 deletions(-)
Index: linux-2.6.32.git/kernel/sched.c
===================================================================
--- linux-2.6.32.git.orig/kernel/sched.c
+++ linux-2.6.32.git/kernel/sched.c
@@ -1011,6 +1011,32 @@ static struct rq *this_rq_lock(void)
return rq;
}
+/*
+ * cpu_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the cpu_rq without
+ * explicitly disabling preemption.
+ */
+static struct rq *cpu_rq_lock(int cpu, unsigned long *flags)
+ __acquires(rq->lock)
+{
+ struct rq *rq;
+
+ for (;;) {
+ local_irq_save(*flags);
+ rq = cpu_rq(cpu);
+ spin_lock(&rq->lock);
+ if (likely(rq == cpu_rq(cpu)))
+ return rq;
+ spin_unlock_irqrestore(&rq->lock, *flags);
+ }
+}
+
+static inline void cpu_rq_unlock(struct rq *rq, unsigned long *flags)
+ __releases(rq->lock)
+{
+ spin_unlock_irqrestore(&rq->lock, *flags);
+}
+
#ifdef CONFIG_SCHED_HRTICK
/*
* Use HR-timers to deliver accurate preemption points.
@@ -2345,13 +2371,12 @@ static int try_to_wake_up(struct task_st
task_rq_unlock(rq, &flags);
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
- set_task_cpu(p, cpu);
-
- rq = task_rq_lock(p, &flags);
-
- if (rq != orig_rq)
+ if (cpu != orig_cpu) {
+ rq = cpu_rq_lock(cpu, &flags);
update_rq_clock(rq);
+ set_task_cpu(p, cpu);
+ } else
+ rq = task_rq_lock(p, &flags);
if (rq->idle_stamp) {
u64 delta = rq->clock - rq->idle_stamp;
@@ -2365,7 +2390,6 @@ static int try_to_wake_up(struct task_st
}
WARN_ON(p->state != TASK_WAKING);
- cpu = task_cpu(p);
#ifdef CONFIG_SCHEDSTATS
schedstat_inc(rq, ttwu_count);
Mike Galbraith wrote:
> A bit of late night cut/paste fixed it right up, so tomorrow, I can redo
> benchmarks etc etc.
>
> Lai, mind giving this a try? I believe this will fix your problem as
> well as mine.
My problem: a bound task is run on a different cpu. You haven't describe
how does it happen, how do you think this patch will fix my problem?
>
> sched: fix runqueue locking buglet.
>
> Calling set_task_cpu() with the runqueue unlocked is unsafe. Add cpu_rq_lock()
> locking primitive, and lock the runqueue. Also, update rq->clock before calling
> set_task_cpu(), as it could be stale.
>
> Running netperf UDP_STREAM with two pinned tasks with tip 1b9508f applied emitted
> the thoroughly unbelievable result that ratelimiting newidle could produce twice
> the throughput of the virgin kernel. Reverting to locking the runqueue prior to
> runqueue selection restored benchmarking sanity, as finally did this patchlet.
>
[...]
> ---
> kernel/sched.c | 38 +++++++++++++++++++++++++++++++-------
> 1 file changed, 31 insertions(+), 7 deletions(-)
>
> Index: linux-2.6.32.git/kernel/sched.c
> ===================================================================
> --- linux-2.6.32.git.orig/kernel/sched.c
> +++ linux-2.6.32.git/kernel/sched.c
> @@ -1011,6 +1011,32 @@ static struct rq *this_rq_lock(void)
> return rq;
> }
>
> +/*
> + * cpu_rq_lock - lock the runqueue a given task resides on and disable
> + * interrupts. Note the ordering: we can safely lookup the cpu_rq without
> + * explicitly disabling preemption.
> + */
> +static struct rq *cpu_rq_lock(int cpu, unsigned long *flags)
> + __acquires(rq->lock)
> +{
> + struct rq *rq;
> +
> + for (;;) {
> + local_irq_save(*flags);
> + rq = cpu_rq(cpu);
> + spin_lock(&rq->lock);
> + if (likely(rq == cpu_rq(cpu)))
> + return rq;
> + spin_unlock_irqrestore(&rq->lock, *flags);
> + }
> +}
> +
> +static inline void cpu_rq_unlock(struct rq *rq, unsigned long *flags)
> + __releases(rq->lock)
> +{
> + spin_unlock_irqrestore(&rq->lock, *flags);
> +}
> +
The above code is totally garbage, cpu_rq(cpu) is constant.
> #ifdef CONFIG_SCHED_HRTICK
> /*
> * Use HR-timers to deliver accurate preemption points.
> @@ -2345,13 +2371,12 @@ static int try_to_wake_up(struct task_st
> task_rq_unlock(rq, &flags);
>
> cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> - if (cpu != orig_cpu)
> - set_task_cpu(p, cpu);
> -
> - rq = task_rq_lock(p, &flags);
> -
> - if (rq != orig_rq)
> + if (cpu != orig_cpu) {
> + rq = cpu_rq_lock(cpu, &flags);
> update_rq_clock(rq);
> + set_task_cpu(p, cpu);
Process p's runqueue may not have been locked.
> + } else
> + rq = task_rq_lock(p, &flags);
>
> if (rq->idle_stamp) {
> u64 delta = rq->clock - rq->idle_stamp;
> @@ -2365,7 +2390,6 @@ static int try_to_wake_up(struct task_st
> }
>
> WARN_ON(p->state != TASK_WAKING);
> - cpu = task_cpu(p);
>
> #ifdef CONFIG_SCHEDSTATS
> schedstat_inc(rq, ttwu_count);
>
>
>
>
On Fri, 2009-11-06 at 10:31 +0800, Lai Jiangshan wrote:
> Mike Galbraith wrote:
> > A bit of late night cut/paste fixed it right up, so tomorrow, I can redo
> > benchmarks etc etc.
> >
> > Lai, mind giving this a try? I believe this will fix your problem as
> > well as mine.
>
> My problem: a bound task is run on a different cpu. You haven't describe
> how does it happen, how do you think this patch will fix my problem?
That's an easy one. I haven't figured out exactly how it happens, made
8x10 color glossies with circles and arrows etc, but read on anyway.
My problem is bound tasks doubling their throughput. How do they do
that while staying bound to CPUs that do not share a cache? If I make a
shortcut in select_rq() for bound tasks, so we don't waste time looking
for a place to put a pinned task, why does throughput decrease? I
haven't made film and analyzed it, but that's what is happening. We are
diddling the task struct with zero protection, and that is causing the
bad thing to happen, which is what I care about.
Now that I've fixed the problem, here's the throughput of the same
netperf run when bound to cache affine CPUs.
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 15102830 0 8248.16
65536 60.00 15101813 8247.60
Convinced enough now to try the patch, despite it's beauty mark?
> > + for (;;) {
> > + local_irq_save(*flags);
> > + rq = cpu_rq(cpu);
> > + spin_lock(&rq->lock);
> > + if (likely(rq == cpu_rq(cpu)))
> > + return rq;
> > + spin_unlock_irqrestore(&rq->lock, *flags);
> > + }
> > +}
> > +
> > +static inline void cpu_rq_unlock(struct rq *rq, unsigned long *flags)
> > + __releases(rq->lock)
> > +{
> > + spin_unlock_irqrestore(&rq->lock, *flags);
> > +}
> > +
>
> The above code is totally garbage, cpu_rq(cpu) is constant.
Yeah, the hazards of late late night cut/paste sessions.
> > #ifdef CONFIG_SCHED_HRTICK
> > /*
> > * Use HR-timers to deliver accurate preemption points.
> > @@ -2345,13 +2371,12 @@ static int try_to_wake_up(struct task_st
> > task_rq_unlock(rq, &flags);
> >
> > cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
> > - if (cpu != orig_cpu)
> > - set_task_cpu(p, cpu);
> > -
> > - rq = task_rq_lock(p, &flags);
> > -
> > - if (rq != orig_rq)
> > + if (cpu != orig_cpu) {
> > + rq = cpu_rq_lock(cpu, &flags);
> > update_rq_clock(rq);
> > + set_task_cpu(p, cpu);
>
> Process p's runqueue may not have been locked.
Process p is being held hostage by TASK_WAKING. It's going nowhere but
where select_task_rq() tells us to put it, and that's the runqueue we
must lock. If you lock p's old runqueue, and it's not the runqueue
you're about to queue it to, nothing good is gonna happen.
In fact, now that I think about it more, seems I want to disable preempt
across the call to select_task_rq(). Concurrency sounds nice, but when
when waker is preempted, the hostage, who may well have earned the right
to instant cpu access will wait until the waker returns, and finishes
looking for a runqueue. We want to get wakee onto the runqueue asap.
What happens if say a SCHED_IDLE task gets CPU on a busy box long enough
to wake kjournald?
-Mike
On Fri, 2009-11-06 at 10:31 +0800, Lai Jiangshan wrote:
>
> > +/*
> > + * cpu_rq_lock - lock the runqueue a given task resides on and disable
> > + * interrupts. Note the ordering: we can safely lookup the cpu_rq without
> > + * explicitly disabling preemption.
> > + */
> > +static struct rq *cpu_rq_lock(int cpu, unsigned long *flags)
> > + __acquires(rq->lock)
> > +{
> > + struct rq *rq;
> > +
> > + for (;;) {
> > + local_irq_save(*flags);
> > + rq = cpu_rq(cpu);
> > + spin_lock(&rq->lock);
> > + if (likely(rq == cpu_rq(cpu)))
> > + return rq;
> > + spin_unlock_irqrestore(&rq->lock, *flags);
> > + }
> > +}
> > +
> > +static inline void cpu_rq_unlock(struct rq *rq, unsigned long *flags)
> > + __releases(rq->lock)
> > +{
> > + spin_unlock_irqrestore(&rq->lock, *flags);
> > +}
> > +
>
> The above code is totally garbage, cpu_rq(cpu) is constant.
No, that's not the garbage bit. The true hazard of late late night is
that you can't _see_ anymore. cpu_rq_lock + spin_unlock :)))))
Now I'm _really_ puzzled. Embarrassing, but funny.
-Mike
On Fri, 2009-11-06 at 05:27 +0100, Mike Galbraith wrote:
> In fact, now that I think about it more, seems I want to disable preempt
> across the call to select_task_rq(). Concurrency sounds nice, but when
> when waker is preempted, the hostage, who may well have earned the right
> to instant cpu access will wait until the waker returns, and finishes
> looking for a runqueue. We want to get wakee onto the runqueue asap.
> What happens if say a SCHED_IDLE task gets CPU on a busy box long enough
> to wake kjournald?
So, here's the 6 A.M. no java yet version. Now to go _make_ some java,
and settle in for a long test session.
sched: fix runqueue locking buglet.
Calling set_task_cpu() with the runqueue unlocked is unsafe. Add cpu_rq_lock()
locking primitive, and lock the runqueue. Also, update rq->clock before calling
set_task_cpu(), as it could be stale.
Running netperf UDP_STREAM with two pinned tasks with tip 1b9508f applied emitted
the thoroughly unbelievable result that ratelimiting newidle could produce twice
the throughput of the virgin kernel. Reverting to locking the runqueue prior to
runqueue selection restored benchmarking sanity, as did this patchlet.
Before:
git v2.6.32-rc6-26-g91d3f9b virgin
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 7340005 0 4008.62
65536 60.00 7320453 3997.94
git v2.6.32-rc6-26-g91d3f9b with only 1b9508f
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 15018541 0 8202.12
65536 60.00 15018232 8201.96
After:
git v2.6.32-rc6-26-g91d3f9b with only 1b9508f + this patch
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
65536 4096 60.00 7780289 0 4249.07
65536 60.00 7779832 4248.82
Signed-off-by: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
---
kernel/sched.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)
Index: linux-2.6.32.git/kernel/sched.c
===================================================================
--- linux-2.6.32.git.orig/kernel/sched.c
+++ linux-2.6.32.git/kernel/sched.c
@@ -1011,6 +1011,24 @@ static struct rq *this_rq_lock(void)
return rq;
}
+/*
+ * cpu_rq_lock - lock the runqueue a given cpu and disable interrupts.
+ */
+static struct rq *cpu_rq_lock(int cpu, unsigned long *flags)
+ __acquires(rq->lock)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ spin_lock_irqsave(&rq->lock, *flags);
+ return rq;
+}
+
+static inline void cpu_rq_unlock(struct rq *rq, unsigned long *flags)
+ __releases(rq->lock)
+{
+ spin_unlock_irqrestore(&rq->lock, *flags);
+}
+
#ifdef CONFIG_SCHED_HRTICK
/*
* Use HR-timers to deliver accurate preemption points.
@@ -2342,16 +2360,17 @@ static int try_to_wake_up(struct task_st
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
p->state = TASK_WAKING;
+ preempt_disable();
task_rq_unlock(rq, &flags);
cpu = p->sched_class->select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
- if (cpu != orig_cpu)
- set_task_cpu(p, cpu);
-
- rq = task_rq_lock(p, &flags);
-
- if (rq != orig_rq)
+ if (cpu != orig_cpu) {
+ rq = cpu_rq_lock(cpu, &flags);
update_rq_clock(rq);
+ set_task_cpu(p, cpu);
+ } else
+ rq = task_rq_lock(p, &flags);
+ preempt_enable_no_resched();
if (rq->idle_stamp) {
u64 delta = rq->clock - rq->idle_stamp;
@@ -2365,7 +2384,6 @@ static int try_to_wake_up(struct task_st
}
WARN_ON(p->state != TASK_WAKING);
- cpu = task_cpu(p);
#ifdef CONFIG_SCHEDSTATS
schedstat_inc(rq, ttwu_count);
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Eric Paris
> Sent: Wednesday, October 28, 2009 9:43 PM
> To: [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: [regression bisect -next] BUG: using smp_processor_id() in
> preemptible [00000000] code: rmmod
>
> I get a slew of these on boot.
>
> [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000]
> code: rmmod/1314
> [ 4.593043] caller is task_hot+0x86/0xd0
> [ 4.593872] Pid: 1314, comm: rmmod Tainted: G W 2.6.32-rc3-
> fanotify #127
> [ 4.595443] Call Trace:
> [ 4.596177] [<ffffffff812ad35b>] debug_smp_processor_id+0x11b/0x120
> [ 4.597337] [<ffffffff81051d66>] task_hot+0x86/0xd0
> [ 4.598320] [<ffffffff81066275>] set_task_cpu+0x115/0x270
> [ 4.599368] [<ffffffff810985ab>] kthread_bind+0x6b/0x100
> [ 4.600354] [<ffffffff810914f0>] start_workqueue_thread+0x30/0x60
> [ 4.601545] [<ffffffff810941dd>] __create_workqueue_key+0x18d/0x2f0
> [ 4.602526] [<ffffffff810d9bee>] stop_machine_create+0x4e/0xd0
> [ 4.603811] [<ffffffff810c5818>] sys_delete_module+0x98/0x250
> [ 4.604922] [<ffffffff810e2505>] ? audit_syscall_entry+0x205/0x290
> [ 4.606202] [<ffffffff81013202>] system_call_fastpath+0x16/0x1b
>
> f685ceacab07d3f6c236f04803e2f2f0dbcc5afb is first bad commit commit
> f685ceacab07d3f6c236f04803e2f2f0dbcc5afb
> Author: Mike Galbraith <[email protected]>
> Date: Fri Oct 23 23:09:22 2009 +0200
>
> sched: Strengthen buddies and mitigate buddy induced latencies
>
> This patch restores the effectiveness of LAST_BUDDY in preventing
> pgsql+oltp from collapsing due to wakeup preemption. It also
> switches LAST_BUDDY to exclusively do what it does best, namely
> mitigate the effects of aggressive wakeup preemption, which
> improves vmark throughput markedly, and restores mysql+oltp
> scalability.
>
> [snip]
Hi -
I have same oops (different cause-vector). Details here http://lkml.org/lkml/2009/11/23/548
Can anyone tell me if they are related and/or recommend any work-arounds or troubleshooting steps?
Could this be related to scheduler trying to reschedule a task/thread on other proc and possibly a false alarm?
Snip from trace is below.
-Bob
Nov 23 17:24:15 localhost kernel: BUG: using smp_processor_id() in preemptible
[00000000] code: smash/3062
Nov 23 17:24:15 localhost kernel: caller is __schedule+0x93/0x7d4
Nov 23 17:24:15 localhost kernel: Pid: 3062, comm: smash Not tainted
2.6.31.6-rt19 #1
Nov 23 17:24:15 localhost kernel: Call Trace:
Nov 23 17:24:15 localhost kernel: [<c06dc3df>] ? printk+0xf/0x18
Nov 23 17:24:15 localhost kernel: [<c05499a6>] debug_smp_processor_id+0xa6/0xbc
Nov 23 17:24:15 localhost kernel: [<c06dc70e>] __schedule+0x93/0x7d4
Nov 23 17:24:15 localhost kernel: [<c046621f>] ? audit_syscall_exit+0xfa/0x10f
Nov 23 17:24:15 localhost kernel: [<c040a955>] ? syscall_trace_leave+0xc8/0xef
Nov 23 17:24:15 localhost kernel: [<c040338a>] work_resched+0x5/0x19
Nov 23 17:24:15 localhost kernel: __ratelimit: 6 callbacks suppressed
Nov 23 17:24:15 localhost kernel: BUG: using smp_processor_id() in preemptible
[00000000] code: smash/3062
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thu, 2009-11-26 at 12:09 -0500, Leyendecker, Robert wrote:
>
> I have same oops (different cause-vector). Details here
> http://lkml.org/lkml/2009/11/23/548
Which is an -rt kernel and very likely totally unrelated.
> Can anyone tell me if they are related and/or recommend any
> work-arounds or troubleshooting steps?
>
> Could this be related to scheduler trying to reschedule a task/thread
> on other proc
No, the warning says you try to get the current cpu number when the code
in question is in no way tied to a particular cpu, therefore the cpu
number obtained is totally useless.
> and possibly a false alarm?
Most likely not.
Look at your particular source-tree and try to fine the
smp_processor_id() call that triggers this, then look what it needs the
cpu number for and fix the code so that it makes sense ;-)
> Nov 23 17:24:15 localhost kernel: BUG: using smp_processor_id() in preemptible [00000000] code: smash/3062
> Nov 23 17:24:15 localhost kernel: caller is __schedule+0x93/0x7d4
> Nov 23 17:24:15 localhost kernel: Pid: 3062, comm: smash Not tainted 2.6.31.6-rt19 #1
> Nov 23 17:24:15 localhost kernel: Call Trace:
> Nov 23 17:24:15 localhost kernel: [<c06dc3df>] ? printk+0xf/0x18
> Nov 23 17:24:15 localhost kernel: [<c05499a6>] debug_smp_processor_id+0xa6/0xbc
> Nov 23 17:24:15 localhost kernel: [<c06dc70e>] __schedule+0x93/0x7d4
> Nov 23 17:24:15 localhost kernel: [<c046621f>] ? audit_syscall_exit+0xfa/0x10f
> Nov 23 17:24:15 localhost kernel: [<c040a955>] ? syscall_trace_leave+0xc8/0xef
> Nov 23 17:24:15 localhost kernel: [<c040338a>] work_resched+0x5/0x19
> Nov 23 17:24:15 localhost kernel: __ratelimit: 6 callbacks suppressed