2011-04-28 14:20:53

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC PATCH] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

Oleg, Peter,

I apologize if I misunderstand a code. I'm afraid we need following
patch. Is there any gurantte that cpusets user task is not RT task?



>From 400e24f7b6e15d1397dc82d74fdd1e0a631493ef Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Thu, 28 Apr 2011 16:14:16 +0900
Subject: [PATCH] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
tsk->cpus_allowed.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/cpuset.h | 1 +
kernel/cpuset.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index f20eb8f..42dcbdc 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
{
cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+ p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
return cpumask_any(cpu_active_mask);
}

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1ceeb04..6e5bbe8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
cpu = cpumask_any(cpu_active_mask);
}
+ tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);

return cpu;
}
--
1.7.3.1



2011-04-28 16:13:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On 04/28, KOSAKI Motohiro wrote:
>
> Oleg, Peter,
>
> I apologize if I misunderstand a code.

Heh, I bet you understand it better than me ;)

> index f20eb8f..42dcbdc 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
> static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> {
> cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> + p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
> return cpumask_any(cpu_active_mask);
> }
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 1ceeb04..6e5bbe8 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
> cpu = cpumask_any(cpu_active_mask;
> }
> + tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);

I think you are right...

But, don't we need sched_class->set_cpus_allowed() in this case? Only for
consistency, iiuc currently it is not needed because the task is not active.

IOW, perhaps cpuset_cpus_allowed_fallback() should do


if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, cpu_possible_mask);
else {
cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
p->rt.nr_cpus_allowed = cpumask_weight(cpu_possible_mask);
}

?

If yes, probably the new do_set_cpus_allowed(p, mask) helper makes sense,
it can be used by set_cpus_allowed_ptr() too.

Oleg.

2011-05-02 10:42:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC PATCH] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

> On 04/28, KOSAKI Motohiro wrote:
> >
> > Oleg, Peter,
> >
> > I apologize if I misunderstand a code.
>
> Heh, I bet you understand it better than me ;)
>
> > index f20eb8f..42dcbdc 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
> > static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> > {
> > cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > + p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
> > return cpumask_any(cpu_active_mask);
> > }
> >
> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > index 1ceeb04..6e5bbe8 100644
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
> > cpu = cpumask_any(cpu_active_mask;
> > }
> > + tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
>
> I think you are right...
>
> But, don't we need sched_class->set_cpus_allowed() in this case? Only for
> consistency, iiuc currently it is not needed because the task is not active.
>
> IOW, perhaps cpuset_cpus_allowed_fallback() should do
>
>
> if (p->sched_class->set_cpus_allowed)
> p->sched_class->set_cpus_allowed(p, cpu_possible_mask);
> else {
> cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> p->rt.nr_cpus_allowed = cpumask_weight(cpu_possible_mask);
> }
>
> ?
>
> If yes, probably the new do_set_cpus_allowed(p, mask) helper makes sense,
> it can be used by set_cpus_allowed_ptr() too.

Absolutely. I have very similar patch. but I though we should keep them
separated. The reasons are two.

1. To keep simple one line patch may help to reduce a backport guy's headache.
2. now we have 6 tsk->cpu_allowed writer.

1) sched_rt.c: set_cpus_allowed_rt() no problem. it's shceduler.
2) sched.c: set_cpus_allowed_ptr() ditto.
3) sched.c: init_idle() no lock, but no competitor. that's init.
4) kthread.c kthread_bind() no lock, but no competitor. kthread haven't started yet
5) cpuset.c: cpuset_cpus_allowed_fallback() p->pi_lock held
6) arch/bfin/../process.c: bfin_clone() crazy. but I alread sent a patch to blackfin folks.

ok, (2)-(5) can use do_set_cpus_allowed(). but It slightly large refactoring.

But, Hmmm...
I've found my patch have one issue. Sigh. Recently RCU sub system introduced
rcuc FIFO kthread and it start to run before secondary cpu is up. then,
cpuset_cpus_allowed_fallback set its cpus_allowed to cpu_possible_mask and
kernel will crash.

Will fix it too.

2011-05-02 10:55:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

This patch fixes it.

btw, system_state checking is very important. current boot sequence is (1) smp_init
(ie secondary cpus up and created cpu bound kthreads). (2) sched_init_smp().
Then following bad scenario can be happen,

(1) cpuup call notifier(CPU_UP_PREPARE)
(2) A cpu notifier consumer create FIFO kthread
(3) It call kthread_bind()
... but, now secondary cpu haven't ONLINE
(3) schedule() makes fallback and cpuset_cpus_allowed_fallback
change task->cpus_allowed
(4) find_lowest_rq() touch local_cpu_mask if task->rt.nr_cpus_allowed != 1,
but it haven't been initialized.

RCU folks plan to introduce such FIFO kthread and our testing hitted the
above issue. Then this patch also protect it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/cpuset.h | 1 +
kernel/cpuset.c | 1 +
kernel/sched.c | 4 ++++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index f20eb8f..42dcbdc 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
{
cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+ p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
return cpumask_any(cpu_active_mask);
}

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1ceeb04..6e5bbe8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
cpu = cpumask_any(cpu_active_mask);
}
+ tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);

return cpu;
}
diff --git a/kernel/sched.c b/kernel/sched.c
index fd4625f..bfcd219 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2352,6 +2352,10 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
if (dest_cpu < nr_cpu_ids)
return dest_cpu;

+ /* Don't worry. It's temporary mismatch. */
+ if (system_state < SYSTEM_RUNNING)
+ return cpu;
+
/* No more Mr. Nice Guy. */
dest_cpu = cpuset_cpus_allowed_fallback(p);
/*
--
1.7.3.1


2011-05-02 10:56:07

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/2] sched, cpuset: introduce do_set_cpus_allowed() helper function

Now, we have five task->cpus_allowed writer.

1) sched_rt.c: set_cpus_allowed_rt()
2) sched.c: set_cpus_allowed_ptr()
3) sched.c: init_idle()
4) kthread.c kthread_bind()
5) cpuset.c: cpuset_cpus_allowed_fallback()

And, now (3), (4), (5) don't check p->sched_class->set_cpus_allowed.
It's ok, We have a implicit gurantee that it's safe. However they
theorically slightly bad habit. If any scheduler class will add to
implement ->set_cpus_allowed in future, they may not work.

Then, it would be nice to make good helper function and cleanup.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/cpuset.h | 3 +--
include/linux/sched.h | 7 +++++++
kernel/cpuset.c | 5 ++---
kernel/kthread.c | 4 ++--
kernel/sched.c | 19 ++++++++++++-------
5 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 42dcbdc..e9eaec5 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -146,8 +146,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,

static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
{
- cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
- p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
+ do_set_cpus_allowed(p, cpu_possible_mask);
return cpumask_any(cpu_active_mask);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3f7d3f9..fc7964d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1823,9 +1823,16 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif

#ifdef CONFIG_SMP
+extern void do_set_cpus_allowed(struct task_struct *p,
+ const struct cpumask *new_mask);
+
extern int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask);
#else
+static inline void do_set_cpus_allowed(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+}
static inline int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask)
{
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 6e5bbe8..9c9b754 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2190,7 +2190,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
rcu_read_lock();
cs = task_cs(tsk);
if (cs)
- cpumask_copy(&tsk->cpus_allowed, cs->cpus_allowed);
+ do_set_cpus_allowed(tsk, cs->cpus_allowed);
rcu_read_unlock();

/*
@@ -2217,10 +2217,9 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
* Like above we can temporary set any mask and rely on
* set_cpus_allowed_ptr() as synchronization point.
*/
- cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
+ do_set_cpus_allowed(tsk, cpu_possible_mask);
cpu = cpumask_any(cpu_active_mask);
}
- tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);

return cpu;
}
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3b34d27..4ba7ccc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -202,8 +202,8 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
return;
}

- p->cpus_allowed = cpumask_of_cpu(cpu);
- p->rt.nr_cpus_allowed = 1;
+ /* It's safe because the task is inactive. */
+ do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);
diff --git a/kernel/sched.c b/kernel/sched.c
index bfcd219..7867e47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5819,7 +5819,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();

- cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
+ do_set_cpus_allowed(idle, cpumask_of(cpu));
/*
* We're having a chicken and egg problem, even though we are
* holding rq->lock, the cpu isn't yet set to this cpu so the
@@ -5910,6 +5910,16 @@ static inline void sched_init_granularity(void)
}

#ifdef CONFIG_SMP
+void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
+{
+ if (p->sched_class && p->sched_class->set_cpus_allowed)
+ p->sched_class->set_cpus_allowed(p, new_mask);
+ else {
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
+ }
+}
+
/*
* This is how migration works:
*
@@ -5953,12 +5963,7 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
goto out;
}

- if (p->sched_class->set_cpus_allowed)
- p->sched_class->set_cpus_allowed(p, new_mask);
- else {
- cpumask_copy(&p->cpus_allowed, new_mask);
- p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
- }
+ do_set_cpus_allowed(p, new_mask);

/* Can the task run on the task's current CPU? If so, we're done */
if (cpumask_test_cpu(task_cpu(p), new_mask))
--
1.7.3.1


2011-05-02 12:59:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Mon, May 02, 2011 at 07:42:40PM +0900, KOSAKI Motohiro wrote:
> > On 04/28, KOSAKI Motohiro wrote:
> > >
> > > Oleg, Peter,
> > >
> > > I apologize if I misunderstand a code.
> >
> > Heh, I bet you understand it better than me ;)
> >
> > > index f20eb8f..42dcbdc 100644
> > > --- a/include/linux/cpuset.h
> > > +++ b/include/linux/cpuset.h
> > > @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
> > > static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> > > {
> > > cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > > + p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
> > > return cpumask_any(cpu_active_mask);
> > > }
> > >
> > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > > index 1ceeb04..6e5bbe8 100644
> > > --- a/kernel/cpuset.c
> > > +++ b/kernel/cpuset.c
> > > @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > > cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
> > > cpu = cpumask_any(cpu_active_mask;
> > > }
> > > + tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
> >
> > I think you are right...
> >
> > But, don't we need sched_class->set_cpus_allowed() in this case? Only for
> > consistency, iiuc currently it is not needed because the task is not active.
> >
> > IOW, perhaps cpuset_cpus_allowed_fallback() should do
> >
> >
> > if (p->sched_class->set_cpus_allowed)
> > p->sched_class->set_cpus_allowed(p, cpu_possible_mask);
> > else {
> > cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > p->rt.nr_cpus_allowed = cpumask_weight(cpu_possible_mask);
> > }
> >
> > ?
> >
> > If yes, probably the new do_set_cpus_allowed(p, mask) helper makes sense,
> > it can be used by set_cpus_allowed_ptr() too.
>
> Absolutely. I have very similar patch. but I though we should keep them
> separated. The reasons are two.
>
> 1. To keep simple one line patch may help to reduce a backport guy's headache.
> 2. now we have 6 tsk->cpu_allowed writer.
>
> 1) sched_rt.c: set_cpus_allowed_rt() no problem. it's shceduler.
> 2) sched.c: set_cpus_allowed_ptr() ditto.
> 3) sched.c: init_idle() no lock, but no competitor. that's init.
> 4) kthread.c kthread_bind() no lock, but no competitor. kthread haven't started yet
> 5) cpuset.c: cpuset_cpus_allowed_fallback() p->pi_lock held
> 6) arch/bfin/../process.c: bfin_clone() crazy. but I alread sent a patch to blackfin folks.
>
> ok, (2)-(5) can use do_set_cpus_allowed(). but It slightly large refactoring.
>
> But, Hmmm...
> I've found my patch have one issue. Sigh. Recently RCU sub system introduced
> rcuc FIFO kthread and it start to run before secondary cpu is up. then,
> cpuset_cpus_allowed_fallback set its cpus_allowed to cpu_possible_mask and
> kernel will crash.

If there is something different that I should be doing with the rcuc FIFO
kthread, please do let me know!

Thanx, Paul

> Will fix it too.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-05-11 16:02:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Mon, 2011-05-02 at 19:55 +0900, KOSAKI Motohiro wrote:
> The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
> tsk->cpus_allowed. Otherwise RT scheduler may confuse.
>
> This patch fixes it.
>
> btw, system_state checking is very important. current boot sequence is (1) smp_init
> (ie secondary cpus up and created cpu bound kthreads). (2) sched_init_smp().
> Then following bad scenario can be happen,
>
> (1) cpuup call notifier(CPU_UP_PREPARE)
> (2) A cpu notifier consumer create FIFO kthread
> (3) It call kthread_bind()
> ... but, now secondary cpu haven't ONLINE

isn't

> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
> change task->cpus_allowed

I'm failing to see how this is happening, surely that kthread isn't
actually running that early?

> (4) find_lowest_rq() touch local_cpu_mask if task->rt.nr_cpus_allowed != 1,
> but it haven't been initialized.
>
> RCU folks plan to introduce such FIFO kthread and our testing hitted the
> above issue. Then this patch also protect it.

I'm fairly sure it doesn't, normal cpu-hotplug doesn't poke at
system_state.

>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> ---
> include/linux/cpuset.h | 1 +
> kernel/cpuset.c | 1 +
> kernel/sched.c | 4 ++++
> 3 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index f20eb8f..42dcbdc 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
> static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> {
> cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> + p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
> return cpumask_any(cpu_active_mask);
> }
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 1ceeb04..6e5bbe8 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
> cpu = cpumask_any(cpu_active_mask);
> }
> + tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
>
> return cpu;
> }

I don't really see the point of doing this separately from your second
patch, please fold them.

> diff --git a/kernel/sched.c b/kernel/sched.c
> index fd4625f..bfcd219 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2352,6 +2352,10 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> if (dest_cpu < nr_cpu_ids)
> return dest_cpu;
>
> + /* Don't worry. It's temporary mismatch. */
> + if (system_state < SYSTEM_RUNNING)
> + return cpu;
> +
> /* No more Mr. Nice Guy. */
> dest_cpu = cpuset_cpus_allowed_fallback(p);
> /*

Like explained, I don't believe this actually fixes your problem (its
also disgusting).

2011-05-13 05:46:42

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

Hi

> On Mon, 2011-05-02 at 19:55 +0900, KOSAKI Motohiro wrote:
>> The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
>> tsk->cpus_allowed. Otherwise RT scheduler may confuse.
>>
>> This patch fixes it.
>>
>> btw, system_state checking is very important. current boot sequence is (1) smp_init
>> (ie secondary cpus up and created cpu bound kthreads). (2) sched_init_smp().
>> Then following bad scenario can be happen,
>>
>> (1) cpuup call notifier(CPU_UP_PREPARE)
>> (2) A cpu notifier consumer create FIFO kthread
>> (3) It call kthread_bind()
>> ... but, now secondary cpu haven't ONLINE
>
> isn't

thanks, correction.

>
>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
>> change task->cpus_allowed
>
> I'm failing to see how this is happening, surely that kthread isn't
> actually running that early?

If my understand correctly, current call graph is below.

kernel_init()
smp_init();
cpu_up()
... cpu hotplug notification
kthread_create()
sched_init_smp();


So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
all kthreads of using cpu-up notification have to use kthread_bind(). It
protected from sched load balance.

but, now cpuset_cpus_allowed_fallback() forcedly change kthread's cpumask.
Why is this works? the point are two.

- rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again periodically.
then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
my debug print obseve following cpumask change occur at boot time.
1) kthread_bind: bind cpu1
2) cpuset_cpus_allowed_fallback: bind possible cpu
3) rcu_cpu_kthread_should_stop: rebind cpu1
- while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.


>
>> (4) find_lowest_rq() touch local_cpu_mask if task->rt.nr_cpus_allowed != 1,
>> but it haven't been initialized.
>>
>> RCU folks plan to introduce such FIFO kthread and our testing hitted the
>> above issue. Then this patch also protect it.
>
> I'm fairly sure it doesn't, normal cpu-hotplug doesn't poke at
> system_state.

If my understand correctly. it's pure scheduler issue. because

- rcuc keep the old rule (ie an early spawned kthread have to call kthread_bind)
- cpuset_cpus_allowed_fallback() is called from scheduler internal
- crash is happen in find_lowest_rq(). (following line)


static int find_lowest_rq(struct task_struct *task)
{
(snip)
if (cpumask_test_cpu(cpu, lowest_mask)) // HERE


IOW, I think cpuset_cpus_allowed_fallback() should NOT be changed
tsk->cpus_allowed until to finish sched_init_smp().

Do you have an any alternative idea for this?


>> Signed-off-by: KOSAKI Motohiro<[email protected]>
>> Cc: Oleg Nesterov<[email protected]>
>> Cc: Peter Zijlstra<[email protected]>
>> Cc: Ingo Molnar<[email protected]>
>> ---
>> include/linux/cpuset.h | 1 +
>> kernel/cpuset.c | 1 +
>> kernel/sched.c | 4 ++++
>> 3 files changed, 6 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>> index f20eb8f..42dcbdc 100644
>> --- a/include/linux/cpuset.h
>> +++ b/include/linux/cpuset.h
>> @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
>> static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
>> {
>> cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
>> + p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
>> return cpumask_any(cpu_active_mask);
>> }
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 1ceeb04..6e5bbe8 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>> cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
>> cpu = cpumask_any(cpu_active_mask);
>> }
>> + tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
>>
>> return cpu;
>> }
>
> I don't really see the point of doing this separately from your second
> patch, please fold them.

Ok. Will do.

>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index fd4625f..bfcd219 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2352,6 +2352,10 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>> if (dest_cpu< nr_cpu_ids)
>> return dest_cpu;
>>
>> + /* Don't worry. It's temporary mismatch. */
>> + if (system_state< SYSTEM_RUNNING)
>> + return cpu;
>> +
>> /* No more Mr. Nice Guy. */
>> dest_cpu = cpuset_cpus_allowed_fallback(p);
>> /*
>
> Like explained, I don't believe this actually fixes your problem (its
> also disgusting).

If anybody have an alternative idea, I have no reason to refuse it.

Thanks.

2011-05-13 06:42:50

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Fri, May 13, 2011 at 1:48 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hi
>
>> On Mon, 2011-05-02 at 19:55 +0900, KOSAKI Motohiro wrote:
>>>
>>> The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
>>> tsk->cpus_allowed. Otherwise RT scheduler may confuse.
>>>
>>> This patch fixes it.
>>>
>>> btw, system_state checking is very important. current boot sequence is
>>> (1) smp_init
>>> (ie secondary cpus up and created cpu bound kthreads). (2)
>>> sched_init_smp().
>>> Then following bad scenario can be happen,
>>>
>>> (1) cpuup call notifier(CPU_UP_PREPARE)
>>> (2) A cpu notifier consumer create FIFO kthread
>>> (3) It call kthread_bind()
>>>    ... but, now secondary cpu haven't ONLINE
>>
>> isn't
>
> thanks, correction.
>
>>
>>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
>>>     change task->cpus_allowed
>>
>> I'm failing to see how this is happening, surely that kthread isn't
>> actually running that early?
>
> If my understand correctly, current call graph is below.
>
> kernel_init()
>        smp_init();
>                cpu_up()
>                        ... cpu hotplug notification
>                                kthread_create()
>        sched_init_smp();
>
>
> So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
> all kthreads of using cpu-up notification have to use kthread_bind(). It
> protected from sched load balance.
>
> but, now cpuset_cpus_allowed_fallback() forcedly change kthread's cpumask.
> Why is this works? the point are two.
>
> - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
> periodically.
>  then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
>  my debug print obseve following cpumask change occur at boot time.
>     1) kthread_bind: bind cpu1
>     2) cpuset_cpus_allowed_fallback: bind possible cpu
>     3) rcu_cpu_kthread_should_stop: rebind cpu1
> - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.

Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
which is under hotplug event CPU_UP_PREPARE. Maybe it should be
under CPU_ONLINE.

Thanks,
Yong

>
>
>>
>>> (4) find_lowest_rq() touch local_cpu_mask if task->rt.nr_cpus_allowed !=
>>> 1,
>>>     but it haven't been initialized.
>>>
>>> RCU folks plan to introduce such FIFO kthread and our testing hitted the
>>> above issue. Then this patch also protect it.
>>
>> I'm fairly sure it doesn't, normal cpu-hotplug doesn't poke at
>> system_state.
>
> If my understand correctly. it's pure scheduler issue. because
>
> - rcuc keep the old rule (ie an early spawned kthread have to call
> kthread_bind)
> - cpuset_cpus_allowed_fallback() is called from scheduler internal
> - crash is happen in find_lowest_rq(). (following line)
>
>
> static int find_lowest_rq(struct task_struct *task)
> {
>  (snip)
>        if (cpumask_test_cpu(cpu, lowest_mask))   // HERE
>
>
> IOW, I think cpuset_cpus_allowed_fallback() should NOT be changed
> tsk->cpus_allowed until to finish sched_init_smp().
>
> Do you have an any alternative idea for this?
>
>
>>> Signed-off-by: KOSAKI Motohiro<[email protected]>
>>> Cc: Oleg Nesterov<[email protected]>
>>> Cc: Peter Zijlstra<[email protected]>
>>> Cc: Ingo Molnar<[email protected]>
>>> ---
>>>  include/linux/cpuset.h |    1 +
>>>  kernel/cpuset.c        |    1 +
>>>  kernel/sched.c         |    4 ++++
>>>  3 files changed, 6 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
>>> index f20eb8f..42dcbdc 100644
>>> --- a/include/linux/cpuset.h
>>> +++ b/include/linux/cpuset.h
>>> @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct
>>> task_struct *p,
>>>  static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
>>>  {
>>>        cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
>>> +       p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
>>>        return cpumask_any(cpu_active_mask);
>>>  }
>>>
>>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>>> index 1ceeb04..6e5bbe8 100644
>>> --- a/kernel/cpuset.c
>>> +++ b/kernel/cpuset.c
>>> @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct
>>> *tsk)
>>>                cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
>>>                cpu = cpumask_any(cpu_active_mask);
>>>        }
>>> +       tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
>>>
>>>        return cpu;
>>>  }
>>
>> I don't really see the point of doing this separately from your second
>> patch, please fold them.
>
> Ok. Will do.
>
>>
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index fd4625f..bfcd219 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>> @@ -2352,6 +2352,10 @@ static int select_fallback_rq(int cpu, struct
>>> task_struct *p)
>>>        if (dest_cpu<  nr_cpu_ids)
>>>                return dest_cpu;
>>>
>>> +       /* Don't worry. It's temporary mismatch. */
>>> +       if (system_state<  SYSTEM_RUNNING)
>>> +               return cpu;
>>> +
>>>        /* No more Mr. Nice Guy. */
>>>        dest_cpu = cpuset_cpus_allowed_fallback(p);
>>>        /*
>>
>> Like explained, I don't believe this actually fixes your problem (its
>> also disgusting).
>
> If anybody have an alternative idea, I have no reason to refuse it.
>
> Thanks.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
Only stand for myself

2011-05-13 07:31:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

Hello,

>>>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
>>>> change task->cpus_allowed
>>>
>>> I'm failing to see how this is happening, surely that kthread isn't
>>> actually running that early?
>>
>> If my understand correctly, current call graph is below.
>>
>> kernel_init()
>> smp_init();
>> cpu_up()
>> ... cpu hotplug notification
>> kthread_create()
>> sched_init_smp();
>>
>>
>> So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
>> all kthreads of using cpu-up notification have to use kthread_bind(). It
>> protected from sched load balance.
>>
>> but, now cpuset_cpus_allowed_fallback() forcedly change kthread's cpumask.
>> Why is this works? the point are two.
>>
>> - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
>> periodically.
>> then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
>> my debug print obseve following cpumask change occur at boot time.
>> 1) kthread_bind: bind cpu1
>> 2) cpuset_cpus_allowed_fallback: bind possible cpu
>> 3) rcu_cpu_kthread_should_stop: rebind cpu1
>> - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>
> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
> under CPU_ONLINE.

Hmm..

I haven't catch your point. cpu_up() call both CPU_UP_PREPARE and CPU_ONLINE
notification. Thus, CPU_ONLINE still be called before sched_init_smp().

Am I missing something?

2011-05-13 07:44:02

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Fri, May 13, 2011 at 3:33 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Hello,
>
>>>>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
>>>>>     change task->cpus_allowed
>>>>
>>>> I'm failing to see how this is happening, surely that kthread isn't
>>>> actually running that early?
>>>
>>> If my understand correctly, current call graph is below.
>>>
>>> kernel_init()
>>>        smp_init();
>>>                cpu_up()
>>>                        ... cpu hotplug notification
>>>                                kthread_create()
>>>        sched_init_smp();
>>>
>>>
>>> So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
>>> all kthreads of using cpu-up notification have to use kthread_bind(). It
>>> protected from sched load balance.
>>>
>>> but, now cpuset_cpus_allowed_fallback() forcedly change kthread's
>>> cpumask.
>>> Why is this works? the point are two.
>>>
>>> - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
>>> periodically.
>>>  then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
>>>  my debug print obseve following cpumask change occur at boot time.
>>>     1) kthread_bind: bind cpu1
>>>     2) cpuset_cpus_allowed_fallback: bind possible cpu
>>>     3) rcu_cpu_kthread_should_stop: rebind cpu1
>>> - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>>
>> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
>> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
>> under CPU_ONLINE.
>
> Hmm..
>
> I haven't catch your point. cpu_up() call both CPU_UP_PREPARE and CPU_ONLINE
> notification. Thus, CPU_ONLINE still be called before sched_init_smp().

The difference is that CPU_ONLINE will set cpu_online_mask and cpu_active_mask,
so wake_up_process()->ttwu()->select_task_rq() will not go back to
cpuset_cpus_allowed_fallback().

But the comments for rcu_cpu_kthread_should_stop() shot me, since
rcu_spawn_one_cpu_kthread() must be called under CPU_UP_PREPARE
for some reason.

Maybe we should find a more suitable way for your issue :)

Thanks,
Yong

>
> Am I missing something?
>
>
>



--
Only stand for myself

2011-05-13 09:33:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

(2011/05/13 16:43), Yong Zhang wrote:
> On Fri, May 13, 2011 at 3:33 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>> Hello,
>>
>>>>>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
>>>>>> change task->cpus_allowed
>>>>>
>>>>> I'm failing to see how this is happening, surely that kthread isn't
>>>>> actually running that early?
>>>>
>>>> If my understand correctly, current call graph is below.
>>>>
>>>> kernel_init()
>>>> smp_init();
>>>> cpu_up()
>>>> ... cpu hotplug notification
>>>> kthread_create()
>>>> sched_init_smp();
>>>>
>>>>
>>>> So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
>>>> all kthreads of using cpu-up notification have to use kthread_bind(). It
>>>> protected from sched load balance.
>>>>
>>>> but, now cpuset_cpus_allowed_fallback() forcedly change kthread's
>>>> cpumask.
>>>> Why is this works? the point are two.
>>>>
>>>> - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
>>>> periodically.
>>>> then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
>>>> my debug print obseve following cpumask change occur at boot time.
>>>> 1) kthread_bind: bind cpu1
>>>> 2) cpuset_cpus_allowed_fallback: bind possible cpu
>>>> 3) rcu_cpu_kthread_should_stop: rebind cpu1
>>>> - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>>>
>>> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
>>> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
>>> under CPU_ONLINE.
>>
>> Hmm..
>>
>> I haven't catch your point. cpu_up() call both CPU_UP_PREPARE and CPU_ONLINE
>> notification. Thus, CPU_ONLINE still be called before sched_init_smp().
>
> The difference is that CPU_ONLINE will set cpu_online_mask and cpu_active_mask,
> so wake_up_process()->ttwu()->select_task_rq() will not go back to
> cpuset_cpus_allowed_fallback().

Right. Thank you. Now I got your point!

> But the comments for rcu_cpu_kthread_should_stop() shot me, since
> rcu_spawn_one_cpu_kthread() must be called under CPU_UP_PREPARE
> for some reason.
>
> Maybe we should find a more suitable way for your issue :)

ok, I'll try to read and understand rcu-kthread design and find it out.


2011-05-13 17:02:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Fri, 2011-05-13 at 14:42 +0800, Yong Zhang wrote:
> > - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
> > periodically.
> > then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
> > my debug print obseve following cpumask change occur at boot time.
> > 1) kthread_bind: bind cpu1
> > 2) cpuset_cpus_allowed_fallback: bind possible cpu
> > 3) rcu_cpu_kthread_should_stop: rebind cpu1
> > - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>
> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
> under CPU_ONLINE.

IIRC I talked to Paul about this a while back and ONLINE is too late,
however STARTING should work. At the time he couldn't quite get that to
work, but the above situation is indeed the root cause of our problems.

We shouldn't try to run a cpu affine thread before the cpu in question
is actually able to run stuff.

I did me a little hackery and with the below patch my kernel still
boots...

Would that sort your issue?

---
kernel/rcutree.c | 44 ++++++++++++++++++++++++++++++++++++++------
kernel/rcutree_plugin.h | 1 -
2 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5616b17..e0218ed 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1656,7 +1656,6 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
per_cpu(rcu_cpu_kthread_task, cpu) = t;
- wake_up_process(t);
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
return 0;
@@ -1764,13 +1763,33 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
raw_spin_lock_irqsave(&rnp->lock, flags);
rnp->node_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
- wake_up_process(t);
sp.sched_priority = 99;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
}
return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
}

+static void __cpuinit rcu_wake_cpu_kthread(int cpu)
+{
+ struct task_struct *p = per_cpu(rcu_cpu_kthread_task, cpu);
+
+ if (p)
+ wake_up_process(p);
+}
+
+static void __cpuinit rcu_wake_node_kthread(struct rcu_node *rnp)
+{
+ if (!rnp)
+ return;
+
+ if (rnp->node_kthread_task)
+ wake_up_process(rnp->node_kthread_task);
+#ifdef CONFIG_RCU_BOOST
+ if (rnp->boost_kthread_task)
+ wake_up_process(rnp->boost_kthread_task);
+#endif
+}
+
/*
* Spawn all kthreads -- called as soon as the scheduler is running.
*/
@@ -1783,19 +1802,24 @@ static int __init rcu_spawn_kthreads(void)
for_each_possible_cpu(cpu) {
init_waitqueue_head(&per_cpu(rcu_cpu_wq, cpu));
per_cpu(rcu_cpu_has_work, cpu) = 0;
- if (cpu_online(cpu))
+ if (cpu_online(cpu)) {
(void)rcu_spawn_one_cpu_kthread(cpu);
+ rcu_wake_cpu_kthread(cpu);
+ }
}
rnp = rcu_get_root(rcu_state);
init_waitqueue_head(&rnp->node_wq);
rcu_init_boost_waitqueue(rnp);
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
- if (NUM_RCU_NODES > 1)
+ rcu_wake_node_kthread(rnp);
+ if (NUM_RCU_NODES > 1) {
rcu_for_each_leaf_node(rcu_state, rnp) {
init_waitqueue_head(&rnp->node_wq);
rcu_init_boost_waitqueue(rnp);
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ rcu_wake_node_kthread(rnp);
}
+ }
return 0;
}
early_initcall(rcu_spawn_kthreads);
@@ -2206,7 +2230,7 @@ static void __cpuinit rcu_online_cpu(int cpu)
rcu_preempt_init_percpu_data(cpu);
}

-static void __cpuinit rcu_online_kthreads(int cpu)
+static void __cpuinit rcu_prepare_kthreads(int cpu)
{
struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
struct rcu_node *rnp = rdp->mynode;
@@ -2233,7 +2257,15 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
rcu_online_cpu(cpu);
- rcu_online_kthreads(cpu);
+ rcu_prepare_kthreads(cpu);
+ break;
+ case CPU_STARTING:
+ rcu_wake_cpu_kthread(cpu);
+ do {
+ struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
+ if (rdp)
+ rcu_wake_node_kthread(rdp->mynode);
+ } while (0);
break;
case CPU_ONLINE:
case CPU_DOWN_FAILED:
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index ed339702..961a316 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1306,7 +1306,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
raw_spin_lock_irqsave(&rnp->lock, flags);
rnp->boost_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
- wake_up_process(t);
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
return 0;

2011-05-14 11:17:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

2011/5/14 Peter Zijlstra <[email protected]>:
> On Fri, 2011-05-13 at 14:42 +0800, Yong Zhang wrote:
>> > - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
>> > periodically.
>> > ?then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
>> > ?my debug print obseve following cpumask change occur at boot time.
>> > ? ? 1) kthread_bind: bind cpu1
>> > ? ? 2) cpuset_cpus_allowed_fallback: bind possible cpu
>> > ? ? 3) rcu_cpu_kthread_should_stop: rebind cpu1
>> > - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>>
>> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
>> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
>> under CPU_ONLINE.
>
> IIRC I talked to Paul about this a while back and ONLINE is too late,
> however STARTING should work. At the time he couldn't quite get that to
> work, but the above situation is indeed the root cause of our problems.
>
> We shouldn't try to run a cpu affine thread before the cpu in question
> is actually able to run stuff.
>
> I did me a little hackery and with the below patch my kernel still
> boots...
>
> Would that sort your issue?

Great!!
Unfortunately, I can't test this until next wednesday. but I'll do it
as far as possible soon.

Thanks.

2011-05-15 21:34:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Fri, May 13, 2011 at 02:42:48PM +0800, Yong Zhang wrote:
> On Fri, May 13, 2011 at 1:48 PM, KOSAKI Motohiro
> <[email protected]> wrote:
> > Hi
> >
> >> On Mon, 2011-05-02 at 19:55 +0900, KOSAKI Motohiro wrote:
> >>>
> >>> The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
> >>> tsk->cpus_allowed. Otherwise RT scheduler may confuse.
> >>>
> >>> This patch fixes it.
> >>>
> >>> btw, system_state checking is very important. current boot sequence is
> >>> (1) smp_init
> >>> (ie secondary cpus up and created cpu bound kthreads). (2)
> >>> sched_init_smp().
> >>> Then following bad scenario can be happen,
> >>>
> >>> (1) cpuup call notifier(CPU_UP_PREPARE)
> >>> (2) A cpu notifier consumer create FIFO kthread
> >>> (3) It call kthread_bind()
> >>> ? ?... but, now secondary cpu haven't ONLINE
> >>
> >> isn't
> >
> > thanks, correction.
> >
> >>
> >>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
> >>> ? ? change task->cpus_allowed
> >>
> >> I'm failing to see how this is happening, surely that kthread isn't
> >> actually running that early?
> >
> > If my understand correctly, current call graph is below.
> >
> > kernel_init()
> > ? ? ? ?smp_init();
> > ? ? ? ? ? ? ? ?cpu_up()
> > ? ? ? ? ? ? ? ? ? ? ? ?... cpu hotplug notification
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?kthread_create()
> > ? ? ? ?sched_init_smp();
> >
> >
> > So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
> > all kthreads of using cpu-up notification have to use kthread_bind(). It
> > protected from sched load balance.
> >
> > but, now cpuset_cpus_allowed_fallback() forcedly change kthread's cpumask.
> > Why is this works? the point are two.
> >
> > - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
> > periodically.
> > ?then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
> > ?my debug print obseve following cpumask change occur at boot time.
> > ? ? 1) kthread_bind: bind cpu1
> > ? ? 2) cpuset_cpus_allowed_fallback: bind possible cpu
> > ? ? 3) rcu_cpu_kthread_should_stop: rebind cpu1
> > - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>
> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
> under CPU_ONLINE.

Sorry, but this does not work. The kthread must be running by the time
the CPU appears, otherwise RCU grace periods in CPU_ONLINE notifiers
will never complete.

This did turn out to be a scheduler bug -- see Cheng Xu's recent patch.
([email protected])

Thanx, Paul

> Thanks,
> Yong
>
> >
> >
> >>
> >>> (4) find_lowest_rq() touch local_cpu_mask if task->rt.nr_cpus_allowed !=
> >>> 1,
> >>> ? ? but it haven't been initialized.
> >>>
> >>> RCU folks plan to introduce such FIFO kthread and our testing hitted the
> >>> above issue. Then this patch also protect it.
> >>
> >> I'm fairly sure it doesn't, normal cpu-hotplug doesn't poke at
> >> system_state.
> >
> > If my understand correctly. it's pure scheduler issue. because
> >
> > - rcuc keep the old rule (ie an early spawned kthread have to call
> > kthread_bind)
> > - cpuset_cpus_allowed_fallback() is called from scheduler internal
> > - crash is happen in find_lowest_rq(). (following line)
> >
> >
> > static int find_lowest_rq(struct task_struct *task)
> > {
> > ?(snip)
> > ? ? ? ?if (cpumask_test_cpu(cpu, lowest_mask)) ? // HERE
> >
> >
> > IOW, I think cpuset_cpus_allowed_fallback() should NOT be changed
> > tsk->cpus_allowed until to finish sched_init_smp().
> >
> > Do you have an any alternative idea for this?
> >
> >
> >>> Signed-off-by: KOSAKI Motohiro<[email protected]>
> >>> Cc: Oleg Nesterov<[email protected]>
> >>> Cc: Peter Zijlstra<[email protected]>
> >>> Cc: Ingo Molnar<[email protected]>
> >>> ---
> >>> ?include/linux/cpuset.h | ? ?1 +
> >>> ?kernel/cpuset.c ? ? ? ?| ? ?1 +
> >>> ?kernel/sched.c ? ? ? ? | ? ?4 ++++
> >>> ?3 files changed, 6 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> >>> index f20eb8f..42dcbdc 100644
> >>> --- a/include/linux/cpuset.h
> >>> +++ b/include/linux/cpuset.h
> >>> @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct
> >>> task_struct *p,
> >>> ?static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> >>> ?{
> >>> ? ? ? ?cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> >>> + ? ? ? p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
> >>> ? ? ? ?return cpumask_any(cpu_active_mask);
> >>> ?}
> >>>
> >>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> >>> index 1ceeb04..6e5bbe8 100644
> >>> --- a/kernel/cpuset.c
> >>> +++ b/kernel/cpuset.c
> >>> @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct
> >>> *tsk)
> >>> ? ? ? ? ? ? ? ?cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
> >>> ? ? ? ? ? ? ? ?cpu = cpumask_any(cpu_active_mask);
> >>> ? ? ? ?}
> >>> + ? ? ? tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
> >>>
> >>> ? ? ? ?return cpu;
> >>> ?}
> >>
> >> I don't really see the point of doing this separately from your second
> >> patch, please fold them.
> >
> > Ok. Will do.
> >
> >>
> >>> diff --git a/kernel/sched.c b/kernel/sched.c
> >>> index fd4625f..bfcd219 100644
> >>> --- a/kernel/sched.c
> >>> +++ b/kernel/sched.c
> >>> @@ -2352,6 +2352,10 @@ static int select_fallback_rq(int cpu, struct
> >>> task_struct *p)
> >>> ? ? ? ?if (dest_cpu< ?nr_cpu_ids)
> >>> ? ? ? ? ? ? ? ?return dest_cpu;
> >>>
> >>> + ? ? ? /* Don't worry. It's temporary mismatch. */
> >>> + ? ? ? if (system_state< ?SYSTEM_RUNNING)
> >>> + ? ? ? ? ? ? ? return cpu;
> >>> +
> >>> ? ? ? ?/* No more Mr. Nice Guy. */
> >>> ? ? ? ?dest_cpu = cpuset_cpus_allowed_fallback(p);
> >>> ? ? ? ?/*
> >>
> >> Like explained, I don't believe this actually fixes your problem (its
> >> also disgusting).
> >
> > If anybody have an alternative idea, I have no reason to refuse it.
> >
> > Thanks.
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at ?http://www.tux.org/lkml/
> >
>
>
>
> --
> Only stand for myself
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-05-16 13:26:36

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Sun, May 15, 2011 at 11:55:47AM -0700, Paul E. McKenney wrote:
> > Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
> > which is under hotplug event CPU_UP_PREPARE. Maybe it should be
> > under CPU_ONLINE.
>
> Sorry, but this does not work. The kthread must be running by the time

Yup, I also noticed that from the comments of rcu_spawn_one_cpu_kthread();

> the CPU appears, otherwise RCU grace periods in CPU_ONLINE notifiers
> will never complete.
>
> This did turn out to be a scheduler bug -- see Cheng Xu's recent patch.
> ([email protected])

Hmmm, I don't think they are same issue.

The patch sent by Cheng Xu is about rt runtime normalization during
CPU hotplug.

But this one is about too earlier rt_rq load balance.

Let consider this:

on boot CPU
...
kernel_init();
smp_init();
_cpu_up();
__cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls);
rcu_cpu_notify();
rcu_online_kthreads();
rcu_spawn_one_node_kthread();
wake_up_process();
try_to_wake_up();
select_task_rq();
select_fallback_rq();
cpuset_cpus_allowed_fallback();
/* here the rcu_thread's cpus_allowed will be set
to cpu_possible_mask, but now we only have the
boot cpu online, so it will run on the boot cpu
So if we have KOSAKI's patch applied,
p->rt.nr_cpus_allowed will be set to
cpumask_weight(cpu_possible_mask);
*/
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
__sched_setscheduler();
check_class_changed();
p->sched_class->switched_to(rq, p); /* rt_class */
push_rt_task();
find_lock_lowest_rq();
find_lowest_rq();
/* crash here because local_cpu_mask is uninitialized */


Thanks,
Yong

2011-05-16 13:38:00

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Fri, May 13, 2011 at 07:02:15PM +0200, Peter Zijlstra wrote:
> On Fri, 2011-05-13 at 14:42 +0800, Yong Zhang wrote:
> > > - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
> > > periodically.
> > > then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
> > > my debug print obseve following cpumask change occur at boot time.
> > > 1) kthread_bind: bind cpu1
> > > 2) cpuset_cpus_allowed_fallback: bind possible cpu
> > > 3) rcu_cpu_kthread_should_stop: rebind cpu1
> > > - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
> >
> > Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
> > which is under hotplug event CPU_UP_PREPARE. Maybe it should be
> > under CPU_ONLINE.
>
> IIRC I talked to Paul about this a while back and ONLINE is too late,
> however STARTING should work. At the time he couldn't quite get that to
> work, but the above situation is indeed the root cause of our problems.
>
> We shouldn't try to run a cpu affine thread before the cpu in question
> is actually able to run stuff.

But I'm afraid this patch still doesn't help.
If I understand your patch correctly, you just put the wake up to CPU_STARTING,
but it's still before CPU_ONLINE.
Please check my mail to Paul's in this thread group.

Thanks,
Yong

>
> I did me a little hackery and with the below patch my kernel still
> boots...
>
> Would that sort your issue?
>
> ---
> kernel/rcutree.c | 44 ++++++++++++++++++++++++++++++++++++++------
> kernel/rcutree_plugin.h | 1 -
> 2 files changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 5616b17..e0218ed 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1656,7 +1656,6 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
> per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
> WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
> per_cpu(rcu_cpu_kthread_task, cpu) = t;
> - wake_up_process(t);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> return 0;
> @@ -1764,13 +1763,33 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
> raw_spin_lock_irqsave(&rnp->lock, flags);
> rnp->node_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> - wake_up_process(t);
> sp.sched_priority = 99;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> }
> return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
> }
>
> +static void __cpuinit rcu_wake_cpu_kthread(int cpu)
> +{
> + struct task_struct *p = per_cpu(rcu_cpu_kthread_task, cpu);
> +
> + if (p)
> + wake_up_process(p);
> +}
> +
> +static void __cpuinit rcu_wake_node_kthread(struct rcu_node *rnp)
> +{
> + if (!rnp)
> + return;
> +
> + if (rnp->node_kthread_task)
> + wake_up_process(rnp->node_kthread_task);
> +#ifdef CONFIG_RCU_BOOST
> + if (rnp->boost_kthread_task)
> + wake_up_process(rnp->boost_kthread_task);
> +#endif
> +}
> +
> /*
> * Spawn all kthreads -- called as soon as the scheduler is running.
> */
> @@ -1783,19 +1802,24 @@ static int __init rcu_spawn_kthreads(void)
> for_each_possible_cpu(cpu) {
> init_waitqueue_head(&per_cpu(rcu_cpu_wq, cpu));
> per_cpu(rcu_cpu_has_work, cpu) = 0;
> - if (cpu_online(cpu))
> + if (cpu_online(cpu)) {
> (void)rcu_spawn_one_cpu_kthread(cpu);
> + rcu_wake_cpu_kthread(cpu);
> + }
> }
> rnp = rcu_get_root(rcu_state);
> init_waitqueue_head(&rnp->node_wq);
> rcu_init_boost_waitqueue(rnp);
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> - if (NUM_RCU_NODES > 1)
> + rcu_wake_node_kthread(rnp);
> + if (NUM_RCU_NODES > 1) {
> rcu_for_each_leaf_node(rcu_state, rnp) {
> init_waitqueue_head(&rnp->node_wq);
> rcu_init_boost_waitqueue(rnp);
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> + rcu_wake_node_kthread(rnp);
> }
> + }
> return 0;
> }
> early_initcall(rcu_spawn_kthreads);
> @@ -2206,7 +2230,7 @@ static void __cpuinit rcu_online_cpu(int cpu)
> rcu_preempt_init_percpu_data(cpu);
> }
>
> -static void __cpuinit rcu_online_kthreads(int cpu)
> +static void __cpuinit rcu_prepare_kthreads(int cpu)
> {
> struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> struct rcu_node *rnp = rdp->mynode;
> @@ -2233,7 +2257,15 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> rcu_online_cpu(cpu);
> - rcu_online_kthreads(cpu);
> + rcu_prepare_kthreads(cpu);
> + break;
> + case CPU_STARTING:
> + rcu_wake_cpu_kthread(cpu);
> + do {
> + struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> + if (rdp)
> + rcu_wake_node_kthread(rdp->mynode);
> + } while (0);
> break;
> case CPU_ONLINE:
> case CPU_DOWN_FAILED:
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index ed339702..961a316 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -1306,7 +1306,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> raw_spin_lock_irqsave(&rnp->lock, flags);
> rnp->boost_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> - wake_up_process(t);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> return 0;

2011-05-19 06:06:28

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH v2 1/2] rcu: don't bind offline cpu

Hi Paul,

I've made new patch. Is this acceptable to you?


==============================================================
While discussing cpuset_cpus_allowed_fallback() fix, we've found
rcu subsystem don't use kthread_bind() correctly.

The detail is, typical subsystem wake up a kthread at CPU_ONLINE
notifier (ie. _after_ cpu is onlined), but rcu subsystem wake up
a kthread at CPU_UP_PREPARE notifier (ie. _before_ cpu is onlined).
Because otherwise RCU grace periods in CPU_ONLINE notifiers will
never complete.

This makes big different result. if we fix cpuset_cpus_allowed_fallback(),
sched load balancer run before scheduler smp initialize and makes
kernel crash. (see below)

kernel_init();
smp_init();
_cpu_up();
__cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls);
rcu_cpu_notify();
rcu_online_kthreads();
rcu_spawn_one_node_kthread();
wake_up_process();
try_to_wake_up();
select_task_rq();
select_fallback_rq();
cpuset_cpus_allowed_fallback();
/* here the rcu_thread's cpus_allowed will
* be set to cpu_possible_mask, but now
* we only have the boot cpu online, so it
* will run on the boot cpu p->rt.nr_cpus_allowed
* will be set to cpumask_weight(cpu_possible_mask);
*/
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
__sched_setscheduler();
check_class_changed();
p->sched_class->switched_to(rq, p); /* rt_class */
push_rt_task();
find_lock_lowest_rq();
find_lowest_rq();
/* crash here because local_cpu_mask is uninitialized */

The right way is, explicit two phase cpu bindings (1) bind boot
(or any other online) cpu at CPU_UP_PREPARE (2) bind correct
target cpu at CPU_ONLINE. This patch does it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Yong Zhang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
kernel/rcutree.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5616b17..924b0cd 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1645,6 +1645,7 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
{
struct sched_param sp;
struct task_struct *t;
+ int bound_cpu;

if (!rcu_kthreads_spawnable ||
per_cpu(rcu_cpu_kthread_task, cpu) != NULL)
@@ -1652,8 +1653,14 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
t = kthread_create(rcu_cpu_kthread, (void *)(long)cpu, "rcuc%d", cpu);
if (IS_ERR(t))
return PTR_ERR(t);
- kthread_bind(t, cpu);
- per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
+ /*
+ * The target cpu isn't online yet and can't be bound the rcuc kthread.
+ * Thus we bind it to another online cpu temporary.
+ * rcu_cpu_kthread_should_stop() rebind it to target cpu later.
+ */
+ bound_cpu = cpumask_any(cpu_online_mask);
+ kthread_bind(t, bound_cpu);
+ per_cpu(rcu_cpu_kthread_cpu, cpu) = bound_cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
per_cpu(rcu_cpu_kthread_task, cpu) = t;
wake_up_process(t);
--
1.7.3.1


2011-05-19 06:09:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH v2 2/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
include/linux/cpuset.h | 2 +-
include/linux/sched.h | 7 +++++++
kernel/cpuset.c | 4 ++--
kernel/kthread.c | 4 ++--
kernel/sched.c | 19 ++++++++++++-------
5 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index f20eb8f..e9eaec5 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -146,7 +146,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,

static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
{
- cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+ do_set_cpus_allowed(p, cpu_possible_mask);
return cpumask_any(cpu_active_mask);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12211e1..fa8fe23 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1799,9 +1799,16 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif

#ifdef CONFIG_SMP
+extern void do_set_cpus_allowed(struct task_struct *p,
+ const struct cpumask *new_mask);
+
extern int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask);
#else
+static inline void do_set_cpus_allowed(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+}
static inline int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask)
{
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2bb8c2e..04ceb53 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2195,7 +2195,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
rcu_read_lock();
cs = task_cs(tsk);
if (cs)
- cpumask_copy(&tsk->cpus_allowed, cs->cpus_allowed);
+ do_set_cpus_allowed(tsk, cs->cpus_allowed);
rcu_read_unlock();

/*
@@ -2222,7 +2222,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
* Like above we can temporary set any mask and rely on
* set_cpus_allowed_ptr() as synchronization point.
*/
- cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
+ do_set_cpus_allowed(tsk, cpu_possible_mask);
cpu = cpumask_any(cpu_active_mask);
}

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3b34d27..4ba7ccc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -202,8 +202,8 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
return;
}

- p->cpus_allowed = cpumask_of_cpu(cpu);
- p->rt.nr_cpus_allowed = 1;
+ /* It's safe because the task is inactive. */
+ do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);
diff --git a/kernel/sched.c b/kernel/sched.c
index 8efe3b7..52f1b5b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5826,7 +5826,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();

- cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
+ do_set_cpus_allowed(idle, cpumask_of(cpu));
/*
* We're having a chicken and egg problem, even though we are
* holding rq->lock, the cpu isn't yet set to this cpu so the
@@ -5914,6 +5914,16 @@ static inline void sched_init_granularity(void)
}

#ifdef CONFIG_SMP
+void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
+{
+ if (p->sched_class && p->sched_class->set_cpus_allowed)
+ p->sched_class->set_cpus_allowed(p, new_mask);
+ else {
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
+ }
+}
+
/*
* This is how migration works:
*
@@ -5959,12 +5969,7 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
goto out;
}

- if (p->sched_class->set_cpus_allowed)
- p->sched_class->set_cpus_allowed(p, new_mask);
- else {
- cpumask_copy(&p->cpus_allowed, new_mask);
- p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
- }
+ do_set_cpus_allowed(p, new_mask);

/* Can the task run on the task's current CPU? If so, we're done */
if (cpumask_test_cpu(task_cpu(p), new_mask))
--
1.7.3.1


2011-05-19 08:35:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

On Thu, 2011-05-19 at 15:06 +0900, KOSAKI Motohiro wrote:
>
> The right way is, explicit two phase cpu bindings (1) bind boot
> (or any other online) cpu at CPU_UP_PREPARE (2) bind correct
> target cpu at CPU_ONLINE. This patch does it.

I'm not sure that is in-fact correct. From what I understood, RCU could
have need of this thread before the ONLINE callback and expects it to be
affine at that time.

What was wrong with delaying the wakeup to STARTING?

2011-05-19 08:45:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Mon, 2011-05-16 at 21:37 +0800, Yong Zhang wrote:
>
> But I'm afraid this patch still doesn't help.
> If I understand your patch correctly, you just put the wake up to CPU_STARTING,
> but it's still before CPU_ONLINE.\

Well, that's the whole point really, CPU_ONLINE is terribly late. But
did you perhaps mean to say its before we mark the cpu online?

> Please check my mail to Paul's in this thread group.

Not sure that made things any clearer.

2011-05-19 08:51:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

(2011/05/19 17:34), Peter Zijlstra wrote:
> On Thu, 2011-05-19 at 15:06 +0900, KOSAKI Motohiro wrote:
>>
>> The right way is, explicit two phase cpu bindings (1) bind boot
>> (or any other online) cpu at CPU_UP_PREPARE (2) bind correct
>> target cpu at CPU_ONLINE. This patch does it.
>
> I'm not sure that is in-fact correct. From what I understood, RCU could
> have need of this thread before the ONLINE callback and expects it to be
> affine at that time.
>
> What was wrong with delaying the wakeup to STARTING?

Grr, I misparsed Paul's mail. I thought he said "Sorry, but this does not work."
to you. But in fact, he said to Young. I have to see and test your patch.

I'm sorry. Please give me a bit time.

2011-05-19 08:54:42

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don't update tsk->rt.nr_cpus_allowed

On Thu, May 19, 2011 at 4:45 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2011-05-16 at 21:37 +0800, Yong Zhang wrote:
>>
>> But I'm afraid this patch still doesn't help.
>> If I understand your patch correctly, you just put the wake up to CPU_STARTING,
>> but it's still before CPU_ONLINE.\
>
> Well, that's the whole point really, CPU_ONLINE is terribly late. But
> did you perhaps mean to say its before we mark the cpu online?

I always try to express myself correctly(poor english :)

Precisely, it's before we mark the cpu active. sched_cpu_active()
is call when CPU_ONLINE.

Thanks,
Yong

>
>> Please check my mail to Paul's in this thread group.
>
> Not sure that made things any clearer.
>



--
Only stand for myself

2011-05-19 08:55:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

On Thu, 2011-05-19 at 10:34 +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-19 at 15:06 +0900, KOSAKI Motohiro wrote:
> >
> > The right way is, explicit two phase cpu bindings (1) bind boot
> > (or any other online) cpu at CPU_UP_PREPARE (2) bind correct
> > target cpu at CPU_ONLINE. This patch does it.
>
> I'm not sure that is in-fact correct. From what I understood, RCU could
> have need of this thread before the ONLINE callback and expects it to be
> affine at that time.

Right, so we're waking that thread from the timer tick, so we must have
that thread set-up and ready _before_ we enable interrupts for the first
time. CPU_ONLINE is _WAAAAAY_ too late for that.

> What was wrong with delaying the wakeup to STARTING?

Hrmm right, so I thought STARTING was right after we marked the cpu
online, but its right before. Bummer.

Also Paul:

rcu_cpu_kthread()
rcu_process_callbacks()
__rcu_process_callbacks()
rcu_do_batch()
invoke_rcu_cpu_kthread()

Why wake the thread again if its already running?

2011-05-19 09:42:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

On Thu, 2011-05-19 at 17:50 +0900, KOSAKI Motohiro wrote:
> I'm sorry. Please give me a bit time.

N/P, as I've already found out CPU_STARTING isn't a suitable location
either.

How about something like the below...


---
Subject: rcu: Remove waitqueue usage for cpu,node and boost kthreads
From: Peter Zijlstra <[email protected]>
Date: Thu May 19 11:27:39 CEST 2011

Since wait-queue usage is inconsistent (node) and generally not needed
since we know exactly which thread to wake, rip it out.

The problem is that wake_up() only issues an actual wake-up when there's
a thread waiting on the queue which means we need an extra, explicit,
wakeup to kick-start the kthread itself (they're born sleeping).

Now, since we already know which exact thread we want to wake up,
there's not point really in enqueing it on a waitqueue. So provide Paul
with a nice rcu_wait() macro to replace wait_event_interruptible() with
and rip out all the waitqueue stuff.

This moves us to an explicit, unconditional, wake_up_process() which can
be used to wake us from both conditions.

Signed-off-by: Peter Zijlstra <[email protected]>
---
Index: linux-2.6/kernel/rcutree.c
===================================================================
--- linux-2.6.orig/kernel/rcutree.c
+++ linux-2.6/kernel/rcutree.c
@@ -94,7 +94,6 @@ static DEFINE_PER_CPU(struct task_struct
DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_status);
DEFINE_PER_CPU(int, rcu_cpu_kthread_cpu);
DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_loops);
-static DEFINE_PER_CPU(wait_queue_head_t, rcu_cpu_wq);
DEFINE_PER_CPU(char, rcu_cpu_has_work);
static char rcu_kthreads_spawnable;

@@ -1475,7 +1474,7 @@ static void invoke_rcu_cpu_kthread(void)
local_irq_restore(flags);
return;
}
- wake_up(&__get_cpu_var(rcu_cpu_wq));
+ wake_up_process(__this_cpu_read(rcu_cpu_kthread_task));
local_irq_restore(flags);
}

@@ -1598,14 +1597,12 @@ static int rcu_cpu_kthread(void *arg)
unsigned long flags;
int spincnt = 0;
unsigned int *statusp = &per_cpu(rcu_cpu_kthread_status, cpu);
- wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
char work;
char *workp = &per_cpu(rcu_cpu_has_work, cpu);

for (;;) {
*statusp = RCU_KTHREAD_WAITING;
- wait_event_interruptible(*wqp,
- *workp != 0 || kthread_should_stop());
+ rcu_wait(*workp != 0 || kthread_should_stop());
local_bh_disable();
if (rcu_cpu_kthread_should_stop(cpu)) {
local_bh_enable();
@@ -1656,7 +1653,6 @@ static int __cpuinit rcu_spawn_one_cpu_k
per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
per_cpu(rcu_cpu_kthread_task, cpu) = t;
- wake_up_process(t);
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
return 0;
@@ -1679,7 +1675,7 @@ static int rcu_node_kthread(void *arg)

for (;;) {
rnp->node_kthread_status = RCU_KTHREAD_WAITING;
- wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0);
+ rcu_wait(rnp->wakemask != 0);
rnp->node_kthread_status = RCU_KTHREAD_RUNNING;
raw_spin_lock_irqsave(&rnp->lock, flags);
mask = rnp->wakemask;
@@ -1764,7 +1760,6 @@ static int __cpuinit rcu_spawn_one_node_
raw_spin_lock_irqsave(&rnp->lock, flags);
rnp->node_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
- wake_up_process(t);
sp.sched_priority = 99;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
}
@@ -1781,21 +1776,16 @@ static int __init rcu_spawn_kthreads(voi

rcu_kthreads_spawnable = 1;
for_each_possible_cpu(cpu) {
- init_waitqueue_head(&per_cpu(rcu_cpu_wq, cpu));
per_cpu(rcu_cpu_has_work, cpu) = 0;
if (cpu_online(cpu))
(void)rcu_spawn_one_cpu_kthread(cpu);
}
rnp = rcu_get_root(rcu_state);
- init_waitqueue_head(&rnp->node_wq);
- rcu_init_boost_waitqueue(rnp);
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
- if (NUM_RCU_NODES > 1)
- rcu_for_each_leaf_node(rcu_state, rnp) {
- init_waitqueue_head(&rnp->node_wq);
- rcu_init_boost_waitqueue(rnp);
+ if (NUM_RCU_NODES > 1) {
+ rcu_for_each_leaf_node(rcu_state, rnp)
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
- }
+ }
return 0;
}
early_initcall(rcu_spawn_kthreads);
Index: linux-2.6/kernel/rcutree.h
===================================================================
--- linux-2.6.orig/kernel/rcutree.h
+++ linux-2.6/kernel/rcutree.h
@@ -28,6 +28,17 @@
#include <linux/cpumask.h>
#include <linux/seqlock.h>

+#define rcu_wait(cond) \
+do { \
+ for (;;) { \
+ set_current_state(TASK_INTERRUPTIBLE); \
+ if (cond) \
+ break; \
+ schedule(); \
+ } \
+ __set_current_state(TASK_RUNNING); \
+} while (0)
+
/*
* Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT.
* In theory, it should be possible to add more levels straightforwardly.
@@ -157,9 +168,6 @@ struct rcu_node {
struct task_struct *boost_kthread_task;
/* kthread that takes care of priority */
/* boosting for this rcu_node structure. */
- wait_queue_head_t boost_wq;
- /* Wait queue on which to park the boost */
- /* kthread. */
unsigned int boost_kthread_status;
/* State of boost_kthread_task for tracing. */
unsigned long n_tasks_boosted;
@@ -186,9 +194,6 @@ struct rcu_node {
/* kthread that takes care of this rcu_node */
/* structure, for example, awakening the */
/* per-CPU kthreads as needed. */
- wait_queue_head_t node_wq;
- /* Wait queue on which to park the per-node */
- /* kthread. */
unsigned int node_kthread_status;
/* State of node_kthread_task for tracing. */
} ____cacheline_internodealigned_in_smp;
@@ -443,7 +448,6 @@ static void __cpuinit rcu_preempt_init_p
static void rcu_preempt_send_cbs_to_online(void);
static void __init __rcu_init_preempt(void);
static void rcu_needs_cpu_flush(void);
-static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp);
static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp,
cpumask_var_t cm);
Index: linux-2.6/kernel/rcutree_plugin.h
===================================================================
--- linux-2.6.orig/kernel/rcutree_plugin.h
+++ linux-2.6/kernel/rcutree_plugin.h
@@ -1196,8 +1196,7 @@ static int rcu_boost_kthread(void *arg)

for (;;) {
rnp->boost_kthread_status = RCU_KTHREAD_WAITING;
- wait_event_interruptible(rnp->boost_wq, rnp->boost_tasks ||
- rnp->exp_tasks);
+ rcu_wait(rnp->boost_tasks || rnp->exp_tasks);
rnp->boost_kthread_status = RCU_KTHREAD_RUNNING;
more2boost = rcu_boost(rnp);
if (more2boost)
@@ -1275,14 +1274,6 @@ static void rcu_preempt_boost_start_gp(s
}

/*
- * Initialize the RCU-boost waitqueue.
- */
-static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp)
-{
- init_waitqueue_head(&rnp->boost_wq);
-}
-
-/*
* Create an RCU-boost kthread for the specified node if one does not
* already exist. We only create this kthread for preemptible RCU.
* Returns zero if all is well, a negated errno otherwise.
@@ -1306,7 +1297,6 @@ static int __cpuinit rcu_spawn_one_boost
raw_spin_lock_irqsave(&rnp->lock, flags);
rnp->boost_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
- wake_up_process(t);
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
return 0;
@@ -1328,10 +1318,6 @@ static void rcu_preempt_boost_start_gp(s
{
}

-static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp)
-{
-}
-
static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
struct rcu_node *rnp,
int rnp_index)

2011-05-19 10:12:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

(2011/05/19 18:41), Peter Zijlstra wrote:
> On Thu, 2011-05-19 at 17:50 +0900, KOSAKI Motohiro wrote:
>> I'm sorry. Please give me a bit time.
>
> N/P, as I've already found out CPU_STARTING isn't a suitable location
> either.
>
> How about something like the below...

latest -tip + my cpuset_cpus_allowed_fallback() fix + this patch works completely!

Thanks.

2011-05-19 11:41:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

On Thu, 2011-05-19 at 19:12 +0900, KOSAKI Motohiro wrote:
> latest -tip + my cpuset_cpus_allowed_fallback() fix + this patch works
> completely!

Great, let me push these two patches Ingo-wards.

Thanks!

2011-05-20 22:47:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] rcu: don't bind offline cpu

On Thu, May 19, 2011 at 11:41:53AM +0200, Peter Zijlstra wrote:
> On Thu, 2011-05-19 at 17:50 +0900, KOSAKI Motohiro wrote:
> > I'm sorry. Please give me a bit time.
>
> N/P, as I've already found out CPU_STARTING isn't a suitable location
> either.
>
> How about something like the below...
>
>
> ---
> Subject: rcu: Remove waitqueue usage for cpu,node and boost kthreads
> From: Peter Zijlstra <[email protected]>
> Date: Thu May 19 11:27:39 CEST 2011
>
> Since wait-queue usage is inconsistent (node) and generally not needed
> since we know exactly which thread to wake, rip it out.
>
> The problem is that wake_up() only issues an actual wake-up when there's
> a thread waiting on the queue which means we need an extra, explicit,
> wakeup to kick-start the kthread itself (they're born sleeping).
>
> Now, since we already know which exact thread we want to wake up,
> there's not point really in enqueing it on a waitqueue. So provide Paul
> with a nice rcu_wait() macro to replace wait_event_interruptible() with
> and rip out all the waitqueue stuff.
>
> This moves us to an explicit, unconditional, wake_up_process() which can
> be used to wake us from both conditions.

Hello, Peter,

Thank you! I have queued this and will test it.

I am having some difficulty understanding the synchronization in the
rcu_node_kthread() case, where I want to hold the rcu_node structure's
->lock over the "if (cond) break;" in rcu_wait(), but cannot prove that
it is really necessary.

Will look it over again after the jetlag has worn off a bit.

Thanx, Paul

> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> Index: linux-2.6/kernel/rcutree.c
> ===================================================================
> --- linux-2.6.orig/kernel/rcutree.c
> +++ linux-2.6/kernel/rcutree.c
> @@ -94,7 +94,6 @@ static DEFINE_PER_CPU(struct task_struct
> DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_status);
> DEFINE_PER_CPU(int, rcu_cpu_kthread_cpu);
> DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_loops);
> -static DEFINE_PER_CPU(wait_queue_head_t, rcu_cpu_wq);
> DEFINE_PER_CPU(char, rcu_cpu_has_work);
> static char rcu_kthreads_spawnable;
>
> @@ -1475,7 +1474,7 @@ static void invoke_rcu_cpu_kthread(void)
> local_irq_restore(flags);
> return;
> }
> - wake_up(&__get_cpu_var(rcu_cpu_wq));
> + wake_up_process(__this_cpu_read(rcu_cpu_kthread_task));
> local_irq_restore(flags);
> }
>
> @@ -1598,14 +1597,12 @@ static int rcu_cpu_kthread(void *arg)
> unsigned long flags;
> int spincnt = 0;
> unsigned int *statusp = &per_cpu(rcu_cpu_kthread_status, cpu);
> - wait_queue_head_t *wqp = &per_cpu(rcu_cpu_wq, cpu);
> char work;
> char *workp = &per_cpu(rcu_cpu_has_work, cpu);
>
> for (;;) {
> *statusp = RCU_KTHREAD_WAITING;
> - wait_event_interruptible(*wqp,
> - *workp != 0 || kthread_should_stop());
> + rcu_wait(*workp != 0 || kthread_should_stop());
> local_bh_disable();
> if (rcu_cpu_kthread_should_stop(cpu)) {
> local_bh_enable();
> @@ -1656,7 +1653,6 @@ static int __cpuinit rcu_spawn_one_cpu_k
> per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
> WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
> per_cpu(rcu_cpu_kthread_task, cpu) = t;
> - wake_up_process(t);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> return 0;
> @@ -1679,7 +1675,7 @@ static int rcu_node_kthread(void *arg)
>
> for (;;) {
> rnp->node_kthread_status = RCU_KTHREAD_WAITING;
> - wait_event_interruptible(rnp->node_wq, rnp->wakemask != 0);
> + rcu_wait(rnp->wakemask != 0);
> rnp->node_kthread_status = RCU_KTHREAD_RUNNING;
> raw_spin_lock_irqsave(&rnp->lock, flags);
> mask = rnp->wakemask;
> @@ -1764,7 +1760,6 @@ static int __cpuinit rcu_spawn_one_node_
> raw_spin_lock_irqsave(&rnp->lock, flags);
> rnp->node_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> - wake_up_process(t);
> sp.sched_priority = 99;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> }
> @@ -1781,21 +1776,16 @@ static int __init rcu_spawn_kthreads(voi
>
> rcu_kthreads_spawnable = 1;
> for_each_possible_cpu(cpu) {
> - init_waitqueue_head(&per_cpu(rcu_cpu_wq, cpu));
> per_cpu(rcu_cpu_has_work, cpu) = 0;
> if (cpu_online(cpu))
> (void)rcu_spawn_one_cpu_kthread(cpu);
> }
> rnp = rcu_get_root(rcu_state);
> - init_waitqueue_head(&rnp->node_wq);
> - rcu_init_boost_waitqueue(rnp);
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> - if (NUM_RCU_NODES > 1)
> - rcu_for_each_leaf_node(rcu_state, rnp) {
> - init_waitqueue_head(&rnp->node_wq);
> - rcu_init_boost_waitqueue(rnp);
> + if (NUM_RCU_NODES > 1) {
> + rcu_for_each_leaf_node(rcu_state, rnp)
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> - }
> + }
> return 0;
> }
> early_initcall(rcu_spawn_kthreads);
> Index: linux-2.6/kernel/rcutree.h
> ===================================================================
> --- linux-2.6.orig/kernel/rcutree.h
> +++ linux-2.6/kernel/rcutree.h
> @@ -28,6 +28,17 @@
> #include <linux/cpumask.h>
> #include <linux/seqlock.h>
>
> +#define rcu_wait(cond) \
> +do { \
> + for (;;) { \
> + set_current_state(TASK_INTERRUPTIBLE); \
> + if (cond) \
> + break; \
> + schedule(); \
> + } \
> + __set_current_state(TASK_RUNNING); \
> +} while (0)
> +
> /*
> * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT.
> * In theory, it should be possible to add more levels straightforwardly.
> @@ -157,9 +168,6 @@ struct rcu_node {
> struct task_struct *boost_kthread_task;
> /* kthread that takes care of priority */
> /* boosting for this rcu_node structure. */
> - wait_queue_head_t boost_wq;
> - /* Wait queue on which to park the boost */
> - /* kthread. */
> unsigned int boost_kthread_status;
> /* State of boost_kthread_task for tracing. */
> unsigned long n_tasks_boosted;
> @@ -186,9 +194,6 @@ struct rcu_node {
> /* kthread that takes care of this rcu_node */
> /* structure, for example, awakening the */
> /* per-CPU kthreads as needed. */
> - wait_queue_head_t node_wq;
> - /* Wait queue on which to park the per-node */
> - /* kthread. */
> unsigned int node_kthread_status;
> /* State of node_kthread_task for tracing. */
> } ____cacheline_internodealigned_in_smp;
> @@ -443,7 +448,6 @@ static void __cpuinit rcu_preempt_init_p
> static void rcu_preempt_send_cbs_to_online(void);
> static void __init __rcu_init_preempt(void);
> static void rcu_needs_cpu_flush(void);
> -static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp);
> static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
> static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp,
> cpumask_var_t cm);
> Index: linux-2.6/kernel/rcutree_plugin.h
> ===================================================================
> --- linux-2.6.orig/kernel/rcutree_plugin.h
> +++ linux-2.6/kernel/rcutree_plugin.h
> @@ -1196,8 +1196,7 @@ static int rcu_boost_kthread(void *arg)
>
> for (;;) {
> rnp->boost_kthread_status = RCU_KTHREAD_WAITING;
> - wait_event_interruptible(rnp->boost_wq, rnp->boost_tasks ||
> - rnp->exp_tasks);
> + rcu_wait(rnp->boost_tasks || rnp->exp_tasks);
> rnp->boost_kthread_status = RCU_KTHREAD_RUNNING;
> more2boost = rcu_boost(rnp);
> if (more2boost)
> @@ -1275,14 +1274,6 @@ static void rcu_preempt_boost_start_gp(s
> }
>
> /*
> - * Initialize the RCU-boost waitqueue.
> - */
> -static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp)
> -{
> - init_waitqueue_head(&rnp->boost_wq);
> -}
> -
> -/*
> * Create an RCU-boost kthread for the specified node if one does not
> * already exist. We only create this kthread for preemptible RCU.
> * Returns zero if all is well, a negated errno otherwise.
> @@ -1306,7 +1297,6 @@ static int __cpuinit rcu_spawn_one_boost
> raw_spin_lock_irqsave(&rnp->lock, flags);
> rnp->boost_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> - wake_up_process(t);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> return 0;
> @@ -1328,10 +1318,6 @@ static void rcu_preempt_boost_start_gp(s
> {
> }
>
> -static void __init rcu_init_boost_waitqueue(struct rcu_node *rnp)
> -{
> -}
> -
> static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> struct rcu_node *rnp,
> int rnp_index)
>

2011-05-28 16:36:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: [tip:sched/urgent] cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed

Commit-ID: 1e1b6c511d1b23cb7c3b619d82fc7bd9f620565d
Gitweb: http://git.kernel.org/tip/1e1b6c511d1b23cb7c3b619d82fc7bd9f620565d
Author: KOSAKI Motohiro <[email protected]>
AuthorDate: Thu, 19 May 2011 15:08:58 +0900
Committer: Ingo Molnar <[email protected]>
CommitDate: Sat, 28 May 2011 17:02:57 +0200

cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/cpuset.h | 2 +-
include/linux/sched.h | 7 +++++++
kernel/cpuset.c | 4 ++--
kernel/kthread.c | 4 ++--
kernel/sched.c | 19 ++++++++++++-------
5 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index f20eb8f..e9eaec5 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -146,7 +146,7 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,

static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
{
- cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+ do_set_cpus_allowed(p, cpu_possible_mask);
return cpumask_any(cpu_active_mask);
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc88712..8da84b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1841,9 +1841,16 @@ static inline void rcu_copy_process(struct task_struct *p)
#endif

#ifdef CONFIG_SMP
+extern void do_set_cpus_allowed(struct task_struct *p,
+ const struct cpumask *new_mask);
+
extern int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask);
#else
+static inline void do_set_cpus_allowed(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+}
static inline int set_cpus_allowed_ptr(struct task_struct *p,
const struct cpumask *new_mask)
{
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1ceeb04..9c9b754 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2190,7 +2190,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
rcu_read_lock();
cs = task_cs(tsk);
if (cs)
- cpumask_copy(&tsk->cpus_allowed, cs->cpus_allowed);
+ do_set_cpus_allowed(tsk, cs->cpus_allowed);
rcu_read_unlock();

/*
@@ -2217,7 +2217,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct *tsk)
* Like above we can temporary set any mask and rely on
* set_cpus_allowed_ptr() as synchronization point.
*/
- cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
+ do_set_cpus_allowed(tsk, cpu_possible_mask);
cpu = cpumask_any(cpu_active_mask);
}

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3b34d27..4ba7ccc 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -202,8 +202,8 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
return;
}

- p->cpus_allowed = cpumask_of_cpu(cpu);
- p->rt.nr_cpus_allowed = 1;
+ /* It's safe because the task is inactive. */
+ do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);
diff --git a/kernel/sched.c b/kernel/sched.c
index a80ee91..cbb3a0e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5860,7 +5860,7 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();

- cpumask_copy(&idle->cpus_allowed, cpumask_of(cpu));
+ do_set_cpus_allowed(idle, cpumask_of(cpu));
/*
* We're having a chicken and egg problem, even though we are
* holding rq->lock, the cpu isn't yet set to this cpu so the
@@ -5948,6 +5948,16 @@ static inline void sched_init_granularity(void)
}

#ifdef CONFIG_SMP
+void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
+{
+ if (p->sched_class && p->sched_class->set_cpus_allowed)
+ p->sched_class->set_cpus_allowed(p, new_mask);
+ else {
+ cpumask_copy(&p->cpus_allowed, new_mask);
+ p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
+ }
+}
+
/*
* This is how migration works:
*
@@ -5993,12 +6003,7 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
goto out;
}

- if (p->sched_class->set_cpus_allowed)
- p->sched_class->set_cpus_allowed(p, new_mask);
- else {
- cpumask_copy(&p->cpus_allowed, new_mask);
- p->rt.nr_cpus_allowed = cpumask_weight(new_mask);
- }
+ do_set_cpus_allowed(p, new_mask);

/* Can the task run on the task's current CPU? If so, we're done */
if (cpumask_test_cpu(task_cpu(p), new_mask))

2011-06-20 10:21:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed

On Sat, 2011-05-28 at 16:35 +0000, tip-bot for KOSAKI Motohiro wrote:
> +++ b/kernel/kthread.c
> @@ -202,8 +202,8 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> return;
> }
>
> - p->cpus_allowed = cpumask_of_cpu(cpu);
> - p->rt.nr_cpus_allowed = 1;
> + /* It's safe because the task is inactive. */
> + do_set_cpus_allowed(p, cpumask_of(cpu));
> p->flags |= PF_THREAD_BOUND;
> }


I just happened to be staring at this stuff again, and I'm wondering
how and why this is correct. After kthread_create() the thread exists
and is exposed in the pid-hash, therefore userspace can come and do
sys_sched_setaffinity() on it, and since we're not holding any locks and
set PF_THREAD_BOUND _after_ setting cpus_allowed, things can end up
funny.

Hmm?

2011-06-21 09:55:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [tip:sched/urgent] cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed

(2011/06/20 19:20), Peter Zijlstra wrote:
> On Sat, 2011-05-28 at 16:35 +0000, tip-bot for KOSAKI Motohiro wrote:
>> +++ b/kernel/kthread.c
>> @@ -202,8 +202,8 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
>> return;
>> }
>>
>> - p->cpus_allowed = cpumask_of_cpu(cpu);
>> - p->rt.nr_cpus_allowed = 1;
>> + /* It's safe because the task is inactive. */
>> + do_set_cpus_allowed(p, cpumask_of(cpu));
>> p->flags |= PF_THREAD_BOUND;
>> }
>
>
> I just happened to be staring at this stuff again, and I'm wondering
> how and why this is correct. After kthread_create() the thread exists
> and is exposed in the pid-hash, therefore userspace can come and do
> sys_sched_setaffinity() on it, and since we're not holding any locks and
> set PF_THREAD_BOUND _after_ setting cpus_allowed, things can end up
> funny.
>
> Hmm?

Can't we take just either rq lock or pi_lock? Layer violation?



>From 1c0874b9157f47e22d0f6499612f0f78b830f018 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 21 Jun 2011 18:15:19 +0900
Subject: [PATCH] kthread: fix kthread_bind() race

Peter Zijlstra reported kthread_bind() has a race. It doesn't hold
any locks and set PF_THREAD_BOUND _after_ setting cpus_allowed. So,
following race can be happen.

CPU0 CPU1
----------------------------------------------------
do_set_cpus_allowed()
sys_sched_setaffinity()
p->flags |= PF_THREAD_BOUND;

The solution is to take either rq lock or pi_lock. They prevent
a race because set_cpus_allowed_ptr() take both locks. This patch
choose to use latter way.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
kernel/kthread.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4ba7ccc..92e3083 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -196,15 +196,20 @@ EXPORT_SYMBOL(kthread_create_on_node);
*/
void kthread_bind(struct task_struct *p, unsigned int cpu)
{
+ unsigned long flags;
+
/* Must have done schedule() in kthread() before we set_task_cpu */
if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
WARN_ON(1);
return;
}

+ /* protect from a race against set_cpus_allowed_ptr() */
+ raw_spin_lock_irqsave(&p->pi_lock, flags);
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
+ raw_spin_unlock_irqrestore(&p->pi_lock, flags);
}
EXPORT_SYMBOL(kthread_bind);

--
1.7.3.1