2011-05-30 06:10:05

by Damien Wyart

[permalink] [raw]
Subject: Very high CPU load when idle with 3.0-rc1

Hi,

Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
when idle. No process is shown running or in "D state" in htop. The box
is behaving normal, no impression of lag or slowness.

Not sure what other info to include, I guess this should be quite easy
to reproduce.

Best,
--
Damien Wyart


2011-05-30 11:35:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 07:59 +0200, Damien Wyart wrote:
> Hi,
>
> Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
> when idle. No process is shown running or in "D state" in htop. The box
> is behaving normal, no impression of lag or slowness.
>
> Not sure what other info to include, I guess this should be quite easy
> to reproduce.


---
Subject: rcu: Cure load woes

Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
state) fudges a sleeping task' state, resulting in the scheduler seeing
a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
task waking up. The result is unbalanced load calculation.

The problem that patch tried to address is that the RCU threads could
stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
task detector due to on-demand wake-ups.

Cure the problem differently by always giving the tasks at least one
wake-up once the CPU is fully up and running, this will kick them out of
the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
wait state.

The alternative would be teaching kthread_create() to start threads as
INTERRUPTIBLE but that needs a tad more thought.

Reported-by: Damien Wyart <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/rcutree.c | 54 ++++++++++++++++++++++++++++++++++++++++-------
kernel/rcutree_plugin.h | 11 ++++++++-
2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 77a7671..89419ff 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1648,7 +1648,6 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
if (IS_ERR(t))
return PTR_ERR(t);
kthread_bind(t, cpu);
- set_task_state(t, TASK_INTERRUPTIBLE);
per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
per_cpu(rcu_cpu_kthread_task, cpu) = t;
@@ -1756,7 +1755,6 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
if (IS_ERR(t))
return PTR_ERR(t);
raw_spin_lock_irqsave(&rnp->lock, flags);
- set_task_state(t, TASK_INTERRUPTIBLE);
rnp->node_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
sp.sched_priority = 99;
@@ -1765,6 +1763,8 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
}

+static void rcu_wake_one_boost_kthread(struct rcu_node *rnp);
+
/*
* Spawn all kthreads -- called as soon as the scheduler is running.
*/
@@ -1772,18 +1772,30 @@ static int __init rcu_spawn_kthreads(void)
{
int cpu;
struct rcu_node *rnp;
+ struct task_struct *t;

rcu_kthreads_spawnable = 1;
for_each_possible_cpu(cpu) {
per_cpu(rcu_cpu_has_work, cpu) = 0;
- if (cpu_online(cpu))
+ if (cpu_online(cpu)) {
(void)rcu_spawn_one_cpu_kthread(cpu);
+ t = per_cpu(rcu_cpu_kthread_task, cpu);
+ if (t)
+ wake_up_process(t);
+ }
}
rnp = rcu_get_root(rcu_state);
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ if (rnp->node_kthread_task)
+ wake_up_process(rnp->node_kthread_task);
if (NUM_RCU_NODES > 1) {
- rcu_for_each_leaf_node(rcu_state, rnp)
+ rcu_for_each_leaf_node(rcu_state, rnp) {
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ t = rnp->node_kthread_task;
+ if (t)
+ wake_up_process(t);
+ rcu_wake_one_boost_kthread(rnp);
+ }
}
return 0;
}
@@ -2188,14 +2200,14 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
}

-static void __cpuinit rcu_online_cpu(int cpu)
+static void __cpuinit rcu_prepare_cpu(int cpu)
{
rcu_init_percpu_data(cpu, &rcu_sched_state, 0);
rcu_init_percpu_data(cpu, &rcu_bh_state, 0);
rcu_preempt_init_percpu_data(cpu);
}

-static void __cpuinit rcu_online_kthreads(int cpu)
+static void __cpuinit rcu_prepare_kthreads(int cpu)
{
struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
struct rcu_node *rnp = rdp->mynode;
@@ -2209,6 +2221,31 @@ static void __cpuinit rcu_online_kthreads(int cpu)
}

/*
+ * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
+ * but the RCU threads are woken on demand, and if demand is low this
+ * could be a while triggering the hung task watchdog.
+ *
+ * In order to avoid this, poke all tasks once the CPU is fully
+ * up and running.
+ */
+static void __cpuinit rcu_online_kthreads(int cpu)
+{
+ struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
+ struct rcu_node *rnp = rdp->mynode;
+ struct task_struct *t;
+
+ t = per_cpu(rcu_cpu_kthread_task, cpu);
+ if (t)
+ wake_up_process(t);
+
+ t = rnp->node_kthread_task;
+ if (t)
+ wake_up_process(t);
+
+ rcu_wake_one_boost_kthread(rnp);
+}
+
+/*
* Handle CPU online/offline notification events.
*/
static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
@@ -2221,10 +2258,11 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- rcu_online_cpu(cpu);
- rcu_online_kthreads(cpu);
+ rcu_prepare_cpu(cpu);
+ rcu_prepare_kthreads(cpu);
break;
case CPU_ONLINE:
+ rcu_online_kthreads(cpu);
case CPU_DOWN_FAILED:
rcu_node_kthread_setaffinity(rnp, -1);
rcu_cpu_kthread_setrt(cpu, 1);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a767b7d..2910de7 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1295,7 +1295,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
if (IS_ERR(t))
return PTR_ERR(t);
raw_spin_lock_irqsave(&rnp->lock, flags);
- set_task_state(t, TASK_INTERRUPTIBLE);
rnp->boost_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
sp.sched_priority = RCU_KTHREAD_PRIO;
@@ -1303,6 +1302,12 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
return 0;
}

+static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
+{
+ if (rnp->boost_kthread_task)
+ wake_up_process(rnp->boost_thread_task);
+}
+
#else /* #ifdef CONFIG_RCU_BOOST */

static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
@@ -1326,6 +1331,10 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
return 0;
}

+static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
+{
+}
+
#endif /* #else #ifdef CONFIG_RCU_BOOST */

#ifndef CONFIG_SMP

2011-05-30 11:50:47

by Damien Wyart

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

> Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
> when idle. No process is shown running or in "D state" in htop. The box
> is behaving normal, no impression of lag or slowness.

> Not sure what other info to include, I guess this should be quite easy
> to reproduce.

Here the .config in case it is needed for further testing...

--
Damien Wyart


Attachments:
(No filename) (375.00 B)
config.2640 (52.42 kB)
Download all attachments

2011-05-30 12:18:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1


* Peter Zijlstra <[email protected]> wrote:

> The problem that patch tried to address is that the RCU threads
> could stay in UNINTERRUPTIBLE state for quite a while and
> triggering the hung task detector due to on-demand wake-ups.

and it's not just the hung task detector, but obviously also the load
average itself that would be skewed in this original bug case as
well.

Thanks,

Ingo

2011-05-30 12:27:58

by Morten Stevens

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

2011/5/30 Damien Wyart <[email protected]>:
> Hi,
>
> Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
> when idle. No process is shown running or in "D state" in htop. The box
> is behaving normal, no impression of lag or slowness.
>
> Not sure what other info to include, I guess this should be quite easy
> to reproduce.

Same problem here.

# uname -r
3.0.0-rc1
# uptime
14:18:39 up 30 min, 2 users, load average: 10.00, 10.15, 9.68

Best regards,

Morten

2011-05-30 13:10:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 13:34 +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 07:59 +0200, Damien Wyart wrote:
> > Hi,
> >
> > Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
> > when idle. No process is shown running or in "D state" in htop. The box
> > is behaving normal, no impression of lag or slowness.
> >
> > Not sure what other info to include, I guess this should be quite easy
> > to reproduce.
>
>
> ---
> Subject: rcu: Cure load woes
>
> Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
> state) fudges a sleeping task' state, resulting in the scheduler seeing
> a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
> task waking up. The result is unbalanced load calculation.

(darn, 'which' poked me in the eye, but 'how' didn't)

Yup, all better.

-Mike

2011-05-30 16:24:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, May 30, 2011 at 01:34:51PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 07:59 +0200, Damien Wyart wrote:
> > Hi,
> >
> > Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
> > when idle. No process is shown running or in "D state" in htop. The box
> > is behaving normal, no impression of lag or slowness.
> >
> > Not sure what other info to include, I guess this should be quite easy
> > to reproduce.
>
>
> ---
> Subject: rcu: Cure load woes
>
> Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
> state) fudges a sleeping task' state, resulting in the scheduler seeing
> a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
> task waking up. The result is unbalanced load calculation.
>
> The problem that patch tried to address is that the RCU threads could
> stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
> task detector due to on-demand wake-ups.
>
> Cure the problem differently by always giving the tasks at least one
> wake-up once the CPU is fully up and running, this will kick them out of
> the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
> wait state.
>
> The alternative would be teaching kthread_create() to start threads as
> INTERRUPTIBLE but that needs a tad more thought.
>
> Reported-by: Damien Wyart <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>

Very cool! I do have a few questions below, but am queuing and testing
this in the meantime.

> ---
> kernel/rcutree.c | 54 ++++++++++++++++++++++++++++++++++++++++-------
> kernel/rcutree_plugin.h | 11 ++++++++-
> 2 files changed, 56 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 77a7671..89419ff 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1648,7 +1648,6 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
> if (IS_ERR(t))
> return PTR_ERR(t);
> kthread_bind(t, cpu);
> - set_task_state(t, TASK_INTERRUPTIBLE);
> per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
> WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
> per_cpu(rcu_cpu_kthread_task, cpu) = t;
> @@ -1756,7 +1755,6 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
> if (IS_ERR(t))
> return PTR_ERR(t);
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - set_task_state(t, TASK_INTERRUPTIBLE);
> rnp->node_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> sp.sched_priority = 99;
> @@ -1765,6 +1763,8 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
> return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
> }
>
> +static void rcu_wake_one_boost_kthread(struct rcu_node *rnp);
> +
> /*
> * Spawn all kthreads -- called as soon as the scheduler is running.
> */
> @@ -1772,18 +1772,30 @@ static int __init rcu_spawn_kthreads(void)
> {
> int cpu;
> struct rcu_node *rnp;
> + struct task_struct *t;
>
> rcu_kthreads_spawnable = 1;
> for_each_possible_cpu(cpu) {
> per_cpu(rcu_cpu_has_work, cpu) = 0;
> - if (cpu_online(cpu))
> + if (cpu_online(cpu)) {
> (void)rcu_spawn_one_cpu_kthread(cpu);
> + t = per_cpu(rcu_cpu_kthread_task, cpu);
> + if (t)
> + wake_up_process(t);
> + }

Would it be OK to simplify the code a bit by doing this initial wakeup
in rcu_spawn_one_cpu_kthread() itself? My thought would be to rearrange
rcu_spawn_one_cpu_kthread() as follows:

static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
{
struct sched_param sp;
struct task_struct *t;

if (!rcu_kthreads_spawnable ||
per_cpu(rcu_cpu_kthread_task, cpu) != NULL)
return 0;
t = kthread_create(rcu_cpu_kthread, (void *)(long)cpu, "rcuc%d", cpu);
if (IS_ERR(t))
return PTR_ERR(t);
kthread_bind(t, cpu);
set_task_state(t, TASK_INTERRUPTIBLE);
per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
sp.sched_priority = RCU_KTHREAD_PRIO;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
wake_up_process(t);
per_cpu(rcu_cpu_kthread_task, cpu) = t;
return 0;
}

> }
> rnp = rcu_get_root(rcu_state);
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> + if (rnp->node_kthread_task)
> + wake_up_process(rnp->node_kthread_task);

Ditto here -- can this wake_up_process() be pushed into
rcu_spawn_one_node_kthread()?

> if (NUM_RCU_NODES > 1) {
> - rcu_for_each_leaf_node(rcu_state, rnp)
> + rcu_for_each_leaf_node(rcu_state, rnp) {
> (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> + t = rnp->node_kthread_task;
> + if (t)
> + wake_up_process(t);
> + rcu_wake_one_boost_kthread(rnp);
> + }

Analogous question here for rcu_wake_one_boost_kthread being eliminated
in favor of doing the wake_up_process() in rcu_spawn_one_node_kthread().

> }
> return 0;
> }
> @@ -2188,14 +2200,14 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
> raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> }
>
> -static void __cpuinit rcu_online_cpu(int cpu)
> +static void __cpuinit rcu_prepare_cpu(int cpu)
> {
> rcu_init_percpu_data(cpu, &rcu_sched_state, 0);
> rcu_init_percpu_data(cpu, &rcu_bh_state, 0);
> rcu_preempt_init_percpu_data(cpu);
> }
>
> -static void __cpuinit rcu_online_kthreads(int cpu)
> +static void __cpuinit rcu_prepare_kthreads(int cpu)

Indeed, this naming is much better than mine. ;-)

> {
> struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> struct rcu_node *rnp = rdp->mynode;
> @@ -2209,6 +2221,31 @@ static void __cpuinit rcu_online_kthreads(int cpu)
> }
>
> /*
> + * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
> + * but the RCU threads are woken on demand, and if demand is low this
> + * could be a while triggering the hung task watchdog.
> + *
> + * In order to avoid this, poke all tasks once the CPU is fully
> + * up and running.
> + */
> +static void __cpuinit rcu_online_kthreads(int cpu)
> +{
> + struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> + struct rcu_node *rnp = rdp->mynode;
> + struct task_struct *t;
> +
> + t = per_cpu(rcu_cpu_kthread_task, cpu);
> + if (t)
> + wake_up_process(t);
> +
> + t = rnp->node_kthread_task;
> + if (t)
> + wake_up_process(t);
> +
> + rcu_wake_one_boost_kthread(rnp);

Interesting... So we are really awakening them twice, once at creation
time to get them to sleep interruptibly, and a second time when the CPU
comes online.

What does this second set of wake_up_process() calls do?

> +}
> +
> +/*
> * Handle CPU online/offline notification events.
> */
> static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> @@ -2221,10 +2258,11 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> switch (action) {
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> - rcu_online_cpu(cpu);
> - rcu_online_kthreads(cpu);
> + rcu_prepare_cpu(cpu);
> + rcu_prepare_kthreads(cpu);
> break;
> case CPU_ONLINE:
> + rcu_online_kthreads(cpu);
> case CPU_DOWN_FAILED:
> rcu_node_kthread_setaffinity(rnp, -1);
> rcu_cpu_kthread_setrt(cpu, 1);
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index a767b7d..2910de7 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -1295,7 +1295,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> if (IS_ERR(t))
> return PTR_ERR(t);
> raw_spin_lock_irqsave(&rnp->lock, flags);
> - set_task_state(t, TASK_INTERRUPTIBLE);
> rnp->boost_kthread_task = t;
> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> @@ -1303,6 +1302,12 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> return 0;
> }
>
> +static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> +{
> + if (rnp->boost_kthread_task)
> + wake_up_process(rnp->boost_thread_task);
> +}
> +
> #else /* #ifdef CONFIG_RCU_BOOST */
>
> static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
> @@ -1326,6 +1331,10 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> return 0;
> }
>
> +static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> +{
> +}
> +
> #endif /* #else #ifdef CONFIG_RCU_BOOST */
>
> #ifndef CONFIG_SMP
>

2011-05-30 16:41:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, May 30, 2011 at 09:23:54AM -0700, Paul E. McKenney wrote:
> On Mon, May 30, 2011 at 01:34:51PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-05-30 at 07:59 +0200, Damien Wyart wrote:
> > > Hi,
> > >
> > > Testing 3.0-rc1 on a core i7 (4 cores + HT), I get a load average of 9.0
> > > when idle. No process is shown running or in "D state" in htop. The box
> > > is behaving normal, no impression of lag or slowness.
> > >
> > > Not sure what other info to include, I guess this should be quite easy
> > > to reproduce.
> >
> >
> > ---
> > Subject: rcu: Cure load woes
> >
> > Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
> > state) fudges a sleeping task' state, resulting in the scheduler seeing
> > a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
> > task waking up. The result is unbalanced load calculation.
> >
> > The problem that patch tried to address is that the RCU threads could
> > stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
> > task detector due to on-demand wake-ups.
> >
> > Cure the problem differently by always giving the tasks at least one
> > wake-up once the CPU is fully up and running, this will kick them out of
> > the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
> > wait state.
> >
> > The alternative would be teaching kthread_create() to start threads as
> > INTERRUPTIBLE but that needs a tad more thought.
> >
> > Reported-by: Damien Wyart <[email protected]>
> > Signed-off-by: Peter Zijlstra <[email protected]>
>
> Very cool! I do have a few questions below, but am queuing and testing
> this in the meantime.
>
> > ---
> > kernel/rcutree.c | 54 ++++++++++++++++++++++++++++++++++++++++-------
> > kernel/rcutree_plugin.h | 11 ++++++++-
> > 2 files changed, 56 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 77a7671..89419ff 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1648,7 +1648,6 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
> > if (IS_ERR(t))
> > return PTR_ERR(t);
> > kthread_bind(t, cpu);
> > - set_task_state(t, TASK_INTERRUPTIBLE);
> > per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
> > WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
> > per_cpu(rcu_cpu_kthread_task, cpu) = t;
> > @@ -1756,7 +1755,6 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
> > if (IS_ERR(t))
> > return PTR_ERR(t);
> > raw_spin_lock_irqsave(&rnp->lock, flags);
> > - set_task_state(t, TASK_INTERRUPTIBLE);
> > rnp->node_kthread_task = t;
> > raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > sp.sched_priority = 99;
> > @@ -1765,6 +1763,8 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
> > return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
> > }
> >
> > +static void rcu_wake_one_boost_kthread(struct rcu_node *rnp);
> > +
> > /*
> > * Spawn all kthreads -- called as soon as the scheduler is running.
> > */
> > @@ -1772,18 +1772,30 @@ static int __init rcu_spawn_kthreads(void)
> > {
> > int cpu;
> > struct rcu_node *rnp;
> > + struct task_struct *t;
> >
> > rcu_kthreads_spawnable = 1;
> > for_each_possible_cpu(cpu) {
> > per_cpu(rcu_cpu_has_work, cpu) = 0;
> > - if (cpu_online(cpu))
> > + if (cpu_online(cpu)) {
> > (void)rcu_spawn_one_cpu_kthread(cpu);
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t)
> > + wake_up_process(t);
> > + }
>
> Would it be OK to simplify the code a bit by doing this initial wakeup
> in rcu_spawn_one_cpu_kthread() itself? My thought would be to rearrange
> rcu_spawn_one_cpu_kthread() as follows:
>
> static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
> {
> struct sched_param sp;
> struct task_struct *t;
>
> if (!rcu_kthreads_spawnable ||
> per_cpu(rcu_cpu_kthread_task, cpu) != NULL)
> return 0;
> t = kthread_create(rcu_cpu_kthread, (void *)(long)cpu, "rcuc%d", cpu);
> if (IS_ERR(t))
> return PTR_ERR(t);
> kthread_bind(t, cpu);
> set_task_state(t, TASK_INTERRUPTIBLE);
> per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
> WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> wake_up_process(t);
> per_cpu(rcu_cpu_kthread_task, cpu) = t;
> return 0;
> }
>
> > }
> > rnp = rcu_get_root(rcu_state);
> > (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> > + if (rnp->node_kthread_task)
> > + wake_up_process(rnp->node_kthread_task);
>
> Ditto here -- can this wake_up_process() be pushed into
> rcu_spawn_one_node_kthread()?
>
> > if (NUM_RCU_NODES > 1) {
> > - rcu_for_each_leaf_node(rcu_state, rnp)
> > + rcu_for_each_leaf_node(rcu_state, rnp) {
> > (void)rcu_spawn_one_node_kthread(rcu_state, rnp);
> > + t = rnp->node_kthread_task;
> > + if (t)
> > + wake_up_process(t);
> > + rcu_wake_one_boost_kthread(rnp);
> > + }
>
> Analogous question here for rcu_wake_one_boost_kthread being eliminated
> in favor of doing the wake_up_process() in rcu_spawn_one_node_kthread().
>
> > }
> > return 0;
> > }
> > @@ -2188,14 +2200,14 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
> > raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> > }
> >
> > -static void __cpuinit rcu_online_cpu(int cpu)
> > +static void __cpuinit rcu_prepare_cpu(int cpu)
> > {
> > rcu_init_percpu_data(cpu, &rcu_sched_state, 0);
> > rcu_init_percpu_data(cpu, &rcu_bh_state, 0);
> > rcu_preempt_init_percpu_data(cpu);
> > }
> >
> > -static void __cpuinit rcu_online_kthreads(int cpu)
> > +static void __cpuinit rcu_prepare_kthreads(int cpu)
>
> Indeed, this naming is much better than mine. ;-)
>
> > {
> > struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> > struct rcu_node *rnp = rdp->mynode;
> > @@ -2209,6 +2221,31 @@ static void __cpuinit rcu_online_kthreads(int cpu)
> > }
> >
> > /*
> > + * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
> > + * but the RCU threads are woken on demand, and if demand is low this
> > + * could be a while triggering the hung task watchdog.
> > + *
> > + * In order to avoid this, poke all tasks once the CPU is fully
> > + * up and running.
> > + */
> > +static void __cpuinit rcu_online_kthreads(int cpu)
> > +{
> > + struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> > + struct rcu_node *rnp = rdp->mynode;
> > + struct task_struct *t;
> > +
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t)
> > + wake_up_process(t);
> > +
> > + t = rnp->node_kthread_task;
> > + if (t)
> > + wake_up_process(t);
> > +
> > + rcu_wake_one_boost_kthread(rnp);
>
> Interesting... So we are really awakening them twice, once at creation
> time to get them to sleep interruptibly, and a second time when the CPU
> comes online.
>
> What does this second set of wake_up_process() calls do?
>
> > +}
> > +
> > +/*
> > * Handle CPU online/offline notification events.
> > */
> > static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> > @@ -2221,10 +2258,11 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
> > switch (action) {
> > case CPU_UP_PREPARE:
> > case CPU_UP_PREPARE_FROZEN:
> > - rcu_online_cpu(cpu);
> > - rcu_online_kthreads(cpu);
> > + rcu_prepare_cpu(cpu);
> > + rcu_prepare_kthreads(cpu);
> > break;
> > case CPU_ONLINE:
> > + rcu_online_kthreads(cpu);
> > case CPU_DOWN_FAILED:
> > rcu_node_kthread_setaffinity(rnp, -1);
> > rcu_cpu_kthread_setrt(cpu, 1);
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index a767b7d..2910de7 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -1295,7 +1295,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> > if (IS_ERR(t))
> > return PTR_ERR(t);
> > raw_spin_lock_irqsave(&rnp->lock, flags);
> > - set_task_state(t, TASK_INTERRUPTIBLE);
> > rnp->boost_kthread_task = t;
> > raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > sp.sched_priority = RCU_KTHREAD_PRIO;
> > @@ -1303,6 +1302,12 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> > return 0;
> > }
> >
> > +static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> > +{
> > + if (rnp->boost_kthread_task)
> > + wake_up_process(rnp->boost_thread_task);

And this needs to be:

wake_up_process(rnp->boost_kthread_task);

I fixed this in my tree, continuing testing.

Thanx, Paul

> > +}
> > +
> > #else /* #ifdef CONFIG_RCU_BOOST */
> >
> > static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
> > @@ -1326,6 +1331,10 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
> > return 0;
> > }
> >
> > +static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> > +{
> > +}
> > +
> > #endif /* #else #ifdef CONFIG_RCU_BOOST */
> >
> > #ifndef CONFIG_SMP
> >

2011-05-30 16:47:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 09:23 -0700, Paul E. McKenney wrote:

> > @@ -1772,18 +1772,30 @@ static int __init rcu_spawn_kthreads(void)
> > {
> > int cpu;
> > struct rcu_node *rnp;
> > + struct task_struct *t;
> >
> > rcu_kthreads_spawnable = 1;
> > for_each_possible_cpu(cpu) {
> > per_cpu(rcu_cpu_has_work, cpu) = 0;
> > - if (cpu_online(cpu))
> > + if (cpu_online(cpu)) {
> > (void)rcu_spawn_one_cpu_kthread(cpu);
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t)
> > + wake_up_process(t);
> > + }
>
> Would it be OK to simplify the code a bit by doing this initial wakeup
> in rcu_spawn_one_cpu_kthread() itself? My thought would be to rearrange
> rcu_spawn_one_cpu_kthread() as follows:
>

well, no that would get us back to waking a task affine to an offline
cpu :-)

> > @@ -2209,6 +2221,31 @@ static void __cpuinit rcu_online_kthreads(int cpu)
> > }
> >
> > /*
> > + * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
> > + * but the RCU threads are woken on demand, and if demand is low this
> > + * could be a while triggering the hung task watchdog.
> > + *
> > + * In order to avoid this, poke all tasks once the CPU is fully
> > + * up and running.
> > + */
> > +static void __cpuinit rcu_online_kthreads(int cpu)
> > +{
> > + struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> > + struct rcu_node *rnp = rdp->mynode;
> > + struct task_struct *t;
> > +
> > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > + if (t)
> > + wake_up_process(t);
> > +
> > + t = rnp->node_kthread_task;
> > + if (t)
> > + wake_up_process(t);
> > +
> > + rcu_wake_one_boost_kthread(rnp);
>
> Interesting... So we are really awakening them twice, once at creation
> time to get them to sleep interruptibly, and a second time when the CPU
> comes online.
>
> What does this second set of wake_up_process() calls do?

Ah, not so, see the initial one is conditional on cpu_online() and will
fail for the CPU_UP_PREPARE case, this new function will be ran from
CPU_ONLINE to then issue the first wakeup.

The distinction comes from the initialize while cpus are already running
vs hotplug.

2011-05-30 16:48:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 09:41 -0700, Paul E. McKenney wrote:
> > > +static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
> > > +{
> > > + if (rnp->boost_kthread_task)
> > > + wake_up_process(rnp->boost_thread_task);
>
> And this needs to be:
>
> wake_up_process(rnp->boost_kthread_task);
>
> I fixed this in my tree, continuing testing.

d'oh yes.

2011-05-30 17:16:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 09:23 -0700, Paul E. McKenney wrote:
> sp.sched_priority = RCU_KTHREAD_PRIO;
> sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);

Why are those things RT tasks anyway? The old ksoftirq runs as a regular
task. And once you start boosting things you can boost this into FIFO as
well...

just wondering..

2011-05-30 21:28:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, May 30, 2011 at 07:19:49PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 09:23 -0700, Paul E. McKenney wrote:
> > sp.sched_priority = RCU_KTHREAD_PRIO;
> > sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
>
> Why are those things RT tasks anyway? The old ksoftirq runs as a regular
> task. And once you start boosting things you can boost this into FIFO as
> well...
>
> just wondering..

Because priority boosting doesn't help unless the callbacks also run
RT priority.

I could make it so that they ran as normal tasks if !RCU_BOOST, but
they would still need to run as RT tasks for RCU_BOOST. I figured
running them the same way in both cases would be simpler.

Thanx, Paul

2011-05-30 21:30:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, May 30, 2011 at 06:46:21PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 09:23 -0700, Paul E. McKenney wrote:
>
> > > @@ -1772,18 +1772,30 @@ static int __init rcu_spawn_kthreads(void)
> > > {
> > > int cpu;
> > > struct rcu_node *rnp;
> > > + struct task_struct *t;
> > >
> > > rcu_kthreads_spawnable = 1;
> > > for_each_possible_cpu(cpu) {
> > > per_cpu(rcu_cpu_has_work, cpu) = 0;
> > > - if (cpu_online(cpu))
> > > + if (cpu_online(cpu)) {
> > > (void)rcu_spawn_one_cpu_kthread(cpu);
> > > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > > + if (t)
> > > + wake_up_process(t);
> > > + }
> >
> > Would it be OK to simplify the code a bit by doing this initial wakeup
> > in rcu_spawn_one_cpu_kthread() itself? My thought would be to rearrange
> > rcu_spawn_one_cpu_kthread() as follows:
>
> well, no that would get us back to waking a task affine to an offline
> cpu :-)

My turn to say d'oh, then!

But I should be able to move them back in under "if (cpu_online(cpu))",
right?

> > > @@ -2209,6 +2221,31 @@ static void __cpuinit rcu_online_kthreads(int cpu)
> > > }
> > >
> > > /*
> > > + * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
> > > + * but the RCU threads are woken on demand, and if demand is low this
> > > + * could be a while triggering the hung task watchdog.
> > > + *
> > > + * In order to avoid this, poke all tasks once the CPU is fully
> > > + * up and running.
> > > + */
> > > +static void __cpuinit rcu_online_kthreads(int cpu)
> > > +{
> > > + struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
> > > + struct rcu_node *rnp = rdp->mynode;
> > > + struct task_struct *t;
> > > +
> > > + t = per_cpu(rcu_cpu_kthread_task, cpu);
> > > + if (t)
> > > + wake_up_process(t);
> > > +
> > > + t = rnp->node_kthread_task;
> > > + if (t)
> > > + wake_up_process(t);
> > > +
> > > + rcu_wake_one_boost_kthread(rnp);
> >
> > Interesting... So we are really awakening them twice, once at creation
> > time to get them to sleep interruptibly, and a second time when the CPU
> > comes online.
> >
> > What does this second set of wake_up_process() calls do?
>
> Ah, not so, see the initial one is conditional on cpu_online() and will
> fail for the CPU_UP_PREPARE case, this new function will be ran from
> CPU_ONLINE to then issue the first wakeup.
>
> The distinction comes from the initialize while cpus are already running
> vs hotplug.

OK, got it.

Thanx, Paul

2011-05-30 21:34:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 14:28 -0700, Paul E. McKenney wrote:
> On Mon, May 30, 2011 at 07:19:49PM +0200, Peter Zijlstra wrote:
> > On Mon, 2011-05-30 at 09:23 -0700, Paul E. McKenney wrote:
> > > sp.sched_priority = RCU_KTHREAD_PRIO;
> > > sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> >
> > Why are those things RT tasks anyway? The old ksoftirq runs as a regular
> > task. And once you start boosting things you can boost this into FIFO as
> > well...
> >
> > just wondering..
>
> Because priority boosting doesn't help unless the callbacks also run
> RT priority.
>
> I could make it so that they ran as normal tasks if !RCU_BOOST, but
> they would still need to run as RT tasks for RCU_BOOST. I figured
> running them the same way in both cases would be simpler.

Ah, I thought you'd boost the threads along with the waiters, to the
same prio so that they wouldn't disturb higher priority tasks for no
reason.

2011-05-30 21:36:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 14:29 -0700, Paul E. McKenney wrote:
> But I should be able to move them back in under "if (cpu_online(cpu))",
> right?

Yeah, you should be able to put the wakeup in the spawn function when
made conditional on cpu_online(), although I'm not quite sure how to do
that for the rnp case, but I guess you'll have a better grasp on that.

2011-05-31 01:45:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, May 30, 2011 at 11:33:39PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 14:28 -0700, Paul E. McKenney wrote:
> > On Mon, May 30, 2011 at 07:19:49PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-05-30 at 09:23 -0700, Paul E. McKenney wrote:
> > > > sp.sched_priority = RCU_KTHREAD_PRIO;
> > > > sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
> > >
> > > Why are those things RT tasks anyway? The old ksoftirq runs as a regular
> > > task. And once you start boosting things you can boost this into FIFO as
> > > well...
> > >
> > > just wondering..
> >
> > Because priority boosting doesn't help unless the callbacks also run
> > RT priority.
> >
> > I could make it so that they ran as normal tasks if !RCU_BOOST, but
> > they would still need to run as RT tasks for RCU_BOOST. I figured
> > running them the same way in both cases would be simpler.
>
> Ah, I thought you'd boost the threads along with the waiters, to the
> same prio so that they wouldn't disturb higher priority tasks for no
> reason.

I considered that, but working out when it is OK to deboost them is
decidedly non-trivial.

Thanx, Paul

2011-05-31 01:46:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, May 30, 2011 at 11:35:28PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 14:29 -0700, Paul E. McKenney wrote:
> > But I should be able to move them back in under "if (cpu_online(cpu))",
> > right?
>
> Yeah, you should be able to put the wakeup in the spawn function when
> made conditional on cpu_online(), although I'm not quite sure how to do
> that for the rnp case, but I guess you'll have a better grasp on that.

OK, will give it a go.

Thanx, Paul

2011-05-31 12:31:37

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:core/urgent] rcu: Cure load woes

Commit-ID: d72bce0e67e8afc6eb959f656013cbb577426f1e
Gitweb: http://git.kernel.org/tip/d72bce0e67e8afc6eb959f656013cbb577426f1e
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 30 May 2011 13:34:51 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 31 May 2011 10:01:48 +0200

rcu: Cure load woes

Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
state) fudges a sleeping task' state, resulting in the scheduler seeing
a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
task waking up. The result is unbalanced load calculation.

The problem that patch tried to address is that the RCU threads could
stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
task detector due to on-demand wake-ups.

Cure the problem differently by always giving the tasks at least one
wake-up once the CPU is fully up and running, this will kick them out of
the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
wait state.

[ The alternative would be teaching kthread_create() to start threads as
INTERRUPTIBLE but that needs a tad more thought. ]

Reported-by: Damien Wyart <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Link: http://lkml.kernel.org/r/1306755291.1200.2872.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/rcutree.c | 54 ++++++++++++++++++++++++++++++++++++++++-------
kernel/rcutree_plugin.h | 11 ++++++++-
2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 77a7671..89419ff 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1648,7 +1648,6 @@ static int __cpuinit rcu_spawn_one_cpu_kthread(int cpu)
if (IS_ERR(t))
return PTR_ERR(t);
kthread_bind(t, cpu);
- set_task_state(t, TASK_INTERRUPTIBLE);
per_cpu(rcu_cpu_kthread_cpu, cpu) = cpu;
WARN_ON_ONCE(per_cpu(rcu_cpu_kthread_task, cpu) != NULL);
per_cpu(rcu_cpu_kthread_task, cpu) = t;
@@ -1756,7 +1755,6 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
if (IS_ERR(t))
return PTR_ERR(t);
raw_spin_lock_irqsave(&rnp->lock, flags);
- set_task_state(t, TASK_INTERRUPTIBLE);
rnp->node_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
sp.sched_priority = 99;
@@ -1765,6 +1763,8 @@ static int __cpuinit rcu_spawn_one_node_kthread(struct rcu_state *rsp,
return rcu_spawn_one_boost_kthread(rsp, rnp, rnp_index);
}

+static void rcu_wake_one_boost_kthread(struct rcu_node *rnp);
+
/*
* Spawn all kthreads -- called as soon as the scheduler is running.
*/
@@ -1772,18 +1772,30 @@ static int __init rcu_spawn_kthreads(void)
{
int cpu;
struct rcu_node *rnp;
+ struct task_struct *t;

rcu_kthreads_spawnable = 1;
for_each_possible_cpu(cpu) {
per_cpu(rcu_cpu_has_work, cpu) = 0;
- if (cpu_online(cpu))
+ if (cpu_online(cpu)) {
(void)rcu_spawn_one_cpu_kthread(cpu);
+ t = per_cpu(rcu_cpu_kthread_task, cpu);
+ if (t)
+ wake_up_process(t);
+ }
}
rnp = rcu_get_root(rcu_state);
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ if (rnp->node_kthread_task)
+ wake_up_process(rnp->node_kthread_task);
if (NUM_RCU_NODES > 1) {
- rcu_for_each_leaf_node(rcu_state, rnp)
+ rcu_for_each_leaf_node(rcu_state, rnp) {
(void)rcu_spawn_one_node_kthread(rcu_state, rnp);
+ t = rnp->node_kthread_task;
+ if (t)
+ wake_up_process(t);
+ rcu_wake_one_boost_kthread(rnp);
+ }
}
return 0;
}
@@ -2188,14 +2200,14 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
}

-static void __cpuinit rcu_online_cpu(int cpu)
+static void __cpuinit rcu_prepare_cpu(int cpu)
{
rcu_init_percpu_data(cpu, &rcu_sched_state, 0);
rcu_init_percpu_data(cpu, &rcu_bh_state, 0);
rcu_preempt_init_percpu_data(cpu);
}

-static void __cpuinit rcu_online_kthreads(int cpu)
+static void __cpuinit rcu_prepare_kthreads(int cpu)
{
struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
struct rcu_node *rnp = rdp->mynode;
@@ -2209,6 +2221,31 @@ static void __cpuinit rcu_online_kthreads(int cpu)
}

/*
+ * kthread_create() creates threads in TASK_UNINTERRUPTIBLE state,
+ * but the RCU threads are woken on demand, and if demand is low this
+ * could be a while triggering the hung task watchdog.
+ *
+ * In order to avoid this, poke all tasks once the CPU is fully
+ * up and running.
+ */
+static void __cpuinit rcu_online_kthreads(int cpu)
+{
+ struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu);
+ struct rcu_node *rnp = rdp->mynode;
+ struct task_struct *t;
+
+ t = per_cpu(rcu_cpu_kthread_task, cpu);
+ if (t)
+ wake_up_process(t);
+
+ t = rnp->node_kthread_task;
+ if (t)
+ wake_up_process(t);
+
+ rcu_wake_one_boost_kthread(rnp);
+}
+
+/*
* Handle CPU online/offline notification events.
*/
static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
@@ -2221,10 +2258,11 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- rcu_online_cpu(cpu);
- rcu_online_kthreads(cpu);
+ rcu_prepare_cpu(cpu);
+ rcu_prepare_kthreads(cpu);
break;
case CPU_ONLINE:
+ rcu_online_kthreads(cpu);
case CPU_DOWN_FAILED:
rcu_node_kthread_setaffinity(rnp, -1);
rcu_cpu_kthread_setrt(cpu, 1);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a767b7d..c8bff30 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1295,7 +1295,6 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
if (IS_ERR(t))
return PTR_ERR(t);
raw_spin_lock_irqsave(&rnp->lock, flags);
- set_task_state(t, TASK_INTERRUPTIBLE);
rnp->boost_kthread_task = t;
raw_spin_unlock_irqrestore(&rnp->lock, flags);
sp.sched_priority = RCU_KTHREAD_PRIO;
@@ -1303,6 +1302,12 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
return 0;
}

+static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
+{
+ if (rnp->boost_kthread_task)
+ wake_up_process(rnp->boost_kthread_task);
+}
+
#else /* #ifdef CONFIG_RCU_BOOST */

static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
@@ -1326,6 +1331,10 @@ static int __cpuinit rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
return 0;
}

+static void __cpuinit rcu_wake_one_boost_kthread(struct rcu_node *rnp)
+{
+}
+
#endif /* #else #ifdef CONFIG_RCU_BOOST */

#ifndef CONFIG_SMP

2011-06-01 11:17:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Mon, 2011-05-30 at 18:45 -0700, Paul E. McKenney wrote:
> > > Because priority boosting doesn't help unless the callbacks also run
> > > RT priority.
> > >
> > > I could make it so that they ran as normal tasks if !RCU_BOOST, but
> > > they would still need to run as RT tasks for RCU_BOOST. I figured
> > > running them the same way in both cases would be simpler.
> >
> > Ah, I thought you'd boost the threads along with the waiters, to the
> > same prio so that they wouldn't disturb higher priority tasks for no
> > reason.
>
> I considered that, but working out when it is OK to deboost them is
> decidedly non-trivial.

Where exactly is the problem there? The boost lasts for as long as it
takes to finish the grace period, right? There's a distinct set of
callbacks associated with each grace-period, right? In which case you
can de-boost your thread the moment you're done processing that set.

Or am I simply confused about how all this is supposed to work?

2011-06-01 14:37:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Wed, Jun 01, 2011 at 01:05:39PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 18:45 -0700, Paul E. McKenney wrote:
> > > > Because priority boosting doesn't help unless the callbacks also run
> > > > RT priority.
> > > >
> > > > I could make it so that they ran as normal tasks if !RCU_BOOST, but
> > > > they would still need to run as RT tasks for RCU_BOOST. I figured
> > > > running them the same way in both cases would be simpler.
> > >
> > > Ah, I thought you'd boost the threads along with the waiters, to the
> > > same prio so that they wouldn't disturb higher priority tasks for no
> > > reason.
> >
> > I considered that, but working out when it is OK to deboost them is
> > decidedly non-trivial.
>
> Where exactly is the problem there? The boost lasts for as long as it
> takes to finish the grace period, right? There's a distinct set of
> callbacks associated with each grace-period, right? In which case you
> can de-boost your thread the moment you're done processing that set.
>
> Or am I simply confused about how all this is supposed to work?

The main complications are: (1) the fact that it is hard to tell exactly
which grace period to wait for, this one or the next one, and (2) the
fact that callbacks get shuffled when CPUs go offline.

That said, it might be possible if we are willing to live with some
approximate behavior. For example, always waiting for the next grace
period (rather than the current one) to finish, and boosting through the
extra callbacks in case where a given CPU "adopts" callbacks that must
be boosted when that CPU also has some callbacks whose priority must be
boosted and some that need not be.

The reason I am not all that excited about taking this approach is that
it doesn't help worst-case latency.

Plus the current implementation is just a less-precise approximation.
(Sorry, couldn't resist!)

Thanx, Paul

2011-06-01 16:55:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Wed, 2011-06-01 at 07:37 -0700, Paul E. McKenney wrote:

> > > I considered that, but working out when it is OK to deboost them is
> > > decidedly non-trivial.
> >
> > Where exactly is the problem there? The boost lasts for as long as it
> > takes to finish the grace period, right? There's a distinct set of
> > callbacks associated with each grace-period, right? In which case you
> > can de-boost your thread the moment you're done processing that set.
> >
> > Or am I simply confused about how all this is supposed to work?
>
> The main complications are: (1) the fact that it is hard to tell exactly
> which grace period to wait for, this one or the next one, and (2) the
> fact that callbacks get shuffled when CPUs go offline.

I can't say I would worry too much about 2, hotplug and RT don't really
go hand-in-hand anyway.

On 1 however, is that due to the boost condition?

I must admit that my thought there is somewhat fuzzy since I just
realized I don't actually know the exact condition to start boosting,
but suppose we boost because the queue is too large, then waiting for
the current grace period might not reduce the queue length, as most
callbacks might actually be for the next.

If however the condition is grace period duration, then completion of
the current grace period is sufficient, since the whole boost condition
is defined as such. [ if the next is also exceeding the time limit,
that's a whole next boost ]

> That said, it might be possible if we are willing to live with some
> approximate behavior. For example, always waiting for the next grace
> period (rather than the current one) to finish, and boosting through the
> extra callbacks in case where a given CPU "adopts" callbacks that must
> be boosted when that CPU also has some callbacks whose priority must be
> boosted and some that need not be.

That might make sense, but I must admit to not fully understanding the
whole current/next thing yet.

> The reason I am not all that excited about taking this approach is that
> it doesn't help worst-case latency.

Well, not running at the top most prio does help those tasks running at
a higher priority, so in that regard it does reduce the jitter for a
number of tasks.

Also, I guess there's the whole question of what prio to boost to which
I somehow totally forgot about, which is a non-trivial thing in its own
right, since there isn't really someone blocked on grace period
completion (although in the special case of someone calling sync_rcu it
is clear).

> Plus the current implementation is just a less-precise approximation.
> (Sorry, couldn't resist!)

Appreciated, on a similar note I still need to actually look at all this
(preempt) tree-rcu stuff to learn how exactly it works.

2011-06-01 18:19:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Very high CPU load when idle with 3.0-rc1

On Wed, Jun 01, 2011 at 06:58:33PM +0200, Peter Zijlstra wrote:
> On Wed, 2011-06-01 at 07:37 -0700, Paul E. McKenney wrote:
>
> > > > I considered that, but working out when it is OK to deboost them is
> > > > decidedly non-trivial.
> > >
> > > Where exactly is the problem there? The boost lasts for as long as it
> > > takes to finish the grace period, right? There's a distinct set of
> > > callbacks associated with each grace-period, right? In which case you
> > > can de-boost your thread the moment you're done processing that set.
> > >
> > > Or am I simply confused about how all this is supposed to work?
> >
> > The main complications are: (1) the fact that it is hard to tell exactly
> > which grace period to wait for, this one or the next one, and (2) the
> > fact that callbacks get shuffled when CPUs go offline.
>
> I can't say I would worry too much about 2, hotplug and RT don't really
> go hand-in-hand anyway.

Perhaps not, but I do need to handle the combination.

> On 1 however, is that due to the boost condition?

The boost condition is straightforward. By default, if a grace period
lasts for more than 500 milliseconds, boosting starts. So the obvious
answer is "deboost when the grace period ends", but different CPUs become
aware of the end at different times, so it is still a bit fuzzy.

> I must admit that my thought there is somewhat fuzzy since I just
> realized I don't actually know the exact condition to start boosting,
> but suppose we boost because the queue is too large, then waiting for
> the current grace period might not reduce the queue length, as most
> callbacks might actually be for the next.
>
> If however the condition is grace period duration, then completion of
> the current grace period is sufficient, since the whole boost condition
> is defined as such. [ if the next is also exceeding the time limit,
> that's a whole next boost ]

Don't get me wrong -- it can be done. Just a bit ugly due to the
fact that different CPUs have different views of when the grace period
ends.

> > That said, it might be possible if we are willing to live with some
> > approximate behavior. For example, always waiting for the next grace
> > period (rather than the current one) to finish, and boosting through the
> > extra callbacks in case where a given CPU "adopts" callbacks that must
> > be boosted when that CPU also has some callbacks whose priority must be
> > boosted and some that need not be.
>
> That might make sense, but I must admit to not fully understanding the
> whole current/next thing yet.

And I cannot claim to have thought it through thoroughly, for that
matter.

> > The reason I am not all that excited about taking this approach is that
> > it doesn't help worst-case latency.
>
> Well, not running at the top most prio does help those tasks running at
> a higher priority, so in that regard it does reduce the jitter for a
> number of tasks.

By default, boosting is to RT prio 1, so shouldn't bother most RT
processes.

> Also, I guess there's the whole question of what prio to boost to which
> I somehow totally forgot about, which is a non-trivial thing in its own
> right, since there isn't really someone blocked on grace period
> completion (although in the special case of someone calling sync_rcu it
> is clear).

I am not all that excited about synchronize_rcu() controlling the boost
priority, but having synchronize_rcu_expedited() do so might make sense.
But I would want someone to come up with a situation needing this first.

Other than that, it is similar to working out what priority softirq
should run at in PREEMPT_RT.

> > Plus the current implementation is just a less-precise approximation.
> > (Sorry, couldn't resist!)
>
> Appreciated, on a similar note I still need to actually look at all this
> (preempt) tree-rcu stuff to learn how exactly it works.

And I do need to document it. For one thing, I usually find a few bugs
when I do that. For another, the previous documentation is getting quite
dated.

Thanx, Paul