2012-06-11 10:06:51

by Mike Galbraith

[permalink] [raw]
Subject: rcu: endless stalls

Greetings,

I received a report of a 48 core UV box hitting a gripe, taking longer
than timeout to emit same, so box griped endlessly, forcing reboot.

The below might prevent that.. and bust other stuff for free :)

rcu: one gripe at a time please

Not-compiled-by:
Not-signed-off-by:
Not-etc-by:

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 0da7b88..6462056d6 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -818,10 +818,25 @@ static void print_cpu_stall(struct rcu_state *rsp)
set_need_resched(); /* kick ourselves to get things going. */
}

+/**
+ * rcu_cpu_stall_reset - prevent further stall warnings in current grace period
+ *
+ * Set the stall-warning timeout way off into the future, thus preventing
+ * any RCU CPU stall-warning messages from appearing in the current set of
+ * RCU grace periods.
+ *
+ * The caller must disable hard irqs.
+ */
+void rcu_cpu_stall_reset(void)
+{
+ rcu_sched_state.jiffies_stall = jiffies + ULONG_MAX / 2;
+ rcu_bh_state.jiffies_stall = jiffies + ULONG_MAX / 2;
+ rcu_preempt_stall_reset();
+}
+
static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
{
- unsigned long j;
- unsigned long js;
+ unsigned long j, js, flags;
struct rcu_node *rnp;

if (rcu_cpu_stall_suppress)
@@ -832,13 +847,23 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
if ((ACCESS_ONCE(rnp->qsmask) & rdp->grpmask) && ULONG_CMP_GE(j, js)) {

/* We haven't checked in, so go dump stack. */
+ rcu_cpu_stall_suppress = 1;
print_cpu_stall(rsp);
+ local_irq_save(flags);
+ rcu_cpu_stall_reset();
+ local_irq_restore(flags);
+ rcu_cpu_stall_suppress = 0;

} else if (rcu_gp_in_progress(rsp) &&
ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY)) {

/* They had a few time units to dump stack, so complain. */
+ rcu_cpu_stall_suppress = 1;
print_other_cpu_stall(rsp);
+ local_irq_save(flags);
+ rcu_cpu_stall_reset();
+ local_irq_restore(flags);
+ rcu_cpu_stall_suppress = 0;
}
}

@@ -848,22 +873,6 @@ static int rcu_panic(struct notifier_block *this, unsigned long ev, void *ptr)
return NOTIFY_DONE;
}

-/**
- * rcu_cpu_stall_reset - prevent further stall warnings in current grace period
- *
- * Set the stall-warning timeout way off into the future, thus preventing
- * any RCU CPU stall-warning messages from appearing in the current set of
- * RCU grace periods.
- *
- * The caller must disable hard irqs.
- */
-void rcu_cpu_stall_reset(void)
-{
- rcu_sched_state.jiffies_stall = jiffies + ULONG_MAX / 2;
- rcu_bh_state.jiffies_stall = jiffies + ULONG_MAX / 2;
- rcu_preempt_stall_reset();
-}
-
static struct notifier_block rcu_panic_block = {
.notifier_call = rcu_panic,
};


2012-06-11 13:40:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, Jun 11, 2012 at 12:06:16PM +0200, Mike Galbraith wrote:
> Greetings,
>
> I received a report of a 48 core UV box hitting a gripe, taking longer
> than timeout to emit same, so box griped endlessly, forcing reboot.

So it took minutes to print an RCU CPU stall warning? On only 48 CPUs?

If so, yow!!!

My guess is that rcu_cpu_stall_suppress must be manipulated atomically
for this to work reliably, for example, using xchg().

Thanx, Paul

> The below might prevent that.. and bust other stuff for free :)
>
> rcu: one gripe at a time please
>
> Not-compiled-by:
> Not-signed-off-by:
> Not-etc-by:
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 0da7b88..6462056d6 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -818,10 +818,25 @@ static void print_cpu_stall(struct rcu_state *rsp)
> set_need_resched(); /* kick ourselves to get things going. */
> }
>
> +/**
> + * rcu_cpu_stall_reset - prevent further stall warnings in current grace period
> + *
> + * Set the stall-warning timeout way off into the future, thus preventing
> + * any RCU CPU stall-warning messages from appearing in the current set of
> + * RCU grace periods.
> + *
> + * The caller must disable hard irqs.
> + */
> +void rcu_cpu_stall_reset(void)
> +{
> + rcu_sched_state.jiffies_stall = jiffies + ULONG_MAX / 2;
> + rcu_bh_state.jiffies_stall = jiffies + ULONG_MAX / 2;
> + rcu_preempt_stall_reset();
> +}
> +
> static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
> {
> - unsigned long j;
> - unsigned long js;
> + unsigned long j, js, flags;
> struct rcu_node *rnp;
>
> if (rcu_cpu_stall_suppress)
> @@ -832,13 +847,23 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
> if ((ACCESS_ONCE(rnp->qsmask) & rdp->grpmask) && ULONG_CMP_GE(j, js)) {
>
> /* We haven't checked in, so go dump stack. */
> + rcu_cpu_stall_suppress = 1;
> print_cpu_stall(rsp);
> + local_irq_save(flags);
> + rcu_cpu_stall_reset();
> + local_irq_restore(flags);
> + rcu_cpu_stall_suppress = 0;
>
> } else if (rcu_gp_in_progress(rsp) &&
> ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY)) {
>
> /* They had a few time units to dump stack, so complain. */
> + rcu_cpu_stall_suppress = 1;
> print_other_cpu_stall(rsp);
> + local_irq_save(flags);
> + rcu_cpu_stall_reset();
> + local_irq_restore(flags);
> + rcu_cpu_stall_suppress = 0;
> }
> }
>
> @@ -848,22 +873,6 @@ static int rcu_panic(struct notifier_block *this, unsigned long ev, void *ptr)
> return NOTIFY_DONE;
> }
>
> -/**
> - * rcu_cpu_stall_reset - prevent further stall warnings in current grace period
> - *
> - * Set the stall-warning timeout way off into the future, thus preventing
> - * any RCU CPU stall-warning messages from appearing in the current set of
> - * RCU grace periods.
> - *
> - * The caller must disable hard irqs.
> - */
> -void rcu_cpu_stall_reset(void)
> -{
> - rcu_sched_state.jiffies_stall = jiffies + ULONG_MAX / 2;
> - rcu_bh_state.jiffies_stall = jiffies + ULONG_MAX / 2;
> - rcu_preempt_stall_reset();
> -}
> -
> static struct notifier_block rcu_panic_block = {
> .notifier_call = rcu_panic,
> };
>
>

2012-06-11 14:22:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, 2012-06-11 at 06:39 -0700, Paul E. McKenney wrote:
> On Mon, Jun 11, 2012 at 12:06:16PM +0200, Mike Galbraith wrote:
> > Greetings,
> >
> > I received a report of a 48 core UV box hitting a gripe, taking longer
> > than timeout to emit same, so box griped endlessly, forcing reboot.
>
> So it took minutes to print an RCU CPU stall warning? On only 48 CPUs?
>
> If so, yow!!!

The report was a tad fuzzy on details, but serial console can take a
while at low baud rate, and the box was apparently perma-stalled.

> My guess is that rcu_cpu_stall_suppress must be manipulated atomically
> for this to work reliably, for example, using xchg().

Ah, thanks.

-Mike

2012-06-11 16:55:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, Jun 11, 2012 at 04:22:23PM +0200, Mike Galbraith wrote:
> On Mon, 2012-06-11 at 06:39 -0700, Paul E. McKenney wrote:
> > On Mon, Jun 11, 2012 at 12:06:16PM +0200, Mike Galbraith wrote:
> > > Greetings,
> > >
> > > I received a report of a 48 core UV box hitting a gripe, taking longer
> > > than timeout to emit same, so box griped endlessly, forcing reboot.
> >
> > So it took minutes to print an RCU CPU stall warning? On only 48 CPUs?
> >
> > If so, yow!!!
>
> The report was a tad fuzzy on details, but serial console can take a
> while at low baud rate, and the box was apparently perma-stalled.

Which clearly demands the question "why not increase the serial-console
baud rate"... ;-)

Thanx, Paul

> > My guess is that rcu_cpu_stall_suppress must be manipulated atomically
> > for this to work reliably, for example, using xchg().
>
> Ah, thanks.
>
> -Mike
>

2012-06-11 17:20:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, 2012-06-11 at 16:22 +0200, Mike Galbraith wrote:
> On Mon, 2012-06-11 at 06:39 -0700, Paul E. McKenney wrote:
> > On Mon, Jun 11, 2012 at 12:06:16PM +0200, Mike Galbraith wrote:
> > > Greetings,
> > >
> > > I received a report of a 48 core UV box hitting a gripe, taking longer
> > > than timeout to emit same, so box griped endlessly, forcing reboot.
> >
> > So it took minutes to print an RCU CPU stall warning? On only 48 CPUs?
> >
> > If so, yow!!!
>
> The report was a tad fuzzy on details, but serial console can take a
> while at low baud rate, and the box was apparently perma-stalled.

Gee, no wonder it took a while to emit the gripe, there were a few more
than 48 CPUs actually.. 4048 more to be precise.

4096 tasks on 4096 CPUs all hammering wait_lock at once r!pretty.

2aa15890 - mm: prevent concurrent unmap_mapping_range() on the same inode

-Mike

2012-06-11 18:01:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, Jun 11, 2012 at 07:20:14PM +0200, Mike Galbraith wrote:
> On Mon, 2012-06-11 at 16:22 +0200, Mike Galbraith wrote:
> > On Mon, 2012-06-11 at 06:39 -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 11, 2012 at 12:06:16PM +0200, Mike Galbraith wrote:
> > > > Greetings,
> > > >
> > > > I received a report of a 48 core UV box hitting a gripe, taking longer
> > > > than timeout to emit same, so box griped endlessly, forcing reboot.
> > >
> > > So it took minutes to print an RCU CPU stall warning? On only 48 CPUs?
> > >
> > > If so, yow!!!
> >
> > The report was a tad fuzzy on details, but serial console can take a
> > while at low baud rate, and the box was apparently perma-stalled.
>
> Gee, no wonder it took a while to emit the gripe, there were a few more
> than 48 CPUs actually.. 4048 more to be precise.
>
> 4096 tasks on 4096 CPUs all hammering wait_lock at once r!pretty.

That would be bad...

BTW, one other thing on the patch -- it needs to use a separate flag,
otherwise there are races with sysfs and panic() updates to the flag.

> 2aa15890 - mm: prevent concurrent unmap_mapping_range() on the same inode

I confess, you lost me on this one. You believe that this commit is
the cause of the RCU CPU stall warnings?

Thanx, Paul

2012-06-11 18:10:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, 2012-06-11 at 11:01 -0700, Paul E. McKenney wrote:
> On Mon, Jun 11, 2012 at 07:20:14PM +0200, Mike Galbraith wrote:
> > On Mon, 2012-06-11 at 16:22 +0200, Mike Galbraith wrote:
> > > On Mon, 2012-06-11 at 06:39 -0700, Paul E. McKenney wrote:
> > > > On Mon, Jun 11, 2012 at 12:06:16PM +0200, Mike Galbraith wrote:
> > > > > Greetings,
> > > > >
> > > > > I received a report of a 48 core UV box hitting a gripe, taking longer
> > > > > than timeout to emit same, so box griped endlessly, forcing reboot.
> > > >
> > > > So it took minutes to print an RCU CPU stall warning? On only 48 CPUs?
> > > >
> > > > If so, yow!!!
> > >
> > > The report was a tad fuzzy on details, but serial console can take a
> > > while at low baud rate, and the box was apparently perma-stalled.
> >
> > Gee, no wonder it took a while to emit the gripe, there were a few more
> > than 48 CPUs actually.. 4048 more to be precise.
> >
> > 4096 tasks on 4096 CPUs all hammering wait_lock at once r!pretty.
>
> That would be bad...
>
> BTW, one other thing on the patch -- it needs to use a separate flag,
> otherwise there are races with sysfs and panic() updates to the flag.
>
> > 2aa15890 - mm: prevent concurrent unmap_mapping_range() on the same inode
>
> I confess, you lost me on this one. You believe that this commit is
> the cause of the RCU CPU stall warnings?

4096 tasks on 4096 CPUs exit (well, try to) simultaneously.

Call Trace:
__mutex_lock_slowpath+0x94/0x150
mutex_lock+0x1a/0x40
unlink_file_vma+0x3f/0xf0
free_pgtables+0x40/0x100
exit_mmap+0xb0/0x120
mmput+0x49/0x120
exit_mm+0x122/0x160
do_exit+0x179/0x8d0
do_group_exit+0x3d/0xb0
sys_exit_group+0x12/0x20

Monster box dies screaming.

-Mike

2012-06-13 03:35:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Mon, 2012-06-11 at 20:10 +0200, Mike Galbraith wrote:
> On Mon, 2012-06-11 at 11:01 -0700, Paul E. McKenney wrote:

> > > 2aa15890 - mm: prevent concurrent unmap_mapping_range() on the same inode
> >
> > I confess, you lost me on this one. You believe that this commit is
> > the cause of the RCU CPU stall warnings?
>
> 4096 tasks on 4096 CPUs exit (well, try to) simultaneously.
>
> Call Trace:
> __mutex_lock_slowpath+0x94/0x150
> mutex_lock+0x1a/0x40
> unlink_file_vma+0x3f/0xf0
> free_pgtables+0x40/0x100
> exit_mmap+0xb0/0x120
> mmput+0x49/0x120
> exit_mm+0x122/0x160
> do_exit+0x179/0x8d0
> do_group_exit+0x3d/0xb0
> sys_exit_group+0x12/0x20
>
> Monster box dies screaming.

That commit landed in stable, box with way too many cores (NR_CPUS=0!!)
chokes instantly with loads of spinners. Ok, so zillion CPUs grabbing a
mutex in lockstep is a bad idea (_having_ zillion?), but is there pilot
error involved in a logjam like this?

-Mike

2012-06-13 04:32:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Wed, 13 Jun 2012, Mike Galbraith wrote:
> On Mon, 2012-06-11 at 20:10 +0200, Mike Galbraith wrote:
> > On Mon, 2012-06-11 at 11:01 -0700, Paul E. McKenney wrote:
>
> > > > 2aa15890 - mm: prevent concurrent unmap_mapping_range() on the same inode
> > >
> > > I confess, you lost me on this one. You believe that this commit is
> > > the cause of the RCU CPU stall warnings?
> >
> > 4096 tasks on 4096 CPUs exit (well, try to) simultaneously.
> >
> > Call Trace:
> > __mutex_lock_slowpath+0x94/0x150
> > mutex_lock+0x1a/0x40
> > unlink_file_vma+0x3f/0xf0
> > free_pgtables+0x40/0x100
> > exit_mmap+0xb0/0x120
> > mmput+0x49/0x120
> > exit_mm+0x122/0x160
> > do_exit+0x179/0x8d0
> > do_group_exit+0x3d/0xb0
> > sys_exit_group+0x12/0x20
> >
> > Monster box dies screaming.
>
> That commit landed in stable, box with way too many cores (NR_CPUS=0!!)
> chokes instantly with loads of spinners. Ok, so zillion CPUs grabbing a
> mutex in lockstep is a bad idea (_having_ zillion?), but is there pilot
> error involved in a logjam like this?

Surely some mistake...

I can't find any mention of which kernel release you're talking about.

But Miklos's 2aa15890 unmap_mutex was introduced in 2.6.38 and removed
in 3.0, when PeterZ converted i_mmap_lock to i_mmap_mutex, and removed
the need for the additional unmap_mutex.

The unmap_mutex would never have been taken in unlink_file_vma(),
shown in your stacktrace above: it was for truncation and invalidation.

The likely mutex in unlink_file_vma() would be the i_mmap_mutex.
So I expect you're talking about a 3.0 or later kernel.

But then why would someone "backport" Miklos's patch to stable for it?

You lost me too!

Hugh

2012-06-13 05:56:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Tue, 2012-06-12 at 21:31 -0700, Hugh Dickins wrote:
> On Wed, 13 Jun 2012, Mike Galbraith wrote:
> > On Mon, 2012-06-11 at 20:10 +0200, Mike Galbraith wrote:
> > > On Mon, 2012-06-11 at 11:01 -0700, Paul E. McKenney wrote:
> >
> > > > > 2aa15890 - mm: prevent concurrent unmap_mapping_range() on the same inode
> > > >
> > > > I confess, you lost me on this one. You believe that this commit is
> > > > the cause of the RCU CPU stall warnings?
> > >
> > > 4096 tasks on 4096 CPUs exit (well, try to) simultaneously.
> > >
> > > Call Trace:
> > > __mutex_lock_slowpath+0x94/0x150
> > > mutex_lock+0x1a/0x40
> > > unlink_file_vma+0x3f/0xf0
> > > free_pgtables+0x40/0x100
> > > exit_mmap+0xb0/0x120
> > > mmput+0x49/0x120
> > > exit_mm+0x122/0x160
> > > do_exit+0x179/0x8d0
> > > do_group_exit+0x3d/0xb0
> > > sys_exit_group+0x12/0x20
> > >
> > > Monster box dies screaming.
> >
> > That commit landed in stable, box with way too many cores (NR_CPUS=0!!)
> > chokes instantly with loads of spinners. Ok, so zillion CPUs grabbing a
> > mutex in lockstep is a bad idea (_having_ zillion?), but is there pilot
> > error involved in a logjam like this?
>
> Surely some mistake...

Yup.

> I can't find any mention of which kernel release you're talking about.
>
> But Miklos's 2aa15890 unmap_mutex was introduced in 2.6.38 and removed
> in 3.0, when PeterZ converted i_mmap_lock to i_mmap_mutex, and removed
> the need for the additional unmap_mutex.
>
> The unmap_mutex would never have been taken in unlink_file_vma(),
> shown in your stacktrace above: it was for truncation and invalidation.
>
> The likely mutex in unlink_file_vma() would be the i_mmap_mutex.
> So I expect you're talking about a 3.0 or later kernel.

Yeah.

> But then why would someone "backport" Miklos's patch to stable for it?
>
> You lost me too!

Sorry, /me is hopping around vigorously between too many darn trees,
it's already there, not a stable addition.

Question remains though. Maybe the box hit some other problem that led
to death by RCU gripage, but the info I received indicated the box was
in the midst of a major spin-fest.

-Mike

2012-06-13 07:13:01

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Wed, 2012-06-13 at 07:56 +0200, Mike Galbraith wrote:

> Question remains though. Maybe the box hit some other problem that led
> to death by RCU gripage, but the info I received indicated the box was
> in the midst of a major spin-fest.

To (maybe) speak more clearly, since it's a mutex like any other mutex
that loads of CPUs can hit if you've got loads of CPUs, did huge box
driver do something that we don't expect so many CPUs to be doing, thus
instigate simultaneous exit trouble (ie shoot self in foot), or did that
mutex addition create the exit trouble which box appeared to be having?

Too bad box RCU griped itself to death, would have been nice to have
more than a snippet.

-Mike

2012-06-14 07:45:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Wed, 2012-06-13 at 09:12 +0200, Mike Galbraith wrote:
> On Wed, 2012-06-13 at 07:56 +0200, Mike Galbraith wrote:
>
> > Question remains though. Maybe the box hit some other problem that led
> > to death by RCU gripage, but the info I received indicated the box was
> > in the midst of a major spin-fest.
>
> To (maybe) speak more clearly, since it's a mutex like any other mutex
> that loads of CPUs can hit if you've got loads of CPUs, did huge box
> driver do something that we don't expect so many CPUs to be doing, thus
> instigate simultaneous exit trouble (ie shoot self in foot), or did that
> mutex addition create the exit trouble which box appeared to be having?

Crickets chirping.. I know what _that_ means: "tsk tsk, you dummy" :)

I suspected that would happen, but asked anyway because I couldn't
imagine even 4096 CPUs getting tangled up for an _eternity_ trying to go
to sleep, but the lock which landed after 32-stable where these beasts
earn their daily fuel rods was splattered all over the event. Oh well.

So, I can forget that and just make the thing not gripe itself to death
should a stall for whatever reason be encountered again.

Rather than mucking about with rcu_cpu_stall_suppress, how about adjust
timeout as you proceed, and block report functions? That way, there's
no fiddling with things used elsewhere, and it shouldn't matter how
badly console is being be hammered, you get a full report, and maybe
even only one.

Hm, maybe I should forget hoping to keep check_cpu_stall() happy too,
and only silently ignore it when busy.

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 0da7b88..e9dd654 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -727,24 +727,29 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
rsp->jiffies_stall = jiffies + jiffies_till_stall_check();
}

+int rcu_stall_report_in_progress;
+
static void print_other_cpu_stall(struct rcu_state *rsp)
{
int cpu;
long delta;
unsigned long flags;
int ndetected;
- struct rcu_node *rnp = rcu_get_root(rsp);
+ struct rcu_node *root = rcu_get_root(rsp);
+ struct rcu_node *rnp;

/* Only let one CPU complain about others per time interval. */

- raw_spin_lock_irqsave(&rnp->lock, flags);
+ raw_spin_lock_irqsave(&root->lock, flags);
delta = jiffies - rsp->jiffies_stall;
- if (delta < RCU_STALL_RAT_DELAY || !rcu_gp_in_progress(rsp)) {
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ if (delta < RCU_STALL_RAT_DELAY || !rcu_gp_in_progress(rsp) ||
+ rcu_stall_report_in_progress) {
+ raw_spin_unlock_irqrestore(&root->lock, flags);
return;
}
rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3;
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ rcu_stall_report_in_progress++;
+ raw_spin_unlock_irqrestore(&root->lock, flags);

/*
* OK, time to rat on our buddy...
@@ -765,16 +770,23 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
print_cpu_stall_info(rsp, rnp->grplo + cpu);
ndetected++;
}
+
+ /*
+ * Push the timeout back as we go. With a slow serial
+ * console on a large machine, this may take a while.
+ */
+ raw_spin_lock_irqsave(&root->lock, flags);
+ rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3;
+ raw_spin_unlock_irqrestore(&root->lock, flags);
}

/*
* Now rat on any tasks that got kicked up to the root rcu_node
* due to CPU offlining.
*/
- rnp = rcu_get_root(rsp);
- raw_spin_lock_irqsave(&rnp->lock, flags);
- ndetected = rcu_print_task_stall(rnp);
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ raw_spin_lock_irqsave(&root->lock, flags);
+ ndetected = rcu_print_task_stall(root);
+ raw_spin_unlock_irqrestore(&root->lock, flags);

print_cpu_stall_info_end();
printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
@@ -784,6 +796,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
else if (!trigger_all_cpu_backtrace())
dump_stack();

+ raw_spin_lock_irqsave(&root->lock, flags);
+ rcu_stall_report_in_progress--;
+ raw_spin_unlock_irqrestore(&root->lock, flags);
+
/* If so configured, complain about tasks blocking the grace period. */

rcu_print_detail_task_stall(rsp);
@@ -796,6 +812,17 @@ static void print_cpu_stall(struct rcu_state *rsp)
unsigned long flags;
struct rcu_node *rnp = rcu_get_root(rsp);

+ raw_spin_lock_irqsave(&rnp->lock, flags);
+ if (rcu_stall_report_in_progress) {
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+ return;
+ }
+
+ /* Reset timeout, dump_stack() may take a while on large machines. */
+ rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3;
+ rcu_stall_report_in_progress++;
+ raw_spin_unlock_irqrestore(&rnp->lock, flags);
+
/*
* OK, time to rat on ourselves...
* See Documentation/RCU/stallwarn.txt for info on how to debug
@@ -813,6 +840,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
if (ULONG_CMP_GE(jiffies, rsp->jiffies_stall))
rsp->jiffies_stall = jiffies +
3 * jiffies_till_stall_check() + 3;
+ rcu_stall_report_in_progress--;
raw_spin_unlock_irqrestore(&rnp->lock, flags);

set_need_resched(); /* kick ourselves to get things going. */

-Mike

2012-06-14 16:51:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Thu, Jun 14, 2012 at 09:45:32AM +0200, Mike Galbraith wrote:
> On Wed, 2012-06-13 at 09:12 +0200, Mike Galbraith wrote:
> > On Wed, 2012-06-13 at 07:56 +0200, Mike Galbraith wrote:
> >
> > > Question remains though. Maybe the box hit some other problem that led
> > > to death by RCU gripage, but the info I received indicated the box was
> > > in the midst of a major spin-fest.
> >
> > To (maybe) speak more clearly, since it's a mutex like any other mutex
> > that loads of CPUs can hit if you've got loads of CPUs, did huge box
> > driver do something that we don't expect so many CPUs to be doing, thus
> > instigate simultaneous exit trouble (ie shoot self in foot), or did that
> > mutex addition create the exit trouble which box appeared to be having?
>
> Crickets chirping.. I know what _that_ means: "tsk tsk, you dummy" :)

In my case, it means that I suspect you would rather me continue working
on my series of patches to further reduce RCU grace-period-initialization
latencies on large systems than worry about this patch.

If I understand correctly, your patch did what you wanted in the
situation at hand. I have some quibbles, please see below.

> I suspected that would happen, but asked anyway because I couldn't
> imagine even 4096 CPUs getting tangled up for an _eternity_ trying to go
> to sleep, but the lock which landed after 32-stable where these beasts
> earn their daily fuel rods was splattered all over the event. Oh well.
>
> So, I can forget that and just make the thing not gripe itself to death
> should a stall for whatever reason be encountered again.
>
> Rather than mucking about with rcu_cpu_stall_suppress, how about adjust
> timeout as you proceed, and block report functions? That way, there's
> no fiddling with things used elsewhere, and it shouldn't matter how
> badly console is being be hammered, you get a full report, and maybe
> even only one.

That sounds like a very good interim approach to me!

> Hm, maybe I should forget hoping to keep check_cpu_stall() happy too,
> and only silently ignore it when busy.

But this is a good way to accumulate a variety of stalls, so not
recommended.

Thanx, Paul

> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 0da7b88..e9dd654 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -727,24 +727,29 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
> rsp->jiffies_stall = jiffies + jiffies_till_stall_check();
> }
>
> +int rcu_stall_report_in_progress;
> +
> static void print_other_cpu_stall(struct rcu_state *rsp)
> {
> int cpu;
> long delta;
> unsigned long flags;
> int ndetected;
> - struct rcu_node *rnp = rcu_get_root(rsp);
> + struct rcu_node *root = rcu_get_root(rsp);

s/root/rnp_root/, please -- consistency with other places in the code.
But see below.

> + struct rcu_node *rnp;
>
> /* Only let one CPU complain about others per time interval. */
>
> - raw_spin_lock_irqsave(&rnp->lock, flags);
> + raw_spin_lock_irqsave(&root->lock, flags);

On a 4096-CPU system, I bet that this needs to be trylock, with
a bare "return" on failure to acquire the lock.

Of course, that fails if someone is holding this lock for some other
reason. So I believe that you need a separate ->stalllock in the
rcu_state structure for this purpose. You won't need to disable irqs
when acquiring the lock.

> delta = jiffies - rsp->jiffies_stall;
> - if (delta < RCU_STALL_RAT_DELAY || !rcu_gp_in_progress(rsp)) {
> - raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + if (delta < RCU_STALL_RAT_DELAY || !rcu_gp_in_progress(rsp) ||
> + rcu_stall_report_in_progress) {

If you conditionally acquire the new ->stalllock, you shouldn't need
this added check.

> + raw_spin_unlock_irqrestore(&root->lock, flags);
> return;
> }
> rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3;
> - raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + rcu_stall_report_in_progress++;

And with the new ->stalllock, rcu_stall_report_in_progress can go away.

> + raw_spin_unlock_irqrestore(&root->lock, flags);
>
> /*
> * OK, time to rat on our buddy...
> @@ -765,16 +770,23 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
> print_cpu_stall_info(rsp, rnp->grplo + cpu);
> ndetected++;
> }
> +
> + /*
> + * Push the timeout back as we go. With a slow serial
> + * console on a large machine, this may take a while.
> + */
> + raw_spin_lock_irqsave(&root->lock, flags);
> + rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3;
> + raw_spin_unlock_irqrestore(&root->lock, flags);

And the separate ->stalllock should make this unnecessary as well.
However, updating rsp->jiffies_stall periodically is a good idea
in order to decrease cache thrashing on ->stalllock.

> }
>
> /*
> * Now rat on any tasks that got kicked up to the root rcu_node
> * due to CPU offlining.
> */
> - rnp = rcu_get_root(rsp);
> - raw_spin_lock_irqsave(&rnp->lock, flags);
> - ndetected = rcu_print_task_stall(rnp);
> - raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + raw_spin_lock_irqsave(&root->lock, flags);
> + ndetected = rcu_print_task_stall(root);
> + raw_spin_unlock_irqrestore(&root->lock, flags);

And the separate lock makes this change unnecessary, also.

> print_cpu_stall_info_end();
> printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
> @@ -784,6 +796,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
> else if (!trigger_all_cpu_backtrace())
> dump_stack();
>
> + raw_spin_lock_irqsave(&root->lock, flags);
> + rcu_stall_report_in_progress--;
> + raw_spin_unlock_irqrestore(&root->lock, flags);

Ditto here.

> +
> /* If so configured, complain about tasks blocking the grace period. */
>
> rcu_print_detail_task_stall(rsp);
> @@ -796,6 +812,17 @@ static void print_cpu_stall(struct rcu_state *rsp)
> unsigned long flags;
> struct rcu_node *rnp = rcu_get_root(rsp);
>
> + raw_spin_lock_irqsave(&rnp->lock, flags);
> + if (rcu_stall_report_in_progress) {
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);
> + return;
> + }
> +
> + /* Reset timeout, dump_stack() may take a while on large machines. */
> + rsp->jiffies_stall = jiffies + 3 * jiffies_till_stall_check() + 3;
> + rcu_stall_report_in_progress++;
> + raw_spin_unlock_irqrestore(&rnp->lock, flags);

And do trylock (without disabling irqs) here as well. No need for
rcu_stall_report_in_progress. You can update rsp->jiffies_stall
to reduce cache thrashing on ->stalllock.

> +
> /*
> * OK, time to rat on ourselves...
> * See Documentation/RCU/stallwarn.txt for info on how to debug
> @@ -813,6 +840,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
> if (ULONG_CMP_GE(jiffies, rsp->jiffies_stall))
> rsp->jiffies_stall = jiffies +
> 3 * jiffies_till_stall_check() + 3;
> + rcu_stall_report_in_progress--;

And ->stalllock makes this unnecessary as well.

> raw_spin_unlock_irqrestore(&rnp->lock, flags);
>
> set_need_resched(); /* kick ourselves to get things going. */
>
> -Mike
>

2012-06-14 21:34:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Thu, 2012-06-14 at 09:47 -0700, Paul E. McKenney wrote:
> On Thu, Jun 14, 2012 at 09:45:32AM +0200, Mike Galbraith wrote:
> > Crickets chirping.. I know what _that_ means: "tsk tsk, you dummy" :)
>
> In my case, it means that I suspect you would rather me continue working
> on my series of patches to further reduce RCU grace-period-initialization
> latencies on large systems than worry about this patch.

Well of course you, being vastly more qualified (duh), doing that
instead of me doing that is highly desirable, but no, I'm not into
slavery, either expressed or implied, so if it happens before I get to
poking at it again, great, if not, oh well.

Really, I don't care. I trivially fixed the regression. The folks who
need this are highly capable of piddling with it, and you know who they
are. They ain't doing shit but trying to milk _me_ like a damn cow, so
I can't be arsed to give a shit, I'm way too busy to have that luxury.

-Mike

2012-06-15 07:49:22

by Mike Galbraith

[permalink] [raw]
Subject: Re: rcu: endless stalls

On Thu, 2012-06-14 at 09:47 -0700, Paul E. McKenney wrote:

> In my case, it means that I suspect you would rather me continue working
> on my series of patches to further reduce RCU grace-period-initialization
> latencies on large systems than worry about this patch.

Hm, grumpy man may have misinterpreted that. He's off for a weekend of
therapeutic bike riding and flower smelling in any case.

-Mike