2010-07-08 20:58:47

by Andres Freund

[permalink] [raw]
Subject: INFO: rcu_sched_state detected stall on CPU

Hi all,

I recently got a dual-socket E5520 (only one cpu attached right now,
problems where the same with both though) system where I regularly get
errors like

[ 288.281073] INFO: rcu_sched_state detected stall on CPU 1 (t=5890 jiffies)
[ 288.281086] INFO: rcu_sched_state detected stall on CPU 5 (t=5890 jiffies)
[ 288.281087] sending NMI to all CPUs:
[ 288.281096] sending NMI to all CPUs:

After deactivating all power saving mechanisms it seems to have gotten
a bit more stable - it still crashes pretty reliably under
io-load. Graphics-intensive work seems also be able trigger it
reliably. The crashes also occured with the cheap on-board intel
graphics card.

Without the rcu debugging producing the messages above I pretty
regularly get hangs or missing inputs regularly - at times ending
fatal (no sysrq, no keyboard reaction)

Normally I would try to do a bisect, but in this case I am in the
unfortunate Sitation that with earlier kernels I get problems with
other hardware (particularly the sas controller which currently holds
the only disks). So I have no known good version to start from.
Perhaps you have and Idea?

dmesg of different, likely related crashes, lspci -v and my latest
.config are attached.

As I am not sure what kernel code is actually causing the problem -
the backtraces looked innocent enoug on a short, clueless glance - I
dont know who to explicitly CC.

As small additional datapoints: using latencytop I get latencies in
the second area for various things (creating md request, creating
block layer request, radeon_fence_wait).
The problems seem to get more frequent after I enabled lockdep and RCU
debugging - possibly simply making the race more likely?

Thanks,

Andres


Attachments:
(No filename) (1.68 kB)
dmesg (148.95 kB)
dmesg (143.72 kB)
lspci (23.22 kB)
.config (70.80 kB)
Download all attachments

2010-07-08 21:34:06

by Andres Freund

[permalink] [raw]
Subject: Re: INFO: rcu_sched_state detected stall on CPU

Err,

> After deactivating all power saving mechanisms it seems to have gotten
> a bit more stable - it still crashes pretty reliably under
> io-load. Graphics-intensive work seems also be able trigger it
> reliably. The crashes also occured with the cheap on-board intel
> graphics card.
Its not a intel one, but aspeed... Remembered the wrong system. Sorry.

Also so that you dont have to read the full dmesg: Its 2.6.35-rc4
(reproduced it with 2.6.32 onwards).

Andres

2010-08-06 20:10:34

by Andres Freund

[permalink] [raw]
Subject: Re: INFO: rcu_sched_state detected stall on CPU

On Thursday 08 July 2010 22:51:13 Andres Freund wrote:
> Hi all,
>
> I recently got a dual-socket E5520 (only one cpu attached right now,
> problems where the same with both though) system where I regularly get
> errors like
The (attached in the other msg) errors still occur with 2.6.35. Anything I can
do to help?

Andres