2011-05-29 16:37:52

by Whit Blauvelt

[permalink] [raw]
Subject: recursive fault in 2.6.35.5

Hi,

This isn't a most-recent kernel, so we should upgrade the systems with it,
but it could also be useful to know why the fault occurred. If someone here
can easily decode the final messages when the system froze....

This is vanilla 2.6.35.5, built from source, running with Ubuntu Server
10.04.2. Two similar systems have been running stably for months, then
yesterday and today both froze up - one twice. On the one where I was able
to get a remote console before rebooting the final messages are in a screen
capture at

http://www.transpect.com/jpg/sb2crash.jpg

The final lines are

[3521437.065988] RIP [<ffffffff81054ddc>] set_next_entity+0xc/0xa0
[3521437.065993] RSP <ffff8801b60b1748>
[3521437.065994] CR2: 0000000000000038
[3521437.065997] ---[ end trace 5a40c5f226029029 ]---
[3521437.065999] Fixing recursive fault but reboot is needed!

These are basically file servers running NFS, samba, and some Python. I know
there are recent improvements to the kernel's NFS functions. Does this point
in that direction as the cause of the recursive fault?

TIA,
Whit


2011-05-30 02:48:34

by Mike Galbraith

[permalink] [raw]
Subject: Re: recursive fault in 2.6.35.5

On Sun, 2011-05-29 at 12:27 -0400, Whit Blauvelt wrote:
> Hi,
>
> This isn't a most-recent kernel, so we should upgrade the systems with it,
> but it could also be useful to know why the fault occurred. If someone here
> can easily decode the final messages when the system froze....
>
> This is vanilla 2.6.35.5, built from source, running with Ubuntu Server
> 10.04.2. Two similar systems have been running stably for months, then
> yesterday and today both froze up - one twice. On the one where I was able
> to get a remote console before rebooting the final messages are in a screen
> capture at
>
> http://www.transpect.com/jpg/sb2crash.jpg
>
> The final lines are
>
> [3521437.065988] RIP [<ffffffff81054ddc>] set_next_entity+0xc/0xa0
> [3521437.065993] RSP <ffff8801b60b1748>
> [3521437.065994] CR2: 0000000000000038
> [3521437.065997] ---[ end trace 5a40c5f226029029 ]---
> [3521437.065999] Fixing recursive fault but reboot is needed!
>
> These are basically file servers running NFS, samba, and some Python. I know
> there are recent improvements to the kernel's NFS functions. Does this point
> in that direction as the cause of the recursive fault?

No, you've been bitten by an annoyingly elusive load balancing bug.

-Mike

2011-05-31 14:24:17

by Whit Blauvelt

[permalink] [raw]
Subject: Re: recursive fault in 2.6.35.5

On Mon, May 30, 2011 at 04:48:29AM +0200, Mike Galbraith wrote:

> No, you've been bitten by an annoyingly elusive load balancing bug.

Thanks Mike. Can that bug be avoided by leaving out some kernel option? The
system that happened on had it's identical twin fail the day before. For
both, it was a time of relatively more load (although not excessive). On the
twin we didn't look at the console before rebooting though.

On the other hand, we'd run for months with no problem up until this.

Regards,
Whit

2011-06-01 02:01:51

by Mike Galbraith

[permalink] [raw]
Subject: Re: recursive fault in 2.6.35.5

On Tue, 2011-05-31 at 10:24 -0400, Whit Blauvelt wrote:
> On Mon, May 30, 2011 at 04:48:29AM +0200, Mike Galbraith wrote:
>
> > No, you've been bitten by an annoyingly elusive load balancing bug.
>
> Thanks Mike. Can that bug be avoided by leaving out some kernel option? The
> system that happened on had it's identical twin fail the day before. For
> both, it was a time of relatively more load (although not excessive). On the
> twin we didn't look at the console before rebooting though.
>
> On the other hand, we'd run for months with no problem up until this.

No earthly notion. I never figured out exactly how it happens. Setting
traps for the critter didn't worked out. I did receive some diagnostic
info from a group of ppc64 boxen that indicated that the clock went
backward, but when I zeroed in on it, it they went silent. All other
machines with traps set have been totally silent for months (that's a
lot of machines too).

Bug seems to be dead upstream, at least I haven't noticed any reports
with a recent kernel.

-Mike