2007-12-20 18:14:19

by Volker Armin Hemmann

[permalink] [raw]
Subject: Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

On Donnerstag, 20. Dezember 2007, you wrote:
> On Thu, 2007-12-20 at 06:53 +0100, Hemmann, Volker Armin wrote:
> > On Donnerstag, 20. Dezember 2007, you wrote:
> > > On Thu, 2007-12-20 at 03:13 +0100, Hemmann, Volker Armin wrote:
> > > > On Montag, 17. Dezember 2007, you wrote:
> > > >
> > > > and another one, this time tainted with the nvidia module:
> > > > 5194.130985] Unable to handle kernel paging request at
> > > > 0000030000000000 RIP:
> > >
> > > This really sounds like bad hardware. Either memory or the mobo/riser
> > > card the memory is on. You might try lowering the memory timings of
> > > your memory in BIOS. Try removing 1/2 of your memory. If it still
> > > remove the other 1/2 and put the first 1/2 back and try again.
> >
> > if this is bad hardware why:
> >
> > - didn't this show up earlier?
> >
> > - did a several hour memtest run couple of weeks ago didn't show up
> > anything?
> >
> > - and does stuff like compiling all of kde 3.5.8 or the latest kde4 rc
> > finish without any problems?
>
> Because bad hardware can be highly sensitive to exact load patterns.
> Don't be so skeptical of the suggestion that your hardware may be
> flakey, in the last 30+ years as a hardware guy in the design lab and in
> the field, I've seen very much hardware which passed extensive
> diagnostics, but turn out to be flakey nonetheless.
>
> I would suggest that you rearrange your ram modules, and see if the bit
> pattern changes. Memtest may not show a problem with bitflips... if
> that's happening. I would also suggest that you check your case
> temperature as someone else suggested - lmsensors may say that the CPU
> temperature is fine, but that isn't the whole picture. by a long shot.
>
> -Mike

case temp: 25?C measured near a warm harddisk by a digital thermometer.
mainboard temp: 31?C measured by lmsensors (mobo bios agrees)
cpu temp: 29-50?C (load dependent) measured by lmsensors, bios puts on two
additional degrees.

I have 4 'big' fans installed to have a constant air stream in the case.

This really does not look like overheating. And I did have flaky ram in the
past. The thing is, apart from the oops the system is completly, perfectly
stable. That really does not smell like flaky hardware. At least not in my
experience. Flaky PSU = sudden reboots, boot problems, crashes under load.
Flaky mobo = see flaky psu. Flaky Ram: crashes, crashes, more crashes,
segfaults here and there, especially when updating glibc, qt or kde.

And I don't get this. I only get oopses.

It is just.. I could be the hardware - but I should have seen the
same 'problem' with earlier kernels - and the 'almost daily oops' only
started with 2.6.23.


2007-12-21 07:47:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well


On Thu, 2007-12-20 at 19:14 +0100, Hemmann, Volker Armin wrote:

> It is just.. I could be the hardware - but I should have seen the
> same 'problem' with earlier kernels - and the 'almost daily oops' only
> started with 2.6.23.

Nonetheless, the oopsen _suggest_ hardware. If it were my box, I'd move
ram modules as a first step. It costs about two minutes to eliminate
that possibility, but you seem reluctant to take that step. Heck, I'd
_hope_ it's something as simple bad ram, because otherwise, quest for
stability could become a time consuming and/or expensive undertaking...

If that didn't change anything, I'd go back and stress test a previously
stable configuration to gain confidence in my hardware. If 'uhoh, not
as stable as I thought' happened, and nothing is getting obviously hot
[1], I'd pray that it's an electrically noisy power supply, because
that's also easy and cheap. In any case, once I was very very confident
that my hardware was indeed sound, I'd move on to an agonizingly tedious
bisection, with no out of tree modules ever loaded, to narrow down when
this memory corruption that nobody else appears to be hitting appeared.

-Mike

1. Crappy heatsink compound can dry out and fracture, leaving hot chip
under a relatively cool heatsink. This is exactly what I found when I
disassembled my suddenly unstable under heavy load P4 box a while back.

2007-12-21 12:54:47

by Volker Armin Hemmann

[permalink] [raw]
Subject: Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

On Freitag, 21. Dezember 2007, Mike Galbraith wrote:
> On Thu, 2007-12-20 at 19:14 +0100, Hemmann, Volker Armin wrote:
> > It is just.. I could be the hardware - but I should have seen the
> > same 'problem' with earlier kernels - and the 'almost daily oops' only
> > started with 2.6.23.
>
> Nonetheless, the oopsen _suggest_ hardware. If it were my box, I'd move
> ram modules as a first step. It costs about two minutes to eliminate
> that possibility, but you seem reluctant to take that step. Heck, I'd
> _hope_ it's something as simple bad ram, because otherwise, quest for
> stability could become a time consuming and/or expensive undertaking...
>

It costs a little bit more, but it will be part of the 'past holiday special'.
As an intermediate step I incresed the voltages of the ram - looks good so
far.

> If that didn't change anything, I'd go back and stress test a previously
> stable configuration to gain confidence in my hardware.

you mean like playing ut2004 with reduced fans or several instances of
cpuburn, or compiling something big like kdepim with kdeenablefinal? Done all
that ...

> If 'uhoh, not
> as stable as I thought' happened, and nothing is getting obviously hot
> [1], I'd pray that it's an electrically noisy power supply, because
> that's also easy and cheap.

yeah, it would be the least annoying variant. After one PSU ate two computers,
this one is just two and a half month old - I had my share of bad PSUs.
That is why I increased the voltages. Maybe it helps.

> In any case, once I was very very confident
> that my hardware was indeed sound, I'd move on to an agonizingly tedious
> bisection, with no out of tree modules ever loaded, to narrow down when
> this memory corruption that nobody else appears to be hitting appeared.
>
> -Mike
>
> 1. Crappy heatsink compound can dry out and fracture, leaving hot chip
> under a relatively cool heatsink. This is exactly what I found when I
> disassembled my suddenly unstable under heavy load P4 box a while back.

still this would show up with the temps. And the temps are ok.

Gl?ck Auf,
Volker