2006-10-12 16:53:28

by David Johnson

[permalink] [raw]
Subject: Hardware bug or kernel bug?

Hi,

I'm having a major problem on a system that I've been unable to track down.
When using scp to transfer a large file (a few gig) over the network
(@100Mbit/s) the system will reboot after about 5-10 minutes of transfer. No
errors, just a reboot. I have another identical system which exhibits the
same behaviour.

The system is a Supermicro P4SCT+ with a hyperthreading P4. I've posted the
dmesg here:
http://www.david-web.co.uk/download/dmesg

I initially tried a different NIC in case that was at fault, but the results
were the same.

Changing the interrupt timer frequency in the kernel makes a difference:
100Hz - system reboots instantly when transfer is started
250Hz - reboots after a few seconds
1000Hz - reboots after 5-10 minutes

As the problem appears to be interrupt-related, I disabled the I/O APIC in the
BIOS (after first having to disable hyperthreading) which resulted in the
system lasting a bit longer before it reboots. I then tried disabling the
Local APIC as well but this made no difference.

I've tested with Centos' 2.6.9 kernel and with a vanilla 2.6.17.13 kernel and
the results are the same with both.

Does anyone have any idea whether this is likely to be a hardware problem or a
kernel problem?
Any suggestions for more ways to debug this would be greatfully received.

Thanks,
David.


2006-10-12 17:20:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Thu, 2006-10-12 at 17:53 +0100, David Johnson wrote:
> Hi,
>
> I'm having a major problem on a system that I've been unable to track down.
> When using scp to transfer a large file (a few gig) over the network
> (@100Mbit/s) the system will reboot after about 5-10 minutes of transfer. No
> errors, just a reboot. I have another identical system which exhibits the
> same behaviour.


could be a heat issue.... although.. the rest of what you describe
doesn't quite match that. Still.. just opening the case and using an
external fan to blow air in for 10 minutes should entirely disprove that
I suppose..

2006-10-12 19:13:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?



On Thu, 12 Oct 2006, David Johnson wrote:
>
> I'm having a major problem on a system that I've been unable to track down.
> When using scp to transfer a large file (a few gig) over the network
> (@100Mbit/s) the system will reboot after about 5-10 minutes of transfer. No
> errors, just a reboot. I have another identical system which exhibits the
> same behaviour.

A reboot usually indicates a serious hardware problem - it could be an
overheating sensor tripping, but it could be some serious corruption
causing a triple-fault or something like that too.

But the _most_ likely problem is just the power supply. If your power
supply is border-line, having something that stresses CPU, disk,
southbridge and networking at the same time may be just the way to cause a
power-fail signal, which usually causes an instant reboot.

> The system is a Supermicro P4SCT+ with a hyperthreading P4. I've posted the
> dmesg here:
> http://www.david-web.co.uk/download/dmesg
>
> I initially tried a different NIC in case that was at fault, but the results
> were the same.
>
> Changing the interrupt timer frequency in the kernel makes a difference:
> 100Hz - system reboots instantly when transfer is started
> 250Hz - reboots after a few seconds
> 1000Hz - reboots after 5-10 minutes

I think it just changes timings, and there is something timing-related
going on - like just instant power draw. The timer frequency should not
have any serious impact on heat, so I doubt it's about overheating, but
it's certainly worth opening the case and using one of those
compressed-air things to cool down the CPU and/or southbridge chips.

> As the problem appears to be interrupt-related, I disabled the I/O APIC in the
> BIOS (after first having to disable hyperthreading) which resulted in the
> system lasting a bit longer before it reboots. I then tried disabling the
> Local APIC as well but this made no difference.

Interrupts generally aren't problematic, I'd be more likely to suspect CPU
overclocking or similar (does the cpuinfo match the frequency claimed by
the BIOS?) or just some strange motherboard problem (which could be
firmware: bad programming of memory timings etc). So a BIOS upgrade is
worth looking into.

Soemtimes issues like this can be worked around - for example, maybe the
problem is the chipset having issues with concurrent DMA or something, so
turning off DMA on the disk drives could possibly at least _hide_ the
problem.

> Does anyone have any idea whether this is likely to be a hardware problem or a
> kernel problem?

Anything is possible, and it certainly _could_ be a kernel bug. There are
situations that cause triple-faults and insta-reboots. If the stack
pointer gets whacked in kernel space, you can get some bad bad stuff
happening.

But check the power supply first. And check to see if there is a BIOS
upgrade available. You can double-check the cooling: check that all
heat-sinks are properly seated and have appropriate amounts of thermal
grease. And blowing air from a compressed-air can on top of the things
until you see the frost over is certainly a good spot-check.

In other words, I'd almost bet on bad hardware.

Linus

2006-10-13 08:51:14

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On 12-10-2006 18:53, David Johnson wrote:
> Hi,
>
> I'm having a major problem on a system that I've been unable to track down.
> When using scp to transfer a large file (a few gig) over the network
> (@100Mbit/s) the system will reboot after about 5-10 minutes of transfer. No
> errors, just a reboot. I have another identical system which exhibits the
> same behaviour.
...
> I've tested with Centos' 2.6.9 kernel and with a vanilla 2.6.17.13 kernel and
> the results are the same with both.
...
> Any suggestions for more ways to debug this would be greatfully received.

I'd try with this:
- minimal workable config with a lot of debugging turned on (e.g. no:
smp, floppy, parport, mouse, ipv6, video, clock modulation, apm, acpi
buttons, thermal etc. - only base acpi or no if possible),
- 2.4 kernel,
- other distro e.g. live-cd knoppix,
- other transfer method like ftp (all superfluous services turned off).

Jarek P.

2006-10-13 09:20:56

by David Johnson

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Thursday 12 October 2006 20:13, you wrote:
>
> A reboot usually indicates a serious hardware problem - it could be an
> overheating sensor tripping, but it could be some serious corruption
> causing a triple-fault or something like that too.
>
> But the _most_ likely problem is just the power supply. If your power
> supply is border-line, having something that stresses CPU, disk,
> southbridge and networking at the same time may be just the way to cause a
> power-fail signal, which usually causes an instant reboot.

The power supplies in both machines on which I'm seeing the problem are brand
new, supposedly good quality and from different manufacturers. Could it be
that the motherboard has some fault which causes it to overload even good
power supplies?

> I think it just changes timings, and there is something timing-related
> going on - like just instant power draw. The timer frequency should not
> have any serious impact on heat, so I doubt it's about overheating, but
> it's certainly worth opening the case and using one of those
> compressed-air things to cool down the CPU and/or southbridge chips.

The motherboard has all the usual heat sensors and will alarm if something
gets too hot - I suspected overheating the first time this happened and
checked the temps in the BIOS, but everything was well within limits.

> Interrupts generally aren't problematic, I'd be more likely to suspect CPU
> overclocking or similar (does the cpuinfo match the frequency claimed by
> the BIOS?) or just some strange motherboard problem (which could be
> firmware: bad programming of memory timings etc). So a BIOS upgrade is
> worth looking into.

The cpuinfo does indeed match the reported BIOS speed. The boards are already
running the latest BIOS, so if it is a BIOS issue the motherboard
manufacturer isn't aware of it...

> Soemtimes issues like this can be worked around - for example, maybe the
> problem is the chipset having issues with concurrent DMA or something, so
> turning off DMA on the disk drives could possibly at least _hide_ the
> problem.

I should have mentioned that of the two machines that are having the problem,
one is using IDE and the other SATA. The SATA machine seems worst affected by
it.

> But check the power supply first. And check to see if there is a BIOS
> upgrade available. You can double-check the cooling: check that all
> heat-sinks are properly seated and have appropriate amounts of thermal
> grease. And blowing air from a compressed-air can on top of the things
> until you see the frost over is certainly a good spot-check.

OK I'll give all that a go.

Thanks for your help,
David.

2006-10-13 09:20:54

by David Johnson

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Friday 13 October 2006 09:56, Jarek Poplawski wrote:
>
> I'd try with this:
> - minimal workable config with a lot of debugging turned on (e.g. no:
> smp, floppy, parport, mouse, ipv6, video, clock modulation, apm, acpi
> buttons, thermal etc. - only base acpi or no if possible),
> - 2.4 kernel,
> - other distro e.g. live-cd knoppix,
> - other transfer method like ftp (all superfluous services turned off).
>

I'll give that a go and I guess I should also see whether I can reproduce it
under Windows too...

Cheers,
David.

2006-10-13 10:53:16

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Fri, Oct 13, 2006 at 10:20:48AM +0100, David Johnson wrote:
> On Friday 13 October 2006 09:56, Jarek Poplawski wrote:
> >
> > I'd try with this:
> > - minimal workable config with a lot of debugging turned on (e.g. no:
> > smp, floppy, parport, mouse, ipv6, video, clock modulation, apm, acpi
> > buttons, thermal etc. - only base acpi or no if possible),
> > - 2.4 kernel,
> > - other distro e.g. live-cd knoppix,
> > - other transfer method like ftp (all superfluous services turned off).
> >
>
> I'll give that a go and I guess I should also see whether I can reproduce it
> under Windows too...

Sure! After all we shouldn't be system nazis and let others do
some secondary jobs...

Regards,

Jarek P.

PS: I hope you tested it also under internal stress (heavy
copying plus computing).

2006-10-13 11:56:59

by David Johnson

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Friday 13 October 2006 11:58, Jarek Poplawski wrote:
>
> PS: I hope you tested it also under internal stress (heavy
> copying plus computing).

Yes, I did. No individual factor triggers the bug (high CPU load, lots of disk
activity, high network load, etc.) nor does any other combination of factors
other than what I mentioned before (high network load, some disk activity,
some CPU load).

Both scp and rsync trigger it reliably, but FTP does not trigger it at all. So
CPU load (which scp and rsync generates but FTP does not) must be a key part
of the equation...

Regards,
David.

2006-10-13 13:02:04

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Fri, Oct 13, 2006 at 12:56:53PM +0100, David Johnson wrote:
> On Friday 13 October 2006 11:58, Jarek Poplawski wrote:
> >
> > PS: I hope you tested it also under internal stress (heavy
> > copying plus computing).
>
> Yes, I did. No individual factor triggers the bug (high CPU load, lots of disk
> activity, high network load, etc.) nor does any other combination of factors
> other than what I mentioned before (high network load, some disk activity,
> some CPU load).
>
> Both scp and rsync trigger it reliably, but FTP does not trigger it at all. So
> CPU load (which scp and rsync generates but FTP does not) must be a key part
> of the equation...

Probably - but only with networking. So I'd try with this debugging
like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
this other tested card was different model - and locking improved)
and resend conclusions to [email protected].

Cheers,

Jarek P.

2006-10-13 16:24:46

by David Johnson

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Friday 13 October 2006 14:06, Jarek Poplawski wrote:
>
> Probably - but only with networking. So I'd try with this debugging
> like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
> this other tested card was different model - and locking improved)
> and resend conclusions to [email protected].
>

OK I built a 2.6.19-rc1 kernel with a minimal config as you describe and I
cannot reproduce the reboots with this kernel. My .config:
http://www.david-web.co.uk/download/config

The other NIC I tried was a D-Link DL10050-based card which I think uses the
dl2k module.

I tried to reproduce the problem under Windows (2k), which didn't reboot but
did still suffer from it I believe. Randomly during an scp transfer (using
the PuTTY scp client) Windows will lock-up for about 30 seconds, making an
entry in the event log indicating that there was a time-out talking to the
IDE controller, then continuing. Could the same thing be happening in Linux?
If Linux can't talk to the IDE controller when trying to write to disk, how
does it handle that?

Regards,
David.

2006-10-13 16:45:50

by Alan

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

Ar Gwe, 2006-10-13 am 17:24 +0100, ysgrifennodd David Johnson:
> IDE controller, then continuing. Could the same thing be happening in Linux?
> If Linux can't talk to the IDE controller when trying to write to disk, how
> does it handle that?

It will timeout and then retry the command. It's not the most ideal
situation to end up in but I'd expect to see a DMA timeout and a retry
or two in the log not a crash.

2006-10-16 10:20:06

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Fri, Oct 13, 2006 at 05:24:39PM +0100, David Johnson wrote:
> On Friday 13 October 2006 14:06, Jarek Poplawski wrote:
> >
> > Probably - but only with networking. So I'd try with this debugging
> > like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
> > this other tested card was different model - and locking improved)
> > and resend conclusions to [email protected].
> >
>
> OK I built a 2.6.19-rc1 kernel with a minimal config as you describe and I
> cannot reproduce the reboots with this kernel. My .config:
> http://www.david-web.co.uk/download/config

I've seen more minimal minimal configs but if it works
it is 50% of success.

> The other NIC I tried was a D-Link DL10050-based card which I think uses the
> dl2k module.
>
> I tried to reproduce the problem under Windows (2k), which didn't reboot but
> did still suffer from it I believe. Randomly during an scp transfer (using
> the PuTTY scp client) Windows will lock-up for about 30 seconds, making an
> entry in the event log indicating that there was a time-out talking to the
> IDE controller, then continuing. Could the same thing be happening in Linux?
> If Linux can't talk to the IDE controller when trying to write to disk, how
> does it handle that?

Was this lock-up effect visible during above 2.6.19-rc1 tests?
If not I'd try to continue linux debbuging:
- is 2.6.19-rc1 working with "normal" config (use make oldconfig
to "upgrade" .config),
- is 2.6.17 working with "minimal" config (use make oldconfig),
- changing one or two options at a time try to find which one makes
the effect returns (acpi, smp...).

Regards,
Jarek P.

PS: Sorry for late reply - I was offline.

2006-10-16 14:32:44

by David Johnson

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Monday 16 October 2006 11:25, Jarek Poplawski wrote:
>
> Was this lock-up effect visible during above 2.6.19-rc1 tests?

No, I've not seen anything in Linux other than the reboots, which are instant
without any preceding lock-up.

> If not I'd try to continue linux debbuging:
> - is 2.6.19-rc1 working with "normal" config (use make oldconfig
> to "upgrade" .config),

With 2.6.19-rc1 and a normal config, I get the reboots as usual.

> - is 2.6.17 working with "minimal" config (use make oldconfig),

Yes.

> - changing one or two options at a time try to find which one makes
> the effect returns (acpi, smp...).

I've found the culprit - CPU Frequency Scaling.
With it enabled I get the reboots, with it disabled I don't. That's the same
with every kernel version I've tried (2.6.19-rc1+rc2, 2.6.17.13 & Centos'
2.6.9) The system was using the p4-clockmod driver and the ondemand governor.

I'm still not sure exactly what the problem is - the reboots only happen in
the circumstances I've mentioned and are not triggered by changes in clock
speed alone - but disabling cpufreq seems to make it go away...

Thanks for your help,
David.

2006-10-17 07:05:08

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Hardware bug or kernel bug?

On Mon, Oct 16, 2006 at 03:32:38PM +0100, David Johnson wrote:
...
> I've found the culprit - CPU Frequency Scaling.
> With it enabled I get the reboots, with it disabled I don't. That's the same
> with every kernel version I've tried (2.6.19-rc1+rc2, 2.6.17.13 & Centos'
> 2.6.9) The system was using the p4-clockmod driver and the ondemand governor.
>
> I'm still not sure exactly what the problem is - the reboots only happen in
> the circumstances I've mentioned and are not triggered by changes in clock
> speed alone - but disabling cpufreq seems to make it go away...

I see you devoted a lot of work and time to this testing
and for sure it will help people who read this to
diagnose similar problems but I think it could be even
more valuable if you'd try (after some rest!) to find
if "Enable CPUfreq debugging" plus adding to kernel
command line cpufreq.debug=<value> (according to help
screen) would return any error messages that could be
send to bugzilla and/or cpufreq maintainer.

Best regards,

Jarek P.