2005-03-28 18:57:06

by Olivier Fourdan

[permalink] [raw]
Subject: Various issues after rebooting

Hi all,

I'm facing some various odd issues with a AMD64 based laptop (Compaq
R3480EA) I bought recently.

On first boot, everything is all right. The laptop runs flawlessly. But
if I shutdown the laptop and restart it, I can see all kind of strange
things happening.

1) the system clock runs 3 times faster,
2) the system is unable to mount cdroms,
3) modprobing nidswrapper cause a whole system freeze with the following
message:

CPU 0: Machine Check Exception: 0000000000000004
Bank 4: b200000000070f0f
Kernel panic - not syncing: CPU context corrupt

I've tried with various kernels and distributions in 32bit and 64bit
modes but that make no differences.

I also tried disable ACPI, setting clock=[tsc|pmtmr|pti], diabling APIC,
etc. No luck. No matter how many reboots I do, the problem remains. The
only way to fix the problem is to keep the laptop off for a couple of
hours.

I thought of a hardware issue, but in WinXP, everything is fine. And in
the case of a hardware issue, I guess the problem would always show, not
just in Linux after a reboot.

My guess is that the BIOS doesn't re-initialize the hardware correctly
in case of a quick shutdown/reboot but WinXP might be initializing the
things by itself (it's a guess, I'm probably completely wrong).

Does that make any sense so someone? How could I help tracking down this
issue?

Thanks in advance,

Best regards,
Olivier.



2005-03-28 19:21:02

by Willy Tarreau

[permalink] [raw]
Subject: Re: Various issues after rebooting

Hi,

On Mon, Mar 28, 2005 at 09:56:39PM +0200, Olivier Fourdan wrote:
(...)
> I thought of a hardware issue, but in WinXP, everything is fine. And in
> the case of a hardware issue, I guess the problem would always show, not
> just in Linux after a reboot.
>
> My guess is that the BIOS doesn't re-initialize the hardware correctly
> in case of a quick shutdown/reboot but WinXP might be initializing the
> things by itself (it's a guess, I'm probably completely wrong).

I had same sort of problems with my crappy VAIO (which, fortunately, is
dead now). The bios did not initialize anything, and there were many
situations where it would not recover after a reboot. The most common one
was the local APIC. It was guaranteed that if I rebooted while I had used
local APIC, the BIOS would not detect the hard disk at next boot ! And if
I booted 2.6 and used the frame buffer, then I would have no screen at
next boot, which was not really a problem since it would also timeout on
the disk 10 seconds later...

> Does that make any sense so someone? How could I help tracking down this
> issue?

Now I have a compaq (nc8000) which does not exhibit such buggy behaviour,
but you can try disabling the APIC too just in case it's a similar problem
(at least in 32 bits, I don't know if you can disable it in 64 bits mode).

Regards,
Willy

2005-03-28 19:30:41

by Olivier Fourdan

[permalink] [raw]
Subject: Re: Various issues after rebooting

Hi Willy

On Mon, 2005-03-28 at 21:20 +0200, Willy Tarreau wrote:
> Now I have a compaq (nc8000) which does not exhibit such buggy behaviour,
> but you can try disabling the APIC too just in case it's a similar problem
> (at least in 32 bits, I don't know if you can disable it in 64 bits mode).

Thanks for the hint, but unfortunately, it's one of the first things I
tried, and that makes no difference.

Regards,
Olivier.



2005-03-28 19:39:41

by Willy Tarreau

[permalink] [raw]
Subject: Re: Various issues after rebooting

On Mon, Mar 28, 2005 at 09:30:26PM +0200, Olivier Fourdan wrote:
> Hi Willy
>
> On Mon, 2005-03-28 at 21:20 +0200, Willy Tarreau wrote:
> > Now I have a compaq (nc8000) which does not exhibit such buggy behaviour,
> > but you can try disabling the APIC too just in case it's a similar problem
> > (at least in 32 bits, I don't know if you can disable it in 64 bits mode).
>
> Thanks for the hint, but unfortunately, it's one of the first things I
> tried, and that makes no difference.

Sorry, at first I only noticed ACPI in your mail, but after reading it
again, I also noticed APIC. So now, you can only try not to initialize
some peripherals (IDE, network, display, etc...) by removing their drivers
from the kernel. You may end up with a kernel panic, but that does not
matter is you boot it with "panic=5" so that it automatically reboots
5 seconds after the panic. You should then finally identify the subsystem
which is responsible for your problems. Perhaps you'll even need to remove
PCI support :-(

Regards,
Willy

2005-03-28 20:10:45

by Olivier Fourdan

[permalink] [raw]
Subject: Re: Various issues after rebooting

Hi Willy,

On Mon, 2005-03-28 at 21:39 +0200, Willy Tarreau wrote:
> Sorry, at first I only noticed ACPI in your mail, but after reading it
> again, I also noticed APIC. So now, you can only try not to initialize
> some peripherals (IDE, network, display, etc...) by removing their drivers
> from the kernel. You may end up with a kernel panic, but that does not
> matter is you boot it with "panic=5" so that it automatically reboots
> 5 seconds after the panic. You should then finally identify the subsystem
> which is responsible for your problems. Perhaps you'll even need to remove
> PCI support :-(

Well, actually, the system runs (at least) unless I try to load
"ndiswrapper" which leads to a kernel panic.

I tried to bring the issue to the ndiswrapper ML but I doubt that
ndiswrapper is faulty.

I can reliably predict the crash. If the clock (and all other time based
events) are too fast, then modprobing ndiswrapper will lead to a system
crash, just like mounting a CDROM will fail.

I think the clock speed and other effects are just signs, not the cause
of the problem. What I'd like to determine is what would need to be done
to avoid the root cause, or maybe if there is anything that can be done
in Linux to avoid that?

I just tried "acpi_fake_ecdt" but that leads to a immediate kernel
panic.

Ps: Given the crash (Machine check exception), the sleep option seems to
have no effect.

Thanks,
Olivier.




2005-03-29 21:29:14

by Olivier Fourdan

[permalink] [raw]
Subject: Clock 3x too fast on AMD64 laptop [WAS Re: Various issues after rebooting]

Hi all

Following my own thread, I found the following error in dmesg:

PM-Timer running at invalid rate: 33% of normal - aborting.

I found that interesting because 33% is 1/3 and the clock runs exactly
3x faster than normal...

A bit of search on google gave me several links to posts from other
people with the exact same problem on similar hardware (AMD64 laptop)
but I couldn't find neither the cause nor the fix of that issue (as I
think it might be related to the other issues I observe when the clock
goes too fast)

Does that PM-Timer message makes sense to someone knowledgeable?

Thanks in advance,

Cheers,
Olivier.

On Mon, 2005-03-28 at 21:39 +0200, Willy Tarreau wrote:
> On Mon, Mar 28, 2005 at 09:30:26PM +0200, Olivier Fourdan wrote:
> > Hi Willy
> >
> > On Mon, 2005-03-28 at 21:20 +0200, Willy Tarreau wrote:
> > > Now I have a compaq (nc8000) which does not exhibit such buggy behaviour,
> > > but you can try disabling the APIC too just in case it's a similar problem
> > > (at least in 32 bits, I don't know if you can disable it in 64 bits mode).
> >
> > Thanks for the hint, but unfortunately, it's one of the first things I
> > tried, and that makes no difference.
>
> Sorry, at first I only noticed ACPI in your mail, but after reading it
> again, I also noticed APIC. So now, you can only try not to initialize
> some peripherals (IDE, network, display, etc...) by removing their drivers
> from the kernel. You may end up with a kernel panic, but that does not
> matter is you boot it with "panic=5" so that it automatically reboots
> 5 seconds after the panic. You should then finally identify the subsystem
> which is responsible for your problems. Perhaps you'll even need to remove
> PCI support :-(
>
> Regards,
> Willy
>
>


2005-03-29 22:02:18

by Olivier Fourdan

[permalink] [raw]
Subject: Re: Clock 3x too fast on AMD64 laptop [WAS Re: Various issues after rebooting]

Hi,

A quick look at the source shows that the error is triggered in
arch/i386/kernel/timers/timer_pm.c by the verify_pmtr_rate() function.

My guess is that the pmtmr timer is right and the pit is wrong in my
case. That would explain why the clock is wrong when being based on pit
(like when forced with "clock=pit")

Maybe, if I can prove my guesses, a fix could be to "trust" the pmtmr
clock when the user has passed a "clock=pmtmr" argument ? Does that make
any sense ?

TIA
Olivier.



On Tue, 2005-03-29 at 23:28 +0200, Olivier Fourdan wrote:
> Hi all
>
> Following my own thread, I found the following error in dmesg:
>
> PM-Timer running at invalid rate: 33% of normal - aborting.
>
> I found that interesting because 33% is 1/3 and the clock runs exactly
> 3x faster than normal...
>
> A bit of search on google gave me several links to posts from other
> people with the exact same problem on similar hardware (AMD64 laptop)
> but I couldn't find neither the cause nor the fix of that issue (as I
> think it might be related to the other issues I observe when the clock
> goes too fast)
>
> Does that PM-Timer message makes sense to someone knowledgeable?
>
> Thanks in advance,
>
> Cheers,
> Olivier.
>
> On Mon, 2005-03-28 at 21:39 +0200, Willy Tarreau wrote:
> > On Mon, Mar 28, 2005 at 09:30:26PM +0200, Olivier Fourdan wrote:
> > > Hi Willy
> > >
> > > On Mon, 2005-03-28 at 21:20 +0200, Willy Tarreau wrote:
> > > > Now I have a compaq (nc8000) which does not exhibit such buggy behaviour,
> > > > but you can try disabling the APIC too just in case it's a similar problem
> > > > (at least in 32 bits, I don't know if you can disable it in 64 bits mode).
> > >
> > > Thanks for the hint, but unfortunately, it's one of the first things I
> > > tried, and that makes no difference.
> >
> > Sorry, at first I only noticed ACPI in your mail, but after reading it
> > again, I also noticed APIC. So now, you can only try not to initialize
> > some peripherals (IDE, network, display, etc...) by removing their drivers
> > from the kernel. You may end up with a kernel panic, but that does not
> > matter is you boot it with "panic=5" so that it automatically reboots
> > 5 seconds after the panic. You should then finally identify the subsystem
> > which is responsible for your problems. Perhaps you'll even need to remove
> > PCI support :-(
> >
> > Regards,
> > Willy
> >
> >
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2005-03-29 22:11:30

by Dominik Brodowski

[permalink] [raw]
Subject: Re: Clock 3x too fast on AMD64 laptop [WAS Re: Various issues after rebooting]

On Wed, Mar 30, 2005 at 12:02:11AM +0200, Olivier Fourdan wrote:
> Hi,
>
> A quick look at the source shows that the error is triggered in
> arch/i386/kernel/timers/timer_pm.c by the verify_pmtr_rate() function.
>
> My guess is that the pmtmr timer is right and the pit is wrong in my
> case. That would explain why the clock is wrong when being based on pit
> (like when forced with "clock=pit")
>
> Maybe, if I can prove my guesses, a fix could be to "trust" the pmtmr
> clock when the user has passed a "clock=pmtmr" argument ? Does that make
> any sense ?

This would make a lot of sense, IMHO. John, what do you think?

Dominik

2005-03-29 22:12:13

by john stultz

[permalink] [raw]
Subject: Re: Clock 3x too fast on AMD64 laptop [WAS Re: Various issues after rebooting]

On Wed, 2005-03-30 at 00:02 +0200, Olivier Fourdan wrote:
> A quick look at the source shows that the error is triggered in
> arch/i386/kernel/timers/timer_pm.c by the verify_pmtr_rate() function.
>
> My guess is that the pmtmr timer is right and the pit is wrong in my
> case. That would explain why the clock is wrong when being based on pit
> (like when forced with "clock=pit")

Yea. From your description this is most likely the cause of the issue.
Currently the time of day is still tick-based, using the tsc/pmtmr/hpet
only for interpolating between ticks.

> Maybe, if I can prove my guesses, a fix could be to "trust" the pmtmr
> clock when the user has passed a "clock=pmtmr" argument ? Does that make
> any sense ?

Well, if you tried the time of day re-work I've been working on it would
mask the issue somewhat, but you'd still have the problem that you are
taking too many timer interrupts.

One thing you could try is playing with the CLOCK_TICK_RATE value to see
if you just have very unique hardware.

A similar sounding issue has also been reported here:
http://bugme.osdl.org/show_bug.cgi?id=3927

thanks
-john




2005-03-31 19:13:11

by Olivier Fourdan

[permalink] [raw]
Subject: Re: Clock 3x too fast on AMD64 laptop [WAS Re: Various issues after rebooting]

Hi John, Dominik,


On Tue, 2005-03-29 at 14:11 -0800, john stultz wrote:
> Yea. From your description this is most likely the cause of the issue.
> Currently the time of day is still tick-based, using the tsc/pmtmr/hpet
> only for interpolating between ticks.

Sorry for the late follow up. Unfortunately, a quick hack to disable the
"pmtmr" check shows that even when "trusting" the PM-Timer, the clock
and interrupts still run 3x too fast. That makes no difference.

> Well, if you tried the time of day re-work I've been working on it would
> mask the issue somewhat, but you'd still have the problem that you are
> taking too many timer interrupts.

Where could I get that patch from ? I'd be glad to do some testing for
you if you need it.

> One thing you could try is playing with the CLOCK_TICK_RATE value to see
> if you just have very unique hardware.

Problem is that the issue shows exactly after one quick power off/power
on sequence. It doesn't show after a real cold start (leaving the laptop
off for a couple of hours) or even after a reboot.

> A similar sounding issue has also been reported here:
> http://bugme.osdl.org/show_bug.cgi?id=3927

Not sure if that's the exact same problem. What I can say, after reading
that bug report, is that disabling ACPI and/or APIC makes no difference.
Specifying the clock=... makes no difference either. It doesn't seem
related to the AMD64 part of the kernel since it shows equally when
using a 64bit kernel and a 32bit kernel.

Moreover, when that bug shows, there are other different problems
showing (such as the cdrom not being to mount anything, or ndiswrapper
crashing the system with a MCE error).

At first, I thought the issue might be related to the nforce3, but the
bug refers to an ATI chipset so I guess it's not related to the nforce.

Anyway, it doesn't seem to be an uncommon issue with AMD64 based
hardware. I don't know where to start from though.

Cheers,
Olivier.