Hi all-
I have a problem, possibly HPET related, that's gotten a lot worse in
2.6.35 and later. It is present in the latest Fedora 14 kernel
(2.6.35-blahblahblah) and in kernel.org's 2.6.36.
After several minutes of uptime, my system freezes in a strange way.
All of userspace and some kernel services freeze hard (no mouse, no
typing, etc.) Caps-lock works intermittently, as does (I think)
pinging. The only debug output I've been able to get is this (typed
from a blurry photo because netconsole dies along with everything
else):
[ 5478.244722] Clocksource tsc unstable (delta = 101399051316 ns)
[ 5478.367958] Switching to clocksource hpet
[ 5543.729698] BUG: soft lockup - CPU #1 stuck for 61s! [migration/1:6]
[ 5543.729743] Modules linked in: [...]
At this point, even SysRq only works some of the time (it'll work for
a few seconds then die for awhile). Banging on the keyboard doesn't
appear to help, and I'd expect it to if all that was wrong was that I
was losing wakeups.
IIRC I got the tsc unstable messages on older kernels with similarly
insanely large deltas but the system kept working.
Booting with hpet=disable not only makes the system stable but
prevents the clocksource tsc unstable message. ntpd doesn't seem to
complain about my clock, either (and on a different machine with a (I
think) particularly crappy onboard clock, ntpd complains loudly, so I
assume that ntpd would tell me if my clock didn't work right).
CPU is Intel Xeon W5320 at 2.67GHz and /proc/cpuinfo says tsc, rdtscp,
constant_tsc, nonstop_tsc, so the TSC should be good.
Motherboard is Asus P6T WS PRO rev 1.xx running a BIOS from 02/26/2009
(version 0603).
My guess is that my hpet (or its driver) is broken enough to bring
down userspace and that the kernel incorrectly trusts the hpet more
than the TSC.
--Andy
Andrew Lutomirski wrote:
> After several minutes of uptime, my system freezes in a strange way.
> All of userspace and some kernel services freeze hard (no mouse, no
> typing, etc.) Caps-lock works intermittently, as does (I think)
> pinging. The only debug output I've been able to get is this (typed
> from a blurry photo because netconsole dies along with everything
> else):
>
> [ 5478.244722] Clocksource tsc unstable (delta = 101399051316 ns)
> [ 5478.367958] Switching to clocksource hpet
> [ 5543.729698] BUG: soft lockup - CPU #1 stuck for 61s! [migration/1:6]
Looks like this one:
http://lkml.org/lkml/2010/10/26/126
Regards,
Clemens
On Tue, Nov 09, 2010 at 04:20:56PM +0100, Clemens Ladisch wrote:
> Andrew Lutomirski wrote:
> > After several minutes of uptime, my system freezes in a strange way.
> > All of userspace and some kernel services freeze hard (no mouse, no
> > typing, etc.) Caps-lock works intermittently, as does (I think)
> > pinging. The only debug output I've been able to get is this (typed
> > from a blurry photo because netconsole dies along with everything
> > else):
> >
> > [ 5478.244722] Clocksource tsc unstable (delta = 101399051316 ns)
> > [ 5478.367958] Switching to clocksource hpet
> > [ 5543.729698] BUG: soft lockup - CPU #1 stuck for 61s! [migration/1:6]
>
> Looks like this one:
> http://lkml.org/lkml/2010/10/26/126
I don't think so: the one at the URL above is 37-rc1 specific (issue
came in during the merge window) while Andrew's problem can be
reproduced with 35 and 36, reportedly.
Anyway, adding more people to Cc.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
On Tue, Nov 9, 2010 at 10:38 AM, Borislav Petkov <[email protected]> wrote:
> On Tue, Nov 09, 2010 at 04:20:56PM +0100, Clemens Ladisch wrote:
>> Andrew Lutomirski wrote:
>> > After several minutes of uptime, my system freezes in a strange way.
>> > All of userspace and some kernel services freeze hard (no mouse, no
>> > typing, etc.) ?Caps-lock works intermittently, as does (I think)
>> > pinging. ?The only debug output I've been able to get is this (typed
>> > from a blurry photo because netconsole dies along with everything
>> > else):
>> >
>> > [ 5478.244722] Clocksource tsc unstable (delta = 101399051316 ns)
>> > [ 5478.367958] Switching to clocksource hpet
>> > [ 5543.729698] BUG: soft lockup - CPU #1 stuck for 61s! [migration/1:6]
>>
>> Looks like this one:
>> http://lkml.org/lkml/2010/10/26/126
>
> I don't think so: the one at the URL above is 37-rc1 specific (issue
> came in during the merge window) while Andrew's problem can be
> reproduced with 35 and 36, reportedly.
>
> Anyway, adding more people to Cc.
I may have just reproduced after a couple of days of uptime it in
2.6.36 with hpet=disable (in any case, the system hung and netconsole
didn't work).
I'm currently testing the hpet_min_tick stuff, backported to 2.6.36,
with hpet on. Haven't had a problem yet, but I'll see what happens
after a few hours. So far, it said:
ACPI: HPET id: 0x8086a301 base: 0xfed00000 min tick: 128
Is there any way that a tsc delta of >100 seconds occurring awhile
after bootup on Nehalem with a single socket can real (as opposed to a
bug making the kernel think that there was a delta when there really
wasn't)?
--Andy
>
> --
> Regards/Gruss,
> Boris.
>
> Advanced Micro Devices GmbH
> Einsteinring 24, 85609 Dornach
> General Managers: Alberto Bozzo, Andrew Bowd
> Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
>
>
On Tue, Nov 09, 2010 at 12:44:48PM -0500, Andrew Lutomirski wrote:
> I may have just reproduced after a couple of days of uptime it in
> 2.6.36 with hpet=disable (in any case, the system hung and netconsole
> didn't work).
>
> I'm currently testing the hpet_min_tick stuff, backported to 2.6.36,
> with hpet on. Haven't had a problem yet, but I'll see what happens
> after a few hours. So far, it said:
>
> ACPI: HPET id: 0x8086a301 base: 0xfed00000 min tick: 128
This doesn't say what your BIOS has set it to originally. For that, do
hexdump -C /sys/firmware/acpi/tables/HPET
and check what does the u16 little endian value at offset 0x35 say.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
On Tue, Nov 9, 2010 at 1:01 PM, Borislav Petkov <[email protected]> wrote:
> On Tue, Nov 09, 2010 at 12:44:48PM -0500, Andrew Lutomirski wrote:
>> I may have just reproduced after a couple of days of uptime it in
>> 2.6.36 with hpet=disable (in any case, the system hung and netconsole
>> didn't work).
>>
>> I'm currently testing the hpet_min_tick stuff, backported to 2.6.36,
>> with hpet on. ?Haven't had a problem yet, but I'll see what happens
>> after a few hours. ?So far, it said:
>>
>> ACPI: HPET id: 0x8086a301 base: 0xfed00000 min tick: 128
>
> This doesn't say what your BIOS has set it to originally. For that, do
>
> hexdump -C /sys/firmware/acpi/tables/HPET
>
> and check what does the u16 little endian value at offset 0x35 say.
0x37ee, which people have seen before. In any case, even with the
patch applied my system hung, and I still got this off netconsole
sometime around when the hang happened:
Clocksource: tsc unstable (delta = -34355296774 ns)
Switching: to clocksource hpet
Another odd symptom: none of the SysRq hotkeys I could think of did
anything *except for SysRq-b*. I have no idea what could cause that.
My system appears to pass ingo's time-warp-test (when compiled with
-O0 -- it segfaults with -O2), so I think my TSC is okay.
--Andy
On Tue, Nov 9, 2010 at 4:29 PM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, Nov 9, 2010 at 1:01 PM, Borislav Petkov <[email protected]> wrote:
>> On Tue, Nov 09, 2010 at 12:44:48PM -0500, Andrew Lutomirski wrote:
>>> I may have just reproduced after a couple of days of uptime it in
>>> 2.6.36 with hpet=disable (in any case, the system hung and netconsole
>>> didn't work).
>>>
>>> I'm currently testing the hpet_min_tick stuff, backported to 2.6.36,
>>> with hpet on. ?Haven't had a problem yet, but I'll see what happens
>>> after a few hours. ?So far, it said:
>>>
>>> ACPI: HPET id: 0x8086a301 base: 0xfed00000 min tick: 128
>>
>> This doesn't say what your BIOS has set it to originally. For that, do
>>
>> hexdump -C /sys/firmware/acpi/tables/HPET
>>
>> and check what does the u16 little endian value at offset 0x35 say.
>
> 0x37ee, which people have seen before. ?In any case, even with the
> patch applied my system hung, and I still got this off netconsole
> sometime around when the hang happened:
>
> Clocksource: tsc unstable (delta = -34355296774 ns)
> Switching: to clocksource hpet
Please disregard -- this is a bug in nouveau (or drm) not hpet. I'll
send a bug report to the maintainers.
--Andy
>
> Another odd symptom: none of the SysRq hotkeys I could think of did
> anything *except for SysRq-b*. ?I have no idea what could cause that.
>
> My system appears to pass ingo's time-warp-test (when compiled with
> -O0 -- it segfaults with -O2), so I think my TSC is okay.
>
> --Andy
>
On Wed, Nov 10, 2010 at 01:48:00PM -0500, Andrew Lutomirski wrote:
> > Clocksource: tsc unstable (delta = -34355296774 ns)
> > Switching: to clocksource hpet
>
> Please disregard -- this is a bug in nouveau (or drm) not hpet. I'll
> send a bug report to the maintainers.
Interesting! Joerg was complaining about similar symptoms with .36 today
too.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
On Wed, Nov 10, 2010 at 1:50 PM, Borislav Petkov <[email protected]> wrote:
> On Wed, Nov 10, 2010 at 01:48:00PM -0500, Andrew Lutomirski wrote:
>> > Clocksource: tsc unstable (delta = -34355296774 ns)
>> > Switching: to clocksource hpet
>>
>> Please disregard -- this is a bug in nouveau (or drm) not hpet. ?I'll
>> send a bug report to the maintainers.
>
> Interesting! Joerg was complaining about similar symptoms with .36 today
> too.
Well, there is a clocksource sort-of-bug that could cause confusion:
when something totally unrelated to clocksources goes out to lunch,
the clocksource watchdog decides that the clocksource is unstable and
complains, steering everyone toward filing the wrong bug.
tglx?
>
> --
> Regards/Gruss,
> Boris.
>
> Advanced Micro Devices GmbH
> Einsteinring 24, 85609 Dornach
> General Managers: Alberto Bozzo, Andrew Bowd
> Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
>
On Wed, 10 Nov 2010, Andrew Lutomirski wrote:
> On Wed, Nov 10, 2010 at 1:50 PM, Borislav Petkov <[email protected]> wrote:
> > On Wed, Nov 10, 2010 at 01:48:00PM -0500, Andrew Lutomirski wrote:
> >> > Clocksource: tsc unstable (delta = -34355296774 ns)
> >> > Switching: to clocksource hpet
> >>
> >> Please disregard -- this is a bug in nouveau (or drm) not hpet. ?I'll
> >> send a bug report to the maintainers.
> >
> > Interesting! Joerg was complaining about similar symptoms with .36 today
> > too.
>
> Well, there is a clocksource sort-of-bug that could cause confusion:
> when something totally unrelated to clocksources goes out to lunch,
> the clocksource watchdog decides that the clocksource is unstable and
> complains, steering everyone toward filing the wrong bug.
How should the clocksource watchdog code know that something went to
lunch? The fact that we need to monitor TSC at all is horrible enough,
adding further heuristics to detect extended lunch breaks would be
just a PITA.
Maybe we could print a different warning when we see large negative
deltas, which is the main indicator for the system being stuck for
quite a time while TSC advances happily.
Thanks,
tglx
On Wed, Nov 10, 2010 at 3:52 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 10 Nov 2010, Andrew Lutomirski wrote:
>
>> On Wed, Nov 10, 2010 at 1:50 PM, Borislav Petkov <[email protected]> wrote:
>> > On Wed, Nov 10, 2010 at 01:48:00PM -0500, Andrew Lutomirski wrote:
>> >> > Clocksource: tsc unstable (delta = -34355296774 ns)
>> >> > Switching: to clocksource hpet
>> >>
>> >> Please disregard -- this is a bug in nouveau (or drm) not hpet. ?I'll
>> >> send a bug report to the maintainers.
>> >
>> > Interesting! Joerg was complaining about similar symptoms with .36 today
>> > too.
>>
>> Well, there is a clocksource sort-of-bug that could cause confusion:
>> when something totally unrelated to clocksources goes out to lunch,
>> the clocksource watchdog decides that the clocksource is unstable and
>> complains, steering everyone toward filing the wrong bug.
>
> How should the clocksource watchdog code know that something went to
> lunch? The fact that we need to monitor TSC at all is horrible enough,
> adding further heuristics to detect extended lunch breaks would be
> just a PITA.
Could the clocksource watchdog detect when it gets woken up (i.e. when
the hardware timer fires) instead of when it gets scheduled? It is
internal to the timing code, after all...
Alternatively, maybe the watchdog could just compare the TSC timestamp
to the current value according to some other clock (PIT? whatever
clockevent is in use?) instead of just using the time passed into the
delayed work in the first place.
>
> Maybe we could print a different warning when we see large negative
> deltas, which is the main indicator for the system being stuck for
> quite a time while TSC advances happily.
That would be an easy fix.
--Andy
On Wed, 10 Nov 2010, Andrew Lutomirski wrote:
> On Wed, Nov 10, 2010 at 3:52 PM, Thomas Gleixner <[email protected]> wrote:
> > On Wed, 10 Nov 2010, Andrew Lutomirski wrote:
> >
> >> On Wed, Nov 10, 2010 at 1:50 PM, Borislav Petkov <[email protected]> wrote:
> >> > On Wed, Nov 10, 2010 at 01:48:00PM -0500, Andrew Lutomirski wrote:
> >> >> > Clocksource: tsc unstable (delta = -34355296774 ns)
> >> >> > Switching: to clocksource hpet
> >> >>
> >> >> Please disregard -- this is a bug in nouveau (or drm) not hpet. ?I'll
> >> >> send a bug report to the maintainers.
> >> >
> >> > Interesting! Joerg was complaining about similar symptoms with .36 today
> >> > too.
> >>
> >> Well, there is a clocksource sort-of-bug that could cause confusion:
> >> when something totally unrelated to clocksources goes out to lunch,
> >> the clocksource watchdog decides that the clocksource is unstable and
> >> complains, steering everyone toward filing the wrong bug.
> >
> > How should the clocksource watchdog code know that something went to
> > lunch? The fact that we need to monitor TSC at all is horrible enough,
> > adding further heuristics to detect extended lunch breaks would be
> > just a PITA.
>
> Could the clocksource watchdog detect when it gets woken up (i.e. when
> the hardware timer fires) instead of when it gets scheduled? It is
> internal to the timing code, after all...
>
> Alternatively, maybe the watchdog could just compare the TSC timestamp
> to the current value according to some other clock (PIT? whatever
> clockevent is in use?) instead of just using the time passed into the
> delayed work in the first place.
It compares to some other clock. i.e. HPET in your case, but the
extended lunch break was larger than the HPET wrap around time :(
Thanks,
tglx
On Wed, Nov 10, 2010 at 01:50:31PM -0500, Borislav Petkov wrote:
> On Wed, Nov 10, 2010 at 01:48:00PM -0500, Andrew Lutomirski wrote:
> > > Clocksource: tsc unstable (delta = -34355296774 ns)
> > > Switching: to clocksource hpet
> >
> > Please disregard -- this is a bug in nouveau (or drm) not hpet. I'll
> > send a bug report to the maintainers.
>
> Interesting! Joerg was complaining about similar symptoms with .36 today
> too.
The issue I have seen was on two boxes with 890FX chipset and Phenom II
X6 CPUs on a recent avi/master which was 2.6.36 + some kvm patches.
The problem was that the box became extremly sluggish and slow. I stated
dstat and it reported a lot of missed ticks. After some time the
clocksource tsc became unstable and the system became responsive again.
I have not yet investigated this further but can do any requested tests.
Joerg
--
AMD Operating System Research Center
Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632