2009-03-13 21:06:53

by Michael Tokarev

[permalink] [raw]
Subject: phenom, amd780g, tsc, hpet, kvm, kernel -- who's at fault?

Today (Friday, the 13th) I had a very bad sequence of failures
with our servers leading to data loss and almost the whole day
of very hard work. And now I'm *really* interested where the
fault(s) is(are).

What I have here is an AMD780G-based system (Asus M3A-H/HDMI
motherboard, latest BIOS) with AND Phenom 9750 CPU and 8Gig of
ECC memory. The system is built for KVM (kernel virtual machine)
work, and is running several guests, but I'm not sure anymore
that KVM is related to the problem at hand.

The problem is that - it seems - timekeeping on this machine is
quite unreliable.

It's Phenom, so TSC should be synced. And it is being choosen
at bootup as clocksource. But regardless of current_clocksource
(tsc), it constantly increases hpet min_delta_ns, like this:

Mar 13 19:58:02 gate kernel: CE: hpet increasing min_delta_ns to 15000 nsec
Mar 13 19:59:16 gate kernel: CE: hpet increasing min_delta_ns to 22500 nsec
Mar 13 19:59:16 gate kernel: CE: hpet increasing min_delta_ns to 33750 nsec
Mar 13 19:59:16 gate kernel: CE: hpet increasing min_delta_ns to 50624 nsec
Mar 13 20:47:02 gate kernel: CE: hpet increasing min_delta_ns to 75936 nsec
Mar 13 20:48:17 gate kernel: CE: hpet increasing min_delta_ns to 113904 nsec
Mar 13 21:02:23 gate kernel: CE: hpet increasing min_delta_ns to 170856 nsec
Mar 13 21:05:27 gate kernel: CE: hpet increasing min_delta_ns to 256284 nsec
Mar 13 21:07:28 gate kernel: Clocksource tsc unstable (delta = 751920452 ns)
Mar 13 21:09:12 gate kernel: CE: hpet increasing min_delta_ns to 384426 nsec

and finally, it declares that TSC is unstable (pre-last line) and
switches to the (unstable) HPET.

HPET min_delta_ns will be increasing further and further, i've seen it
increased to 576638 and more.

And no doubt the system is unstable with KVM like crazy, especially under
some, even light, load.

Today I were copying some relatively large amount of data over network from
another to this machine (to the host itself, not to any virtual guest), and
had numerous guest and host stalls and lockups. At times, host sops doing
anything at all, all guests stalling too, load average jumps to 80 and more,
and nothing happens. I can do something over console still, like running
top/strace, but nothing interesting shows. I captured Sysrq+T of this situation
here: http://www.corpit.ru/mjt/host-high-la -- everything I was able to find
in kern.log.

After some time, sometimes it's several seconds, sometimes it's up to 10
minutes, the thing "unstucks" and continues working. Today it happened after
about 10 minutes. But after it continued, 2 of the KVM guests were eating
100% CPU and did not respond at all. The Sysrq+T of this is available at
http://www.corpit.ru/mjt/guest-stuck -- two KVM guests were not responsible.

It's even more - the system started showing sporadic, random I/O errors
unrelated to the disks - for example, one of software RAID5 arrays started
behaving really oddly, so that finally, after a reboot, I had to re-create
the array and some of the filesystems on it (which I never saw in last ~10
years I'm using sofraid on linus, on many different systems and disks and
with various failure cases).

Now, I switched to acpi_pm clocksource. And also tried to disable nested
page tables with kvm (kvm_amd npt=0). With that, everything is slow and
sluggish, but I was finally able to copy that data without errors, while
the guests were running.

It were about to stuck as before, but I noticed it switched to hpet (see
"tsc is unstable" above) and I forced it to use acpi_pm instead, and it
survived.

So, to the hell out of it all, and ignoring the magical Friday the 13th --
who's fault it is?

o why it declares tsc is unstable while phenom supposed to keep it ok?
o why hpet is malfunctioning?
o why the system time on this machine is dog slow without special
adjtimex adjustments, while it worked before (circa 2.6.26) and
windows works ok here?

For reference:

https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2351676&group_id=180599
-- kvm bug on sourceforge, without any visible interest in even looking at it

http://www.google.com/search?q=CE%3A+hpet+increasing+min_delta_ns
-- numerous references to that "CE: hpet increasing min_delta_ns" on the 'net,
mostly for C2Ds, mentioning various lockup issues

http://marc.info/?t=123246270000002&r=1&w=2 --
"slow clock on AMD 740G chipset" -- it's about the clock issue, also without
any visible interest.

What's the next thing to do here? I for one don't want to see todays failures
again, it was very, and I mean *very* difficult day to restore the functionality
of this system that (and it isn't restored at full because of the slowness of
its current state).

Thanks!

/tired mjt.


2009-03-23 08:06:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: phenom, amd780g, tsc, hpet, kvm, kernel -- who's at fault?


* Michael Tokarev <[email protected]> wrote:

> Today (Friday, the 13th) I had a very bad sequence of failures
> with our servers leading to data loss and almost the whole day
> of very hard work. And now I'm *really* interested where the
> fault(s) is(are).
>
> What I have here is an AMD780G-based system (Asus M3A-H/HDMI
> motherboard, latest BIOS) with AND Phenom 9750 CPU and 8Gig of
> ECC memory. The system is built for KVM (kernel virtual machine)
> work, and is running several guests, but I'm not sure anymore
> that KVM is related to the problem at hand.
>
> The problem is that - it seems - timekeeping on this machine is
> quite unreliable.
>
> It's Phenom, so TSC should be synced. And it is being choosen
> at bootup as clocksource. But regardless of current_clocksource
> (tsc), it constantly increases hpet min_delta_ns, like this:
>
> Mar 13 19:58:02 gate kernel: CE: hpet increasing min_delta_ns to 15000 nsec
> Mar 13 19:59:16 gate kernel: CE: hpet increasing min_delta_ns to 22500 nsec
> Mar 13 19:59:16 gate kernel: CE: hpet increasing min_delta_ns to 33750 nsec
> Mar 13 19:59:16 gate kernel: CE: hpet increasing min_delta_ns to 50624 nsec
> Mar 13 20:47:02 gate kernel: CE: hpet increasing min_delta_ns to 75936 nsec
> Mar 13 20:48:17 gate kernel: CE: hpet increasing min_delta_ns to 113904 nsec
> Mar 13 21:02:23 gate kernel: CE: hpet increasing min_delta_ns to 170856 nsec
> Mar 13 21:05:27 gate kernel: CE: hpet increasing min_delta_ns to 256284 nsec
> Mar 13 21:07:28 gate kernel: Clocksource tsc unstable (delta = 751920452 ns)
> Mar 13 21:09:12 gate kernel: CE: hpet increasing min_delta_ns to 384426 nsec
>
> and finally, it declares that TSC is unstable (pre-last line) and
> switches to the (unstable) HPET.
>
> HPET min_delta_ns will be increasing further and further, i've seen it
> increased to 576638 and more.
>
> And no doubt the system is unstable with KVM like crazy, especially under
> some, even light, load.
>
> Today I were copying some relatively large amount of data over network from
> another to this machine (to the host itself, not to any virtual guest), and
> had numerous guest and host stalls and lockups. At times, host sops doing
> anything at all, all guests stalling too, load average jumps to 80 and more,
> and nothing happens. I can do something over console still, like running
> top/strace, but nothing interesting shows. I captured Sysrq+T of this situation
> here: http://www.corpit.ru/mjt/host-high-la -- everything I was able to find
> in kern.log.

403

> After some time, sometimes it's several seconds, sometimes it's up to 10
> minutes, the thing "unstucks" and continues working. Today it happened after
> about 10 minutes. But after it continued, 2 of the KVM guests were eating
> 100% CPU and did not respond at all. The Sysrq+T of this is available at
> http://www.corpit.ru/mjt/guest-stuck -- two KVM guests were not responsible.

403 too.

> It's even more - the system started showing sporadic, random I/O
> errors unrelated to the disks - for example, one of software RAID5
> arrays started behaving really oddly, so that finally, after a
> reboot, I had to re-create the array and some of the filesystems
> on it (which I never saw in last ~10 years I'm using sofraid on
> linus, on many different systems and disks and with various
> failure cases).
>
> Now, I switched to acpi_pm clocksource. And also tried to disable
> nested page tables with kvm (kvm_amd npt=0). With that,
> everything is slow and sluggish, but I was finally able to copy
> that data without errors, while the guests were running.
>
> It were about to stuck as before, but I noticed it switched to
> hpet (see "tsc is unstable" above) and I forced it to use acpi_pm
> instead, and it survived.
>
> So, to the hell out of it all, and ignoring the magical Friday the 13th --
> who's fault it is?
>
> o why it declares tsc is unstable while phenom supposed to keep it ok?

the TSC can drift slowly between cores, and it might not be in sync
at bootup time already. You check check the TSC from user-space (on
any kernel) via time-warp-test:

http://redhat.com/~mingo/time-warp-test/MINI-HOWTO

> o why hpet is malfunctioning?

That's a question for Thomas i guess.

> o why the system time on this machine is dog slow without special
> adjtimex adjustments, while it worked before (circa 2.6.26) and
> windows works ok here?
>
> For reference:
>
> https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2351676&group_id=180599
> -- kvm bug on sourceforge, without any visible interest in even looking at it
>
> http://www.google.com/search?q=CE%3A+hpet+increasing+min_delta_ns
> -- numerous references to that "CE: hpet increasing min_delta_ns" on the 'net,
> mostly for C2Ds, mentioning various lockup issues
>
> http://marc.info/?t=123246270000002&r=1&w=2 --
> "slow clock on AMD 740G chipset" -- it's about the clock issue, also without
> any visible interest.
>
> What's the next thing to do here? I for one don't want to see
> todays failures again, it was very, and I mean *very* difficult
> day to restore the functionality of this system that (and it isn't
> restored at full because of the slowness of its current state).

it's not clear which kernel you tried - if you tried a recent enough
one then i'd chalk this up as a yet-unfixed timekeeping problem -
which probably had ripple effects on KVM and the rest of the system.

What would be helpful is to debug the problem :-) First verify that
basic timekeeping is OK: does 'time sleep 10' really take precisely
10 seconds? Does 'date' advance precisely one second per physical
second?

A generic hw/sw state output of:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

would also help people taking a look at this problem.

If the problem persists, there might be a chance to debug it without
rebooting the system. Rebooting and trying out various patches wont
really work well for a server i guess.

Ingo

2009-03-23 08:29:24

by Michael Tokarev

[permalink] [raw]
Subject: Re: phenom, amd780g, tsc, hpet, kvm, kernel -- who's at fault?

Ingo, I lost any hope already to hear anything about this one..
Surprise. Thank you for replying!

Ingo Molnar wrote:
[]
>> top/strace, but nothing interesting shows. I captured Sysrq+T of this situation
>> here: http://www.corpit.ru/mjt/host-high-la -- everything I was able to find
>> in kern.log.
>
> 403

Fixed both. Didn't notice it was 0640 (i copied the kern.log). I just
checked my apache access.log - no one but several bots even looked at
those pages before you. Oh well.

[]
>> So, to the hell out of it all, and ignoring the magical Friday the 13th --
>> who's fault it is?
>>
>> o why it declares tsc is unstable while phenom supposed to keep it ok?
>
> the TSC can drift slowly between cores, and it might not be in sync
> at bootup time already. You check check the TSC from user-space (on
> any kernel) via time-warp-test:
>
> http://redhat.com/~mingo/time-warp-test/MINI-HOWTO

Aha. Will do. But see below.

>> o why hpet is malfunctioning?
>
> That's a question for Thomas i guess.
>
>> o why the system time on this machine is dog slow without special
>> adjtimex adjustments, while it worked before (circa 2.6.26) and
>> windows works ok here?
>>
>> For reference:
>>
>> https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2351676&group_id=180599
>> -- kvm bug on sourceforge, without any visible interest in even looking at it
>>
>> http://www.google.com/search?q=CE%3A+hpet+increasing+min_delta_ns
>> -- numerous references to that "CE: hpet increasing min_delta_ns" on the 'net,
>> mostly for C2Ds, mentioning various lockup issues
>>
>> http://marc.info/?t=123246270000002&r=1&w=2 --
>> "slow clock on AMD 740G chipset" -- it's about the clock issue, also without
>> any visible interest.
>>
>> What's the next thing to do here? I for one don't want to see
>> todays failures again, it was very, and I mean *very* difficult
>> day to restore the functionality of this system that (and it isn't
>> restored at full because of the slowness of its current state).
>
> it's not clear which kernel you tried - if you tried a recent enough
> one then i'd chalk this up as a yet-unfixed timekeeping problem -
> which probably had ripple effects on KVM and the rest of the system.

It is 2.6.28.7 compiled for x86-64 (64 bits).
Config is at http://www.corpit.ru/mjt/2.6.28.7-x86-64.config

> What would be helpful is to debug the problem :-) First verify that
> basic timekeeping is OK: does 'time sleep 10' really take precisely
> 10 seconds? Does 'date' advance precisely one second per physical
> second?
[...]

I'll try - maybe today. The thing is that this is a production machine
running quite several of various (virtual) servers which are all our
infrastructure. When it started misbehaving at 13th (just because there
was high load, not because of failures or any other changes), all our
office was stopped... ;)

Now, after quite some googling around, I tried to disable hpet, booting
with hpet=disable parameter. And that one fixed all the problems at once.
7 days uptime, I stress-tested it several times, it works with TSC as
timesource (still a problem within guests as those shows unstable TSC
anyway) since boot, no issues logged. Even cpufreq works as expected...

Note that i tried to disable hpet as clocksource several times but without
any noticeable effect - kernel still used hpet and hpet2 for something,
and printed that scary "increasing min_delay" message on a semi-regular
basis usually after the next 'stuck' state....

> A generic hw/sw state output of:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> would also help people taking a look at this problem.
>
> If the problem persists, there might be a chance to debug it without
> rebooting the system. Rebooting and trying out various patches wont
> really work well for a server i guess.

..so it has to be rebooted back to enable hpet. Hence I'll do it not
before evening. But I really want to debug and fix the issue, as it
gave me quite some headaches and I want to kill it once and for all ;)

Thanks for noticing this!

/mjt

2009-03-23 15:42:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: phenom, amd780g, tsc, hpet, kvm, kernel -- who's at fault?


* Michael Tokarev <[email protected]> wrote:

> Now, after quite some googling around, I tried to disable hpet,
> booting with hpet=disable parameter. And that one fixed all the
> problems at once. 7 days uptime, I stress-tested it several times,
> it works with TSC as timesource (still a problem within guests as
> those shows unstable TSC anyway) since boot, no issues logged.
> Even cpufreq works as expected...
>
> Note that i tried to disable hpet as clocksource several times but
> without any noticeable effect - kernel still used hpet and hpet2
> for something, and printed that scary "increasing min_delay"
> message on a semi-regular basis usually after the next 'stuck'
> state....

It could again go bad like it did before - those messages are signs
of HPET weirdnesses.

Probably your box's hpet needs to be blacklisted, so that it gets
disabled automatically on bootup.

Ingo

2009-03-23 16:03:09

by Michael Tokarev

[permalink] [raw]
Subject: Re: phenom, amd780g, tsc, hpet, kvm, kernel -- who's at fault?

Ingo Molnar wrote:
> * Michael Tokarev <[email protected]> wrote:
>
>> Now, after quite some googling around, I tried to disable hpet,
>> booting with hpet=disable parameter. And that one fixed all the
>> problems at once. 7 days uptime, I stress-tested it several times,
>> it works with TSC as timesource (still a problem within guests as
>> those shows unstable TSC anyway) since boot, no issues logged.
>> Even cpufreq works as expected...
[]
> It could again go bad like it did before - those messages are signs
> of HPET weirdnesses.
>
> Probably your box's hpet needs to be blacklisted, so that it gets
> disabled automatically on bootup.

Well, I'm not convinced at all... at least not yet ;)

The reason is simple: this box was rock solid a few months back.
With 2.6.25 and 2.6.26 kernels I think. It had probs with kvm
(bugs), and lacked in general hardware support (both the chipset
and phenom cpu were still too new to be fully supported). At
that time I installed the thing (was a test install with a random
hdd, so I added real drives and installed real distro), with quite
a lot of data copying back and forth (were rearranging partitions,
raid arrays, guests and so on, copying data to another disk, to
another machine and back). There was no single issue, no single
mention of tsc or hpet instabilities, and system time was stable
too. But since some time, -- unfortunately I don't know when
exactly, and sure thing it'd be very interesting to know, I'll
try to figure it out -- first it started showing system clock
weirdness, and finally come to this Friday the 13 incident.

That all to say: it was stable with earlier kernel. Now it's not.
Maybe, just maybe, at that time hpet wasn't supported, or maybe
wasn't used, or supported not in full to rely on it - I've no
idea. If that's the case, I'll just shut up now because the
whole point becomes moot.

Maybe it was due to somehow broken bios -- I did several bios
updates there, mostly because linux complained about something
scary (something akin "wasting so much megs memory due to bios
not set up something (GART? IOMMU?)") and I was hoping to fix that.
And it will be fixed someday in bios...

(By the way: how bad the lack of hpet is? It's used for
something, and having it malfunctioning and disabled does
not sound good, esp. on a machine which is running close
to its maximum... Maybe I should return the mobo back? :)

Thanks!

/mjt

2009-03-23 16:14:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: phenom, amd780g, tsc, hpet, kvm, kernel -- who's at fault?


* Michael Tokarev <[email protected]> wrote:

> Ingo Molnar wrote:
>> * Michael Tokarev <[email protected]> wrote:
>>
>>> Now, after quite some googling around, I tried to disable hpet,
>>> booting with hpet=disable parameter. And that one fixed all the
>>> problems at once. 7 days uptime, I stress-tested it several times,
>>> it works with TSC as timesource (still a problem within guests as
>>> those shows unstable TSC anyway) since boot, no issues logged. Even
>>> cpufreq works as expected...
> []
>> It could again go bad like it did before - those messages are signs of
>> HPET weirdnesses.
>>
>> Probably your box's hpet needs to be blacklisted, so that it gets
>> disabled automatically on bootup.
>
> Well, I'm not convinced at all... at least not yet ;)
>
> The reason is simple: this box was rock solid a few months back.
> With 2.6.25 and 2.6.26 kernels I think. It had probs with kvm
> (bugs), and lacked in general hardware support (both the chipset
> and phenom cpu were still too new to be fully supported). At that
> time I installed the thing (was a test install with a random hdd,
> so I added real drives and installed real distro), with quite a
> lot of data copying back and forth (were rearranging partitions,
> raid arrays, guests and so on, copying data to another disk, to
> another machine and back). There was no single issue, no single
> mention of tsc or hpet instabilities, and system time was stable
> too. But since some time, -- unfortunately I don't know when
> exactly, and sure thing it'd be very interesting to know, I'll try
> to figure it out -- first it started showing system clock
> weirdness, and finally come to this Friday the 13 incident.
>
> That all to say: it was stable with earlier kernel. Now it's not.
> Maybe, just maybe, at that time hpet wasn't supported, or maybe
> wasn't used, or supported not in full to rely on it - I've no
> idea. If that's the case, I'll just shut up now because the whole
> point becomes moot.

We added force-enabling of the hpets of certain boards over the past
few kernel releases. Do you have kernel logs from earlier kernels,
do you know it for sure that .28 was the first that enabled the
hpet?

> Maybe it was due to somehow broken bios -- I did several bios
> updates there, mostly because linux complained about something
> scary (something akin "wasting so much megs memory due to bios not
> set up something (GART? IOMMU?)") and I was hoping to fix that.
> And it will be fixed someday in bios...
>
> (By the way: how bad the lack of hpet is? It's used for
> something, and having it malfunctioning and disabled does not
> sound good, esp. on a machine which is running close to its
> maximum... Maybe I should return the mobo back? :)

a hpet isnt really important for server workloads. It's useful in
terms of keeping dynticks timeouts long on the desktop - but on a
busy server it has little relevance.

Ingo