2016-04-12 00:58:11

by Shaohua Li

[permalink] [raw]
Subject: [RFC 1/2] time: workaround crappy hpet

Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource
switching to hpet. We found similar symptom in another machine. Here is an example:

[8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
[8224517.540032] 'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff
[8224517.553092] 'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff
[8224517.569849] Switched to clocksource hpet

In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G
(0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s
0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both
machines, this sounds not coincidence, hept is crappy.

This patch tries to workaround this issue. We do retry if hpet has 0xffffff value.
In the relevant machine, the hpet counter doesn't read to 0xffffffff later.
The chance hpet has 0xffffffff counter is very small, this patch should have no
impact for good hpet.

I'm open if there is better solution.

Reported-by: Calvin Owens<[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/kernel/hpet.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a1f0e4a..333b57c 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -763,7 +763,23 @@ static int hpet_cpuhp_notify(struct notifier_block *n,
*/
static cycle_t read_hpet(struct clocksource *cs)
{
- return (cycle_t)hpet_readl(HPET_COUNTER);
+ unsigned int ret;
+ static bool checked;
+ ret = hpet_readl(HPET_COUNTER);
+
+ if (unlikely(ret == 0xffffffff && !checked)) {
+ int i;
+ for (i = 0; i < 20; i++) {
+ ret = hpet_readl(HPET_COUNTER);
+ if (ret != 0xffffffff)
+ break;
+ }
+ if (i == 20) {
+ WARN_ONCE(true, "HPET counter value is abnormal\n");
+ checked = true;
+ }
+ }
+ return (cycle_t)ret;
}

static struct clocksource clocksource_hpet = {
--
2.8.0.rc2


2016-04-12 00:58:03

by Shaohua Li

[permalink] [raw]
Subject: [RFC 2/2] time: double check if watchdog clocksource is correct

We use watchdog clocksource to detect unstable clocksource. This assumes
watchdog clocksource is correct. But it's possible watchdog clocksource
is crappy, please see previous patch. Double check if watchdog interval
is too long and bypass potential wrong watchdog clocksource.

Signed-off-by: Shaohua Li <[email protected]>
---
kernel/time/clocksource.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 56ece14..36aff4e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
static void __clocksource_change_rating(struct clocksource *cs, int rating);

/*
- * Interval: 0.5sec Threshold: 0.0625s
+ * Interval: 0.5sec, Max Interval: 0.75sec, Threshold: 0.0625s
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
+#define WATCHDOG_MAX_INTERVAL ((NSEC_PER_SEC >> 1) + (NSEC_PER_SEC >> 2))
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)

static void clocksource_watchdog_work(struct work_struct *work)
@@ -217,7 +218,8 @@ static void clocksource_watchdog(unsigned long data)
continue;

/* Check the deviation from the watchdog clocksource. */
- if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
+ if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD &&
+ wd_nsec < WATCHDOG_MAX_INTERVAL) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);
pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
--
2.8.0.rc2

2016-04-18 17:05:27

by John Stultz

[permalink] [raw]
Subject: Re: [RFC 1/2] time: workaround crappy hpet

On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <[email protected]> wrote:
> Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource
> switching to hpet. We found similar symptom in another machine. Here is an example:
>
> [8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
> [8224517.540032] 'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff
> [8224517.553092] 'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff
> [8224517.569849] Switched to clocksource hpet
>
> In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G
> (0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s
> 0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both
> machines, this sounds not coincidence, hept is crappy.
>
> This patch tries to workaround this issue. We do retry if hpet has 0xffffff value.
> In the relevant machine, the hpet counter doesn't read to 0xffffffff later.
> The chance hpet has 0xffffffff counter is very small, this patch should have no
> impact for good hpet.
>
> I'm open if there is better solution.

Hrm..

So can you characterize this bad behavior a bit more for us? Does
every read of the HPET return 0xFFFFFFFF ? Or does it just
occasionally start returning -1 values? Or once it trips and starts
returning -1 does it always return -1?

I'm trying to understand if there is a way to catch and disqualify
that clocksource earlier then in the watchdog logic.

thanks
-john

2016-04-18 17:31:11

by John Stultz

[permalink] [raw]
Subject: Re: [RFC 2/2] time: double check if watchdog clocksource is correct

On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <[email protected]> wrote:
> We use watchdog clocksource to detect unstable clocksource. This assumes
> watchdog clocksource is correct. But it's possible watchdog clocksource
> is crappy, please see previous patch. Double check if watchdog interval
> is too long and bypass potential wrong watchdog clocksource.
>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> kernel/time/clocksource.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 56ece14..36aff4e 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -122,9 +122,10 @@ static int clocksource_watchdog_kthread(void *data);
> static void __clocksource_change_rating(struct clocksource *cs, int rating);
>
> /*
> - * Interval: 0.5sec Threshold: 0.0625s
> + * Interval: 0.5sec, Max Interval: 0.75sec, Threshold: 0.0625s
> */
> #define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_MAX_INTERVAL ((NSEC_PER_SEC >> 1) + (NSEC_PER_SEC >> 2))

Is there a reason this #define is so obtusely stated? Its all going to
be pre-computed by the compiler, so I'm not sure why to make it more
opaque to the casual reader. Maybe something more like:
(750*MSEC_PER_SEC)

> #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
>
> static void clocksource_watchdog_work(struct work_struct *work)
> @@ -217,7 +218,8 @@ static void clocksource_watchdog(unsigned long data)
> continue;
>
> /* Check the deviation from the watchdog clocksource. */
> - if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
> + if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD &&
> + wd_nsec < WATCHDOG_MAX_INTERVAL) {
> pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> smp_processor_id(), cs->name);
> pr_warn(" '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
> --


My main concern with these tweaks have been that we might have a case
where there's a slow TSC (old cpufreq effected style, or maybe halts
in deep idle), so everything is moving slowly and the timers are
firing late. Then the watchdog interval would *look* to be too long
and we wouldn't have a good tool for disqualifying that bad TSC.

That said, I am continuing to hear cases of problematic watchdog
disqualifications where the watchdog is to blame, due to wraparound
caused by VM delays, PREEMPT_RT, or softirq processing, so something
here would be nice.

Gratin (cc'ed) also was looking at this, and we had some thoughts
about trying to avoid the watchdog wrap-around issue by checking that
the clocksource's counter divided by the watchdog wraparound interval
(in nsecs) matched the watchdog's counter (again, in nsecs).

But it seems like for your case where the watchdog hardware is bad,
that still wouldn't catch it. Something like a watchdog watchdog w/
something like the RTC might be an approach, but I worry that we will
hit the same problematic "can't trust the RTC on hardware foo" issue
down the road. Having a watchdog quorum between multiple counters
might be the only way there, but on many devices there's really only
two main options. :/

Thomas: Any other thoughts? Are these just oddball cases that have to
be dealt with via tsc=reliable boot arguments? Should we print
something informative about that option when we disqualify the TSC?

thanks
-john

2016-04-18 17:33:10

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC 1/2] time: workaround crappy hpet

On Mon, Apr 18, 2016 at 10:05:22AM -0700, John Stultz wrote:
> On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <[email protected]> wrote:
> > Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource
> > switching to hpet. We found similar symptom in another machine. Here is an example:
> >
> > [8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
> > [8224517.540032] 'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff
> > [8224517.553092] 'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff
> > [8224517.569849] Switched to clocksource hpet
> >
> > In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G
> > (0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s
> > 0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both
> > machines, this sounds not coincidence, hept is crappy.
> >
> > This patch tries to workaround this issue. We do retry if hpet has 0xffffff value.
> > In the relevant machine, the hpet counter doesn't read to 0xffffffff later.
> > The chance hpet has 0xffffffff counter is very small, this patch should have no
> > impact for good hpet.
> >
> > I'm open if there is better solution.
>
> Hrm..
>
> So can you characterize this bad behavior a bit more for us? Does
> every read of the HPET return 0xFFFFFFFF ? Or does it just
> occasionally start returning -1 values? Or once it trips and starts
> returning -1 does it always return -1?
>
> I'm trying to understand if there is a way to catch and disqualify
> that clocksource earlier then in the watchdog logic.

The HPET returns 0xffffffff occasionally and can still return
normal value after it returns -1. I have no idea when the issue happens
and when not.

Thanks,
Shaohua

2016-04-18 17:42:43

by John Stultz

[permalink] [raw]
Subject: Re: [RFC 1/2] time: workaround crappy hpet

On Mon, Apr 18, 2016 at 10:32 AM, Shaohua Li <[email protected]> wrote:
> On Mon, Apr 18, 2016 at 10:05:22AM -0700, John Stultz wrote:
>> On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <[email protected]> wrote:
>> > Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource
>> > switching to hpet. We found similar symptom in another machine. Here is an example:
>> >
>> > [8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
>> > [8224517.540032] 'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff
>> > [8224517.553092] 'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff
>> > [8224517.569849] Switched to clocksource hpet
>> >
>> > In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G
>> > (0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s
>> > 0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both
>> > machines, this sounds not coincidence, hept is crappy.
>> >
>> > This patch tries to workaround this issue. We do retry if hpet has 0xffffff value.
>> > In the relevant machine, the hpet counter doesn't read to 0xffffffff later.
>> > The chance hpet has 0xffffffff counter is very small, this patch should have no
>> > impact for good hpet.
>> >
>> > I'm open if there is better solution.
>>
>> Hrm..
>>
>> So can you characterize this bad behavior a bit more for us? Does
>> every read of the HPET return 0xFFFFFFFF ? Or does it just
>> occasionally start returning -1 values? Or once it trips and starts
>> returning -1 does it always return -1?
>>
>> I'm trying to understand if there is a way to catch and disqualify
>> that clocksource earlier then in the watchdog logic.
>
> The HPET returns 0xffffffff occasionally and can still return
> normal value after it returns -1. I have no idea when the issue happens
> and when not.

So from the code, it seems like it occasionally recovers after 20
reads, but sometimes it doesn't? Do you have any sense of the max
bound on the number of reads that it will give you the -1 value?

That's an ugly problem. Other then something like you have where we
re-read until we get a valid value (which could cause major unexpected
latencies), I'm not sure what to do other then try to add some logic
like we have with the TSC to mark it bad. Though even there, we don't
detect the issue until we're in a read, and there's no "good" value to
return w/o causing trouble. So its really too late at that point.

I'm sort of on the edge of just adding a blacklist entry for the HPET
on this hardware. I'm not sure its something that can be easily
handled generically. I *hope* you only see this issue on one sort of
hardware?

thanks
-john

2016-04-18 17:49:08

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC 1/2] time: workaround crappy hpet

On Mon, Apr 18, 2016 at 10:42:38AM -0700, John Stultz wrote:
> On Mon, Apr 18, 2016 at 10:32 AM, Shaohua Li <[email protected]> wrote:
> > On Mon, Apr 18, 2016 at 10:05:22AM -0700, John Stultz wrote:
> >> On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <[email protected]> wrote:
> >> > Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource
> >> > switching to hpet. We found similar symptom in another machine. Here is an example:
> >> >
> >> > [8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
> >> > [8224517.540032] 'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff
> >> > [8224517.553092] 'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff
> >> > [8224517.569849] Switched to clocksource hpet
> >> >
> >> > In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G
> >> > (0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s
> >> > 0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both
> >> > machines, this sounds not coincidence, hept is crappy.
> >> >
> >> > This patch tries to workaround this issue. We do retry if hpet has 0xffffff value.
> >> > In the relevant machine, the hpet counter doesn't read to 0xffffffff later.
> >> > The chance hpet has 0xffffffff counter is very small, this patch should have no
> >> > impact for good hpet.
> >> >
> >> > I'm open if there is better solution.
> >>
> >> Hrm..
> >>
> >> So can you characterize this bad behavior a bit more for us? Does
> >> every read of the HPET return 0xFFFFFFFF ? Or does it just
> >> occasionally start returning -1 values? Or once it trips and starts
> >> returning -1 does it always return -1?
> >>
> >> I'm trying to understand if there is a way to catch and disqualify
> >> that clocksource earlier then in the watchdog logic.
> >
> > The HPET returns 0xffffffff occasionally and can still return
> > normal value after it returns -1. I have no idea when the issue happens
> > and when not.
>
> So from the code, it seems like it occasionally recovers after 20
> reads, but sometimes it doesn't? Do you have any sense of the max
> bound on the number of reads that it will give you the -1 value?
>
> That's an ugly problem. Other then something like you have where we
> re-read until we get a valid value (which could cause major unexpected
> latencies), I'm not sure what to do other then try to add some logic
> like we have with the TSC to mark it bad. Though even there, we don't
> detect the issue until we're in a read, and there's no "good" value to
> return w/o causing trouble. So its really too late at that point.
>
> I'm sort of on the edge of just adding a blacklist entry for the HPET
> on this hardware. I'm not sure its something that can be easily
> handled generically. I *hope* you only see this issue on one sort of
> hardware?

Blacklist is a option for sure. We saw the issue in several machines,
but seems they are the same type. I hope we can have a defensive way to
handle such problem if it happens in other hardware.

Thanks,
Shaohua

2016-04-18 17:55:00

by John Stultz

[permalink] [raw]
Subject: Re: [RFC 1/2] time: workaround crappy hpet

On Mon, Apr 18, 2016 at 10:48 AM, Shaohua Li <[email protected]> wrote:
> On Mon, Apr 18, 2016 at 10:42:38AM -0700, John Stultz wrote:
>> I'm sort of on the edge of just adding a blacklist entry for the HPET
>> on this hardware. I'm not sure its something that can be easily
>> handled generically. I *hope* you only see this issue on one sort of
>> hardware?
>
> Blacklist is a option for sure. We saw the issue in several machines,
> but seems they are the same type. I hope we can have a defensive way to
> handle such problem if it happens in other hardware.

Yea. I'm not sure what can be done. Filtering -1 values from the HPET
seems problematic for latency reasons. I'm not sure we should punish
all the fine hardware w/o the issue just to try to detect the case
where its bad.

Especially as this is something particularly problematic to detect
separately from other more common well known bad hardware.

thanks
-john