MIME-Version: 1.0
In-Reply-To: <20160418173253.GA1892997@devbig084.prn1.facebook.com>
References: <09c4f19409012995595db6fd0a12f326c292af1a.1460422356.git.shli@fb.com>
	<CALAqxLVAdfwni5PrXiYhLZdUY6BN1u+GiQmJouSU95g=jKBKaA@mail.gmail.com>
	<20160418173253.GA1892997@devbig084.prn1.facebook.com>
Date: Mon, 18 Apr 2016 10:42:38 -0700
Message-ID: <CALAqxLU9anpOc2L9yFk4yHuDQhDTpd7aCa3VcqH3h6B+NO0tZA@mail.gmail.com>
Subject: Re: [RFC 1/2] time: workaround crappy hpet
From: John Stultz <john.stultz@linaro.org>
To: Shaohua Li <shli@fb.com>
Cc: lkml <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        calvinowens@fb.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2746
Lines: 55

On Mon, Apr 18, 2016 at 10:32 AM, Shaohua Li <shli@fb.com> wrote:
> On Mon, Apr 18, 2016 at 10:05:22AM -0700, John Stultz wrote:
>> On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li <shli@fb.com> wrote:
>> > Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource
>> > switching to hpet. We found similar symptom in another machine. Here is an example:
>> >
>> > [8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
>> > [8224517.540032]        'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff
>> > [8224517.553092]        'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff
>> > [8224517.569849] Switched to clocksource hpet
>> >
>> > In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G
>> > (0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s
>> > 0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both
>> > machines, this sounds not coincidence, hept is crappy.
>> >
>> > This patch tries to workaround this issue. We do retry if hpet has 0xffffff value.
>> > In the relevant machine, the hpet counter doesn't read to 0xffffffff later.
>> > The chance hpet has 0xffffffff counter is very small, this patch should have no
>> > impact for good hpet.
>> >
>> > I'm open if there is better solution.
>>
>> Hrm..
>>
>> So can you characterize this bad behavior a bit more for us? Does
>> every read of the HPET return 0xFFFFFFFF ? Or does it just
>> occasionally start returning -1 values? Or once it trips and starts
>> returning -1 does it always return -1?
>>
>> I'm trying to understand if there is a way to catch and disqualify
>> that clocksource earlier then in the watchdog logic.
>
> The HPET returns 0xffffffff occasionally and can still return
> normal value after it returns -1. I have no idea when the issue happens
> and when not.

So from the code, it seems like it occasionally recovers after 20
reads, but sometimes it doesn't? Do you have any sense of the max
bound on the number of reads that it will give you the -1 value?

That's an ugly problem. Other then something like you have where we
re-read until we get a valid value (which could cause major unexpected
latencies), I'm not sure what to do other then try to add some logic
like we have with the TSC to mark it bad. Though even there, we don't
detect the issue until we're in a read, and there's no "good" value to
return w/o causing trouble. So its really too late at that point.

I'm sort of on the edge of just adding a blacklist entry for the HPET
on this hardware. I'm not sure its something that can be easily
handled generically.   I *hope* you only see this issue on one sort of
hardware?

thanks
-john