Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753135AbcDRRmn (ORCPT ); Mon, 18 Apr 2016 13:42:43 -0400 Received: from mail-ob0-f173.google.com ([209.85.214.173]:35913 "EHLO mail-ob0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752045AbcDRRmm (ORCPT ); Mon, 18 Apr 2016 13:42:42 -0400 MIME-Version: 1.0 In-Reply-To: <20160418173253.GA1892997@devbig084.prn1.facebook.com> References: <09c4f19409012995595db6fd0a12f326c292af1a.1460422356.git.shli@fb.com> <20160418173253.GA1892997@devbig084.prn1.facebook.com> Date: Mon, 18 Apr 2016 10:42:38 -0700 Message-ID: Subject: Re: [RFC 1/2] time: workaround crappy hpet From: John Stultz To: Shaohua Li Cc: lkml , Thomas Gleixner , calvinowens@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2746 Lines: 55 On Mon, Apr 18, 2016 at 10:32 AM, Shaohua Li wrote: > On Mon, Apr 18, 2016 at 10:05:22AM -0700, John Stultz wrote: >> On Mon, Apr 11, 2016 at 5:57 PM, Shaohua Li wrote: >> > Calvin found 'perf record -a --call-graph dwarf -- sleep 5' making clocksource >> > switching to hpet. We found similar symptom in another machine. Here is an example: >> > >> > [8224517.520885] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large: >> > [8224517.540032] 'hpet' wd_now: ffffffff wd_last: b39c0bd mask: ffffffff >> > [8224517.553092] 'tsc' cs_now: 48ceac7013714e cs_last: 48ceac25be34ac mask: ffffffffffffffff >> > [8224517.569849] Switched to clocksource hpet >> > >> > In both machines, wd_now is 0xffffffff. The tsc time looks correct, the cpu is 2.5G >> > (0x48ceac7013714e - 0x48ceac25be34ac)/2500000 = 0.4988s >> > 0.4988s matches WATCHDOG_INTERVAL. Since hpet reads to 0xffffffff in both >> > machines, this sounds not coincidence, hept is crappy. >> > >> > This patch tries to workaround this issue. We do retry if hpet has 0xffffff value. >> > In the relevant machine, the hpet counter doesn't read to 0xffffffff later. >> > The chance hpet has 0xffffffff counter is very small, this patch should have no >> > impact for good hpet. >> > >> > I'm open if there is better solution. >> >> Hrm.. >> >> So can you characterize this bad behavior a bit more for us? Does >> every read of the HPET return 0xFFFFFFFF ? Or does it just >> occasionally start returning -1 values? Or once it trips and starts >> returning -1 does it always return -1? >> >> I'm trying to understand if there is a way to catch and disqualify >> that clocksource earlier then in the watchdog logic. > > The HPET returns 0xffffffff occasionally and can still return > normal value after it returns -1. I have no idea when the issue happens > and when not. So from the code, it seems like it occasionally recovers after 20 reads, but sometimes it doesn't? Do you have any sense of the max bound on the number of reads that it will give you the -1 value? That's an ugly problem. Other then something like you have where we re-read until we get a valid value (which could cause major unexpected latencies), I'm not sure what to do other then try to add some logic like we have with the TSC to mark it bad. Though even there, we don't detect the issue until we're in a read, and there's no "good" value to return w/o causing trouble. So its really too late at that point. I'm sort of on the edge of just adding a blacklist entry for the HPET on this hardware. I'm not sure its something that can be easily handled generically. I *hope* you only see this issue on one sort of hardware? thanks -john