Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <8a9bed0d-c166-37e9-24c3-8cea7a336c76@redhat.com>
Date:   Tue, 20 Dec 2022 22:26:15 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.0
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when
 high read lantency detected
Content-Language: en-US
To:     Feng Tang <feng.tang@intel.com>,
        "Paul E. McKenney" <paulmck@kernel.org>
Cc:     John Stultz <jstultz@google.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Stephen Boyd <sboyd@kernel.org>, x86@kernel.org,
        Peter Zijlstra <peterz@infradead.org>,
        linux-kernel@vger.kernel.org, Tim Chen <tim.c.chen@intel.com>
References: <20221220082512.186283-1-feng.tang@intel.com>
 <6fb04ee9-ce77-4835-2ad1-b7f8419cfb77@redhat.com>
 <20221220183400.GY4001@paulmck-ThinkPad-P17-Gen-1>
 <Y6Ja+kYQAi4pppV6@feng-clx>
From:   Waiman Long <longman@redhat.com>
In-Reply-To: <Y6Ja+kYQAi4pppV6@feng-clx>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk


On 12/20/22 20:01, Feng Tang wrote:
> Using correct email address of John Stultz.
>
> On Tue, Dec 20, 2022 at 10:34:00AM -0800, Paul E. McKenney wrote:
>> On Tue, Dec 20, 2022 at 11:11:08AM -0500, Waiman Long wrote:
>>> On 12/20/22 03:25, Feng Tang wrote:
>>>> There were bug reported on 8 sockets x86 machines that TSC was wrongly
>>>> disabled when system is under heavy workload.
>>>>
>>>>    [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
>>>>    [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
>>>>    [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
>>>>    [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
>>>>    [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
>>>>    [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
>>>>    [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>>>>    [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
>>>>    [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
>>>>    [ 821.067990] clocksource: Switched to clocksource hpet
>>>>
>>>> This can be reproduced when system is running memory intensive 'stream'
>>>> test, or some stress-ng subcases like 'ioport'.
>>>>
>>>> The reason is when system is under heavy load, the read latency of
>>>> clocksource can be very high, it can be seen even with lightweight
>>>> TSC read, and is much worse on MMIO or IO port read based external
>>>> clocksource. Causing the watchdog check to be inaccurate.
>>>>
>>>> As the clocksource watchdog is a lifetime check with frequency of
>>>> twice a second, there is no need to rush doing it when the system
>>>> is under heavy load and the clocksource read latency is very high,
>>>> suspend the watchdog timer for 5 minutes.
>>>>
>>>> Signed-off-by: Feng Tang <feng.tang@intel.com>
>>>> ---
>>>>    kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
>>>>    1 file changed, 32 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>>>> index 9cf32ccda715..8cd74b89d577 100644
>>>> --- a/kernel/time/clocksource.c
>>>> +++ b/kernel/time/clocksource.c
>>>> @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
>>>> +static inline void clocksource_reset_watchdog(void)
>>>> +{
>>>> +	struct clocksource *cs;
>>>> +
>>>> +	list_for_each_entry(cs, &watchdog_list, wd_list)
>>>> +		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
>>>> +}
>>>> +
>>>> +
>>>>    static void clocksource_watchdog(struct timer_list *unused)
>>>>    {
>>>>    	u64 csnow, wdnow, cslast, wdlast, delta;
>>>> @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>    	int64_t wd_nsec, cs_nsec;
>>>>    	struct clocksource *cs;
>>>>    	enum wd_read_status read_ret;
>>>> +	unsigned long extra_wait = 0;
>>>>    	u32 md;
>>>>    	spin_lock(&watchdog_lock);
>>>> @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>    		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
>>>> -		if (read_ret != WD_READ_SUCCESS) {
>>>> -			if (read_ret == WD_READ_UNSTABLE)
>>>> -				/* Clock readout unreliable, so give it up. */
>>>> -				__clocksource_unstable(cs);
>>>> +		if (read_ret == WD_READ_UNSTABLE) {
>>>> +			/* Clock readout unreliable, so give it up. */
>>>> +			__clocksource_unstable(cs);
>>>>    			continue;
>>>>    		}
>>>> +		/*
>>>> +		 * When WD_READ_SKIP is returned, it means the system is likely
>>>> +		 * under very heavy load, where the latency of reading
>>>> +		 * watchdog/clocksource is very big, and affect the accuracy of
>>>> +		 * watchdog check. So give system some space and suspend the
>>>> +		 * watchdog check for 5 minutes.
>>>> +		 */
>>>> +		if (read_ret == WD_READ_SKIP) {
>>>> +			/*
>>>> +			 * As the watchdog timer will be suspended, and
>>>> +			 * cs->last could keep unchanged for 5 minutes, reset
>>>> +			 * the counters.
>>>> +			 */
>>>> +			clocksource_reset_watchdog();
>>>> +			extra_wait = HZ * 300;
>>>> +			break;
>>>> +		}
>>>> +
>>>>    		/* Clocksource initialized ? */
>>>>    		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
>>>>    		    atomic_read(&watchdog_reset_pending)) {
>>>> @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>>>>    	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
>>>>    	 */
>>>>    	if (!timer_pending(&watchdog_timer)) {
>>>> -		watchdog_timer.expires += WATCHDOG_INTERVAL;
>>>> +		watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
>>>>    		add_timer_on(&watchdog_timer, next_cpu);
>>>>    	}
>>>>    out:
>>>> @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
>>>>    	watchdog_running = 0;
>>>>    }
>>>> -static inline void clocksource_reset_watchdog(void)
>>>> -{
>>>> -	struct clocksource *cs;
>>>> -
>>>> -	list_for_each_entry(cs, &watchdog_list, wd_list)
>>>> -		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
>>>> -}
>>>> -
>>>>    static void clocksource_resume_watchdog(void)
>>>>    {
>>>>    	atomic_inc(&watchdog_reset_pending);
>>> It looks reasonable to me. Thanks for the patch.
>>>
>>> Acked-by: Waiman Long <longman@redhat.com>
>> Queued, thank you both!
> Thanks for reviewing and queueing!
>
>> If you would like this to go in some other way:
>>
>> Acked-by: Paul E. McKenney <paulmck@kernel.org>
>>
>> And while I am remembering it...  Any objections to reversing the role of
>> TSC and the other timers on systems where TSC is believed to be accurate?
>> So that if there is clocksource skew, HPET is marked unstable rather than
>> TSC?
> For the bug in commit log, I think it's the 8 sockets system with
> hundreds of CPUs causing the big latency, while the HPET itself may
> not be broken, and if we switched to ACPI PM_TIMER as watchdog, we
> could see similar big latency.
>
> I used to only see this issue with stress tool like stress-ng, but
> seems with larger and larger system, even the momory intensive load
> can easily trigger this.
>
>> This would preserve the diagnostics without hammering performance
>> when skew is detected.  (Switching from TSC to HPET hammers performance
>> enough that our automation usually notices and reboots the system.)
> Yes, switching to HPET is a disaster for performance, we've seen
> from 30% to 90% drop in different benchmarks.

Switching to hpet is very bad for performance. That is the main reason 
why I posted clocksource patches in the past to avoid this as much as 
possible. I think your patch is also a good countermeasure to avoid this.

Thanks,
Longman