Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp688783pxj; Fri, 14 May 2021 13:10:26 -0700 (PDT) X-Google-Smtp-Source: ABdhPJykgJIgIu/fD6bBSnnKj+pLCjHPtjQLU+Eld6XyJ1RradZw+ZBhVE+SYps7BMdE7igxkjAB X-Received: by 2002:a17:906:2a1b:: with SMTP id j27mr49964383eje.370.1621023026335; Fri, 14 May 2021 13:10:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1621023026; cv=none; d=google.com; s=arc-20160816; b=KhoLkdO4QEG9tR/BnfvMIk3oKTufq2ZFXi2L7/maC67XirK9W3TA1Alh+tktqALhiU Q214X7uHe08K9vKzd2jLm7p3Mr7xAdrZNmOOBXzwWP7tLa3uvuqdFYQbGVw1idkkdiSk MjUx5KATCaKcfs6c5H+mQ/tA0of4aWBeNZTCu9Vqg7NYELeaK9MSZBvw8slSHg6HAb/F kpR9Vbar4I81MPMPER7DcA6aSn/94gh8HhGEntvoTb5yQBNqD93OeOaAuovXx1f+gc90 a1pFj/8XQJqmOru6CUJDQ2RrMuj+uNxzmI09ji4U0D3yqsWzma3y4FQb83/doLE6mmwq vL5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=+9+uv++RSx5Im36Btjyo8DmOBvre/blj6NIeed1s1z8=; b=nLRB7ZJawpeE+pMnsHajrPzpwnF59SUppzrAWW1KkGBenOMrTXB/DIwM1mwm5s6hx+ vv/rP0JJR/6rLM9hdPNL2sryEZCN3yUnJmWOBes5Sb5iJqWLwjADY0CreIotGULuIONv 0zmFBODaMto+TlGK5TrRpGUvVs9wMW6S90Niub5dcYrtYVCRBWj9g46YPidNShLd2B0b Qpd6dkNNfo5eMtYM3jhwOgXxLt8beXbeEnqXEchj90zxLjGr0nZviH6uO32Hocf3YVU2 Pl92U/csErhYHIqwNRFPjssqnEPV+g88k7tfdQ12S8y5zNqOXCZ/rjkBrIZQ8qVXPKh+ nKqQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="o1/RB9Hs"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j4si7184048edk.420.2021.05.14.13.10.03; Fri, 14 May 2021 13:10:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="o1/RB9Hs"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235230AbhENRuU (ORCPT + 99 others); Fri, 14 May 2021 13:50:20 -0400 Received: from mail.kernel.org ([198.145.29.99]:48932 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229445AbhENRuU (ORCPT ); Fri, 14 May 2021 13:50:20 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 6AB4A61408; Fri, 14 May 2021 17:49:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1621014548; bh=o26Wkx/ez879lFq8JcHpfo4f6e+NGRwcrc+ND37TxOU=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=o1/RB9Hsksg/QlBKZ/PgN3r7rgJhWoVbliP0ZZ907zazWAigd+eLtKMNIx5tgZ9KJ XHJf4v0sUj8QKXmpoEJvfAFzng+hKWJYo+yyNXgnwhOpugfRRdAMC0carMxOZVvcpE muvTK2wu12t6jGkT46aQngv3cnBP8035qeoUsAW5/sQkXUdXhl0rMgeZZWD4kktYO+ VHp2YdNe0yNj9QC/QWEeupbwGSx6D1y+5B9I2rLsmojgBN+OEIKH+Mkxfk+RVf7n0x Fe2Xp7GdFP5HDqW/TADNpGFhmZ6My0Z6YDuXJxIS+WZ3LWMfBLWmPzwwCfolP753Qx g27ziYm7aDj6Q== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 310E85C02A5; Fri, 14 May 2021 10:49:08 -0700 (PDT) Date: Fri, 14 May 2021 10:49:08 -0700 From: "Paul E. McKenney" To: Feng Tang Cc: kernel test robot , 0day robot , Thomas Gleixner , John Stultz , Stephen Boyd , Jonathan Corbet , Mark Rutland , Marc Zyngier , Andi Kleen , Xing Zhengjun , LKML , lkp@lists.01.org, ying.huang@intel.com, zhengjun.xing@intel.com, kernel-team@fb.com, neeraju@codeaurora.org Subject: Re: [clocksource] 388450c708: netperf.Throughput_tps -65.1% regression Message-ID: <20210514174908.GI975577@paulmck-ThinkPad-P17-Gen-1> Reply-To: paulmck@kernel.org References: <20210501003247.2448287-4-paulmck@kernel.org> <20210513155515.GB23902@xsang-OptiPlex-9020> <20210513170707.GA975577@paulmck-ThinkPad-P17-Gen-1> <20210514074314.GB5384@shbuild999.sh.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210514074314.GB5384@shbuild999.sh.intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 14, 2021 at 03:43:14PM +0800, Feng Tang wrote: > Hi Paul, > > On Thu, May 13, 2021 at 10:07:07AM -0700, Paul E. McKenney wrote: > > On Thu, May 13, 2021 at 11:55:15PM +0800, kernel test robot wrote: > > > > > > > > > Greeting, > > > > > > FYI, we noticed a -65.1% regression of netperf.Throughput_tps due to commit: > > > > > > > > > commit: 388450c7081ded73432e2b7148c1bb9a0b039963 ("[PATCH v12 clocksource 4/5] clocksource: Reduce clocksource-skew threshold for TSC") > > > url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Do-not-mark-clocks-unstable-due-to-delays-for-v5-13/20210501-083404 > > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2d036dfa5f10df9782f5278fc591d79d283c1fad > > > > > > in testcase: netperf > > > on test machine: 96 threads 2 sockets Ice Lake with 256G memory > > > with following parameters: > > > > > > ip: ipv4 > > > runtime: 300s > > > nr_threads: 25% > > > cluster: cs-localhost > > > test: UDP_RR > > > cpufreq_governor: performance > > > ucode: 0xb000280 > > > > > > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance. > > > test-url: http://www.netperf.org/netperf/ > > > > > > > > > > > > If you fix the issue, kindly add following tag > > > Reported-by: kernel test robot > > > > > > > > > also as Feng Tang checked, this is a "unstable clocksource" case. > > > attached dmesg FYI. > > > > Agreed, given the clock-skew event and the resulting switch to HPET, > > performance regressions are expected behavior. > > > > That dmesg output does demonstrate the value of Feng Tang's patch! > > > > I don't see how to obtain the values of ->mult and ->shift that would > > allow me to compute the delta. So if you don't tell me otherwise, I > > will assume that the skew itself was expected on this hardware, perhaps > > somehow due to the tpm_tis_status warning immediately preceding the > > clock-skew event. If my assumption is incorrect, please let me know. > > I run the case with the debug patch applied, the info is: > > [ 13.796429] clocksource: timekeeping watchdog on CPU19: Marking clocksource 'tsc' as unstable because the skew is too large: > [ 13.797413] clocksource: 'hpet' wd_nesc: 505192062 wd_now: 10657158 wd_last: fac6f97 mask: ffffffff > [ 13.797413] clocksource: 'tsc' cs_nsec: 504008008 cs_now: 3445570292aa5 cs_last: 344551f0cad6f mask: ffffffffffffffff > [ 13.797413] clocksource: 'tsc' is current clocksource. > [ 13.797413] tsc: Marking TSC unstable due to clocksource watchdog > [ 13.844513] clocksource: Checking clocksource tsc synchronization from CPU 50 to CPUs 0-1,12,22,32-33,60,65. > [ 13.855080] clocksource: Switched to clocksource hpet > > So the delta is 1184 us (505192062 - 504008008), and I agree with > you that it should be related with the tpm_tis_status warning stuff. > > But this re-trigger my old concerns, that if the margins calculated > for tsc, hpet are too small? If the error really did disturb either tsc or hpet, then we really do not have a false positive, and nothing should change (aside from perhaps documenting that TPM issues can disturb the clocks, or better yet treating that perturbation as a separate bug that should be fixed). But if this is yet another way to get a confused measurement, then it would be better to work out a way to reject the confusion and keep the tighter margins. I cannot think right off of a way that this could cause measurement confusion, but you never know. So any thoughts on exactly how the tpm_tis_status warning might have resulted in the skew? > With current math algorithm, the 'uncertainty_margin' is > calculated against the frequency, and those tsc/hpet/acpi_pm > timer is multiple of MHz or GHz, which gives them to have margin of > 100 us. It works with normal systems. But in the wild world, there > could be some sparkles due to some immature HW components, their > firmwares or drivers etc, just like this case. Isn't diagnosing issues from immature hardware, firmware, and drivers actually a benefit? It would after all be quite unfortunate if some issue that was visible only due to clock skew were to escape into production. Thanx, Paul