Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751509AbcDGE63 (ORCPT ); Thu, 7 Apr 2016 00:58:29 -0400 Received: from mail-oi0-f47.google.com ([209.85.218.47]:34787 "EHLO mail-oi0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750817AbcDGE61 (ORCPT ); Thu, 7 Apr 2016 00:58:27 -0400 MIME-Version: 1.0 In-Reply-To: <1459951324-53339-1-git-send-email-Waiman.Long@hpe.com> References: <1459951324-53339-1-git-send-email-Waiman.Long@hpe.com> From: Andy Lutomirski Date: Wed, 6 Apr 2016 21:58:07 -0700 Message-ID: Subject: Re: [PATCH] x86/hpet: Reduce HPET counter read contention To: Waiman Long Cc: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , "linux-kernel@vger.kernel.org" , X86 ML , Jiang Liu , Borislav Petkov , Andy Lutomirski , Scott J Norton , Douglas Hatch , Randy Wright Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3969 Lines: 85 On Wed, Apr 6, 2016 at 7:02 AM, Waiman Long wrote: > On a large system with many CPUs, using HPET as the clock source can > have a significant impact on the overall system performance because > of the following reasons: > 1) There is a single HPET counter shared by all the CPUs. > 2) HPET counter reading is a very slow operation. > > Using HPET as the default clock source may happen when, for example, > the TSC clock calibration exceeds the allowable tolerance. Something > the performance slowdown can be so severe that the system may crash > because of a NMI watchdog soft lockup, for example. > > This patch attempts to reduce HPET read contention by using the fact > that if more than one task are trying to access HPET at the same time, > it will be more efficient if one task in the group reads the HPET > counter and shares it with the rest of the group instead of each > group member reads the HPET counter individually. > > This is done by using a combination word with a sequence number and > a bit lock. The task that gets the bit lock will be responsible for > reading the HPET counter and update the sequence number. The others > will monitor the change in sequence number and grab the HPET counter > accordingly. > > On a 4-socket Haswell-EX box with 72 cores (HT off), running the > AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000) > with and without the patch has the following performance numbers > (with HPET or TSC as clock source): > > TSC = 646515 jobs/min > HPET w/o patch = 566708 jobs/min > HPET with patch = 638791 jobs/min > > The perf profile showed a reduction of the %CPU time consumed by > read_hpet from 4.99% without patch to 1.41% with patch. > > On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the > other hand, the performance numbers of the same benchmark were: > > TSC = 3145329 jobs/min > HPET w/o patch = 1108537 jobs/min > HPET with patch = 3019934 jobs/min > > The corresponding perf profile showed a drop of CPU consumption of > the read_hpet function from more than 34% to just 2.96%. > > Signed-off-by: Waiman Long > --- > arch/x86/kernel/hpet.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 109 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c > index a1f0e4a..9e3de73 100644 > --- a/arch/x86/kernel/hpet.c > +++ b/arch/x86/kernel/hpet.c > @@ -759,11 +759,112 @@ static int hpet_cpuhp_notify(struct notifier_block *n, > #endif > > /* > + * Reading the HPET counter is a very slow operation. If a large number of > + * CPUs are trying to access the HPET counter simultaneously, it can cause > + * massive delay and slow down system performance dramatically. This may > + * happen when HPET is the default clock source instead of TSC. For a > + * really large system with hundreds of CPUs, the slowdown may be so > + * severe that it may actually crash the system because of a NMI watchdog > + * soft lockup, for example. > + * > + * If multiple CPUs are trying to access the HPET counter at the same time, > + * we don't actually need to read the counter multiple times. Instead, the > + * other CPUs can use the counter value read by the first CPU in the group. > + * > + * A sequence number whose lsb is a lock bit is used to control which CPU > + * has the right to read the HPET counter directly and which CPUs are going > + * to get the indirect value read by the lock holder. For the later group, > + * if the sequence number differs from the expected locked value, they > + * can assume that the saved HPET value is up-to-date and return it. > + * > + * This mechanism is only activated on system with a large number of CPUs. > + * Currently, it is enabled when nr_cpus > 64. > + */ Reading the HPET is so slow that all the atomic ops in the world won't make a dent. Why not just turn this optimization on unconditionally? --Andy