Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755298AbcDKUVt (ORCPT ); Mon, 11 Apr 2016 16:21:49 -0400 Received: from mail-oi0-f46.google.com ([209.85.218.46]:35433 "EHLO mail-oi0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755179AbcDKUVr (ORCPT ); Mon, 11 Apr 2016 16:21:47 -0400 MIME-Version: 1.0 In-Reply-To: <1460405379-25294-1-git-send-email-Waiman.Long@hpe.com> References: <1460405379-25294-1-git-send-email-Waiman.Long@hpe.com> From: Andy Lutomirski Date: Mon, 11 Apr 2016 13:21:27 -0700 Message-ID: Subject: Re: [PATCH v3] x86/hpet: Reduce HPET counter read contention To: Waiman Long Cc: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Jonathan Corbet , "linux-kernel@vger.kernel.org" , linux-doc@kernel.org, X86 ML , Jiang Liu , Borislav Petkov , Andy Lutomirski , Scott J Norton , Douglas Hatch , Randy Wright Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7128 Lines: 183 On Mon, Apr 11, 2016 at 1:09 PM, Waiman Long wrote: > On a large system with many CPUs, using HPET as the clock source can > have a significant impact on the overall system performance because > of the following reasons: > 1) There is a single HPET counter shared by all the CPUs. > 2) HPET counter reading is a very slow operation. > > Using HPET as the default clock source may happen when, for example, > the TSC clock calibration exceeds the allowable tolerance. Something > the performance slowdown can be so severe that the system may crash > because of a NMI watchdog soft lockup, for example. > > This patch attempts to reduce HPET read contention by using the fact > that if more than one CPUs are trying to access HPET at the same time, > it will be more efficient if one CPU in the group reads the HPET > counter and shares it with the rest of the group instead of each > group member reads the HPET counter individually. > > This is done by using a combination word with a sequence number and > a bit lock. The CPU that gets the bit lock will be responsible for > reading the HPET counter and update the sequence number. The others > will monitor the change in sequence number and grab the HPET counter > accordingly. This change is enabled on SMP configuration. > > On a 4-socket Haswell-EX box with 72 cores (HT off), running the > AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000) > with and without the patch has the following performance numbers > (with HPET or TSC as clock source): > > TSC = 646515 jobs/min > HPET w/o patch = 566708 jobs/min > HPET with patch = 638791 jobs/min > > The perf profile showed a reduction of the %CPU time consumed by > read_hpet from 4.99% without patch to 1.41% with patch. > > On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the > other hand, the performance numbers of the same benchmark were: > > TSC = 3145329 jobs/min > HPET w/o patch = 1108537 jobs/min > HPET with patch = 3019934 jobs/min > > The corresponding perf profile showed a drop of CPU consumption of > the read_hpet function from more than 34% to just 2.96%. > > Signed-off-by: Waiman Long > --- > v2->v3: > - Make the hpet optimization the default for SMP configuration. So > no documentation change is needed. > - Remove threshold checking code as it should not be necessary and > can be potentially unsafe. > > v1->v2: > - Reduce the CPU threshold to 32. > - Add a kernel parameter to explicitly enable or disable hpet > optimization. > - Change hpet_save.hpet type to u32 to make sure that read & write > is atomic on i386. > > arch/x86/kernel/hpet.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 83 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c > index a1f0e4a..ff1f140 100644 > --- a/arch/x86/kernel/hpet.c > +++ b/arch/x86/kernel/hpet.c > @@ -759,12 +759,95 @@ static int hpet_cpuhp_notify(struct notifier_block *n, > #endif > > /* > + * Reading the HPET counter is a very slow operation. If a large number of > + * CPUs are trying to access the HPET counter simultaneously, it can cause > + * massive delay and slow down system performance dramatically. This may > + * happen when HPET is the default clock source instead of TSC. For a > + * really large system with hundreds of CPUs, the slowdown may be so > + * severe that it may actually crash the system because of a NMI watchdog > + * soft lockup, for example. > + * > + * If multiple CPUs are trying to access the HPET counter at the same time, > + * we don't actually need to read the counter multiple times. Instead, the > + * other CPUs can use the counter value read by the first CPU in the group. > + * > + * A sequence number whose lsb is a lock bit is used to control which CPU > + * has the right to read the HPET counter directly and which CPUs are going > + * to get the indirect value read by the lock holder. For the later group, > + * if the sequence number differs from the expected locked value, they > + * can assume that the saved HPET value is up-to-date and return it. > + */ > +static struct { > + /* Sequence number + bit lock */ > + int seq ____cacheline_aligned_in_smp; > + > + /* Current HPET value */ > + u32 hpet ____cacheline_aligned_in_smp; > +} hpet_save; > +#define HPET_SEQ_LOCKED(seq) ((seq) & 1) /* Odd == locked */ > + > +/* > * Clock source related code > */ > +#ifdef CONFIG_SMP > +static cycle_t read_hpet(struct clocksource *cs) > +{ > + int seq; > + > + seq = READ_ONCE(hpet_save.seq); > + if (!HPET_SEQ_LOCKED(seq)) { > + int old, new = seq + 1; > + unsigned long flags; > + > + local_irq_save(flags); > + /* > + * Set the lock bit (lsb) to get the right to read HPET > + * counter directly. If successful, read the counter, save > + * its value, and increment the sequence number. Otherwise, > + * increment the sequnce number to the expected locked value > + * for comparison later on. > + */ > + old = cmpxchg(&hpet_save.seq, seq, new); > + if (old == seq) { > + u32 time = hpet_readl(HPET_COUNTER); > + > + WRITE_ONCE(hpet_save.hpet, time); I think a plain store is okay due to the smp_store_release below. > + > + /* Unlock */ > + smp_store_release(&hpet_save.seq, new + 1); > + local_irq_restore(flags); > + return (cycle_t)time; > + } > + local_irq_restore(flags); > + seq = new; > + } > + > + /* > + * Wait until the locked sequence number changes which indicates > + * that the saved HPET value is up-to-date. > + */ > + while (READ_ONCE(hpet_save.seq) == seq) { > + /* > + * Since reading the HPET is much slower than a single > + * cpu_relax() instruction, we use two here in an attempt > + * to reduce the amount of cacheline contention in the > + * hpet_save.seq cacheline. > + */ > + cpu_relax(); > + cpu_relax(); > + } > + > + return (cycle_t)READ_ONCE(hpet_save.hpet); > +} I wonder if this could be simplified. Pseudocode: u32 time; unsigned long flags; local_irq_save(flags); if (spin_trylock(&hpet_lock)) { time = hpet_readl(HPET_COUNTER); WRITE_ONCE(last_hpet_counter, time); } else { spin_unlock_wait(&hpet_lock); /* When this function started, hpet_lock was locked. Now it's unlocked, which means that time is at least as new as whatever the lock holder returned. */ time = READ_ONCE(last_hpet_counter); } local_irq_restore(flags); return time; Should be fasterunder heavy contention, too: spinlocks are very nicely optimized. I *think* this is sufficiently atomic. --Andy