Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933094AbcCTDVr (ORCPT ); Sat, 19 Mar 2016 23:21:47 -0400 Received: from g2t4618.austin.hp.com ([15.73.212.83]:35275 "EHLO g2t4618.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933015AbcCTDVi (ORCPT ); Sat, 19 Mar 2016 23:21:38 -0400 From: Waiman Long To: Peter Zijlstra , Ingo Molnar Cc: linux-kernel@vger.kernel.org, Scott J Norton , Douglas Hatch , Waiman Long Subject: [PATCH] locking/qrwlock: Allow multiple spinning readers Date: Sat, 19 Mar 2016 23:21:19 -0400 Message-Id: <1458444079-59601-1-git-send-email-Waiman.Long@hpe.com> X-Mailer: git-send-email 1.7.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5604 Lines: 157 In qrwlock, the reader that is spining on the lock will need to notify the next reader in the queue when the lock is free. That introduces a reader-to-reader latency that is not present in the original rwlock. That is the price for reducing lock cacheline contention. It also reduces the performance benefit of qrwlock on reader heavy workloads. However, if we allow a limited number of readers to spin on the lock simultaneously, we can eliminates some of the reader-to-reader latencies at the expense of a bit more cacheline contention and probably more power consumption. This patch changes the reader slowpath to allow multiple readers to spin on the lock. The maximum number of concurrent readers allowed is currently set to 4 to limit the amount of additional cacheline contention while improving reader performance on most workloads. If a writer comes to the queue head, however, it will stop additional readers from coming out. Using a multi-threaded locking microbenchmark on a 4-socket 40-core Haswell-EX system, the locking throughput of 4.5-rc6 kernel with or without the patch were as follows: 1) All locking threads run in the same socket r:w # of locking rate locking rate % change ratio threads w/o patch with patch ------ ------- ------------ ------------ -------- 1:1 2 6867 Mop/s 6125 Mop/s -10.8% 4 5762 Mop/s 5065 Mop/s -12.1% 6 6211 Mop/s 6876 Mop/s +10.7% 8 6102 Mop/s 6954 Mop/s +14.0% 10:1 2 8808 Mop/s 9446 Mop/s + 7.2% 4 6034 Mop/s 6199 Mop/s + 2.7% 6 6862 Mop/s 7702 Mop/s +12.2% 8 6693 Mop/s 7808 Mop/s +16.7% 50:1 2 10530 Mop/s 11028 Mop/s + 4.7% 4 9065 Mop/s 9015 Mop/s - 0.6% 6 9659 Mop/s 10669 Mop/s +10.5% 8 8927 Mop/s 10441 Mop/s +17.0% 2) 2 locking threads per socket (8 threads total) r:w locking rate locking rate % change ratio w/o patch with patch ------ ------------ ------------ -------- 10:1 7027 Mop/s 8012 Mop/s +14.0% 50:1 9986 Mop/s 11573 Mop/s +15.9% With a moderately contended qrwlock (2/4 threads), the performance benefit was a bit mixed. The microbenchmark also had more run-to-run variations at that contention level. At higher contention level, however, there was more than 10% gain in performance. Signed-off-by: Waiman Long --- kernel/locking/qrwlock.c | 53 ++++++++++++++++++++++++++++----------------- 1 files changed, 33 insertions(+), 20 deletions(-) diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c index fec0823..bec6d5f 100644 --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -23,6 +23,14 @@ #include /* + * More than one reader will be allowed to spin on the lock waiting for the + * writer to exit. When more readers are allowed, it reduces the reader lock + * acquisition latency, but increases the amount of cacheline contention and + * probably power consumption. + */ +#define MAX_SPINNING_READERS 4 + +/* * This internal data structure is used for optimizing access to some of * the subfields within the atomic_t cnts. */ @@ -43,29 +51,14 @@ struct __qrwlock { }; /** - * rspin_until_writer_unlock - inc reader count & spin until writer is gone - * @lock : Pointer to queue rwlock structure - * @writer: Current queue rwlock writer status byte - * - * In interrupt context or at the head of the queue, the reader will just - * increment the reader count & wait until the writer releases the lock. - */ -static __always_inline void -rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts) -{ - while ((cnts & _QW_WMASK) == _QW_LOCKED) { - cpu_relax_lowlatency(); - cnts = atomic_read_acquire(&lock->cnts); - } -} - -/** * queued_read_lock_slowpath - acquire read lock of a queue rwlock * @lock: Pointer to queue rwlock structure * @cnts: Current qrwlock lock value */ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) { + bool locked = true; + /* * Readers come here when they cannot get the lock without waiting */ @@ -78,7 +71,10 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) * semantics) until the lock is available without waiting in * the queue. */ - rspin_until_writer_unlock(lock, cnts); + while ((cnts & _QW_WMASK) == _QW_LOCKED) { + cpu_relax_lowlatency(); + cnts = atomic_read_acquire(&lock->cnts); + } return; } atomic_sub(_QR_BIAS, &lock->cnts); @@ -92,14 +88,31 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts) * The ACQUIRE semantics of the following spinning code ensure * that accesses can't leak upwards out of our subsequent critical * section in the case that the lock is currently held for write. + * + * The reader increments the reader count & wait until the writer + * releases the lock. */ cnts = atomic_add_return_acquire(_QR_BIAS, &lock->cnts) - _QR_BIAS; - rspin_until_writer_unlock(lock, cnts); + while ((cnts & _QW_WMASK) == _QW_LOCKED) { + if (locked && ((cnts >> _QR_SHIFT) < MAX_SPINNING_READERS)) { + /* + * Unlock the wait queue so that more readers can + * come forward and waiting for the writer to exit + * as long as no more than MAX_SPINNING_READERS + * readers are present. + */ + arch_spin_unlock(&lock->wait_lock); + locked = false; + } + cpu_relax_lowlatency(); + cnts = atomic_read_acquire(&lock->cnts); + } /* * Signal the next one in queue to become queue head */ - arch_spin_unlock(&lock->wait_lock); + if (locked) + arch_spin_unlock(&lock->wait_lock); } EXPORT_SYMBOL(queued_read_lock_slowpath); -- 1.7.1