Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp669312yba; Sat, 13 Apr 2019 10:25:16 -0700 (PDT) X-Google-Smtp-Source: APXvYqwts+kCePjuDxBmZOjQ0obo/kOxG8mLfozNE5I43J2HhcX4ymNkmIdsot3cNmK4K5G2reaM X-Received: by 2002:a17:902:8349:: with SMTP id z9mr62343982pln.144.1555176316277; Sat, 13 Apr 2019 10:25:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555176316; cv=none; d=google.com; s=arc-20160816; b=oi/kuj65w3ZzpTmTNKsSSYOeAJF+3UfdVULLnVdLwvJWmyGUl1HczCmokfuorgiKOz Ctjbry9pHsfcRWFCR3+eHs8f/SLTgIuMbndd+/roSGF0idc4idvHa43JoJMKrR5MpmA5 WU8IXbjfdJo5T9GGTJcokNUMkiGI2t/7/MR6mIsBf3Fgc3J/aVPTQ4I3idQY7KN4RM62 3niRij1ubrgKTbymhEDBF5EDK8gCYZ3UBnp3CqqGJJBBp80bMFav74sBnVgTcO7OOFfy EVnyn7Wjcbq2FL7TGQFVnV+/rx5gDsmCHJbxJ9dPLZlWfiSRFh9FZaTYBgc9UktUNWSs hwQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from; bh=Pe9Vcg4GIARIJ+dXhtdHepmr9yXIo/O66CROxhldODA=; b=hWv9UGjnLqLWYPAtNJvuiCB3fCO2V2LbKiIwmKvEdIx0zOIT2CrSu/LUvjU9DSSvW9 muBlB/h5uZB1R0SUcz/WO9R6koyTJ1ZXQUoAL4s1w20VQd3SwbCpcCSTs2jVaIzC3UL1 8qgOo/xI5kr1V3U3RZwjkbDJMuEd87Z3KxPKH3m53J7ZJJIMugOdG3HQdZmh2ohYrGWP 1qbCvcrLYuPRPovo7OkmrsdnifSSZyZQdJ2IO8bch61vaIEqZix1sg+xiUCOGKI2hHhH dK0ml3LgmEeON3KrWe+RO4wVIsCl2wsUHbHmmhA9DgAaDP/h9AH3PJsBNO50F/F68QqA 7iGQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q137si39369616pgq.58.2019.04.13.10.25.00; Sat, 13 Apr 2019 10:25:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728040AbfDMRYF (ORCPT + 99 others); Sat, 13 Apr 2019 13:24:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35090 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727708AbfDMRYC (ORCPT ); Sat, 13 Apr 2019 13:24:02 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1954136807; Sat, 13 Apr 2019 17:24:01 +0000 (UTC) Received: from llong.com (ovpn-120-133.rdu2.redhat.com [10.10.120.133]) by smtp.corp.redhat.com (Postfix) with ESMTP id DCA2A5D9C6; Sat, 13 Apr 2019 17:23:57 +0000 (UTC) From: Waiman Long To: Peter Zijlstra , Ingo Molnar , Will Deacon , Thomas Gleixner Cc: linux-kernel@vger.kernel.org, x86@kernel.org, Davidlohr Bueso , Linus Torvalds , Tim Chen , huang ying , Waiman Long Subject: [PATCH v4 12/16] locking/rwsem: Enable time-based spinning on reader-owned rwsem Date: Sat, 13 Apr 2019 13:22:55 -0400 Message-Id: <20190413172259.2740-13-longman@redhat.com> In-Reply-To: <20190413172259.2740-1-longman@redhat.com> References: <20190413172259.2740-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Sat, 13 Apr 2019 17:24:01 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When the rwsem is owned by reader, writers stop optimistic spinning simply because there is no easy way to figure out if all the readers are actively running or not. However, there are scenarios where the readers are unlikely to sleep and optimistic spinning can help performance. This patch provides a simple mechanism for spinning on a reader-owned rwsem by a writer. It is a time threshold based spinning where the allowable spinning time can vary from 10us to 25us depending on the condition of the rwsem. When the time threshold is exceeded, a bit will be set in the owner field to indicate that no more optimistic spinning will be allowed on this rwsem until it becomes writer owned again. Not even readers is allowed to acquire the reader-locked rwsem by optimistic spinning for fairness. The time taken for each iteration of the reader-owned rwsem spinning loop varies. Below are sample minimum elapsed times for 16 iterations of the loop. System Time for 16 Iterations ------ ---------------------- 1-socket Skylake ~800ns 4-socket Broadwell ~300ns 2-socket ThunderX2 (arm64) ~250ns When the lock cacheline is contended, we can see up to almost 10X increase in elapsed time. So 25us will be at most 500, 1300 and 1600 iterations for each of the above systems. With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 8-socket IvyBridge-EX system with equal numbers of readers and writers before and after this patch were as follows: # of Threads Pre-patch Post-patch ------------ --------- ---------- 2 1,759 6,684 4 1,684 6,738 8 1,074 7,222 16 900 7,163 32 458 7,316 64 208 520 128 168 425 240 143 474 This patch gives a big boost in performance for mixed reader/writer workloads. With 32 locking threads, the rwsem lock event data were: rwsem_opt_fail=79850 rwsem_opt_nospin=5069 rwsem_opt_rlock=597484 rwsem_opt_wlock=957339 rwsem_sleep_reader=57782 rwsem_sleep_writer=55663 With 64 locking threads, the data looked like: rwsem_opt_fail=346723 rwsem_opt_nospin=6293 rwsem_opt_rlock=1127119 rwsem_opt_wlock=1400628 rwsem_sleep_reader=308201 rwsem_sleep_writer=72281 So a lot more threads acquired the lock in the slowpath and more threads went to sleep. Signed-off-by: Waiman Long --- kernel/locking/lock_events_list.h | 1 + kernel/locking/rwsem.c | 121 ++++++++++++++++++++++++++---- 2 files changed, 107 insertions(+), 15 deletions(-) diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h index 333ed5fda333..f3550aa5866a 100644 --- a/kernel/locking/lock_events_list.h +++ b/kernel/locking/lock_events_list.h @@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */ LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */ LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */ +LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */ LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */ LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */ LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 3cf8355252d1..8b23009e6b2c 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -35,18 +36,20 @@ * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned, * i.e. the owner(s) cannot be readily determined. It can be reader - * owned or the owning writer is indeterminate. + * owned or the owning writer is indeterminate. Optimistic spinning + * should be disabled if this flag is set. * * When a writer acquires a rwsem, it puts its task_struct pointer - * into the owner field. It is cleared after an unlock. + * into the owner field or the count itself (64-bit only. It should + * be cleared after an unlock. * * When a reader acquires a rwsem, it will also puts its task_struct - * pointer into the owner field with both the RWSEM_READER_OWNED and - * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will - * largely be left untouched. So for a free or reader-owned rwsem, - * the owner value may contain information about the last reader that - * acquires the rwsem. The anonymous bit is set because that particular - * reader may or may not still own the lock. + * pointer into the owner field with the RWSEM_READER_OWNED bit set. + * On unlock, the owner field will largely be left untouched. So + * for a free or reader-owned rwsem, the owner value may contain + * information about the last reader that acquires the rwsem. The + * anonymous bit may also be set to permanently disable optimistic + * spinning on a reader-own rwsem until a writer comes along. * * That information may be helpful in debugging cases where the system * seems to hang on a reader owned rwsem especially if only one reader @@ -138,8 +141,7 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem) static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem, struct task_struct *owner) { - unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED - | RWSEM_ANONYMOUSLY_OWNED; + unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED; WRITE_ONCE(sem->owner, (struct task_struct *)val); } @@ -164,6 +166,14 @@ static inline bool is_rwsem_owner_reader(struct task_struct *owner) return (unsigned long)owner & RWSEM_READER_OWNED; } +/* + * Return true if the rwsem is spinnable. + */ +static inline bool is_rwsem_spinnable(struct rw_semaphore *sem) +{ + return is_rwsem_owner_spinnable(READ_ONCE(sem->owner)); +} + /* * Return true if rwsem is owned by an anonymous writer or readers. */ @@ -193,6 +203,22 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem) } #endif +/* + * Set the RWSEM_ANONYMOUSLY_OWNED flag if the RWSEM_READER_OWNED flag + * remains set. Otherwise, the operation will be aborted. + */ +static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem) +{ + long owner = (long)READ_ONCE(sem->owner); + + while (is_rwsem_owner_reader((struct task_struct *)owner)) { + if (!is_rwsem_owner_spinnable((struct task_struct *)owner)) + break; + owner = cmpxchg((long *)&sem->owner, owner, + owner | RWSEM_ANONYMOUSLY_OWNED); + } +} + /* * Guide to the rw_semaphore's count field. * @@ -507,7 +533,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) owner = READ_ONCE(sem->owner); if (owner) { ret = is_rwsem_owner_spinnable(owner) && - owner_on_cpu(owner); + (is_rwsem_owner_reader(owner) || owner_on_cpu(owner)); } rcu_read_unlock(); preempt_enable(); @@ -532,7 +558,7 @@ enum owner_state { OWNER_READER = 1 << 2, OWNER_NONSPINNABLE = 1 << 3, }; -#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER) +#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER) static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) { @@ -543,7 +569,8 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) return OWNER_NONSPINNABLE; rcu_read_lock(); - while (owner && (READ_ONCE(sem->owner) == owner)) { + while (owner && !is_rwsem_owner_reader(owner) + && (READ_ONCE(sem->owner) == owner)) { /* * Ensure we emit the owner->on_cpu, dereference _after_ * checking sem->owner still matches owner, if that fails, @@ -582,11 +609,47 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem) return !owner ? OWNER_NULL : OWNER_READER; } +/* + * Calculate reader-owned rwsem spinning threshold for writer + * + * It is assumed that the more readers own the rwsem, the longer it will + * take for them to wind down and free the rwsem. So the formula to + * determine the actual spinning time limit is: + * + * 1) RWSEM_FLAG_WAITERS set + * Spinning threshold = (10 + nr_readers/2)us + * + * 2) RWSEM_FLAG_WAITERS not set + * Spinning threshold = 25us + * + * In the first case when RWSEM_FLAG_WAITERS is set, no new reader can + * become rwsem owner. It is assumed that the more readers own the rwsem, + * the longer it will take for them to wind down and free the rwsem. This + * is subjected to a maximum value of 25us. + * + * In the second case with RWSEM_FLAG_WAITERS off, new readers can join + * and become one of the owners. So assuming for the worst case and spin + * for at most 25us. + */ +static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem) +{ + long count = atomic_long_read(&sem->count); + int reader_cnt = atomic_long_read(&sem->count) >> RWSEM_READER_SHIFT; + + if (reader_cnt > 30) + reader_cnt = 30; + return sched_clock() + ((count & RWSEM_FLAG_WAITERS) + ? 10 * NSEC_PER_USEC + reader_cnt * NSEC_PER_USEC/2 + : 25 * NSEC_PER_USEC); +} + static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) { bool taken = false; bool is_rt_task = rt_task(current); int prev_owner_state = OWNER_NULL; + int loop = 0; + u64 rspin_threshold = 0; preempt_disable(); @@ -598,8 +661,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) * Optimistically spin on the owner field and attempt to acquire the * lock whenever the owner changes. Spinning will be stopped when: * 1) the owning writer isn't running; or - * 2) readers own the lock as we can't determine if they are - * actively running or not. + * 2) readers own the lock and spinning count has reached 0. */ for (;;) { enum owner_state owner_state = rwsem_spin_on_owner(sem); @@ -616,6 +678,35 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock) if (taken) break; + /* + * Time-based reader-owned rwsem optimistic spinning + */ + if (wlock && (owner_state == OWNER_READER)) { + /* + * Initialize rspin_threshold when the owner + * state changes from non-reader to reader. + */ + if (prev_owner_state != OWNER_READER) { + if (!is_rwsem_spinnable(sem)) + break; + rspin_threshold = rwsem_rspin_threshold(sem); + loop = 0; + } + + /* + * Check time threshold every 16 iterations to + * avoid calling sched_clock() too frequently. + * This will make the actual spinning time a + * bit more than that specified in the threshold. + */ + else if (!(++loop & 0xf) && + (sched_clock() > rspin_threshold)) { + rwsem_set_nonspinnable(sem); + lockevent_inc(rwsem_opt_nospin); + break; + } + } + /* * An RT task cannot do optimistic spinning if it cannot * be sure the lock holder is running or live-lock may -- 2.18.1