v8:
- Remove v7 patch 1 as it has been merged.
- Add patch 14 to make rwsem->owner an atomic_long_t as suggested
by PeterZ.
- Incorporates other changes suggested by PeterZ like removing
*_is_spinnable() and owner_without_flags() helpers.
v7:
- Fix a bug in patch 1 and include changes suggested by Linus.
- Refresh the other patches accordingly.
v6:
- Add a new patch 1 to fix an existing rwsem bug that allows both
readers and writer to become rwsem owners simultaneously.
- Fix a missed wakeup bug (in patch 9) of the v5 series.
- Fix boot up hang problem on some kernel configurations caused
by the patch that merge owner into count.
- Make the adaptive disabling of reader optimistic spinning more
aggressive to further improve workloads that don't benefit from
that.
- Add another patch to disable preemption during down_read() when
owner is merged into count to prevent the possiblity of reader
count overflow.
- Don't allow merging of owner into count if NR_CPUS is greater
that the max supported reader count.
- Allow readers dropped out from the optimistic spinning loop to
attempt a trylock if it is still in the right read phase.
- Fix incorrect DEBUG_RWSEMS_WARN_ON() warning.
This is part 2 of a 3-part (0/1/2) series to rearchitect the internal
operation of rwsem. Both part 0 and part 1 are merged into tip.
This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. It also implements the following 3
new features:
1) Waiter lock handoff
2) Reader optimistic spinning with adapative disabling
3) Store write-lock owner in the atomic count (x86-64 only for now)
Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.
Reader optimistic spinning enables readers to acquire the lock more
quickly. So workloads that use a mix of readers and writers should
see an increase in performance as long as the reader critical sections
are short. For those workloads that have long reader critical sections
reader optimistic spinning may hurt performance, so an adaptive disabling
mechanism is also implemented to disable it when reader-owned lock
spinning timeouts happen.
Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.
Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.
On a 2-socket 40-core 80-thread Skylake system with 40 reader and writer
locking threads, the min/mean/max locking operations done in a 5-second
testing window before the patchset were:
40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255
After the patchset, they became:
40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098
It can be seen that the performance improves for both readers and writers.
It also makes rwsem fairer to readers as well as among different threads
within the reader and writer groups.
Patch 1 makes owner a permanent member of the rw_semaphore structure and
set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.
Patch 2 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.
Patch 3 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().
Patch 4 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just
like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is
left for internal function declaration needed by percpu-rwsem.c.
Patch 5 optimizes the merged rwsem.c file to generate smaller object
file and performs other miscellaneous code cleanups.
Patch 6 makes rwsem_spin_on_owner() returns owner state.
Patch 7 implments lock handoff to prevent lock starvation. It is expected
that throughput will be lower on workloads with highly contended rwsems
for better fairness.
Patch 8 makes sure that all wake_up_q() calls happened after dropping
the wait_lock.
Patch 9 makes RT task's handling of NULL owner more optimal.
Patch 10 makes reader wakeup to wake up almost all the readers in the
wait queue instead of just those in the front.
Patch 11 renames the RWSEM_ANONYMOUSLY_OWNED bit to RWSEM_NONSPINNABLE.
Patch 12 enables reader to spin on a writer-owned rwsem.
Patch 13 makes rwsem->owner an atomic_long_t as it is now holding some
flags that may need to be atomically updated.
Patch 14 enables a writer to spin on a reader-owned rwsem for at most
25us and extends the RWSEM_NONSPINNABLE bit to 2 separate ones - one
for readers and one for writers.
Patch 15 implements the adaptive disabling of reader optimistic spinning
when the reader-owned rwsem spinning timeouts happen.
Patch 16 handles the case of too many readers by reserving the sign
bit to designate that a reader lock attempt will fail and the locking
reader will be put to sleep. This will ensure that we will not overflow
the reader count.
Patch 17 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of
bits for reader count. This is for x86-64 only for the time being.
Patch 18 eliminates redundant computation of the merged owner-count.
Patch 19 disable preemption during the down_read() call to eliminate
the remote possibility that the reader count may overflow.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 2-socket Skylake system with equal numbers
of readers and writers (mixed) before and after this patchset were:
# of Threads Before Patch After Patch
------------ ------------ -----------
2 2,618 4,193
4 1,202 3,726
8 802 3,622
16 729 3,359
32 319 2,826
64 102 2,744
On workloads where the rwsem reader critical section is relatively long
(longer than the spinning period), optimistic of writer on reader-owned
rwsem may not be that helpful. In fact, the performance may regress
in some cases like the will-it-sclae page_fault1 microbenchmark. This
is likely due to the fact that larger reader groups where the readers
acquire the lock together are broken into smaller ones. So more work
will be needed to better tune the rwsem code to that kind of workload.
The v7-to-v8 diff is attached at the end.
Waiman Long (19):
locking/rwsem: Make owner available even if
!CONFIG_RWSEM_SPIN_ON_OWNER
locking/rwsem: Remove rwsem_wake() wakeup optimization
locking/rwsem: Implement a new locking scheme
locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
locking/rwsem: Code cleanup after files merging
locking/rwsem: Make rwsem_spin_on_owner() return owner state
locking/rwsem: Implement lock handoff to prevent lock starvation
locking/rwsem: Always release wait_lock before waking up tasks
locking/rwsem: More optimal RT task handling of null owner
locking/rwsem: Wake up almost all readers in wait queue
locking/rwsem: Clarify usage of owner's nonspinaable bit
locking/rwsem: Enable readers spinning on writer
locking/rwsem: Make rwsem->owner an atomic_long_t
locking/rwsem: Enable time-based spinning on reader-owned rwsem
locking/rwsem: Adaptive disabling of reader optimistic spinning
locking/rwsem: Guard against making count negative
locking/rwsem: Merge owner into count on x86-64
locking/rwsem: Remove redundant computation of writer lock word
locking/rwsem: Disable preemption in down_read*() if owner in count
arch/x86/Kconfig | 6 +
include/linux/percpu-rwsem.h | 4 +-
include/linux/rwsem.h | 16 +-
include/linux/sched/wake_q.h | 5 +
kernel/Kconfig.locks | 12 +
kernel/locking/Makefile | 2 +-
kernel/locking/lock_events_list.h | 12 +-
kernel/locking/rwsem-xadd.c | 745 -------------
kernel/locking/rwsem.c | 1668 ++++++++++++++++++++++++++++-
kernel/locking/rwsem.h | 306 +-----
lib/Kconfig.debug | 8 +-
11 files changed, 1694 insertions(+), 1090 deletions(-)
delete mode 100644 kernel/locking/rwsem-xadd.c
--
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..0a43830f1932 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
lock_release(&sem->rw_sem.dep_map, 1, ip);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
if (!read)
- sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
+ atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
#endif
}
@@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
if (!read)
- sem->rw_sem.owner = current;
+ atomic_long_set(&sem->rw_sem.owner, (long)current);
#endif
}
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index bb76e82398b2..e401358c4e7e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -35,10 +35,11 @@
struct rw_semaphore {
atomic_long_t count;
/*
- * Write owner or one of the read owners. Can be used as a
- * speculative check to see if the owner is running on the cpu.
+ * Write owner or one of the read owners as well flags regarding
+ * the current state of the rwsem. Can be used as a speculative
+ * check to see if the write owner is running on the cpu.
*/
- struct task_struct *owner;
+ atomic_long_t owner;
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
#endif
@@ -53,7 +54,7 @@ struct rw_semaphore {
* Setting all bits of the owner field except bit 0 will indicate
* that the rwsem is writer-owned with an unknown owner.
*/
-#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)
+#define RWSEM_OWNER_UNKNOWN (-2L)
/* In all implementations count != 0 means locked */
static inline int rwsem_is_locked(struct rw_semaphore *sem)
@@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
#define __RWSEM_INITIALIZER(name) \
{ __RWSEM_INIT_COUNT(name), \
- .owner = NULL, \
+ .owner = ATOMIC_LONG_INIT(0), \
.wait_list = LIST_HEAD_INIT((name).wait_list), \
.wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
__RWSEM_OPT_INIT(name) \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 60783267b50d..cede2f99220b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -107,7 +107,7 @@
if (!debug_locks_silent && \
WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
#c, atomic_long_read(&(sem)->count), \
- (long)((sem)->owner), (long)current, \
+ atomic_long_read(&(sem)->owner), (long)current, \
list_empty(&(sem)->wait_list) ? "" : "not ")) \
debug_locks_off(); \
} while (0)
@@ -174,9 +174,9 @@
* atomic_long_cmpxchg() will be used to obtain writer lock.
*
* There are three places where the lock handoff bit may be set or cleared.
- * 1) __rwsem_mark_wake() for readers.
+ * 1) rwsem_mark_wake() for readers.
* 2) rwsem_try_write_lock() for writers.
- * 3) Error path of __rwsem_down_write_failed_common().
+ * 3) Error path of rwsem_down_write_slowpath().
*
* For all the above cases, wait_lock will be held. A writer must also
* be the first one in the wait_list to be eligible for setting the handoff
@@ -259,12 +259,20 @@ static inline unsigned long rwsem_count_owner(long count)
static inline void rwsem_set_owner(struct rw_semaphore *sem)
{
- WRITE_ONCE(sem->owner, current);
+ atomic_long_set(&sem->owner, (long)current);
}
static inline void rwsem_clear_owner(struct rw_semaphore *sem)
{
- WRITE_ONCE(sem->owner, NULL);
+ atomic_long_set(&sem->owner, 0);
+}
+
+/*
+ * Test the flags in the owner field.
+ */
+static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
+{
+ return atomic_long_read(&sem->owner) & flags;
}
/*
@@ -280,11 +288,10 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED |
- ((unsigned long)READ_ONCE(sem->owner) &
- RWSEM_RD_NONSPINNABLE);
+ long val = (long)owner | RWSEM_READER_OWNED |
+ (atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE);
- WRITE_ONCE(sem->owner, (struct task_struct *)val);
+ atomic_long_set(&sem->owner, val);
}
static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
@@ -292,34 +299,6 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
__rwsem_set_reader_owned(sem, current);
}
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner, bool wr)
-{
- unsigned long bit = wr ? RWSEM_WR_NONSPINNABLE : RWSEM_RD_NONSPINNABLE;
-
- return !((unsigned long)owner & bit);
-}
-
-/*
- * Return true if the rwsem is spinnable.
- */
-static inline bool is_rwsem_spinnable(struct rw_semaphore *sem, bool wr)
-{
- return is_rwsem_owner_spinnable(READ_ONCE(sem->owner), wr);
-}
-
-/*
- * Remove all the flag bits from owner.
- */
-static inline struct task_struct *owner_without_flags(struct task_struct *owner)
-{
- return (struct task_struct *)((long)owner & ~RWSEM_OWNER_FLAGS_MASK);
-}
-
/*
* Return true if the rwsem is owned by a reader.
*/
@@ -334,7 +313,7 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
if (count & RWSEM_WRITER_MASK)
return false;
#endif
- return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+ return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
}
#ifdef CONFIG_DEBUG_RWSEMS
@@ -346,11 +325,13 @@ static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
*/
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
- unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+ long val = atomic_long_read(&sem->owner);
- if ((owner & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current)
- cmpxchg_relaxed((unsigned long *)&sem->owner, owner,
- owner & RWSEM_OWNER_FLAGS_MASK);
+ while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (long)current) {
+ if (atomic_long_try_cmpxchg(&sem->owner, &val,
+ val & RWSEM_OWNER_FLAGS_MASK))
+ return;
+ }
}
#else
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -364,13 +345,13 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
*/
static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
{
- long owner = (long)READ_ONCE(sem->owner);
+ long owner = atomic_long_read(&sem->owner);
while (owner & RWSEM_READER_OWNED) {
- if (!is_rwsem_owner_spinnable((void *)owner, false))
+ if (owner & RWSEM_NONSPINNABLE)
break;
- owner = cmpxchg((long *)&sem->owner, owner,
- owner | RWSEM_NONSPINNABLE);
+ owner = atomic_long_cmpxchg(&sem->owner, owner,
+ owner | RWSEM_NONSPINNABLE);
}
}
@@ -395,16 +376,22 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
/*
* Get the owner value from count to have early access to the task structure.
- * Owner from sem->count should includes the RWSEM_NONSPINNABLE bits
- * from sem->owner.
*/
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
{
- unsigned long cowner = rwsem_count_owner(atomic_long_read(&sem->count));
- unsigned long sowner = (unsigned long)READ_ONCE(sem->owner);
+ return (struct task_struct *)
+ rwsem_count_owner(atomic_long_read(&sem->count));
+}
- return (struct task_struct *) (cowner
- ? cowner | (sowner & RWSEM_NONSPINNABLE) : sowner);
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+ *pflags = atomic_long_read(&sem->owner) & RWSEM_OWNER_FLAGS_MASK;
+ return rwsem_read_owner(sem);
}
/*
@@ -448,15 +435,33 @@ static int __init rwsem_show_count_status(void)
return 0;
}
late_initcall(rwsem_show_count_status);
+
#else /* !MERGE_OWNER_INTO_COUNT */
#define rwsem_preempt_disable()
#define rwsem_preempt_enable()
#define rwsem_schedule_preempt_disabled() schedule()
-static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+/*
+ * Return just the real task structure pointer of the owner
+ */
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
{
- return READ_ONCE(sem->owner);
+ return (struct task_struct *)(atomic_long_read(&sem->owner) &
+ ~RWSEM_OWNER_FLAGS_MASK);
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner. pflags must be non-NULL.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+ long owner = atomic_long_read(&sem->owner);
+
+ *pflags = owner & RWSEM_OWNER_FLAGS_MASK;
+ return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
}
static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
@@ -499,7 +504,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
- sem->owner = NULL;
+ atomic_long_set(&sem->owner, 0L);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
osq_lock_init(&sem->osq);
#endif
@@ -516,7 +521,7 @@ struct rwsem_waiter {
struct task_struct *task;
enum rwsem_waiter_type type;
unsigned long timeout;
- unsigned long last_rowner;
+ long last_rowner;
};
#define rwsem_first_waiter(sem) \
list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -560,9 +565,9 @@ enum writer_wait_state {
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only marked woken if downgrading is false
*/
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
+static void rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type,
+ struct wake_q_head *wake_q)
{
struct rwsem_waiter *waiter, *tmp;
long oldcount, woken = 0, adjustment = 0;
@@ -722,25 +727,31 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* If wstate is WRITER_HANDOFF, it will make sure that either the handoff
* bit is set or the lock is acquired with handoff bit cleared.
*/
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
const long wlock,
enum writer_wait_state wstate)
{
- long new;
+ long count, new;
lockdep_assert_held(&sem->wait_lock);
+
+ count = atomic_long_read(&sem->count);
do {
bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
if (has_handoff && wstate == WRITER_NOT_FIRST)
return false;
+ new = count;
+
if (count & RWSEM_LOCK_MASK) {
if (has_handoff || (wstate != WRITER_HANDOFF))
return false;
- new = count | RWSEM_FLAG_HANDOFF;
+
+ new |= RWSEM_FLAG_HANDOFF;
} else {
- new = (count | wlock) & ~RWSEM_FLAG_HANDOFF;
+ new |= wlock;
+ new &= ~RWSEM_FLAG_HANDOFF;
if (list_is_singular(&sem->wait_list))
new &= ~RWSEM_FLAG_WAITERS;
@@ -804,11 +815,6 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
static inline bool owner_on_cpu(struct task_struct *owner)
{
- /*
- * Clear all the flag bits in owner
- */
- owner = owner_without_flags(owner);
-
/*
* As lock holder preemption issue, we both skip spinning if
* task is not on cpu or its cpu is preempted
@@ -816,12 +822,14 @@ static inline bool owner_on_cpu(struct task_struct *owner)
return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
}
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+ long nonspinnable)
{
struct task_struct *owner;
+ long flags;
bool ret = true;
- BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN, true));
+ BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
if (need_resched()) {
lockevent_inc(rwsem_opt_fail);
@@ -830,17 +838,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
preempt_disable();
rcu_read_lock();
-
- owner = rwsem_get_owner(sem);
- if (!is_rwsem_owner_spinnable(owner, wr))
+ owner = rwsem_read_owner_flags(sem, &flags);
+ if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
ret = false;
- else if ((unsigned long)owner & RWSEM_READER_OWNED)
- ret = true;
- else if ((owner = owner_without_flags(owner)))
- ret = owner_on_cpu(owner);
-
rcu_read_unlock();
preempt_enable();
+
lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
}
@@ -864,26 +867,27 @@ enum owner_state {
};
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER)
-static inline enum owner_state rwsem_owner_state(unsigned long owner, bool wr)
+static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
+ long flags, long nonspinnable)
{
- if (!is_rwsem_owner_spinnable((void *)owner, wr))
+ if (flags & nonspinnable)
return OWNER_NONSPINNABLE;
- if (owner & RWSEM_READER_OWNED)
+ if (flags & RWSEM_READER_OWNED)
return OWNER_READER;
- if (!(owner & ~RWSEM_OWNER_FLAGS_MASK))
- return OWNER_NULL;
-
- return OWNER_WRITER;
+ return owner ? OWNER_WRITER : OWNER_NULL;
}
-static noinline enum owner_state
-rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem,
+ long nonspinnable)
{
- struct task_struct *tmp, *owner = rwsem_get_owner(sem);
- enum owner_state state = rwsem_owner_state((unsigned long)owner, wr);
+ struct task_struct *new, *owner;
+ long flags, new_flags;
+ enum owner_state state;
+ owner = rwsem_read_owner_flags(sem, &flags);
+ state = rwsem_owner_state(owner, flags, nonspinnable);
if (state != OWNER_WRITER)
return state;
@@ -894,9 +898,9 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
break;
}
- tmp = rwsem_get_owner(sem);
- if (tmp != owner) {
- state = rwsem_owner_state((unsigned long)tmp, wr);
+ new = rwsem_read_owner_flags(sem, &new_flags);
+ if ((new != owner) || (new_flags != flags)) {
+ state = rwsem_owner_state(new, new_flags, nonspinnable);
break;
}
@@ -923,40 +927,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
/*
* Calculate reader-owned rwsem spinning threshold for writer
*
- * It is assumed that the more readers own the rwsem, the longer it will
- * take for them to wind down and free the rwsem. So the formula to
- * determine the actual spinning time limit is:
- *
- * 1) RWSEM_FLAG_WAITERS set
- * Spinning threshold = (10 + nr_readers/2)us
+ * The more readers own the rwsem, the longer it will take for them to
+ * wind down and free the rwsem. So the empirical formula used to
+ * determine the actual spinning time limit here is:
*
- * 2) RWSEM_FLAG_WAITERS not set
- * Spinning threshold = 25us
+ * Spinning threshold = (10 + nr_readers/2)us
*
- * In the first case when RWSEM_FLAG_WAITERS is set, no new reader can
- * become rwsem owner. It is assumed that the more readers own the rwsem,
- * the longer it will take for them to wind down and free the rwsem. In
- * addition, if it happens that a previous task that releases the lock
- * is in the process of waking up readers one-by-one, the process will
- * take longer when more readers needed to be woken up. This is subjected
- * to a maximum value of 25us.
- *
- * In the second case with RWSEM_FLAG_WAITERS off, new readers can join
- * and become one of the owners. So assuming for the worst case and spin
- * for at most 25us.
+ * The limit is capped to a maximum of 25us (30 readers). This is just
+ * a heuristic and is subjected to change in the future.
*/
static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
{
long count = atomic_long_read(&sem->count);
- u64 delta = 25 * NSEC_PER_USEC;
+ int readers = count >> RWSEM_READER_SHIFT;
+ u64 delta;
- if (count & RWSEM_FLAG_WAITERS) {
- int readers = count >> RWSEM_READER_SHIFT;
-
- if (readers > 30)
- readers = 30;
- delta = (20 + readers) * NSEC_PER_USEC / 2;
- }
+ if (readers > 30)
+ readers = 30;
+ delta = (20 + readers) * NSEC_PER_USEC / 2;
return sched_clock() + delta;
}
@@ -967,6 +955,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
int prev_owner_state = OWNER_NULL;
int loop = 0;
u64 rspin_threshold = 0;
+ long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
+ : RWSEM_RD_NONSPINNABLE;
preempt_disable();
@@ -981,8 +971,9 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
* 2) readers own the lock and spinning time has exceeded limit.
*/
for (;;) {
- enum owner_state owner_state = rwsem_spin_on_owner(sem, wlock);
+ enum owner_state owner_state;
+ owner_state = rwsem_spin_on_owner(sem, nonspinnable);
if (!(owner_state & OWNER_SPINNABLE))
break;
@@ -1007,7 +998,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
* the beginning of the 2nd reader phase.
*/
if (prev_owner_state != OWNER_READER) {
- if (!is_rwsem_spinnable(sem, wlock))
+ if (rwsem_test_oflags(sem, nonspinnable))
break;
rspin_threshold = rwsem_rspin_threshold(sem);
loop = 0;
@@ -1095,9 +1086,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
*/
static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
{
- if (!is_rwsem_spinnable(sem, true))
- atomic_long_andnot(RWSEM_WR_NONSPINNABLE,
- (atomic_long_t *)&sem->owner);
+ if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
+ atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
}
/*
@@ -1120,9 +1110,9 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
* not be here at all.
*/
static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
- unsigned long last_rowner)
+ long last_rowner)
{
- unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
+ long owner = atomic_long_read(&sem->owner);
if (!(owner & RWSEM_READER_OWNED))
return false;
@@ -1137,7 +1127,8 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
return false;
}
#else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, bool wr)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+ long nonspinnable)
{
return false;
}
@@ -1180,7 +1171,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
* holding or trying to acquire the lock. So disable
* optimistic spinning and go directly into the wait list.
*/
- if (is_rwsem_spinnable(sem, false))
+ if (rwsem_test_oflags(sem, RWSEM_RD_NONSPINNABLE))
rwsem_set_nonspinnable(sem);
goto queue;
}
@@ -1189,11 +1180,11 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
* Save the current read-owner of rwsem, if available, and the
* reader nonspinnable bit.
*/
- waiter.last_rowner = (long)READ_ONCE(sem->owner);
+ waiter.last_rowner = atomic_long_read(&sem->owner);
if (!(waiter.last_rowner & RWSEM_READER_OWNED))
waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
- if (!rwsem_can_spin_on_owner(sem, false))
+ if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
goto queue;
/*
@@ -1209,8 +1200,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
raw_spin_lock_irq(&sem->wait_lock);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
- &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+ &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
}
@@ -1261,7 +1252,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
}
if (wake || (!(count & RWSEM_WRITER_MASK) &&
(adjustment & RWSEM_FLAG_WAITERS)))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
@@ -1297,11 +1288,15 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
return ERR_PTR(-EINTR);
}
+/*
+ * This function is called by the a write lock owner. So the owner value
+ * won't get changed by others.
+ */
static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
bool disable)
{
if (unlikely(disable)) {
- *((unsigned long *)&sem->owner) |= RWSEM_RD_NONSPINNABLE;
+ atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner);
lockevent_inc(rwsem_opt_norspin);
}
}
@@ -1321,7 +1316,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
const long wlock = RWSEM_WRITER_LOCKED;
/* do optimistic spinning and steal lock if possible */
- if (rwsem_can_spin_on_owner(sem, true) &&
+ if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
rwsem_optimistic_spin(sem, wlock))
return sem;
@@ -1330,7 +1325,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
* acquiring the write lock when the setting of the nonspinnable
* bits are observed.
*/
- disable_rspin = (long)READ_ONCE(sem->owner) & RWSEM_NONSPINNABLE;
+ disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE;
/*
* Optimistic spinning failed, proceed to the slowpath
@@ -1362,7 +1357,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
if (count & RWSEM_WRITER_MASK)
goto wait;
- __rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+ rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
? RWSEM_WAKE_READERS
: RWSEM_WAKE_ANY, &wake_q);
@@ -1375,25 +1370,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
wake_up_q(&wake_q);
wake_q_init(&wake_q); /* Used again, reinit */
raw_spin_lock_irq(&sem->wait_lock);
- /*
- * This waiter may have become first in the wait
- * list after re-acquring the wait_lock. The
- * rwsem_first_waiter() test in the main while
- * loop below will correctly detect that. We do
- * need to reload count to perform proper trylock
- * and avoid missed wakeup.
- */
- count = atomic_long_read(&sem->count);
}
} else {
- count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
}
wait:
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem, wlock, wstate))
+ if (rwsem_try_write_lock(sem, wlock, wstate))
break;
raw_spin_unlock_irq(&sem->wait_lock);
@@ -1434,7 +1420,6 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
}
raw_spin_lock_irq(&sem->wait_lock);
- count = atomic_long_read(&sem->count);
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
@@ -1455,7 +1440,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
if (list_empty(&sem->wait_list))
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
lockevent_inc(rwsem_wlock_fail);
@@ -1475,7 +1460,7 @@ static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q);
@@ -1496,7 +1481,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q);
@@ -1572,7 +1557,12 @@ static inline void __down_write(struct rw_semaphore *sem)
else
rwsem_set_owner(sem);
#ifdef MERGE_OWNER_INTO_COUNT
- DEBUG_RWSEMS_WARN_ON(sem->owner != rwsem_get_owner(sem), sem);
+ /*
+ * Make sure that count<=>owner translation is correct.
+ */
+ DEBUG_RWSEMS_WARN_ON(
+ (atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK) !=
+ (long)rwsem_read_owner(sem), sem);
#endif
}
@@ -1631,8 +1621,8 @@ static inline void __up_write(struct rw_semaphore *sem)
* sem->owner may differ from current if the ownership is transferred
* to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
*/
- DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
- !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
+ DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
+ !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
@@ -1653,7 +1643,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
- DEBUG_RWSEMS_WARN_ON(owner_without_flags(sem->owner) != current, sem);
+ DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.
To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used. The atomic long count has the following
bit definitions:
Bit 0 - writer locked bit
Bit 1 - waiters present bit
Bits 2-7 - reserved for future extension
Bits 8-X - reader count (24/56 bits)
The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.
With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 30,659 31,341 31,055 31,283
2 8,909 16,457 9,884 17,659
4 9,028 15,823 8,933 20,233
8 8,410 14,212 7,230 17,140
16 8,217 25,240 7,479 24,607
The locking rates of the benchmark on a Power8 system were as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 12,963 13,647 13,275 13,601
2 7,570 11,569 7,902 10,829
4 5,232 5,516 5,466 5,435
8 5,233 3,386 5,467 3,168
The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 21,495 21,046 21,524 21,074
2 5,293 10,502 5,333 10,504
4 5,325 11,463 5,358 11,631
8 5,391 11,712 5,470 11,680
The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 147 ++++++++++++------------------------
kernel/locking/rwsem.h | 74 +++++++++---------
2 files changed, 85 insertions(+), 136 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3083fdf50447..7d537b50a849 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
*
* Optimistic spinning by Tim Chen <[email protected]>
* and Davidlohr Bueso <[email protected]>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <[email protected]>.
*/
#include <linux/rwsem.h>
#include <linux/init.h>
@@ -22,52 +24,20 @@
#include "rwsem.h"
/*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X (1) X readers active or attempting lock, no writer waiting
- * X = #active_readers + #readers attempting to lock
- * (X*ACTIVE_BIAS)
- *
- * 0x00000000 rwsem is unlocked, and no one is waiting for the lock or
- * attempting to read lock or write lock.
- *
- * 0xffff000X (1) X readers active or attempting lock, with waiters for lock
- * X = #active readers + # readers attempting lock
- * (X*ACTIVE_BIAS + WAITING_BIAS)
- * (2) 1 writer attempting lock, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- * (3) 1 writer active, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001 (1) 1 reader active or attempting lock, waiters for lock
- * (WAITING_BIAS + ACTIVE_BIAS)
- * (2) 1 writer active or attempting lock, no waiters for lock
- * (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
*
- * 0xffff0000 (1) There are writers or readers queued but none active
- * or in the process of attempting lock.
- * (WAITING_BIAS)
- * Note: writer can attempt to steal lock for this count by adding
- * ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
*
- * 0xfffe0001 (1) 1 writer active, or attempting lock. Waiters on queue.
- * (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- * the count becomes more than 0 for successful lock acquisition,
- * i.e. the case where there are only readers or nobody has lock.
- * (1st and 2nd case above).
- *
- * Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- * checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- * acquisition (i.e. nobody else has lock or attempts lock). If
- * unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- * are only waiters but none active (5th case above), and attempt to
- * steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
*
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
*/
/*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
/*
* handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- * - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- * - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ * have been set.
* - there must be someone on the queue
* - the wait_lock must be held by the caller
* - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* so we can bail out early if a writer stole the lock.
*/
if (wake_type != RWSEM_WAKE_READ_OWNED) {
- adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+ adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
- if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
- /*
- * If the count is still less than RWSEM_WAITING_BIAS
- * after removing the adjustment, it is assumed that
- * a writer has stolen the lock. We have to undo our
- * reader grant.
- */
- if (atomic_long_add_return(-adjustment, &sem->count) <
- RWSEM_WAITING_BIAS)
- return;
-
- /* Last active locker left. Retry waking readers. */
- goto try_reader_grant;
+ if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+ atomic_long_sub(adjustment, &sem->count);
+ return;
}
/*
* Set it to reader-owned to give spinners an early
@@ -209,11 +167,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}
list_cut_before(&wlist, &sem->wait_list, &waiter->list);
- adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+ adjustment = woken * RWSEM_READER_BIAS - adjustment;
lockevent_cond_inc(rwsem_wake_reader, woken);
if (list_empty(&sem->wait_list)) {
/* hit end of list above */
- adjustment -= RWSEM_WAITING_BIAS;
+ adjustment -= RWSEM_FLAG_WAITERS;
}
if (adjustment)
@@ -248,22 +206,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
*/
static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
{
- /*
- * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
- */
- if (count != RWSEM_WAITING_BIAS)
+ long new;
+
+ if (count & RWSEM_LOCK_MASK)
return false;
- /*
- * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
- * are other tasks on the wait list, we need to add on WAITING_BIAS.
- */
- count = list_is_singular(&sem->wait_list) ?
- RWSEM_ACTIVE_WRITE_BIAS :
- RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+ new = count + RWSEM_WRITER_LOCKED -
+ (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
- if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
- == RWSEM_WAITING_BIAS) {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
rwsem_set_owner(sem);
return true;
}
@@ -279,9 +230,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
{
long count = atomic_long_read(&sem->count);
- while (!count || count == RWSEM_WAITING_BIAS) {
+ while (!(count & RWSEM_LOCK_MASK)) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count + RWSEM_ACTIVE_WRITE_BIAS)) {
+ count + RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -424,7 +375,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
static inline struct rw_semaphore __sched *
__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
{
- long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+ long count, adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
@@ -436,16 +387,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
/*
* In case the wait queue is empty and the lock isn't owned
* by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_ACTIVE_READ_BIAS has already
- * been set in the count.
+ * immediately as its RWSEM_READER_BIAS has already been
+ * set in the count.
*/
- if (atomic_long_read(&sem->count) >= 0) {
+ if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
return sem;
}
- adjustment += RWSEM_WAITING_BIAS;
+ adjustment += RWSEM_FLAG_WAITERS;
}
list_add_tail(&waiter.list, &sem->wait_list);
@@ -458,9 +409,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
* If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers !
*/
- if (count == RWSEM_WAITING_BIAS ||
- (count > RWSEM_WAITING_BIAS &&
- adjustment != -RWSEM_ACTIVE_READ_BIAS))
+ if (!(count & RWSEM_LOCK_MASK) ||
+ (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
@@ -488,7 +438,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
out_nolock:
list_del(&waiter.list);
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock_fail);
@@ -521,9 +471,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
- /* undo write bias from down_write operation, stop active locking */
- count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
/* do optimistic spinning and steal lock if possible */
if (rwsem_optimistic_spin(sem))
return sem;
@@ -543,16 +490,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
list_add_tail(&waiter.list, &sem->wait_list);
- /* we're now waiting on the lock, but no longer actively locking */
+ /* we're now waiting on the lock */
if (waiting) {
count = atomic_long_read(&sem->count);
/*
* If there were already threads queued before us and there are
- * no active writers, the lock must be read owned; so we try to
- * wake any read locks that were queued ahead of us.
+ * no active writers and some readers, the lock must be read
+ * owned; so we try to any read locks that were queued ahead
+ * of us.
*/
- if (count > RWSEM_WAITING_BIAS) {
+ if (!(count & RWSEM_WRITER_MASK) &&
+ (count & RWSEM_READER_MASK)) {
__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
/*
* The wakeup is normally called _after_ the wait_lock
@@ -569,8 +518,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
wake_q_init(&wake_q);
}
- } else
- count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+ } else {
+ count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ }
/* wait until we successfully acquire the lock */
set_current_state(state);
@@ -587,7 +537,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
schedule();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
- } while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+ count = atomic_long_read(&sem->count);
+ } while (count & RWSEM_LOCK_MASK);
raw_spin_lock_irq(&sem->wait_lock);
}
@@ -603,7 +554,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
raw_spin_lock_irq(&sem->wait_lock);
list_del(&waiter.list);
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index eb9c8534299b..499a9b2bda82 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -42,24 +42,24 @@
#endif
/*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <[email protected]>.
- */
-
-/*
- * the semaphore definition
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
*/
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_WRITER_LOCKED (1UL << 0)
+#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_READER_SHIFT 8
+#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -151,7 +151,8 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
*/
static inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_failed(sem);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
@@ -162,7 +163,8 @@ static inline void __down_read(struct rw_semaphore *sem)
static inline int __down_read_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -183,11 +185,11 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
lockevent_inc(rwsem_rtrylock);
do {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
return 1;
}
- } while (tmp >= 0);
+ } while (!(tmp & RWSEM_READ_FAILED_MASK));
return 0;
}
@@ -196,22 +198,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
static inline void __down_write(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
rwsem_down_write_failed(sem);
rwsem_set_owner(sem);
}
static inline int __down_write_killable(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
if (IS_ERR(rwsem_down_write_failed_killable(sem)))
return -EINTR;
rwsem_set_owner(sem);
@@ -224,7 +220,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
lockevent_inc(rwsem_wtrylock);
tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
+ RWSEM_WRITER_LOCKED);
if (tmp == RWSEM_UNLOCKED_VALUE) {
rwsem_set_owner(sem);
return true;
@@ -242,8 +238,9 @@ static inline void __up_read(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
sem);
rwsem_clear_reader_owned(sem);
- tmp = atomic_long_dec_return_release(&sem->count);
- if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+ tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+ == RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -254,8 +251,8 @@ static inline void __up_write(struct rw_semaphore *sem)
{
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
- if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count) < 0))
+ if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+ &sem->count) & RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -274,8 +271,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* write side. As such, rely on RELEASE semantics.
*/
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+ tmp = atomic_long_fetch_add_release(
+ -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
- if (tmp < 0)
+ if (tmp & RWSEM_FLAG_WAITERS)
rwsem_downgrade_wake(sem);
}
--
2.18.1
The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.
New helper functions are added to properly separate out the task
structure pointer and the embedded flags.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/percpu-rwsem.h | 4 +-
include/linux/rwsem.h | 11 +--
kernel/locking/rwsem.c | 125 ++++++++++++++++++++++-------------
3 files changed, 88 insertions(+), 52 deletions(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..0a43830f1932 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
lock_release(&sem->rw_sem.dep_map, 1, ip);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
if (!read)
- sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
+ atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
#endif
}
@@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
if (!read)
- sem->rw_sem.owner = current;
+ atomic_long_set(&sem->rw_sem.owner, (long)current);
#endif
}
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index bb76e82398b2..e401358c4e7e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -35,10 +35,11 @@
struct rw_semaphore {
atomic_long_t count;
/*
- * Write owner or one of the read owners. Can be used as a
- * speculative check to see if the owner is running on the cpu.
+ * Write owner or one of the read owners as well flags regarding
+ * the current state of the rwsem. Can be used as a speculative
+ * check to see if the write owner is running on the cpu.
*/
- struct task_struct *owner;
+ atomic_long_t owner;
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
#endif
@@ -53,7 +54,7 @@ struct rw_semaphore {
* Setting all bits of the owner field except bit 0 will indicate
* that the rwsem is writer-owned with an unknown owner.
*/
-#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)
+#define RWSEM_OWNER_UNKNOWN (-2L)
/* In all implementations count != 0 means locked */
static inline int rwsem_is_locked(struct rw_semaphore *sem)
@@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
#define __RWSEM_INITIALIZER(name) \
{ __RWSEM_INIT_COUNT(name), \
- .owner = NULL, \
+ .owner = ATOMIC_LONG_INIT(0), \
.wait_list = LIST_HEAD_INIT((name).wait_list), \
.wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
__RWSEM_OPT_INIT(name) \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 9eb46ab9edaa..555da4868e54 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -64,7 +64,7 @@
if (!debug_locks_silent && \
WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
#c, atomic_long_read(&(sem)->count), \
- (long)((sem)->owner), (long)current, \
+ atomic_long_read(&(sem)->owner), (long)current, \
list_empty(&(sem)->wait_list) ? "" : "not ")) \
debug_locks_off(); \
} while (0)
@@ -114,12 +114,20 @@
*/
static inline void rwsem_set_owner(struct rw_semaphore *sem)
{
- WRITE_ONCE(sem->owner, current);
+ atomic_long_set(&sem->owner, (long)current);
}
static inline void rwsem_clear_owner(struct rw_semaphore *sem)
{
- WRITE_ONCE(sem->owner, NULL);
+ atomic_long_set(&sem->owner, 0);
+}
+
+/*
+ * Test the flags in the owner field.
+ */
+static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
+{
+ return atomic_long_read(&sem->owner) & flags;
}
/*
@@ -133,10 +141,9 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_NONSPINNABLE;
+ long val = (long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
- WRITE_ONCE(sem->owner, (struct task_struct *)val);
+ atomic_long_set(&sem->owner, val);
}
static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
@@ -145,13 +152,20 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
}
/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock.
- * N.B. !owner is considered spinnable.
+ * Return true if the rwsem is owned by a reader.
*/
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
{
- return !((unsigned long)owner & RWSEM_NONSPINNABLE);
+#ifdef CONFIG_DEBUG_RWSEMS
+ /*
+ * Check the count to see if it is write-locked.
+ */
+ long count = atomic_long_read(&sem->count);
+
+ if (count & RWSEM_WRITER_MASK)
+ return false;
+#endif
+ return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
}
#ifdef CONFIG_DEBUG_RWSEMS
@@ -163,11 +177,13 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
*/
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
- unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_NONSPINNABLE;
- if (READ_ONCE(sem->owner) == (struct task_struct *)val)
- cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
+ long val = atomic_long_read(&sem->owner);
+
+ while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (long)current) {
+ if (atomic_long_try_cmpxchg(&sem->owner, &val,
+ val & RWSEM_OWNER_FLAGS_MASK))
+ return;
+ }
}
#else
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -175,6 +191,28 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
}
#endif
+/*
+ * Return just the real task structure pointer of the owner
+ */
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
+{
+ return (struct task_struct *)(atomic_long_read(&sem->owner) &
+ ~RWSEM_OWNER_FLAGS_MASK);
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner. pflags must be non-NULL.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+ long owner = atomic_long_read(&sem->owner);
+
+ *pflags = owner & RWSEM_OWNER_FLAGS_MASK;
+ return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
/*
* Guide to the rw_semaphore's count field.
*
@@ -208,7 +246,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
- sem->owner = NULL;
+ atomic_long_set(&sem->owner, 0L);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
osq_lock_init(&sem->osq);
#endif
@@ -511,9 +549,10 @@ static inline bool owner_on_cpu(struct task_struct *owner)
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
{
struct task_struct *owner;
+ long flags;
bool ret = true;
- BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
+ BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
if (need_resched()) {
lockevent_inc(rwsem_opt_fail);
@@ -522,11 +561,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
preempt_disable();
rcu_read_lock();
- owner = READ_ONCE(sem->owner);
- if (owner) {
- ret = is_rwsem_owner_spinnable(owner) &&
- owner_on_cpu(owner);
- }
+ owner = rwsem_read_owner_flags(sem, &flags);
+ if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+ ret = false;
rcu_read_unlock();
preempt_enable();
@@ -553,25 +590,26 @@ enum owner_state {
};
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
-static inline enum owner_state rwsem_owner_state(unsigned long owner)
+static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
+ long flags)
{
- if (!owner)
- return OWNER_NULL;
-
- if (owner & RWSEM_NONSPINNABLE)
+ if (flags & RWSEM_NONSPINNABLE)
return OWNER_NONSPINNABLE;
- if (owner & RWSEM_READER_OWNED)
+ if (flags & RWSEM_READER_OWNED)
return OWNER_READER;
- return OWNER_WRITER;
+ return owner ? OWNER_WRITER : OWNER_NULL;
}
static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
{
- struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
- enum owner_state state = rwsem_owner_state((unsigned long)owner);
+ struct task_struct *new, *owner;
+ long flags, new_flags;
+ enum owner_state state;
+ owner = rwsem_read_owner_flags(sem, &flags);
+ state = rwsem_owner_state(owner, flags);
if (state != OWNER_WRITER)
return state;
@@ -582,9 +620,9 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
break;
}
- tmp = READ_ONCE(sem->owner);
- if (tmp != owner) {
- state = rwsem_owner_state((unsigned long)tmp);
+ new = rwsem_read_owner_flags(sem, &new_flags);
+ if ((new != owner) || (new_flags != flags)) {
+ state = rwsem_owner_state(new, new_flags);
break;
}
@@ -1001,8 +1039,7 @@ inline void __down_read(struct rw_semaphore *sem)
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1014,8 +1051,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
&sem->count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1084,7 +1120,7 @@ inline void __up_read(struct rw_semaphore *sem)
{
long tmp;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
@@ -1103,8 +1139,8 @@ static inline void __up_write(struct rw_semaphore *sem)
* sem->owner may differ from current if the ownership is transferred
* to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
*/
- DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
- !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
+ DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
+ !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
@@ -1125,7 +1161,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
@@ -1296,8 +1332,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
void up_read_non_owner(struct rw_semaphore *sem)
{
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
__up_read(sem);
}
EXPORT_SYMBOL(up_read_non_owner);
--
2.18.1
It is very unlikely that successive preemption at the middle of
down_read's inc-check-dec sequence will cause the reader count to
overflow, For absolute correctness, however, we still need to prevent
that possibility from happening. So preemption will be disabled during
the down_read*() call.
For PREEMPT=n kernels, there isn't much overhead in doing that.
For PREEMPT=y kernels, there will be some additional cost. RT kernels
have their own rwsem code, so it will not be a problem for them.
If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry about
reader count overflow and so we don't need to disable preemption.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 38 ++++++++++++++++++++++++++++++++++----
1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 29f0e0e5b62e..cede2f99220b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -356,6 +356,24 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
}
#ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * It is very unlikely that successive preemption at the middle of
+ * down_read's inc-check-dec sequence will cause the reader count to
+ * overflow, For absolute correctness, we still need to prevent
+ * that possibility from happening. So preemption will be disabled
+ * during the down_read*() call.
+ *
+ * For PREEMPT=n kernels, there isn't much overhead in doing that.
+ * For PREEMPT=y kernels, there will be some additional cost.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't need to worry
+ * about reader count overflow and so we don't need to disable
+ * preemption.
+ */
+#define rwsem_preempt_disable() preempt_disable()
+#define rwsem_preempt_enable() preempt_enable()
+#define rwsem_schedule_preempt_disabled() schedule_preempt_disabled()
+
/*
* Get the owner value from count to have early access to the task structure.
*/
@@ -420,6 +438,10 @@ late_initcall(rwsem_show_count_status);
#else /* !MERGE_OWNER_INTO_COUNT */
+#define rwsem_preempt_disable()
+#define rwsem_preempt_enable()
+#define rwsem_schedule_preempt_disabled() schedule()
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -1247,7 +1269,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
raw_spin_unlock_irq(&sem->wait_lock);
break;
}
- schedule();
+ rwsem_schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_reader);
}
@@ -1472,28 +1494,36 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
*/
inline void __down_read(struct rw_semaphore *sem)
{
- long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+ long tmp, adjustment;
+ rwsem_preempt_disable();
+ adjustment = rwsem_read_trylock(sem, &tmp);
if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
+ rwsem_preempt_enable();
}
static inline int __down_read_killable(struct rw_semaphore *sem)
{
- long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+ long tmp, adjustment;
+ rwsem_preempt_disable();
+ adjustment = rwsem_read_trylock(sem, &tmp);
if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
- adjustment)))
+ adjustment))) {
+ rwsem_preempt_enable();
return -EINTR;
+ }
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
+ rwsem_preempt_enable();
return 0;
}
--
2.18.1
On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.
When CONFIG_RWSEM_OWNER_COUNT isn't defined, the extra constant argument
to rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should
be optimized out by the compiler.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8196ace2d4a2..29f0e0e5b62e 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -706,6 +706,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* bit is set or the lock is acquired with handoff bit cleared.
*/
static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
+ const long wlock,
enum writer_wait_state wstate)
{
long count, new;
@@ -727,7 +728,7 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
new |= RWSEM_FLAG_HANDOFF;
} else {
- new |= RWSEM_WRITER_LOCKED;
+ new |= wlock;
new &= ~RWSEM_FLAG_HANDOFF;
if (list_is_singular(&sem->wait_list))
@@ -774,13 +775,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+ const long wlock)
{
long count = atomic_long_read(&sem->count);
while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count | RWSEM_WRITER_LOCKED)) {
+ count | wlock)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -925,7 +927,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
return sched_clock() + delta;
}
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
{
bool taken = false;
int prev_owner_state = OWNER_NULL;
@@ -956,7 +958,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
/*
* Try to acquire the lock
*/
- taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+ taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
: rwsem_try_read_lock_unqueued(sem);
if (taken)
@@ -1109,7 +1111,8 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
return false;
}
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+ const long wlock)
{
return false;
}
@@ -1288,10 +1291,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
+ const long wlock = RWSEM_WRITER_LOCKED;
/* do optimistic spinning and steal lock if possible */
if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
- rwsem_optimistic_spin(sem, true))
+ rwsem_optimistic_spin(sem, wlock))
return sem;
/*
@@ -1353,7 +1357,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(sem, wstate))
+ if (rwsem_try_write_lock(sem, wlock, wstate))
break;
raw_spin_unlock_irq(&sem->wait_lock);
--
2.18.1
The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.
To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 95 +++++++++++++++++++++++++++++++++++-------
1 file changed, 80 insertions(+), 15 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 743476f386b2..028f29b39045 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
#endif
/*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
*
- * Bit 0 - writer locked bit
- * Bit 1 - waiters present bit
- * Bit 2 - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit 63 - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit 31 - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
#define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_READFAIL (1UL << (BITS_PER_LONG - 1))
#define RWSEM_READER_SHIFT 8
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
- RWSEM_FLAG_HANDOFF)
+ RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -253,6 +269,28 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
}
}
+/*
+ * This function does a read trylock by incrementing the reader count
+ * and then decrementing it immediately if too many readers are present
+ * (count becomes negative) in order to prevent the remote possibility
+ * of overflowing the count with minimal delay between the increment
+ * and decrement.
+ *
+ * It returns the adjustment that should be added back to the count
+ * in the slowpath.
+ */
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+ long adjustment = -RWSEM_READER_BIAS;
+
+ *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+ if (unlikely(*cnt < 0)) {
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ adjustment = 0;
+ }
+ return adjustment;
+}
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -401,6 +439,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
return;
}
+ /*
+ * No reader wakeup if there are too many of them already.
+ */
+ if (unlikely(atomic_long_read(&sem->count) < 0))
+ return;
+
/*
* Writers might steal the lock before we grant it to the next reader.
* We prefer to do the first reader grant before counting readers
@@ -947,13 +991,30 @@ static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
* Wait for the read lock to be granted
*/
static struct rw_semaphore __sched *
-rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
{
- long count, adjustment = -RWSEM_READER_BIAS;
+ long count;
bool wake = false;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ if (unlikely(!adjustment)) {
+ /*
+ * This shouldn't happen. If it does, there is probably
+ * something wrong in the system.
+ */
+ WARN_ON_ONCE(1);
+
+ /*
+ * An adjustment of 0 means that there are too many readers
+ * holding or trying to acquire the lock. So disable
+ * optimistic spinning and go directly into the wait list.
+ */
+ if (rwsem_test_oflags(sem, RWSEM_RD_NONSPINNABLE))
+ rwsem_set_nonspinnable(sem);
+ goto queue;
+ }
+
/*
* Save the current read-owner of rwsem, if available, and the
* reader nonspinnable bit.
@@ -1271,9 +1332,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
*/
inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
+ long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+
+ if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
+ rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
@@ -1282,9 +1344,11 @@ inline void __down_read(struct rw_semaphore *sem)
static inline int __down_read_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
+ long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
+
+ if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
+ if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
+ adjustment)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
@@ -1360,6 +1424,7 @@ inline void __up_read(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_wr_nonspinnable(sem);
--
2.18.1
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.
With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.
This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/rwsem.h | 2 +-
kernel/locking/rwsem.c | 43 +++++++++++++++++++++---------------------
2 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 148983e21d47..bb76e82398b2 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,7 +50,7 @@ struct rw_semaphore {
};
/*
- * Setting bit 1 of the owner field but not bit 0 will indicate
+ * Setting all bits of the owner field except bit 0 will indicate
* that the rwsem is writer-owned with an unknown owner.
*/
#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index b8e209c5fa55..be939accd60c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -33,17 +33,18 @@
/*
* The least significant 2 bits of the owner value has the following
* meanings when set.
- * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- * i.e. the owner(s) cannot be readily determined. It can be reader
- * owned or the owning writer is indeterminate.
+ * - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
+ * - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
+ * The rwsem is anonymously owned, i.e. the owner(s) cannot be
+ * readily determined. It can be reader owned or the owning writer
+ * is indeterminate.
*
* When a writer acquires a rwsem, it puts its task_struct pointer
* into the owner field. It is cleared after an unlock.
*
* When a reader acquires a rwsem, it will also puts its task_struct
* pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
* largely be left untouched. So for a free or reader-owned rwsem,
* the owner value may contain information about the last reader that
* acquires the rwsem. The anonymous bit is set because that particular
@@ -55,7 +56,8 @@
* a rwsem, but the overhead is simply too big.
*/
#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+#define RWSEM_NONSPINNABLE (1UL << 1)
+#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
#ifdef CONFIG_DEBUG_RWSEMS
# define DEBUG_RWSEMS_WARN_ON(c, sem) do { \
@@ -132,7 +134,7 @@ static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
+ | RWSEM_NONSPINNABLE;
WRITE_ONCE(sem->owner, (struct task_struct *)val);
}
@@ -144,20 +146,12 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
/*
* Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
+ * and steal the lock.
* N.B. !owner is considered spinnable.
*/
static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
{
- return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
- return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+ return !((unsigned long)owner & RWSEM_NONSPINNABLE);
}
#ifdef CONFIG_DEBUG_RWSEMS
@@ -170,10 +164,10 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
+ | RWSEM_NONSPINNABLE;
if (READ_ONCE(sem->owner) == (struct task_struct *)val)
cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+ RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
}
#else
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -495,7 +489,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
struct task_struct *owner;
bool ret = true;
- BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+ BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
if (need_resched())
return false;
@@ -534,7 +528,7 @@ static inline enum owner_state rwsem_owner_state(unsigned long owner)
if (!owner)
return OWNER_NULL;
- if (owner & RWSEM_ANONYMOUSLY_OWNED)
+ if (owner & RWSEM_NONSPINNABLE)
return OWNER_NONSPINNABLE;
if (owner & RWSEM_READER_OWNED)
@@ -1043,7 +1037,12 @@ static inline void __up_write(struct rw_semaphore *sem)
{
long tmp;
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ /*
+ * sem->owner may differ from current if the ownership is transferred
+ * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
+ */
+ DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
+ !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
--
2.18.1
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.
This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.
When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.
We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.
The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.
System Time for 16 Iterations
------ ----------------------
1-socket Skylake ~800ns
4-socket Broadwell ~300ns
2-socket ThunderX2 (arm64) ~250ns
When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time. So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:
# of Threads Pre-patch Post-patch
------------ --------- ----------
2 1,759 6,684
4 1,684 6,738
8 1,074 7,222
16 900 7,163
32 458 7,316
64 208 520
128 168 425
240 143 474
This patch gives a big boost in performance for mixed reader/writer
workloads.
With 32 locking threads, the rwsem lock event data were:
rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663
With 64 locking threads, the data looked like:
rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281
So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 1 +
kernel/locking/rwsem.c | 173 ++++++++++++++++++++++++------
2 files changed, 144 insertions(+), 30 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ca954e4e00e4..baa998401052 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
+LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 555da4868e54..ec4c26b353c9 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -23,6 +23,7 @@
#include <linux/sched/debug.h>
#include <linux/sched/wake_q.h>
#include <linux/sched/signal.h>
+#include <linux/sched/clock.h>
#include <linux/export.h>
#include <linux/rwsem.h>
#include <linux/atomic.h>
@@ -31,24 +32,28 @@
#include "lock_events.h"
/*
- * The least significant 2 bits of the owner value has the following
+ * The least significant 3 bits of the owner value has the following
* meanings when set.
* - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
- * - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
- * The rwsem is anonymously owned, i.e. the owner(s) cannot be
- * readily determined. It can be reader owned or the owning writer
- * is indeterminate.
+ * - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock.
+ * - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock.
*
+ * When the rwsem is either owned by an anonymous writer, or it is
+ * reader-owned, but a spinning writer has timed out, both nonspinnable
+ * bits will be set to disable optimistic spinning by readers and writers.
+ * In the later case, the last unlocking reader should then check the
+ * writer nonspinnable bit and clear it only to give writers preference
+ * to acquire the lock via optimistic spinning, but not readers. Similar
+ * action is also done in the reader slowpath.
+
* When a writer acquires a rwsem, it puts its task_struct pointer
* into the owner field. It is cleared after an unlock.
*
* When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem.
*
* That information may be helpful in debugging cases where the system
* seems to hang on a reader owned rwsem especially if only one reader
@@ -56,7 +61,9 @@
* a rwsem, but the overhead is simply too big.
*/
#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_NONSPINNABLE (1UL << 1)
+#define RWSEM_RD_NONSPINNABLE (1UL << 1)
+#define RWSEM_WR_NONSPINNABLE (1UL << 2)
+#define RWSEM_NONSPINNABLE (RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
#ifdef CONFIG_DEBUG_RWSEMS
@@ -141,7 +148,7 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- long val = (long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
+ long val = (long)owner | RWSEM_READER_OWNED;
atomic_long_set(&sem->owner, val);
}
@@ -191,6 +198,22 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
}
#endif
+/*
+ * Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+ long owner = atomic_long_read(&sem->owner);
+
+ while (owner & RWSEM_READER_OWNED) {
+ if (owner & RWSEM_NONSPINNABLE)
+ break;
+ owner = atomic_long_cmpxchg(&sem->owner, owner,
+ owner | RWSEM_NONSPINNABLE);
+ }
+}
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -546,7 +569,8 @@ static inline bool owner_on_cpu(struct task_struct *owner)
return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
}
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+ long nonspinnable)
{
struct task_struct *owner;
long flags;
@@ -562,7 +586,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
preempt_disable();
rcu_read_lock();
owner = rwsem_read_owner_flags(sem, &flags);
- if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+ if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
ret = false;
rcu_read_unlock();
preempt_enable();
@@ -588,12 +612,12 @@ enum owner_state {
OWNER_READER = 1 << 2,
OWNER_NONSPINNABLE = 1 << 3,
};
-#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
+#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER)
static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
- long flags)
+ long flags, long nonspinnable)
{
- if (flags & RWSEM_NONSPINNABLE)
+ if (flags & nonspinnable)
return OWNER_NONSPINNABLE;
if (flags & RWSEM_READER_OWNED)
@@ -602,14 +626,15 @@ static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
return owner ? OWNER_WRITER : OWNER_NULL;
}
-static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem,
+ long nonspinnable)
{
struct task_struct *new, *owner;
long flags, new_flags;
enum owner_state state;
owner = rwsem_read_owner_flags(sem, &flags);
- state = rwsem_owner_state(owner, flags);
+ state = rwsem_owner_state(owner, flags, nonspinnable);
if (state != OWNER_WRITER)
return state;
@@ -622,7 +647,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
new = rwsem_read_owner_flags(sem, &new_flags);
if ((new != owner) || (new_flags != flags)) {
- state = rwsem_owner_state(new, new_flags);
+ state = rwsem_owner_state(new, new_flags, nonspinnable);
break;
}
@@ -646,10 +671,39 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
return state;
}
+/*
+ * Calculate reader-owned rwsem spinning threshold for writer
+ *
+ * The more readers own the rwsem, the longer it will take for them to
+ * wind down and free the rwsem. So the empirical formula used to
+ * determine the actual spinning time limit here is:
+ *
+ * Spinning threshold = (10 + nr_readers/2)us
+ *
+ * The limit is capped to a maximum of 25us (30 readers). This is just
+ * a heuristic and is subjected to change in the future.
+ */
+static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+ int readers = count >> RWSEM_READER_SHIFT;
+ u64 delta;
+
+ if (readers > 30)
+ readers = 30;
+ delta = (20 + readers) * NSEC_PER_USEC / 2;
+
+ return sched_clock() + delta;
+}
+
static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
bool taken = false;
int prev_owner_state = OWNER_NULL;
+ int loop = 0;
+ u64 rspin_threshold = 0;
+ long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
+ : RWSEM_RD_NONSPINNABLE;
preempt_disable();
@@ -661,12 +715,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
* Optimistically spin on the owner field and attempt to acquire the
* lock whenever the owner changes. Spinning will be stopped when:
* 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
+ * 2) readers own the lock and spinning time has exceeded limit.
*/
for (;;) {
- enum owner_state owner_state = rwsem_spin_on_owner(sem);
+ enum owner_state owner_state;
+ owner_state = rwsem_spin_on_owner(sem, nonspinnable);
if (!(owner_state & OWNER_SPINNABLE))
break;
@@ -679,6 +733,39 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
if (taken)
break;
+ /*
+ * Time-based reader-owned rwsem optimistic spinning
+ */
+ if (wlock && (owner_state == OWNER_READER)) {
+ /*
+ * Re-initialize rspin_threshold every time when
+ * the owner state changes from non-reader to reader.
+ * This allows a writer to steal the lock in between
+ * 2 reader phases and have the threshold reset at
+ * the beginning of the 2nd reader phase.
+ */
+ if (prev_owner_state != OWNER_READER) {
+ if (rwsem_test_oflags(sem, nonspinnable))
+ break;
+ rspin_threshold = rwsem_rspin_threshold(sem);
+ loop = 0;
+ }
+
+ /*
+ * Check time threshold once every 16 iterations to
+ * avoid calling sched_clock() too frequently so
+ * as to reduce the average latency between the times
+ * when the lock becomes free and when the spinner
+ * is ready to do a trylock.
+ */
+ else if (!(++loop & 0xf) &&
+ (sched_clock() > rspin_threshold)) {
+ rwsem_set_nonspinnable(sem);
+ lockevent_inc(rwsem_opt_nospin);
+ break;
+ }
+ }
+
/*
* An RT task cannot do optimistic spinning if it cannot
* be sure the lock holder is running or live-lock may
@@ -733,8 +820,25 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}
+
+/*
+ * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
+ * only be called when the reader count reaches 0.
+ *
+ * This give writers better chance to acquire the rwsem first before
+ * readers when the rwsem was being held by readers for a relatively long
+ * period of time. Race can happen that an optimistic spinner may have
+ * just stolen the rwsem and set the owner, but just clearing the
+ * RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
+ */
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
+{
+ if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
+ atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
+}
#else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+ long nonspinnable)
{
return false;
}
@@ -743,6 +847,8 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
return false;
}
+
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
#endif
/*
@@ -752,10 +858,11 @@ static struct rw_semaphore __sched *
rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
{
long count, adjustment = -RWSEM_READER_BIAS;
+ bool wake = false;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
- if (!rwsem_can_spin_on_owner(sem))
+ if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
goto queue;
/*
@@ -815,8 +922,12 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
* If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers !
*/
- if (!(count & RWSEM_LOCK_MASK) ||
- (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+ if (!(count & RWSEM_LOCK_MASK)) {
+ clear_wr_nonspinnable(sem);
+ wake = true;
+ }
+ if (wake || (!(count & RWSEM_WRITER_MASK) &&
+ (adjustment & RWSEM_FLAG_WAITERS)))
rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
@@ -866,7 +977,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */
- if (rwsem_can_spin_on_owner(sem) &&
+ if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
rwsem_optimistic_spin(sem, true))
return sem;
@@ -1124,8 +1235,10 @@ inline void __up_read(struct rw_semaphore *sem)
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
- RWSEM_FLAG_WAITERS))
+ RWSEM_FLAG_WAITERS)) {
+ clear_wr_nonspinnable(sem);
rwsem_wake(sem, tmp);
+ }
}
/*
--
2.18.1
This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly. The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:
# of Threads Pre-patch Post-patch
------------ --------- ----------
4 1,674 1,684
8 1,062 1,074
16 924 900
32 300 458
64 195 208
128 164 168
240 149 143
The performance change wasn't significant in this case, but this change
is required by a follow-on patch.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 1 +
kernel/locking/rwsem.c | 86 ++++++++++++++++++++++++++-----
2 files changed, 75 insertions(+), 12 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 634b47fd8b5e..ca954e4e00e4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader) /* # of reader sleeps */
LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */
LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */
LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
+LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be939accd60c..9eb46ab9edaa 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -457,6 +457,30 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
}
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
+ return false;
+
+ count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+ if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ rwsem_set_reader_owned(sem);
+ lockevent_inc(rwsem_opt_rlock);
+ return true;
+ }
+
+ /* Back out the change */
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ return false;
+}
+
/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
@@ -491,9 +515,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
- if (need_resched())
+ if (need_resched()) {
+ lockevent_inc(rwsem_opt_fail);
return false;
+ }
+ preempt_disable();
rcu_read_lock();
owner = READ_ONCE(sem->owner);
if (owner) {
@@ -501,6 +528,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
owner_on_cpu(owner);
}
rcu_read_unlock();
+ preempt_enable();
+
+ lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
}
@@ -578,7 +608,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
return state;
}
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
bool taken = false;
int prev_owner_state = OWNER_NULL;
@@ -586,9 +616,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
preempt_disable();
/* sem->wait_lock should not be held when doing optimistic spinning */
- if (!rwsem_can_spin_on_owner(sem))
- goto done;
-
if (!osq_lock(&sem->osq))
goto done;
@@ -608,10 +635,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/*
* Try to acquire the lock
*/
- if (rwsem_try_write_lock_unqueued(sem)) {
- taken = true;
+ taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+ : rwsem_try_read_lock_unqueued(sem);
+
+ if (taken)
break;
- }
/*
* An RT task cannot do optimistic spinning if it cannot
@@ -668,7 +696,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
return taken;
}
#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
return false;
}
@@ -684,6 +717,31 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ if (!rwsem_can_spin_on_owner(sem))
+ goto queue;
+
+ /*
+ * Undo read bias from down_read() and do optimistic spinning.
+ */
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ adjustment = 0;
+ if (rwsem_optimistic_spin(sem, false)) {
+ /*
+ * Wake up other readers in the wait list if the front
+ * waiter is a reader.
+ */
+ if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (!list_empty(&sem->wait_list))
+ rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+ &wake_q);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ }
+ return sem;
+ }
+
+queue:
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -696,7 +754,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
* exit the slowpath and return immediately as its
* RWSEM_READER_BIAS has already been set in the count.
*/
- if (!(atomic_long_read(&sem->count) &
+ if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
@@ -708,7 +766,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
+ if (adjustment)
+ count = atomic_long_add_return(adjustment, &sem->count);
+ else
+ count = atomic_long_read(&sem->count);
/*
* If there are no active locks, wake the front queued process(es).
@@ -767,7 +828,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem))
+ if (rwsem_can_spin_on_owner(sem) &&
+ rwsem_optimistic_spin(sem, true))
return sem;
/*
--
2.18.1
This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 65 ++++++++++++++++++++++++++++++------------
1 file changed, 47 insertions(+), 18 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f56329240ef1..8d0f2acfe13d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -414,17 +414,54 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
}
/*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * The rwsem_spin_on_owner() function returns the folowing 4 values
+ * depending on the lock owner state.
+ * OWNER_NULL : owner is currently NULL
+ * OWNER_WRITER: when owner changes and is a writer
+ * OWNER_READER: when owner changes and the new owner may be a reader.
+ * OWNER_NONSPINNABLE:
+ * when optimistic spinning has to stop because either the
+ * owner stops running, is unknown, or its timeslice has
+ * been used up.
*/
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+ OWNER_NULL = 1 << 0,
+ OWNER_WRITER = 1 << 1,
+ OWNER_READER = 1 << 2,
+ OWNER_NONSPINNABLE = 1 << 3,
+};
+#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
+
+static inline enum owner_state rwsem_owner_state(unsigned long owner)
{
- struct task_struct *owner = READ_ONCE(sem->owner);
+ if (!owner)
+ return OWNER_NULL;
- if (!is_rwsem_owner_spinnable(owner))
- return false;
+ if (owner & RWSEM_ANONYMOUSLY_OWNED)
+ return OWNER_NONSPINNABLE;
+
+ if (owner & RWSEM_READER_OWNED)
+ return OWNER_READER;
+
+ return OWNER_WRITER;
+}
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+ struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+ enum owner_state state = rwsem_owner_state((unsigned long)owner);
+
+ if (state != OWNER_WRITER)
+ return state;
rcu_read_lock();
- while (owner && (READ_ONCE(sem->owner) == owner)) {
+ for (;;) {
+ tmp = READ_ONCE(sem->owner);
+ if (tmp != owner) {
+ state = rwsem_owner_state((unsigned long)tmp);
+ break;
+ }
+
/*
* Ensure we emit the owner->on_cpu, dereference _after_
* checking sem->owner still matches owner, if that fails,
@@ -433,24 +470,16 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
*/
barrier();
- /*
- * abort spinning when need_resched or owner is not running or
- * owner's cpu is preempted.
- */
if (need_resched() || !owner_on_cpu(owner)) {
- rcu_read_unlock();
- return false;
+ state = OWNER_NONSPINNABLE;
+ break;
}
cpu_relax();
}
rcu_read_unlock();
- /*
- * If there is a new owner or the owner is not set, we continue
- * spinning.
- */
- return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+ return state;
}
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -473,7 +502,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* 2) readers own the lock as we can't determine if they are
* actively running or not.
*/
- while (rwsem_spin_on_owner(sem)) {
+ while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
/*
* Try to acquire the lock
*/
--
2.18.1
An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.
When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:
1) The lock owner is in the process of releasing the lock, sem->owner
is cleared but the lock has not been released yet.
2) The lock was free and owner cleared, but another task just comes
in and acquire the lock before we try to get it. The new owner may
be a spinnable writer.
So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.
When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.
Locking threads Pre-patch Post-patch
--------------- --------- -----------
4 2,753 2,608
8 2,529 2,520
16 1,727 1,918
32 1,263 1,956
64 889 1,343
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 51 ++++++++++++++++++++++++++++++++++++------
1 file changed, 44 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 36aed5236bd2..eb43201b89b4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -566,6 +566,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
bool taken = false;
+ int prev_owner_state = OWNER_NULL;
preempt_disable();
@@ -583,7 +584,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* 2) readers own the lock as we can't determine if they are
* actively running or not.
*/
- while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+ for (;;) {
+ enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+ if (!(owner_state & OWNER_SPINNABLE))
+ break;
+
/*
* Try to acquire the lock
*/
@@ -593,13 +599,44 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
}
/*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
+ * An RT task cannot do optimistic spinning if it cannot
+ * be sure the lock holder is running or live-lock may
+ * happen if the current task and the lock holder happen
+ * to run in the same CPU. However, aborting optimistic
+ * spinning while a NULL owner is detected may miss some
+ * opportunity where spinning can continue without causing
+ * problem.
+ *
+ * There are 2 possible cases where an RT task may be able
+ * to continue spinning.
+ *
+ * 1) The lock owner is in the process of releasing the
+ * lock, sem->owner is cleared but the lock has not
+ * been released yet.
+ * 2) The lock was free and owner cleared, but another
+ * task just comes in and acquire the lock before
+ * we try to get it. The new owner may be a spinnable
+ * writer.
+ *
+ * To take advantage of two scenarios listed agove, the RT
+ * task is made to retry one more time to see if it can
+ * acquire the lock or continue spinning on the new owning
+ * writer. Of course, if the time lag is long enough or the
+ * new owner is not a writer or spinnable, the RT task will
+ * quit spinning.
+ *
+ * If the owner is a writer, the need_resched() check is
+ * done inside rwsem_spin_on_owner(). If the owner is not
+ * a writer, need_resched() check needs to be done here.
*/
- if (!sem->owner && (need_resched() || rt_task(current)))
- break;
+ if (owner_state != OWNER_WRITER) {
+ if (need_resched())
+ break;
+ if (rt_task(current) &&
+ (prev_owner_state != OWNER_WRITER))
+ break;
+ }
+ prev_owner_state = owner_state;
/*
* The cpu_relax() call is a compiler barrier which forces
--
2.18.1
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.
This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.
The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.
A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.
Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458. It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.
Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:
40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081
By making waiter set handoff bit immediately after wakeup:
40 readers, Iterations Min/Mean/Max = 43/44/46
40 writers, Iterations Min/Mean/Max = 43/1,263/3,191
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 2 +
kernel/locking/rwsem.c | 225 +++++++++++++++++++++++-------
2 files changed, 173 insertions(+), 54 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 11187a1d40b8..634b47fd8b5e 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,5 +61,7 @@ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
+LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */
LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
+LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8d0f2acfe13d..0c8aef065acb 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -10,8 +10,9 @@
* Optimistic spinning by Tim Chen <[email protected]>
* and Davidlohr Bueso <[email protected]>. Based on mutexes.
*
- * Rwsem count bit fields re-definition and rwsem rearchitecture
- * by Waiman Long <[email protected]>.
+ * Rwsem count bit fields re-definition and rwsem rearchitecture by
+ * Waiman Long <[email protected]> and
+ * Peter Zijlstra <[email protected]>.
*/
#include <linux/types.h>
@@ -74,20 +75,33 @@
*
* Bit 0 - writer locked bit
* Bit 1 - waiters present bit
- * Bits 2-7 - reserved
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
* Bits 8-X - 24-bit (32-bit) or 56-bit reader count
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of rwsem_down_write_slowpath().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
*/
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_FLAG_HANDOFF (1UL << 2)
+
#define RWSEM_READER_SHIFT 8
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+ RWSEM_FLAG_HANDOFF)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -216,7 +230,10 @@ struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+ unsigned long timeout;
};
+#define rwsem_first_waiter(sem) \
+ list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
enum rwsem_wake_type {
RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
@@ -224,6 +241,19 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
};
+enum writer_wait_state {
+ WRITER_NOT_FIRST, /* Writer is not first in wait list */
+ WRITER_FIRST, /* Writer is first in wait list */
+ WRITER_HANDOFF /* Writer is first & handoff needed */
+};
+
+/*
+ * The typical HZ value is either 250 or 1000. So set the minimum waiting
+ * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
+ * queue before initiating the handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -244,11 +274,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
long oldcount, woken = 0, adjustment = 0;
struct list_head wlist;
+ lockdep_assert_held(&sem->wait_lock);
+
/*
* Take a peek at the queue head waiter such that we can determine
* the wakeup(s) to perform.
*/
- waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+ waiter = rwsem_first_waiter(sem);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY) {
@@ -275,7 +307,18 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
- atomic_long_sub(adjustment, &sem->count);
+ /*
+ * When we've been waiting "too" long (for writers
+ * to give up the lock), request a * HANDOFF to
+ * force the issue.
+ */
+ if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+ time_after(jiffies, waiter->timeout)) {
+ adjustment -= RWSEM_FLAG_HANDOFF;
+ lockevent_inc(rwsem_rlock_handoff);
+ }
+
+ atomic_long_add(-adjustment, &sem->count);
return;
}
/*
@@ -317,6 +360,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
adjustment -= RWSEM_FLAG_WAITERS;
}
+ /*
+ * When we've woken a reader, we no longer need to force writers
+ * to give up the lock and we can clear HANDOFF.
+ */
+ if (woken && (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF))
+ adjustment -= RWSEM_FLAG_HANDOFF;
+
if (adjustment)
atomic_long_add(adjustment, &sem->count);
@@ -346,23 +396,48 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* This function must be called with the sem->wait_lock held to prevent
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
+ *
+ * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * bit is set or the lock is acquired with handoff bit cleared.
*/
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+ enum writer_wait_state wstate)
{
long new;
- if (count & RWSEM_LOCK_MASK)
- return false;
+ lockdep_assert_held(&sem->wait_lock);
- new = count + RWSEM_WRITER_LOCKED -
- (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+ do {
+ bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
- rwsem_set_owner(sem);
- return true;
- }
+ if (has_handoff && wstate == WRITER_NOT_FIRST)
+ return false;
- return false;
+ new = count;
+
+ if (count & RWSEM_LOCK_MASK) {
+ if (has_handoff || (wstate != WRITER_HANDOFF))
+ return false;
+
+ new |= RWSEM_FLAG_HANDOFF;
+ } else {
+ new |= RWSEM_WRITER_LOCKED;
+ new &= ~RWSEM_FLAG_HANDOFF;
+
+ if (list_is_singular(&sem->wait_list))
+ new &= ~RWSEM_FLAG_WAITERS;
+ }
+ } while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
+
+ /*
+ * We have either acquired the lock with handoff bit cleared or
+ * set the handoff bit.
+ */
+ if (new & RWSEM_FLAG_HANDOFF)
+ return false;
+
+ rwsem_set_owner(sem);
+ return true;
}
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
@@ -373,9 +448,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
{
long count = atomic_long_read(&sem->count);
- while (!(count & RWSEM_LOCK_MASK)) {
+ while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count + RWSEM_WRITER_LOCKED)) {
+ count | RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -456,6 +531,11 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
rcu_read_lock();
for (;;) {
+ if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
+ state = OWNER_NONSPINNABLE;
+ break;
+ }
+
tmp = READ_ONCE(sem->owner);
if (tmp != owner) {
state = rwsem_owner_state((unsigned long)tmp);
@@ -553,16 +633,18 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
+ waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
/*
* In case the wait queue is empty and the lock isn't owned
- * by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_READER_BIAS has already been
- * set in the count.
+ * by a writer or has the handoff bit set, this reader can
+ * exit the slowpath and return immediately as its
+ * RWSEM_READER_BIAS has already been set in the count.
*/
- if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+ if (!(atomic_long_read(&sem->count) &
+ (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
@@ -609,8 +691,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
return sem;
out_nolock:
list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+ if (list_empty(&sem->wait_list)) {
+ atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
+ &sem->count);
+ }
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock_fail);
@@ -624,7 +708,7 @@ static struct rw_semaphore *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
long count;
- bool waiting = true; /* any queued threads before us */
+ enum writer_wait_state wstate;
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
@@ -639,66 +723,95 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
*/
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
+ waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
raw_spin_lock_irq(&sem->wait_lock);
/* account for this before adding a new element to the list */
- if (list_empty(&sem->wait_list))
- waiting = false;
+ wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock */
- if (waiting) {
+ if (wstate == WRITER_NOT_FIRST) {
count = atomic_long_read(&sem->count);
/*
- * If there were already threads queued before us and there are
- * no active writers and some readers, the lock must be read
- * owned; so we try to any read locks that were queued ahead
- * of us.
+ * If there were already threads queued before us and:
+ * 1) there are no no active locks, wake the front
+ * queued process(es) as the handoff bit might be set.
+ * 2) there are no active writers and some readers, the lock
+ * must be read owned; so we try to wake any read lock
+ * waiters that were queued ahead of us.
*/
- if (!(count & RWSEM_WRITER_MASK) &&
- (count & RWSEM_READER_MASK)) {
- rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
- /*
- * The wakeup is normally called _after_ the wait_lock
- * is released, but given that we are proactively waking
- * readers we can deal with the wake_q overhead as it is
- * similar to releasing and taking the wait_lock again
- * for attempting rwsem_try_write_lock().
- */
- wake_up_q(&wake_q);
+ if (count & RWSEM_WRITER_MASK)
+ goto wait;
- /*
- * Reinitialize wake_q after use.
- */
- wake_q_init(&wake_q);
- }
+ rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+ ? RWSEM_WAKE_READERS
+ : RWSEM_WAKE_ANY, &wake_q);
+ /*
+ * The wakeup is normally called _after_ the wait_lock
+ * is released, but given that we are proactively waking
+ * readers we can deal with the wake_q overhead as it is
+ * similar to releasing and taking the wait_lock again
+ * for attempting rwsem_try_write_lock().
+ */
+ wake_up_q(&wake_q);
+
+ /* We need wake_q again below, reinitialize */
+ wake_q_init(&wake_q);
} else {
count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
}
+wait:
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem))
+ if (rwsem_try_write_lock(count, sem, wstate))
break;
+
raw_spin_unlock_irq(&sem->wait_lock);
/* Block until there are no active lockers. */
- do {
+ for (;;) {
if (signal_pending_state(state, current))
goto out_nolock;
schedule();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
+ /*
+ * If HANDOFF bit is set, unconditionally do
+ * a trylock.
+ */
+ if (wstate == WRITER_HANDOFF)
+ break;
+
+ if ((wstate == WRITER_NOT_FIRST) &&
+ (rwsem_first_waiter(sem) == &waiter))
+ wstate = WRITER_FIRST;
+
count = atomic_long_read(&sem->count);
- } while (count & RWSEM_LOCK_MASK);
+ if (!(count & RWSEM_LOCK_MASK))
+ break;
+
+ /*
+ * The setting of the handoff bit is deferred
+ * until rwsem_try_write_lock() is called.
+ */
+ if ((wstate == WRITER_FIRST) && (rt_task(current) ||
+ time_after(jiffies, waiter.timeout))) {
+ wstate = WRITER_HANDOFF;
+ lockevent_inc(rwsem_wlock_handoff);
+ break;
+ }
+ }
raw_spin_lock_irq(&sem->wait_lock);
+ count = atomic_long_read(&sem->count);
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
@@ -711,6 +824,10 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
__set_current_state(TASK_RUNNING);
raw_spin_lock_irq(&sem->wait_lock);
list_del(&waiter.list);
+
+ if (unlikely(wstate == WRITER_HANDOFF))
+ atomic_long_add(-RWSEM_FLAG_HANDOFF, &sem->count);
+
if (list_empty(&sem->wait_list))
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
@@ -726,7 +843,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
* handle waking up a waiter on the semaphore
* - up_read/up_write has decremented the active part of count if we come here
*/
-static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -859,7 +976,7 @@ inline void __up_read(struct rw_semaphore *sem)
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
+ rwsem_wake(sem, tmp);
}
/*
@@ -873,7 +990,7 @@ static inline void __up_write(struct rw_semaphore *sem)
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
+ rwsem_wake(sem, tmp);
}
/*
--
2.18.1
After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:
1) Remove all the EXPORT_SYMBOL() calls for functions that are not
accessed elsewhere.
2) Remove all the __visible tags as none of the functions will be
called from assembly code anymore.
3) Make all the internal functions static.
4) Remove some unneeded blank lines.
5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
and rename __rwsem_down_{read|write}_failed_common() to
rwsem_down_{read|write}_slowpath().
6) Remove "__" prefix of __rwsem_mark_wake().
7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
are not that useful.
That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 2 -
kernel/locking/rwsem.c | 135 ++++++++++--------------------
2 files changed, 42 insertions(+), 95 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ad7668cfc9da..11187a1d40b8 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,5 @@ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
-LOCK_EVENT(rwsem_rtrylock) /* # of read trylock calls */
LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
-LOCK_EVENT(rwsem_wtrylock) /* # of write trylock calls */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8317bcdf063b..f56329240ef1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -205,7 +205,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
osq_lock_init(&sem->osq);
#endif
}
-
EXPORT_SYMBOL(__init_rwsem);
enum rwsem_waiter_type {
@@ -237,9 +236,9 @@ enum rwsem_wake_type {
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only marked woken if downgrading is false
*/
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
+static void rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type,
+ struct wake_q_head *wake_q)
{
struct rwsem_waiter *waiter, *tmp;
long oldcount, woken = 0, adjustment = 0;
@@ -330,7 +329,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
/*
* Ensure calling get_task_struct() before setting the reader
- * waiter to nil such that rwsem_down_read_failed() cannot
+ * waiter to nil such that rwsem_down_read_slowpath() cannot
* race with do_exit() by always holding a reference count
* to the task to wakeup.
*/
@@ -516,8 +515,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/*
* Wait for the read lock to be granted
*/
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore __sched *
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
{
long count, adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
@@ -555,7 +554,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
*/
if (!(count & RWSEM_LOCK_MASK) ||
(!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
@@ -589,25 +588,11 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
return ERR_PTR(-EINTR);
}
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
/*
* Wait until we successfully acquire the write lock
*/
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore *
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
long count;
bool waiting = true; /* any queued threads before us */
@@ -646,7 +631,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
*/
if (!(count & RWSEM_WRITER_MASK) &&
(count & RWSEM_READER_MASK)) {
- __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
/*
* The wakeup is normally called _after_ the wait_lock
* is released, but given that we are proactively waking
@@ -700,7 +685,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
if (list_empty(&sem->wait_list))
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
lockevent_inc(rwsem_wlock_fail);
@@ -708,26 +693,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
return ERR_PTR(-EINTR);
}
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
/*
* handle waking up a waiter on the semaphore
* - up_read/up_write has decremented the active part of count if we come here
*/
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -735,22 +705,20 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q);
return sem;
}
-EXPORT_SYMBOL(rwsem_wake);
/*
* downgrade a write lock into a read lock
* - caller incremented waiting part of count and discovered it still negative
* - just wake up any readers at the front of the queue
*/
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -758,14 +726,13 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q);
return sem;
}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
/*
* lock for reading
@@ -774,7 +741,7 @@ inline void __down_read(struct rw_semaphore *sem)
{
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
- rwsem_down_read_failed(sem);
+ rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
} else {
@@ -786,7 +753,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
{
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+ if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
@@ -803,7 +770,6 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
long tmp = RWSEM_UNLOCKED_VALUE;
- lockevent_inc(rwsem_rtrylock);
do {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
tmp + RWSEM_READER_BIAS)) {
@@ -819,30 +785,33 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
static inline void __down_write(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- rwsem_down_write_failed(sem);
+ long tmp = RWSEM_UNLOCKED_VALUE;
+
+ if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ RWSEM_WRITER_LOCKED)))
+ rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
rwsem_set_owner(sem);
}
static inline int __down_write_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+ long tmp = RWSEM_UNLOCKED_VALUE;
+
+ if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ RWSEM_WRITER_LOCKED))) {
+ if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
+ }
rwsem_set_owner(sem);
return 0;
}
static inline int __down_write_trylock(struct rw_semaphore *sem)
{
- long tmp;
+ long tmp = RWSEM_UNLOCKED_VALUE;
- lockevent_inc(rwsem_wtrylock);
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_WRITER_LOCKED);
- if (tmp == RWSEM_UNLOCKED_VALUE) {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
return true;
}
@@ -856,12 +825,11 @@ inline void __up_read(struct rw_semaphore *sem)
{
long tmp;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
- if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
- == RWSEM_FLAG_WAITERS))
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
+ RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -870,10 +838,12 @@ inline void __up_read(struct rw_semaphore *sem)
*/
static inline void __up_write(struct rw_semaphore *sem)
{
+ long tmp;
+
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
- if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
- &sem->count) & RWSEM_FLAG_WAITERS))
+ tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+ if (unlikely(tmp & RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -909,7 +879,6 @@ void __sched down_read(struct rw_semaphore *sem)
LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
}
-
EXPORT_SYMBOL(down_read);
int __sched down_read_killable(struct rw_semaphore *sem)
@@ -924,7 +893,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
return 0;
}
-
EXPORT_SYMBOL(down_read_killable);
/*
@@ -938,7 +906,6 @@ int down_read_trylock(struct rw_semaphore *sem)
rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
return ret;
}
-
EXPORT_SYMBOL(down_read_trylock);
/*
@@ -948,10 +915,8 @@ void __sched down_write(struct rw_semaphore *sem)
{
might_sleep();
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}
-
EXPORT_SYMBOL(down_write);
/*
@@ -962,14 +927,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
might_sleep();
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
- if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+ if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+ __down_write_killable)) {
rwsem_release(&sem->dep_map, 1, _RET_IP_);
return -EINTR;
}
return 0;
}
-
EXPORT_SYMBOL(down_write_killable);
/*
@@ -984,7 +949,6 @@ int down_write_trylock(struct rw_semaphore *sem)
return ret;
}
-
EXPORT_SYMBOL(down_write_trylock);
/*
@@ -993,10 +957,8 @@ EXPORT_SYMBOL(down_write_trylock);
void up_read(struct rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
__up_read(sem);
}
-
EXPORT_SYMBOL(up_read);
/*
@@ -1005,10 +967,8 @@ EXPORT_SYMBOL(up_read);
void up_write(struct rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
__up_write(sem);
}
-
EXPORT_SYMBOL(up_write);
/*
@@ -1017,10 +977,8 @@ EXPORT_SYMBOL(up_write);
void downgrade_write(struct rw_semaphore *sem)
{
lock_downgrade(&sem->dep_map, _RET_IP_);
-
__downgrade_write(sem);
}
-
EXPORT_SYMBOL(downgrade_write);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -1029,40 +987,32 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
{
might_sleep();
rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
}
-
EXPORT_SYMBOL(down_read_nested);
void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
{
might_sleep();
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}
-
EXPORT_SYMBOL(_down_write_nest_lock);
void down_read_non_owner(struct rw_semaphore *sem)
{
might_sleep();
-
__down_read(sem);
__rwsem_set_reader_owned(sem, NULL);
}
-
EXPORT_SYMBOL(down_read_non_owner);
void down_write_nested(struct rw_semaphore *sem, int subclass)
{
might_sleep();
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}
-
EXPORT_SYMBOL(down_write_nested);
int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
@@ -1070,14 +1020,14 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
might_sleep();
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
- if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+ if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+ __down_write_killable)) {
rwsem_release(&sem->dep_map, 1, _RET_IP_);
return -EINTR;
}
return 0;
}
-
EXPORT_SYMBOL(down_write_killable_nested);
void up_read_non_owner(struct rw_semaphore *sem)
@@ -1086,7 +1036,6 @@ void up_read_non_owner(struct rw_semaphore *sem)
sem);
__up_read(sem);
}
-
EXPORT_SYMBOL(up_read_non_owner);
#endif
--
2.18.1
With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present. This can help performance by avoiding
spinning on the wait_lock when it is contended.
With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.
With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang. So the commit 59aabfc7e959
("locking/rwsem: Reduce spinlock contention in wakeup after
up_read()/up_write()") will have to be reverted.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 72 -------------------------------------
1 file changed, 72 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index c0500679fd2f..3083fdf50447 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -411,25 +411,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
- return osq_is_locked(&sem->osq);
-}
-
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
return false;
}
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
- return false;
-}
#endif
/*
@@ -651,65 +637,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
- /*
- * __rwsem_down_write_failed_common(sem)
- * rwsem_optimistic_spin(sem)
- * osq_unlock(sem->osq)
- * ...
- * atomic_long_add_return(&sem->count)
- *
- * - VS -
- *
- * __up_write()
- * if (atomic_long_sub_return_release(&sem->count) < 0)
- * rwsem_wake(sem)
- * osq_is_locked(&sem->osq)
- *
- * And __up_write() must observe !osq_is_locked() when it observes the
- * atomic_long_add_return() in order to not miss a wakeup.
- *
- * This boils down to:
- *
- * [S.rel] X = 1 [RmW] r0 = (Y += 0)
- * MB RMB
- * [RmW] Y += 1 [L] r1 = X
- *
- * exists (r0=1 /\ r1=0)
- */
- smp_rmb();
-
- /*
- * If a spinner is present, it is not necessary to do the wakeup.
- * Try to do wakeup only if the trylock succeeds to minimize
- * spinlock contention which may introduce too much delay in the
- * unlock operation.
- *
- * spinning writer up_write/up_read caller
- * --------------- -----------------------
- * [S] osq_unlock() [L] osq
- * MB RMB
- * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
- *
- * Here, it is important to make sure that there won't be a missed
- * wakeup while the rwsem is free and the only spinning writer goes
- * to sleep without taking the rwsem. Even when the spinning writer
- * is just going to break out of the waiting loop, it will still do
- * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
- * rwsem_has_spinner() is true, it will guarantee at least one
- * trylock attempt on the rwsem later on.
- */
- if (rwsem_has_spinner(sem)) {
- /*
- * The smp_rmb() here is to make sure that the spinner
- * state is consulted before reading the wait_lock.
- */
- smp_rmb();
- if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
- return sem;
- goto locked;
- }
raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:
if (!list_empty(&sem->wait_list))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
--
2.18.1
Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.
So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:
<upper part of rwsem.h>
<rwsem-xadd.c>
<lower part of rwsem.h>
<rwsem.c>
The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().
This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/Makefile | 2 +-
kernel/locking/rwsem-xadd.c | 624 -------------------------
kernel/locking/rwsem.c | 884 ++++++++++++++++++++++++++++++++++++
kernel/locking/rwsem.h | 281 +-----------
4 files changed, 891 insertions(+), 900 deletions(-)
delete mode 100644 kernel/locking/rwsem-xadd.c
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
# and is generally not a function of system call inputs.
KCOV_INSTRUMENT := n
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 7d537b50a849..000000000000
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,624 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells ([email protected]).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi <[email protected]>
- * and Michel Lespinasse <[email protected]>
- *
- * Optimistic spinning by Tim Chen <[email protected]>
- * and Davidlohr Bueso <[email protected]>. Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long <[email protected]>.
- */
-#include <linux/rwsem.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#include <linux/sched/signal.h>
-#include <linux/sched/rt.h>
-#include <linux/sched/wake_q.h>
-#include <linux/sched/debug.h>
-#include <linux/osq_lock.h>
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- /*
- * Make sure we are not reinitializing a held semaphore:
- */
- debug_check_no_locks_freed((void *)sem, sizeof(*sem));
- lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
- atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
- raw_spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
- sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
- osq_lock_init(&sem->osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
- RWSEM_WAITING_FOR_WRITE,
- RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
- struct list_head list;
- struct task_struct *task;
- enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
- RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
- RWSEM_WAKE_READERS, /* Wake readers only */
- RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
- * have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- * to actually wakeup the blocked task(s) and drop the reference count,
- * preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
-{
- struct rwsem_waiter *waiter, *tmp;
- long oldcount, woken = 0, adjustment = 0;
- struct list_head wlist;
-
- /*
- * Take a peek at the queue head waiter such that we can determine
- * the wakeup(s) to perform.
- */
- waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
-
- if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
- if (wake_type == RWSEM_WAKE_ANY) {
- /*
- * Mark writer at the front of the queue for wakeup.
- * Until the task is actually later awoken later by
- * the caller, other writers are able to steal it.
- * Readers, on the other hand, will block as they
- * will notice the queued writer.
- */
- wake_q_add(wake_q, waiter->task);
- lockevent_inc(rwsem_wake_writer);
- }
-
- return;
- }
-
- /*
- * Writers might steal the lock before we grant it to the next reader.
- * We prefer to do the first reader grant before counting readers
- * so we can bail out early if a writer stole the lock.
- */
- if (wake_type != RWSEM_WAKE_READ_OWNED) {
- adjustment = RWSEM_READER_BIAS;
- oldcount = atomic_long_fetch_add(adjustment, &sem->count);
- if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
- atomic_long_sub(adjustment, &sem->count);
- return;
- }
- /*
- * Set it to reader-owned to give spinners an early
- * indication that readers now have the lock.
- */
- __rwsem_set_reader_owned(sem, waiter->task);
- }
-
- /*
- * Grant an infinite number of read locks to the readers at the front
- * of the queue. We know that woken will be at least 1 as we accounted
- * for above. Note we increment the 'active part' of the count by the
- * number of readers before waking any processes up.
- *
- * We have to do wakeup in 2 passes to prevent the possibility that
- * the reader count may be decremented before it is incremented. It
- * is because the to-be-woken waiter may not have slept yet. So it
- * may see waiter->task got cleared, finish its critical section and
- * do an unlock before the reader count increment.
- *
- * 1) Collect the read-waiters in a separate list, count them and
- * fully increment the reader count in rwsem.
- * 2) For each waiters in the new list, clear waiter->task and
- * put them into wake_q to be woken up later.
- */
- list_for_each_entry(waiter, &sem->wait_list, list) {
- if (waiter->type == RWSEM_WAITING_FOR_WRITE)
- break;
-
- woken++;
- }
- list_cut_before(&wlist, &sem->wait_list, &waiter->list);
-
- adjustment = woken * RWSEM_READER_BIAS - adjustment;
- lockevent_cond_inc(rwsem_wake_reader, woken);
- if (list_empty(&sem->wait_list)) {
- /* hit end of list above */
- adjustment -= RWSEM_FLAG_WAITERS;
- }
-
- if (adjustment)
- atomic_long_add(adjustment, &sem->count);
-
- /* 2nd pass */
- list_for_each_entry_safe(waiter, tmp, &wlist, list) {
- struct task_struct *tsk;
-
- tsk = waiter->task;
- get_task_struct(tsk);
-
- /*
- * Ensure calling get_task_struct() before setting the reader
- * waiter to nil such that rwsem_down_read_failed() cannot
- * race with do_exit() by always holding a reference count
- * to the task to wakeup.
- */
- smp_store_release(&waiter->task, NULL);
- /*
- * Ensure issuing the wakeup (either by us or someone else)
- * after setting the reader waiter to nil.
- */
- wake_q_add_safe(wake_q, tsk);
- }
-}
-
-/*
- * This function must be called with the sem->wait_lock held to prevent
- * race conditions between checking the rwsem wait list and setting the
- * sem->count accordingly.
- */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
-{
- long new;
-
- if (count & RWSEM_LOCK_MASK)
- return false;
-
- new = count + RWSEM_WRITER_LOCKED -
- (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
-
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
- rwsem_set_owner(sem);
- return true;
- }
-
- return false;
-}
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * Try to acquire write lock before the writer has been put on wait queue.
- */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
-{
- long count = atomic_long_read(&sem->count);
-
- while (!(count & RWSEM_LOCK_MASK)) {
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count + RWSEM_WRITER_LOCKED)) {
- rwsem_set_owner(sem);
- lockevent_inc(rwsem_opt_wlock);
- return true;
- }
- }
- return false;
-}
-
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
- /*
- * As lock holder preemption issue, we both skip spinning if
- * task is not on cpu or its cpu is preempted
- */
- return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
- struct task_struct *owner;
- bool ret = true;
-
- BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
-
- if (need_resched())
- return false;
-
- rcu_read_lock();
- owner = READ_ONCE(sem->owner);
- if (owner) {
- ret = is_rwsem_owner_spinnable(owner) &&
- owner_on_cpu(owner);
- }
- rcu_read_unlock();
- return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
- struct task_struct *owner = READ_ONCE(sem->owner);
-
- if (!is_rwsem_owner_spinnable(owner))
- return false;
-
- rcu_read_lock();
- while (owner && (READ_ONCE(sem->owner) == owner)) {
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_
- * checking sem->owner still matches owner, if that fails,
- * owner might point to free()d memory, if it still matches,
- * the rcu_read_lock() ensures the memory stays valid.
- */
- barrier();
-
- /*
- * abort spinning when need_resched or owner is not running or
- * owner's cpu is preempted.
- */
- if (need_resched() || !owner_on_cpu(owner)) {
- rcu_read_unlock();
- return false;
- }
-
- cpu_relax();
- }
- rcu_read_unlock();
-
- /*
- * If there is a new owner or the owner is not set, we continue
- * spinning.
- */
- return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
-}
-
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
- bool taken = false;
-
- preempt_disable();
-
- /* sem->wait_lock should not be held when doing optimistic spinning */
- if (!rwsem_can_spin_on_owner(sem))
- goto done;
-
- if (!osq_lock(&sem->osq))
- goto done;
-
- /*
- * Optimistically spin on the owner field and attempt to acquire the
- * lock whenever the owner changes. Spinning will be stopped when:
- * 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
- */
- while (rwsem_spin_on_owner(sem)) {
- /*
- * Try to acquire the lock
- */
- if (rwsem_try_write_lock_unqueued(sem)) {
- taken = true;
- break;
- }
-
- /*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
- */
- if (!sem->owner && (need_resched() || rt_task(current)))
- break;
-
- /*
- * The cpu_relax() call is a compiler barrier which forces
- * everything in this loop to be re-loaded. We don't need
- * memory barriers as we'll eventually observe the right
- * values at the cost of a few extra spins.
- */
- cpu_relax();
- }
- osq_unlock(&sem->osq);
-done:
- preempt_enable();
- lockevent_cond_inc(rwsem_opt_fail, !taken);
- return taken;
-}
-#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
- return false;
-}
-#endif
-
-/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
- long count, adjustment = -RWSEM_READER_BIAS;
- struct rwsem_waiter waiter;
- DEFINE_WAKE_Q(wake_q);
-
- waiter.task = current;
- waiter.type = RWSEM_WAITING_FOR_READ;
-
- raw_spin_lock_irq(&sem->wait_lock);
- if (list_empty(&sem->wait_list)) {
- /*
- * In case the wait queue is empty and the lock isn't owned
- * by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_READER_BIAS has already been
- * set in the count.
- */
- if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
- raw_spin_unlock_irq(&sem->wait_lock);
- rwsem_set_reader_owned(sem);
- lockevent_inc(rwsem_rlock_fast);
- return sem;
- }
- adjustment += RWSEM_FLAG_WAITERS;
- }
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
-
- /*
- * If there are no active locks, wake the front queued process(es).
- *
- * If there are no writers and we are first in the queue,
- * wake our own waiter to join the existing active readers !
- */
- if (!(count & RWSEM_LOCK_MASK) ||
- (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
-
- /* wait to be given the lock */
- while (true) {
- set_current_state(state);
- if (!waiter.task)
- break;
- if (signal_pending_state(state, current)) {
- raw_spin_lock_irq(&sem->wait_lock);
- if (waiter.task)
- goto out_nolock;
- raw_spin_unlock_irq(&sem->wait_lock);
- break;
- }
- schedule();
- lockevent_inc(rwsem_sleep_reader);
- }
-
- __set_current_state(TASK_RUNNING);
- lockevent_inc(rwsem_rlock);
- return sem;
-out_nolock:
- list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
- raw_spin_unlock_irq(&sem->wait_lock);
- __set_current_state(TASK_RUNNING);
- lockevent_inc(rwsem_rlock_fail);
- return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
- * Wait until we successfully acquire the write lock
- */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
-{
- long count;
- bool waiting = true; /* any queued threads before us */
- struct rwsem_waiter waiter;
- struct rw_semaphore *ret = sem;
- DEFINE_WAKE_Q(wake_q);
-
- /* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem))
- return sem;
-
- /*
- * Optimistic spinning failed, proceed to the slowpath
- * and block until we can acquire the sem.
- */
- waiter.task = current;
- waiter.type = RWSEM_WAITING_FOR_WRITE;
-
- raw_spin_lock_irq(&sem->wait_lock);
-
- /* account for this before adding a new element to the list */
- if (list_empty(&sem->wait_list))
- waiting = false;
-
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we're now waiting on the lock */
- if (waiting) {
- count = atomic_long_read(&sem->count);
-
- /*
- * If there were already threads queued before us and there are
- * no active writers and some readers, the lock must be read
- * owned; so we try to any read locks that were queued ahead
- * of us.
- */
- if (!(count & RWSEM_WRITER_MASK) &&
- (count & RWSEM_READER_MASK)) {
- __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
- /*
- * The wakeup is normally called _after_ the wait_lock
- * is released, but given that we are proactively waking
- * readers we can deal with the wake_q overhead as it is
- * similar to releasing and taking the wait_lock again
- * for attempting rwsem_try_write_lock().
- */
- wake_up_q(&wake_q);
-
- /*
- * Reinitialize wake_q after use.
- */
- wake_q_init(&wake_q);
- }
-
- } else {
- count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
- }
-
- /* wait until we successfully acquire the lock */
- set_current_state(state);
- while (true) {
- if (rwsem_try_write_lock(count, sem))
- break;
- raw_spin_unlock_irq(&sem->wait_lock);
-
- /* Block until there are no active lockers. */
- do {
- if (signal_pending_state(state, current))
- goto out_nolock;
-
- schedule();
- lockevent_inc(rwsem_sleep_writer);
- set_current_state(state);
- count = atomic_long_read(&sem->count);
- } while (count & RWSEM_LOCK_MASK);
-
- raw_spin_lock_irq(&sem->wait_lock);
- }
- __set_current_state(TASK_RUNNING);
- list_del(&waiter.list);
- raw_spin_unlock_irq(&sem->wait_lock);
- lockevent_inc(rwsem_wlock);
-
- return ret;
-
-out_nolock:
- __set_current_state(TASK_RUNNING);
- raw_spin_lock_irq(&sem->wait_lock);
- list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
- else
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
- lockevent_inc(rwsem_wlock_fail);
-
- return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read/up_write has decremented the active part of count if we come here
- */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
- unsigned long flags;
- DEFINE_WAKE_Q(wake_q);
-
- raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
- raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
- wake_up_q(&wake_q);
-
- return sem;
-}
-EXPORT_SYMBOL(rwsem_wake);
-
-/*
- * downgrade a write lock into a read lock
- * - caller incremented waiting part of count and discovered it still negative
- * - just wake up any readers at the front of the queue
- */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
-{
- unsigned long flags;
- DEFINE_WAKE_Q(wake_q);
-
- raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
-
- raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
- wake_up_q(&wake_q);
-
- return sem;
-}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ccbf18f560ff..8317bcdf063b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -3,17 +3,901 @@
*
* Written by David Howells ([email protected]).
* Derived from asm-i386/semaphore.h
+ *
+ * Writer lock-stealing by Alex Shi <[email protected]>
+ * and Michel Lespinasse <[email protected]>
+ *
+ * Optimistic spinning by Tim Chen <[email protected]>
+ * and Davidlohr Bueso <[email protected]>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition and rwsem rearchitecture
+ * by Waiman Long <[email protected]>.
*/
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/sched/task.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched/signal.h>
#include <linux/export.h>
#include <linux/rwsem.h>
#include <linux/atomic.h>
#include "rwsem.h"
+#include "lock_events.h"
+
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ * i.e. the owner(s) cannot be readily determined. It can be reader
+ * owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED (1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c, sem) do { \
+ if (!debug_locks_silent && \
+ WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+ #c, atomic_long_read(&(sem)->count), \
+ (long)((sem)->owner), (long)current, \
+ list_empty(&(sem)->wait_list) ? "" : "not ")) \
+ debug_locks_off(); \
+ } while (0)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
+#endif
+
+/*
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
+ */
+#define RWSEM_WRITER_LOCKED (1UL << 0)
+#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_READER_SHIFT 8
+#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+ WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+ WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+ struct task_struct *owner)
+{
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+ | RWSEM_ANONYMOUSLY_OWNED;
+
+ WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+ __rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+ return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+ return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+ unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+ | RWSEM_ANONYMOUSLY_OWNED;
+ if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+ cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+ RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#else
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * Guide to the rw_semaphore's count field.
+ *
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
+ *
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
+ *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
+ */
+
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /*
+ * Make sure we are not reinitializing a held semaphore:
+ */
+ debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+ lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
+ atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ raw_spin_lock_init(&sem->wait_lock);
+ INIT_LIST_HEAD(&sem->wait_list);
+ sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+ osq_lock_init(&sem->osq);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
+enum rwsem_waiter_type {
+ RWSEM_WAITING_FOR_WRITE,
+ RWSEM_WAITING_FOR_READ
+};
+
+struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ enum rwsem_waiter_type type;
+};
+
+enum rwsem_wake_type {
+ RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
+ RWSEM_WAKE_READERS, /* Wake readers only */
+ RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
+};
+
+/*
+ * handle the lock release when processes blocked on it that can now run
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ * have been set.
+ * - there must be someone on the queue
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ * to actually wakeup the blocked task(s) and drop the reference count,
+ * preferably when the wait_lock is released
+ * - woken process blocks are discarded from the list after having task zeroed
+ * - writers are only marked woken if downgrading is false
+ */
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type,
+ struct wake_q_head *wake_q)
+{
+ struct rwsem_waiter *waiter, *tmp;
+ long oldcount, woken = 0, adjustment = 0;
+ struct list_head wlist;
+
+ /*
+ * Take a peek at the queue head waiter such that we can determine
+ * the wakeup(s) to perform.
+ */
+ waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+
+ if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+ if (wake_type == RWSEM_WAKE_ANY) {
+ /*
+ * Mark writer at the front of the queue for wakeup.
+ * Until the task is actually later awoken later by
+ * the caller, other writers are able to steal it.
+ * Readers, on the other hand, will block as they
+ * will notice the queued writer.
+ */
+ wake_q_add(wake_q, waiter->task);
+ lockevent_inc(rwsem_wake_writer);
+ }
+
+ return;
+ }
+
+ /*
+ * Writers might steal the lock before we grant it to the next reader.
+ * We prefer to do the first reader grant before counting readers
+ * so we can bail out early if a writer stole the lock.
+ */
+ if (wake_type != RWSEM_WAKE_READ_OWNED) {
+ adjustment = RWSEM_READER_BIAS;
+ oldcount = atomic_long_fetch_add(adjustment, &sem->count);
+ if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+ atomic_long_sub(adjustment, &sem->count);
+ return;
+ }
+ /*
+ * Set it to reader-owned to give spinners an early
+ * indication that readers now have the lock.
+ */
+ __rwsem_set_reader_owned(sem, waiter->task);
+ }
+
+ /*
+ * Grant an infinite number of read locks to the readers at the front
+ * of the queue. We know that woken will be at least 1 as we accounted
+ * for above. Note we increment the 'active part' of the count by the
+ * number of readers before waking any processes up.
+ *
+ * We have to do wakeup in 2 passes to prevent the possibility that
+ * the reader count may be decremented before it is incremented. It
+ * is because the to-be-woken waiter may not have slept yet. So it
+ * may see waiter->task got cleared, finish its critical section and
+ * do an unlock before the reader count increment.
+ *
+ * 1) Collect the read-waiters in a separate list, count them and
+ * fully increment the reader count in rwsem.
+ * 2) For each waiters in the new list, clear waiter->task and
+ * put them into wake_q to be woken up later.
+ */
+ list_for_each_entry(waiter, &sem->wait_list, list) {
+ if (waiter->type == RWSEM_WAITING_FOR_WRITE)
+ break;
+
+ woken++;
+ }
+ list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+ adjustment = woken * RWSEM_READER_BIAS - adjustment;
+ lockevent_cond_inc(rwsem_wake_reader, woken);
+ if (list_empty(&sem->wait_list)) {
+ /* hit end of list above */
+ adjustment -= RWSEM_FLAG_WAITERS;
+ }
+
+ if (adjustment)
+ atomic_long_add(adjustment, &sem->count);
+
+ /* 2nd pass */
+ list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+ struct task_struct *tsk;
+
+ tsk = waiter->task;
+ get_task_struct(tsk);
+
+ /*
+ * Ensure calling get_task_struct() before setting the reader
+ * waiter to nil such that rwsem_down_read_failed() cannot
+ * race with do_exit() by always holding a reference count
+ * to the task to wakeup.
+ */
+ smp_store_release(&waiter->task, NULL);
+ /*
+ * Ensure issuing the wakeup (either by us or someone else)
+ * after setting the reader waiter to nil.
+ */
+ wake_q_add_safe(wake_q, tsk);
+ }
+}
+
+/*
+ * This function must be called with the sem->wait_lock held to prevent
+ * race conditions between checking the rwsem wait list and setting the
+ * sem->count accordingly.
+ */
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+ long new;
+
+ if (count & RWSEM_LOCK_MASK)
+ return false;
+
+ new = count + RWSEM_WRITER_LOCKED -
+ (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
+ rwsem_set_owner(sem);
+ return true;
+ }
+
+ return false;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ while (!(count & RWSEM_LOCK_MASK)) {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+ count + RWSEM_WRITER_LOCKED)) {
+ rwsem_set_owner(sem);
+ lockevent_inc(rwsem_opt_wlock);
+ return true;
+ }
+ }
+ return false;
+}
+
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+ /*
+ * As lock holder preemption issue, we both skip spinning if
+ * task is not on cpu or its cpu is preempted
+ */
+ return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ struct task_struct *owner;
+ bool ret = true;
+
+ BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+
+ if (need_resched())
+ return false;
+
+ rcu_read_lock();
+ owner = READ_ONCE(sem->owner);
+ if (owner) {
+ ret = is_rwsem_owner_spinnable(owner) &&
+ owner_on_cpu(owner);
+ }
+ rcu_read_unlock();
+ return ret;
+}
+
+/*
+ * Return true only if we can still spin on the owner field of the rwsem.
+ */
+static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+ struct task_struct *owner = READ_ONCE(sem->owner);
+
+ if (!is_rwsem_owner_spinnable(owner))
+ return false;
+
+ rcu_read_lock();
+ while (owner && (READ_ONCE(sem->owner) == owner)) {
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_
+ * checking sem->owner still matches owner, if that fails,
+ * owner might point to free()d memory, if it still matches,
+ * the rcu_read_lock() ensures the memory stays valid.
+ */
+ barrier();
+
+ /*
+ * abort spinning when need_resched or owner is not running or
+ * owner's cpu is preempted.
+ */
+ if (need_resched() || !owner_on_cpu(owner)) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * If there is a new owner or the owner is not set, we continue
+ * spinning.
+ */
+ return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+ bool taken = false;
+
+ preempt_disable();
+
+ /* sem->wait_lock should not be held when doing optimistic spinning */
+ if (!rwsem_can_spin_on_owner(sem))
+ goto done;
+
+ if (!osq_lock(&sem->osq))
+ goto done;
+
+ /*
+ * Optimistically spin on the owner field and attempt to acquire the
+ * lock whenever the owner changes. Spinning will be stopped when:
+ * 1) the owning writer isn't running; or
+ * 2) readers own the lock as we can't determine if they are
+ * actively running or not.
+ */
+ while (rwsem_spin_on_owner(sem)) {
+ /*
+ * Try to acquire the lock
+ */
+ if (rwsem_try_write_lock_unqueued(sem)) {
+ taken = true;
+ break;
+ }
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!sem->owner && (need_resched() || rt_task(current)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ cpu_relax();
+ }
+ osq_unlock(&sem->osq);
+done:
+ preempt_enable();
+ lockevent_cond_inc(rwsem_opt_fail, !taken);
+ return taken;
+}
+#else
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+ return false;
+}
+#endif
+
+/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+ long count, adjustment = -RWSEM_READER_BIAS;
+ struct rwsem_waiter waiter;
+ DEFINE_WAKE_Q(wake_q);
+
+ waiter.task = current;
+ waiter.type = RWSEM_WAITING_FOR_READ;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (list_empty(&sem->wait_list)) {
+ /*
+ * In case the wait queue is empty and the lock isn't owned
+ * by a writer, this reader can exit the slowpath and return
+ * immediately as its RWSEM_READER_BIAS has already been
+ * set in the count.
+ */
+ if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+ raw_spin_unlock_irq(&sem->wait_lock);
+ rwsem_set_reader_owned(sem);
+ lockevent_inc(rwsem_rlock_fast);
+ return sem;
+ }
+ adjustment += RWSEM_FLAG_WAITERS;
+ }
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* we're now waiting on the lock, but no longer actively locking */
+ count = atomic_long_add_return(adjustment, &sem->count);
+
+ /*
+ * If there are no active locks, wake the front queued process(es).
+ *
+ * If there are no writers and we are first in the queue,
+ * wake our own waiter to join the existing active readers !
+ */
+ if (!(count & RWSEM_LOCK_MASK) ||
+ (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+
+ /* wait to be given the lock */
+ while (true) {
+ set_current_state(state);
+ if (!waiter.task)
+ break;
+ if (signal_pending_state(state, current)) {
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (waiter.task)
+ goto out_nolock;
+ raw_spin_unlock_irq(&sem->wait_lock);
+ break;
+ }
+ schedule();
+ lockevent_inc(rwsem_sleep_reader);
+ }
+
+ __set_current_state(TASK_RUNNING);
+ lockevent_inc(rwsem_rlock);
+ return sem;
+out_nolock:
+ list_del(&waiter.list);
+ if (list_empty(&sem->wait_list))
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ __set_current_state(TASK_RUNNING);
+ lockevent_inc(rwsem_rlock_fail);
+ return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+ return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+ return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
+ * Wait until we successfully acquire the write lock
+ */
+static inline struct rw_semaphore *
+__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+{
+ long count;
+ bool waiting = true; /* any queued threads before us */
+ struct rwsem_waiter waiter;
+ struct rw_semaphore *ret = sem;
+ DEFINE_WAKE_Q(wake_q);
+
+ /* do optimistic spinning and steal lock if possible */
+ if (rwsem_optimistic_spin(sem))
+ return sem;
+
+ /*
+ * Optimistic spinning failed, proceed to the slowpath
+ * and block until we can acquire the sem.
+ */
+ waiter.task = current;
+ waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+
+ /* account for this before adding a new element to the list */
+ if (list_empty(&sem->wait_list))
+ waiting = false;
+
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* we're now waiting on the lock */
+ if (waiting) {
+ count = atomic_long_read(&sem->count);
+
+ /*
+ * If there were already threads queued before us and there are
+ * no active writers and some readers, the lock must be read
+ * owned; so we try to any read locks that were queued ahead
+ * of us.
+ */
+ if (!(count & RWSEM_WRITER_MASK) &&
+ (count & RWSEM_READER_MASK)) {
+ __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+ /*
+ * The wakeup is normally called _after_ the wait_lock
+ * is released, but given that we are proactively waking
+ * readers we can deal with the wake_q overhead as it is
+ * similar to releasing and taking the wait_lock again
+ * for attempting rwsem_try_write_lock().
+ */
+ wake_up_q(&wake_q);
+
+ /*
+ * Reinitialize wake_q after use.
+ */
+ wake_q_init(&wake_q);
+ }
+
+ } else {
+ count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ }
+
+ /* wait until we successfully acquire the lock */
+ set_current_state(state);
+ while (true) {
+ if (rwsem_try_write_lock(count, sem))
+ break;
+ raw_spin_unlock_irq(&sem->wait_lock);
+
+ /* Block until there are no active lockers. */
+ do {
+ if (signal_pending_state(state, current))
+ goto out_nolock;
+
+ schedule();
+ lockevent_inc(rwsem_sleep_writer);
+ set_current_state(state);
+ count = atomic_long_read(&sem->count);
+ } while (count & RWSEM_LOCK_MASK);
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ }
+ __set_current_state(TASK_RUNNING);
+ list_del(&waiter.list);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ lockevent_inc(rwsem_wlock);
+
+ return ret;
+
+out_nolock:
+ __set_current_state(TASK_RUNNING);
+ raw_spin_lock_irq(&sem->wait_lock);
+ list_del(&waiter.list);
+ if (list_empty(&sem->wait_list))
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+ else
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ lockevent_inc(rwsem_wlock_fail);
+
+ return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed(struct rw_semaphore *sem)
+{
+ return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed_killable(struct rw_semaphore *sem)
+{
+ return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed_killable);
+
+/*
+ * handle waking up a waiter on the semaphore
+ * - up_read/up_write has decremented the active part of count if we come here
+ */
+__visible
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+ unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
+
+ raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+ if (!list_empty(&sem->wait_list))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+ raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+ wake_up_q(&wake_q);
+
+ return sem;
+}
+EXPORT_SYMBOL(rwsem_wake);
+
+/*
+ * downgrade a write lock into a read lock
+ * - caller incremented waiting part of count and discovered it still negative
+ * - just wake up any readers at the front of the queue
+ */
+__visible
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+{
+ unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
+
+ raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+ if (!list_empty(&sem->wait_list))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+
+ raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+ wake_up_q(&wake_q);
+
+ return sem;
+}
+EXPORT_SYMBOL(rwsem_downgrade_wake);
+
+/*
+ * lock for reading
+ */
+inline void __down_read(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ rwsem_down_read_failed(sem);
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+ RWSEM_READER_OWNED), sem);
+ } else {
+ rwsem_set_reader_owned(sem);
+ }
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+ return -EINTR;
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+ RWSEM_READER_OWNED), sem);
+ } else {
+ rwsem_set_reader_owned(sem);
+ }
+ return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+ /*
+ * Optimize for the case when the rwsem is not locked at all.
+ */
+ long tmp = RWSEM_UNLOCKED_VALUE;
+
+ lockevent_inc(rwsem_rtrylock);
+ do {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ tmp + RWSEM_READER_BIAS)) {
+ rwsem_set_reader_owned(sem);
+ return 1;
+ }
+ } while (!(tmp & RWSEM_READ_FAILED_MASK));
+ return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
+ rwsem_down_write_failed(sem);
+ rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
+ if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+ return -EINTR;
+ rwsem_set_owner(sem);
+ return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ lockevent_inc(rwsem_wtrylock);
+ tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_WRITER_LOCKED);
+ if (tmp == RWSEM_UNLOCKED_VALUE) {
+ rwsem_set_owner(sem);
+ return true;
+ }
+ return false;
+}
+
+/*
+ * unlock after reading
+ */
+inline void __up_read(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+ sem);
+ rwsem_clear_reader_owned(sem);
+ tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+ == RWSEM_FLAG_WAITERS))
+ rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ rwsem_clear_owner(sem);
+ if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+ &sem->count) & RWSEM_FLAG_WAITERS))
+ rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ /*
+ * When downgrading from exclusive to shared ownership,
+ * anything inside the write-locked region cannot leak
+ * into the read side. In contrast, anything in the
+ * read-locked region is ok to be re-ordered into the
+ * write side. As such, rely on RELEASE semantics.
+ */
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ tmp = atomic_long_fetch_add_release(
+ -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
+ rwsem_set_reader_owned(sem);
+ if (tmp & RWSEM_FLAG_WAITERS)
+ rwsem_downgrade_wake(sem);
+}
/*
* lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 499a9b2bda82..2534ce49f648 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -1,279 +1,10 @@
/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- * i.e. the owner(s) cannot be readily determined. It can be reader
- * owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#include "lock_events.h"
-#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+#ifndef __INTERNAL_RWSEM_H
+#define __INTERNAL_RWSEM_H
+#include <linux/rwsem.h>
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c, sem) do { \
- if (!debug_locks_silent && \
- WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
- #c, atomic_long_read(&(sem)->count), \
- (long)((sem)->owner), (long)current, \
- list_empty(&(sem)->wait_list) ? "" : "not ")) \
- debug_locks_off(); \
- } while (0)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c, sem)
-#endif
+extern void __down_read(struct rw_semaphore *sem);
+extern void __up_read(struct rw_semaphore *sem);
-/*
- * The definition of the atomic counter in the semaphore:
- *
- * Bit 0 - writer locked bit
- * Bit 1 - waiters present bit
- * Bits 2-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
- *
- * atomic_long_fetch_add() is used to obtain reader lock, whereas
- * atomic_long_cmpxchg() will be used to obtain writer lock.
- */
-#define RWSEM_WRITER_LOCKED (1UL << 0)
-#define RWSEM_FLAG_WAITERS (1UL << 1)
-#define RWSEM_READER_SHIFT 8
-#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
-#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
-#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
-
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
- WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
- WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
- struct task_struct *owner)
-{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
-
- WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
- __rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
- return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
- return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
- unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
- if (READ_ONCE(sem->owner) == (struct task_struct *)val)
- cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#else
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- rwsem_down_read_failed(sem);
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
- } else {
- rwsem_set_reader_owned(sem);
- }
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
- return -EINTR;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
- } else {
- rwsem_set_reader_owned(sem);
- }
- return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- /*
- * Optimize for the case when the rwsem is not locked at all.
- */
- long tmp = RWSEM_UNLOCKED_VALUE;
-
- lockevent_inc(rwsem_rtrylock);
- do {
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
- tmp + RWSEM_READER_BIAS)) {
- rwsem_set_reader_owned(sem);
- return 1;
- }
- } while (!(tmp & RWSEM_READ_FAILED_MASK));
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- rwsem_down_write_failed(sem);
- rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
- return -EINTR;
- rwsem_set_owner(sem);
- return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- long tmp;
-
- lockevent_inc(rwsem_wtrylock);
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_WRITER_LOCKED);
- if (tmp == RWSEM_UNLOCKED_VALUE) {
- rwsem_set_owner(sem);
- return true;
- }
- return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long tmp;
-
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
- rwsem_clear_reader_owned(sem);
- tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
- if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
- == RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- rwsem_clear_owner(sem);
- if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
- &sem->count) & RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- long tmp;
-
- /*
- * When downgrading from exclusive to shared ownership,
- * anything inside the write-locked region cannot leak
- * into the read side. In contrast, anything in the
- * read-locked region is ok to be re-ordered into the
- * write side. As such, rely on RELEASE semantics.
- */
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- tmp = atomic_long_fetch_add_release(
- -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
- rwsem_set_reader_owned(sem);
- if (tmp & RWSEM_FLAG_WAITERS)
- rwsem_downgrade_wake(sem);
-}
+#endif /* __INTERNAL_RWSEM_H */
--
2.18.1
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.
1) The reader critical section is long, perhaps the task sleeps after
acquiring the read lock.
2) There are just too many readers contending the lock causing it to
take a while to service all of them.
In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.
Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.
Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.
This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.
In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.
On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:
Threads Before rspin Before patch After patch %change
------- ------------ ------------ ----------- -------
20 5541068 5345484 5455667 -3.5%/ +2.1%
40 10185150 7292313 9219276 -28.5%/+26.4%
60 8196733 6460517 7181209 -21.2%/+11.2%
80 9508864 6739559 8107025 -29.1%/+20.3%
This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.
Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:
# of Threads Pre-patch Post-patch
------------ --------- ----------
2 4,730 4,969
4 4,814 4,786
8 4,866 4,815
16 4,715 4,511
32 3,338 3,500
64 3,212 3,389
80 3,110 3,044
When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.
Before patch:
40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644
After patch:
40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 10 ++-
kernel/locking/rwsem.c | 134 +++++++++++++++++++++++++++++-
2 files changed, 136 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index baa998401052..239039d0ce21 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,10 +56,12 @@ LOCK_EVENT(rwsem_sleep_reader) /* # of reader sleeps */
LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */
LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */
LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
-LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
-LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
-LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
-LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */
+LOCK_EVENT(rwsem_opt_rlock) /* # of opt-acquired read locks */
+LOCK_EVENT(rwsem_opt_wlock) /* # of opt-acquired write locks */
+LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */
+LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */
+LOCK_EVENT(rwsem_opt_norspin) /* # of disabled reader-only optspins */
+LOCK_EVENT(rwsem_opt_rlock2) /* # of opt-acquired 2ndary read locks */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ec4c26b353c9..743476f386b2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -59,6 +59,42 @@
* seems to hang on a reader owned rwsem especially if only one reader
* is involved. Ideally we would like to track all the readers that own
* a rwsem, but the overhead is simply too big.
+ *
+ * Reader optimistic spinning is helpful when the reader critical section
+ * is short and there aren't that many readers around. It makes readers
+ * relatively more preferred than writers. When a writer times out spinning
+ * on a reader-owned lock and set the nospinnable bits, there are two main
+ * reasons for that.
+ *
+ * 1) The reader critical section is long, perhaps the task sleeps after
+ * acquiring the read lock.
+ * 2) There are just too many readers contending the lock causing it to
+ * take a while to service all of them.
+ *
+ * In the former case, long reader critical section will impede the progress
+ * of writers which is usually more important for system performance. In
+ * the later case, reader optimistic spinning tends to make the reader
+ * groups that contain readers that acquire the lock together smaller
+ * leading to more of them. That may hurt performance in some cases. In
+ * other words, the setting of nonspinnable bits indicates that reader
+ * optimistic spinning may not be helpful for those workloads that cause
+ * it.
+ *
+ * Therefore, any writers that had observed the setting of the writer
+ * nonspinnable bit for a given rwsem after they fail to acquire the lock
+ * via optimistic spinning will set the reader nonspinnable bit once they
+ * acquire the write lock. Similarly, readers that observe the setting
+ * of reader nonspinnable bit at slowpath entry will set the reader
+ * nonspinnable bits when they acquire the read lock via the wakeup path.
+ *
+ * Once the reader nonspinnable bit is on, it will only be reset when
+ * a writer is able to acquire the rwsem in the fast path or somehow a
+ * reader or writer in the slowpath doesn't observe the nonspinable bit.
+ *
+ * This is to discourage reader optmistic spinning on that particular
+ * rwsem and make writers more preferred. This adaptive disabling of reader
+ * optimistic spinning will alleviate the negative side effect of this
+ * feature.
*/
#define RWSEM_READER_OWNED (1UL << 0)
#define RWSEM_RD_NONSPINNABLE (1UL << 1)
@@ -144,11 +180,14 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
* Note that the owner value just indicates the task has owned the rwsem
* previously, it may not be the real owner or one of the real owners
* anymore when that field is examined, so take it with a grain of salt.
+ *
+ * The reader non-spinnable bit is preserved.
*/
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- long val = (long)owner | RWSEM_READER_OWNED;
+ long val = (long)owner | RWSEM_READER_OWNED |
+ (atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE);
atomic_long_set(&sem->owner, val);
}
@@ -286,6 +325,7 @@ struct rwsem_waiter {
struct task_struct *task;
enum rwsem_waiter_type type;
unsigned long timeout;
+ long last_rowner;
};
#define rwsem_first_waiter(sem) \
list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -367,6 +407,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* so we can bail out early if a writer stole the lock.
*/
if (wake_type != RWSEM_WAKE_READ_OWNED) {
+ struct task_struct *owner;
+
adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
@@ -387,8 +429,15 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
/*
* Set it to reader-owned to give spinners an early
* indication that readers now have the lock.
+ * The reader nonspinnable bit seen at slowpath entry of
+ * the reader is copied over.
*/
- __rwsem_set_reader_owned(sem, waiter->task);
+ owner = waiter->task;
+ if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) {
+ owner = (void *)((long)owner | RWSEM_RD_NONSPINNABLE);
+ lockevent_inc(rwsem_opt_norspin);
+ }
+ __rwsem_set_reader_owned(sem, owner);
}
/*
@@ -836,6 +885,43 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
}
+
+/*
+ * This function is called when the reader fails to acquire the lock via
+ * optimistic spinning. In this case we will still attempt to do a trylock
+ * when comparing the rwsem state right now with the state when entering
+ * the slowpath indicates that the reader is still in a valid reader phase.
+ * This happens when the following conditions are true:
+ *
+ * 1) The lock is currently reader owned, and
+ * 2) The lock is previously not reader-owned or the last read owner changes.
+ *
+ * In the former case, we have transitioned from a writer phase to a
+ * reader-phase while spinning. In the latter case, it means the reader
+ * phase hasn't ended when we entered the optimistic spinning loop. In
+ * both cases, the reader is eligible to acquire the lock. This is the
+ * secondary path where a read lock is acquired optimistically.
+ *
+ * The reader non-spinnable bit wasn't set at time of entry or it will
+ * not be here at all.
+ */
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+ long last_rowner)
+{
+ long owner = atomic_long_read(&sem->owner);
+
+ if (!(owner & RWSEM_READER_OWNED))
+ return false;
+
+ owner &= ~RWSEM_OWNER_FLAGS_MASK;
+ last_rowner &= ~RWSEM_OWNER_FLAGS_MASK;
+ if ((owner != last_rowner) && rwsem_try_read_lock_unqueued(sem)) {
+ lockevent_inc(rwsem_opt_rlock2);
+ lockevent_add(rwsem_opt_fail, -1);
+ return true;
+ }
+ return false;
+}
#else
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
long nonspinnable)
@@ -849,6 +935,12 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
}
static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
+
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+ unsigned long last_rowner)
+{
+ return false;
+}
#endif
/*
@@ -862,6 +954,14 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ /*
+ * Save the current read-owner of rwsem, if available, and the
+ * reader nonspinnable bit.
+ */
+ waiter.last_rowner = atomic_long_read(&sem->owner);
+ if (!(waiter.last_rowner & RWSEM_READER_OWNED))
+ waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
+
if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
goto queue;
@@ -884,6 +984,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
wake_up_q(&wake_q);
}
return sem;
+ } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+ return sem;
}
queue:
@@ -964,6 +1066,19 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
return ERR_PTR(-EINTR);
}
+/*
+ * This function is called by the a write lock owner. So the owner value
+ * won't get changed by others.
+ */
+static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
+ bool disable)
+{
+ if (unlikely(disable)) {
+ atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner);
+ lockevent_inc(rwsem_opt_norspin);
+ }
+}
+
/*
* Wait until we successfully acquire the write lock
*/
@@ -971,6 +1086,7 @@ static struct rw_semaphore *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
long count;
+ bool disable_rspin;
enum writer_wait_state wstate;
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
@@ -981,6 +1097,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
rwsem_optimistic_spin(sem, true))
return sem;
+ /*
+ * Disable reader optimistic spinning for this rwsem after
+ * acquiring the write lock when the setting of the nonspinnable
+ * bits are observed.
+ */
+ disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE;
+
/*
* Optimistic spinning failed, proceed to the slowpath
* and block until we can acquire the sem.
@@ -1077,6 +1200,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
+ rwsem_disable_reader_optspin(sem, disable_rspin);
raw_spin_unlock_irq(&sem->wait_lock);
lockevent_inc(rwsem_wlock);
@@ -1196,7 +1320,8 @@ static inline void __down_write(struct rw_semaphore *sem)
if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
RWSEM_WRITER_LOCKED)))
rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
- rwsem_set_owner(sem);
+ else
+ rwsem_set_owner(sem);
}
static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1207,8 +1332,9 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
RWSEM_WRITER_LOCKED))) {
if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
+ } else {
+ rwsem_set_owner(sem);
}
- rwsem_set_owner(sem);
return 0;
}
--
2.18.1
With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.
On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count and the read fail bit.
That can supports up to (32k-1) readers. Without 5-level page table,
we can supports up to (2M-1) readers.
The owner value will still be duplicated in the owner field as that will
ease debugging when looking at core dump. There may be a slight overhead
in transforming the task pointer to fit into a smaller number of bits,
but that shouldn't be noticeable in real workloads.
This change is currently enabled for x86-64 only. Other 64-bit
architectures may be enabled in the future if the need arises.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
writer-only locking threads and then equal numbers of readers and writers
(mixed) before patch and after this and subsequent related patches were
as follows:
Before Patch After Patch
# of Threads wlock mixed wlock mixed
------------ ----- ----- ----- -----
1 30,422 31,034 30,323 30,379
2 6,427 6,684 7,804 9,436
4 6,742 6,738 7,568 8,268
8 7,092 7,222 5,679 7,041
16 6,882 7,163 6,848 7,652
32 7,458 7,316 7,975 2,189
64 7,906 520 8,269 534
128 1,680 425 8,047 448
In the single thread case, the complex write-locking operation does
introduce a little bit of overhead (about 0.3%). For the contended cases,
except for some anomalies in the data, there is no evidence that this
change will adversely impact performance.
When running the same microbenchmark with RT locking threads instead,
we got the following results:
Before Patch After Patch
# of Threads wlock mixed wlock mixed
------------ ----- ----- ----- -----
2 4,065 3,642 4,756 5,062
4 2,254 1,907 3,460 2,496
8 2,386 964 3,012 1,964
16 2,095 1,596 3,083 1,862
32 2,388 530 3,717 359
64 1,424 322 4,060 401
128 1,642 510 4,488 628
It is obvious that RT tasks can benefit pretty significantly with this set
of patches.
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/Kconfig | 6 ++
kernel/Kconfig.locks | 12 +++
kernel/locking/rwsem.c | 162 ++++++++++++++++++++++++++++++++++++++---
3 files changed, 171 insertions(+), 9 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..141be11a3a6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,6 +91,7 @@ config X86
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+ select ARCH_USE_RWSEM_OWNER_COUNT if X86_64
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_THP_SWAP if X86_64
@@ -353,6 +354,11 @@ config PGTABLE_LEVELS
default 3 if X86_PAE
default 2
+config RWSEM_OWNER_COUNT_PA_BITS
+ int
+ default 52 if X86_5LEVEL
+ default 46 if X86_64
+
config CC_HAS_SANE_STACKPROTECTOR
bool
default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh $(CC)) if 64BIT
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index bf770d7556f7..9cd5f8547674 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -258,3 +258,15 @@ config ARCH_HAS_MMIOWB
config MMIOWB
def_bool y if ARCH_HAS_MMIOWB
depends on SMP
+
+#
+# An 64-bit architecture that wants to merge rwsem write-owner into
+# count should select ARCH_USE_RWSEM_OWNER_COUNT and define
+# RWSEM_OWNER_COUNT_PA_BITS as the correct number of physical address
+# bits. In addition, the number of bits available for reader count
+# should allow all the CPUs as defined in NR_CPUS to acquire the same
+# read lock without overflowing it.
+#
+config ARCH_USE_RWSEM_OWNER_COUNT
+ bool
+ depends on SMP && 64BIT
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 028f29b39045..8196ace2d4a2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,7 +116,38 @@
#endif
/*
- * On 64-bit architectures, the bit definitions of the count are:
+ * With separate count and owner, there are timing windows where the two
+ * values are inconsistent. That can cause problem when trying to figure
+ * out the exact state of the rwsem. That can be solved by combining
+ * the count and owner together in a single atomic value.
+ *
+ * On 64-bit architectures, the owner task structure pointer can be
+ * compressed and combined with reader count and other status flags.
+ * A simple compression method is to map the virtual address back to
+ * the physical address by subtracting PAGE_OFFSET. On 32-bit
+ * architectures, the long integer value just isn't big enough for
+ * combining owner and count. So they remain separate.
+ *
+ * For x86-64, the physical address can use up to 52 bits if
+ * CONFIG_X86_5LEVEL. That is 4PB of memory. That leaves 12 bits
+ * available for other use. The task structure pointer is also aligned
+ * to the L1 cache size. That means another 6 bits (64 bytes cacheline)
+ * will be available. Reserving 2 bits for status flags, we will have
+ * 16 bits for the reader count and read fail bit. That can supports up
+ * to (32k-1) active readers. If 5-level page table support isn't
+ * configured, we can supports up to (2M-1) active readers.
+ *
+ * On x86-64 with CONFIG_X86_5LEVEL and CONFIG_ARCH_USE_RWSEM_OWNER_COUNT,
+ * the bit definitions of the count are:
+ *
+ * Bit 0 - waiters present bit
+ * Bit 1 - lock handoff bit
+ * Bits 2-47 - compressed task structure pointer
+ * Bits 48-62 - 15-bit reader counts
+ * Bit 63 - read fail bit
+ *
+ * On other 64-bit architectures without MERGE_OWNER_INTO_COUNT, the bit
+ * definitions are:
*
* Bit 0 - writer locked bit
* Bit 1 - waiters present bit
@@ -151,26 +182,81 @@
* be the first one in the wait_list to be eligible for setting the handoff
* bit. So concurrent setting/clearing of handoff bit is not possible.
*/
-#define RWSEM_WRITER_LOCKED (1UL << 0)
-#define RWSEM_FLAG_WAITERS (1UL << 1)
-#define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_WAITERS (1UL << 0)
+#define RWSEM_FLAG_HANDOFF (1UL << 1)
#define RWSEM_FLAG_READFAIL (1UL << (BITS_PER_LONG - 1))
+/*
+ * The MERGE_OWNER_INTO_COUNT macro will only be defined if the following
+ * conditions are true:
+ * 1) Both CONFIG_ARCH_USE_RWSEM_OWNER_COUNT and
+ * CONFIG_RWSEM_OWNER_COUNT_PA_BITS are defined.
+ * 2) The number of reader count bits available is able to hold the
+ * maximum number of CPUs as defined in NR_CPUS.
+ */
+#if defined(CONFIG_ARCH_USE_RWSEM_OWNER_COUNT) && \
+ defined(CONFIG_RWSEM_OWNER_COUNT_PA_BITS)
+# define __READER_SHIFT (CONFIG_RWSEM_OWNER_COUNT_PA_BITS -\
+ L1_CACHE_SHIFT + 2)
+# define __READER_COUNT_BITS (BITS_PER_LONG - __READER_SHIFT - 1)
+# define __READER_COUNT_MAX ((1UL << __READER_COUNT_BITS) - 1)
+# if (NR_CPUS <= __READER_COUNT_MAX)
+# define MERGE_OWNER_INTO_COUNT
+# endif
+#endif
+
+#ifdef MERGE_OWNER_INTO_COUNT
+#define RWSEM_READER_SHIFT __READER_SHIFT
+#define RWSEM_WRITER_MASK ((1UL << RWSEM_READER_SHIFT) - 4)
+#define RWSEM_WRITER_LOCKED rwsem_owner_count(current)
+#else /* !MERGE_OWNER_INTO_COUNT */
#define RWSEM_READER_SHIFT 8
+#define RWSEM_WRITER_MASK (1UL << 7)
+#define RWSEM_WRITER_LOCKED RWSEM_WRITER_MASK
+#endif /* MERGE_OWNER_INTO_COUNT */
+
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
+/*
+ * Task structure pointer compression (64-bit only):
+ * (owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2)
+ *
+ * However, init_task may lie outside of the linearly mapped physical
+ * to virtual memory range and so has to be handled separately.
+ */
+static inline unsigned long rwsem_owner_count(struct task_struct *owner)
+{
+ if (unlikely(owner == &init_task))
+ return RWSEM_WRITER_MASK;
+
+ return ((unsigned long)owner - PAGE_OFFSET) >> (L1_CACHE_SHIFT - 2);
+}
+
+static inline unsigned long rwsem_count_owner(long count)
+{
+ unsigned long writer = (unsigned long)count & RWSEM_WRITER_MASK;
+
+ if (unlikely(writer == RWSEM_WRITER_MASK))
+ return (unsigned long)&init_task;
+
+ return writer ? (writer << (L1_CACHE_SHIFT - 2)) + PAGE_OFFSET : 0;
+}
+
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
* store tearing can't happen as optimistic spinners may read and use
* the owner value concurrently without lock. Read from owner, however,
* may not need READ_ONCE() as long as the pointer value is only used
* for comparison and isn't being dereferenced.
+ *
+ * With MERGE_OWNER_INTO_COUNT defined, the writer task structure pointer
+ * is written to the count as well in addition to the owner field.
*/
+
static inline void rwsem_set_owner(struct rw_semaphore *sem)
{
atomic_long_set(&sem->owner, (long)current);
@@ -269,6 +355,27 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
}
}
+#ifdef MERGE_OWNER_INTO_COUNT
+/*
+ * Get the owner value from count to have early access to the task structure.
+ */
+static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
+{
+ return (struct task_struct *)
+ rwsem_count_owner(atomic_long_read(&sem->count));
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner.
+ */
+static inline struct task_struct *
+rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+{
+ *pflags = atomic_long_read(&sem->owner) & RWSEM_OWNER_FLAGS_MASK;
+ return rwsem_read_owner(sem);
+}
+
/*
* This function does a read trylock by incrementing the reader count
* and then decrementing it immediately if too many readers are present
@@ -276,6 +383,18 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
* of overflowing the count with minimal delay between the increment
* and decrement.
*
+ * When the owner task structure pointer is merged into couunt, less bits
+ * will be available for readers (down to 15 bits for x86-64). There is a
+ * very slight chance that preemption may happen in the middle of the
+ * inc-check-dec sequence leaving the reader count incremented for a
+ * certain period of time until the reader wakes up and move on. Still
+ * the chance of having enough of these unfortunate sequence of events to
+ * overflow the reader count is infinitesimally small.
+ *
+ * If MERGE_OWNER_INTO_COUNT isn't defined, we don't really need to
+ * worry about the possibility of overflowing the reader counts even
+ * for 32-bit architectures which can support up to 8M readers.
+ *
* It returns the adjustment that should be added back to the count
* in the slowpath.
*/
@@ -291,6 +410,16 @@ static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
return adjustment;
}
+static int __init rwsem_show_count_status(void)
+{
+ pr_info("RW Semaphores: Write-owner in count & %d bits for readers.\n",
+ __READER_COUNT_BITS);
+ return 0;
+}
+late_initcall(rwsem_show_count_status);
+
+#else /* !MERGE_OWNER_INTO_COUNT */
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -313,14 +442,21 @@ rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
}
+static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
+{
+ *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+ return -RWSEM_READER_BIAS;
+}
+#endif /* MERGE_OWNER_INTO_COUNT */
+
/*
* Guide to the rw_semaphore's count field.
*
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
+ * When any of the RWSEM_WRITER_MASK bits in count is set, the lock is
+ * owned by a writer.
*
* The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (1) none of the RWSEM_WRITER_MASK bits is set in count,
* (2) some of the reader bits are set in count, and
* (3) the owner field has RWSEM_READ_OWNED bit set.
*
@@ -1386,6 +1522,14 @@ static inline void __down_write(struct rw_semaphore *sem)
rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
else
rwsem_set_owner(sem);
+#ifdef MERGE_OWNER_INTO_COUNT
+ /*
+ * Make sure that count<=>owner translation is correct.
+ */
+ DEBUG_RWSEMS_WARN_ON(
+ (atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK) !=
+ (long)rwsem_read_owner(sem), sem);
+#endif
}
static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1446,7 +1590,7 @@ static inline void __up_write(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
!rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
- tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+ tmp = atomic_long_fetch_and_release(~RWSEM_WRITER_MASK, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
rwsem_wake(sem, tmp);
}
--
2.18.1
When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.
Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.
Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:
# of Threads Pre-Patch Post-patch
------------ --------- ----------
4 1,641 1,674
8 731 1,062
16 564 924
32 78 300
64 38 195
240 50 149
There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index eb43201b89b4..b8e209c5fa55 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -254,6 +254,14 @@ enum writer_wait_state {
*/
#define RWSEM_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+/*
+ * Magic number to batch-wakeup waiting readers, even when writers are
+ * also present in the queue. This both limits the amount of work the
+ * waking thread must do and also prevents any potential counter overflow,
+ * however unlikely.
+ */
+#define MAX_READERS_WAKEUP 0x100
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -329,11 +337,17 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
}
/*
- * Grant an infinite number of read locks to the readers at the front
- * of the queue. We know that woken will be at least 1 as we accounted
+ * Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+ * queue. We know that the woken will be at least 1 as we accounted
* for above. Note we increment the 'active part' of the count by the
* number of readers before waking any processes up.
*
+ * This is an adaptation of the phase-fair R/W locks where at the
+ * reader phase (first waiter is a reader), all readers are eligible
+ * to acquire the lock at the same time irrespective of their order
+ * in the queue. The writers acquire the lock according to their
+ * order in the queue.
+ *
* We have to do wakeup in 2 passes to prevent the possibility that
* the reader count may be decremented before it is incremented. It
* is because the to-be-woken waiter may not have slept yet. So it
@@ -345,13 +359,20 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* 2) For each waiters in the new list, clear waiter->task and
* put them into wake_q to be woken up later.
*/
- list_for_each_entry(waiter, &sem->wait_list, list) {
+ INIT_LIST_HEAD(&wlist);
+ list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
if (waiter->type == RWSEM_WAITING_FOR_WRITE)
- break;
+ continue;
woken++;
+ list_move_tail(&waiter->list, &wlist);
+
+ /*
+ * Limit # of readers that can be woken up per wakeup call.
+ */
+ if (woken >= MAX_READERS_WAKEUP)
+ break;
}
- list_cut_before(&wlist, &sem->wait_list, &waiter->list);
adjustment = woken * RWSEM_READER_BIAS - adjustment;
lockevent_cond_inc(rwsem_wake_reader, woken);
--
2.18.1
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.
Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/sched/wake_q.h | 5 +++++
kernel/locking/rwsem.c | 31 +++++++++++++++----------------
2 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index ad826d2a4557..26a2013ac39c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
head->lastp = &head->first;
}
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+ return head->first == WAKE_Q_TAIL;
+}
+
extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 0c8aef065acb..36aed5236bd2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -400,13 +400,14 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* If wstate is WRITER_HANDOFF, it will make sure that either the handoff
* bit is set or the lock is acquired with handoff bit cleared.
*/
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
enum writer_wait_state wstate)
{
- long new;
+ long count, new;
lockdep_assert_held(&sem->wait_lock);
+ count = atomic_long_read(&sem->count);
do {
bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
@@ -751,26 +752,25 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
? RWSEM_WAKE_READERS
: RWSEM_WAKE_ANY, &wake_q);
- /*
- * The wakeup is normally called _after_ the wait_lock
- * is released, but given that we are proactively waking
- * readers we can deal with the wake_q overhead as it is
- * similar to releasing and taking the wait_lock again
- * for attempting rwsem_try_write_lock().
- */
- wake_up_q(&wake_q);
-
- /* We need wake_q again below, reinitialize */
- wake_q_init(&wake_q);
+ if (!wake_q_empty(&wake_q)) {
+ /*
+ * We want to minimize wait_lock hold time especially
+ * when a large number of readers are to be woken up.
+ */
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q); /* Used again, reinit */
+ raw_spin_lock_irq(&sem->wait_lock);
+ }
} else {
- count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
}
wait:
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem, wstate))
+ if (rwsem_try_write_lock(sem, wstate))
break;
raw_spin_unlock_irq(&sem->wait_lock);
@@ -811,7 +811,6 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
}
raw_spin_lock_irq(&sem->wait_lock);
- count = atomic_long_read(&sem->count);
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
--
2.18.1
Hi Waiman,
On Tue, 21 May 2019 at 05:01, Waiman Long <[email protected]> wrote:
>
> Because of writer lock stealing, it is possible that a constant
> stream of incoming writers will cause a waiting writer or reader to
> wait indefinitely leading to lock starvation.
>
> This patch implements a lock handoff mechanism to disable lock stealing
> and force lock handoff to the first waiter or waiters (for readers)
> in the queue after at least a 4ms waiting period unless it is a RT
> writer task which doesn't need to wait. The waiting period is used to
> avoid discouraging lock stealing too much to affect performance.
I was working on a patchset to solve read-write lock deadlock
detection problem (https://lkml.org/lkml/2019/5/16/93).
One of the mistakes in that work is that I considered the following
case as deadlock:
T1 T2
-- --
down_read1 down_write2
down_write2 down_read1
So I was trying to understand what really went wrong and find the
problem is that if I understand correctly the current rwsem design
isn't showing real fairness but priority in favor of write locks, and
thus one of the bad effects is that read locks can be starved if write
locks keep coming.
Luckily, I noticed you are revamping rwsem and seem to have thought
about it already. I am not crystal sure what is your work's
ramification on the above case, so hope that you can shed some light
and perhaps share your thoughts on this.
Thanks,
Yuyang
On Tue, 4 Jun 2019 at 11:03, Yuyang Du <[email protected]> wrote:
>
> Hi Waiman,
>
> On Tue, 21 May 2019 at 05:01, Waiman Long <[email protected]> wrote:
> >
> > Because of writer lock stealing, it is possible that a constant
> > stream of incoming writers will cause a waiting writer or reader to
> > wait indefinitely leading to lock starvation.
> >
> > This patch implements a lock handoff mechanism to disable lock stealing
> > and force lock handoff to the first waiter or waiters (for readers)
> > in the queue after at least a 4ms waiting period unless it is a RT
> > writer task which doesn't need to wait. The waiting period is used to
> > avoid discouraging lock stealing too much to affect performance.
>
> I was working on a patchset to solve read-write lock deadlock
> detection problem (https://lkml.org/lkml/2019/5/16/93).
>
> One of the mistakes in that work is that I considered the following
> case as deadlock:
Sorry everyone, but let me rephrase:
One of the mistakes in that work is that I considered the following
case as no deadlock:
>
> T1 T2
> -- --
>
> down_read1 down_write2
>
> down_write2 down_read1
>
> So I was trying to understand what really went wrong and find the
> problem is that if I understand correctly the current rwsem design
> isn't showing real fairness but priority in favor of write locks, and
> thus one of the bad effects is that read locks can be starved if write
> locks keep coming.
>
> Luckily, I noticed you are revamping rwsem and seem to have thought
> about it already. I am not crystal sure what is your work's
> ramification on the above case, so hope that you can shed some light
> and perhaps share your thoughts on this.
>
> Thanks,
> Yuyang
On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
> +static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
> +{
> + return (struct task_struct *)(atomic_long_read(&sem->owner) &
> + ~RWSEM_OWNER_FLAGS_MASK);
> +}
> +
> +/*
> + * Return the real task structure pointer of the owner and the embedded
> + * flags in the owner. pflags must be non-NULL.
> + */
> +static inline struct task_struct *
> +rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
> +{
> + long owner = atomic_long_read(&sem->owner);
> +
> + *pflags = owner & RWSEM_OWNER_FLAGS_MASK;
> + return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
> +}
I got confused by the 'read' part in those nanes, I initially thought
they paired with rwsem_set_reader_owned().
So I've done 's/rwsem_read_owner/rwsem_owner/g on it.
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -194,10 +194,10 @@ static inline void rwsem_clear_reader_ow
/*
* Return just the real task structure pointer of the owner
*/
-static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
+static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem)
{
- return (struct task_struct *)(atomic_long_read(&sem->owner) &
- ~RWSEM_OWNER_FLAGS_MASK);
+ return (struct task_struct *)
+ (atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK);
}
/*
@@ -205,7 +205,7 @@ static inline struct task_struct *rwsem_
* flags in the owner. pflags must be non-NULL.
*/
static inline struct task_struct *
-rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
+rwsem_owner_flags(struct rw_semaphore *sem, long *pflags)
{
long owner = atomic_long_read(&sem->owner);
@@ -561,7 +561,7 @@ static inline bool rwsem_can_spin_on_own
preempt_disable();
rcu_read_lock();
- owner = rwsem_read_owner_flags(sem, &flags);
+ owner = rwsem_owner_flags(sem, &flags);
if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
ret = false;
rcu_read_unlock();
@@ -590,8 +590,8 @@ enum owner_state {
};
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
-static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
- long flags)
+static inline enum owner_state
+rwsem_owner_state(struct task_struct *owner, long flags)
{
if (flags & RWSEM_NONSPINNABLE)
return OWNER_NONSPINNABLE;
@@ -608,7 +608,7 @@ static noinline enum owner_state rwsem_s
long flags, new_flags;
enum owner_state state;
- owner = rwsem_read_owner_flags(sem, &flags);
+ owner = rwsem_owner_flags(sem, &flags);
state = rwsem_owner_state(owner, flags);
if (state != OWNER_WRITER)
return state;
@@ -620,7 +620,7 @@ static noinline enum owner_state rwsem_s
break;
}
- new = rwsem_read_owner_flags(sem, &new_flags);
+ new = rwsem_owner_flags(sem, &new_flags);
if ((new != owner) || (new_flags != flags)) {
state = rwsem_owner_state(new, new_flags);
break;
@@ -1139,7 +1139,7 @@ static inline void __up_write(struct rw_
* sem->owner may differ from current if the ownership is transferred
* to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
*/
- DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
+ DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) &&
!rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
@@ -1161,7 +1161,7 @@ static inline void __downgrade_write(str
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
- DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
+ DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem);
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
On Mon, May 20, 2019 at 04:59:13PM -0400, Waiman Long wrote:
> +static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
> +{
> + long owner = atomic_long_read(&sem->owner);
> +
> + while (owner & RWSEM_READER_OWNED) {
> + if (owner & RWSEM_NONSPINNABLE)
> + break;
> + owner = atomic_long_cmpxchg(&sem->owner, owner,
> + owner | RWSEM_NONSPINNABLE);
> + }
> +}
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -206,12 +206,13 @@ static inline void rwsem_set_nonspinnabl
{
long owner = atomic_long_read(&sem->owner);
- while (owner & RWSEM_READER_OWNED) {
+ do {
+ if (!(owner & RWSEM_READER_OWNED))
+ break;
if (owner & RWSEM_NONSPINNABLE)
break;
- owner = atomic_long_cmpxchg(&sem->owner, owner,
- owner | RWSEM_NONSPINNABLE);
- }
+ } while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
+ owner | RWSEM_NONSPINNABLE));
}
/*
On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> Reader optimistic spinning is helpful when the reader critical section
> is short and there aren't that many readers around. It makes readers
> relatively more preferred than writers. When a writer times out spinning
> on a reader-owned lock and set the nospinnable bits, there are two main
> reasons for that.
>
> 1) The reader critical section is long, perhaps the task sleeps after
> acquiring the read lock.
> 2) There are just too many readers contending the lock causing it to
> take a while to service all of them.
>
> In the former case, long reader critical section will impede the progress
> of writers which is usually more important for system performance.
> In the later case, reader optimistic spinning tends to make the reader
> groups that contain readers that acquire the lock together smaller
> leading to more of them. That may hurt performance in some cases. In
> other words, the setting of nonspinnable bits indicates that reader
> optimistic spinning may not be helpful for those workloads that cause it.
>
> Therefore, any writers that have observed the setting of the writer
> nonspinnable bit for a given rwsem after they fail to acquire the lock
> via optimistic spinning will set the reader nonspinnable bit once they
> acquire the write lock. Similarly, readers that observe the setting
> of reader nonspinnable bit at slowpath entry will also set the reader
> nonspinnable bit when they acquire the read lock via the wakeup path.
So both cases set the _reader_ nonspinnable bit?
On Tue, Jun 04, 2019 at 11:26:30AM +0800, Yuyang Du wrote:
> On Tue, 4 Jun 2019 at 11:03, Yuyang Du <[email protected]> wrote:
> >
> > Hi Waiman,
> >
> > On Tue, 21 May 2019 at 05:01, Waiman Long <[email protected]> wrote:
> > >
> > > Because of writer lock stealing, it is possible that a constant
> > > stream of incoming writers will cause a waiting writer or reader to
> > > wait indefinitely leading to lock starvation.
> > >
> > > This patch implements a lock handoff mechanism to disable lock stealing
> > > and force lock handoff to the first waiter or waiters (for readers)
> > > in the queue after at least a 4ms waiting period unless it is a RT
> > > writer task which doesn't need to wait. The waiting period is used to
> > > avoid discouraging lock stealing too much to affect performance.
> >
> > I was working on a patchset to solve read-write lock deadlock
> > detection problem (https://lkml.org/lkml/2019/5/16/93).
> >
> > One of the mistakes in that work is that I considered the following
> > case as deadlock:
>
> Sorry everyone, but let me rephrase:
>
> One of the mistakes in that work is that I considered the following
> case as no deadlock:
>
> >
> > T1 T2
> > -- --
> >
> > down_read1 down_write2
> >
> > down_write2 down_read1
> >
Not sure I understand the whole context here, but isn't adding a third
independent task makes this a deadlock?
T1 T2 T3
-- -- --
down_read1 down_write2
down_write1
down_write2 down_read1
from the perspective of lockdep, we cannot be sure whether there will
a T3 or not.
In case that I mis-understood you, maybe your point is about in the
above case whether "down_read1" on T2 can *gauranteedly* steal (in the
sense of breaking the fairness) the read lock after Waiman modification?
If so, I will wait for Waiman's response ;-)
Regards,
Boqun
> > So I was trying to understand what really went wrong and find the
> > problem is that if I understand correctly the current rwsem design
> > isn't showing real fairness but priority in favor of write locks, and
> > thus one of the bad effects is that read locks can be starved if write
> > locks keep coming.
> >
> > Luckily, I noticed you are revamping rwsem and seem to have thought
> > about it already. I am not crystal sure what is your work's
> > ramification on the above case, so hope that you can shed some light
> > and perhaps share your thoughts on this.
> >
> > Thanks,
> > Yuyang
On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
> the will-it-scale benchmark was run with various number of threads. The
> number of operations done before reader optimistic spinning patches,
> this patch and after this patch were:
>
> Threads Before rspin Before patch After patch %change
> ------- ------------ ------------ ----------- -------
> 20 5541068 5345484 5455667 -3.5%/ +2.1%
> 40 10185150 7292313 9219276 -28.5%/+26.4%
> 60 8196733 6460517 7181209 -21.2%/+11.2%
> 80 9508864 6739559 8107025 -29.1%/+20.3%
'rspin' is patch 12 in this series, right?
On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> @@ -286,6 +325,7 @@ struct rwsem_waiter {
> struct task_struct *task;
> enum rwsem_waiter_type type;
> unsigned long timeout;
> + long last_rowner;
> };
> #define rwsem_first_waiter(sem) \
> list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> + long last_rowner)
> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> + unsigned long last_rowner)
> + waiter.last_rowner = atomic_long_read(&sem->owner);
That's somewhat inconsistent wrt the type. I'll make it unsigned long,
as that is what makes most sense, given there's a pointer inside.
On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
> With separate count and owner, there are timing windows where the two
> values are inconsistent. That can cause problem when trying to figure
> out the exact state of the rwsem. For instance, a RT task will stop
> optimistic spinning if the lock is acquired by a writer but the owner
> field isn't set yet. That can be solved by combining the count and
> owner together in a single atomic value.
I just realized we can use cmpxchg_double() here (where available of
course).
On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
> + long last_rowner)
> +{
> + long owner = atomic_long_read(&sem->owner);
> +
> + if (!(owner & RWSEM_READER_OWNED))
> + return false;
> +
> + owner &= ~RWSEM_OWNER_FLAGS_MASK;
> + last_rowner &= ~RWSEM_OWNER_FLAGS_MASK;
> + if ((owner != last_rowner) && rwsem_try_read_lock_unqueued(sem)) {
just because I'm struggling with sleep deprivation and the big picture
isn't making sense,.. you can write that like:
((owner ^ last_rowner) & ~RWSEM_OWNER_FLAGS_MASK)
> + lockevent_inc(rwsem_opt_rlock2);
> + lockevent_add(rwsem_opt_fail, -1);
> + return true;
> + }
> + return false;
> +}
On 6/3/19 11:26 PM, Yuyang Du wrote:
> On Tue, 4 Jun 2019 at 11:03, Yuyang Du <[email protected]> wrote:
>> Hi Waiman,
>>
>> On Tue, 21 May 2019 at 05:01, Waiman Long <[email protected]> wrote:
>>> Because of writer lock stealing, it is possible that a constant
>>> stream of incoming writers will cause a waiting writer or reader to
>>> wait indefinitely leading to lock starvation.
>>>
>>> This patch implements a lock handoff mechanism to disable lock stealing
>>> and force lock handoff to the first waiter or waiters (for readers)
>>> in the queue after at least a 4ms waiting period unless it is a RT
>>> writer task which doesn't need to wait. The waiting period is used to
>>> avoid discouraging lock stealing too much to affect performance.
>> I was working on a patchset to solve read-write lock deadlock
>> detection problem (https://lkml.org/lkml/2019/5/16/93).
>>
>> One of the mistakes in that work is that I considered the following
>> case as deadlock:
> Sorry everyone, but let me rephrase:
>
> One of the mistakes in that work is that I considered the following
> case as no deadlock:
>
>> T1 T2
>> -- --
>>
>> down_read1 down_write2
>>
>> down_write2 down_read1
Yes, that combination shouldn't cause a deadlock. However, the lockdep
code isn't able to recognize this case and so you may still see splat
about possible deadlock scenario when lockdep checking is enabled. So
the general advise is still to try to rearrange the lock ordering, if
possible.
>> So I was trying to understand what really went wrong and find the
>> problem is that if I understand correctly the current rwsem design
>> isn't showing real fairness but priority in favor of write locks, and
>> thus one of the bad effects is that read locks can be starved if write
>> locks keep coming.
>>
>> Luckily, I noticed you are revamping rwsem and seem to have thought
>> about it already. I am not crystal sure what is your work's
>> ramification on the above case, so hope that you can shed some light
>> and perhaps share your thoughts on this.
Lock starvation is certainly possible with the current rwsem code. Why
don't try to apply the patch to see if it can remedy your problem?
Cheers,
Longman
On 6/4/19 4:52 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>> +static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
>> +{
>> + return (struct task_struct *)(atomic_long_read(&sem->owner) &
>> + ~RWSEM_OWNER_FLAGS_MASK);
>> +}
>> +
>> +/*
>> + * Return the real task structure pointer of the owner and the embedded
>> + * flags in the owner. pflags must be non-NULL.
>> + */
>> +static inline struct task_struct *
>> +rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
>> +{
>> + long owner = atomic_long_read(&sem->owner);
>> +
>> + *pflags = owner & RWSEM_OWNER_FLAGS_MASK;
>> + return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
>> +}
> I got confused by the 'read' part in those nanes, I initially thought
> they paired with rwsem_set_reader_owned().
Sorry for the confusion. I initially use "get", but I am afraid that it
may get confused with "get/put" for reference counting. So I used read
instead.
> So I've done 's/rwsem_read_owner/rwsem_owner/g on it.
I am fine with getting rid of the "read" from the function names.
Cheers,
Longman
On 6/4/19 5:45 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
>> With separate count and owner, there are timing windows where the two
>> values are inconsistent. That can cause problem when trying to figure
>> out the exact state of the rwsem. For instance, a RT task will stop
>> optimistic spinning if the lock is acquired by a writer but the owner
>> field isn't set yet. That can be solved by combining the count and
>> owner together in a single atomic value.
> I just realized we can use cmpxchg_double() here (where available of
> course).
Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
don't think we can guarantee that unless we explicitly set this alignment.
Cheers,
Longman
On 6/4/19 5:12 AM, Boqun Feng wrote:
> On Tue, Jun 04, 2019 at 11:26:30AM +0800, Yuyang Du wrote:
>> On Tue, 4 Jun 2019 at 11:03, Yuyang Du <[email protected]> wrote:
>>> Hi Waiman,
>>>
>>> On Tue, 21 May 2019 at 05:01, Waiman Long <[email protected]> wrote:
>>>> Because of writer lock stealing, it is possible that a constant
>>>> stream of incoming writers will cause a waiting writer or reader to
>>>> wait indefinitely leading to lock starvation.
>>>>
>>>> This patch implements a lock handoff mechanism to disable lock stealing
>>>> and force lock handoff to the first waiter or waiters (for readers)
>>>> in the queue after at least a 4ms waiting period unless it is a RT
>>>> writer task which doesn't need to wait. The waiting period is used to
>>>> avoid discouraging lock stealing too much to affect performance.
>>> I was working on a patchset to solve read-write lock deadlock
>>> detection problem (https://lkml.org/lkml/2019/5/16/93).
>>>
>>> One of the mistakes in that work is that I considered the following
>>> case as deadlock:
>> Sorry everyone, but let me rephrase:
>>
>> One of the mistakes in that work is that I considered the following
>> case as no deadlock:
>>
>>> T1 T2
>>> -- --
>>>
>>> down_read1 down_write2
>>>
>>> down_write2 down_read1
>>>
> Not sure I understand the whole context here, but isn't adding a third
> independent task makes this a deadlock?
>
> T1 T2 T3
> -- -- --
>
> down_read1 down_write2
> down_write1
> down_write2 down_read1
>
> from the perspective of lockdep, we cannot be sure whether there will
> a T3 or not.
Yes, that will be a deadlock even with the my rwsem patch applied, as it
will still try to preserve the reader-writer ordering. So it will
certainly be safer to have the same lock ordering for both tasks.
>
> In case that I mis-understood you, maybe your point is about in the
> above case whether "down_read1" on T2 can *gauranteedly* steal (in the
> sense of breaking the fairness) the read lock after Waiman modification?
> If so, I will wait for Waiman's response ;-)
With my patchset applied, the reader-writer ordering is still supposed
to be preserved. Of course, there can be exceptions depending on the
exact timing, but we can't rely on that to prevent deadlock.
Cheers,
Longman
On 6/4/19 5:03 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:13PM -0400, Waiman Long wrote:
>> +static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
>> +{
>> + long owner = atomic_long_read(&sem->owner);
>> +
>> + while (owner & RWSEM_READER_OWNED) {
>> + if (owner & RWSEM_NONSPINNABLE)
>> + break;
>> + owner = atomic_long_cmpxchg(&sem->owner, owner,
>> + owner | RWSEM_NONSPINNABLE);
>> + }
>> +}
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -206,12 +206,13 @@ static inline void rwsem_set_nonspinnabl
> {
> long owner = atomic_long_read(&sem->owner);
>
> - while (owner & RWSEM_READER_OWNED) {
> + do {
> + if (!(owner & RWSEM_READER_OWNED))
> + break;
> if (owner & RWSEM_NONSPINNABLE)
> break;
> - owner = atomic_long_cmpxchg(&sem->owner, owner,
> - owner | RWSEM_NONSPINNABLE);
> - }
> + } while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
> + owner | RWSEM_NONSPINNABLE));
> }
>
> /*
Sure.
Thanks,
Longman
On Tue, Jun 04, 2019 at 11:47:21AM -0400, Waiman Long wrote:
> On 6/4/19 5:45 AM, Peter Zijlstra wrote:
> > On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
> >> With separate count and owner, there are timing windows where the two
> >> values are inconsistent. That can cause problem when trying to figure
> >> out the exact state of the rwsem. For instance, a RT task will stop
> >> optimistic spinning if the lock is acquired by a writer but the owner
> >> field isn't set yet. That can be solved by combining the count and
> >> owner together in a single atomic value.
> > I just realized we can use cmpxchg_double() here (where available of
> > course).
>
> Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
> don't think we can guarantee that unless we explicitly set this alignment.
It does :/ and yes, we'd need to play games with __align(2*sizeof(long))
and such.
On 6/4/19 1:02 PM, Peter Zijlstra wrote:
> On Tue, Jun 04, 2019 at 11:47:21AM -0400, Waiman Long wrote:
>> On 6/4/19 5:45 AM, Peter Zijlstra wrote:
>>> On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
>>>> With separate count and owner, there are timing windows where the two
>>>> values are inconsistent. That can cause problem when trying to figure
>>>> out the exact state of the rwsem. For instance, a RT task will stop
>>>> optimistic spinning if the lock is acquired by a writer but the owner
>>>> field isn't set yet. That can be solved by combining the count and
>>>> owner together in a single atomic value.
>>> I just realized we can use cmpxchg_double() here (where available of
>>> course).
>> Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
>> don't think we can guarantee that unless we explicitly set this alignment.
> It does :/ and yes, we'd need to play games with __align(2*sizeof(long))
> and such.
So do you want this as an option now as it will be x86 specific? Or we
can do that as a follow-up if we want to.
Cheers,
Longman
On Tue, Jun 04, 2019 at 01:06:11PM -0400, Waiman Long wrote:
> On 6/4/19 1:02 PM, Peter Zijlstra wrote:
> > On Tue, Jun 04, 2019 at 11:47:21AM -0400, Waiman Long wrote:
> >> On 6/4/19 5:45 AM, Peter Zijlstra wrote:
> >>> On Mon, May 20, 2019 at 04:59:16PM -0400, Waiman Long wrote:
> >>>> With separate count and owner, there are timing windows where the two
> >>>> values are inconsistent. That can cause problem when trying to figure
> >>>> out the exact state of the rwsem. For instance, a RT task will stop
> >>>> optimistic spinning if the lock is acquired by a writer but the owner
> >>>> field isn't set yet. That can be solved by combining the count and
> >>>> owner together in a single atomic value.
> >>> I just realized we can use cmpxchg_double() here (where available of
> >>> course).
> >> Does the 2 doubles need to be 128-bit aligned to use cmpxchg_double()? I
> >> don't think we can guarantee that unless we explicitly set this alignment.
> > It does :/ and yes, we'd need to play games with __align(2*sizeof(long))
> > and such.
>
> So do you want this as an option now as it will be x86 specific? Or we
> can do that as a follow-up if we want to.
x86, s390 and arm64 have cmpxchg_double().
I was going to have a look (but like I wrote, I'm pretty useless today
so i didn't actually get anywhere) at the exact race that's a problem
here and see if there's not another solution too.
On 6/4/19 5:10 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
>> Reader optimistic spinning is helpful when the reader critical section
>> is short and there aren't that many readers around. It makes readers
>> relatively more preferred than writers. When a writer times out spinning
>> on a reader-owned lock and set the nospinnable bits, there are two main
>> reasons for that.
>>
>> 1) The reader critical section is long, perhaps the task sleeps after
>> acquiring the read lock.
>> 2) There are just too many readers contending the lock causing it to
>> take a while to service all of them.
>>
>> In the former case, long reader critical section will impede the progress
>> of writers which is usually more important for system performance.
>> In the later case, reader optimistic spinning tends to make the reader
>> groups that contain readers that acquire the lock together smaller
>> leading to more of them. That may hurt performance in some cases. In
>> other words, the setting of nonspinnable bits indicates that reader
>> optimistic spinning may not be helpful for those workloads that cause it.
>>
>> Therefore, any writers that have observed the setting of the writer
>> nonspinnable bit for a given rwsem after they fail to acquire the lock
>> via optimistic spinning will set the reader nonspinnable bit once they
>> acquire the write lock. Similarly, readers that observe the setting
>> of reader nonspinnable bit at slowpath entry will also set the reader
>> nonspinnable bit when they acquire the read lock via the wakeup path.
> So both cases set the _reader_ nonspinnable bit?
Yes.
-Longman
On 6/4/19 5:14 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
>> On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
>> the will-it-scale benchmark was run with various number of threads. The
>> number of operations done before reader optimistic spinning patches,
>> this patch and after this patch were:
>>
>> Threads Before rspin Before patch After patch %change
>> ------- ------------ ------------ ----------- -------
>> 20 5541068 5345484 5455667 -3.5%/ +2.1%
>> 40 10185150 7292313 9219276 -28.5%/+26.4%
>> 60 8196733 6460517 7181209 -21.2%/+11.2%
>> 80 9508864 6739559 8107025 -29.1%/+20.3%
> 'rspin' is patch 12 in this series, right?
Yes, I should have spell out the patch name.
-Longman
On 6/4/19 5:20 AM, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:14PM -0400, Waiman Long wrote:
>> @@ -286,6 +325,7 @@ struct rwsem_waiter {
>> struct task_struct *task;
>> enum rwsem_waiter_type type;
>> unsigned long timeout;
>> + long last_rowner;
>> };
>> #define rwsem_first_waiter(sem) \
>> list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
>> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>> + long last_rowner)
>> +static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
>> + unsigned long last_rowner)
>> + waiter.last_rowner = atomic_long_read(&sem->owner);
> That's somewhat inconsistent wrt the type. I'll make it unsigned long,
> as that is what makes most sense, given there's a pointer inside.
Thank for spotting that, I will fix it.
-Longman
On Tue, Jun 04, 2019 at 01:30:00PM -0400, Waiman Long wrote:
> > That's somewhat inconsistent wrt the type. I'll make it unsigned long,
> > as that is what makes most sense, given there's a pointer inside.
>
> Thank for spotting that, I will fix it.
I fixed a whole bunch of them; please find the modified patches here:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=locking/core
On 6/4/19 1:38 PM, Peter Zijlstra wrote:
> On Tue, Jun 04, 2019 at 01:30:00PM -0400, Waiman Long wrote:
>>> That's somewhat inconsistent wrt the type. I'll make it unsigned long,
>>> as that is what makes most sense, given there's a pointer inside.
>> Thank for spotting that, I will fix it.
> I fixed a whole bunch of them; please find the modified patches here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=locking/core
Thanks for reviewing the patches.
So how do you think about the overall state of this patchset? Do you
think it is mature enough to go into 5.3?
Or if you want more time to think about solving the RT thread issue, we
can merge just patches 1-16 and play with the last threes for some more
time. I am fine with that too as improving RT tasks is not my main
focus. I like patch 16 as it led me to discover the rwsem reader wakeup
bug as I hit the negative dentry count WARN_ON message in my testing.
I worked on this owner merging patch mainly to alleviate the need to use
cmpxchg for reader lock. cmpxchg_double() is certainly one possible
solution though it won't work on older CPUs. We can have a config option
to use cmpxchg_double as it may increase the size of other structures
that embedded rwsem and impose additional alignment constraint.
Cheers,
Longman
On Tue, Jun 04, 2019 at 02:04:34PM -0400, Waiman Long wrote:
> On 6/4/19 1:38 PM, Peter Zijlstra wrote:
> > On Tue, Jun 04, 2019 at 01:30:00PM -0400, Waiman Long wrote:
> >>> That's somewhat inconsistent wrt the type. I'll make it unsigned long,
> >>> as that is what makes most sense, given there's a pointer inside.
> >> Thank for spotting that, I will fix it.
> > I fixed a whole bunch of them; please find the modified patches here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=locking/core
>
> Thanks for reviewing the patches.
>
> So how do you think about the overall state of this patchset? Do you
> think it is mature enough to go into 5.3?
So far so good :-)
> Or if you want more time to think about solving the RT thread issue, we
> can merge just patches 1-16 and play with the last threes for some more
> time. I am fine with that too as improving RT tasks is not my main
> focus. I like patch 16 as it led me to discover the rwsem reader wakeup
> bug as I hit the negative dentry count WARN_ON message in my testing.
My brain gave out around patch 14.. I'll try again tomorrow. But I'm
thinking we should be able to do as you suggest and get -16 merged.
> I worked on this owner merging patch mainly to alleviate the need to use
> cmpxchg for reader lock. cmpxchg_double() is certainly one possible
> solution though it won't work on older CPUs. We can have a config option
> to use cmpxchg_double as it may increase the size of other structures
> that embedded rwsem and impose additional alignment constraint.
cmpxchg8b was introduced with the Pentium (for PAE IIRC, it enabled
atomic 64bit PTEs, but Linux never used it for that) and every Intel/AMD
thereafter has had it. AFAIK there's no x86_64 chip without cmpxchg16b.
On 6/4/19 2:14 PM, Peter Zijlstra wrote:
>> I worked on this owner merging patch mainly to alleviate the need to use
>> cmpxchg for reader lock. cmpxchg_double() is certainly one possible
>> solution though it won't work on older CPUs. We can have a config option
>> to use cmpxchg_double as it may increase the size of other structures
>> that embedded rwsem and impose additional alignment constraint.
> cmpxchg8b was introduced with the Pentium (for PAE IIRC, it enabled
> atomic 64bit PTEs, but Linux never used it for that) and every Intel/AMD
> thereafter has had it. AFAIK there's no x86_64 chip without cmpxchg16b.
Thank for the clarification. I actually didn't check when cmpxch8b was
introduced. I know it is a bit slower than regular cmpxchg. So we may
still need to do some performance analysis to see how it compares with
my current approach.
Cheers,
Longman
Hi Waiman,
On Wed, 5 Jun 2019 at 00:00, Waiman Long <[email protected]> wrote:
> With my patchset applied, the reader-writer ordering is still supposed
> to be preserved. Of course, there can be exceptions depending on the
> exact timing, but we can't rely on that to prevent deadlock.
This is exactly what I want to know. Thanks for the reply.
Thanks,
Yuyang
On 6/4/19 2:21 PM, Waiman Long wrote:
> On 6/4/19 2:14 PM, Peter Zijlstra wrote:
>>> I worked on this owner merging patch mainly to alleviate the need to use
>>> cmpxchg for reader lock. cmpxchg_double() is certainly one possible
>>> solution though it won't work on older CPUs. We can have a config option
>>> to use cmpxchg_double as it may increase the size of other structures
>>> that embedded rwsem and impose additional alignment constraint.
>> cmpxchg8b was introduced with the Pentium (for PAE IIRC, it enabled
>> atomic 64bit PTEs, but Linux never used it for that) and every Intel/AMD
>> thereafter has had it. AFAIK there's no x86_64 chip without cmpxchg16b.
> Thank for the clarification. I actually didn't check when cmpxch8b was
> introduced. I know it is a bit slower than regular cmpxchg. So we may
> still need to do some performance analysis to see how it compares with
> my current approach.
Using cmpxchg_double is actually more risky than I thought. I have been
trying to try to use cmpxchg_double for down_write, but I kept getting
kernel panics because the rwsem wasn't 16b-aligned. As rwsem is embedded
in quite a large number of structures, they all have to align properly
to make that work or the kernel will panic. That does seem too risky to
me. So I am dropping the idea of trying to use it.
Cheers,
Longman
On Wed, Jun 05, 2019 at 02:13:27PM -0400, Waiman Long wrote:
> Using cmpxchg_double is actually more risky than I thought. I have been
> trying to try to use cmpxchg_double for down_write, but I kept getting
> kernel panics because the rwsem wasn't 16b-aligned. As rwsem is embedded
> in quite a large number of structures, they all have to align properly
> to make that work or the kernel will panic. That does seem too risky to
> me. So I am dropping the idea of trying to use it.
Urgh, that's another things that's been on the TODO list for a long long
time, write code to verify the alignment of allocations :/ I'm
suspecting quite a lot of that goes wrong all over the place.
On Wed, Jun 5, 2019 at 1:19 PM Peter Zijlstra <[email protected]> wrote:
>
> Urgh, that's another things that's been on the TODO list for a long long
> time, write code to verify the alignment of allocations :/ I'm
> suspecting quite a lot of that goes wrong all over the place.
On x86, we only guarantee 8-byte alignment from things like kmalloc(), iirc.
That ends up actually being a useful thing for small allocations,
which do happen.
On the whole, I would suggest against cmpxchg2 unless it's something
_really_ special. And would definitely strongly suggest against it for
something like a rwsem. Requiring 16-byte alignment just because your
data structure has a lock is nasty. Of course, we could probably
fairly easily change our kmalloc alignment rules to be "still just 8
bytes for small allocations, 16 bytes for anything that is >=64 bytes"
or whatever.
At least nobody is hopefully crazy enough to put one of those things
on the stack, where we *definitely* don't want to increase alignment
issues.
And before people say "surely small allocations aren't normal" - take
a look at slaballoc. Small allocations (<= 32 bytes) are actually not
all that uncommon, and you want them dense in the cache and dense in
memory to not waste either. arm64 has some insane alignment issues
(128 byte alignment due to DMA coherency issues, iirc), and it hurts
them badly.
Right now my machine has 400k 8-byte allocations, if I read things right.
You also find some core slab caches that are small and that don't need
16-byte alignment. A quick script finds things like
ext4_extent_status, which is 40 bytes, not horribly uncommon (I've
apparently got 250k of those things on my system), and currently fits
102 entries per page *because* it's not excessively aligned. Or
Acpi-Parse, which I apparently have 350k of, and is 56 bytes, and fits
73 per page exactly because it only needs 8-byte alignment (but
admittedly a 16-byte alignment would waste some memory, but guarantee
it doesn't cross a cacheline, so _maybe_ it would be ok).
16-byte alignment really isn't a good idea when you have data sizes
that are clearly smaller than even a cacheline.
So I *really* don't want to force excessive alignment. We'd have to
add some special static tooling to say "this kmalloc is assigned to a
pointer which requires 16-byte alignment" and make it use a separate
slab cache with that explicit alignment for that.
Linus
On Wed, Jun 05, 2019 at 01:52:15PM -0700, Linus Torvalds wrote:
> On Wed, Jun 5, 2019 at 1:19 PM Peter Zijlstra <[email protected]> wrote:
> >
> > Urgh, that's another things that's been on the TODO list for a long long
> > time, write code to verify the alignment of allocations :/ I'm
> > suspecting quite a lot of that goes wrong all over the place.
>
> On x86, we only guarantee 8-byte alignment from things like kmalloc(), iirc.
Oh sure, and I'm not proposing to change that. I was more thinking of
having a GCC plugin that verifies, for every ptr assignment:
ptr = foo;
that the actual alignment maches:
assert(!(uintptr_t)ptr % __alignof(*ptr));
That would catch bugs like:
struct bar {
int ponies;
int peaches __smp_cacheline_aligned;
};
struct bar *barp = kmalloc(sizeof(barp, GFP_KERNEL);
Blatantly violating alignment can't be right; either the alignment
constraints put on the data structures are not important and they should
be fixed, or we should respect them and fix the allocation, either way,
we should not silently violate things like we do today.
On Thu, Jun 06, 2019 at 10:03:15AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 05, 2019 at 01:52:15PM -0700, Linus Torvalds wrote:
> > On Wed, Jun 5, 2019 at 1:19 PM Peter Zijlstra <[email protected]> wrote:
> > >
> > > Urgh, that's another things that's been on the TODO list for a long long
> > > time, write code to verify the alignment of allocations :/ I'm
> > > suspecting quite a lot of that goes wrong all over the place.
> >
> > On x86, we only guarantee 8-byte alignment from things like kmalloc(), iirc.
>
> Oh sure, and I'm not proposing to change that. I was more thinking of
> having a GCC plugin that verifies, for every ptr assignment:
>
> ptr = foo;
To better qualify: 'for every ptr assignment that includes a type cast',
and since allocators return 'void *' and (typically/eventually) assign
to a typed pointer, that would be the place to check.
This avoids having to instrument every single pointer assignment.
> that the actual alignment maches:
>
> assert(!(uintptr_t)ptr % __alignof(*ptr));
>
> That would catch bugs like:
>
> struct bar {
> int ponies;
> int peaches __smp_cacheline_aligned;
> };
>
> struct bar *barp = kmalloc(sizeof(barp, GFP_KERNEL);
>
> Blatantly violating alignment can't be right; either the alignment
> constraints put on the data structures are not important and they should
> be fixed, or we should respect them and fix the allocation, either way,
> we should not silently violate things like we do today.
>
>
On Mon, May 20, 2019 at 04:59:15PM -0400, Waiman Long wrote:
> +static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
> +{
> + long adjustment = -RWSEM_READER_BIAS;
> +
> + *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
I'm thinking we'd actually want add_return_acquire() here.
> + if (unlikely(*cnt < 0)) {
> + atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> + adjustment = 0;
> + }
> + return adjustment;
> +}
> @@ -1271,9 +1332,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
> */
> inline void __down_read(struct rw_semaphore *sem)
> {
> + long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> +
> + if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> + rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
> DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> } else {
> rwsem_set_reader_owned(sem);
> @@ -1282,9 +1344,11 @@ inline void __down_read(struct rw_semaphore *sem)
>
> static inline int __down_read_killable(struct rw_semaphore *sem)
> {
> + long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> +
> + if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> + if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
> + adjustment)))
> return -EINTR;
> DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> } else {
I'm confused by the need for @tmp; isn't that returning the exact same
state !adjustment is?
Also; half the patch seems to do cnt<0, while the other half (above)
does &READ_FAILED, what gives?
On Mon, May 20, 2019 at 04:59:15PM -0400, Waiman Long wrote:
> static struct rw_semaphore __sched *
> +rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long adjustment)
> {
> + long count;
> bool wake = false;
> struct rwsem_waiter waiter;
> DEFINE_WAKE_Q(wake_q);
>
> + if (unlikely(!adjustment)) {
> + /*
> + * This shouldn't happen. If it does, there is probably
> + * something wrong in the system.
> + */
> + WARN_ON_ONCE(1);
if (WARN_ON_ONCE(!adjustment)) {
> +
> + /*
> + * An adjustment of 0 means that there are too many readers
> + * holding or trying to acquire the lock. So disable
> + * optimistic spinning and go directly into the wait list.
> + */
> + if (rwsem_test_oflags(sem, RWSEM_RD_NONSPINNABLE))
> + rwsem_set_nonspinnable(sem);
ISTR rwsem_set_nonspinnable() already does that test, so no need to do
it again, right?
> + goto queue;
> + }
> +
> /*
> * Save the current read-owner of rwsem, if available, and the
> * reader nonspinnable bit.
On Tue, Jun 11, 2019 at 03:11:31PM +0200, Peter Zijlstra wrote:
> On Mon, May 20, 2019 at 04:59:15PM -0400, Waiman Long wrote:
>
> > +static inline long rwsem_read_trylock(struct rw_semaphore *sem, long *cnt)
> > +{
> > + long adjustment = -RWSEM_READER_BIAS;
> > +
> > + *cnt = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
>
> I'm thinking we'd actually want add_return_acquire() here.
>
> > + if (unlikely(*cnt < 0)) {
> > + atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
> > + adjustment = 0;
> > + }
> > + return adjustment;
> > +}
>
> > @@ -1271,9 +1332,10 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
> > */
> > inline void __down_read(struct rw_semaphore *sem)
> > {
> > + long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> > +
> > + if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> > + rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE, adjustment);
> > DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> > } else {
> > rwsem_set_reader_owned(sem);
> > @@ -1282,9 +1344,11 @@ inline void __down_read(struct rw_semaphore *sem)
> >
> > static inline int __down_read_killable(struct rw_semaphore *sem)
> > {
> > + long tmp, adjustment = rwsem_read_trylock(sem, &tmp);
> > +
> > + if (unlikely(tmp & RWSEM_READ_FAILED_MASK)) {
> > + if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE,
> > + adjustment)))
> > return -EINTR;
> > DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> > } else {
>
> I'm confused by the need for @tmp; isn't that returning the exact same
> state !adjustment is?
Argh.. READ_FAILED_MASK isn't just the MSB. Bah, this is confusing.
Maybe something like so?
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
#endif
/*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
*
- * Bit 0 - writer locked bit
- * Bit 1 - waiters present bit
- * Bit 2 - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit 63 - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit 31 - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
#define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_READFAIL (1UL << (BITS_PER_LONG - 1))
#define RWSEM_READER_SHIFT 8
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
- RWSEM_FLAG_HANDOFF)
+ RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -254,6 +270,14 @@ static inline void rwsem_set_nonspinnabl
owner | RWSEM_NONSPINNABLE));
}
+static inline bool rwsem_read_trylock(struct rw_semaphore *sem)
+{
+ unsigned long cnt = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count);
+ WARN_ON_ONCE(cnt < 0);
+ return !(cnt & RWSEM_READ_FAILED_MASK);
+
+}
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -403,6 +427,12 @@ static void rwsem_mark_wake(struct rw_se
}
/*
+ * No reader wakeup if there are too many of them already.
+ */
+ if (unlikely(atomic_long_read(&sem->count) < 0))
+ return;
+
+ /*
* Writers might steal the lock before we grant it to the next reader.
* We prefer to do the first reader grant before counting readers
* so we can bail out early if a writer stole the lock.
@@ -949,9 +979,9 @@ static struct rw_semaphore __sched *
rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
{
long count, adjustment = -RWSEM_READER_BIAS;
- bool wake = false;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ bool wake = false;
/*
* Save the current read-owner of rwsem, if available, and the
@@ -1270,8 +1300,7 @@ static struct rw_semaphore *rwsem_downgr
*/
inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ if (!rwsem_read_trylock(sem)) {
rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
@@ -1281,9 +1310,8 @@ inline void __down_read(struct rw_semaph
static inline int __down_read_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
+ if (!rwsem_read_trylock(sem)) {
+ if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE));
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
@@ -1359,6 +1387,7 @@ inline void __up_read(struct rw_semaphor
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_wr_nonspinnable(sem);
Commit-ID: 5c1ec49b60cdb31e51010f8a647f3189b774bddf
Gitweb: https://git.kernel.org/tip/5c1ec49b60cdb31e51010f8a647f3189b774bddf
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:01 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:27:55 +0200
locking/rwsem: Remove rwsem_wake() wakeup optimization
After the following commit:
59aabfc7e959 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()")
the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly
acquired and an optimistic spinning locker is present. This can help performance
by avoiding spinning on the wait_lock when it is contended.
With the later commit:
133e89ef5ef3 ("locking/rwsem: Enable lockless waiter wakeup(s)")
the performance advantage of the above optimization diminishes as the average
wait_lock hold time become much shorter.
With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang.
So the original 59aabfc7e959 commit has to be reverted.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem-xadd.c | 72 ---------------------------------------------
1 file changed, 72 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index c0500679fd2f..3083fdf50447 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -411,25 +411,11 @@ done:
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
- return osq_is_locked(&sem->osq);
-}
-
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
return false;
}
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
- return false;
-}
#endif
/*
@@ -651,65 +637,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
- /*
- * __rwsem_down_write_failed_common(sem)
- * rwsem_optimistic_spin(sem)
- * osq_unlock(sem->osq)
- * ...
- * atomic_long_add_return(&sem->count)
- *
- * - VS -
- *
- * __up_write()
- * if (atomic_long_sub_return_release(&sem->count) < 0)
- * rwsem_wake(sem)
- * osq_is_locked(&sem->osq)
- *
- * And __up_write() must observe !osq_is_locked() when it observes the
- * atomic_long_add_return() in order to not miss a wakeup.
- *
- * This boils down to:
- *
- * [S.rel] X = 1 [RmW] r0 = (Y += 0)
- * MB RMB
- * [RmW] Y += 1 [L] r1 = X
- *
- * exists (r0=1 /\ r1=0)
- */
- smp_rmb();
-
- /*
- * If a spinner is present, it is not necessary to do the wakeup.
- * Try to do wakeup only if the trylock succeeds to minimize
- * spinlock contention which may introduce too much delay in the
- * unlock operation.
- *
- * spinning writer up_write/up_read caller
- * --------------- -----------------------
- * [S] osq_unlock() [L] osq
- * MB RMB
- * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
- *
- * Here, it is important to make sure that there won't be a missed
- * wakeup while the rwsem is free and the only spinning writer goes
- * to sleep without taking the rwsem. Even when the spinning writer
- * is just going to break out of the waiting loop, it will still do
- * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
- * rwsem_has_spinner() is true, it will guarantee at least one
- * trylock attempt on the rwsem later on.
- */
- if (rwsem_has_spinner(sem)) {
- /*
- * The smp_rmb() here is to make sure that the spinner
- * state is consulted before reading the wait_lock.
- */
- smp_rmb();
- if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
- return sem;
- goto locked;
- }
raw_spin_lock_irqsave(&sem->wait_lock, flags);
-locked:
if (!list_empty(&sem->wait_list))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
Commit-ID: 64489e78004cb5623211c75790cac90bd25ff5e9
Gitweb: https://git.kernel.org/tip/64489e78004cb5623211c75790cac90bd25ff5e9
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:02 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:27:56 +0200
locking/rwsem: Implement a new locking scheme
The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.
To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used. The atomic long count has the following
bit definitions:
Bit 0 - writer locked bit
Bit 1 - waiters present bit
Bits 2-7 - reserved for future extension
Bits 8-X - reader count (24/56 bits)
The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.
With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 30,659 31,341 31,055 31,283
2 8,909 16,457 9,884 17,659
4 9,028 15,823 8,933 20,233
8 8,410 14,212 7,230 17,140
16 8,217 25,240 7,479 24,607
The locking rates of the benchmark on a Power8 system were as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 12,963 13,647 13,275 13,601
2 7,570 11,569 7,902 10,829
4 5,232 5,516 5,466 5,435
8 5,233 3,386 5,467 3,168
The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 21,495 21,046 21,524 21,074
2 5,293 10,502 5,333 10,504
4 5,325 11,463 5,358 11,631
8 5,391 11,712 5,470 11,680
The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem-xadd.c | 147 +++++++++++++++-----------------------------
kernel/locking/rwsem.h | 74 +++++++++++-----------
2 files changed, 85 insertions(+), 136 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3083fdf50447..7d537b50a849 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
*
* Optimistic spinning by Tim Chen <[email protected]>
* and Davidlohr Bueso <[email protected]>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long <[email protected]>.
*/
#include <linux/rwsem.h>
#include <linux/init.h>
@@ -22,52 +24,20 @@
#include "rwsem.h"
/*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x0000000X (1) X readers active or attempting lock, no writer waiting
- * X = #active_readers + #readers attempting to lock
- * (X*ACTIVE_BIAS)
- *
- * 0x00000000 rwsem is unlocked, and no one is waiting for the lock or
- * attempting to read lock or write lock.
- *
- * 0xffff000X (1) X readers active or attempting lock, with waiters for lock
- * X = #active readers + # readers attempting lock
- * (X*ACTIVE_BIAS + WAITING_BIAS)
- * (2) 1 writer attempting lock, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- * (3) 1 writer active, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0xffff0001 (1) 1 reader active or attempting lock, waiters for lock
- * (WAITING_BIAS + ACTIVE_BIAS)
- * (2) 1 writer active or attempting lock, no waiters for lock
- * (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
*
- * 0xffff0000 (1) There are writers or readers queued but none active
- * or in the process of attempting lock.
- * (WAITING_BIAS)
- * Note: writer can attempt to steal lock for this count by adding
- * ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
*
- * 0xfffe0001 (1) 1 writer active, or attempting lock. Waiters on queue.
- * (ACTIVE_WRITE_BIAS + WAITING_BIAS)
- *
- * Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
- * the count becomes more than 0 for successful lock acquisition,
- * i.e. the case where there are only readers or nobody has lock.
- * (1st and 2nd case above).
- *
- * Writers attempt to lock by adding ACTIVE_WRITE_BIAS in down_write and
- * checking the count becomes ACTIVE_WRITE_BIAS for successful lock
- * acquisition (i.e. nobody else has lock or attempts lock). If
- * unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- * are only waiters but none active (5th case above), and attempt to
- * steal the lock.
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
*
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
*/
/*
@@ -113,9 +83,8 @@ enum rwsem_wake_type {
/*
* handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then:
- * - the 'active part' of count (&0x0000ffff) reached 0 (but may have changed)
- * - the 'waiting part' of count (&0xffff0000) is -ve (and will still be so)
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ * have been set.
* - there must be someone on the queue
* - the wait_lock must be held by the caller
* - tasks are marked for wakeup, the caller must later invoke wake_up_q()
@@ -160,22 +129,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
* so we can bail out early if a writer stole the lock.
*/
if (wake_type != RWSEM_WAKE_READ_OWNED) {
- adjustment = RWSEM_ACTIVE_READ_BIAS;
- try_reader_grant:
+ adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
- if (unlikely(oldcount < RWSEM_WAITING_BIAS)) {
- /*
- * If the count is still less than RWSEM_WAITING_BIAS
- * after removing the adjustment, it is assumed that
- * a writer has stolen the lock. We have to undo our
- * reader grant.
- */
- if (atomic_long_add_return(-adjustment, &sem->count) <
- RWSEM_WAITING_BIAS)
- return;
-
- /* Last active locker left. Retry waking readers. */
- goto try_reader_grant;
+ if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+ atomic_long_sub(adjustment, &sem->count);
+ return;
}
/*
* Set it to reader-owned to give spinners an early
@@ -209,11 +167,11 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}
list_cut_before(&wlist, &sem->wait_list, &waiter->list);
- adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
+ adjustment = woken * RWSEM_READER_BIAS - adjustment;
lockevent_cond_inc(rwsem_wake_reader, woken);
if (list_empty(&sem->wait_list)) {
/* hit end of list above */
- adjustment -= RWSEM_WAITING_BIAS;
+ adjustment -= RWSEM_FLAG_WAITERS;
}
if (adjustment)
@@ -248,22 +206,15 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
*/
static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
{
- /*
- * Avoid trying to acquire write lock if count isn't RWSEM_WAITING_BIAS.
- */
- if (count != RWSEM_WAITING_BIAS)
+ long new;
+
+ if (count & RWSEM_LOCK_MASK)
return false;
- /*
- * Acquire the lock by trying to set it to ACTIVE_WRITE_BIAS. If there
- * are other tasks on the wait list, we need to add on WAITING_BIAS.
- */
- count = list_is_singular(&sem->wait_list) ?
- RWSEM_ACTIVE_WRITE_BIAS :
- RWSEM_ACTIVE_WRITE_BIAS + RWSEM_WAITING_BIAS;
+ new = count + RWSEM_WRITER_LOCKED -
+ (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
- if (atomic_long_cmpxchg_acquire(&sem->count, RWSEM_WAITING_BIAS, count)
- == RWSEM_WAITING_BIAS) {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
rwsem_set_owner(sem);
return true;
}
@@ -279,9 +230,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
{
long count = atomic_long_read(&sem->count);
- while (!count || count == RWSEM_WAITING_BIAS) {
+ while (!(count & RWSEM_LOCK_MASK)) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count + RWSEM_ACTIVE_WRITE_BIAS)) {
+ count + RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -424,7 +375,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
static inline struct rw_semaphore __sched *
__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
{
- long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+ long count, adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
@@ -436,16 +387,16 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
/*
* In case the wait queue is empty and the lock isn't owned
* by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_ACTIVE_READ_BIAS has already
- * been set in the count.
+ * immediately as its RWSEM_READER_BIAS has already been
+ * set in the count.
*/
- if (atomic_long_read(&sem->count) >= 0) {
+ if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
return sem;
}
- adjustment += RWSEM_WAITING_BIAS;
+ adjustment += RWSEM_FLAG_WAITERS;
}
list_add_tail(&waiter.list, &sem->wait_list);
@@ -458,9 +409,8 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
* If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers !
*/
- if (count == RWSEM_WAITING_BIAS ||
- (count > RWSEM_WAITING_BIAS &&
- adjustment != -RWSEM_ACTIVE_READ_BIAS))
+ if (!(count & RWSEM_LOCK_MASK) ||
+ (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
@@ -488,7 +438,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
out_nolock:
list_del(&waiter.list);
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock_fail);
@@ -521,9 +471,6 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
- /* undo write bias from down_write operation, stop active locking */
- count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
-
/* do optimistic spinning and steal lock if possible */
if (rwsem_optimistic_spin(sem))
return sem;
@@ -543,16 +490,18 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
list_add_tail(&waiter.list, &sem->wait_list);
- /* we're now waiting on the lock, but no longer actively locking */
+ /* we're now waiting on the lock */
if (waiting) {
count = atomic_long_read(&sem->count);
/*
* If there were already threads queued before us and there are
- * no active writers, the lock must be read owned; so we try to
- * wake any read locks that were queued ahead of us.
+ * no active writers and some readers, the lock must be read
+ * owned; so we try to any read locks that were queued ahead
+ * of us.
*/
- if (count > RWSEM_WAITING_BIAS) {
+ if (!(count & RWSEM_WRITER_MASK) &&
+ (count & RWSEM_READER_MASK)) {
__rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
/*
* The wakeup is normally called _after_ the wait_lock
@@ -569,8 +518,9 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
wake_q_init(&wake_q);
}
- } else
- count = atomic_long_add_return(RWSEM_WAITING_BIAS, &sem->count);
+ } else {
+ count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ }
/* wait until we successfully acquire the lock */
set_current_state(state);
@@ -587,7 +537,8 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
schedule();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
- } while ((count = atomic_long_read(&sem->count)) & RWSEM_ACTIVE_MASK);
+ count = atomic_long_read(&sem->count);
+ } while (count & RWSEM_LOCK_MASK);
raw_spin_lock_irq(&sem->wait_lock);
}
@@ -603,7 +554,7 @@ out_nolock:
raw_spin_lock_irq(&sem->wait_lock);
list_del(&waiter.list);
if (list_empty(&sem->wait_list))
- atomic_long_add(-RWSEM_WAITING_BIAS, &sem->count);
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index eb9c8534299b..499a9b2bda82 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -42,24 +42,24 @@
#endif
/*
- * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <[email protected]>.
- */
-
-/*
- * the semaphore definition
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
*/
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_WRITER_LOCKED (1UL << 0)
+#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_READER_SHIFT 8
+#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -151,7 +151,8 @@ extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
*/
static inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_failed(sem);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
@@ -162,7 +163,8 @@ static inline void __down_read(struct rw_semaphore *sem)
static inline int __down_read_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) {
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_failed_killable(sem)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
@@ -183,11 +185,11 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
lockevent_inc(rwsem_rtrylock);
do {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
return 1;
}
- } while (tmp >= 0);
+ } while (!(tmp & RWSEM_READ_FAILED_MASK));
return 0;
}
@@ -196,22 +198,16 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
static inline void __down_write(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
rwsem_down_write_failed(sem);
rwsem_set_owner(sem);
}
static inline int __down_write_killable(struct rw_semaphore *sem)
{
- long tmp;
-
- tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count);
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
if (IS_ERR(rwsem_down_write_failed_killable(sem)))
return -EINTR;
rwsem_set_owner(sem);
@@ -224,7 +220,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem)
lockevent_inc(rwsem_wtrylock);
tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
+ RWSEM_WRITER_LOCKED);
if (tmp == RWSEM_UNLOCKED_VALUE) {
rwsem_set_owner(sem);
return true;
@@ -242,8 +238,9 @@ static inline void __up_read(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
sem);
rwsem_clear_reader_owned(sem);
- tmp = atomic_long_dec_return_release(&sem->count);
- if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+ tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+ == RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -254,8 +251,8 @@ static inline void __up_write(struct rw_semaphore *sem)
{
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
- if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
- &sem->count) < 0))
+ if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+ &sem->count) & RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -274,8 +271,9 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* write side. As such, rely on RELEASE semantics.
*/
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count);
+ tmp = atomic_long_fetch_add_release(
+ -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
- if (tmp < 0)
+ if (tmp & RWSEM_FLAG_WAITERS)
rwsem_downgrade_wake(sem);
}
Commit-ID: 5dec94d4923683b1dd6a09dc62427a24d79ee7b4
Gitweb: https://git.kernel.org/tip/5dec94d4923683b1dd6a09dc62427a24d79ee7b4
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:03 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:27:57 +0200
locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.
So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:
<upper part of rwsem.h>
<rwsem-xadd.c>
<lower part of rwsem.h>
<rwsem.c>
The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().
This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/Makefile | 2 +-
kernel/locking/rwsem-xadd.c | 624 -------------------------------
kernel/locking/rwsem.c | 884 ++++++++++++++++++++++++++++++++++++++++++++
kernel/locking/rwsem.h | 281 +-------------
4 files changed, 891 insertions(+), 900 deletions(-)
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
# and is generally not a function of system call inputs.
KCOV_INSTRUMENT := n
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 7d537b50a849..000000000000
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,624 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells ([email protected]).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi <[email protected]>
- * and Michel Lespinasse <[email protected]>
- *
- * Optimistic spinning by Tim Chen <[email protected]>
- * and Davidlohr Bueso <[email protected]>. Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long <[email protected]>.
- */
-#include <linux/rwsem.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#include <linux/sched/signal.h>
-#include <linux/sched/rt.h>
-#include <linux/sched/wake_q.h>
-#include <linux/sched/debug.h>
-#include <linux/osq_lock.h>
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- /*
- * Make sure we are not reinitializing a held semaphore:
- */
- debug_check_no_locks_freed((void *)sem, sizeof(*sem));
- lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
- atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
- raw_spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
- sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
- osq_lock_init(&sem->osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
- RWSEM_WAITING_FOR_WRITE,
- RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
- struct list_head list;
- struct task_struct *task;
- enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
- RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
- RWSEM_WAKE_READERS, /* Wake readers only */
- RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
- * have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- * to actually wakeup the blocked task(s) and drop the reference count,
- * preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
-{
- struct rwsem_waiter *waiter, *tmp;
- long oldcount, woken = 0, adjustment = 0;
- struct list_head wlist;
-
- /*
- * Take a peek at the queue head waiter such that we can determine
- * the wakeup(s) to perform.
- */
- waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
-
- if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
- if (wake_type == RWSEM_WAKE_ANY) {
- /*
- * Mark writer at the front of the queue for wakeup.
- * Until the task is actually later awoken later by
- * the caller, other writers are able to steal it.
- * Readers, on the other hand, will block as they
- * will notice the queued writer.
- */
- wake_q_add(wake_q, waiter->task);
- lockevent_inc(rwsem_wake_writer);
- }
-
- return;
- }
-
- /*
- * Writers might steal the lock before we grant it to the next reader.
- * We prefer to do the first reader grant before counting readers
- * so we can bail out early if a writer stole the lock.
- */
- if (wake_type != RWSEM_WAKE_READ_OWNED) {
- adjustment = RWSEM_READER_BIAS;
- oldcount = atomic_long_fetch_add(adjustment, &sem->count);
- if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
- atomic_long_sub(adjustment, &sem->count);
- return;
- }
- /*
- * Set it to reader-owned to give spinners an early
- * indication that readers now have the lock.
- */
- __rwsem_set_reader_owned(sem, waiter->task);
- }
-
- /*
- * Grant an infinite number of read locks to the readers at the front
- * of the queue. We know that woken will be at least 1 as we accounted
- * for above. Note we increment the 'active part' of the count by the
- * number of readers before waking any processes up.
- *
- * We have to do wakeup in 2 passes to prevent the possibility that
- * the reader count may be decremented before it is incremented. It
- * is because the to-be-woken waiter may not have slept yet. So it
- * may see waiter->task got cleared, finish its critical section and
- * do an unlock before the reader count increment.
- *
- * 1) Collect the read-waiters in a separate list, count them and
- * fully increment the reader count in rwsem.
- * 2) For each waiters in the new list, clear waiter->task and
- * put them into wake_q to be woken up later.
- */
- list_for_each_entry(waiter, &sem->wait_list, list) {
- if (waiter->type == RWSEM_WAITING_FOR_WRITE)
- break;
-
- woken++;
- }
- list_cut_before(&wlist, &sem->wait_list, &waiter->list);
-
- adjustment = woken * RWSEM_READER_BIAS - adjustment;
- lockevent_cond_inc(rwsem_wake_reader, woken);
- if (list_empty(&sem->wait_list)) {
- /* hit end of list above */
- adjustment -= RWSEM_FLAG_WAITERS;
- }
-
- if (adjustment)
- atomic_long_add(adjustment, &sem->count);
-
- /* 2nd pass */
- list_for_each_entry_safe(waiter, tmp, &wlist, list) {
- struct task_struct *tsk;
-
- tsk = waiter->task;
- get_task_struct(tsk);
-
- /*
- * Ensure calling get_task_struct() before setting the reader
- * waiter to nil such that rwsem_down_read_failed() cannot
- * race with do_exit() by always holding a reference count
- * to the task to wakeup.
- */
- smp_store_release(&waiter->task, NULL);
- /*
- * Ensure issuing the wakeup (either by us or someone else)
- * after setting the reader waiter to nil.
- */
- wake_q_add_safe(wake_q, tsk);
- }
-}
-
-/*
- * This function must be called with the sem->wait_lock held to prevent
- * race conditions between checking the rwsem wait list and setting the
- * sem->count accordingly.
- */
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
-{
- long new;
-
- if (count & RWSEM_LOCK_MASK)
- return false;
-
- new = count + RWSEM_WRITER_LOCKED -
- (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
-
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
- rwsem_set_owner(sem);
- return true;
- }
-
- return false;
-}
-
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-/*
- * Try to acquire write lock before the writer has been put on wait queue.
- */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
-{
- long count = atomic_long_read(&sem->count);
-
- while (!(count & RWSEM_LOCK_MASK)) {
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count + RWSEM_WRITER_LOCKED)) {
- rwsem_set_owner(sem);
- lockevent_inc(rwsem_opt_wlock);
- return true;
- }
- }
- return false;
-}
-
-static inline bool owner_on_cpu(struct task_struct *owner)
-{
- /*
- * As lock holder preemption issue, we both skip spinning if
- * task is not on cpu or its cpu is preempted
- */
- return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
-}
-
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
-{
- struct task_struct *owner;
- bool ret = true;
-
- BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
-
- if (need_resched())
- return false;
-
- rcu_read_lock();
- owner = READ_ONCE(sem->owner);
- if (owner) {
- ret = is_rwsem_owner_spinnable(owner) &&
- owner_on_cpu(owner);
- }
- rcu_read_unlock();
- return ret;
-}
-
-/*
- * Return true only if we can still spin on the owner field of the rwsem.
- */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
-{
- struct task_struct *owner = READ_ONCE(sem->owner);
-
- if (!is_rwsem_owner_spinnable(owner))
- return false;
-
- rcu_read_lock();
- while (owner && (READ_ONCE(sem->owner) == owner)) {
- /*
- * Ensure we emit the owner->on_cpu, dereference _after_
- * checking sem->owner still matches owner, if that fails,
- * owner might point to free()d memory, if it still matches,
- * the rcu_read_lock() ensures the memory stays valid.
- */
- barrier();
-
- /*
- * abort spinning when need_resched or owner is not running or
- * owner's cpu is preempted.
- */
- if (need_resched() || !owner_on_cpu(owner)) {
- rcu_read_unlock();
- return false;
- }
-
- cpu_relax();
- }
- rcu_read_unlock();
-
- /*
- * If there is a new owner or the owner is not set, we continue
- * spinning.
- */
- return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
-}
-
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
- bool taken = false;
-
- preempt_disable();
-
- /* sem->wait_lock should not be held when doing optimistic spinning */
- if (!rwsem_can_spin_on_owner(sem))
- goto done;
-
- if (!osq_lock(&sem->osq))
- goto done;
-
- /*
- * Optimistically spin on the owner field and attempt to acquire the
- * lock whenever the owner changes. Spinning will be stopped when:
- * 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
- */
- while (rwsem_spin_on_owner(sem)) {
- /*
- * Try to acquire the lock
- */
- if (rwsem_try_write_lock_unqueued(sem)) {
- taken = true;
- break;
- }
-
- /*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
- */
- if (!sem->owner && (need_resched() || rt_task(current)))
- break;
-
- /*
- * The cpu_relax() call is a compiler barrier which forces
- * everything in this loop to be re-loaded. We don't need
- * memory barriers as we'll eventually observe the right
- * values at the cost of a few extra spins.
- */
- cpu_relax();
- }
- osq_unlock(&sem->osq);
-done:
- preempt_enable();
- lockevent_cond_inc(rwsem_opt_fail, !taken);
- return taken;
-}
-#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
-{
- return false;
-}
-#endif
-
-/*
- * Wait for the read lock to be granted
- */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
-{
- long count, adjustment = -RWSEM_READER_BIAS;
- struct rwsem_waiter waiter;
- DEFINE_WAKE_Q(wake_q);
-
- waiter.task = current;
- waiter.type = RWSEM_WAITING_FOR_READ;
-
- raw_spin_lock_irq(&sem->wait_lock);
- if (list_empty(&sem->wait_list)) {
- /*
- * In case the wait queue is empty and the lock isn't owned
- * by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_READER_BIAS has already been
- * set in the count.
- */
- if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
- raw_spin_unlock_irq(&sem->wait_lock);
- rwsem_set_reader_owned(sem);
- lockevent_inc(rwsem_rlock_fast);
- return sem;
- }
- adjustment += RWSEM_FLAG_WAITERS;
- }
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
-
- /*
- * If there are no active locks, wake the front queued process(es).
- *
- * If there are no writers and we are first in the queue,
- * wake our own waiter to join the existing active readers !
- */
- if (!(count & RWSEM_LOCK_MASK) ||
- (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
-
- /* wait to be given the lock */
- while (true) {
- set_current_state(state);
- if (!waiter.task)
- break;
- if (signal_pending_state(state, current)) {
- raw_spin_lock_irq(&sem->wait_lock);
- if (waiter.task)
- goto out_nolock;
- raw_spin_unlock_irq(&sem->wait_lock);
- break;
- }
- schedule();
- lockevent_inc(rwsem_sleep_reader);
- }
-
- __set_current_state(TASK_RUNNING);
- lockevent_inc(rwsem_rlock);
- return sem;
-out_nolock:
- list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
- raw_spin_unlock_irq(&sem->wait_lock);
- __set_current_state(TASK_RUNNING);
- lockevent_inc(rwsem_rlock_fail);
- return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
-/*
- * Wait until we successfully acquire the write lock
- */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
-{
- long count;
- bool waiting = true; /* any queued threads before us */
- struct rwsem_waiter waiter;
- struct rw_semaphore *ret = sem;
- DEFINE_WAKE_Q(wake_q);
-
- /* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem))
- return sem;
-
- /*
- * Optimistic spinning failed, proceed to the slowpath
- * and block until we can acquire the sem.
- */
- waiter.task = current;
- waiter.type = RWSEM_WAITING_FOR_WRITE;
-
- raw_spin_lock_irq(&sem->wait_lock);
-
- /* account for this before adding a new element to the list */
- if (list_empty(&sem->wait_list))
- waiting = false;
-
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we're now waiting on the lock */
- if (waiting) {
- count = atomic_long_read(&sem->count);
-
- /*
- * If there were already threads queued before us and there are
- * no active writers and some readers, the lock must be read
- * owned; so we try to any read locks that were queued ahead
- * of us.
- */
- if (!(count & RWSEM_WRITER_MASK) &&
- (count & RWSEM_READER_MASK)) {
- __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
- /*
- * The wakeup is normally called _after_ the wait_lock
- * is released, but given that we are proactively waking
- * readers we can deal with the wake_q overhead as it is
- * similar to releasing and taking the wait_lock again
- * for attempting rwsem_try_write_lock().
- */
- wake_up_q(&wake_q);
-
- /*
- * Reinitialize wake_q after use.
- */
- wake_q_init(&wake_q);
- }
-
- } else {
- count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
- }
-
- /* wait until we successfully acquire the lock */
- set_current_state(state);
- while (true) {
- if (rwsem_try_write_lock(count, sem))
- break;
- raw_spin_unlock_irq(&sem->wait_lock);
-
- /* Block until there are no active lockers. */
- do {
- if (signal_pending_state(state, current))
- goto out_nolock;
-
- schedule();
- lockevent_inc(rwsem_sleep_writer);
- set_current_state(state);
- count = atomic_long_read(&sem->count);
- } while (count & RWSEM_LOCK_MASK);
-
- raw_spin_lock_irq(&sem->wait_lock);
- }
- __set_current_state(TASK_RUNNING);
- list_del(&waiter.list);
- raw_spin_unlock_irq(&sem->wait_lock);
- lockevent_inc(rwsem_wlock);
-
- return ret;
-
-out_nolock:
- __set_current_state(TASK_RUNNING);
- raw_spin_lock_irq(&sem->wait_lock);
- list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
- else
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
- lockevent_inc(rwsem_wlock_fail);
-
- return ERR_PTR(-EINTR);
-}
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read/up_write has decremented the active part of count if we come here
- */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
- unsigned long flags;
- DEFINE_WAKE_Q(wake_q);
-
- raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
- raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
- wake_up_q(&wake_q);
-
- return sem;
-}
-EXPORT_SYMBOL(rwsem_wake);
-
-/*
- * downgrade a write lock into a read lock
- * - caller incremented waiting part of count and discovered it still negative
- * - just wake up any readers at the front of the queue
- */
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
-{
- unsigned long flags;
- DEFINE_WAKE_Q(wake_q);
-
- raw_spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
-
- raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
- wake_up_q(&wake_q);
-
- return sem;
-}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ccbf18f560ff..8317bcdf063b 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -3,17 +3,901 @@
*
* Written by David Howells ([email protected]).
* Derived from asm-i386/semaphore.h
+ *
+ * Writer lock-stealing by Alex Shi <[email protected]>
+ * and Michel Lespinasse <[email protected]>
+ *
+ * Optimistic spinning by Tim Chen <[email protected]>
+ * and Davidlohr Bueso <[email protected]>. Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition and rwsem rearchitecture
+ * by Waiman Long <[email protected]>.
*/
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/sched/task.h>
#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched/signal.h>
#include <linux/export.h>
#include <linux/rwsem.h>
#include <linux/atomic.h>
#include "rwsem.h"
+#include "lock_events.h"
+
+/*
+ * The least significant 2 bits of the owner value has the following
+ * meanings when set.
+ * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
+ * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
+ * i.e. the owner(s) cannot be readily determined. It can be reader
+ * owned or the owning writer is indeterminate.
+ *
+ * When a writer acquires a rwsem, it puts its task_struct pointer
+ * into the owner field. It is cleared after an unlock.
+ *
+ * When a reader acquires a rwsem, it will also puts its task_struct
+ * pointer into the owner field with both the RWSEM_READER_OWNED and
+ * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * largely be left untouched. So for a free or reader-owned rwsem,
+ * the owner value may contain information about the last reader that
+ * acquires the rwsem. The anonymous bit is set because that particular
+ * reader may or may not still own the lock.
+ *
+ * That information may be helpful in debugging cases where the system
+ * seems to hang on a reader owned rwsem especially if only one reader
+ * is involved. Ideally we would like to track all the readers that own
+ * a rwsem, but the overhead is simply too big.
+ */
+#define RWSEM_READER_OWNED (1UL << 0)
+#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+
+#ifdef CONFIG_DEBUG_RWSEMS
+# define DEBUG_RWSEMS_WARN_ON(c, sem) do { \
+ if (!debug_locks_silent && \
+ WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
+ #c, atomic_long_read(&(sem)->count), \
+ (long)((sem)->owner), (long)current, \
+ list_empty(&(sem)->wait_list) ? "" : "not ")) \
+ debug_locks_off(); \
+ } while (0)
+#else
+# define DEBUG_RWSEMS_WARN_ON(c, sem)
+#endif
+
+/*
+ * The definition of the atomic counter in the semaphore:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bits 2-7 - reserved
+ * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ *
+ * atomic_long_fetch_add() is used to obtain reader lock, whereas
+ * atomic_long_cmpxchg() will be used to obtain writer lock.
+ */
+#define RWSEM_WRITER_LOCKED (1UL << 0)
+#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_READER_SHIFT 8
+#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
+#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
+#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
+#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+
+/*
+ * All writes to owner are protected by WRITE_ONCE() to make sure that
+ * store tearing can't happen as optimistic spinners may read and use
+ * the owner value concurrently without lock. Read from owner, however,
+ * may not need READ_ONCE() as long as the pointer value is only used
+ * for comparison and isn't being dereferenced.
+ */
+static inline void rwsem_set_owner(struct rw_semaphore *sem)
+{
+ WRITE_ONCE(sem->owner, current);
+}
+
+static inline void rwsem_clear_owner(struct rw_semaphore *sem)
+{
+ WRITE_ONCE(sem->owner, NULL);
+}
+
+/*
+ * The task_struct pointer of the last owning reader will be left in
+ * the owner field.
+ *
+ * Note that the owner value just indicates the task has owned the rwsem
+ * previously, it may not be the real owner or one of the real owners
+ * anymore when that field is examined, so take it with a grain of salt.
+ */
+static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
+ struct task_struct *owner)
+{
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
+ | RWSEM_ANONYMOUSLY_OWNED;
+
+ WRITE_ONCE(sem->owner, (struct task_struct *)val);
+}
+
+static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
+{
+ __rwsem_set_reader_owned(sem, current);
+}
+
+/*
+ * Return true if the a rwsem waiter can spin on the rwsem's owner
+ * and steal the lock, i.e. the lock is not anonymously owned.
+ * N.B. !owner is considered spinnable.
+ */
+static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+{
+ return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
+}
+
+/*
+ * Return true if rwsem is owned by an anonymous writer or readers.
+ */
+static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
+{
+ return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+}
+
+#ifdef CONFIG_DEBUG_RWSEMS
+/*
+ * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
+ * is a task pointer in owner of a reader-owned rwsem, it will be the
+ * real owner or one of the real owners. The only exception is when the
+ * unlock is done by up_read_non_owner().
+ */
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+ unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
+ | RWSEM_ANONYMOUSLY_OWNED;
+ if (READ_ONCE(sem->owner) == (struct task_struct *)val)
+ cmpxchg_relaxed((unsigned long *)&sem->owner, val,
+ RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+}
+#else
+static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
+{
+}
+#endif
+
+/*
+ * Guide to the rw_semaphore's count field.
+ *
+ * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
+ * by a writer.
+ *
+ * The lock is owned by readers when
+ * (1) the RWSEM_WRITER_LOCKED isn't set in count,
+ * (2) some of the reader bits are set in count, and
+ * (3) the owner field has RWSEM_READ_OWNED bit set.
+ *
+ * Having some reader bits set is not enough to guarantee a readers owned
+ * lock as the readers may be in the process of backing out from the count
+ * and a writer has just released the lock. So another writer may steal
+ * the lock immediately after that.
+ */
+
+/*
+ * Initialize an rwsem:
+ */
+void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ /*
+ * Make sure we are not reinitializing a held semaphore:
+ */
+ debug_check_no_locks_freed((void *)sem, sizeof(*sem));
+ lockdep_init_map(&sem->dep_map, name, key, 0);
+#endif
+ atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
+ raw_spin_lock_init(&sem->wait_lock);
+ INIT_LIST_HEAD(&sem->wait_list);
+ sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+ osq_lock_init(&sem->osq);
+#endif
+}
+
+EXPORT_SYMBOL(__init_rwsem);
+
+enum rwsem_waiter_type {
+ RWSEM_WAITING_FOR_WRITE,
+ RWSEM_WAITING_FOR_READ
+};
+
+struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ enum rwsem_waiter_type type;
+};
+
+enum rwsem_wake_type {
+ RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
+ RWSEM_WAKE_READERS, /* Wake readers only */
+ RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
+};
+
+/*
+ * handle the lock release when processes blocked on it that can now run
+ * - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
+ * have been set.
+ * - there must be someone on the queue
+ * - the wait_lock must be held by the caller
+ * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
+ * to actually wakeup the blocked task(s) and drop the reference count,
+ * preferably when the wait_lock is released
+ * - woken process blocks are discarded from the list after having task zeroed
+ * - writers are only marked woken if downgrading is false
+ */
+static void __rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type,
+ struct wake_q_head *wake_q)
+{
+ struct rwsem_waiter *waiter, *tmp;
+ long oldcount, woken = 0, adjustment = 0;
+ struct list_head wlist;
+
+ /*
+ * Take a peek at the queue head waiter such that we can determine
+ * the wakeup(s) to perform.
+ */
+ waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+
+ if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+ if (wake_type == RWSEM_WAKE_ANY) {
+ /*
+ * Mark writer at the front of the queue for wakeup.
+ * Until the task is actually later awoken later by
+ * the caller, other writers are able to steal it.
+ * Readers, on the other hand, will block as they
+ * will notice the queued writer.
+ */
+ wake_q_add(wake_q, waiter->task);
+ lockevent_inc(rwsem_wake_writer);
+ }
+
+ return;
+ }
+
+ /*
+ * Writers might steal the lock before we grant it to the next reader.
+ * We prefer to do the first reader grant before counting readers
+ * so we can bail out early if a writer stole the lock.
+ */
+ if (wake_type != RWSEM_WAKE_READ_OWNED) {
+ adjustment = RWSEM_READER_BIAS;
+ oldcount = atomic_long_fetch_add(adjustment, &sem->count);
+ if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
+ atomic_long_sub(adjustment, &sem->count);
+ return;
+ }
+ /*
+ * Set it to reader-owned to give spinners an early
+ * indication that readers now have the lock.
+ */
+ __rwsem_set_reader_owned(sem, waiter->task);
+ }
+
+ /*
+ * Grant an infinite number of read locks to the readers at the front
+ * of the queue. We know that woken will be at least 1 as we accounted
+ * for above. Note we increment the 'active part' of the count by the
+ * number of readers before waking any processes up.
+ *
+ * We have to do wakeup in 2 passes to prevent the possibility that
+ * the reader count may be decremented before it is incremented. It
+ * is because the to-be-woken waiter may not have slept yet. So it
+ * may see waiter->task got cleared, finish its critical section and
+ * do an unlock before the reader count increment.
+ *
+ * 1) Collect the read-waiters in a separate list, count them and
+ * fully increment the reader count in rwsem.
+ * 2) For each waiters in the new list, clear waiter->task and
+ * put them into wake_q to be woken up later.
+ */
+ list_for_each_entry(waiter, &sem->wait_list, list) {
+ if (waiter->type == RWSEM_WAITING_FOR_WRITE)
+ break;
+
+ woken++;
+ }
+ list_cut_before(&wlist, &sem->wait_list, &waiter->list);
+
+ adjustment = woken * RWSEM_READER_BIAS - adjustment;
+ lockevent_cond_inc(rwsem_wake_reader, woken);
+ if (list_empty(&sem->wait_list)) {
+ /* hit end of list above */
+ adjustment -= RWSEM_FLAG_WAITERS;
+ }
+
+ if (adjustment)
+ atomic_long_add(adjustment, &sem->count);
+
+ /* 2nd pass */
+ list_for_each_entry_safe(waiter, tmp, &wlist, list) {
+ struct task_struct *tsk;
+
+ tsk = waiter->task;
+ get_task_struct(tsk);
+
+ /*
+ * Ensure calling get_task_struct() before setting the reader
+ * waiter to nil such that rwsem_down_read_failed() cannot
+ * race with do_exit() by always holding a reference count
+ * to the task to wakeup.
+ */
+ smp_store_release(&waiter->task, NULL);
+ /*
+ * Ensure issuing the wakeup (either by us or someone else)
+ * after setting the reader waiter to nil.
+ */
+ wake_q_add_safe(wake_q, tsk);
+ }
+}
+
+/*
+ * This function must be called with the sem->wait_lock held to prevent
+ * race conditions between checking the rwsem wait list and setting the
+ * sem->count accordingly.
+ */
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+{
+ long new;
+
+ if (count & RWSEM_LOCK_MASK)
+ return false;
+
+ new = count + RWSEM_WRITER_LOCKED -
+ (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
+ rwsem_set_owner(sem);
+ return true;
+ }
+
+ return false;
+}
+
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire write lock before the writer has been put on wait queue.
+ */
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ while (!(count & RWSEM_LOCK_MASK)) {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+ count + RWSEM_WRITER_LOCKED)) {
+ rwsem_set_owner(sem);
+ lockevent_inc(rwsem_opt_wlock);
+ return true;
+ }
+ }
+ return false;
+}
+
+static inline bool owner_on_cpu(struct task_struct *owner)
+{
+ /*
+ * As lock holder preemption issue, we both skip spinning if
+ * task is not on cpu or its cpu is preempted
+ */
+ return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
+}
+
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ struct task_struct *owner;
+ bool ret = true;
+
+ BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+
+ if (need_resched())
+ return false;
+
+ rcu_read_lock();
+ owner = READ_ONCE(sem->owner);
+ if (owner) {
+ ret = is_rwsem_owner_spinnable(owner) &&
+ owner_on_cpu(owner);
+ }
+ rcu_read_unlock();
+ return ret;
+}
+
+/*
+ * Return true only if we can still spin on the owner field of the rwsem.
+ */
+static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+ struct task_struct *owner = READ_ONCE(sem->owner);
+
+ if (!is_rwsem_owner_spinnable(owner))
+ return false;
+
+ rcu_read_lock();
+ while (owner && (READ_ONCE(sem->owner) == owner)) {
+ /*
+ * Ensure we emit the owner->on_cpu, dereference _after_
+ * checking sem->owner still matches owner, if that fails,
+ * owner might point to free()d memory, if it still matches,
+ * the rcu_read_lock() ensures the memory stays valid.
+ */
+ barrier();
+
+ /*
+ * abort spinning when need_resched or owner is not running or
+ * owner's cpu is preempted.
+ */
+ if (need_resched() || !owner_on_cpu(owner)) {
+ rcu_read_unlock();
+ return false;
+ }
+
+ cpu_relax();
+ }
+ rcu_read_unlock();
+
+ /*
+ * If there is a new owner or the owner is not set, we continue
+ * spinning.
+ */
+ return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+}
+
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+ bool taken = false;
+
+ preempt_disable();
+
+ /* sem->wait_lock should not be held when doing optimistic spinning */
+ if (!rwsem_can_spin_on_owner(sem))
+ goto done;
+
+ if (!osq_lock(&sem->osq))
+ goto done;
+
+ /*
+ * Optimistically spin on the owner field and attempt to acquire the
+ * lock whenever the owner changes. Spinning will be stopped when:
+ * 1) the owning writer isn't running; or
+ * 2) readers own the lock as we can't determine if they are
+ * actively running or not.
+ */
+ while (rwsem_spin_on_owner(sem)) {
+ /*
+ * Try to acquire the lock
+ */
+ if (rwsem_try_write_lock_unqueued(sem)) {
+ taken = true;
+ break;
+ }
+
+ /*
+ * When there's no owner, we might have preempted between the
+ * owner acquiring the lock and setting the owner field. If
+ * we're an RT task that will live-lock because we won't let
+ * the owner complete.
+ */
+ if (!sem->owner && (need_resched() || rt_task(current)))
+ break;
+
+ /*
+ * The cpu_relax() call is a compiler barrier which forces
+ * everything in this loop to be re-loaded. We don't need
+ * memory barriers as we'll eventually observe the right
+ * values at the cost of a few extra spins.
+ */
+ cpu_relax();
+ }
+ osq_unlock(&sem->osq);
+done:
+ preempt_enable();
+ lockevent_cond_inc(rwsem_opt_fail, !taken);
+ return taken;
+}
+#else
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+{
+ return false;
+}
+#endif
+
+/*
+ * Wait for the read lock to be granted
+ */
+static inline struct rw_semaphore __sched *
+__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+{
+ long count, adjustment = -RWSEM_READER_BIAS;
+ struct rwsem_waiter waiter;
+ DEFINE_WAKE_Q(wake_q);
+
+ waiter.task = current;
+ waiter.type = RWSEM_WAITING_FOR_READ;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (list_empty(&sem->wait_list)) {
+ /*
+ * In case the wait queue is empty and the lock isn't owned
+ * by a writer, this reader can exit the slowpath and return
+ * immediately as its RWSEM_READER_BIAS has already been
+ * set in the count.
+ */
+ if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+ raw_spin_unlock_irq(&sem->wait_lock);
+ rwsem_set_reader_owned(sem);
+ lockevent_inc(rwsem_rlock_fast);
+ return sem;
+ }
+ adjustment += RWSEM_FLAG_WAITERS;
+ }
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* we're now waiting on the lock, but no longer actively locking */
+ count = atomic_long_add_return(adjustment, &sem->count);
+
+ /*
+ * If there are no active locks, wake the front queued process(es).
+ *
+ * If there are no writers and we are first in the queue,
+ * wake our own waiter to join the existing active readers !
+ */
+ if (!(count & RWSEM_LOCK_MASK) ||
+ (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+
+ /* wait to be given the lock */
+ while (true) {
+ set_current_state(state);
+ if (!waiter.task)
+ break;
+ if (signal_pending_state(state, current)) {
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (waiter.task)
+ goto out_nolock;
+ raw_spin_unlock_irq(&sem->wait_lock);
+ break;
+ }
+ schedule();
+ lockevent_inc(rwsem_sleep_reader);
+ }
+
+ __set_current_state(TASK_RUNNING);
+ lockevent_inc(rwsem_rlock);
+ return sem;
+out_nolock:
+ list_del(&waiter.list);
+ if (list_empty(&sem->wait_list))
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ __set_current_state(TASK_RUNNING);
+ lockevent_inc(rwsem_rlock_fail);
+ return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+ return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_read_failed_killable(struct rw_semaphore *sem)
+{
+ return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_read_failed_killable);
+
+/*
+ * Wait until we successfully acquire the write lock
+ */
+static inline struct rw_semaphore *
+__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+{
+ long count;
+ bool waiting = true; /* any queued threads before us */
+ struct rwsem_waiter waiter;
+ struct rw_semaphore *ret = sem;
+ DEFINE_WAKE_Q(wake_q);
+
+ /* do optimistic spinning and steal lock if possible */
+ if (rwsem_optimistic_spin(sem))
+ return sem;
+
+ /*
+ * Optimistic spinning failed, proceed to the slowpath
+ * and block until we can acquire the sem.
+ */
+ waiter.task = current;
+ waiter.type = RWSEM_WAITING_FOR_WRITE;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+
+ /* account for this before adding a new element to the list */
+ if (list_empty(&sem->wait_list))
+ waiting = false;
+
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* we're now waiting on the lock */
+ if (waiting) {
+ count = atomic_long_read(&sem->count);
+
+ /*
+ * If there were already threads queued before us and there are
+ * no active writers and some readers, the lock must be read
+ * owned; so we try to any read locks that were queued ahead
+ * of us.
+ */
+ if (!(count & RWSEM_WRITER_MASK) &&
+ (count & RWSEM_READER_MASK)) {
+ __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+ /*
+ * The wakeup is normally called _after_ the wait_lock
+ * is released, but given that we are proactively waking
+ * readers we can deal with the wake_q overhead as it is
+ * similar to releasing and taking the wait_lock again
+ * for attempting rwsem_try_write_lock().
+ */
+ wake_up_q(&wake_q);
+
+ /*
+ * Reinitialize wake_q after use.
+ */
+ wake_q_init(&wake_q);
+ }
+
+ } else {
+ count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ }
+
+ /* wait until we successfully acquire the lock */
+ set_current_state(state);
+ while (true) {
+ if (rwsem_try_write_lock(count, sem))
+ break;
+ raw_spin_unlock_irq(&sem->wait_lock);
+
+ /* Block until there are no active lockers. */
+ do {
+ if (signal_pending_state(state, current))
+ goto out_nolock;
+
+ schedule();
+ lockevent_inc(rwsem_sleep_writer);
+ set_current_state(state);
+ count = atomic_long_read(&sem->count);
+ } while (count & RWSEM_LOCK_MASK);
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ }
+ __set_current_state(TASK_RUNNING);
+ list_del(&waiter.list);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ lockevent_inc(rwsem_wlock);
+
+ return ret;
+
+out_nolock:
+ __set_current_state(TASK_RUNNING);
+ raw_spin_lock_irq(&sem->wait_lock);
+ list_del(&waiter.list);
+ if (list_empty(&sem->wait_list))
+ atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+ else
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ lockevent_inc(rwsem_wlock_fail);
+
+ return ERR_PTR(-EINTR);
+}
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed(struct rw_semaphore *sem)
+{
+ return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed);
+
+__visible struct rw_semaphore * __sched
+rwsem_down_write_failed_killable(struct rw_semaphore *sem)
+{
+ return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(rwsem_down_write_failed_killable);
+
+/*
+ * handle waking up a waiter on the semaphore
+ * - up_read/up_write has decremented the active part of count if we come here
+ */
+__visible
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+{
+ unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
+
+ raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+ if (!list_empty(&sem->wait_list))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+ raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+ wake_up_q(&wake_q);
+
+ return sem;
+}
+EXPORT_SYMBOL(rwsem_wake);
+
+/*
+ * downgrade a write lock into a read lock
+ * - caller incremented waiting part of count and discovered it still negative
+ * - just wake up any readers at the front of the queue
+ */
+__visible
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+{
+ unsigned long flags;
+ DEFINE_WAKE_Q(wake_q);
+
+ raw_spin_lock_irqsave(&sem->wait_lock, flags);
+
+ if (!list_empty(&sem->wait_list))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+
+ raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+ wake_up_q(&wake_q);
+
+ return sem;
+}
+EXPORT_SYMBOL(rwsem_downgrade_wake);
+
+/*
+ * lock for reading
+ */
+inline void __down_read(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ rwsem_down_read_failed(sem);
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+ RWSEM_READER_OWNED), sem);
+ } else {
+ rwsem_set_reader_owned(sem);
+ }
+}
+
+static inline int __down_read_killable(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
+ &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+ return -EINTR;
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
+ RWSEM_READER_OWNED), sem);
+ } else {
+ rwsem_set_reader_owned(sem);
+ }
+ return 0;
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+ /*
+ * Optimize for the case when the rwsem is not locked at all.
+ */
+ long tmp = RWSEM_UNLOCKED_VALUE;
+
+ lockevent_inc(rwsem_rtrylock);
+ do {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ tmp + RWSEM_READER_BIAS)) {
+ rwsem_set_reader_owned(sem);
+ return 1;
+ }
+ } while (!(tmp & RWSEM_READ_FAILED_MASK));
+ return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
+ rwsem_down_write_failed(sem);
+ rwsem_set_owner(sem);
+}
+
+static inline int __down_write_killable(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
+ RWSEM_WRITER_LOCKED)))
+ if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+ return -EINTR;
+ rwsem_set_owner(sem);
+ return 0;
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ lockevent_inc(rwsem_wtrylock);
+ tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_WRITER_LOCKED);
+ if (tmp == RWSEM_UNLOCKED_VALUE) {
+ rwsem_set_owner(sem);
+ return true;
+ }
+ return false;
+}
+
+/*
+ * unlock after reading
+ */
+inline void __up_read(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
+ sem);
+ rwsem_clear_reader_owned(sem);
+ tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
+ == RWSEM_FLAG_WAITERS))
+ rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ rwsem_clear_owner(sem);
+ if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
+ &sem->count) & RWSEM_FLAG_WAITERS))
+ rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ /*
+ * When downgrading from exclusive to shared ownership,
+ * anything inside the write-locked region cannot leak
+ * into the read side. In contrast, anything in the
+ * read-locked region is ok to be re-ordered into the
+ * write side. As such, rely on RELEASE semantics.
+ */
+ DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ tmp = atomic_long_fetch_add_release(
+ -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
+ rwsem_set_reader_owned(sem);
+ if (tmp & RWSEM_FLAG_WAITERS)
+ rwsem_downgrade_wake(sem);
+}
/*
* lock for reading
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 499a9b2bda82..2534ce49f648 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -1,279 +1,10 @@
/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * The least significant 2 bits of the owner value has the following
- * meanings when set.
- * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- * i.e. the owner(s) cannot be readily determined. It can be reader
- * owned or the owning writer is indeterminate.
- *
- * When a writer acquires a rwsem, it puts its task_struct pointer
- * into the owner field. It is cleared after an unlock.
- *
- * When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
- *
- * That information may be helpful in debugging cases where the system
- * seems to hang on a reader owned rwsem especially if only one reader
- * is involved. Ideally we would like to track all the readers that own
- * a rwsem, but the overhead is simply too big.
- */
-#include "lock_events.h"
-#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+#ifndef __INTERNAL_RWSEM_H
+#define __INTERNAL_RWSEM_H
+#include <linux/rwsem.h>
-#ifdef CONFIG_DEBUG_RWSEMS
-# define DEBUG_RWSEMS_WARN_ON(c, sem) do { \
- if (!debug_locks_silent && \
- WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
- #c, atomic_long_read(&(sem)->count), \
- (long)((sem)->owner), (long)current, \
- list_empty(&(sem)->wait_list) ? "" : "not ")) \
- debug_locks_off(); \
- } while (0)
-#else
-# define DEBUG_RWSEMS_WARN_ON(c, sem)
-#endif
+extern void __down_read(struct rw_semaphore *sem);
+extern void __up_read(struct rw_semaphore *sem);
-/*
- * The definition of the atomic counter in the semaphore:
- *
- * Bit 0 - writer locked bit
- * Bit 1 - waiters present bit
- * Bits 2-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
- *
- * atomic_long_fetch_add() is used to obtain reader lock, whereas
- * atomic_long_cmpxchg() will be used to obtain writer lock.
- */
-#define RWSEM_WRITER_LOCKED (1UL << 0)
-#define RWSEM_FLAG_WAITERS (1UL << 1)
-#define RWSEM_READER_SHIFT 8
-#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
-#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
-#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
-#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
-
-/*
- * All writes to owner are protected by WRITE_ONCE() to make sure that
- * store tearing can't happen as optimistic spinners may read and use
- * the owner value concurrently without lock. Read from owner, however,
- * may not need READ_ONCE() as long as the pointer value is only used
- * for comparison and isn't being dereferenced.
- */
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
- WRITE_ONCE(sem->owner, current);
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
- WRITE_ONCE(sem->owner, NULL);
-}
-
-/*
- * The task_struct pointer of the last owning reader will be left in
- * the owner field.
- *
- * Note that the owner value just indicates the task has owned the rwsem
- * previously, it may not be the real owner or one of the real owners
- * anymore when that field is examined, so take it with a grain of salt.
- */
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
- struct task_struct *owner)
-{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
-
- WRITE_ONCE(sem->owner, (struct task_struct *)val);
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
- __rwsem_set_reader_owned(sem, current);
-}
-
-/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
- * N.B. !owner is considered spinnable.
- */
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
-{
- return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
- return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
-}
-
-#ifdef CONFIG_DEBUG_RWSEMS
-/*
- * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
- * is a task pointer in owner of a reader-owned rwsem, it will be the
- * real owner or one of the real owners. The only exception is when the
- * unlock is done by up_read_non_owner().
- */
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
- unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
- if (READ_ONCE(sem->owner) == (struct task_struct *)val)
- cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
-}
-#else
-static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_read_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed_killable(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- rwsem_down_read_failed(sem);
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
- } else {
- rwsem_set_reader_owned(sem);
- }
-}
-
-static inline int __down_read_killable(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
- return -EINTR;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
- } else {
- rwsem_set_reader_owned(sem);
- }
- return 0;
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- /*
- * Optimize for the case when the rwsem is not locked at all.
- */
- long tmp = RWSEM_UNLOCKED_VALUE;
-
- lockevent_inc(rwsem_rtrylock);
- do {
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
- tmp + RWSEM_READER_BIAS)) {
- rwsem_set_reader_owned(sem);
- return 1;
- }
- } while (!(tmp & RWSEM_READ_FAILED_MASK));
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- rwsem_down_write_failed(sem);
- rwsem_set_owner(sem);
-}
-
-static inline int __down_write_killable(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
- return -EINTR;
- rwsem_set_owner(sem);
- return 0;
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- long tmp;
-
- lockevent_inc(rwsem_wtrylock);
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_WRITER_LOCKED);
- if (tmp == RWSEM_UNLOCKED_VALUE) {
- rwsem_set_owner(sem);
- return true;
- }
- return false;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long tmp;
-
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
- rwsem_clear_reader_owned(sem);
- tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
- if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
- == RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- rwsem_clear_owner(sem);
- if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
- &sem->count) & RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- long tmp;
-
- /*
- * When downgrading from exclusive to shared ownership,
- * anything inside the write-locked region cannot leak
- * into the read side. In contrast, anything in the
- * read-locked region is ok to be re-ordered into the
- * write side. As such, rely on RELEASE semantics.
- */
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
- tmp = atomic_long_fetch_add_release(
- -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
- rwsem_set_reader_owned(sem);
- if (tmp & RWSEM_FLAG_WAITERS)
- rwsem_downgrade_wake(sem);
-}
+#endif /* __INTERNAL_RWSEM_H */
Commit-ID: 3f6d517a3ece6e6ced7abcbe798ff332ac5ca586
Gitweb: https://git.kernel.org/tip/3f6d517a3ece6e6ced7abcbe798ff332ac5ca586
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:05 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:27:59 +0200
locking/rwsem: Make rwsem_spin_on_owner() return owner state
This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 65 ++++++++++++++++++++++++++++++++++++--------------
1 file changed, 47 insertions(+), 18 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index f56329240ef1..8d0f2acfe13d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -414,17 +414,54 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
}
/*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * The rwsem_spin_on_owner() function returns the folowing 4 values
+ * depending on the lock owner state.
+ * OWNER_NULL : owner is currently NULL
+ * OWNER_WRITER: when owner changes and is a writer
+ * OWNER_READER: when owner changes and the new owner may be a reader.
+ * OWNER_NONSPINNABLE:
+ * when optimistic spinning has to stop because either the
+ * owner stops running, is unknown, or its timeslice has
+ * been used up.
*/
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+ OWNER_NULL = 1 << 0,
+ OWNER_WRITER = 1 << 1,
+ OWNER_READER = 1 << 2,
+ OWNER_NONSPINNABLE = 1 << 3,
+};
+#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
+
+static inline enum owner_state rwsem_owner_state(unsigned long owner)
{
- struct task_struct *owner = READ_ONCE(sem->owner);
+ if (!owner)
+ return OWNER_NULL;
- if (!is_rwsem_owner_spinnable(owner))
- return false;
+ if (owner & RWSEM_ANONYMOUSLY_OWNED)
+ return OWNER_NONSPINNABLE;
+
+ if (owner & RWSEM_READER_OWNED)
+ return OWNER_READER;
+
+ return OWNER_WRITER;
+}
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+ struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+ enum owner_state state = rwsem_owner_state((unsigned long)owner);
+
+ if (state != OWNER_WRITER)
+ return state;
rcu_read_lock();
- while (owner && (READ_ONCE(sem->owner) == owner)) {
+ for (;;) {
+ tmp = READ_ONCE(sem->owner);
+ if (tmp != owner) {
+ state = rwsem_owner_state((unsigned long)tmp);
+ break;
+ }
+
/*
* Ensure we emit the owner->on_cpu, dereference _after_
* checking sem->owner still matches owner, if that fails,
@@ -433,24 +470,16 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
*/
barrier();
- /*
- * abort spinning when need_resched or owner is not running or
- * owner's cpu is preempted.
- */
if (need_resched() || !owner_on_cpu(owner)) {
- rcu_read_unlock();
- return false;
+ state = OWNER_NONSPINNABLE;
+ break;
}
cpu_relax();
}
rcu_read_unlock();
- /*
- * If there is a new owner or the owner is not set, we continue
- * spinning.
- */
- return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+ return state;
}
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -473,7 +502,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* 2) readers own the lock as we can't determine if they are
* actively running or not.
*/
- while (rwsem_spin_on_owner(sem)) {
+ while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
/*
* Try to acquire the lock
*/
Commit-ID: 4f23dbc1e657951e5d94c60369bc1db065961fb3
Gitweb: https://git.kernel.org/tip/4f23dbc1e657951e5d94c60369bc1db065961fb3
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:06 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:27:59 +0200
locking/rwsem: Implement lock handoff to prevent lock starvation
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.
This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.
The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.
A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.
Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458. It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.
Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:
40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081
By making waiter set handoff bit immediately after wakeup:
40 readers, Iterations Min/Mean/Max = 43/44/46
40 writers, Iterations Min/Mean/Max = 43/1,263/3,191
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/lock_events_list.h | 2 +
kernel/locking/rwsem.c | 225 +++++++++++++++++++++++++++++---------
2 files changed, 173 insertions(+), 54 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 11187a1d40b8..634b47fd8b5e 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,5 +61,7 @@ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
+LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */
LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
+LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8d0f2acfe13d..decda9fb8c6d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -10,8 +10,9 @@
* Optimistic spinning by Tim Chen <[email protected]>
* and Davidlohr Bueso <[email protected]>. Based on mutexes.
*
- * Rwsem count bit fields re-definition and rwsem rearchitecture
- * by Waiman Long <[email protected]>.
+ * Rwsem count bit fields re-definition and rwsem rearchitecture by
+ * Waiman Long <[email protected]> and
+ * Peter Zijlstra <[email protected]>.
*/
#include <linux/types.h>
@@ -74,20 +75,33 @@
*
* Bit 0 - writer locked bit
* Bit 1 - waiters present bit
- * Bits 2-7 - reserved
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
* Bits 8-X - 24-bit (32-bit) or 56-bit reader count
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of rwsem_down_write_slowpath().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
*/
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_FLAG_HANDOFF (1UL << 2)
+
#define RWSEM_READER_SHIFT 8
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
#define RWSEM_READER_MASK (~(RWSEM_READER_BIAS - 1))
#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+ RWSEM_FLAG_HANDOFF)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -216,7 +230,10 @@ struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+ unsigned long timeout;
};
+#define rwsem_first_waiter(sem) \
+ list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
enum rwsem_wake_type {
RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
@@ -224,6 +241,19 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
};
+enum writer_wait_state {
+ WRITER_NOT_FIRST, /* Writer is not first in wait list */
+ WRITER_FIRST, /* Writer is first in wait list */
+ WRITER_HANDOFF /* Writer is first & handoff needed */
+};
+
+/*
+ * The typical HZ value is either 250 or 1000. So set the minimum waiting
+ * time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
+ * queue before initiating the handoff protocol.
+ */
+#define RWSEM_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -244,11 +274,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
long oldcount, woken = 0, adjustment = 0;
struct list_head wlist;
+ lockdep_assert_held(&sem->wait_lock);
+
/*
* Take a peek at the queue head waiter such that we can determine
* the wakeup(s) to perform.
*/
- waiter = list_first_entry(&sem->wait_list, struct rwsem_waiter, list);
+ waiter = rwsem_first_waiter(sem);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY) {
@@ -275,7 +307,18 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
- atomic_long_sub(adjustment, &sem->count);
+ /*
+ * When we've been waiting "too" long (for writers
+ * to give up the lock), request a HANDOFF to
+ * force the issue.
+ */
+ if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
+ time_after(jiffies, waiter->timeout)) {
+ adjustment -= RWSEM_FLAG_HANDOFF;
+ lockevent_inc(rwsem_rlock_handoff);
+ }
+
+ atomic_long_add(-adjustment, &sem->count);
return;
}
/*
@@ -317,6 +360,13 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
adjustment -= RWSEM_FLAG_WAITERS;
}
+ /*
+ * When we've woken a reader, we no longer need to force writers
+ * to give up the lock and we can clear HANDOFF.
+ */
+ if (woken && (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF))
+ adjustment -= RWSEM_FLAG_HANDOFF;
+
if (adjustment)
atomic_long_add(adjustment, &sem->count);
@@ -346,23 +396,48 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* This function must be called with the sem->wait_lock held to prevent
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
+ *
+ * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * bit is set or the lock is acquired with handoff bit cleared.
*/
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+ enum writer_wait_state wstate)
{
long new;
- if (count & RWSEM_LOCK_MASK)
- return false;
+ lockdep_assert_held(&sem->wait_lock);
- new = count + RWSEM_WRITER_LOCKED -
- (list_is_singular(&sem->wait_list) ? RWSEM_FLAG_WAITERS : 0);
+ do {
+ bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
- if (atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)) {
- rwsem_set_owner(sem);
- return true;
- }
+ if (has_handoff && wstate == WRITER_NOT_FIRST)
+ return false;
- return false;
+ new = count;
+
+ if (count & RWSEM_LOCK_MASK) {
+ if (has_handoff || (wstate != WRITER_HANDOFF))
+ return false;
+
+ new |= RWSEM_FLAG_HANDOFF;
+ } else {
+ new |= RWSEM_WRITER_LOCKED;
+ new &= ~RWSEM_FLAG_HANDOFF;
+
+ if (list_is_singular(&sem->wait_list))
+ new &= ~RWSEM_FLAG_WAITERS;
+ }
+ } while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
+
+ /*
+ * We have either acquired the lock with handoff bit cleared or
+ * set the handoff bit.
+ */
+ if (new & RWSEM_FLAG_HANDOFF)
+ return false;
+
+ rwsem_set_owner(sem);
+ return true;
}
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
@@ -373,9 +448,9 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
{
long count = atomic_long_read(&sem->count);
- while (!(count & RWSEM_LOCK_MASK)) {
+ while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &count,
- count + RWSEM_WRITER_LOCKED)) {
+ count | RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -456,6 +531,11 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
rcu_read_lock();
for (;;) {
+ if (atomic_long_read(&sem->count) & RWSEM_FLAG_HANDOFF) {
+ state = OWNER_NONSPINNABLE;
+ break;
+ }
+
tmp = READ_ONCE(sem->owner);
if (tmp != owner) {
state = rwsem_owner_state((unsigned long)tmp);
@@ -553,16 +633,18 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
+ waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
/*
* In case the wait queue is empty and the lock isn't owned
- * by a writer, this reader can exit the slowpath and return
- * immediately as its RWSEM_READER_BIAS has already been
- * set in the count.
+ * by a writer or has the handoff bit set, this reader can
+ * exit the slowpath and return immediately as its
+ * RWSEM_READER_BIAS has already been set in the count.
*/
- if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+ if (!(atomic_long_read(&sem->count) &
+ (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
lockevent_inc(rwsem_rlock_fast);
@@ -609,8 +691,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
return sem;
out_nolock:
list_del(&waiter.list);
- if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
+ if (list_empty(&sem->wait_list)) {
+ atomic_long_andnot(RWSEM_FLAG_WAITERS|RWSEM_FLAG_HANDOFF,
+ &sem->count);
+ }
raw_spin_unlock_irq(&sem->wait_lock);
__set_current_state(TASK_RUNNING);
lockevent_inc(rwsem_rlock_fail);
@@ -624,7 +708,7 @@ static struct rw_semaphore *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
long count;
- bool waiting = true; /* any queued threads before us */
+ enum writer_wait_state wstate;
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
@@ -639,66 +723,95 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
*/
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
+ waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
raw_spin_lock_irq(&sem->wait_lock);
/* account for this before adding a new element to the list */
- if (list_empty(&sem->wait_list))
- waiting = false;
+ wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock */
- if (waiting) {
+ if (wstate == WRITER_NOT_FIRST) {
count = atomic_long_read(&sem->count);
/*
- * If there were already threads queued before us and there are
- * no active writers and some readers, the lock must be read
- * owned; so we try to any read locks that were queued ahead
- * of us.
+ * If there were already threads queued before us and:
+ * 1) there are no no active locks, wake the front
+ * queued process(es) as the handoff bit might be set.
+ * 2) there are no active writers and some readers, the lock
+ * must be read owned; so we try to wake any read lock
+ * waiters that were queued ahead of us.
*/
- if (!(count & RWSEM_WRITER_MASK) &&
- (count & RWSEM_READER_MASK)) {
- rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
- /*
- * The wakeup is normally called _after_ the wait_lock
- * is released, but given that we are proactively waking
- * readers we can deal with the wake_q overhead as it is
- * similar to releasing and taking the wait_lock again
- * for attempting rwsem_try_write_lock().
- */
- wake_up_q(&wake_q);
+ if (count & RWSEM_WRITER_MASK)
+ goto wait;
- /*
- * Reinitialize wake_q after use.
- */
- wake_q_init(&wake_q);
- }
+ rwsem_mark_wake(sem, (count & RWSEM_READER_MASK)
+ ? RWSEM_WAKE_READERS
+ : RWSEM_WAKE_ANY, &wake_q);
+ /*
+ * The wakeup is normally called _after_ the wait_lock
+ * is released, but given that we are proactively waking
+ * readers we can deal with the wake_q overhead as it is
+ * similar to releasing and taking the wait_lock again
+ * for attempting rwsem_try_write_lock().
+ */
+ wake_up_q(&wake_q);
+
+ /* We need wake_q again below, reinitialize */
+ wake_q_init(&wake_q);
} else {
count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
}
+wait:
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem))
+ if (rwsem_try_write_lock(count, sem, wstate))
break;
+
raw_spin_unlock_irq(&sem->wait_lock);
/* Block until there are no active lockers. */
- do {
+ for (;;) {
if (signal_pending_state(state, current))
goto out_nolock;
schedule();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
+ /*
+ * If HANDOFF bit is set, unconditionally do
+ * a trylock.
+ */
+ if (wstate == WRITER_HANDOFF)
+ break;
+
+ if ((wstate == WRITER_NOT_FIRST) &&
+ (rwsem_first_waiter(sem) == &waiter))
+ wstate = WRITER_FIRST;
+
count = atomic_long_read(&sem->count);
- } while (count & RWSEM_LOCK_MASK);
+ if (!(count & RWSEM_LOCK_MASK))
+ break;
+
+ /*
+ * The setting of the handoff bit is deferred
+ * until rwsem_try_write_lock() is called.
+ */
+ if ((wstate == WRITER_FIRST) && (rt_task(current) ||
+ time_after(jiffies, waiter.timeout))) {
+ wstate = WRITER_HANDOFF;
+ lockevent_inc(rwsem_wlock_handoff);
+ break;
+ }
+ }
raw_spin_lock_irq(&sem->wait_lock);
+ count = atomic_long_read(&sem->count);
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
@@ -711,6 +824,10 @@ out_nolock:
__set_current_state(TASK_RUNNING);
raw_spin_lock_irq(&sem->wait_lock);
list_del(&waiter.list);
+
+ if (unlikely(wstate == WRITER_HANDOFF))
+ atomic_long_add(-RWSEM_FLAG_HANDOFF, &sem->count);
+
if (list_empty(&sem->wait_list))
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
@@ -726,7 +843,7 @@ out_nolock:
* handle waking up a waiter on the semaphore
* - up_read/up_write has decremented the active part of count if we come here
*/
-static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem, long count)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -859,7 +976,7 @@ inline void __up_read(struct rw_semaphore *sem)
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
+ rwsem_wake(sem, tmp);
}
/*
@@ -873,7 +990,7 @@ static inline void __up_write(struct rw_semaphore *sem)
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
- rwsem_wake(sem);
+ rwsem_wake(sem, tmp);
}
/*
Commit-ID: 6cef7ff6e43cbdb9fa8eb91eb9a6b25d45ae11e3
Gitweb: https://git.kernel.org/tip/6cef7ff6e43cbdb9fa8eb91eb9a6b25d45ae11e3
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:04 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:27:58 +0200
locking/rwsem: Code cleanup after files merging
After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:
1) Remove all the EXPORT_SYMBOL() calls for functions that are not
accessed elsewhere.
2) Remove all the __visible tags as none of the functions will be
called from assembly code anymore.
3) Make all the internal functions static.
4) Remove some unneeded blank lines.
5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
and rename __rwsem_down_{read|write}_failed_common() to
rwsem_down_{read|write}_slowpath().
6) Remove "__" prefix of __rwsem_mark_wake().
7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
are not that useful.
That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/lock_events_list.h | 2 -
kernel/locking/rwsem.c | 135 ++++++++++++--------------------------
2 files changed, 42 insertions(+), 95 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ad7668cfc9da..11187a1d40b8 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,5 @@ LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
-LOCK_EVENT(rwsem_rtrylock) /* # of read trylock calls */
LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
-LOCK_EVENT(rwsem_wtrylock) /* # of write trylock calls */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8317bcdf063b..f56329240ef1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -205,7 +205,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
osq_lock_init(&sem->osq);
#endif
}
-
EXPORT_SYMBOL(__init_rwsem);
enum rwsem_waiter_type {
@@ -237,9 +236,9 @@ enum rwsem_wake_type {
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only marked woken if downgrading is false
*/
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
+static void rwsem_mark_wake(struct rw_semaphore *sem,
+ enum rwsem_wake_type wake_type,
+ struct wake_q_head *wake_q)
{
struct rwsem_waiter *waiter, *tmp;
long oldcount, woken = 0, adjustment = 0;
@@ -330,7 +329,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
/*
* Ensure calling get_task_struct() before setting the reader
- * waiter to nil such that rwsem_down_read_failed() cannot
+ * waiter to nil such that rwsem_down_read_slowpath() cannot
* race with do_exit() by always holding a reference count
* to the task to wakeup.
*/
@@ -516,8 +515,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/*
* Wait for the read lock to be granted
*/
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore __sched *
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
{
long count, adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
@@ -555,7 +554,7 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
*/
if (!(count & RWSEM_LOCK_MASK) ||
(!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
@@ -589,25 +588,11 @@ out_nolock:
return ERR_PTR(-EINTR);
}
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
/*
* Wait until we successfully acquire the write lock
*/
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore *
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
long count;
bool waiting = true; /* any queued threads before us */
@@ -646,7 +631,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
*/
if (!(count & RWSEM_WRITER_MASK) &&
(count & RWSEM_READER_MASK)) {
- __rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_READERS, &wake_q);
/*
* The wakeup is normally called _after_ the wait_lock
* is released, but given that we are proactively waking
@@ -700,7 +685,7 @@ out_nolock:
if (list_empty(&sem->wait_list))
atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
else
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
lockevent_inc(rwsem_wlock_fail);
@@ -708,26 +693,11 @@ out_nolock:
return ERR_PTR(-EINTR);
}
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
- return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
/*
* handle waking up a waiter on the semaphore
* - up_read/up_write has decremented the active part of count if we come here
*/
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -735,22 +705,20 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q);
return sem;
}
-EXPORT_SYMBOL(rwsem_wake);
/*
* downgrade a write lock into a read lock
* - caller incremented waiting part of count and discovered it still negative
* - just wake up any readers at the front of the queue
*/
-__visible
-struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
{
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
@@ -758,14 +726,13 @@ struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
if (!list_empty(&sem->wait_list))
- __rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
+ rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
wake_up_q(&wake_q);
return sem;
}
-EXPORT_SYMBOL(rwsem_downgrade_wake);
/*
* lock for reading
@@ -774,7 +741,7 @@ inline void __down_read(struct rw_semaphore *sem)
{
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
- rwsem_down_read_failed(sem);
+ rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
} else {
@@ -786,7 +753,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
{
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
- if (IS_ERR(rwsem_down_read_failed_killable(sem)))
+ if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
RWSEM_READER_OWNED), sem);
@@ -803,7 +770,6 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
long tmp = RWSEM_UNLOCKED_VALUE;
- lockevent_inc(rwsem_rtrylock);
do {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
tmp + RWSEM_READER_BIAS)) {
@@ -819,30 +785,33 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
static inline void __down_write(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- rwsem_down_write_failed(sem);
+ long tmp = RWSEM_UNLOCKED_VALUE;
+
+ if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ RWSEM_WRITER_LOCKED)))
+ rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
rwsem_set_owner(sem);
}
static inline int __down_write_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_cmpxchg_acquire(&sem->count, 0,
- RWSEM_WRITER_LOCKED)))
- if (IS_ERR(rwsem_down_write_failed_killable(sem)))
+ long tmp = RWSEM_UNLOCKED_VALUE;
+
+ if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ RWSEM_WRITER_LOCKED))) {
+ if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
+ }
rwsem_set_owner(sem);
return 0;
}
static inline int __down_write_trylock(struct rw_semaphore *sem)
{
- long tmp;
+ long tmp = RWSEM_UNLOCKED_VALUE;
- lockevent_inc(rwsem_wtrylock);
- tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_WRITER_LOCKED);
- if (tmp == RWSEM_UNLOCKED_VALUE) {
+ if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+ RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
return true;
}
@@ -856,12 +825,11 @@ inline void __up_read(struct rw_semaphore *sem)
{
long tmp;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
+ DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
- if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS))
- == RWSEM_FLAG_WAITERS))
+ if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
+ RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -870,10 +838,12 @@ inline void __up_read(struct rw_semaphore *sem)
*/
static inline void __up_write(struct rw_semaphore *sem)
{
+ long tmp;
+
DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
rwsem_clear_owner(sem);
- if (unlikely(atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED,
- &sem->count) & RWSEM_FLAG_WAITERS))
+ tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
+ if (unlikely(tmp & RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
}
@@ -909,7 +879,6 @@ void __sched down_read(struct rw_semaphore *sem)
LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
}
-
EXPORT_SYMBOL(down_read);
int __sched down_read_killable(struct rw_semaphore *sem)
@@ -924,7 +893,6 @@ int __sched down_read_killable(struct rw_semaphore *sem)
return 0;
}
-
EXPORT_SYMBOL(down_read_killable);
/*
@@ -938,7 +906,6 @@ int down_read_trylock(struct rw_semaphore *sem)
rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
return ret;
}
-
EXPORT_SYMBOL(down_read_trylock);
/*
@@ -948,10 +915,8 @@ void __sched down_write(struct rw_semaphore *sem)
{
might_sleep();
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}
-
EXPORT_SYMBOL(down_write);
/*
@@ -962,14 +927,14 @@ int __sched down_write_killable(struct rw_semaphore *sem)
might_sleep();
rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
- if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+ if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+ __down_write_killable)) {
rwsem_release(&sem->dep_map, 1, _RET_IP_);
return -EINTR;
}
return 0;
}
-
EXPORT_SYMBOL(down_write_killable);
/*
@@ -984,7 +949,6 @@ int down_write_trylock(struct rw_semaphore *sem)
return ret;
}
-
EXPORT_SYMBOL(down_write_trylock);
/*
@@ -993,10 +957,8 @@ EXPORT_SYMBOL(down_write_trylock);
void up_read(struct rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
__up_read(sem);
}
-
EXPORT_SYMBOL(up_read);
/*
@@ -1005,10 +967,8 @@ EXPORT_SYMBOL(up_read);
void up_write(struct rw_semaphore *sem)
{
rwsem_release(&sem->dep_map, 1, _RET_IP_);
-
__up_write(sem);
}
-
EXPORT_SYMBOL(up_write);
/*
@@ -1017,10 +977,8 @@ EXPORT_SYMBOL(up_write);
void downgrade_write(struct rw_semaphore *sem)
{
lock_downgrade(&sem->dep_map, _RET_IP_);
-
__downgrade_write(sem);
}
-
EXPORT_SYMBOL(downgrade_write);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -1029,40 +987,32 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
{
might_sleep();
rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
}
-
EXPORT_SYMBOL(down_read_nested);
void _down_write_nest_lock(struct rw_semaphore *sem, struct lockdep_map *nest)
{
might_sleep();
rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}
-
EXPORT_SYMBOL(_down_write_nest_lock);
void down_read_non_owner(struct rw_semaphore *sem)
{
might_sleep();
-
__down_read(sem);
__rwsem_set_reader_owned(sem, NULL);
}
-
EXPORT_SYMBOL(down_read_non_owner);
void down_write_nested(struct rw_semaphore *sem, int subclass)
{
might_sleep();
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
-
LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
}
-
EXPORT_SYMBOL(down_write_nested);
int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
@@ -1070,14 +1020,14 @@ int __sched down_write_killable_nested(struct rw_semaphore *sem, int subclass)
might_sleep();
rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
- if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock, __down_write_killable)) {
+ if (LOCK_CONTENDED_RETURN(sem, __down_write_trylock,
+ __down_write_killable)) {
rwsem_release(&sem->dep_map, 1, _RET_IP_);
return -EINTR;
}
return 0;
}
-
EXPORT_SYMBOL(down_write_killable_nested);
void up_read_non_owner(struct rw_semaphore *sem)
@@ -1086,7 +1036,6 @@ void up_read_non_owner(struct rw_semaphore *sem)
sem);
__up_read(sem);
}
-
EXPORT_SYMBOL(up_read_non_owner);
#endif
Commit-ID: 990fa7384a3057a3298bcf493651c6e14416c47c
Gitweb: https://git.kernel.org/tip/990fa7384a3057a3298bcf493651c6e14416c47c
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:08 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:01 +0200
locking/rwsem: More optimal RT task handling of null owner
An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.
When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:
1) The lock owner is in the process of releasing the lock, sem->owner
is cleared but the lock has not been released yet.
2) The lock was free and owner cleared, but another task just comes
in and acquire the lock before we try to get it. The new owner may
be a spinnable writer.
So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.
When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.
Locking threads Pre-patch Post-patch
--------------- --------- -----------
4 2,753 2,608
8 2,529 2,520
16 1,727 1,918
32 1,263 1,956
64 889 1,343
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 51 +++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 44 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 5532304406f7..e1840b7c5310 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -566,6 +566,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
bool taken = false;
+ int prev_owner_state = OWNER_NULL;
preempt_disable();
@@ -583,7 +584,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* 2) readers own the lock as we can't determine if they are
* actively running or not.
*/
- while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+ for (;;) {
+ enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+ if (!(owner_state & OWNER_SPINNABLE))
+ break;
+
/*
* Try to acquire the lock
*/
@@ -593,13 +599,44 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
}
/*
- * When there's no owner, we might have preempted between the
- * owner acquiring the lock and setting the owner field. If
- * we're an RT task that will live-lock because we won't let
- * the owner complete.
+ * An RT task cannot do optimistic spinning if it cannot
+ * be sure the lock holder is running or live-lock may
+ * happen if the current task and the lock holder happen
+ * to run in the same CPU. However, aborting optimistic
+ * spinning while a NULL owner is detected may miss some
+ * opportunity where spinning can continue without causing
+ * problem.
+ *
+ * There are 2 possible cases where an RT task may be able
+ * to continue spinning.
+ *
+ * 1) The lock owner is in the process of releasing the
+ * lock, sem->owner is cleared but the lock has not
+ * been released yet.
+ * 2) The lock was free and owner cleared, but another
+ * task just comes in and acquire the lock before
+ * we try to get it. The new owner may be a spinnable
+ * writer.
+ *
+ * To take advantage of two scenarios listed agove, the RT
+ * task is made to retry one more time to see if it can
+ * acquire the lock or continue spinning on the new owning
+ * writer. Of course, if the time lag is long enough or the
+ * new owner is not a writer or spinnable, the RT task will
+ * quit spinning.
+ *
+ * If the owner is a writer, the need_resched() check is
+ * done inside rwsem_spin_on_owner(). If the owner is not
+ * a writer, need_resched() check needs to be done here.
*/
- if (!sem->owner && (need_resched() || rt_task(current)))
- break;
+ if (owner_state != OWNER_WRITER) {
+ if (need_resched())
+ break;
+ if (rt_task(current) &&
+ (prev_owner_state != OWNER_WRITER))
+ break;
+ }
+ prev_owner_state = owner_state;
/*
* The cpu_relax() call is a compiler barrier which forces
Commit-ID: 00f3c5a3df2c1e3dab14d0dd2b71f852d46be97f
Gitweb: https://git.kernel.org/tip/00f3c5a3df2c1e3dab14d0dd2b71f852d46be97f
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:07 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:00 +0200
locking/rwsem: Always release wait_lock before waking up tasks
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.
Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched/wake_q.h | 5 +++++
kernel/locking/rwsem.c | 31 +++++++++++++++----------------
2 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index ad826d2a4557..26a2013ac39c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
head->lastp = &head->first;
}
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+ return head->first == WAKE_Q_TAIL;
+}
+
extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct *task);
extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index decda9fb8c6d..5532304406f7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -400,13 +400,14 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* If wstate is WRITER_HANDOFF, it will make sure that either the handoff
* bit is set or the lock is acquired with handoff bit cleared.
*/
-static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
enum writer_wait_state wstate)
{
- long new;
+ long count, new;
lockdep_assert_held(&sem->wait_lock);
+ count = atomic_long_read(&sem->count);
do {
bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
@@ -751,26 +752,25 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
? RWSEM_WAKE_READERS
: RWSEM_WAKE_ANY, &wake_q);
- /*
- * The wakeup is normally called _after_ the wait_lock
- * is released, but given that we are proactively waking
- * readers we can deal with the wake_q overhead as it is
- * similar to releasing and taking the wait_lock again
- * for attempting rwsem_try_write_lock().
- */
- wake_up_q(&wake_q);
-
- /* We need wake_q again below, reinitialize */
- wake_q_init(&wake_q);
+ if (!wake_q_empty(&wake_q)) {
+ /*
+ * We want to minimize wait_lock hold time especially
+ * when a large number of readers are to be woken up.
+ */
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ wake_q_init(&wake_q); /* Used again, reinit */
+ raw_spin_lock_irq(&sem->wait_lock);
+ }
} else {
- count = atomic_long_add_return(RWSEM_FLAG_WAITERS, &sem->count);
+ atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
}
wait:
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
- if (rwsem_try_write_lock(count, sem, wstate))
+ if (rwsem_try_write_lock(sem, wstate))
break;
raw_spin_unlock_irq(&sem->wait_lock);
@@ -811,7 +811,6 @@ wait:
}
raw_spin_lock_irq(&sem->wait_lock);
- count = atomic_long_read(&sem->count);
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
Commit-ID: d3681e269fff84048c94012342c3434b227c4706
Gitweb: https://git.kernel.org/tip/d3681e269fff84048c94012342c3434b227c4706
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:09 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:02 +0200
locking/rwsem: Wake up almost all readers in wait queue
When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.
Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.
Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:
# of Threads Pre-Patch Post-patch
------------ --------- ----------
4 1,641 1,674
8 731 1,062
16 564 924
32 78 300
64 38 195
240 50 149
There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 31 ++++++++++++++++++++++++++-----
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e1840b7c5310..ded96023f4dc 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -254,6 +254,14 @@ enum writer_wait_state {
*/
#define RWSEM_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+/*
+ * Magic number to batch-wakeup waiting readers, even when writers are
+ * also present in the queue. This both limits the amount of work the
+ * waking thread must do and also prevents any potential counter overflow,
+ * however unlikely.
+ */
+#define MAX_READERS_WAKEUP 0x100
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then the RWSEM_FLAG_WAITERS bit must
@@ -329,11 +337,17 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
}
/*
- * Grant an infinite number of read locks to the readers at the front
- * of the queue. We know that woken will be at least 1 as we accounted
+ * Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+ * queue. We know that the woken will be at least 1 as we accounted
* for above. Note we increment the 'active part' of the count by the
* number of readers before waking any processes up.
*
+ * This is an adaptation of the phase-fair R/W locks where at the
+ * reader phase (first waiter is a reader), all readers are eligible
+ * to acquire the lock at the same time irrespective of their order
+ * in the queue. The writers acquire the lock according to their
+ * order in the queue.
+ *
* We have to do wakeup in 2 passes to prevent the possibility that
* the reader count may be decremented before it is incremented. It
* is because the to-be-woken waiter may not have slept yet. So it
@@ -345,13 +359,20 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* 2) For each waiters in the new list, clear waiter->task and
* put them into wake_q to be woken up later.
*/
- list_for_each_entry(waiter, &sem->wait_list, list) {
+ INIT_LIST_HEAD(&wlist);
+ list_for_each_entry_safe(waiter, tmp, &sem->wait_list, list) {
if (waiter->type == RWSEM_WAITING_FOR_WRITE)
- break;
+ continue;
woken++;
+ list_move_tail(&waiter->list, &wlist);
+
+ /*
+ * Limit # of readers that can be woken up per wakeup call.
+ */
+ if (woken >= MAX_READERS_WAKEUP)
+ break;
}
- list_cut_before(&wlist, &sem->wait_list, &waiter->list);
adjustment = woken * RWSEM_READER_BIAS - adjustment;
lockevent_cond_inc(rwsem_wake_reader, woken);
Commit-ID: cf69482d62d996d3ce840eeead8e160de281ac6c
Gitweb: https://git.kernel.org/tip/cf69482d62d996d3ce840eeead8e160de281ac6c
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:11 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:05 +0200
locking/rwsem: Enable readers spinning on writer
This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly. The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:
# of Threads Pre-patch Post-patch
------------ --------- ----------
4 1,674 1,684
8 1,062 1,074
16 924 900
32 300 458
64 195 208
128 164 168
240 149 143
The performance change wasn't significant in this case, but this change
is required by a follow-on patch.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/lock_events_list.h | 1 +
kernel/locking/rwsem.c | 86 +++++++++++++++++++++++++++++++++------
2 files changed, 75 insertions(+), 12 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 634b47fd8b5e..ca954e4e00e4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader) /* # of reader sleeps */
LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */
LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */
LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
+LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 180455b6b0d4..985a03ad3f8c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -457,6 +457,30 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
}
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
+ return false;
+
+ count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, &sem->count);
+ if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+ rwsem_set_reader_owned(sem);
+ lockevent_inc(rwsem_opt_rlock);
+ return true;
+ }
+
+ /* Back out the change */
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ return false;
+}
+
/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
@@ -491,9 +515,12 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
- if (need_resched())
+ if (need_resched()) {
+ lockevent_inc(rwsem_opt_fail);
return false;
+ }
+ preempt_disable();
rcu_read_lock();
owner = READ_ONCE(sem->owner);
if (owner) {
@@ -501,6 +528,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
owner_on_cpu(owner);
}
rcu_read_unlock();
+ preempt_enable();
+
+ lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
}
@@ -578,7 +608,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
return state;
}
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
bool taken = false;
int prev_owner_state = OWNER_NULL;
@@ -586,9 +616,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
preempt_disable();
/* sem->wait_lock should not be held when doing optimistic spinning */
- if (!rwsem_can_spin_on_owner(sem))
- goto done;
-
if (!osq_lock(&sem->osq))
goto done;
@@ -608,10 +635,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/*
* Try to acquire the lock
*/
- if (rwsem_try_write_lock_unqueued(sem)) {
- taken = true;
+ taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+ : rwsem_try_read_lock_unqueued(sem);
+
+ if (taken)
break;
- }
/*
* An RT task cannot do optimistic spinning if it cannot
@@ -668,7 +696,12 @@ done:
return taken;
}
#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+{
+ return false;
+}
+
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
return false;
}
@@ -684,6 +717,31 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ if (!rwsem_can_spin_on_owner(sem))
+ goto queue;
+
+ /*
+ * Undo read bias from down_read() and do optimistic spinning.
+ */
+ atomic_long_add(-RWSEM_READER_BIAS, &sem->count);
+ adjustment = 0;
+ if (rwsem_optimistic_spin(sem, false)) {
+ /*
+ * Wake up other readers in the wait list if the front
+ * waiter is a reader.
+ */
+ if ((atomic_long_read(&sem->count) & RWSEM_FLAG_WAITERS)) {
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (!list_empty(&sem->wait_list))
+ rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
+ &wake_q);
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ }
+ return sem;
+ }
+
+queue:
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
@@ -696,7 +754,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
* exit the slowpath and return immediately as its
* RWSEM_READER_BIAS has already been set in the count.
*/
- if (!(atomic_long_read(&sem->count) &
+ if (adjustment && !(atomic_long_read(&sem->count) &
(RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
raw_spin_unlock_irq(&sem->wait_lock);
rwsem_set_reader_owned(sem);
@@ -708,7 +766,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
list_add_tail(&waiter.list, &sem->wait_list);
/* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
+ if (adjustment)
+ count = atomic_long_add_return(adjustment, &sem->count);
+ else
+ count = atomic_long_read(&sem->count);
/*
* If there are no active locks, wake the front queued process(es).
@@ -767,7 +828,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem))
+ if (rwsem_can_spin_on_owner(sem) &&
+ rwsem_optimistic_spin(sem, true))
return sem;
/*
Commit-ID: 7d43f1ce9dd075d8b2aa3ad1f3970ef386a5c358
Gitweb: https://git.kernel.org/tip/7d43f1ce9dd075d8b2aa3ad1f3970ef386a5c358
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:13 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:07 +0200
locking/rwsem: Enable time-based spinning on reader-owned rwsem
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.
This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.
When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.
We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.
The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.
System Time for 16 Iterations
------ ----------------------
1-socket Skylake ~800ns
4-socket Broadwell ~300ns
2-socket ThunderX2 (arm64) ~250ns
When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time. So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:
# of Threads Pre-patch Post-patch
------------ --------- ----------
2 1,759 6,684
4 1,684 6,738
8 1,074 7,222
16 900 7,163
32 458 7,316
64 208 520
128 168 425
240 143 474
This patch gives a big boost in performance for mixed reader/writer
workloads.
With 32 locking threads, the rwsem lock event data were:
rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663
With 64 locking threads, the data looked like:
rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281
So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/lock_events_list.h | 1 +
kernel/locking/rwsem.c | 173 +++++++++++++++++++++++++++++++-------
2 files changed, 144 insertions(+), 30 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index ca954e4e00e4..baa998401052 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
+LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index fae557be8334..2d7cabcfca50 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -23,6 +23,7 @@
#include <linux/sched/debug.h>
#include <linux/sched/wake_q.h>
#include <linux/sched/signal.h>
+#include <linux/sched/clock.h>
#include <linux/export.h>
#include <linux/rwsem.h>
#include <linux/atomic.h>
@@ -31,24 +32,28 @@
#include "lock_events.h"
/*
- * The least significant 2 bits of the owner value has the following
+ * The least significant 3 bits of the owner value has the following
* meanings when set.
* - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
- * - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
- * The rwsem is anonymously owned, i.e. the owner(s) cannot be
- * readily determined. It can be reader owned or the owning writer
- * is indeterminate.
+ * - Bit 1: RWSEM_RD_NONSPINNABLE - Readers cannot spin on this lock.
+ * - Bit 2: RWSEM_WR_NONSPINNABLE - Writers cannot spin on this lock.
*
+ * When the rwsem is either owned by an anonymous writer, or it is
+ * reader-owned, but a spinning writer has timed out, both nonspinnable
+ * bits will be set to disable optimistic spinning by readers and writers.
+ * In the later case, the last unlocking reader should then check the
+ * writer nonspinnable bit and clear it only to give writers preference
+ * to acquire the lock via optimistic spinning, but not readers. Similar
+ * action is also done in the reader slowpath.
+
* When a writer acquires a rwsem, it puts its task_struct pointer
* into the owner field. It is cleared after an unlock.
*
* When a reader acquires a rwsem, it will also puts its task_struct
- * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
- * largely be left untouched. So for a free or reader-owned rwsem,
- * the owner value may contain information about the last reader that
- * acquires the rwsem. The anonymous bit is set because that particular
- * reader may or may not still own the lock.
+ * pointer into the owner field with the RWSEM_READER_OWNED bit set.
+ * On unlock, the owner field will largely be left untouched. So
+ * for a free or reader-owned rwsem, the owner value may contain
+ * information about the last reader that acquires the rwsem.
*
* That information may be helpful in debugging cases where the system
* seems to hang on a reader owned rwsem especially if only one reader
@@ -56,7 +61,9 @@
* a rwsem, but the overhead is simply too big.
*/
#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_NONSPINNABLE (1UL << 1)
+#define RWSEM_RD_NONSPINNABLE (1UL << 1)
+#define RWSEM_WR_NONSPINNABLE (1UL << 2)
+#define RWSEM_NONSPINNABLE (RWSEM_RD_NONSPINNABLE | RWSEM_WR_NONSPINNABLE)
#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
#ifdef CONFIG_DEBUG_RWSEMS
@@ -141,7 +148,7 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
atomic_long_set(&sem->owner, val);
}
@@ -191,6 +198,23 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
}
#endif
+/*
+ * Set the RWSEM_NONSPINNABLE bits if the RWSEM_READER_OWNED flag
+ * remains set. Otherwise, the operation will be aborted.
+ */
+static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
+{
+ unsigned long owner = atomic_long_read(&sem->owner);
+
+ do {
+ if (!(owner & RWSEM_READER_OWNED))
+ break;
+ if (owner & RWSEM_NONSPINNABLE)
+ break;
+ } while (!atomic_long_try_cmpxchg(&sem->owner, &owner,
+ owner | RWSEM_NONSPINNABLE));
+}
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -546,7 +570,8 @@ static inline bool owner_on_cpu(struct task_struct *owner)
return owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
}
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+ unsigned long nonspinnable)
{
struct task_struct *owner;
unsigned long flags;
@@ -562,7 +587,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
preempt_disable();
rcu_read_lock();
owner = rwsem_owner_flags(sem, &flags);
- if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+ if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
ret = false;
rcu_read_unlock();
preempt_enable();
@@ -588,12 +613,12 @@ enum owner_state {
OWNER_READER = 1 << 2,
OWNER_NONSPINNABLE = 1 << 3,
};
-#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
+#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER | OWNER_READER)
static inline enum owner_state
-rwsem_owner_state(struct task_struct *owner, unsigned long flags)
+rwsem_owner_state(struct task_struct *owner, unsigned long flags, unsigned long nonspinnable)
{
- if (flags & RWSEM_NONSPINNABLE)
+ if (flags & nonspinnable)
return OWNER_NONSPINNABLE;
if (flags & RWSEM_READER_OWNED)
@@ -602,14 +627,15 @@ rwsem_owner_state(struct task_struct *owner, unsigned long flags)
return owner ? OWNER_WRITER : OWNER_NULL;
}
-static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+static noinline enum owner_state
+rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
{
struct task_struct *new, *owner;
unsigned long flags, new_flags;
enum owner_state state;
owner = rwsem_owner_flags(sem, &flags);
- state = rwsem_owner_state(owner, flags);
+ state = rwsem_owner_state(owner, flags, nonspinnable);
if (state != OWNER_WRITER)
return state;
@@ -622,7 +648,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
new = rwsem_owner_flags(sem, &new_flags);
if ((new != owner) || (new_flags != flags)) {
- state = rwsem_owner_state(new, new_flags);
+ state = rwsem_owner_state(new, new_flags, nonspinnable);
break;
}
@@ -646,10 +672,39 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
return state;
}
+/*
+ * Calculate reader-owned rwsem spinning threshold for writer
+ *
+ * The more readers own the rwsem, the longer it will take for them to
+ * wind down and free the rwsem. So the empirical formula used to
+ * determine the actual spinning time limit here is:
+ *
+ * Spinning threshold = (10 + nr_readers/2)us
+ *
+ * The limit is capped to a maximum of 25us (30 readers). This is just
+ * a heuristic and is subjected to change in the future.
+ */
+static inline u64 rwsem_rspin_threshold(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+ int readers = count >> RWSEM_READER_SHIFT;
+ u64 delta;
+
+ if (readers > 30)
+ readers = 30;
+ delta = (20 + readers) * NSEC_PER_USEC / 2;
+
+ return sched_clock() + delta;
+}
+
static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
bool taken = false;
int prev_owner_state = OWNER_NULL;
+ int loop = 0;
+ u64 rspin_threshold = 0;
+ unsigned long nonspinnable = wlock ? RWSEM_WR_NONSPINNABLE
+ : RWSEM_RD_NONSPINNABLE;
preempt_disable();
@@ -661,12 +716,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
* Optimistically spin on the owner field and attempt to acquire the
* lock whenever the owner changes. Spinning will be stopped when:
* 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
+ * 2) readers own the lock and spinning time has exceeded limit.
*/
for (;;) {
- enum owner_state owner_state = rwsem_spin_on_owner(sem);
+ enum owner_state owner_state;
+ owner_state = rwsem_spin_on_owner(sem, nonspinnable);
if (!(owner_state & OWNER_SPINNABLE))
break;
@@ -679,6 +734,38 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
if (taken)
break;
+ /*
+ * Time-based reader-owned rwsem optimistic spinning
+ */
+ if (wlock && (owner_state == OWNER_READER)) {
+ /*
+ * Re-initialize rspin_threshold every time when
+ * the owner state changes from non-reader to reader.
+ * This allows a writer to steal the lock in between
+ * 2 reader phases and have the threshold reset at
+ * the beginning of the 2nd reader phase.
+ */
+ if (prev_owner_state != OWNER_READER) {
+ if (rwsem_test_oflags(sem, nonspinnable))
+ break;
+ rspin_threshold = rwsem_rspin_threshold(sem);
+ loop = 0;
+ }
+
+ /*
+ * Check time threshold once every 16 iterations to
+ * avoid calling sched_clock() too frequently so
+ * as to reduce the average latency between the times
+ * when the lock becomes free and when the spinner
+ * is ready to do a trylock.
+ */
+ else if (!(++loop & 0xf) && (sched_clock() > rspin_threshold)) {
+ rwsem_set_nonspinnable(sem);
+ lockevent_inc(rwsem_opt_nospin);
+ break;
+ }
+ }
+
/*
* An RT task cannot do optimistic spinning if it cannot
* be sure the lock holder is running or live-lock may
@@ -733,8 +820,25 @@ done:
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}
+
+/*
+ * Clear the owner's RWSEM_WR_NONSPINNABLE bit if it is set. This should
+ * only be called when the reader count reaches 0.
+ *
+ * This give writers better chance to acquire the rwsem first before
+ * readers when the rwsem was being held by readers for a relatively long
+ * period of time. Race can happen that an optimistic spinner may have
+ * just stolen the rwsem and set the owner, but just clearing the
+ * RWSEM_WR_NONSPINNABLE bit will do no harm anyway.
+ */
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
+{
+ if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
+ atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
+}
#else
-static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
+static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
+ unsigned long nonspinnable)
{
return false;
}
@@ -743,6 +847,8 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
{
return false;
}
+
+static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
#endif
/*
@@ -752,10 +858,11 @@ static struct rw_semaphore __sched *
rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
{
long count, adjustment = -RWSEM_READER_BIAS;
+ bool wake = false;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
- if (!rwsem_can_spin_on_owner(sem))
+ if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
goto queue;
/*
@@ -815,8 +922,12 @@ queue:
* If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers !
*/
- if (!(count & RWSEM_LOCK_MASK) ||
- (!(count & RWSEM_WRITER_MASK) && (adjustment & RWSEM_FLAG_WAITERS)))
+ if (!(count & RWSEM_LOCK_MASK)) {
+ clear_wr_nonspinnable(sem);
+ wake = true;
+ }
+ if (wake || (!(count & RWSEM_WRITER_MASK) &&
+ (adjustment & RWSEM_FLAG_WAITERS)))
rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
@@ -866,7 +977,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
DEFINE_WAKE_Q(wake_q);
/* do optimistic spinning and steal lock if possible */
- if (rwsem_can_spin_on_owner(sem) &&
+ if (rwsem_can_spin_on_owner(sem, RWSEM_WR_NONSPINNABLE) &&
rwsem_optimistic_spin(sem, true))
return sem;
@@ -1124,8 +1235,10 @@ inline void __up_read(struct rw_semaphore *sem)
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
- RWSEM_FLAG_WAITERS))
+ RWSEM_FLAG_WAITERS)) {
+ clear_wr_nonspinnable(sem);
rwsem_wake(sem, tmp);
+ }
}
/*
Commit-ID: 94a9717b3c40e77a54e4afacd8f19a9a86bfeead
Gitweb: https://git.kernel.org/tip/94a9717b3c40e77a54e4afacd8f19a9a86bfeead
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:12 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:06 +0200
locking/rwsem: Make rwsem->owner an atomic_long_t
The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.
New helper functions are added to properly separate out the task
structure pointer and the embedded flags.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/percpu-rwsem.h | 4 +-
include/linux/rwsem.h | 11 ++--
kernel/locking/rwsem.c | 125 +++++++++++++++++++++++++++----------------
3 files changed, 88 insertions(+), 52 deletions(-)
diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 03cb4b6f842e..0a43830f1932 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
lock_release(&sem->rw_sem.dep_map, 1, ip);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
if (!read)
- sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
+ atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
#endif
}
@@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
if (!read)
- sem->rw_sem.owner = current;
+ atomic_long_set(&sem->rw_sem.owner, (long)current);
#endif
}
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index bb76e82398b2..e401358c4e7e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -35,10 +35,11 @@
struct rw_semaphore {
atomic_long_t count;
/*
- * Write owner or one of the read owners. Can be used as a
- * speculative check to see if the owner is running on the cpu.
+ * Write owner or one of the read owners as well flags regarding
+ * the current state of the rwsem. Can be used as a speculative
+ * check to see if the write owner is running on the cpu.
*/
- struct task_struct *owner;
+ atomic_long_t owner;
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
#endif
@@ -53,7 +54,7 @@ struct rw_semaphore {
* Setting all bits of the owner field except bit 0 will indicate
* that the rwsem is writer-owned with an unknown owner.
*/
-#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)
+#define RWSEM_OWNER_UNKNOWN (-2L)
/* In all implementations count != 0 means locked */
static inline int rwsem_is_locked(struct rw_semaphore *sem)
@@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
#define __RWSEM_INITIALIZER(name) \
{ __RWSEM_INIT_COUNT(name), \
- .owner = NULL, \
+ .owner = ATOMIC_LONG_INIT(0), \
.wait_list = LIST_HEAD_INIT((name).wait_list), \
.wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
__RWSEM_OPT_INIT(name) \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 985a03ad3f8c..fae557be8334 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -64,7 +64,7 @@
if (!debug_locks_silent && \
WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
#c, atomic_long_read(&(sem)->count), \
- (long)((sem)->owner), (long)current, \
+ atomic_long_read(&(sem)->owner), (long)current, \
list_empty(&(sem)->wait_list) ? "" : "not ")) \
debug_locks_off(); \
} while (0)
@@ -114,12 +114,20 @@
*/
static inline void rwsem_set_owner(struct rw_semaphore *sem)
{
- WRITE_ONCE(sem->owner, current);
+ atomic_long_set(&sem->owner, (long)current);
}
static inline void rwsem_clear_owner(struct rw_semaphore *sem)
{
- WRITE_ONCE(sem->owner, NULL);
+ atomic_long_set(&sem->owner, 0);
+}
+
+/*
+ * Test the flags in the owner field.
+ */
+static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
+{
+ return atomic_long_read(&sem->owner) & flags;
}
/*
@@ -133,10 +141,9 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_NONSPINNABLE;
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
- WRITE_ONCE(sem->owner, (struct task_struct *)val);
+ atomic_long_set(&sem->owner, val);
}
static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
@@ -145,13 +152,20 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
}
/*
- * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock.
- * N.B. !owner is considered spinnable.
+ * Return true if the rwsem is owned by a reader.
*/
-static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
{
- return !((unsigned long)owner & RWSEM_NONSPINNABLE);
+#ifdef CONFIG_DEBUG_RWSEMS
+ /*
+ * Check the count to see if it is write-locked.
+ */
+ long count = atomic_long_read(&sem->count);
+
+ if (count & RWSEM_WRITER_MASK)
+ return false;
+#endif
+ return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
}
#ifdef CONFIG_DEBUG_RWSEMS
@@ -163,11 +177,13 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
*/
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
- unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_NONSPINNABLE;
- if (READ_ONCE(sem->owner) == (struct task_struct *)val)
- cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
+ unsigned long val = atomic_long_read(&sem->owner);
+
+ while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current) {
+ if (atomic_long_try_cmpxchg(&sem->owner, &val,
+ val & RWSEM_OWNER_FLAGS_MASK))
+ return;
+ }
}
#else
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -175,6 +191,28 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
}
#endif
+/*
+ * Return just the real task structure pointer of the owner
+ */
+static inline struct task_struct *rwsem_owner(struct rw_semaphore *sem)
+{
+ return (struct task_struct *)
+ (atomic_long_read(&sem->owner) & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
+/*
+ * Return the real task structure pointer of the owner and the embedded
+ * flags in the owner. pflags must be non-NULL.
+ */
+static inline struct task_struct *
+rwsem_owner_flags(struct rw_semaphore *sem, unsigned long *pflags)
+{
+ unsigned long owner = atomic_long_read(&sem->owner);
+
+ *pflags = owner & RWSEM_OWNER_FLAGS_MASK;
+ return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
+}
+
/*
* Guide to the rw_semaphore's count field.
*
@@ -208,7 +246,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
raw_spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
- sem->owner = NULL;
+ atomic_long_set(&sem->owner, 0L);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
osq_lock_init(&sem->osq);
#endif
@@ -511,9 +549,10 @@ static inline bool owner_on_cpu(struct task_struct *owner)
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
{
struct task_struct *owner;
+ unsigned long flags;
bool ret = true;
- BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
+ BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
if (need_resched()) {
lockevent_inc(rwsem_opt_fail);
@@ -522,11 +561,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
preempt_disable();
rcu_read_lock();
- owner = READ_ONCE(sem->owner);
- if (owner) {
- ret = is_rwsem_owner_spinnable(owner) &&
- owner_on_cpu(owner);
- }
+ owner = rwsem_owner_flags(sem, &flags);
+ if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
+ ret = false;
rcu_read_unlock();
preempt_enable();
@@ -553,25 +590,26 @@ enum owner_state {
};
#define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
-static inline enum owner_state rwsem_owner_state(unsigned long owner)
+static inline enum owner_state
+rwsem_owner_state(struct task_struct *owner, unsigned long flags)
{
- if (!owner)
- return OWNER_NULL;
-
- if (owner & RWSEM_NONSPINNABLE)
+ if (flags & RWSEM_NONSPINNABLE)
return OWNER_NONSPINNABLE;
- if (owner & RWSEM_READER_OWNED)
+ if (flags & RWSEM_READER_OWNED)
return OWNER_READER;
- return OWNER_WRITER;
+ return owner ? OWNER_WRITER : OWNER_NULL;
}
static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
{
- struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
- enum owner_state state = rwsem_owner_state((unsigned long)owner);
+ struct task_struct *new, *owner;
+ unsigned long flags, new_flags;
+ enum owner_state state;
+ owner = rwsem_owner_flags(sem, &flags);
+ state = rwsem_owner_state(owner, flags);
if (state != OWNER_WRITER)
return state;
@@ -582,9 +620,9 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
break;
}
- tmp = READ_ONCE(sem->owner);
- if (tmp != owner) {
- state = rwsem_owner_state((unsigned long)tmp);
+ new = rwsem_owner_flags(sem, &new_flags);
+ if ((new != owner) || (new_flags != flags)) {
+ state = rwsem_owner_state(new, new_flags);
break;
}
@@ -1001,8 +1039,7 @@ inline void __down_read(struct rw_semaphore *sem)
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
&sem->count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1014,8 +1051,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
&sem->count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
- RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1084,7 +1120,7 @@ inline void __up_read(struct rw_semaphore *sem)
{
long tmp;
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
@@ -1103,8 +1139,8 @@ static inline void __up_write(struct rw_semaphore *sem)
* sem->owner may differ from current if the ownership is transferred
* to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
*/
- DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
- !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
+ DEBUG_RWSEMS_WARN_ON((rwsem_owner(sem) != current) &&
+ !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
@@ -1125,7 +1161,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem);
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
@@ -1296,8 +1332,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
void up_read_non_owner(struct rw_semaphore *sem)
{
- DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
- sem);
+ DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
__up_read(sem);
}
EXPORT_SYMBOL(up_read_non_owner);
Commit-ID: 5cfd92e12e13432251981b9d0cd68dbd7aa8d690
Gitweb: https://git.kernel.org/tip/5cfd92e12e13432251981b9d0cd68dbd7aa8d690
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:14 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:09 +0200
locking/rwsem: Adaptive disabling of reader optimistic spinning
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.
1) The reader critical section is long, perhaps the task sleeps after
acquiring the read lock.
2) There are just too many readers contending the lock causing it to
take a while to service all of them.
In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.
Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.
Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.
This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.
In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.
On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:
Threads Before rspin Before patch After patch %change
------- ------------ ------------ ----------- -------
20 5541068 5345484 5455667 -3.5%/ +2.1%
40 10185150 7292313 9219276 -28.5%/+26.4%
60 8196733 6460517 7181209 -21.2%/+11.2%
80 9508864 6739559 8107025 -29.1%/+20.3%
This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.
Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:
# of Threads Pre-patch Post-patch
------------ --------- ----------
2 4,730 4,969
4 4,814 4,786
8 4,866 4,815
16 4,715 4,511
32 3,338 3,500
64 3,212 3,389
80 3,110 3,044
When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.
Before patch:
40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644
After patch:
40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/lock_events_list.h | 10 +--
kernel/locking/rwsem.c | 133 ++++++++++++++++++++++++++++++++++++--
2 files changed, 135 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index baa998401052..239039d0ce21 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,10 +56,12 @@ LOCK_EVENT(rwsem_sleep_reader) /* # of reader sleeps */
LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps */
LOCK_EVENT(rwsem_wake_reader) /* # of reader wakeups */
LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups */
-LOCK_EVENT(rwsem_opt_rlock) /* # of read locks opt-spin acquired */
-LOCK_EVENT(rwsem_opt_wlock) /* # of write locks opt-spin acquired */
-LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings */
-LOCK_EVENT(rwsem_opt_nospin) /* # of disabled reader opt-spinnings */
+LOCK_EVENT(rwsem_opt_rlock) /* # of opt-acquired read locks */
+LOCK_EVENT(rwsem_opt_wlock) /* # of opt-acquired write locks */
+LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */
+LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */
+LOCK_EVENT(rwsem_opt_norspin) /* # of disabled reader-only optspins */
+LOCK_EVENT(rwsem_opt_rlock2) /* # of opt-acquired 2ndary read locks */
LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2d7cabcfca50..e1e0bac957c4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -59,6 +59,42 @@
* seems to hang on a reader owned rwsem especially if only one reader
* is involved. Ideally we would like to track all the readers that own
* a rwsem, but the overhead is simply too big.
+ *
+ * Reader optimistic spinning is helpful when the reader critical section
+ * is short and there aren't that many readers around. It makes readers
+ * relatively more preferred than writers. When a writer times out spinning
+ * on a reader-owned lock and set the nospinnable bits, there are two main
+ * reasons for that.
+ *
+ * 1) The reader critical section is long, perhaps the task sleeps after
+ * acquiring the read lock.
+ * 2) There are just too many readers contending the lock causing it to
+ * take a while to service all of them.
+ *
+ * In the former case, long reader critical section will impede the progress
+ * of writers which is usually more important for system performance. In
+ * the later case, reader optimistic spinning tends to make the reader
+ * groups that contain readers that acquire the lock together smaller
+ * leading to more of them. That may hurt performance in some cases. In
+ * other words, the setting of nonspinnable bits indicates that reader
+ * optimistic spinning may not be helpful for those workloads that cause
+ * it.
+ *
+ * Therefore, any writers that had observed the setting of the writer
+ * nonspinnable bit for a given rwsem after they fail to acquire the lock
+ * via optimistic spinning will set the reader nonspinnable bit once they
+ * acquire the write lock. Similarly, readers that observe the setting
+ * of reader nonspinnable bit at slowpath entry will set the reader
+ * nonspinnable bits when they acquire the read lock via the wakeup path.
+ *
+ * Once the reader nonspinnable bit is on, it will only be reset when
+ * a writer is able to acquire the rwsem in the fast path or somehow a
+ * reader or writer in the slowpath doesn't observe the nonspinable bit.
+ *
+ * This is to discourage reader optmistic spinning on that particular
+ * rwsem and make writers more preferred. This adaptive disabling of reader
+ * optimistic spinning will alleviate the negative side effect of this
+ * feature.
*/
#define RWSEM_READER_OWNED (1UL << 0)
#define RWSEM_RD_NONSPINNABLE (1UL << 1)
@@ -144,11 +180,14 @@ static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
* Note that the owner value just indicates the task has owned the rwsem
* previously, it may not be the real owner or one of the real owners
* anymore when that field is examined, so take it with a grain of salt.
+ *
+ * The reader non-spinnable bit is preserved.
*/
static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
- unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED;
+ unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED |
+ (atomic_long_read(&sem->owner) & RWSEM_RD_NONSPINNABLE);
atomic_long_set(&sem->owner, val);
}
@@ -287,6 +326,7 @@ struct rwsem_waiter {
struct task_struct *task;
enum rwsem_waiter_type type;
unsigned long timeout;
+ unsigned long last_rowner;
};
#define rwsem_first_waiter(sem) \
list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -368,6 +408,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* so we can bail out early if a writer stole the lock.
*/
if (wake_type != RWSEM_WAKE_READ_OWNED) {
+ struct task_struct *owner;
+
adjustment = RWSEM_READER_BIAS;
oldcount = atomic_long_fetch_add(adjustment, &sem->count);
if (unlikely(oldcount & RWSEM_WRITER_MASK)) {
@@ -388,8 +430,15 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
/*
* Set it to reader-owned to give spinners an early
* indication that readers now have the lock.
+ * The reader nonspinnable bit seen at slowpath entry of
+ * the reader is copied over.
*/
- __rwsem_set_reader_owned(sem, waiter->task);
+ owner = waiter->task;
+ if (waiter->last_rowner & RWSEM_RD_NONSPINNABLE) {
+ owner = (void *)((unsigned long)owner | RWSEM_RD_NONSPINNABLE);
+ lockevent_inc(rwsem_opt_norspin);
+ }
+ __rwsem_set_reader_owned(sem, owner);
}
/*
@@ -836,6 +885,42 @@ static inline void clear_wr_nonspinnable(struct rw_semaphore *sem)
if (rwsem_test_oflags(sem, RWSEM_WR_NONSPINNABLE))
atomic_long_andnot(RWSEM_WR_NONSPINNABLE, &sem->owner);
}
+
+/*
+ * This function is called when the reader fails to acquire the lock via
+ * optimistic spinning. In this case we will still attempt to do a trylock
+ * when comparing the rwsem state right now with the state when entering
+ * the slowpath indicates that the reader is still in a valid reader phase.
+ * This happens when the following conditions are true:
+ *
+ * 1) The lock is currently reader owned, and
+ * 2) The lock is previously not reader-owned or the last read owner changes.
+ *
+ * In the former case, we have transitioned from a writer phase to a
+ * reader-phase while spinning. In the latter case, it means the reader
+ * phase hasn't ended when we entered the optimistic spinning loop. In
+ * both cases, the reader is eligible to acquire the lock. This is the
+ * secondary path where a read lock is acquired optimistically.
+ *
+ * The reader non-spinnable bit wasn't set at time of entry or it will
+ * not be here at all.
+ */
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+ unsigned long last_rowner)
+{
+ unsigned long owner = atomic_long_read(&sem->owner);
+
+ if (!(owner & RWSEM_READER_OWNED))
+ return false;
+
+ if (((owner ^ last_rowner) & ~RWSEM_OWNER_FLAGS_MASK) &&
+ rwsem_try_read_lock_unqueued(sem)) {
+ lockevent_inc(rwsem_opt_rlock2);
+ lockevent_add(rwsem_opt_fail, -1);
+ return true;
+ }
+ return false;
+}
#else
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
unsigned long nonspinnable)
@@ -849,6 +934,12 @@ static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
}
static inline void clear_wr_nonspinnable(struct rw_semaphore *sem) { }
+
+static inline bool rwsem_reader_phase_trylock(struct rw_semaphore *sem,
+ unsigned long last_rowner)
+{
+ return false;
+}
#endif
/*
@@ -862,6 +953,14 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ /*
+ * Save the current read-owner of rwsem, if available, and the
+ * reader nonspinnable bit.
+ */
+ waiter.last_rowner = atomic_long_read(&sem->owner);
+ if (!(waiter.last_rowner & RWSEM_READER_OWNED))
+ waiter.last_rowner &= RWSEM_RD_NONSPINNABLE;
+
if (!rwsem_can_spin_on_owner(sem, RWSEM_RD_NONSPINNABLE))
goto queue;
@@ -884,6 +983,8 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
wake_up_q(&wake_q);
}
return sem;
+ } else if (rwsem_reader_phase_trylock(sem, waiter.last_rowner)) {
+ return sem;
}
queue:
@@ -964,6 +1065,19 @@ out_nolock:
return ERR_PTR(-EINTR);
}
+/*
+ * This function is called by the a write lock owner. So the owner value
+ * won't get changed by others.
+ */
+static inline void rwsem_disable_reader_optspin(struct rw_semaphore *sem,
+ bool disable)
+{
+ if (unlikely(disable)) {
+ atomic_long_or(RWSEM_RD_NONSPINNABLE, &sem->owner);
+ lockevent_inc(rwsem_opt_norspin);
+ }
+}
+
/*
* Wait until we successfully acquire the write lock
*/
@@ -971,6 +1085,7 @@ static struct rw_semaphore *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
long count;
+ bool disable_rspin;
enum writer_wait_state wstate;
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
@@ -981,6 +1096,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
rwsem_optimistic_spin(sem, true))
return sem;
+ /*
+ * Disable reader optimistic spinning for this rwsem after
+ * acquiring the write lock when the setting of the nonspinnable
+ * bits are observed.
+ */
+ disable_rspin = atomic_long_read(&sem->owner) & RWSEM_NONSPINNABLE;
+
/*
* Optimistic spinning failed, proceed to the slowpath
* and block until we can acquire the sem.
@@ -1077,6 +1199,7 @@ wait:
}
__set_current_state(TASK_RUNNING);
list_del(&waiter.list);
+ rwsem_disable_reader_optspin(sem, disable_rspin);
raw_spin_unlock_irq(&sem->wait_lock);
lockevent_inc(rwsem_wlock);
@@ -1196,7 +1319,8 @@ static inline void __down_write(struct rw_semaphore *sem)
if (unlikely(!atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
RWSEM_WRITER_LOCKED)))
rwsem_down_write_slowpath(sem, TASK_UNINTERRUPTIBLE);
- rwsem_set_owner(sem);
+ else
+ rwsem_set_owner(sem);
}
static inline int __down_write_killable(struct rw_semaphore *sem)
@@ -1207,8 +1331,9 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
RWSEM_WRITER_LOCKED))) {
if (IS_ERR(rwsem_down_write_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
+ } else {
+ rwsem_set_owner(sem);
}
- rwsem_set_owner(sem);
return 0;
}
Commit-ID: 02f1082b003a0cd48f48f12533d969cdbf1c2b63
Gitweb: https://git.kernel.org/tip/02f1082b003a0cd48f48f12533d969cdbf1c2b63
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:10 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:03 +0200
locking/rwsem: Clarify usage of owner's nonspinaable bit
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.
With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.
This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/rwsem.h | 2 +-
kernel/locking/rwsem.c | 43 +++++++++++++++++++++----------------------
2 files changed, 22 insertions(+), 23 deletions(-)
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 148983e21d47..bb76e82398b2 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,7 +50,7 @@ struct rw_semaphore {
};
/*
- * Setting bit 1 of the owner field but not bit 0 will indicate
+ * Setting all bits of the owner field except bit 0 will indicate
* that the rwsem is writer-owned with an unknown owner.
*/
#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ded96023f4dc..180455b6b0d4 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -33,17 +33,18 @@
/*
* The least significant 2 bits of the owner value has the following
* meanings when set.
- * - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- * - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- * i.e. the owner(s) cannot be readily determined. It can be reader
- * owned or the owning writer is indeterminate.
+ * - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
+ * - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
+ * The rwsem is anonymously owned, i.e. the owner(s) cannot be
+ * readily determined. It can be reader owned or the owning writer
+ * is indeterminate.
*
* When a writer acquires a rwsem, it puts its task_struct pointer
* into the owner field. It is cleared after an unlock.
*
* When a reader acquires a rwsem, it will also puts its task_struct
* pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
* largely be left untouched. So for a free or reader-owned rwsem,
* the owner value may contain information about the last reader that
* acquires the rwsem. The anonymous bit is set because that particular
@@ -55,7 +56,8 @@
* a rwsem, but the overhead is simply too big.
*/
#define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED (1UL << 1)
+#define RWSEM_NONSPINNABLE (1UL << 1)
+#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
#ifdef CONFIG_DEBUG_RWSEMS
# define DEBUG_RWSEMS_WARN_ON(c, sem) do { \
@@ -132,7 +134,7 @@ static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
struct task_struct *owner)
{
unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
+ | RWSEM_NONSPINNABLE;
WRITE_ONCE(sem->owner, (struct task_struct *)val);
}
@@ -144,20 +146,12 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
/*
* Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
+ * and steal the lock.
* N.B. !owner is considered spinnable.
*/
static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
{
- return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
- return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+ return !((unsigned long)owner & RWSEM_NONSPINNABLE);
}
#ifdef CONFIG_DEBUG_RWSEMS
@@ -170,10 +164,10 @@ static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
{
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
- | RWSEM_ANONYMOUSLY_OWNED;
+ | RWSEM_NONSPINNABLE;
if (READ_ONCE(sem->owner) == (struct task_struct *)val)
cmpxchg_relaxed((unsigned long *)&sem->owner, val,
- RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+ RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
}
#else
static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
@@ -495,7 +489,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
struct task_struct *owner;
bool ret = true;
- BUILD_BUG_ON(!rwsem_has_anonymous_owner(RWSEM_OWNER_UNKNOWN));
+ BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
if (need_resched())
return false;
@@ -534,7 +528,7 @@ static inline enum owner_state rwsem_owner_state(unsigned long owner)
if (!owner)
return OWNER_NULL;
- if (owner & RWSEM_ANONYMOUSLY_OWNED)
+ if (owner & RWSEM_NONSPINNABLE)
return OWNER_NONSPINNABLE;
if (owner & RWSEM_READER_OWNED)
@@ -1043,7 +1037,12 @@ static inline void __up_write(struct rw_semaphore *sem)
{
long tmp;
- DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
+ /*
+ * sem->owner may differ from current if the ownership is transferred
+ * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
+ */
+ DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
+ !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
Commit-ID: a15ea1a35f1b2782befc8b958c123c5d6a7cab0a
Gitweb: https://git.kernel.org/tip/a15ea1a35f1b2782befc8b958c123c5d6a7cab0a
Author: Waiman Long <[email protected]>
AuthorDate: Mon, 20 May 2019 16:59:15 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 17 Jun 2019 12:28:11 +0200
locking/rwsem: Guard against making count negative
The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.
To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 53 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 12 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index e1e0bac957c4..37524a47f002 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -116,13 +116,28 @@
#endif
/*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
*
- * Bit 0 - writer locked bit
- * Bit 1 - waiters present bit
- * Bit 2 - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit 63 - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit 0 - writer locked bit
+ * Bit 1 - waiters present bit
+ * Bit 2 - lock handoff bit
+ * Bits 3-7 - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit 31 - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
*
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -139,6 +154,7 @@
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
#define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_READFAIL (1UL << (BITS_PER_LONG - 1))
#define RWSEM_READER_SHIFT 8
#define RWSEM_READER_BIAS (1UL << RWSEM_READER_SHIFT)
@@ -146,7 +162,7 @@
#define RWSEM_WRITER_MASK RWSEM_WRITER_LOCKED
#define RWSEM_LOCK_MASK (RWSEM_WRITER_MASK|RWSEM_READER_MASK)
#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
- RWSEM_FLAG_HANDOFF)
+ RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
/*
* All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -254,6 +270,14 @@ static inline void rwsem_set_nonspinnable(struct rw_semaphore *sem)
owner | RWSEM_NONSPINNABLE));
}
+static inline bool rwsem_read_trylock(struct rw_semaphore *sem)
+{
+ long cnt = atomic_long_add_return_acquire(RWSEM_READER_BIAS, &sem->count);
+ if (WARN_ON_ONCE(cnt < 0))
+ rwsem_set_nonspinnable(sem);
+ return !(cnt & RWSEM_READ_FAILED_MASK);
+}
+
/*
* Return just the real task structure pointer of the owner
*/
@@ -402,6 +426,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
return;
}
+ /*
+ * No reader wakeup if there are too many of them already.
+ */
+ if (unlikely(atomic_long_read(&sem->count) < 0))
+ return;
+
/*
* Writers might steal the lock before we grant it to the next reader.
* We prefer to do the first reader grant before counting readers
@@ -949,9 +979,9 @@ static struct rw_semaphore __sched *
rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
{
long count, adjustment = -RWSEM_READER_BIAS;
- bool wake = false;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
+ bool wake = false;
/*
* Save the current read-owner of rwsem, if available, and the
@@ -1270,8 +1300,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
*/
inline void __down_read(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ if (!rwsem_read_trylock(sem)) {
rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
@@ -1281,8 +1310,7 @@ inline void __down_read(struct rw_semaphore *sem)
static inline int __down_read_killable(struct rw_semaphore *sem)
{
- if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
- &sem->count) & RWSEM_READ_FAILED_MASK)) {
+ if (!rwsem_read_trylock(sem)) {
if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
@@ -1359,6 +1387,7 @@ inline void __up_read(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
+ DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_wr_nonspinnable(sem);
On 7/19/19 2:45 PM, Luis Henriques wrote:
> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>> The rwsem->owner contains not just the task structure pointer, it also
>> holds some flags for storing the current state of the rwsem. Some of
>> the flags may have to be atomically updated. To reflect the new reality,
>> the owner is now changed to an atomic_long_t type.
>>
>> New helper functions are added to properly separate out the task
>> structure pointer and the embedded flags.
> I started seeing KASAN use-after-free with current master, and a bisect
> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
> rwsem->owner an atomic_long_t") was the problem. Does it ring any
> bells? I can easily reproduce it with xfstests (generic/464).
>
> Cheers,
> --
> Luís
This patch shouldn't change the behavior of the rwsem code. The code
only access data within the rw_semaphore structures. I don't know why it
will cause a KASAN error. I will have to reproduce it and figure out
exactly which statement is doing the invalid access.
Thanks,
Longman
On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
> The rwsem->owner contains not just the task structure pointer, it also
> holds some flags for storing the current state of the rwsem. Some of
> the flags may have to be atomically updated. To reflect the new reality,
> the owner is now changed to an atomic_long_t type.
>
> New helper functions are added to properly separate out the task
> structure pointer and the embedded flags.
I started seeing KASAN use-after-free with current master, and a bisect
showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
rwsem->owner an atomic_long_t") was the problem. Does it ring any
bells? I can easily reproduce it with xfstests (generic/464).
Cheers,
--
Lu?s
[ 6380.820179] run fstests generic/464 at 2019-07-19 12:04:05
[ 6381.504693] libceph: mon0 (1)192.168.155.1:40786 session established
[ 6381.506790] libceph: client4572 fsid 86b39301-7192-4052-8427-a241af35a591
[ 6381.618830] libceph: mon0 (1)192.168.155.1:40786 session established
[ 6381.619993] libceph: client4573 fsid 86b39301-7192-4052-8427-a241af35a591
[ 6384.464561] ==================================================================
[ 6384.466165] BUG: KASAN: use-after-free in rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.468288] Read of size 4 at addr ffff8881d5dc9478 by task xfs_io/17238
[ 6384.469545] CPU: 1 PID: 17238 Comm: xfs_io Not tainted 5.2.0+ #444
[ 6384.469550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[ 6384.469554] Call Trace:
[ 6384.469563] dump_stack+0x5b/0x90
[ 6384.469569] print_address_description+0x6f/0x332
[ 6384.469573] ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469575] ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469579] __kasan_report.cold+0x1a/0x3e
[ 6384.469583] ? rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469588] kasan_report+0xe/0x12
[ 6384.469591] rwsem_down_write_slowpath+0x67d/0x8a0
[ 6384.469596] ? __ceph_caps_issued_mask+0xe7/0x280
[ 6384.469599] ? find_held_lock+0xc9/0xf0
[ 6384.469604] ? __ceph_do_getattr+0x19f/0x290
[ 6384.469608] ? down_read_non_owner+0x1c0/0x1c0
[ 6384.469612] ? do_raw_spin_unlock+0xa3/0x130
[ 6384.469617] ? _raw_spin_unlock+0x24/0x30
[ 6384.469622] ? __lock_acquire.isra.0+0x486/0x770
[ 6384.469629] ? path_openat+0x7ef/0xfe0
[ 6384.469635] ? down_write+0x11e/0x130
[ 6384.469638] down_write+0x11e/0x130
[ 6384.469642] ? down_read_killable+0x1e0/0x1e0
[ 6384.469646] ? __sb_start_write+0x11c/0x170
[ 6384.469650] ? __mnt_want_write+0xb4/0xd0
[ 6384.469655] path_openat+0x7ef/0xfe0
[ 6384.469661] ? path_mountpoint+0x4d0/0x4d0
[ 6384.469667] ? __is_insn_slot_addr+0x93/0xb0
[ 6384.469671] ? kernel_text_address+0x113/0x120
[ 6384.469674] ? __kernel_text_address+0xe/0x30
[ 6384.469679] ? unwind_get_return_address+0x2f/0x50
[ 6384.469683] ? swiotlb_map.cold+0x25/0x25
[ 6384.469687] ? arch_stack_walk+0x8f/0xe0
[ 6384.469692] do_filp_open+0x12b/0x1c0
[ 6384.469695] ? may_open_dev+0x50/0x50
[ 6384.469702] ? __alloc_fd+0x115/0x280
[ 6384.469705] ? lock_downgrade+0x350/0x350
[ 6384.469709] ? do_raw_spin_lock+0x113/0x1d0
[ 6384.469713] ? rwlock_bug.part.0+0x60/0x60
[ 6384.469718] ? do_raw_spin_unlock+0xa3/0x130
[ 6384.469722] ? _raw_spin_unlock+0x24/0x30
[ 6384.469725] ? __alloc_fd+0x115/0x280
[ 6384.469731] do_sys_open+0x1f0/0x2d0
[ 6384.469735] ? filp_open+0x50/0x50
[ 6384.469738] ? switch_fpu_return+0x13e/0x230
[ 6384.469742] ? __do_page_fault+0x4b5/0x670
[ 6384.469748] do_syscall_64+0x63/0x1c0
[ 6384.469753] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6384.469756] RIP: 0033:0x7fe961434528
[ 6384.469760] Code: 00 00 41 00 3d 00 00 41 00 74 47 48 8d 05 20 4d 0d 00 8b 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 94 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
[ 6384.469762] RSP: 002b:00007ffd9bbabb20 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6384.469765] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007fe961434528
[ 6384.469767] RDX: 0000000000000242 RSI: 00007ffd9bbae2a5 RDI: 00000000ffffff9c
[ 6384.469769] RBP: 00007ffd9bbae2a5 R08: 0000000000000001 R09: 0000000000000000
[ 6384.469771] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000242
[ 6384.469773] R13: 00007ffd9bbabe00 R14: 0000000000000180 R15: 0000000000000060
[ 6384.470018] Allocated by task 16593:
[ 6384.470562] __kasan_kmalloc.part.0+0x3c/0xa0
[ 6384.470565] kmem_cache_alloc+0xdc/0x240
[ 6384.470569] copy_process+0x1dce/0x27b0
[ 6384.470572] _do_fork+0xec/0x540
[ 6384.470576] __se_sys_clone+0xb2/0x100
[ 6384.470581] do_syscall_64+0x63/0x1c0
[ 6384.470586] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6384.470823] Freed by task 9:
[ 6384.471235] __kasan_slab_free+0x147/0x200
[ 6384.471240] kmem_cache_free+0x111/0x330
[ 6384.471246] rcu_core+0x2f9/0x830
[ 6384.471251] __do_softirq+0x154/0x486
[ 6384.471493] The buggy address belongs to the object at ffff8881d5dc9440
which belongs to the cache task_struct of size 4928
[ 6384.473081] The buggy address is located 56 bytes inside of
4928-byte region [ffff8881d5dc9440, ffff8881d5dca780)
[ 6384.474453] The buggy address belongs to the page:
[ 6384.474989] page:ffffea0007577200 refcount:1 mapcount:0 mapping:ffff8881f6811800 index:0x0 compound_mapcount: 0
[ 6384.474993] flags: 0x8000000000010200(slab|head)
[ 6384.474997] raw: 8000000000010200 0000000000000000 0000000100000001 ffff8881f6811800
[ 6384.475000] raw: 0000000000000000 0000000000060006 00000001ffffffff 0000000000000000
[ 6384.475002] page dumped because: kasan: bad access detected
[ 6384.475176] Memory state around the buggy address:
[ 6384.475744] ffff8881d5dc9300: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 6384.476571] ffff8881d5dc9380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 6384.477390] >ffff8881d5dc9400: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 6384.478214] ^
[ 6384.479052] ffff8881d5dc9480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6384.479898] ffff8881d5dc9500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 6384.481300] ==================================================================
[ 6384.482408] Disabling lock debugging due to kernel taint
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/percpu-rwsem.h | 4 +-
> include/linux/rwsem.h | 11 +--
> kernel/locking/rwsem.c | 125 ++++++++++++++++++++++-------------
> 3 files changed, 88 insertions(+), 52 deletions(-)
>
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 03cb4b6f842e..0a43830f1932 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -117,7 +117,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
> lock_release(&sem->rw_sem.dep_map, 1, ip);
> #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> if (!read)
> - sem->rw_sem.owner = RWSEM_OWNER_UNKNOWN;
> + atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN);
> #endif
> }
>
> @@ -127,7 +127,7 @@ static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
> lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
> #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> if (!read)
> - sem->rw_sem.owner = current;
> + atomic_long_set(&sem->rw_sem.owner, (long)current);
> #endif
> }
>
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index bb76e82398b2..e401358c4e7e 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -35,10 +35,11 @@
> struct rw_semaphore {
> atomic_long_t count;
> /*
> - * Write owner or one of the read owners. Can be used as a
> - * speculative check to see if the owner is running on the cpu.
> + * Write owner or one of the read owners as well flags regarding
> + * the current state of the rwsem. Can be used as a speculative
> + * check to see if the write owner is running on the cpu.
> */
> - struct task_struct *owner;
> + atomic_long_t owner;
> #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> #endif
> @@ -53,7 +54,7 @@ struct rw_semaphore {
> * Setting all bits of the owner field except bit 0 will indicate
> * that the rwsem is writer-owned with an unknown owner.
> */
> -#define RWSEM_OWNER_UNKNOWN ((struct task_struct *)-2L)
> +#define RWSEM_OWNER_UNKNOWN (-2L)
>
> /* In all implementations count != 0 means locked */
> static inline int rwsem_is_locked(struct rw_semaphore *sem)
> @@ -80,7 +81,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
>
> #define __RWSEM_INITIALIZER(name) \
> { __RWSEM_INIT_COUNT(name), \
> - .owner = NULL, \
> + .owner = ATOMIC_LONG_INIT(0), \
> .wait_list = LIST_HEAD_INIT((name).wait_list), \
> .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
> __RWSEM_OPT_INIT(name) \
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 9eb46ab9edaa..555da4868e54 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -64,7 +64,7 @@
> if (!debug_locks_silent && \
> WARN_ONCE(c, "DEBUG_RWSEMS_WARN_ON(%s): count = 0x%lx, owner = 0x%lx, curr 0x%lx, list %sempty\n",\
> #c, atomic_long_read(&(sem)->count), \
> - (long)((sem)->owner), (long)current, \
> + atomic_long_read(&(sem)->owner), (long)current, \
> list_empty(&(sem)->wait_list) ? "" : "not ")) \
> debug_locks_off(); \
> } while (0)
> @@ -114,12 +114,20 @@
> */
> static inline void rwsem_set_owner(struct rw_semaphore *sem)
> {
> - WRITE_ONCE(sem->owner, current);
> + atomic_long_set(&sem->owner, (long)current);
> }
>
> static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> {
> - WRITE_ONCE(sem->owner, NULL);
> + atomic_long_set(&sem->owner, 0);
> +}
> +
> +/*
> + * Test the flags in the owner field.
> + */
> +static inline bool rwsem_test_oflags(struct rw_semaphore *sem, long flags)
> +{
> + return atomic_long_read(&sem->owner) & flags;
> }
>
> /*
> @@ -133,10 +141,9 @@ static inline void rwsem_clear_owner(struct rw_semaphore *sem)
> static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
> struct task_struct *owner)
> {
> - unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
> - | RWSEM_NONSPINNABLE;
> + long val = (long)owner | RWSEM_READER_OWNED | RWSEM_NONSPINNABLE;
>
> - WRITE_ONCE(sem->owner, (struct task_struct *)val);
> + atomic_long_set(&sem->owner, val);
> }
>
> static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
> @@ -145,13 +152,20 @@ static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
> }
>
> /*
> - * Return true if the a rwsem waiter can spin on the rwsem's owner
> - * and steal the lock.
> - * N.B. !owner is considered spinnable.
> + * Return true if the rwsem is owned by a reader.
> */
> -static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
> +static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
> {
> - return !((unsigned long)owner & RWSEM_NONSPINNABLE);
> +#ifdef CONFIG_DEBUG_RWSEMS
> + /*
> + * Check the count to see if it is write-locked.
> + */
> + long count = atomic_long_read(&sem->count);
> +
> + if (count & RWSEM_WRITER_MASK)
> + return false;
> +#endif
> + return rwsem_test_oflags(sem, RWSEM_READER_OWNED);
> }
>
> #ifdef CONFIG_DEBUG_RWSEMS
> @@ -163,11 +177,13 @@ static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
> */
> static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
> {
> - unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
> - | RWSEM_NONSPINNABLE;
> - if (READ_ONCE(sem->owner) == (struct task_struct *)val)
> - cmpxchg_relaxed((unsigned long *)&sem->owner, val,
> - RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
> + long val = atomic_long_read(&sem->owner);
> +
> + while ((val & ~RWSEM_OWNER_FLAGS_MASK) == (long)current) {
> + if (atomic_long_try_cmpxchg(&sem->owner, &val,
> + val & RWSEM_OWNER_FLAGS_MASK))
> + return;
> + }
> }
> #else
> static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
> @@ -175,6 +191,28 @@ static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
> }
> #endif
>
> +/*
> + * Return just the real task structure pointer of the owner
> + */
> +static inline struct task_struct *rwsem_read_owner(struct rw_semaphore *sem)
> +{
> + return (struct task_struct *)(atomic_long_read(&sem->owner) &
> + ~RWSEM_OWNER_FLAGS_MASK);
> +}
> +
> +/*
> + * Return the real task structure pointer of the owner and the embedded
> + * flags in the owner. pflags must be non-NULL.
> + */
> +static inline struct task_struct *
> +rwsem_read_owner_flags(struct rw_semaphore *sem, long *pflags)
> +{
> + long owner = atomic_long_read(&sem->owner);
> +
> + *pflags = owner & RWSEM_OWNER_FLAGS_MASK;
> + return (struct task_struct *)(owner & ~RWSEM_OWNER_FLAGS_MASK);
> +}
> +
> /*
> * Guide to the rw_semaphore's count field.
> *
> @@ -208,7 +246,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
> atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
> raw_spin_lock_init(&sem->wait_lock);
> INIT_LIST_HEAD(&sem->wait_list);
> - sem->owner = NULL;
> + atomic_long_set(&sem->owner, 0L);
> #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> osq_lock_init(&sem->osq);
> #endif
> @@ -511,9 +549,10 @@ static inline bool owner_on_cpu(struct task_struct *owner)
> static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
> {
> struct task_struct *owner;
> + long flags;
> bool ret = true;
>
> - BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
> + BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE));
>
> if (need_resched()) {
> lockevent_inc(rwsem_opt_fail);
> @@ -522,11 +561,9 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>
> preempt_disable();
> rcu_read_lock();
> - owner = READ_ONCE(sem->owner);
> - if (owner) {
> - ret = is_rwsem_owner_spinnable(owner) &&
> - owner_on_cpu(owner);
> - }
> + owner = rwsem_read_owner_flags(sem, &flags);
> + if ((flags & RWSEM_NONSPINNABLE) || (owner && !owner_on_cpu(owner)))
> + ret = false;
> rcu_read_unlock();
> preempt_enable();
>
> @@ -553,25 +590,26 @@ enum owner_state {
> };
> #define OWNER_SPINNABLE (OWNER_NULL | OWNER_WRITER)
>
> -static inline enum owner_state rwsem_owner_state(unsigned long owner)
> +static inline enum owner_state rwsem_owner_state(struct task_struct *owner,
> + long flags)
> {
> - if (!owner)
> - return OWNER_NULL;
> -
> - if (owner & RWSEM_NONSPINNABLE)
> + if (flags & RWSEM_NONSPINNABLE)
> return OWNER_NONSPINNABLE;
>
> - if (owner & RWSEM_READER_OWNED)
> + if (flags & RWSEM_READER_OWNED)
> return OWNER_READER;
>
> - return OWNER_WRITER;
> + return owner ? OWNER_WRITER : OWNER_NULL;
> }
>
> static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
> {
> - struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
> - enum owner_state state = rwsem_owner_state((unsigned long)owner);
> + struct task_struct *new, *owner;
> + long flags, new_flags;
> + enum owner_state state;
>
> + owner = rwsem_read_owner_flags(sem, &flags);
> + state = rwsem_owner_state(owner, flags);
> if (state != OWNER_WRITER)
> return state;
>
> @@ -582,9 +620,9 @@ static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
> break;
> }
>
> - tmp = READ_ONCE(sem->owner);
> - if (tmp != owner) {
> - state = rwsem_owner_state((unsigned long)tmp);
> + new = rwsem_read_owner_flags(sem, &new_flags);
> + if ((new != owner) || (new_flags != flags)) {
> + state = rwsem_owner_state(new, new_flags);
> break;
> }
>
> @@ -1001,8 +1039,7 @@ inline void __down_read(struct rw_semaphore *sem)
> if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
> &sem->count) & RWSEM_READ_FAILED_MASK)) {
> rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
> - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
> - RWSEM_READER_OWNED), sem);
> + DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> } else {
> rwsem_set_reader_owned(sem);
> }
> @@ -1014,8 +1051,7 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
> &sem->count) & RWSEM_READ_FAILED_MASK)) {
> if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
> return -EINTR;
> - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
> - RWSEM_READER_OWNED), sem);
> + DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> } else {
> rwsem_set_reader_owned(sem);
> }
> @@ -1084,7 +1120,7 @@ inline void __up_read(struct rw_semaphore *sem)
> {
> long tmp;
>
> - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), sem);
> + DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> @@ -1103,8 +1139,8 @@ static inline void __up_write(struct rw_semaphore *sem)
> * sem->owner may differ from current if the ownership is transferred
> * to an anonymous writer by setting the RWSEM_NONSPINNABLE bits.
> */
> - DEBUG_RWSEMS_WARN_ON((sem->owner != current) &&
> - !((long)sem->owner & RWSEM_NONSPINNABLE), sem);
> + DEBUG_RWSEMS_WARN_ON((rwsem_read_owner(sem) != current) &&
> + !rwsem_test_oflags(sem, RWSEM_NONSPINNABLE), sem);
> rwsem_clear_owner(sem);
> tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
> if (unlikely(tmp & RWSEM_FLAG_WAITERS))
> @@ -1125,7 +1161,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
> * read-locked region is ok to be re-ordered into the
> * write side. As such, rely on RELEASE semantics.
> */
> - DEBUG_RWSEMS_WARN_ON(sem->owner != current, sem);
> + DEBUG_RWSEMS_WARN_ON(rwsem_read_owner(sem) != current, sem);
> tmp = atomic_long_fetch_add_release(
> -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
> rwsem_set_reader_owned(sem);
> @@ -1296,8 +1332,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
>
> void up_read_non_owner(struct rw_semaphore *sem)
> {
> - DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
> - sem);
> + DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
> __up_read(sem);
> }
> EXPORT_SYMBOL(up_read_non_owner);
> --
> 2.18.1
>
Waiman Long <[email protected]> writes:
> On 7/19/19 2:45 PM, Luis Henriques wrote:
>> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>>> The rwsem->owner contains not just the task structure pointer, it also
>>> holds some flags for storing the current state of the rwsem. Some of
>>> the flags may have to be atomically updated. To reflect the new reality,
>>> the owner is now changed to an atomic_long_t type.
>>>
>>> New helper functions are added to properly separate out the task
>>> structure pointer and the embedded flags.
>> I started seeing KASAN use-after-free with current master, and a bisect
>> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
>> rwsem->owner an atomic_long_t") was the problem. Does it ring any
>> bells? I can easily reproduce it with xfstests (generic/464).
>>
>> Cheers,
>> --
>> Luís
>
> This patch shouldn't change the behavior of the rwsem code. The code
> only access data within the rw_semaphore structures. I don't know why it
> will cause a KASAN error. I will have to reproduce it and figure out
> exactly which statement is doing the invalid access.
Yeah, screwing the bisection is something I've done in the past so I may
have got the wrong commit. Another detail is that I was running
xfstests against CephFS, I didn't tried with any other filesystem. I
can try to reproduce with btrfs or xfs next week.
Cheers,
--
Luis
On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>
> This patch shouldn't change the behavior of the rwsem code. The code
> only access data within the rw_semaphore structures. I don't know why it
> will cause a KASAN error. I will have to reproduce it and figure out
> exactly which statement is doing the invalid access.
The stack traces should show line numbers if you run them through
scripts/decode_stacktrace.sh.
You need to have debug info enabled for that, though.
Luis?
Linus
On 7/19/19 3:45 PM, Luis Henriques wrote:
> Waiman Long <[email protected]> writes:
>
>> On 7/19/19 2:45 PM, Luis Henriques wrote:
>>> On Mon, May 20, 2019 at 04:59:12PM -0400, Waiman Long wrote:
>>>> The rwsem->owner contains not just the task structure pointer, it also
>>>> holds some flags for storing the current state of the rwsem. Some of
>>>> the flags may have to be atomically updated. To reflect the new reality,
>>>> the owner is now changed to an atomic_long_t type.
>>>>
>>>> New helper functions are added to properly separate out the task
>>>> structure pointer and the embedded flags.
>>> I started seeing KASAN use-after-free with current master, and a bisect
>>> showed me that this commit 94a9717b3c40 ("locking/rwsem: Make
>>> rwsem->owner an atomic_long_t") was the problem. Does it ring any
>>> bells? I can easily reproduce it with xfstests (generic/464).
>>>
>>> Cheers,
>>> --
>>> Luís
>> This patch shouldn't change the behavior of the rwsem code. The code
>> only access data within the rw_semaphore structures. I don't know why it
>> will cause a KASAN error. I will have to reproduce it and figure out
>> exactly which statement is doing the invalid access.
> Yeah, screwing the bisection is something I've done in the past so I may
> have got the wrong commit. Another detail is that I was running
> xfstests against CephFS, I didn't tried with any other filesystem. I
> can try to reproduce with btrfs or xfs next week.
>
> Cheers,
Oh, I don't have a CephFS setup. Will you use the
scripts/decode_stacktrace.sh to find what line number is the offending
statement? That will help in figuring out what has gone wrong.
Anyway, it seems like a structure that include a rwsem is freed while
another cpu is still waiting to acquire the lock. It is probably a
hidden bug in the filesystem code somewhere that the recent changes in
rwsem behavior make it easier for the problem to show up.
Cheers,
Longman
"Linus Torvalds" <[email protected]> writes:
> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>>
>> This patch shouldn't change the behavior of the rwsem code. The code
>> only access data within the rw_semaphore structures. I don't know why it
>> will cause a KASAN error. I will have to reproduce it and figure out
>> exactly which statement is doing the invalid access.
>
> The stack traces should show line numbers if you run them through
> scripts/decode_stacktrace.sh.
>
> You need to have debug info enabled for that, though.
>
> Luis?
>
> Linus
Yep, sure. And I should have done this in the initial report. It's a
different trace, I had to recompile the kernel.
(I'm also adding Jeff to the CC list.)
Cheers,
--
Luis
[ 39.801179] ==================================================================
[ 39.801973] BUG: KASAN: use-after-free in rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
[ 39.802733] Read of size 4 at addr ffff8881f1f65138 by task xfs_io/2145
[ 39.803598] CPU: 0 PID: 2145 Comm: xfs_io Not tainted 5.2.0+ #460
[ 39.803600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
[ 39.803602] Call Trace:
[ 39.803609] dump_stack (/home/miguel/kernel/linux/lib/dump_stack.c:115)
[ 39.803615] print_address_description (/home/miguel/kernel/linux/mm/kasan/report.c:352)
[ 39.803618] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
[ 39.803621] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
[ 39.803624] __kasan_report.cold (/home/miguel/kernel/linux/mm/kasan/report.c:483)
[ 39.803629] ? rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
[ 39.803633] kasan_report (/home/miguel/kernel/linux/./arch/x86/include/asm/smap.h:69 /home/miguel/kernel/linux/mm/kasan/common.c:613)
[ 39.803636] rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
[ 39.803641] ? __ceph_caps_issued_mask (/home/miguel/kernel/linux/fs/ceph/caps.c:914)
[ 39.803644] ? find_held_lock (/home/miguel/kernel/linux/kernel/locking/lockdep.c:4004)
[ 39.803649] ? __ceph_do_getattr (/home/miguel/kernel/linux/fs/ceph/inode.c:2246)
[ 39.803653] ? down_read_non_owner (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1116)
[ 39.803658] ? do_raw_spin_unlock (/home/miguel/kernel/linux/./include/linux/compiler.h:218 /home/miguel/kernel/linux/./include/asm-generic/qspinlock.h:94 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:139)
[ 39.803663] ? _raw_spin_unlock (/home/miguel/kernel/linux/kernel/locking/spinlock.c:184)
[ 39.803667] ? __lock_acquire.isra.0 (/home/miguel/kernel/linux/kernel/locking/lockdep.c:3884)
[ 39.803674] ? path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 /home/miguel/kernel/linux/fs/namei.c:3533)
[ 39.803680] ? down_write (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486)
[ 39.803683] down_write (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1486)
[ 39.803687] ? down_read_killable (/home/miguel/kernel/linux/kernel/locking/rwsem.c:1482)
[ 39.803690] ? __sb_start_write (/home/miguel/kernel/linux/./include/linux/compiler.h:194 /home/miguel/kernel/linux/./include/linux/rcu_sync.h:38 /home/miguel/kernel/linux/./include/linux/percpu-rwsem.h:52 /home/miguel/kernel/linux/fs/super.c:1608)
[ 39.803694] ? __mnt_want_write (/home/miguel/kernel/linux/fs/namespace.c:253 /home/miguel/kernel/linux/fs/namespace.c:297 /home/miguel/kernel/linux/fs/namespace.c:337)
[ 39.803699] path_openat (/home/miguel/kernel/linux/fs/namei.c:3322 /home/miguel/kernel/linux/fs/namei.c:3533)
[ 39.803706] ? path_mountpoint (/home/miguel/kernel/linux/fs/namei.c:3518)
[ 39.803711] ? __is_insn_slot_addr (/home/miguel/kernel/linux/kernel/kprobes.c:291)
[ 39.803716] ? kernel_text_address (/home/miguel/kernel/linux/kernel/extable.c:113)
[ 39.803719] ? __kernel_text_address (/home/miguel/kernel/linux/kernel/extable.c:95)
[ 39.803724] ? unwind_get_return_address (/home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:311 /home/miguel/kernel/linux/arch/x86/kernel/unwind_orc.c:306)
[ 39.803727] ? swiotlb_map.cold (/home/miguel/kernel/linux/kernel/stacktrace.c:83)
[ 39.803730] ? arch_stack_walk (/home/miguel/kernel/linux/arch/x86/kernel/stacktrace.c:26)
[ 39.803735] do_filp_open (/home/miguel/kernel/linux/fs/namei.c:3563)
[ 39.803739] ? may_open_dev (/home/miguel/kernel/linux/fs/namei.c:3557)
[ 39.803746] ? __alloc_fd (/home/miguel/kernel/linux/fs/file.c:536)
[ 39.803749] ? lock_downgrade (/home/miguel/kernel/linux/kernel/locking/lockdep.c:4422)
[ 39.803753] ? do_raw_spin_lock (/home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:92 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:115)
[ 39.803757] ? rwlock_bug.part.0 (/home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:111)
[ 39.803762] ? do_raw_spin_unlock (/home/miguel/kernel/linux/./include/linux/compiler.h:218 /home/miguel/kernel/linux/./include/asm-generic/qspinlock.h:94 /home/miguel/kernel/linux/kernel/locking/spinlock_debug.c:139)
[ 39.803766] ? _raw_spin_unlock (/home/miguel/kernel/linux/kernel/locking/spinlock.c:184)
[ 39.803769] ? __alloc_fd (/home/miguel/kernel/linux/fs/file.c:536)
[ 39.803774] do_sys_open (/home/miguel/kernel/linux/fs/open.c:1070)
[ 39.803778] ? filp_open (/home/miguel/kernel/linux/fs/open.c:1056)
[ 39.803781] ? switch_fpu_return (/home/miguel/kernel/linux/./arch/x86/include/asm/bitops.h:76 /home/miguel/kernel/linux/./include/asm-generic/bitops-instrumented.h:57 /home/miguel/kernel/linux/./include/linux/thread_info.h:60 /home/miguel/kernel/linux/./arch/x86/include/asm/fpu/internal.h:547 /home/miguel/kernel/linux/arch/x86/kernel/fpu/core.c:343)
[ 39.803786] ? __do_page_fault (/home/miguel/kernel/linux/./include/linux/compiler.h:194 /home/miguel/kernel/linux/./arch/x86/include/asm/atomic.h:31 /home/miguel/kernel/linux/./include/asm-generic/atomic-instrumented.h:27 /home/miguel/kernel/linux/./include/linux/jump_label.h:254 /home/miguel/kernel/linux/./include/linux/jump_label.h:264 /home/miguel/kernel/linux/./include/linux/perf_event.h:1094 /home/miguel/kernel/linux/arch/x86/mm/fault.c:1485 /home/miguel/kernel/linux/arch/x86/mm/fault.c:1510)
[ 39.803792] do_syscall_64 (/home/miguel/kernel/linux/arch/x86/entry/common.c:296)
[ 39.803796] entry_SYSCALL_64_after_hwframe (/home/miguel/kernel/linux/arch/x86/entry/entry_64.S:184)
[ 39.803799] RIP: 0033:0x7f62b41a2528
[ 39.803803] Code: 00 00 41 00 3d 00 00 41 00 74 47 48 8d 05 20 4d 0d 00 8b 00 85 c0 75 6b 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 94 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
All code
========
0: 00 00 add %al,(%rax)
2: 41 00 3d 00 00 41 00 add %dil,0x410000(%rip) # 0x410009
9: 74 47 je 0x52
b: 48 8d 05 20 4d 0d 00 lea 0xd4d20(%rip),%rax # 0xd4d32
12: 8b 00 mov (%rax),%eax
14: 85 c0 test %eax,%eax
16: 75 6b jne 0x83
18: 44 89 e2 mov %r12d,%edx
1b: 48 89 ee mov %rbp,%rsi
1e: bf 9c ff ff ff mov $0xffffff9c,%edi
23: b8 01 01 00 00 mov $0x101,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 0f 87 94 00 00 00 ja 0xca
36: 48 8b 4c 24 28 mov 0x28(%rsp),%rcx
3b: 64 fs
3c: 48 rex.W
3d: 33 .byte 0x33
3e: 0c 25 or $0x25,%al
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 0f 87 94 00 00 00 ja 0xa0
c: 48 8b 4c 24 28 mov 0x28(%rsp),%rcx
11: 64 fs
12: 48 rex.W
13: 33 .byte 0x33
14: 0c 25 or $0x25,%al
[ 39.803805] RSP: 002b:00007ffe6c3359e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 39.803808] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007f62b41a2528
[ 39.803810] RDX: 0000000000000242 RSI: 00007ffe6c3382a5 RDI: 00000000ffffff9c
[ 39.803812] RBP: 00007ffe6c3382a5 R08: 0000000000000001 R09: 0000000000000000
[ 39.803814] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000242
[ 39.803816] R13: 00007ffe6c335cc0 R14: 0000000000000180 R15: 0000000000000060
[ 39.803996] Allocated by task 2093:
[ 39.804373] __kasan_kmalloc.part.0 (/home/miguel/kernel/linux/mm/kasan/common.c:69 /home/miguel/kernel/linux/mm/kasan/common.c:77 /home/miguel/kernel/linux/mm/kasan/common.c:487)
[ 39.804376] kmem_cache_alloc (/home/miguel/kernel/linux/mm/slab.h:522 /home/miguel/kernel/linux/mm/slub.c:2766 /home/miguel/kernel/linux/mm/slub.c:2774 /home/miguel/kernel/linux/mm/slub.c:2779)
[ 39.804380] copy_process (/home/miguel/kernel/linux/kernel/fork.c:852 /home/miguel/kernel/linux/kernel/fork.c:1856)
[ 39.804382] _do_fork (/home/miguel/kernel/linux/kernel/fork.c:2369)
[ 39.804385] __se_sys_clone (/home/miguel/kernel/linux/kernel/fork.c:2505)
[ 39.804387] do_syscall_64 (/home/miguel/kernel/linux/arch/x86/entry/common.c:296)
[ 39.804390] entry_SYSCALL_64_after_hwframe (/home/miguel/kernel/linux/arch/x86/entry/entry_64.S:184)
[ 39.804558] Freed by task 16:
[ 39.804871] __kasan_slab_free (/home/miguel/kernel/linux/mm/kasan/common.c:69 /home/miguel/kernel/linux/mm/kasan/common.c:77 /home/miguel/kernel/linux/mm/kasan/common.c:449)
[ 39.804874] kmem_cache_free (/home/miguel/kernel/linux/mm/slub.c:1470 /home/miguel/kernel/linux/mm/slub.c:3012 /home/miguel/kernel/linux/mm/slub.c:3028)
[ 39.804877] rcu_core (/home/miguel/kernel/linux/./include/linux/rcupdate.h:213 /home/miguel/kernel/linux/kernel/rcu/rcu.h:223 /home/miguel/kernel/linux/kernel/rcu/tree.c:2114 /home/miguel/kernel/linux/kernel/rcu/tree.c:2314)
[ 39.804880] __do_softirq (/home/miguel/kernel/linux/./include/asm-generic/atomic-instrumented.h:26 /home/miguel/kernel/linux/./include/linux/jump_label.h:254 /home/miguel/kernel/linux/./include/linux/jump_label.h:264 /home/miguel/kernel/linux/./include/trace/events/irq.h:142 /home/miguel/kernel/linux/kernel/softirq.c:293)
[ 39.805048] The buggy address belongs to the object at ffff8881f1f65100
which belongs to the cache task_struct of size 4928
[ 39.806345] The buggy address is located 56 bytes inside of
4928-byte region [ffff8881f1f65100, ffff8881f1f66440)
[ 39.807543] The buggy address belongs to the page:
[ 39.808045] page:ffffea0007c7d800 refcount:1 mapcount:0 mapping:ffff8881f6811800 index:0x0 compound_mapcount: 0
[ 39.808049] flags: 0x8000000000010200(slab|head)
[ 39.808053] raw: 8000000000010200 dead000000000100 dead000000000122 ffff8881f6811800
[ 39.808056] raw: 0000000000000000 0000000000060006 00000001ffffffff 0000000000000000
[ 39.808058] page dumped because: kasan: bad access detected
[ 39.808224] Memory state around the buggy address:
[ 39.808723] ffff8881f1f65000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 39.809476] ffff8881f1f65080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 39.810220] >ffff8881f1f65100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.810968] ^
[ 39.811504] ffff8881f1f65180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.812237] ffff8881f1f65200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.812972] ==================================================================
[ 39.813710] Disabling lock debugging due to kernel taint
Luis Henriques <[email protected]> writes:
> Luis Henriques <[email protected]> writes:
>
>> "Linus Torvalds" <[email protected]> writes:
>>
>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>>>>
>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>> only access data within the rw_semaphore structures. I don't know why it
>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>> exactly which statement is doing the invalid access.
>>>
>>> The stack traces should show line numbers if you run them through
>>> scripts/decode_stacktrace.sh.
>>>
>>> You need to have debug info enabled for that, though.
>>>
>>> Luis?
>>>
>>> Linus
>>
>> Yep, sure. And I should have done this in the initial report. It's a
>> different trace, I had to recompile the kernel.
>>
>> (I'm also adding Jeff to the CC list.)
>>
>
> Ah, and I also managed to reproduce this on btrfs so I guess this rules
> out a bug in the filesystem code.
Just another detail (before I go completely offline until tomorrow
evening): in the btrfs case I'm seeing the bug on the
rwsem_down_read_slowpath path, not on rwsem_down_write_slowpath. But it
seems to be on the same place (i.e. rwsem_can_spin_on_owner).
Cheers,
--
Luis
Luis Henriques <[email protected]> writes:
> "Linus Torvalds" <[email protected]> writes:
>
>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>>>
>>> This patch shouldn't change the behavior of the rwsem code. The code
>>> only access data within the rw_semaphore structures. I don't know why it
>>> will cause a KASAN error. I will have to reproduce it and figure out
>>> exactly which statement is doing the invalid access.
>>
>> The stack traces should show line numbers if you run them through
>> scripts/decode_stacktrace.sh.
>>
>> You need to have debug info enabled for that, though.
>>
>> Luis?
>>
>> Linus
>
> Yep, sure. And I should have done this in the initial report. It's a
> different trace, I had to recompile the kernel.
>
> (I'm also adding Jeff to the CC list.)
>
Ah, and I also managed to reproduce this on btrfs so I guess this rules
out a bug in the filesystem code.
Cheers,
--
Luis
On Sat, Jul 20, 2019 at 09:41:05AM +0100, Luis Henriques wrote:
> [ 39.801179] ==================================================================
> [ 39.801973] BUG: KASAN: use-after-free in rwsem_down_write_slowpath (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669 /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
That's rwsem_can_spin_on_owner(), specifically line 669 seems to suggest
owner_on_cpu().
So we'd somehow have a dead owner; I'm not immediately seeing how that
can happen.
On 7/20/19 4:41 AM, Luis Henriques wrote:
> "Linus Torvalds" <[email protected]> writes:
>
>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>>> This patch shouldn't change the behavior of the rwsem code. The code
>>> only access data within the rw_semaphore structures. I don't know why it
>>> will cause a KASAN error. I will have to reproduce it and figure out
>>> exactly which statement is doing the invalid access.
>> The stack traces should show line numbers if you run them through
>> scripts/decode_stacktrace.sh.
>>
>> You need to have debug info enabled for that, though.
>>
>> Luis?
>>
>> Linus
> Yep, sure. And I should have done this in the initial report. It's a
> different trace, I had to recompile the kernel.
>
> (I'm also adding Jeff to the CC list.)
>
> Cheers,
Thanks for the information. I think I know where the problem is. Would
you mind applying the attached patch to see if it can fix the KASAN error.
Thanks,
Longman
Waiman Long <[email protected]> writes:
> On 7/20/19 4:41 AM, Luis Henriques wrote:
>> "Linus Torvalds" <[email protected]> writes:
>>
>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>> only access data within the rw_semaphore structures. I don't know why it
>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>> exactly which statement is doing the invalid access.
>>> The stack traces should show line numbers if you run them through
>>> scripts/decode_stacktrace.sh.
>>>
>>> You need to have debug info enabled for that, though.
>>>
>>> Luis?
>>>
>>> Linus
>> Yep, sure. And I should have done this in the initial report. It's a
>> different trace, I had to recompile the kernel.
>>
>> (I'm also adding Jeff to the CC list.)
>>
>> Cheers,
>
> Thanks for the information. I think I know where the problem is. Would
> you mind applying the attached patch to see if it can fix the KASAN error.
Yep, that seems to work -- I can't reproduce the error anymore (and
sorry for the delay). Thanks! And feel free to add my Tested-by.
Cheers,
--
Luis
On 7/21/19 4:49 PM, Luis Henriques wrote:
> Waiman Long <[email protected]> writes:
>
>> On 7/20/19 4:41 AM, Luis Henriques wrote:
>>> "Linus Torvalds" <[email protected]> writes:
>>>
>>>> On Fri, Jul 19, 2019 at 12:32 PM Waiman Long <[email protected]> wrote:
>>>>> This patch shouldn't change the behavior of the rwsem code. The code
>>>>> only access data within the rw_semaphore structures. I don't know why it
>>>>> will cause a KASAN error. I will have to reproduce it and figure out
>>>>> exactly which statement is doing the invalid access.
>>>> The stack traces should show line numbers if you run them through
>>>> scripts/decode_stacktrace.sh.
>>>>
>>>> You need to have debug info enabled for that, though.
>>>>
>>>> Luis?
>>>>
>>>> Linus
>>> Yep, sure. And I should have done this in the initial report. It's a
>>> different trace, I had to recompile the kernel.
>>>
>>> (I'm also adding Jeff to the CC list.)
>>>
>>> Cheers,
>> Thanks for the information. I think I know where the problem is. Would
>> you mind applying the attached patch to see if it can fix the KASAN error.
> Yep, that seems to work -- I can't reproduce the error anymore (and
> sorry for the delay). Thanks! And feel free to add my Tested-by.
>
> Cheers,
Thanks for the testing. I will post the official patch tomorrow.
Cheers,
Longman
Commit-ID: 78134300579a45f527ca173ec8fdb4701b69f16e
Gitweb: https://git.kernel.org/tip/78134300579a45f527ca173ec8fdb4701b69f16e
Author: Waiman Long <[email protected]>
AuthorDate: Sat, 20 Jul 2019 11:04:10 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 25 Jul 2019 15:39:22 +0200
locking/rwsem: Don't call owner_on_cpu() on read-owner
For writer, the owner value is cleared on unlock. For reader, it is
left intact on unlock for providing better debugging aid on crash dump
and the unlock of one reader may not mean the lock is free.
As a result, the owner_on_cpu() shouldn't be used on read-owner
as the task pointer value may not be valid and it might have
been freed. That is the case in rwsem_spin_on_owner(), but not in
rwsem_can_spin_on_owner(). This can lead to use-after-free error from
KASAN. For example,
BUG: KASAN: use-after-free in rwsem_down_write_slowpath
(/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
/home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)
Fix this by checking for RWSEM_READER_OWNED flag before calling
owner_on_cpu().
Reported-by: Luis Henriques <[email protected]>
Tested-by: Luis Henriques <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: huang ying <[email protected]>
Fixes: 94a9717b3c40e ("locking/rwsem: Make rwsem->owner an atomic_long_t")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/rwsem.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 37524a47f002..bc91aacaab58 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -666,7 +666,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem,
preempt_disable();
rcu_read_lock();
owner = rwsem_owner_flags(sem, &flags);
- if ((flags & nonspinnable) || (owner && !owner_on_cpu(owner)))
+ /*
+ * Don't check the read-owner as the entry may be stale.
+ */
+ if ((flags & nonspinnable) ||
+ (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
ret = false;
rcu_read_unlock();
preempt_enable();