v6:
- Fix an error in patch 2 reported by kernel test robot.
v5:
- Drop patch 2 and replace it with 2 new patches disabling preemption on
all reader functions and writer functions respectively. The other
patches are adjusted accordingly.
v4:
- Update patch descriptions in patches 1 & 2 to make clear the live
lock conditions that are being fixed by these patches. There is no code
change from v3.
It turns out the current waiter optimistic spinning code does not work
that well if we have RT tasks in the mix. This patch series include two
different fixes to resolve those issues. The last 3 patches modify the
handoff code to implement true lock handoff similar to that of mutex.
Waiman Long (6):
locking/rwsem: Prevent non-first waiter from spinning in down_write()
slowpath
locking/rwsem: Disable preemption at all down_read*() and up_read()
code paths
locking/rwsem: Disable preemption at all down_write*() and up_write()
code paths
locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
locking/rwsem: Enable direct rwsem lock handoff
locking/rwsem: Update handoff lock events tracking
kernel/locking/lock_events_list.h | 6 +-
kernel/locking/rwsem.c | 240 ++++++++++++++++++++++--------
2 files changed, 182 insertions(+), 64 deletions(-)
--
2.31.1
A non-first waiter can potentially spin in the for loop of
rwsem_down_write_slowpath() without sleeping but fail to acquire the
lock even if the rwsem is free if the following sequence happens:
Non-first RT waiter First waiter Lock holder
------------------- ------------ -----------
Acquire wait_lock
rwsem_try_write_lock():
Set handoff bit if RT or
wait too long
Set waiter->handoff_set
Release wait_lock
Acquire wait_lock
Inherit waiter->handoff_set
Release wait_lock
Clear owner
Release lock
if (waiter.handoff_set) {
rwsem_spin_on_owner(();
if (OWNER_NULL)
goto trylock_again;
}
trylock_again:
Acquire wait_lock
rwsem_try_write_lock():
if (first->handoff_set && (waiter != first))
return false;
Release wait_lock
A non-first waiter cannot really acquire the rwsem even if it mistakenly
believes that it can spin on OWNER_NULL value. If that waiter happens
to be an RT task running on the same CPU as the first waiter, it can
block the first waiter from acquiring the rwsem leading to live lock.
Fix this problem by making sure that a non-first waiter cannot spin in
the slowpath loop without sleeping.
Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
Reviewed-and-tested-by: Mukesh Ojha <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Cc: [email protected]
---
kernel/locking/rwsem.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 44873594de03..be2df9ea7c30 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
*/
if (first->handoff_set && (waiter != first))
return false;
-
- /*
- * First waiter can inherit a previously set handoff
- * bit and spin on rwsem if lock acquisition fails.
- */
- if (waiter == first)
- waiter->handoff_set = true;
}
new = count;
if (count & RWSEM_LOCK_MASK) {
+ /*
+ * A waiter (first or not) can set the handoff bit
+ * if it is an RT task or wait in the wait queue
+ * for too long.
+ */
if (has_handoff || (!rt_task(waiter->task) &&
!time_after(jiffies, waiter->timeout)))
return false;
@@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
} while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
/*
- * We have either acquired the lock with handoff bit cleared or
- * set the handoff bit.
+ * We have either acquired the lock with handoff bit cleared or set
+ * the handoff bit. Only the first waiter can have its handoff_set
+ * set here to enable optimistic spinning in slowpath loop.
*/
if (new & RWSEM_FLAG_HANDOFF) {
- waiter->handoff_set = true;
+ first->handoff_set = true;
lockevent_inc(rwsem_wlock_handoff);
return false;
}
--
2.31.1
Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
spin on owner") assumes that when the owner field is changed to NULL,
the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem:
Disable preemption while trying for rwsem lock") disables preemption
when acquiring rwsem for write. However, preemption has not yet been
disabled when acquiring a read lock on a rwsem. So a reader can add a
RWSEM_READER_BIAS to count without setting owner to signal a reader,
got preempted out by a RT task which then spins in the writer slowpath
as owner remains NULL leading to live lock.
One easy way to fix this problem is to disable preemption at all the
down_read*() and up_read() code paths as implemented in this patch.
Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
Reported-by: Mukesh Ojha <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be2df9ea7c30..ebaff8a87e1d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1091,7 +1091,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
/* Ordered by sem->wait_lock against rwsem_mark_wake(). */
break;
}
- schedule();
+ schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_reader);
}
@@ -1253,14 +1253,20 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
*/
static inline int __down_read_common(struct rw_semaphore *sem, int state)
{
+ int ret = 0;
long count;
+ preempt_disable();
if (!rwsem_read_trylock(sem, &count)) {
- if (IS_ERR(rwsem_down_read_slowpath(sem, count, state)))
- return -EINTR;
+ if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) {
+ ret = -EINTR;
+ goto out;
+ }
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
}
- return 0;
+out:
+ preempt_enable();
+ return ret;
}
static inline void __down_read(struct rw_semaphore *sem)
@@ -1280,19 +1286,23 @@ static inline int __down_read_killable(struct rw_semaphore *sem)
static inline int __down_read_trylock(struct rw_semaphore *sem)
{
+ int ret = 0;
long tmp;
DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
+ preempt_disable();
tmp = atomic_long_read(&sem->count);
while (!(tmp & RWSEM_READ_FAILED_MASK)) {
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
tmp + RWSEM_READER_BIAS)) {
rwsem_set_reader_owned(sem);
- return 1;
+ ret = 1;
+ break;
}
}
- return 0;
+ preempt_enable();
+ return ret;
}
/*
@@ -1334,6 +1344,7 @@ static inline void __up_read(struct rw_semaphore *sem)
DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
+ preempt_disable();
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
@@ -1342,6 +1353,7 @@ static inline void __up_read(struct rw_semaphore *sem)
clear_nonspinnable(sem);
rwsem_wake(sem);
}
+ preempt_enable();
}
/*
--
2.31.1
Change the boolean waiter->handoff_set to an enum type so that we can
have more states in some later patches. Also use READ_ONCE() outside
wait_lock critical sections for read and WRITE_ONCE() inside wait_lock
critical sections for write for proper synchronization. There is no
functional change.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8159a69b5de8..aa771fa1a1fe 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -335,12 +335,17 @@ enum rwsem_waiter_type {
RWSEM_WAITING_FOR_READ
};
+enum rwsem_handoff_state {
+ HANDOFF_NONE = 0,
+ HANDOFF_REQUESTED,
+};
+
struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+ enum rwsem_handoff_state handoff_state;
unsigned long timeout;
- bool handoff_set;
};
#define rwsem_first_waiter(sem) \
list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -467,7 +472,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
adjustment -= RWSEM_FLAG_HANDOFF;
lockevent_inc(rwsem_rlock_handoff);
}
- waiter->handoff_set = true;
+ WRITE_ONCE(waiter->handoff_state, HANDOFF_REQUESTED);
}
atomic_long_add(-adjustment, &sem->count);
@@ -619,7 +624,7 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
* waiter is the one that set it. Otherwisee, we
* still try to acquire the rwsem.
*/
- if (first->handoff_set && (waiter != first))
+ if (first->handoff_state && (waiter != first))
return false;
}
@@ -647,11 +652,11 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
/*
* We have either acquired the lock with handoff bit cleared or set
- * the handoff bit. Only the first waiter can have its handoff_set
+ * the handoff bit. Only the first waiter can have its handoff_state
* set here to enable optimistic spinning in slowpath loop.
*/
if (new & RWSEM_FLAG_HANDOFF) {
- first->handoff_set = true;
+ WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
lockevent_inc(rwsem_wlock_handoff);
return false;
}
@@ -1035,7 +1040,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
- waiter.handoff_set = false;
+ waiter.handoff_state = HANDOFF_NONE;
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
@@ -1122,7 +1127,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
- waiter.handoff_set = false;
+ waiter.handoff_state = HANDOFF_NONE;
raw_spin_lock_irq(&sem->wait_lock);
rwsem_add_waiter(sem, &waiter);
@@ -1167,7 +1172,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
* In this case, we attempt to acquire the lock again
* without sleeping.
*/
- if (waiter.handoff_set) {
+ if (READ_ONCE(waiter.handoff_state)) {
enum owner_state owner_state;
owner_state = rwsem_spin_on_owner(sem);
--
2.31.1
The previous patch has disabled preemption at all the down_read()
and up_read() code paths. For symmetry, this patch extends commit
48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem
lock") to have preemption disabled at all the down_write() and up_write()
code path including downgrade_write().
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ebaff8a87e1d..8159a69b5de8 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -256,16 +256,13 @@ static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cntp)
static inline bool rwsem_write_trylock(struct rw_semaphore *sem)
{
long tmp = RWSEM_UNLOCKED_VALUE;
- bool ret = false;
- preempt_disable();
if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) {
rwsem_set_owner(sem);
- ret = true;
+ return true;
}
- preempt_enable();
- return ret;
+ return false;
}
/*
@@ -716,7 +713,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
return false;
}
- preempt_disable();
/*
* Disable preemption is equal to the RCU read-side crital section,
* thus the task_strcut structure won't go away.
@@ -728,7 +724,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
if ((flags & RWSEM_NONSPINNABLE) ||
(owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner)))
ret = false;
- preempt_enable();
lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
@@ -828,8 +823,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
int loop = 0;
u64 rspin_threshold = 0;
- preempt_disable();
-
/* sem->wait_lock should not be held when doing optimistic spinning */
if (!osq_lock(&sem->osq))
goto done;
@@ -937,7 +930,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
}
osq_unlock(&sem->osq);
done:
- preempt_enable();
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
}
@@ -1178,15 +1170,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
if (waiter.handoff_set) {
enum owner_state owner_state;
- preempt_disable();
owner_state = rwsem_spin_on_owner(sem);
- preempt_enable();
-
if (owner_state == OWNER_NULL)
goto trylock_again;
}
- schedule();
+ schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
trylock_again:
@@ -1310,12 +1299,15 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
*/
static inline int __down_write_common(struct rw_semaphore *sem, int state)
{
+ int ret = 0;
+
+ preempt_disable();
if (unlikely(!rwsem_write_trylock(sem))) {
if (IS_ERR(rwsem_down_write_slowpath(sem, state)))
- return -EINTR;
+ ret = -EINTR;
}
-
- return 0;
+ preempt_enable();
+ return ret;
}
static inline void __down_write(struct rw_semaphore *sem)
@@ -1330,8 +1322,14 @@ static inline int __down_write_killable(struct rw_semaphore *sem)
static inline int __down_write_trylock(struct rw_semaphore *sem)
{
+ int ret;
+
+ preempt_disable();
DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
- return rwsem_write_trylock(sem);
+ ret = rwsem_write_trylock(sem);
+ preempt_enable();
+
+ return ret;
}
/*
@@ -1374,9 +1372,9 @@ static inline void __up_write(struct rw_semaphore *sem)
preempt_disable();
rwsem_clear_owner(sem);
tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count);
- preempt_enable();
if (unlikely(tmp & RWSEM_FLAG_WAITERS))
rwsem_wake(sem);
+ preempt_enable();
}
/*
@@ -1394,11 +1392,13 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* write side. As such, rely on RELEASE semantics.
*/
DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem);
+ preempt_disable();
tmp = atomic_long_fetch_add_release(
-RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count);
rwsem_set_reader_owned(sem);
if (tmp & RWSEM_FLAG_WAITERS)
rwsem_downgrade_wake(sem);
+ preempt_enable();
}
#else /* !CONFIG_PREEMPT_RT */
--
2.31.1
With the new direct rwsem lock handoff, the corresponding handoff lock
events are updated to also track the number of secondary lock handoffs
in rwsem_down_read_slowpath() to see how prevalent those handoff
events are. The number of primary lock handoffs in the unlock paths is
(rwsem_handoff_read + rwsem_handoff_write - rwsem_handoff_rslow).
After running a 96-thread rwsem microbenchmark with equal number
of readers and writers on a 2-socket 96-thread system for 40s, the
following handoff stats were obtained:
rwsem_handoff_read=189
rwsem_handoff_rslow=1
rwsem_handoff_write=6678
rwsem_handoff_wspin=6681
The number of primary handoffs was 6866, whereas there was only one
secondary handoff for this test run.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/lock_events_list.h | 6 ++++--
kernel/locking/rwsem.c | 9 +++++----
2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
index 97fb6f3f840a..04d101767c2c 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -63,7 +63,9 @@ LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
LOCK_EVENT(rwsem_rlock_steal) /* # of read locks by lock stealing */
LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
-LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */
LOCK_EVENT(rwsem_wlock) /* # of write locks acquired */
LOCK_EVENT(rwsem_wlock_fail) /* # of failed write lock acquisitions */
-LOCK_EVENT(rwsem_wlock_handoff) /* # of write lock handoffs */
+LOCK_EVENT(rwsem_handoff_read) /* # of read lock handoffs */
+LOCK_EVENT(rwsem_handoff_write) /* # of write lock handoffs */
+LOCK_EVENT(rwsem_handoff_rslow) /* # of handoffs in read slowpath */
+LOCK_EVENT(rwsem_handoff_wspin) /* # of handoff spins in write slowpath */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 047e5fcb2457..b3135e12ca22 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -469,10 +469,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* force the issue.
*/
if (time_after(jiffies, waiter->timeout)) {
- if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
+ if (!(oldcount & RWSEM_FLAG_HANDOFF))
adjustment -= RWSEM_FLAG_HANDOFF;
- lockevent_inc(rwsem_rlock_handoff);
- }
WRITE_ONCE(waiter->handoff_state, HANDOFF_REQUESTED);
}
@@ -677,7 +675,6 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
*/
if (new & RWSEM_FLAG_HANDOFF) {
WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
- lockevent_inc(rwsem_wlock_handoff);
return false;
}
@@ -1011,10 +1008,12 @@ static void rwsem_handoff(struct rw_semaphore *sem, long adj,
wake_type = RWSEM_WAKE_ANY;
adj += RWSEM_WRITER_LOCKED;
atomic_long_set(&sem->owner, (long)waiter->task);
+ lockevent_inc(rwsem_handoff_write);
} else {
wake_type = RWSEM_WAKE_READ_OWNED;
adj += RWSEM_READER_BIAS;
__rwsem_set_reader_owned(sem, waiter->task);
+ lockevent_inc(rwsem_handoff_read);
}
atomic_long_add(adj, &sem->count);
rwsem_mark_wake(sem, wake_type, wake_q);
@@ -1123,6 +1122,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
if (rwsem_first_waiter(sem)->type == RWSEM_WAITING_FOR_READ)
adjustment = 0;
rwsem_handoff(sem, adjustment, &wake_q);
+ lockevent_inc(rwsem_handoff_rslow);
if (!adjustment) {
raw_spin_unlock_irq(&sem->wait_lock);
@@ -1253,6 +1253,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
if (handoff == HANDOFF_REQUESTED) {
rwsem_spin_on_owner(sem);
handoff = READ_ONCE(waiter.handoff_state);
+ lockevent_inc(rwsem_handoff_wspin);
}
if (handoff == HANDOFF_GRANTED)
--
2.31.1
The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.
Reworking the code to enable a true lock handoff is more complex due to
the following facts:
1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
is too expensive to always take the wait_lock in the unlock path
to prevent racing.
2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
time to prevent a proper lock handoff from a reader owned rwsem.
A lock handoff can only be initiated when the following conditions are
true:
1) The RWSEM_FLAG_HANDOFF bit is set.
2) The task to do the handoff don't see any other active lock
excluding the lock that it might have held.
The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
overhead. The rwsem count will be known at that point to determine if
handoff should be done. However, there is a small time gap between the
rwsem becomes free and the wait_lock is taken where a reader can come
in and add a RWSEM_READER_BIAS to the count or the current first waiter
can take the rwsem and clear RWSEM_FLAG_HANDOFF in the interim. That
will fail the handoff operation. To handle the former case, a secondary
handoff will also be done in the rwsem_down_read_slowpath() to catch it.
With true lock handoff, there is no need to do a NULL owner spinning
anymore as wakeup will be performed if handoff is possible. So it
is likely that the first waiter won't actually go to sleep even when
schedule() is called in this case.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 135 +++++++++++++++++++++++++++++++++++------
1 file changed, 117 insertions(+), 18 deletions(-)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index aa771fa1a1fe..047e5fcb2457 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -338,6 +338,7 @@ enum rwsem_waiter_type {
enum rwsem_handoff_state {
HANDOFF_NONE = 0,
HANDOFF_REQUESTED,
+ HANDOFF_GRANTED,
};
struct rwsem_waiter {
@@ -486,6 +487,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
*/
owner = waiter->task;
__rwsem_set_reader_owned(sem, owner);
+ } else if (waiter->handoff_state == HANDOFF_GRANTED) {
+ /*
+ * rwsem_handoff() has added to count RWSEM_READER_BIAS of
+ * the first waiter.
+ */
+ adjustment = RWSEM_READER_BIAS;
}
/*
@@ -583,7 +590,7 @@ rwsem_del_wake_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter,
struct wake_q_head *wake_q)
__releases(&sem->wait_lock)
{
- bool first = rwsem_first_waiter(sem) == waiter;
+ struct rwsem_waiter *first = rwsem_first_waiter(sem);
wake_q_init(wake_q);
@@ -592,8 +599,21 @@ rwsem_del_wake_waiter(struct rw_semaphore *sem, struct rwsem_waiter *waiter,
* the first waiter, we wake up the remaining waiters as they may
* be eligible to acquire or spin on the lock.
*/
- if (rwsem_del_waiter(sem, waiter) && first)
+ if (rwsem_del_waiter(sem, waiter) && (waiter == first)) {
+ switch (waiter->handoff_state) {
+ case HANDOFF_GRANTED:
+ raw_spin_unlock_irq(&sem->wait_lock);
+ return;
+ case HANDOFF_REQUESTED:
+ /* Pass handoff state to the new first waiter */
+ first = rwsem_first_waiter(sem);
+ WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
+ fallthrough;
+ default:
+ break;
+ }
rwsem_mark_wake(sem, RWSEM_WAKE_ANY, wake_q);
+ }
raw_spin_unlock_irq(&sem->wait_lock);
if (!wake_q_empty(wake_q))
wake_up_q(wake_q);
@@ -759,6 +779,11 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
owner = rwsem_owner_flags(sem, &flags);
state = rwsem_owner_state(owner, flags);
+
+ /* A handoff may have been granted */
+ if (!flags && (owner == current))
+ return OWNER_NONSPINNABLE;
+
if (state != OWNER_WRITER)
return state;
@@ -969,6 +994,32 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
}
#endif
+/*
+ * Hand off the lock to the first waiter
+ */
+static void rwsem_handoff(struct rw_semaphore *sem, long adj,
+ struct wake_q_head *wake_q)
+{
+ struct rwsem_waiter *waiter;
+ enum rwsem_wake_type wake_type;
+
+ lockdep_assert_held(&sem->wait_lock);
+ adj -= RWSEM_FLAG_HANDOFF;
+ waiter = rwsem_first_waiter(sem);
+ WRITE_ONCE(waiter->handoff_state, HANDOFF_GRANTED);
+ if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
+ wake_type = RWSEM_WAKE_ANY;
+ adj += RWSEM_WRITER_LOCKED;
+ atomic_long_set(&sem->owner, (long)waiter->task);
+ } else {
+ wake_type = RWSEM_WAKE_READ_OWNED;
+ adj += RWSEM_READER_BIAS;
+ __rwsem_set_reader_owned(sem, waiter->task);
+ }
+ atomic_long_add(adj, &sem->count);
+ rwsem_mark_wake(sem, wake_type, wake_q);
+}
+
/*
* Prepare to wake up waiter(s) in the wait queue by putting them into the
* given wake_q if the rwsem lock owner isn't a writer. If rwsem is likely
@@ -1043,6 +1094,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
waiter.handoff_state = HANDOFF_NONE;
raw_spin_lock_irq(&sem->wait_lock);
+ count = atomic_long_read(&sem->count);
if (list_empty(&sem->wait_list)) {
/*
* In case the wait queue is empty and the lock isn't owned
@@ -1050,7 +1102,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
* immediately as its RWSEM_READER_BIAS has already been set
* in the count.
*/
- if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) {
+ if (!(count & RWSEM_WRITER_MASK)) {
/* Provide lock ACQUIRE */
smp_acquire__after_ctrl_dep();
raw_spin_unlock_irq(&sem->wait_lock);
@@ -1059,13 +1111,36 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
return sem;
}
adjustment += RWSEM_FLAG_WAITERS;
+ } else if ((count & RWSEM_FLAG_HANDOFF) &&
+ ((count & RWSEM_LOCK_MASK) == RWSEM_READER_BIAS)) {
+ /*
+ * If the waiter to be handed off is a reader, all the
+ * readers in the wait queue will be waken up. As this reader
+ * hasn't been queued in the wait queue yet, it may as well
+ * keep its RWSEM_READER_BIAS and return after waking up
+ * other readers in the queue.
+ */
+ if (rwsem_first_waiter(sem)->type == RWSEM_WAITING_FOR_READ)
+ adjustment = 0;
+ rwsem_handoff(sem, adjustment, &wake_q);
+
+ if (!adjustment) {
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+ return sem;
+ }
+ adjustment = 0;
}
rwsem_add_waiter(sem, &waiter);
- /* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
-
- rwsem_cond_wake_waiter(sem, count, &wake_q);
+ if (adjustment) {
+ /*
+ * We are now waiting on the lock with no handoff, but no
+ * longer actively locking.
+ */
+ count = atomic_long_add_return(adjustment, &sem->count);
+ rwsem_cond_wake_waiter(sem, count, &wake_q);
+ }
raw_spin_unlock_irq(&sem->wait_lock);
if (!wake_q_empty(&wake_q))
@@ -1154,6 +1229,8 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
trace_contention_begin(sem, LCB_F_WRITE);
for (;;) {
+ enum rwsem_handoff_state handoff;
+
if (rwsem_try_write_lock(sem, &waiter)) {
/* rwsem_try_write_lock() implies ACQUIRE on success */
break;
@@ -1168,26 +1245,33 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
* After setting the handoff bit and failing to acquire
* the lock, attempt to spin on owner to accelerate lock
* transfer. If the previous owner is a on-cpu writer and it
- * has just released the lock, OWNER_NULL will be returned.
- * In this case, we attempt to acquire the lock again
- * without sleeping.
+ * has just released the lock, handoff_state is likely to be
+ * set to HANDOFF_GRANTED or is to be set soon.
*/
- if (READ_ONCE(waiter.handoff_state)) {
- enum owner_state owner_state;
+ handoff = READ_ONCE(waiter.handoff_state);
+ if (handoff) {
+ if (handoff == HANDOFF_REQUESTED) {
+ rwsem_spin_on_owner(sem);
+ handoff = READ_ONCE(waiter.handoff_state);
+ }
- owner_state = rwsem_spin_on_owner(sem);
- if (owner_state == OWNER_NULL)
- goto trylock_again;
+ if (handoff == HANDOFF_GRANTED)
+ goto skip_sleep;
}
schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
-trylock_again:
+skip_sleep:
raw_spin_lock_irq(&sem->wait_lock);
+ if (waiter.handoff_state == HANDOFF_GRANTED) {
+ rwsem_del_waiter(sem, &waiter);
+ break;
+ }
}
__set_current_state(TASK_RUNNING);
raw_spin_unlock_irq(&sem->wait_lock);
+out_lock:
lockevent_inc(rwsem_wlock);
trace_contention_end(sem, 0);
return sem;
@@ -1196,6 +1280,9 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
__set_current_state(TASK_RUNNING);
raw_spin_lock_irq(&sem->wait_lock);
rwsem_del_wake_waiter(sem, &waiter, &wake_q);
+ if (unlikely(READ_ONCE(waiter.handoff_state) == HANDOFF_GRANTED))
+ goto out_lock;
+
lockevent_inc(rwsem_wlock_fail);
trace_contention_end(sem, -EINTR);
return ERR_PTR(-EINTR);
@@ -1207,12 +1294,24 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
*/
static struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
- unsigned long flags;
DEFINE_WAKE_Q(wake_q);
+ unsigned long flags;
+ long count;
raw_spin_lock_irqsave(&sem->wait_lock, flags);
- if (!list_empty(&sem->wait_list))
+ if (list_empty(&sem->wait_list)) {
+ raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
+ return sem;
+ }
+ /*
+ * If the rwsem is free and handoff flag is set with wait_lock held,
+ * no other CPUs can take an active lock.
+ */
+ count = atomic_long_read(&sem->count);
+ if (!(count & RWSEM_LOCK_MASK) && (count & RWSEM_FLAG_HANDOFF))
+ rwsem_handoff(sem, 0, &wake_q);
+ else
rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
--
2.31.1
On Thu, Nov 17, 2022 at 09:20:12PM -0500, Waiman Long wrote:
> Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
> spin on owner") assumes that when the owner field is changed to NULL,
> the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem:
> Disable preemption while trying for rwsem lock") disables preemption
> when acquiring rwsem for write. However, preemption has not yet been
> disabled when acquiring a read lock on a rwsem. So a reader can add a
> RWSEM_READER_BIAS to count without setting owner to signal a reader,
> got preempted out by a RT task which then spins in the writer slowpath
> as owner remains NULL leading to live lock.
>
> One easy way to fix this problem is to disable preemption at all the
> down_read*() and up_read() code paths as implemented in this patch.
>
> Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
> Reported-by: Mukesh Ojha <[email protected]>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> ---
Tested-by: Jiri Wiesner <[email protected]>
--
Jiri Wiesner
SUSE Labs
On Thu, Nov 17, 2022 at 09:20:13PM -0500, Waiman Long wrote:
> The previous patch has disabled preemption at all the down_read()
> and up_read() code paths. For symmetry, this patch extends commit
> 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem
> lock") to have preemption disabled at all the down_write() and up_write()
> code path including downgrade_write().
>
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> ---
Tested-by: Jiri Wiesner <[email protected]>
--
Jiri Wiesner
SUSE Labs
On Thu, Nov 17, 2022 at 09:20:11PM -0500, Waiman Long wrote:
> A non-first waiter can potentially spin in the for loop of
> rwsem_down_write_slowpath() without sleeping but fail to acquire the
> lock even if the rwsem is free if the following sequence happens:
>
> Non-first RT waiter First waiter Lock holder
> ------------------- ------------ -----------
> Acquire wait_lock
> rwsem_try_write_lock():
> Set handoff bit if RT or
> wait too long
> Set waiter->handoff_set
> Release wait_lock
> Acquire wait_lock
> Inherit waiter->handoff_set
> Release wait_lock
> Clear owner
> Release lock
> if (waiter.handoff_set) {
> rwsem_spin_on_owner(();
> if (OWNER_NULL)
> goto trylock_again;
> }
> trylock_again:
> Acquire wait_lock
> rwsem_try_write_lock():
> if (first->handoff_set && (waiter != first))
> return false;
> Release wait_lock
>
> A non-first waiter cannot really acquire the rwsem even if it mistakenly
> believes that it can spin on OWNER_NULL value. If that waiter happens
> to be an RT task running on the same CPU as the first waiter, it can
> block the first waiter from acquiring the rwsem leading to live lock.
> Fix this problem by making sure that a non-first waiter cannot spin in
> the slowpath loop without sleeping.
>
> Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
> Reviewed-and-tested-by: Mukesh Ojha <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> Cc: [email protected]
> ---
I was checking if commit 6eebd5fb2083 ("locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter") resolves the issue that was discussed in [1]. I modified the program and script slighly:
fsim.c:
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
void sig_handle(int sig) { exit(0); }
int main(void)
{
unsigned long c;
signal(SIGALRM, sig_handle);
alarm(3);
while (1)
c++;
}
run-fsim.sh:
#!/bin/bash
if [ ! -e fsim ]; then
gcc -o fsim fsim.c
if [ $? -ne 0 ]; then
echo Failed to compile fsim
exit -1
fi
fi
MAX_ITERATIONS=20000
#The fsim processes are meant to run on both logical CPUs belonging to a CPU core, e.g. 1 and 129.
CPU_RANGE1="${1:-1 11}"
CPU_RANGE2="${1:-129 139}"
for i in `seq 1 $MAX_ITERATIONS`; do
echo "Start $i/$MAX_ITERATIONS: `date`"
for CPU in `seq $CPU_RANGE1` `seq $CPU_RANGE2`; do
taskset -c $CPU chrt -r 10 ./fsim &>/dev/null &
taskset -c $CPU chrt -r 20 ./fsim &>/dev/null &
taskset -c $CPU chrt -r 30 ./fsim &>/dev/null &
taskset -c $CPU chrt -r 40 ./fsim &>/dev/null &
done
echo "Wait $i/$MAX_ITERATIONS: `date`"
wait
done
No soft lockups were triggered but after 1.5 hours of testing, the fsim processes got stuck and only one of them was visible in the output of top:
> top - 18:45:01 up 44 min, 3 users, load average: 72.00, 71.04, 54.81
> Tasks: 2226 total, 4 running, 2222 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 0.0 us, 0.4 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> MiB Mem : 239777.1+total, 235671.7+free, 4332.156 used, 1435.641 buff/cache
> MiB Swap: 1023.996 total, 1023.996 free, 0.000 used. 235444.9+avail Mem
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 100666 root -31 0 0 0 0 D 94.84 0.000 14:59.40 fsim
> 98193 root 20 0 42224 6844 3484 R 0.794 0.003 0:07.05 top
> 1 root 20 0 79220 12544 9124 S 0.000 0.005 0:11.95 systemd
All of the fsim processes got stuck at the same code path - while exiting:
> [ 2462.649033] INFO: task fsim:100600 blocked for more than 491 seconds.
> [ 2462.649036] Tainted: G E N 5.14.21-sle15-sp5-221214-hoff3-7 #8
> [ 2462.649038] task:fsim state:D stack: 0 pid:100600 ppid: 95456 flags:0x00000000
> [ 2462.649042] Call Trace:
> [ 2462.649045] <TASK>
> [ 2462.649046] __schedule+0x2cd/0x1140
> [ 2462.649059] schedule+0x5c/0xc0
> [ 2462.649061] rwsem_down_write_slowpath+0x349/0x5d0
> [ 2462.649070] unlink_file_vma+0x2d/0x60
> [ 2462.649074] free_pgtables+0x67/0x110
> [ 2462.649083] exit_mmap+0xaf/0x1f0
> [ 2462.649088] mmput+0x56/0x120
> [ 2462.649090] do_exit+0x306/0xb50
> [ 2462.649095] do_group_exit+0x3a/0xa0
> [ 2462.649098] __x64_sys_exit_group+0x14/0x20
> [ 2462.649102] do_syscall_64+0x5b/0x80
> [ 2462.649116] entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [ 2462.649120] RIP: 0033:0x7f90abae6c46
> [ 2462.649122] RSP: 002b:00007ffc0ca21638 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [ 2462.649124] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f90abae6c46
> [ 2462.649125] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [ 2462.649127] RBP: 00007f90abdf5970 R08: 00000000000000e7 R09: ffffffffffffff80
> [ 2462.649128] R10: 0000000000000002 R11: 0000000000000246 R12: 00007f90abdf5970
> [ 2462.649129] R13: 0000000000000001 R14: 00007f90abdf9328 R15: 0000000000000000
> [ 2462.649132] </TASK>
> [ 2462.649133] INFO: task fsim:100601 blocked for more than 491 seconds.
> [ 2462.649216] INFO: task fsim:100603 blocked for more than 491 seconds.
> [ 2462.649295] INFO: task fsim:100604 blocked for more than 491 seconds.
> [ 2462.649371] INFO: task fsim:100605 blocked for more than 491 seconds.
> [ 2462.649449] INFO: task fsim:100606 blocked for more than 491 seconds.
> [ 2462.649526] INFO: task fsim:100607 blocked for more than 491 seconds.
> [ 2462.649606] INFO: task fsim:100608 blocked for more than 491 seconds.
> [ 2462.649676] INFO: task fsim:100609 blocked for more than 491 seconds.
So, I tested these fixes all together added on top of 6eebd5fb2083:
* locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
* locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
* locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
After 20 hours of runtime, none of the fsim processes got stuck nor any soft lockups occurred. AFAICT, it works.
Tested-by: Jiri Wiesner <[email protected]>
[1] https://lore.kernel.org/lkml/[email protected]/
--
Jiri Wiesner
SUSE Labs
On 11/17/22 21:20, Waiman Long wrote:
> v6:
> - Fix an error in patch 2 reported by kernel test robot.
>
> v5:
> - Drop patch 2 and replace it with 2 new patches disabling preemption on
> all reader functions and writer functions respectively. The other
> patches are adjusted accordingly.
>
> v4:
> - Update patch descriptions in patches 1 & 2 to make clear the live
> lock conditions that are being fixed by these patches. There is no code
> change from v3.
>
> It turns out the current waiter optimistic spinning code does not work
> that well if we have RT tasks in the mix. This patch series include two
> different fixes to resolve those issues. The last 3 patches modify the
> handoff code to implement true lock handoff similar to that of mutex.
>
> Waiman Long (6):
> locking/rwsem: Prevent non-first waiter from spinning in down_write()
> slowpath
> locking/rwsem: Disable preemption at all down_read*() and up_read()
> code paths
> locking/rwsem: Disable preemption at all down_write*() and up_write()
> code paths
> locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
> locking/rwsem: Enable direct rwsem lock handoff
> locking/rwsem: Update handoff lock events tracking
>
> kernel/locking/lock_events_list.h | 6 +-
> kernel/locking/rwsem.c | 240 ++++++++++++++++++++++--------
> 2 files changed, 182 insertions(+), 64 deletions(-)
>
Peter,
I know that you are busy these days. Do have any further comment or
suggestion on this patch series?
Looking forward to your reply.
Thanks,
Longman
On 11/17/22 21:20, Waiman Long wrote:
> A non-first waiter can potentially spin in the for loop of
> rwsem_down_write_slowpath() without sleeping but fail to acquire the
> lock even if the rwsem is free if the following sequence happens:
>
> Non-first RT waiter First waiter Lock holder
> ------------------- ------------ -----------
> Acquire wait_lock
> rwsem_try_write_lock():
> Set handoff bit if RT or
> wait too long
> Set waiter->handoff_set
> Release wait_lock
> Acquire wait_lock
> Inherit waiter->handoff_set
> Release wait_lock
> Clear owner
> Release lock
> if (waiter.handoff_set) {
> rwsem_spin_on_owner(();
> if (OWNER_NULL)
> goto trylock_again;
> }
> trylock_again:
> Acquire wait_lock
> rwsem_try_write_lock():
> if (first->handoff_set && (waiter != first))
> return false;
> Release wait_lock
>
> A non-first waiter cannot really acquire the rwsem even if it mistakenly
> believes that it can spin on OWNER_NULL value. If that waiter happens
> to be an RT task running on the same CPU as the first waiter, it can
> block the first waiter from acquiring the rwsem leading to live lock.
> Fix this problem by making sure that a non-first waiter cannot spin in
> the slowpath loop without sleeping.
>
> Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
> Reviewed-and-tested-by: Mukesh Ojha <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> Cc: [email protected]
> ---
> kernel/locking/rwsem.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 44873594de03..be2df9ea7c30 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> */
> if (first->handoff_set && (waiter != first))
> return false;
> -
> - /*
> - * First waiter can inherit a previously set handoff
> - * bit and spin on rwsem if lock acquisition fails.
> - */
> - if (waiter == first)
> - waiter->handoff_set = true;
> }
>
> new = count;
>
> if (count & RWSEM_LOCK_MASK) {
> + /*
> + * A waiter (first or not) can set the handoff bit
> + * if it is an RT task or wait in the wait queue
> + * for too long.
> + */
> if (has_handoff || (!rt_task(waiter->task) &&
> !time_after(jiffies, waiter->timeout)))
> return false;
> @@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> } while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new));
>
> /*
> - * We have either acquired the lock with handoff bit cleared or
> - * set the handoff bit.
> + * We have either acquired the lock with handoff bit cleared or set
> + * the handoff bit. Only the first waiter can have its handoff_set
> + * set here to enable optimistic spinning in slowpath loop.
> */
> if (new & RWSEM_FLAG_HANDOFF) {
> - waiter->handoff_set = true;
> + first->handoff_set = true;
> lockevent_inc(rwsem_wlock_handoff);
> return false;
> }
Peter,
I would really like to get this fix patch merged as soon as possible if
you don't see any problem with it. For the rests of the series, you can
take your time.
Cheers,
Longman
On Thu, Nov 17, 2022 at 09:20:10PM -0500, Waiman Long wrote:
> Waiman Long (6):
> locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
> locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
> locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
> locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
After all these we still have down_read_non_owner() using
__rwsem_set_reader_owner() outside of the preempt_disable() region.
Now, let me go stare at this one:
> locking/rwsem: Enable direct rwsem lock handoff
On 1/22/23 08:46, Peter Zijlstra wrote:
> On Thu, Nov 17, 2022 at 09:20:10PM -0500, Waiman Long wrote:
>
>> Waiman Long (6):
>> locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
>> locking/rwsem: Disable preemption at all down_read*() and up_read() code paths
>> locking/rwsem: Disable preemption at all down_write*() and up_write() code paths
>> locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
> After all these we still have down_read_non_owner() using
> __rwsem_set_reader_owner() outside of the preempt_disable() region.
>
> Now, let me go stare at this one:
Thanks for spotting that. I kind of overlook the fact we have a
down_read_non_owner(). Will update that as well.
Cheers,
Longman
>> locking/rwsem: Enable direct rwsem lock handoff
>
On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
> The lock handoff provided in rwsem isn't a true handoff like that in
> the mutex. Instead, it is more like a quiescent state where optimistic
> spinning and lock stealing are disabled to make it easier for the first
> waiter to acquire the lock.
>
> Reworking the code to enable a true lock handoff is more complex due to
> the following facts:
> 1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
> is too expensive to always take the wait_lock in the unlock path
> to prevent racing.
Specifically, the issue is that we'd need to turn the
atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
loop, like:
tmp = atomic_long_read(&sem->count);
do {
if (tmp & (WAITERS|HANDOFF))
return slow_unock();
} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
in order to not race with a concurrent setting of the HANDOFF bit,
right? And we don't like to turn unlock into a cmpxchg loop.
(OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
unlock, any fail and we go to the slow path -- I suppose the distinct
difference is that we sorta expect some contention on the read side)
> 2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
> time to prevent a proper lock handoff from a reader owned rwsem.
This would be much the same, right? We'd have to turn
rwsem_read_trylock() into a cmpxchg-loop and we don't like that.
Therefore we do that speculative add and fix up later.
Now, I'm not enturely sure what issue you allude to here; is the problem
that you can't quite tell when the last reader is gone?
> A lock handoff can only be initiated when the following conditions are
> true:
> 1) The RWSEM_FLAG_HANDOFF bit is set.
d'uh ;-)
> 2) The task to do the handoff don't see any other active lock
> excluding the lock that it might have held.
2) here is the 2) above, right?
> The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
> overhead. The rwsem count will be known at that point to determine if
> handoff should be done. However, there is a small time gap between the
> rwsem becomes free and the wait_lock is taken
Right, that's between atomic_long_fetch_add_release() and calling the
slow path because WAITERS bit is set.
> where a reader can come in and add a RWSEM_READER_BIAS to the count or
Both 2s above.
> the current first waiter can take the rwsem and clear
> RWSEM_FLAG_HANDOFF in the interim.
The actual intended action.
> That will fail the handoff operation.
I would not list that latter as a failure, it's exactly what we want to
happen, no?
> To handle the former case, a secondary handoff will also be done in
> the rwsem_down_read_slowpath() to catch it.
Right. In short:
Having HANDOVER set:
- implies WAITERS set
- disables all fastpaths (spinning or otherwise)
- dis-allows anybody except first waiter to obtain lock
Therefore, having the window between clearing owner and prodding first
waiter is 'harmless'.
> With true lock handoff, there is no need to do a NULL owner spinning
> anymore as wakeup will be performed if handoff is possible. So it
> is likely that the first waiter won't actually go to sleep even when
> schedule() is called in this case.
Right, removing that NULL spinning was the whole purpose -- except I
seem to have forgotten why it was a problem :-)
OK, lemme go read the actual patch.
Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
above, you added a whole intermediate GRANTED state. Why?
Since we fundamentally must deal with the release->wait_lock hole, why
do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
Why can't we skip the whole rwsem_wake()->GRANTED part and only do
handoff in the slowpath?
On 1/23/23 09:59, Peter Zijlstra wrote:
> On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
>> The lock handoff provided in rwsem isn't a true handoff like that in
>> the mutex. Instead, it is more like a quiescent state where optimistic
>> spinning and lock stealing are disabled to make it easier for the first
>> waiter to acquire the lock.
>>
>> Reworking the code to enable a true lock handoff is more complex due to
>> the following facts:
>> 1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
>> is too expensive to always take the wait_lock in the unlock path
>> to prevent racing.
> Specifically, the issue is that we'd need to turn the
> atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
> loop, like:
>
> tmp = atomic_long_read(&sem->count);
> do {
> if (tmp & (WAITERS|HANDOFF))
> return slow_unock();
> } while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
> tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
>
> in order to not race with a concurrent setting of the HANDOFF bit,
> right? And we don't like to turn unlock into a cmpxchg loop.
>
> (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
> unlock, any fail and we go to the slow path -- I suppose the distinct
> difference is that we sorta expect some contention on the read side)
I see that your inclination is to do the handoff right at the unlock
time. It is certainly possible to do that, but it will be more complex
in the case of rwsem than mutex.
First of all for mutex, the owner is the lock word. You do a successful
cmpxchg once and it is all done. rwsem isn't like that. Its owner field
is separated from the actual lock word (count). So setting the right
lock value and owner cannot be atomic. That is why I am using the
wait_lock for synchronizing between the unlocker and the first waiter
that set handoff. Since rwsem_wake() will take the wait_lock anyway, so
I decide to do the handoff at that time.
Another consideration that I have is to minimize the unlock overhead. To
do handoff at unlock time requires an extra read of the rwsem count.
> 2) The reader lock fast path may add a RWSEM_READER_BIAS at the wrong
> time to prevent a proper lock handoff from a reader owned rwsem.
> This would be much the same, right? We'd have to turn
> rwsem_read_trylock() into a cmpxchg-loop and we don't like that.
> Therefore we do that speculative add and fix up later.
>
> Now, I'm not enturely sure what issue you allude to here; is the problem
> that you can't quite tell when the last reader is gone?
>> A lock handoff can only be initiated when the following conditions are
>> true:
>> 1) The RWSEM_FLAG_HANDOFF bit is set.
> d'uh ;-)
>
>> 2) The task to do the handoff don't see any other active lock
>> excluding the lock that it might have held.
> 2) here is the 2) above, right?
>
>> The new handoff mechanism performs handoff in rwsem_wakeup() to minimize
>> overhead. The rwsem count will be known at that point to determine if
>> handoff should be done. However, there is a small time gap between the
>> rwsem becomes free and the wait_lock is taken
> Right, that's between atomic_long_fetch_add_release() and calling the
> slow path because WAITERS bit is set.
Yes, there is no extra overhead unless the waiter bit is set.
>
>> where a reader can come in and add a RWSEM_READER_BIAS to the count or
> Both 2s above.
>
>> the current first waiter can take the rwsem and clear
>> RWSEM_FLAG_HANDOFF in the interim.
> The actual intended action.
>
>> That will fail the handoff operation.
> I would not list that latter as a failure, it's exactly what we want to
> happen, no?
Yes, that is the intended action.
I adds this due to the fact that even if the HANDOFF bit is observed to
be set at unlock time, it does not guarantee a handoff can be
successfully done because the first waiter can be interrupted out or
killed in the interim. The HANDOFF bit has to be confirmed to be set
while holding the wait lock to be sure that we can do a handoff.
>> To handle the former case, a secondary handoff will also be done in
>> the rwsem_down_read_slowpath() to catch it.
> Right. In short:
>
> Having HANDOVER set:
> - implies WAITERS set
> - disables all fastpaths (spinning or otherwise)
> - dis-allows anybody except first waiter to obtain lock
>
> Therefore, having the window between clearing owner and prodding first
> waiter is 'harmless'.
As said above, we need to confirm that the HANDOFF bit is set with
wait_lock held. Now, the HANDOFF bit may not set at unlock time, or it
may not be.
We can pass the count value fetched at unlock time down to rwsem_wake()
to confirm that HANDOFF bit is set at unlock time. However, it is also
possible that the original waiter that set HANDOFF have bailed out, then
a reader acquire the lock and another waiter set HANDOFF before the
unlocker acquire the wait lock. Then the rwsem is really reader-owned in
this case. So we can't perform handoff. That is why I also check for if
there is an active lock (mostly read lock) at rwsem_wake(). However,
that can be a false negative because an incoming reader may have just
added a READER_BIAS which is to be removed soon. That is the reason why
I have a secondary handoff check in the reader slowpath.
>
>> With true lock handoff, there is no need to do a NULL owner spinning
>> anymore as wakeup will be performed if handoff is possible. So it
>> is likely that the first waiter won't actually go to sleep even when
>> schedule() is called in this case.
> Right, removing that NULL spinning was the whole purpose -- except I
> seem to have forgotten why it was a problem :-)
>
> OK, lemme go read the actual patch.
>
> Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
> above, you added a whole intermediate GRANTED state. Why?
>
> Since we fundamentally must deal with the release->wait_lock hole, why
> do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
> Why can't we skip the whole rwsem_wake()->GRANTED part and only do
> handoff in the slowpath?
First of all, the owner value for a reader-owned rwsem is mostly of an
advisory value as it holds one of the possible owners. So it may be a
bit risky to use it as an indication that a handoff had happened as it
may be screwed up in some rare cases. That is why I use the repurposed
handoff_state value in the waiter structure. Also reading this value is
less costly than reading the rwsem cacheline which can be heavily contended.
I will update the patch description to highlight the points that I
discussed in this email.
Cheers,
Longman
On 1/22/23 22:40, Waiman Long wrote:
>
> On 1/22/23 08:46, Peter Zijlstra wrote:
>> On Thu, Nov 17, 2022 at 09:20:10PM -0500, Waiman Long wrote:
>>
>>> Waiman Long (6):
>>> locking/rwsem: Prevent non-first waiter from spinning in
>>> down_write() slowpath
>>> locking/rwsem: Disable preemption at all down_read*() and
>>> up_read() code paths
>>> locking/rwsem: Disable preemption at all down_write*() and
>>> up_write() code paths
>>> locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
>> After all these we still have down_read_non_owner() using
>> __rwsem_set_reader_owner() outside of the preempt_disable() region.
>>
>> Now, let me go stare at this one:
>
> Thanks for spotting that. I kind of overlook the fact we have a
> down_read_non_owner(). Will update that as well.
After looking further into this, we will need to either pass one more
parameter to __down_read() for this special case or to do
preempt_disable/enable in each of the higher level down_read functions.
As down_read_non_owner() is a rarely called function, I doubt it is
worth the extra effort to do that since the owner value (other than the
RWSEM_READER_OWNED bit) is for debugging purpose only and is not
critical to the correct functioning of a rwsem. I will add a comment to
note that.
Cheers,
Longman
On 1/23/23 12:30, Waiman Long wrote:
> I will update the patch description to highlight the points that I
> discussed in this email.
I am planning to update the patch description to as follows:
The lock handoff provided in rwsem isn't a true handoff like that in
the mutex. Instead, it is more like a quiescent state where optimistic
spinning and lock stealing are disabled to make it easier for the first
waiter to acquire the lock.
For mutex, lock handoff is done at unlock time as the owner value and
the handoff bit is in the same lock word and can be updated atomically.
That is the not case for rwsem which has a separate count value for
locking and an owner value. The only way to update them in a
quasi-atomic
way is to use the wait_lock for synchronization as the handoff bit can
only be updated while holding the wait_lock. So for rwsem, the new
lock handoff mechanism is done mostly at rwsem_wake() time when the
wait_lock has to be acquired anyway to minimize additional overhead.
Passing the count value at unlock time down to rwsem_wake() to
determine
if handoff should be done is not safe as the waiter that set the
RWSEM_FLAG_HANDOFF bit may have been interrupted out or killed in the
interim. So we need to recheck the count value again after taking the
wait_lock. If there is an active lock, we can't perform the handoff
even if the handoff bit is set at both the unlock and rwsem_wake()
times. It is because there is a slight possibility that the original
waiter that set the handoff bit may have bailed out followed by a read
lock and then the handoff bit is set by another waiter.
It is also likely that the active lock in this case may be a transient
RWSEM_READER_BIAS that will be removed soon. So we have a secondary
handoff done at reader slow path to handle this particular case.
For reader-owned rwsem, the owner value other than the
RWSEM_READER_OWNED
bit is mostly for debugging purpose only. So it is not safe to use
the owner value to confirm a handoff to a reader has happened. On the
other hand, we can do that when handing off to a writer. However, it
is simpler to use the same mechanism to notify a handoff has happened
for both readers and writers. So a new HANDOFF_GRANTED state is added
to enum rwsem_handoff_state to signify that. This new value will be
written to the handoff_state value of the first waiter.
With true lock handoff, there is no need to do a NULL owner spinning
anymore as wakeup will be performed if handoff is successful. So it
is likely that the first waiter won't actually go to sleep even when
schedule() is called in this case.
Please let me know what you think.
Cheers,
Longman
On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
> On 1/23/23 09:59, Peter Zijlstra wrote:
> > On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
> > > The lock handoff provided in rwsem isn't a true handoff like that in
> > > the mutex. Instead, it is more like a quiescent state where optimistic
> > > spinning and lock stealing are disabled to make it easier for the first
> > > waiter to acquire the lock.
> > >
> > > Reworking the code to enable a true lock handoff is more complex due to
> > > the following facts:
> > > 1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
> > > is too expensive to always take the wait_lock in the unlock path
> > > to prevent racing.
> > Specifically, the issue is that we'd need to turn the
> > atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
> > loop, like:
> >
> > tmp = atomic_long_read(&sem->count);
> > do {
> > if (tmp & (WAITERS|HANDOFF))
> > return slow_unock();
> > } while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
> > tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
> >
> > in order to not race with a concurrent setting of the HANDOFF bit,
> > right? And we don't like to turn unlock into a cmpxchg loop.
> >
> > (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
> > unlock, any fail and we go to the slow path -- I suppose the distinct
> > difference is that we sorta expect some contention on the read side)
>
> I see that your inclination is to do the handoff right at the unlock time.
> It is certainly possible to do that, but it will be more complex in the case
> of rwsem than mutex.
Still, it would make things ever so much simpler -- but I agree we'll
probably not get away with it on the performance side of things.
> > Right. In short:
> >
> > Having HANDOVER set:
> > - implies WAITERS set
> > - disables all fastpaths (spinning or otherwise)
> > - dis-allows anybody except first waiter to obtain lock
> >
> > Therefore, having the window between clearing owner and prodding first
> > waiter is 'harmless'.
>
> As said above, we need to confirm that the HANDOFF bit is set with wait_lock
> held. Now, the HANDOFF bit may not set at unlock time, or it may not be.
>
> We can pass the count value fetched at unlock time down to rwsem_wake() to
> confirm that HANDOFF bit is set at unlock time. However, it is also possible
> that the original waiter that set HANDOFF have bailed out, then a reader
> acquire the lock and another waiter set HANDOFF before the unlocker acquire
> the wait lock. Then the rwsem is really reader-owned in this case. So we
> can't perform handoff. That is why I also check for if there is an active
> lock (mostly read lock) at rwsem_wake(). However, that can be a false
> negative because an incoming reader may have just added a READER_BIAS which
> is to be removed soon. That is the reason why I have a secondary handoff
> check in the reader slowpath.
>
> >
> > > With true lock handoff, there is no need to do a NULL owner spinning
> > > anymore as wakeup will be performed if handoff is possible. So it
> > > is likely that the first waiter won't actually go to sleep even when
> > > schedule() is called in this case.
> > Right, removing that NULL spinning was the whole purpose -- except I
> > seem to have forgotten why it was a problem :-)
> >
> > OK, lemme go read the actual patch.
> >
> > Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
> > above, you added a whole intermediate GRANTED state. Why?
> >
> > Since we fundamentally must deal with the release->wait_lock hole, why
> > do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
> > Why can't we skip the whole rwsem_wake()->GRANTED part and only do
> > handoff in the slowpath?
>
> First of all, the owner value for a reader-owned rwsem is mostly of an
> advisory value as it holds one of the possible owners. So it may be a bit
> risky to use it as an indication that a handoff had happened as it may be
> screwed up in some rare cases. That is why I use the repurposed
> handoff_state value in the waiter structure. Also reading this value is less
> costly than reading the rwsem cacheline which can be heavily contended.
>
> I will update the patch description to highlight the points that I discussed
> in this email.
Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
all the fastpaths, all the spinning, all the stealing, everything; then
all that is left is the slowpath that is holding wait_lock.
Then in both slowpaths, ensure only the first waiter can go on and we're
done.
What am I missing? Why does it need to be so complicated?
That is, afaict something like the below would actually work, no? Yes,
simply deleting that spinning in write_slowpath isn't ideal, but I
suspect we can do something to rwsem_try_write_lock() to make up for
that if we think about it.
Again, please explain, explicitly and in small steps, why you think you
need all that complexity.
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -40,7 +40,7 @@
*
* When the rwsem is reader-owned and a spinning writer has timed out,
* the nonspinnable bit will be set to disable optimistic spinning.
-
+ *
* When a writer acquires a rwsem, it puts its task_struct pointer
* into the owner field. It is cleared after an unlock.
*
@@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
* Mark writer at the front of the queue for wakeup.
* Until the task is actually later awoken later by
* the caller, other writers are able to steal it.
+ *
+ * *Unless* HANDOFF is set, in which case only the
+ * first waiter is allowed to take it.
+ *
* Readers, on the other hand, will block as they
* will notice the queued writer.
*/
@@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
* force the issue.
*/
if (time_after(jiffies, waiter->timeout)) {
+ /*
+ * Setting HANDOFF blocks fastpaths and stealing.
+ */
if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
adjustment -= RWSEM_FLAG_HANDOFF;
lockevent_inc(rwsem_rlock_handoff);
@@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
}
atomic_long_add(-adjustment, &sem->count);
+
+ if (waiter->handoff_set) {
+ /*
+ * With HANDOFF set we must terminate all spinning.
+ */
+ rwsem_set_nonspinnable(sem);
+ }
return;
}
/*
@@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
* Try to acquire the lock
*/
taken = rwsem_try_write_lock_unqueued(sem);
-
if (taken)
break;
@@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
if (signal_pending_state(state, current))
goto out_nolock;
- /*
- * After setting the handoff bit and failing to acquire
- * the lock, attempt to spin on owner to accelerate lock
- * transfer. If the previous owner is a on-cpu writer and it
- * has just released the lock, OWNER_NULL will be returned.
- * In this case, we attempt to acquire the lock again
- * without sleeping.
- */
- if (waiter.handoff_set) {
- enum owner_state owner_state;
-
- owner_state = rwsem_spin_on_owner(sem);
- if (owner_state == OWNER_NULL)
- goto trylock_again;
- }
-
schedule_preempt_disabled();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
On Mon, Jan 23, 2023 at 05:07:08PM -0500, Waiman Long wrote:
> On 1/23/23 12:30, Waiman Long wrote:
> > I will update the patch description to highlight the points that I
> > discussed in this email.
>
> I am planning to update the patch description to as follows:
>
> ??? The lock handoff provided in rwsem isn't a true handoff like that in
> ??? the mutex. Instead, it is more like a quiescent state where optimistic
> ??? spinning and lock stealing are disabled to make it easier for the first
> ??? waiter to acquire the lock.
>
> ??? For mutex, lock handoff is done at unlock time as the owner value and
> ??? the handoff bit is in the same lock word and can be updated atomically.
>
> ??? That is the not case for rwsem which has a separate count value for
> ??? locking and an owner value. The only way to update them in a
> quasi-atomic
> ??? way is to use the wait_lock for synchronization as the handoff bit can
> ??? only be updated while holding the wait_lock. So for rwsem, the new
> ??? lock handoff mechanism is done mostly at rwsem_wake() time when the
> ??? wait_lock has to be acquired anyway to minimize additional overhead.
So for first==reader, sure, and you don't need anything special, since
rwsem_mark_wake() already does the right thing.
But for first==writer, I don't follow; *WHY* do you have to complicate
this path so. The write_slowpath already takes wait_lock for
rwsem_try_write_lock() and that already knows about handoff.
> ??? It is also likely that the active lock in this case may be a transient
> ??? RWSEM_READER_BIAS that will be removed soon. So we have a secondary
> ??? handoff done at reader slow path to handle this particular case.
Only because you made it so damn complicated. If instead you rely on the
wait_lock in write_slowpath you can keep it all in once place AFAICT.
> ??? For reader-owned rwsem, the owner value other than the
> RWSEM_READER_OWNED
> ??? bit is mostly for debugging purpose only. So it is not safe to use
> ??? the owner value to confirm a handoff to a reader has happened. On the
What ?!? Where do we care about the owner value? There's
RWSEM_FLAG_HANDOFF which lives in sem->count and there's
waiter->handoff_set. Nowhere do we care about sem->owner in this.
> ??? other hand, we can do that when handing off to a writer. However, it
> ??? is simpler to use the same mechanism to notify a handoff has happened
> ??? for both readers and writers. So a new HANDOFF_GRANTED state is added
I really can't follow whatever logic jump here.
> ??? to enum rwsem_handoff_state to signify that. This new value will be
> ??? written to the handoff_state value of the first waiter.
>
> ??? With true lock handoff, there is no need to do a NULL owner spinning
> ??? anymore as wakeup will be performed if handoff is successful. So it
> ??? is likely that the first waiter won't actually go to sleep even when
> ??? schedule() is called in this case.
So this spinning, this is purely for writer->write handoff (which is
exceedingly rare since it is readers that set handoff), right?
Why is that so important?
Also, why can't we add something like
owner = rwsem_owner_flags(sem, &flags);
if (owner && !(flags & RWSEM_READER_OWNED))
atomic_long_cond_read_relaxed(&sem->counter, !(VAL & RWSEM_WRITER_LOCKED))
to the start of that? If it's really needed.
On 1/24/23 07:29, Peter Zijlstra wrote:
> On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
>> On 1/23/23 09:59, Peter Zijlstra wrote:
>>> On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
>>>> The lock handoff provided in rwsem isn't a true handoff like that in
>>>> the mutex. Instead, it is more like a quiescent state where optimistic
>>>> spinning and lock stealing are disabled to make it easier for the first
>>>> waiter to acquire the lock.
>>>>
>>>> Reworking the code to enable a true lock handoff is more complex due to
>>>> the following facts:
>>>> 1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
>>>> is too expensive to always take the wait_lock in the unlock path
>>>> to prevent racing.
>>> Specifically, the issue is that we'd need to turn the
>>> atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
>>> loop, like:
>>>
>>> tmp = atomic_long_read(&sem->count);
>>> do {
>>> if (tmp & (WAITERS|HANDOFF))
>>> return slow_unock();
>>> } while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
>>> tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
>>>
>>> in order to not race with a concurrent setting of the HANDOFF bit,
>>> right? And we don't like to turn unlock into a cmpxchg loop.
>>>
>>> (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
>>> unlock, any fail and we go to the slow path -- I suppose the distinct
>>> difference is that we sorta expect some contention on the read side)
>> I see that your inclination is to do the handoff right at the unlock time.
>> It is certainly possible to do that, but it will be more complex in the case
>> of rwsem than mutex.
> Still, it would make things ever so much simpler -- but I agree we'll
> probably not get away with it on the performance side of things.
>
>>> Right. In short:
>>>
>>> Having HANDOVER set:
>>> - implies WAITERS set
>>> - disables all fastpaths (spinning or otherwise)
>>> - dis-allows anybody except first waiter to obtain lock
>>>
>>> Therefore, having the window between clearing owner and prodding first
>>> waiter is 'harmless'.
>> As said above, we need to confirm that the HANDOFF bit is set with wait_lock
>> held. Now, the HANDOFF bit may not set at unlock time, or it may not be.
>>
>> We can pass the count value fetched at unlock time down to rwsem_wake() to
>> confirm that HANDOFF bit is set at unlock time. However, it is also possible
>> that the original waiter that set HANDOFF have bailed out, then a reader
>> acquire the lock and another waiter set HANDOFF before the unlocker acquire
>> the wait lock. Then the rwsem is really reader-owned in this case. So we
>> can't perform handoff. That is why I also check for if there is an active
>> lock (mostly read lock) at rwsem_wake(). However, that can be a false
>> negative because an incoming reader may have just added a READER_BIAS which
>> is to be removed soon. That is the reason why I have a secondary handoff
>> check in the reader slowpath.
>>
>>>> With true lock handoff, there is no need to do a NULL owner spinning
>>>> anymore as wakeup will be performed if handoff is possible. So it
>>>> is likely that the first waiter won't actually go to sleep even when
>>>> schedule() is called in this case.
>>> Right, removing that NULL spinning was the whole purpose -- except I
>>> seem to have forgotten why it was a problem :-)
>>>
>>> OK, lemme go read the actual patch.
>>>
>>> Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
>>> above, you added a whole intermediate GRANTED state. Why?
>>>
>>> Since we fundamentally must deal with the release->wait_lock hole, why
>>> do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
>>> Why can't we skip the whole rwsem_wake()->GRANTED part and only do
>>> handoff in the slowpath?
>> First of all, the owner value for a reader-owned rwsem is mostly of an
>> advisory value as it holds one of the possible owners. So it may be a bit
>> risky to use it as an indication that a handoff had happened as it may be
>> screwed up in some rare cases. That is why I use the repurposed
>> handoff_state value in the waiter structure. Also reading this value is less
>> costly than reading the rwsem cacheline which can be heavily contended.
>>
>> I will update the patch description to highlight the points that I discussed
>> in this email.
> Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
> all the fastpaths, all the spinning, all the stealing, everything; then
> all that is left is the slowpath that is holding wait_lock.
>
> Then in both slowpaths, ensure only the first waiter can go on and we're
> done.
>
> What am I missing? Why does it need to be so complicated?
>
> That is, afaict something like the below would actually work, no? Yes,
> simply deleting that spinning in write_slowpath isn't ideal, but I
> suspect we can do something to rwsem_try_write_lock() to make up for
> that if we think about it.
>
> Again, please explain, explicitly and in small steps, why you think you
> need all that complexity.
>
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -40,7 +40,7 @@
> *
> * When the rwsem is reader-owned and a spinning writer has timed out,
> * the nonspinnable bit will be set to disable optimistic spinning.
> -
> + *
> * When a writer acquires a rwsem, it puts its task_struct pointer
> * into the owner field. It is cleared after an unlock.
> *
> @@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
> * Mark writer at the front of the queue for wakeup.
> * Until the task is actually later awoken later by
> * the caller, other writers are able to steal it.
> + *
> + * *Unless* HANDOFF is set, in which case only the
> + * first waiter is allowed to take it.
> + *
> * Readers, on the other hand, will block as they
> * will notice the queued writer.
> */
> @@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
> * force the issue.
> */
> if (time_after(jiffies, waiter->timeout)) {
> + /*
> + * Setting HANDOFF blocks fastpaths and stealing.
> + */
> if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
> adjustment -= RWSEM_FLAG_HANDOFF;
> lockevent_inc(rwsem_rlock_handoff);
> @@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
> }
>
> atomic_long_add(-adjustment, &sem->count);
> +
> + if (waiter->handoff_set) {
> + /*
> + * With HANDOFF set we must terminate all spinning.
> + */
> + rwsem_set_nonspinnable(sem);
> + }
> return;
> }
> /*
> @@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
> * Try to acquire the lock
> */
> taken = rwsem_try_write_lock_unqueued(sem);
> -
> if (taken)
> break;
>
> @@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
> if (signal_pending_state(state, current))
> goto out_nolock;
>
> - /*
> - * After setting the handoff bit and failing to acquire
> - * the lock, attempt to spin on owner to accelerate lock
> - * transfer. If the previous owner is a on-cpu writer and it
> - * has just released the lock, OWNER_NULL will be returned.
> - * In this case, we attempt to acquire the lock again
> - * without sleeping.
> - */
> - if (waiter.handoff_set) {
> - enum owner_state owner_state;
> -
> - owner_state = rwsem_spin_on_owner(sem);
> - if (owner_state == OWNER_NULL)
> - goto trylock_again;
> - }
> -
> schedule_preempt_disabled();
> lockevent_inc(rwsem_sleep_writer);
> set_current_state(state);
>
Thanks for the comments and suggested changes. I will adopt some of your
suggestions to simplify the patchset. Will post a new version once I
finish my testing.
Cheers,
Longman