2022-10-24 19:44:12

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 0/5] lockinig/rwsem: Fix rwsem bugs & enable true lock handoff

v4:
- Update patch descriptions in patches 1 & 2 to make clear the live
lock conditions that are being fixed by these patches. There is no code
change from v3.

v3:
- Make a minor cleanup to patch 1.
- Add 3 more patches to implement true lock handoff.

It turns out the current waiter optimistic spinning code does not work
that well if we have RT tasks in the mix. This patch series include two
different fixes to resolve those issues. The last 3 patches modify the
handoff code to implement true lock handoff similar to that of mutex.

Waiman Long (5):
locking/rwsem: Prevent non-first waiter from spinning in down_write()
slowpath
locking/rwsem: Limit # of null owner retries for handoff writer
locking/rwsem: Change waiter->hanodff_set to a handoff_state enum
locking/rwsem: Enable direct rwsem lock handoff
locking/rwsem: Update handoff lock events tracking

kernel/locking/lock_events_list.h | 6 +-
kernel/locking/rwsem.c | 172 +++++++++++++++++++++++-------
2 files changed, 138 insertions(+), 40 deletions(-)

--
2.31.1


2022-10-24 20:07:38

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 3/5] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum

Change the boolean waiter->handoff_set to an enum type so that we can
have more states in some later patches. Also use READ_ONCE() outside
wait_lock critical sections for read and WRITE_ONCE() inside wait_lock
critical sections for write for proper synchronization. There is no
functional change.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c68d76fc8c68..a8bfc905637a 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -338,12 +338,17 @@ enum rwsem_waiter_type {
RWSEM_WAITING_FOR_READ
};

+enum rwsem_handoff_state {
+ HANDOFF_NONE = 0,
+ HANDOFF_REQUESTED,
+};
+
struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+ enum rwsem_handoff_state handoff_state;
unsigned long timeout;
- bool handoff_set;
};
#define rwsem_first_waiter(sem) \
list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
@@ -470,7 +475,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
adjustment -= RWSEM_FLAG_HANDOFF;
lockevent_inc(rwsem_rlock_handoff);
}
- waiter->handoff_set = true;
+ WRITE_ONCE(waiter->handoff_state, HANDOFF_REQUESTED);
}

atomic_long_add(-adjustment, &sem->count);
@@ -622,7 +627,7 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
* waiter is the one that set it. Otherwisee, we
* still try to acquire the rwsem.
*/
- if (first->handoff_set && (waiter != first))
+ if (first->handoff_state && (waiter != first))
return false;
}

@@ -650,11 +655,11 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,

/*
* We have either acquired the lock with handoff bit cleared or set
- * the handoff bit. Only the first waiter can have its handoff_set
+ * the handoff bit. Only the first waiter can have its handoff_state
* set here to enable optimistic spinning in slowpath loop.
*/
if (new & RWSEM_FLAG_HANDOFF) {
- first->handoff_set = true;
+ WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
lockevent_inc(rwsem_wlock_handoff);
return false;
}
@@ -1043,7 +1048,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_READ;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
- waiter.handoff_set = false;
+ waiter.handoff_state = HANDOFF_NONE;

raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list)) {
@@ -1131,7 +1136,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
waiter.task = current;
waiter.type = RWSEM_WAITING_FOR_WRITE;
waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
- waiter.handoff_set = false;
+ waiter.handoff_state = HANDOFF_NONE;

raw_spin_lock_irq(&sem->wait_lock);
rwsem_add_waiter(sem, &waiter);
@@ -1176,7 +1181,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
* In this case, we attempt to acquire the lock again
* without sleeping.
*/
- if (waiter.handoff_set) {
+ if (READ_ONCE(waiter.handoff_state)) {
enum owner_state owner_state;

preempt_disable();
--
2.31.1

2022-10-24 20:27:34

by John Donnelly

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum

On 10/24/22 12:44 PM, Waiman Long wrote:
> Change the boolean waiter->handoff_set to an enum type so that we can
> have more states in some later patches. Also use READ_ONCE() outside
> wait_lock critical sections for read and WRITE_ONCE() inside wait_lock
> critical sections for write for proper synchronization. There is no
> functional change.

Hi,

Do we need


Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more
consistent")

Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
spin on owner")

Cc: [email protected]


clauses added to patches 3,4,5 so they are all picked up in one series ?

Thank you,

John.



>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/locking/rwsem.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index c68d76fc8c68..a8bfc905637a 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -338,12 +338,17 @@ enum rwsem_waiter_type {
> RWSEM_WAITING_FOR_READ
> };
>
> +enum rwsem_handoff_state {
> + HANDOFF_NONE = 0,
> + HANDOFF_REQUESTED,
> +};
> +
> struct rwsem_waiter {
> struct list_head list;
> struct task_struct *task;
> enum rwsem_waiter_type type;
> + enum rwsem_handoff_state handoff_state;
> unsigned long timeout;
> - bool handoff_set;
> };
> #define rwsem_first_waiter(sem) \
> list_first_entry(&sem->wait_list, struct rwsem_waiter, list)
> @@ -470,7 +475,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
> adjustment -= RWSEM_FLAG_HANDOFF;
> lockevent_inc(rwsem_rlock_handoff);
> }
> - waiter->handoff_set = true;
> + WRITE_ONCE(waiter->handoff_state, HANDOFF_REQUESTED);
> }
>
> atomic_long_add(-adjustment, &sem->count);
> @@ -622,7 +627,7 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> * waiter is the one that set it. Otherwisee, we
> * still try to acquire the rwsem.
> */
> - if (first->handoff_set && (waiter != first))
> + if (first->handoff_state && (waiter != first))
> return false;
> }
>
> @@ -650,11 +655,11 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>
> /*
> * We have either acquired the lock with handoff bit cleared or set
> - * the handoff bit. Only the first waiter can have its handoff_set
> + * the handoff bit. Only the first waiter can have its handoff_state
> * set here to enable optimistic spinning in slowpath loop.
> */
> if (new & RWSEM_FLAG_HANDOFF) {
> - first->handoff_set = true;
> + WRITE_ONCE(first->handoff_state, HANDOFF_REQUESTED);
> lockevent_inc(rwsem_wlock_handoff);
> return false;
> }
> @@ -1043,7 +1048,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
> waiter.task = current;
> waiter.type = RWSEM_WAITING_FOR_READ;
> waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> - waiter.handoff_set = false;
> + waiter.handoff_state = HANDOFF_NONE;
>
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list)) {
> @@ -1131,7 +1136,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
> waiter.task = current;
> waiter.type = RWSEM_WAITING_FOR_WRITE;
> waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> - waiter.handoff_set = false;
> + waiter.handoff_state = HANDOFF_NONE;
>
> raw_spin_lock_irq(&sem->wait_lock);
> rwsem_add_waiter(sem, &waiter);
> @@ -1176,7 +1181,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
> * In this case, we attempt to acquire the lock again
> * without sleeping.
> */
> - if (waiter.handoff_set) {
> + if (READ_ONCE(waiter.handoff_state)) {
> enum owner_state owner_state;
>
> preempt_disable();

2022-10-24 21:04:56

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] locking/rwsem: Change waiter->hanodff_set to a handoff_state enum

On 10/24/22 14:25, [email protected] wrote:
> On 10/24/22 12:44 PM, Waiman Long wrote:
>> Change the boolean waiter->handoff_set to an enum type so that we can
>> have more states in some later patches. Also use READ_ONCE() outside
>> wait_lock critical sections for read and WRITE_ONCE() inside wait_lock
>> critical sections for write for proper synchronization. There is no
>> functional change.
>
> Hi,
>
> Do we need
>
>
> Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more
> consistent")
>
> Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer
> optimistically spin on owner")
>
> Cc: [email protected]
>
>
> clauses added to patches 3,4,5 so they are all picked up in one series ?
>
> Thank you,
>
> John.

I am not planning to do that. The handoff code rewrite represent a
moderate amount of changes. So I don't want them to be backported to
stable in case there are some hidden bugs inside. That is why I have
patch 2 is essentially reverted in patch 4.

Cheers,
Longman

2022-10-24 21:11:21

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 2/5] locking/rwsem: Limit # of null owner retries for handoff writer

Commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically
spin on owner") assumes that when the owner field is changed to NULL,
the lock will become free soon. Commit 48dfb5d2560d ("locking/rwsem:
Disable preemption while trying for rwsem lock") disable preemption
when acquiring rwsem for write. However, preemption has not yet been
disabled when acquiring a read lock on a rwsem. So a reader can add a
RWSEM_READER_BIAS to count without setting owner to signal a reader,
got preempted out by a RT task which then spins in the writer slowpath
as owner remains NULL leading to live lock.

One way to fix that is to disable preemption before the read lock attempt
and then immediately remove RWSEM_READER_BIAS when the trylock fails
before reenabling preemption. This will remove some optimizations that
can be done by delaying the RWSEM_READER_BIAS backoff. Alternatively
we could delay the preempt_enable() into the rwsem_down_read_slowpath()
and even after acquiring and releasing the wait_lock. Another possible
alternative is to limit the number of trylock attempts without sleeping.
The last alternative seems to be less messy and is being implemented
in this patch.

The current limit is now set to 8 to allow enough time for the other
task to hopefully complete its action.

By adding new lock events to track the number of NULL owner retries with
handoff flag set before a successful trylock when running a 96 threads
locking microbenchmark with equal number of readers and writers running
on a 2-core 96-thread system for 15 seconds, the following stats are
obtained. Note that none of locking threads are RT tasks.

Retries of successful trylock Count
----------------------------- -----
1 1738
2 19
3 11
4 2
5 1
6 1
7 1
8 0
X 1

The last row is the one failed attempt that needs more than 8 retries.
So a retry count maximum of 8 should capture most of them if no RT task
is in the mix.

Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")
Reported-by: Mukesh Ojha <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Reviewed-and-tested-by: Mukesh Ojha <[email protected]>
Cc: [email protected]
---
kernel/locking/rwsem.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index be2df9ea7c30..c68d76fc8c68 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1115,6 +1115,7 @@ static struct rw_semaphore __sched *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
struct rwsem_waiter waiter;
+ int null_owner_retries;
DEFINE_WAKE_Q(wake_q);

/* do optimistic spinning and steal lock if possible */
@@ -1156,7 +1157,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
set_current_state(state);
trace_contention_begin(sem, LCB_F_WRITE);

- for (;;) {
+ for (null_owner_retries = 0;;) {
if (rwsem_try_write_lock(sem, &waiter)) {
/* rwsem_try_write_lock() implies ACQUIRE on success */
break;
@@ -1182,8 +1183,21 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
owner_state = rwsem_spin_on_owner(sem);
preempt_enable();

- if (owner_state == OWNER_NULL)
+ /*
+ * owner is NULL doesn't guarantee the lock is free.
+ * An incoming reader will temporarily increment the
+ * reader count without changing owner and the
+ * rwsem_try_write_lock() will fails if the reader
+ * is not able to decrement it in time. Allow 8
+ * trylock attempts when hitting a NULL owner before
+ * going to sleep.
+ */
+ if ((owner_state == OWNER_NULL) &&
+ (null_owner_retries < 8)) {
+ null_owner_retries++;
goto trylock_again;
+ }
+ null_owner_retries = 0;
}

schedule();
--
2.31.1