2021-09-22 19:39:57

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH] locking/rwsem: Add upgrade_read()

Currently there are about 12 instances in the kernel where an up_read()
is immediately followed by a down_write() of the same lock. For example,

drivers/tty/n_tty.c: up_read(&tty->termios_rwsem);
drivers/tty/n_tty.c- down_write(&tty->termios_rwsem);

Since we have already provided a downgrade_write() function, we may as
well provide an upgrade_read() function to make the code easier to read
and the intention clearer.

If the current task is the only reader, the upgrade can be done by a
single atomic operation. If not, the upgrade will have to be done by a
separate up_read() call followed by a down_write(). In the former case,
the handoff bit is not considered and the waiter will have to wait a
bit longer to acquire the lock.

The new upgrade_read() function returns a value of 0 for safe upgrade
where rwsem protected data won't change. Otherwise a value of 1 is
returned to indicate unsafe upgrade where rwsem protected data may
change during the upgrade process.

For PREEMPT_RT, it falls back to up_read() followed by down_write()
for simplicity.

Some uses of down_write() with long lock hold time may be changed
to the following format in the future:

down_read()
/* check data */
if (upgrade_read()) {
/* unsafe upgrade, recheck data */
}
/* update data */
up_write();

As long as the "recheck data" and "update data" parts are relatively
short compared with the "check data" part, this conversion may help to
improve parallelism and reduce lock contention.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/rwsem.h | 5 ++++
kernel/locking/rwsem.c | 53 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 352c6127cb90..8ece58224f25 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -207,6 +207,11 @@ extern void up_write(struct rw_semaphore *sem);
*/
extern void downgrade_write(struct rw_semaphore *sem);

+/*
+ * upgrade read lock to write lock
+ */
+extern int upgrade_read(struct rw_semaphore *sem);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/*
* nested locking. NOTE: rwsems are not allowed to recurse
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 000e8d5a2884..aeb5b0668304 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1203,6 +1203,29 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
return sem;
}

+/*
+ * Try to upgrade read lock to write lock
+ */
+static inline int __try_upgrade_read(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ WARN_ON_ONCE(count & RWSEM_WRITER_LOCKED);
+
+ /*
+ * When upgrading from shared to exclusive ownership,
+ * anything inside the write-locked region cannot leak
+ * into the read side. Use an ACQUIRE semantics.
+ */
+ if (((count & RWSEM_READER_MASK) == RWSEM_READER_BIAS) &&
+ atomic_long_try_cmpxchg_acquire(&sem->count, &count,
+ count - RWSEM_READER_BIAS + RWSEM_WRITER_LOCKED)) {
+ rwsem_set_owner(sem);
+ return 1;
+ }
+ return 0;
+}
+
/*
* lock for reading
*/
@@ -1438,6 +1461,11 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
rwbase_write_downgrade(&sem->rwbase);
}

+static inline int __try_upgrade_read(struct rw_semaphore *sem)
+{
+ return 0;
+}
+
/* Debug stubs for the common API */
#define DEBUG_RWSEMS_WARN_ON(c, sem)

@@ -1581,6 +1609,31 @@ void downgrade_write(struct rw_semaphore *sem)
}
EXPORT_SYMBOL(downgrade_write);

+/*
+ * Upgrade read lock to write lock
+ *
+ * Return: 0 when upgrade is safe, i.e. rwsem protected data do not change;
+ * 1 when upgrade is unsafe as rwsem protected data may have changed.
+ */
+int upgrade_read(struct rw_semaphore *sem)
+{
+ if (__try_upgrade_read(sem)) {
+ rwsem_release(&sem->dep_map, _RET_IP_);
+ rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
+ return 0;
+ }
+
+ /*
+ * We cannot directly upgrade to the write lock, just do a regular
+ * up_read() and down_write() sequence. The data protected by the
+ * rwsem may have changed before the write lock is acquired.
+ */
+ down_read(sem);
+ up_write(sem);
+ return 1;
+}
+EXPORT_SYMBOL(upgrade_read);
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC

void down_read_nested(struct rw_semaphore *sem, int subclass)
--
2.18.1


2021-09-22 20:13:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] locking/rwsem: Add upgrade_read()

On Wed, Sep 22, 2021 at 03:36:57PM -0400, Waiman Long wrote:
> Currently there are about 12 instances in the kernel where an up_read()
> is immediately followed by a down_write() of the same lock. For example,
>
> drivers/tty/n_tty.c: up_read(&tty->termios_rwsem);
> drivers/tty/n_tty.c- down_write(&tty->termios_rwsem);

And TTY is a high performance issue, that requires hacks like this?

> Since we have already provided a downgrade_write() function, we may as
> well provide an upgrade_read() function to make the code easier to read
> and the intention clearer.
>
> If the current task is the only reader, the upgrade can be done by a
> single atomic operation. If not, the upgrade will have to be done by a
> separate up_read() call followed by a down_write(). In the former case,
> the handoff bit is not considered and the waiter will have to wait a
> bit longer to acquire the lock.
>
> The new upgrade_read() function returns a value of 0 for safe upgrade
> where rwsem protected data won't change. Otherwise a value of 1 is
> returned to indicate unsafe upgrade where rwsem protected data may
> change during the upgrade process.

Yuck...

Is there any workload where this is a massive win? I'm thinking that
either the lock is contended and you get the unsafe option which is the
same as today, or the lock isn't contended and you would've gotten
fast-paths and you barely safe anything anyway.

Also, -ENODATA

2021-09-22 20:24:09

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH] locking/rwsem: Add upgrade_read()

On 9/22/21 4:06 PM, Peter Zijlstra wrote:
> On Wed, Sep 22, 2021 at 03:36:57PM -0400, Waiman Long wrote:
>> Currently there are about 12 instances in the kernel where an up_read()
>> is immediately followed by a down_write() of the same lock. For example,
>>
>> drivers/tty/n_tty.c: up_read(&tty->termios_rwsem);
>> drivers/tty/n_tty.c- down_write(&tty->termios_rwsem);
> And TTY is a high performance issue, that requires hacks like this?
I won't call it a hack. I consider it a better documentation what the
real intention is.
>
>> Since we have already provided a downgrade_write() function, we may as
>> well provide an upgrade_read() function to make the code easier to read
>> and the intention clearer.
>>
>> If the current task is the only reader, the upgrade can be done by a
>> single atomic operation. If not, the upgrade will have to be done by a
>> separate up_read() call followed by a down_write(). In the former case,
>> the handoff bit is not considered and the waiter will have to wait a
>> bit longer to acquire the lock.
>>
>> The new upgrade_read() function returns a value of 0 for safe upgrade
>> where rwsem protected data won't change. Otherwise a value of 1 is
>> returned to indicate unsafe upgrade where rwsem protected data may
>> change during the upgrade process.
> Yuck...
>
> Is there any workload where this is a massive win? I'm thinking that
> either the lock is contended and you get the unsafe option which is the
> same as today, or the lock isn't contended and you would've gotten
> fast-paths and you barely safe anything anyway.
>
> Also, -ENODATA
>
Yes, the best case saving is just just an atomic op.

This is just a RFC. I can look for workloads that can benefit from using
this.

Cheers,
Longman

2021-09-23 01:19:05

by Boqun Feng

[permalink] [raw]
Subject: Re: [RFC PATCH] locking/rwsem: Add upgrade_read()

On Wed, Sep 22, 2021 at 03:36:57PM -0400, Waiman Long wrote:
> Currently there are about 12 instances in the kernel where an up_read()
> is immediately followed by a down_write() of the same lock. For example,
>
> drivers/tty/n_tty.c: up_read(&tty->termios_rwsem);
> drivers/tty/n_tty.c- down_write(&tty->termios_rwsem);
>
> Since we have already provided a downgrade_write() function, we may as
> well provide an upgrade_read() function to make the code easier to read
> and the intention clearer.
>
> If the current task is the only reader, the upgrade can be done by a
> single atomic operation. If not, the upgrade will have to be done by a
> separate up_read() call followed by a down_write(). In the former case,
> the handoff bit is not considered and the waiter will have to wait a
> bit longer to acquire the lock.
>
> The new upgrade_read() function returns a value of 0 for safe upgrade
> where rwsem protected data won't change. Otherwise a value of 1 is
> returned to indicate unsafe upgrade where rwsem protected data may
> change during the upgrade process.
>
> For PREEMPT_RT, it falls back to up_read() followed by down_write()
> for simplicity.
>
> Some uses of down_write() with long lock hold time may be changed
> to the following format in the future:
>
> down_read()
> /* check data */
> if (upgrade_read()) {
> /* unsafe upgrade, recheck data */
> }
> /* update data */
> up_write();
>
> As long as the "recheck data" and "update data" parts are relatively
> short compared with the "check data" part, this conversion may help to
> improve parallelism and reduce lock contention.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/rwsem.h | 5 ++++
> kernel/locking/rwsem.c | 53 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 58 insertions(+)
>
> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
> index 352c6127cb90..8ece58224f25 100644
> --- a/include/linux/rwsem.h
> +++ b/include/linux/rwsem.h
> @@ -207,6 +207,11 @@ extern void up_write(struct rw_semaphore *sem);
> */
> extern void downgrade_write(struct rw_semaphore *sem);
>
> +/*
> + * upgrade read lock to write lock
> + */
> +extern int upgrade_read(struct rw_semaphore *sem);
> +
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> /*
> * nested locking. NOTE: rwsems are not allowed to recurse
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 000e8d5a2884..aeb5b0668304 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1203,6 +1203,29 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
> return sem;
> }
>
> +/*
> + * Try to upgrade read lock to write lock
> + */
> +static inline int __try_upgrade_read(struct rw_semaphore *sem)
> +{
> + long count = atomic_long_read(&sem->count);
> +
> + WARN_ON_ONCE(count & RWSEM_WRITER_LOCKED);
> +
> + /*
> + * When upgrading from shared to exclusive ownership,
> + * anything inside the write-locked region cannot leak
> + * into the read side. Use an ACQUIRE semantics.
> + */
> + if (((count & RWSEM_READER_MASK) == RWSEM_READER_BIAS) &&
> + atomic_long_try_cmpxchg_acquire(&sem->count, &count,
> + count - RWSEM_READER_BIAS + RWSEM_WRITER_LOCKED)) {
> + rwsem_set_owner(sem);
> + return 1;
> + }
> + return 0;
> +}
> +
> /*
> * lock for reading
> */
> @@ -1438,6 +1461,11 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
> rwbase_write_downgrade(&sem->rwbase);
> }
>
> +static inline int __try_upgrade_read(struct rw_semaphore *sem)
> +{
> + return 0;
> +}
> +
> /* Debug stubs for the common API */
> #define DEBUG_RWSEMS_WARN_ON(c, sem)
>
> @@ -1581,6 +1609,31 @@ void downgrade_write(struct rw_semaphore *sem)
> }
> EXPORT_SYMBOL(downgrade_write);
>
> +/*
> + * Upgrade read lock to write lock
> + *
> + * Return: 0 when upgrade is safe, i.e. rwsem protected data do not change;
> + * 1 when upgrade is unsafe as rwsem protected data may have changed.
> + */
> +int upgrade_read(struct rw_semaphore *sem)
> +{
> + if (__try_upgrade_read(sem)) {
> + rwsem_release(&sem->dep_map, _RET_IP_);
> + rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
> + return 0;
> + }
> +
> + /*
> + * We cannot directly upgrade to the write lock, just do a regular
> + * up_read() and down_write() sequence. The data protected by the
> + * rwsem may have changed before the write lock is acquired.
> + */
> + down_read(sem);
> + up_write(sem);

Confused, the comment says up_read()+down_write(), however the code is
down_read()+up_write().

Besides, I don't like the idea that the value may have changed before
the write lock is acquired if we call it "upgrade". Maybe we want api
like down_read_upgradable(), which can be held in parallel with other
down_read() but no other down_read_upgradable(), and one can only
upgrade the read-side critical section created by
down_read_upgradable(). For implementation, that means we need to have
one extra bit for upgradable. Thoughts?

Regards,
Boqun

> + return 1;
> +}
> +EXPORT_SYMBOL(upgrade_read);
> +
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>
> void down_read_nested(struct rw_semaphore *sem, int subclass)
> --
> 2.18.1
>

2021-09-23 01:52:04

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH] locking/rwsem: Add upgrade_read()

On 9/22/21 9:16 PM, Boqun Feng wrote:
> On Wed, Sep 22, 2021 at 03:36:57PM -0400, Waiman Long wrote:
>> Currently there are about 12 instances in the kernel where an up_read()
>> is immediately followed by a down_write() of the same lock. For example,
>>
>> drivers/tty/n_tty.c: up_read(&tty->termios_rwsem);
>> drivers/tty/n_tty.c- down_write(&tty->termios_rwsem);
>>
>> Since we have already provided a downgrade_write() function, we may as
>> well provide an upgrade_read() function to make the code easier to read
>> and the intention clearer.
>>
>> If the current task is the only reader, the upgrade can be done by a
>> single atomic operation. If not, the upgrade will have to be done by a
>> separate up_read() call followed by a down_write(). In the former case,
>> the handoff bit is not considered and the waiter will have to wait a
>> bit longer to acquire the lock.
>>
>> The new upgrade_read() function returns a value of 0 for safe upgrade
>> where rwsem protected data won't change. Otherwise a value of 1 is
>> returned to indicate unsafe upgrade where rwsem protected data may
>> change during the upgrade process.
>>
>> For PREEMPT_RT, it falls back to up_read() followed by down_write()
>> for simplicity.
>>
>> Some uses of down_write() with long lock hold time may be changed
>> to the following format in the future:
>>
>> down_read()
>> /* check data */
>> if (upgrade_read()) {
>> /* unsafe upgrade, recheck data */
>> }
>> /* update data */
>> up_write();
>>
>> As long as the "recheck data" and "update data" parts are relatively
>> short compared with the "check data" part, this conversion may help to
>> improve parallelism and reduce lock contention.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> include/linux/rwsem.h | 5 ++++
>> kernel/locking/rwsem.c | 53 ++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 58 insertions(+)
>>
>> diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
>> index 352c6127cb90..8ece58224f25 100644
>> --- a/include/linux/rwsem.h
>> +++ b/include/linux/rwsem.h
>> @@ -207,6 +207,11 @@ extern void up_write(struct rw_semaphore *sem);
>> */
>> extern void downgrade_write(struct rw_semaphore *sem);
>>
>> +/*
>> + * upgrade read lock to write lock
>> + */
>> +extern int upgrade_read(struct rw_semaphore *sem);
>> +
>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>> /*
>> * nested locking. NOTE: rwsems are not allowed to recurse
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index 000e8d5a2884..aeb5b0668304 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -1203,6 +1203,29 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
>> return sem;
>> }
>>
>> +/*
>> + * Try to upgrade read lock to write lock
>> + */
>> +static inline int __try_upgrade_read(struct rw_semaphore *sem)
>> +{
>> + long count = atomic_long_read(&sem->count);
>> +
>> + WARN_ON_ONCE(count & RWSEM_WRITER_LOCKED);
>> +
>> + /*
>> + * When upgrading from shared to exclusive ownership,
>> + * anything inside the write-locked region cannot leak
>> + * into the read side. Use an ACQUIRE semantics.
>> + */
>> + if (((count & RWSEM_READER_MASK) == RWSEM_READER_BIAS) &&
>> + atomic_long_try_cmpxchg_acquire(&sem->count, &count,
>> + count - RWSEM_READER_BIAS + RWSEM_WRITER_LOCKED)) {
>> + rwsem_set_owner(sem);
>> + return 1;
>> + }
>> + return 0;
>> +}
>> +
>> /*
>> * lock for reading
>> */
>> @@ -1438,6 +1461,11 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
>> rwbase_write_downgrade(&sem->rwbase);
>> }
>>
>> +static inline int __try_upgrade_read(struct rw_semaphore *sem)
>> +{
>> + return 0;
>> +}
>> +
>> /* Debug stubs for the common API */
>> #define DEBUG_RWSEMS_WARN_ON(c, sem)
>>
>> @@ -1581,6 +1609,31 @@ void downgrade_write(struct rw_semaphore *sem)
>> }
>> EXPORT_SYMBOL(downgrade_write);
>>
>> +/*
>> + * Upgrade read lock to write lock
>> + *
>> + * Return: 0 when upgrade is safe, i.e. rwsem protected data do not change;
>> + * 1 when upgrade is unsafe as rwsem protected data may have changed.
>> + */
>> +int upgrade_read(struct rw_semaphore *sem)
>> +{
>> + if (__try_upgrade_read(sem)) {
>> + rwsem_release(&sem->dep_map, _RET_IP_);
>> + rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
>> + return 0;
>> + }
>> +
>> + /*
>> + * We cannot directly upgrade to the write lock, just do a regular
>> + * up_read() and down_write() sequence. The data protected by the
>> + * rwsem may have changed before the write lock is acquired.
>> + */
>> + down_read(sem);
>> + up_write(sem);
> Confused, the comment says up_read()+down_write(), however the code is
> down_read()+up_write().
Thanks for catching that typo. My bad.
>
> Besides, I don't like the idea that the value may have changed before
> the write lock is acquired if we call it "upgrade". Maybe we want api
> like down_read_upgradable(), which can be held in parallel with other
> down_read() but no other down_read_upgradable(), and one can only
> upgrade the read-side critical section created by
> down_read_upgradable(). For implementation, that means we need to have
> one extra bit for upgradable. Thoughts?

I like your idea. There are spare bits available and we can dedicate one
bit for that purpose. After successfully acquire the bit the reader can
probably spin a little bit and then insert itself to the head of the
wait queue to sleep. The last exiting reader can wake it up to acquire
the write lock.

I will probably use "try_upgrade_read() to indicate that the attempt may
fail.

Thanks for the suggestion.

Cheers,
Longman