Message-ID: <dbab4166-abed-1633-1ee6-a2844fc401f6@redhat.com>
Date:   Fri, 17 Feb 2023 09:25:48 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.0
Subject: Re: [PATCH v2 2/3] locking/rwsem: Enable early rwsem writer lock
 handoff
Content-Language: en-US
To:     Boqun Feng <boqun.feng@gmail.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        linux-kernel@vger.kernel.org, Hillf Danton <hdanton@sina.com>
References: <20230216210933.1169097-1-longman@redhat.com>
 <20230216210933.1169097-3-longman@redhat.com>
 <Y+8PlCPWDFiFjE8M@Boquns-Mac-mini.local>
From:   Waiman Long <longman@redhat.com>
In-Reply-To: <Y+8PlCPWDFiFjE8M@Boquns-Mac-mini.local>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk


On 2/17/23 00:24, Boqun Feng wrote:
> On Thu, Feb 16, 2023 at 04:09:32PM -0500, Waiman Long wrote:
>> The lock handoff provided in rwsem isn't a true handoff like that in
>> the mutex. Instead, it is more like a quiescent state where optimistic
>> spinning and lock stealing are disabled to make it easier for the first
>> waiter to acquire the lock.
>>
>> For readers, setting the HANDOFF bit will disable writers from stealing
>> the lock. The actual handoff is done at rwsem_wake() time after taking
>> the wait_lock. There isn't much we need to improve here other than
>> setting the RWSEM_NONSPINNABLE bit in owner.
>>
>> For writers, setting the HANDOFF bit does not guarantee that it can
>> acquire the rwsem successfully in a subsequent rwsem_try_write_lock()
>> after setting the bit there. A reader can come in and add a
>> RWSEM_READER_BIAS temporarily which can spoil the takeover of the rwsem
>> in rwsem_try_write_lock() leading to additional delay.
>>
>> For mutex, lock handoff is done at unlock time as the owner value and
>> the handoff bit is in the same lock word and can be updated atomically.
>>
>> That is the not case for rwsem which has a count value for locking and
>> a different owner value for storing lock owner. In addition, the handoff
>> processing differs depending on whether the first waiter is a writer or a
>> reader. We can only make that waiter type determination after acquiring
>> the wait lock. Together with the fact that the RWSEM_FLAG_HANDOFF bit
>> is stable while holding the wait_lock, the most convenient place to
>> do the early handoff is at rwsem_mark_wake() where wait_lock has to be
>> acquired anyway.
>>
>> There isn't much additional cost in doing this check and early handoff
>> in rwsem_mark_wake() while increasing the chance that a lock handoff
>> will be successful when the HANDOFF setting writer wakes up without
>> even the need to take the wait_lock at all. Note that if early handoff
>> fails to happen in rwsem_mark_wake(), a late handoff can still happen
>> when the awoken writer calls rwsem_try_write_lock().
>>
>> Kernel test robot noticed a 19.3% improvement of
>> will-it-scale.per_thread_ops in an earlier version of this commit [1].
>>
>> [1] https://lore.kernel.org/lkml/202302122155.87699b56-oliver.sang@intel.com/
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/locking/lock_events_list.h |  1 +
>>   kernel/locking/rwsem.c            | 71 +++++++++++++++++++++++++++----
>>   2 files changed, 64 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
>> index 97fb6f3f840a..fd80f5828f24 100644
>> --- a/kernel/locking/lock_events_list.h
>> +++ b/kernel/locking/lock_events_list.h
>> @@ -67,3 +67,4 @@ LOCK_EVENT(rwsem_rlock_handoff)	/* # of read lock handoffs		*/
>>   LOCK_EVENT(rwsem_wlock)		/* # of write locks acquired		*/
>>   LOCK_EVENT(rwsem_wlock_fail)	/* # of failed write lock acquisitions	*/
>>   LOCK_EVENT(rwsem_wlock_handoff)	/* # of write lock handoffs		*/
>> +LOCK_EVENT(rwsem_wlock_ehandoff) /* # of write lock early handoffs	*/
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index e589f69793df..fc3961ceabe8 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -412,8 +412,9 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   			    enum rwsem_wake_type wake_type,
>>   			    struct wake_q_head *wake_q)
>>   {
>> +	long count = atomic_long_read(&sem->count);
>>   	struct rwsem_waiter *waiter, *tmp;
>> -	long count, woken = 0, adjustment = 0;
>> +	long woken = 0, adjustment = 0;
>>   	struct list_head wlist;
>>   
>>   	lockdep_assert_held(&sem->wait_lock);
>> @@ -432,19 +433,39 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   		 * Mark writer at the front of the queue for wakeup.
>>   		 * Until the task is actually later awoken later by
>>   		 * the caller, other writers are able to steal it.
>> +		 *
>> +		 * *Unless* HANDOFF is set, in which case only the
>> +		 * first waiter is allowed to take it.
>> +		 *
>>   		 * Readers, on the other hand, will block as they
>>   		 * will notice the queued writer.
>>   		 */
>>   		wake_q_add(wake_q, waiter->task);
>>   		lockevent_inc(rwsem_wake_writer);
>> +
>> +		if ((count & RWSEM_LOCK_MASK) || !(count & RWSEM_FLAG_HANDOFF))
>> +			return;
>> +
>> +		/*
>> +		 * If the rwsem is free and handoff flag is set with wait_lock
>> +		 * held, no other CPUs can take an active lock. We can do an
>> +		 * early handoff.
>> +		 */
>> +		adjustment = RWSEM_WRITER_LOCKED - RWSEM_FLAG_HANDOFF;
>> +		atomic_long_set(&sem->owner, (long)waiter->task);
>> +		waiter->task = NULL;
>> +		atomic_long_add(adjustment, &sem->count);
>> +		rwsem_del_waiter(sem, waiter);
>> +		lockevent_inc(rwsem_wlock_ehandoff);
>>   	}
>>   	return;
>>   
>>   wake_readers:
>>   	/*
>> -	 * No reader wakeup if there are too many of them already.
>> +	 * No reader wakeup if there are too many of them already or
>> +	 * something wrong happens.
>>   	 */
>> -	if (unlikely(atomic_long_read(&sem->count) < 0))
>> +	if (unlikely(count < 0))
>>   		return;
>>   
>>   	/*
>> @@ -468,7 +489,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>>   					adjustment -= RWSEM_FLAG_HANDOFF;
>>   					lockevent_inc(rwsem_rlock_handoff);
>>   				}
>> +				/*
>> +				 * With HANDOFF set for reader, we must
>> +				 * terminate all spinning.
>> +				 */
>>   				waiter->handoff_set = true;
>> +				rwsem_set_nonspinnable(sem);
>>   			}
>>   
>>   			atomic_long_add(-adjustment, &sem->count);
>> @@ -610,6 +636,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>>   
>>   	lockdep_assert_held(&sem->wait_lock);
>>   
>> +	if (!waiter->task) {
>> +		/* Write lock handed off */
>> +		smp_acquire__after_ctrl_dep();
> I don't think you need smp_acquire__after_ctrl_dep() here, since:
>
> *	The other side is just a normal write "waiter->task = NULL"
> *	The "&sem->wait_lock" already provides the necessary ACQUIRE
>
> , but I come to this series so late, I may miss something subtle.

You are probably right about that. I was just being on cautious side to 
make sure that there is a proper acquire barrier. Thinking about it, the 
lock acquire is actually done in the previous wait_lock critical section 
where the early lock handoff is done. So the wait_lock acquire barrier 
should be good enough.


>
>> +		return true;
>> +	}
>> +
>>   	count = atomic_long_read(&sem->count);
>>   	do {
>>   		bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>> @@ -755,6 +787,10 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>>   
>>   	owner = rwsem_owner_flags(sem, &flags);
>>   	state = rwsem_owner_state(owner, flags);
>> +
>> +	if (owner == current)
>> +		return OWNER_NONSPINNABLE;	/* Handoff granted */
>> +
>>   	if (state != OWNER_WRITER)
>>   		return state;
>>   
>> @@ -1164,32 +1200,51 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>>   		 * the lock, attempt to spin on owner to accelerate lock
>>   		 * transfer. If the previous owner is a on-cpu writer and it
>>   		 * has just released the lock, OWNER_NULL will be returned.
>> -		 * In this case, we attempt to acquire the lock again
>> -		 * without sleeping.
>> +		 * In this case, the waker may be in the process of early
>> +		 * lock handoff. Use the wait_lock to synchronize with that
>> +		 * before checking for handoff.
>>   		 */
>>   		if (waiter.handoff_set) {
>>   			enum owner_state owner_state;
>>   
>>   			owner_state = rwsem_spin_on_owner(sem);
>> -			if (owner_state == OWNER_NULL)
>> -				goto trylock_again;
>> +			if ((owner_state == OWNER_NULL) &&
>> +			    READ_ONCE(waiter.task)) {
> In theory, if there is a read outside some synchronization (say locks),
> not only READ_ONCE(), but also WRITE_ONCE() is needed (even for write
> inside locks), otherwise, KCSAN will yell (if
> KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n). This is because plain write may get
> teared.

Yes, I should have used a WRITE_ONCE() in writing NULL to waiter->task.

Thanks,
Longman