by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff

On 1/24/23 07:29, Peter Zijlstra wrote:
> On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
>> On 1/23/23 09:59, Peter Zijlstra wrote:
>>> On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
>>>> The lock handoff provided in rwsem isn't a true handoff like that in
>>>> the mutex. Instead, it is more like a quiescent state where optimistic
>>>> spinning and lock stealing are disabled to make it easier for the first
>>>> waiter to acquire the lock.
>>>>
>>>> Reworking the code to enable a true lock handoff is more complex due to
>>>> the following facts:
>>>> 1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
>>>> is too expensive to always take the wait_lock in the unlock path
>>>> to prevent racing.
>>> Specifically, the issue is that we'd need to turn the
>>> atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
>>> loop, like:
>>>
>>> tmp = atomic_long_read(&sem->count);
>>> do {
>>> if (tmp & (WAITERS|HANDOFF))
>>> return slow_unock();
>>> } while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
>>> tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
>>>
>>> in order to not race with a concurrent setting of the HANDOFF bit,
>>> right? And we don't like to turn unlock into a cmpxchg loop.
>>>
>>> (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
>>> unlock, any fail and we go to the slow path -- I suppose the distinct
>>> difference is that we sorta expect some contention on the read side)
>> I see that your inclination is to do the handoff right at the unlock time.
>> It is certainly possible to do that, but it will be more complex in the case
>> of rwsem than mutex.
> Still, it would make things ever so much simpler -- but I agree we'll
> probably not get away with it on the performance side of things.
>
>>> Right. In short:
>>>
>>> Having HANDOVER set:
>>> - implies WAITERS set
>>> - disables all fastpaths (spinning or otherwise)
>>> - dis-allows anybody except first waiter to obtain lock
>>>
>>> Therefore, having the window between clearing owner and prodding first
>>> waiter is 'harmless'.
>> As said above, we need to confirm that the HANDOFF bit is set with wait_lock
>> held. Now, the HANDOFF bit may not set at unlock time, or it may not be.
>>
>> We can pass the count value fetched at unlock time down to rwsem_wake() to
>> confirm that HANDOFF bit is set at unlock time. However, it is also possible
>> that the original waiter that set HANDOFF have bailed out, then a reader
>> acquire the lock and another waiter set HANDOFF before the unlocker acquire
>> the wait lock. Then the rwsem is really reader-owned in this case. So we
>> can't perform handoff. That is why I also check for if there is an active
>> lock (mostly read lock) at rwsem_wake(). However, that can be a false
>> negative because an incoming reader may have just added a READER_BIAS which
>> is to be removed soon. That is the reason why I have a secondary handoff
>> check in the reader slowpath.
>>
>>>> With true lock handoff, there is no need to do a NULL owner spinning
>>>> anymore as wakeup will be performed if handoff is possible. So it
>>>> is likely that the first waiter won't actually go to sleep even when
>>>> schedule() is called in this case.
>>> Right, removing that NULL spinning was the whole purpose -- except I
>>> seem to have forgotten why it was a problem :-)
>>>
>>> OK, lemme go read the actual patch.
>>>
>>> Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
>>> above, you added a whole intermediate GRANTED state. Why?
>>>
>>> Since we fundamentally must deal with the release->wait_lock hole, why
>>> do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
>>> Why can't we skip the whole rwsem_wake()->GRANTED part and only do
>>> handoff in the slowpath?
>> First of all, the owner value for a reader-owned rwsem is mostly of an
>> advisory value as it holds one of the possible owners. So it may be a bit
>> risky to use it as an indication that a handoff had happened as it may be
>> screwed up in some rare cases. That is why I use the repurposed
>> handoff_state value in the waiter structure. Also reading this value is less
>> costly than reading the rwsem cacheline which can be heavily contended.
>>
>> I will update the patch description to highlight the points that I discussed
>> in this email.
> Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
> all the fastpaths, all the spinning, all the stealing, everything; then
> all that is left is the slowpath that is holding wait_lock.
>
> Then in both slowpaths, ensure only the first waiter can go on and we're
> done.
>
> What am I missing? Why does it need to be so complicated?
>
> That is, afaict something like the below would actually work, no? Yes,
> simply deleting that spinning in write_slowpath isn't ideal, but I
> suspect we can do something to rwsem_try_write_lock() to make up for
> that if we think about it.
>
> Again, please explain, explicitly and in small steps, why you think you
> need all that complexity.
>
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -40,7 +40,7 @@
> *
> * When the rwsem is reader-owned and a spinning writer has timed out,
> * the nonspinnable bit will be set to disable optimistic spinning.
> -
> + *
> * When a writer acquires a rwsem, it puts its task_struct pointer
> * into the owner field. It is cleared after an unlock.
> *
> @@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
> * Mark writer at the front of the queue for wakeup.
> * Until the task is actually later awoken later by
> * the caller, other writers are able to steal it.
> + *
> + * *Unless* HANDOFF is set, in which case only the
> + * first waiter is allowed to take it.
> + *
> * Readers, on the other hand, will block as they
> * will notice the queued writer.
> */
> @@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
> * force the issue.
> */
> if (time_after(jiffies, waiter->timeout)) {
> + /*
> + * Setting HANDOFF blocks fastpaths and stealing.
> + */
> if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
> adjustment -= RWSEM_FLAG_HANDOFF;
> lockevent_inc(rwsem_rlock_handoff);
> @@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
> }
>
> atomic_long_add(-adjustment, &sem->count);
> +
> + if (waiter->handoff_set) {
> + /*
> + * With HANDOFF set we must terminate all spinning.
> + */
> + rwsem_set_nonspinnable(sem);
> + }
> return;
> }
> /*
> @@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
> * Try to acquire the lock
> */
> taken = rwsem_try_write_lock_unqueued(sem);
> -
> if (taken)
> break;
>
> @@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
> if (signal_pending_state(state, current))
> goto out_nolock;
>
> - /*
> - * After setting the handoff bit and failing to acquire
> - * the lock, attempt to spin on owner to accelerate lock
> - * transfer. If the previous owner is a on-cpu writer and it
> - * has just released the lock, OWNER_NULL will be returned.
> - * In this case, we attempt to acquire the lock again
> - * without sleeping.
> - */
> - if (waiter.handoff_set) {
> - enum owner_state owner_state;
> -
> - owner_state = rwsem_spin_on_owner(sem);
> - if (owner_state == OWNER_NULL)
> - goto trylock_again;
> - }
> -
> schedule_preempt_disabled();
> lockevent_inc(rwsem_sleep_writer);
> set_current_state(state);
>
Thanks for the comments and suggested changes. I will adopt some of your
suggestions to simplify the patchset. Will post a new version once I
finish my testing.

Cheers,
Longman