Date:   Tue, 24 Jan 2023 13:29:39 +0100
From:   Peter Zijlstra <peterz@infradead.org>
To:     Waiman Long <longman@redhat.com>
Cc:     Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Boqun Feng <boqun.feng@gmail.com>,
        linux-kernel@vger.kernel.org, john.p.donnelly@oracle.com,
        Hillf Danton <hdanton@sina.com>,
        Mukesh Ojha <quic_mojha@quicinc.com>,
        Ting11 Wang =?utf-8?B?546L5am3?= <wangting11@xiaomi.com>
Subject: Re: [PATCH v6 5/6] locking/rwsem: Enable direct rwsem lock handoff
Message-ID: <Y8/PM7ONLUkIETxy@hirez.programming.kicks-ass.net>
References: <20221118022016.462070-1-longman@redhat.com>
 <20221118022016.462070-6-longman@redhat.com>
 <Y86gzkVHlsOTY8QL@hirez.programming.kicks-ass.net>
 <9fb531a9-0951-f7c0-316d-749d2c59ade8@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <9fb531a9-0951-f7c0-316d-749d2c59ade8@redhat.com>
Precedence: bulk

On Mon, Jan 23, 2023 at 12:30:59PM -0500, Waiman Long wrote:
> On 1/23/23 09:59, Peter Zijlstra wrote:
> > On Thu, Nov 17, 2022 at 09:20:15PM -0500, Waiman Long wrote:
> > > The lock handoff provided in rwsem isn't a true handoff like that in
> > > the mutex. Instead, it is more like a quiescent state where optimistic
> > > spinning and lock stealing are disabled to make it easier for the first
> > > waiter to acquire the lock.
> > > 
> > > Reworking the code to enable a true lock handoff is more complex due to
> > > the following facts:
> > >   1) The RWSEM_FLAG_HANDOFF bit is protected by the wait_lock and it
> > >      is too expensive to always take the wait_lock in the unlock path
> > >      to prevent racing.
> > Specifically, the issue is that we'd need to turn the
> > atomic_long_add_return_release() into an atomic_try_cmpxchg_release()
> > loop, like:
> > 
> > 	tmp = atomic_long_read(&sem->count);
> > 	do {
> > 		if (tmp & (WAITERS|HANDOFF))
> > 			return slow_unock();
> > 	} while (atomic_long_try_cmpxchg_release(&sem->count, &tmp,
> > 						 tmp - RWSEM_{READER_BIAS,WRITE_LOCKED});
> > 
> > in order to not race with a concurrent setting of the HANDOFF bit,
> > right? And we don't like to turn unlock into a cmpxchg loop.
> > 
> > (OTOH we sorta do this for mutex, unconteded mutex has cmpxchg lock and
> > unlock, any fail and we go to the slow path -- I suppose the distinct
> > difference is that we sorta expect some contention on the read side)
> 
> I see that your inclination is to do the handoff right at the unlock time.
> It is certainly possible to do that, but it will be more complex in the case
> of rwsem than mutex.

Still, it would make things ever so much simpler -- but I agree we'll
probably not get away with it on the performance side of things.

> > Right. In short:
> > 
> > Having HANDOVER set:
> >   - implies WAITERS set
> >   - disables all fastpaths (spinning or otherwise)
> >   - dis-allows anybody except first waiter to obtain lock
> > 
> > Therefore, having the window between clearing owner and prodding first
> > waiter is 'harmless'.
> 
> As said above, we need to confirm that the HANDOFF bit is set with wait_lock
> held. Now, the HANDOFF bit may not set at unlock time, or it may not be.
> 
> We can pass the count value fetched at unlock time down to rwsem_wake() to
> confirm that HANDOFF bit is set at unlock time. However, it is also possible
> that the original waiter that set HANDOFF have bailed out, then a reader
> acquire the lock and another waiter set HANDOFF before the unlocker acquire
> the wait lock. Then the rwsem is really reader-owned in this case. So we
> can't perform handoff. That is why I also check for if there is an active
> lock (mostly read lock) at rwsem_wake(). However, that can be a false
> negative because an incoming reader may have just added a READER_BIAS which
> is to be removed soon. That is the reason why I have a secondary handoff
> check in the reader slowpath.
> 
> > 
> > > With true lock handoff, there is no need to do a NULL owner spinning
> > > anymore as wakeup will be performed if handoff is possible. So it
> > > is likely that the first waiter won't actually go to sleep even when
> > > schedule() is called in this case.
> > Right, removing that NULL spinning was the whole purpose -- except I
> > seem to have forgotten why it was a problem :-)
> > 
> > OK, lemme go read the actual patch.
> > 
> > Hmmm... you made it a wee bit more complicated, instead of my 3rd clause
> > above, you added a whole intermediate GRANTED state. Why?
> > 
> > Since we fundamentally must deal with the release->wait_lock hole, why
> > do we need to do the whole rwsem_wake()->GRANTED->*_slowpath() dance?
> > Why can't we skip the whole rwsem_wake()->GRANTED part and only do
> > handoff in the slowpath?
> 
> First of all, the owner value for a reader-owned rwsem is mostly of an
> advisory value as it holds one of the possible owners. So it may be a bit
> risky to use it as an indication that a handoff had happened as it may be
> screwed up in some rare cases. That is why I use the repurposed
> handoff_state value in the waiter structure. Also reading this value is less
> costly than reading the rwsem cacheline which can be heavily contended.
> 
> I will update the patch description to highlight the points that I discussed
> in this email.

Maybe I'm being dense, but I'm not seeing it. If we make HANDOFF block
all the fastpaths, all the spinning, all the stealing, everything; then
all that is left is the slowpath that is holding wait_lock.

Then in both slowpaths, ensure only the first waiter can go on and we're
done.

What am I missing? Why does it need to be so complicated?

That is, afaict something like the below would actually work, no? Yes,
simply deleting that spinning in write_slowpath isn't ideal, but I
suspect we can do something to rwsem_try_write_lock() to make up for
that if we think about it.

Again, please explain, explicitly and in small steps, why you think you
need all that complexity.

--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -40,7 +40,7 @@
  *
  * When the rwsem is reader-owned and a spinning writer has timed out,
  * the nonspinnable bit will be set to disable optimistic spinning.
-
+ *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
@@ -430,6 +430,10 @@ static void rwsem_mark_wake(struct rw_se
 			 * Mark writer at the front of the queue for wakeup.
 			 * Until the task is actually later awoken later by
 			 * the caller, other writers are able to steal it.
+			 *
+			 * *Unless* HANDOFF is set, in which case only the
+			 * first waiter is allowed to take it.
+			 *
 			 * Readers, on the other hand, will block as they
 			 * will notice the queued writer.
 			 */
@@ -463,6 +467,9 @@ static void rwsem_mark_wake(struct rw_se
 			 * force the issue.
 			 */
 			if (time_after(jiffies, waiter->timeout)) {
+				/*
+				 * Setting HANDOFF blocks fastpaths and stealing.
+				 */
 				if (!(oldcount & RWSEM_FLAG_HANDOFF)) {
 					adjustment -= RWSEM_FLAG_HANDOFF;
 					lockevent_inc(rwsem_rlock_handoff);
@@ -471,6 +478,13 @@ static void rwsem_mark_wake(struct rw_se
 			}
 
 			atomic_long_add(-adjustment, &sem->count);
+
+			if (waiter->handoff_set) {
+				/*
+				 * With HANDOFF set we must terminate all spinning.
+				 */
+				rwsem_set_nonspinnable(sem);
+			}
 			return;
 		}
 		/*
@@ -844,7 +858,6 @@ static bool rwsem_optimistic_spin(struct
 		 * Try to acquire the lock
 		 */
 		taken = rwsem_try_write_lock_unqueued(sem);
-
 		if (taken)
 			break;
 
@@ -1159,22 +1172,6 @@ rwsem_down_write_slowpath(struct rw_sema
 		if (signal_pending_state(state, current))
 			goto out_nolock;
 
-		/*
-		 * After setting the handoff bit and failing to acquire
-		 * the lock, attempt to spin on owner to accelerate lock
-		 * transfer. If the previous owner is a on-cpu writer and it
-		 * has just released the lock, OWNER_NULL will be returned.
-		 * In this case, we attempt to acquire the lock again
-		 * without sleeping.
-		 */
-		if (waiter.handoff_set) {
-			enum owner_state owner_state;
-
-			owner_state = rwsem_spin_on_owner(sem);
-			if (owner_state == OWNER_NULL)
-				goto trylock_again;
-		}
-
 		schedule_preempt_disabled();
 		lockevent_inc(rwsem_sleep_writer);
 		set_current_state(state);