2021-11-07 21:52:32

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On 11/7/21 04:01, Hillf Danton wrote:
> On Sat, 6 Nov 2021 23:25:38 -0400 Waiman Long wrote:
>> On 11/6/21 08:39, 马振华 wrote:
>>> Dear longman,
>>>
>>> recently , i find a issue which rwsem count is negative value, it
>>> happened always when a task try to get the lock
>>> with __down_write_killable , then it is killed
>>>
>>> this issue happened like this
>>>
>>>             CPU2         CPU4
>>>     task A[reader]     task B[writer]
>>>     down_read_killable[locked]
>>>     sem->count=0x100
>>>             down_write_killable
>>>             sem->count=0x102[wlist not empty]
>>>     up_read
>>>     count=0x2
>>>             sig kill received
>>>     down_read_killable
>>>     sem->count=0x102[wlist not empty]
>>>             goto branch out_nolock:
>>> list_del(&waiter.list);
>>> wait list is empty
>>> sem->count-RWSEM_FLAG_HANDOFF
>>> sem->count=0xFE
>>>     list_empty(&sem->wait_list) is TRUE
>>>      sem->count andnot RWSEM_FLAG_WAITERS
>>>       sem->count=0xFC
>>>     up_read
>>>     sem->count -= 0x100
>>>     sem->count=0xFFFFFFFFFFFFFFFC
>>>     DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
>>>
>>> so sem->count will be negative after writer is killed
>>> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it
>> Thanks for reporting this possible race condition.
>>
>> However, I am still trying to figure how it is possible to set the
>> wstate to WRITER_HANDOFF without actually setting the handoff bit as
>> well. The statement sequence should be as follows:
>>
>> wstate = WRITER_HANDOFF;
>> raw_spin_lock_irq(&sem->wait_lock);
>> if (rwsem_try_write_lock(sem, wstate))
>> raw_spin_unlock_irq(&sem->wait_lock);
>>   :
>> if (signal_pending_state(state, current))
>>     goto out_nolock
>>
>> The rwsem_try_write_lock() function will make sure that we either
>> acquire the lock and clear handoff or set the handoff bit. This should
>> be done before we actually check for signal. I do think that it is
> Given that WRITER_HANDOFF makes no sure that RWSEM_FLAG_HANDOFF is set in
> wsem_try_write_lock(), the flag should be cleared only by the setter to
> avoid count underflow.
>
> Hillf
>
>> probably safer to use atomic_long_andnot to clear the handoff bit
>> instead of using atomic_long_add().

I did have a tentative patch to address this issue which is somewhat
similar to your approach. However, I would like to further investigate
the exact mechanics of the race condition to make sure that I won't miss
a latent bug somewhere else in the rwsem code.

-Longman

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..20be967620c0 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -347,7 +347,8 @@ enum rwsem_wake_type {
 enum writer_wait_state {
     WRITER_NOT_FIRST,    /* Writer is not first in wait list */
     WRITER_FIRST,        /* Writer is first in wait list     */
-    WRITER_HANDOFF        /* Writer is first & handoff needed */
+    WRITER_NEED_HANDOFF    /* Writer is first & handoff needed */
+    WRITER_HANDOFF        /* Writer is first & handoff set */
 };

 /*
@@ -532,11 +533,11 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
  * race conditions between checking the rwsem wait list and setting the
  * sem->count accordingly.
  *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
+ * If wstate is WRITER_NEED_HANDOFF, it will make sure that either the
handoff
  * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
-                    enum writer_wait_state wstate)
+                    enum writer_wait_state *wstate)
 {
     long count, new;

@@ -546,13 +547,13 @@ static inline bool rwsem_try_write_lock(struct
rw_semaphore *sem,
     do {
         bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);

-        if (has_handoff && wstate == WRITER_NOT_FIRST)
+        if (has_handoff && *wstate == WRITER_NOT_FIRST)
             return false;

         new = count;

         if (count & RWSEM_LOCK_MASK) {
-            if (has_handoff || (wstate != WRITER_HANDOFF))
+            if (has_handoff || (*wstate != WRITER_NEED_HANDOFF))
                 return false;

             new |= RWSEM_FLAG_HANDOFF;
@@ -569,8 +570,10 @@ static inline bool rwsem_try_write_lock(struct
rw_semaphore *sem,
      * We have either acquired the lock with handoff bit cleared or
      * set the handoff bit.
      */
-    if (new & RWSEM_FLAG_HANDOFF)
+    if (new & RWSEM_FLAG_HANDOFF) {
+        *wstate = WRITER_HANDOFF;
         return false;
+    }

     rwsem_set_owner(sem);
     return true;
@@ -1083,7 +1086,7 @@ rwsem_down_write_slowpath(struct rw_semaphore
*sem, int state)
     /* wait until we successfully acquire the lock */
     set_current_state(state);
     for (;;) {
-        if (rwsem_try_write_lock(sem, wstate)) {
+        if (rwsem_try_write_lock(sem, &wstate)) {
             /* rwsem_try_write_lock() implies ACQUIRE on success */
             break;
         }
@@ -1138,7 +1141,7 @@ rwsem_down_write_slowpath(struct rw_semaphore
*sem, int state)
              */
             if ((wstate == WRITER_FIRST) && (rt_task(current) ||
                 time_after(jiffies, waiter.timeout))) {
-                wstate = WRITER_HANDOFF;
+                wstate = WRITER_NEED_HANDOFF;
                 lockevent_inc(rwsem_wlock_handoff);
                 break;
             }
@@ -1159,7 +1162,7 @@ rwsem_down_write_slowpath(struct rw_semaphore
*sem, int state)
     list_del(&waiter.list);

     if (unlikely(wstate == WRITER_HANDOFF))
-        atomic_long_add(-RWSEM_FLAG_HANDOFF, &sem->count);
+        atomic_long_addnot(RWSEM_FLAG_HANDOFF, &sem->count);

     if (list_empty(&sem->wait_list))
         atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
--



2021-11-08 02:25:47

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On 11/7/21 10:24, Waiman Long wrote:
> On 11/7/21 04:01, Hillf Danton wrote:
>> On Sat, 6 Nov 2021 23:25:38 -0400 Waiman Long wrote:
>>> On 11/6/21 08:39, 马振华 wrote:
>>>> Dear longman,
>>>>
>>>> recently , i find a issue which rwsem count is negative value, it
>>>> happened always when a task try to get the lock
>>>> with __down_write_killable , then it is killed
>>>>
>>>> this issue happened like this
>>>>
>>>>              CPU2         CPU4
>>>>      task A[reader]     task B[writer]
>>>>      down_read_killable[locked]
>>>>      sem->count=0x100
>>>>              down_write_killable
>>>>              sem->count=0x102[wlist not empty]
>>>>      up_read
>>>>      count=0x2
>>>>              sig kill received
>>>>      down_read_killable
>>>>      sem->count=0x102[wlist not empty]
>>>>              goto branch out_nolock:
>>>> list_del(&waiter.list);
>>>> wait list is empty
>>>> sem->count-RWSEM_FLAG_HANDOFF
>>>> sem->count=0xFE
>>>>      list_empty(&sem->wait_list) is TRUE
>>>>       sem->count andnot RWSEM_FLAG_WAITERS
>>>>        sem->count=0xFC
>>>>      up_read
>>>>      sem->count -= 0x100
>>>>      sem->count=0xFFFFFFFFFFFFFFFC
>>>>      DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
>>>>
>>>> so sem->count will be negative after writer is killed
>>>> i think if flag RWSEM_FLAG_HANDOFF is not set, we shouldn't clean it
>>> Thanks for reporting this possible race condition.
>>>
>>> However, I am still trying to figure how it is possible to set the
>>> wstate to WRITER_HANDOFF without actually setting the handoff bit as
>>> well. The statement sequence should be as follows:
>>>
>>> wstate = WRITER_HANDOFF;
>>> raw_spin_lock_irq(&sem->wait_lock);
>>> if (rwsem_try_write_lock(sem, wstate))
>>> raw_spin_unlock_irq(&sem->wait_lock);
>>>    :
>>> if (signal_pending_state(state, current))
>>>      goto out_nolock
>>>
>>> The rwsem_try_write_lock() function will make sure that we either
>>> acquire the lock and clear handoff or set the handoff bit. This should
>>> be done before we actually check for signal. I do think that it is
>> Given that WRITER_HANDOFF makes no sure that RWSEM_FLAG_HANDOFF is
>> set in
>> wsem_try_write_lock(), the flag should be cleared only by the setter to
>> avoid count underflow.
>>
>> Hillf
>>
>>> probably safer to use atomic_long_andnot to clear the handoff bit
>>> instead of using atomic_long_add().
>
> I did have a tentative patch to address this issue which is somewhat
> similar to your approach. However, I would like to further investigate
> the exact mechanics of the race condition to make sure that I won't
> miss a latent bug somewhere else in the rwsem code.

I still couldn't figure how this race condition can happen. However, I
do discover that it is possible to leave rwsem with no waiter but
handoff bit set if we kill or interrupt all the waiters in the wait
queue. I have just sent out a patch to address that concern, but it
should be able to handle this race condition as well if it really happens.

Cheers,
Longman

2021-11-10 21:39:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Sun, Nov 07, 2021 at 02:52:36PM -0500, Waiman Long wrote:
> >
> > I did have a tentative patch to address this issue which is somewhat
> > similar to your approach. However, I would like to further investigate
> > the exact mechanics of the race condition to make sure that I won't miss
> > a latent bug somewhere else in the rwsem code.
>
> I still couldn't figure how this race condition can happen. However, I do
> discover that it is possible to leave rwsem with no waiter but handoff bit
> set if we kill or interrupt all the waiters in the wait queue. I have just
> sent out a patch to address that concern, but it should be able to handle
> this race condition as well if it really happens.

The comment above RWSEM_WRITER_LOCKED seems wrong/out-dated in that
there's a 4th place that modifies the HANDOFF bit namely
rwsem_down_read_slowpath() in the out_nolock: case.

Now the thing I'm most worried about is that rwsem_down_write_slowpath()
modifies the HANDOFF bit depending on wstate, and wstate itself it not
determined under the same ->wait_lock section, so there could be a race
there.

Another thing is that once wstate==HANDOFF, we rely on spin_on_owner()
to return OWNER_NULL such that it goes to trylock_again, however if it
returns anything else then we're at signal_pending_state() and the
observed race can happen.

Now, spin_on_owner() *can* in fact return something else, consider
need_resched() being set for instance.

Combined I think the observed race is valid.

Now before we go make things more complicated, I think we should see if
we can make things simpler. Also I think perhaps the HANDOFF name here
is a misnomer.

I agree that using _andnot() will fix this issue; I also agree with
folding it with the existing _andnot() already there. But let me stare a
little more at this code, something isn't making sense...

2021-11-11 02:42:37

by Aiqun Yu (Maria)

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

From: [email protected]

There is an important information that when the issue reported, the first waiter can be changed. This is via other hook changes which can not always add to the tail of the waitlist which is not the same of the current upstream code.

The current original Xiaomi's issue when we checked the log, we can find out that the scenario is that write waiter set the sem->count with RWSEM_FLAG_HANDOFF bit and then another reader champ in and got the first waiter.
So When the reader go through rwsem_down_read_slowpath if it have RWSEM_FLAG_HANDOFF bit set, it will clear the RWSEM_FLAG_HANDOFF bit.

So it left the wstate of the local variable as WRITER_HANDOFF, but the reader thought it was a Reader Handoff and cleared it.
While the writer only checkes the wstate local variable of WRITER_HANDOFF, and do the add(-RWSEM_FLAG_HANDOFF) action. If it do a addnot(RWSEM_FLAG_HANDOFF), then the writer will not make the sem->count underflow.

The current scenario looked like this:
--------------------------------
wstate = WRITER_HANDOFF;
raw_spin_lock_irq(&sem->wait_lock);
if (rwsem_try_write_lock(sem, wstate))
raw_spin_unlock_irq(&sem->wait_lock);
:
if (signal_pending_state(state, current)) //other *important* reader rwsem_down_read_slowpath add to the first of the waitlist, and clear the RWSEM_FLAG_HANDOFF
goto out_nolock
out_nolock:
add(-RWSEM_FLAG_HANDOFF, sem->count) // make the count to be negative.
--------------------------------

So if we do a andnot it will be much more safer, and working in this scenario:
--------------------------------
wstate = WRITER_HANDOFF;
raw_spin_lock_irq(&sem->wait_lock);
if (rwsem_try_write_lock(sem, wstate))
raw_spin_unlock_irq(&sem->wait_lock);
:
if (signal_pending_state(state, current)) //other *important* reader rwsem_down_read_slowpath add to the first of the waitlist, and clear the RWSEM_FLAG_HANDOFF
goto out_nolock
out_nolock:
addnot(RWSEM_FLAG_HANDOFF, sem->count) // make the count unchanged.
--------------------------------


So if we do a andnot it will be much more safer, and working in the not killed normal senario:
--------------------------------
wstate = WRITER_HANDOFF;
raw_spin_lock_irq(&sem->wait_lock);
if (rwsem_try_write_lock(sem, wstate))
raw_spin_unlock_irq(&sem->wait_lock);
:
if (signal_pending_state(state, current)) //other *important* reader rwsem_down_read_slowpath add to the first of the waitlist, and clear the RWSEM_FLAG_HANDOFF
if (wstate == WRITER_HANDOFF)
trylock_again:
raw_spin_lock_irq(&sem->wait_lock);
if (rwsem_try_write_lock(sem, wstate)) //writer will set the RWSEM_FLAG_HANDOFF flag again.
raw_spin_unlock_irq(&sem->wait_lock);
--------------------------------



Thx and BRs,
Aiqun Yu (Maria)



2021-11-11 15:08:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Wed, Nov 10, 2021 at 10:38:55PM +0100, Peter Zijlstra wrote:
> On Sun, Nov 07, 2021 at 02:52:36PM -0500, Waiman Long wrote:
> > >
> > > I did have a tentative patch to address this issue which is somewhat
> > > similar to your approach. However, I would like to further investigate
> > > the exact mechanics of the race condition to make sure that I won't miss
> > > a latent bug somewhere else in the rwsem code.
> >
> > I still couldn't figure how this race condition can happen. However, I do
> > discover that it is possible to leave rwsem with no waiter but handoff bit
> > set if we kill or interrupt all the waiters in the wait queue. I have just
> > sent out a patch to address that concern, but it should be able to handle
> > this race condition as well if it really happens.
>
> The comment above RWSEM_WRITER_LOCKED seems wrong/out-dated in that
> there's a 4th place that modifies the HANDOFF bit namely
> rwsem_down_read_slowpath() in the out_nolock: case.
>
> Now the thing I'm most worried about is that rwsem_down_write_slowpath()
> modifies the HANDOFF bit depending on wstate, and wstate itself it not
> determined under the same ->wait_lock section, so there could be a race
> there.
>
> Another thing is that once wstate==HANDOFF, we rely on spin_on_owner()
> to return OWNER_NULL such that it goes to trylock_again, however if it
> returns anything else then we're at signal_pending_state() and the
> observed race can happen.
>
> Now, spin_on_owner() *can* in fact return something else, consider
> need_resched() being set for instance.
>
> Combined I think the observed race is valid.
>
> Now before we go make things more complicated, I think we should see if
> we can make things simpler. Also I think perhaps the HANDOFF name here
> is a misnomer.
>
> I agree that using _andnot() will fix this issue; I also agree with
> folding it with the existing _andnot() already there. But let me stare a
> little more at this code, something isn't making sense...

I think I want to see WRITER_HANDOFF go away. And preferably all of
wstate.

Something like the *completely* untested below, might set fire to your
pet, eat your granny, etc..

Also, perhaps s/HANDOFF/PHASE_CHANGE/ ?

Waiman, did I overlook something fundamental here?

---
kernel/locking/rwsem.c | 85 +++++++++++++++++++-------------------------------
1 file changed, 32 insertions(+), 53 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c51387a43265..bc5da05346e2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -104,14 +104,17 @@
* atomic_long_fetch_add() is used to obtain reader lock, whereas
* atomic_long_cmpxchg() will be used to obtain writer lock.
*
- * There are three places where the lock handoff bit may be set or cleared.
- * 1) rwsem_mark_wake() for readers.
- * 2) rwsem_try_write_lock() for writers.
- * 3) Error path of rwsem_down_write_slowpath().
+ * There are four places where the lock handoff bit may be set or cleared.
+ * 1) rwsem_mark_wake() for readers -- set, clear
+ * 2) rwsem_try_write_lock() for writers -- set, clear
+ * 3) Error path of rwsem_down_write_slowpath() -- clear
+ * 4) Error path of rwsem_down_read_slowpath() -- clear
*
* For all the above cases, wait_lock will be held. A writer must also
* be the first one in the wait_list to be eligible for setting the handoff
* bit. So concurrent setting/clearing of handoff bit is not possible.
+ *
+ * XXX handoff is a misnomer here, all it does it force a phase change
*/
#define RWSEM_WRITER_LOCKED (1UL << 0)
#define RWSEM_FLAG_WAITERS (1UL << 1)
@@ -344,12 +347,6 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
};

-enum writer_wait_state {
- WRITER_NOT_FIRST, /* Writer is not first in wait list */
- WRITER_FIRST, /* Writer is first in wait list */
- WRITER_HANDOFF /* Writer is first & handoff needed */
-};
-
/*
* The typical HZ value is either 250 or 1000. So set the minimum waiting
* time to at least 4ms or 1 jiffy (if it is higher than 4ms) in the wait
@@ -531,13 +528,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
* This function must be called with the sem->wait_lock held to prevent
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
- *
- * If wstate is WRITER_HANDOFF, it will make sure that either the handoff
- * bit is set or the lock is acquired with handoff bit cleared.
*/
static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
- enum writer_wait_state wstate)
+ struct rwsem_waiter *waiter)
{
+ bool first = rwsem_first_waiter(sem) == waiter;
+ bool timo = time_after(jiffies, waiter->timeout);
long count, new;

lockdep_assert_held(&sem->wait_lock);
@@ -546,13 +542,13 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
do {
bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);

- if (has_handoff && wstate == WRITER_NOT_FIRST)
+ if (has_handoff && !first)
return false;

new = count;

if (count & RWSEM_LOCK_MASK) {
- if (has_handoff || (wstate != WRITER_HANDOFF))
+ if (has_handoff || !(first && timo))
return false;

new |= RWSEM_FLAG_HANDOFF;
@@ -707,6 +703,8 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
break;
}

+ // XXX also terminate on signal_pending_state(current->__state, current) ?
+
cpu_relax();
}

@@ -1019,11 +1017,10 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
static struct rw_semaphore *
rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
{
- long count;
- enum writer_wait_state wstate;
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
+ long count, flags = 0;

/* do optimistic spinning and steal lock if possible */
if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
@@ -1041,13 +1038,10 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)

raw_spin_lock_irq(&sem->wait_lock);

- /* account for this before adding a new element to the list */
- wstate = list_empty(&sem->wait_list) ? WRITER_FIRST : WRITER_NOT_FIRST;
-
list_add_tail(&waiter.list, &sem->wait_list);

/* we're now waiting on the lock */
- if (wstate == WRITER_NOT_FIRST) {
+ if (rwsem_first_waiter(sem) != &waiter) {
count = atomic_long_read(&sem->count);

/*
@@ -1083,22 +1077,14 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
/* wait until we successfully acquire the lock */
set_current_state(state);
for (;;) {
- if (rwsem_try_write_lock(sem, wstate)) {
+ if (rwsem_try_write_lock(sem, &waiter)) {
/* rwsem_try_write_lock() implies ACQUIRE on success */
break;
}

raw_spin_unlock_irq(&sem->wait_lock);

- /*
- * After setting the handoff bit and failing to acquire
- * the lock, attempt to spin on owner to accelerate lock
- * transfer. If the previous owner is a on-cpu writer and it
- * has just released the lock, OWNER_NULL will be returned.
- * In this case, we attempt to acquire the lock again
- * without sleeping.
- */
- if (wstate == WRITER_HANDOFF) {
+ if (rwsem_first_waiter(sem) == &waiter) {
enum owner_state owner_state;

preempt_disable();
@@ -1117,28 +1103,15 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
schedule();
lockevent_inc(rwsem_sleep_writer);
set_current_state(state);
- /*
- * If HANDOFF bit is set, unconditionally do
- * a trylock.
- */
- if (wstate == WRITER_HANDOFF)
- break;

- if ((wstate == WRITER_NOT_FIRST) &&
- (rwsem_first_waiter(sem) == &waiter))
- wstate = WRITER_FIRST;
+ if (rwsem_first_waiter(sem) != &waiter)
+ continue;

count = atomic_long_read(&sem->count);
if (!(count & RWSEM_LOCK_MASK))
break;

- /*
- * The setting of the handoff bit is deferred
- * until rwsem_try_write_lock() is called.
- */
- if ((wstate == WRITER_FIRST) && (rt_task(current) ||
- time_after(jiffies, waiter.timeout))) {
- wstate = WRITER_HANDOFF;
+ if (rt_task(current) || time_after(jiffies, waiter.timeout)) {
lockevent_inc(rwsem_wlock_handoff);
break;
}
@@ -1156,15 +1129,21 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
out_nolock:
__set_current_state(TASK_RUNNING);
raw_spin_lock_irq(&sem->wait_lock);
- list_del(&waiter.list);

- if (unlikely(wstate == WRITER_HANDOFF))
- atomic_long_add(-RWSEM_FLAG_HANDOFF, &sem->count);
+ if (rwsem_first_waiter(sem) == &waiter)
+ flags |= RWSEM_FLAG_HANDOFF;
+
+ list_del(&waiter.list);

if (list_empty(&sem->wait_list))
- atomic_long_andnot(RWSEM_FLAG_WAITERS, &sem->count);
- else
+ flags |= RWSEM_FLAG_WAITERS | RWSEM_FLAG_HANDOFF;
+
+ if (flags)
+ atomic_long_andnot(flags, &sem->count);
+
+ if (!list_empty(&sem->wait_list))
rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
raw_spin_unlock_irq(&sem->wait_lock);
wake_up_q(&wake_q);
lockevent_inc(rwsem_wlock_fail);

2021-11-11 19:15:01

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 10:08, Peter Zijlstra wrote:
> On Wed, Nov 10, 2021 at 10:38:55PM +0100, Peter Zijlstra wrote:
>>
>> The comment above RWSEM_WRITER_LOCKED seems wrong/out-dated in that
>> there's a 4th place that modifies the HANDOFF bit namely
>> rwsem_down_read_slowpath() in the out_nolock: case.
>>
>> Now the thing I'm most worried about is that rwsem_down_write_slowpath()
>> modifies the HANDOFF bit depending on wstate, and wstate itself it not
>> determined under the same ->wait_lock section, so there could be a race
>> there.
>>
>> Another thing is that once wstate==HANDOFF, we rely on spin_on_owner()
>> to return OWNER_NULL such that it goes to trylock_again, however if it
>> returns anything else then we're at signal_pending_state() and the
>> observed race can happen.
>>
>> Now, spin_on_owner() *can* in fact return something else, consider
>> need_resched() being set for instance.
>>
>> Combined I think the observed race is valid.
>>
>> Now before we go make things more complicated, I think we should see if
>> we can make things simpler. Also I think perhaps the HANDOFF name here
>> is a misnomer.
>>
>> I agree that using _andnot() will fix this issue; I also agree with
>> folding it with the existing _andnot() already there. But let me stare a
>> little more at this code, something isn't making sense...
> I think I want to see WRITER_HANDOFF go away. And preferably all of
> wstate.
>
> Something like the *completely* untested below, might set fire to your
> pet, eat your granny, etc..
>
> Also, perhaps s/HANDOFF/PHASE_CHANGE/ ?
>
> Waiman, did I overlook something fundamental here?

The handoff bit is also set when the current writer is a RT task. You
miss that in your patch. The attached patch is my version of your
change. What do you think about that?

As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
mutex. Maybe a follow up patch if you think we should change the
terminology.

Cheers,
Longman


Attachments:
0001-locking-rwsem-Make-handoff-bit-handling-more-consist.patch (9.93 kB)

2021-11-11 19:20:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> mutex. Maybe a follow up patch if you think we should change the
> terminology.

Well, that's exactly the point, they do radically different things.
Having the same name for two different things is confusing.

Anyway, let me go read that patch you sent.

2021-11-11 19:37:11

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 14:20, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
>> mutex. Maybe a follow up patch if you think we should change the
>> terminology.
> Well, that's exactly the point, they do radically different things.
> Having the same name for two different things is confusing.
>
> Anyway, let me go read that patch you sent.

My understanding of handoff is to disable optimistic spinning to let
waiters in the wait queue have an opportunity to acquire the lock. There
are difference in details on how to do that in mutex and rwsem, though.

BTW, I have decided that we should further simply the trylock for loop
in rwsem_down_write_slowpath() to make it easier to read. That is the
only difference in the attached v2 patch compared with the previous one.

Cheers,
LongmaN


Attachments:
v2-0001-locking-rwsem-Make-handoff-bit-handling-more-cons.patch (10.12 kB)

2021-11-11 19:52:33

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 14:36, Waiman Long wrote:
>
> On 11/11/21 14:20, Peter Zijlstra wrote:
>> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem
>>> and
>>> mutex. Maybe a follow up patch if you think we should change the
>>> terminology.
>> Well, that's exactly the point, they do radically different things.
>> Having the same name for two different things is confusing.
>>
>> Anyway, let me go read that patch you sent.
>
> My understanding of handoff is to disable optimistic spinning to let
> waiters in the wait queue have an opportunity to acquire the lock.
> There are difference in details on how to do that in mutex and rwsem,
> though.
>
> BTW, I have decided that we should further simply the trylock for loop
> in rwsem_down_write_slowpath() to make it easier to read. That is the
> only difference in the attached v2 patch compared with the previous one.

My bad, I forgot to initialize waiter.handoff_set in
rwsem_down_write_slowpath(). I will send out an updated version once you
have finished your review.

Cheers,
Longman


2021-11-11 20:27:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:

> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
> if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
> time_after(jiffies, waiter->timeout)) {
> adjustment -= RWSEM_FLAG_HANDOFF;
> + waiter->handoff_set = true;
> lockevent_inc(rwsem_rlock_handoff);
> }
>

Do we really need this flag? Wouldn't it be the same as waiter-is-first
AND sem-has-handoff ?

> static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> + struct rwsem_waiter *waiter)
> {
> long count, new;
> + bool first = rwsem_first_waiter(sem) == waiter;

flip those lines for reverse xmas please

>
> lockdep_assert_held(&sem->wait_lock);
>
> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> do {
> bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>
> - if (has_handoff && wstate == WRITER_NOT_FIRST)
> + if (has_handoff && !first)
> return false;
>
> new = count;
>
> if (count & RWSEM_LOCK_MASK) {
> - if (has_handoff || (wstate != WRITER_HANDOFF))
> + if (has_handoff || (!waiter->rt_task &&
> + !time_after(jiffies, waiter->timeout)))


Does ->rt_task really help over rt_task(current) ? I suppose there's an
argument for locality, but that should be pretty much it, no?

> return false;
>
> new |= RWSEM_FLAG_HANDOFF;
> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
> }
> #endif
>
> +/*
> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
> + */
> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
> + struct rwsem_waiter *waiter)
> +{
> + long flags = 0;
> +
> + list_del(&waiter->list);
> + if (list_empty(&sem->wait_list))
> + flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
> + else if (waiter->handoff_set)
> + flags = RWSEM_FLAG_HANDOFF;
> +
> + if (flags)
> + atomic_long_andnot(flags, &sem->count);
> +}

Right, so I like sharing this between the two _slowpath functions, that
makes sense.

The difference between this and my approach is that I unconditionally
clear HANDOFF when @waiter was the first. Because if it was set, it
must've been ours, and if it wasn't set, clearing it doesn't really hurt
much. This is an unlikely path, I don't think the potentially extra
atomic is an issue here.

> +
> /*
> * Wait for the read lock to be granted
> */
> @@ -936,6 +953,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
> waiter.task = current;
> waiter.type = RWSEM_WAITING_FOR_READ;
> waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> + waiter.handoff_set = false;

Forgot to set rt_task

>
> raw_spin_lock_irq(&sem->wait_lock);
> if (list_empty(&sem->wait_list)) {

> @@ -1038,16 +1051,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
> waiter.task = current;
> waiter.type = RWSEM_WAITING_FOR_WRITE;
> waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;

Forget to set handoff_set

> + waiter.rt_task = rt_task(current);
>
> raw_spin_lock_irq(&sem->wait_lock);

Again, I'm not convinced we need these variables.

> @@ -1083,13 +1093,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
> /* wait until we successfully acquire the lock */
> set_current_state(state);
> for (;;) {
> - if (rwsem_try_write_lock(sem, wstate)) {
> + if (rwsem_try_write_lock(sem, &waiter)) {
> /* rwsem_try_write_lock() implies ACQUIRE on success */
> break;
> }
>
> raw_spin_unlock_irq(&sem->wait_lock);
>
> + if (signal_pending_state(state, current))
> + goto out_nolock;
> +
> /*
> * After setting the handoff bit and failing to acquire
> * the lock, attempt to spin on owner to accelerate lock
> @@ -1098,7 +1111,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
> * In this case, we attempt to acquire the lock again
> * without sleeping.
> */
> - if (wstate == WRITER_HANDOFF) {
> + if (waiter.handoff_set) {
> enum owner_state owner_state;
>
> preempt_disable();

Does it matter much if we spin-wait for every first or only for handoff?

Either way around, I think spin-wait ought to terminate on sigpending
(same for mutex I suppose).

> @@ -1109,40 +1122,9 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
> goto trylock_again;
> }
>
> - /* Block until there are no active lockers. */
> - for (;;) {
> - if (signal_pending_state(state, current))
> - goto out_nolock;
> -
> - schedule();
> - lockevent_inc(rwsem_sleep_writer);
> - set_current_state(state);
> - /*
> - * If HANDOFF bit is set, unconditionally do
> - * a trylock.
> - */
> - if (wstate == WRITER_HANDOFF)
> - break;
> -
> - if ((wstate == WRITER_NOT_FIRST) &&
> - (rwsem_first_waiter(sem) == &waiter))
> - wstate = WRITER_FIRST;
> -
> - count = atomic_long_read(&sem->count);
> - if (!(count & RWSEM_LOCK_MASK))
> - break;
> -
> - /*
> - * The setting of the handoff bit is deferred
> - * until rwsem_try_write_lock() is called.
> - */
> - if ((wstate == WRITER_FIRST) && (rt_task(current) ||
> - time_after(jiffies, waiter.timeout))) {
> - wstate = WRITER_HANDOFF;
> - lockevent_inc(rwsem_wlock_handoff);
> - break;
> - }
> - }
> + schedule();
> + lockevent_inc(rwsem_sleep_writer);
> + set_current_state(state);
> trylock_again:
> raw_spin_lock_irq(&sem->wait_lock);
> }

Nice.

2021-11-11 20:35:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>
> On 11/11/21 14:20, Peter Zijlstra wrote:
> > On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> > > As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> > > mutex. Maybe a follow up patch if you think we should change the
> > > terminology.
> > Well, that's exactly the point, they do radically different things.
> > Having the same name for two different things is confusing.
> >
> > Anyway, let me go read that patch you sent.
>
> My understanding of handoff is to disable optimistic spinning to let waiters
> in the wait queue have an opportunity to acquire the lock. There are
> difference in details on how to do that in mutex and rwsem, though.

Ah, but the mutex does an actual hand-off, it hands the lock to a
specific waiting task. That is, unlock() sets owner, as opposed to
trylock().

The rwsem code doesn't, it just forces a phase change. Once a waiter has
been blocked too long, the handoff bit is set, causing new readers to be
blocked. Then we wait for existing readers to complete. At that point,
any next waiter (most likely a writer) should really get the lock (and
in that regards the rwsem code is a bit funny).

So while both ensure fairness, the means of doing so is quite different.
One hands the lock ownership to a specific waiter, the other arranges
for a quiescent state such that the next waiter can proceed.


2021-11-11 20:39:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 09:35:00PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> >
> > On 11/11/21 14:20, Peter Zijlstra wrote:
> > > On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> > > > As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> > > > mutex. Maybe a follow up patch if you think we should change the
> > > > terminology.
> > > Well, that's exactly the point, they do radically different things.
> > > Having the same name for two different things is confusing.
> > >
> > > Anyway, let me go read that patch you sent.
> >
> > My understanding of handoff is to disable optimistic spinning to let waiters
> > in the wait queue have an opportunity to acquire the lock. There are
> > difference in details on how to do that in mutex and rwsem, though.
>
> Ah, but the mutex does an actual hand-off, it hands the lock to a
> specific waiting task. That is, unlock() sets owner, as opposed to
> trylock().
>
> The rwsem code doesn't, it just forces a phase change. Once a waiter has
> been blocked too long, the handoff bit is set, causing new readers to be
> blocked. Then we wait for existing readers to complete. At that point,
> any next waiter (most likely a writer) should really get the lock (and
> in that regards the rwsem code is a bit funny).

And this is I think the thing you tried in your earlier inherit patch.
Keep the quescent state and simply let whatever next waiter is in line
have a go.

I suspect that change is easier now. But I've not tried.


2021-11-11 20:45:40

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 15:35, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>> On 11/11/21 14:20, Peter Zijlstra wrote:
>>> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>>>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
>>>> mutex. Maybe a follow up patch if you think we should change the
>>>> terminology.
>>> Well, that's exactly the point, they do radically different things.
>>> Having the same name for two different things is confusing.
>>>
>>> Anyway, let me go read that patch you sent.
>> My understanding of handoff is to disable optimistic spinning to let waiters
>> in the wait queue have an opportunity to acquire the lock. There are
>> difference in details on how to do that in mutex and rwsem, though.
> Ah, but the mutex does an actual hand-off, it hands the lock to a
> specific waiting task. That is, unlock() sets owner, as opposed to
> trylock().
>
> The rwsem code doesn't, it just forces a phase change. Once a waiter has
> been blocked too long, the handoff bit is set, causing new readers to be
> blocked. Then we wait for existing readers to complete. At that point,
> any next waiter (most likely a writer) should really get the lock (and
> in that regards the rwsem code is a bit funny).
>
> So while both ensure fairness, the means of doing so is quite different.
> One hands the lock ownership to a specific waiter, the other arranges
> for a quiescent state such that the next waiter can proceed.

That is a valid argument. However, the name PHASE_CHANGE sounds weird to
me. I am not objecting to changing the term, but probably with a better
name NO_OPTSPIN, NO_LOCKSTEALING or something like that to emphasize
that fact that optimistic spinning or lock stealing should not be allowed.

Anyway, it will be a follow-up patch.

Cheers,
Longman



2021-11-11 20:50:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


So I suspect that if..

On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> - enum writer_wait_state wstate)
> + struct rwsem_waiter *waiter)
> {
> long count, new;
> + bool first = rwsem_first_waiter(sem) == waiter;
>
> lockdep_assert_held(&sem->wait_lock);
>
> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
> do {
> bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>
> - if (has_handoff && wstate == WRITER_NOT_FIRST)
> + if (has_handoff && !first)
> return false;
>
> new = count;
>
> if (count & RWSEM_LOCK_MASK) {
> - if (has_handoff || (wstate != WRITER_HANDOFF))
> + if (has_handoff || (!waiter->rt_task &&
> + !time_after(jiffies, waiter->timeout)))
> return false;

we delete this whole condition, and..

>
> new |= RWSEM_FLAG_HANDOFF;

> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
> }
> #endif
>
> +/*
> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
> + */
> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
> + struct rwsem_waiter *waiter)
> +{
> + long flags = 0;
> +
> + list_del(&waiter->list);
> + if (list_empty(&sem->wait_list))
> + flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
> + else if (waiter->handoff_set)
> + flags = RWSEM_FLAG_HANDOFF;

take out this else,

> +
> + if (flags)
> + atomic_long_andnot(flags, &sem->count);
> +}

We get the inherit thing for free, no?

Once HANDOFF is set, new readers are blocked. And then allow any first
waiter to acquire the lock, who cares if it was the one responsible for
having set the HANDOFF bit.

2021-11-11 21:01:24

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 15:26, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>
>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem,
>> if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>> time_after(jiffies, waiter->timeout)) {
>> adjustment -= RWSEM_FLAG_HANDOFF;
>> + waiter->handoff_set = true;
>> lockevent_inc(rwsem_rlock_handoff);
>> }
>>
> Do we really need this flag? Wouldn't it be the same as waiter-is-first
> AND sem-has-handoff ?
That is true. The only downside is that we have to read the count first
in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
should be OK to do that.
>> static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>> + struct rwsem_waiter *waiter)
>> {
>> long count, new;
>> + bool first = rwsem_first_waiter(sem) == waiter;
> flip those lines for reverse xmas please
Sure, will do.
>
>>
>> lockdep_assert_held(&sem->wait_lock);
>>
>> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>> do {
>> bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>>
>> - if (has_handoff && wstate == WRITER_NOT_FIRST)
>> + if (has_handoff && !first)
>> return false;
>>
>> new = count;
>>
>> if (count & RWSEM_LOCK_MASK) {
>> - if (has_handoff || (wstate != WRITER_HANDOFF))
>> + if (has_handoff || (!waiter->rt_task &&
>> + !time_after(jiffies, waiter->timeout)))
>
> Does ->rt_task really help over rt_task(current) ? I suppose there's an
> argument for locality, but that should be pretty much it, no?
Waiting for the timeout may introduce too much latency for RT task. That
is the only reason I am doing it. I can take it out if you think it is
not necessary.
>
>> return false;
>>
>> new |= RWSEM_FLAG_HANDOFF;
>> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>> }
>> #endif
>>
>> +/*
>> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
>> + */
>> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
>> + struct rwsem_waiter *waiter)
>> +{
>> + long flags = 0;
>> +
>> + list_del(&waiter->list);
>> + if (list_empty(&sem->wait_list))
>> + flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
>> + else if (waiter->handoff_set)
>> + flags = RWSEM_FLAG_HANDOFF;
>> +
>> + if (flags)
>> + atomic_long_andnot(flags, &sem->count);
>> +}
> Right, so I like sharing this between the two _slowpath functions, that
> makes sense.
>
> The difference between this and my approach is that I unconditionally
> clear HANDOFF when @waiter was the first. Because if it was set, it
> must've been ours, and if it wasn't set, clearing it doesn't really hurt
> much. This is an unlikely path, I don't think the potentially extra
> atomic is an issue here.
That is true, we shouldn't worry too much about performance for this
unlikely path. Will make the change.
>
>> +
>> /*
>> * Wait for the read lock to be granted
>> */
>> @@ -936,6 +953,7 @@ rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int stat
>> waiter.task = current;
>> waiter.type = RWSEM_WAITING_FOR_READ;
>> waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
>> + waiter.handoff_set = false;
> Forgot to set rt_task

We don't use rt_task for reader. It is writer only. I will document that.

>
>>
>> raw_spin_lock_irq(&sem->wait_lock);
>> if (list_empty(&sem->wait_list)) {
>> @@ -1038,16 +1051,13 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>> waiter.task = current;
>> waiter.type = RWSEM_WAITING_FOR_WRITE;
>> waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> Forget to set handoff_set
Yes, I was aware of that.
>
>> + waiter.rt_task = rt_task(current);
>>
>> raw_spin_lock_irq(&sem->wait_lock);
> Again, I'm not convinced we need these variables.
I will take out handoff_set as suggested. I can can also take out
rt_task if you don't think we need to test it.
>
>> @@ -1083,13 +1093,16 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>> /* wait until we successfully acquire the lock */
>> set_current_state(state);
>> for (;;) {
>> - if (rwsem_try_write_lock(sem, wstate)) {
>> + if (rwsem_try_write_lock(sem, &waiter)) {
>> /* rwsem_try_write_lock() implies ACQUIRE on success */
>> break;
>> }
>>
>> raw_spin_unlock_irq(&sem->wait_lock);
>>
>> + if (signal_pending_state(state, current))
>> + goto out_nolock;
>> +
>> /*
>> * After setting the handoff bit and failing to acquire
>> * the lock, attempt to spin on owner to accelerate lock
>> @@ -1098,7 +1111,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
>> * In this case, we attempt to acquire the lock again
>> * without sleeping.
>> */
>> - if (wstate == WRITER_HANDOFF) {
>> + if (waiter.handoff_set) {
>> enum owner_state owner_state;
>>
>> preempt_disable();
> Does it matter much if we spin-wait for every first or only for handoff?
Only for handoff as no other task will be spinning for the lock.
>
> Either way around, I think spin-wait ought to terminate on sigpending
> (same for mutex I suppose).

I am thinking about that too. Time for another followup patch, I think.

Cheers,
Longman


2021-11-11 21:09:43

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On 11/11/21 15:50, Peter Zijlstra wrote:
> So I suspect that if..
>
> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>> static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>> - enum writer_wait_state wstate)
>> + struct rwsem_waiter *waiter)
>> {
>> long count, new;
>> + bool first = rwsem_first_waiter(sem) == waiter;
>>
>> lockdep_assert_held(&sem->wait_lock);
>>
>> @@ -546,13 +541,14 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem,
>> do {
>> bool has_handoff = !!(count & RWSEM_FLAG_HANDOFF);
>>
>> - if (has_handoff && wstate == WRITER_NOT_FIRST)
>> + if (has_handoff && !first)
>> return false;
>>
>> new = count;
>>
>> if (count & RWSEM_LOCK_MASK) {
>> - if (has_handoff || (wstate != WRITER_HANDOFF))
>> + if (has_handoff || (!waiter->rt_task &&
>> + !time_after(jiffies, waiter->timeout)))
>> return false;
> we delete this whole condition, and..
I don't think we can take out this if test.
>
>>
>> new |= RWSEM_FLAG_HANDOFF;
>> @@ -889,6 +888,24 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
>> }
>> #endif
>>
>> +/*
>> + * Common code to handle rwsem flags in out_nolock path with wait_lock held.
>> + */
>> +static inline void rwsem_out_nolock_clear_flags(struct rw_semaphore *sem,
>> + struct rwsem_waiter *waiter)
>> +{
>> + long flags = 0;
>> +
>> + list_del(&waiter->list);
>> + if (list_empty(&sem->wait_list))
>> + flags = RWSEM_FLAG_HANDOFF | RWSEM_FLAG_WAITERS;
>> + else if (waiter->handoff_set)
>> + flags = RWSEM_FLAG_HANDOFF;
> take out this else,
>
>> +
>> + if (flags)
>> + atomic_long_andnot(flags, &sem->count);
>> +}
> We get the inherit thing for free, no?
>
> Once HANDOFF is set, new readers are blocked. And then allow any first
> waiter to acquire the lock, who cares if it was the one responsible for
> having set the HANDOFF bit.

Yes, we can have the policy of inheriting the HANDOFF bit as long as it
is consistent which will be the case here with a common out_nolock
function. I can go with that. I just have to document this fact in the
comment.

Cheers,
Longman


2021-11-11 21:26:09

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 16:01, Waiman Long wrote:
>
> On 11/11/21 15:26, Peter Zijlstra wrote:
>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>
>>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct rw_semaphore
>>> *sem,
>>>               if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>>                   time_after(jiffies, waiter->timeout)) {
>>>                   adjustment -= RWSEM_FLAG_HANDOFF;
>>> +                waiter->handoff_set = true;
>>>                   lockevent_inc(rwsem_rlock_handoff);
>>>               }
>> Do we really need this flag? Wouldn't it be the same as waiter-is-first
>> AND sem-has-handoff ?
> That is true. The only downside is that we have to read the count
> first in rwsem_out_nolock_clear_flags(). Since this is not a fast
> path, it should be OK to do that.

I just realize that I may still need this flag for writer to determine
if it should spin after failing to acquire the lock. Or I will have to
do extra read of count value in the loop. I don't need to use it for
writer now.

Cheers,
Longman


2021-11-11 21:27:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 03:45:30PM -0500, Waiman Long wrote:
>
> On 11/11/21 15:35, Peter Zijlstra wrote:
> > On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> > > On 11/11/21 14:20, Peter Zijlstra wrote:
> > > > On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
> > > > > As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
> > > > > mutex. Maybe a follow up patch if you think we should change the
> > > > > terminology.
> > > > Well, that's exactly the point, they do radically different things.
> > > > Having the same name for two different things is confusing.
> > > >
> > > > Anyway, let me go read that patch you sent.
> > > My understanding of handoff is to disable optimistic spinning to let waiters
> > > in the wait queue have an opportunity to acquire the lock. There are
> > > difference in details on how to do that in mutex and rwsem, though.
> > Ah, but the mutex does an actual hand-off, it hands the lock to a
> > specific waiting task. That is, unlock() sets owner, as opposed to
> > trylock().
> >
> > The rwsem code doesn't, it just forces a phase change. Once a waiter has
> > been blocked too long, the handoff bit is set, causing new readers to be
> > blocked. Then we wait for existing readers to complete. At that point,
> > any next waiter (most likely a writer) should really get the lock (and
> > in that regards the rwsem code is a bit funny).
> >
> > So while both ensure fairness, the means of doing so is quite different.
> > One hands the lock ownership to a specific waiter, the other arranges
> > for a quiescent state such that the next waiter can proceed.
>
> That is a valid argument. However, the name PHASE_CHANGE sounds weird to me.
> I am not objecting to changing the term, but probably with a better name
> NO_OPTSPIN, NO_LOCKSTEALING or something like that to emphasize that fact
> that optimistic spinning or lock stealing should not be allowed.

RWSEM_FLAG_QUIESCE ?

2021-11-11 21:38:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 04:01:16PM -0500, Waiman Long wrote:

> > > + if (has_handoff || (!waiter->rt_task &&
> > > + !time_after(jiffies, waiter->timeout)))
> >
> > Does ->rt_task really help over rt_task(current) ? I suppose there's an
> > argument for locality, but that should be pretty much it, no?
> Waiting for the timeout may introduce too much latency for RT task. That is
> the only reason I am doing it. I can take it out if you think it is not
> necessary.

I meant simply calling rt_task(waiter->task) here, instead of mucking about
with the extra variable.

2021-11-11 21:46:38

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 16:38, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 04:01:16PM -0500, Waiman Long wrote:
>
>>>> + if (has_handoff || (!waiter->rt_task &&
>>>> + !time_after(jiffies, waiter->timeout)))
>>> Does ->rt_task really help over rt_task(current) ? I suppose there's an
>>> argument for locality, but that should be pretty much it, no?
>> Waiting for the timeout may introduce too much latency for RT task. That is
>> the only reason I am doing it. I can take it out if you think it is not
>> necessary.
> I meant simply calling rt_task(waiter->task) here, instead of mucking about
> with the extra variable.

OK, that make sense.

Cheers,
Longman


2021-11-11 21:53:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set

On Thu, Nov 11, 2021 at 04:25:56PM -0500, Waiman Long wrote:
>
> On 11/11/21 16:01, Waiman Long wrote:
> >
> > On 11/11/21 15:26, Peter Zijlstra wrote:
> > > On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
> > >
> > > > @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct
> > > > rw_semaphore *sem,
> > > > ????????????? if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
> > > > ????????????????? time_after(jiffies, waiter->timeout)) {
> > > > ????????????????? adjustment -= RWSEM_FLAG_HANDOFF;
> > > > +??????????????? waiter->handoff_set = true;
> > > > ????????????????? lockevent_inc(rwsem_rlock_handoff);
> > > > ????????????? }
> > > Do we really need this flag? Wouldn't it be the same as waiter-is-first
> > > AND sem-has-handoff ?
> > That is true. The only downside is that we have to read the count first
> > in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
> > should be OK to do that.
>
> I just realize that I may still need this flag for writer to determine if it
> should spin after failing to acquire the lock. Or I will have to do extra
> read of count value in the loop. I don't need to use it for writer now.

Maybe it's too late here, but afaict this is right after failing
try_write_lock(), which will have done at least that load you're
interested in, no?

Simply have try_write_lock() update &count or something.

2021-11-11 21:54:25

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 16:27, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 03:45:30PM -0500, Waiman Long wrote:
>> On 11/11/21 15:35, Peter Zijlstra wrote:
>>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>>> On 11/11/21 14:20, Peter Zijlstra wrote:
>>>>> On Thu, Nov 11, 2021 at 02:14:48PM -0500, Waiman Long wrote:
>>>>>> As for the PHASE_CHANGE name, we have to be consistent in both rwsem and
>>>>>> mutex. Maybe a follow up patch if you think we should change the
>>>>>> terminology.
>>>>> Well, that's exactly the point, they do radically different things.
>>>>> Having the same name for two different things is confusing.
>>>>>
>>>>> Anyway, let me go read that patch you sent.
>>>> My understanding of handoff is to disable optimistic spinning to let waiters
>>>> in the wait queue have an opportunity to acquire the lock. There are
>>>> difference in details on how to do that in mutex and rwsem, though.
>>> Ah, but the mutex does an actual hand-off, it hands the lock to a
>>> specific waiting task. That is, unlock() sets owner, as opposed to
>>> trylock().
>>>
>>> The rwsem code doesn't, it just forces a phase change. Once a waiter has
>>> been blocked too long, the handoff bit is set, causing new readers to be
>>> blocked. Then we wait for existing readers to complete. At that point,
>>> any next waiter (most likely a writer) should really get the lock (and
>>> in that regards the rwsem code is a bit funny).
>>>
>>> So while both ensure fairness, the means of doing so is quite different.
>>> One hands the lock ownership to a specific waiter, the other arranges
>>> for a quiescent state such that the next waiter can proceed.
>> That is a valid argument. However, the name PHASE_CHANGE sounds weird to me.
>> I am not objecting to changing the term, but probably with a better name
>> NO_OPTSPIN, NO_LOCKSTEALING or something like that to emphasize that fact
>> that optimistic spinning or lock stealing should not be allowed.
> RWSEM_FLAG_QUIESCE ?

I think that is a more acceptable term than PHASE_CHANGE. Will have a
follow up patch later on. This one is more urgent and I want to get it
done first.

Cheers,
Longman.


2021-11-11 21:55:23

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 16:53, Peter Zijlstra wrote:
> On Thu, Nov 11, 2021 at 04:25:56PM -0500, Waiman Long wrote:
>> On 11/11/21 16:01, Waiman Long wrote:
>>> On 11/11/21 15:26, Peter Zijlstra wrote:
>>>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>>>
>>>>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct
>>>>> rw_semaphore *sem,
>>>>>               if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>>>>                   time_after(jiffies, waiter->timeout)) {
>>>>>                   adjustment -= RWSEM_FLAG_HANDOFF;
>>>>> +                waiter->handoff_set = true;
>>>>>                   lockevent_inc(rwsem_rlock_handoff);
>>>>>               }
>>>> Do we really need this flag? Wouldn't it be the same as waiter-is-first
>>>> AND sem-has-handoff ?
>>> That is true. The only downside is that we have to read the count first
>>> in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
>>> should be OK to do that.
>> I just realize that I may still need this flag for writer to determine if it
>> should spin after failing to acquire the lock. Or I will have to do extra
>> read of count value in the loop. I don't need to use it for writer now.
> Maybe it's too late here, but afaict this is right after failing
> try_write_lock(), which will have done at least that load you're
> interested in, no?
>
> Simply have try_write_lock() update &count or something.

You are right. I have actually decided to do an extra read after second
thought.

Cheers,
Longman


2021-11-11 22:00:34

by Waiman Long

[permalink] [raw]
Subject: Re: [BUG]locking/rwsem: only clean RWSEM_FLAG_HANDOFF when already set


On 11/11/21 16:55, Waiman Long wrote:
>
> On 11/11/21 16:53, Peter Zijlstra wrote:
>> On Thu, Nov 11, 2021 at 04:25:56PM -0500, Waiman Long wrote:
>>> On 11/11/21 16:01, Waiman Long wrote:
>>>> On 11/11/21 15:26, Peter Zijlstra wrote:
>>>>> On Thu, Nov 11, 2021 at 02:36:52PM -0500, Waiman Long wrote:
>>>>>
>>>>>> @@ -434,6 +430,7 @@ static void rwsem_mark_wake(struct
>>>>>> rw_semaphore *sem,
>>>>>>                if (!(oldcount & RWSEM_FLAG_HANDOFF) &&
>>>>>>                    time_after(jiffies, waiter->timeout)) {
>>>>>>                    adjustment -= RWSEM_FLAG_HANDOFF;
>>>>>> +                waiter->handoff_set = true;
>>>>>>                    lockevent_inc(rwsem_rlock_handoff);
>>>>>>                }
>>>>> Do we really need this flag? Wouldn't it be the same as
>>>>> waiter-is-first
>>>>> AND sem-has-handoff ?
>>>> That is true. The only downside is that we have to read the count
>>>> first
>>>> in rwsem_out_nolock_clear_flags(). Since this is not a fast path, it
>>>> should be OK to do that.
>>> I just realize that I may still need this flag for writer to
>>> determine if it
>>> should spin after failing to acquire the lock. Or I will have to do
>>> extra
>>> read of count value in the loop. I don't need to use it for writer now.
>> Maybe it's too late here, but afaict this is right after failing
>> try_write_lock(), which will have done at least that load you're
>> interested in, no?
>>
>> Simply have try_write_lock() update &count or something.
>
> You are right. I have actually decided to do an extra read after
> second thought.

Oh, I would like to take it back. The condition to do spinning is when
the handoff bit is set and the waiter is the first one in the queue. It
be easier to do it with extra internal state variable.

Cheers,
Longman