2022-06-17 10:00:27

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] locking/rwlocks: do not starve writers

From: Eric Dumazet <[email protected]>

Networking is still using rwlocks and read_lock() is called
from softirq context, potentially from many cpus.

In this (soft)irq context, rwlock code is unfair to writers
and can cause soft lockups.

We first noticed an issue with epoll after commit a218cc491420
("epoll: use rwlock in order to reduce ep_poll_callback() contention"),
but it is trivial to brick a host using this repro:

for i in {1..48}
do
ping -f -n -q 127.0.0.1 &
sleep 0.1
done

If really an unfair version of rwlocks is needed, we should introduce
a new read_lock_unfair().

[ 673.678717][ C34] watchdog: BUG: soft lockup - CPU#34 stuck for 82s! [ping:17794]
[ 673.700713][ C45] watchdog: BUG: soft lockup - CPU#45 stuck for 82s! [ping:17796]
[ 673.702712][ C46] watchdog: BUG: soft lockup - CPU#46 stuck for 78s! [ping:17802]
[ 673.704712][ C47] watchdog: BUG: soft lockup - CPU#47 stuck for 82s! [ping:17798]
[ 677.636023][ C13] watchdog: BUG: soft lockup - CPU#13 stuck for 82s! [ping:17804]
[ 677.638022][ C14] watchdog: BUG: soft lockup - CPU#14 stuck for 75s! [ping:17825]
[ 677.644021][ C17] watchdog: BUG: soft lockup - CPU#17 stuck for 75s! [ping:17821]
[ 677.650020][ C20] watchdog: BUG: soft lockup - CPU#20 stuck for 82s! [ping:17800]
[ 677.686014][ C38] watchdog: BUG: soft lockup - CPU#38 stuck for 75s! [ping:17819]
[ 681.691318][ C41] watchdog: BUG: soft lockup - CPU#41 stuck for 74s! [ping:17823]
[ 684.657807][ C46] rcu: INFO: rcu_sched self-detected stall on CPU
[ 684.664075][ C46] rcu: 46-....: (1 GPs behind) idle=529/1/0x4000000000000000 softirq=22717/22717 fqs=20200
[ 705.633252][ C14] watchdog: BUG: soft lockup - CPU#14 stuck for 101s! [ping:17825]
[ 706.999058][ T309] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 14-... 41-... } 88575 jiffies s: 2325 root: 0x5/.
[ 706.999069][ T309] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/. l=1:32-47:0x200/.
[ 709.686574][ C41] watchdog: BUG: soft lockup - CPU#41 stuck for 100s! [ping:17823]
[ 714.457782][ C41] rcu: INFO: rcu_sched self-detected stall on CPU
[ 714.464047][ C41] rcu: 41-....: (1 GPs behind) idle=403/1/0x4000000000000000 softirq=24654/24655 fqs=4653

Fixes: 70af2f8a4f48 ("locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Roman Penyaev <[email protected]>
Cc: Shakeel Butt <[email protected]>
---
kernel/locking/qrwlock.c | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 2e1600906c9f5cd868415d20e2d7024c5b1e0531..bf64d14f0fc88613363c3c007bca8c0918709123 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
/*
* Readers come here when they cannot get the lock without waiting
*/
- if (unlikely(in_interrupt())) {
- /*
- * Readers in interrupt context will get the lock immediately
- * if the writer is just waiting (not holding the lock yet),
- * so spin with ACQUIRE semantics until the lock is available
- * without waiting in the queue.
- */
- atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
- return;
- }
atomic_sub(_QR_BIAS, &lock->cnts);

trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
--
2.36.1.476.g0c4daa206d-goog


2022-06-17 12:45:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
> /*
> * Readers come here when they cannot get the lock without waiting
> */
> - if (unlikely(in_interrupt())) {
> - /*
> - * Readers in interrupt context will get the lock immediately
> - * if the writer is just waiting (not holding the lock yet),
> - * so spin with ACQUIRE semantics until the lock is available
> - * without waiting in the queue.
> - */
> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> - return;
> - }
> atomic_sub(_QR_BIAS, &lock->cnts);
>
> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);

This is known to break tasklist_lock.

2022-06-17 15:02:18

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On 6/17/22 08:07, Peter Zijlstra wrote:
> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
>> --- a/kernel/locking/qrwlock.c
>> +++ b/kernel/locking/qrwlock.c
>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
>> /*
>> * Readers come here when they cannot get the lock without waiting
>> */
>> - if (unlikely(in_interrupt())) {
>> - /*
>> - * Readers in interrupt context will get the lock immediately
>> - * if the writer is just waiting (not holding the lock yet),
>> - * so spin with ACQUIRE semantics until the lock is available
>> - * without waiting in the queue.
>> - */
>> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
>> - return;
>> - }
>> atomic_sub(_QR_BIAS, &lock->cnts);
>>
>> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
> This is known to break tasklist_lock.
>
We certainly can't break the current usage of tasklist_lock.

I am aware of this problem with networking code and is thinking about
either relaxing the check to exclude softirq or provide a
read_lock_unfair() variant for networking use. I think tasklist_lock
isn't taken from softirq context, but I may be wrong. Providing a
read_lock_unfair() will require quite a bit of work in the supporting
infrastructure as well.

Cheers,
Longman

2022-06-17 15:04:45

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On 6/17/22 10:57, Shakeel Butt wrote:
> On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <[email protected]> wrote:
>> On 6/17/22 08:07, Peter Zijlstra wrote:
>>> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
>>>> --- a/kernel/locking/qrwlock.c
>>>> +++ b/kernel/locking/qrwlock.c
>>>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
>>>> /*
>>>> * Readers come here when they cannot get the lock without waiting
>>>> */
>>>> - if (unlikely(in_interrupt())) {
>>>> - /*
>>>> - * Readers in interrupt context will get the lock immediately
>>>> - * if the writer is just waiting (not holding the lock yet),
>>>> - * so spin with ACQUIRE semantics until the lock is available
>>>> - * without waiting in the queue.
>>>> - */
>>>> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
>>>> - return;
>>>> - }
>>>> atomic_sub(_QR_BIAS, &lock->cnts);
>>>>
>>>> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
>>> This is known to break tasklist_lock.
>>>
>> We certainly can't break the current usage of tasklist_lock.
>>
>> I am aware of this problem with networking code and is thinking about
>> either relaxing the check to exclude softirq or provide a
>> read_lock_unfair() variant for networking use.
> read_lock_unfair() for networking use or tasklist_lock use?

I mean to say read_lock_fair(), but it could also be the other way
around. Thanks for spotting that.

Cheers,
Longman

2022-06-17 15:24:33

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <[email protected]> wrote:
>
> On 6/17/22 08:07, Peter Zijlstra wrote:
> > On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
> >> --- a/kernel/locking/qrwlock.c
> >> +++ b/kernel/locking/qrwlock.c
> >> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
> >> /*
> >> * Readers come here when they cannot get the lock without waiting
> >> */
> >> - if (unlikely(in_interrupt())) {
> >> - /*
> >> - * Readers in interrupt context will get the lock immediately
> >> - * if the writer is just waiting (not holding the lock yet),
> >> - * so spin with ACQUIRE semantics until the lock is available
> >> - * without waiting in the queue.
> >> - */
> >> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> >> - return;
> >> - }
> >> atomic_sub(_QR_BIAS, &lock->cnts);
> >>
> >> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
> > This is known to break tasklist_lock.
> >
> We certainly can't break the current usage of tasklist_lock.
>
> I am aware of this problem with networking code and is thinking about
> either relaxing the check to exclude softirq or provide a
> read_lock_unfair() variant for networking use.

read_lock_unfair() for networking use or tasklist_lock use?

> I think tasklist_lock
> isn't taken from softirq context, but I may be wrong. Providing a
> read_lock_unfair() will require quite a bit of work in the supporting
> infrastructure as well.
>
> Cheers,
> Longman
>

2022-06-17 15:37:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 5:00 PM Waiman Long <[email protected]> wrote:
>
> On 6/17/22 10:57, Shakeel Butt wrote:
> > On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <[email protected]> wrote:
> >> On 6/17/22 08:07, Peter Zijlstra wrote:
> >>> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
> >>>> --- a/kernel/locking/qrwlock.c
> >>>> +++ b/kernel/locking/qrwlock.c
> >>>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
> >>>> /*
> >>>> * Readers come here when they cannot get the lock without waiting
> >>>> */
> >>>> - if (unlikely(in_interrupt())) {
> >>>> - /*
> >>>> - * Readers in interrupt context will get the lock immediately
> >>>> - * if the writer is just waiting (not holding the lock yet),
> >>>> - * so spin with ACQUIRE semantics until the lock is available
> >>>> - * without waiting in the queue.
> >>>> - */
> >>>> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> >>>> - return;
> >>>> - }
> >>>> atomic_sub(_QR_BIAS, &lock->cnts);
> >>>>
> >>>> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
> >>> This is known to break tasklist_lock.
> >>>
> >> We certainly can't break the current usage of tasklist_lock.
> >>
> >> I am aware of this problem with networking code and is thinking about
> >> either relaxing the check to exclude softirq or provide a
> >> read_lock_unfair() variant for networking use.
> > read_lock_unfair() for networking use or tasklist_lock use?
>
> I mean to say read_lock_fair(), but it could also be the other way
> around. Thanks for spotting that.
>

If only tasklist_lock is problematic and needs the unfair variant,
then changing a few read_lock() for tasklist_lock will be less
invasive than ~1000 read_lock() elsewhere....

2022-06-17 16:07:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 05:24:58PM +0200, Eric Dumazet wrote:

> If only tasklist_lock is problematic and needs the unfair variant,
> then changing a few read_lock() for tasklist_lock will be less
> invasive than ~1000 read_lock() elsewhere....

This is unknown. tasklist_lock was the obvious one, there might be more.
Lockdep should be able to tell you.

2022-06-17 18:03:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On 6/17/22 11:24, Eric Dumazet wrote:
> On Fri, Jun 17, 2022 at 5:00 PM Waiman Long <[email protected]> wrote:
>> On 6/17/22 10:57, Shakeel Butt wrote:
>>> On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <[email protected]> wrote:
>>>> On 6/17/22 08:07, Peter Zijlstra wrote:
>>>>> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
>>>>>> --- a/kernel/locking/qrwlock.c
>>>>>> +++ b/kernel/locking/qrwlock.c
>>>>>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
>>>>>> /*
>>>>>> * Readers come here when they cannot get the lock without waiting
>>>>>> */
>>>>>> - if (unlikely(in_interrupt())) {
>>>>>> - /*
>>>>>> - * Readers in interrupt context will get the lock immediately
>>>>>> - * if the writer is just waiting (not holding the lock yet),
>>>>>> - * so spin with ACQUIRE semantics until the lock is available
>>>>>> - * without waiting in the queue.
>>>>>> - */
>>>>>> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
>>>>>> - return;
>>>>>> - }
>>>>>> atomic_sub(_QR_BIAS, &lock->cnts);
>>>>>>
>>>>>> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
>>>>> This is known to break tasklist_lock.
>>>>>
>>>> We certainly can't break the current usage of tasklist_lock.
>>>>
>>>> I am aware of this problem with networking code and is thinking about
>>>> either relaxing the check to exclude softirq or provide a
>>>> read_lock_unfair() variant for networking use.
>>> read_lock_unfair() for networking use or tasklist_lock use?
>> I mean to say read_lock_fair(), but it could also be the other way
>> around. Thanks for spotting that.
>>
> If only tasklist_lock is problematic and needs the unfair variant,
> then changing a few read_lock() for tasklist_lock will be less
> invasive than ~1000 read_lock() elsewhere....

After a second thought, I think the right way is to introduce a fair
variant, if needed. If an arch isn't using qrwlock, the native rwlock
implementation will be unfair. In that sense, unfair rwlock is the
default. We will only need to change the relevant network read_lock()
calls to use the fair variant which will still be unfair if qrwlock
isn't used. We are not going to touch other read_lock call that don't
care about fair or unfair.

Cheers,
Longman

2022-06-17 18:14:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 7:42 PM Waiman Long <[email protected]> wrote:
>
> On 6/17/22 11:24, Eric Dumazet wrote:
> > On Fri, Jun 17, 2022 at 5:00 PM Waiman Long <[email protected]> wrote:
> >> On 6/17/22 10:57, Shakeel Butt wrote:
> >>> On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <[email protected]> wrote:
> >>>> On 6/17/22 08:07, Peter Zijlstra wrote:
> >>>>> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
> >>>>>> --- a/kernel/locking/qrwlock.c
> >>>>>> +++ b/kernel/locking/qrwlock.c
> >>>>>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
> >>>>>> /*
> >>>>>> * Readers come here when they cannot get the lock without waiting
> >>>>>> */
> >>>>>> - if (unlikely(in_interrupt())) {
> >>>>>> - /*
> >>>>>> - * Readers in interrupt context will get the lock immediately
> >>>>>> - * if the writer is just waiting (not holding the lock yet),
> >>>>>> - * so spin with ACQUIRE semantics until the lock is available
> >>>>>> - * without waiting in the queue.
> >>>>>> - */
> >>>>>> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> >>>>>> - return;
> >>>>>> - }
> >>>>>> atomic_sub(_QR_BIAS, &lock->cnts);
> >>>>>>
> >>>>>> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
> >>>>> This is known to break tasklist_lock.
> >>>>>
> >>>> We certainly can't break the current usage of tasklist_lock.
> >>>>
> >>>> I am aware of this problem with networking code and is thinking about
> >>>> either relaxing the check to exclude softirq or provide a
> >>>> read_lock_unfair() variant for networking use.
> >>> read_lock_unfair() for networking use or tasklist_lock use?
> >> I mean to say read_lock_fair(), but it could also be the other way
> >> around. Thanks for spotting that.
> >>
> > If only tasklist_lock is problematic and needs the unfair variant,
> > then changing a few read_lock() for tasklist_lock will be less
> > invasive than ~1000 read_lock() elsewhere....
>
> After a second thought, I think the right way is to introduce a fair
> variant, if needed. If an arch isn't using qrwlock, the native rwlock
> implementation will be unfair. In that sense, unfair rwlock is the
> default. We will only need to change the relevant network read_lock()
> calls to use the fair variant which will still be unfair if qrwlock
> isn't used. We are not going to touch other read_lock call that don't
> care about fair or unfair.
>

Hmm... backporting this kind of invasive change to stable kernels will
be a daunting task.

Were rwlocks always unfair, and we have been lucky ?

2022-06-17 19:06:25

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On 6/17/22 13:45, Eric Dumazet wrote:
> On Fri, Jun 17, 2022 at 7:42 PM Waiman Long <[email protected]> wrote:
>> On 6/17/22 11:24, Eric Dumazet wrote:
>>> On Fri, Jun 17, 2022 at 5:00 PM Waiman Long <[email protected]> wrote:
>>>> On 6/17/22 10:57, Shakeel Butt wrote:
>>>>> On Fri, Jun 17, 2022 at 7:43 AM Waiman Long <[email protected]> wrote:
>>>>>> On 6/17/22 08:07, Peter Zijlstra wrote:
>>>>>>> On Fri, Jun 17, 2022 at 02:10:39AM -0700, Eric Dumazet wrote:
>>>>>>>> --- a/kernel/locking/qrwlock.c
>>>>>>>> +++ b/kernel/locking/qrwlock.c
>>>>>>>> @@ -23,16 +23,6 @@ void queued_read_lock_slowpath(struct qrwlock *lock)
>>>>>>>> /*
>>>>>>>> * Readers come here when they cannot get the lock without waiting
>>>>>>>> */
>>>>>>>> - if (unlikely(in_interrupt())) {
>>>>>>>> - /*
>>>>>>>> - * Readers in interrupt context will get the lock immediately
>>>>>>>> - * if the writer is just waiting (not holding the lock yet),
>>>>>>>> - * so spin with ACQUIRE semantics until the lock is available
>>>>>>>> - * without waiting in the queue.
>>>>>>>> - */
>>>>>>>> - atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
>>>>>>>> - return;
>>>>>>>> - }
>>>>>>>> atomic_sub(_QR_BIAS, &lock->cnts);
>>>>>>>>
>>>>>>>> trace_contention_begin(lock, LCB_F_SPIN | LCB_F_READ);
>>>>>>> This is known to break tasklist_lock.
>>>>>>>
>>>>>> We certainly can't break the current usage of tasklist_lock.
>>>>>>
>>>>>> I am aware of this problem with networking code and is thinking about
>>>>>> either relaxing the check to exclude softirq or provide a
>>>>>> read_lock_unfair() variant for networking use.
>>>>> read_lock_unfair() for networking use or tasklist_lock use?
>>>> I mean to say read_lock_fair(), but it could also be the other way
>>>> around. Thanks for spotting that.
>>>>
>>> If only tasklist_lock is problematic and needs the unfair variant,
>>> then changing a few read_lock() for tasklist_lock will be less
>>> invasive than ~1000 read_lock() elsewhere....
>> After a second thought, I think the right way is to introduce a fair
>> variant, if needed. If an arch isn't using qrwlock, the native rwlock
>> implementation will be unfair. In that sense, unfair rwlock is the
>> default. We will only need to change the relevant network read_lock()
>> calls to use the fair variant which will still be unfair if qrwlock
>> isn't used. We are not going to touch other read_lock call that don't
>> care about fair or unfair.
>>
> Hmm... backporting this kind of invasive change to stable kernels will
> be a daunting task.
>
> Were rwlocks always unfair, and we have been lucky ?
>
Yes, rwlocks was always unfair and it always had this kind of soft
lockup problem and scalability problem because of cacheline bouncing.
That was reason of creating qrwlock which can at least provide a fair
rwlock at task context. Now we have systems with more and more cpus and
that is the reason why you are seeing it all over again with the
networking code.

Cheers,
Longman

2022-06-17 19:06:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 07:45:14PM +0200, Eric Dumazet wrote:

> Were rwlocks always unfair, and we have been lucky ?

Yes.

2022-06-17 19:17:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 12:45 PM Eric Dumazet <[email protected]> wrote:
>
> Were rwlocks always unfair, and we have been lucky ?

Yes, rwlocks have always been unfair.

And the unfairness is very fundamental: it is what allows you to take
a read-lock without disabling interrupts (or softirqs), and still
allowing interrupts to *also* take the lock for reading.

That's actually a big deal for some things, with the tasklist lock
being the traditional thing, but there could be others.

Because with a fair lock, you'd have deadlocks when CPU1 gets the lock
for reading, then another CPU blocks on writing (with interrupts
disabled), and then CPU1 takes an interrupt and wants to read again.

This is not going to change.

If you want fair rwlocks, you have a couple of options:

- add a new explicitlly fair version.

- don't use rwlocks from irq or softirq context

- use another lock entirely (spinlocks are fair, and often perform better)

You do *not* get to change behavior that has been there since day#1
and that very core code very much depends on.

In fact, the fact that you use a read_lock from softirq context makes
me suspect that you yourself might be in danger of deadlocking due to
that "look, another CPU is trying to get the write lock" deadlock
situation.

Because the only way to fix that deadlock is

(a) unfair rwlocks like we have now

(b) having to disable interrupts around all read-locked regions

and (b) often basically means that the whole point of using a rwlock
goes away, because it is now much more expensive.

Linus

2022-06-17 19:29:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 2:10 PM Eric Dumazet <[email protected]> wrote:
>
> So I wonder why we replaced eventpoll spinlock with an rwlock.

Yeah, usually we've actually gone the other way.

Spinning rwlocks are seldom a big win, unless you can get some
secondary indirect win out of them.

That secondary win is often:

(a) unfairness is usually very good for throughput (iow, the very
unfairness that you hit may *be* the reason why it looked good in some
benchmark, and people decided "ok, let's do this").

(b) the special case of "interrupts take the lock for reading only"
thing that allows other readers to not disable interrupts

IOW, the win of a spinning rwlock is not necessarily the "we allow
multiple concurrent readers" that you'd expect, because if you have
small sections of code you protect, that just isn't a big deal, and
the costs are in the lock bouncing etc.

It's also worth pointing out that rwlocks are only unfair *if* they
hit that "reader from (soft)interrupt" case. Which means that such
cases *really* had better either have very very short locked regions
(with interrupts disabled), or they really need that (b) part above.

And yes, the tasklist lock really needs the (b) part above. Disabling
interrupts for task traversal would be completely and entirely
unacceptable, because the traversal can actually be fairly expensive
(lots and lots of threads).

I suspect eventpoll just did the wrong thing.

Linus

2022-06-17 19:35:11

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 9:04 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jun 17, 2022 at 07:45:14PM +0200, Eric Dumazet wrote:
>
> > Were rwlocks always unfair, and we have been lucky ?
>
> Yes.

So I wonder why we replaced eventpoll spinlock with an rwlock.

Writing a repro is probably more difficult for eventpoll though.

We definitely had production issues after commit a218cc491420

2022-06-17 19:37:21

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 12:19 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Jun 17, 2022 at 2:10 PM Eric Dumazet <[email protected]> wrote:
> >
> > So I wonder why we replaced eventpoll spinlock with an rwlock.
>
> Yeah, usually we've actually gone the other way.
>
> Spinning rwlocks are seldom a big win, unless you can get some
> secondary indirect win out of them.
>
> That secondary win is often:
>
> (a) unfairness is usually very good for throughput (iow, the very
> unfairness that you hit may *be* the reason why it looked good in some
> benchmark, and people decided "ok, let's do this").
>
> (b) the special case of "interrupts take the lock for reading only"
> thing that allows other readers to not disable interrupts
>
> IOW, the win of a spinning rwlock is not necessarily the "we allow
> multiple concurrent readers" that you'd expect, because if you have
> small sections of code you protect, that just isn't a big deal, and
> the costs are in the lock bouncing etc.
>
> It's also worth pointing out that rwlocks are only unfair *if* they
> hit that "reader from (soft)interrupt" case. Which means that such
> cases *really* had better either have very very short locked regions
> (with interrupts disabled), or they really need that (b) part above.
>
> And yes, the tasklist lock really needs the (b) part above. Disabling
> interrupts for task traversal would be completely and entirely
> unacceptable, because the traversal can actually be fairly expensive
> (lots and lots of threads).
>
> I suspect eventpoll just did the wrong thing.
>

In addition the commit log of commit a218cc491420 has this comment:
"(I assume that write side of a rwlock does not starve, it seems
qrwlock implementation has these guarantees)."

Since this assumption seems incorrect, is there any objection in
reverting the commit a218cc491420? Or do we need more
evidence/arguments/repro?

2022-06-17 19:48:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 2:25 PM Eric Dumazet <[email protected]> wrote:
>
> Interesting...
>
> I think getrusage(RUSAGE_SELF) is blocking interrupts in the
> possible long loop:

Yeah, that looks bad.

It needs that interrupt disable due to sighand->siglock, but normally
we would expect to *not* have a big loop inside the siglock.

Nasty.

I wonder if this is possibly a situation where we should actually make
siglock be a rwlock.

But considering that this RUSAGE_SELF is hopefully a special case,
maybe we could write it differently.

Instead of taking the sighand lock, we might be able to iterate just
over the regular thread list (using the tasklist lock), and then do
the "does sighand match" as a one-off check in
accumulate_thread_rusage().

It's not like we even really need that strict locking there, I suspect.

Anyway, I should have noted in my previous email that my "rwlock is
often not the win you'd think it is" that that is only true for this
*spinning* rwlock.

For the actual sleeping reader-writer lock (down_read/down_write and
friends), the whole "you can have multiple readers" is often a *huge*
deal and very central to using a rwlock. It's literally just the
spinning one that is often better as a spinlock unless you have those
magical reasons to use it.

Linus

2022-06-17 19:51:22

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 9:19 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Jun 17, 2022 at 2:10 PM Eric Dumazet <[email protected]> wrote:
> >
> > So I wonder why we replaced eventpoll spinlock with an rwlock.
>
> Yeah, usually we've actually gone the other way.
>
> Spinning rwlocks are seldom a big win, unless you can get some
> secondary indirect win out of them.
>
> That secondary win is often:
>
> (a) unfairness is usually very good for throughput (iow, the very
> unfairness that you hit may *be* the reason why it looked good in some
> benchmark, and people decided "ok, let's do this").
>
> (b) the special case of "interrupts take the lock for reading only"
> thing that allows other readers to not disable interrupts
>
> IOW, the win of a spinning rwlock is not necessarily the "we allow
> multiple concurrent readers" that you'd expect, because if you have
> small sections of code you protect, that just isn't a big deal, and
> the costs are in the lock bouncing etc.
>
> It's also worth pointing out that rwlocks are only unfair *if* they
> hit that "reader from (soft)interrupt" case. Which means that such
> cases *really* had better either have very very short locked regions
> (with interrupts disabled), or they really need that (b) part above.
>
> And yes, the tasklist lock really needs the (b) part above. Disabling
> interrupts for task traversal would be completely and entirely
> unacceptable, because the traversal can actually be fairly expensive
> (lots and lots of threads).

Interesting...

I think getrusage(RUSAGE_SELF) is blocking interrupts in the
possible long loop:

do {
accumulate_thread_rusage(t, r);
} while_each_thread(p, t);



>
> I suspect eventpoll just did the wrong thing.
>
> Linus

2022-06-17 20:11:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 2:39 PM Eric Dumazet <[email protected]> wrote:
>
> I am converting RAW sockets to RCU.

RCU is usually absolutely the best approach. Use regular spinlocks for
writers, and RCU for readers.

I'd love to see the tasklist_lock be converted to RCU too. But that
locks predates RCU (and probably 99% of all kernel code), and it's
messy, so nobody sane has ever willingly tried to do that afaik.

Linus

2022-06-17 20:19:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 9:35 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Jun 17, 2022 at 2:25 PM Eric Dumazet <[email protected]> wrote:
> >
> > Interesting...
> >
> > I think getrusage(RUSAGE_SELF) is blocking interrupts in the
> > possible long loop:
>
> Yeah, that looks bad.
>
> It needs that interrupt disable due to sighand->siglock, but normally
> we would expect to *not* have a big loop inside the siglock.
>
> Nasty.
>
> I wonder if this is possibly a situation where we should actually make
> siglock be a rwlock.
>
> But considering that this RUSAGE_SELF is hopefully a special case,
> maybe we could write it differently.
>
> Instead of taking the sighand lock, we might be able to iterate just
> over the regular thread list (using the tasklist lock), and then do
> the "does sighand match" as a one-off check in
> accumulate_thread_rusage().
>
> It's not like we even really need that strict locking there, I suspect.
>
> Anyway, I should have noted in my previous email that my "rwlock is
> often not the win you'd think it is" that that is only true for this
> *spinning* rwlock.
>
> For the actual sleeping reader-writer lock (down_read/down_write and
> friends), the whole "you can have multiple readers" is often a *huge*
> deal and very central to using a rwlock. It's literally just the
> spinning one that is often better as a spinlock unless you have those
> magical reasons to use it.
>

I am converting RAW sockets to RCU.

We will likely need to use RCU in place of rwlock in most networking code.

2022-06-17 22:31:25

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] locking/rwlocks: do not starve writers

From: Eric Dumazet
> Sent: 17 June 2022 20:39
...
> I am converting RAW sockets to RCU.

I presume that is because you are shoving a lot of packets
through them?

Will that remove the horrid delayed free of the routing
table entry that gets created for IPv4 with 'hdrinc'
because the address in the packet might not match that
in the address buffer?

I've seen the softint code spend ages doing the final frees.

We have good reasons for using raw sockets to send a lot of UDP.
But 'hdrinc' has to be used to get the UDP checksum right.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-06-20 08:05:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

On Fri, Jun 17, 2022 at 02:48:08PM -0500, Linus Torvalds wrote:

> I'd love to see the tasklist_lock be converted to RCU too. But that
> locks predates RCU (and probably 99% of all kernel code), and it's
> messy, so nobody sane has ever willingly tried to do that afaik.

Thomas actually tried a few years ago (for RT, which also hates on RW
type locks). At the time he converted most users to RCU (since the
tasklist itself is also RCU protected), but there's a bunch of users
(more than you'd like) that really need to be exclusive vs fork.

IIRC the most prominent problem is that RCU iteration can miss incoming
tasks and they'll miss getting updated. But like said, it's a few years
ago so I'm a bit hazy on the details.

2022-06-21 17:01:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] locking/rwlocks: do not starve writers

Linus Torvalds <[email protected]> writes:

> On Fri, Jun 17, 2022 at 2:39 PM Eric Dumazet <[email protected]> wrote:
>>
>> I am converting RAW sockets to RCU.
>
> RCU is usually absolutely the best approach. Use regular spinlocks for
> writers, and RCU for readers.
>
> I'd love to see the tasklist_lock be converted to RCU too. But that
> locks predates RCU (and probably 99% of all kernel code), and it's
> messy, so nobody sane has ever willingly tried to do that afaik.

Well sort of. I converted proc many many years ago.

Like Peter mentioned the big obvious challenge for converting
signal delivery to something else is the atomic delivery aspect.

I am playing with it, and I think I see how to convert signal delivery.
Something like a quick grab of lock that updates struct pid and creates
a list of signals are pending to be delivered. Plus code that forces
clone to deliver the pending signal before clone creates a new task.

Plus something like a generation counter so I can see when pulling the
signal in clone if the signal has already been delivered.

I think tasks exiting before getting a signal is ok, and does not need
any code.


I have some patches that are almost working that can use siglock to
protect the parent/child/ptrace relation ship for SIGCHLD processing.
Which will remove the pressure on tasklist_lock when I get them sorted.


Not that any of this will kill tasklist_lock but with a little luck we
can get to short deterministic hold times.

Eric