2024-05-12 07:21:55

by Tetsuo Handa

[permalink] [raw]
Subject: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

If a BPF program is attached to kfree() event, calling kfree()
with psock->link_lock held triggers lockdep warning.

Defer kfree() using RCU so that the attached BPF program runs
without holding psock->link_lock.

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=ec941d6e24f633a59172
Tested-by: [email protected]
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=a4ed4041b9bea8177ac3
Tested-by: [email protected]
Signed-off-by: Tetsuo Handa <[email protected]>
---
include/linux/skmsg.h | 7 +++++--
net/core/skmsg.c | 2 ++
net/core/sock_map.c | 2 ++
3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index a509caf823d6..66590f20b777 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -66,7 +66,10 @@ enum sk_psock_state_bits {
};

struct sk_psock_link {
- struct list_head list;
+ union {
+ struct list_head list;
+ struct rcu_head rcu;
+ };
struct bpf_map *map;
void *link_raw;
};
@@ -418,7 +421,7 @@ static inline struct sk_psock_link *sk_psock_init_link(void)

static inline void sk_psock_free_link(struct sk_psock_link *link)
{
- kfree(link);
+ kfree_rcu(link, rcu);
}

struct sk_psock_link *sk_psock_link_pop(struct sk_psock *psock);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index fd20aae30be2..9cebfeecd3c9 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -791,10 +791,12 @@ static void sk_psock_link_destroy(struct sk_psock *psock)
{
struct sk_psock_link *link, *tmp;

+ rcu_read_lock();
list_for_each_entry_safe(link, tmp, &psock->link, list) {
list_del(&link->list);
sk_psock_free_link(link);
}
+ rcu_read_unlock();
}

void sk_psock_stop(struct sk_psock *psock)
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 8598466a3805..8bec4b7a8ec7 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
bool strp_stop = false, verdict_stop = false;
struct sk_psock_link *link, *tmp;

+ rcu_read_lock();
spin_lock_bh(&psock->link_lock);
list_for_each_entry_safe(link, tmp, &psock->link, list) {
if (link->link_raw == link_raw) {
@@ -159,6 +160,7 @@ static void sock_map_del_link(struct sock *sk,
}
}
spin_unlock_bh(&psock->link_lock);
+ rcu_read_unlock();
if (strp_stop || verdict_stop) {
write_lock_bh(&sk->sk_callback_lock);
if (strp_stop)
--
2.34.1


2024-05-21 15:39:19

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Sun, May 12, 2024 at 12:22 AM Tetsuo Handa
<[email protected]> wrote:
>
> If a BPF program is attached to kfree() event, calling kfree()
> with psock->link_lock held triggers lockdep warning.
>
> Defer kfree() using RCU so that the attached BPF program runs
> without holding psock->link_lock.
>
> Reported-by: [email protected]
> Closes: https://syzkaller.appspot.com/bug?extid=ec941d6e24f633a59172
> Tested-by: [email protected]
> Reported-by: [email protected]
> Closes: https://syzkaller.appspot.com/bug?extid=a4ed4041b9bea8177ac3
> Tested-by: [email protected]
> Signed-off-by: Tetsuo Handa <[email protected]>
> ---
> include/linux/skmsg.h | 7 +++++--
> net/core/skmsg.c | 2 ++
> net/core/sock_map.c | 2 ++
> 3 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index a509caf823d6..66590f20b777 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -66,7 +66,10 @@ enum sk_psock_state_bits {
> };
>
> struct sk_psock_link {
> - struct list_head list;
> + union {
> + struct list_head list;
> + struct rcu_head rcu;
> + };
> struct bpf_map *map;
> void *link_raw;
> };
> @@ -418,7 +421,7 @@ static inline struct sk_psock_link *sk_psock_init_link(void)
>
> static inline void sk_psock_free_link(struct sk_psock_link *link)
> {
> - kfree(link);
> + kfree_rcu(link, rcu);
> }
>
> struct sk_psock_link *sk_psock_link_pop(struct sk_psock *psock);
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index fd20aae30be2..9cebfeecd3c9 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -791,10 +791,12 @@ static void sk_psock_link_destroy(struct sk_psock *psock)
> {
> struct sk_psock_link *link, *tmp;
>
> + rcu_read_lock();
> list_for_each_entry_safe(link, tmp, &psock->link, list) {
> list_del(&link->list);
> sk_psock_free_link(link);
> }
> + rcu_read_unlock();
> }
>
> void sk_psock_stop(struct sk_psock *psock)
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 8598466a3805..8bec4b7a8ec7 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
> bool strp_stop = false, verdict_stop = false;
> struct sk_psock_link *link, *tmp;
>
> + rcu_read_lock();
> spin_lock_bh(&psock->link_lock);

I think this is incorrect.
spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.

pw-bot: cr

> list_for_each_entry_safe(link, tmp, &psock->link, list) {
> if (link->link_raw == link_raw) {
> @@ -159,6 +160,7 @@ static void sock_map_del_link(struct sock *sk,
> }
> }
> spin_unlock_bh(&psock->link_lock);
> + rcu_read_unlock();
> if (strp_stop || verdict_stop) {
> write_lock_bh(&sk->sk_callback_lock);
> if (strp_stop)
> --
> 2.34.1
>

2024-05-21 22:59:56

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
> > --- a/net/core/sock_map.c
> > +++ b/net/core/sock_map.c
> > @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
> > bool strp_stop =3D false, verdict_stop =3D false;
> > struct sk_psock_link *link, *tmp;
> >
> > + rcu_read_lock();
> > spin_lock_bh(&psock->link_lock);
>
> I think this is incorrect.
> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.

Could you specify why it won't be safe in rcu cs if you are right?
What does rcu look like in RT if not nothing?
>
> pw-bot: cr

2024-05-22 09:51:03

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
> On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
>> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
>> > --- a/net/core/sock_map.c
>> > +++ b/net/core/sock_map.c
>> > @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
>> > bool strp_stop =3D false, verdict_stop =3D false;
>> > struct sk_psock_link *link, *tmp;
>> >
>> > + rcu_read_lock();
>> > spin_lock_bh(&psock->link_lock);
>>
>> I think this is incorrect.
>> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
>
> Could you specify why it won't be safe in rcu cs if you are right?
> What does rcu look like in RT if not nothing?

RCU readers can't block, while spinlock RT doesn't disable preemption.

https://docs.kernel.org/RCU/rcu.html
https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt

I've finally gotten around to testing proposed fix that just disallows
map_delete_elem on sockmap/sockhash from BPF tracing progs
completely. This should put an end to this saga of syzkaller reports.

https://lore.kernel.org/all/[email protected]/

2024-05-22 10:31:42

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On 2024/05/22 18:50, Jakub Sitnicki wrote:
> On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
>> On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
>>> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
>>>> --- a/net/core/sock_map.c
>>>> +++ b/net/core/sock_map.c
>>>> @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
>>>> bool strp_stop =3D false, verdict_stop =3D false;
>>>> struct sk_psock_link *link, *tmp;
>>>>
>>>> + rcu_read_lock();
>>>> spin_lock_bh(&psock->link_lock);
>>>
>>> I think this is incorrect.
>>> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
>>
>> Could you specify why it won't be safe in rcu cs if you are right?
>> What does rcu look like in RT if not nothing?
>
> RCU readers can't block, while spinlock RT doesn't disable preemption.
>
> https://docs.kernel.org/RCU/rcu.html
> https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt
>

I didn't catch what you mean.

https://elixir.bootlin.com/linux/latest/source/include/linux/spinlock_rt.h#L43 defines spin_lock() for RT as

static __always_inline void spin_lock(spinlock_t *lock)
{
rt_spin_lock(lock);
}

and https://elixir.bootlin.com/linux/v6.9/source/include/linux/spinlock_rt.h#L85 defines spin_lock_bh() for RT as

static __always_inline void spin_lock_bh(spinlock_t *lock)
{
/* Investigate: Drop bh when blocking ? */
local_bh_disable();
rt_spin_lock(lock);
}

and https://elixir.bootlin.com/linux/latest/source/kernel/locking/spinlock_rt.c#L54 defines rt_spin_lock() for RT as

void __sched rt_spin_lock(spinlock_t *lock)
{
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
__rt_spin_lock(lock);
}

and https://elixir.bootlin.com/linux/v6.9/source/kernel/locking/spinlock_rt.c#L46 defines __rt_spin_lock() for RT as

static __always_inline void __rt_spin_lock(spinlock_t *lock)
{
rtlock_might_resched();
rtlock_lock(&lock->lock);
rcu_read_lock();
migrate_disable();
}

You can see that calling spin_lock() or spin_lock_bh() automatically starts RCU critical section, can't you?

If spin_lock_bh() for RT might sleep and calling spin_lock_bh() under RCU critical section is not safe,
how can

spin_lock(&lock1);
spin_lock(&lock2);
// do something
spin_unlock(&lock2);
spin_unlock(&lock1);

or

spin_lock_bh(&lock1);
spin_lock(&lock2);
// do something
spin_unlock(&lock2);
spin_unlock_bh(&lock1);

be possible?

Unless rcu_read_lock() is implemented in a way that is safe to do

rcu_read_lock();
spin_lock(&lock2);
// do something
spin_unlock(&lock2);
rcu_read_unlock();

and

rcu_read_lock();
spin_lock_bh(&lock2);
// do something
spin_unlock_bh(&lock2);
rcu_read_unlock();

, I think RT kernels can't run safely.

Locking primitive ordering is too much complicated/distributed.
We need documentation using safe/unsafe ordering examples.


2024-05-22 11:08:35

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Wed, May 22, 2024 at 07:30 PM +09, Tetsuo Handa wrote:
> On 2024/05/22 18:50, Jakub Sitnicki wrote:
>> On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
>>> On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
>>>> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
>>>>> --- a/net/core/sock_map.c
>>>>> +++ b/net/core/sock_map.c
>>>>> @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
>>>>> bool strp_stop =3D false, verdict_stop =3D false;
>>>>> struct sk_psock_link *link, *tmp;
>>>>>
>>>>> + rcu_read_lock();
>>>>> spin_lock_bh(&psock->link_lock);
>>>>
>>>> I think this is incorrect.
>>>> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
>>>
>>> Could you specify why it won't be safe in rcu cs if you are right?
>>> What does rcu look like in RT if not nothing?
>>
>> RCU readers can't block, while spinlock RT doesn't disable preemption.
>>
>> https://docs.kernel.org/RCU/rcu.html
>> https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt
>>
>
> I didn't catch what you mean.
>
> https://elixir.bootlin.com/linux/latest/source/include/linux/spinlock_rt.h#L43 defines spin_lock() for RT as
>
> static __always_inline void spin_lock(spinlock_t *lock)
> {
> rt_spin_lock(lock);
> }
>
> and https://elixir.bootlin.com/linux/v6.9/source/include/linux/spinlock_rt.h#L85 defines spin_lock_bh() for RT as
>
> static __always_inline void spin_lock_bh(spinlock_t *lock)
> {
> /* Investigate: Drop bh when blocking ? */
> local_bh_disable();
> rt_spin_lock(lock);
> }
>
> and https://elixir.bootlin.com/linux/latest/source/kernel/locking/spinlock_rt.c#L54 defines rt_spin_lock() for RT as
>
> void __sched rt_spin_lock(spinlock_t *lock)
> {
> spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> __rt_spin_lock(lock);
> }
>
> and https://elixir.bootlin.com/linux/v6.9/source/kernel/locking/spinlock_rt.c#L46 defines __rt_spin_lock() for RT as
>
> static __always_inline void __rt_spin_lock(spinlock_t *lock)
> {
> rtlock_might_resched();
> rtlock_lock(&lock->lock);
> rcu_read_lock();
> migrate_disable();
> }
>
> . You can see that calling spin_lock() or spin_lock_bh() automatically starts RCU critical section, can't you?
>
> If spin_lock_bh() for RT might sleep and calling spin_lock_bh() under RCU critical section is not safe,
> how can
>
> spin_lock(&lock1);
> spin_lock(&lock2);
> // do something
> spin_unlock(&lock2);
> spin_unlock(&lock1);
>
> or
>
> spin_lock_bh(&lock1);
> spin_lock(&lock2);
> // do something
> spin_unlock(&lock2);
> spin_unlock_bh(&lock1);
>
> be possible?
>
> Unless rcu_read_lock() is implemented in a way that is safe to do
>
> rcu_read_lock();
> spin_lock(&lock2);
> // do something
> spin_unlock(&lock2);
> rcu_read_unlock();
>
> and
>
> rcu_read_lock();
> spin_lock_bh(&lock2);
> // do something
> spin_unlock_bh(&lock2);
> rcu_read_unlock();
>
> , I think RT kernels can't run safely.
>
> Locking primitive ordering is too much complicated/distributed.
> We need documentation using safe/unsafe ordering examples.

You're right. My answer was too hasty. Docs say that RT kernels can
preempt RCU read-side critical sections:

https://docs.kernel.org/RCU/whatisRCU.html?highlight=rcu_read_lock#rcu-read-lock


2024-05-22 11:34:49

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Wed, 22 May 2024 11:50:49 +0200 Jakub Sitnicki <[email protected]>
On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
> > On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
> >> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
> >> > --- a/net/core/sock_map.c
> >> > +++ b/net/core/sock_map.c
> >> > @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
> >> > bool strp_stop =3D false, verdict_stop =3D false;
> >> > struct sk_psock_link *link, *tmp;
> >> >
> >> > + rcu_read_lock();
> >> > spin_lock_bh(&psock->link_lock);
> >>
> >> I think this is incorrect.
> >> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
> >
> > Could you specify why it won't be safe in rcu cs if you are right?
> > What does rcu look like in RT if not nothing?
>
> RCU readers can't block, while spinlock RT doesn't disable preemption.
>
> https://docs.kernel.org/RCU/rcu.html
> https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt
>
> I've finally gotten around to testing proposed fix that just disallows
> map_delete_elem on sockmap/sockhash from BPF tracing progs
> completely. This should put an end to this saga of syzkaller reports.
>
> https://lore.kernel.org/all/[email protected]/
>
The locking info syzbot reported [2] suggests a known issue that like Alexei
you hit the send button earlier than expected.

4 locks held by syz-executor361/5090:
#0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
#0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
#0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: map_delete_elem+0x388/0x5e0 kernel/bpf/syscall.c:1695
#1: ffff88807b2af8f8 (&htab->buckets[i].lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
#1: ffff88807b2af8f8 (&htab->buckets[i].lock){+...}-{2:2}, at: sock_hash_delete_elem+0x17c/0x400 net/core/sock_map.c:945
#2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
#2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: sock_map_del_link net/core/sock_map.c:145 [inline]
#2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: sock_map_unref+0xcc/0x5e0 net/core/sock_map.c:180
#3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
#3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
#3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: __bpf_trace_run kernel/trace/bpf_trace.c:2380 [inline]
#3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run2+0x114/0x420 kernel/trace/bpf_trace.c:2420

[2] https://lore.kernel.org/all/[email protected]/


If CONFIG_PREEMPT_RCU=y rcu_read_lock() does not disable
preemption. This is even true for !RT kernels with CONFIG_PREEMPT=y

[3] Subject: Re: [patch 30/63] locking/spinlock: Provide RT variant
https://lore.kernel.org/all/874kc6rizr.ffs@tglx/

2024-05-22 12:19:53

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Wed, May 22, 2024 at 07:33 PM +08, Hillf Danton wrote:
> On Wed, 22 May 2024 11:50:49 +0200 Jakub Sitnicki <[email protected]>
> On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
>> > On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
>> >> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
>> >> > --- a/net/core/sock_map.c
>> >> > +++ b/net/core/sock_map.c
>> >> > @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
>> >> > bool strp_stop =3D false, verdict_stop =3D false;
>> >> > struct sk_psock_link *link, *tmp;
>> >> >
>> >> > + rcu_read_lock();
>> >> > spin_lock_bh(&psock->link_lock);
>> >>
>> >> I think this is incorrect.
>> >> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
>> >
>> > Could you specify why it won't be safe in rcu cs if you are right?
>> > What does rcu look like in RT if not nothing?
>>
>> RCU readers can't block, while spinlock RT doesn't disable preemption.
>>
>> https://docs.kernel.org/RCU/rcu.html
>> https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt
>>
>> I've finally gotten around to testing proposed fix that just disallows
>> map_delete_elem on sockmap/sockhash from BPF tracing progs
>> completely. This should put an end to this saga of syzkaller reports.
>>
>> https://lore.kernel.org/all/[email protected]/
>>
> The locking info syzbot reported [2] suggests a known issue that like Alexei
> you hit the send button earlier than expected.
>
> 4 locks held by syz-executor361/5090:
> #0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
> #0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
> #0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: map_delete_elem+0x388/0x5e0 kernel/bpf/syscall.c:1695
> #1: ffff88807b2af8f8 (&htab->buckets[i].lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
> #1: ffff88807b2af8f8 (&htab->buckets[i].lock){+...}-{2:2}, at: sock_hash_delete_elem+0x17c/0x400 net/core/sock_map.c:945
> #2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
> #2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: sock_map_del_link net/core/sock_map.c:145 [inline]
> #2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: sock_map_unref+0xcc/0x5e0 net/core/sock_map.c:180
> #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
> #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
> #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: __bpf_trace_run kernel/trace/bpf_trace.c:2380 [inline]
> #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run2+0x114/0x420 kernel/trace/bpf_trace.c:2420
>
> [2] https://lore.kernel.org/all/[email protected]/
>
>
> If CONFIG_PREEMPT_RCU=y rcu_read_lock() does not disable
> preemption. This is even true for !RT kernels with CONFIG_PREEMPT=y
>
> [3] Subject: Re: [patch 30/63] locking/spinlock: Provide RT variant
> https://lore.kernel.org/all/874kc6rizr.ffs@tglx/

That locking issue is related to my earlier, as it turned out -
incomplete, fix:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ff91059932401894e6c86341915615c5eb0eca48

We don't expect map_delete_elem to be called from map_update_elem for
sockmap/sockhash, but that is what syzkaller started doing by attaching
BPF tracing progs which call map_delete_elem.

2024-05-22 14:58:12

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Wed, May 22, 2024 at 5:12 AM Jakub Sitnicki <[email protected]> wrote:
>
> On Wed, May 22, 2024 at 07:33 PM +08, Hillf Danton wrote:
> > On Wed, 22 May 2024 11:50:49 +0200 Jakub Sitnicki <[email protected]>
> > On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
> >> > On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
> >> >> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
> >> >> > --- a/net/core/sock_map.c
> >> >> > +++ b/net/core/sock_map.c
> >> >> > @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
> >> >> > bool strp_stop =3D false, verdict_stop =3D false;
> >> >> > struct sk_psock_link *link, *tmp;
> >> >> >
> >> >> > + rcu_read_lock();
> >> >> > spin_lock_bh(&psock->link_lock);
> >> >>
> >> >> I think this is incorrect.
> >> >> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
> >> >
> >> > Could you specify why it won't be safe in rcu cs if you are right?
> >> > What does rcu look like in RT if not nothing?
> >>
> >> RCU readers can't block, while spinlock RT doesn't disable preemption.
> >>
> >> https://docs.kernel.org/RCU/rcu.html
> >> https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt
> >>
> >> I've finally gotten around to testing proposed fix that just disallows
> >> map_delete_elem on sockmap/sockhash from BPF tracing progs
> >> completely. This should put an end to this saga of syzkaller reports.
> >>
> >> https://lore.kernel.org/all/[email protected]/

Agree. Let's do that. According to John the delete path is not something
that is used in production. It's only a source of trouble with syzbot.


> >>
> > The locking info syzbot reported [2] suggests a known issue that like Alexei
> > you hit the send button earlier than expected.
> >
> > 4 locks held by syz-executor361/5090:
> > #0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
> > #0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
> > #0: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: map_delete_elem+0x388/0x5e0 kernel/bpf/syscall.c:1695
> > #1: ffff88807b2af8f8 (&htab->buckets[i].lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
> > #1: ffff88807b2af8f8 (&htab->buckets[i].lock){+...}-{2:2}, at: sock_hash_delete_elem+0x17c/0x400 net/core/sock_map.c:945
> > #2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:356 [inline]
> > #2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: sock_map_del_link net/core/sock_map.c:145 [inline]
> > #2: ffff88801c2a4290 (&psock->link_lock){+...}-{2:2}, at: sock_map_unref+0xcc/0x5e0 net/core/sock_map.c:180
> > #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:329 [inline]
> > #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:781 [inline]
> > #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: __bpf_trace_run kernel/trace/bpf_trace.c:2380 [inline]
> > #3: ffffffff8e334d20 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run2+0x114/0x420 kernel/trace/bpf_trace.c:2420
> >
> > [2] https://lore.kernel.org/all/[email protected]/
> >
> >
> > If CONFIG_PREEMPT_RCU=y rcu_read_lock() does not disable
> > preemption. This is even true for !RT kernels with CONFIG_PREEMPT=y
> >
> > [3] Subject: Re: [patch 30/63] locking/spinlock: Provide RT variant
> > https://lore.kernel.org/all/874kc6rizr.ffs@tglx/
>
> That locking issue is related to my earlier, as it turned out -
> incomplete, fix:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ff91059932401894e6c86341915615c5eb0eca48
>
> We don't expect map_delete_elem to be called from map_update_elem for
> sockmap/sockhash, but that is what syzkaller started doing by attaching
> BPF tracing progs which call map_delete_elem.

2024-05-24 13:07:06

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: [PATCH] bpf, sockmap: defer sk_psock_free_link() using RCU

On Wed, May 22, 2024 at 07:57 AM -07, Alexei Starovoitov wrote:
> On Wed, May 22, 2024 at 5:12 AM Jakub Sitnicki <[email protected]> wrote:
>>
>> On Wed, May 22, 2024 at 07:33 PM +08, Hillf Danton wrote:
>> > On Wed, 22 May 2024 11:50:49 +0200 Jakub Sitnicki <[email protected]>
>> > On Wed, May 22, 2024 at 06:59 AM +08, Hillf Danton wrote:
>> >> > On Tue, 21 May 2024 08:38:52 -0700 Alexei Starovoitov <[email protected]>
>> >> >> On Sun, May 12, 2024 at 12:22=E2=80=AFAM Tetsuo Handa <[email protected]> wrote:
>> >> >> > --- a/net/core/sock_map.c
>> >> >> > +++ b/net/core/sock_map.c
>> >> >> > @@ -142,6 +142,7 @@ static void sock_map_del_link(struct sock *sk,
>> >> >> > bool strp_stop =3D false, verdict_stop =3D false;
>> >> >> > struct sk_psock_link *link, *tmp;
>> >> >> >
>> >> >> > + rcu_read_lock();
>> >> >> > spin_lock_bh(&psock->link_lock);
>> >> >>
>> >> >> I think this is incorrect.
>> >> >> spin_lock_bh may sleep in RT and it won't be safe to do in rcu cs.
>> >> >
>> >> > Could you specify why it won't be safe in rcu cs if you are right?
>> >> > What does rcu look like in RT if not nothing?
>> >>
>> >> RCU readers can't block, while spinlock RT doesn't disable preemption.
>> >>
>> >> https://docs.kernel.org/RCU/rcu.html
>> >> https://docs.kernel.org/locking/locktypes.html#spinlock-t-and-preempt-rt
>> >>
>> >> I've finally gotten around to testing proposed fix that just disallows
>> >> map_delete_elem on sockmap/sockhash from BPF tracing progs
>> >> completely. This should put an end to this saga of syzkaller reports.
>> >>
>> >> https://lore.kernel.org/all/[email protected]/
>
> Agree. Let's do that. According to John the delete path is not something
> that is used in production. It's only a source of trouble with syzbot.

Cool. The proposed API rule would be that if a BPF program type is
allowed to update a sockmap/sockhash, then it is also allowed to delete
from it.

So I need to tweak my patch to allow deletes from sock_ops progs.
We have a dedicated bpf_sock_map_update() helper there.

[...]