2020-02-04 18:44:08

by Qian Cai

[permalink] [raw]
Subject: [PATCH v3] skbuff: fix a data race in skb_queue_len()

sk_buff.qlen can be accessed concurrently as noticed by KCSAN,

BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg

read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
net/unix/af_unix.c:1761
____sys_sendmsg+0x33e/0x370
___sys_sendmsg+0xa6/0xf0
__sys_sendmsg+0x69/0xf0
__x64_sys_sendmsg+0x51/0x70
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe

write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
__skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
__skb_try_recv_datagram+0xbe/0x220
unix_dgram_recvmsg+0xee/0x850
____sys_recvmsg+0x1fb/0x210
___sys_recvmsg+0xa2/0xf0
__sys_recvmsg+0x66/0xf0
__x64_sys_recvmsg+0x51/0x70
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since only the read is operating as lockless, it could introduce a logic
bug in unix_recvq_full() due to the load tearing. Fix it by adding
a lockless variant of skb_queue_len() and unix_recvq_full() where
READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
the commit d7d16a89350a ("net: add skb_queue_empty_lockless()").

Signed-off-by: Qian Cai <[email protected]>
---

v3: fix minor issues thanks to Eric.
v2: add lockless variant helpers and WRITE_ONCE().

include/linux/skbuff.h | 14 +++++++++++++-
net/unix/af_unix.c | 11 +++++++++--
2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3d13a4b717e9..ca8806b69388 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1822,6 +1822,18 @@ static inline __u32 skb_queue_len(const struct sk_buff_head *list_)
}

/**
+ * skb_queue_len_lockless - get queue length
+ * @list_: list to measure
+ *
+ * Return the length of an &sk_buff queue.
+ * This variant can be used in lockless contexts.
+ */
+static inline __u32 skb_queue_len_lockless(const struct sk_buff_head *list_)
+{
+ return READ_ONCE(list_->qlen);
+}
+
+/**
* __skb_queue_head_init - initialize non-spinlock portions of sk_buff_head
* @list: queue to initialize
*
@@ -2026,7 +2038,7 @@ static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list)
{
struct sk_buff *next, *prev;

- list->qlen--;
+ WRITE_ONCE(list->qlen, list->qlen - 1);
next = skb->next;
prev = skb->prev;
skb->next = skb->prev = NULL;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 321af97c7bbe..62c12cb5763e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -189,11 +189,17 @@ static inline int unix_may_send(struct sock *sk, struct sock *osk)
return unix_peer(osk) == NULL || unix_our_peer(sk, osk);
}

-static inline int unix_recvq_full(struct sock const *sk)
+static inline int unix_recvq_full(const struct sock *sk)
{
return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
}

+static inline int unix_recvq_full_lockless(const struct sock *sk)
+{
+ return skb_queue_len_lockless(&sk->sk_receive_queue) >
+ READ_ONCE(sk->sk_max_ack_backlog);
+}
+
struct sock *unix_peer_get(struct sock *s)
{
struct sock *peer;
@@ -1758,7 +1764,8 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
* - unix_peer(sk) == sk by time of get but disconnected before lock
*/
if (other != sk &&
- unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+ unlikely(unix_peer(other) != sk &&
+ unix_recvq_full_lockless(other))) {
if (timeo) {
timeo = unix_wait_for_peer(other, timeo);

--
1.8.3.1


2020-02-06 13:26:47

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

From: Qian Cai <[email protected]>
Date: Tue, 4 Feb 2020 13:40:29 -0500

> sk_buff.qlen can be accessed concurrently as noticed by KCSAN,
>
> BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg
>
> read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
> unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
> net/unix/af_unix.c:1761
> ____sys_sendmsg+0x33e/0x370
> ___sys_sendmsg+0xa6/0xf0
> __sys_sendmsg+0x69/0xf0
> __x64_sys_sendmsg+0x51/0x70
> do_syscall_64+0x91/0xb47
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
> __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
> __skb_try_recv_datagram+0xbe/0x220
> unix_dgram_recvmsg+0xee/0x850
> ____sys_recvmsg+0x1fb/0x210
> ___sys_recvmsg+0xa2/0xf0
> __sys_recvmsg+0x66/0xf0
> __x64_sys_recvmsg+0x51/0x70
> do_syscall_64+0x91/0xb47
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Since only the read is operating as lockless, it could introduce a logic
> bug in unix_recvq_full() due to the load tearing. Fix it by adding
> a lockless variant of skb_queue_len() and unix_recvq_full() where
> READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
> the commit d7d16a89350a ("net: add skb_queue_empty_lockless()").
>
> Signed-off-by: Qian Cai <[email protected]>

Applied, thank you.

2020-02-06 16:39:58

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

Hi Eric,

On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
> - list->qlen--;
> + WRITE_ONCE(list->qlen, list->qlen - 1);

Sorry I'm a bit late to the party here, but this immediately jumped out.
This generates worse code with a bigger race in some sense:

list->qlen-- is:

0: 83 6f 10 01 subl $0x1,0x10(%rdi)

whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:

0: 8b 47 10 mov 0x10(%rdi),%eax
3: 83 e8 01 sub $0x1,%eax
6: 89 47 10 mov %eax,0x10(%rdi)

Are you sure that's what we want?

Jason

2020-02-06 17:12:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()



On 2/6/20 8:38 AM, Jason A. Donenfeld wrote:
> Hi Eric,
>
> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
>> - list->qlen--;
>> + WRITE_ONCE(list->qlen, list->qlen - 1);
>
> Sorry I'm a bit late to the party here, but this immediately jumped out.
> This generates worse code with a bigger race in some sense:
>
> list->qlen-- is:
>
> 0: 83 6f 10 01 subl $0x1,0x10(%rdi)
>
> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
>
> 0: 8b 47 10 mov 0x10(%rdi),%eax
> 3: 83 e8 01 sub $0x1,%eax
> 6: 89 47 10 mov %eax,0x10(%rdi)
>
> Are you sure that's what we want?
>
> Jason
>


Unfortunately we do not have ADD_ONCE() or something like that.

Sure, on x86 we could get much better code generation.

If we agree a READ_ONCE() was needed at the read side,
then a WRITE_ONCE() is needed as well on write sides.

If we believe load-tearing and/or write-tearing must not ever happen,
then we must document this.

2020-02-06 18:14:51

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <[email protected]> wrote:
> Unfortunately we do not have ADD_ONCE() or something like that.

I guess normally this is called "atomic_add", unless you're thinking
instead about something like this, which generates the same
inefficient code as WRITE_ONCE:

#define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)

2020-02-06 18:23:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()



On 2/6/20 10:12 AM, Jason A. Donenfeld wrote:
> On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <[email protected]> wrote:
>> Unfortunately we do not have ADD_ONCE() or something like that.
>
> I guess normally this is called "atomic_add", unless you're thinking
> instead about something like this, which generates the same
> inefficient code as WRITE_ONCE:
>
> #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
>

Dmitry Vyukov had a nice suggestion few months back how to implement this.

https://lkml.org/lkml/2019/10/5/6

2020-02-06 18:45:00

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

On Thu, Feb 06, 2020 at 10:22:02AM -0800, Eric Dumazet wrote:
> On 2/6/20 10:12 AM, Jason A. Donenfeld wrote:
> > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <[email protected]> wrote:
> >> Unfortunately we do not have ADD_ONCE() or something like that.
> >
> > I guess normally this is called "atomic_add", unless you're thinking
> > instead about something like this, which generates the same
> > inefficient code as WRITE_ONCE:
> >
> > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
> >
>
> Dmitry Vyukov had a nice suggestion few months back how to implement this.
>
> https://lkml.org/lkml/2019/10/5/6

That trick appears to work well in clang but not gcc:

#define ADD_ONCE(d, i) ({ \
       typeof(d) *__p = &(d); \
       __atomic_store_n(__p, (i) + __atomic_load_n(__p, __ATOMIC_RELAXED), __ATOMIC_RELAXED); \
})

gcc 9.2 gives:

 0:   8b 47 10                mov    0x10(%rdi),%eax
  3:   83 e8 01                sub    $0x1,%eax
  6:   89 47 10                mov    %eax,0x10(%rdi)

clang 9.0.1 gives:

   0:   81 47 10 ff ff ff ff    addl   $0xffffffff,0x10(%rdi)

But actually, clang does equally as well with:

#define ADD_ONCE(d, i) *(volatile typeof(d) *)&(d) += (i)

And testing further back, it generates the same code with your original
WRITE_ONCE.

If clang's optimization here is technically correct, maybe we should go
talk to the gcc people about catching this case?

2020-02-06 19:30:31

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

On Thu, 6 Feb 2020 at 19:43, Jason A. Donenfeld <[email protected]> wrote:
>
> On Thu, Feb 06, 2020 at 10:22:02AM -0800, Eric Dumazet wrote:
> > On 2/6/20 10:12 AM, Jason A. Donenfeld wrote:
> > > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <[email protected]> wrote:
> > >> Unfortunately we do not have ADD_ONCE() or something like that.
> > >
> > > I guess normally this is called "atomic_add", unless you're thinking
> > > instead about something like this, which generates the same
> > > inefficient code as WRITE_ONCE:
> > >
> > > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
> > >
> >
> > Dmitry Vyukov had a nice suggestion few months back how to implement this.
> >
> > https://lkml.org/lkml/2019/10/5/6
>
> That trick appears to work well in clang but not gcc:
>
> #define ADD_ONCE(d, i) ({ \
> typeof(d) *__p = &(d); \
> __atomic_store_n(__p, (i) + __atomic_load_n(__p, __ATOMIC_RELAXED), __ATOMIC_RELAXED); \
> })
>
> gcc 9.2 gives:
>
> 0: 8b 47 10 mov 0x10(%rdi),%eax
> 3: 83 e8 01 sub $0x1,%eax
> 6: 89 47 10 mov %eax,0x10(%rdi)
>
> clang 9.0.1 gives:
>
> 0: 81 47 10 ff ff ff ff addl $0xffffffff,0x10(%rdi)
>
> But actually, clang does equally as well with:
>
> #define ADD_ONCE(d, i) *(volatile typeof(d) *)&(d) += (i)

I feel that ADD_ONCE conveys that it adds actually once (atomically),
that is, if there are concurrent ADD_ONCE, all of them will succeed.
This is not the case with the above variants and the 'ONCE' can turn
into a 'MAYBE', and since we probably want to avoid making this more
expensive on e.g. x86 that would need a LOCK-prefix.

In the case here, what we actually want is something that safely
increments/decrements if there are only concurrent readers (concurrent
writers disallowed). So 'add_exclusive(var, val)' (all-caps or not)
might be more appropriate. [As an aside, recent changes to KCSAN would
also allow us to assert for something like 'add_exclusive()' that
there are in fact no other writers but only concurrent readers, even
if all accesses are marked.]

If the single-writer constraint isn't wanted, but should still not be
atomic, maybe 'add_lossy()'?

Thanks,
-- Marco


> And testing further back, it generates the same code with your original
> WRITE_ONCE.
>
> If clang's optimization here is technically correct, maybe we should go
> talk to the gcc people about catching this case?

2020-02-07 10:36:54

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

On Thu, 6 Feb 2020 at 18:10, Eric Dumazet <[email protected]> wrote:
>
>
>
> On 2/6/20 8:38 AM, Jason A. Donenfeld wrote:
> > Hi Eric,
> >
> > On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
> >> - list->qlen--;
> >> + WRITE_ONCE(list->qlen, list->qlen - 1);
> >
> > Sorry I'm a bit late to the party here, but this immediately jumped out.
> > This generates worse code with a bigger race in some sense:
> >
> > list->qlen-- is:
> >
> > 0: 83 6f 10 01 subl $0x1,0x10(%rdi)
> >
> > whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
> >
> > 0: 8b 47 10 mov 0x10(%rdi),%eax
> > 3: 83 e8 01 sub $0x1,%eax
> > 6: 89 47 10 mov %eax,0x10(%rdi)
> >
> > Are you sure that's what we want?
> >
> > Jason
> >
>
>
> Unfortunately we do not have ADD_ONCE() or something like that.
>
> Sure, on x86 we could get much better code generation.
>
> If we agree a READ_ONCE() was needed at the read side,
> then a WRITE_ONCE() is needed as well on write sides.
>
> If we believe load-tearing and/or write-tearing must not ever happen,
> then we must document this.

Just FYI, forgot to mention: Recent KCSAN by default will forgive
unannotated aligned writes up to word-size, making the assumptions
these are safe. This would include things like 'var++' if there is
only a single writer. This was added because of kernel-wide
preferences we were told about.

Since I cannot verify if this assumption is always correct, I would
still prefer to mark writes if at all possible. In the end it's up to
maintainers.

Thanks,
-- Marco

2020-02-17 03:25:26

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

Jason A. Donenfeld <[email protected]> wrote:
> Hi Eric,
>
> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
>> - list->qlen--;
>> + WRITE_ONCE(list->qlen, list->qlen - 1);
>
> Sorry I'm a bit late to the party here, but this immediately jumped out.
> This generates worse code with a bigger race in some sense:
>
> list->qlen-- is:
>
> 0: 83 6f 10 01 subl $0x1,0x10(%rdi)
>
> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
>
> 0: 8b 47 10 mov 0x10(%rdi),%eax
> 3: 83 e8 01 sub $0x1,%eax
> 6: 89 47 10 mov %eax,0x10(%rdi)
>
> Are you sure that's what we want?

Fixing these KCSAN warnings is actively making the kernel worse.

Why are we still doing this?

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-02-17 07:41:10

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

On 2/17/20, Herbert Xu <[email protected]> wrote:
> Jason A. Donenfeld <[email protected]> wrote:
>> Hi Eric,
>>
>> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
>>> - list->qlen--;
>>> + WRITE_ONCE(list->qlen, list->qlen - 1);
>>
>> Sorry I'm a bit late to the party here, but this immediately jumped out.
>> This generates worse code with a bigger race in some sense:
>>
>> list->qlen-- is:
>>
>> 0: 83 6f 10 01 subl $0x1,0x10(%rdi)
>>
>> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
>>
>> 0: 8b 47 10 mov 0x10(%rdi),%eax
>> 3: 83 e8 01 sub $0x1,%eax
>> 6: 89 47 10 mov %eax,0x10(%rdi)
>>
>> Are you sure that's what we want?
>
> Fixing these KCSAN warnings is actively making the kernel worse.
>
> Why are we still doing this?
>
Not necessarily a big fan of this either, but just for the record here
in case it helps, while you might complain about instruction size
blowing up a bit, cycle-wise these wind up being about the same
anyway. On x86, one instruction != one cycle.

2020-02-17 10:21:16

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v3] skbuff: fix a data race in skb_queue_len()

On Mon, Feb 17, 2020 at 08:39:45AM +0100, Jason A. Donenfeld wrote:
>
> Not necessarily a big fan of this either, but just for the record here
> in case it helps, while you might complain about instruction size
> blowing up a bit, cycle-wise these wind up being about the same
> anyway. On x86, one instruction != one cycle.

I don't care about that. My problem is with the mindless patches
that started this thread. Look at the patch:

commit 86b18aaa2b5b5bb48e609cd591b3d2d0fdbe0442
Author: Qian Cai <[email protected]>
Date: Tue Feb 4 13:40:29 2020 -0500

skbuff: fix a data race in skb_queue_len()

It's utter garbage. Why on earth did it change that one instance
of unix_recvq_full? In fact you can see how stupid it is because
right after the call that got changed we again call into
unix_recvq_full which surely would trigger the same warning.

So far the vast majority of the KCSAN patches that have caught
my attention have been of this mindless kind that does not add
any value to the kernel. If anything they could be hiding real
bugs that would now be harder to find.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt