LinuxLists.cc - WARNING in sk_stream_kill

2020-11-30 08:45:03

Subject: WARNING in sk_stream_kill_queues (5)

Hello,

syzbot found the following issue on:

HEAD commit: 6147c83f Add linux-next specific files for 20201126
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
kernel config: https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
compiler: gcc (GCC) 10.1.0-syz 20200507
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000

The issue was bisected to:

commit 145cd60fb481328faafba76842aa0fd242e2b163
Author: Alexander Potapenko <[email protected]>
Date: Tue Nov 24 05:38:44 2020 +0000

mm, kfence: insert KFENCE hooks for SLUB

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
final oops: https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]
Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")

------------[ cut here ]------------
WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
Modules linked in:
CPU: 0 PID: 11307 Comm: syz-executor673 Not tainted 5.10.0-rc5-next-20201126-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
Code: 00 00 00 fc ff df 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e 63 01 00 00 8b ab 20 02 00 00 e9 60 ff ff ff e8 ad 24 7b fa <0f> 0b eb 97 e8 a4 24 7b fa 0f 0b eb a0 e8 9b 24 7b fa 0f 0b e9 a5
RSP: 0018:ffffc9000979f978 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 00000000fffffe80 RCX: ffffffff86f5877a
RDX: ffff88801ebb5040 RSI: ffffffff86f587e3 RDI: 0000000000000005
RBP: 0000000000000180 R08: 0000000000000001 R09: ffffffff8ebd9817
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8880182f3ce0
R13: ffffffff8fb178c0 R14: ffff8880182f3ae8 R15: ffff8880182f3c70
FS: 0000000000000000(0000) GS:ffff8880b9e00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004c7cd8 CR3: 000000000b08e000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
__tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release+0xcd/0x280 net/socket.c:596
sock_close+0x18/0x20 net/socket.c:1255
__fput+0x283/0x920 fs/file_table.c:280
task_work_run+0xdd/0x190 kernel/task_work.c:140
exit_task_work include/linux/task_work.h:30 [inline]
do_exit+0xb89/0x29e0 kernel/exit.c:823
do_group_exit+0x125/0x310 kernel/exit.c:920
get_signal+0x3ec/0x2010 kernel/signal.c:2770
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:144 [inline]
exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x44ea59
Code: Unable to access opcode bytes at RIP 0x44ea2f.
RSP: 002b:00007fd1200f3d98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: 0000000000012255 RBX: 00000000006e6a18 RCX: 000000000044ea59
RDX: 00000001000001bd RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000006e6a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e6a1c
R13: 3030303030303030 R14: 3030303030303d65 R15: 2b74d0dd4a6f722c

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

2020-12-03 16:02:30

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 6147c83f Add linux-next specific files for 20201126
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> kernel config: https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> compiler: gcc (GCC) 10.1.0-syz 20200507
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
>
> The issue was bisected to:
>
> commit 145cd60fb481328faafba76842aa0fd242e2b163
> Author: Alexander Potapenko <[email protected]>
> Date: Tue Nov 24 05:38:44 2020 +0000
>
> mm, kfence: insert KFENCE hooks for SLUB
>
> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> final oops: https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
> Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
[...]
> Call Trace:
> inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> __sock_release+0xcd/0x280 net/socket.c:596
> sock_close+0x18/0x20 net/socket.c:1255
> __fput+0x283/0x920 fs/file_table.c:280
> task_work_run+0xdd/0x190 kernel/task_work.c:140
> exit_task_work include/linux/task_work.h:30 [inline]
> do_exit+0xb89/0x29e0 kernel/exit.c:823
> do_group_exit+0x125/0x310 kernel/exit.c:920
> get_signal+0x3ec/0x2010 kernel/signal.c:2770
> arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> handle_signal_work kernel/entry/common.c:144 [inline]
> exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> entry_SYSCALL_64_after_hwframe+0x44/0xa9

I've been debugging this and I think enabling KFENCE uncovered that some
code is assuming that the following is always true:

ksize(kmalloc(S)) == ksize(kmalloc(S))

but I don't think this assumption can be made (with or without KFENCE).

With KFENCE, we actually end up testing no code assumes this, because
KFENCE's ksize() always returns the exact size S.

I have narrowed it down to sk_wmem_queued becoming <0 in
sk_wmem_free_skb().

The skb passed to sk_wmem_free_skb() and whose truesize causes
sk_wmem_queued to become negative is always allocated in:

| kmem_cache_alloc_node+0x140/0x400 mm/slub.c:2939
| __alloc_skb+0x6d/0x710 net/core/skbuff.c:198
| alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
| sk_stream_alloc_skb+0x109/0xc30 net/ipv4/tcp.c:888
| tso_fragment net/ipv4/tcp_output.c:2124 [inline]
| tcp_write_xmit+0x1dbf/0x5ce0 net/ipv4/tcp_output.c:2674
| __tcp_push_pending_frames+0xaa/0x390 net/ipv4/tcp_output.c:2866
| tcp_push_pending_frames include/net/tcp.h:1864 [inline]
| tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
| tcp_rcv_established+0x8c9/0x1eb0 net/ipv4/tcp_input.c:5869
| tcp_v4_do_rcv+0x5d1/0x870 net/ipv4/tcp_ipv4.c:1668
| sk_backlog_rcv include/net/sock.h:1011 [inline]
| __release_sock+0x134/0x3a0 net/core/sock.c:2523
| release_sock+0x54/0x1b0 net/core/sock.c:3053
| sk_wait_data+0x177/0x450 net/core/sock.c:2565
| tcp_recvmsg+0x17ea/0x2aa0 net/ipv4/tcp.c:2181
| inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:848
| sock_recvmsg_nosec net/socket.c:885 [inline]
| sock_recvmsg net/socket.c:903 [inline]
| sock_recvmsg net/socket.c:899 [inline]
| ____sys_recvmsg+0x2c4/0x600 net/socket.c:2563
| ___sys_recvmsg+0x127/0x200 net/socket.c:2605
| __sys_recvmsg+0xe2/0x1a0 net/socket.c:2641
| do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
| entry_SYSCALL_64_after_hwframe+0x44/0xa9

I used the code below to add some warnings that helped narrow it down.

Does any of this help explain the problem?

Thanks,
-- Marco

------ >8 ------

diff --git a/include/net/sock.h b/include/net/sock.h
index e8d958ef3ea0..ef4837f3aba4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -35,6 +35,7 @@
#ifndef _SOCK_H
#define _SOCK_H

+#include <linux/kfence.h>
#include <linux/hardirq.h>
#include <linux/kernel.h>
#include <linux/list.h>
@@ -1534,7 +1535,15 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
{
+ bool bad = false;
+
sk_wmem_queued_add(sk, -skb->truesize);
+
+ if (WARN_ON(READ_ONCE(sk->sk_wmem_queued) == -384)) {
+ pr_info("wmem_queued=%d truesize=%u\n", sk->sk_wmem_queued, skb->truesize);
+ bad = true;
+ }
+
sk_mem_uncharge(sk, skb->truesize);
if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
!sk->sk_tx_skb_cache && !skb_cloned(skb)) {
@@ -1544,6 +1553,9 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
return;
}
__kfree_skb(skb);
+
+ if (bad)
+ (void)READ_ONCE(skb->truesize); /* UAF to let KASAN show where it was allocated */
}

static inline void sock_release_ownership(struct sock *sk)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ffe3dcc0ebea..f365495819ee 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -208,6 +208,19 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
+
+ if (is_kfence_address(data))
+ pr_info("kfence's ksize: %zu\n", ksize(data));
+ /*
+ * BUG BUG
+ * Hypothesis: The problem is that some code assumes that:
+ *
+ * ksize(kmalloc(S)) == ksize(kmalloc(S))
+ *
+ * Note: If we force no KFENCE allocation for @data above, the warnings
+ * disappear. KFENCE's ksize() always returns the exact size S.
+ */
+
if (!data)
goto nodata;
/* kmalloc(size) might give us more room than requested.

2020-12-03 16:30:28

by Eric Dumazet

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <[email protected]> wrote:
>
> On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: 6147c83f Add linux-next specific files for 20201126
> > git tree: linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > compiler: gcc (GCC) 10.1.0-syz 20200507
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> >
> > The issue was bisected to:
> >
> > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > Author: Alexander Potapenko <[email protected]>
> > Date: Tue Nov 24 05:38:44 2020 +0000
> >
> > mm, kfence: insert KFENCE hooks for SLUB
> >
> > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: [email protected]
> > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> [...]
> > Call Trace:
> > inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > __sock_release+0xcd/0x280 net/socket.c:596
> > sock_close+0x18/0x20 net/socket.c:1255
> > __fput+0x283/0x920 fs/file_table.c:280
> > task_work_run+0xdd/0x190 kernel/task_work.c:140
> > exit_task_work include/linux/task_work.h:30 [inline]
> > do_exit+0xb89/0x29e0 kernel/exit.c:823
> > do_group_exit+0x125/0x310 kernel/exit.c:920
> > get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > handle_signal_work kernel/entry/common.c:144 [inline]
> > exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> I've been debugging this and I think enabling KFENCE uncovered that some
> code is assuming that the following is always true:
>
> ksize(kmalloc(S)) == ksize(kmalloc(S))
>

I do not think we make this assumption.

Each skb tracks the 'truesize' which is populated from __alloc_skb()
using ksize(allocated head) .

So if ksize() decides to give us random data, it should be still fine,
because we use ksize(buff) only once at alloc skb time, and record the
value in skb->truesize
(only the socket buffer accounting would be off)

> but I don't think this assumption can be made (with or without KFENCE).
>
> With KFENCE, we actually end up testing no code assumes this, because
> KFENCE's ksize() always returns the exact size S.
>
> I have narrowed it down to sk_wmem_queued becoming <0 in
> sk_wmem_free_skb().
>
> The skb passed to sk_wmem_free_skb() and whose truesize causes
> sk_wmem_queued to become negative is always allocated in:
>
> | kmem_cache_alloc_node+0x140/0x400 mm/slub.c:2939
> | __alloc_skb+0x6d/0x710 net/core/skbuff.c:198
> | alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
> | sk_stream_alloc_skb+0x109/0xc30 net/ipv4/tcp.c:888
> | tso_fragment net/ipv4/tcp_output.c:2124 [inline]
> | tcp_write_xmit+0x1dbf/0x5ce0 net/ipv4/tcp_output.c:2674
> | __tcp_push_pending_frames+0xaa/0x390 net/ipv4/tcp_output.c:2866
> | tcp_push_pending_frames include/net/tcp.h:1864 [inline]
> | tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
> | tcp_rcv_established+0x8c9/0x1eb0 net/ipv4/tcp_input.c:5869
> | tcp_v4_do_rcv+0x5d1/0x870 net/ipv4/tcp_ipv4.c:1668
> | sk_backlog_rcv include/net/sock.h:1011 [inline]
> | __release_sock+0x134/0x3a0 net/core/sock.c:2523
> | release_sock+0x54/0x1b0 net/core/sock.c:3053
> | sk_wait_data+0x177/0x450 net/core/sock.c:2565
> | tcp_recvmsg+0x17ea/0x2aa0 net/ipv4/tcp.c:2181
> | inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:848
> | sock_recvmsg_nosec net/socket.c:885 [inline]
> | sock_recvmsg net/socket.c:903 [inline]
> | sock_recvmsg net/socket.c:899 [inline]
> | ____sys_recvmsg+0x2c4/0x600 net/socket.c:2563
> | ___sys_recvmsg+0x127/0x200 net/socket.c:2605
> | __sys_recvmsg+0xe2/0x1a0 net/socket.c:2641
> | do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> | entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> I used the code below to add some warnings that helped narrow it down.
>
> Does any of this help explain the problem?

Not yet :)

tso_fragment() transfers some payload from one skb to another, and
properly shifts this amount from src skb to dst skb (@buff) :

sk_wmem_queued_add(sk, buff->truesize);
buff->truesize += nlen;
skb->truesize -= nlen;

>
> Thanks,
> -- Marco
>
> ------ >8 ------
>
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e8d958ef3ea0..ef4837f3aba4 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -35,6 +35,7 @@
> #ifndef _SOCK_H
> #define _SOCK_H
>
> +#include <linux/kfence.h>
> #include <linux/hardirq.h>
> #include <linux/kernel.h>
> #include <linux/list.h>
> @@ -1534,7 +1535,15 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
> DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
> static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
> {
> + bool bad = false;
> +
> sk_wmem_queued_add(sk, -skb->truesize);
> +
> + if (WARN_ON(READ_ONCE(sk->sk_wmem_queued) == -384)) {
> + pr_info("wmem_queued=%d truesize=%u\n", sk->sk_wmem_queued, skb->truesize);
> + bad = true;
> + }
> +
> sk_mem_uncharge(sk, skb->truesize);
> if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
> !sk->sk_tx_skb_cache && !skb_cloned(skb)) {
> @@ -1544,6 +1553,9 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
> return;
> }
> __kfree_skb(skb);
> +
> + if (bad)
> + (void)READ_ONCE(skb->truesize); /* UAF to let KASAN show where it was allocated */
> }
>
> static inline void sock_release_ownership(struct sock *sk)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ffe3dcc0ebea..f365495819ee 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -208,6 +208,19 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> size = SKB_DATA_ALIGN(size);
> size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> +
> + if (is_kfence_address(data))
> + pr_info("kfence's ksize: %zu\n", ksize(data));
> + /*
> + * BUG BUG
> + * Hypothesis: The problem is that some code assumes that:
> + *
> + * ksize(kmalloc(S)) == ksize(kmalloc(S))
> + *
> + * Note: If we force no KFENCE allocation for @data above, the warnings
> + * disappear. KFENCE's ksize() always returns the exact size S.
> + */
> +
> if (!data)
> goto nodata;
> /* kmalloc(size) might give us more room than requested.

2020-12-03 16:39:46

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, 3 Dec 2020 at 17:27, Eric Dumazet <[email protected]> wrote:
> On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <[email protected]> wrote:
> >
> > On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit: 6147c83f Add linux-next specific files for 20201126
> > > git tree: linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > > compiler: gcc (GCC) 10.1.0-syz 20200507
> > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> > >
> > > The issue was bisected to:
> > >
> > > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > > Author: Alexander Potapenko <[email protected]>
> > > Date: Tue Nov 24 05:38:44 2020 +0000
> > >
> > > mm, kfence: insert KFENCE hooks for SLUB
> > >
> > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > > final oops: https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: [email protected]
> > > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> > >
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> > [...]
> > > Call Trace:
> > > inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > > __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > > tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > > inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > > __sock_release+0xcd/0x280 net/socket.c:596
> > > sock_close+0x18/0x20 net/socket.c:1255
> > > __fput+0x283/0x920 fs/file_table.c:280
> > > task_work_run+0xdd/0x190 kernel/task_work.c:140
> > > exit_task_work include/linux/task_work.h:30 [inline]
> > > do_exit+0xb89/0x29e0 kernel/exit.c:823
> > > do_group_exit+0x125/0x310 kernel/exit.c:920
> > > get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > > arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > > handle_signal_work kernel/entry/common.c:144 [inline]
> > > exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > > exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > > syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > I've been debugging this and I think enabling KFENCE uncovered that some
> > code is assuming that the following is always true:
> >
> > ksize(kmalloc(S)) == ksize(kmalloc(S))
> >
>
>
> I do not think we make this assumption.
>
> Each skb tracks the 'truesize' which is populated from __alloc_skb()
> using ksize(allocated head) .
>
> So if ksize() decides to give us random data, it should be still fine,
> because we use ksize(buff) only once at alloc skb time, and record the
> value in skb->truesize
> (only the socket buffer accounting would be off)

Good, thanks for clarifying. So something else must be off then.

> > but I don't think this assumption can be made (with or without KFENCE).
> >
> > With KFENCE, we actually end up testing no code assumes this, because
> > KFENCE's ksize() always returns the exact size S.
> >
> > I have narrowed it down to sk_wmem_queued becoming <0 in
> > sk_wmem_free_skb().
> >
> > The skb passed to sk_wmem_free_skb() and whose truesize causes
> > sk_wmem_queued to become negative is always allocated in:
> >
> > | kmem_cache_alloc_node+0x140/0x400 mm/slub.c:2939
> > | __alloc_skb+0x6d/0x710 net/core/skbuff.c:198
> > | alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
> > | sk_stream_alloc_skb+0x109/0xc30 net/ipv4/tcp.c:888
> > | tso_fragment net/ipv4/tcp_output.c:2124 [inline]
> > | tcp_write_xmit+0x1dbf/0x5ce0 net/ipv4/tcp_output.c:2674
> > | __tcp_push_pending_frames+0xaa/0x390 net/ipv4/tcp_output.c:2866
> > | tcp_push_pending_frames include/net/tcp.h:1864 [inline]
> > | tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
> > | tcp_rcv_established+0x8c9/0x1eb0 net/ipv4/tcp_input.c:5869
> > | tcp_v4_do_rcv+0x5d1/0x870 net/ipv4/tcp_ipv4.c:1668
> > | sk_backlog_rcv include/net/sock.h:1011 [inline]
> > | __release_sock+0x134/0x3a0 net/core/sock.c:2523
> > | release_sock+0x54/0x1b0 net/core/sock.c:3053
> > | sk_wait_data+0x177/0x450 net/core/sock.c:2565
> > | tcp_recvmsg+0x17ea/0x2aa0 net/ipv4/tcp.c:2181
> > | inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:848
> > | sock_recvmsg_nosec net/socket.c:885 [inline]
> > | sock_recvmsg net/socket.c:903 [inline]
> > | sock_recvmsg net/socket.c:899 [inline]
> > | ____sys_recvmsg+0x2c4/0x600 net/socket.c:2563
> > | ___sys_recvmsg+0x127/0x200 net/socket.c:2605
> > | __sys_recvmsg+0xe2/0x1a0 net/socket.c:2641
> > | do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> > | entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > I used the code below to add some warnings that helped narrow it down.
> >
> > Does any of this help explain the problem?
>
> Not yet :)

Damn...

> tso_fragment() transfers some payload from one skb to another, and
> properly shifts this amount from src skb to dst skb (@buff) :
>
> sk_wmem_queued_add(sk, buff->truesize);
> buff->truesize += nlen;
> skb->truesize -= nlen;

Any other stacktraces that might help?

Thanks,
-- Marco

2020-12-03 16:45:37

by Eric Dumazet

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, Dec 3, 2020 at 5:34 PM Marco Elver <[email protected]> wrote:
>
> On Thu, 3 Dec 2020 at 17:27, Eric Dumazet <[email protected]> wrote:
> > On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <[email protected]> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit: 6147c83f Add linux-next specific files for 20201126
> > > > git tree: linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > > > kernel config: https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > > > compiler: gcc (GCC) 10.1.0-syz 20200507
> > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> > > >
> > > > The issue was bisected to:
> > > >
> > > > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > > > Author: Alexander Potapenko <[email protected]>
> > > > Date: Tue Nov 24 05:38:44 2020 +0000
> > > >
> > > > mm, kfence: insert KFENCE hooks for SLUB
> > > >
> > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > > > final oops: https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: [email protected]
> > > > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> > > >
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> > > [...]
> > > > Call Trace:
> > > > inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > > > __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > > > tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > > > inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > > > __sock_release+0xcd/0x280 net/socket.c:596
> > > > sock_close+0x18/0x20 net/socket.c:1255
> > > > __fput+0x283/0x920 fs/file_table.c:280
> > > > task_work_run+0xdd/0x190 kernel/task_work.c:140
> > > > exit_task_work include/linux/task_work.h:30 [inline]
> > > > do_exit+0xb89/0x29e0 kernel/exit.c:823
> > > > do_group_exit+0x125/0x310 kernel/exit.c:920
> > > > get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > > > arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > > > handle_signal_work kernel/entry/common.c:144 [inline]
> > > > exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > > > exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > > > syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > I've been debugging this and I think enabling KFENCE uncovered that some
> > > code is assuming that the following is always true:
> > >
> > > ksize(kmalloc(S)) == ksize(kmalloc(S))
> > >
> >
> >
> > I do not think we make this assumption.
> >
> > Each skb tracks the 'truesize' which is populated from __alloc_skb()
> > using ksize(allocated head) .
> >
> > So if ksize() decides to give us random data, it should be still fine,
> > because we use ksize(buff) only once at alloc skb time, and record the
> > value in skb->truesize
> > (only the socket buffer accounting would be off)
>
> Good, thanks for clarifying. So something else must be off then.

Actually we might have the following assumption :

buff = kmalloc(size, GFP...)
if (buff)
ASSERT(ksize(buff) >= size)

So obviously ksize() should not be completely random ;)

2020-12-03 17:45:59

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, Dec 03, 2020 at 05:42PM +0100, Eric Dumazet wrote:
> On Thu, Dec 3, 2020 at 5:34 PM Marco Elver <[email protected]> wrote:
> >
> > On Thu, 3 Dec 2020 at 17:27, Eric Dumazet <[email protected]> wrote:
> > > On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <[email protected]> wrote:
> > > >
> > > > On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > > > > Hello,
> > > > >
> > > > > syzbot found the following issue on:
> > > > >
> > > > > HEAD commit: 6147c83f Add linux-next specific files for 20201126
> > > > > git tree: linux-next
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > > > > compiler: gcc (GCC) 10.1.0-syz 20200507
> > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> > > > >
> > > > > The issue was bisected to:
> > > > >
> > > > > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > > > > Author: Alexander Potapenko <[email protected]>
> > > > > Date: Tue Nov 24 05:38:44 2020 +0000
> > > > >
> > > > > mm, kfence: insert KFENCE hooks for SLUB
> > > > >
> > > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > > > > final oops: https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> > > > >
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > Reported-by: [email protected]
> > > > > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> > > > >
> > > > > ------------[ cut here ]------------
> > > > > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> > > > [...]
> > > > > Call Trace:
> > > > > inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > > > > __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > > > > tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > > > > inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > > > > __sock_release+0xcd/0x280 net/socket.c:596
> > > > > sock_close+0x18/0x20 net/socket.c:1255
> > > > > __fput+0x283/0x920 fs/file_table.c:280
> > > > > task_work_run+0xdd/0x190 kernel/task_work.c:140
> > > > > exit_task_work include/linux/task_work.h:30 [inline]
> > > > > do_exit+0xb89/0x29e0 kernel/exit.c:823
> > > > > do_group_exit+0x125/0x310 kernel/exit.c:920
> > > > > get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > > > > arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > > > > handle_signal_work kernel/entry/common.c:144 [inline]
> > > > > exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > > > > exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > > > > syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > I've been debugging this and I think enabling KFENCE uncovered that some
> > > > code is assuming that the following is always true:
> > > >
> > > > ksize(kmalloc(S)) == ksize(kmalloc(S))
> > > >
> > >
> > >
> > > I do not think we make this assumption.
> > >
> > > Each skb tracks the 'truesize' which is populated from __alloc_skb()
> > > using ksize(allocated head) .
> > >
> > > So if ksize() decides to give us random data, it should be still fine,
> > > because we use ksize(buff) only once at alloc skb time, and record the
> > > value in skb->truesize
> > > (only the socket buffer accounting would be off)
> >
> > Good, thanks for clarifying. So something else must be off then.
>
> Actually we might have the following assumption :
>
> buff = kmalloc(size, GFP...)
> if (buff)
> ASSERT(ksize(buff) >= size)
>
> So obviously ksize() should not be completely random ;)

One more experiment -- simply adding

--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
*/
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ size = 1 << kmalloc_index(size); /* HACK */
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);

also got rid of the warnings. Something must be off with some value that
is computed in terms of ksize(). If not, I don't have any explanation
for why the above hides the problem.

Thanks,
-- Marco

2020-12-03 18:05:02

by Eric Dumazet

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On 12/3/20 6:41 PM, Marco Elver wrote:

> One more experiment -- simply adding
>
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> */
> size = SKB_DATA_ALIGN(size);
> size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + size = 1 << kmalloc_index(size); /* HACK */
> data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
>
>
> also got rid of the warnings. Something must be off with some value that
> is computed in terms of ksize(). If not, I don't have any explanation
> for why the above hides the problem.

Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
hae some kind of assumptions, I will double check this.

2020-12-07 15:33:31

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

2020-12-07 16:34:23

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, Dec 03, 2020 at 07:01PM +0100, Eric Dumazet wrote:
> On 12/3/20 6:41 PM, Marco Elver wrote:
>
> > One more experiment -- simply adding
> >
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > */
> > size = SKB_DATA_ALIGN(size);
> > size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > + size = 1 << kmalloc_index(size); /* HACK */
> > data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> >
> >
> > also got rid of the warnings. Something must be off with some value that
> > is computed in terms of ksize(). If not, I don't have any explanation
> > for why the above hides the problem.
>
> Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> hae some kind of assumptions, I will double check this.

Some more data; removing all uses of ksize() fixes the warnings:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -214,7 +214,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
| * Put skb_shared_info exactly at the end of allocated zone,
| * to allow max possible filling before reallocation.
| */
| - size = SKB_WITH_OVERHEAD(ksize(data));
| + size = SKB_WITH_OVERHEAD(size);
| prefetchw(data + size);
|
| /*
| @@ -1628,7 +1628,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
| gfp_mask, NUMA_NO_NODE, NULL);
| if (!data)
| goto nodata;
| - size = SKB_WITH_OVERHEAD(ksize(data));
| + size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
^^ Reverting *only* this to 'ksize(data)' triggers the warning.
| /* Copy only real data... and, alas, header. This should be
| * optimized for the cases when header is void.
| @@ -5901,7 +5901,7 @@ static int pskb_carve_inside_header(struct sk_buff *skb, const u32 off,
| if (!data)
| return -ENOMEM;
|
| - size = SKB_WITH_OVERHEAD(ksize(data));
| + size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
|
| /* Copy real data, and all frags */
| skb_copy_from_linear_data_offset(skb, off, data, new_hlen);
| @@ -6025,7 +6025,7 @@ static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off,
| if (!data)
| return -ENOMEM;
|
| - size = SKB_WITH_OVERHEAD(ksize(data));
| + size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
|
| memcpy((struct skb_shared_info *)(data + size),
| skb_shinfo(skb), offsetof(struct skb_shared_info, frags[0]));

Conversely, only doing this also fixes the warnings:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -1628,7 +1628,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
| gfp_mask, NUMA_NO_NODE, NULL);
| if (!data)
| goto nodata;
| - size = SKB_WITH_OVERHEAD(ksize(data));
| + size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
|
| /* Copy only real data... and, alas, header. This should be
| * optimized for the cases when header is void.

But not sure if any of this is helpful, since in the end what we want is
to make a bunch of subtractions reach precisely 0, and any deviation
somewhere might, by chance, achieve that.

Thanks,
-- Marco

2020-12-09 01:08:15

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <[email protected]> wrote:
> On 12/3/20 6:41 PM, Marco Elver wrote:
>
> > One more experiment -- simply adding
> >
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > */
> > size = SKB_DATA_ALIGN(size);
> > size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > + size = 1 << kmalloc_index(size); /* HACK */
> > data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> >
> >
> > also got rid of the warnings. Something must be off with some value that
> > is computed in terms of ksize(). If not, I don't have any explanation
> > for why the above hides the problem.
>
> Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> hae some kind of assumptions, I will double check this.

If I force kfence to return 4K sized allocations for everything, the
warnings remain. That might suggest that it's not due to a missed
ALIGN.

Is it possible that copies or moves are a problem? E.g. we copy
something from kfence -> non-kfence object (or vice-versa), and
ksize() no longer matches, then things go wrong?

Thanks,
-- Marco

2020-12-09 23:10:08

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Tue, Dec 08, 2020 at 08:06PM +0100, Marco Elver wrote:
> On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <[email protected]> wrote:
> > On 12/3/20 6:41 PM, Marco Elver wrote:
> >
> > > One more experiment -- simply adding
> > >
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > */
> > > size = SKB_DATA_ALIGN(size);
> > > size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > + size = 1 << kmalloc_index(size); /* HACK */
> > > data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > >
> > >
> > > also got rid of the warnings. Something must be off with some value that
> > > is computed in terms of ksize(). If not, I don't have any explanation
> > > for why the above hides the problem.
> >
> > Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> > hae some kind of assumptions, I will double check this.
>
> If I force kfence to return 4K sized allocations for everything, the
> warnings remain. That might suggest that it's not due to a missed
> ALIGN.
>
> Is it possible that copies or moves are a problem? E.g. we copy
> something from kfence -> non-kfence object (or vice-versa), and
> ksize() no longer matches, then things go wrong?

I was able to narrow it down to allocations of size 640. I also narrowed
it down to 5 allocations that go through kfence that start triggering
the issue. I have attached the list of those 5 allocations with
allocation + free stacks. I'll try to go through them, maybe I get
lucky eventually. :-)

Thanks,
-- Marco

Attachments:

(No filename) (1.63 kB)
suspect-allocations.log (18.08 kB)
Download all attachments

2020-12-11 08:30:00

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Wed, Dec 09, 2020 at 01:47PM +0100, Marco Elver wrote:
> On Tue, Dec 08, 2020 at 08:06PM +0100, Marco Elver wrote:
> > On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <[email protected]> wrote:
> > > On 12/3/20 6:41 PM, Marco Elver wrote:
> > >
> > > > One more experiment -- simply adding
> > > >
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > */
> > > > size = SKB_DATA_ALIGN(size);
> > > > size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > + size = 1 << kmalloc_index(size); /* HACK */
> > > > data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > >
> > > >
> > > > also got rid of the warnings. Something must be off with some value that
> > > > is computed in terms of ksize(). If not, I don't have any explanation
> > > > for why the above hides the problem.
> > >
> > > Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> > > hae some kind of assumptions, I will double check this.
> >
> > If I force kfence to return 4K sized allocations for everything, the
> > warnings remain. That might suggest that it's not due to a missed
> > ALIGN.
> >
> > Is it possible that copies or moves are a problem? E.g. we copy
> > something from kfence -> non-kfence object (or vice-versa), and
> > ksize() no longer matches, then things go wrong?
>
> I was able to narrow it down to allocations of size 640. I also narrowed
> it down to 5 allocations that go through kfence that start triggering
> the issue. I have attached the list of those 5 allocations with
> allocation + free stacks. I'll try to go through them, maybe I get
> lucky eventually. :-)

[...]

> kfence-#3 [0xffff88843681ac00-0xffff88843681ae7f, size=640, cache=kmalloc-1k] allocated by task 17012:
> __kmalloc_reserve net/core/skbuff.c:142 [inline]
> __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
> alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
> sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
> tso_fragment net/ipv4/tcp_output.c:2124 [inline]
> tcp_write_xmit+0x1366/0x3510 net/ipv4/tcp_output.c:2674
> __tcp_push_pending_frames+0x68/0x1f0 net/ipv4/tcp_output.c:2866
> tcp_push_pending_frames include/net/tcp.h:1864 [inline]
> tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
> tcp_rcv_established+0x57c/0x10b0 net/ipv4/tcp_input.c:5869
> tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
> sk_backlog_rcv include/net/sock.h:1010 [inline]
> __release_sock+0xd7/0x260 net/core/sock.c:2523
> release_sock+0x40/0x120 net/core/sock.c:3053
> sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
> tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
> inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
> sock_recvmsg_nosec net/socket.c:885 [inline]
> sock_recvmsg net/socket.c:903 [inline]
> sock_recvmsg net/socket.c:899 [inline]
> ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
> ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
> __sys_recvmsg+0x8b/0x130 net/socket.c:2641
> __do_sys_recvmsg net/socket.c:2651 [inline]
> __se_sys_recvmsg net/socket.c:2648 [inline]
> __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
> do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
> entry_SYSCALL_64_after_hwframe+0x44/0xa9

So I started putting gdb to work, and whenever I see an allocation
exactly like the above that goes through tso_fragment() a warning
immediately follows.

Long story short, I somehow synthesized this patch that appears to fix
things, but I can't explain why exactly:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
|
| skb_metadata_clear(skb);
|
| - /* It is not generally safe to change skb->truesize.
| - * For the moment, we really care of rx path, or
| - * when skb is orphaned (not attached to a socket).
| - */
| - if (!skb->sk || skb->destructor == sock_edemux)
| - skb->truesize += size - osize;
| -
| return 0;
|
| nofrags:

Now, here are the breadcrumbs I followed:

1. Breakpoint on kfence_ksize() -- first allocation that matches the above:

| #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
| #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
| #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
| #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
| #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
| #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
| #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
| skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
| #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
| at net/ipv4/tcp_output.c:2674
| #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
| at ./include/net/sock.h:918
| #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
| #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
| #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
| #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
| [...]

Set watchpoint on skb->truesize:

| (gdb) frame 3
| #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
| 217 size = SKB_WITH_OVERHEAD(ksize(data));
| (gdb) p &skb->truesize
| $5 = (unsigned int *) 0xffff888117f55f90
| (gdb) awatch *0xffff888117f55f90
| Hardware access (read/write) watchpoint 6: *0xffff888117f55f90

2. Some time later, we see that the skb with kfence-allocated data
is cloned:

| Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
|
| Value = 1570
| 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
| 1002 C(truesize);
| (gdb) bt
| #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
| #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
| #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
| rcv_nxt=0) at net/ipv4/tcp_output.c:1267
| #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
| sk=<optimized out>) at ./include/linux/tcp.h:439
| #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
| at net/ipv4/tcp_output.c:2688
| #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
| at ./include/net/sock.h:918
| #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
| #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
| #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
| #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
| #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
[...]

3. The original skb (that was cloned) has its truesize adjusted
after a pskb_expand_head():

| Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
|
| Old value = 1570
| New value = 1954

^^ the difference between the old and the new value is exactly
384, which is also the final underflow of the sk_wmem_queued
that triggers the warning. Presumably if the original allocation
had been through kmalloc-1k and not KFENCE, the difference here
would have been 0, since ksize() of the original allocation in
step (1) would have been 1024, and not 640 (difference of 384).

| 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
| at net/core/skbuff.c:1687
| 1687 skb->truesize += size - osize;
| (gdb) bt
| #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
| at net/core/skbuff.c:1687
| #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
| #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
| #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
| #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
| to=<optimized out>) at net/ipv4/tcp_input.c:1497
| #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
| skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
| #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
| state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
| at net/ipv4/tcp_input.c:1670
| #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
| state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
| #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
| #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
| #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
[...]

Any of this make sense?

Thanks,
-- Marco

2020-12-11 21:04:27

by Eric Dumazet

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <[email protected]> wrote:
>
> On Wed, Dec 09, 2020 at 01:47PM +0100, Marco Elver wrote:
> > On Tue, Dec 08, 2020 at 08:06PM +0100, Marco Elver wrote:
> > > On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <[email protected]> wrote:
> > > > On 12/3/20 6:41 PM, Marco Elver wrote:
> > > >
> > > > > One more experiment -- simply adding
> > > > >
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > > */
> > > > > size = SKB_DATA_ALIGN(size);
> > > > > size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > > + size = 1 << kmalloc_index(size); /* HACK */
> > > > > data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > > >
> > > > >
> > > > > also got rid of the warnings. Something must be off with some value that
> > > > > is computed in terms of ksize(). If not, I don't have any explanation
> > > > > for why the above hides the problem.
> > > >
> > > > Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> > > > hae some kind of assumptions, I will double check this.
> > >
> > > If I force kfence to return 4K sized allocations for everything, the
> > > warnings remain. That might suggest that it's not due to a missed
> > > ALIGN.
> > >
> > > Is it possible that copies or moves are a problem? E.g. we copy
> > > something from kfence -> non-kfence object (or vice-versa), and
> > > ksize() no longer matches, then things go wrong?
> >
> > I was able to narrow it down to allocations of size 640. I also narrowed
> > it down to 5 allocations that go through kfence that start triggering
> > the issue. I have attached the list of those 5 allocations with
> > allocation + free stacks. I'll try to go through them, maybe I get
> > lucky eventually. :-)
>
> [...]
>
> > kfence-#3 [0xffff88843681ac00-0xffff88843681ae7f, size=640, cache=kmalloc-1k] allocated by task 17012:
> > __kmalloc_reserve net/core/skbuff.c:142 [inline]
> > __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
> > alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
> > sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
> > tso_fragment net/ipv4/tcp_output.c:2124 [inline]
> > tcp_write_xmit+0x1366/0x3510 net/ipv4/tcp_output.c:2674
> > __tcp_push_pending_frames+0x68/0x1f0 net/ipv4/tcp_output.c:2866
> > tcp_push_pending_frames include/net/tcp.h:1864 [inline]
> > tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
> > tcp_rcv_established+0x57c/0x10b0 net/ipv4/tcp_input.c:5869
> > tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
> > sk_backlog_rcv include/net/sock.h:1010 [inline]
> > __release_sock+0xd7/0x260 net/core/sock.c:2523
> > release_sock+0x40/0x120 net/core/sock.c:3053
> > sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
> > tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
> > inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
> > sock_recvmsg_nosec net/socket.c:885 [inline]
> > sock_recvmsg net/socket.c:903 [inline]
> > sock_recvmsg net/socket.c:899 [inline]
> > ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
> > ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
> > __sys_recvmsg+0x8b/0x130 net/socket.c:2641
> > __do_sys_recvmsg net/socket.c:2651 [inline]
> > __se_sys_recvmsg net/socket.c:2648 [inline]
> > __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
> > do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> So I started putting gdb to work, and whenever I see an allocation
> exactly like the above that goes through tso_fragment() a warning
> immediately follows.
>
> Long story short, I somehow synthesized this patch that appears to fix
> things, but I can't explain why exactly:
>
> | --- a/net/core/skbuff.c
> | +++ b/net/core/skbuff.c
> | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> |
> | skb_metadata_clear(skb);
> |
> | - /* It is not generally safe to change skb->truesize.
> | - * For the moment, we really care of rx path, or
> | - * when skb is orphaned (not attached to a socket).
> | - */
> | - if (!skb->sk || skb->destructor == sock_edemux)
> | - skb->truesize += size - osize;
> | -
> | return 0;
> |
> | nofrags:
>
> Now, here are the breadcrumbs I followed:
>
>
> 1. Breakpoint on kfence_ksize() -- first allocation that matches the above:
>
> | #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> | #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> | #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> | #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> | #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> | #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> | #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> | at net/ipv4/tcp_output.c:2674
> | #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> | at ./include/net/sock.h:918
> | #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> | [...]
>
> Set watchpoint on skb->truesize:
>
> | (gdb) frame 3
> | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> | 217 size = SKB_WITH_OVERHEAD(ksize(data));
> | (gdb) p &skb->truesize
> | $5 = (unsigned int *) 0xffff888117f55f90
> | (gdb) awatch *0xffff888117f55f90
> | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
>
> 2. Some time later, we see that the skb with kfence-allocated data
> is cloned:
>
> | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> |
> | Value = 1570
> | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> | 1002 C(truesize);
> | (gdb) bt
> | #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> | #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> | #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> | rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> | #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> | sk=<optimized out>) at ./include/linux/tcp.h:439
> | #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> | at net/ipv4/tcp_output.c:2688
> | #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> | at ./include/net/sock.h:918
> | #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> | #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> | #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> | #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> [...]
>
>
> 3. The original skb (that was cloned) has its truesize adjusted
> after a pskb_expand_head():
>
> | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> |
> | Old value = 1570
> | New value = 1954
>
> ^^ the difference between the old and the new value is exactly
> 384, which is also the final underflow of the sk_wmem_queued
> that triggers the warning. Presumably if the original allocation
> had been through kmalloc-1k and not KFENCE, the difference here
> would have been 0, since ksize() of the original allocation in
> step (1) would have been 1024, and not 640 (difference of 384).
>
> | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> | at net/core/skbuff.c:1687
> | 1687 skb->truesize += size - osize;
> | (gdb) bt
> | #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> | at net/core/skbuff.c:1687
> | #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> | #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> | #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> | #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> | to=<optimized out>) at net/ipv4/tcp_input.c:1497
> | #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> | #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> | state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> | at net/ipv4/tcp_input.c:1670
> | #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> | state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> | #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> | #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> [...]
>
>
> Any of this make sense?

Very nice debugging !

I guess we could fix this in skb_prepare_for_shift(), eventually
caring for the truesize manipulation
(or reverting the change done in pskb_expand_head(), since only kfence
is having this issue.

(All TCP skbs in output path have the same allocation size for skb->head)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
*/
static int skb_prepare_for_shift(struct sk_buff *skb)
{
- return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+ unsigned int ret = 0, save;
+
+ if (skb_cloned(skb)) {
+ save = skb->truesize;
+ ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+ skb->truesize = save;
+ }
+ return ret;
}

2020-12-11 21:16:15

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <[email protected]> wrote:
> On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <[email protected]> wrote:
[...]
> > So I started putting gdb to work, and whenever I see an allocation
> > exactly like the above that goes through tso_fragment() a warning
> > immediately follows.
> >
> > Long story short, I somehow synthesized this patch that appears to fix
> > things, but I can't explain why exactly:
> >
> > | --- a/net/core/skbuff.c
> > | +++ b/net/core/skbuff.c
> > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > |
> > | skb_metadata_clear(skb);
> > |
> > | - /* It is not generally safe to change skb->truesize.
> > | - * For the moment, we really care of rx path, or
> > | - * when skb is orphaned (not attached to a socket).
> > | - */
> > | - if (!skb->sk || skb->destructor == sock_edemux)
> > | - skb->truesize += size - osize;
> > | -
> > | return 0;
> > |
> > | nofrags:
> >
> > Now, here are the breadcrumbs I followed:
> >
> >
> > 1. Breakpoint on kfence_ksize() -- first allocation that matches the above:
> >
> > | #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > | #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > | #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > | #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > | #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > | #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > | #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > | at net/ipv4/tcp_output.c:2674
> > | #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > | at ./include/net/sock.h:918
> > | #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > | [...]
> >
> > Set watchpoint on skb->truesize:
> >
> > | (gdb) frame 3
> > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > | 217 size = SKB_WITH_OVERHEAD(ksize(data));
> > | (gdb) p &skb->truesize
> > | $5 = (unsigned int *) 0xffff888117f55f90
> > | (gdb) awatch *0xffff888117f55f90
> > | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> >
> > 2. Some time later, we see that the skb with kfence-allocated data
> > is cloned:
> >
> > | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > |
> > | Value = 1570
> > | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > | 1002 C(truesize);
> > | (gdb) bt
> > | #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > | #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > | #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > | rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > | #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > | sk=<optimized out>) at ./include/linux/tcp.h:439
> > | #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > | at net/ipv4/tcp_output.c:2688
> > | #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > | at ./include/net/sock.h:918
> > | #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > | #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > | #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > | #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > [...]
> >
> >
> > 3. The original skb (that was cloned) has its truesize adjusted
> > after a pskb_expand_head():
> >
> > | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > |
> > | Old value = 1570
> > | New value = 1954
> >
> > ^^ the difference between the old and the new value is exactly
> > 384, which is also the final underflow of the sk_wmem_queued
> > that triggers the warning. Presumably if the original allocation
> > had been through kmalloc-1k and not KFENCE, the difference here
> > would have been 0, since ksize() of the original allocation in
> > step (1) would have been 1024, and not 640 (difference of 384).
> >
> > | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > | at net/core/skbuff.c:1687
> > | 1687 skb->truesize += size - osize;
> > | (gdb) bt
> > | #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > | at net/core/skbuff.c:1687
> > | #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > | #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > | #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > | #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > | to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > | #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > | #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > | state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > | at net/ipv4/tcp_input.c:1670
> > | #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > | state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > | #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > | #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > [...]
> >
> >
> > Any of this make sense?
>
> Very nice debugging !
>
> I guess we could fix this in skb_prepare_for_shift(), eventually
> caring for the truesize manipulation
> (or reverting the change done in pskb_expand_head(), since only kfence
> is having this issue.

Phew, good to hear I finally got lucky. :-)

Either option is fine, as long as it avoids this problem in future.
Hopefully it can be fixed for 5.11.

> (All TCP skbs in output path have the same allocation size for skb->head)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> */
> static int skb_prepare_for_shift(struct sk_buff *skb)
> {
> - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> + unsigned int ret = 0, save;
> +
> + if (skb_cloned(skb)) {
> + save = skb->truesize;
> + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> + skb->truesize = save;
> + }
> + return ret;
> }

FWIW,

Tested-by: Marco Elver <[email protected]>

Thanks,
-- Marco

2020-12-14 11:08:30

by Eric Dumazet

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Mon, Dec 14, 2020 at 11:09 AM Marco Elver <[email protected]> wrote:
>
> On Thu, 10 Dec 2020 at 20:01, Marco Elver <[email protected]> wrote:
> > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <[email protected]> wrote:
> > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <[email protected]> wrote:
> > [...]
> > > > So I started putting gdb to work, and whenever I see an allocation
> > > > exactly like the above that goes through tso_fragment() a warning
> > > > immediately follows.
> > > >
> > > > Long story short, I somehow synthesized this patch that appears to fix
> > > > things, but I can't explain why exactly:
> > > >
> > > > | --- a/net/core/skbuff.c
> > > > | +++ b/net/core/skbuff.c
> > > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > |
> > > > | skb_metadata_clear(skb);
> > > > |
> > > > | - /* It is not generally safe to change skb->truesize.
> > > > | - * For the moment, we really care of rx path, or
> > > > | - * when skb is orphaned (not attached to a socket).
> > > > | - */
> > > > | - if (!skb->sk || skb->destructor == sock_edemux)
> > > > | - skb->truesize += size - osize;
> > > > | -
> > > > | return 0;
> > > > |
> > > > | nofrags:
> > > >
> > > > Now, here are the breadcrumbs I followed:
> > > >
> > > >
> > > > 1. Breakpoint on kfence_ksize() -- first allocation that matches the above:
> > > >
> > > > | #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > > > | #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > > > | #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > > > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > > | #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > > > | #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > > > | #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > > > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > > > | #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > > > | at net/ipv4/tcp_output.c:2674
> > > > | #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > > | at ./include/net/sock.h:918
> > > > | #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > > | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > > | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > > | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > > > | [...]
> > > >
> > > > Set watchpoint on skb->truesize:
> > > >
> > > > | (gdb) frame 3
> > > > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > > | 217 size = SKB_WITH_OVERHEAD(ksize(data));
> > > > | (gdb) p &skb->truesize
> > > > | $5 = (unsigned int *) 0xffff888117f55f90
> > > > | (gdb) awatch *0xffff888117f55f90
> > > > | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >
> > > > 2. Some time later, we see that the skb with kfence-allocated data
> > > > is cloned:
> > > >
> > > > | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > > |
> > > > | Value = 1570
> > > > | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > > | 1002 C(truesize);
> > > > | (gdb) bt
> > > > | #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > > | #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > > > | #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > > > | rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > > > | #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > > > | sk=<optimized out>) at ./include/linux/tcp.h:439
> > > > | #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > > > | at net/ipv4/tcp_output.c:2688
> > > > | #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > > | at ./include/net/sock.h:918
> > > > | #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > > | #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > > | #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > > | #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > > > | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > > > [...]
> > > >
> > > >
> > > > 3. The original skb (that was cloned) has its truesize adjusted
> > > > after a pskb_expand_head():
> > > >
> > > > | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > > |
> > > > | Old value = 1570
> > > > | New value = 1954
> > > >
> > > > ^^ the difference between the old and the new value is exactly
> > > > 384, which is also the final underflow of the sk_wmem_queued
> > > > that triggers the warning. Presumably if the original allocation
> > > > had been through kmalloc-1k and not KFENCE, the difference here
> > > > would have been 0, since ksize() of the original allocation in
> > > > step (1) would have been 1024, and not 640 (difference of 384).
> > > >
> > > > | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > > | at net/core/skbuff.c:1687
> > > > | 1687 skb->truesize += size - osize;
> > > > | (gdb) bt
> > > > | #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > > | at net/core/skbuff.c:1687
> > > > | #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > > > | #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > > > | #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > > > | #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > > > | to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > > > | #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > > > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > > > | #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > > > | state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > > > | at net/ipv4/tcp_input.c:1670
> > > > | #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > > > | state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > > > | #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > > > | #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > > > | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > > > [...]
> > > >
> > > >
> > > > Any of this make sense?
> > >
> > > Very nice debugging !
> > >
> > > I guess we could fix this in skb_prepare_for_shift(), eventually
> > > caring for the truesize manipulation
> > > (or reverting the change done in pskb_expand_head(), since only kfence
> > > is having this issue.
> >
> > Phew, good to hear I finally got lucky. :-)
> >
> > Either option is fine, as long as it avoids this problem in future.
> > Hopefully it can be fixed for 5.11.
> >
> > > (All TCP skbs in output path have the same allocation size for skb->head)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > > 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > > */
> > > static int skb_prepare_for_shift(struct sk_buff *skb)
> > > {
> > > - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > + unsigned int ret = 0, save;
> > > +
> > > + if (skb_cloned(skb)) {
> > > + save = skb->truesize;
> > > + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > + skb->truesize = save;
> > > + }
> > > + return ret;
> > > }
> >
> > FWIW,
> >
> > Tested-by: Marco Elver <[email protected]>
>
> Has this patch, or similar, already been sent?

Not yet, we have few weeks left before 5.11 is released ;)

2020-12-14 14:26:21

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Thu, 10 Dec 2020 at 20:01, Marco Elver <[email protected]> wrote:
> On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <[email protected]> wrote:
> > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <[email protected]> wrote:
> [...]
> > > So I started putting gdb to work, and whenever I see an allocation
> > > exactly like the above that goes through tso_fragment() a warning
> > > immediately follows.
> > >
> > > Long story short, I somehow synthesized this patch that appears to fix
> > > things, but I can't explain why exactly:
> > >
> > > | --- a/net/core/skbuff.c
> > > | +++ b/net/core/skbuff.c
> > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > |
> > > | skb_metadata_clear(skb);
> > > |
> > > | - /* It is not generally safe to change skb->truesize.
> > > | - * For the moment, we really care of rx path, or
> > > | - * when skb is orphaned (not attached to a socket).
> > > | - */
> > > | - if (!skb->sk || skb->destructor == sock_edemux)
> > > | - skb->truesize += size - osize;
> > > | -
> > > | return 0;
> > > |
> > > | nofrags:
> > >
> > > Now, here are the breadcrumbs I followed:
> > >
> > >
> > > 1. Breakpoint on kfence_ksize() -- first allocation that matches the above:
> > >
> > > | #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > > | #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > > | #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > | #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > > | #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > > | #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > > | #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > > | at net/ipv4/tcp_output.c:2674
> > > | #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > | at ./include/net/sock.h:918
> > > | #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > > | [...]
> > >
> > > Set watchpoint on skb->truesize:
> > >
> > > | (gdb) frame 3
> > > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > | 217 size = SKB_WITH_OVERHEAD(ksize(data));
> > > | (gdb) p &skb->truesize
> > > | $5 = (unsigned int *) 0xffff888117f55f90
> > > | (gdb) awatch *0xffff888117f55f90
> > > | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > >
> > > 2. Some time later, we see that the skb with kfence-allocated data
> > > is cloned:
> > >
> > > | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > |
> > > | Value = 1570
> > > | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > | 1002 C(truesize);
> > > | (gdb) bt
> > > | #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > | #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > > | #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > > | rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > > | #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > > | sk=<optimized out>) at ./include/linux/tcp.h:439
> > > | #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > > | at net/ipv4/tcp_output.c:2688
> > > | #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > | at ./include/net/sock.h:918
> > > | #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > | #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > | #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > | #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > > | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > > [...]
> > >
> > >
> > > 3. The original skb (that was cloned) has its truesize adjusted
> > > after a pskb_expand_head():
> > >
> > > | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > |
> > > | Old value = 1570
> > > | New value = 1954
> > >
> > > ^^ the difference between the old and the new value is exactly
> > > 384, which is also the final underflow of the sk_wmem_queued
> > > that triggers the warning. Presumably if the original allocation
> > > had been through kmalloc-1k and not KFENCE, the difference here
> > > would have been 0, since ksize() of the original allocation in
> > > step (1) would have been 1024, and not 640 (difference of 384).
> > >
> > > | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > | at net/core/skbuff.c:1687
> > > | 1687 skb->truesize += size - osize;
> > > | (gdb) bt
> > > | #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > | at net/core/skbuff.c:1687
> > > | #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > > | #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > > | #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > > | #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > > | to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > > | #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > > | #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > > | state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > > | at net/ipv4/tcp_input.c:1670
> > > | #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > > | state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > > | #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > > | #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > > | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > > [...]
> > >
> > >
> > > Any of this make sense?
> >
> > Very nice debugging !
> >
> > I guess we could fix this in skb_prepare_for_shift(), eventually
> > caring for the truesize manipulation
> > (or reverting the change done in pskb_expand_head(), since only kfence
> > is having this issue.
>
> Phew, good to hear I finally got lucky. :-)
>
> Either option is fine, as long as it avoids this problem in future.
> Hopefully it can be fixed for 5.11.
>
> > (All TCP skbs in output path have the same allocation size for skb->head)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > */
> > static int skb_prepare_for_shift(struct sk_buff *skb)
> > {
> > - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > + unsigned int ret = 0, save;
> > +
> > + if (skb_cloned(skb)) {
> > + save = skb->truesize;
> > + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > + skb->truesize = save;
> > + }
> > + return ret;
> > }
>
> FWIW,
>
> Tested-by: Marco Elver <[email protected]>

Has this patch, or similar, already been sent?

Thanks,
-- Marco

2021-01-11 08:59:47

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Mon, 14 Dec 2020 at 11:47, Eric Dumazet <[email protected]> wrote:
> On Mon, Dec 14, 2020 at 11:09 AM Marco Elver <[email protected]> wrote:
> > On Thu, 10 Dec 2020 at 20:01, Marco Elver <[email protected]> wrote:
> > > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <[email protected]> wrote:
> > > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <[email protected]> wrote:
[...]
> > > Either option is fine, as long as it avoids this problem in future.
> > > Hopefully it can be fixed for 5.11.
> > >
> > > > (All TCP skbs in output path have the same allocation size for skb->head)
> > > >
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > > > 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > > > */
> > > > static int skb_prepare_for_shift(struct sk_buff *skb)
> > > > {
> > > > - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > > + unsigned int ret = 0, save;
> > > > +
> > > > + if (skb_cloned(skb)) {
> > > > + save = skb->truesize;
> > > > + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > > + skb->truesize = save;
> > > > + }
> > > > + return ret;
> > > > }
> > >
> > > FWIW,
> > >
> > > Tested-by: Marco Elver <[email protected]>
> >
> > Has this patch, or similar, already been sent?
>
>
> Not yet, we have few weeks left before 5.11 is released ;)

Ping. Though KFENCE has been dropped from 5.11, just a reminder so we
get this fixed for 5.12.

Thanks,
-- Marco

2021-02-03 10:28:38

by Marco Elver

[permalink] [raw]

Subject: Re: WARNING in sk_stream_kill_queues (5)

On Mon, 14 Dec 2020 at 11:09, Marco Elver <[email protected]> wrote:
> On Thu, 10 Dec 2020 at 20:01, Marco Elver <[email protected]> wrote:
> > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <[email protected]> wrote:
> > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <[email protected]> wrote:
> > [...]
> > > > So I started putting gdb to work, and whenever I see an allocation
> > > > exactly like the above that goes through tso_fragment() a warning
> > > > immediately follows.
> > > >
> > > > Long story short, I somehow synthesized this patch that appears to fix
> > > > things, but I can't explain why exactly:
> > > >
> > > > | --- a/net/core/skbuff.c
> > > > | +++ b/net/core/skbuff.c
> > > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > |
> > > > | skb_metadata_clear(skb);
> > > > |
> > > > | - /* It is not generally safe to change skb->truesize.
> > > > | - * For the moment, we really care of rx path, or
> > > > | - * when skb is orphaned (not attached to a socket).
> > > > | - */
> > > > | - if (!skb->sk || skb->destructor == sock_edemux)
> > > > | - skb->truesize += size - osize;
> > > > | -
> > > > | return 0;
> > > > |
> > > > | nofrags:
> > > >
> > > > Now, here are the breadcrumbs I followed:
> > > >
> > > >
> > > > 1. Breakpoint on kfence_ksize() -- first allocation that matches the above:
> > > >
> > > > | #0 __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > > > | #1 0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > > > | #2 0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > > > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > > | #4 0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > > > | #5 sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > > > | #6 0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > > > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > > > | #7 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > > > | at net/ipv4/tcp_output.c:2674
> > > > | #8 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > > | at ./include/net/sock.h:918
> > > > | #9 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > > | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > > | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > > | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > > > | [...]
> > > >
> > > > Set watchpoint on skb->truesize:
> > > >
> > > > | (gdb) frame 3
> > > > | #3 0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > > | 217 size = SKB_WITH_OVERHEAD(ksize(data));
> > > > | (gdb) p &skb->truesize
> > > > | $5 = (unsigned int *) 0xffff888117f55f90
> > > > | (gdb) awatch *0xffff888117f55f90
> > > > | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >
> > > > 2. Some time later, we see that the skb with kfence-allocated data
> > > > is cloned:
> > > >
> > > > | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > > |
> > > > | Value = 1570
> > > > | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > > | 1002 C(truesize);
> > > > | (gdb) bt
> > > > | #0 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > > | #1 0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > > > | #2 0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > > > | rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > > > | #3 0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > > > | sk=<optimized out>) at ./include/linux/tcp.h:439
> > > > | #4 tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > > > | at net/ipv4/tcp_output.c:2688
> > > > | #5 0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > > | at ./include/net/sock.h:918
> > > > | #6 0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > > | #7 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > > | #8 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > > | #9 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > > > | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > > > [...]
> > > >
> > > >
> > > > 3. The original skb (that was cloned) has its truesize adjusted
> > > > after a pskb_expand_head():
> > > >
> > > > | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > > |
> > > > | Old value = 1570
> > > > | New value = 1954
> > > >
> > > > ^^ the difference between the old and the new value is exactly
> > > > 384, which is also the final underflow of the sk_wmem_queued
> > > > that triggers the warning. Presumably if the original allocation
> > > > had been through kmalloc-1k and not KFENCE, the difference here
> > > > would have been 0, since ksize() of the original allocation in
> > > > step (1) would have been 1024, and not 640 (difference of 384).
> > > >
> > > > | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > > | at net/core/skbuff.c:1687
> > > > | 1687 skb->truesize += size - osize;
> > > > | (gdb) bt
> > > > | #0 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > > | at net/core/skbuff.c:1687
> > > > | #1 0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > > > | #2 skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > > > | #3 0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > > > | #4 0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > > > | to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > > > | #5 tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > > > | skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > > > | #6 tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > > > | state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > > > | at net/ipv4/tcp_input.c:1670
> > > > | #7 0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > > > | state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > > > | #8 0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > > > | #9 0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > > > | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > > > [...]
> > > >
> > > >
> > > > Any of this make sense?
> > >
> > > Very nice debugging !
> > >
> > > I guess we could fix this in skb_prepare_for_shift(), eventually
> > > caring for the truesize manipulation
> > > (or reverting the change done in pskb_expand_head(), since only kfence
> > > is having this issue.
> >
> > Phew, good to hear I finally got lucky. :-)
> >
> > Either option is fine, as long as it avoids this problem in future.
> > Hopefully it can be fixed for 5.11.
> >
> > > (All TCP skbs in output path have the same allocation size for skb->head)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > > 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > > */
> > > static int skb_prepare_for_shift(struct sk_buff *skb)
> > > {
> > > - return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > + unsigned int ret = 0, save;
> > > +
> > > + if (skb_cloned(skb)) {
> > > + save = skb->truesize;
> > > + ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > + skb->truesize = save;
> > > + }
> > > + return ret;
> > > }
> >
> > FWIW,
> >
> > Tested-by: Marco Elver <[email protected]>
>
> Has this patch, or similar, already been sent?

This is now in net-next:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=097b9146c0e2

Thanks,
-- Marco