Hello,
syzbot found the following issue on:
HEAD commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
git tree: upstream
console+strace: https://syzkaller.appspot.com/x/log.txt?x=16c5cd8ca80000
kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/991cea284d12/disk-3f01e9fe.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/13051b4d971a/vmlinux-3f01e9fe.xz
kernel image: https://storage.googleapis.com/syzbot-assets/31cc7c3c2596/bzImage-3f01e9fe.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]
INFO: task syz-executor421:5031 blocked for more than 143 seconds.
Not tainted 6.5.0-rc1-syzkaller-00006-g3f01e9fed845 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor421 state:D stack:27728 pid:5031 ppid:5021 flags:0x00004006
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5381 [inline]
__schedule+0xc9a/0x5880 kernel/sched/core.c:6710
schedule+0xde/0x1a0 kernel/sched/core.c:6786
schedule_preempt_disabled+0x13/0x20 kernel/sched/core.c:6845
__mutex_lock_common kernel/locking/mutex.c:679 [inline]
__mutex_lock+0xa3b/0x1350 kernel/locking/mutex.c:747
__pipe_lock fs/pipe.c:103 [inline]
pipe_release+0x4d/0x310 fs/pipe.c:721
__fput+0x40c/0xad0 fs/file_table.c:384
task_work_run+0x16f/0x270 kernel/task_work.c:179
ptrace_notify+0x118/0x140 kernel/signal.c:2372
ptrace_report_syscall include/linux/ptrace.h:411 [inline]
ptrace_report_syscall_exit include/linux/ptrace.h:473 [inline]
syscall_exit_work kernel/entry/common.c:252 [inline]
syscall_exit_to_user_mode_prepare+0x129/0x220 kernel/entry/common.c:279
__syscall_exit_to_user_mode_work kernel/entry/common.c:284 [inline]
syscall_exit_to_user_mode+0xd/0x50 kernel/entry/common.c:297
do_syscall_64+0x46/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fbe8982b7da
RSP: 002b:00007ffd74cb3820 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007fbe8982b7da
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 000000000000000a R08: 0000000000000000 R09: 00007ffd74d90080
R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000a9f7
R13: 000000000000aa29 R14: 00007fbe898b642c R15: 00007fbe898b6420
</TASK>
Showing all locks held in the system:
1 lock held by rcu_tasks_kthre/13:
#0: ffffffff8c9a3af0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xd80 kernel/rcu/tasks.h:522
1 lock held by rcu_tasks_trace/14:
#0: ffffffff8c9a37f0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xd80 kernel/rcu/tasks.h:522
1 lock held by khungtaskd/28:
#0: ffffffff8c9a4700 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340 kernel/locking/lockdep.c:6615
2 locks held by getty/4773:
#0: ffff88802d33a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x26/0x80 drivers/tty/tty_ldisc.c:243
#1: ffffc900015c02f0 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0xf08/0x13f0 drivers/tty/n_tty.c:2187
1 lock held by syz-executor421/5031:
#0: ffff888022606868 (&pipe->mutex/1){+.+.}-{3:3}, at: __pipe_lock fs/pipe.c:103 [inline]
#0: ffff888022606868 (&pipe->mutex/1){+.+.}-{3:3}, at: pipe_release+0x4d/0x310 fs/pipe.c:721
2 locks held by syz-executor421/5032:
=============================================
NMI backtrace for cpu 1
CPU: 1 PID: 28 Comm: khungtaskd Not tainted 6.5.0-rc1-syzkaller-00006-g3f01e9fed845 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
nmi_cpu_backtrace+0x29c/0x350 lib/nmi_backtrace.c:113
nmi_trigger_cpumask_backtrace+0x2a4/0x300 lib/nmi_backtrace.c:62
trigger_all_cpu_backtrace include/linux/nmi.h:160 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:222 [inline]
watchdog+0xe16/0x1090 kernel/hung_task.c:379
kthread+0x344/0x440 kernel/kthread.c:389
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 5032 Comm: syz-executor421 Not tainted 6.5.0-rc1-syzkaller-00006-g3f01e9fed845 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
RIP: 0010:__ip_append_data+0x957/0x3c90 net/ipv4/ip_output.c:1198
Code: 31 ff 29 eb 89 de e8 d8 6d e9 f8 85 db 0f 8e b5 1a 00 00 e8 ab 71 e9 f8 48 8b 44 24 38 41 39 de 41 0f 4e de 41 89 dc 80 38 00 <0f> 85 15 29 00 00 48 8b 44 24 18 48 8b 18 48 8d bb f0 00 00 00 48
RSP: 0018:ffffc90003b9f520 EFLAGS: 00000246
RAX: ffffed100fab0f00 RBX: 0000000000000006 RCX: 0000000000000000
RDX: ffff888019e95940 RSI: ffffffff889b7015 RDI: 0000000000000005
RBP: 000000000000001c R08: 0000000000000005 R09: 0000000000000000
R10: 000000000000057e R11: 0000000000000001 R12: 0000000000000006
R13: dffffc0000000000 R14: 0000000000000006 R15: ffff8880132dd3c0
FS: 00007fbe897e96c0(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005598ca226780 CR3: 00000000193f8000 CR4: 0000000000350ef0
Call Trace:
<NMI>
</NMI>
<TASK>
ip_append_data net/ipv4/ip_output.c:1352 [inline]
ip_append_data+0x115/0x1a0 net/ipv4/ip_output.c:1331
udp_sendmsg+0x881/0x2840 net/ipv4/udp.c:1287
inet_sendmsg+0x9d/0xe0 net/ipv4/af_inet.c:830
sock_sendmsg_nosec net/socket.c:725 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:748
splice_to_socket+0x964/0xee0 fs/splice.c:882
do_splice_from fs/splice.c:934 [inline]
do_splice+0xb8a/0x1ec0 fs/splice.c:1295
__do_splice+0x14e/0x270 fs/splice.c:1373
__do_sys_splice fs/splice.c:1584 [inline]
__se_sys_splice fs/splice.c:1566 [inline]
__x64_sys_splice+0x19c/0x250 fs/splice.c:1566
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fbe8982c519
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 61 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fbe897e9218 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
RAX: ffffffffffffffda RBX: 00007fbe898b6428 RCX: 00007fbe8982c519
RDX: 0000000000000005 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007fbe898b6420 R08: 000000000004ffe0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fbe898b642c
R13: 00007fbe898834f4 R14: 04000000000003bd R15: 00007ffd74cb3758
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the bug is already fixed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
If you want to change bug's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the bug is a duplicate of another bug, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
syzbot has bisected this issue to:
commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
Author: David Howells <[email protected]>
Date: Mon May 22 12:11:22 2023 +0000
udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
git tree: upstream
final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
Reported-by: [email protected]
Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
On Wed, 12 Jul 2023 11:54:33 -0700 syzbot wrote:
> syzbot has bisected this issue to:
>
> commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> Author: David Howells <[email protected]>
> Date: Mon May 22 12:11:22 2023 +0000
>
> udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
>
> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> git tree: upstream
> final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
>
> Reported-by: [email protected]
> Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
>
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
Hi David, any ideas about this one? Looks like it triggers on fairly
recent upstream?
Note that the test program is dodgy:
pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
r2 = socket$inet_udp(0x2, 0x2, 0x0)
r2 is closed here:
close(r2)
r3 = socket$inet_udp(0x2, 0x2, 0x0)
setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
but then used here:
splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
As it happens, r3 will probably end up referring to the same fd as r2 did, but
that's not guaranteed.
David
Jakub Kicinski <[email protected]> wrote:
> Hi David, any ideas about this one? Looks like it triggers on fairly
> recent upstream?
I've managed to reproduce it finally. Instrumenting the pipe_lock/unlock
functions, splice_to_socket() and pipe_release() seems to show that
pipe_release() is being called whilst splice_to_socket() is still running.
I *think* syzbot is arranging things such that splice_to_socket() takes a
significant amount of time so that another thread can close the socket as it
exits.
In this sample logging, the pipe is created by pid 7101:
[ 66.205719] --pipe 7101
[ 66.209942] lock
[ 66.212526] locked
[ 66.215344] unlock
[ 66.218103] unlocked
splice begins in 7101 also and locks the pipe:
[ 66.221057] ==>splice_to_socket() 7101
[ 66.225596] lock
[ 66.228177] locked
but for some reason, pid 7100 then tries to release it:
[ 66.377781] release 7100
and hangs on the __pipe_lock() call in pipe_release():
[ 66.381059] lock
The syz reproducer does weird things with threading - and I'm wondering if
there's a file struct refcount bug here. Note that splice_to_socket() can't
access the pipe file structs to alter the refcount, and the involved pipe
isn't communicated to udp_sendmsg() in any way - so if there is a refcount
bug, it must be somewhere in the VFS, the pipe driver or the splice
infrastructure:-/.
I'm also not sure what's going on inside udp_sendmsg() as yet. It doesn't
show a stack in /proc/7101/stacks, which means it doesn't hit a schedule().
David
David Howells <[email protected]> wrote:
> I've managed to reproduce it finally. Instrumenting the pipe_lock/unlock
> functions, splice_to_socket() and pipe_release() seems to show that
> pipe_release() is being called whilst splice_to_socket() is still running.
That's actually a bit of a red herring. pipe_release() is so-called because
it's called as the release file op for an end of the pipe. It doesn't
automatically free the pipe_inode_info struct - there's refcounting on that.
So the problem is that udp_sendmsg() didn't return; pipe_release() hanging on
the pipe_lock() is merely a noisy symptom thereof.
David
Hi Jakub, Willem,
I think I'm going to need your help with this one.
> > syzbot has bisected this issue to:
> >
> > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> > Author: David Howells <[email protected]>
> > Date: Mon May 22 12:11:22 2023 +0000
> >
> > udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
> >
> > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> > start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> > git tree: upstream
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
> >
> > Reported-by: [email protected]
> > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
> >
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
The issue that syzbot is triggering seems to be something to do with the
calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when
MSG_SPLICE_PAGES is in operation.
What seems to happen is that the test program uses sendmsg() + MSG_MORE to
loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then
splices in 8 extra bytes.
r3 = socket$inet_udp(0x2, 0x2, 0x0)
setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
This results in some negative intermediate values turning up in the
calculations - and this results in the remaining length being made longer
from 8 to 14.
I added some printks (patch attached), resulting in the attached tracelines:
==>splice_to_socket() 7099
udp_sendmsg(8,8)
__ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428)
pagedlen 14 = 14 - 0
copy -6 = 14 - 0 - 6 - 14
length 8 -= -6 + 0
__ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428)
copy=1414 len=14
skb_splice_from_iter(8,14)
__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
copy=1406 len=6
skb_splice_from_iter(0,6)
__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
copy=1406 len=6
skb_splice_from_iter(0,6)
__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
copy=1406 len=6
skb_splice_from_iter(0,6)
__ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
copy=1406 len=6
skb_splice_from_iter(0,6)
copy=1406 len=6
skb_splice_from_iter(0,6)
...
'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes
less than the amount of data then in the packet (skblen=1434).
'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen
(0) - fraggap (6) - pagedlen (14).
datalen is 14 because it was length (8) + fraggap (6).
Inside skb_splice_from_iter(), we eventually end up in an enless loop in which
msg_iter.count is 0 and the length to be copied is 6. It always returns 0
because there's nothing to copy, and so __ip_append_data() cycles round the
loop endlessly.
Any suggestion as to how to fix this?
Thanks,
David
---
Debug hang in pipe_release's pipe_lock
---
fs/splice.c | 3 +++
net/core/skbuff.c | 7 +++++++
net/ipv4/ip_output.c | 24 ++++++++++++++++++++++++
net/ipv4/udp.c | 3 +++
4 files changed, 37 insertions(+)
diff --git a/fs/splice.c b/fs/splice.c
index 004eb1c4ce31..9ee82b818bd6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -801,6 +801,8 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
size_t spliced = 0;
bool need_wakeup = false;
+ printk("==>splice_to_socket() %u\n", current->pid);
+
pipe_lock(pipe);
while (len > 0) {
@@ -911,6 +913,7 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
pipe_unlock(pipe);
if (need_wakeup)
wakeup_pipe_writers(pipe);
+ printk("<==splice_to_socket() = %zd\n", spliced ?: ret);
return spliced ?: ret;
}
#endif
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a298992060e6..c3d60da9e3f7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6801,6 +6801,13 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
ssize_t spliced = 0, ret = 0;
unsigned int i;
+ static int __pcount;
+
+ if (__pcount < 6) {
+ printk("skb_splice_from_iter(%zu,%zd)\n", iter->count, maxsize);
+ __pcount++;
+ }
+
while (iter->count > 0) {
ssize_t space, nr, len;
size_t off;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e70839257f7..8c84a7d13627 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1066,6 +1066,14 @@ static int __ip_append_data(struct sock *sk,
copy = mtu - skb->len;
if (copy < length)
copy = maxfraglen - skb->len;
+ if (flags & MSG_SPLICE_PAGES) {
+ static int __pcount;
+ if (__pcount < 6) {
+ printk("__ip_append_data(copy=%d,len=%d, mtu=%d skblen=%d maxfl=%d)\n",
+ copy, length, mtu, skb->len, maxfraglen);
+ __pcount++;
+ }
+ }
if (copy <= 0) {
char *data;
unsigned int datalen;
@@ -1112,6 +1120,10 @@ static int __ip_append_data(struct sock *sk,
else {
alloclen = fragheaderlen + transhdrlen;
pagedlen = datalen - transhdrlen;
+ if (flags & MSG_SPLICE_PAGES) {
+ printk("pagedlen %d = %d - %d\n",
+ pagedlen, datalen, transhdrlen);
+ }
}
alloclen += alloc_extra;
@@ -1158,6 +1170,9 @@ static int __ip_append_data(struct sock *sk,
}
copy = datalen - transhdrlen - fraggap - pagedlen;
+ if (flags & MSG_SPLICE_PAGES)
+ printk("copy %d = %d - %d - %d - %d\n",
+ copy, datalen, transhdrlen, fraggap, pagedlen);
if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
@@ -1165,6 +1180,8 @@ static int __ip_append_data(struct sock *sk,
}
offset += copy;
+ if (flags & MSG_SPLICE_PAGES)
+ printk("length %d -= %d + %d\n", length, copy, transhdrlen);
length -= copy + transhdrlen;
transhdrlen = 0;
exthdrlen = 0;
@@ -1192,6 +1209,13 @@ static int __ip_append_data(struct sock *sk,
continue;
}
+ if (flags & MSG_SPLICE_PAGES) {
+ static int __qcount;
+ if (__qcount < 6) {
+ printk("copy=%d len=%d\n", copy, length);
+ __qcount++;
+ }
+ }
if (copy > length)
copy = length;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 42a96b3547c9..bd3f4e62574b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1081,6 +1081,9 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */
return -EOPNOTSUPP;
+ if (msg->msg_flags & MSG_SPLICE_PAGES)
+ printk("udp_sendmsg(%zx,%zx)\n", msg->msg_iter.count, len);
+
getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
fl4 = &inet->cork.fl.u.ip4;
Hello,
syzbot has tested the proposed patch and the reproducer did not trigger any issue:
Reported-and-tested-by: [email protected]
Tested on:
commit: 3f01e9fe Merge tag 'linux-watchdog-6.5-rc2' of git://w..
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
console output: https://syzkaller.appspot.com/x/log.txt?x=1208a829a80000
kernel config: https://syzkaller.appspot.com/x/.config?x=87ef4ae61b35aae1
dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
patch: https://syzkaller.appspot.com/x/patch.diff?x=14dbfef6a80000
Note: testing is done by a robot and is best-effort only.
David Howells wrote:
> Hi Jakub, Willem,
>
> I think I'm going to need your help with this one.
>
> > > syzbot has bisected this issue to:
> > >
> > > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> > > Author: David Howells <[email protected]>
> > > Date: Mon May 22 12:11:22 2023 +0000
> > >
> > > udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
> > >
> > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> > > start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> > > git tree: upstream
> > > final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
> > >
> > > Reported-by: [email protected]
> > > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
> > >
> > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>
> The issue that syzbot is triggering seems to be something to do with the
> calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when
> MSG_SPLICE_PAGES is in operation.
>
> What seems to happen is that the test program uses sendmsg() + MSG_MORE to
> loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then
> splices in 8 extra bytes.
>
> r3 = socket$inet_udp(0x2, 0x2, 0x0)
> setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
> bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
> connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
> sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
> write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
> splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
>
> This results in some negative intermediate values turning up in the
> calculations - and this results in the remaining length being made longer
> from 8 to 14.
>
> I added some printks (patch attached), resulting in the attached tracelines:
>
> ==>splice_to_socket() 7099
> udp_sendmsg(8,8)
> __ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428)
> pagedlen 14 = 14 - 0
> copy -6 = 14 - 0 - 6 - 14
> length 8 -= -6 + 0
> __ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428)
> copy=1414 len=14
> skb_splice_from_iter(8,14)
> __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> copy=1406 len=6
> skb_splice_from_iter(0,6)
> __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> copy=1406 len=6
> skb_splice_from_iter(0,6)
> __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> copy=1406 len=6
> skb_splice_from_iter(0,6)
> __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> copy=1406 len=6
> skb_splice_from_iter(0,6)
> copy=1406 len=6
> skb_splice_from_iter(0,6)
> ...
>
> 'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes
> less than the amount of data then in the packet (skblen=1434).
>
> 'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen
> (0) - fraggap (6) - pagedlen (14).
>
> datalen is 14 because it was length (8) + fraggap (6).
>
> Inside skb_splice_from_iter(), we eventually end up in an enless loop in which
> msg_iter.count is 0 and the length to be copied is 6. It always returns 0
> because there's nothing to copy, and so __ip_append_data() cycles round the
> loop endlessly.
>
> Any suggestion as to how to fix this?
Still ingesting.
The syzkaller repro runs in threaded mode, so syscalls should not be
interpreted in order.
There are two seemingly (but evidently not really) independent
operations:
One involving splicing
pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
r2 = socket$inet_udp(0x2, 0x2, 0x0)
write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
close(r2)
And separately the MSG_MORE transmission that you mentioned
r3 = socket$inet_udp(0x2, 0x2, 0x0)
setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
This second program also sets setsockopt SOL_SOCKET/SO_BROADCAST. That
is likely not some random noise in the program (but that can be easily
checked, by removing it -- assuming the repro triggers quickly).
Question is whether the infinite loop happens on the r2, the socket
involving MSG_SPLICE_PAGES, or r3, the socket involving SO_BROADCAST.
If the second, on the surface it would seem that splicing is entirely
unrelated.
Willem de Bruijn wrote:
> David Howells wrote:
> > Hi Jakub, Willem,
> >
> > I think I'm going to need your help with this one.
> >
> > > > syzbot has bisected this issue to:
> > > >
> > > > commit 7ac7c987850c3ec617c778f7bd871804dc1c648d
> > > > Author: David Howells <[email protected]>
> > > > Date: Mon May 22 12:11:22 2023 +0000
> > > >
> > > > udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
> > > >
> > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15853bcaa80000
> > > > start commit: 3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
> > > > git tree: upstream
> > > > final oops: https://syzkaller.appspot.com/x/report.txt?x=17853bcaa80000
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=13853bcaa80000
> > > > kernel config: https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
> > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12a86682a80000
> > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1520ab6ca80000
> > > >
> > > > Reported-by: [email protected]
> > > > Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
> > > >
> > > > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> >
> > The issue that syzbot is triggering seems to be something to do with the
> > calculations in the "if (copy <= 0) { ... }" chunk in __ip_append_data() when
> > MSG_SPLICE_PAGES is in operation.
> >
> > What seems to happen is that the test program uses sendmsg() + MSG_MORE to
> > loads a UDP packet with 1406 bytes of data to the MTU size (1434) and then
> > splices in 8 extra bytes.
> >
> > r3 = socket$inet_udp(0x2, 0x2, 0x0)
> > setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
> > bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
> > connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
> > sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
> > write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
> > splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
> >
> > This results in some negative intermediate values turning up in the
> > calculations - and this results in the remaining length being made longer
> > from 8 to 14.
> >
> > I added some printks (patch attached), resulting in the attached tracelines:
> >
> > ==>splice_to_socket() 7099
> > udp_sendmsg(8,8)
> > __ip_append_data(copy=-6,len=8, mtu=1434 skblen=1434 maxfl=1428)
> > pagedlen 14 = 14 - 0
> > copy -6 = 14 - 0 - 6 - 14
> > length 8 -= -6 + 0
> > __ip_append_data(copy=1414,len=14, mtu=1434 skblen=20 maxfl=1428)
> > copy=1414 len=14
> > skb_splice_from_iter(8,14)
> > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > copy=1406 len=6
> > skb_splice_from_iter(0,6)
> > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > copy=1406 len=6
> > skb_splice_from_iter(0,6)
> > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > copy=1406 len=6
> > skb_splice_from_iter(0,6)
> > __ip_append_data(copy=1406,len=6, mtu=1434 skblen=28 maxfl=1428)
> > copy=1406 len=6
> > skb_splice_from_iter(0,6)
> > copy=1406 len=6
> > skb_splice_from_iter(0,6)
> > ...
> >
> > 'copy' gets calculated as -6 because the maxfraglen (maxfl=1428) is 8 bytes
> > less than the amount of data then in the packet (skblen=1434).
> >
> > 'copy' gets recalculated part way down as -6 from datalen (14) - transhdrlen
> > (0) - fraggap (6) - pagedlen (14).
> >
> > datalen is 14 because it was length (8) + fraggap (6).
> >
> > Inside skb_splice_from_iter(), we eventually end up in an enless loop in which
> > msg_iter.count is 0 and the length to be copied is 6. It always returns 0
> > because there's nothing to copy, and so __ip_append_data() cycles round the
> > loop endlessly.
> >
> > Any suggestion as to how to fix this?
>
> Still ingesting.
>
> The syzkaller repro runs in threaded mode, so syscalls should not be
> interpreted in order.
>
> There are two seemingly (but evidently not really) independent
> operations:
>
> One involving splicing
>
> pipe(&(0x7f0000000100)={<r0=>0xffffffffffffffff, <r1=>0xffffffffffffffff})
> r2 = socket$inet_udp(0x2, 0x2, 0x0)
> write$binfmt_misc(r1, &(0x7f0000000440)=ANY=[], 0x8)
> splice(r0, 0x0, r2, 0x0, 0x4ffe0, 0x0)
> close(r2)
>
> And separately the MSG_MORE transmission that you mentioned
>
> r3 = socket$inet_udp(0x2, 0x2, 0x0)
> setsockopt$sock_int(r3, 0x1, 0x6, &(0x7f0000000140)=0x32, 0x4)
> bind$inet(r3, &(0x7f0000000000)={0x2, 0x0, @dev={0xac, 0x14, 0x14, 0x15}}, 0x10)
> connect$inet(r3, &(0x7f0000000200)={0x2, 0x0, @broadcast}, 0x10)
> sendmmsg(r3, &(0x7f0000000180)=[{{0x0, 0x0, 0x0}}, {{0x0, 0xfffffffffffffed3, &(0x7f0000000940)=[{&(0x7f00000006c0)='O', 0x57e}], 0x1}}], 0x4000000000003bd, 0x8800)
>
> This second program also sets setsockopt SOL_SOCKET/SO_BROADCAST. That
> is likely not some random noise in the program (but that can be easily
> checked, by removing it -- assuming the repro triggers quickly).
>
> Question is whether the infinite loop happens on the r2, the socket
> involving MSG_SPLICE_PAGES, or r3, the socket involving SO_BROADCAST.
> If the second, on the surface it would seem that splicing is entirely
> unrelated.
I still don't entirely follow how the splice and sendmmsg end up on
the same socket.
Also, the sendmmsg in the case on the dashboard is very odd, a vlen of
0x4000000000003bd and flags MSG_MORE | MSG_CONFIRM. But maybe other
runs have more sensible flags here.
The issue appears to be with appending through splicing to an skb that
exceeds the length with fragments, triggering the if (fraggap) branch
to copy some trailer from skb_prev to skb, then appending the spliced
data.
Perhaps not an true fix based on understanding in detail, but one way
out may be to disable splicing if !transhdrlen (which ip_append_data
clears if appending).
Willem de Bruijn <[email protected]> wrote:
> The syzkaller repro runs in threaded mode, so syscalls should not be
> interpreted in order.
I think they are actually ordered. It's kind of weirdly done, though,
flipping back and forth between threads and using futexes for synchronisation.
David
Hi Willem,
Here's a reduced testcase. I doesn't require anything special; the key is
that the amount of data placed in the packet by the send() - it's related to
the MTU size. It needs to stuff in sufficient data to go over the
fragmentation limit (I think).
In this case, my interface's MTU is 8192. send() is sticking in 8161 bytes of
data and then the output from the aforeposted debugging patch is:
==>splice_to_socket() 6630
udp_sendmsg(8,8)
__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
pagedlen 9 = 9 - 0
copy -1 = 9 - 0 - 1 - 9
length 8 -= -1 + 0
__ip_append_data(copy=8172,len=9, mtu=8192 skblen=20 maxfl=8188)
copy=8172 len=9
skb_splice_from_iter(8,9)
__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
copy=8164 len=1
skb_splice_from_iter(0,1)
__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
copy=8164 len=1
skb_splice_from_iter(0,1)
__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
copy=8164 len=1
skb_splice_from_iter(0,1)
__ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
copy=8164 len=1
skb_splice_from_iter(0,1)
copy=8164 len=1
skb_splice_from_iter(0,1)
It looks like send() pushes 1 byte over the fragmentation limit, then the
splice sees -1 crop up, the length to be copied is increased by 1, but
insufficient data is available and we go into an endless loop.
---
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <sys/uio.h>
#define OSERROR(R, S) do { if ((long)(R) == -1L) { perror((S)); exit(1); } } while(0)
int main()
{
struct sockaddr_storage ss;
struct sockaddr_in sin;
void *buffer;
unsigned int tmp;
int pfd[2], sfd;
int res;
OSERROR(pipe(pfd), "pipe");
sfd = socket(AF_INET, SOCK_DGRAM, 0);
OSERROR(sfd, "socket/2");
memset(&sin, 0, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_port = htons(0);
sin.sin_addr.s_addr = htonl(0xc0a80601);
#warning you might want to set the address here - this is 192.168.6.1
OSERROR(connect(sfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");
buffer = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
OSERROR(buffer, "mmap");
OSERROR(send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE), "send");
#warning you need to adjust the length on the above line to match your MTU
OSERROR(write(pfd[1], buffer, 8), "write");
OSERROR(splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0), "splice");
return 0;
}
David Howells wrote:
> Hi Willem,
>
> Here's a reduced testcase. I doesn't require anything special; the key is
> that the amount of data placed in the packet by the send() - it's related to
> the MTU size. It needs to stuff in sufficient data to go over the
> fragmentation limit (I think).
>
> In this case, my interface's MTU is 8192. send() is sticking in 8161 bytes of
> data and then the output from the aforeposted debugging patch is:
>
> ==>splice_to_socket() 6630
> udp_sendmsg(8,8)
> __ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
> pagedlen 9 = 9 - 0
> copy -1 = 9 - 0 - 1 - 9
> length 8 -= -1 + 0
> __ip_append_data(copy=8172,len=9, mtu=8192 skblen=20 maxfl=8188)
> copy=8172 len=9
> skb_splice_from_iter(8,9)
> __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> copy=8164 len=1
> skb_splice_from_iter(0,1)
> __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> copy=8164 len=1
> skb_splice_from_iter(0,1)
> __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> copy=8164 len=1
> skb_splice_from_iter(0,1)
> __ip_append_data(copy=8164,len=1, mtu=8192 skblen=28 maxfl=8188)
> copy=8164 len=1
> skb_splice_from_iter(0,1)
> copy=8164 len=1
> skb_splice_from_iter(0,1)
>
> It looks like send() pushes 1 byte over the fragmentation limit, then the
> splice sees -1 crop up, the length to be copied is increased by 1, but
> insufficient data is available and we go into an endless loop.
>
> ---
> #define _GNU_SOURCE
> #include <arpa/inet.h>
> #include <fcntl.h>
> #include <netinet/in.h>
> #include <stdarg.h>
> #include <stdbool.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/socket.h>
> #include <sys/mman.h>
> #include <sys/uio.h>
>
> #define OSERROR(R, S) do { if ((long)(R) == -1L) { perror((S)); exit(1); } } while(0)
>
> int main()
> {
> struct sockaddr_storage ss;
> struct sockaddr_in sin;
> void *buffer;
> unsigned int tmp;
> int pfd[2], sfd;
> int res;
>
> OSERROR(pipe(pfd), "pipe");
>
> sfd = socket(AF_INET, SOCK_DGRAM, 0);
> OSERROR(sfd, "socket/2");
>
> memset(&sin, 0, sizeof(sin));
> sin.sin_family = AF_INET;
> sin.sin_port = htons(0);
> sin.sin_addr.s_addr = htonl(0xc0a80601);
> #warning you might want to set the address here - this is 192.168.6.1
> OSERROR(connect(sfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");
>
> buffer = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
> OSERROR(buffer, "mmap");
>
> OSERROR(send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE), "send");
> #warning you need to adjust the length on the above line to match your MTU
>
> OSERROR(write(pfd[1], buffer, 8), "write");
>
> OSERROR(splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0), "splice");
> return 0;
> }
That's helpful.
Is the MSG_CONFIRM needed to trigger this?
Appending to a MSG_MORE datagram that previously fit within MTU, but
no longer, triggers the copy from skb_prev to skb in if (fraggap).
I did not see how that would cause issues, but maybe something in how
that second skb is setup makes none of the cases in the while loop
successfully append, yet also not fail and exit. It would be helpful
to know which path it takes (I assume skb_splice_from_iter) and what
that returns (0?).
Is this indeed trivially sidestepped if downgrading from splicing to
regular copying with fragmentation?
@@ -1042,7 +1042,7 @@ static int __ip_append_data(struct sock *sk,
if (inet->hdrincl)
return -EPERM;
if (rt->dst.dev->features & NETIF_F_SG &&
- getfrag == ip_generic_getfrag)
+ getfrag == ip_generic_getfrag && transhdrlen)
Willem de Bruijn <[email protected]> wrote:
> Is the MSG_CONFIRM needed to trigger this?
It's not actually necessary. The syz test was doing it.
> Is this indeed trivially sidestepped if downgrading from splicing to
> regular copying with fragmentation?
That works. The logging looks like:
==>splice_to_socket() 5535
udp_sendmsg(8,8)
__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
copy 8 = 9 - 0 - 1 - 0
length 8 -= 8 + 0
<==splice_to_socket() = 8
It looks like pagedlen being non-zero might be the issue.
David
The more I look at __ip_append_data(), the more I think the maths is wrong.
In the bit that allocates a new skbuff:
if (copy <= 0) {
...
datalen = length + fraggap;
if (datalen > mtu - fragheaderlen)
datalen = maxfraglen - fragheaderlen;
fraglen = datalen + fragheaderlen;
pagedlen = 0;
...
if ((flags & MSG_MORE) &&
!(rt->dst.dev->features&NETIF_F_SG))
...
else if (!paged &&
(fraglen + alloc_extra < SKB_MAX_ALLOC ||
!(rt->dst.dev->features & NETIF_F_SG)))
...
else {
alloclen = fragheaderlen + transhdrlen;
pagedlen = datalen - transhdrlen;
}
...
In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause.
The values used here, a few lines further along:
copy = datalen - transhdrlen - fraggap - pagedlen;
are constant over the intervening span. This means that, provided the splice
isn't going to exceed the MTU on the second fragment, the calculation of
'copy' can then be simplified algebraically thus:
copy = (length + fraggap) - transhdrlen - fraggap - pagedlen;
copy = length - transhdrlen - pagedlen;
copy = length - transhdrlen - (datalen - transhdrlen);
copy = length - transhdrlen - datalen + transhdrlen;
copy = length - datalen;
copy = length - (length + fraggap);
copy = length - length - fraggap;
copy = -fraggap;
I think we might need to recalculate copy after the conditional call to
getfrag(). Possibly we should skip that entirely for MSG_SPLICE_READ. The
root seems to be that we're subtracting pagedlen from datalen - but probably
we shouldn't be doing getfrag() if pagedlen > 0.
David
Willem de Bruijn <[email protected]> wrote:
> What causes the infinite loop: does skb_splice_from_iter return 0 and
> therefore the loop neither decrements copy, nor breaks out with error?
Yeah. skb_splice_from_iter() starts returning 0 because the iterator is
empty, but it's still being asked to copy data. Possibly it should break out
of the loop (or give an error) in such a case.
David
Willem de Bruijn <[email protected]> wrote:
> That getfrag is needed. For non-splice cases, to fill the linear part
> of an skb. As your example shows, it is skipped if all data is covered
> by pagedlen?
Yes, because copy goes negative. To quote from the previously quoted log:
==>splice_to_socket() 6630
udp_sendmsg(8,8)
__ip_append_data(copy=-1,len=8, mtu=8192 skblen=8189 maxfl=8188)
pagedlen 9 = 9 - 0
copy -1 = 9 - 0 - 1 - 9
copy is -(the number of excess bytes).
length 8 -= -1 + 0
which then gets deducted from the length - but why? I wonder if copy should
be cleared if we don't call getfrag(). It looks like it's meant to deduct the
amount copied by getfrag(), but that doesn't happen if copy < 0.
Also, note that MSG_ZEROCOPY might see the same maths issue here.
David
David Howells wrote:
> The attached seems to work. I still think copy isn't correctly calculated in
> some circumstances - as I showed, several terms in the maths cancel out,
> including the length of the data.
That arithmetic makes assumptions that are specific to a particular
set of conditions (e.g., that pagedlen is non-zero).
Since the arithmetic is so complicated and error prone, I would try
to structure a fix that is easy to reason about to only change
behavior for the MSG_SPLICE_PAGES case.
> I'm also not entirely sure what 'paged' means in this function. Should it
> actually be set in the MSG_SPLICE_PAGES context?
I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the
length that is not copied.
If the existing code would affect MSG_ZEROCOPY too, I expect syzbot
to have reported that previously.
> ---
> udp: Fix __ip_addend_data()
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 6e70839257f7..54675a4f2c9f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1157,7 +1157,7 @@ static int __ip_append_data(struct sock *sk,
> pskb_trim_unique(skb_prev, maxfraglen);
> }
>
> - copy = datalen - transhdrlen - fraggap - pagedlen;
> + copy = max_t(int, datalen - transhdrlen - fraggap - pagedlen, 0);
> if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
> err = -EFAULT;
> kfree_skb(skb);
>
The attached seems to work. I still think copy isn't correctly calculated in
some circumstances - as I showed, several terms in the maths cancel out,
including the length of the data.
I'm also not entirely sure what 'paged' means in this function. Should it
actually be set in the MSG_SPLICE_PAGES context?
---
udp: Fix __ip_addend_data()
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e70839257f7..54675a4f2c9f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1157,7 +1157,7 @@ static int __ip_append_data(struct sock *sk,
pskb_trim_unique(skb_prev, maxfraglen);
}
- copy = datalen - transhdrlen - fraggap - pagedlen;
+ copy = max_t(int, datalen - transhdrlen - fraggap - pagedlen, 0);
if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
David Howells wrote:
> The more I look at __ip_append_data(), the more I think the maths is wrong.
> In the bit that allocates a new skbuff:
>
> if (copy <= 0) {
> ...
> datalen = length + fraggap;
> if (datalen > mtu - fragheaderlen)
> datalen = maxfraglen - fragheaderlen;
> fraglen = datalen + fragheaderlen;
> pagedlen = 0;
> ...
> if ((flags & MSG_MORE) &&
> !(rt->dst.dev->features&NETIF_F_SG))
> ...
> else if (!paged &&
> (fraglen + alloc_extra < SKB_MAX_ALLOC ||
> !(rt->dst.dev->features & NETIF_F_SG)))
> ...
> else {
> alloclen = fragheaderlen + transhdrlen;
> pagedlen = datalen - transhdrlen;
> }
> ...
>
> In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause.
> The values used here, a few lines further along:
>
> copy = datalen - transhdrlen - fraggap - pagedlen;
>
> are constant over the intervening span. This means that, provided the splice
> isn't going to exceed the MTU on the second fragment, the calculation of
> 'copy' can then be simplified algebraically thus:
>
> copy = (length + fraggap) - transhdrlen - fraggap - pagedlen;
>
> copy = length - transhdrlen - pagedlen;
>
> copy = length - transhdrlen - (datalen - transhdrlen);
>
> copy = length - transhdrlen - datalen + transhdrlen;
>
> copy = length - datalen;
>
> copy = length - (length + fraggap);
>
> copy = length - length - fraggap;
>
> copy = -fraggap;
>
> I think we might need to recalculate copy after the conditional call to
> getfrag(). Possibly we should skip that entirely for MSG_SPLICE_READ. The
> root seems to be that we're subtracting pagedlen from datalen - but probably
> we shouldn't be doing getfrag() if pagedlen > 0.
q
Willem de Bruijn <[email protected]> wrote:
> > I'm also not entirely sure what 'paged' means in this function. Should it
> > actually be set in the MSG_SPLICE_PAGES context?
>
> I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the
> length that is not copied.
>
> If the existing code would affect MSG_ZEROCOPY too, I expect syzbot
> to have reported that previously.
Ah... I think it *should* affect MSG_ZEROCOPY also... but... If you look at:
} else {
err = skb_zerocopy_iter_dgram(skb, from, copy);
if (err < 0)
goto error;
}
offset += copy;
length -= copy;
MSG_ZEROCOPY assumes that if it didn't return an error, then
skb_zerocopy_iter_dgram() copied all the data requested - whether or not the
iterator had sufficient data to copy.
If you look in __zerocopy_sg_from_iter(), it will drop straight out, returning
0 if/when iov_iter_count() is/reaches 0, even if length is still > 0, just as
skb_splice_from_iter() does.
So there's a potential bug in the handling of MSG_ZEROCOPY - but one that you
survive because it subtracts 'copy' from 'length', reducing it to zero, exits
the loop and returns without looking at 'length' again. The actual length to
be transmitted is in the skbuff.
> Since the arithmetic is so complicated and error prone, I would try
> to structure a fix that is easy to reason about to only change
> behavior for the MSG_SPLICE_PAGES case.
Does that mean you want to have a go at that - or did you want me to try?
David
David Howells wrote:
> The more I look at __ip_append_data(), the more I think the maths is wrong.
> In the bit that allocates a new skbuff:
>
> if (copy <= 0) {
> ...
> datalen = length + fraggap;
> if (datalen > mtu - fragheaderlen)
> datalen = maxfraglen - fragheaderlen;
> fraglen = datalen + fragheaderlen;
> pagedlen = 0;
> ...
> if ((flags & MSG_MORE) &&
> !(rt->dst.dev->features&NETIF_F_SG))
> ...
> else if (!paged &&
> (fraglen + alloc_extra < SKB_MAX_ALLOC ||
> !(rt->dst.dev->features & NETIF_F_SG)))
> ...
> else {
> alloclen = fragheaderlen + transhdrlen;
> pagedlen = datalen - transhdrlen;
> }
> ...
>
> In the MSG_SPLICE_READ but not MSG_MORE case, we go through that else clause.
> The values used here, a few lines further along:
>
> copy = datalen - transhdrlen - fraggap - pagedlen;
>
> are constant over the intervening span. This means that, provided the splice
> isn't going to exceed the MTU on the second fragment, the calculation of
> 'copy' can then be simplified algebraically thus:
>
> copy = (length + fraggap) - transhdrlen - fraggap - pagedlen;
>
> copy = length - transhdrlen - pagedlen;
>
> copy = length - transhdrlen - (datalen - transhdrlen);
>
> copy = length - transhdrlen - datalen + transhdrlen;
>
> copy = length - datalen;
>
> copy = length - (length + fraggap);
>
> copy = length - length - fraggap;
>
> copy = -fraggap;
>
> I think we might need to recalculate copy after the conditional call to
> getfrag(). Possibly we should skip that entirely for MSG_SPLICE_READ. The
> root seems to be that we're subtracting pagedlen from datalen - but probably
> we shouldn't be doing getfrag() if pagedlen > 0.
That getfrag is needed. For non-splice cases, to fill the linear part
of an skb. As your example shows, it is skipped if all data is covered
by pagedlen?
In this edge case, splicing appends pagedlen to an skb that holds only
a small linear part for fragheaderlen and fraggap.
Splicing has no problem appending to a normal linear skb, right. Say
send(fd, buf, 100, MSG_MORE);
write(pipe[1], buf, 8);
splice(pipe[2], 0, fd, 0, 8, 0);
This only happens when a new skb has to be allocated during the splice
call.
What causes the infinite loop: does skb_splice_from_iter return 0 and
therefore the loop neither decrements copy, nor breaks out with error?
Apologies for the earlier mail. Accidentally hit send too soon..
David Howells wrote:
> Willem de Bruijn <[email protected]> wrote:
>
> > > I'm also not entirely sure what 'paged' means in this function. Should it
> > > actually be set in the MSG_SPLICE_PAGES context?
> >
> > I introduced it with MSG_ZEROCOPY. It sets up pagedlen to capture the
> > length that is not copied.
> >
> > If the existing code would affect MSG_ZEROCOPY too, I expect syzbot
> > to have reported that previously.
>
> Ah... I think it *should* affect MSG_ZEROCOPY also... but... If you look at:
>
> } else {
> err = skb_zerocopy_iter_dgram(skb, from, copy);
> if (err < 0)
> goto error;
> }
> offset += copy;
> length -= copy;
>
> MSG_ZEROCOPY assumes that if it didn't return an error, then
> skb_zerocopy_iter_dgram() copied all the data requested - whether or not the
> iterator had sufficient data to copy.
>
> If you look in __zerocopy_sg_from_iter(), it will drop straight out, returning
> 0 if/when iov_iter_count() is/reaches 0, even if length is still > 0, just as
> skb_splice_from_iter() does.
>
> So there's a potential bug in the handling of MSG_ZEROCOPY - but one that you
> survive because it subtracts 'copy' from 'length', reducing it to zero, exits
> the loop and returns without looking at 'length' again. The actual length to
> be transmitted is in the skbuff.
>
> > Since the arithmetic is so complicated and error prone, I would try
> > to structure a fix that is easy to reason about to only change
> > behavior for the MSG_SPLICE_PAGES case.
>
> Does that mean you want to have a go at that - or did you want me to try
Please give it a try. I can review. It's just safer if it's trivial
to review that the patch only affects the behavior of the recently
introduced MSG_SPLICE_PAGES code.
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES
__ip_append_data() can get into an infinite loop when asked to splice into
a partially-built UDP message that has more than the frag-limit data and up
to the MTU limit. Something like:
pipe(pfd);
sfd = socket(AF_INET, SOCK_DGRAM, 0);
connect(sfd, ...);
send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE);
write(pfd[1], buffer, 8);
splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0);
where the amount of data given to send() is dependent on the MTU size (in
this instance an interface with an MTU of 8192).
The problem is that the calculation of the amount to copy in
__ip_append_data() goes negative in two places, and, in the second place,
this gets subtracted from the length remaining, thereby increasing it.
This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and
MSG_SPLICE_PAGES), because the terms in:
copy = datalen - transhdrlen - fraggap - pagedlen;
then mostly cancel when pagedlen is substituted for, leaving just -fraggap.
This causes:
length -= copy + transhdrlen;
to increase the length to more than the amount of data in msg->msg_iter,
which causes skb_splice_from_iter() to be unable to fill the request and it
returns less than 'copied' - which means that length never gets to 0 and we
never exit the loop.
Fix this by:
(1) Insert a note about the dodgy calculation of 'copy'.
(2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above
equation, so that 'offset' isn't regressed and 'length' isn't
increased, which will mean that length and thus copy should match the
amount left in the iterator.
(3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if
we're asked to splice more than is in the iterator. It might be
better to not give the warning or even just give a 'short' write.
[!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY
avoids the problem by simply assuming that everything asked for got copied,
not just the amount that was in the iterator. This is a potential bug for
the future.
Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES")
Reported-by: [email protected]
Link: https://lore.kernel.org/r/[email protected]/
Signed-off-by: David Howells <[email protected]>
cc: Willem de Bruijn <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: David Ahern <[email protected]>
cc: Jens Axboe <[email protected]>
cc: [email protected]
---
net/ipv4/ip_output.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e70839257f7..91715603cf6e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1158,10 +1158,15 @@ static int __ip_append_data(struct sock *sk,
}
copy = datalen - transhdrlen - fraggap - pagedlen;
+ /* [!] NOTE: copy will be negative if pagedlen>0
+ * because then the equation reduces to -fraggap.
+ */
if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
goto error;
+ } else if (flags & MSG_SPLICE_PAGES) {
+ copy = 0;
}
offset += copy;
@@ -1209,6 +1214,10 @@ static int __ip_append_data(struct sock *sk,
} else if (flags & MSG_SPLICE_PAGES) {
struct msghdr *msg = from;
+ err = -EIO;
+ if (WARN_ON_ONCE(copy > msg->msg_iter.count))
+ goto error;
+
err = skb_splice_from_iter(skb, &msg->msg_iter, copy,
sk->sk_allocation);
if (err < 0)
Hello,
syzbot has tested the proposed patch and the reproducer did not trigger any issue:
Reported-and-tested-by: [email protected]
Tested on:
commit: 5d0c230f Linux 6.5-rc4
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15a8b5d5a80000
kernel config: https://syzkaller.appspot.com/x/.config?x=df103238a07f256e
dashboard link: https://syzkaller.appspot.com/bug?extid=f527b971b4bdc8e79f9e
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
patch: https://syzkaller.appspot.com/x/patch.diff?x=17e285dea80000
Note: testing is done by a robot and is best-effort only.