LinuxLists.cc - INFO: rcu detected stall in wg_packet_tx

2020-04-26 17:42:20

Subject: INFO: rcu detected stall in wg_packet_tx_worker

Hello,

syzbot found the following crash on:

HEAD commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16141e64100000
kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
userspace arch: i386
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 1-...!: (10499 ticks this GP) idle=272/1/0x4000000000000002 softirq=53996/53996 fqs=254
(t=10500 jiffies g=80873 q=139)
rcu: rcu_preempt kthread starved for 9955 jiffies! g80873 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu: RCU grace-period kthread stack dump:
rcu_preempt I28848 10 2 0x80004000
Call Trace:
schedule+0xd0/0x2a0 kernel/sched/core.c:4158
schedule_timeout+0x35c/0x850 kernel/time/timer.c:1898
rcu_gp_fqs_loop kernel/rcu/tree.c:1674 [inline]
rcu_gp_kthread+0x9bf/0x1960 kernel/rcu/tree.c:1836
kthread+0x388/0x470 kernel/kthread.c:268
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
NMI backtrace for cpu 1
CPU: 1 PID: 3483 Comm: kworker/1:8 Not tainted 5.7.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: wg-crypt-wg0 wg_packet_tx_worker
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x188/0x20d lib/dump_stack.c:118
nmi_cpu_backtrace.cold+0x70/0xb1 lib/nmi_backtrace.c:101
nmi_trigger_cpumask_backtrace+0x231/0x27e lib/nmi_backtrace.c:62
trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
rcu_dump_cpu_stacks+0x19b/0x1e5 kernel/rcu/tree_stall.h:254
print_cpu_stall kernel/rcu/tree_stall.h:475 [inline]
check_cpu_stall kernel/rcu/tree_stall.h:549 [inline]
rcu_pending kernel/rcu/tree.c:3225 [inline]
rcu_sched_clock_irq.cold+0x55d/0xcfa kernel/rcu/tree.c:2296
update_process_times+0x25/0x60 kernel/time/timer.c:1727
tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:176
tick_sched_timer+0x4e/0x140 kernel/time/tick-sched.c:1320
__run_hrtimer kernel/time/hrtimer.c:1520 [inline]
__hrtimer_run_queues+0x5ca/0xed0 kernel/time/hrtimer.c:1584
hrtimer_interrupt+0x312/0x770 kernel/time/hrtimer.c:1646
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1113 [inline]
smp_apic_timer_interrupt+0x15b/0x600 arch/x86/kernel/apic/apic.c:1138
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
</IRQ>
RIP: 0010:sfq_dequeue+0x19f/0xcc0 net/sched/sch_sfq.c:492
Code: 66 44 03 a5 e2 02 00 00 66 44 89 63 1a 48 8b 9d 28 03 00 00 48 8d 7b 12 48 89 f8 48 c1 e8 03 42 0f b6 14 30 48 89 f8 83 e0 07 <83> c0 01 38 d0 7c 08 84 d2 0f 85 2b 0a 00 00 44 0f b7 63 12 48 8b
RSP: 0018:ffffc9000b69f4e0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000002 RBX: ffff8880986c2370 RCX: ffffffff86466097
RDX: 0000000000000000 RSI: ffffffff86465fd2 RDI: ffff8880986c2382
RBP: ffff888097c2d000 R08: ffff88809b468240 R09: fffff520016d3e7a
R10: 0000000000000003 R11: fffff520016d3e79 R12: 0000000000000000
R13: ffff8880986c238a R14: dffffc0000000000 R15: 0000000000000002
dequeue_skb net/sched/sch_generic.c:263 [inline]
qdisc_restart net/sched/sch_generic.c:366 [inline]
__qdisc_run+0x1ac/0x17b0 net/sched/sch_generic.c:384
__dev_xmit_skb net/core/dev.c:3704 [inline]
__dev_queue_xmit+0x1d07/0x30a0 net/core/dev.c:4021
neigh_hh_output include/net/neighbour.h:499 [inline]
neigh_output include/net/neighbour.h:508 [inline]
ip6_finish_output2+0xfb5/0x25b0 net/ipv6/ip6_output.c:117
__ip6_finish_output+0x442/0xab0 net/ipv6/ip6_output.c:143
ip6_finish_output+0x34/0x1f0 net/ipv6/ip6_output.c:153
NF_HOOK_COND include/linux/netfilter.h:296 [inline]
ip6_output+0x239/0x810 net/ipv6/ip6_output.c:176
dst_output include/net/dst.h:435 [inline]
ip6_local_out+0xaf/0x1a0 net/ipv6/output_core.c:179
ip6tunnel_xmit include/net/ip6_tunnel.h:160 [inline]
udp_tunnel6_xmit_skb+0x6f4/0xcd0 net/ipv6/ip6_udp_tunnel.c:114
send6+0x4e3/0xb20 drivers/net/wireguard/socket.c:164
wg_socket_send_skb_to_peer+0xf5/0x220 drivers/net/wireguard/socket.c:189
wg_packet_create_data_done drivers/net/wireguard/send.c:250 [inline]
wg_packet_tx_worker+0x30c/0xc30 drivers/net/wireguard/send.c:278
process_one_work+0x965/0x16a0 kernel/workqueue.c:2268
worker_thread+0x96/0xe20 kernel/workqueue.c:2414
kthread+0x388/0x470 kernel/kthread.c:268
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

2020-04-26 17:58:50

by syzbot

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

syzbot has bisected this bug to:

commit e7096c131e5161fa3b8e52a650d7719d2857adfd
Author: Jason A. Donenfeld <[email protected]>
Date: Sun Dec 8 23:27:34 2019 +0000

net: WireGuard secure network tunnel

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
git tree: upstream
final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
userspace arch: i386
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000

Reported-by: [email protected]
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

2020-04-26 19:44:48

by Eric Dumazet

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 10:57 AM, syzbot wrote:
> syzbot has bisected this bug to:
>
> commit e7096c131e5161fa3b8e52a650d7719d2857adfd
> Author: Jason A. Donenfeld <[email protected]>
> Date: Sun Dec 8 23:27:34 2019 +0000
>
> net: WireGuard secure network tunnel
>
> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
> start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
> git tree: upstream
> final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
> console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
> dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
> userspace arch: i386
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000
>
> Reported-by: [email protected]
> Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
>
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>

I have not looked at the repro closely, but WireGuard has some workers
that might loop forever, cond_resched() might help a bit.

diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index da3b782ab7d31df11e381529b144bcc494234a38..349a71e1907e081c61967c77c9f25a6ec5e57a24 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -518,6 +518,7 @@ void wg_packet_decrypt_worker(struct work_struct *work)
&PACKET_CB(skb)->keypair->receiving)) ?
PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
wg_queue_enqueue_per_peer_napi(skb, state);
+ cond_resched();
}
}

diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index 7348c10cbae3db54bfcb31f23c2753185735f876..f5b88693176c84b4bfdf8c4e05071481a3ce45b5 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -281,6 +281,7 @@ void wg_packet_tx_worker(struct work_struct *work)

wg_noise_keypair_put(keypair, false);
wg_peer_put(peer);
+ cond_resched();
}
}

2020-04-26 19:45:00

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet <[email protected]> wrote:
>
>
>
> On 4/26/20 10:57 AM, syzbot wrote:
> > syzbot has bisected this bug to:
> >
> > commit e7096c131e5161fa3b8e52a650d7719d2857adfd
> > Author: Jason A. Donenfeld <[email protected]>
> > Date: Sun Dec 8 23:27:34 2019 +0000
> >
> > net: WireGuard secure network tunnel
> >
> > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
> > start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
> > git tree: upstream
> > final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
> > dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
> > userspace arch: i386
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000
> >
> > Reported-by: [email protected]
> > Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
> >
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> >
>
> I have not looked at the repro closely, but WireGuard has some workers
> that might loop forever, cond_resched() might help a bit.

I'm working on this right now. Having a bit difficult of a time
getting it to reproduce locally...

The reports show the stall happening always at:

static struct sk_buff *
sfq_dequeue(struct Qdisc *sch)
{
struct sfq_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
sfq_index a, next_a;
struct sfq_slot *slot;

/* No active slots */
if (q->tail == NULL)
return NULL;

next_slot:
a = q->tail->next;
slot = &q->slots[a];

Which is kind of interesting, because it's not like that should block
or anything, unless there's some kasan faulting happening.

2020-04-26 19:55:20

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

It looks like part of the issue might be that I call
udp_tunnel6_xmit_skb while holding rcu_read_lock_bh, in
drivers/net/wireguard/socket.c. But I think there's good reason to do
so, and udp_tunnel6_xmit_skb should be rcu safe. In fact,
every.single.other user of udp_tunnel6_xmit_skb in the kernel uses it
with rcu locked. So, hm...

2020-04-26 20:00:15

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On Sun, Apr 26, 2020 at 1:52 PM Jason A. Donenfeld <[email protected]> wrote:
>
> It looks like part of the issue might be that I call
> udp_tunnel6_xmit_skb while holding rcu_read_lock_bh, in
> drivers/net/wireguard/socket.c. But I think there's good reason to do
> so, and udp_tunnel6_xmit_skb should be rcu safe. In fact,
> every.single.other user of udp_tunnel6_xmit_skb in the kernel uses it
> with rcu locked. So, hm...

In the syzkaller log, it looks like several runs are hitting:

run #0: crashed: INFO: rcu detected stall in netlink_sendmsg

And other runs are hitting yet different functions. So actually, it's
not clear that this is the fault of the call to udp_tunnel6_xmit_skb.

2020-04-26 20:29:38

by Eric Dumazet

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:
> On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet <[email protected]> wrote:
>>
>>
>>
>> On 4/26/20 10:57 AM, syzbot wrote:
>>> syzbot has bisected this bug to:
>>>
>>> commit e7096c131e5161fa3b8e52a650d7719d2857adfd
>>> Author: Jason A. Donenfeld <[email protected]>
>>> Date: Sun Dec 8 23:27:34 2019 +0000
>>>
>>> net: WireGuard secure network tunnel
>>>
>>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
>>> start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
>>> git tree: upstream
>>> final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
>>> kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
>>> userspace arch: i386
>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000
>>>
>>> Reported-by: [email protected]
>>> Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
>>>
>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>>>
>>
>> I have not looked at the repro closely, but WireGuard has some workers
>> that might loop forever, cond_resched() might help a bit.
>
> I'm working on this right now. Having a bit difficult of a time
> getting it to reproduce locally...
>
> The reports show the stall happening always at:
>
> static struct sk_buff *
> sfq_dequeue(struct Qdisc *sch)
> {
> struct sfq_sched_data *q = qdisc_priv(sch);
> struct sk_buff *skb;
> sfq_index a, next_a;
> struct sfq_slot *slot;
>
> /* No active slots */
> if (q->tail == NULL)
> return NULL;
>
> next_slot:
> a = q->tail->next;
> slot = &q->slots[a];
>
> Which is kind of interesting, because it's not like that should block
> or anything, unless there's some kasan faulting happening.
>

I am not really sure WireGuard is involved, the repro does not rely on it anyway.

2020-04-26 20:41:18

by Eric Dumazet

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 1:26 PM, Eric Dumazet wrote:
>
>
> On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:
>> On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet <[email protected]> wrote:
>>>
>>>
>>>
>>> On 4/26/20 10:57 AM, syzbot wrote:
>>>> syzbot has bisected this bug to:
>>>>
>>>> commit e7096c131e5161fa3b8e52a650d7719d2857adfd
>>>> Author: Jason A. Donenfeld <[email protected]>
>>>> Date: Sun Dec 8 23:27:34 2019 +0000
>>>>
>>>> net: WireGuard secure network tunnel
>>>>
>>>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
>>>> start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
>>>> git tree: upstream
>>>> final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
>>>> userspace arch: i386
>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000
>>>>
>>>> Reported-by: [email protected]
>>>> Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
>>>>
>>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>>>>
>>>
>>> I have not looked at the repro closely, but WireGuard has some workers
>>> that might loop forever, cond_resched() might help a bit.
>>
>> I'm working on this right now. Having a bit difficult of a time
>> getting it to reproduce locally...
>>
>> The reports show the stall happening always at:
>>
>> static struct sk_buff *
>> sfq_dequeue(struct Qdisc *sch)
>> {
>> struct sfq_sched_data *q = qdisc_priv(sch);
>> struct sk_buff *skb;
>> sfq_index a, next_a;
>> struct sfq_slot *slot;
>>
>> /* No active slots */
>> if (q->tail == NULL)
>> return NULL;
>>
>> next_slot:
>> a = q->tail->next;
>> slot = &q->slots[a];
>>
>> Which is kind of interesting, because it's not like that should block
>> or anything, unless there's some kasan faulting happening.
>>
>
> I am not really sure WireGuard is involved, the repro does not rely on it anyway.
>

Yes, do not spend too much time on this.

syzbot found its way into crazy qdisc settings these last days.

( I sent a patch yesterday for choke qdisc, it seems similar checks are needed in sfq )

2020-04-26 20:48:45

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On Sun, Apr 26, 2020 at 2:38 PM Eric Dumazet <[email protected]> wrote:
>
>
>
> On 4/26/20 1:26 PM, Eric Dumazet wrote:
> >
> >
> > On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:
> >> On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On 4/26/20 10:57 AM, syzbot wrote:
> >>>> syzbot has bisected this bug to:
> >>>>
> >>>> commit e7096c131e5161fa3b8e52a650d7719d2857adfd
> >>>> Author: Jason A. Donenfeld <[email protected]>
> >>>> Date: Sun Dec 8 23:27:34 2019 +0000
> >>>>
> >>>> net: WireGuard secure network tunnel
> >>>>
> >>>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
> >>>> start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
> >>>> git tree: upstream
> >>>> final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
> >>>> console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
> >>>> kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
> >>>> dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
> >>>> userspace arch: i386
> >>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
> >>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000
> >>>>
> >>>> Reported-by: [email protected]
> >>>> Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
> >>>>
> >>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> >>>>
> >>>
> >>> I have not looked at the repro closely, but WireGuard has some workers
> >>> that might loop forever, cond_resched() might help a bit.
> >>
> >> I'm working on this right now. Having a bit difficult of a time
> >> getting it to reproduce locally...
> >>
> >> The reports show the stall happening always at:
> >>
> >> static struct sk_buff *
> >> sfq_dequeue(struct Qdisc *sch)
> >> {
> >> struct sfq_sched_data *q = qdisc_priv(sch);
> >> struct sk_buff *skb;
> >> sfq_index a, next_a;
> >> struct sfq_slot *slot;
> >>
> >> /* No active slots */
> >> if (q->tail == NULL)
> >> return NULL;
> >>
> >> next_slot:
> >> a = q->tail->next;
> >> slot = &q->slots[a];
> >>
> >> Which is kind of interesting, because it's not like that should block
> >> or anything, unless there's some kasan faulting happening.
> >>
> >
> > I am not really sure WireGuard is involved, the repro does not rely on it anyway.
> >
>
> Yes, do not spend too much time on this.
>
> syzbot found its way into crazy qdisc settings these last days.
>
> ( I sent a patch yesterday for choke qdisc, it seems similar checks are needed in sfq )

Ah, whew, okay. I had just begun instrumenting sfq (the highly
technical term for "adding printks everywhere") to figure out what's
going on. Looks like you've got a handle on it, so I'll let you have
at it.

On the brighter side, it seems like Dmitry's and my effort to get full
coverage of WireGuard has paid off in the sense that tons of packets
wind up being shoveled through it in one way or another, which is
good.

2020-04-26 21:56:04

by Eric Dumazet

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

On 4/26/20 1:46 PM, Jason A. Donenfeld wrote:

>
> Ah, whew, okay. I had just begun instrumenting sfq (the highly
> technical term for "adding printks everywhere") to figure out what's
> going on. Looks like you've got a handle on it, so I'll let you have
> at it.

Yes, syzbot manages to put a zero in q->scaled_quantum

I will send a fix.

>
> On the brighter side, it seems like Dmitry's and my effort to get full
> coverage of WireGuard has paid off in the sense that tons of packets
> wind up being shoveled through it in one way or another, which is
> good.
>

Sure !

2020-05-04 11:27:30

by Jason A. Donenfeld

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in wg_packet_tx_worker

So in spite of this Syzkaller bug being unrelated in the end, I've
continued to think about the stacktrace a bit, and combined with some
other [potentially false alarm] bug reports I'm trying to wrap my head
around, I'm a bit a curious about ideal usage for the udp_tunnel API.

All the uses I've seen in the kernel (including wireguard) follow this pattern:

rcu_read_lock_bh();
sock = rcu_dereference(obj->sock);
...
udp_tunnel_xmit_skb(..., sock, ...);
rcu_read_unlock_bh();

udp_tunnel_xmit_skb calls iptunnel_xmit, which winds up in the usual
ip_local_out path, which eventually winds up calling some other
devices' ndo_xmit, or gets queued up in a qdisc. Calls to
udp_tunnel_xmit_skb aren't exactly cheap. So I wonder: is holding the
rcu lock for all that time really a good thing?

A different pattern that avoids holding the rcu lock would be:

rcu_read_lock_bh();
sock = rcu_dereference(obj->sock);
sock_hold(sock);
rcu_read_unlock_bh();
...
udp_tunnel_xmit_skb(..., sock, ...);
sock_put(sock);

This seems better, but I wonder if it has some drawbacks too. For
example, sock_put has some comment that warns against incrementing it
in response to forwarded packets. And if this isn't necessary to do,
it's marginally more costly than the first pattern.

Any opinions about this?

Jason