LinuxLists.cc - Re: INFO: rcu detected stall in xfrm_confirm

2018-02-12 15:28:46

Subject: Re: INFO: rcu detected stall in xfrm_confirm_neigh

On Mon, Feb 12, 2018 at 4:23 PM, syzbot
<[email protected]> wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> 9515a2e082f91457db0ecff4b65371d0fb5d9aad (Thu Jan 25 03:37:38 2018 +0000)
> net/ipv4: Allow send to local broadcast from a socket bound to a VRF
>
> So far this crash happened 6 times on net-next.
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.

+xfrm maintainers

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> INFO: rcu_sched self-detected stall on CPU
> 1-...!: (124998 ticks this GP) idle=376/140000000000001/0
> softirq=506054/506054 fqs=19
> (t=125000 jiffies g=289415 c=289414 q=312)
> rcu_sched kthread starved for 124920 jiffies! g289415 c289414 f0x0
> RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> rcu_sched R running task 23456 8 2 0x80000000
> Call Trace:
> context_switch kernel/sched/core.c:2799 [inline]
> __schedule+0x8eb/0x2060 kernel/sched/core.c:3375
> schedule+0xf5/0x430 kernel/sched/core.c:3434
> schedule_timeout+0x118/0x230 kernel/time/timer.c:1793
> rcu_gp_kthread+0x9e5/0x1930 kernel/rcu/tree.c:2314
> kthread+0x33c/0x400 kernel/kthread.c:238
> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:541
> NMI backtrace for cpu 1
> CPU: 1 PID: 15893 Comm: syz-executor0 Not tainted 4.15.0-rc9+ #210
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x194/0x257 lib/dump_stack.c:53
> nmi_cpu_backtrace+0x1d2/0x210 lib/nmi_backtrace.c:103
> nmi_trigger_cpumask_backtrace+0x122/0x180 lib/nmi_backtrace.c:62
> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> rcu_dump_cpu_stacks+0x186/0x1d9 kernel/rcu/tree.c:1459
> print_cpu_stall kernel/rcu/tree.c:1608 [inline]
> check_cpu_stall.isra.61+0xbb8/0x15b0 kernel/rcu/tree.c:1676
> __rcu_pending kernel/rcu/tree.c:3440 [inline]
> rcu_pending kernel/rcu/tree.c:3502 [inline]
> rcu_check_callbacks+0x256/0xd00 kernel/rcu/tree.c:2842
> update_process_times+0x30/0x60 kernel/time/timer.c:1628
> tick_sched_handle+0x85/0x160 kernel/time/tick-sched.c:162
> tick_sched_timer+0x42/0x120 kernel/time/tick-sched.c:1194
> __run_hrtimer kernel/time/hrtimer.c:1211 [inline]
> __hrtimer_run_queues+0x358/0xe20 kernel/time/hrtimer.c:1275
> hrtimer_interrupt+0x1c2/0x5e0 kernel/time/hrtimer.c:1309
> local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> smp_apic_timer_interrupt+0x14a/0x700 arch/x86/kernel/apic/apic.c:1050
> apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:937
> </IRQ>
> RIP: 0010:__read_once_size include/linux/compiler.h:183 [inline]
> RIP: 0010:__sanitizer_cov_trace_pc+0x3b/0x50 kernel/kcov.c:106
> RSP: 0018:ffff8801a6867820 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
> RAX: 0000000000010000 RBX: ffff8801caf7d200 RCX: ffffffff84acc87d
> RDX: 000000000000ffff RSI: ffffc90002aa6000 RDI: ffff8801c54c4f50
> RBP: ffff8801a6867820 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c54c4e00
> R13: dffffc0000000000 R14: ffffed00395efa5f R15: ffff8801caf7d2fc
> xfrm_get_dst_nexthop net/xfrm/xfrm_policy.c:2732 [inline]
> xfrm_confirm_neigh+0xad/0x270 net/xfrm/xfrm_policy.c:2759
> dst_confirm_neigh include/net/dst.h:419 [inline]
> raw_sendmsg+0xece/0x23b0 net/ipv4/raw.c:702
> inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
> sock_sendmsg_nosec net/socket.c:630 [inline]
> sock_sendmsg+0xca/0x110 net/socket.c:640
> SYSC_sendto+0x361/0x5c0 net/socket.c:1747
> SyS_sendto+0x40/0x50 net/socket.c:1715
> entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x452f19
> RSP: 002b:00007f00a389ec58 EFLAGS: 00000212 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 000000000071bf58 RCX: 0000000000452f19
> RDX: 000000000000001e RSI: 0000000020098000 RDI: 0000000000000013
> RBP: 0000000000000510 R08: 0000000020cf9000 R09: 0000000000000010
> R10: fffffffffffffffe R11: 0000000000000212 R12: 00000000006f6a20
> R13: 00000000ffffffff R14: 00007f00a389f6d4 R15: 0000000000000001
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to [email protected].
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/001a1141ba9ea381f70565057687%40google.com.
> For more options, visit https://groups.google.com/d/optout.

2018-02-13 09:21:24

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in xfrm_confirm_neigh

On Mon, Feb 12, 2018 at 4:26 PM, Dmitry Vyukov <[email protected]> wrote:
> On Mon, Feb 12, 2018 at 4:23 PM, syzbot
> <[email protected]> wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 9515a2e082f91457db0ecff4b65371d0fb5d9aad (Thu Jan 25 03:37:38 2018 +0000)
>> net/ipv4: Allow send to local broadcast from a socket bound to a VRF
>>
>> So far this crash happened 6 times on net-next.
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>
>
> +xfrm maintainers

Here is a C repro:
https://gist.githubusercontent.com/dvyukov/92c67ba9afaaa960bcfbdc6ef549ac10/raw/786f9221c1d707c7f4a15effcb1d5997dd4f8638/gistfile1.txt
Somewhat messy, but it gives me:

[64360.028053] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [a.out:7891]
[64360.030043] Modules linked in:
[64360.030355] CPU: 0 PID: 7891 Comm: a.out Not tainted 4.15.0+ #96
[64360.031334] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[64360.033632] RIP: 0010:__sanitizer_cov_trace_pc+0x20/0x50
[64360.035114] RSP: 0018:ffff88006b96f6e0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff11
[64360.037238] RAX: ffff8800636ce040 RBX: ffff880067fadb00 RCX: ffffffff84d147a6
[64360.039238] RDX: 0000000000000000 RSI: ffffffff84d147b0 RDI: 0000000000000001
[64360.041357] RBP: ffff88006b96f6e0 R08: ffff8800636ce040 R09: ffffed000d1008c9
[64360.043364] R10: 0000000000000000 R11: 0000000000000000 R12: ffffed000cff5b7b
[64360.045366] R13: ffff880067fadbdc R14: ffff88006586ae00 R15: 0000000000000000
[64360.047367] FS: 00007f2a5bab5700(0000) GS:ffff88006ca00000(0000)
knlGS:0000000000000000
[64360.049486] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[64360.051095] CR2: 0000000020cf9000 CR3: 00000000626a7001 CR4: 00000000001606f0
[64360.053065] Call Trace:
[64360.053822] xfrm_confirm_neigh+0xd0/0x2c0
[64360.056166] raw_sendmsg+0x1f1e/0x2640
[64360.068396] inet_sendmsg+0x145/0x490
[64360.074198] sock_sendmsg+0xd2/0x120
[64360.075094] SYSC_sendto+0x3de/0x640
[64360.085269] SyS_sendto+0x40/0x50

>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: [email protected]
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> INFO: rcu_sched self-detected stall on CPU
>> 1-...!: (124998 ticks this GP) idle=376/140000000000001/0
>> softirq=506054/506054 fqs=19
>> (t=125000 jiffies g=289415 c=289414 q=312)
>> rcu_sched kthread starved for 124920 jiffies! g289415 c289414 f0x0
>> RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
>> rcu_sched R running task 23456 8 2 0x80000000
>> Call Trace:
>> context_switch kernel/sched/core.c:2799 [inline]
>> __schedule+0x8eb/0x2060 kernel/sched/core.c:3375
>> schedule+0xf5/0x430 kernel/sched/core.c:3434
>> schedule_timeout+0x118/0x230 kernel/time/timer.c:1793
>> rcu_gp_kthread+0x9e5/0x1930 kernel/rcu/tree.c:2314
>> kthread+0x33c/0x400 kernel/kthread.c:238
>> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:541
>> NMI backtrace for cpu 1
>> CPU: 1 PID: 15893 Comm: syz-executor0 Not tainted 4.15.0-rc9+ #210
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>> <IRQ>
>> __dump_stack lib/dump_stack.c:17 [inline]
>> dump_stack+0x194/0x257 lib/dump_stack.c:53
>> nmi_cpu_backtrace+0x1d2/0x210 lib/nmi_backtrace.c:103
>> nmi_trigger_cpumask_backtrace+0x122/0x180 lib/nmi_backtrace.c:62
>> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>> trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
>> rcu_dump_cpu_stacks+0x186/0x1d9 kernel/rcu/tree.c:1459
>> print_cpu_stall kernel/rcu/tree.c:1608 [inline]
>> check_cpu_stall.isra.61+0xbb8/0x15b0 kernel/rcu/tree.c:1676
>> __rcu_pending kernel/rcu/tree.c:3440 [inline]
>> rcu_pending kernel/rcu/tree.c:3502 [inline]
>> rcu_check_callbacks+0x256/0xd00 kernel/rcu/tree.c:2842
>> update_process_times+0x30/0x60 kernel/time/timer.c:1628
>> tick_sched_handle+0x85/0x160 kernel/time/tick-sched.c:162
>> tick_sched_timer+0x42/0x120 kernel/time/tick-sched.c:1194
>> __run_hrtimer kernel/time/hrtimer.c:1211 [inline]
>> __hrtimer_run_queues+0x358/0xe20 kernel/time/hrtimer.c:1275
>> hrtimer_interrupt+0x1c2/0x5e0 kernel/time/hrtimer.c:1309
>> local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
>> smp_apic_timer_interrupt+0x14a/0x700 arch/x86/kernel/apic/apic.c:1050
>> apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:937
>> </IRQ>
>> RIP: 0010:__read_once_size include/linux/compiler.h:183 [inline]
>> RIP: 0010:__sanitizer_cov_trace_pc+0x3b/0x50 kernel/kcov.c:106
>> RSP: 0018:ffff8801a6867820 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
>> RAX: 0000000000010000 RBX: ffff8801caf7d200 RCX: ffffffff84acc87d
>> RDX: 000000000000ffff RSI: ffffc90002aa6000 RDI: ffff8801c54c4f50
>> RBP: ffff8801a6867820 R08: 0000000000000001 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801c54c4e00
>> R13: dffffc0000000000 R14: ffffed00395efa5f R15: ffff8801caf7d2fc
>> xfrm_get_dst_nexthop net/xfrm/xfrm_policy.c:2732 [inline]
>> xfrm_confirm_neigh+0xad/0x270 net/xfrm/xfrm_policy.c:2759
>> dst_confirm_neigh include/net/dst.h:419 [inline]
>> raw_sendmsg+0xece/0x23b0 net/ipv4/raw.c:702
>> inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764
>> sock_sendmsg_nosec net/socket.c:630 [inline]
>> sock_sendmsg+0xca/0x110 net/socket.c:640
>> SYSC_sendto+0x361/0x5c0 net/socket.c:1747
>> SyS_sendto+0x40/0x50 net/socket.c:1715
>> entry_SYSCALL_64_fastpath+0x29/0xa0
>> RIP: 0033:0x452f19
>> RSP: 002b:00007f00a389ec58 EFLAGS: 00000212 ORIG_RAX: 000000000000002c
>> RAX: ffffffffffffffda RBX: 000000000071bf58 RCX: 0000000000452f19
>> RDX: 000000000000001e RSI: 0000000020098000 RDI: 0000000000000013
>> RBP: 0000000000000510 R08: 0000000020cf9000 R09: 0000000000000010
>> R10: fffffffffffffffe R11: 0000000000000212 R12: 00000000006f6a20
>> R13: 00000000ffffffff R14: 00007f00a389f6d4 R15: 0000000000000001
>>
>>
>> ---
>> This bug is generated by a dumb bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for details.
>> Direct all questions to [email protected].
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is
>> merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> To mark this as a duplicate of another syzbot report, please reply with:
>> #syz dup: exact-subject-of-another-report
>> If it's a one-off invalid bug report, please reply with:
>> #syz invalid
>> Note: if the crash happens again, it will cause creation of a new bug
>> report.
>> Note: all commands must start from beginning of the line in the email body.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "syzkaller-bugs" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/syzkaller-bugs/001a1141ba9ea381f70565057687%40google.com.
>> For more options, visit https://groups.google.com/d/optout.

2018-02-19 07:24:15

by Steffen Klassert

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in xfrm_confirm_neigh

On Tue, Feb 13, 2018 at 10:19:17AM +0100, Dmitry Vyukov wrote:
> On Mon, Feb 12, 2018 at 4:26 PM, Dmitry Vyukov <[email protected]> wrote:
> > On Mon, Feb 12, 2018 at 4:23 PM, syzbot
> > <[email protected]> wrote:
> >> Hello,
> >>
> >> syzbot hit the following crash on net-next commit
> >> 9515a2e082f91457db0ecff4b65371d0fb5d9aad (Thu Jan 25 03:37:38 2018 +0000)
> >> net/ipv4: Allow send to local broadcast from a socket bound to a VRF
> >>
> >> So far this crash happened 6 times on net-next.
> >> Unfortunately, I don't have any reproducer for this crash yet.
> >> Raw console output is attached.
> >> compiler: gcc (GCC) 7.1.1 20170620
> >> .config is attached.
> >
> >
> > +xfrm maintainers
>
> Here is a C repro:
> https://gist.githubusercontent.com/dvyukov/92c67ba9afaaa960bcfbdc6ef549ac10/raw/786f9221c1d707c7f4a15effcb1d5997dd4f8638/gistfile1.txt

Seems like syzbot does not know about this reproducer.

I've send a patch to test and got this as the reply:

This crash does not have a reproducer. I cannot test it.

2018-02-19 10:07:01

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in xfrm_confirm_neigh

On Mon, Feb 19, 2018 at 8:22 AM, Steffen Klassert
<[email protected]> wrote:
>> > <[email protected]> wrote:
>> >> Hello,
>> >>
>> >> syzbot hit the following crash on net-next commit
>> >> 9515a2e082f91457db0ecff4b65371d0fb5d9aad (Thu Jan 25 03:37:38 2018 +0000)
>> >> net/ipv4: Allow send to local broadcast from a socket bound to a VRF
>> >>
>> >> So far this crash happened 6 times on net-next.
>> >> Unfortunately, I don't have any reproducer for this crash yet.
>> >> Raw console output is attached.
>> >> compiler: gcc (GCC) 7.1.1 20170620
>> >> .config is attached.
>> >
>> >
>> > +xfrm maintainers
>>
>> Here is a C repro:
>> https://gist.githubusercontent.com/dvyukov/92c67ba9afaaa960bcfbdc6ef549ac10/raw/786f9221c1d707c7f4a15effcb1d5997dd4f8638/gistfile1.txt
>
> Seems like syzbot does not know about this reproducer.
>
> I've send a patch to test and got this as the reply:
>
> This crash does not have a reproducer. I cannot test it.

Yes, it does not know about the reproducer. I've extracted it
manually, these hangs are sometimes hard to reproduce. For syzbot this
bug does not have a reproducer.
Have you tried to run the reproducer? For me it reproduced the bug
quite reliably.

2018-02-20 09:54:09

by Steffen Klassert

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in xfrm_confirm_neigh

On Mon, Feb 19, 2018 at 11:05:38AM +0100, Dmitry Vyukov wrote:
> On Mon, Feb 19, 2018 at 8:22 AM, Steffen Klassert
> <[email protected]> wrote:
> >> > <[email protected]> wrote:
> >> >> Hello,
> >> >>
> >> >> syzbot hit the following crash on net-next commit
> >> >> 9515a2e082f91457db0ecff4b65371d0fb5d9aad (Thu Jan 25 03:37:38 2018 +0000)
> >> >> net/ipv4: Allow send to local broadcast from a socket bound to a VRF
> >> >>
> >> >> So far this crash happened 6 times on net-next.
> >> >> Unfortunately, I don't have any reproducer for this crash yet.
> >> >> Raw console output is attached.
> >> >> compiler: gcc (GCC) 7.1.1 20170620
> >> >> .config is attached.
> >> >
> >> >
> >> > +xfrm maintainers
> >>
> >> Here is a C repro:
> >> https://gist.githubusercontent.com/dvyukov/92c67ba9afaaa960bcfbdc6ef549ac10/raw/786f9221c1d707c7f4a15effcb1d5997dd4f8638/gistfile1.txt
> >
> > Seems like syzbot does not know about this reproducer.
> >
> > I've send a patch to test and got this as the reply:
> >
> > This crash does not have a reproducer. I cannot test it.
>
> Yes, it does not know about the reproducer. I've extracted it
> manually, these hangs are sometimes hard to reproduce. For syzbot this
> bug does not have a reproducer.
> Have you tried to run the reproducer? For me it reproduced the bug
> quite reliably.

I've tried the reproducer, it does not trigger with my kernel
config and with your config my machine does not boot. So it
was just easy to ask syzbot for a test.

Anyway, Florian Westphal did a test and confirmed that
the patch fixes the issue. I've just applied the fix
below to the ipsec tree.

Subject: [PATCH] xfrm: Fix infinite loop in xfrm_get_dst_nexthop with transport mode.

On transport mode we forget to fetch the child dst_entry
before we continue the while loop, this leads to an infinite
loop. Fix this by fetching the child dst_entry before we
continue the while loop.

Fixes: 0f6c480f23f4 ("xfrm: Move dst->path into struct xfrm_dst")
Reported-by: [email protected]
Tested-by: Florian Westphal <[email protected]>
Signed-off-by: Steffen Klassert <[email protected]>
---
net/xfrm/xfrm_policy.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 150d46633ce6..625b3fca5704 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2732,14 +2732,14 @@ static const void *xfrm_get_dst_nexthop(const struct dst_entry *dst,
while (dst->xfrm) {
const struct xfrm_state *xfrm = dst->xfrm;

+ dst = xfrm_dst_child(dst);
+
if (xfrm->props.mode == XFRM_MODE_TRANSPORT)
continue;
if (xfrm->type->flags & XFRM_TYPE_REMOTE_COADDR)
daddr = xfrm->coaddr;
else if (!(xfrm->type->flags & XFRM_TYPE_LOCAL_COADDR))
daddr = &xfrm->id.daddr;
-
- dst = xfrm_dst_child(dst);
}
return daddr;
}
--
2.14.1