LinuxLists.cc - INFO: rcu detected stall in kfree

2018-04-30 18:09:34

Subject: INFO: rcu detected stall in kfree_skbmem

Hello,

syzbot found the following crash on:

HEAD commit: 5d1365940a68 Merge
git://git.kernel.org/pub/scm/linux/kerne...
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
kernel config:
https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

INFO: rcu_sched self-detected stall on CPU
1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908 softirq=71980/71983
fqs=33
(t=125000 jiffies g=39438 c=39437 q=958)
rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
RCU grace-period kthread stack dump:
rcu_sched R running task 23768 9 2 0x80000000
Call Trace:
context_switch kernel/sched/core.c:2848 [inline]
__schedule+0x801/0x1e30 kernel/sched/core.c:3490
schedule+0xef/0x430 kernel/sched/core.c:3549
schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
kthread+0x345/0x410 kernel/kthread.c:238
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
NMI backtrace for cpu 1
CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
print_cpu_stall kernel/rcu/tree.c:1525 [inline]
check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
__rcu_pending kernel/rcu/tree.c:3356 [inline]
rcu_pending kernel/rcu/tree.c:3401 [inline]
rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
update_process_times+0x2d/0x70 kernel/time/timer.c:1636
tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
__run_hrtimer kernel/time/hrtimer.c:1386 [inline]
__hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
[inline]
RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
__kfree_skb net/core/skbuff.c:642 [inline]
kfree_skb+0x19d/0x560 net/core/skbuff.c:659
enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
netif_rx+0xba/0x400 net/core/dev.c:4206
loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
__netdev_start_xmit include/linux/netdevice.h:4087 [inline]
netdev_start_xmit include/linux/netdevice.h:4096 [inline]
xmit_one net/core/dev.c:3053 [inline]
dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
__dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
neigh_hh_output include/net/neighbour.h:472 [inline]
neigh_output include/net/neighbour.h:480 [inline]
ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
NF_HOOK_COND include/linux/netfilter.h:277 [inline]
ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
dst_output include/net/dst.h:444 [inline]
NF_HOOK include/linux/netfilter.h:288 [inline]
ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x79e/0xc50 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1d1/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:525 [inline]
smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
</IRQ>
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
[inline]
RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
rcu_lock_release include/linux/rcupdate.h:251 [inline]
rcu_read_unlock include/linux/rcupdate.h:688 [inline]
__unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
page_remove_file_rmap mm/rmap.c:1248 [inline]
page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
zap_pte_range mm/memory.c:1337 [inline]
zap_pmd_range mm/memory.c:1441 [inline]
zap_pud_range mm/memory.c:1470 [inline]
zap_p4d_range mm/memory.c:1491 [inline]
unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
unmap_vmas+0x120/0x1f0 mm/memory.c:1587
exit_mmap+0x265/0x570 mm/mmap.c:3038
__mmput kernel/fork.c:962 [inline]
mmput+0x251/0x610 kernel/fork.c:983
exit_mm kernel/exit.c:544 [inline]
do_exit+0xe98/0x2730 kernel/exit.c:852
do_group_exit+0x16f/0x430 kernel/exit.c:968
get_signal+0x886/0x1960 kernel/signal.c:2469
do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x455319
RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.

2018-05-11 10:01:48

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Mon, Apr 30, 2018 at 8:09 PM, syzbot
<[email protected]> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: 5d1365940a68 Merge
> git://git.kernel.org/pub/scm/linux/kerne...
> git tree: net-next
> console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
> kernel config:
> https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.

This looks sctp-related, +sctp maintainers.

> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
>
> INFO: rcu_sched self-detected stall on CPU
> 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
> softirq=71980/71983 fqs=33
> (t=125000 jiffies g=39438 c=39437 q=958)
> rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
> RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> RCU grace-period kthread stack dump:
> rcu_sched R running task 23768 9 2 0x80000000
> Call Trace:
> context_switch kernel/sched/core.c:2848 [inline]
> __schedule+0x801/0x1e30 kernel/sched/core.c:3490
> schedule+0xef/0x430 kernel/sched/core.c:3549
> schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
> rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
> kthread+0x345/0x410 kernel/kthread.c:238
> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
> NMI backtrace for cpu 1
> CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
> print_cpu_stall kernel/rcu/tree.c:1525 [inline]
> check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
> __rcu_pending kernel/rcu/tree.c:3356 [inline]
> rcu_pending kernel/rcu/tree.c:3401 [inline]
> rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
> update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
> tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
> __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
> __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
> hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
> local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
> apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> [inline]
> RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
> RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
> RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
> RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
> R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
> R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
> kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
> __kfree_skb net/core/skbuff.c:642 [inline]
> kfree_skb+0x19d/0x560 net/core/skbuff.c:659
> enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
> netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
> netif_rx+0xba/0x400 net/core/dev.c:4206
> loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
> __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
> netdev_start_xmit include/linux/netdevice.h:4096 [inline]
> xmit_one net/core/dev.c:3053 [inline]
> dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
> __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
> dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
> neigh_hh_output include/net/neighbour.h:472 [inline]
> neigh_output include/net/neighbour.h:480 [inline]
> ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
> ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
> NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
> dst_output include/net/dst.h:444 [inline]
> NF_HOOK include/linux/netfilter.h:288 [inline]
> ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
> sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
> sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
> sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
> sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
> sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
> call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
> expire_timers kernel/time/timer.c:1363 [inline]
> __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
> run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
> __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
> invoke_softirq kernel/softirq.c:365 [inline]
> irq_exit+0x1d1/0x200 kernel/softirq.c:405
> exiting_irq arch/x86/include/asm/apic.h:525 [inline]
> smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
> apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> </IRQ>
> RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> [inline]
> RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
> RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
> RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
> RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
> R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
> R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
> rcu_lock_release include/linux/rcupdate.h:251 [inline]
> rcu_read_unlock include/linux/rcupdate.h:688 [inline]
> __unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
> unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
> page_remove_file_rmap mm/rmap.c:1248 [inline]
> page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
> zap_pte_range mm/memory.c:1337 [inline]
> zap_pmd_range mm/memory.c:1441 [inline]
> zap_pud_range mm/memory.c:1470 [inline]
> zap_p4d_range mm/memory.c:1491 [inline]
> unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
> unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
> unmap_vmas+0x120/0x1f0 mm/memory.c:1587
> exit_mmap+0x265/0x570 mm/mmap.c:3038
> __mmput kernel/fork.c:962 [inline]
> mmput+0x251/0x610 kernel/fork.c:983
> exit_mm kernel/exit.c:544 [inline]
> do_exit+0xe98/0x2730 kernel/exit.c:852
> do_group_exit+0x16f/0x430 kernel/exit.c:968
> get_signal+0x886/0x1960 kernel/signal.c:2469
> do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
> exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
> prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
> syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
> do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x455319
> RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
> RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at [email protected].
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b0e3056b14bfb2%40google.com.
> For more options, visit https://groups.google.com/d/optout.

2018-05-11 18:43:26

by Marcelo Ricardo Leitner

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
> <[email protected]> wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit: 5d1365940a68 Merge
> > git://git.kernel.org/pub/scm/linux/kerne...
> > git tree: net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
> > kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
> > compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
>
> This looks sctp-related, +sctp maintainers.
>
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: [email protected]
> >
> > INFO: rcu_sched self-detected stall on CPU
> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
> > softirq=71980/71983 fqs=33
> > (t=125000 jiffies g=39438 c=39437 q=958)
> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> > RCU grace-period kthread stack dump:
> > rcu_sched R running task 23768 9 2 0x80000000
> > Call Trace:
> > context_switch kernel/sched/core.c:2848 [inline]
> > __schedule+0x801/0x1e30 kernel/sched/core.c:3490
> > schedule+0xef/0x430 kernel/sched/core.c:3549
> > schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
> > rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
> > kthread+0x345/0x410 kernel/kthread.c:238
> > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> > <IRQ>
> > __dump_stack lib/dump_stack.c:77 [inline]
> > dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> > nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> > nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> > arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> > trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> > rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
> > print_cpu_stall kernel/rcu/tree.c:1525 [inline]
> > check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
> > __rcu_pending kernel/rcu/tree.c:3356 [inline]
> > rcu_pending kernel/rcu/tree.c:3401 [inline]
> > rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
> > update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> > tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
> > tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
> > __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
> > __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
> > hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
> > local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> > smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
> > RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
> > RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
> > RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
> > R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
> > R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
> > kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
> > __kfree_skb net/core/skbuff.c:642 [inline]
> > kfree_skb+0x19d/0x560 net/core/skbuff.c:659
> > enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
> > netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
> > netif_rx+0xba/0x400 net/core/dev.c:4206
> > loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
> > __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
> > netdev_start_xmit include/linux/netdevice.h:4096 [inline]
> > xmit_one net/core/dev.c:3053 [inline]
> > dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
> > __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
> > dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
> > neigh_hh_output include/net/neighbour.h:472 [inline]
> > neigh_output include/net/neighbour.h:480 [inline]
> > ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
> > ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
> > NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> > ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
> > dst_output include/net/dst.h:444 [inline]
> > NF_HOOK include/linux/netfilter.h:288 [inline]
> > ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277

sctp_v6_xmit calls ip6_xmit with rcu_read_lock() as it has to pass
np->opt to ip6_xmit. Sounds like this packet then went through a long
journey and hit the bell.

But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
does it.
Thus I think this is more of an issue with IPv6 stack. If a host has
an extensive ip6tables ruleset, it probably generates this more
easily.

> > sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
> > sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
> > sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
> > sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> > sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> > sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> > sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
> > sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
> > call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
> > expire_timers kernel/time/timer.c:1363 [inline]

Having this call from a timer means it wasn't processing sctp stack
for too long.

> > __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
> > run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
> > __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
> > invoke_softirq kernel/softirq.c:365 [inline]
> > irq_exit+0x1d1/0x200 kernel/softirq.c:405
> > exiting_irq arch/x86/include/asm/apic.h:525 [inline]
> > smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > </IRQ>
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
> > RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> > RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
> > RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
> > RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
> > R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
> > R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
> > rcu_lock_release include/linux/rcupdate.h:251 [inline]
> > rcu_read_unlock include/linux/rcupdate.h:688 [inline]
> > __unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
> > unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
> > page_remove_file_rmap mm/rmap.c:1248 [inline]
> > page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
> > zap_pte_range mm/memory.c:1337 [inline]
> > zap_pmd_range mm/memory.c:1441 [inline]
> > zap_pud_range mm/memory.c:1470 [inline]
> > zap_p4d_range mm/memory.c:1491 [inline]
> > unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
> > unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
> > unmap_vmas+0x120/0x1f0 mm/memory.c:1587
> > exit_mmap+0x265/0x570 mm/mmap.c:3038
> > __mmput kernel/fork.c:962 [inline]
> > mmput+0x251/0x610 kernel/fork.c:983
> > exit_mm kernel/exit.c:544 [inline]
> > do_exit+0xe98/0x2730 kernel/exit.c:852
> > do_group_exit+0x16f/0x430 kernel/exit.c:968
> > get_signal+0x886/0x1960 kernel/signal.c:2469
> > do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
> > exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
> > prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
> > syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
> > do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
> > entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > RIP: 0033:0x455319
> > RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
> > RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001
> >
> >
> > ---
> > This bug is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at [email protected].
> >
> > syzbot will keep track of this bug report.
> > If you forgot to add the Reported-by tag, once the fix for this bug is
> > merged
> > into any tree, please reply to this email with:
> > #syz fix: exact-commit-title
> > To mark this as a duplicate of another syzbot report, please reply with:
> > #syz dup: exact-subject-of-another-report
> > If it's a one-off invalid bug report, please reply with:
> > #syz invalid
> > Note: if the crash happens again, it will cause creation of a new bug
> > report.
> > Note: all commands must start from beginning of the line in the email body.
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b0e3056b14bfb2%40google.com.
> > For more options, visit https://groups.google.com/d/optout.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2018-05-11 19:09:40

by Eric Dumazet

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On 05/11/2018 11:41 AM, Marcelo Ricardo Leitner wrote:

> But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
> does it.
> Thus I think this is more of an issue with IPv6 stack. If a host has
> an extensive ip6tables ruleset, it probably generates this more
> easily.
>
>>> sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
>>> sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
>>> sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
>>> sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>>> sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>>> sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>>> sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>>> sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
>>> call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>>> expire_timers kernel/time/timer.c:1363 [inline]
>
> Having this call from a timer means it wasn't processing sctp stack
> for too long.
>

I feel the problem is that this part is looping, in some infinite loop.

I have seen this stack traces in other reports.

Maybe some kind of list corruption.

2018-05-11 20:50:04

by Marcelo Ricardo Leitner

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Fri, May 11, 2018 at 12:08:33PM -0700, Eric Dumazet wrote:
>
>
> On 05/11/2018 11:41 AM, Marcelo Ricardo Leitner wrote:
>
> > But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
> > does it.
> > Thus I think this is more of an issue with IPv6 stack. If a host has
> > an extensive ip6tables ruleset, it probably generates this more
> > easily.
> >
> >>> sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
> >>> sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
> >>> sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
> >>> sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> >>> sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> >>> sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> >>> sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
> >>> sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
> >>> call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
> >>> expire_timers kernel/time/timer.c:1363 [inline]
> >
> > Having this call from a timer means it wasn't processing sctp stack
> > for too long.
> >
>
> I feel the problem is that this part is looping, in some infinite loop.
>
> I have seen this stack traces in other reports.

Checked mail history now, seems at least two other reports on RCU
stalls had sctp_generate_heartbeat_event involved.

>
> Maybe some kind of list corruption.

Could be.
Do we know if it generated a flood of packets?

Marcelo

2018-05-13 13:53:03

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Fri, May 11, 2018 at 10:42 PM, Marcelo Ricardo Leitner
<[email protected]> wrote:
> On Fri, May 11, 2018 at 12:08:33PM -0700, Eric Dumazet wrote:
>>
>>
>> On 05/11/2018 11:41 AM, Marcelo Ricardo Leitner wrote:
>>
>> > But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
>> > does it.
>> > Thus I think this is more of an issue with IPv6 stack. If a host has
>> > an extensive ip6tables ruleset, it probably generates this more
>> > easily.
>> >
>> >>> sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
>> >>> sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
>> >>> sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
>> >>> sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>> >>> sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>> >>> sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>> >>> sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>> >>> sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
>> >>> call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>> >>> expire_timers kernel/time/timer.c:1363 [inline]
>> >
>> > Having this call from a timer means it wasn't processing sctp stack
>> > for too long.
>> >
>>
>> I feel the problem is that this part is looping, in some infinite loop.
>>
>> I have seen this stack traces in other reports.
>
> Checked mail history now, seems at least two other reports on RCU
> stalls had sctp_generate_heartbeat_event involved.
>
>>
>> Maybe some kind of list corruption.
>
> Could be.
> Do we know if it generated a flood of packets?

We only know what's in the bug reports. Do the other ones have
reproducers? It can make sense to mark them as duplicates to not have
a placer of open bugs about the same root cause.

2018-05-13 16:03:21

by Marcelo Ricardo Leitner

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Sun, May 13, 2018 at 03:52:01PM +0200, Dmitry Vyukov wrote:
> On Fri, May 11, 2018 at 10:42 PM, Marcelo Ricardo Leitner
> <[email protected]> wrote:
> > On Fri, May 11, 2018 at 12:08:33PM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 05/11/2018 11:41 AM, Marcelo Ricardo Leitner wrote:
> >>
> >> > But calling ip6_xmit with rcu_read_lock is expected. tcp stack also
> >> > does it.
> >> > Thus I think this is more of an issue with IPv6 stack. If a host has
> >> > an extensive ip6tables ruleset, it probably generates this more
> >> > easily.
> >> >
> >> >>> sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
> >> >>> sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
> >> >>> sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
> >> >>> sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> >> >>> sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> >> >>> sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> >> >>> sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
> >> >>> sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
> >> >>> call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
> >> >>> expire_timers kernel/time/timer.c:1363 [inline]
> >> >
> >> > Having this call from a timer means it wasn't processing sctp stack
> >> > for too long.
> >> >
> >>
> >> I feel the problem is that this part is looping, in some infinite loop.
> >>
> >> I have seen this stack traces in other reports.
> >
> > Checked mail history now, seems at least two other reports on RCU
> > stalls had sctp_generate_heartbeat_event involved.
> >
> >>
> >> Maybe some kind of list corruption.
> >
> > Could be.
> > Do we know if it generated a flood of packets?
>
> We only know what's in the bug reports. Do the other ones have

Ok.

> reproducers? It can make sense to mark them as duplicates to not have

No.

> a placer of open bugs about the same root cause.

They may have the same root cause, but right now I cannot tell for
sure.

2018-05-14 13:36:46

by Neil Horman

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
> <[email protected]> wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit: 5d1365940a68 Merge
> > git://git.kernel.org/pub/scm/linux/kerne...
> > git tree: net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
> > kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
> > compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
>
> This looks sctp-related, +sctp maintainers.
>
Looking at the entire trace, it appears that we are getting caught in the
kfree_skb that is getting triggered in enqueue_to_backlog which occurs when our
rx backlog list grows over netdev_max_backlog packets. That suggests to me that
whatever test(s) is/are causing this trace are queuing up a large number of
frames to be sent over the loopback interface, and are never/rarely getting
received. Looking up higher in the stack, in the sctp_generate_heartbeat_event
function, we (in addition to the rcu_read_lock in sctp_v6_xmit) we also hold the
socket lock during the entirety of the xmit operaion. Is it possible that we
are just enqueuing so many frames for xmit that we are blocking progress of
other threads using the same socket that we cross the RCU self detected stall
boundary? While its not a fix per se, it might be a worthwhile test to limit
the number of frames we flush in a single pass.

Neil

> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: [email protected]
> >
> > INFO: rcu_sched self-detected stall on CPU
> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
> > softirq=71980/71983 fqs=33
> > (t=125000 jiffies g=39438 c=39437 q=958)
> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> > RCU grace-period kthread stack dump:
> > rcu_sched R running task 23768 9 2 0x80000000
> > Call Trace:
> > context_switch kernel/sched/core.c:2848 [inline]
> > __schedule+0x801/0x1e30 kernel/sched/core.c:3490
> > schedule+0xef/0x430 kernel/sched/core.c:3549
> > schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
> > rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
> > kthread+0x345/0x410 kernel/kthread.c:238
> > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> > <IRQ>
> > __dump_stack lib/dump_stack.c:77 [inline]
> > dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> > nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> > nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> > arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> > trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> > rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
> > print_cpu_stall kernel/rcu/tree.c:1525 [inline]
> > check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
> > __rcu_pending kernel/rcu/tree.c:3356 [inline]
> > rcu_pending kernel/rcu/tree.c:3401 [inline]
> > rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
> > update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> > tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
> > tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
> > __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
> > __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
> > hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
> > local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> > smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
> > RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
> > RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
> > RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
> > R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
> > R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
> > kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
> > __kfree_skb net/core/skbuff.c:642 [inline]
> > kfree_skb+0x19d/0x560 net/core/skbuff.c:659
> > enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
> > netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
> > netif_rx+0xba/0x400 net/core/dev.c:4206
> > loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
> > __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
> > netdev_start_xmit include/linux/netdevice.h:4096 [inline]
> > xmit_one net/core/dev.c:3053 [inline]
> > dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
> > __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
> > dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
> > neigh_hh_output include/net/neighbour.h:472 [inline]
> > neigh_output include/net/neighbour.h:480 [inline]
> > ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
> > ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
> > NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> > ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
> > dst_output include/net/dst.h:444 [inline]
> > NF_HOOK include/linux/netfilter.h:288 [inline]
> > ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
> > sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
> > sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
> > sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
> > sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> > sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> > sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> > sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
> > sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
> > call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
> > expire_timers kernel/time/timer.c:1363 [inline]
> > __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
> > run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
> > __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
> > invoke_softirq kernel/softirq.c:365 [inline]
> > irq_exit+0x1d1/0x200 kernel/softirq.c:405
> > exiting_irq arch/x86/include/asm/apic.h:525 [inline]
> > smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > </IRQ>
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
> > RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
> > RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
> > RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
> > RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
> > R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
> > R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
> > rcu_lock_release include/linux/rcupdate.h:251 [inline]
> > rcu_read_unlock include/linux/rcupdate.h:688 [inline]
> > __unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
> > unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
> > page_remove_file_rmap mm/rmap.c:1248 [inline]
> > page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
> > zap_pte_range mm/memory.c:1337 [inline]
> > zap_pmd_range mm/memory.c:1441 [inline]
> > zap_pud_range mm/memory.c:1470 [inline]
> > zap_p4d_range mm/memory.c:1491 [inline]
> > unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
> > unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
> > unmap_vmas+0x120/0x1f0 mm/memory.c:1587
> > exit_mmap+0x265/0x570 mm/mmap.c:3038
> > __mmput kernel/fork.c:962 [inline]
> > mmput+0x251/0x610 kernel/fork.c:983
> > exit_mm kernel/exit.c:544 [inline]
> > do_exit+0xe98/0x2730 kernel/exit.c:852
> > do_group_exit+0x16f/0x430 kernel/exit.c:968
> > get_signal+0x886/0x1960 kernel/signal.c:2469
> > do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
> > exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
> > prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
> > syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
> > do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
> > entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > RIP: 0033:0x455319
> > RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
> > RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001
> >
> >
> > ---
> > This bug is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at [email protected].
> >
> > syzbot will keep track of this bug report.
> > If you forgot to add the Reported-by tag, once the fix for this bug is
> > merged
> > into any tree, please reply to this email with:
> > #syz fix: exact-commit-title
> > To mark this as a duplicate of another syzbot report, please reply with:
> > #syz dup: exact-subject-of-another-report
> > If it's a one-off invalid bug report, please reply with:
> > #syz invalid
> > Note: if the crash happens again, it will cause creation of a new bug
> > report.
> > Note: all commands must start from beginning of the line in the email body.
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "syzkaller-bugs" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b0e3056b14bfb2%40google.com.
> > For more options, visit https://groups.google.com/d/optout.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2018-05-14 18:05:56

by Xin Long

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Mon, May 14, 2018 at 9:34 PM, Neil Horman <[email protected]> wrote:
> On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
>> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
>> <[email protected]> wrote:
>> > Hello,
>> >
>> > syzbot found the following crash on:
>> >
>> > HEAD commit: 5d1365940a68 Merge
>> > git://git.kernel.org/pub/scm/linux/kerne...
>> > git tree: net-next
>> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
>> > kernel config:
>> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
>> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
>> > compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> >
>> > Unfortunately, I don't have any reproducer for this crash yet.
>>
>> This looks sctp-related, +sctp maintainers.
>>
> Looking at the entire trace, it appears that we are getting caught in the
> kfree_skb that is getting triggered in enqueue_to_backlog which occurs when our
> rx backlog list grows over netdev_max_backlog packets. That suggests to me that
It might be a long skb->frag_list that made kfree_skb slow when packing
lots of small chunks to go through lo device?

> whatever test(s) is/are causing this trace are queuing up a large number of
> frames to be sent over the loopback interface, and are never/rarely getting
> received. Looking up higher in the stack, in the sctp_generate_heartbeat_event
> function, we (in addition to the rcu_read_lock in sctp_v6_xmit) we also hold the
> socket lock during the entirety of the xmit operaion. Is it possible that we
> are just enqueuing so many frames for xmit that we are blocking progress of
> other threads using the same socket that we cross the RCU self detected stall
> boundary? While its not a fix per se, it might be a worthwhile test to limit
> the number of frames we flush in a single pass.
>
> Neil
>
>> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> > Reported-by: [email protected]
>> >
>> > INFO: rcu_sched self-detected stall on CPU
>> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
>> > softirq=71980/71983 fqs=33
>> > (t=125000 jiffies g=39438 c=39437 q=958)
>> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
>> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
>> > RCU grace-period kthread stack dump:
>> > rcu_sched R running task 23768 9 2 0x80000000
>> > Call Trace:
>> > context_switch kernel/sched/core.c:2848 [inline]
>> > __schedule+0x801/0x1e30 kernel/sched/core.c:3490
>> > schedule+0xef/0x430 kernel/sched/core.c:3549
>> > schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
>> > rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
>> > kthread+0x345/0x410 kernel/kthread.c:238
>> > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
>> > NMI backtrace for cpu 1
>> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
>> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> > Google 01/01/2011
>> > Call Trace:
>> > <IRQ>
>> > __dump_stack lib/dump_stack.c:77 [inline]
>> > dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>> > nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>> > nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>> > arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>> > trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
>> > rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
>> > print_cpu_stall kernel/rcu/tree.c:1525 [inline]
>> > check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
>> > __rcu_pending kernel/rcu/tree.c:3356 [inline]
>> > rcu_pending kernel/rcu/tree.c:3401 [inline]
>> > rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
>> > update_process_times+0x2d/0x70 kernel/time/timer.c:1636
>> > tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
>> > tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
>> > __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
>> > __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
>> > hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
>> > local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
>> > smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
>> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
>> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
>> > [inline]
>> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
>> > RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
>> > RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
>> > RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
>> > RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
>> > R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
>> > R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
>> > kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
>> > __kfree_skb net/core/skbuff.c:642 [inline]
>> > kfree_skb+0x19d/0x560 net/core/skbuff.c:659
>> > enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
>> > netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
>> > netif_rx+0xba/0x400 net/core/dev.c:4206
>> > loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
>> > __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
>> > netdev_start_xmit include/linux/netdevice.h:4096 [inline]
>> > xmit_one net/core/dev.c:3053 [inline]
>> > dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
>> > __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
>> > dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
>> > neigh_hh_output include/net/neighbour.h:472 [inline]
>> > neigh_output include/net/neighbour.h:480 [inline]
>> > ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
>> > ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
>> > NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>> > ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
>> > dst_output include/net/dst.h:444 [inline]
>> > NF_HOOK include/linux/netfilter.h:288 [inline]
>> > ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
>> > sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
>> > sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
>> > sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
>> > sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>> > sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>> > sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>> > sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>> > sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406
>> > call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>> > expire_timers kernel/time/timer.c:1363 [inline]
>> > __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
>> > run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
>> > __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
>> > invoke_softirq kernel/softirq.c:365 [inline]
>> > irq_exit+0x1d1/0x200 kernel/softirq.c:405
>> > exiting_irq arch/x86/include/asm/apic.h:525 [inline]
>> > smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
>> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
>> > </IRQ>
>> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
>> > [inline]
>> > RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
>> > RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
>> > RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
>> > RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
>> > RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
>> > R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
>> > R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
>> > rcu_lock_release include/linux/rcupdate.h:251 [inline]
>> > rcu_read_unlock include/linux/rcupdate.h:688 [inline]
>> > __unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
>> > unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
>> > page_remove_file_rmap mm/rmap.c:1248 [inline]
>> > page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
>> > zap_pte_range mm/memory.c:1337 [inline]
>> > zap_pmd_range mm/memory.c:1441 [inline]
>> > zap_pud_range mm/memory.c:1470 [inline]
>> > zap_p4d_range mm/memory.c:1491 [inline]
>> > unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
>> > unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
>> > unmap_vmas+0x120/0x1f0 mm/memory.c:1587
>> > exit_mmap+0x265/0x570 mm/mmap.c:3038
>> > __mmput kernel/fork.c:962 [inline]
>> > mmput+0x251/0x610 kernel/fork.c:983
>> > exit_mm kernel/exit.c:544 [inline]
>> > do_exit+0xe98/0x2730 kernel/exit.c:852
>> > do_group_exit+0x16f/0x430 kernel/exit.c:968
>> > get_signal+0x886/0x1960 kernel/signal.c:2469
>> > do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
>> > exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
>> > prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>> > syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>> > do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
>> > entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> > RIP: 0033:0x455319
>> > RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>> > RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
>> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
>> > RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
>> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> > R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001
>> >
>> >
>> > ---
>> > This bug is generated by a bot. It may contain errors.
>> > See https://goo.gl/tpsmEJ for more information about syzbot.
>> > syzbot engineers can be reached at [email protected].
>> >
>> > syzbot will keep track of this bug report.
>> > If you forgot to add the Reported-by tag, once the fix for this bug is
>> > merged
>> > into any tree, please reply to this email with:
>> > #syz fix: exact-commit-title
>> > To mark this as a duplicate of another syzbot report, please reply with:
>> > #syz dup: exact-subject-of-another-report
>> > If it's a one-off invalid bug report, please reply with:
>> > #syz invalid
>> > Note: if the crash happens again, it will cause creation of a new bug
>> > report.
>> > Note: all commands must start from beginning of the line in the email body.
>> >
>> > --
>> > You received this message because you are subscribed to the Google Groups
>> > "syzkaller-bugs" group.
>> > To unsubscribe from this group and stop receiving emails from it, send an
>> > email to [email protected].
>> > To view this discussion on the web visit
>> > https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b0e3056b14bfb2%40google.com.
>> > For more options, visit https://groups.google.com/d/optout.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2018-05-26 15:37:01

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: rcu detected stall in kfree_skbmem

On Mon, May 14, 2018 at 8:04 PM, Xin Long <[email protected]> wrote:
> On Mon, May 14, 2018 at 9:34 PM, Neil Horman <[email protected]> wrote:
>> On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
>>> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
>>> <[email protected]> wrote:
>>> > Hello,
>>> >
>>> > syzbot found the following crash on:
>>> >
>>> > HEAD commit: 5d1365940a68 Merge
>>> > git://git.kernel.org/pub/scm/linux/kerne...
>>> > git tree: net-next
>>> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
>>> > kernel config:
>>> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
>>> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
>>> > compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>>> >
>>> > Unfortunately, I don't have any reproducer for this crash yet.
>>>
>>> This looks sctp-related, +sctp maintainers.
>>>
>> Looking at the entire trace, it appears that we are getting caught in the
>> kfree_skb that is getting triggered in enqueue_to_backlog which occurs when our
>> rx backlog list grows over netdev_max_backlog packets. That suggests to me that
> It might be a long skb->frag_list that made kfree_skb slow when packing
> lots of small chunks to go through lo device?
>
>> whatever test(s) is/are causing this trace are queuing up a large number of
>> frames to be sent over the loopback interface, and are never/rarely getting
>> received. Looking up higher in the stack, in the sctp_generate_heartbeat_event
>> function, we (in addition to the rcu_read_lock in sctp_v6_xmit) we also hold the
>> socket lock during the entirety of the xmit operaion. Is it possible that we
>> are just enqueuing so many frames for xmit that we are blocking progress of
>> other threads using the same socket that we cross the RCU self detected stall
>> boundary? While its not a fix per se, it might be a worthwhile test to limit
>> the number of frames we flush in a single pass.
>>
>> Neil
>>
>>> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> > Reported-by: [email protected]
>>> >
>>> > INFO: rcu_sched self-detected stall on CPU
>>> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
>>> > softirq=71980/71983 fqs=33
>>> > (t=125000 jiffies g=39438 c=39437 q=958)
>>> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
>>> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
>>> > RCU grace-period kthread stack dump:
>>> > rcu_sched R running task 23768 9 2 0x80000000
>>> > Call Trace:
>>> > context_switch kernel/sched/core.c:2848 [inline]
>>> > __schedule+0x801/0x1e30 kernel/sched/core.c:3490
>>> > schedule+0xef/0x430 kernel/sched/core.c:3549
>>> > schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
>>> > rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
>>> > kthread+0x345/0x410 kernel/kthread.c:238
>>> > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
>>> > NMI backtrace for cpu 1
>>> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
>>> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> > Google 01/01/2011
>>> > Call Trace:
>>> > <IRQ>
>>> > __dump_stack lib/dump_stack.c:77 [inline]
>>> > dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>>> > nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>>> > nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>>> > arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>>> > trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
>>> > rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
>>> > print_cpu_stall kernel/rcu/tree.c:1525 [inline]
>>> > check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
>>> > __rcu_pending kernel/rcu/tree.c:3356 [inline]
>>> > rcu_pending kernel/rcu/tree.c:3401 [inline]
>>> > rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
>>> > update_process_times+0x2d/0x70 kernel/time/timer.c:1636
>>> > tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
>>> > tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
>>> > __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
>>> > __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
>>> > hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
>>> > local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
>>> > smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
>>> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
>>> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
>>> > [inline]
>>> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
>>> > RSP: 0018:ffff8801db105228 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
>>> > RAX: 0000000000000007 RBX: ffff8800b055c940 RCX: 1ffff1003b2345a5
>>> > RDX: 0000000000000000 RSI: ffff8801d91a2d80 RDI: 0000000000000282
>>> > RBP: ffff8801db105248 R08: ffff8801d91a2cb8 R09: 0000000000000002
>>> > R10: ffff8801d91a2480 R11: 0000000000000000 R12: ffff8801d9848e40
>>> > R13: 0000000000000282 R14: ffffffff85b7f27c R15: 0000000000000000
>>> > kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
>>> > __kfree_skb net/core/skbuff.c:642 [inline]
>>> > kfree_skb+0x19d/0x560 net/core/skbuff.c:659
>>> > enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
>>> > netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
>>> > netif_rx+0xba/0x400 net/core/dev.c:4206
>>> > loopback_xmit+0x283/0x741 drivers/net/loopback.c:91
>>> > __netdev_start_xmit include/linux/netdevice.h:4087 [inline]
>>> > netdev_start_xmit include/linux/netdevice.h:4096 [inline]
>>> > xmit_one net/core/dev.c:3053 [inline]
>>> > dev_hard_start_xmit+0x264/0xc10 net/core/dev.c:3069
>>> > __dev_queue_xmit+0x2724/0x34c0 net/core/dev.c:3584
>>> > dev_queue_xmit+0x17/0x20 net/core/dev.c:3617
>>> > neigh_hh_output include/net/neighbour.h:472 [inline]
>>> > neigh_output include/net/neighbour.h:480 [inline]
>>> > ip6_finish_output2+0x134e/0x2810 net/ipv6/ip6_output.c:120
>>> > ip6_finish_output+0x5fe/0xbc0 net/ipv6/ip6_output.c:154
>>> > NF_HOOK_COND include/linux/netfilter.h:277 [inline]
>>> > ip6_output+0x227/0x9b0 net/ipv6/ip6_output.c:171
>>> > dst_output include/net/dst.h:444 [inline]
>>> > NF_HOOK include/linux/netfilter.h:288 [inline]
>>> > ip6_xmit+0xf51/0x23f0 net/ipv6/ip6_output.c:277
>>> > sctp_v6_xmit+0x4a5/0x6b0 net/sctp/ipv6.c:225
>>> > sctp_packet_transmit+0x26f6/0x3ba0 net/sctp/output.c:650
>>> > sctp_outq_flush+0x1373/0x4370 net/sctp/outqueue.c:1197
>>> > sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>>> > sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>>> > sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>>> > sctp_do_sm+0x596/0x7160 net/sctp/sm_sideeffect.c:1191
>>> > sctp_generate_heartbeat_event+0x218/0x450 net/sctp/sm_sideeffect.c:406

#syz fix: sctp: not allow to set rto_min with a value below 200 msecs

>>> > call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
>>> > expire_timers kernel/time/timer.c:1363 [inline]
>>> > __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
>>> > run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
>>> > __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
>>> > invoke_softirq kernel/softirq.c:365 [inline]
>>> > irq_exit+0x1d1/0x200 kernel/softirq.c:405
>>> > exiting_irq arch/x86/include/asm/apic.h:525 [inline]
>>> > smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
>>> > apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
>>> > </IRQ>
>>> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
>>> > [inline]
>>> > RIP: 0010:lock_release+0x4d4/0xa10 kernel/locking/lockdep.c:3942
>>> > RSP: 0018:ffff8801971ce7b0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
>>> > RAX: dffffc0000000000 RBX: 1ffff10032e39cfb RCX: 1ffff1003b234595
>>> > RDX: 1ffffffff11630ed RSI: 0000000000000002 RDI: 0000000000000282
>>> > RBP: ffff8801971ce8e0 R08: 1ffff10032e39cff R09: ffffed003b6246c2
>>> > R10: 0000000000000003 R11: 0000000000000001 R12: ffff8801d91a2480
>>> > R13: ffffffff88b8df60 R14: ffff8801d91a2480 R15: ffff8801971ce7f8
>>> > rcu_lock_release include/linux/rcupdate.h:251 [inline]
>>> > rcu_read_unlock include/linux/rcupdate.h:688 [inline]
>>> > __unlock_page_memcg+0x72/0x100 mm/memcontrol.c:1654
>>> > unlock_page_memcg+0x2c/0x40 mm/memcontrol.c:1663
>>> > page_remove_file_rmap mm/rmap.c:1248 [inline]
>>> > page_remove_rmap+0x6f2/0x1250 mm/rmap.c:1299
>>> > zap_pte_range mm/memory.c:1337 [inline]
>>> > zap_pmd_range mm/memory.c:1441 [inline]
>>> > zap_pud_range mm/memory.c:1470 [inline]
>>> > zap_p4d_range mm/memory.c:1491 [inline]
>>> > unmap_page_range+0xeb4/0x2200 mm/memory.c:1512
>>> > unmap_single_vma+0x1a0/0x310 mm/memory.c:1557
>>> > unmap_vmas+0x120/0x1f0 mm/memory.c:1587
>>> > exit_mmap+0x265/0x570 mm/mmap.c:3038
>>> > __mmput kernel/fork.c:962 [inline]
>>> > mmput+0x251/0x610 kernel/fork.c:983
>>> > exit_mm kernel/exit.c:544 [inline]
>>> > do_exit+0xe98/0x2730 kernel/exit.c:852
>>> > do_group_exit+0x16f/0x430 kernel/exit.c:968
>>> > get_signal+0x886/0x1960 kernel/signal.c:2469
>>> > do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
>>> > exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
>>> > prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>>> > syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>>> > do_syscall_64+0x792/0x9d0 arch/x86/entry/common.c:292
>>> > entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>> > RIP: 0033:0x455319
>>> > RSP: 002b:00007fa346e81ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>>> > RAX: fffffffffffffe00 RBX: 000000000072bf80 RCX: 0000000000455319
>>> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000072bf80
>>> > RBP: 000000000072bf80 R08: 0000000000000000 R09: 000000000072bf58
>>> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>>> > R13: 0000000000a3e81f R14: 00007fa346e829c0 R15: 0000000000000001
>>> >
>>> >
>>> > ---
>>> > This bug is generated by a bot. It may contain errors.
>>> > See https://goo.gl/tpsmEJ for more information about syzbot.
>>> > syzbot engineers can be reached at [email protected].
>>> >
>>> > syzbot will keep track of this bug report.
>>> > If you forgot to add the Reported-by tag, once the fix for this bug is
>>> > merged
>>> > into any tree, please reply to this email with:
>>> > #syz fix: exact-commit-title
>>> > To mark this as a duplicate of another syzbot report, please reply with:
>>> > #syz dup: exact-subject-of-another-report
>>> > If it's a one-off invalid bug report, please reply with:
>>> > #syz invalid
>>> > Note: if the crash happens again, it will cause creation of a new bug
>>> > report.
>>> > Note: all commands must start from beginning of the line in the email body.
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google Groups
>>> > "syzkaller-bugs" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send an
>>> > email to [email protected].
>>> > To view this discussion on the web visit
>>> > https://groups.google.com/d/msgid/syzkaller-bugs/000000000000a9b0e3056b14bfb2%40google.com.
>>> > For more options, visit https://groups.google.com/d/optout.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html