LinuxLists.cc - [syzbot] [fs?] INFO: task hung in synchronize

2023-05-04 02:26:12

Subject: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

Hello,

syzbot found the following issue on:

HEAD commit: 6686317855c6 net: dsa: mv88e6xxx: add mv88e6321 rsvd2cpu
git tree: net
console output: https://syzkaller.appspot.com/x/log.txt?x=1457a594280000
kernel config: https://syzkaller.appspot.com/x/.config?x=7205cdba522fe4bc
dashboard link: https://syzkaller.appspot.com/bug?extid=222aa26d0a5dbc2e84fe
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12ede410280000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/48685f457043/disk-66863178.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/7e1798ecf070/vmlinux-66863178.xz
kernel image: https://storage.googleapis.com/syzbot-assets/cc77fb901221/bzImage-66863178.xz

The issue was bisected to:

commit 3b5d4ddf8fe1f60082513f94bae586ac80188a03
Author: Martin KaFai Lau <[email protected]>
Date: Wed Mar 9 09:04:50 2022 +0000

bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=153459d7c80000
final oops: https://syzkaller.appspot.com/x/report.txt?x=173459d7c80000
console output: https://syzkaller.appspot.com/x/log.txt?x=133459d7c80000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]
Fixes: 3b5d4ddf8fe1 ("bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro")

INFO: task kworker/u4:1:12 blocked for more than 145 seconds.
Not tainted 6.3.0-syzkaller-07940-g6686317855c6 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u4:1 state:D stack:26288 pid:12 ppid:2 flags:0x00004000
Workqueue: events_unbound fsnotify_connector_destroy_workfn
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5307 [inline]
__schedule+0xc91/0x5770 kernel/sched/core.c:6625
schedule+0xde/0x1a0 kernel/sched/core.c:6701
schedule_timeout+0x276/0x2b0 kernel/time/timer.c:2143
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common+0x1ce/0x5c0 kernel/sched/completion.c:106
__synchronize_srcu+0x1be/0x2c0 kernel/rcu/srcutree.c:1360
fsnotify_connector_destroy_workfn+0x4d/0xa0 fs/notify/mark.c:208
process_one_work+0x991/0x15c0 kernel/workqueue.c:2390
worker_thread+0x669/0x1090 kernel/workqueue.c:2537
kthread+0x344/0x440 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
INFO: task kworker/u4:4:5134 blocked for more than 146 seconds.
Not tainted 6.3.0-syzkaller-07940-g6686317855c6 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u4:4 state:D stack:26344 pid:5134 ppid:2 flags:0x00004000
Workqueue: events_unbound fsnotify_mark_destroy_workfn
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5307 [inline]
__schedule+0xc91/0x5770 kernel/sched/core.c:6625
schedule+0xde/0x1a0 kernel/sched/core.c:6701
schedule_timeout+0x276/0x2b0 kernel/time/timer.c:2143
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common+0x1ce/0x5c0 kernel/sched/completion.c:106
__synchronize_srcu+0x1be/0x2c0 kernel/rcu/srcutree.c:1360
fsnotify_mark_destroy_workfn+0x101/0x3c0 fs/notify/mark.c:898
process_one_work+0x991/0x15c0 kernel/workqueue.c:2390
worker_thread+0x669/0x1090 kernel/workqueue.c:2537
kthread+0x344/0x440 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>

Showing all locks held in the system:
2 locks held by kworker/u4:1/12:
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1324 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:639 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:666 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x87a/0x15c0 kernel/workqueue.c:2361
#1: ffffc90000117da8 (connector_reaper_work){+.+.}-{0:0}, at: process_one_work+0x8ae/0x15c0 kernel/workqueue.c:2365
1 lock held by rcu_tasks_kthre/13:
#0: ffffffff8c7962f0 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xd80 kernel/rcu/tasks.h:518
1 lock held by rcu_tasks_trace/14:
#0: ffffffff8c795ff0 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xd80 kernel/rcu/tasks.h:518
3 locks held by kworker/1:0/22:
1 lock held by khungtaskd/28:
#0: ffffffff8c796f00 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340 kernel/locking/lockdep.c:6545
1 lock held by khugepaged/34:
#0: ffffffff8c896708 (lock#3){+.+.}-{3:3}, at: __lru_add_drain_all+0x62/0x6a0 mm/swap.c:852
2 locks held by getty/4759:
#0: ffff88814c1e0098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x26/0x80 drivers/tty/tty_ldisc.c:244
#1: ffffc900015a02f0 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0xef4/0x13e0 drivers/tty/n_tty.c:2177
4 locks held by syz-executor.2/5077:
#0: ffff8880b993c2d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2f/0x120 kernel/sched/core.c:539
#1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: mm_cid_get kernel/sched/sched.h:3280 [inline]
#1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: switch_mm_cid kernel/sched/sched.h:3302 [inline]
#1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: prepare_task_switch kernel/sched/core.c:5117 [inline]
#1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: context_switch kernel/sched/core.c:5258 [inline]
#1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: __schedule+0x2802/0x5770 kernel/sched/core.c:6625
#2: ffff8880b9929698 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x5a/0x1f0 kernel/time/timer.c:999
#3: ffffffff91fb4ac8 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_activate+0x134/0x3f0 lib/debugobjects.c:690
1 lock held by syz-executor.5/5080:
2 locks held by kworker/u4:4/5134:
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1324 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:639 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:666 [inline]
#0: ffff888012479138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x87a/0x15c0 kernel/workqueue.c:2361
#1: ffffc9000433fda8 ((reaper_work).work){+.+.}-{0:0}, at: process_one_work+0x8ae/0x15c0 kernel/workqueue.c:2365
2 locks held by dhcpcd/5583:
#0: ffff88802b7e0130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88802b7e0130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3204
#1: ffffffff8c7a2378 (rcu_state.exp_mutex){+.+.}-{3:3}, at: exp_funnel_lock kernel/rcu/tree_exp.h:293 [inline]
#1: ffffffff8c7a2378 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x64a/0x770 kernel/rcu/tree_exp.h:992
2 locks held by dhcpcd/5586:
#0: ffff88806919e130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88806919e130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3204
#1: ffffffff8c7a2378 (rcu_state.exp_mutex){+.+.}-{3:3}, at: exp_funnel_lock kernel/rcu/tree_exp.h:325 [inline]
#1: ffffffff8c7a2378 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x3e8/0x770 kernel/rcu/tree_exp.h:992
1 lock held by dhcpcd/5587:
#0: ffff88802cca0130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88802cca0130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3204
1 lock held by dhcpcd/5598:
#0: ffff88807be2c130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88807be2c130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3204
1 lock held by dhcpcd/5621:
#0: ffff88805f706130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88805f706130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3204
1 lock held by dhcpcd/5622:
#0: ffff88807e98e130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88807e98e130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3204
2 locks held by syz-executor.5/6144:

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 28 Comm: khungtaskd Not tainted 6.3.0-syzkaller-07940-g6686317855c6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
nmi_cpu_backtrace+0x29c/0x350 lib/nmi_backtrace.c:113
nmi_trigger_cpumask_backtrace+0x2a4/0x300 lib/nmi_backtrace.c:62
trigger_all_cpu_backtrace include/linux/nmi.h:148 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:222 [inline]
watchdog+0xe16/0x1090 kernel/hung_task.c:379
kthread+0x344/0x440 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 6146 Comm: syz-executor.1 Not tainted 6.3.0-syzkaller-07940-g6686317855c6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023
RIP: 0010:get_page_from_freelist+0x2f5/0x2e20 mm/page_alloc.c:4281
Code: e8 03 42 80 3c 28 00 0f 85 25 1d 00 00 48 8b 04 24 4d 03 67 20 48 8d 48 1c 48 89 c8 48 89 4c 24 38 48 c1 e8 03 42 0f b6 14 28 <48> 89 c8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 c2 1c 00 00 48
RSP: 0018:ffffc9000392f408 EFLAGS: 00000a07
RAX: 1ffff92000725ec9 RBX: 0000000000000001 RCX: ffffc9000392f64c
RDX: 0000000000000000 RSI: ffffffffffffffff RDI: ffff88813fffae08
RBP: ffff88813fffc300 R08: 0000000000000000 R09: ffff88802581b0b7
R10: ffffed1004b03616 R11: 0000000000000000 R12: 0000000000000006
R13: dffffc0000000000 R14: 0000000000000000 R15: ffff88813fffae00
FS: 00007f7b6a98a700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5f54fa8000 CR3: 0000000191236000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__alloc_pages+0x1cb/0x4a0 mm/page_alloc.c:5592
alloc_pages+0x1aa/0x270 mm/mempolicy.c:2277
__get_free_pages+0xc/0x40 mm/page_alloc.c:5642
kasan_populate_vmalloc_pte mm/kasan/shadow.c:323 [inline]
kasan_populate_vmalloc_pte+0x2e/0x180 mm/kasan/shadow.c:314
apply_to_pte_range mm/memory.c:2578 [inline]
apply_to_pmd_range mm/memory.c:2622 [inline]
apply_to_pud_range mm/memory.c:2658 [inline]
apply_to_p4d_range mm/memory.c:2694 [inline]
__apply_to_page_range+0x68c/0x1030 mm/memory.c:2728
alloc_vmap_area+0x500/0x1e00 mm/vmalloc.c:1642
__get_vm_area_node+0x145/0x3f0 mm/vmalloc.c:2499
__vmalloc_node_range+0x252/0x14a0 mm/vmalloc.c:3165
__bpf_map_area_alloc+0xe5/0x180 kernel/bpf/syscall.c:336
bloom_map_alloc+0x303/0x560 kernel/bpf/bloom_filter.c:137
find_and_alloc_map kernel/bpf/syscall.c:135 [inline]
map_create+0x508/0x1860 kernel/bpf/syscall.c:1161
__sys_bpf+0x127f/0x5420 kernel/bpf/syscall.c:5040
__do_sys_bpf kernel/bpf/syscall.c:5162 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5160 [inline]
__x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5160
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7b69c8c169
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f7b6a98a168 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007f7b69dabf80 RCX: 00007f7b69c8c169
RDX: 0000000000000048 RSI: 0000000020000180 RDI: 0000000000000000
RBP: 00007f7b69ce7ca1 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd1b79d6ff R14: 00007f7b6a98a300 R15: 0000000000022000
</TASK>

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection

If the bug is already fixed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to change bug's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the bug is a duplicate of another bug, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

2023-05-04 07:32:24

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

On 2023/05/04 15:16, Hillf Danton wrote:
>> 4 locks held by syz-executor.2/5077:
>> #0: ffff8880b993c2d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2f/0x120 kernel/sched/core.c:539
>> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: mm_cid_get kernel/sched/sched.h:3280 [inline]
>> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: switch_mm_cid kernel/sched/sched.h:3302 [inline]
>> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: prepare_task_switch kernel/sched/core.c:5117 [inline]
>> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: context_switch kernel/sched/core.c:5258 [inline]
>> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: __schedule+0x2802/0x5770 kernel/sched/core.c:6625
>> #2: ffff8880b9929698 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x5a/0x1f0 kernel/time/timer.c:999
>> #3: ffffffff91fb4ac8 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_activate+0x134/0x3f0 lib/debugobjects.c:690
>
> What is hard to understand in this report is, how could acquire the
> timer base lock with the mm cid lock held [1]?

Please be aware that lockdep_print_held_locks() is not an atomic action.
Since synchronous printk() is slow, it can sometimes happen that
task_is_running(p) becomes true after passing the

if (p != current && task_is_running(p))
return;

check. I think that this trace is an example where print_lock() by chance hit
hlock_class(p->held_locks + 2) != NULL. If sched_show_task() were also available,
we can know it via mismatch between sched_show_task() and lockdep_print_held_locks().

Linus, I think that "[PATCH v3 (repost)] locking/lockdep: add debug_show_all_lock_holders()"
helps here, but I can't wake up locking people. What can we do?

>
> static inline int mm_cid_get(struct mm_struct *mm)
> {
> int ret;
>
> lockdep_assert_irqs_disabled();
> raw_spin_lock(&mm->cid_lock);
> ret = __mm_cid_get(mm);
> raw_spin_unlock(&mm->cid_lock);
> return ret;
> }
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/kernel/sched/sched.h?id=6686317855c6#n3280
>

2023-05-05 08:54:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

On Thu, May 04, 2023 at 04:01:23PM +0900, Tetsuo Handa wrote:
> On 2023/05/04 15:16, Hillf Danton wrote:
> >> 4 locks held by syz-executor.2/5077:
> >> #0: ffff8880b993c2d8 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2f/0x120 kernel/sched/core.c:539
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: mm_cid_get kernel/sched/sched.h:3280 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: switch_mm_cid kernel/sched/sched.h:3302 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: prepare_task_switch kernel/sched/core.c:5117 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: context_switch kernel/sched/core.c:5258 [inline]
> >> #1: ffff88802296aef0 (&mm->cid_lock#2){....}-{2:2}, at: __schedule+0x2802/0x5770 kernel/sched/core.c:6625
> >> #2: ffff8880b9929698 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x5a/0x1f0 kernel/time/timer.c:999
> >> #3: ffffffff91fb4ac8 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_activate+0x134/0x3f0 lib/debugobjects.c:690
> >
> > What is hard to understand in this report is, how could acquire the
> > timer base lock with the mm cid lock held [1]?
>
> Please be aware that lockdep_print_held_locks() is not an atomic action.
> Since synchronous printk() is slow, it can sometimes happen that
> task_is_running(p) becomes true after passing the
>
> if (p != current && task_is_running(p))
> return;
>
> check. I think that this trace is an example where print_lock() by chance hit
> hlock_class(p->held_locks + 2) != NULL. If sched_show_task() were also available,
> we can know it via mismatch between sched_show_task() and lockdep_print_held_locks().
>
> Linus, I think that "[PATCH v3 (repost)] locking/lockdep: add debug_show_all_lock_holders()"
> helps here, but I can't wake up locking people. What can we do?

How is that not also racy ?

I think I've seen that patch, and it had a some 'blurb' Changelog that
leaves me wondering wtf the actual problem is and how it attempts to
solve it and I went on with looking at regressions because more
important than random weird patch.

2023-05-05 10:09:58

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

On 2023/05/05 18:46, Tetsuo Handa wrote:
> On 2023/05/05 17:35, Peter Zijlstra wrote:
>>> Linus, I think that "[PATCH v3 (repost)] locking/lockdep: add debug_show_all_lock_holders()"
>>> helps here, but I can't wake up locking people. What can we do?
>>
>> How is that not also racy ?
>
> Nobody can make this code racy.

Oops, nobody can make this code race-free.

Capturing vmcore is not an option.

>
>>
>> I think I've seen that patch, and it had a some 'blurb' Changelog that
>> leaves me wondering wtf the actual problem is and how it attempts to
>> solve it and I went on with looking at regressions because more
>> important than random weird patch.
>
> Please respond to
> https://lkml.kernel.org/[email protected] .
>

I am waiting for your response.
But since you don't respond, I have to repeat re-posting.

2023-05-05 10:10:58

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

On 2023/05/05 17:35, Peter Zijlstra wrote:
>> Linus, I think that "[PATCH v3 (repost)] locking/lockdep: add debug_show_all_lock_holders()"
>> helps here, but I can't wake up locking people. What can we do?
>
> How is that not also racy ?

Nobody can make this code racy.

>
> I think I've seen that patch, and it had a some 'blurb' Changelog that
> leaves me wondering wtf the actual problem is and how it attempts to
> solve it and I went on with looking at regressions because more
> important than random weird patch.

Please respond to
https://lkml.kernel.org/[email protected] .

2023-05-20 22:32:28

by syzbot

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

syzbot has found a reproducer for the following issue on:

HEAD commit: dcbe4ea1985d Merge branch '1GbE' of git://git.kernel.org/p..
git tree: net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=123ebd91280000
kernel config: https://syzkaller.appspot.com/x/.config?x=f20b05fe035db814
dashboard link: https://syzkaller.appspot.com/bug?extid=222aa26d0a5dbc2e84fe
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1495596a280000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1529326a280000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/41b9dda0e686/disk-dcbe4ea1.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/64d9bece8f89/vmlinux-dcbe4ea1.xz
kernel image: https://storage.googleapis.com/syzbot-assets/42429896dca0/bzImage-dcbe4ea1.xz

The issue was bisected to:

commit 3b5d4ddf8fe1f60082513f94bae586ac80188a03
Author: Martin KaFai Lau <[email protected]>
Date: Wed Mar 9 09:04:50 2022 +0000

bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=153459d7c80000
final oops: https://syzkaller.appspot.com/x/report.txt?x=173459d7c80000
console output: https://syzkaller.appspot.com/x/log.txt?x=133459d7c80000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]
Fixes: 3b5d4ddf8fe1 ("bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro")

INFO: task dhcpcd:10860 blocked for more than 143 seconds.
Not tainted 6.4.0-rc2-syzkaller-00481-gdcbe4ea1985d #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:dhcpcd state:D stack:29024 pid:10860 ppid:4670 flags:0x00000002
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5343 [inline]
__schedule+0xc9a/0x5880 kernel/sched/core.c:6669
schedule+0xde/0x1a0 kernel/sched/core.c:6745
exp_funnel_lock kernel/rcu/tree_exp.h:316 [inline]
synchronize_rcu_expedited+0x6f8/0x770 kernel/rcu/tree_exp.h:992
synchronize_rcu+0x2f1/0x3a0 kernel/rcu/tree.c:3499
synchronize_net+0x4e/0x60 net/core/dev.c:10791
__unregister_prot_hook+0x4b3/0x5c0 net/packet/af_packet.c:380
packet_do_bind+0x93f/0xe30 net/packet/af_packet.c:3235
packet_bind+0x15f/0x1c0 net/packet/af_packet.c:3319
__sys_bind+0x1ed/0x260 net/socket.c:1803
__do_sys_bind net/socket.c:1814 [inline]
__se_sys_bind net/socket.c:1812 [inline]
__x64_sys_bind+0x73/0xb0 net/socket.c:1812
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f522deb3677
RSP: 002b:00007ffec7fc94f8 EFLAGS: 00000217 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 000055e71e865ca3 RCX: 00007f522deb3677
RDX: 0000000000000014 RSI: 00007ffec7fc9508 RDI: 0000000000000005
RBP: 0000000000000000 R08: 000055e7200048a0 R09: 0000000000000004
R10: 000000000000006d R11: 0000000000000217 R12: 000055e71ffff250
R13: 000055e720004788 R14: 00007ffec7fe9dec R15: 000055e720004754
</TASK>
INFO: task dhcpcd:10906 blocked for more than 145 seconds.
Not tainted 6.4.0-rc2-syzkaller-00481-gdcbe4ea1985d #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:dhcpcd state:D stack:28864 pid:10906 ppid:4670 flags:0x00000002
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5343 [inline]
__schedule+0xc9a/0x5880 kernel/sched/core.c:6669
schedule+0xde/0x1a0 kernel/sched/core.c:6745
exp_funnel_lock kernel/rcu/tree_exp.h:316 [inline]
synchronize_rcu_expedited+0x6f8/0x770 kernel/rcu/tree_exp.h:992
synchronize_rcu+0x2f1/0x3a0 kernel/rcu/tree.c:3499
synchronize_net+0x4e/0x60 net/core/dev.c:10791
__unregister_prot_hook+0x4b3/0x5c0 net/packet/af_packet.c:380
packet_do_bind+0x93f/0xe30 net/packet/af_packet.c:3235
packet_bind+0x15f/0x1c0 net/packet/af_packet.c:3319
__sys_bind+0x1ed/0x260 net/socket.c:1803
__do_sys_bind net/socket.c:1814 [inline]
__se_sys_bind net/socket.c:1812 [inline]
__x64_sys_bind+0x73/0xb0 net/socket.c:1812
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f522deb3677
RSP: 002b:00007ffec7fc94f8 EFLAGS: 00000217 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 000055e71e865ca3 RCX: 00007f522deb3677
RDX: 0000000000000014 RSI: 00007ffec7fc9508 RDI: 0000000000000005
RBP: 0000000000000000 R08: 000055e720004a20 R09: 0000000000000004
R10: 000000000000006d R11: 0000000000000217 R12: 000055e71ffff250
R13: 000055e720004908 R14: 00007ffec7fe9dec R15: 000055e7200048d4
</TASK>
INFO: task dhcpcd:11298 blocked for more than 147 seconds.
Not tainted 6.4.0-rc2-syzkaller-00481-gdcbe4ea1985d #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:dhcpcd state:D stack:29008 pid:11298 ppid:4670 flags:0x00000002
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5343 [inline]
__schedule+0xc9a/0x5880 kernel/sched/core.c:6669
schedule+0xde/0x1a0 kernel/sched/core.c:6745
exp_funnel_lock kernel/rcu/tree_exp.h:316 [inline]
synchronize_rcu_expedited+0x6f8/0x770 kernel/rcu/tree_exp.h:992
synchronize_rcu+0x2f1/0x3a0 kernel/rcu/tree.c:3499
synchronize_net+0x4e/0x60 net/core/dev.c:10791
__unregister_prot_hook+0x4b3/0x5c0 net/packet/af_packet.c:380
packet_do_bind+0x93f/0xe30 net/packet/af_packet.c:3235
packet_bind+0x15f/0x1c0 net/packet/af_packet.c:3319
__sys_bind+0x1ed/0x260 net/socket.c:1803
__do_sys_bind net/socket.c:1814 [inline]
__se_sys_bind net/socket.c:1812 [inline]
__x64_sys_bind+0x73/0xb0 net/socket.c:1812
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f522deb3677
RSP: 002b:00007ffec7fc94f8 EFLAGS: 00000217
ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 000055e71e865ca3 RCX: 00007f522deb3677
RDX: 0000000000000014 RSI: 00007ffec7fc9508 RDI: 0000000000000005
RBP: 0000000000000000 R08: 000055e720004420 R09: 0000000000000004
R10: 000000000000006d R11: 0000000000000217 R12: 000055e71ffff250
R13: 000055e720004a88 R14: 00007ffec7fe9dec R15: 000055e720004a54
</TASK>
INFO: task dhcpcd:11328 blocked for more than 149 seconds.
Not tainted 6.4.0-rc2-syzkaller-00481-gdcbe4ea1985d #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:dhcpcd state:D stack:28320 pid:11328 ppid:4670 flags:0x00000002
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5343 [inline]
__schedule+0xc9a/0x5880 kernel/sched/core.c:6669
schedule+0xde/0x1a0 kernel/sched/core.c:6745
exp_funnel_lock kernel/rcu/tree_exp.h:316 [inline]
synchronize_rcu_expedited+0x6f8/0x770 kernel/rcu/tree_exp.h:992
synchronize_rcu+0x2f1/0x3a0 kernel/rcu/tree.c:3499
synchronize_net+0x4e/0x60 net/core/dev.c:10791
__unregister_prot_hook+0x4b3/0x5c0 net/packet/af_packet.c:380
packet_do_bind+0x93f/0xe30 net/packet/af_packet.c:3235
packet_bind+0x15f/0x1c0 net/packet/af_packet.c:3319
__sys_bind+0x1ed/0x260 net/socket.c:1803
__do_sys_bind net/socket.c:1814 [inline]
__se_sys_bind net/socket.c:1812 [inline]
__x64_sys_bind+0x73/0xb0 net/socket.c:1812
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f522deb3677
RSP: 002b:00007ffec7fc94f8 EFLAGS: 00000217 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 000055e71e865ca3 RCX: 00007f522deb3677
RDX: 0000000000000014 RSI: 00007ffec7fc9508 RDI: 0000000000000005
RBP: 0000000000000000 R08: 000055e720004420 R09: 0000000000000004
R10: 000000000000006d R11: 0000000000000217 R12: 000055e71ffff250
R13: 000055e720004c08 R14: 00007ffec7fe9dec R15: 000055e720004bd4
</TASK>

Showing all locks held in the system:
1 lock held by rcu_tasks_kthre/13:
#0: ffffffff8c798430 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xd80 kernel/rcu/tasks.h:518
1 lock held by rcu_tasks_trace/14:
#0: ffffffff8c798130 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x31/0xd80 kernel/rcu/tasks.h:518
1 lock held by khungtaskd/28:
#0: ffffffff8c799040 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340 kernel/locking/lockdep.c:6545
1 lock held by kswapd0/84:
2 locks held by kswapd1/85:
3 locks held by kworker/0:2/760:
3 locks held by kworker/1:2/1126:
1 lock held by syslogd/4438:
2 locks held by klogd/4445:
1 lock held by dhcpcd/4669:
2 locks held by getty/4756:
#0: ffff8880286bf098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x26/0x80 drivers/tty/tty_ldisc.c:243
#1: ffffc900015902f0 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0xef4/0x13e0 drivers/tty/n_tty.c:2176
2 locks held by sshd/5018:
2 locks held by syz-executor300/5025:
2 locks held by dhcpcd/10797:
#0: ffff8880734ebe10 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:775 [inline]
#0: ffff8880734ebe10 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: __sock_release+0x86/0x290 net/socket.c:652
#1: ffffffff8c7a44b8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: exp_funnel_lock kernel/rcu/tree_exp.h:325 [inline]
#1: ffffffff8c7a44b8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x3e8/0x770 kernel/rcu/tree_exp.h:992
2 locks held by dhcpcd/10818:
#0: ffff8880217f2130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff8880217f2130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3202
#1: ffffffff8c7a44b8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: exp_funnel_lock kernel/rcu/tree_exp.h:325 [inline]
#1: ffffffff8c7a44b8 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x3e8/0x770 kernel/rcu/tree_exp.h:992
1 lock held by dhcpcd/10860:
#0: ffff88807b95c130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff88807b95c130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3202
1 lock held by dhcpcd/10906:
#0: ffff888076080130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff888076080130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3202
1 lock held by dhcpcd/11298:
#0: ffff888027a66130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1697 [inline]
#0: ffff888027a66130 (sk_lock-AF_PACKET){+.+.}-{0:0}, at: packet_do_bind+0x2f/0xe30 net/packet/af_packet.c:3202

---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

2023-05-21 02:48:31

by Martin KaFai Lau

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

On 5/20/23 3:13 PM, syzbot wrote:
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit: dcbe4ea1985d Merge branch '1GbE' of git://git.kernel.org/p..
> git tree: net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=123ebd91280000
> kernel config: https://syzkaller.appspot.com/x/.config?x=f20b05fe035db814
> dashboard link: https://syzkaller.appspot.com/bug?extid=222aa26d0a5dbc2e84fe
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1495596a280000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1529326a280000
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/41b9dda0e686/disk-dcbe4ea1.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/64d9bece8f89/vmlinux-dcbe4ea1.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/42429896dca0/bzImage-dcbe4ea1.xz
>
> The issue was bisected to:
>
> commit 3b5d4ddf8fe1f60082513f94bae586ac80188a03
> Author: Martin KaFai Lau <[email protected]>
> Date: Wed Mar 9 09:04:50 2022 +0000
>
> bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro

I am afraid this bisect is incorrect. The commit removed a redundant macro and
is a no-op change.

2023-05-21 06:03:26

by Tetsuo Handa

[permalink] [raw]

Subject: Re: [syzbot] [fs?] INFO: task hung in synchronize_rcu (4)

On 2023/05/21 11:26, Martin KaFai Lau wrote:
> On 5/20/23 3:13 PM, syzbot wrote:
>> syzbot has found a reproducer for the following issue on:
>>
>> HEAD commit: dcbe4ea1985d Merge branch '1GbE' of git://git.kernel.org/p..
>> git tree: net-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=123ebd91280000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=f20b05fe035db814
>> dashboard link: https://syzkaller.appspot.com/bug?extid=222aa26d0a5dbc2e84fe
>> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1495596a280000
>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1529326a280000
>>
>> Downloadable assets:
>> disk image: https://storage.googleapis.com/syzbot-assets/41b9dda0e686/disk-dcbe4ea1.raw.xz
>> vmlinux: https://storage.googleapis.com/syzbot-assets/64d9bece8f89/vmlinux-dcbe4ea1.xz
>> kernel image: https://storage.googleapis.com/syzbot-assets/42429896dca0/bzImage-dcbe4ea1.xz
>>
>> The issue was bisected to:
>>
>> commit 3b5d4ddf8fe1f60082513f94bae586ac80188a03
>> Author: Martin KaFai Lau <[email protected]>
>> Date: Wed Mar 9 09:04:50 2022 +0000
>>
>> bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro
>
> I am afraid this bisect is incorrect. The commit removed a redundant macro and is a no-op change.
>
>

But the reproducer is heavily calling bpf() syscall.

void execute_call(int call)
{
switch (call) {
case 0:
NONFAILING(*(uint32_t*)0x200027c0 = 3);
NONFAILING(*(uint32_t*)0x200027c4 = 4);
NONFAILING(*(uint32_t*)0x200027c8 = 4);
NONFAILING(*(uint32_t*)0x200027cc = 0x10001);
NONFAILING(*(uint32_t*)0x200027d0 = 0);
NONFAILING(*(uint32_t*)0x200027d4 = -1);
NONFAILING(*(uint32_t*)0x200027d8 = 0);
NONFAILING(memset((void*)0x200027dc, 0, 16));
NONFAILING(*(uint32_t*)0x200027ec = 0);
NONFAILING(*(uint32_t*)0x200027f0 = -1);
NONFAILING(*(uint32_t*)0x200027f4 = 0);
NONFAILING(*(uint32_t*)0x200027f8 = 0);
NONFAILING(*(uint32_t*)0x200027fc = 0);
NONFAILING(*(uint64_t*)0x20002800 = 0);
syscall(__NR_bpf, 0ul, 0x200027c0ul, 0x48ul);
break;
}
}

Something caused infinite loop or too heavy stress to survive?
The first report was 7d31677bb7b1.
Rechecking or running the reproducer on commits shown by
"git log 7d31677bb7b1 net/bpf" might help.