LinuxLists.cc - possible deadlock in sk_clone

2021-02-26 21:11:48

Subject: possible deadlock in sk_clone_lock

Hello,

syzbot found the following issue on:

HEAD commit: 577c2835 Add linux-next specific files for 20210224
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=137cef82d00000
kernel config: https://syzkaller.appspot.com/x/.config?x=e9bb3d369b3bf49
dashboard link: https://syzkaller.appspot.com/bug?extid=506c8a2a115201881d45

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.11.0-next-20210224-syzkaller #0 Not tainted
-----------------------------------------------------
syz-executor.3/15411 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
ffffffff8c0a0e18 (hugetlb_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffffffff8c0a0e18 (hugetlb_lock){+.+.}-{2:2}, at: __free_huge_page+0x4cd/0xc10 mm/hugetlb.c:1390

and this task is already holding:
ffff88802391c720 (slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffff88802391c720 (slock-AF_INET){+.-.}-{2:2}, at: __tcp_close+0x6d9/0x1170 net/ipv4/tcp.c:2788
which would create a new lock dependency:
(slock-AF_INET){+.-.}-{2:2} -> (hugetlb_lock){+.+.}-{2:2}

but this new dependency connects a SOFTIRQ-irq-safe lock:
(slock-AF_INET){+.-.}-{2:2}

... which became SOFTIRQ-irq-safe at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
sk_clone_lock+0x296/0x1070 net/core/sock.c:1913
inet_csk_clone_lock+0x21/0x4c0 net/ipv4/inet_connection_sock.c:830
tcp_create_openreq_child+0x30/0x16d0 net/ipv4/tcp_minisocks.c:460
tcp_v4_syn_recv_sock+0x10c/0x1460 net/ipv4/tcp_ipv4.c:1526
tcp_check_req+0x616/0x1860 net/ipv4/tcp_minisocks.c:772
tcp_v4_rcv+0x221a/0x3780 net/ipv4/tcp_ipv4.c:2001
ip_protocol_deliver_rcu+0x5c/0x8a0 net/ipv4/ip_input.c:204
ip_local_deliver_finish+0x20a/0x370 net/ipv4/ip_input.c:231
NF_HOOK include/linux/netfilter.h:301 [inline]
NF_HOOK include/linux/netfilter.h:295 [inline]
ip_local_deliver+0x1b3/0x200 net/ipv4/ip_input.c:252
dst_input include/net/dst.h:458 [inline]
ip_sublist_rcv_finish+0x9a/0x2c0 net/ipv4/ip_input.c:551
ip_list_rcv_finish.constprop.0+0x514/0x6e0 net/ipv4/ip_input.c:601
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5408 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5456
__netif_receive_skb_list net/core/dev.c:5508 [inline]
netif_receive_skb_list_internal+0x777/0xd70 net/core/dev.c:5618
gro_normal_list net/core/dev.c:5772 [inline]
gro_normal_list net/core/dev.c:5768 [inline]
napi_complete_done+0x1f1/0x820 net/core/dev.c:6474
virtqueue_napi_complete+0x2c/0xc0 drivers/net/virtio_net.c:334
virtnet_poll+0xae2/0xd90 drivers/net/virtio_net.c:1455
__napi_poll+0xaf/0x440 net/core/dev.c:6892
napi_poll net/core/dev.c:6959 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7036
__do_softirq+0x29b/0x9f6 kernel/softirq.c:345
invoke_softirq kernel/softirq.c:221 [inline]
__irq_exit_rcu kernel/softirq.c:422 [inline]
irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
common_interrupt+0xa4/0xd0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:623
tomoyo_domain_quota_is_ok+0x2f1/0x550 security/tomoyo/util.c:1093
tomoyo_supervisor+0x2f2/0xf00 security/tomoyo/common.c:2089
tomoyo_audit_path_log security/tomoyo/file.c:168 [inline]
tomoyo_path_permission security/tomoyo/file.c:587 [inline]
tomoyo_path_permission+0x270/0x3a0 security/tomoyo/file.c:573
tomoyo_path_perm+0x39e/0x400 security/tomoyo/file.c:838
tomoyo_path_symlink+0x94/0xe0 security/tomoyo/tomoyo.c:200
security_path_symlink+0xdf/0x150 security/security.c:1119
do_symlinkat+0x123/0x300 fs/namei.c:4201
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae

to a SOFTIRQ-irq-unsafe lock:
(hugetlb_lock){+.+.}-{2:2}

... which became SOFTIRQ-irq-unsafe at:
...
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
hugetlb_overcommit_handler+0x260/0x3e0 mm/hugetlb.c:3448
proc_sys_call_handler+0x336/0x610 fs/proc/proc_sysctl.c:591
call_write_iter include/linux/fs.h:1977 [inline]
new_sync_write+0x426/0x650 fs/read_write.c:519
vfs_write+0x796/0xa30 fs/read_write.c:606
ksys_write+0x12d/0x250 fs/read_write.c:659
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(hugetlb_lock);
local_irq_disable();
lock(slock-AF_INET);
lock(hugetlb_lock);
<Interrupt>
lock(slock-AF_INET);

*** DEADLOCK ***

3 locks held by syz-executor.3/15411:
#0: ffff88802a56a190 (&sb->s_type->i_mutex_key#13){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:775 [inline]
#0: ffff88802a56a190 (&sb->s_type->i_mutex_key#13){+.+.}-{3:3}, at: __sock_release+0x86/0x280 net/socket.c:598
#1: ffff88802391c7a0 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1600 [inline]
#1: ffff88802391c7a0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1e/0xc0 net/ipv4/tcp.c:2866
#2: ffff88802391c720 (slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
#2: ffff88802391c720 (slock-AF_INET){+.-.}-{2:2}, at: __tcp_close+0x6d9/0x1170 net/ipv4/tcp.c:2788

the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (slock-AF_INET){+.-.}-{2:2} {
HARDIRQ-ON-W at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
_raw_spin_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:175
spin_lock_bh include/linux/spinlock.h:359 [inline]
lock_sock_nested+0x40/0x120 net/core/sock.c:3063
lock_sock include/net/sock.h:1600 [inline]
inet_autobind+0x1a/0x190 net/ipv4/af_inet.c:180
inet_dgram_connect+0x1f5/0x2d0 net/ipv4/af_inet.c:578
__sys_connect_file+0x155/0x1a0 net/socket.c:1837
__sys_connect+0x161/0x190 net/socket.c:1854
__do_sys_connect net/socket.c:1864 [inline]
__se_sys_connect net/socket.c:1861 [inline]
__x64_sys_connect+0x6f/0xb0 net/socket.c:1861
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
IN-SOFTIRQ-W at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
sk_clone_lock+0x296/0x1070 net/core/sock.c:1913
inet_csk_clone_lock+0x21/0x4c0 net/ipv4/inet_connection_sock.c:830
tcp_create_openreq_child+0x30/0x16d0 net/ipv4/tcp_minisocks.c:460
tcp_v4_syn_recv_sock+0x10c/0x1460 net/ipv4/tcp_ipv4.c:1526
tcp_check_req+0x616/0x1860 net/ipv4/tcp_minisocks.c:772
tcp_v4_rcv+0x221a/0x3780 net/ipv4/tcp_ipv4.c:2001
ip_protocol_deliver_rcu+0x5c/0x8a0 net/ipv4/ip_input.c:204
ip_local_deliver_finish+0x20a/0x370 net/ipv4/ip_input.c:231
NF_HOOK include/linux/netfilter.h:301 [inline]
NF_HOOK include/linux/netfilter.h:295 [inline]
ip_local_deliver+0x1b3/0x200 net/ipv4/ip_input.c:252
dst_input include/net/dst.h:458 [inline]
ip_sublist_rcv_finish+0x9a/0x2c0 net/ipv4/ip_input.c:551
ip_list_rcv_finish.constprop.0+0x514/0x6e0 net/ipv4/ip_input.c:601
ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
__netif_receive_skb_list_ptype net/core/dev.c:5408 [inline]
__netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5456
__netif_receive_skb_list net/core/dev.c:5508 [inline]
netif_receive_skb_list_internal+0x777/0xd70 net/core/dev.c:5618
gro_normal_list net/core/dev.c:5772 [inline]
gro_normal_list net/core/dev.c:5768 [inline]
napi_complete_done+0x1f1/0x820 net/core/dev.c:6474
virtqueue_napi_complete+0x2c/0xc0 drivers/net/virtio_net.c:334
virtnet_poll+0xae2/0xd90 drivers/net/virtio_net.c:1455
__napi_poll+0xaf/0x440 net/core/dev.c:6892
napi_poll net/core/dev.c:6959 [inline]
net_rx_action+0x801/0xb40 net/core/dev.c:7036
__do_softirq+0x29b/0x9f6 kernel/softirq.c:345
invoke_softirq kernel/softirq.c:221 [inline]
__irq_exit_rcu kernel/softirq.c:422 [inline]
irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
common_interrupt+0xa4/0xd0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:623
tomoyo_domain_quota_is_ok+0x2f1/0x550 security/tomoyo/util.c:1093
tomoyo_supervisor+0x2f2/0xf00 security/tomoyo/common.c:2089
tomoyo_audit_path_log security/tomoyo/file.c:168 [inline]
tomoyo_path_permission security/tomoyo/file.c:587 [inline]
tomoyo_path_permission+0x270/0x3a0 security/tomoyo/file.c:573
tomoyo_path_perm+0x39e/0x400 security/tomoyo/file.c:838
tomoyo_path_symlink+0x94/0xe0 security/tomoyo/tomoyo.c:200
security_path_symlink+0xdf/0x150 security/security.c:1119
do_symlinkat+0x123/0x300 fs/namei.c:4201
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
INITIAL USE at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
_raw_spin_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:175
spin_lock_bh include/linux/spinlock.h:359 [inline]
lock_sock_nested+0x40/0x120 net/core/sock.c:3063
lock_sock include/net/sock.h:1600 [inline]
inet_autobind+0x1a/0x190 net/ipv4/af_inet.c:180
inet_dgram_connect+0x1f5/0x2d0 net/ipv4/af_inet.c:578
__sys_connect_file+0x155/0x1a0 net/socket.c:1837
__sys_connect+0x161/0x190 net/socket.c:1854
__do_sys_connect net/socket.c:1864 [inline]
__se_sys_connect net/socket.c:1861 [inline]
__x64_sys_connect+0x6f/0xb0 net/socket.c:1861
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
}
... key at: [<ffffffff901df860>] af_family_slock_keys+0x20/0x300
... acquired at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
__free_huge_page+0x4cd/0xc10 mm/hugetlb.c:1390
free_huge_page+0x31/0xb0 mm/hugetlb.c:1461
__put_page+0xf7/0x3e0 mm/swap.c:126
put_page include/linux/mm.h:1219 [inline]
__skb_frag_unref include/linux/skbuff.h:3085 [inline]
skb_release_data+0x465/0x750 net/core/skbuff.c:666
skb_release_all net/core/skbuff.c:725 [inline]
__kfree_skb+0x46/0x60 net/core/skbuff.c:739
sk_wmem_free_skb include/net/sock.h:1558 [inline]
tcp_rtx_queue_purge net/ipv4/tcp.c:2895 [inline]
tcp_write_queue_purge+0x44c/0x1250 net/ipv4/tcp.c:2908
tcp_v4_destroy_sock+0xf2/0x780 net/ipv4/tcp_ipv4.c:2219
inet_csk_destroy_sock+0x196/0x490 net/ipv4/inet_connection_sock.c:884
__tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2855
tcp_close+0x29/0xc0 net/ipv4/tcp.c:2867
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release+0xcd/0x280 net/socket.c:599
sock_close+0x18/0x20 net/socket.c:1258
__fput+0x288/0x920 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:140
get_signal+0x1c89/0x2100 kernel/signal.c:2554
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:208
__syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x44/0xae

the dependencies between the lock to be acquired
and SOFTIRQ-irq-unsafe lock:
-> (hugetlb_lock){+.+.}-{2:2} {
HARDIRQ-ON-W at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
hugetlb_overcommit_handler+0x260/0x3e0 mm/hugetlb.c:3448
proc_sys_call_handler+0x336/0x610 fs/proc/proc_sysctl.c:591
call_write_iter include/linux/fs.h:1977 [inline]
new_sync_write+0x426/0x650 fs/read_write.c:519
vfs_write+0x796/0xa30 fs/read_write.c:606
ksys_write+0x12d/0x250 fs/read_write.c:659
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
SOFTIRQ-ON-W at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
hugetlb_overcommit_handler+0x260/0x3e0 mm/hugetlb.c:3448
proc_sys_call_handler+0x336/0x610 fs/proc/proc_sysctl.c:591
call_write_iter include/linux/fs.h:1977 [inline]
new_sync_write+0x426/0x650 fs/read_write.c:519
vfs_write+0x796/0xa30 fs/read_write.c:606
ksys_write+0x12d/0x250 fs/read_write.c:659
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
INITIAL USE at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
hugetlb_overcommit_handler+0x260/0x3e0 mm/hugetlb.c:3448
proc_sys_call_handler+0x336/0x610 fs/proc/proc_sysctl.c:591
call_write_iter include/linux/fs.h:1977 [inline]
new_sync_write+0x426/0x650 fs/read_write.c:519
vfs_write+0x796/0xa30 fs/read_write.c:606
ksys_write+0x12d/0x250 fs/read_write.c:659
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
}
... key at: [<ffffffff8c0a0e18>] hugetlb_lock+0x18/0x4240
... acquired at:
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
__free_huge_page+0x4cd/0xc10 mm/hugetlb.c:1390
free_huge_page+0x31/0xb0 mm/hugetlb.c:1461
__put_page+0xf7/0x3e0 mm/swap.c:126
put_page include/linux/mm.h:1219 [inline]
__skb_frag_unref include/linux/skbuff.h:3085 [inline]
skb_release_data+0x465/0x750 net/core/skbuff.c:666
skb_release_all net/core/skbuff.c:725 [inline]
__kfree_skb+0x46/0x60 net/core/skbuff.c:739
sk_wmem_free_skb include/net/sock.h:1558 [inline]
tcp_rtx_queue_purge net/ipv4/tcp.c:2895 [inline]
tcp_write_queue_purge+0x44c/0x1250 net/ipv4/tcp.c:2908
tcp_v4_destroy_sock+0xf2/0x780 net/ipv4/tcp_ipv4.c:2219
inet_csk_destroy_sock+0x196/0x490 net/ipv4/inet_connection_sock.c:884
__tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2855
tcp_close+0x29/0xc0 net/ipv4/tcp.c:2867
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release+0xcd/0x280 net/socket.c:599
sock_close+0x18/0x20 net/socket.c:1258
__fput+0x288/0x920 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:140
get_signal+0x1c89/0x2100 kernel/signal.c:2554
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:208
__syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x44/0xae

stack backtrace:
CPU: 0 PID: 15411 Comm: syz-executor.3 Not tainted 5.11.0-next-20210224-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0xfa/0x151 lib/dump_stack.c:120
print_bad_irq_dependency kernel/locking/lockdep.c:2460 [inline]
check_irq_usage.cold+0x50d/0x744 kernel/locking/lockdep.c:2689
check_prev_add kernel/locking/lockdep.c:2940 [inline]
check_prevs_add kernel/locking/lockdep.c:3059 [inline]
validate_chain kernel/locking/lockdep.c:3674 [inline]
__lock_acquire+0x2b2c/0x54c0 kernel/locking/lockdep.c:4900
lock_acquire kernel/locking/lockdep.c:5510 [inline]
lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
spin_lock include/linux/spinlock.h:354 [inline]
__free_huge_page+0x4cd/0xc10 mm/hugetlb.c:1390
free_huge_page+0x31/0xb0 mm/hugetlb.c:1461
__put_page+0xf7/0x3e0 mm/swap.c:126
put_page include/linux/mm.h:1219 [inline]
__skb_frag_unref include/linux/skbuff.h:3085 [inline]
skb_release_data+0x465/0x750 net/core/skbuff.c:666
skb_release_all net/core/skbuff.c:725 [inline]
__kfree_skb+0x46/0x60 net/core/skbuff.c:739
sk_wmem_free_skb include/net/sock.h:1558 [inline]
tcp_rtx_queue_purge net/ipv4/tcp.c:2895 [inline]
tcp_write_queue_purge+0x44c/0x1250 net/ipv4/tcp.c:2908
tcp_v4_destroy_sock+0xf2/0x780 net/ipv4/tcp_ipv4.c:2219
inet_csk_destroy_sock+0x196/0x490 net/ipv4/inet_connection_sock.c:884
__tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2855
tcp_close+0x29/0xc0 net/ipv4/tcp.c:2867
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release+0xcd/0x280 net/socket.c:599
sock_close+0x18/0x20 net/socket.c:1258
__fput+0x288/0x920 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:140
get_signal+0x1c89/0x2100 kernel/signal.c:2554
arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
handle_signal_work kernel/entry/common.c:147 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:208
__syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x465ef9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fbfc748c188 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: 0000000000005cd9 RBX: 000000000056bf60 RCX: 0000000000465ef9
RDX: ffffffffffffffd0 RSI: 0000000020000140 RDI: 0000000000000003
RBP: 00000000004bcd1c R08: 0000000000000000 R09: ffffffffffffff36
R10: 000000000401c005 R11: 0000000000000246 R12: 000000000056bf60
R13: 00007ffde63b79bf R14: 00007fbfc748c300 R15: 0000000000022000

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

2021-02-26 22:48:55

by Shakeel Butt

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Fri, Feb 26, 2021 at 2:09 PM syzbot
<[email protected]> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 577c2835 Add linux-next specific files for 20210224
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=137cef82d00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=e9bb3d369b3bf49
> dashboard link: https://syzkaller.appspot.com/bug?extid=506c8a2a115201881d45
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
>
> =====================================================
> WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> 5.11.0-next-20210224-syzkaller #0 Not tainted
> -----------------------------------------------------
> syz-executor.3/15411 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
> ffffffff8c0a0e18 (hugetlb_lock){+.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
> ffffffff8c0a0e18 (hugetlb_lock){+.+.}-{2:2}, at: __free_huge_page+0x4cd/0xc10 mm/hugetlb.c:1390
>
> and this task is already holding:
> ffff88802391c720 (slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
> ffff88802391c720 (slock-AF_INET){+.-.}-{2:2}, at: __tcp_close+0x6d9/0x1170 net/ipv4/tcp.c:2788
> which would create a new lock dependency:
> (slock-AF_INET){+.-.}-{2:2} -> (hugetlb_lock){+.+.}-{2:2}
>
> but this new dependency connects a SOFTIRQ-irq-safe lock:
> (slock-AF_INET){+.-.}-{2:2}
>
> ... which became SOFTIRQ-irq-safe at:
> lock_acquire kernel/locking/lockdep.c:5510 [inline]
> lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
> __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
> _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
> spin_lock include/linux/spinlock.h:354 [inline]
> sk_clone_lock+0x296/0x1070 net/core/sock.c:1913
> inet_csk_clone_lock+0x21/0x4c0 net/ipv4/inet_connection_sock.c:830
> tcp_create_openreq_child+0x30/0x16d0 net/ipv4/tcp_minisocks.c:460
> tcp_v4_syn_recv_sock+0x10c/0x1460 net/ipv4/tcp_ipv4.c:1526
> tcp_check_req+0x616/0x1860 net/ipv4/tcp_minisocks.c:772
> tcp_v4_rcv+0x221a/0x3780 net/ipv4/tcp_ipv4.c:2001
> ip_protocol_deliver_rcu+0x5c/0x8a0 net/ipv4/ip_input.c:204
> ip_local_deliver_finish+0x20a/0x370 net/ipv4/ip_input.c:231
> NF_HOOK include/linux/netfilter.h:301 [inline]
> NF_HOOK include/linux/netfilter.h:295 [inline]
> ip_local_deliver+0x1b3/0x200 net/ipv4/ip_input.c:252
> dst_input include/net/dst.h:458 [inline]
> ip_sublist_rcv_finish+0x9a/0x2c0 net/ipv4/ip_input.c:551
> ip_list_rcv_finish.constprop.0+0x514/0x6e0 net/ipv4/ip_input.c:601
> ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
> ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
> __netif_receive_skb_list_ptype net/core/dev.c:5408 [inline]
> __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5456
> __netif_receive_skb_list net/core/dev.c:5508 [inline]
> netif_receive_skb_list_internal+0x777/0xd70 net/core/dev.c:5618
> gro_normal_list net/core/dev.c:5772 [inline]
> gro_normal_list net/core/dev.c:5768 [inline]
> napi_complete_done+0x1f1/0x820 net/core/dev.c:6474
> virtqueue_napi_complete+0x2c/0xc0 drivers/net/virtio_net.c:334
> virtnet_poll+0xae2/0xd90 drivers/net/virtio_net.c:1455
> __napi_poll+0xaf/0x440 net/core/dev.c:6892
> napi_poll net/core/dev.c:6959 [inline]
> net_rx_action+0x801/0xb40 net/core/dev.c:7036
> __do_softirq+0x29b/0x9f6 kernel/softirq.c:345
> invoke_softirq kernel/softirq.c:221 [inline]
> __irq_exit_rcu kernel/softirq.c:422 [inline]
> irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
> common_interrupt+0xa4/0xd0 arch/x86/kernel/irq.c:240
> asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:623
> tomoyo_domain_quota_is_ok+0x2f1/0x550 security/tomoyo/util.c:1093
> tomoyo_supervisor+0x2f2/0xf00 security/tomoyo/common.c:2089
> tomoyo_audit_path_log security/tomoyo/file.c:168 [inline]
> tomoyo_path_permission security/tomoyo/file.c:587 [inline]
> tomoyo_path_permission+0x270/0x3a0 security/tomoyo/file.c:573
> tomoyo_path_perm+0x39e/0x400 security/tomoyo/file.c:838
> tomoyo_path_symlink+0x94/0xe0 security/tomoyo/tomoyo.c:200
> security_path_symlink+0xdf/0x150 security/security.c:1119
> do_symlinkat+0x123/0x300 fs/namei.c:4201
> do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> to a SOFTIRQ-irq-unsafe lock:
> (hugetlb_lock){+.+.}-{2:2}
>
> ... which became SOFTIRQ-irq-unsafe at:
> ...
> lock_acquire kernel/locking/lockdep.c:5510 [inline]
> lock_acquire+0x1ab/0x730 kernel/locking/lockdep.c:5475
> __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
> _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
> spin_lock include/linux/spinlock.h:354 [inline]
> hugetlb_overcommit_handler+0x260/0x3e0 mm/hugetlb.c:3448
> proc_sys_call_handler+0x336/0x610 fs/proc/proc_sysctl.c:591
> call_write_iter include/linux/fs.h:1977 [inline]
> new_sync_write+0x426/0x650 fs/read_write.c:519
> vfs_write+0x796/0xa30 fs/read_write.c:606
> ksys_write+0x12d/0x250 fs/read_write.c:659
> do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> other info that might help us debug this:
>
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(hugetlb_lock);
> local_irq_disable();
> lock(slock-AF_INET);
> lock(hugetlb_lock);
> <Interrupt>
> lock(slock-AF_INET);
>
> *** DEADLOCK ***

This has been reproduced on 4.19 stable kernel as well [1] and there
is a reproducer as well.

It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
wonder if we just need to make hugetlb_lock softirq-safe.

[1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93

2021-02-26 23:18:33

by Mike Kravetz

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

Cc: Michal

On 2/26/21 2:44 PM, Shakeel Butt wrote:
> On Fri, Feb 26, 2021 at 2:09 PM syzbot
> <[email protected]> wrote:
<snip>
>> other info that might help us debug this:
>>
>> Possible interrupt unsafe locking scenario:
>>
>> CPU0 CPU1
>> ---- ----
>> lock(hugetlb_lock);
>> local_irq_disable();
>> lock(slock-AF_INET);
>> lock(hugetlb_lock);
>> <Interrupt>
>> lock(slock-AF_INET);
>>
>> *** DEADLOCK ***
>
> This has been reproduced on 4.19 stable kernel as well [1] and there
> is a reproducer as well.
>
> It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
> wonder if we just need to make hugetlb_lock softirq-safe.
>
> [1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93

Thanks Shakeel,

Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
context") attempted to address this issue. It uses a work queue to
acquire hugetlb_lock if the caller is !in_task().

In another recent thread, there was the suggestion to change the
!in_task to in_atomic.

I need to do some research on the subtle differences between in_task,
in_atomic, etc. TBH, I 'thought' !in_task would prevent the issue
reported here. But, that obviously is not the case.
--
Mike Kravetz

2021-02-27 00:04:06

by Shakeel Butt

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Fri, Feb 26, 2021 at 3:14 PM Mike Kravetz <[email protected]> wrote:
>
> Cc: Michal
>
> On 2/26/21 2:44 PM, Shakeel Butt wrote:
> > On Fri, Feb 26, 2021 at 2:09 PM syzbot
> > <[email protected]> wrote:
> <snip>
> >> other info that might help us debug this:
> >>
> >> Possible interrupt unsafe locking scenario:
> >>
> >> CPU0 CPU1
> >> ---- ----
> >> lock(hugetlb_lock);
> >> local_irq_disable();
> >> lock(slock-AF_INET);
> >> lock(hugetlb_lock);
> >> <Interrupt>
> >> lock(slock-AF_INET);
> >>
> >> *** DEADLOCK ***
> >
> > This has been reproduced on 4.19 stable kernel as well [1] and there
> > is a reproducer as well.
> >
> > It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
> > wonder if we just need to make hugetlb_lock softirq-safe.
> >
> > [1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93
>
> Thanks Shakeel,
>
> Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> context") attempted to address this issue. It uses a work queue to
> acquire hugetlb_lock if the caller is !in_task().
>
> In another recent thread, there was the suggestion to change the
> !in_task to in_atomic.
>
> I need to do some research on the subtle differences between in_task,
> in_atomic, etc. TBH, I 'thought' !in_task would prevent the issue
> reported here. But, that obviously is not the case.

I think the freeing is happening in the process context in this report
but it is creating the lock chain from softirq-safe slock to
irq-unsafe hugetlb_lock. So, two solutions I can think of are: (1)
always defer the freeing of hugetlb pages to a work queue or (2) make
hugetlb_lock softirq-safe.

2021-03-01 12:13:50

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Fri 26-02-21 16:00:30, Shakeel Butt wrote:
> On Fri, Feb 26, 2021 at 3:14 PM Mike Kravetz <[email protected]> wrote:
> >
> > Cc: Michal
> >
> > On 2/26/21 2:44 PM, Shakeel Butt wrote:
> > > On Fri, Feb 26, 2021 at 2:09 PM syzbot
> > > <[email protected]> wrote:
> > <snip>
> > >> other info that might help us debug this:
> > >>
> > >> Possible interrupt unsafe locking scenario:
> > >>
> > >> CPU0 CPU1
> > >> ---- ----
> > >> lock(hugetlb_lock);
> > >> local_irq_disable();
> > >> lock(slock-AF_INET);
> > >> lock(hugetlb_lock);
> > >> <Interrupt>
> > >> lock(slock-AF_INET);
> > >>
> > >> *** DEADLOCK ***
> > >
> > > This has been reproduced on 4.19 stable kernel as well [1] and there
> > > is a reproducer as well.
> > >
> > > It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
> > > wonder if we just need to make hugetlb_lock softirq-safe.
> > >
> > > [1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93
> >
> > Thanks Shakeel,
> >
> > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> > context") attempted to address this issue. It uses a work queue to
> > acquire hugetlb_lock if the caller is !in_task().
> >
> > In another recent thread, there was the suggestion to change the
> > !in_task to in_atomic.
> >
> > I need to do some research on the subtle differences between in_task,
> > in_atomic, etc. TBH, I 'thought' !in_task would prevent the issue
> > reported here. But, that obviously is not the case.
>
> I think the freeing is happening in the process context in this report
> but it is creating the lock chain from softirq-safe slock to
> irq-unsafe hugetlb_lock. So, two solutions I can think of are: (1)
> always defer the freeing of hugetlb pages to a work queue or (2) make
> hugetlb_lock softirq-safe.

There is __do_softirq so this should be in the soft IRQ context no?
Is this really reproducible with kernels which have c77c0a8ac4c5
applied?

Btw. making hugetlb lock irq safe has been already discussed and it
seems to be much harder than expected as some heavy operations are done
under the lock. This is really bad. Postponing the whole freeing
operation into a worker context is certainly possible but I would
consider it rather unfortunate. We would have to add some sync mechanism
to wait for hugetlb pages in flight to prevent from external
observability to the userspace. E.g. when shrinking the pool.
--
Michal Hocko
SUSE Labs

2021-03-01 15:14:53

by Shakeel Butt

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Mon, Mar 1, 2021 at 4:12 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 26-02-21 16:00:30, Shakeel Butt wrote:
> > On Fri, Feb 26, 2021 at 3:14 PM Mike Kravetz <[email protected]> wrote:
> > >
> > > Cc: Michal
> > >
> > > On 2/26/21 2:44 PM, Shakeel Butt wrote:
> > > > On Fri, Feb 26, 2021 at 2:09 PM syzbot
> > > > <[email protected]> wrote:
> > > <snip>
> > > >> other info that might help us debug this:
> > > >>
> > > >> Possible interrupt unsafe locking scenario:
> > > >>
> > > >> CPU0 CPU1
> > > >> ---- ----
> > > >> lock(hugetlb_lock);
> > > >> local_irq_disable();
> > > >> lock(slock-AF_INET);
> > > >> lock(hugetlb_lock);
> > > >> <Interrupt>
> > > >> lock(slock-AF_INET);
> > > >>
> > > >> *** DEADLOCK ***
> > > >
> > > > This has been reproduced on 4.19 stable kernel as well [1] and there
> > > > is a reproducer as well.
> > > >
> > > > It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
> > > > wonder if we just need to make hugetlb_lock softirq-safe.
> > > >
> > > > [1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93
> > >
> > > Thanks Shakeel,
> > >
> > > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> > > context") attempted to address this issue. It uses a work queue to
> > > acquire hugetlb_lock if the caller is !in_task().
> > >
> > > In another recent thread, there was the suggestion to change the
> > > !in_task to in_atomic.
> > >
> > > I need to do some research on the subtle differences between in_task,
> > > in_atomic, etc. TBH, I 'thought' !in_task would prevent the issue
> > > reported here. But, that obviously is not the case.
> >
> > I think the freeing is happening in the process context in this report
> > but it is creating the lock chain from softirq-safe slock to
> > irq-unsafe hugetlb_lock. So, two solutions I can think of are: (1)
> > always defer the freeing of hugetlb pages to a work queue or (2) make
> > hugetlb_lock softirq-safe.
>
> There is __do_softirq so this should be in the soft IRQ context no?
> Is this really reproducible with kernels which have c77c0a8ac4c5
> applied?

Yes this is softirq context and syzbot has reproduced this on
linux-next 20210224.

>
> Btw. making hugetlb lock irq safe has been already discussed and it
> seems to be much harder than expected as some heavy operations are done
> under the lock. This is really bad.

What about just softirq-safe i.e. spin_[un]lock_bh()? Will it still be that bad?

> Postponing the whole freeing
> operation into a worker context is certainly possible but I would
> consider it rather unfortunate. We would have to add some sync mechanism
> to wait for hugetlb pages in flight to prevent from external
> observability to the userspace. E.g. when shrinking the pool.

I think in practice recycling of hugetlb pages is a rare event, so we
might get away without the sync mechanism. How about start postponing
the freeing without sync mechanism and add it later if there are any
user reports complaining?

> --
> Michal Hocko
> SUSE Labs

2021-03-01 15:58:58

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Mon 01-03-21 07:10:11, Shakeel Butt wrote:
> On Mon, Mar 1, 2021 at 4:12 AM Michal Hocko <[email protected]> wrote:
> >
> > On Fri 26-02-21 16:00:30, Shakeel Butt wrote:
> > > On Fri, Feb 26, 2021 at 3:14 PM Mike Kravetz <[email protected]> wrote:
> > > >
> > > > Cc: Michal
> > > >
> > > > On 2/26/21 2:44 PM, Shakeel Butt wrote:
> > > > > On Fri, Feb 26, 2021 at 2:09 PM syzbot
> > > > > <[email protected]> wrote:
> > > > <snip>
> > > > >> other info that might help us debug this:
> > > > >>
> > > > >> Possible interrupt unsafe locking scenario:
> > > > >>
> > > > >> CPU0 CPU1
> > > > >> ---- ----
> > > > >> lock(hugetlb_lock);
> > > > >> local_irq_disable();
> > > > >> lock(slock-AF_INET);
> > > > >> lock(hugetlb_lock);
> > > > >> <Interrupt>
> > > > >> lock(slock-AF_INET);
> > > > >>
> > > > >> *** DEADLOCK ***
> > > > >
> > > > > This has been reproduced on 4.19 stable kernel as well [1] and there
> > > > > is a reproducer as well.
> > > > >
> > > > > It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
> > > > > wonder if we just need to make hugetlb_lock softirq-safe.
> > > > >
> > > > > [1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93
> > > >
> > > > Thanks Shakeel,
> > > >
> > > > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> > > > context") attempted to address this issue. It uses a work queue to
> > > > acquire hugetlb_lock if the caller is !in_task().
> > > >
> > > > In another recent thread, there was the suggestion to change the
> > > > !in_task to in_atomic.
> > > >
> > > > I need to do some research on the subtle differences between in_task,
> > > > in_atomic, etc. TBH, I 'thought' !in_task would prevent the issue
> > > > reported here. But, that obviously is not the case.
> > >
> > > I think the freeing is happening in the process context in this report
> > > but it is creating the lock chain from softirq-safe slock to
> > > irq-unsafe hugetlb_lock. So, two solutions I can think of are: (1)
> > > always defer the freeing of hugetlb pages to a work queue or (2) make
> > > hugetlb_lock softirq-safe.
> >
> > There is __do_softirq so this should be in the soft IRQ context no?
> > Is this really reproducible with kernels which have c77c0a8ac4c5
> > applied?
>
> Yes this is softirq context and syzbot has reproduced this on
> linux-next 20210224.

Then how come this can ever be a problem? in_task() should exclude soft
irq context unless I am mistaken.

> > Btw. making hugetlb lock irq safe has been already discussed and it
> > seems to be much harder than expected as some heavy operations are done
> > under the lock. This is really bad.
>
> What about just softirq-safe i.e. spin_[un]lock_bh()? Will it still be that bad?

This would be a similar problem to the irq variant. It would just result
in soft irq being delayed potentially.

> > Postponing the whole freeing
> > operation into a worker context is certainly possible but I would
> > consider it rather unfortunate. We would have to add some sync mechanism
> > to wait for hugetlb pages in flight to prevent from external
> > observability to the userspace. E.g. when shrinking the pool.
>
> I think in practice recycling of hugetlb pages is a rare event, so we
> might get away without the sync mechanism. How about start postponing
> the freeing without sync mechanism and add it later if there are any
> user reports complaining?

I think this should be a last resort. Maybe we can come up with
something better. E.g. break down the hugetlb_lock and use something
different for expensive operations.
--
Michal Hocko
SUSE Labs

2021-03-01 18:41:06

by Shakeel Butt

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 01-03-21 07:10:11, Shakeel Butt wrote:
> > On Mon, Mar 1, 2021 at 4:12 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Fri 26-02-21 16:00:30, Shakeel Butt wrote:
> > > > On Fri, Feb 26, 2021 at 3:14 PM Mike Kravetz <[email protected]> wrote:
> > > > >
> > > > > Cc: Michal
> > > > >
> > > > > On 2/26/21 2:44 PM, Shakeel Butt wrote:
> > > > > > On Fri, Feb 26, 2021 at 2:09 PM syzbot
> > > > > > <[email protected]> wrote:
> > > > > <snip>
> > > > > >> other info that might help us debug this:
> > > > > >>
> > > > > >> Possible interrupt unsafe locking scenario:
> > > > > >>
> > > > > >> CPU0 CPU1
> > > > > >> ---- ----
> > > > > >> lock(hugetlb_lock);
> > > > > >> local_irq_disable();
> > > > > >> lock(slock-AF_INET);
> > > > > >> lock(hugetlb_lock);
> > > > > >> <Interrupt>
> > > > > >> lock(slock-AF_INET);
> > > > > >>
> > > > > >> *** DEADLOCK ***
> > > > > >
> > > > > > This has been reproduced on 4.19 stable kernel as well [1] and there
> > > > > > is a reproducer as well.
> > > > > >
> > > > > > It seems like sendmsg(MSG_ZEROCOPY) from a buffer backed by hugetlb. I
> > > > > > wonder if we just need to make hugetlb_lock softirq-safe.
> > > > > >
> > > > > > [1] https://syzkaller.appspot.com/bug?extid=6383ce4b0b8ec575ad93
> > > > >
> > > > > Thanks Shakeel,
> > > > >
> > > > > Commit c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task
> > > > > context") attempted to address this issue. It uses a work queue to
> > > > > acquire hugetlb_lock if the caller is !in_task().
> > > > >
> > > > > In another recent thread, there was the suggestion to change the
> > > > > !in_task to in_atomic.
> > > > >
> > > > > I need to do some research on the subtle differences between in_task,
> > > > > in_atomic, etc. TBH, I 'thought' !in_task would prevent the issue
> > > > > reported here. But, that obviously is not the case.
> > > >
> > > > I think the freeing is happening in the process context in this report
> > > > but it is creating the lock chain from softirq-safe slock to
> > > > irq-unsafe hugetlb_lock. So, two solutions I can think of are: (1)
> > > > always defer the freeing of hugetlb pages to a work queue or (2) make
> > > > hugetlb_lock softirq-safe.
> > >
> > > There is __do_softirq so this should be in the soft IRQ context no?
> > > Is this really reproducible with kernels which have c77c0a8ac4c5
> > > applied?
> >
> > Yes this is softirq context and syzbot has reproduced this on
> > linux-next 20210224.
>
> Then how come this can ever be a problem? in_task() should exclude soft
> irq context unless I am mistaken.
>

If I take the following example of syzbot's deadlock scenario then
CPU1 is the one freeing the hugetlb pages. It is in the process
context but has disabled softirqs (see __tcp_close()).

CPU0 CPU1
---- ----
lock(hugetlb_lock);
local_irq_disable();
lock(slock-AF_INET);
lock(hugetlb_lock);
<Interrupt>
lock(slock-AF_INET);

So, this deadlock scenario is very much possible.

2021-03-02 18:24:26

by Mike Kravetz

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On 3/1/21 9:23 AM, Michal Hocko wrote:
> On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
>> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> [...]
>>> Then how come this can ever be a problem? in_task() should exclude soft
>>> irq context unless I am mistaken.
>>>
>>
>> If I take the following example of syzbot's deadlock scenario then
>> CPU1 is the one freeing the hugetlb pages. It is in the process
>> context but has disabled softirqs (see __tcp_close()).
>>
>> CPU0 CPU1
>> ---- ----
>> lock(hugetlb_lock);
>> local_irq_disable();
>> lock(slock-AF_INET);
>> lock(hugetlb_lock);
>> <Interrupt>
>> lock(slock-AF_INET);
>>
>> So, this deadlock scenario is very much possible.
>
> OK, I see the point now. I was focusing on the IRQ context and hugetlb
> side too much. We do not need to be freeing from there. All it takes is
> to get a dependency chain over a common lock held here. Thanks for
> bearing with me.
>
> Let's see whether we can make hugetlb_lock irq safe.

I may be confused, but it seems like we have a general problem with
calling free_huge_page (as a result of put_page) with interrupts
disabled.

Consider the current free_huge_page code. Today, we drop the lock
when processing gigantic pages because we may need to block on a mutex
in cma code. If our caller has disabled interrupts, then it doesn't
matter if the hugetlb lock is irq safe, when we drop it interrupts will
still be disabled we can not block . Right? If correct, then making
hugetlb_lock irq safe would not help.

Again, I may be missing something.

Note that we also are considering doing more with the hugetlb lock
dropped in this path in the 'free vmemmap of hugetlb pages' series.

Since we need to do some work that could block in this path, it seems
like we really need to use a workqueue. It is too bad that there is not
an interface to identify all the cases where interrupts are disabled.
--
Mike Kravetz

2021-03-02 21:38:58

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> On 3/1/21 9:23 AM, Michal Hocko wrote:
> > On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> >> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > [...]
> >>> Then how come this can ever be a problem? in_task() should exclude soft
> >>> irq context unless I am mistaken.
> >>>
> >>
> >> If I take the following example of syzbot's deadlock scenario then
> >> CPU1 is the one freeing the hugetlb pages. It is in the process
> >> context but has disabled softirqs (see __tcp_close()).
> >>
> >> CPU0 CPU1
> >> ---- ----
> >> lock(hugetlb_lock);
> >> local_irq_disable();
> >> lock(slock-AF_INET);
> >> lock(hugetlb_lock);
> >> <Interrupt>
> >> lock(slock-AF_INET);
> >>
> >> So, this deadlock scenario is very much possible.
> >
> > OK, I see the point now. I was focusing on the IRQ context and hugetlb
> > side too much. We do not need to be freeing from there. All it takes is
> > to get a dependency chain over a common lock held here. Thanks for
> > bearing with me.
> >
> > Let's see whether we can make hugetlb_lock irq safe.
>
> I may be confused, but it seems like we have a general problem with
> calling free_huge_page (as a result of put_page) with interrupts
> disabled.
>
> Consider the current free_huge_page code. Today, we drop the lock
> when processing gigantic pages because we may need to block on a mutex
> in cma code. If our caller has disabled interrupts, then it doesn't
> matter if the hugetlb lock is irq safe, when we drop it interrupts will
> still be disabled we can not block . Right? If correct, then making
> hugetlb_lock irq safe would not help.
>
> Again, I may be missing something.
>
> Note that we also are considering doing more with the hugetlb lock
> dropped in this path in the 'free vmemmap of hugetlb pages' series.
>
> Since we need to do some work that could block in this path, it seems
> like we really need to use a workqueue. It is too bad that there is not
> an interface to identify all the cases where interrupts are disabled.

Wouldn't something like this help? It is quite ugly but it would be
simple enough and backportable while we come up with a more rigorous
solution. What do you think?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..c9a8b39f678d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
void free_huge_page(struct page *page)
{
/*
- * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
+ * Defer freeing if in non-task context or when put_page is called
+ * with IRQ disabled (e.g from via TCP slock dependency chain) to
+ * avoid hugetlb_lock deadlock.
*/
- if (!in_task()) {
+ if (!in_task() || irqs_disabled()) {
/*
* Only call schedule_work() if hpage_freelist is previously
* empty. Otherwise, schedule_work() had been called but the
--
Michal Hocko
SUSE Labs

2021-03-03 14:54:16

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
[...]
> > Then how come this can ever be a problem? in_task() should exclude soft
> > irq context unless I am mistaken.
> >
>
> If I take the following example of syzbot's deadlock scenario then
> CPU1 is the one freeing the hugetlb pages. It is in the process
> context but has disabled softirqs (see __tcp_close()).
>
> CPU0 CPU1
> ---- ----
> lock(hugetlb_lock);
> local_irq_disable();
> lock(slock-AF_INET);
> lock(hugetlb_lock);
> <Interrupt>
> lock(slock-AF_INET);
>
> So, this deadlock scenario is very much possible.

OK, I see the point now. I was focusing on the IRQ context and hugetlb
side too much. We do not need to be freeing from there. All it takes is
to get a dependency chain over a common lock held here. Thanks for
bearing with me.

Let's see whether we can make hugetlb_lock irq safe.

--
Michal Hocko
SUSE Labs

2021-03-04 06:30:56

by Shakeel Butt

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> > On 3/1/21 9:23 AM, Michal Hocko wrote:
> > > On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> > >> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > > [...]
> > >>> Then how come this can ever be a problem? in_task() should exclude soft
> > >>> irq context unless I am mistaken.
> > >>>
> > >>
> > >> If I take the following example of syzbot's deadlock scenario then
> > >> CPU1 is the one freeing the hugetlb pages. It is in the process
> > >> context but has disabled softirqs (see __tcp_close()).
> > >>
> > >> CPU0 CPU1
> > >> ---- ----
> > >> lock(hugetlb_lock);
> > >> local_irq_disable();
> > >> lock(slock-AF_INET);
> > >> lock(hugetlb_lock);
> > >> <Interrupt>
> > >> lock(slock-AF_INET);
> > >>
> > >> So, this deadlock scenario is very much possible.
> > >
> > > OK, I see the point now. I was focusing on the IRQ context and hugetlb
> > > side too much. We do not need to be freeing from there. All it takes is
> > > to get a dependency chain over a common lock held here. Thanks for
> > > bearing with me.
> > >
> > > Let's see whether we can make hugetlb_lock irq safe.
> >
> > I may be confused, but it seems like we have a general problem with
> > calling free_huge_page (as a result of put_page) with interrupts
> > disabled.
> >
> > Consider the current free_huge_page code. Today, we drop the lock
> > when processing gigantic pages because we may need to block on a mutex
> > in cma code. If our caller has disabled interrupts, then it doesn't
> > matter if the hugetlb lock is irq safe, when we drop it interrupts will
> > still be disabled we can not block . Right? If correct, then making
> > hugetlb_lock irq safe would not help.
> >
> > Again, I may be missing something.
> >
> > Note that we also are considering doing more with the hugetlb lock
> > dropped in this path in the 'free vmemmap of hugetlb pages' series.
> >
> > Since we need to do some work that could block in this path, it seems
> > like we really need to use a workqueue. It is too bad that there is not
> > an interface to identify all the cases where interrupts are disabled.
>
> Wouldn't something like this help? It is quite ugly but it would be
> simple enough and backportable while we come up with a more rigorous
> solution. What do you think?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4bdb58ab14cb..c9a8b39f678d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> void free_huge_page(struct page *page)
> {
> /*
> - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> + * Defer freeing if in non-task context or when put_page is called
> + * with IRQ disabled (e.g from via TCP slock dependency chain) to
> + * avoid hugetlb_lock deadlock.
> */
> - if (!in_task()) {
> + if (!in_task() || irqs_disabled()) {

Does irqs_disabled() also check softirqs?

2021-03-04 06:31:30

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Tue 02-03-21 06:11:51, Shakeel Butt wrote:
> On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> > > On 3/1/21 9:23 AM, Michal Hocko wrote:
> > > > On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> > > >> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > > > [...]
> > > >>> Then how come this can ever be a problem? in_task() should exclude soft
> > > >>> irq context unless I am mistaken.
> > > >>>
> > > >>
> > > >> If I take the following example of syzbot's deadlock scenario then
> > > >> CPU1 is the one freeing the hugetlb pages. It is in the process
> > > >> context but has disabled softirqs (see __tcp_close()).
> > > >>
> > > >> CPU0 CPU1
> > > >> ---- ----
> > > >> lock(hugetlb_lock);
> > > >> local_irq_disable();
> > > >> lock(slock-AF_INET);
> > > >> lock(hugetlb_lock);
> > > >> <Interrupt>
> > > >> lock(slock-AF_INET);
> > > >>
[...]
> > Wouldn't something like this help? It is quite ugly but it would be
> > simple enough and backportable while we come up with a more rigorous
> > solution. What do you think?
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 4bdb58ab14cb..c9a8b39f678d 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> > void free_huge_page(struct page *page)
> > {
> > /*
> > - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> > + * Defer freeing if in non-task context or when put_page is called
> > + * with IRQ disabled (e.g from via TCP slock dependency chain) to
> > + * avoid hugetlb_lock deadlock.
> > */
> > - if (!in_task()) {
> > + if (!in_task() || irqs_disabled()) {
>
> Does irqs_disabled() also check softirqs?

Nope it doesn't AFAICS. I was referring to the above lockdep splat which
claims irq disabled to be the trigger. But now that you are mentioning
that it would be better to replace in_task() along the way. We have
discussed that in another email thread and I was suggesting to use
in_atomic() which should catch also bh disabled situation. The big IF is
that this needs preempt count to be enabled unconditionally. There are
changes in the RCU tree heading that direction.
--
Michal Hocko
SUSE Labs

2021-03-04 07:20:30

by Mike Kravetz

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On 3/2/21 6:29 AM, Michal Hocko wrote:
> On Tue 02-03-21 06:11:51, Shakeel Butt wrote:
>> On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
>>>
>>> On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
>>>> On 3/1/21 9:23 AM, Michal Hocko wrote:
>>>>> On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
>>>>>> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
>>>>> [...]
>>>>>>> Then how come this can ever be a problem? in_task() should exclude soft
>>>>>>> irq context unless I am mistaken.
>>>>>>>
>>>>>>
>>>>>> If I take the following example of syzbot's deadlock scenario then
>>>>>> CPU1 is the one freeing the hugetlb pages. It is in the process
>>>>>> context but has disabled softirqs (see __tcp_close()).
>>>>>>
>>>>>> CPU0 CPU1
>>>>>> ---- ----
>>>>>> lock(hugetlb_lock);
>>>>>> local_irq_disable();
>>>>>> lock(slock-AF_INET);
>>>>>> lock(hugetlb_lock);
>>>>>> <Interrupt>
>>>>>> lock(slock-AF_INET);
>>>>>>
> [...]
>>> Wouldn't something like this help? It is quite ugly but it would be
>>> simple enough and backportable while we come up with a more rigorous
>>> solution. What do you think?
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 4bdb58ab14cb..c9a8b39f678d 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
>>> void free_huge_page(struct page *page)
>>> {
>>> /*
>>> - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
>>> + * Defer freeing if in non-task context or when put_page is called
>>> + * with IRQ disabled (e.g from via TCP slock dependency chain) to
>>> + * avoid hugetlb_lock deadlock.
>>> */
>>> - if (!in_task()) {
>>> + if (!in_task() || irqs_disabled()) {
>>
>> Does irqs_disabled() also check softirqs?
>
> Nope it doesn't AFAICS. I was referring to the above lockdep splat which
> claims irq disabled to be the trigger. But now that you are mentioning
> that it would be better to replace in_task() along the way. We have
> discussed that in another email thread and I was suggesting to use
> in_atomic() which should catch also bh disabled situation. The big IF is
> that this needs preempt count to be enabled unconditionally. There are
> changes in the RCU tree heading that direction.

I have not been following developments in preemption and the RCU tree.
The comment for in_atomic() says:

/*
* Are we running in atomic context? WARNING: this macro cannot
* always detect atomic context; in particular, it cannot know about
* held spinlocks in non-preemptible kernels. Thus it should not be
* used in the general case to determine whether sleeping is possible.
* Do not use in_atomic() in driver code.
*/

That does seem to be the case. I verified in_atomic can detect softirq
context even in non-preemptible kernels. But, as the comment says it
will not detect a held spinlock in non-preemptible kernels. So, I think
in_atomic would be better than the current check for !in_task. That
would handle this syzbot issue, but we could still have issues if the
hugetlb put_page path is called while someone is holding a spinlock with
all interrupts enabled. Looks like there is no way to detect this
today in non-preemptible kernels. in_atomic does detect spinlocks held
in preemptible kernels.

I might suggest changing !in_task to in_atomic for now, and then work on
a more robust solution. I'm afraid such a robust solution will
require considerable effort. It would need to handle put_page being
called in any context: hardirq, softirq, spinlock held ... The
put_page/free_huge_page path will need to offload (workqueue or
something else) any processing that can possibly sleep.

Is it worth making the in_atomic change now, or should we just start
working on the more robust complete solution?
--
Mike Kravetz

2021-03-04 08:16:56

by Shakeel Butt

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Tue, Mar 2, 2021 at 1:19 PM Mike Kravetz <[email protected]> wrote:
>
> On 3/2/21 6:29 AM, Michal Hocko wrote:
> > On Tue 02-03-21 06:11:51, Shakeel Butt wrote:
> >> On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
> >>>
> >>> On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> >>>> On 3/1/21 9:23 AM, Michal Hocko wrote:
> >>>>> On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> >>>>>> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> >>>>> [...]
> >>>>>>> Then how come this can ever be a problem? in_task() should exclude soft
> >>>>>>> irq context unless I am mistaken.
> >>>>>>>
> >>>>>>
> >>>>>> If I take the following example of syzbot's deadlock scenario then
> >>>>>> CPU1 is the one freeing the hugetlb pages. It is in the process
> >>>>>> context but has disabled softirqs (see __tcp_close()).
> >>>>>>
> >>>>>> CPU0 CPU1
> >>>>>> ---- ----
> >>>>>> lock(hugetlb_lock);
> >>>>>> local_irq_disable();
> >>>>>> lock(slock-AF_INET);
> >>>>>> lock(hugetlb_lock);
> >>>>>> <Interrupt>
> >>>>>> lock(slock-AF_INET);
> >>>>>>
> > [...]
> >>> Wouldn't something like this help? It is quite ugly but it would be
> >>> simple enough and backportable while we come up with a more rigorous
> >>> solution. What do you think?
> >>>
> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>> index 4bdb58ab14cb..c9a8b39f678d 100644
> >>> --- a/mm/hugetlb.c
> >>> +++ b/mm/hugetlb.c
> >>> @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> >>> void free_huge_page(struct page *page)
> >>> {
> >>> /*
> >>> - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> >>> + * Defer freeing if in non-task context or when put_page is called
> >>> + * with IRQ disabled (e.g from via TCP slock dependency chain) to
> >>> + * avoid hugetlb_lock deadlock.
> >>> */
> >>> - if (!in_task()) {
> >>> + if (!in_task() || irqs_disabled()) {
> >>
> >> Does irqs_disabled() also check softirqs?
> >
> > Nope it doesn't AFAICS. I was referring to the above lockdep splat which
> > claims irq disabled to be the trigger. But now that you are mentioning
> > that it would be better to replace in_task() along the way. We have
> > discussed that in another email thread and I was suggesting to use
> > in_atomic() which should catch also bh disabled situation. The big IF is
> > that this needs preempt count to be enabled unconditionally. There are
> > changes in the RCU tree heading that direction.
>
> I have not been following developments in preemption and the RCU tree.
> The comment for in_atomic() says:
>
> /*
> * Are we running in atomic context? WARNING: this macro cannot
> * always detect atomic context; in particular, it cannot know about
> * held spinlocks in non-preemptible kernels. Thus it should not be
> * used in the general case to determine whether sleeping is possible.
> * Do not use in_atomic() in driver code.
> */
>
> That does seem to be the case. I verified in_atomic can detect softirq
> context even in non-preemptible kernels. But, as the comment says it
> will not detect a held spinlock in non-preemptible kernels. So, I think
> in_atomic would be better than the current check for !in_task. That
> would handle this syzbot issue, but we could still have issues if the
> hugetlb put_page path is called while someone is holding a spinlock with
> all interrupts enabled. Looks like there is no way to detect this
> today in non-preemptible kernels. in_atomic does detect spinlocks held
> in preemptible kernels.
>
> I might suggest changing !in_task to in_atomic for now, and then work on
> a more robust solution. I'm afraid such a robust solution will
> require considerable effort. It would need to handle put_page being
> called in any context: hardirq, softirq, spinlock held ... The
> put_page/free_huge_page path will need to offload (workqueue or
> something else) any processing that can possibly sleep.
>
> Is it worth making the in_atomic change now, or should we just start
> working on the more robust complete solution?

IMHO the change to in_atomic is beneficial because it will at least
fix this specific issue. No reason to keep the users of TCP TX
zerocopy from hugetlb pages broken for a more comprehensive solution.

2021-03-04 16:11:46

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Wed 03-03-21 09:59:45, Paul E. McKenney wrote:
> On Wed, Mar 03, 2021 at 09:03:27AM +0100, Michal Hocko wrote:
[...]
> > Paul what is the current plan with in_atomic to be usable for !PREEMPT
> > configurations?
>
> Ah, thank you for the reminder! I have rebased that series on top of
> v5.12-rc1 on -rcu branch tglx-pc.2021.03.03a.
>
> The current state is that Linus is not convinced that this change is
> worthwhile given that only RCU and printk() want it. (BPF would use
> it if it were available, but is willing to live without it, at least in
> the short term.)
>
> But maybe Linus will be convinced given your additional use case.
> Here is hoping!

Yes, hugetlb freeing path would benefit from this. You can reference
this lockdep report (http://lkml.kernel.org/r/[email protected])
with an additional argument that making hugetlb_lock irq safe is a
larger undertaking and we will need something reasonably backportable
for older kernels as well.
--
Michal Hocko
SUSE Labs

2021-03-04 21:37:00

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

[Add Paul]

On Tue 02-03-21 13:19:34, Mike Kravetz wrote:
> On 3/2/21 6:29 AM, Michal Hocko wrote:
> > On Tue 02-03-21 06:11:51, Shakeel Butt wrote:
> >> On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
> >>>
> >>> On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> >>>> On 3/1/21 9:23 AM, Michal Hocko wrote:
> >>>>> On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> >>>>>> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> >>>>> [...]
> >>>>>>> Then how come this can ever be a problem? in_task() should exclude soft
> >>>>>>> irq context unless I am mistaken.
> >>>>>>>
> >>>>>>
> >>>>>> If I take the following example of syzbot's deadlock scenario then
> >>>>>> CPU1 is the one freeing the hugetlb pages. It is in the process
> >>>>>> context but has disabled softirqs (see __tcp_close()).
> >>>>>>
> >>>>>> CPU0 CPU1
> >>>>>> ---- ----
> >>>>>> lock(hugetlb_lock);
> >>>>>> local_irq_disable();
> >>>>>> lock(slock-AF_INET);
> >>>>>> lock(hugetlb_lock);
> >>>>>> <Interrupt>
> >>>>>> lock(slock-AF_INET);
> >>>>>>
> > [...]
> >>> Wouldn't something like this help? It is quite ugly but it would be
> >>> simple enough and backportable while we come up with a more rigorous
> >>> solution. What do you think?
> >>>
> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>> index 4bdb58ab14cb..c9a8b39f678d 100644
> >>> --- a/mm/hugetlb.c
> >>> +++ b/mm/hugetlb.c
> >>> @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> >>> void free_huge_page(struct page *page)
> >>> {
> >>> /*
> >>> - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> >>> + * Defer freeing if in non-task context or when put_page is called
> >>> + * with IRQ disabled (e.g from via TCP slock dependency chain) to
> >>> + * avoid hugetlb_lock deadlock.
> >>> */
> >>> - if (!in_task()) {
> >>> + if (!in_task() || irqs_disabled()) {
> >>
> >> Does irqs_disabled() also check softirqs?
> >
> > Nope it doesn't AFAICS. I was referring to the above lockdep splat which
> > claims irq disabled to be the trigger. But now that you are mentioning
> > that it would be better to replace in_task() along the way. We have
> > discussed that in another email thread and I was suggesting to use
> > in_atomic() which should catch also bh disabled situation. The big IF is
> > that this needs preempt count to be enabled unconditionally. There are
> > changes in the RCU tree heading that direction.
>
> I have not been following developments in preemption and the RCU tree.
> The comment for in_atomic() says:
>
> /*
> * Are we running in atomic context? WARNING: this macro cannot
> * always detect atomic context; in particular, it cannot know about
> * held spinlocks in non-preemptible kernels. Thus it should not be
> * used in the general case to determine whether sleeping is possible.
> * Do not use in_atomic() in driver code.
> */
>
> That does seem to be the case. I verified in_atomic can detect softirq
> context even in non-preemptible kernels. But, as the comment says it
> will not detect a held spinlock in non-preemptible kernels. So, I think
> in_atomic would be better than the current check for !in_task. That
> would handle this syzbot issue, but we could still have issues if the
> hugetlb put_page path is called while someone is holding a spinlock with
> all interrupts enabled. Looks like there is no way to detect this
> today in non-preemptible kernels. in_atomic does detect spinlocks held
> in preemptible kernels.

Paul what is the current plan with in_atomic to be usable for !PREEMPT
configurations?

--
Michal Hocko
SUSE Labs

2021-03-04 23:28:19

by Paul E. McKenney

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Wed, Mar 03, 2021 at 09:03:27AM +0100, Michal Hocko wrote:
> [Add Paul]
>
> On Tue 02-03-21 13:19:34, Mike Kravetz wrote:
> > On 3/2/21 6:29 AM, Michal Hocko wrote:
> > > On Tue 02-03-21 06:11:51, Shakeel Butt wrote:
> > >> On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
> > >>>
> > >>> On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> > >>>> On 3/1/21 9:23 AM, Michal Hocko wrote:
> > >>>>> On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> > >>>>>> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > >>>>> [...]
> > >>>>>>> Then how come this can ever be a problem? in_task() should exclude soft
> > >>>>>>> irq context unless I am mistaken.
> > >>>>>>>
> > >>>>>>
> > >>>>>> If I take the following example of syzbot's deadlock scenario then
> > >>>>>> CPU1 is the one freeing the hugetlb pages. It is in the process
> > >>>>>> context but has disabled softirqs (see __tcp_close()).
> > >>>>>>
> > >>>>>> CPU0 CPU1
> > >>>>>> ---- ----
> > >>>>>> lock(hugetlb_lock);
> > >>>>>> local_irq_disable();
> > >>>>>> lock(slock-AF_INET);
> > >>>>>> lock(hugetlb_lock);
> > >>>>>> <Interrupt>
> > >>>>>> lock(slock-AF_INET);
> > >>>>>>
> > > [...]
> > >>> Wouldn't something like this help? It is quite ugly but it would be
> > >>> simple enough and backportable while we come up with a more rigorous
> > >>> solution. What do you think?
> > >>>
> > >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > >>> index 4bdb58ab14cb..c9a8b39f678d 100644
> > >>> --- a/mm/hugetlb.c
> > >>> +++ b/mm/hugetlb.c
> > >>> @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> > >>> void free_huge_page(struct page *page)
> > >>> {
> > >>> /*
> > >>> - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> > >>> + * Defer freeing if in non-task context or when put_page is called
> > >>> + * with IRQ disabled (e.g from via TCP slock dependency chain) to
> > >>> + * avoid hugetlb_lock deadlock.
> > >>> */
> > >>> - if (!in_task()) {
> > >>> + if (!in_task() || irqs_disabled()) {
> > >>
> > >> Does irqs_disabled() also check softirqs?
> > >
> > > Nope it doesn't AFAICS. I was referring to the above lockdep splat which
> > > claims irq disabled to be the trigger. But now that you are mentioning
> > > that it would be better to replace in_task() along the way. We have
> > > discussed that in another email thread and I was suggesting to use
> > > in_atomic() which should catch also bh disabled situation. The big IF is
> > > that this needs preempt count to be enabled unconditionally. There are
> > > changes in the RCU tree heading that direction.
> >
> > I have not been following developments in preemption and the RCU tree.
> > The comment for in_atomic() says:
> >
> > /*
> > * Are we running in atomic context? WARNING: this macro cannot
> > * always detect atomic context; in particular, it cannot know about
> > * held spinlocks in non-preemptible kernels. Thus it should not be
> > * used in the general case to determine whether sleeping is possible.
> > * Do not use in_atomic() in driver code.
> > */
> >
> > That does seem to be the case. I verified in_atomic can detect softirq
> > context even in non-preemptible kernels. But, as the comment says it
> > will not detect a held spinlock in non-preemptible kernels. So, I think
> > in_atomic would be better than the current check for !in_task. That
> > would handle this syzbot issue, but we could still have issues if the
> > hugetlb put_page path is called while someone is holding a spinlock with
> > all interrupts enabled. Looks like there is no way to detect this
> > today in non-preemptible kernels. in_atomic does detect spinlocks held
> > in preemptible kernels.
>
> Paul what is the current plan with in_atomic to be usable for !PREEMPT
> configurations?

Ah, thank you for the reminder! I have rebased that series on top of
v5.12-rc1 on -rcu branch tglx-pc.2021.03.03a.

The current state is that Linus is not convinced that this change is
worthwhile given that only RCU and printk() want it. (BPF would use
it if it were available, but is willing to live without it, at least in
the short term.)

But maybe Linus will be convinced given your additional use case.
Here is hoping!

Thanx, Paul

2021-03-05 09:11:15

by Michal Hocko

[permalink] [raw]

Subject: Re: possible deadlock in sk_clone_lock

On Tue 02-03-21 19:59:22, Shakeel Butt wrote:
> On Tue, Mar 2, 2021 at 1:19 PM Mike Kravetz <[email protected]> wrote:
> >
> > On 3/2/21 6:29 AM, Michal Hocko wrote:
> > > On Tue 02-03-21 06:11:51, Shakeel Butt wrote:
> > >> On Tue, Mar 2, 2021 at 1:44 AM Michal Hocko <[email protected]> wrote:
> > >>>
> > >>> On Mon 01-03-21 17:16:29, Mike Kravetz wrote:
> > >>>> On 3/1/21 9:23 AM, Michal Hocko wrote:
> > >>>>> On Mon 01-03-21 08:39:22, Shakeel Butt wrote:
> > >>>>>> On Mon, Mar 1, 2021 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > >>>>> [...]
> > >>>>>>> Then how come this can ever be a problem? in_task() should exclude soft
> > >>>>>>> irq context unless I am mistaken.
> > >>>>>>>
> > >>>>>>
> > >>>>>> If I take the following example of syzbot's deadlock scenario then
> > >>>>>> CPU1 is the one freeing the hugetlb pages. It is in the process
> > >>>>>> context but has disabled softirqs (see __tcp_close()).
> > >>>>>>
> > >>>>>> CPU0 CPU1
> > >>>>>> ---- ----
> > >>>>>> lock(hugetlb_lock);
> > >>>>>> local_irq_disable();
> > >>>>>> lock(slock-AF_INET);
> > >>>>>> lock(hugetlb_lock);
> > >>>>>> <Interrupt>
> > >>>>>> lock(slock-AF_INET);
> > >>>>>>
> > > [...]
> > >>> Wouldn't something like this help? It is quite ugly but it would be
> > >>> simple enough and backportable while we come up with a more rigorous
> > >>> solution. What do you think?
> > >>>
> > >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > >>> index 4bdb58ab14cb..c9a8b39f678d 100644
> > >>> --- a/mm/hugetlb.c
> > >>> +++ b/mm/hugetlb.c
> > >>> @@ -1495,9 +1495,11 @@ static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> > >>> void free_huge_page(struct page *page)
> > >>> {
> > >>> /*
> > >>> - * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> > >>> + * Defer freeing if in non-task context or when put_page is called
> > >>> + * with IRQ disabled (e.g from via TCP slock dependency chain) to
> > >>> + * avoid hugetlb_lock deadlock.
> > >>> */
> > >>> - if (!in_task()) {
> > >>> + if (!in_task() || irqs_disabled()) {
> > >>
> > >> Does irqs_disabled() also check softirqs?
> > >
> > > Nope it doesn't AFAICS. I was referring to the above lockdep splat which
> > > claims irq disabled to be the trigger. But now that you are mentioning
> > > that it would be better to replace in_task() along the way. We have
> > > discussed that in another email thread and I was suggesting to use
> > > in_atomic() which should catch also bh disabled situation. The big IF is
> > > that this needs preempt count to be enabled unconditionally. There are
> > > changes in the RCU tree heading that direction.
> >
> > I have not been following developments in preemption and the RCU tree.
> > The comment for in_atomic() says:
> >
> > /*
> > * Are we running in atomic context? WARNING: this macro cannot
> > * always detect atomic context; in particular, it cannot know about
> > * held spinlocks in non-preemptible kernels. Thus it should not be
> > * used in the general case to determine whether sleeping is possible.
> > * Do not use in_atomic() in driver code.
> > */
> >
> > That does seem to be the case. I verified in_atomic can detect softirq
> > context even in non-preemptible kernels. But, as the comment says it
> > will not detect a held spinlock in non-preemptible kernels. So, I think
> > in_atomic would be better than the current check for !in_task. That
> > would handle this syzbot issue, but we could still have issues if the
> > hugetlb put_page path is called while someone is holding a spinlock with
> > all interrupts enabled. Looks like there is no way to detect this
> > today in non-preemptible kernels. in_atomic does detect spinlocks held
> > in preemptible kernels.
> >
> > I might suggest changing !in_task to in_atomic for now, and then work on
> > a more robust solution. I'm afraid such a robust solution will
> > require considerable effort. It would need to handle put_page being
> > called in any context: hardirq, softirq, spinlock held ... The
> > put_page/free_huge_page path will need to offload (workqueue or
> > something else) any processing that can possibly sleep.
> >
> > Is it worth making the in_atomic change now, or should we just start
> > working on the more robust complete solution?
>
> IMHO the change to in_atomic is beneficial because it will at least
> fix this specific issue. No reason to keep the users of TCP TX
> zerocopy from hugetlb pages broken for a more comprehensive solution.

Another option would be to select PREEMPT_COUNT when hugetlb is enabled.
That would reduce dependency on a patch I was talking about in other
email. Not nice but here we are...

--
Michal Hocko
SUSE Labs