LinuxLists.cc - possible deadlock in __ata_sff

2022-12-13 15:38:45

Subject: possible deadlock in __ata_sff_interrupt

Dear Linux Developer,

Recently, when using our tool to fuzz kernel, the following crash was triggered.

HEAD commit: 094226ad94f4 Linux v6.1-rc5
git tree: upstream
compiler: clang 12.0.1
console output:
https://drive.google.com/file/d/1QZttkbuLed4wp6U32UR6TpxfY_HHCIqQ/view?usp=share_link
kernel config: https://drive.google.com/file/d/1TdPsg_5Zon8S2hEFpLBWjb8Tnd2KA5WJ/view?usp=share_link

Unfortunately, I didn't have a reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: Wei Chen <[email protected]>

=====================================================
WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
6.1.0-rc5 #40 Not tainted
-----------------------------------------------------
syz-executor.0/27911 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
ffff888076cc4f30 (&new->fa_lock){....}-{2:2}, at: kill_fasync_rcu
fs/fcntl.c:996 [inline]
ffff888076cc4f30 (&new->fa_lock){....}-{2:2}, at:
kill_fasync+0x13b/0x430 fs/fcntl.c:1017

and this task is already holding:
ffff8880144dec18 (&host->lock){-.-.}-{2:2}, at:
ata_scsi_queuecmd+0x7a/0x130 drivers/ata/libata-scsi.c:4048
which would create a new lock dependency:
(&host->lock){-.-.}-{2:2} -> (&new->fa_lock){....}-{2:2}

but this new dependency connects a HARDIRQ-irq-safe lock:
(&host->lock){-.-.}-{2:2}

... which became HARDIRQ-irq-safe at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xb3/0x100 kernel/locking/spinlock.c:162
__ata_sff_interrupt+0x23/0x710 drivers/ata/libata-sff.c:1540
__handle_irq_event_percpu+0x1f7/0x620 kernel/irq/handle.c:158
handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
handle_irq_event+0x83/0x1e0 kernel/irq/handle.c:210
handle_edge_irq+0x245/0xbe0 kernel/irq/chip.c:819
generic_handle_irq_desc include/linux/irqdesc.h:158 [inline]
handle_irq arch/x86/kernel/irq.c:231 [inline]
__common_interrupt+0xce/0x1e0 arch/x86/kernel/irq.c:250
common_interrupt+0x9f/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x22/0x40 arch/x86/include/asm/idtentry.h:640
__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:160 [inline]
_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:202
process_one_work+0x83c/0x11a0 kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306

to a HARDIRQ-irq-unsafe lock:
(tasklist_lock){.+.+}-{2:2}

... which became HARDIRQ-irq-unsafe at:
...
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock include/linux/rwlock_api_smp.h:150 [inline]
_raw_read_lock+0x32/0x40 kernel/locking/spinlock.c:228
do_wait+0x224/0x9d0 kernel/exit.c:1533
kernel_wait+0xe4/0x230 kernel/exit.c:1723
call_usermodehelper_exec_sync kernel/umh.c:140 [inline]
call_usermodehelper_exec_work+0xb4/0x220 kernel/umh.c:167
process_one_work+0x83c/0x11a0 kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306

other info that might help us debug this:

Chain exists of:
&host->lock --> &new->fa_lock --> tasklist_lock

Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(tasklist_lock);
local_irq_disable();
lock(&host->lock);
lock(&new->fa_lock);
<Interrupt>
lock(&host->lock);

*** DEADLOCK ***

3 locks held by syz-executor.0/27911:
#0: ffffffff8cd20b60 (rcu_read_lock){....}-{1:2}, at:
rcu_lock_acquire+0x9/0x30 include/linux/rcupdate.h:304
#1: ffff8880144dec18 (&host->lock){-.-.}-{2:2}, at:
ata_scsi_queuecmd+0x7a/0x130 drivers/ata/libata-scsi.c:4048
#2: ffffffff8cd20b60 (rcu_read_lock){....}-{1:2}, at:
rcu_lock_acquire+0x5/0x30 include/linux/rcupdate.h:303

the dependencies between HARDIRQ-irq-safe lock and the holding lock:
-> (&host->lock){-.-.}-{2:2} {
IN-HARDIRQ-W at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_spin_lock_irqsave
include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xb3/0x100
kernel/locking/spinlock.c:162
__ata_sff_interrupt+0x23/0x710 drivers/ata/libata-sff.c:1540
__handle_irq_event_percpu+0x1f7/0x620
kernel/irq/handle.c:158
handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
handle_irq_event+0x83/0x1e0 kernel/irq/handle.c:210
handle_edge_irq+0x245/0xbe0 kernel/irq/chip.c:819
generic_handle_irq_desc include/linux/irqdesc.h:158 [inline]
handle_irq arch/x86/kernel/irq.c:231 [inline]
__common_interrupt+0xce/0x1e0 arch/x86/kernel/irq.c:250
common_interrupt+0x9f/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x22/0x40
arch/x86/include/asm/idtentry.h:640
__raw_spin_unlock_irq
include/linux/spinlock_api_smp.h:160 [inline]
_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:202
process_one_work+0x83c/0x11a0 kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
IN-SOFTIRQ-W at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_spin_lock_irqsave
include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xb3/0x100
kernel/locking/spinlock.c:162
__ata_sff_interrupt+0x23/0x710 drivers/ata/libata-sff.c:1540
__handle_irq_event_percpu+0x1f7/0x620
kernel/irq/handle.c:158
handle_irq_event_percpu kernel/irq/handle.c:193 [inline]
handle_irq_event+0x83/0x1e0 kernel/irq/handle.c:210
handle_edge_irq+0x245/0xbe0 kernel/irq/chip.c:819
generic_handle_irq_desc include/linux/irqdesc.h:158 [inline]
handle_irq arch/x86/kernel/irq.c:231 [inline]
__common_interrupt+0xce/0x1e0 arch/x86/kernel/irq.c:250
common_interrupt+0x4a/0xc0 arch/x86/kernel/irq.c:240
asm_common_interrupt+0x22/0x40
arch/x86/include/asm/idtentry.h:640
__raw_spin_unlock_irq
include/linux/spinlock_api_smp.h:160 [inline]
_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:202
__run_timers+0x922/0x970 kernel/time/timer.c:1792
run_timer_softirq+0x63/0xf0 kernel/time/timer.c:1803
__do_softirq+0x277/0x73a kernel/softirq.c:571
__irq_exit_rcu+0xcf/0x150 kernel/softirq.c:650
irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
sysvec_apic_timer_interrupt+0x91/0xb0
arch/x86/kernel/apic/apic.c:1107
asm_sysvec_apic_timer_interrupt+0x16/0x20
arch/x86/include/asm/idtentry.h:649
__raw_spin_unlock_irqrestore
include/linux/spinlock_api_smp.h:152 [inline]
_raw_spin_unlock_irqrestore+0xbc/0x120
kernel/locking/spinlock.c:194
spin_unlock_irqrestore include/linux/spinlock.h:405 [inline]
ata_scsi_queuecmd+0xc6/0x130 drivers/ata/libata-scsi.c:4058
scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1524 [inline]
scsi_queue_rq+0x1ea6/0x2ec0 drivers/scsi/scsi_lib.c:1760
blk_mq_dispatch_rq_list+0x104f/0x2ca0 block/blk-mq.c:1992
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:173 [inline]
blk_mq_do_dispatch_sched+0x820/0xe60
block/blk-mq-sched.c:187
__blk_mq_sched_dispatch_requests+0x39b/0x490
blk_mq_sched_dispatch_requests+0xef/0x160
block/blk-mq-sched.c:339
__blk_mq_run_hw_queue+0x1cf/0x260 block/blk-mq.c:2110
process_one_work+0x83c/0x11a0 kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
INITIAL USE at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_spin_lock_irqsave
include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0xb3/0x100
kernel/locking/spinlock.c:162
ata_dev_init drivers/ata/libata-core.c:5202 [inline]
ata_link_init+0x236/0x890 drivers/ata/libata-core.c:5247
ata_port_alloc+0x3ce/0x470 drivers/ata/libata-core.c:5322
ata_host_alloc+0x198/0x2c0 drivers/ata/libata-core.c:5433
ata_host_alloc_pinfo+0x27/0x430
drivers/ata/libata-core.c:5476
ata_pci_sff_prepare_host+0x40/0xe0
drivers/ata/libata-sff.c:2305
ata_pci_bmdma_prepare_host+0x20/0x70
drivers/ata/libata-sff.c:3210
piix_init_one+0x628/0x1ed0 drivers/ata/ata_piix.c:1704
local_pci_probe drivers/pci/pci-driver.c:324 [inline]
pci_call_probe drivers/pci/pci-driver.c:392 [inline]
__pci_device_probe drivers/pci/pci-driver.c:417 [inline]
pci_device_probe+0x4fe/0xa60 drivers/pci/pci-driver.c:460
call_driver_probe+0x96/0x250
really_probe+0x237/0xaf0 drivers/base/dd.c:639
__driver_probe_device+0x1f8/0x3e0 drivers/base/dd.c:778
driver_probe_device+0x50/0x240 drivers/base/dd.c:808
__driver_attach+0x2b6/0x5b0 drivers/base/dd.c:1190
bus_for_each_dev+0x168/0x1d0 drivers/base/bus.c:301
bus_add_driver+0x32f/0x600 drivers/base/bus.c:618
driver_register+0x2e9/0x3e0 drivers/base/driver.c:246
piix_init+0x1b/0x41 drivers/ata/ata_piix.c:1774
do_one_initcall+0x1a7/0x400 init/main.c:1303
do_initcall_level+0x168/0x218 init/main.c:1376
do_initcalls+0x4b/0x8c init/main.c:1392
kernel_init_freeable+0x428/0x5d8 init/main.c:1631
kernel_init+0x19/0x2b0 init/main.c:1519
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
}
... key at: [<ffffffff9114fc40>] ata_host_alloc.__key+0x0/0x40

the dependencies between the lock to be acquired
and HARDIRQ-irq-unsafe lock:
-> (tasklist_lock){.+.+}-{2:2} {
HARDIRQ-ON-R at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock
include/linux/rwlock_api_smp.h:150 [inline]
_raw_read_lock+0x32/0x40 kernel/locking/spinlock.c:228
do_wait+0x224/0x9d0 kernel/exit.c:1533
kernel_wait+0xe4/0x230 kernel/exit.c:1723
call_usermodehelper_exec_sync kernel/umh.c:140 [inline]
call_usermodehelper_exec_work+0xb4/0x220
kernel/umh.c:167
process_one_work+0x83c/0x11a0 kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
SOFTIRQ-ON-R at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock
include/linux/rwlock_api_smp.h:150 [inline]
_raw_read_lock+0x32/0x40 kernel/locking/spinlock.c:228
do_wait+0x224/0x9d0 kernel/exit.c:1533
kernel_wait+0xe4/0x230 kernel/exit.c:1723
call_usermodehelper_exec_sync kernel/umh.c:140 [inline]
call_usermodehelper_exec_work+0xb4/0x220
kernel/umh.c:167
process_one_work+0x83c/0x11a0 kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
INITIAL USE at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_write_lock_irq
include/linux/rwlock_api_smp.h:195 [inline]
_raw_write_lock_irq+0xae/0xf0
kernel/locking/spinlock.c:326
copy_process+0x37cf/0x6200 kernel/fork.c:2387
kernel_clone+0x212/0x610 kernel/fork.c:2671
user_mode_thread+0x12d/0x190 kernel/fork.c:2747
rest_init+0x21/0x270 init/main.c:694
start_kernel+0x0/0x540 init/main.c:890
start_kernel+0x49a/0x540 init/main.c:1145
secondary_startup_64_no_verify+0xcf/0xdb
INITIAL READ USE at:
lock_acquire+0x17f/0x430
kernel/locking/lockdep.c:5668
__raw_read_lock
include/linux/rwlock_api_smp.h:150 [inline]
_raw_read_lock+0x32/0x40
kernel/locking/spinlock.c:228
do_wait+0x224/0x9d0 kernel/exit.c:1533
kernel_wait+0xe4/0x230 kernel/exit.c:1723
call_usermodehelper_exec_sync
kernel/umh.c:140 [inline]
call_usermodehelper_exec_work+0xb4/0x220
kernel/umh.c:167
process_one_work+0x83c/0x11a0
kernel/workqueue.c:2289
worker_thread+0xa6c/0x1290 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
arch/x86/entry/entry_64.S:306
}
... key at: [<ffffffff8ca0a058>] tasklist_lock+0x18/0x40
... acquired at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock include/linux/rwlock_api_smp.h:150 [inline]
_raw_read_lock+0x32/0x40 kernel/locking/spinlock.c:228
send_sigio+0xbe/0x300 fs/fcntl.c:792
kill_fasync_rcu fs/fcntl.c:1003 [inline]
kill_fasync+0x1e4/0x430 fs/fcntl.c:1017
__receive_buf drivers/tty/n_tty.c:1629 [inline]
n_tty_receive_buf_common+0xaac/0x1370 drivers/tty/n_tty.c:1711
tiocsti drivers/tty/tty_io.c:2286 [inline]
tty_ioctl+0xda7/0x1710 drivers/tty/tty_io.c:2685
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xfb/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> (&f->f_owner.lock){....}-{2:2} {
INITIAL USE at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_write_lock_irq
include/linux/rwlock_api_smp.h:195 [inline]
_raw_write_lock_irq+0xae/0xf0 kernel/locking/spinlock.c:326
f_modown+0x38/0x340 fs/fcntl.c:91
__tty_fasync drivers/tty/tty_io.c:2237 [inline]
tty_fasync+0x24a/0x340 drivers/tty/tty_io.c:2252
setfl fs/fcntl.c:73 [inline]
do_fcntl+0xe6c/0x1350 fs/fcntl.c:340
__do_sys_fcntl fs/fcntl.c:454 [inline]
__se_sys_fcntl+0xd5/0x1b0 fs/fcntl.c:439
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
INITIAL READ USE at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock_irq
include/linux/rwlock_api_smp.h:169 [inline]
_raw_read_lock_irq+0xb6/0x100
kernel/locking/spinlock.c:244
f_getown_ex fs/fcntl.c:212 [inline]
do_fcntl+0x1a5/0x1350 fs/fcntl.c:380
__do_sys_fcntl fs/fcntl.c:454 [inline]
__se_sys_fcntl+0xd5/0x1b0 fs/fcntl.c:439
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
}
... key at: [<ffffffff90e590c0>] __alloc_file.__key+0x0/0x10
... acquired at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock_irqsave include/linux/rwlock_api_smp.h:160 [inline]
_raw_read_lock_irqsave+0xbb/0x100 kernel/locking/spinlock.c:236
send_sigio+0x2f/0x300 fs/fcntl.c:778
kill_fasync_rcu fs/fcntl.c:1003 [inline]
kill_fasync+0x1e4/0x430 fs/fcntl.c:1017
sock_wake_async+0x133/0x150
rcu_read_unlock include/linux/rcupdate.h:767 [inline]
sk_wake_async+0x12e/0x200 include/net/sock.h:2525
sock_def_error_report+0x154/0x200 net/core/sock.c:3264
smc_fback_forward_wakeup+0x1b6/0x500 net/smc/af_smc.c:786
smc_fback_error_report+0x90/0xb0 net/smc/af_smc.c:838
sk_error_report+0x3b/0xb0 net/core/sock.c:345
tcp_validate_incoming+0x1509/0x1fc0 net/ipv4/tcp_input.c:5805
tcp_rcv_state_process+0x513/0x2610 net/ipv4/tcp_input.c:6520
tcp_v4_do_rcv+0x691/0xa10 net/ipv4/tcp_ipv4.c:1704
sk_backlog_rcv include/net/sock.h:1109 [inline]
__release_sock+0x106/0x3a0 net/core/sock.c:2906
release_sock+0x5d/0x1c0 net/core/sock.c:3462
sk_stream_wait_memory+0x6d9/0xe20 net/core/stream.c:145
tcp_sendmsg_locked+0x1888/0x4540 net/ipv4/tcp.c:1445
tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1483
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0x558/0x8a0 net/socket.c:2482
___sys_sendmsg net/socket.c:2536 [inline]
__sys_sendmmsg+0x360/0x6c0 net/socket.c:2622
__do_sys_sendmmsg net/socket.c:2651 [inline]
__se_sys_sendmmsg net/socket.c:2648 [inline]
__x64_sys_sendmmsg+0x9c/0xb0 net/socket.c:2648
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> (&new->fa_lock){....}-{2:2} {
INITIAL USE at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_write_lock_irq
include/linux/rwlock_api_smp.h:195 [inline]
_raw_write_lock_irq+0xae/0xf0 kernel/locking/spinlock.c:326
fasync_remove_entry+0xff/0x1d0 fs/fcntl.c:873
sock_fasync+0x86/0xf0 net/socket.c:1390
__fput+0x751/0x8c0 fs/file_table.c:317
task_work_run+0x243/0x300 kernel/task_work.c:179
resume_user_mode_work
include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x1f2/0x210
kernel/entry/common.c:203
__syscall_exit_to_user_mode_work
kernel/entry/common.c:285 [inline]
syscall_exit_to_user_mode+0x26/0x60 kernel/entry/common.c:296
do_syscall_64+0x4c/0x90 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
INITIAL READ USE at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock_irqsave
include/linux/rwlock_api_smp.h:160 [inline]
_raw_read_lock_irqsave+0xbb/0x100
kernel/locking/spinlock.c:236
kill_fasync_rcu fs/fcntl.c:996 [inline]
kill_fasync+0x13b/0x430 fs/fcntl.c:1017
sock_wake_async+0x133/0x150
rcu_read_unlock include/linux/rcupdate.h:767 [inline]
sk_wake_async+0x12e/0x200 include/net/sock.h:2525
sock_def_error_report+0x154/0x200 net/core/sock.c:3264
smc_fback_forward_wakeup+0x1b6/0x500
net/smc/af_smc.c:786
smc_fback_error_report+0x90/0xb0 net/smc/af_smc.c:838
sk_error_report+0x3b/0xb0 net/core/sock.c:345
tcp_validate_incoming+0x1509/0x1fc0
net/ipv4/tcp_input.c:5805
tcp_rcv_state_process+0x513/0x2610
net/ipv4/tcp_input.c:6520
tcp_v4_do_rcv+0x691/0xa10 net/ipv4/tcp_ipv4.c:1704
sk_backlog_rcv include/net/sock.h:1109 [inline]
__release_sock+0x106/0x3a0 net/core/sock.c:2906
release_sock+0x5d/0x1c0 net/core/sock.c:3462
sk_stream_wait_memory+0x6d9/0xe20 net/core/stream.c:145
tcp_sendmsg_locked+0x1888/0x4540 net/ipv4/tcp.c:1445
tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1483
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0x558/0x8a0 net/socket.c:2482
___sys_sendmsg net/socket.c:2536 [inline]
__sys_sendmmsg+0x360/0x6c0 net/socket.c:2622
__do_sys_sendmmsg net/socket.c:2651 [inline]
__se_sys_sendmmsg net/socket.c:2648 [inline]
__x64_sys_sendmmsg+0x9c/0xb0 net/socket.c:2648
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
}
... key at: [<ffffffff90e59f80>] fasync_insert_entry.__key+0x0/0x40
... acquired at:
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock_irqsave include/linux/rwlock_api_smp.h:160 [inline]
_raw_read_lock_irqsave+0xbb/0x100 kernel/locking/spinlock.c:236
kill_fasync_rcu fs/fcntl.c:996 [inline]
kill_fasync+0x13b/0x430 fs/fcntl.c:1017
sg_rq_end_io+0x604/0xf50 drivers/scsi/sg.c:1403
__blk_mq_end_request+0x2c7/0x380 block/blk-mq.c:1011
scsi_end_request+0x4ed/0x9c0 drivers/scsi/scsi_lib.c:576
scsi_io_completion+0xc25/0x27a0 drivers/scsi/scsi_lib.c:985
ata_scsi_simulate+0x336e/0x3dd0 drivers/ata/libata-scsi.c:4190
__ata_scsi_queuecmd+0x20b/0x1020 drivers/ata/libata-scsi.c:4009
ata_scsi_queuecmd+0xa0/0x130 drivers/ata/libata-scsi.c:4052
scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1524 [inline]
scsi_queue_rq+0x1ea6/0x2ec0 drivers/scsi/scsi_lib.c:1760
blk_mq_dispatch_rq_list+0x104f/0x2ca0 block/blk-mq.c:1992
__blk_mq_sched_dispatch_requests+0x382/0x490 block/blk-mq-sched.c:306
blk_mq_sched_dispatch_requests+0xef/0x160 block/blk-mq-sched.c:339
__blk_mq_run_hw_queue+0x1cf/0x260 block/blk-mq.c:2110
blk_mq_sched_insert_request+0x1e2/0x430 block/blk-mq-sched.c:458
blk_execute_rq_nowait+0x2e8/0x3b0 block/blk-mq.c:1305
sg_common_write+0x8c0/0x1970 drivers/scsi/sg.c:832
sg_new_write+0x61f/0x860 drivers/scsi/sg.c:770
sg_ioctl_common drivers/scsi/sg.c:935 [inline]
sg_ioctl+0x1c51/0x2be0 drivers/scsi/sg.c:1159
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xfb/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

stack backtrace:
CPU: 0 PID: 27911 Comm: syz-executor.0 Not tainted 6.1.0-rc5 #40
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106
print_bad_irq_dependency kernel/locking/lockdep.c:2611 [inline]
check_irq_usage kernel/locking/lockdep.c:2850 [inline]
check_prev_add kernel/locking/lockdep.c:3101 [inline]
check_prevs_add+0x4e5f/0x5b70 kernel/locking/lockdep.c:3216
validate_chain kernel/locking/lockdep.c:3831 [inline]
__lock_acquire+0x4411/0x6070 kernel/locking/lockdep.c:5055
lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
__raw_read_lock_irqsave include/linux/rwlock_api_smp.h:160 [inline]
_raw_read_lock_irqsave+0xbb/0x100 kernel/locking/spinlock.c:236
kill_fasync_rcu fs/fcntl.c:996 [inline]
kill_fasync+0x13b/0x430 fs/fcntl.c:1017
sg_rq_end_io+0x604/0xf50 drivers/scsi/sg.c:1403
__blk_mq_end_request+0x2c7/0x380 block/blk-mq.c:1011
scsi_end_request+0x4ed/0x9c0 drivers/scsi/scsi_lib.c:576
scsi_io_completion+0xc25/0x27a0 drivers/scsi/scsi_lib.c:985
ata_scsi_simulate+0x336e/0x3dd0 drivers/ata/libata-scsi.c:4190
__ata_scsi_queuecmd+0x20b/0x1020 drivers/ata/libata-scsi.c:4009
ata_scsi_queuecmd+0xa0/0x130 drivers/ata/libata-scsi.c:4052
scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1524 [inline]
scsi_queue_rq+0x1ea6/0x2ec0 drivers/scsi/scsi_lib.c:1760
blk_mq_dispatch_rq_list+0x104f/0x2ca0 block/blk-mq.c:1992
__blk_mq_sched_dispatch_requests+0x382/0x490 block/blk-mq-sched.c:306
blk_mq_sched_dispatch_requests+0xef/0x160 block/blk-mq-sched.c:339
__blk_mq_run_hw_queue+0x1cf/0x260 block/blk-mq.c:2110
blk_mq_sched_insert_request+0x1e2/0x430 block/blk-mq-sched.c:458
blk_execute_rq_nowait+0x2e8/0x3b0 block/blk-mq.c:1305
sg_common_write+0x8c0/0x1970 drivers/scsi/sg.c:832
sg_new_write+0x61f/0x860 drivers/scsi/sg.c:770
sg_ioctl_common drivers/scsi/sg.c:935 [inline]
sg_ioctl+0x1c51/0x2be0 drivers/scsi/sg.c:1159
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl+0xfb/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f153dc8bded
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f153ede2c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f153ddabf80 RCX: 00007f153dc8bded
RDX: 0000000020000440 RSI: 0000000000002285 RDI: 0000000000000006
RBP: 00007f153dcf8ce0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f153ddabf80
R13: 00007ffc72e5108f R14: 00007ffc72e51230 R15: 00007f153ede2dc0
</TASK>

Best,
Wei

2022-12-15 10:49:44

by Damien Le Moal

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On 12/14/22 00:09, Wei Chen wrote:
> Dear Linux Developer,
>
> Recently, when using our tool to fuzz kernel, the following crash was triggered.
>
> HEAD commit: 094226ad94f4 Linux v6.1-rc5
> git tree: upstream
> compiler: clang 12.0.1
> console output:
> https://drive.google.com/file/d/1QZttkbuLed4wp6U32UR6TpxfY_HHCIqQ/view?usp=share_link
> kernel config: https://drive.google.com/file/d/1TdPsg_5Zon8S2hEFpLBWjb8Tnd2KA5WJ/view?usp=share_link
>
> Unfortunately, I didn't have a reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: Wei Chen <[email protected]>
>
> =====================================================
> WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> 6.1.0-rc5 #40 Not tainted
> -----------------------------------------------------
> syz-executor.0/27911 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> ffff888076cc4f30 (&new->fa_lock){....}-{2:2}, at: kill_fasync_rcu
> fs/fcntl.c:996 [inline]
> ffff888076cc4f30 (&new->fa_lock){....}-{2:2}, at:
> kill_fasync+0x13b/0x430 fs/fcntl.c:1017

[...]

> stack backtrace:
> CPU: 0 PID: 27911 Comm: syz-executor.0 Not tainted 6.1.0-rc5 #40
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106
> print_bad_irq_dependency kernel/locking/lockdep.c:2611 [inline]
> check_irq_usage kernel/locking/lockdep.c:2850 [inline]
> check_prev_add kernel/locking/lockdep.c:3101 [inline]
> check_prevs_add+0x4e5f/0x5b70 kernel/locking/lockdep.c:3216
> validate_chain kernel/locking/lockdep.c:3831 [inline]
> __lock_acquire+0x4411/0x6070 kernel/locking/lockdep.c:5055
> lock_acquire+0x17f/0x430 kernel/locking/lockdep.c:5668
> __raw_read_lock_irqsave include/linux/rwlock_api_smp.h:160 [inline]
> _raw_read_lock_irqsave+0xbb/0x100 kernel/locking/spinlock.c:236
> kill_fasync_rcu fs/fcntl.c:996 [inline]
> kill_fasync+0x13b/0x430 fs/fcntl.c:1017
> sg_rq_end_io+0x604/0xf50 drivers/scsi/sg.c:1403

The problem is here: sg_rq_end_io() calling kill_fasync(). But at a quick
glance, this is not the only driver calling kill_fasync() with a spinlock
held with irq disabled... So there may be a fundamental problem with
kill_fasync() function if drivers are allowed to do that, or the reverse,
all drivers calling that function with a lock held with irq disabled need
to be fixed.

Al, Chuck, Jeff,

Any thought ?

> __blk_mq_end_request+0x2c7/0x380 block/blk-mq.c:1011
> scsi_end_request+0x4ed/0x9c0 drivers/scsi/scsi_lib.c:576
> scsi_io_completion+0xc25/0x27a0 drivers/scsi/scsi_lib.c:985
> ata_scsi_simulate+0x336e/0x3dd0 drivers/ata/libata-scsi.c:4190
> __ata_scsi_queuecmd+0x20b/0x1020 drivers/ata/libata-scsi.c:4009
> ata_scsi_queuecmd+0xa0/0x130 drivers/ata/libata-scsi.c:4052
> scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1524 [inline]
> scsi_queue_rq+0x1ea6/0x2ec0 drivers/scsi/scsi_lib.c:1760
> blk_mq_dispatch_rq_list+0x104f/0x2ca0 block/blk-mq.c:1992
> __blk_mq_sched_dispatch_requests+0x382/0x490 block/blk-mq-sched.c:306
> blk_mq_sched_dispatch_requests+0xef/0x160 block/blk-mq-sched.c:339
> __blk_mq_run_hw_queue+0x1cf/0x260 block/blk-mq.c:2110
> blk_mq_sched_insert_request+0x1e2/0x430 block/blk-mq-sched.c:458
> blk_execute_rq_nowait+0x2e8/0x3b0 block/blk-mq.c:1305
> sg_common_write+0x8c0/0x1970 drivers/scsi/sg.c:832
> sg_new_write+0x61f/0x860 drivers/scsi/sg.c:770
> sg_ioctl_common drivers/scsi/sg.c:935 [inline]
> sg_ioctl+0x1c51/0x2be0 drivers/scsi/sg.c:1159
> vfs_ioctl fs/ioctl.c:51 [inline]
> __do_sys_ioctl fs/ioctl.c:870 [inline]
> __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:856
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f153dc8bded
> Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48
> 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f153ede2c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00007f153ddabf80 RCX: 00007f153dc8bded
> RDX: 0000000020000440 RSI: 0000000000002285 RDI: 0000000000000006
> RBP: 00007f153dcf8ce0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007f153ddabf80
> R13: 00007ffc72e5108f R14: 00007ffc72e51230 R15: 00007f153ede2dc0
> </TASK>
>
> Best,
> Wei

--
Damien Le Moal
Western Digital Research

2022-12-15 15:50:52

by Al Viro

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Thu, Dec 15, 2022 at 06:48:20PM +0900, Damien Le Moal wrote:

> The problem is here: sg_rq_end_io() calling kill_fasync(). But at a quick
> glance, this is not the only driver calling kill_fasync() with a spinlock
> held with irq disabled... So there may be a fundamental problem with
> kill_fasync() function if drivers are allowed to do that, or the reverse,
> all drivers calling that function with a lock held with irq disabled need
> to be fixed.
>
> Al, Chuck, Jeff,
>
> Any thought ?

What is the problem with read_lock_irqsave() called with irqs disabled?
read_lock_irq() would have been a bug in such conditions, of course, but
that's not what we use...

2022-12-16 02:07:40

by Damien Le Moal

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On 12/16/22 00:19, Al Viro wrote:
> On Thu, Dec 15, 2022 at 06:48:20PM +0900, Damien Le Moal wrote:
>
>> The problem is here: sg_rq_end_io() calling kill_fasync(). But at a quick
>> glance, this is not the only driver calling kill_fasync() with a spinlock
>> held with irq disabled... So there may be a fundamental problem with
>> kill_fasync() function if drivers are allowed to do that, or the reverse,
>> all drivers calling that function with a lock held with irq disabled need
>> to be fixed.
>>
>> Al, Chuck, Jeff,
>>
>> Any thought ?
>
> What is the problem with read_lock_irqsave() called with irqs disabled?
> read_lock_irq() would have been a bug in such conditions, of course, but
> that's not what we use...

The original & complete lockdep splat is in the report email here:

https://marc.info/?l=linux-ide&m=167094379710177&w=2

It looks like a spinlock is taken for the fasync stuff without irq
disabled and that same spinlock is needed in kill_fasync() which is
itself called (potentially) with IRQ disabled. Hence the splat. In any
case, that is how I understand the issue. But as mentioned above, given
that I can see many drivers calling kill_fasync() with irq disabled, I
wonder if this is a genuine potential problem or a false negative.

--
Damien Le Moal
Western Digital Research

2022-12-16 04:04:09

by Al Viro

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Fri, Dec 16, 2022 at 10:44:06AM +0900, Damien Le Moal wrote:

> The original & complete lockdep splat is in the report email here:
>
> https://marc.info/?l=linux-ide&m=167094379710177&w=2
>
> It looks like a spinlock is taken for the fasync stuff without irq
> disabled and that same spinlock is needed in kill_fasync() which is
> itself called (potentially) with IRQ disabled. Hence the splat. In any
> case, that is how I understand the issue. But as mentioned above, given
> that I can see many drivers calling kill_fasync() with irq disabled, I
> wonder if this is a genuine potential problem or a false negative.

OK, I'm about to fall asleep, so I might very well be missing something
obvious, but...

CPU1: ptrace(2)
ptrace_check_attach()
read_lock(&tasklist_lock);

CPU2: setpgid(2)
write_lock_irq(&tasklist_lock);
spins

CPU1: takes an interrupt that would call kill_fasync(). grep and the
first instance of kill_fasync() is in hpet_interrupt() - it's not
something exotic. IRQs disabled on CPU2 won't stop it.
kill_fasync(..., SIGIO, ...)
kill_fasync_rcu()
read_lock_irqsave(&fa->fa_lock, flags);
send_sigio()
read_lock_irqsave(&fown->lock, flags);
read_lock(&tasklist_lock);

... and CPU1 spins as well.

It's not a matter of kill_fasync() called with IRQs disabled; the
problem is kill_fasync() called from interrupt taken while holding
tasklist_lock at least shared. Somebody trying to grab it on another
CPU exclusive before we get to send_sigio() from kill_fasync() will
end up spinning and will make us spin as well.

I really hope that's just me not seeing something obvious - we had
kill_fasync() called in IRQ handlers since way back and we had
tasklist_lock taken shared without disabling IRQs for just as long.

<goes to sleep, hoping to find "Al, you are a moron, it's obviously OK
for such and such reasons" in the mailbox tomorrow morning>

2022-12-16 12:22:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Thu, Dec 15, 2022 at 7:41 PM Al Viro <[email protected]> wrote:
>
> CPU1: ptrace(2)
> ptrace_check_attach()
> read_lock(&tasklist_lock);
>
> CPU2: setpgid(2)
> write_lock_irq(&tasklist_lock);
> spins
>
> CPU1: takes an interrupt that would call kill_fasync(). grep and the
> first instance of kill_fasync() is in hpet_interrupt() - it's not
> something exotic. IRQs disabled on CPU2 won't stop it.
> kill_fasync(..., SIGIO, ...)
> kill_fasync_rcu()
> read_lock_irqsave(&fa->fa_lock, flags);
> send_sigio()
> read_lock_irqsave(&fown->lock, flags);
> read_lock(&tasklist_lock);
>
> ... and CPU1 spins as well.

Nope. See kernel/locking/qrwlock.c:

/*
* Readers come here when they cannot get the lock without waiting
*/
if (unlikely(in_interrupt())) {
/*
* Readers in interrupt context will get the lock immediately
* if the writer is just waiting (not holding the lock yet),
* so spin with ACQUIRE semantics until the lock is available
* without waiting in the queue.
*/
atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
return;
}

and that's the new "civilized" reader unfairness.

The traditional rwlock was unconditionally unfair to writers, to the
point that there were starvation issues because new readers would
always get the lock.

Linus

2022-12-17 00:04:17

by Al Viro

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

[Boqun Feng Cc'd]

On Fri, Dec 16, 2022 at 03:26:21AM -0800, Linus Torvalds wrote:
> On Thu, Dec 15, 2022 at 7:41 PM Al Viro <[email protected]> wrote:
> >
> > CPU1: ptrace(2)
> > ptrace_check_attach()
> > read_lock(&tasklist_lock);
> >
> > CPU2: setpgid(2)
> > write_lock_irq(&tasklist_lock);
> > spins
> >
> > CPU1: takes an interrupt that would call kill_fasync(). grep and the
> > first instance of kill_fasync() is in hpet_interrupt() - it's not
> > something exotic. IRQs disabled on CPU2 won't stop it.
> > kill_fasync(..., SIGIO, ...)
> > kill_fasync_rcu()
> > read_lock_irqsave(&fa->fa_lock, flags);
> > send_sigio()
> > read_lock_irqsave(&fown->lock, flags);
> > read_lock(&tasklist_lock);
> >
> > ... and CPU1 spins as well.
>
> Nope. See kernel/locking/qrwlock.c:

[snip rwlocks are inherently unfair, queued ones are somewhat milder, but
all implementations have writers-starving behaviour for read_lock() at least
when in_interrupt()]

D'oh... Consider requested "Al, you are a moron" duly delivered... I plead
having been on way too low caffeine and too little sleep ;-/

Looking at the original report, looks like the scenario there is meant to be
the following:

CPU1: read_lock(&tasklist_lock)
tasklist_lock grabbed

CPU2: get an sg write(2) feeding request to libata; host->lock is taken,
request is immediately completed and scsi_done() is about to be called.
host->lock grabbed

CPU3: write_lock_irq(&tasklist_lock)
spins on tasklist_lock until CPU1 gets through.

CPU2: get around to kill_fasync() called by sg_rq_end_io() and to grabbing
tasklist_lock inside send_sigio()
spins, since it's not in an interrupt and there's a pending writer
host->lock is held, spin until CPU3 gets through.

CPU1: take an interrupt, which on libata will try to grab host->lock
tasklist_lock is held, spins on host->lock until CPU2 gets through

Am I reading it correctly?

2022-12-17 00:34:06

by Boqun Feng

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Fri, Dec 16, 2022 at 11:39:21PM +0000, Al Viro wrote:
> [Boqun Feng Cc'd]
>
> On Fri, Dec 16, 2022 at 03:26:21AM -0800, Linus Torvalds wrote:
> > On Thu, Dec 15, 2022 at 7:41 PM Al Viro <[email protected]> wrote:
> > >
> > > CPU1: ptrace(2)
> > > ptrace_check_attach()
> > > read_lock(&tasklist_lock);
> > >
> > > CPU2: setpgid(2)
> > > write_lock_irq(&tasklist_lock);
> > > spins
> > >
> > > CPU1: takes an interrupt that would call kill_fasync(). grep and the
> > > first instance of kill_fasync() is in hpet_interrupt() - it's not
> > > something exotic. IRQs disabled on CPU2 won't stop it.
> > > kill_fasync(..., SIGIO, ...)
> > > kill_fasync_rcu()
> > > read_lock_irqsave(&fa->fa_lock, flags);
> > > send_sigio()
> > > read_lock_irqsave(&fown->lock, flags);
> > > read_lock(&tasklist_lock);
> > >
> > > ... and CPU1 spins as well.
> >
> > Nope. See kernel/locking/qrwlock.c:
>
> [snip rwlocks are inherently unfair, queued ones are somewhat milder, but
> all implementations have writers-starving behaviour for read_lock() at least
> when in_interrupt()]
>
> D'oh... Consider requested "Al, you are a moron" duly delivered... I plead
> having been on way too low caffeine and too little sleep ;-/
>
> Looking at the original report, looks like the scenario there is meant to be
> the following:
>
> CPU1: read_lock(&tasklist_lock)
> tasklist_lock grabbed
>
> CPU2: get an sg write(2) feeding request to libata; host->lock is taken,
> request is immediately completed and scsi_done() is about to be called.
> host->lock grabbed
>
> CPU3: write_lock_irq(&tasklist_lock)
> spins on tasklist_lock until CPU1 gets through.
>
> CPU2: get around to kill_fasync() called by sg_rq_end_io() and to grabbing
> tasklist_lock inside send_sigio()
> spins, since it's not in an interrupt and there's a pending writer
> host->lock is held, spin until CPU3 gets through.

Right, for a reader not in_interrupt(), it may be blocked by a random
waiting writer because of the fairness, even the lock is currently held
by a reader:

CPU 1 CPU 2 CPU 3
read_lock(&tasklist_lock); // get the lock

write_lock_irq(&tasklist_lock); // wait for the lock

read_lock(&tasklist_lock); // cannot get the lock because of the fairness

Regards,
Boqun

>
> CPU1: take an interrupt, which on libata will try to grab host->lock
> tasklist_lock is held, spins on host->lock until CPU2 gets through
>
> Am I reading it correctly?

2022-12-17 02:27:54

by Al Viro

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Fri, Dec 16, 2022 at 03:54:09PM -0800, Boqun Feng wrote:
> On Fri, Dec 16, 2022 at 11:39:21PM +0000, Al Viro wrote:
> > [Boqun Feng Cc'd]
> >
> > On Fri, Dec 16, 2022 at 03:26:21AM -0800, Linus Torvalds wrote:
> > > On Thu, Dec 15, 2022 at 7:41 PM Al Viro <[email protected]> wrote:
> > > >
> > > > CPU1: ptrace(2)
> > > > ptrace_check_attach()
> > > > read_lock(&tasklist_lock);
> > > >
> > > > CPU2: setpgid(2)
> > > > write_lock_irq(&tasklist_lock);
> > > > spins
> > > >
> > > > CPU1: takes an interrupt that would call kill_fasync(). grep and the
> > > > first instance of kill_fasync() is in hpet_interrupt() - it's not
> > > > something exotic. IRQs disabled on CPU2 won't stop it.
> > > > kill_fasync(..., SIGIO, ...)
> > > > kill_fasync_rcu()
> > > > read_lock_irqsave(&fa->fa_lock, flags);
> > > > send_sigio()
> > > > read_lock_irqsave(&fown->lock, flags);
> > > > read_lock(&tasklist_lock);
> > > >
> > > > ... and CPU1 spins as well.
> > >
> > > Nope. See kernel/locking/qrwlock.c:
> >
> > [snip rwlocks are inherently unfair, queued ones are somewhat milder, but
> > all implementations have writers-starving behaviour for read_lock() at least
> > when in_interrupt()]
> >
> > D'oh... Consider requested "Al, you are a moron" duly delivered... I plead
> > having been on way too low caffeine and too little sleep ;-/
> >
> > Looking at the original report, looks like the scenario there is meant to be
> > the following:
> >
> > CPU1: read_lock(&tasklist_lock)
> > tasklist_lock grabbed
> >
> > CPU2: get an sg write(2) feeding request to libata; host->lock is taken,
> > request is immediately completed and scsi_done() is about to be called.
> > host->lock grabbed
> >
> > CPU3: write_lock_irq(&tasklist_lock)
> > spins on tasklist_lock until CPU1 gets through.
> >
> > CPU2: get around to kill_fasync() called by sg_rq_end_io() and to grabbing
> > tasklist_lock inside send_sigio()
> > spins, since it's not in an interrupt and there's a pending writer
> > host->lock is held, spin until CPU3 gets through.
>
> Right, for a reader not in_interrupt(), it may be blocked by a random
> waiting writer because of the fairness, even the lock is currently held
> by a reader:
>
> CPU 1 CPU 2 CPU 3
> read_lock(&tasklist_lock); // get the lock
>
> write_lock_irq(&tasklist_lock); // wait for the lock
>
> read_lock(&tasklist_lock); // cannot get the lock because of the fairness

IOW, any caller of scsi_done() from non-interrupt context while
holding a spinlock that is also taken in an interrupt...

And we have drivers/scsi/scsi_error.c:scsi_send_eh_cmnd(), which calls
->queuecommand() under a mutex, with
#define DEF_SCSI_QCMD(func_name) \
int func_name(struct Scsi_Host *shost, struct scsi_cmnd *cmd) \
{ \
unsigned long irq_flags; \
int rc; \
spin_lock_irqsave(shost->host_lock, irq_flags); \
rc = func_name##_lck(cmd); \
spin_unlock_irqrestore(shost->host_lock, irq_flags); \
return rc; \
}

being commonly used for ->queuecommand() instances. So any scsi_done()
in foo_lck() (quite a few of such) + use of ->host_lock in interrupt
for the same driver (also common)...

I wonder why that hadn't triggered the same warning a long time
ago - these warnings had been around for at least two years.

Am I missing something here?

2022-12-17 02:46:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

Ok, let's bring in Waiman for the rwlock side.

On Fri, Dec 16, 2022 at 5:54 PM Boqun Feng <[email protected]> wrote:
>
> Right, for a reader not in_interrupt(), it may be blocked by a random
> waiting writer because of the fairness, even the lock is currently held
> by a reader:
>
> CPU 1 CPU 2 CPU 3
> read_lock(&tasklist_lock); // get the lock
>
> write_lock_irq(&tasklist_lock); // wait for the lock
>
> read_lock(&tasklist_lock); // cannot get the lock because of the fairness

But this should be ok - because CPU1 can make progress and eventually
release the lock.

So the tasklist_lock use is fine on its own - the reason interrupts
are special is because an interrupt on CPU 1 taking the lock for
reading would deadlock otherwise. As long as it happens on another
CPU, the original CPU should then be able to make progress.

But the problem here seems to be thst *another* lock is also involved
(in this case apparently "host->lock", and now if CPU1 and CPU2 get
these two locks in a different order, you can get an ABBA deadlock.

And apparently our lockdep machinery doesn't catch that issue, so it
doesn't get flagged.

I'm not sure what the lockdep rules for rwlocks are, but maybe lockdep
treats rwlocks as being _always_ unfair, not knowing about that "it's
only unfair when it's in interrupt context".

Maybe we need to always make rwlock unfair? Possibly only for tasklist_lock?

Oh, how I hate tasklist_lock. It's pretty much our one remaining "one
big lock". It's been a pain for a long long time.

Linus

2022-12-17 03:44:53

by Boqun Feng

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Fri, Dec 16, 2022 at 08:31:54PM -0600, Linus Torvalds wrote:
> Ok, let's bring in Waiman for the rwlock side.
>
> On Fri, Dec 16, 2022 at 5:54 PM Boqun Feng <[email protected]> wrote:
> >
> > Right, for a reader not in_interrupt(), it may be blocked by a random
> > waiting writer because of the fairness, even the lock is currently held
> > by a reader:
> >
> > CPU 1 CPU 2 CPU 3
> > read_lock(&tasklist_lock); // get the lock
> >
> > write_lock_irq(&tasklist_lock); // wait for the lock
> >
> > read_lock(&tasklist_lock); // cannot get the lock because of the fairness
>
> But this should be ok - because CPU1 can make progress and eventually
> release the lock.
>

Yes.

> So the tasklist_lock use is fine on its own - the reason interrupts
> are special is because an interrupt on CPU 1 taking the lock for
> reading would deadlock otherwise. As long as it happens on another
> CPU, the original CPU should then be able to make progress.
>
> But the problem here seems to be thst *another* lock is also involved
> (in this case apparently "host->lock", and now if CPU1 and CPU2 get
> these two locks in a different order, you can get an ABBA deadlock.
>

Right.

> And apparently our lockdep machinery doesn't catch that issue, so it
> doesn't get flagged.
>

I'm confused. Isn't the original problem showing that lockdep catches
this?

> I'm not sure what the lockdep rules for rwlocks are, but maybe lockdep
> treats rwlocks as being _always_ unfair, not knowing about that "it's
> only unfair when it's in interrupt context".
>

The rules nowadays are:

* If the reader is in_interrupt() or queued-spinlock implemention
is not used, it's an unfair reader, i.e. it won't wait for
any existing writer.

* Otherwise, it's a fair reader.

> Maybe we need to always make rwlock unfair? Possibly only for tasklist_lock?
>

That's possible, but I need to make sure I understand the issue for
lockdep. It's that lockdep misses catching something or it has a false
positive?

Regards,
Boqun

> Oh, how I hate tasklist_lock. It's pretty much our one remaining "one
> big lock". It's been a pain for a long long time.
>
> Linus

2022-12-17 03:45:36

by Al Viro

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Fri, Dec 16, 2022 at 08:31:54PM -0600, Linus Torvalds wrote:
> Ok, let's bring in Waiman for the rwlock side.
>
> On Fri, Dec 16, 2022 at 5:54 PM Boqun Feng <[email protected]> wrote:
> >
> > Right, for a reader not in_interrupt(), it may be blocked by a random
> > waiting writer because of the fairness, even the lock is currently held
> > by a reader:
> >
> > CPU 1 CPU 2 CPU 3
> > read_lock(&tasklist_lock); // get the lock
> >
> > write_lock_irq(&tasklist_lock); // wait for the lock
> >
> > read_lock(&tasklist_lock); // cannot get the lock because of the fairness
>
> But this should be ok - because CPU1 can make progress and eventually
> release the lock.
>
> So the tasklist_lock use is fine on its own - the reason interrupts
> are special is because an interrupt on CPU 1 taking the lock for
> reading would deadlock otherwise. As long as it happens on another
> CPU, the original CPU should then be able to make progress.
>
> But the problem here seems to be thst *another* lock is also involved
> (in this case apparently "host->lock", and now if CPU1 and CPU2 get
> these two locks in a different order, you can get an ABBA deadlock.
>
> And apparently our lockdep machinery doesn't catch that issue, so it
> doesn't get flagged.

Lockdep has actually caught that; the locks involved are mention in the
report (https://marc.info/?l=linux-ide&m=167094379710177&w=2). The form
of report might have been better, but if anything, it doesn't mention
potential involvement of tasklist_lock writer, turning that into a deadlock.

OTOH, that's more or less implicit for the entire class:

read_lock(A) [non-interrupt]
local_irq_disable() local_irq_disable()
spin_lock(B) write_lock(A)
read_lock(A)
[in interrupt]
spin_lock(B)

is what that sort of reports is about. In this case A is tasklist_lock,
B is host->lock. Possible call chains for CPU1 and CPU2 are reported...

I wonder why analogues of that hadn't been reported for other SCSI hosts -
it's a really common pattern there...

> I'm not sure what the lockdep rules for rwlocks are, but maybe lockdep
> treats rwlocks as being _always_ unfair, not knowing about that "it's
> only unfair when it's in interrupt context".
>
> Maybe we need to always make rwlock unfair? Possibly only for tasklist_lock?

ISTR threads about the possibility of explicit read_lock_unfair()...

2022-12-17 03:50:48

by Boqun Feng

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On Sat, Dec 17, 2022 at 01:59:32AM +0000, Al Viro wrote:
> On Fri, Dec 16, 2022 at 03:54:09PM -0800, Boqun Feng wrote:
> > On Fri, Dec 16, 2022 at 11:39:21PM +0000, Al Viro wrote:
> > > [Boqun Feng Cc'd]
> > >
> > > On Fri, Dec 16, 2022 at 03:26:21AM -0800, Linus Torvalds wrote:
> > > > On Thu, Dec 15, 2022 at 7:41 PM Al Viro <[email protected]> wrote:
> > > > >
> > > > > CPU1: ptrace(2)
> > > > > ptrace_check_attach()
> > > > > read_lock(&tasklist_lock);
> > > > >
> > > > > CPU2: setpgid(2)
> > > > > write_lock_irq(&tasklist_lock);
> > > > > spins
> > > > >
> > > > > CPU1: takes an interrupt that would call kill_fasync(). grep and the
> > > > > first instance of kill_fasync() is in hpet_interrupt() - it's not
> > > > > something exotic. IRQs disabled on CPU2 won't stop it.
> > > > > kill_fasync(..., SIGIO, ...)
> > > > > kill_fasync_rcu()
> > > > > read_lock_irqsave(&fa->fa_lock, flags);
> > > > > send_sigio()
> > > > > read_lock_irqsave(&fown->lock, flags);
> > > > > read_lock(&tasklist_lock);
> > > > >
> > > > > ... and CPU1 spins as well.
> > > >
> > > > Nope. See kernel/locking/qrwlock.c:
> > >
> > > [snip rwlocks are inherently unfair, queued ones are somewhat milder, but
> > > all implementations have writers-starving behaviour for read_lock() at least
> > > when in_interrupt()]
> > >
> > > D'oh... Consider requested "Al, you are a moron" duly delivered... I plead
> > > having been on way too low caffeine and too little sleep ;-/
> > >
> > > Looking at the original report, looks like the scenario there is meant to be
> > > the following:
> > >
> > > CPU1: read_lock(&tasklist_lock)
> > > tasklist_lock grabbed
> > >
> > > CPU2: get an sg write(2) feeding request to libata; host->lock is taken,
> > > request is immediately completed and scsi_done() is about to be called.
> > > host->lock grabbed
> > >
> > > CPU3: write_lock_irq(&tasklist_lock)
> > > spins on tasklist_lock until CPU1 gets through.
> > >
> > > CPU2: get around to kill_fasync() called by sg_rq_end_io() and to grabbing
> > > tasklist_lock inside send_sigio()
> > > spins, since it's not in an interrupt and there's a pending writer
> > > host->lock is held, spin until CPU3 gets through.
> >
> > Right, for a reader not in_interrupt(), it may be blocked by a random
> > waiting writer because of the fairness, even the lock is currently held
> > by a reader:
> >
> > CPU 1 CPU 2 CPU 3
> > read_lock(&tasklist_lock); // get the lock
> >
> > write_lock_irq(&tasklist_lock); // wait for the lock
> >
> > read_lock(&tasklist_lock); // cannot get the lock because of the fairness
>
> IOW, any caller of scsi_done() from non-interrupt context while
> holding a spinlock that is also taken in an interrupt...
>
> And we have drivers/scsi/scsi_error.c:scsi_send_eh_cmnd(), which calls
> ->queuecommand() under a mutex, with
> #define DEF_SCSI_QCMD(func_name) \
> int func_name(struct Scsi_Host *shost, struct scsi_cmnd *cmd) \
> { \
> unsigned long irq_flags; \
> int rc; \
> spin_lock_irqsave(shost->host_lock, irq_flags); \
> rc = func_name##_lck(cmd); \
> spin_unlock_irqrestore(shost->host_lock, irq_flags); \
> return rc; \
> }
>
> being commonly used for ->queuecommand() instances. So any scsi_done()
> in foo_lck() (quite a few of such) + use of ->host_lock in interrupt
> for the same driver (also common)...
>
> I wonder why that hadn't triggered the same warning a long time
> ago - these warnings had been around for at least two years.
>

FWIW, the complete dependency chain is:

&host->lock --> &new->fa_lock --> &f->f_owner.lock --> tasklist_lock

for the "&f->f_owner.lock" part to get into lockdep's radar, the
following call trace needs to appear once:

kill_fasync():
kill_fasync_rcu():
send_sigio()

not sure whether it's rare or not though. And ->fa_lock also had its own
issue:

https://lore.kernel.org/lkml/[email protected]/

which may have covered &host->lock for a while ;-)

Regards,
Boqun

> Am I missing something here?

2022-12-17 04:56:21

by Waiman Long

[permalink] [raw]

Subject: Re: possible deadlock in __ata_sff_interrupt

On 12/16/22 22:05, Al Viro wrote:
> On Fri, Dec 16, 2022 at 08:31:54PM -0600, Linus Torvalds wrote:
>> Ok, let's bring in Waiman for the rwlock side.
>>
>> On Fri, Dec 16, 2022 at 5:54 PM Boqun Feng <[email protected]> wrote:
>>> Right, for a reader not in_interrupt(), it may be blocked by a random
>>> waiting writer because of the fairness, even the lock is currently held
>>> by a reader:
>>>
>>> CPU 1 CPU 2 CPU 3
>>> read_lock(&tasklist_lock); // get the lock
>>>
>>> write_lock_irq(&tasklist_lock); // wait for the lock
>>>
>>> read_lock(&tasklist_lock); // cannot get the lock because of the fairness
>> But this should be ok - because CPU1 can make progress and eventually
>> release the lock.
>>
>> So the tasklist_lock use is fine on its own - the reason interrupts
>> are special is because an interrupt on CPU 1 taking the lock for
>> reading would deadlock otherwise. As long as it happens on another
>> CPU, the original CPU should then be able to make progress.
>>
>> But the problem here seems to be thst *another* lock is also involved
>> (in this case apparently "host->lock", and now if CPU1 and CPU2 get
>> these two locks in a different order, you can get an ABBA deadlock.
>>
>> And apparently our lockdep machinery doesn't catch that issue, so it
>> doesn't get flagged.
> Lockdep has actually caught that; the locks involved are mention in the
> report (https://marc.info/?l=linux-ide&m=167094379710177&w=2). The form
> of report might have been better, but if anything, it doesn't mention
> potential involvement of tasklist_lock writer, turning that into a deadlock.
>
> OTOH, that's more or less implicit for the entire class:
>
> read_lock(A) [non-interrupt]
> local_irq_disable() local_irq_disable()
> spin_lock(B) write_lock(A)
> read_lock(A)
> [in interrupt]
> spin_lock(B)
>
> is what that sort of reports is about. In this case A is tasklist_lock,
> B is host->lock. Possible call chains for CPU1 and CPU2 are reported...
>
> I wonder why analogues of that hadn't been reported for other SCSI hosts -
> it's a really common pattern there...
>
>> I'm not sure what the lockdep rules for rwlocks are, but maybe lockdep
>> treats rwlocks as being _always_ unfair, not knowing about that "it's
>> only unfair when it's in interrupt context".
>>
>> Maybe we need to always make rwlock unfair? Possibly only for tasklist_lock?
That may not be a good idea as the cacheline bouncing problem will be
back with reduced performance.
> ISTR threads about the possibility of explicit read_lock_unfair()...

Another possible alternative is to treat the read_lock as unfair if
interrupt has been disabled as I think we should reduce the interrupt
disabled interval as much as possible.

Thought?

Cheers,
Longman