Hello,
syzbot hit the following crash on upstream commit
d8a332730e757129e70675679f2b2a03f1ecf65e (Fri Apr 27 17:39:38 2018 +0000)
Merge tag 'char-misc-4.17-rc3' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=c4f9cebf9d651f6e54de
So far this crash happened 13 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4727023951937536
syzkaller reproducer:
https://syzkaller.appspot.com/x/repro.syz?id=6091190189424640
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=5018098448990208
Kernel config:
https://syzkaller.appspot.com/x/.config?id=7043958930931867332
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
random: sshd: uninitialized urandom read (32 bytes read)
print_req_error: I/O error, dev loop0, sector 7680
print_req_error: I/O error, dev loop0, sector 10240
print_req_error: I/O error, dev loop0, sector 7680
print_req_error: I/O error, dev loop0, sector 12800
INFO: task kworker/u4:2:31 blocked for more than 120 seconds.
Not tainted 4.17.0-rc2+ #20
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:2 D12840 31 2 0x80000000
Workqueue: writeback wb_workfn (flush-7:0)
Call Trace:
context_switch kernel/sched/core.c:2848 [inline]
__schedule+0x801/0x1e30 kernel/sched/core.c:3490
schedule+0xef/0x430 kernel/sched/core.c:3549
blk_queue_enter+0x8da/0xdf0 block/blk-core.c:953
generic_make_request+0x144/0x1510 block/blk-core.c:2395
submit_bio+0xba/0x460 block/blk-core.c:2568
submit_bh_wbc+0x5e8/0x7c0 fs/buffer.c:3076
__block_write_full_page+0x82e/0xef0 fs/buffer.c:1758
block_write_full_page+0x1ff/0x250 fs/buffer.c:2944
blkdev_writepage+0x24/0x30 fs/block_dev.c:566
__writepage+0x69/0xe0 mm/page-writeback.c:2302
write_cache_pages+0x920/0x1690 mm/page-writeback.c:2240
generic_writepages+0xcc/0x120 mm/page-writeback.c:2326
blkdev_writepages+0x1d/0x30 fs/block_dev.c:1951
do_writepages+0x9a/0x1a0 mm/page-writeback.c:2341
__writeback_single_inode+0x1e4/0x15c0 fs/fs-writeback.c:1323
writeback_sb_inodes+0x6f0/0x11a0 fs/fs-writeback.c:1587
__writeback_inodes_wb+0x1b0/0x320 fs/fs-writeback.c:1656
wb_writeback+0x9e4/0xf50 fs/fs-writeback.c:1765
wb_check_background_flush fs/fs-writeback.c:1833 [inline]
wb_do_writeback fs/fs-writeback.c:1921 [inline]
wb_workfn+0xd2f/0x1770 fs/fs-writeback.c:1949
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
kthread+0x345/0x410 kernel/kthread.c:238
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Showing all locks held in the system:
3 locks held by kworker/u4:2/31:
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at:
__write_once_size include/linux/compiler.h:215 [inline]
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at:
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at: atomic64_set
include/asm-generic/atomic-instrumented.h:40 [inline]
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at:
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at: set_work_data
kernel/workqueue.c:617 [inline]
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at:
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
#0: (ptrval) ((wq_completion)"writeback"){+.+.}, at:
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
#1: (ptrval) ((work_completion)(&(&wb->dwork)->work)){+.+.}, at:
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
#2: (ptrval) (&type->s_umount_key#35){.+.+}, at:
trylock_super+0x22/0x110 fs/super.c:399
2 locks held by khungtaskd/891:
#0: (ptrval) (rcu_read_lock){....}, at:
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
#0: (ptrval) (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60
kernel/hung_task.c:249
#1: (ptrval) (tasklist_lock){.+.+}, at:
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
2 locks held by getty/4512:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4513:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4514:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4515:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4516:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4517:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4518:
#0: (ptrval) (&tty->ldisc_sem){++++}, at:
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
#1: (ptrval) (&ldata->atomic_read_lock){+.+.}, at:
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
1 lock held by syz-executor516/4581:
#0: (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130
drivers/block/loop.c:1391
=============================================
NMI backtrace for cpu 1
CPU: 1 PID: 891 Comm: khungtaskd Not tainted 4.17.0-rc2+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
check_hung_task kernel/hung_task.c:132 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
watchdog+0xc10/0xf60 kernel/hung_task.c:249
kthread+0x345/0x410 kernel/kthread.c:238
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.17.0-rc2+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:ksoftirqd_running kernel/softirq.c:88 [inline]
RIP: 0010:invoke_softirq kernel/softirq.c:355 [inline]
RIP: 0010:irq_exit+0x125/0x200 kernel/softirq.c:405
RSP: 0018:ffff8801dae07cd8 EFLAGS: 00000046
RAX: dffffc0000000000 RBX: ffff8801d9a9c200 RCX: 0000000000000000
RDX: 1ffff1003b353842 RSI: 0000000000000000 RDI: ffff8801d9a9c210
RBP: ffff8801dae07ce0 R08: ffffed003b5c46c3 R09: ffffed003b5c46c2
R10: ffffed003b5c46c2 R11: ffff8801dae23613 R12: ffff8801dae2c580
R13: ffffffff88c75dc0 R14: ffff8801dae07e68 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 00000001b5553000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
scheduler_ipi+0x52b/0xa30 kernel/sched/core.c:1778
smp_reschedule_interrupt+0xed/0x660 arch/x86/kernel/smp.c:277
reschedule_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:887
</IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
RSP: 0018:ffffffff88c07bc0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff02
RAX: dffffc0000000000 RBX: 1ffffffff1180f7b RCX: 0000000000000000
RDX: 1ffffffff11a3160 RSI: 0000000000000001 RDI: ffffffff88d18b00
RBP: ffffffff88c07bc0 R08: ffffed003b5c46c3 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff88c07c78 R14: ffffffff897bf060 R15: 0000000000000000
arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
default_idle+0xc2/0x440 arch/x86/kernel/process.c:354
arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:345
default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
cpuidle_idle_call kernel/sched/idle.c:153 [inline]
do_idle+0x395/0x560 kernel/sched/idle.c:262
cpu_startup_entry+0x104/0x120 kernel/sched/idle.c:368
rest_init+0xe1/0xe4 init/main.c:441
start_kernel+0x906/0x92d init/main.c:737
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:445
x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:426
secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242
Code: 00 01 5b 5d c3 65 48 8b 1d 89 a4 ba 7e 48 85 db 74 2c 48 b8 00 00 00
00 00 fc ff df 48 8d 7b 10 48 89 fa 48 c1 ea 03 80 3c 02 00 <0f> 85 c2 00
00 00 48 8b 43 10 48 85 c0 0f 84 31 ff ff ff 48 c7
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to [email protected].
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
I managed to obtain SysRq-t when khungtaskd fires using debug printk()
( https://groups.google.com/forum/#!topic/syzkaller-bugs/OTuOsVebAiE ).
Only 4 threads shown below seems to be relevant to this problem.
[ 246.929688] task PC stack pid father
[ 249.888937] jbd2/sda1-8 D17736 2271 2 0x80000000
[ 249.894586] Call Trace:
[ 249.897193] __schedule+0x801/0x1e30
[ 249.900954] schedule+0xef/0x430
[ 249.904356] io_schedule+0x1c/0x70
[ 249.907935] bit_wait_io+0x18/0x90
[ 249.911492] __wait_on_bit+0xb3/0x130
[ 249.915313] out_of_line_wait_on_bit+0x204/0x3a0
[ 249.920105] __wait_on_buffer+0x76/0x90
[ 249.924102] jbd2_journal_commit_transaction+0x655b/0x8c18
[ 249.929893] kjournald2+0x26c/0xb30
[ 249.933579] kthread+0x345/0x410
[ 249.936966] ret_from_fork+0x3a/0x50
[ 254.408972] syz-executor7 D18976 5010 4828 0x00000004
[ 254.414639] Call Trace:
[ 254.417263] __schedule+0x801/0x1e30
[ 254.421070] schedule+0xef/0x430
[ 254.424467] blk_queue_enter+0x8da/0xdf0
[ 254.428584] generic_make_request+0x144/0x1510
[ 254.433217] blk_queue_split+0x374/0x2090
[ 254.437425] blk_mq_make_request+0x2d0/0x25c0
[ 254.442004] generic_make_request+0x795/0x1510
[ 254.446663] submit_bio+0xba/0x460
[ 254.451104] mpage_readpages+0x6d7/0x970
[ 254.455224] blkdev_readpages+0x2c/0x40
[ 254.459220] __do_page_cache_readahead+0x79a/0xdc0
[ 254.464205] ondemand_readahead+0x550/0xc40
[ 254.468559] page_cache_sync_readahead+0xd1/0x110
[ 254.473430] generic_file_read_iter+0x1a74/0x2f00
[ 254.478423] blkdev_read_iter+0x120/0x190
[ 254.482594] generic_file_splice_read+0x552/0x910
[ 254.487484] do_splice_to+0x12e/0x190
[ 254.491311] splice_direct_to_actor+0x268/0x8d0
[ 254.496017] do_splice_direct+0x2cc/0x400
[ 254.500224] do_sendfile+0x60f/0xe00
[ 254.503971] __x64_sys_sendfile64+0x155/0x240
[ 254.508507] do_syscall_64+0x1b1/0x800
[ 254.512431] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 254.568779] syz-executor7 D25408 5021 4828 0x00000004
[ 254.574429] Call Trace:
[ 254.577125] __schedule+0x801/0x1e30
[ 254.580882] schedule+0xef/0x430
[ 254.584297] blk_mq_freeze_queue_wait+0x1ce/0x460
[ 254.589192] blk_freeze_queue+0x4a/0x80
[ 254.593209] blk_mq_freeze_queue+0x15/0x20
[ 254.597464] loop_clr_fd+0x226/0xb80
[ 254.601208] lo_ioctl+0x642/0x2130
[ 254.604774] blkdev_ioctl+0x9b6/0x2020
[ 254.608715] block_ioctl+0xee/0x130
[ 254.612362] do_vfs_ioctl+0x1cf/0x16a0
[ 254.616294] ksys_ioctl+0xa9/0xd0
[ 254.619770] __x64_sys_ioctl+0x73/0xb0
[ 254.623674] do_syscall_64+0x1b1/0x800
[ 254.627596] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 255.022839] 1 lock held by syz-executor7/5021:
[ 255.028209] #0: (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130
[ 254.719133] blkid D23880 5071 2713 0x00000000
[ 254.724791] Call Trace:
[ 254.727402] __schedule+0x801/0x1e30
[ 254.731159] schedule+0xef/0x430
[ 254.734562] schedule_preempt_disabled+0x10/0x20
[ 254.739333] __mutex_lock+0xe38/0x17f0
[ 254.743280] mutex_lock_killable_nested+0x16/0x20
[ 254.748135] lo_ioctl+0x8d/0x2130
[ 254.751611] blkdev_ioctl+0x9b6/0x2020
[ 254.755550] block_ioctl+0xee/0x130
[ 254.759200] do_vfs_ioctl+0x1cf/0x16a0
[ 254.763123] ksys_ioctl+0xa9/0xd0
[ 254.766601] __x64_sys_ioctl+0x73/0xb0
[ 254.770506] do_syscall_64+0x1b1/0x800
[ 254.774425] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 255.036399] 1 lock held by blkid/5071:
[ 255.041075] #0: (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130
syz-executor7(PID=5021) is doing ioctl(LOOP_CLR_FD).
blkid(PID=5071) is doing ioctl() but is blocked on lo_ctl_mutex which is
held by syz-executor7(PID=5021). Therefore, I assume that blkid(PID=5071)
is irrelevant for this hung up problem.
syz-executor7(PID=5021) is stuck at blk_mq_freeze_queue_wait() from
blk_freeze_queue() from blk_mq_freeze_queue() from loop_clr_fd().
/*
* Guarantee no request is in use, so we can change any data structure of
* the queue afterward.
*/
void blk_freeze_queue(struct request_queue *q)
{
/*
* In the !blk_mq case we are only calling this to kill the
* q_usage_counter, otherwise this increases the freeze depth
* and waits for it to return to zero. For this reason there is
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
* exported to drivers as the only user for unfreeze is blk_mq.
*/
blk_freeze_queue_start(q);
if (!q->mq_ops)
blk_drain_queue(q);
blk_mq_freeze_queue_wait(q);
}
q->mq_freeze_depth is incremented at blk_freeze_queue_start() and
the caller waits until q->q_usage_counter becomes 0.
void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
}
}
void blk_mq_freeze_queue_wait(struct request_queue *q)
{
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
}
I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
stuck at wait_event() in blk_queue_enter().
syz-executor7 D18976 5010 4828 0x00000004
Call Trace:
context_switch kernel/sched/core.c:2859 [inline]
__schedule+0x801/0x1e30 kernel/sched/core.c:3501
schedule+0xef/0x430 kernel/sched/core.c:3545
blk_queue_enter+0x8da/0xdf0 block/blk-core.c:953
generic_make_request+0x144/0x1510 block/blk-core.c:2395
blk_queue_split+0x374/0x2090 block/blk-merge.c:215
blk_mq_make_request+0x2d0/0x25c0 block/blk-mq.c:1861
generic_make_request+0x795/0x1510 block/blk-core.c:2460
submit_bio+0xba/0x460 block/blk-core.c:2568
mpage_bio_submit fs/mpage.c:66 [inline]
mpage_readpages+0x6d7/0x970 fs/mpage.c:393
blkdev_readpages+0x2c/0x40 fs/block_dev.c:577
read_pages mm/readahead.c:121 [inline]
__do_page_cache_readahead+0x79a/0xdc0 mm/readahead.c:199
ra_submit mm/internal.h:66 [inline]
ondemand_readahead+0x550/0xc40 mm/readahead.c:478
page_cache_sync_readahead+0xd1/0x110 mm/readahead.c:510
generic_file_buffered_read mm/filemap.c:2092 [inline]
generic_file_read_iter+0x1a74/0x2f00 mm/filemap.c:2362
blkdev_read_iter+0x120/0x190 fs/block_dev.c:1930
call_read_iter include/linux/fs.h:1778 [inline]
generic_file_splice_read+0x552/0x910 fs/splice.c:307
do_splice_to+0x12e/0x190 fs/splice.c:880
splice_direct_to_actor+0x268/0x8d0 fs/splice.c:952
do_splice_direct+0x2cc/0x400 fs/splice.c:1061
do_sendfile+0x60f/0xe00 fs/read_write.c:1440
__do_sys_sendfile64 fs/read_write.c:1495 [inline]
__se_sys_sendfile64 fs/read_write.c:1487 [inline]
__x64_sys_sendfile64+0x155/0x240 fs/read_write.c:1487
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Since flags == 0, preempt == false. Since stuck at wait_event(), success == false.
Thus, atomic_read(&q->mq_freeze_depth) > 0 if blk_queue_dying(q) == false. And I
guess blk_queue_dying(q) == false because we are just trying to freeze/unfreeze.
/**
* blk_queue_enter() - try to increase q->q_usage_counter
* @q: request queue pointer
* @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
*/
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
{
const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
while (true) {
bool success = false;
rcu_read_lock();
if (percpu_ref_tryget_live(&q->q_usage_counter)) {
/*
* The code that sets the PREEMPT_ONLY flag is
* responsible for ensuring that that flag is globally
* visible before the queue is unfrozen.
*/
if (preempt || !blk_queue_preempt_only(q)) {
success = true;
} else {
percpu_ref_put(&q->q_usage_counter);
}
}
rcu_read_unlock();
if (success)
return 0;
if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
/*
* read pair of barrier in blk_freeze_queue_start(),
* we need to order reading __PERCPU_REF_DEAD flag of
* .q_usage_counter and reading .mq_freeze_depth or
* queue dying flag, otherwise the following wait may
* never return if the two reads are reordered.
*/
smp_rmb();
wait_event(q->mq_freeze_wq,
(atomic_read(&q->mq_freeze_depth) == 0 &&
(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
}
}
jbd2/sda1-8(PID=2271) is stuck waiting for journal commit operation.
I don't know how this thread is involved to this problem.
jbd2/sda1-8 D17736 2271 2 0x80000000
Call Trace:
context_switch kernel/sched/core.c:2859 [inline]
__schedule+0x801/0x1e30 kernel/sched/core.c:3501
schedule+0xef/0x430 kernel/sched/core.c:3545
io_schedule+0x1c/0x70 kernel/sched/core.c:5165
bit_wait_io+0x18/0x90 kernel/sched/wait_bit.c:207
__wait_on_bit+0xb3/0x130 kernel/sched/wait_bit.c:48
out_of_line_wait_on_bit+0x204/0x3a0 kernel/sched/wait_bit.c:63
wait_on_bit_io include/linux/wait_bit.h:101 [inline]
__wait_on_buffer+0x76/0x90 fs/buffer.c:118
wait_on_buffer include/linux/buffer_head.h:356 [inline]
journal_wait_on_commit_record fs/jbd2/commit.c:174 [inline]
jbd2_journal_commit_transaction+0x655b/0x8c18 fs/jbd2/commit.c:865
kjournald2+0x26c/0xb30 fs/jbd2/journal.c:229
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Can we find the culprit from these hints?
Tetsuo Handa wrote:
> I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
> but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
> stuck at wait_event() in blk_queue_enter().
>
> Since flags == 0, preempt == false. Since stuck at wait_event(), success == false.
> Thus, atomic_read(&q->mq_freeze_depth) > 0 if blk_queue_dying(q) == false. And I
> guess blk_queue_dying(q) == false because we are just trying to freeze/unfreeze.
>
I was able to reproduce the hung up using modified reproducer, and got values
using below debug printk() patch.
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -950,10 +950,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
*/
smp_rmb();
- wait_event(q->mq_freeze_wq,
- (atomic_read(&q->mq_freeze_depth) == 0 &&
- (preempt || !blk_queue_preempt_only(q))) ||
- blk_queue_dying(q));
+ while (wait_event_timeout(q->mq_freeze_wq,
+ (atomic_read(&q->mq_freeze_depth) == 0 &&
+ (preempt || !blk_queue_preempt_only(q))) ||
+ blk_queue_dying(q), 10 * HZ) == 0)
+ printk("%s(%u): q->mq_freeze_depth=%d preempt=%d blk_queue_preempt_only(q)=%d blk_queue_dying(q)=%d\n",
+ current->comm, current->pid, atomic_read(&q->mq_freeze_depth), preempt, blk_queue_preempt_only(q), blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
}
[ 75.869126] print_req_error: I/O error, dev loop0, sector 0
[ 85.983146] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[ 96.222884] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[ 106.463322] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[ 116.702912] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
blk_queue_enter() will never be satisfied. But what does that wait_event()
want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
as if blk_queue_dying(q) == true? That is, something like below:
diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..59e2496 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
smp_rmb();
wait_event(q->mq_freeze_wq,
- (atomic_read(&q->mq_freeze_depth) == 0 &&
- (preempt || !blk_queue_preempt_only(q))) ||
+ atomic_read(&q->mq_freeze_depth) ||
+ (preempt || !blk_queue_preempt_only(q)) ||
blk_queue_dying(q));
- if (blk_queue_dying(q))
+ if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
return -ENODEV;
}
}
On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
> blk_queue_enter() will never be satisfied. But what does that wait_event()
> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
> as if blk_queue_dying(q) == true? That is, something like below:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b4..59e2496 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> smp_rmb();
>
> wait_event(q->mq_freeze_wq,
> - (atomic_read(&q->mq_freeze_depth) == 0 &&
> - (preempt || !blk_queue_preempt_only(q))) ||
> + atomic_read(&q->mq_freeze_depth) ||
> + (preempt || !blk_queue_preempt_only(q)) ||
> blk_queue_dying(q));
> - if (blk_queue_dying(q))
> + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> return -ENODEV;
> }
> }
That change looks wrong to me. Additionally, I think that you are looking in
the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
work fine for all block drivers except the loop driver I think that you should
have a closer look at how the loop driver uses this block layer functionality.
Thanks,
Bart.
On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <[email protected]> wrote:
> On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
>> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
>> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
>> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
>> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
>> blk_queue_enter() will never be satisfied. But what does that wait_event()
>> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
>> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
>> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
>> as if blk_queue_dying(q) == true? That is, something like below:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b4..59e2496 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>> smp_rmb();
>>
>> wait_event(q->mq_freeze_wq,
>> - (atomic_read(&q->mq_freeze_depth) == 0 &&
>> - (preempt || !blk_queue_preempt_only(q))) ||
>> + atomic_read(&q->mq_freeze_depth) ||
>> + (preempt || !blk_queue_preempt_only(q)) ||
>> blk_queue_dying(q));
>> - if (blk_queue_dying(q))
>> + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
>> return -ENODEV;
>> }
>> }
>
> That change looks wrong to me.
Hi Bart,
Why does it look wrong to you?
> Additionally, I think that you are looking in
> the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
> work fine for all block drivers except the loop driver I think that you should
> have a closer look at how the loop driver uses this block layer functionality.
>
> Thanks,
>
> Bart.
>
>
>
On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <[email protected]> wrote:
> > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 85909b4..59e2496 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > > smp_rmb();
> > >
> > > wait_event(q->mq_freeze_wq,
> > > - (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > - (preempt || !blk_queue_preempt_only(q))) ||
> > > + atomic_read(&q->mq_freeze_depth) ||
> > > + (preempt || !blk_queue_preempt_only(q)) ||
> > > blk_queue_dying(q));
> > > - if (blk_queue_dying(q))
> > > + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > > return -ENODEV;
> > > }
> > > }
> >
> > That change looks wrong to me.
>
> Hi Bart,
>
> Why does it look wrong to you?
Because that change conflicts with the purpose of queue freezing and also because
that change would inject I/O errors in code paths that shouldn't inject I/O errors.
Please have a look at e.g. generic_make_request(). From the start of that function:
if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
return ret;
}
The above patch changes the behavior of blk_queue_enter() code from waiting while
q->mq_freeze_depth != 0 into returning -ENODEV while the request queue is frozen.
That will cause generic_make_request() to call bio_io_error(bio) while a request
queue is frozen if REQ_NOWAIT has not been set, which is the default behavior. So
any operation that freezes the queue temporarily, e.g. changing the queue depth,
concurrently with I/O processing can cause I/O to fail with -ENODEV. As you
probably know failure of write requests has very annoying consequences. It e.g.
causes filesystems to go into read-only mode. That's why I think that the above
change is completely wrong.
Bart.
> jbd2/sda1-8(PID=2271) is stuck waiting for journal commit operation.
> I don't know how this thread is involved to this problem.
It feels like it should be a necessary link in the chain. This is the
filesystem underneath the loop device. If that hangs, we would expect
the loop device to hang, but not vice versa.
AIUI, you couldn't reproduce this hang on your own machine.
Don't you think, your fuzzer has just found a nice exploit for reiserfs?
Alan
(It's interesting that you found this hang just after we fixed
block_queue_enter() to wait in D state instead of S state.[1] I don't
know syszkaller, so I assume from this it only flagged this case up
because of the hung task warning. I.e. syzkaller didn't have its own
watchdog that would have failed the test anyway.
[1] the commit that caused you to CC me. "block: do not use
interruptible wait anywhere"
https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428
)
Bart Van Assche wrote:
> On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> > On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <[email protected]> wrote:
> > > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 85909b4..59e2496 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > > > smp_rmb();
> > > >
> > > > wait_event(q->mq_freeze_wq,
> > > > - (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > > - (preempt || !blk_queue_preempt_only(q))) ||
> > > > + atomic_read(&q->mq_freeze_depth) ||
> > > > + (preempt || !blk_queue_preempt_only(q)) ||
> > > > blk_queue_dying(q));
> > > > - if (blk_queue_dying(q))
> > > > + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > > > return -ENODEV;
> > > > }
> > > > }
> > >
> > > That change looks wrong to me.
> >
> > Hi Bart,
> >
> > Why does it look wrong to you?
>
> Because that change conflicts with the purpose of queue freezing and also because
> that change would inject I/O errors in code paths that shouldn't inject I/O errors.
But waiting there until atomic_read(&q->mq_freeze_depth) becomes 0 is causing deadlock.
wait_event() never returns is a bug. I think we should not wait for q->mq_freeze_depth.
I checked counter values using debug printk() patch shown below, and
found that q->q_usage_counter.count == 1 when this deadlock occurs.
Since sum of percpu_count did not change after percpu_ref_kill(), this is
not a race condition while folding percpu counter values into atomic counter
value. That is, for some reason, someone who is responsible for calling
percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
unable to call percpu_ref_put().
diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..6933020 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,6 +908,12 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
}
EXPORT_SYMBOL(blk_alloc_queue);
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+ return (unsigned long __percpu *)
+ (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
/**
* blk_queue_enter() - try to increase q->q_usage_counter
* @q: request queue pointer
@@ -950,10 +956,22 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
*/
smp_rmb();
- wait_event(q->mq_freeze_wq,
- (atomic_read(&q->mq_freeze_depth) == 0 &&
- (preempt || !blk_queue_preempt_only(q))) ||
- blk_queue_dying(q));
+ while (wait_event_timeout(q->mq_freeze_wq,
+ (atomic_read(&q->mq_freeze_depth) == 0 &&
+ (preempt || !blk_queue_preempt_only(q))) ||
+ blk_queue_dying(q), 3 * HZ) == 0) {
+ struct percpu_ref *ref = &q->q_usage_counter;
+ unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+ unsigned long count = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr(percpu_count, cpu);
+
+ printk("%s(%d): %px %ld %ld\n", current->comm, current->pid,
+ ref, atomic_long_read(&ref->count), count);
+ }
+
if (blk_queue_dying(q))
return -ENODEV;
}
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7..72773ce 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -133,8 +133,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
for_each_possible_cpu(cpu)
count += *per_cpu_ptr(percpu_count, cpu);
- pr_debug("global %ld percpu %ld",
- atomic_long_read(&ref->count), (long)count);
+ printk("%px global %ld percpu %ld\n", ref,
+ atomic_long_read(&ref->count), (long)count);
/*
* It's crucial that we sum the percpu counters _before_ adding the sum
@@ -150,6 +150,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
*/
atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count);
+ printk("%px global %ld\n", ref, atomic_long_read(&ref->count));
+
WARN_ONCE(atomic_long_read(&ref->count) <= 0,
"percpu ref (%pf) <= 0 (%ld) after switching to atomic",
ref->release, atomic_long_read(&ref->count));
If I change blk_queue_enter() not to wait at wait_event() if
q->mq_freeze_depth != 0, this deadlock problem does not occur.
Also, I found that if blk_freeze_queue_start() tries to wait for
counters to become 1 before calling percpu_ref_kill() (like shown
below), this deadlock problem does not occur.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac..4bff534 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -134,12 +134,36 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
}
+#define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1))
+
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+ return (unsigned long __percpu *)
+ (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
+ int i;
+ for (i = 0; i < 10; i++) {
+ struct percpu_ref *ref = &q->q_usage_counter;
+ unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+ unsigned long count = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr(percpu_count, cpu);
+
+ if (atomic_long_read(&ref->count) + count - PERCPU_COUNT_BIAS == 1)
+ break;
+ printk("%s(%d):! %px %ld %ld\n", current->comm, current->pid,
+ ref, atomic_long_read(&ref->count), count);
+ schedule_timeout_uninterruptible(HZ / 10);
+ }
percpu_ref_kill(&q->q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
But I don't know how to find someone who is failing to call percpu_ref_put()...
Tetsuo Handa wrote:
> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> not a race condition while folding percpu counter values into atomic counter
> value. That is, for some reason, someone who is responsible for calling
> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> unable to call percpu_ref_put().
> But I don't know how to find someone who is failing to call percpu_ref_put()...
I found the someone. It was already there in the backtrace...
----------------------------------------
[ 62.065852] a.out D 0 4414 4337 0x00000000
[ 62.067677] Call Trace:
[ 62.068545] __schedule+0x40b/0x860
[ 62.069726] schedule+0x31/0x80
[ 62.070796] schedule_timeout+0x1c1/0x3c0
[ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
[ 62.073670] blk_queue_enter+0x218/0x520
[ 62.074985] ? remove_wait_queue+0x70/0x70
[ 62.076361] generic_make_request+0x3d/0x540
[ 62.077785] ? __bio_clone_fast+0x6b/0x80
[ 62.079147] ? bio_clone_fast+0x2c/0x70
[ 62.080456] blk_queue_split+0x29b/0x560
[ 62.081772] ? blk_queue_split+0x29b/0x560
[ 62.083162] blk_mq_make_request+0x7c/0x430
[ 62.084562] generic_make_request+0x276/0x540
[ 62.086034] submit_bio+0x6e/0x140
[ 62.087185] ? submit_bio+0x6e/0x140
[ 62.088384] ? guard_bio_eod+0x9d/0x1d0
[ 62.089681] do_mpage_readpage+0x328/0x730
[ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
[ 62.092726] mpage_readpages+0x120/0x190
[ 62.094034] ? check_disk_change+0x70/0x70
[ 62.095454] ? check_disk_change+0x70/0x70
[ 62.096849] ? alloc_pages_current+0x65/0xd0
[ 62.098277] blkdev_readpages+0x18/0x20
[ 62.099568] __do_page_cache_readahead+0x298/0x360
[ 62.101157] ondemand_readahead+0x1f6/0x490
[ 62.102546] ? ondemand_readahead+0x1f6/0x490
[ 62.103995] page_cache_sync_readahead+0x29/0x40
[ 62.105539] generic_file_read_iter+0x7d0/0x9d0
[ 62.107067] ? futex_wait+0x221/0x240
[ 62.108303] ? trace_hardirqs_on+0xd/0x10
[ 62.109654] blkdev_read_iter+0x30/0x40
[ 62.110954] generic_file_splice_read+0xc5/0x140
[ 62.112538] do_splice_to+0x74/0x90
[ 62.113726] splice_direct_to_actor+0xa4/0x1f0
[ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
[ 62.116773] do_splice_direct+0x8a/0xb0
[ 62.118056] do_sendfile+0x1aa/0x390
[ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
[ 62.120666] do_syscall_64+0x6e/0x210
[ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
----------------------------------------
The someone is blk_queue_split() from blk_mq_make_request() who depends on an
assumption that blk_queue_enter() from recursively called generic_make_request()
does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
----------------------------------------
generic_make_request(struct bio *bio) {
if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
return ret;
}
(...snipped...)
ret = q->make_request_fn(q, bio);
(...snipped...)
if (q)
blk_queue_exit(q);
}
----------------------------------------
where q->make_request_fn == blk_mq_make_request which does
----------------------------------------
blk_mq_make_request(struct request_queue *q, struct bio *bio) {
blk_queue_split(q, &bio);
}
blk_queue_split(struct request_queue *q, struct bio **bio) {
generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
}
----------------------------------------
and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
percpu_ref_kill() are called by blk_freeze_queue_start()...
Now, it is up to you about how to fix this race problem.
On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>
> I found the someone. It was already there in the backtrace...
>
> ----------------------------------------
> [ 62.065852] a.out D 0 4414 4337 0x00000000
> [ 62.067677] Call Trace:
> [ 62.068545] __schedule+0x40b/0x860
> [ 62.069726] schedule+0x31/0x80
> [ 62.070796] schedule_timeout+0x1c1/0x3c0
> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> [ 62.073670] blk_queue_enter+0x218/0x520
> [ 62.074985] ? remove_wait_queue+0x70/0x70
> [ 62.076361] generic_make_request+0x3d/0x540
> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> [ 62.079147] ? bio_clone_fast+0x2c/0x70
> [ 62.080456] blk_queue_split+0x29b/0x560
> [ 62.081772] ? blk_queue_split+0x29b/0x560
> [ 62.083162] blk_mq_make_request+0x7c/0x430
> [ 62.084562] generic_make_request+0x276/0x540
> [ 62.086034] submit_bio+0x6e/0x140
> [ 62.087185] ? submit_bio+0x6e/0x140
> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> [ 62.089681] do_mpage_readpage+0x328/0x730
> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> [ 62.092726] mpage_readpages+0x120/0x190
> [ 62.094034] ? check_disk_change+0x70/0x70
> [ 62.095454] ? check_disk_change+0x70/0x70
> [ 62.096849] ? alloc_pages_current+0x65/0xd0
> [ 62.098277] blkdev_readpages+0x18/0x20
> [ 62.099568] __do_page_cache_readahead+0x298/0x360
> [ 62.101157] ondemand_readahead+0x1f6/0x490
> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> [ 62.103995] page_cache_sync_readahead+0x29/0x40
> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> [ 62.107067] ? futex_wait+0x221/0x240
> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> [ 62.109654] blkdev_read_iter+0x30/0x40
> [ 62.110954] generic_file_splice_read+0xc5/0x140
> [ 62.112538] do_splice_to+0x74/0x90
> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 62.116773] do_splice_direct+0x8a/0xb0
> [ 62.118056] do_sendfile+0x1aa/0x390
> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> [ 62.120666] do_syscall_64+0x6e/0x210
> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
>
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>
> ----------------------------------------
> generic_make_request(struct bio *bio) {
> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> bio_io_error(bio);
> return ret;
> }
> (...snipped...)
> ret = q->make_request_fn(q, bio);
> (...snipped...)
> if (q)
> blk_queue_exit(q);
> }
> ----------------------------------------
>
> where q->make_request_fn == blk_mq_make_request which does
>
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> blk_queue_split(q, &bio);
> }
>
> blk_queue_split(struct request_queue *q, struct bio **bio) {
> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
>
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by blk_freeze_queue_start()...
>
> Now, it is up to you about how to fix this race problem.
Ahh, nicely spotted! One idea would be the one below. For this case,
we're recursing, so we can either do a non-block queue enter, or we
can just do a live enter.
diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..7de12e0dcc3d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
+ if (bio->bi_opf & REQ_QUEUE_ENTERED)
+ blk_queue_enter_live(q);
+ else if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..e1063e8f7eda 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;
+ /*
+ * Since we're recursing into make_request here, ensure
+ * that we mark this bio as already having entered the queue.
+ * If not, and the queue is going away, we can get stuck
+ * forever on waiting for the queue reference to drop. But
+ * that will never happen, as we're already holding a
+ * reference to it.
+ */
+ (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
+
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..a6bc789b83e7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -282,6 +282,8 @@ enum req_flag_bits {
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
+ __REQ_QUEUE_ENTERED, /* queue already entered for this request */
+
/* for driver use */
__REQ_DRV,
@@ -302,9 +304,8 @@ enum req_flag_bits {
#define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
-
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
-
+#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
#define REQ_DRV (1ULL << __REQ_DRV)
#define REQ_FAILFAST_MASK \
--
Jens Axboe
On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >> not a race condition while folding percpu counter values into atomic counter
> >> value. That is, for some reason, someone who is responsible for calling
> >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >> unable to call percpu_ref_put().
> >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >
> > I found the someone. It was already there in the backtrace...
> >
> > ----------------------------------------
> > [ 62.065852] a.out D 0 4414 4337 0x00000000
> > [ 62.067677] Call Trace:
> > [ 62.068545] __schedule+0x40b/0x860
> > [ 62.069726] schedule+0x31/0x80
> > [ 62.070796] schedule_timeout+0x1c1/0x3c0
> > [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> > [ 62.073670] blk_queue_enter+0x218/0x520
> > [ 62.074985] ? remove_wait_queue+0x70/0x70
> > [ 62.076361] generic_make_request+0x3d/0x540
> > [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> > [ 62.079147] ? bio_clone_fast+0x2c/0x70
> > [ 62.080456] blk_queue_split+0x29b/0x560
> > [ 62.081772] ? blk_queue_split+0x29b/0x560
> > [ 62.083162] blk_mq_make_request+0x7c/0x430
> > [ 62.084562] generic_make_request+0x276/0x540
> > [ 62.086034] submit_bio+0x6e/0x140
> > [ 62.087185] ? submit_bio+0x6e/0x140
> > [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> > [ 62.089681] do_mpage_readpage+0x328/0x730
> > [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> > [ 62.092726] mpage_readpages+0x120/0x190
> > [ 62.094034] ? check_disk_change+0x70/0x70
> > [ 62.095454] ? check_disk_change+0x70/0x70
> > [ 62.096849] ? alloc_pages_current+0x65/0xd0
> > [ 62.098277] blkdev_readpages+0x18/0x20
> > [ 62.099568] __do_page_cache_readahead+0x298/0x360
> > [ 62.101157] ondemand_readahead+0x1f6/0x490
> > [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> > [ 62.103995] page_cache_sync_readahead+0x29/0x40
> > [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> > [ 62.107067] ? futex_wait+0x221/0x240
> > [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> > [ 62.109654] blkdev_read_iter+0x30/0x40
> > [ 62.110954] generic_file_splice_read+0xc5/0x140
> > [ 62.112538] do_splice_to+0x74/0x90
> > [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> > [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> > [ 62.116773] do_splice_direct+0x8a/0xb0
> > [ 62.118056] do_sendfile+0x1aa/0x390
> > [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> > [ 62.120666] do_syscall_64+0x6e/0x210
> > [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > ----------------------------------------
> >
> > The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> > assumption that blk_queue_enter() from recursively called generic_make_request()
> > does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> >
> > ----------------------------------------
> > generic_make_request(struct bio *bio) {
> > if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> > if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> > bio_wouldblock_error(bio);
> > else
> > bio_io_error(bio);
> > return ret;
> > }
> > (...snipped...)
> > ret = q->make_request_fn(q, bio);
> > (...snipped...)
> > if (q)
> > blk_queue_exit(q);
> > }
> > ----------------------------------------
> >
> > where q->make_request_fn == blk_mq_make_request which does
> >
> > ----------------------------------------
> > blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> > blk_queue_split(q, &bio);
> > }
> >
> > blk_queue_split(struct request_queue *q, struct bio **bio) {
> > generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> > }
> > ----------------------------------------
> >
> > and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> > percpu_ref_kill() are called by blk_freeze_queue_start()...
> >
> > Now, it is up to you about how to fix this race problem.
>
> Ahh, nicely spotted! One idea would be the one below. For this case,
> we're recursing, so we can either do a non-block queue enter, or we
> can just do a live enter.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b431eb0..7de12e0dcc3d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>
> if (bio->bi_opf & REQ_NOWAIT)
> flags = BLK_MQ_REQ_NOWAIT;
> - if (blk_queue_enter(q, flags) < 0) {
> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
> + blk_queue_enter_live(q);
> + else if (blk_queue_enter(q, flags) < 0) {
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 782940c65d8a..e1063e8f7eda 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> /* there isn't chance to merge the splitted bio */
> split->bi_opf |= REQ_NOMERGE;
>
> + /*
> + * Since we're recursing into make_request here, ensure
> + * that we mark this bio as already having entered the queue.
> + * If not, and the queue is going away, we can get stuck
> + * forever on waiting for the queue reference to drop. But
> + * that will never happen, as we're already holding a
> + * reference to it.
> + */
> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
> +
> bio_chain(split, *bio);
> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> generic_make_request(*bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 17b18b91ebac..a6bc789b83e7 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -282,6 +282,8 @@ enum req_flag_bits {
> /* command specific flags for REQ_OP_WRITE_ZEROES: */
> __REQ_NOUNMAP, /* do not free blocks when zeroing */
>
> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
> +
> /* for driver use */
> __REQ_DRV,
>
> @@ -302,9 +304,8 @@ enum req_flag_bits {
> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
> -
> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
> -
> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
> #define REQ_DRV (1ULL << __REQ_DRV)
>
> #define REQ_FAILFAST_MASK \
bio clone inherits .bi_opf, so this way may not work for cloned bio.
Given most of bio_split() is run in .make_request_fn(), holding the
queue ref may not be needed in case of bio splitting, so another
candidate fix might be to introduce one extra parameter of 'need_queue_ref'
to generic_make_request().
Thanks,
Ming
On 6/1/18 5:49 PM, Ming Lei wrote:
> On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
>> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>>> Tetsuo Handa wrote:
>>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>>> not a race condition while folding percpu counter values into atomic counter
>>>> value. That is, for some reason, someone who is responsible for calling
>>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>>> unable to call percpu_ref_put().
>>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>>
>>> I found the someone. It was already there in the backtrace...
>>>
>>> ----------------------------------------
>>> [ 62.065852] a.out D 0 4414 4337 0x00000000
>>> [ 62.067677] Call Trace:
>>> [ 62.068545] __schedule+0x40b/0x860
>>> [ 62.069726] schedule+0x31/0x80
>>> [ 62.070796] schedule_timeout+0x1c1/0x3c0
>>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
>>> [ 62.073670] blk_queue_enter+0x218/0x520
>>> [ 62.074985] ? remove_wait_queue+0x70/0x70
>>> [ 62.076361] generic_make_request+0x3d/0x540
>>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
>>> [ 62.079147] ? bio_clone_fast+0x2c/0x70
>>> [ 62.080456] blk_queue_split+0x29b/0x560
>>> [ 62.081772] ? blk_queue_split+0x29b/0x560
>>> [ 62.083162] blk_mq_make_request+0x7c/0x430
>>> [ 62.084562] generic_make_request+0x276/0x540
>>> [ 62.086034] submit_bio+0x6e/0x140
>>> [ 62.087185] ? submit_bio+0x6e/0x140
>>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
>>> [ 62.089681] do_mpage_readpage+0x328/0x730
>>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
>>> [ 62.092726] mpage_readpages+0x120/0x190
>>> [ 62.094034] ? check_disk_change+0x70/0x70
>>> [ 62.095454] ? check_disk_change+0x70/0x70
>>> [ 62.096849] ? alloc_pages_current+0x65/0xd0
>>> [ 62.098277] blkdev_readpages+0x18/0x20
>>> [ 62.099568] __do_page_cache_readahead+0x298/0x360
>>> [ 62.101157] ondemand_readahead+0x1f6/0x490
>>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
>>> [ 62.103995] page_cache_sync_readahead+0x29/0x40
>>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
>>> [ 62.107067] ? futex_wait+0x221/0x240
>>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
>>> [ 62.109654] blkdev_read_iter+0x30/0x40
>>> [ 62.110954] generic_file_splice_read+0xc5/0x140
>>> [ 62.112538] do_splice_to+0x74/0x90
>>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
>>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
>>> [ 62.116773] do_splice_direct+0x8a/0xb0
>>> [ 62.118056] do_sendfile+0x1aa/0x390
>>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
>>> [ 62.120666] do_syscall_64+0x6e/0x210
>>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> ----------------------------------------
>>>
>>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
>>> assumption that blk_queue_enter() from recursively called generic_make_request()
>>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>>>
>>> ----------------------------------------
>>> generic_make_request(struct bio *bio) {
>>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>> bio_wouldblock_error(bio);
>>> else
>>> bio_io_error(bio);
>>> return ret;
>>> }
>>> (...snipped...)
>>> ret = q->make_request_fn(q, bio);
>>> (...snipped...)
>>> if (q)
>>> blk_queue_exit(q);
>>> }
>>> ----------------------------------------
>>>
>>> where q->make_request_fn == blk_mq_make_request which does
>>>
>>> ----------------------------------------
>>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>>> blk_queue_split(q, &bio);
>>> }
>>>
>>> blk_queue_split(struct request_queue *q, struct bio **bio) {
>>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
>>> }
>>> ----------------------------------------
>>>
>>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
>>> percpu_ref_kill() are called by blk_freeze_queue_start()...
>>>
>>> Now, it is up to you about how to fix this race problem.
>>
>> Ahh, nicely spotted! One idea would be the one below. For this case,
>> we're recursing, so we can either do a non-block queue enter, or we
>> can just do a live enter.
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b431eb0..7de12e0dcc3d 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>>
>> if (bio->bi_opf & REQ_NOWAIT)
>> flags = BLK_MQ_REQ_NOWAIT;
>> - if (blk_queue_enter(q, flags) < 0) {
>> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
>> + blk_queue_enter_live(q);
>> + else if (blk_queue_enter(q, flags) < 0) {
>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>> bio_wouldblock_error(bio);
>> else
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 782940c65d8a..e1063e8f7eda 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>> /* there isn't chance to merge the splitted bio */
>> split->bi_opf |= REQ_NOMERGE;
>>
>> + /*
>> + * Since we're recursing into make_request here, ensure
>> + * that we mark this bio as already having entered the queue.
>> + * If not, and the queue is going away, we can get stuck
>> + * forever on waiting for the queue reference to drop. But
>> + * that will never happen, as we're already holding a
>> + * reference to it.
>> + */
>> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
>> +
>> bio_chain(split, *bio);
>> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>> generic_make_request(*bio);
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 17b18b91ebac..a6bc789b83e7 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -282,6 +282,8 @@ enum req_flag_bits {
>> /* command specific flags for REQ_OP_WRITE_ZEROES: */
>> __REQ_NOUNMAP, /* do not free blocks when zeroing */
>>
>> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
>> +
>> /* for driver use */
>> __REQ_DRV,
>>
>> @@ -302,9 +304,8 @@ enum req_flag_bits {
>> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
>> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
>> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
>> -
>> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
>> -
>> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
>> #define REQ_DRV (1ULL << __REQ_DRV)
>>
>> #define REQ_FAILFAST_MASK \
>
> bio clone inherits .bi_opf, so this way may not work for cloned bio.
>
> Given most of bio_split() is run in .make_request_fn(), holding the
> queue ref may not be needed in case of bio splitting, so another
> candidate fix might be to introduce one extra parameter of
> 'need_queue_ref' to generic_make_request().
I did consider that, but that's pretty silly for a corner case and you'd
have to change a crap ton of calls to it. I think it'd be cleaner to
clear the bit when we need to, potentially even adding a debug check to
blk_queue_enter_live() that complains if the ref was not already
elevated. Though that would be expensive, compared to the percpu inc
now. Not saying the bit is necessarily the best way forward, but I do
like it a LOT more than adding an argument to generic_make_request.
It should have been a bio flag to begin with, that better captures the
scope and also avoids clones inheriting anything they shouldn't. It's
also not applicable to a request.
diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..8f1063f50863 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
+ if (bio->bi_flags & BIO_QUEUE_ENTERED)
+ blk_queue_enter_live(q);
+ else if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..9ee9474e579c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;
+ /*
+ * Since we're recursing into make_request here, ensure
+ * that we mark this bio as already having entered the queue.
+ * If not, and the queue is going away, we can get stuck
+ * forever on waiting for the queue reference to drop. But
+ * that will never happen, as we're already holding a
+ * reference to it.
+ */
+ (*bio)->bi_flags |= BIO_QUEUE_ENTERED;
+
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..4687d7c99076 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -186,6 +186,8 @@ struct bio {
* throttling rules. Don't do it again. */
#define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion
* of this bio. */
+#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */
+
/* See BVEC_POOL_OFFSET below before adding new flags */
/*
--
Jens Axboe
On 6/1/18 6:49 PM, Jens Axboe wrote:
> On 6/1/18 5:49 PM, Ming Lei wrote:
>> On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
>>> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>>>> Tetsuo Handa wrote:
>>>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>>>> not a race condition while folding percpu counter values into atomic counter
>>>>> value. That is, for some reason, someone who is responsible for calling
>>>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>>>> unable to call percpu_ref_put().
>>>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>>>
>>>> I found the someone. It was already there in the backtrace...
>>>>
>>>> ----------------------------------------
>>>> [ 62.065852] a.out D 0 4414 4337 0x00000000
>>>> [ 62.067677] Call Trace:
>>>> [ 62.068545] __schedule+0x40b/0x860
>>>> [ 62.069726] schedule+0x31/0x80
>>>> [ 62.070796] schedule_timeout+0x1c1/0x3c0
>>>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
>>>> [ 62.073670] blk_queue_enter+0x218/0x520
>>>> [ 62.074985] ? remove_wait_queue+0x70/0x70
>>>> [ 62.076361] generic_make_request+0x3d/0x540
>>>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
>>>> [ 62.079147] ? bio_clone_fast+0x2c/0x70
>>>> [ 62.080456] blk_queue_split+0x29b/0x560
>>>> [ 62.081772] ? blk_queue_split+0x29b/0x560
>>>> [ 62.083162] blk_mq_make_request+0x7c/0x430
>>>> [ 62.084562] generic_make_request+0x276/0x540
>>>> [ 62.086034] submit_bio+0x6e/0x140
>>>> [ 62.087185] ? submit_bio+0x6e/0x140
>>>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
>>>> [ 62.089681] do_mpage_readpage+0x328/0x730
>>>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
>>>> [ 62.092726] mpage_readpages+0x120/0x190
>>>> [ 62.094034] ? check_disk_change+0x70/0x70
>>>> [ 62.095454] ? check_disk_change+0x70/0x70
>>>> [ 62.096849] ? alloc_pages_current+0x65/0xd0
>>>> [ 62.098277] blkdev_readpages+0x18/0x20
>>>> [ 62.099568] __do_page_cache_readahead+0x298/0x360
>>>> [ 62.101157] ondemand_readahead+0x1f6/0x490
>>>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
>>>> [ 62.103995] page_cache_sync_readahead+0x29/0x40
>>>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
>>>> [ 62.107067] ? futex_wait+0x221/0x240
>>>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
>>>> [ 62.109654] blkdev_read_iter+0x30/0x40
>>>> [ 62.110954] generic_file_splice_read+0xc5/0x140
>>>> [ 62.112538] do_splice_to+0x74/0x90
>>>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
>>>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
>>>> [ 62.116773] do_splice_direct+0x8a/0xb0
>>>> [ 62.118056] do_sendfile+0x1aa/0x390
>>>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
>>>> [ 62.120666] do_syscall_64+0x6e/0x210
>>>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> ----------------------------------------
>>>>
>>>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
>>>> assumption that blk_queue_enter() from recursively called generic_make_request()
>>>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>>>>
>>>> ----------------------------------------
>>>> generic_make_request(struct bio *bio) {
>>>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>>>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>>> bio_wouldblock_error(bio);
>>>> else
>>>> bio_io_error(bio);
>>>> return ret;
>>>> }
>>>> (...snipped...)
>>>> ret = q->make_request_fn(q, bio);
>>>> (...snipped...)
>>>> if (q)
>>>> blk_queue_exit(q);
>>>> }
>>>> ----------------------------------------
>>>>
>>>> where q->make_request_fn == blk_mq_make_request which does
>>>>
>>>> ----------------------------------------
>>>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>>>> blk_queue_split(q, &bio);
>>>> }
>>>>
>>>> blk_queue_split(struct request_queue *q, struct bio **bio) {
>>>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
>>>> }
>>>> ----------------------------------------
>>>>
>>>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
>>>> percpu_ref_kill() are called by blk_freeze_queue_start()...
>>>>
>>>> Now, it is up to you about how to fix this race problem.
>>>
>>> Ahh, nicely spotted! One idea would be the one below. For this case,
>>> we're recursing, so we can either do a non-block queue enter, or we
>>> can just do a live enter.
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 85909b431eb0..7de12e0dcc3d 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>
>>> if (bio->bi_opf & REQ_NOWAIT)
>>> flags = BLK_MQ_REQ_NOWAIT;
>>> - if (blk_queue_enter(q, flags) < 0) {
>>> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
>>> + blk_queue_enter_live(q);
>>> + else if (blk_queue_enter(q, flags) < 0) {
>>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>> bio_wouldblock_error(bio);
>>> else
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index 782940c65d8a..e1063e8f7eda 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>>> /* there isn't chance to merge the splitted bio */
>>> split->bi_opf |= REQ_NOMERGE;
>>>
>>> + /*
>>> + * Since we're recursing into make_request here, ensure
>>> + * that we mark this bio as already having entered the queue.
>>> + * If not, and the queue is going away, we can get stuck
>>> + * forever on waiting for the queue reference to drop. But
>>> + * that will never happen, as we're already holding a
>>> + * reference to it.
>>> + */
>>> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
>>> +
>>> bio_chain(split, *bio);
>>> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>>> generic_make_request(*bio);
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index 17b18b91ebac..a6bc789b83e7 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -282,6 +282,8 @@ enum req_flag_bits {
>>> /* command specific flags for REQ_OP_WRITE_ZEROES: */
>>> __REQ_NOUNMAP, /* do not free blocks when zeroing */
>>>
>>> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
>>> +
>>> /* for driver use */
>>> __REQ_DRV,
>>>
>>> @@ -302,9 +304,8 @@ enum req_flag_bits {
>>> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
>>> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
>>> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
>>> -
>>> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
>>> -
>>> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
>>> #define REQ_DRV (1ULL << __REQ_DRV)
>>>
>>> #define REQ_FAILFAST_MASK \
>>
>> bio clone inherits .bi_opf, so this way may not work for cloned bio.
>>
>> Given most of bio_split() is run in .make_request_fn(), holding the
>> queue ref may not be needed in case of bio splitting, so another
>> candidate fix might be to introduce one extra parameter of
>> 'need_queue_ref' to generic_make_request().
>
> I did consider that, but that's pretty silly for a corner case and you'd
> have to change a crap ton of calls to it. I think it'd be cleaner to
> clear the bit when we need to, potentially even adding a debug check to
> blk_queue_enter_live() that complains if the ref was not already
> elevated. Though that would be expensive, compared to the percpu inc
> now. Not saying the bit is necessarily the best way forward, but I do
> like it a LOT more than adding an argument to generic_make_request.
>
> It should have been a bio flag to begin with, that better captures the
> scope and also avoids clones inheriting anything they shouldn't. It's
> also not applicable to a request.
A little too much inheritance from the previous patch, better version
below.
diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..71cb228d7374 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
+ if (bio_flagged(bio, BIO_QUEUE_ENTERED))
+ blk_queue_enter_live(q);
+ else if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..481dc02668f9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;
+ /*
+ * Since we're recursing into make_request here, ensure
+ * that we mark this bio as already having entered the queue.
+ * If not, and the queue is going away, we can get stuck
+ * forever on waiting for the queue reference to drop. But
+ * that will never happen, as we're already holding a
+ * reference to it.
+ */
+ bio_set_flag(*bio, BIO_QUEUE_ENTERED);
+
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..1602bf4ab4cd 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -186,6 +186,8 @@ struct bio {
* throttling rules. Don't do it again. */
#define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion
* of this bio. */
+#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */
+
/* See BVEC_POOL_OFFSET below before adding new flags */
/*
--
Jens Axboe
On Fri, Jun 01, 2018 at 06:49:10PM -0600, Jens Axboe wrote:
> On 6/1/18 5:49 PM, Ming Lei wrote:
> > On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
> >> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> >>> Tetsuo Handa wrote:
> >>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >>>> not a race condition while folding percpu counter values into atomic counter
> >>>> value. That is, for some reason, someone who is responsible for calling
> >>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >>>> unable to call percpu_ref_put().
> >>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >>>
> >>> I found the someone. It was already there in the backtrace...
> >>>
> >>> ----------------------------------------
> >>> [ 62.065852] a.out D 0 4414 4337 0x00000000
> >>> [ 62.067677] Call Trace:
> >>> [ 62.068545] __schedule+0x40b/0x860
> >>> [ 62.069726] schedule+0x31/0x80
> >>> [ 62.070796] schedule_timeout+0x1c1/0x3c0
> >>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> >>> [ 62.073670] blk_queue_enter+0x218/0x520
> >>> [ 62.074985] ? remove_wait_queue+0x70/0x70
> >>> [ 62.076361] generic_make_request+0x3d/0x540
> >>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> >>> [ 62.079147] ? bio_clone_fast+0x2c/0x70
> >>> [ 62.080456] blk_queue_split+0x29b/0x560
> >>> [ 62.081772] ? blk_queue_split+0x29b/0x560
> >>> [ 62.083162] blk_mq_make_request+0x7c/0x430
> >>> [ 62.084562] generic_make_request+0x276/0x540
> >>> [ 62.086034] submit_bio+0x6e/0x140
> >>> [ 62.087185] ? submit_bio+0x6e/0x140
> >>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> >>> [ 62.089681] do_mpage_readpage+0x328/0x730
> >>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> >>> [ 62.092726] mpage_readpages+0x120/0x190
> >>> [ 62.094034] ? check_disk_change+0x70/0x70
> >>> [ 62.095454] ? check_disk_change+0x70/0x70
> >>> [ 62.096849] ? alloc_pages_current+0x65/0xd0
> >>> [ 62.098277] blkdev_readpages+0x18/0x20
> >>> [ 62.099568] __do_page_cache_readahead+0x298/0x360
> >>> [ 62.101157] ondemand_readahead+0x1f6/0x490
> >>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> >>> [ 62.103995] page_cache_sync_readahead+0x29/0x40
> >>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> >>> [ 62.107067] ? futex_wait+0x221/0x240
> >>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> >>> [ 62.109654] blkdev_read_iter+0x30/0x40
> >>> [ 62.110954] generic_file_splice_read+0xc5/0x140
> >>> [ 62.112538] do_splice_to+0x74/0x90
> >>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> >>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> >>> [ 62.116773] do_splice_direct+0x8a/0xb0
> >>> [ 62.118056] do_sendfile+0x1aa/0x390
> >>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> >>> [ 62.120666] do_syscall_64+0x6e/0x210
> >>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>> ----------------------------------------
> >>>
> >>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> >>> assumption that blk_queue_enter() from recursively called generic_make_request()
> >>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> >>>
> >>> ----------------------------------------
> >>> generic_make_request(struct bio *bio) {
> >>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> >>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >>> bio_wouldblock_error(bio);
> >>> else
> >>> bio_io_error(bio);
> >>> return ret;
> >>> }
> >>> (...snipped...)
> >>> ret = q->make_request_fn(q, bio);
> >>> (...snipped...)
> >>> if (q)
> >>> blk_queue_exit(q);
> >>> }
> >>> ----------------------------------------
> >>>
> >>> where q->make_request_fn == blk_mq_make_request which does
> >>>
> >>> ----------------------------------------
> >>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> >>> blk_queue_split(q, &bio);
> >>> }
> >>>
> >>> blk_queue_split(struct request_queue *q, struct bio **bio) {
> >>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> >>> }
> >>> ----------------------------------------
> >>>
> >>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> >>> percpu_ref_kill() are called by blk_freeze_queue_start()...
> >>>
> >>> Now, it is up to you about how to fix this race problem.
> >>
> >> Ahh, nicely spotted! One idea would be the one below. For this case,
> >> we're recursing, so we can either do a non-block queue enter, or we
> >> can just do a live enter.
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 85909b431eb0..7de12e0dcc3d 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
> >>
> >> if (bio->bi_opf & REQ_NOWAIT)
> >> flags = BLK_MQ_REQ_NOWAIT;
> >> - if (blk_queue_enter(q, flags) < 0) {
> >> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
> >> + blk_queue_enter_live(q);
> >> + else if (blk_queue_enter(q, flags) < 0) {
> >> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >> bio_wouldblock_error(bio);
> >> else
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 782940c65d8a..e1063e8f7eda 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> >> /* there isn't chance to merge the splitted bio */
> >> split->bi_opf |= REQ_NOMERGE;
> >>
> >> + /*
> >> + * Since we're recursing into make_request here, ensure
> >> + * that we mark this bio as already having entered the queue.
> >> + * If not, and the queue is going away, we can get stuck
> >> + * forever on waiting for the queue reference to drop. But
> >> + * that will never happen, as we're already holding a
> >> + * reference to it.
> >> + */
> >> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
> >> +
> >> bio_chain(split, *bio);
> >> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> >> generic_make_request(*bio);
> >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> >> index 17b18b91ebac..a6bc789b83e7 100644
> >> --- a/include/linux/blk_types.h
> >> +++ b/include/linux/blk_types.h
> >> @@ -282,6 +282,8 @@ enum req_flag_bits {
> >> /* command specific flags for REQ_OP_WRITE_ZEROES: */
> >> __REQ_NOUNMAP, /* do not free blocks when zeroing */
> >>
> >> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
> >> +
> >> /* for driver use */
> >> __REQ_DRV,
> >>
> >> @@ -302,9 +304,8 @@ enum req_flag_bits {
> >> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
> >> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
> >> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
> >> -
> >> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
> >> -
> >> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
> >> #define REQ_DRV (1ULL << __REQ_DRV)
> >>
> >> #define REQ_FAILFAST_MASK \
> >
> > bio clone inherits .bi_opf, so this way may not work for cloned bio.
> >
> > Given most of bio_split() is run in .make_request_fn(), holding the
> > queue ref may not be needed in case of bio splitting, so another
> > candidate fix might be to introduce one extra parameter of
> > 'need_queue_ref' to generic_make_request().
>
> I did consider that, but that's pretty silly for a corner case and you'd
> have to change a crap ton of calls to it. I think it'd be cleaner to
Seems not necessary to change any call to it, since all blk_queue_split()
are called from q->make_request_fn() now.
> clear the bit when we need to, potentially even adding a debug check to
> blk_queue_enter_live() that complains if the ref was not already
> elevated. Though that would be expensive, compared to the percpu inc
> now. Not saying the bit is necessarily the best way forward, but I do
> like it a LOT more than adding an argument to generic_make_request.
We may introduce generic_make_request_with_ref() for blk_queue_split(),
what do you think of the following patch?
diff --git a/block/blk-core.c b/block/blk-core.c
index cee03cad99f2..e10ee3f9f338 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2363,7 +2363,7 @@ generic_make_request_checks(struct bio *bio)
* a lower device by calling into generic_make_request recursively, which
* means the bio should NOT be touched after the call to ->make_request_fn.
*/
-blk_qc_t generic_make_request(struct bio *bio)
+blk_qc_t __generic_make_request(struct bio *bio, bool got_ref)
{
/*
* bio_list_on_stack[0] contains bios submitted by the current
@@ -2379,12 +2379,14 @@ blk_qc_t generic_make_request(struct bio *bio)
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
- if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
- bio_wouldblock_error(bio);
- else
- bio_io_error(bio);
- return ret;
+ if (!got_ref) {
+ if (blk_queue_enter(q, flags) < 0) {
+ if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
+ bio_wouldblock_error(bio);
+ else
+ bio_io_error(bio);
+ return ret;
+ }
}
if (!generic_make_request_checks(bio))
@@ -2426,8 +2428,9 @@ blk_qc_t generic_make_request(struct bio *bio)
bool enter_succeeded = true;
if (unlikely(q != bio->bi_disk->queue)) {
- if (q)
+ if (q && !got_ref)
blk_queue_exit(q);
+ got_ref = false;
q = bio->bi_disk->queue;
flags = 0;
if (bio->bi_opf & REQ_NOWAIT)
@@ -2476,7 +2479,7 @@ blk_qc_t generic_make_request(struct bio *bio)
blk_queue_exit(q);
return ret;
}
-EXPORT_SYMBOL(generic_make_request);
+EXPORT_SYMBOL(__generic_make_request);
/**
* direct_make_request - hand a buffer directly to its device driver for I/O
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5573d0fbec53..997616f1c85e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -180,6 +180,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
return do_split ? new : NULL;
}
+/*
+ * This function has to be called from q->make_request_fn().
+ */
void blk_queue_split(struct request_queue *q, struct bio **bio)
{
struct bio *split, *res;
@@ -212,7 +215,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
- generic_make_request(*bio);
+ generic_make_request_with_ref(*bio);
*bio = split;
}
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4efd9af62e25..051018e0126e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -956,9 +956,18 @@ static inline void rq_flush_dcache_pages(struct request *rq)
}
#endif
+blk_qc_t generic_make_request(struct bio *bio)
+{
+ return __generic_make_request(bio, false);
+}
+
+blk_qc_t generic_make_request_with_ref(struct bio *bio)
+{
+ return __generic_make_request(bio, true);
+}
+
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
-extern blk_qc_t generic_make_request(struct bio *bio);
extern blk_qc_t direct_make_request(struct bio *bio);
extern void blk_rq_init(struct request_queue *q, struct request *rq);
extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
>
> It should have been a bio flag to begin with, that better captures the
> scope and also avoids clones inheriting anything they shouldn't. It's
> also not applicable to a request.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b431eb0..8f1063f50863 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>
> if (bio->bi_opf & REQ_NOWAIT)
> flags = BLK_MQ_REQ_NOWAIT;
> - if (blk_queue_enter(q, flags) < 0) {
> + if (bio->bi_flags & BIO_QUEUE_ENTERED)
> + blk_queue_enter_live(q);
> + else if (blk_queue_enter(q, flags) < 0) {
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 782940c65d8a..9ee9474e579c 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> /* there isn't chance to merge the splitted bio */
> split->bi_opf |= REQ_NOMERGE;
>
> + /*
> + * Since we're recursing into make_request here, ensure
> + * that we mark this bio as already having entered the queue.
> + * If not, and the queue is going away, we can get stuck
> + * forever on waiting for the queue reference to drop. But
> + * that will never happen, as we're already holding a
> + * reference to it.
> + */
> + (*bio)->bi_flags |= BIO_QUEUE_ENTERED;
> +
> bio_chain(split, *bio);
> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> generic_make_request(*bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 17b18b91ebac..4687d7c99076 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -186,6 +186,8 @@ struct bio {
> * throttling rules. Don't do it again. */
> #define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion
> * of this bio. */
> +#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */
> +
> /* See BVEC_POOL_OFFSET below before adding new flags */
In DM or MD, there is the following use case:
1) bio comes from fs, then one clone is made from this bio
2) the cloned bio is submitted via generic_make_request() from
another context, such as wq.
3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
the queue usage counter isn't held.
Thanks,
Ming
On 6/1/18 8:36 PM, Ming Lei wrote:
>> clear the bit when we need to, potentially even adding a debug check to
>> blk_queue_enter_live() that complains if the ref was not already
>> elevated. Though that would be expensive, compared to the percpu inc
>> now. Not saying the bit is necessarily the best way forward, but I do
>> like it a LOT more than adding an argument to generic_make_request.
>
> We may introduce generic_make_request_with_ref() for blk_queue_split(),
> what do you think of the following patch?
Functionally it'll work as well.
> In DM or MD, there is the following use case:
>
> 1) bio comes from fs, then one clone is made from this bio
>
> 2) the cloned bio is submitted via generic_make_request() from
> another context, such as wq.
>
> 3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
> the queue usage counter isn't held.
We don't copy flags on a clone.
--
Jens Axboe
On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> On 6/1/18 8:36 PM, Ming Lei wrote:
> >> clear the bit when we need to, potentially even adding a debug check to
> >> blk_queue_enter_live() that complains if the ref was not already
> >> elevated. Though that would be expensive, compared to the percpu inc
> >> now. Not saying the bit is necessarily the best way forward, but I do
> >> like it a LOT more than adding an argument to generic_make_request.
> >
> > We may introduce generic_make_request_with_ref() for blk_queue_split(),
> > what do you think of the following patch?
>
> Functionally it'll work as well.
If you are fine with this one, I'm glad to prepare for a normal one,
so that we can save one precious bio flag, and the total number is
just 13.
>
> > In DM or MD, there is the following use case:
> >
> > 1) bio comes from fs, then one clone is made from this bio
> >
> > 2) the cloned bio is submitted via generic_make_request() from
> > another context, such as wq.
> >
> > 3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
> > the queue usage counter isn't held.
>
> We don't copy flags on a clone.
Yeah, you are right, seems I ignored that.
Thanks,
Ming
Ming Lei - 02.06.18, 06:54:
> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> > On 6/1/18 8:36 PM, Ming Lei wrote:
> > >> clear the bit when we need to, potentially even adding a debug
> > >> check to blk_queue_enter_live() that complains if the ref was
> > >> not already elevated. Though that would be expensive, compared
> > >> to the percpu inc now. Not saying the bit is necessarily the
> > >> best way forward, but I do like it a LOT more than adding an
> > >> argument to generic_make_request.> >
> > > We may introduce generic_make_request_with_ref() for
> > > blk_queue_split(), what do you think of the following patch?
> >
> > Functionally it'll work as well.
>
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.
[…]
I am completely lost about what you are talking about.
That is fine with me, but in case there is anything to test… let me
know. I do not promise quick responses tough, as I have been quite a bit
overwhelmed with stuff lately.
Currently I run 4.17-rc7 without any blk-mq related patches and it
appears to be free of hangs or stalls related to blk-mq.
Thanks,
--
Martin
> On Jun 1, 2018, at 10:54 PM, Ming Lei <[email protected]> wrote:
>
>> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
>> On 6/1/18 8:36 PM, Ming Lei wrote:
>>>> clear the bit when we need to, potentially even adding a debug check to
>>>> blk_queue_enter_live() that complains if the ref was not already
>>>> elevated. Though that would be expensive, compared to the percpu inc
>>>> now. Not saying the bit is necessarily the best way forward, but I do
>>>> like it a LOT more than adding an argument to generic_make_request.
>>>
>>> We may introduce generic_make_request_with_ref() for blk_queue_split(),
>>> what do you think of the following patch?
>>
>> Functionally it'll work as well.
>
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.
We’ve never had much issue with bio specific flags, I prefer that approach to a wrapper around generic_make_request().
On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
<[email protected]> wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>
> I found the someone. It was already there in the backtrace...
Nice!
Do I understand it correctly that this bug is probably the root cause
of a whole lot of syzbot "task hung" reports? E.g. this one too?
https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
I guess we will need to sweep close everything related to
filesystems/block devices when this is committed?
> ----------------------------------------
> [ 62.065852] a.out D 0 4414 4337 0x00000000
> [ 62.067677] Call Trace:
> [ 62.068545] __schedule+0x40b/0x860
> [ 62.069726] schedule+0x31/0x80
> [ 62.070796] schedule_timeout+0x1c1/0x3c0
> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> [ 62.073670] blk_queue_enter+0x218/0x520
> [ 62.074985] ? remove_wait_queue+0x70/0x70
> [ 62.076361] generic_make_request+0x3d/0x540
> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> [ 62.079147] ? bio_clone_fast+0x2c/0x70
> [ 62.080456] blk_queue_split+0x29b/0x560
> [ 62.081772] ? blk_queue_split+0x29b/0x560
> [ 62.083162] blk_mq_make_request+0x7c/0x430
> [ 62.084562] generic_make_request+0x276/0x540
> [ 62.086034] submit_bio+0x6e/0x140
> [ 62.087185] ? submit_bio+0x6e/0x140
> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> [ 62.089681] do_mpage_readpage+0x328/0x730
> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> [ 62.092726] mpage_readpages+0x120/0x190
> [ 62.094034] ? check_disk_change+0x70/0x70
> [ 62.095454] ? check_disk_change+0x70/0x70
> [ 62.096849] ? alloc_pages_current+0x65/0xd0
> [ 62.098277] blkdev_readpages+0x18/0x20
> [ 62.099568] __do_page_cache_readahead+0x298/0x360
> [ 62.101157] ondemand_readahead+0x1f6/0x490
> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> [ 62.103995] page_cache_sync_readahead+0x29/0x40
> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> [ 62.107067] ? futex_wait+0x221/0x240
> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> [ 62.109654] blkdev_read_iter+0x30/0x40
> [ 62.110954] generic_file_splice_read+0xc5/0x140
> [ 62.112538] do_splice_to+0x74/0x90
> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 62.116773] do_splice_direct+0x8a/0xb0
> [ 62.118056] do_sendfile+0x1aa/0x390
> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> [ 62.120666] do_syscall_64+0x6e/0x210
> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
>
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>
> ----------------------------------------
> generic_make_request(struct bio *bio) {
> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> bio_io_error(bio);
> return ret;
> }
> (...snipped...)
> ret = q->make_request_fn(q, bio);
> (...snipped...)
> if (q)
> blk_queue_exit(q);
> }
> ----------------------------------------
>
> where q->make_request_fn == blk_mq_make_request which does
>
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> blk_queue_split(q, &bio);
> }
>
> blk_queue_split(struct request_queue *q, struct bio **bio) {
> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
>
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by blk_freeze_queue_start()...
>
> Now, it is up to you about how to fix this race problem.
>
On 2018/06/04 20:46, Dmitry Vyukov wrote:
> On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
> <[email protected]> wrote:
>> Tetsuo Handa wrote:
>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>> not a race condition while folding percpu counter values into atomic counter
>>> value. That is, for some reason, someone who is responsible for calling
>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>> unable to call percpu_ref_put().
>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>
>> I found the someone. It was already there in the backtrace...
>
> Nice!
>
> Do I understand it correctly that this bug is probably the root cause
> of a whole lot of syzbot "task hung" reports? E.g. this one too?
> https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
> I guess we will need to sweep close everything related to
> filesystems/block devices when this is committed?
>
I can't tell it. We still have lockdep warning for loop module. Also, I suspect
that hung tasks waiting for page bit might be related to binder module.
Since we are about to merge "kernel/hung_task.c: show all hung tasks before panic"
patch, we might be able to get more hints after 4.18-rc1.
Jens Axboe wrote:
> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >> not a race condition while folding percpu counter values into atomic counter
> >> value. That is, for some reason, someone who is responsible for calling
> >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >> unable to call percpu_ref_put().
> >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >
> > I found the someone. It was already there in the backtrace...
> >
>
> Ahh, nicely spotted! One idea would be the one below. For this case,
> we're recursing, so we can either do a non-block queue enter, or we
> can just do a live enter.
>
While "block: don't use blocking queue entered for recursive bio submits" was
already applied, syzbot is still reporting a hung task with same signature but
different trace.
https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
----------------------------------------
[ 492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
[ 492.519604] Not tainted 4.17.0+ #83
[ 492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 492.531787] syz-executor1 D23384 20263 4574 0x00000004
[ 492.537443] Call Trace:
[ 492.540041] __schedule+0x801/0x1e30
[ 492.580958] schedule+0xef/0x430
[ 492.610154] blk_queue_enter+0x8da/0xdf0
[ 492.716327] generic_make_request+0x651/0x1790
[ 492.765680] submit_bio+0xba/0x460
[ 492.793198] submit_bio_wait+0x134/0x1e0
[ 492.801891] blkdev_issue_flush+0x204/0x300
[ 492.806236] blkdev_fsync+0x93/0xd0
[ 492.813620] vfs_fsync_range+0x140/0x220
[ 492.817702] vfs_fsync+0x29/0x30
[ 492.821081] __loop_update_dio+0x4de/0x6a0
[ 492.825341] lo_ioctl+0xd28/0x2190
[ 492.833442] blkdev_ioctl+0x9b6/0x2020
[ 492.872146] block_ioctl+0xee/0x130
[ 492.880139] do_vfs_ioctl+0x1cf/0x16a0
[ 492.927550] ksys_ioctl+0xa9/0xd0
[ 492.931036] __x64_sys_ioctl+0x73/0xb0
[ 492.934952] do_syscall_64+0x1b1/0x800
[ 492.963624] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 493.212768] 1 lock held by syz-executor1/20263:
[ 493.217448] #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
----------------------------------------
Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
by caling atomic_inc_return() and percpu_ref_kill() ?
On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
> Jens Axboe wrote:
> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> > >> not a race condition while folding percpu counter values into atomic counter
> > >> value. That is, for some reason, someone who is responsible for calling
> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> > >> unable to call percpu_ref_put().
> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> > >
> > > I found the someone. It was already there in the backtrace...
> > >
> >
> > Ahh, nicely spotted! One idea would be the one below. For this case,
> > we're recursing, so we can either do a non-block queue enter, or we
> > can just do a live enter.
> >
>
> While "block: don't use blocking queue entered for recursive bio submits" was
> already applied, syzbot is still reporting a hung task with same signature but
> different trace.
>
> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
> ----------------------------------------
> [ 492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
> [ 492.519604] Not tainted 4.17.0+ #83
> [ 492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 492.531787] syz-executor1 D23384 20263 4574 0x00000004
> [ 492.537443] Call Trace:
> [ 492.540041] __schedule+0x801/0x1e30
> [ 492.580958] schedule+0xef/0x430
> [ 492.610154] blk_queue_enter+0x8da/0xdf0
> [ 492.716327] generic_make_request+0x651/0x1790
> [ 492.765680] submit_bio+0xba/0x460
> [ 492.793198] submit_bio_wait+0x134/0x1e0
> [ 492.801891] blkdev_issue_flush+0x204/0x300
> [ 492.806236] blkdev_fsync+0x93/0xd0
> [ 492.813620] vfs_fsync_range+0x140/0x220
> [ 492.817702] vfs_fsync+0x29/0x30
> [ 492.821081] __loop_update_dio+0x4de/0x6a0
> [ 492.825341] lo_ioctl+0xd28/0x2190
> [ 492.833442] blkdev_ioctl+0x9b6/0x2020
> [ 492.872146] block_ioctl+0xee/0x130
> [ 492.880139] do_vfs_ioctl+0x1cf/0x16a0
> [ 492.927550] ksys_ioctl+0xa9/0xd0
> [ 492.931036] __x64_sys_ioctl+0x73/0xb0
> [ 492.934952] do_syscall_64+0x1b1/0x800
> [ 492.963624] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 493.212768] 1 lock held by syz-executor1/20263:
> [ 493.217448] #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
> ----------------------------------------
>
> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
> by caling atomic_inc_return() and percpu_ref_kill() ?
>
The vfs_fsync() isn't necessary in loop_update_dio() since both
generic_file_write_iter() and generic_file_read_iter() can handle
buffered io vs dio well.
I will send one patch to remove the vfs_sync() later.
Thanks,
Ming
On Tue, Jun 5, 2018 at 8:41 AM, Ming Lei <[email protected]> wrote:
> On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
>> Jens Axboe wrote:
>> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>> > > Tetsuo Handa wrote:
>> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> > >> not a race condition while folding percpu counter values into atomic counter
>> > >> value. That is, for some reason, someone who is responsible for calling
>> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> > >> unable to call percpu_ref_put().
>> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
>> > >
>> > > I found the someone. It was already there in the backtrace...
>> > >
>> >
>> > Ahh, nicely spotted! One idea would be the one below. For this case,
>> > we're recursing, so we can either do a non-block queue enter, or we
>> > can just do a live enter.
>> >
>>
>> While "block: don't use blocking queue entered for recursive bio submits" was
>> already applied, syzbot is still reporting a hung task with same signature but
>> different trace.
>>
>> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
>> ----------------------------------------
>> [ 492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
>> [ 492.519604] Not tainted 4.17.0+ #83
>> [ 492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 492.531787] syz-executor1 D23384 20263 4574 0x00000004
>> [ 492.537443] Call Trace:
>> [ 492.540041] __schedule+0x801/0x1e30
>> [ 492.580958] schedule+0xef/0x430
>> [ 492.610154] blk_queue_enter+0x8da/0xdf0
>> [ 492.716327] generic_make_request+0x651/0x1790
>> [ 492.765680] submit_bio+0xba/0x460
>> [ 492.793198] submit_bio_wait+0x134/0x1e0
>> [ 492.801891] blkdev_issue_flush+0x204/0x300
>> [ 492.806236] blkdev_fsync+0x93/0xd0
>> [ 492.813620] vfs_fsync_range+0x140/0x220
>> [ 492.817702] vfs_fsync+0x29/0x30
>> [ 492.821081] __loop_update_dio+0x4de/0x6a0
>> [ 492.825341] lo_ioctl+0xd28/0x2190
>> [ 492.833442] blkdev_ioctl+0x9b6/0x2020
>> [ 492.872146] block_ioctl+0xee/0x130
>> [ 492.880139] do_vfs_ioctl+0x1cf/0x16a0
>> [ 492.927550] ksys_ioctl+0xa9/0xd0
>> [ 492.931036] __x64_sys_ioctl+0x73/0xb0
>> [ 492.934952] do_syscall_64+0x1b1/0x800
>> [ 492.963624] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 493.212768] 1 lock held by syz-executor1/20263:
>> [ 493.217448] #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
>> ----------------------------------------
>>
>> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
>> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
>> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
>> by caling atomic_inc_return() and percpu_ref_kill() ?
>>
>
> The vfs_fsync() isn't necessary in loop_update_dio() since both
> generic_file_write_iter() and generic_file_read_iter() can handle
> buffered io vs dio well.
>
> I will send one patch to remove the vfs_sync() later.
Hi Tetsuo,
The issue might be fixed by removing this vfs_sync(), but I'd like to
understand the idea behind since vfs_sync() shouldn't have caused
any IO to this loop queue.
I also tried to do the test via the following c syzbot, but can't reproduce
it yet after running it for several hours.
https://syzkaller.appspot.com/x/repro.c?id=4727023951937536
Could you share us how you reproduce it?
Thanks,
Ming Lei
On 2018/06/07 12:29, Ming Lei wrote:
>> The vfs_fsync() isn't necessary in loop_update_dio() since both
>> generic_file_write_iter() and generic_file_read_iter() can handle
>> buffered io vs dio well.
>>
>> I will send one patch to remove the vfs_sync() later.
>
> Hi Tetsuo,
>
> The issue might be fixed by removing this vfs_sync(), but I'd like to
> understand the idea behind since vfs_sync() shouldn't have caused
> any IO to this loop queue.
>
> I also tried to do the test via the following c syzbot, but can't reproduce
> it yet after running it for several hours.
>
> https://syzkaller.appspot.com/x/repro.c?id=4727023951937536
>
> Could you share us how you reproduce it?
I just reported that syzbot started hitting vfs_sync() case.
I haven't succeeded reproducing vfs_sync() case in my environment.
But in general, I modify "any reproducer provided by syzbot that loops
forever" to call exit() if open() etc. failed, for continuing the loop
unlikely helps after hitting "Too many open files" error.
res = syscall(__NR_memfd_create, 0x200002c0, 0);
if (res != -1)
r[1] = res;
+ else
+ exit(1);
Thus, I restart the reproducer in a loop from shell
rather than keep running for hours.
while :; do echo -n .; ./a.out; done