LinuxLists.cc - INFO: task hung in blk_queue

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

I managed to obtain SysRq-t when khungtaskd fires using debug printk()
( https://groups.google.com/forum/#!topic/syzkaller-bugs/OTuOsVebAiE ).
Only 4 threads shown below seems to be relevant to this problem.

[ 246.929688] task PC stack pid father
[ 249.888937] jbd2/sda1-8 D17736 2271 2 0x80000000
[ 249.894586] Call Trace:
[ 249.897193] __schedule+0x801/0x1e30
[ 249.900954] schedule+0xef/0x430
[ 249.904356] io_schedule+0x1c/0x70
[ 249.907935] bit_wait_io+0x18/0x90
[ 249.911492] __wait_on_bit+0xb3/0x130
[ 249.915313] out_of_line_wait_on_bit+0x204/0x3a0
[ 249.920105] __wait_on_buffer+0x76/0x90
[ 249.924102] jbd2_journal_commit_transaction+0x655b/0x8c18
[ 249.929893] kjournald2+0x26c/0xb30
[ 249.933579] kthread+0x345/0x410
[ 249.936966] ret_from_fork+0x3a/0x50
[ 254.408972] syz-executor7 D18976 5010 4828 0x00000004
[ 254.414639] Call Trace:
[ 254.417263] __schedule+0x801/0x1e30
[ 254.421070] schedule+0xef/0x430
[ 254.424467] blk_queue_enter+0x8da/0xdf0
[ 254.428584] generic_make_request+0x144/0x1510
[ 254.433217] blk_queue_split+0x374/0x2090
[ 254.437425] blk_mq_make_request+0x2d0/0x25c0
[ 254.442004] generic_make_request+0x795/0x1510
[ 254.446663] submit_bio+0xba/0x460
[ 254.451104] mpage_readpages+0x6d7/0x970
[ 254.455224] blkdev_readpages+0x2c/0x40
[ 254.459220] __do_page_cache_readahead+0x79a/0xdc0
[ 254.464205] ondemand_readahead+0x550/0xc40
[ 254.468559] page_cache_sync_readahead+0xd1/0x110
[ 254.473430] generic_file_read_iter+0x1a74/0x2f00
[ 254.478423] blkdev_read_iter+0x120/0x190
[ 254.482594] generic_file_splice_read+0x552/0x910
[ 254.487484] do_splice_to+0x12e/0x190
[ 254.491311] splice_direct_to_actor+0x268/0x8d0
[ 254.496017] do_splice_direct+0x2cc/0x400
[ 254.500224] do_sendfile+0x60f/0xe00
[ 254.503971] __x64_sys_sendfile64+0x155/0x240
[ 254.508507] do_syscall_64+0x1b1/0x800
[ 254.512431] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 254.568779] syz-executor7 D25408 5021 4828 0x00000004
[ 254.574429] Call Trace:
[ 254.577125] __schedule+0x801/0x1e30
[ 254.580882] schedule+0xef/0x430
[ 254.584297] blk_mq_freeze_queue_wait+0x1ce/0x460
[ 254.589192] blk_freeze_queue+0x4a/0x80
[ 254.593209] blk_mq_freeze_queue+0x15/0x20
[ 254.597464] loop_clr_fd+0x226/0xb80
[ 254.601208] lo_ioctl+0x642/0x2130
[ 254.604774] blkdev_ioctl+0x9b6/0x2020
[ 254.608715] block_ioctl+0xee/0x130
[ 254.612362] do_vfs_ioctl+0x1cf/0x16a0
[ 254.616294] ksys_ioctl+0xa9/0xd0
[ 254.619770] __x64_sys_ioctl+0x73/0xb0
[ 254.623674] do_syscall_64+0x1b1/0x800
[ 254.627596] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 255.022839] 1 lock held by syz-executor7/5021:
[ 255.028209] #0: (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130
[ 254.719133] blkid D23880 5071 2713 0x00000000
[ 254.724791] Call Trace:
[ 254.727402] __schedule+0x801/0x1e30
[ 254.731159] schedule+0xef/0x430
[ 254.734562] schedule_preempt_disabled+0x10/0x20
[ 254.739333] __mutex_lock+0xe38/0x17f0
[ 254.743280] mutex_lock_killable_nested+0x16/0x20
[ 254.748135] lo_ioctl+0x8d/0x2130
[ 254.751611] blkdev_ioctl+0x9b6/0x2020
[ 254.755550] block_ioctl+0xee/0x130
[ 254.759200] do_vfs_ioctl+0x1cf/0x16a0
[ 254.763123] ksys_ioctl+0xa9/0xd0
[ 254.766601] __x64_sys_ioctl+0x73/0xb0
[ 254.770506] do_syscall_64+0x1b1/0x800
[ 254.774425] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 255.036399] 1 lock held by blkid/5071:
[ 255.041075] #0: (ptrval) (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2130

syz-executor7(PID=5021) is doing ioctl(LOOP_CLR_FD).

blkid(PID=5071) is doing ioctl() but is blocked on lo_ctl_mutex which is
held by syz-executor7(PID=5021). Therefore, I assume that blkid(PID=5071)
is irrelevant for this hung up problem.

syz-executor7(PID=5021) is stuck at blk_mq_freeze_queue_wait() from
blk_freeze_queue() from blk_mq_freeze_queue() from loop_clr_fd().

/*
* Guarantee no request is in use, so we can change any data structure of
* the queue afterward.
*/
void blk_freeze_queue(struct request_queue *q)
{
/*
* In the !blk_mq case we are only calling this to kill the
* q_usage_counter, otherwise this increases the freeze depth
* and waits for it to return to zero. For this reason there is
* no blk_unfreeze_queue(), and blk_freeze_queue() is not
* exported to drivers as the only user for unfreeze is blk_mq.
*/
blk_freeze_queue_start(q);
if (!q->mq_ops)
blk_drain_queue(q);
blk_mq_freeze_queue_wait(q);
}

q->mq_freeze_depth is incremented at blk_freeze_queue_start() and
the caller waits until q->q_usage_counter becomes 0.

void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;

freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
}
}

void blk_mq_freeze_queue_wait(struct request_queue *q)
{
wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
}

I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
stuck at wait_event() in blk_queue_enter().

syz-executor7 D18976 5010 4828 0x00000004
Call Trace:
context_switch kernel/sched/core.c:2859 [inline]
__schedule+0x801/0x1e30 kernel/sched/core.c:3501
schedule+0xef/0x430 kernel/sched/core.c:3545
blk_queue_enter+0x8da/0xdf0 block/blk-core.c:953
generic_make_request+0x144/0x1510 block/blk-core.c:2395
blk_queue_split+0x374/0x2090 block/blk-merge.c:215
blk_mq_make_request+0x2d0/0x25c0 block/blk-mq.c:1861
generic_make_request+0x795/0x1510 block/blk-core.c:2460
submit_bio+0xba/0x460 block/blk-core.c:2568
mpage_bio_submit fs/mpage.c:66 [inline]
mpage_readpages+0x6d7/0x970 fs/mpage.c:393
blkdev_readpages+0x2c/0x40 fs/block_dev.c:577
read_pages mm/readahead.c:121 [inline]
__do_page_cache_readahead+0x79a/0xdc0 mm/readahead.c:199
ra_submit mm/internal.h:66 [inline]
ondemand_readahead+0x550/0xc40 mm/readahead.c:478
page_cache_sync_readahead+0xd1/0x110 mm/readahead.c:510
generic_file_buffered_read mm/filemap.c:2092 [inline]
generic_file_read_iter+0x1a74/0x2f00 mm/filemap.c:2362
blkdev_read_iter+0x120/0x190 fs/block_dev.c:1930
call_read_iter include/linux/fs.h:1778 [inline]
generic_file_splice_read+0x552/0x910 fs/splice.c:307
do_splice_to+0x12e/0x190 fs/splice.c:880
splice_direct_to_actor+0x268/0x8d0 fs/splice.c:952
do_splice_direct+0x2cc/0x400 fs/splice.c:1061
do_sendfile+0x60f/0xe00 fs/read_write.c:1440
__do_sys_sendfile64 fs/read_write.c:1495 [inline]
__se_sys_sendfile64 fs/read_write.c:1487 [inline]
__x64_sys_sendfile64+0x155/0x240 fs/read_write.c:1487
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since flags == 0, preempt == false. Since stuck at wait_event(), success == false.
Thus, atomic_read(&q->mq_freeze_depth) > 0 if blk_queue_dying(q) == false. And I
guess blk_queue_dying(q) == false because we are just trying to freeze/unfreeze.

/**
* blk_queue_enter() - try to increase q->q_usage_counter
* @q: request queue pointer
* @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
*/
int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
{
const bool preempt = flags & BLK_MQ_REQ_PREEMPT;

while (true) {
bool success = false;

rcu_read_lock();
if (percpu_ref_tryget_live(&q->q_usage_counter)) {
/*
* The code that sets the PREEMPT_ONLY flag is
* responsible for ensuring that that flag is globally
* visible before the queue is unfrozen.
*/
if (preempt || !blk_queue_preempt_only(q)) {
success = true;
} else {
percpu_ref_put(&q->q_usage_counter);
}
}
rcu_read_unlock();

if (success)
return 0;

if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;

/*
* read pair of barrier in blk_freeze_queue_start(),
* we need to order reading __PERCPU_REF_DEAD flag of
* .q_usage_counter and reading .mq_freeze_depth or
* queue dying flag, otherwise the following wait may
* never return if the two reads are reordered.
*/
smp_rmb();

wait_event(q->mq_freeze_wq,
(atomic_read(&q->mq_freeze_depth) == 0 &&
(preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
}
}

jbd2/sda1-8(PID=2271) is stuck waiting for journal commit operation.
I don't know how this thread is involved to this problem.

jbd2/sda1-8 D17736 2271 2 0x80000000
Call Trace:
context_switch kernel/sched/core.c:2859 [inline]
__schedule+0x801/0x1e30 kernel/sched/core.c:3501
schedule+0xef/0x430 kernel/sched/core.c:3545
io_schedule+0x1c/0x70 kernel/sched/core.c:5165
bit_wait_io+0x18/0x90 kernel/sched/wait_bit.c:207
__wait_on_bit+0xb3/0x130 kernel/sched/wait_bit.c:48
out_of_line_wait_on_bit+0x204/0x3a0 kernel/sched/wait_bit.c:63
wait_on_bit_io include/linux/wait_bit.h:101 [inline]
__wait_on_buffer+0x76/0x90 fs/buffer.c:118
wait_on_buffer include/linux/buffer_head.h:356 [inline]
journal_wait_on_commit_record fs/jbd2/commit.c:174 [inline]
jbd2_journal_commit_transaction+0x655b/0x8c18 fs/jbd2/commit.c:865
kjournald2+0x26c/0xb30 fs/jbd2/journal.c:229
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Can we find the culprit from these hints?

2018-05-16 13:07:25

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

Tetsuo Handa wrote:
> I couldn't check whether freeze_depth in blk_freeze_queue_start() was 1,
> but presumably q->mq_freeze_depth > 0 because syz-executor7(PID=5010) is
> stuck at wait_event() in blk_queue_enter().
>
> Since flags == 0, preempt == false. Since stuck at wait_event(), success == false.
> Thus, atomic_read(&q->mq_freeze_depth) > 0 if blk_queue_dying(q) == false. And I
> guess blk_queue_dying(q) == false because we are just trying to freeze/unfreeze.
>

I was able to reproduce the hung up using modified reproducer, and got values
using below debug printk() patch.

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -950,10 +950,12 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
*/
smp_rmb();

- wait_event(q->mq_freeze_wq,
- (atomic_read(&q->mq_freeze_depth) == 0 &&
- (preempt || !blk_queue_preempt_only(q))) ||
- blk_queue_dying(q));
+ while (wait_event_timeout(q->mq_freeze_wq,
+ (atomic_read(&q->mq_freeze_depth) == 0 &&
+ (preempt || !blk_queue_preempt_only(q))) ||
+ blk_queue_dying(q), 10 * HZ) == 0)
+ printk("%s(%u): q->mq_freeze_depth=%d preempt=%d blk_queue_preempt_only(q)=%d blk_queue_dying(q)=%d\n",
+ current->comm, current->pid, atomic_read(&q->mq_freeze_depth), preempt, blk_queue_preempt_only(q), blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
}

[ 75.869126] print_req_error: I/O error, dev loop0, sector 0
[ 85.983146] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[ 96.222884] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[ 106.463322] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0
[ 116.702912] a.out(8838): q->mq_freeze_depth=1 preempt=0 blk_queue_preempt_only(q)=0 blk_queue_dying(q)=0

One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
blk_queue_enter() will never be satisfied. But what does that wait_event()
want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
as if blk_queue_dying(q) == true? That is, something like below:

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..59e2496 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
smp_rmb();

wait_event(q->mq_freeze_wq,
- (atomic_read(&q->mq_freeze_depth) == 0 &&
- (preempt || !blk_queue_preempt_only(q))) ||
+ atomic_read(&q->mq_freeze_depth) ||
+ (preempt || !blk_queue_preempt_only(q)) ||
blk_queue_dying(q));
- if (blk_queue_dying(q))
+ if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
return -ENODEV;
}
}

2018-05-16 14:58:30

by Bart Van Assche

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
> blk_queue_enter() will never be satisfied. But what does that wait_event()
> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
> as if blk_queue_dying(q) == true? That is, something like below:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b4..59e2496 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> smp_rmb();
>
> wait_event(q->mq_freeze_wq,
> - (atomic_read(&q->mq_freeze_depth) == 0 &&
> - (preempt || !blk_queue_preempt_only(q))) ||
> + atomic_read(&q->mq_freeze_depth) ||
> + (preempt || !blk_queue_preempt_only(q)) ||
> blk_queue_dying(q));
> - if (blk_queue_dying(q))
> + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> return -ENODEV;
> }
> }

That change looks wrong to me. Additionally, I think that you are looking in
the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
work fine for all block drivers except the loop driver I think that you should
have a closer look at how the loop driver uses this block layer functionality.

Thanks,

Bart.

2018-05-16 15:18:41

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <[email protected]> wrote:
> On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
>> One ore more threads are waiting for q->mq_freeze_depth to become 0. But the
>> thread who incremented q->mq_freeze_depth at blk_freeze_queue_start(q) from
>> blk_freeze_queue() is waiting at blk_mq_freeze_queue_wait(). Therefore,
>> atomic_read(&q->mq_freeze_depth) == 0 condition for wait_event() in
>> blk_queue_enter() will never be satisfied. But what does that wait_event()
>> want to do? Isn't "start freezing" a sort of blk_queue_dying(q) == true?
>> Since percpu_ref_tryget_live(&q->q_usage_counter) failed and the queue is
>> about to be frozen, shouldn't we treat atomic_read(&q->mq_freeze_depth) != 0
>> as if blk_queue_dying(q) == true? That is, something like below:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b4..59e2496 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>> smp_rmb();
>>
>> wait_event(q->mq_freeze_wq,
>> - (atomic_read(&q->mq_freeze_depth) == 0 &&
>> - (preempt || !blk_queue_preempt_only(q))) ||
>> + atomic_read(&q->mq_freeze_depth) ||
>> + (preempt || !blk_queue_preempt_only(q)) ||
>> blk_queue_dying(q));
>> - if (blk_queue_dying(q))
>> + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
>> return -ENODEV;
>> }
>> }
>
> That change looks wrong to me.

Hi Bart,

Why does it look wrong to you?

> Additionally, I think that you are looking in
> the wrong direction. Since blk_mq_freeze_queue_wait() and blk_queue_enter()
> work fine for all block drivers except the loop driver I think that you should
> have a closer look at how the loop driver uses this block layer functionality.
>
> Thanks,
>
> Bart.
>
>
>

2018-05-16 15:39:07

by Bart Van Assche

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <[email protected]> wrote:
> > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 85909b4..59e2496 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > > smp_rmb();
> > >
> > > wait_event(q->mq_freeze_wq,
> > > - (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > - (preempt || !blk_queue_preempt_only(q))) ||
> > > + atomic_read(&q->mq_freeze_depth) ||
> > > + (preempt || !blk_queue_preempt_only(q)) ||
> > > blk_queue_dying(q));
> > > - if (blk_queue_dying(q))
> > > + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > > return -ENODEV;
> > > }
> > > }
> >
> > That change looks wrong to me.
>
> Hi Bart,
>
> Why does it look wrong to you?

Because that change conflicts with the purpose of queue freezing and also because
that change would inject I/O errors in code paths that shouldn't inject I/O errors.
Please have a look at e.g. generic_make_request(). From the start of that function:

if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
return ret;
}

The above patch changes the behavior of blk_queue_enter() code from waiting while
q->mq_freeze_depth != 0 into returning -ENODEV while the request queue is frozen.
That will cause generic_make_request() to call bio_io_error(bio) while a request
queue is frozen if REQ_NOWAIT has not been set, which is the default behavior. So
any operation that freezes the queue temporarily, e.g. changing the queue depth,
concurrently with I/O processing can cause I/O to fail with -ENODEV. As you
probably know failure of write requests has very annoying consequences. It e.g.
causes filesystems to go into read-only mode. That's why I think that the above
change is completely wrong.

Bart.

2018-05-16 17:35:40

by Alan Jenkins

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

> jbd2/sda1-8(PID=2271) is stuck waiting for journal commit operation.
> I don't know how this thread is involved to this problem.

It feels like it should be a necessary link in the chain. This is the
filesystem underneath the loop device. If that hangs, we would expect
the loop device to hang, but not vice versa.

AIUI, you couldn't reproduce this hang on your own machine.

Don't you think, your fuzzer has just found a nice exploit for reiserfs?

Alan

(It's interesting that you found this hang just after we fixed
block_queue_enter() to wait in D state instead of S state.[1] I don't
know syszkaller, so I assume from this it only flagged this case up
because of the hung task warning. I.e. syzkaller didn't have its own
watchdog that would have failed the test anyway.

[1] the commit that caused you to CC me. "block: do not use
interruptible wait anywhere"

https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

)

2018-05-21 21:54:22

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

Bart Van Assche wrote:
> On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> > On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche <[email protected]> wrote:
> > > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 85909b4..59e2496 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> > > > smp_rmb();
> > > >
> > > > wait_event(q->mq_freeze_wq,
> > > > - (atomic_read(&q->mq_freeze_depth) == 0 &&
> > > > - (preempt || !blk_queue_preempt_only(q))) ||
> > > > + atomic_read(&q->mq_freeze_depth) ||
> > > > + (preempt || !blk_queue_preempt_only(q)) ||
> > > > blk_queue_dying(q));
> > > > - if (blk_queue_dying(q))
> > > > + if (atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q))
> > > > return -ENODEV;
> > > > }
> > > > }
> > >
> > > That change looks wrong to me.
> >
> > Hi Bart,
> >
> > Why does it look wrong to you?
>
> Because that change conflicts with the purpose of queue freezing and also because
> that change would inject I/O errors in code paths that shouldn't inject I/O errors.

But waiting there until atomic_read(&q->mq_freeze_depth) becomes 0 is causing deadlock.
wait_event() never returns is a bug. I think we should not wait for q->mq_freeze_depth.

2018-05-22 11:23:00

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

I checked counter values using debug printk() patch shown below, and
found that q->q_usage_counter.count == 1 when this deadlock occurs.

Since sum of percpu_count did not change after percpu_ref_kill(), this is
not a race condition while folding percpu counter values into atomic counter
value. That is, for some reason, someone who is responsible for calling
percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
unable to call percpu_ref_put().

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b4..6933020 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,6 +908,12 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
}
EXPORT_SYMBOL(blk_alloc_queue);

+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+ return (unsigned long __percpu *)
+ (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
/**
* blk_queue_enter() - try to increase q->q_usage_counter
* @q: request queue pointer
@@ -950,10 +956,22 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
*/
smp_rmb();

- wait_event(q->mq_freeze_wq,
- (atomic_read(&q->mq_freeze_depth) == 0 &&
- (preempt || !blk_queue_preempt_only(q))) ||
- blk_queue_dying(q));
+ while (wait_event_timeout(q->mq_freeze_wq,
+ (atomic_read(&q->mq_freeze_depth) == 0 &&
+ (preempt || !blk_queue_preempt_only(q))) ||
+ blk_queue_dying(q), 3 * HZ) == 0) {
+ struct percpu_ref *ref = &q->q_usage_counter;
+ unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+ unsigned long count = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr(percpu_count, cpu);
+
+ printk("%s(%d): %px %ld %ld\n", current->comm, current->pid,
+ ref, atomic_long_read(&ref->count), count);
+ }
+
if (blk_queue_dying(q))
return -ENODEV;
}
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7..72773ce 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -133,8 +133,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
for_each_possible_cpu(cpu)
count += *per_cpu_ptr(percpu_count, cpu);

- pr_debug("global %ld percpu %ld",
- atomic_long_read(&ref->count), (long)count);
+ printk("%px global %ld percpu %ld\n", ref,
+ atomic_long_read(&ref->count), (long)count);

/*
* It's crucial that we sum the percpu counters _before_ adding the sum
@@ -150,6 +150,8 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
*/
atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count);

+ printk("%px global %ld\n", ref, atomic_long_read(&ref->count));
+
WARN_ONCE(atomic_long_read(&ref->count) <= 0,
"percpu ref (%pf) <= 0 (%ld) after switching to atomic",
ref->release, atomic_long_read(&ref->count));

If I change blk_queue_enter() not to wait at wait_event() if
q->mq_freeze_depth != 0, this deadlock problem does not occur.

Also, I found that if blk_freeze_queue_start() tries to wait for
counters to become 1 before calling percpu_ref_kill() (like shown
below), this deadlock problem does not occur.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ce9cac..4bff534 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -134,12 +134,36 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
}

+#define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1))
+
+static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
+{
+ return (unsigned long __percpu *)
+ (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+}
+
void blk_freeze_queue_start(struct request_queue *q)
{
int freeze_depth;

freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
+ int i;
+ for (i = 0; i < 10; i++) {
+ struct percpu_ref *ref = &q->q_usage_counter;
+ unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+ unsigned long count = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr(percpu_count, cpu);
+
+ if (atomic_long_read(&ref->count) + count - PERCPU_COUNT_BIAS == 1)
+ break;
+ printk("%s(%d):! %px %ld %ld\n", current->comm, current->pid,
+ ref, atomic_long_read(&ref->count), count);
+ schedule_timeout_uninterruptible(HZ / 10);
+ }
percpu_ref_kill(&q->q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);

But I don't know how to find someone who is failing to call percpu_ref_put()...

2018-06-01 10:12:32

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

Tetsuo Handa wrote:
> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> not a race condition while folding percpu counter values into atomic counter
> value. That is, for some reason, someone who is responsible for calling
> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> unable to call percpu_ref_put().
> But I don't know how to find someone who is failing to call percpu_ref_put()...

I found the someone. It was already there in the backtrace...

----------------------------------------
[ 62.065852] a.out D 0 4414 4337 0x00000000
[ 62.067677] Call Trace:
[ 62.068545] __schedule+0x40b/0x860
[ 62.069726] schedule+0x31/0x80
[ 62.070796] schedule_timeout+0x1c1/0x3c0
[ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
[ 62.073670] blk_queue_enter+0x218/0x520
[ 62.074985] ? remove_wait_queue+0x70/0x70
[ 62.076361] generic_make_request+0x3d/0x540
[ 62.077785] ? __bio_clone_fast+0x6b/0x80
[ 62.079147] ? bio_clone_fast+0x2c/0x70
[ 62.080456] blk_queue_split+0x29b/0x560
[ 62.081772] ? blk_queue_split+0x29b/0x560
[ 62.083162] blk_mq_make_request+0x7c/0x430
[ 62.084562] generic_make_request+0x276/0x540
[ 62.086034] submit_bio+0x6e/0x140
[ 62.087185] ? submit_bio+0x6e/0x140
[ 62.088384] ? guard_bio_eod+0x9d/0x1d0
[ 62.089681] do_mpage_readpage+0x328/0x730
[ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
[ 62.092726] mpage_readpages+0x120/0x190
[ 62.094034] ? check_disk_change+0x70/0x70
[ 62.095454] ? check_disk_change+0x70/0x70
[ 62.096849] ? alloc_pages_current+0x65/0xd0
[ 62.098277] blkdev_readpages+0x18/0x20
[ 62.099568] __do_page_cache_readahead+0x298/0x360
[ 62.101157] ondemand_readahead+0x1f6/0x490
[ 62.102546] ? ondemand_readahead+0x1f6/0x490
[ 62.103995] page_cache_sync_readahead+0x29/0x40
[ 62.105539] generic_file_read_iter+0x7d0/0x9d0
[ 62.107067] ? futex_wait+0x221/0x240
[ 62.108303] ? trace_hardirqs_on+0xd/0x10
[ 62.109654] blkdev_read_iter+0x30/0x40
[ 62.110954] generic_file_splice_read+0xc5/0x140
[ 62.112538] do_splice_to+0x74/0x90
[ 62.113726] splice_direct_to_actor+0xa4/0x1f0
[ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
[ 62.116773] do_splice_direct+0x8a/0xb0
[ 62.118056] do_sendfile+0x1aa/0x390
[ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
[ 62.120666] do_syscall_64+0x6e/0x210
[ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
----------------------------------------

The someone is blk_queue_split() from blk_mq_make_request() who depends on an
assumption that blk_queue_enter() from recursively called generic_make_request()
does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.

----------------------------------------
generic_make_request(struct bio *bio) {
if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
return ret;
}
(...snipped...)
ret = q->make_request_fn(q, bio);
(...snipped...)
if (q)
blk_queue_exit(q);
}
----------------------------------------

where q->make_request_fn == blk_mq_make_request which does

----------------------------------------
blk_mq_make_request(struct request_queue *q, struct bio *bio) {
blk_queue_split(q, &bio);
}

blk_queue_split(struct request_queue *q, struct bio **bio) {
generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
}
----------------------------------------

and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
percpu_ref_kill() are called by blk_freeze_queue_start()...

Now, it is up to you about how to fix this race problem.

2018-06-01 17:53:49

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>
> I found the someone. It was already there in the backtrace...
>
> ----------------------------------------
> [ 62.065852] a.out D 0 4414 4337 0x00000000
> [ 62.067677] Call Trace:
> [ 62.068545] __schedule+0x40b/0x860
> [ 62.069726] schedule+0x31/0x80
> [ 62.070796] schedule_timeout+0x1c1/0x3c0
> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> [ 62.073670] blk_queue_enter+0x218/0x520
> [ 62.074985] ? remove_wait_queue+0x70/0x70
> [ 62.076361] generic_make_request+0x3d/0x540
> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> [ 62.079147] ? bio_clone_fast+0x2c/0x70
> [ 62.080456] blk_queue_split+0x29b/0x560
> [ 62.081772] ? blk_queue_split+0x29b/0x560
> [ 62.083162] blk_mq_make_request+0x7c/0x430
> [ 62.084562] generic_make_request+0x276/0x540
> [ 62.086034] submit_bio+0x6e/0x140
> [ 62.087185] ? submit_bio+0x6e/0x140
> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> [ 62.089681] do_mpage_readpage+0x328/0x730
> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> [ 62.092726] mpage_readpages+0x120/0x190
> [ 62.094034] ? check_disk_change+0x70/0x70
> [ 62.095454] ? check_disk_change+0x70/0x70
> [ 62.096849] ? alloc_pages_current+0x65/0xd0
> [ 62.098277] blkdev_readpages+0x18/0x20
> [ 62.099568] __do_page_cache_readahead+0x298/0x360
> [ 62.101157] ondemand_readahead+0x1f6/0x490
> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> [ 62.103995] page_cache_sync_readahead+0x29/0x40
> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> [ 62.107067] ? futex_wait+0x221/0x240
> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> [ 62.109654] blkdev_read_iter+0x30/0x40
> [ 62.110954] generic_file_splice_read+0xc5/0x140
> [ 62.112538] do_splice_to+0x74/0x90
> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 62.116773] do_splice_direct+0x8a/0xb0
> [ 62.118056] do_sendfile+0x1aa/0x390
> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> [ 62.120666] do_syscall_64+0x6e/0x210
> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
>
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>
> ----------------------------------------
> generic_make_request(struct bio *bio) {
> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> bio_io_error(bio);
> return ret;
> }
> (...snipped...)
> ret = q->make_request_fn(q, bio);
> (...snipped...)
> if (q)
> blk_queue_exit(q);
> }
> ----------------------------------------
>
> where q->make_request_fn == blk_mq_make_request which does
>
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> blk_queue_split(q, &bio);
> }
>
> blk_queue_split(struct request_queue *q, struct bio **bio) {
> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
>
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by blk_freeze_queue_start()...
>
> Now, it is up to you about how to fix this race problem.

Ahh, nicely spotted! One idea would be the one below. For this case,
we're recursing, so we can either do a non-block queue enter, or we
can just do a live enter.

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..7de12e0dcc3d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)

if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
+ if (bio->bi_opf & REQ_QUEUE_ENTERED)
+ blk_queue_enter_live(q);
+ else if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..e1063e8f7eda 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;

+ /*
+ * Since we're recursing into make_request here, ensure
+ * that we mark this bio as already having entered the queue.
+ * If not, and the queue is going away, we can get stuck
+ * forever on waiting for the queue reference to drop. But
+ * that will never happen, as we're already holding a
+ * reference to it.
+ */
+ (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
+
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..a6bc789b83e7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -282,6 +282,8 @@ enum req_flag_bits {
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */

+ __REQ_QUEUE_ENTERED, /* queue already entered for this request */
+
/* for driver use */
__REQ_DRV,

@@ -302,9 +304,8 @@ enum req_flag_bits {
#define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
-
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
-
+#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
#define REQ_DRV (1ULL << __REQ_DRV)

#define REQ_FAILFAST_MASK \

--
Jens Axboe

2018-06-01 23:51:01

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >> not a race condition while folding percpu counter values into atomic counter
> >> value. That is, for some reason, someone who is responsible for calling
> >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >> unable to call percpu_ref_put().
> >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >
> > I found the someone. It was already there in the backtrace...
> >
> > ----------------------------------------
> > [ 62.065852] a.out D 0 4414 4337 0x00000000
> > [ 62.067677] Call Trace:
> > [ 62.068545] __schedule+0x40b/0x860
> > [ 62.069726] schedule+0x31/0x80
> > [ 62.070796] schedule_timeout+0x1c1/0x3c0
> > [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> > [ 62.073670] blk_queue_enter+0x218/0x520
> > [ 62.074985] ? remove_wait_queue+0x70/0x70
> > [ 62.076361] generic_make_request+0x3d/0x540
> > [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> > [ 62.079147] ? bio_clone_fast+0x2c/0x70
> > [ 62.080456] blk_queue_split+0x29b/0x560
> > [ 62.081772] ? blk_queue_split+0x29b/0x560
> > [ 62.083162] blk_mq_make_request+0x7c/0x430
> > [ 62.084562] generic_make_request+0x276/0x540
> > [ 62.086034] submit_bio+0x6e/0x140
> > [ 62.087185] ? submit_bio+0x6e/0x140
> > [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> > [ 62.089681] do_mpage_readpage+0x328/0x730
> > [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> > [ 62.092726] mpage_readpages+0x120/0x190
> > [ 62.094034] ? check_disk_change+0x70/0x70
> > [ 62.095454] ? check_disk_change+0x70/0x70
> > [ 62.096849] ? alloc_pages_current+0x65/0xd0
> > [ 62.098277] blkdev_readpages+0x18/0x20
> > [ 62.099568] __do_page_cache_readahead+0x298/0x360
> > [ 62.101157] ondemand_readahead+0x1f6/0x490
> > [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> > [ 62.103995] page_cache_sync_readahead+0x29/0x40
> > [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> > [ 62.107067] ? futex_wait+0x221/0x240
> > [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> > [ 62.109654] blkdev_read_iter+0x30/0x40
> > [ 62.110954] generic_file_splice_read+0xc5/0x140
> > [ 62.112538] do_splice_to+0x74/0x90
> > [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> > [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> > [ 62.116773] do_splice_direct+0x8a/0xb0
> > [ 62.118056] do_sendfile+0x1aa/0x390
> > [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> > [ 62.120666] do_syscall_64+0x6e/0x210
> > [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > ----------------------------------------
> >
> > The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> > assumption that blk_queue_enter() from recursively called generic_make_request()
> > does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> >
> > ----------------------------------------
> > generic_make_request(struct bio *bio) {
> > if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> > if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> > bio_wouldblock_error(bio);
> > else
> > bio_io_error(bio);
> > return ret;
> > }
> > (...snipped...)
> > ret = q->make_request_fn(q, bio);
> > (...snipped...)
> > if (q)
> > blk_queue_exit(q);
> > }
> > ----------------------------------------
> >
> > where q->make_request_fn == blk_mq_make_request which does
> >
> > ----------------------------------------
> > blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> > blk_queue_split(q, &bio);
> > }
> >
> > blk_queue_split(struct request_queue *q, struct bio **bio) {
> > generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> > }
> > ----------------------------------------
> >
> > and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> > percpu_ref_kill() are called by blk_freeze_queue_start()...
> >
> > Now, it is up to you about how to fix this race problem.
>
> Ahh, nicely spotted! One idea would be the one below. For this case,
> we're recursing, so we can either do a non-block queue enter, or we
> can just do a live enter.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b431eb0..7de12e0dcc3d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>
> if (bio->bi_opf & REQ_NOWAIT)
> flags = BLK_MQ_REQ_NOWAIT;
> - if (blk_queue_enter(q, flags) < 0) {
> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
> + blk_queue_enter_live(q);
> + else if (blk_queue_enter(q, flags) < 0) {
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 782940c65d8a..e1063e8f7eda 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> /* there isn't chance to merge the splitted bio */
> split->bi_opf |= REQ_NOMERGE;
>
> + /*
> + * Since we're recursing into make_request here, ensure
> + * that we mark this bio as already having entered the queue.
> + * If not, and the queue is going away, we can get stuck
> + * forever on waiting for the queue reference to drop. But
> + * that will never happen, as we're already holding a
> + * reference to it.
> + */
> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
> +
> bio_chain(split, *bio);
> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> generic_make_request(*bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 17b18b91ebac..a6bc789b83e7 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -282,6 +282,8 @@ enum req_flag_bits {
> /* command specific flags for REQ_OP_WRITE_ZEROES: */
> __REQ_NOUNMAP, /* do not free blocks when zeroing */
>
> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
> +
> /* for driver use */
> __REQ_DRV,
>
> @@ -302,9 +304,8 @@ enum req_flag_bits {
> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
> -
> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
> -
> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
> #define REQ_DRV (1ULL << __REQ_DRV)
>
> #define REQ_FAILFAST_MASK \

bio clone inherits .bi_opf, so this way may not work for cloned bio.

Given most of bio_split() is run in .make_request_fn(), holding the
queue ref may not be needed in case of bio splitting, so another
candidate fix might be to introduce one extra parameter of 'need_queue_ref'
to generic_make_request().

Thanks,
Ming

2018-06-02 00:51:17

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On 6/1/18 5:49 PM, Ming Lei wrote:
> On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
>> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>>> Tetsuo Handa wrote:
>>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>>> not a race condition while folding percpu counter values into atomic counter
>>>> value. That is, for some reason, someone who is responsible for calling
>>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>>> unable to call percpu_ref_put().
>>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>>
>>> I found the someone. It was already there in the backtrace...
>>>
>>> ----------------------------------------
>>> [ 62.065852] a.out D 0 4414 4337 0x00000000
>>> [ 62.067677] Call Trace:
>>> [ 62.068545] __schedule+0x40b/0x860
>>> [ 62.069726] schedule+0x31/0x80
>>> [ 62.070796] schedule_timeout+0x1c1/0x3c0
>>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
>>> [ 62.073670] blk_queue_enter+0x218/0x520
>>> [ 62.074985] ? remove_wait_queue+0x70/0x70
>>> [ 62.076361] generic_make_request+0x3d/0x540
>>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
>>> [ 62.079147] ? bio_clone_fast+0x2c/0x70
>>> [ 62.080456] blk_queue_split+0x29b/0x560
>>> [ 62.081772] ? blk_queue_split+0x29b/0x560
>>> [ 62.083162] blk_mq_make_request+0x7c/0x430
>>> [ 62.084562] generic_make_request+0x276/0x540
>>> [ 62.086034] submit_bio+0x6e/0x140
>>> [ 62.087185] ? submit_bio+0x6e/0x140
>>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
>>> [ 62.089681] do_mpage_readpage+0x328/0x730
>>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
>>> [ 62.092726] mpage_readpages+0x120/0x190
>>> [ 62.094034] ? check_disk_change+0x70/0x70
>>> [ 62.095454] ? check_disk_change+0x70/0x70
>>> [ 62.096849] ? alloc_pages_current+0x65/0xd0
>>> [ 62.098277] blkdev_readpages+0x18/0x20
>>> [ 62.099568] __do_page_cache_readahead+0x298/0x360
>>> [ 62.101157] ondemand_readahead+0x1f6/0x490
>>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
>>> [ 62.103995] page_cache_sync_readahead+0x29/0x40
>>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
>>> [ 62.107067] ? futex_wait+0x221/0x240
>>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
>>> [ 62.109654] blkdev_read_iter+0x30/0x40
>>> [ 62.110954] generic_file_splice_read+0xc5/0x140
>>> [ 62.112538] do_splice_to+0x74/0x90
>>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
>>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
>>> [ 62.116773] do_splice_direct+0x8a/0xb0
>>> [ 62.118056] do_sendfile+0x1aa/0x390
>>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
>>> [ 62.120666] do_syscall_64+0x6e/0x210
>>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> ----------------------------------------
>>>
>>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
>>> assumption that blk_queue_enter() from recursively called generic_make_request()
>>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>>>
>>> ----------------------------------------
>>> generic_make_request(struct bio *bio) {
>>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>> bio_wouldblock_error(bio);
>>> else
>>> bio_io_error(bio);
>>> return ret;
>>> }
>>> (...snipped...)
>>> ret = q->make_request_fn(q, bio);
>>> (...snipped...)
>>> if (q)
>>> blk_queue_exit(q);
>>> }
>>> ----------------------------------------
>>>
>>> where q->make_request_fn == blk_mq_make_request which does
>>>
>>> ----------------------------------------
>>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>>> blk_queue_split(q, &bio);
>>> }
>>>
>>> blk_queue_split(struct request_queue *q, struct bio **bio) {
>>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
>>> }
>>> ----------------------------------------
>>>
>>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
>>> percpu_ref_kill() are called by blk_freeze_queue_start()...
>>>
>>> Now, it is up to you about how to fix this race problem.
>>
>> Ahh, nicely spotted! One idea would be the one below. For this case,
>> we're recursing, so we can either do a non-block queue enter, or we
>> can just do a live enter.
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 85909b431eb0..7de12e0dcc3d 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>>
>> if (bio->bi_opf & REQ_NOWAIT)
>> flags = BLK_MQ_REQ_NOWAIT;
>> - if (blk_queue_enter(q, flags) < 0) {
>> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
>> + blk_queue_enter_live(q);
>> + else if (blk_queue_enter(q, flags) < 0) {
>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>> bio_wouldblock_error(bio);
>> else
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 782940c65d8a..e1063e8f7eda 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>> /* there isn't chance to merge the splitted bio */
>> split->bi_opf |= REQ_NOMERGE;
>>
>> + /*
>> + * Since we're recursing into make_request here, ensure
>> + * that we mark this bio as already having entered the queue.
>> + * If not, and the queue is going away, we can get stuck
>> + * forever on waiting for the queue reference to drop. But
>> + * that will never happen, as we're already holding a
>> + * reference to it.
>> + */
>> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
>> +
>> bio_chain(split, *bio);
>> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>> generic_make_request(*bio);
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 17b18b91ebac..a6bc789b83e7 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -282,6 +282,8 @@ enum req_flag_bits {
>> /* command specific flags for REQ_OP_WRITE_ZEROES: */
>> __REQ_NOUNMAP, /* do not free blocks when zeroing */
>>
>> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
>> +
>> /* for driver use */
>> __REQ_DRV,
>>
>> @@ -302,9 +304,8 @@ enum req_flag_bits {
>> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
>> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
>> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
>> -
>> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
>> -
>> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
>> #define REQ_DRV (1ULL << __REQ_DRV)
>>
>> #define REQ_FAILFAST_MASK \
>
> bio clone inherits .bi_opf, so this way may not work for cloned bio.
>
> Given most of bio_split() is run in .make_request_fn(), holding the
> queue ref may not be needed in case of bio splitting, so another
> candidate fix might be to introduce one extra parameter of
> 'need_queue_ref' to generic_make_request().

I did consider that, but that's pretty silly for a corner case and you'd
have to change a crap ton of calls to it. I think it'd be cleaner to
clear the bit when we need to, potentially even adding a debug check to
blk_queue_enter_live() that complains if the ref was not already
elevated. Though that would be expensive, compared to the percpu inc
now. Not saying the bit is necessarily the best way forward, but I do
like it a LOT more than adding an argument to generic_make_request.

It should have been a bio flag to begin with, that better captures the
scope and also avoids clones inheriting anything they shouldn't. It's
also not applicable to a request.

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..8f1063f50863 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)

if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
+ if (bio->bi_flags & BIO_QUEUE_ENTERED)
+ blk_queue_enter_live(q);
+ else if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..9ee9474e579c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;

+ /*
+ * Since we're recursing into make_request here, ensure
+ * that we mark this bio as already having entered the queue.
+ * If not, and the queue is going away, we can get stuck
+ * forever on waiting for the queue reference to drop. But
+ * that will never happen, as we're already holding a
+ * reference to it.
+ */
+ (*bio)->bi_flags |= BIO_QUEUE_ENTERED;
+
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..4687d7c99076 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -186,6 +186,8 @@ struct bio {
* throttling rules. Don't do it again. */
#define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion
* of this bio. */
+#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */
+
/* See BVEC_POOL_OFFSET below before adding new flags */

/*

--
Jens Axboe

2018-06-02 00:58:36

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On 6/1/18 6:49 PM, Jens Axboe wrote:
> On 6/1/18 5:49 PM, Ming Lei wrote:
>> On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
>>> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>>>> Tetsuo Handa wrote:
>>>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>>>> not a race condition while folding percpu counter values into atomic counter
>>>>> value. That is, for some reason, someone who is responsible for calling
>>>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>>>> unable to call percpu_ref_put().
>>>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>>>
>>>> I found the someone. It was already there in the backtrace...
>>>>
>>>> ----------------------------------------
>>>> [ 62.065852] a.out D 0 4414 4337 0x00000000
>>>> [ 62.067677] Call Trace:
>>>> [ 62.068545] __schedule+0x40b/0x860
>>>> [ 62.069726] schedule+0x31/0x80
>>>> [ 62.070796] schedule_timeout+0x1c1/0x3c0
>>>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
>>>> [ 62.073670] blk_queue_enter+0x218/0x520
>>>> [ 62.074985] ? remove_wait_queue+0x70/0x70
>>>> [ 62.076361] generic_make_request+0x3d/0x540
>>>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
>>>> [ 62.079147] ? bio_clone_fast+0x2c/0x70
>>>> [ 62.080456] blk_queue_split+0x29b/0x560
>>>> [ 62.081772] ? blk_queue_split+0x29b/0x560
>>>> [ 62.083162] blk_mq_make_request+0x7c/0x430
>>>> [ 62.084562] generic_make_request+0x276/0x540
>>>> [ 62.086034] submit_bio+0x6e/0x140
>>>> [ 62.087185] ? submit_bio+0x6e/0x140
>>>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
>>>> [ 62.089681] do_mpage_readpage+0x328/0x730
>>>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
>>>> [ 62.092726] mpage_readpages+0x120/0x190
>>>> [ 62.094034] ? check_disk_change+0x70/0x70
>>>> [ 62.095454] ? check_disk_change+0x70/0x70
>>>> [ 62.096849] ? alloc_pages_current+0x65/0xd0
>>>> [ 62.098277] blkdev_readpages+0x18/0x20
>>>> [ 62.099568] __do_page_cache_readahead+0x298/0x360
>>>> [ 62.101157] ondemand_readahead+0x1f6/0x490
>>>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
>>>> [ 62.103995] page_cache_sync_readahead+0x29/0x40
>>>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
>>>> [ 62.107067] ? futex_wait+0x221/0x240
>>>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
>>>> [ 62.109654] blkdev_read_iter+0x30/0x40
>>>> [ 62.110954] generic_file_splice_read+0xc5/0x140
>>>> [ 62.112538] do_splice_to+0x74/0x90
>>>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
>>>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
>>>> [ 62.116773] do_splice_direct+0x8a/0xb0
>>>> [ 62.118056] do_sendfile+0x1aa/0x390
>>>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
>>>> [ 62.120666] do_syscall_64+0x6e/0x210
>>>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> ----------------------------------------
>>>>
>>>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
>>>> assumption that blk_queue_enter() from recursively called generic_make_request()
>>>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>>>>
>>>> ----------------------------------------
>>>> generic_make_request(struct bio *bio) {
>>>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
>>>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>>> bio_wouldblock_error(bio);
>>>> else
>>>> bio_io_error(bio);
>>>> return ret;
>>>> }
>>>> (...snipped...)
>>>> ret = q->make_request_fn(q, bio);
>>>> (...snipped...)
>>>> if (q)
>>>> blk_queue_exit(q);
>>>> }
>>>> ----------------------------------------
>>>>
>>>> where q->make_request_fn == blk_mq_make_request which does
>>>>
>>>> ----------------------------------------
>>>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
>>>> blk_queue_split(q, &bio);
>>>> }
>>>>
>>>> blk_queue_split(struct request_queue *q, struct bio **bio) {
>>>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
>>>> }
>>>> ----------------------------------------
>>>>
>>>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
>>>> percpu_ref_kill() are called by blk_freeze_queue_start()...
>>>>
>>>> Now, it is up to you about how to fix this race problem.
>>>
>>> Ahh, nicely spotted! One idea would be the one below. For this case,
>>> we're recursing, so we can either do a non-block queue enter, or we
>>> can just do a live enter.
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 85909b431eb0..7de12e0dcc3d 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>>>
>>> if (bio->bi_opf & REQ_NOWAIT)
>>> flags = BLK_MQ_REQ_NOWAIT;
>>> - if (blk_queue_enter(q, flags) < 0) {
>>> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
>>> + blk_queue_enter_live(q);
>>> + else if (blk_queue_enter(q, flags) < 0) {
>>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
>>> bio_wouldblock_error(bio);
>>> else
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index 782940c65d8a..e1063e8f7eda 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
>>> /* there isn't chance to merge the splitted bio */
>>> split->bi_opf |= REQ_NOMERGE;
>>>
>>> + /*
>>> + * Since we're recursing into make_request here, ensure
>>> + * that we mark this bio as already having entered the queue.
>>> + * If not, and the queue is going away, we can get stuck
>>> + * forever on waiting for the queue reference to drop. But
>>> + * that will never happen, as we're already holding a
>>> + * reference to it.
>>> + */
>>> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
>>> +
>>> bio_chain(split, *bio);
>>> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>>> generic_make_request(*bio);
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index 17b18b91ebac..a6bc789b83e7 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -282,6 +282,8 @@ enum req_flag_bits {
>>> /* command specific flags for REQ_OP_WRITE_ZEROES: */
>>> __REQ_NOUNMAP, /* do not free blocks when zeroing */
>>>
>>> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
>>> +
>>> /* for driver use */
>>> __REQ_DRV,
>>>
>>> @@ -302,9 +304,8 @@ enum req_flag_bits {
>>> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
>>> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
>>> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
>>> -
>>> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
>>> -
>>> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
>>> #define REQ_DRV (1ULL << __REQ_DRV)
>>>
>>> #define REQ_FAILFAST_MASK \
>>
>> bio clone inherits .bi_opf, so this way may not work for cloned bio.
>>
>> Given most of bio_split() is run in .make_request_fn(), holding the
>> queue ref may not be needed in case of bio splitting, so another
>> candidate fix might be to introduce one extra parameter of
>> 'need_queue_ref' to generic_make_request().
>
> I did consider that, but that's pretty silly for a corner case and you'd
> have to change a crap ton of calls to it. I think it'd be cleaner to
> clear the bit when we need to, potentially even adding a debug check to
> blk_queue_enter_live() that complains if the ref was not already
> elevated. Though that would be expensive, compared to the percpu inc
> now. Not saying the bit is necessarily the best way forward, but I do
> like it a LOT more than adding an argument to generic_make_request.
>
> It should have been a bio flag to begin with, that better captures the
> scope and also avoids clones inheriting anything they shouldn't. It's
> also not applicable to a request.

A little too much inheritance from the previous patch, better version
below.

diff --git a/block/blk-core.c b/block/blk-core.c
index 85909b431eb0..71cb228d7374 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)

if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
+ if (bio_flagged(bio, BIO_QUEUE_ENTERED))
+ blk_queue_enter_live(q);
+ else if (blk_queue_enter(q, flags) < 0) {
if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
bio_wouldblock_error(bio);
else
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 782940c65d8a..481dc02668f9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;

+ /*
+ * Since we're recursing into make_request here, ensure
+ * that we mark this bio as already having entered the queue.
+ * If not, and the queue is going away, we can get stuck
+ * forever on waiting for the queue reference to drop. But
+ * that will never happen, as we're already holding a
+ * reference to it.
+ */
+ bio_set_flag(*bio, BIO_QUEUE_ENTERED);
+
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
generic_make_request(*bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 17b18b91ebac..1602bf4ab4cd 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -186,6 +186,8 @@ struct bio {
* throttling rules. Don't do it again. */
#define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion
* of this bio. */
+#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */
+
/* See BVEC_POOL_OFFSET below before adding new flags */

/*

--
Jens Axboe

2018-06-02 02:40:33

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Fri, Jun 01, 2018 at 06:49:10PM -0600, Jens Axboe wrote:
> On 6/1/18 5:49 PM, Ming Lei wrote:
> > On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote:
> >> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> >>> Tetsuo Handa wrote:
> >>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >>>> not a race condition while folding percpu counter values into atomic counter
> >>>> value. That is, for some reason, someone who is responsible for calling
> >>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >>>> unable to call percpu_ref_put().
> >>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >>>
> >>> I found the someone. It was already there in the backtrace...
> >>>
> >>> ----------------------------------------
> >>> [ 62.065852] a.out D 0 4414 4337 0x00000000
> >>> [ 62.067677] Call Trace:
> >>> [ 62.068545] __schedule+0x40b/0x860
> >>> [ 62.069726] schedule+0x31/0x80
> >>> [ 62.070796] schedule_timeout+0x1c1/0x3c0
> >>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> >>> [ 62.073670] blk_queue_enter+0x218/0x520
> >>> [ 62.074985] ? remove_wait_queue+0x70/0x70
> >>> [ 62.076361] generic_make_request+0x3d/0x540
> >>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> >>> [ 62.079147] ? bio_clone_fast+0x2c/0x70
> >>> [ 62.080456] blk_queue_split+0x29b/0x560
> >>> [ 62.081772] ? blk_queue_split+0x29b/0x560
> >>> [ 62.083162] blk_mq_make_request+0x7c/0x430
> >>> [ 62.084562] generic_make_request+0x276/0x540
> >>> [ 62.086034] submit_bio+0x6e/0x140
> >>> [ 62.087185] ? submit_bio+0x6e/0x140
> >>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> >>> [ 62.089681] do_mpage_readpage+0x328/0x730
> >>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> >>> [ 62.092726] mpage_readpages+0x120/0x190
> >>> [ 62.094034] ? check_disk_change+0x70/0x70
> >>> [ 62.095454] ? check_disk_change+0x70/0x70
> >>> [ 62.096849] ? alloc_pages_current+0x65/0xd0
> >>> [ 62.098277] blkdev_readpages+0x18/0x20
> >>> [ 62.099568] __do_page_cache_readahead+0x298/0x360
> >>> [ 62.101157] ondemand_readahead+0x1f6/0x490
> >>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> >>> [ 62.103995] page_cache_sync_readahead+0x29/0x40
> >>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> >>> [ 62.107067] ? futex_wait+0x221/0x240
> >>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> >>> [ 62.109654] blkdev_read_iter+0x30/0x40
> >>> [ 62.110954] generic_file_splice_read+0xc5/0x140
> >>> [ 62.112538] do_splice_to+0x74/0x90
> >>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> >>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> >>> [ 62.116773] do_splice_direct+0x8a/0xb0
> >>> [ 62.118056] do_sendfile+0x1aa/0x390
> >>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> >>> [ 62.120666] do_syscall_64+0x6e/0x210
> >>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>> ----------------------------------------
> >>>
> >>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> >>> assumption that blk_queue_enter() from recursively called generic_make_request()
> >>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
> >>>
> >>> ----------------------------------------
> >>> generic_make_request(struct bio *bio) {
> >>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> >>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >>> bio_wouldblock_error(bio);
> >>> else
> >>> bio_io_error(bio);
> >>> return ret;
> >>> }
> >>> (...snipped...)
> >>> ret = q->make_request_fn(q, bio);
> >>> (...snipped...)
> >>> if (q)
> >>> blk_queue_exit(q);
> >>> }
> >>> ----------------------------------------
> >>>
> >>> where q->make_request_fn == blk_mq_make_request which does
> >>>
> >>> ----------------------------------------
> >>> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> >>> blk_queue_split(q, &bio);
> >>> }
> >>>
> >>> blk_queue_split(struct request_queue *q, struct bio **bio) {
> >>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> >>> }
> >>> ----------------------------------------
> >>>
> >>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> >>> percpu_ref_kill() are called by blk_freeze_queue_start()...
> >>>
> >>> Now, it is up to you about how to fix this race problem.
> >>
> >> Ahh, nicely spotted! One idea would be the one below. For this case,
> >> we're recursing, so we can either do a non-block queue enter, or we
> >> can just do a live enter.
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 85909b431eb0..7de12e0dcc3d 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
> >>
> >> if (bio->bi_opf & REQ_NOWAIT)
> >> flags = BLK_MQ_REQ_NOWAIT;
> >> - if (blk_queue_enter(q, flags) < 0) {
> >> + if (bio->bi_opf & REQ_QUEUE_ENTERED)
> >> + blk_queue_enter_live(q);
> >> + else if (blk_queue_enter(q, flags) < 0) {
> >> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> >> bio_wouldblock_error(bio);
> >> else
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index 782940c65d8a..e1063e8f7eda 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> >> /* there isn't chance to merge the splitted bio */
> >> split->bi_opf |= REQ_NOMERGE;
> >>
> >> + /*
> >> + * Since we're recursing into make_request here, ensure
> >> + * that we mark this bio as already having entered the queue.
> >> + * If not, and the queue is going away, we can get stuck
> >> + * forever on waiting for the queue reference to drop. But
> >> + * that will never happen, as we're already holding a
> >> + * reference to it.
> >> + */
> >> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED;
> >> +
> >> bio_chain(split, *bio);
> >> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> >> generic_make_request(*bio);
> >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> >> index 17b18b91ebac..a6bc789b83e7 100644
> >> --- a/include/linux/blk_types.h
> >> +++ b/include/linux/blk_types.h
> >> @@ -282,6 +282,8 @@ enum req_flag_bits {
> >> /* command specific flags for REQ_OP_WRITE_ZEROES: */
> >> __REQ_NOUNMAP, /* do not free blocks when zeroing */
> >>
> >> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */
> >> +
> >> /* for driver use */
> >> __REQ_DRV,
> >>
> >> @@ -302,9 +304,8 @@ enum req_flag_bits {
> >> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
> >> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
> >> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
> >> -
> >> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
> >> -
> >> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED)
> >> #define REQ_DRV (1ULL << __REQ_DRV)
> >>
> >> #define REQ_FAILFAST_MASK \
> >
> > bio clone inherits .bi_opf, so this way may not work for cloned bio.
> >
> > Given most of bio_split() is run in .make_request_fn(), holding the
> > queue ref may not be needed in case of bio splitting, so another
> > candidate fix might be to introduce one extra parameter of
> > 'need_queue_ref' to generic_make_request().
>
> I did consider that, but that's pretty silly for a corner case and you'd
> have to change a crap ton of calls to it. I think it'd be cleaner to

Seems not necessary to change any call to it, since all blk_queue_split()
are called from q->make_request_fn() now.

> clear the bit when we need to, potentially even adding a debug check to
> blk_queue_enter_live() that complains if the ref was not already
> elevated. Though that would be expensive, compared to the percpu inc
> now. Not saying the bit is necessarily the best way forward, but I do
> like it a LOT more than adding an argument to generic_make_request.

We may introduce generic_make_request_with_ref() for blk_queue_split(),
what do you think of the following patch?

diff --git a/block/blk-core.c b/block/blk-core.c
index cee03cad99f2..e10ee3f9f338 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2363,7 +2363,7 @@ generic_make_request_checks(struct bio *bio)
* a lower device by calling into generic_make_request recursively, which
* means the bio should NOT be touched after the call to ->make_request_fn.
*/
-blk_qc_t generic_make_request(struct bio *bio)
+blk_qc_t __generic_make_request(struct bio *bio, bool got_ref)
{
/*
* bio_list_on_stack[0] contains bios submitted by the current
@@ -2379,12 +2379,14 @@ blk_qc_t generic_make_request(struct bio *bio)

if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
- if (blk_queue_enter(q, flags) < 0) {
- if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
- bio_wouldblock_error(bio);
- else
- bio_io_error(bio);
- return ret;
+ if (!got_ref) {
+ if (blk_queue_enter(q, flags) < 0) {
+ if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
+ bio_wouldblock_error(bio);
+ else
+ bio_io_error(bio);
+ return ret;
+ }
}

if (!generic_make_request_checks(bio))
@@ -2426,8 +2428,9 @@ blk_qc_t generic_make_request(struct bio *bio)
bool enter_succeeded = true;

if (unlikely(q != bio->bi_disk->queue)) {
- if (q)
+ if (q && !got_ref)
blk_queue_exit(q);
+ got_ref = false;
q = bio->bi_disk->queue;
flags = 0;
if (bio->bi_opf & REQ_NOWAIT)
@@ -2476,7 +2479,7 @@ blk_qc_t generic_make_request(struct bio *bio)
blk_queue_exit(q);
return ret;
}
-EXPORT_SYMBOL(generic_make_request);
+EXPORT_SYMBOL(__generic_make_request);

/**
* direct_make_request - hand a buffer directly to its device driver for I/O
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5573d0fbec53..997616f1c85e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -180,6 +180,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
return do_split ? new : NULL;
}

+/*
+ * This function has to be called from q->make_request_fn().
+ */
void blk_queue_split(struct request_queue *q, struct bio **bio)
{
struct bio *split, *res;
@@ -212,7 +215,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)

bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
- generic_make_request(*bio);
+ generic_make_request_with_ref(*bio);
*bio = split;
}
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4efd9af62e25..051018e0126e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -956,9 +956,18 @@ static inline void rq_flush_dcache_pages(struct request *rq)
}
#endif

+blk_qc_t generic_make_request(struct bio *bio)
+{
+ return __generic_make_request(bio, false);
+}
+
+blk_qc_t generic_make_request_with_ref(struct bio *bio)
+{
+ return __generic_make_request(bio, true);
+}
+
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
-extern blk_qc_t generic_make_request(struct bio *bio);
extern blk_qc_t direct_make_request(struct bio *bio);
extern void blk_rq_init(struct request_queue *q, struct request *rq);
extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
>
> It should have been a bio flag to begin with, that better captures the
> scope and also avoids clones inheriting anything they shouldn't. It's
> also not applicable to a request.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 85909b431eb0..8f1063f50863 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>
> if (bio->bi_opf & REQ_NOWAIT)
> flags = BLK_MQ_REQ_NOWAIT;
> - if (blk_queue_enter(q, flags) < 0) {
> + if (bio->bi_flags & BIO_QUEUE_ENTERED)
> + blk_queue_enter_live(q);
> + else if (blk_queue_enter(q, flags) < 0) {
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 782940c65d8a..9ee9474e579c 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio)
> /* there isn't chance to merge the splitted bio */
> split->bi_opf |= REQ_NOMERGE;
>
> + /*
> + * Since we're recursing into make_request here, ensure
> + * that we mark this bio as already having entered the queue.
> + * If not, and the queue is going away, we can get stuck
> + * forever on waiting for the queue reference to drop. But
> + * that will never happen, as we're already holding a
> + * reference to it.
> + */
> + (*bio)->bi_flags |= BIO_QUEUE_ENTERED;
> +
> bio_chain(split, *bio);
> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> generic_make_request(*bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 17b18b91ebac..4687d7c99076 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -186,6 +186,8 @@ struct bio {
> * throttling rules. Don't do it again. */
> #define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion
> * of this bio. */
> +#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */
> +
> /* See BVEC_POOL_OFFSET below before adding new flags */

In DM or MD, there is the following use case:

1) bio comes from fs, then one clone is made from this bio

2) the cloned bio is submitted via generic_make_request() from
another context, such as wq.

3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
the queue usage counter isn't held.

Thanks,
Ming

2018-06-02 04:35:01

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On 6/1/18 8:36 PM, Ming Lei wrote:
>> clear the bit when we need to, potentially even adding a debug check to
>> blk_queue_enter_live() that complains if the ref was not already
>> elevated. Though that would be expensive, compared to the percpu inc
>> now. Not saying the bit is necessarily the best way forward, but I do
>> like it a LOT more than adding an argument to generic_make_request.
>
> We may introduce generic_make_request_with_ref() for blk_queue_split(),
> what do you think of the following patch?

Functionally it'll work as well.

> In DM or MD, there is the following use case:
>
> 1) bio comes from fs, then one clone is made from this bio
>
> 2) the cloned bio is submitted via generic_make_request() from
> another context, such as wq.
>
> 3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
> the queue usage counter isn't held.

We don't copy flags on a clone.

--
Jens Axboe

2018-06-02 04:55:19

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> On 6/1/18 8:36 PM, Ming Lei wrote:
> >> clear the bit when we need to, potentially even adding a debug check to
> >> blk_queue_enter_live() that complains if the ref was not already
> >> elevated. Though that would be expensive, compared to the percpu inc
> >> now. Not saying the bit is necessarily the best way forward, but I do
> >> like it a LOT more than adding an argument to generic_make_request.
> >
> > We may introduce generic_make_request_with_ref() for blk_queue_split(),
> > what do you think of the following patch?
>
> Functionally it'll work as well.

If you are fine with this one, I'm glad to prepare for a normal one,
so that we can save one precious bio flag, and the total number is
just 13.

>
> > In DM or MD, there is the following use case:
> >
> > 1) bio comes from fs, then one clone is made from this bio
> >
> > 2) the cloned bio is submitted via generic_make_request() from
> > another context, such as wq.
> >
> > 3) then the flag of BIO_QUEUE_ENTERED is inherited and causes
> > the queue usage counter isn't held.
>
> We don't copy flags on a clone.

Yeah, you are right, seems I ignored that.

Thanks,
Ming

2018-06-02 08:08:49

by Martin Steigerwald

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

Ming Lei - 02.06.18, 06:54:
> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
> > On 6/1/18 8:36 PM, Ming Lei wrote:
> > >> clear the bit when we need to, potentially even adding a debug
> > >> check to blk_queue_enter_live() that complains if the ref was
> > >> not already elevated. Though that would be expensive, compared
> > >> to the percpu inc now. Not saying the bit is necessarily the
> > >> best way forward, but I do like it a LOT more than adding an
> > >> argument to generic_make_request.> >
> > > We may introduce generic_make_request_with_ref() for
> > > blk_queue_split(), what do you think of the following patch?
> >
> > Functionally it'll work as well.
>
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.
[…]

I am completely lost about what you are talking about.

That is fine with me, but in case there is anything to test… let me
know. I do not promise quick responses tough, as I have been quite a bit
overwhelmed with stuff lately.

Currently I run 4.17-rc7 without any blk-mq related patches and it
appears to be free of hangs or stalls related to blk-mq.

Thanks,
--
Martin

2018-06-02 13:49:16

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

> On Jun 1, 2018, at 10:54 PM, Ming Lei <[email protected]> wrote:
>
>> On Fri, Jun 01, 2018 at 10:31:13PM -0600, Jens Axboe wrote:
>> On 6/1/18 8:36 PM, Ming Lei wrote:
>>>> clear the bit when we need to, potentially even adding a debug check to
>>>> blk_queue_enter_live() that complains if the ref was not already
>>>> elevated. Though that would be expensive, compared to the percpu inc
>>>> now. Not saying the bit is necessarily the best way forward, but I do
>>>> like it a LOT more than adding an argument to generic_make_request.
>>>
>>> We may introduce generic_make_request_with_ref() for blk_queue_split(),
>>> what do you think of the following patch?
>>
>> Functionally it'll work as well.
>
> If you are fine with this one, I'm glad to prepare for a normal one,
> so that we can save one precious bio flag, and the total number is
> just 13.

We’ve never had much issue with bio specific flags, I prefer that approach to a wrapper around generic_make_request().

2018-06-04 11:48:53

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
<[email protected]> wrote:
> Tetsuo Handa wrote:
>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> not a race condition while folding percpu counter values into atomic counter
>> value. That is, for some reason, someone who is responsible for calling
>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> unable to call percpu_ref_put().
>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>
> I found the someone. It was already there in the backtrace...

Nice!

Do I understand it correctly that this bug is probably the root cause
of a whole lot of syzbot "task hung" reports? E.g. this one too?
https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
I guess we will need to sweep close everything related to
filesystems/block devices when this is committed?

> ----------------------------------------
> [ 62.065852] a.out D 0 4414 4337 0x00000000
> [ 62.067677] Call Trace:
> [ 62.068545] __schedule+0x40b/0x860
> [ 62.069726] schedule+0x31/0x80
> [ 62.070796] schedule_timeout+0x1c1/0x3c0
> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0
> [ 62.073670] blk_queue_enter+0x218/0x520
> [ 62.074985] ? remove_wait_queue+0x70/0x70
> [ 62.076361] generic_make_request+0x3d/0x540
> [ 62.077785] ? __bio_clone_fast+0x6b/0x80
> [ 62.079147] ? bio_clone_fast+0x2c/0x70
> [ 62.080456] blk_queue_split+0x29b/0x560
> [ 62.081772] ? blk_queue_split+0x29b/0x560
> [ 62.083162] blk_mq_make_request+0x7c/0x430
> [ 62.084562] generic_make_request+0x276/0x540
> [ 62.086034] submit_bio+0x6e/0x140
> [ 62.087185] ? submit_bio+0x6e/0x140
> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0
> [ 62.089681] do_mpage_readpage+0x328/0x730
> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0
> [ 62.092726] mpage_readpages+0x120/0x190
> [ 62.094034] ? check_disk_change+0x70/0x70
> [ 62.095454] ? check_disk_change+0x70/0x70
> [ 62.096849] ? alloc_pages_current+0x65/0xd0
> [ 62.098277] blkdev_readpages+0x18/0x20
> [ 62.099568] __do_page_cache_readahead+0x298/0x360
> [ 62.101157] ondemand_readahead+0x1f6/0x490
> [ 62.102546] ? ondemand_readahead+0x1f6/0x490
> [ 62.103995] page_cache_sync_readahead+0x29/0x40
> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0
> [ 62.107067] ? futex_wait+0x221/0x240
> [ 62.108303] ? trace_hardirqs_on+0xd/0x10
> [ 62.109654] blkdev_read_iter+0x30/0x40
> [ 62.110954] generic_file_splice_read+0xc5/0x140
> [ 62.112538] do_splice_to+0x74/0x90
> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0
> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10
> [ 62.116773] do_splice_direct+0x8a/0xb0
> [ 62.118056] do_sendfile+0x1aa/0x390
> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0
> [ 62.120666] do_syscall_64+0x6e/0x210
> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ----------------------------------------
>
> The someone is blk_queue_split() from blk_mq_make_request() who depends on an
> assumption that blk_queue_enter() from recursively called generic_make_request()
> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure.
>
> ----------------------------------------
> generic_make_request(struct bio *bio) {
> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */
> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
> bio_wouldblock_error(bio);
> else
> bio_io_error(bio);
> return ret;
> }
> (...snipped...)
> ret = q->make_request_fn(q, bio);
> (...snipped...)
> if (q)
> blk_queue_exit(q);
> }
> ----------------------------------------
>
> where q->make_request_fn == blk_mq_make_request which does
>
> ----------------------------------------
> blk_mq_make_request(struct request_queue *q, struct bio *bio) {
> blk_queue_split(q, &bio);
> }
>
> blk_queue_split(struct request_queue *q, struct bio **bio) {
> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */
> }
> ----------------------------------------
>
> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and
> percpu_ref_kill() are called by blk_freeze_queue_start()...
>
> Now, it is up to you about how to fix this race problem.
>

2018-06-04 13:15:29

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On 2018/06/04 20:46, Dmitry Vyukov wrote:
> On Fri, Jun 1, 2018 at 12:10 PM, Tetsuo Handa
> <[email protected]> wrote:
>> Tetsuo Handa wrote:
>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>>> not a race condition while folding percpu counter values into atomic counter
>>> value. That is, for some reason, someone who is responsible for calling
>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>>> unable to call percpu_ref_put().
>>> But I don't know how to find someone who is failing to call percpu_ref_put()...
>>
>> I found the someone. It was already there in the backtrace...
>
> Nice!
>
> Do I understand it correctly that this bug is probably the root cause
> of a whole lot of syzbot "task hung" reports? E.g. this one too?
> https://syzkaller.appspot.com/bug?id=cdc4add60bb95a4da3fec27c5fe6d75196b7f976
> I guess we will need to sweep close everything related to
> filesystems/block devices when this is committed?
>

I can't tell it. We still have lockdep warning for loop module. Also, I suspect
that hung tasks waiting for page bit might be related to binder module.

Since we are about to merge "kernel/hung_task.c: show all hung tasks before panic"
patch, we might be able to get more hints after 4.18-rc1.

2018-06-05 00:29:22

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

Jens Axboe wrote:
> On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> >> not a race condition while folding percpu counter values into atomic counter
> >> value. That is, for some reason, someone who is responsible for calling
> >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> >> unable to call percpu_ref_put().
> >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> >
> > I found the someone. It was already there in the backtrace...
> >
>
> Ahh, nicely spotted! One idea would be the one below. For this case,
> we're recursing, so we can either do a non-block queue enter, or we
> can just do a live enter.
>

While "block: don't use blocking queue entered for recursive bio submits" was
already applied, syzbot is still reporting a hung task with same signature but
different trace.

https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
----------------------------------------
[ 492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
[ 492.519604] Not tainted 4.17.0+ #83
[ 492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 492.531787] syz-executor1 D23384 20263 4574 0x00000004
[ 492.537443] Call Trace:
[ 492.540041] __schedule+0x801/0x1e30
[ 492.580958] schedule+0xef/0x430
[ 492.610154] blk_queue_enter+0x8da/0xdf0
[ 492.716327] generic_make_request+0x651/0x1790
[ 492.765680] submit_bio+0xba/0x460
[ 492.793198] submit_bio_wait+0x134/0x1e0
[ 492.801891] blkdev_issue_flush+0x204/0x300
[ 492.806236] blkdev_fsync+0x93/0xd0
[ 492.813620] vfs_fsync_range+0x140/0x220
[ 492.817702] vfs_fsync+0x29/0x30
[ 492.821081] __loop_update_dio+0x4de/0x6a0
[ 492.825341] lo_ioctl+0xd28/0x2190
[ 492.833442] blkdev_ioctl+0x9b6/0x2020
[ 492.872146] block_ioctl+0xee/0x130
[ 492.880139] do_vfs_ioctl+0x1cf/0x16a0
[ 492.927550] ksys_ioctl+0xa9/0xd0
[ 492.931036] __x64_sys_ioctl+0x73/0xb0
[ 492.934952] do_syscall_64+0x1b1/0x800
[ 492.963624] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 493.212768] 1 lock held by syz-executor1/20263:
[ 493.217448] #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
----------------------------------------

Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
by caling atomic_inc_return() and percpu_ref_kill() ?

2018-06-05 00:42:45

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
> Jens Axboe wrote:
> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
> > >> not a race condition while folding percpu counter values into atomic counter
> > >> value. That is, for some reason, someone who is responsible for calling
> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
> > >> unable to call percpu_ref_put().
> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
> > >
> > > I found the someone. It was already there in the backtrace...
> > >
> >
> > Ahh, nicely spotted! One idea would be the one below. For this case,
> > we're recursing, so we can either do a non-block queue enter, or we
> > can just do a live enter.
> >
>
> While "block: don't use blocking queue entered for recursive bio submits" was
> already applied, syzbot is still reporting a hung task with same signature but
> different trace.
>
> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
> ----------------------------------------
> [ 492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
> [ 492.519604] Not tainted 4.17.0+ #83
> [ 492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 492.531787] syz-executor1 D23384 20263 4574 0x00000004
> [ 492.537443] Call Trace:
> [ 492.540041] __schedule+0x801/0x1e30
> [ 492.580958] schedule+0xef/0x430
> [ 492.610154] blk_queue_enter+0x8da/0xdf0
> [ 492.716327] generic_make_request+0x651/0x1790
> [ 492.765680] submit_bio+0xba/0x460
> [ 492.793198] submit_bio_wait+0x134/0x1e0
> [ 492.801891] blkdev_issue_flush+0x204/0x300
> [ 492.806236] blkdev_fsync+0x93/0xd0
> [ 492.813620] vfs_fsync_range+0x140/0x220
> [ 492.817702] vfs_fsync+0x29/0x30
> [ 492.821081] __loop_update_dio+0x4de/0x6a0
> [ 492.825341] lo_ioctl+0xd28/0x2190
> [ 492.833442] blkdev_ioctl+0x9b6/0x2020
> [ 492.872146] block_ioctl+0xee/0x130
> [ 492.880139] do_vfs_ioctl+0x1cf/0x16a0
> [ 492.927550] ksys_ioctl+0xa9/0xd0
> [ 492.931036] __x64_sys_ioctl+0x73/0xb0
> [ 492.934952] do_syscall_64+0x1b1/0x800
> [ 492.963624] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 493.212768] 1 lock held by syz-executor1/20263:
> [ 493.217448] #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
> ----------------------------------------
>
> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
> by caling atomic_inc_return() and percpu_ref_kill() ?
>

The vfs_fsync() isn't necessary in loop_update_dio() since both
generic_file_write_iter() and generic_file_read_iter() can handle
buffered io vs dio well.

I will send one patch to remove the vfs_sync() later.

Thanks,
Ming

2018-06-07 03:30:14

[permalink] [raw]

Subject: Re: INFO: task hung in blk_queue_enter

On Tue, Jun 5, 2018 at 8:41 AM, Ming Lei <[email protected]> wrote:
> On Tue, Jun 05, 2018 at 09:27:41AM +0900, Tetsuo Handa wrote:
>> Jens Axboe wrote:
>> > On 6/1/18 4:10 AM, Tetsuo Handa wrote:
>> > > Tetsuo Handa wrote:
>> > >> Since sum of percpu_count did not change after percpu_ref_kill(), this is
>> > >> not a race condition while folding percpu counter values into atomic counter
>> > >> value. That is, for some reason, someone who is responsible for calling
>> > >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is
>> > >> unable to call percpu_ref_put().
>> > >> But I don't know how to find someone who is failing to call percpu_ref_put()...
>> > >
>> > > I found the someone. It was already there in the backtrace...
>> > >
>> >
>> > Ahh, nicely spotted! One idea would be the one below. For this case,
>> > we're recursing, so we can either do a non-block queue enter, or we
>> > can just do a live enter.
>> >
>>
>> While "block: don't use blocking queue entered for recursive bio submits" was
>> already applied, syzbot is still reporting a hung task with same signature but
>> different trace.
>>
>> https://syzkaller.appspot.com/text?tag=CrashLog&x=1432cedf800000
>> ----------------------------------------
>> [ 492.512243] INFO: task syz-executor1:20263 blocked for more than 120 seconds.
>> [ 492.519604] Not tainted 4.17.0+ #83
>> [ 492.523793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 492.531787] syz-executor1 D23384 20263 4574 0x00000004
>> [ 492.537443] Call Trace:
>> [ 492.540041] __schedule+0x801/0x1e30
>> [ 492.580958] schedule+0xef/0x430
>> [ 492.610154] blk_queue_enter+0x8da/0xdf0
>> [ 492.716327] generic_make_request+0x651/0x1790
>> [ 492.765680] submit_bio+0xba/0x460
>> [ 492.793198] submit_bio_wait+0x134/0x1e0
>> [ 492.801891] blkdev_issue_flush+0x204/0x300
>> [ 492.806236] blkdev_fsync+0x93/0xd0
>> [ 492.813620] vfs_fsync_range+0x140/0x220
>> [ 492.817702] vfs_fsync+0x29/0x30
>> [ 492.821081] __loop_update_dio+0x4de/0x6a0
>> [ 492.825341] lo_ioctl+0xd28/0x2190
>> [ 492.833442] blkdev_ioctl+0x9b6/0x2020
>> [ 492.872146] block_ioctl+0xee/0x130
>> [ 492.880139] do_vfs_ioctl+0x1cf/0x16a0
>> [ 492.927550] ksys_ioctl+0xa9/0xd0
>> [ 492.931036] __x64_sys_ioctl+0x73/0xb0
>> [ 492.934952] do_syscall_64+0x1b1/0x800
>> [ 492.963624] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [ 493.212768] 1 lock held by syz-executor1/20263:
>> [ 493.217448] #0: 00000000956bf5a3 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8d/0x2190
>> ----------------------------------------
>>
>> Is it OK to call [__]loop_update_dio() between blk_mq_freeze_queue() and
>> blk_mq_unfreeze_queue(), for vfs_fsync() from __loop_update_dio() is calling
>> blk_queue_enter() after blk_mq_freeze_queue() started blocking blk_queue_enter()
>> by caling atomic_inc_return() and percpu_ref_kill() ?
>>
>
> The vfs_fsync() isn't necessary in loop_update_dio() since both
> generic_file_write_iter() and generic_file_read_iter() can handle
> buffered io vs dio well.
>
> I will send one patch to remove the vfs_sync() later.

Hi Tetsuo,

The issue might be fixed by removing this vfs_sync(), but I'd like to
understand the idea behind since vfs_sync() shouldn't have caused
any IO to this loop queue.

I also tried to do the test via the following c syzbot, but can't reproduce
it yet after running it for several hours.

https://syzkaller.appspot.com/x/repro.c?id=4727023951937536

Could you share us how you reproduce it?

Thanks,
Ming Lei

2018-06-07 16:26:33