2020-01-30 19:37:00

by Salman Qazi

[permalink] [raw]
Subject: Hung tasks with multiple partitions

Hi,

I am writing on behalf of the Chromium OS team at Google. We found
the root cause for some hung tasks we were experiencing and we would
like to get your opinion on potential solutions. The bugs were
encountered on 4.19 kernel.
However my reading of the code suggests that the relevant portions of the
code have not changed since then.

We have an eMMC flash drive that has been carved into partitions on an
8 CPU system. The repro case that we came up with, is to use 8
threaded fio write-mostly workload against one partition, let the
system use the other partition as the read-write filesystem (i.e. just
background activity) and then run the following loop:

while true; do sync; sleep 1 ; done

The hung task stack traces look like the following:

[ 128.994891] jbd2/dm-1-8 D 0 367 2 0x00000028
last_sleep: 96340206998. last_runnable: 96340140151
[ 128.994898] Call trace:
[ 128.994903] __switch_to+0x120/0x13c
[ 128.994909] __schedule+0x60c/0x7dc
[ 128.994914] schedule+0x74/0x94
[ 128.994919] io_schedule+0x1c/0x40
[ 128.994925] bit_wait_io+0x18/0x58
[ 128.994930] __wait_on_bit+0x78/0xdc
[ 128.994935] out_of_line_wait_on_bit+0xa0/0xcc
[ 128.994943] __wait_on_buffer+0x48/0x54
[ 128.994948] jbd2_journal_commit_transaction+0x1198/0x1a4c
[ 128.994956] kjournald2+0x19c/0x268
[ 128.994961] kthread+0x120/0x130
[ 128.994967] ret_from_fork+0x10/0x18

I added some more information to trace points to understand what was
going on. It turns out that blk_mq_sched_dispatch_requests had
checked hctx->dispatch, found it empty, and then began consuming
requests from the io scheduler (in blk_mq_do_dispatch_sched).
Unfortunately, the deluge from the I/O scheduler (BFQ in our case)
doesn't stop for 30 seconds and there is no mechanism present in
blk_mq_do_dispatch_sched to terminate early or reconsider
hctx->dispatch contents. In the meantime, a flush command arrives in
hctx->dispatch (via insertion in blk_mq_sched_bypass_insert) and
languishes there. Eventually the thread waiting on the flush triggers
the hung task watchdog.

The solution that comes to mind is to periodically check
hctx->dispatch in blk_mq_do_dispatch_sched and exit early if it is
non-empty. However, not being an expert in this subsystem, I am not
sure if there would be other consequences.

Any help is appreciated,

Salman


2020-01-30 20:50:54

by Bart Van Assche

[permalink] [raw]
Subject: Re: Hung tasks with multiple partitions

On 1/30/20 11:34 AM, Salman Qazi wrote:
> I am writing on behalf of the Chromium OS team at Google. We found
> the root cause for some hung tasks we were experiencing and we would
> like to get your opinion on potential solutions. The bugs were
> encountered on 4.19 kernel.
> However my reading of the code suggests that the relevant portions of the
> code have not changed since then.
>
> We have an eMMC flash drive that has been carved into partitions on an
> 8 CPU system. The repro case that we came up with, is to use 8
> threaded fio write-mostly workload against one partition, let the
> system use the other partition as the read-write filesystem (i.e. just
> background activity) and then run the following loop:
>
> while true; do sync; sleep 1 ; done
>
> The hung task stack traces look like the following:
>
> [ 128.994891] jbd2/dm-1-8 D 0 367 2 0x00000028
> last_sleep: 96340206998. last_runnable: 96340140151
> [ 128.994898] Call trace:
> [ 128.994903] __switch_to+0x120/0x13c
> [ 128.994909] __schedule+0x60c/0x7dc
> [ 128.994914] schedule+0x74/0x94
> [ 128.994919] io_schedule+0x1c/0x40
> [ 128.994925] bit_wait_io+0x18/0x58
> [ 128.994930] __wait_on_bit+0x78/0xdc
> [ 128.994935] out_of_line_wait_on_bit+0xa0/0xcc
> [ 128.994943] __wait_on_buffer+0x48/0x54
> [ 128.994948] jbd2_journal_commit_transaction+0x1198/0x1a4c
> [ 128.994956] kjournald2+0x19c/0x268
> [ 128.994961] kthread+0x120/0x130
> [ 128.994967] ret_from_fork+0x10/0x18
>
> I added some more information to trace points to understand what was
> going on. It turns out that blk_mq_sched_dispatch_requests had
> checked hctx->dispatch, found it empty, and then began consuming
> requests from the io scheduler (in blk_mq_do_dispatch_sched).
> Unfortunately, the deluge from the I/O scheduler (BFQ in our case)
> doesn't stop for 30 seconds and there is no mechanism present in
> blk_mq_do_dispatch_sched to terminate early or reconsider
> hctx->dispatch contents. In the meantime, a flush command arrives in
> hctx->dispatch (via insertion in blk_mq_sched_bypass_insert) and
> languishes there. Eventually the thread waiting on the flush triggers
> the hung task watchdog.
>
> The solution that comes to mind is to periodically check
> hctx->dispatch in blk_mq_do_dispatch_sched and exit early if it is
> non-empty. However, not being an expert in this subsystem, I am not
> sure if there would be other consequences.

The call stack shown in your e-mail usually means that an I/O request
got stuck. How about determining first whether this is caused by the BFQ
scheduler or by the eMMC driver? I think the developers of these
software components need that information anyway before they can step in.

The attached script may help to identify which requests got stuck.

Bart.


Attachments:
list-pending-block-requests (1.47 kB)

2020-01-30 21:04:30

by Salman Qazi

[permalink] [raw]
Subject: Re: Hung tasks with multiple partitions

On Thu, Jan 30, 2020 at 12:49 PM Bart Van Assche <[email protected]> wrote:
>
> On 1/30/20 11:34 AM, Salman Qazi wrote:
> > I am writing on behalf of the Chromium OS team at Google. We found
> > the root cause for some hung tasks we were experiencing and we would
> > like to get your opinion on potential solutions. The bugs were
> > encountered on 4.19 kernel.
> > However my reading of the code suggests that the relevant portions of the
> > code have not changed since then.
> >
> > We have an eMMC flash drive that has been carved into partitions on an
> > 8 CPU system. The repro case that we came up with, is to use 8
> > threaded fio write-mostly workload against one partition, let the
> > system use the other partition as the read-write filesystem (i.e. just
> > background activity) and then run the following loop:
> >
> > while true; do sync; sleep 1 ; done
> >
> > The hung task stack traces look like the following:
> >
> > [ 128.994891] jbd2/dm-1-8 D 0 367 2 0x00000028
> > last_sleep: 96340206998. last_runnable: 96340140151
> > [ 128.994898] Call trace:
> > [ 128.994903] __switch_to+0x120/0x13c
> > [ 128.994909] __schedule+0x60c/0x7dc
> > [ 128.994914] schedule+0x74/0x94
> > [ 128.994919] io_schedule+0x1c/0x40
> > [ 128.994925] bit_wait_io+0x18/0x58
> > [ 128.994930] __wait_on_bit+0x78/0xdc
> > [ 128.994935] out_of_line_wait_on_bit+0xa0/0xcc
> > [ 128.994943] __wait_on_buffer+0x48/0x54
> > [ 128.994948] jbd2_journal_commit_transaction+0x1198/0x1a4c
> > [ 128.994956] kjournald2+0x19c/0x268
> > [ 128.994961] kthread+0x120/0x130
> > [ 128.994967] ret_from_fork+0x10/0x18
> >
> > I added some more information to trace points to understand what was
> > going on. It turns out that blk_mq_sched_dispatch_requests had
> > checked hctx->dispatch, found it empty, and then began consuming
> > requests from the io scheduler (in blk_mq_do_dispatch_sched).
> > Unfortunately, the deluge from the I/O scheduler (BFQ in our case)
> > doesn't stop for 30 seconds and there is no mechanism present in
> > blk_mq_do_dispatch_sched to terminate early or reconsider
> > hctx->dispatch contents. In the meantime, a flush command arrives in
> > hctx->dispatch (via insertion in blk_mq_sched_bypass_insert) and
> > languishes there. Eventually the thread waiting on the flush triggers
> > the hung task watchdog.
> >
> > The solution that comes to mind is to periodically check
> > hctx->dispatch in blk_mq_do_dispatch_sched and exit early if it is
> > non-empty. However, not being an expert in this subsystem, I am not
> > sure if there would be other consequences.
>
> The call stack shown in your e-mail usually means that an I/O request
> got stuck. How about determining first whether this is caused by the BFQ
> scheduler or by the eMMC driver? I think the developers of these
> software components need that information anyway before they can step in.

As I mentioned in my previous email, I did use trace points to arrive
at my conclusion. I added trace points in
blk_mq_sched_dispatch_requests to
detect both the start and the end of that function, as well as where
the dispatched commands were picked from. I also traced
blk_mq_sched_bypass_insert and saw a flush enter hctx->dispatch after
blk_mq_sched_dispatch_requests had started but before it
finished. After reaching my conclusion, I also tried a simple fix by
introducing an exit path in blk_mq_do_dispatch_sched, if
we detect that hctx->dispatch has become non-empty. This made the
problem go away.

>
> The attached script may help to identify which requests got stuck.
>
> Bart.