LinuxLists.cc - [PATCH V2] writeback: fix hung

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

Wanpeng Li <[email protected]> writes:

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index f2d0109..df879ee 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
>
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
> bdi_queue_work(sb->s_bdi, &work);
> - wait_for_completion(&done);
> + if (sysctl_hung_task_timeout_secs)
> + while (!wait_for_completion_timeout(&done, HZ/2))
> + ;
> + else
> + wait_for_completion(&done);
> }
> EXPORT_SYMBOL(writeback_inodes_sb_nr);

Is it really expected that writeback_inodes_sb_nr will routinely queue
up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
only entity issuing I/O)? For devices that are really slow, it may make
more sense to tune the system so that you don't have too much writeback
I/O submitted at once. Dropping nr_requests for the given queue should
fix this situation, I would think.

This really feels like we're papering over the problem.

Cheers,
Jeff

2012-06-13 14:48:48

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

Hi Jeff,

On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
> Wanpeng Li <[email protected]> writes:
>
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index f2d0109..df879ee 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> >
> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > bdi_queue_work(sb->s_bdi, &work);
> > - wait_for_completion(&done);
> > + if (sysctl_hung_task_timeout_secs)
> > + while (!wait_for_completion_timeout(&done, HZ/2))
> > + ;
> > + else
> > + wait_for_completion(&done);
> > }
> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
>
> Is it really expected that writeback_inodes_sb_nr will routinely queue
> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
> only entity issuing I/O)?

Yes, in the case of syncing the whole superblock.
Basically sync() does its job in two steps:

for all sb:
writeback_inodes_sb_nr() # WB_SYNC_NONE
sync_inodes_sb() # WB_SYNC_ALL

> For devices that are really slow, it may make
> more sense to tune the system so that you don't have too much writeback
> I/O submitted at once. Dropping nr_requests for the given queue should
> fix this situation, I would think.

The worried case is about sync() waiting

(nr_dirty + nr_writeback) / write_bandwidth

time, where it is nr_dirty that could grow rather large.

For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
the sync() will have to wait for 100 seconds. If there are heavy
dirtiers running during the sync, it will typically take several
hundreds of seconds (which looks not that good, but still much better
than being livelocked in some old kernels)..

> This really feels like we're papering over the problem.

That's true. The majority users probably don't want to cache 100s
worth of data in memory. It may be worthwhile to add a new per-bdi
limit whose unit is number-of-seconds (of dirty data).

Thanks,
Fengguang

2012-06-13 14:55:33

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Wed, Jun 13, 2012 at 10:48:40PM +0800, Fengguang Wu wrote:
> Hi Jeff,
>
> On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
> > Wanpeng Li <[email protected]> writes:
> >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index f2d0109..df879ee 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> > >
> > > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > > bdi_queue_work(sb->s_bdi, &work);
> > > - wait_for_completion(&done);
> > > + if (sysctl_hung_task_timeout_secs)
> > > + while (!wait_for_completion_timeout(&done, HZ/2))
> > > + ;
> > > + else
> > > + wait_for_completion(&done);
> > > }
> > > EXPORT_SYMBOL(writeback_inodes_sb_nr);
> >
> > Is it really expected that writeback_inodes_sb_nr will routinely queue
> > up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
> > only entity issuing I/O)?
>
> Yes, in the case of syncing the whole superblock.
> Basically sync() does its job in two steps:
>
> for all sb:
> writeback_inodes_sb_nr() # WB_SYNC_NONE
> sync_inodes_sb() # WB_SYNC_ALL
>
> > For devices that are really slow, it may make
> > more sense to tune the system so that you don't have too much writeback
> > I/O submitted at once. Dropping nr_requests for the given queue should
> > fix this situation, I would think.

Note that this wait_for_completion() waits for the flusher to move
all dirty pages into writeback pages, which in turn will block in
get_request_wait(). So writeback_inodes_sb_nr() could in theory finish
quickly if the nr_requests is a huge value. But then sync_inodes_sb()
will come and sleep (longer) on each of the writeback pages..

Thanks,
Fengguang

2012-06-13 15:34:44

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

Fengguang Wu <[email protected]> writes:

> Hi Jeff,
>
> On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
>> Wanpeng Li <[email protected]> writes:
>>
>> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> > index f2d0109..df879ee 100644
>> > --- a/fs/fs-writeback.c
>> > +++ b/fs/fs-writeback.c
>> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
>> >
>> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
>> > bdi_queue_work(sb->s_bdi, &work);
>> > - wait_for_completion(&done);
>> > + if (sysctl_hung_task_timeout_secs)
>> > + while (!wait_for_completion_timeout(&done, HZ/2))
>> > + ;
>> > + else
>> > + wait_for_completion(&done);
>> > }
>> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
>>
>> Is it really expected that writeback_inodes_sb_nr will routinely queue
>> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
>> only entity issuing I/O)?
>
> Yes, in the case of syncing the whole superblock.
> Basically sync() does its job in two steps:
>
> for all sb:
> writeback_inodes_sb_nr() # WB_SYNC_NONE
> sync_inodes_sb() # WB_SYNC_ALL
>
>> For devices that are really slow, it may make
>> more sense to tune the system so that you don't have too much writeback
>> I/O submitted at once. Dropping nr_requests for the given queue should
>> fix this situation, I would think.
>
> The worried case is about sync() waiting
>
> (nr_dirty + nr_writeback) / write_bandwidth
>
> time, where it is nr_dirty that could grow rather large.
>
> For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
> the sync() will have to wait for 100 seconds. If there are heavy
> dirtiers running during the sync, it will typically take several
> hundreds of seconds (which looks not that good, but still much better
> than being livelocked in some old kernels)..
>
>> This really feels like we're papering over the problem.
>
> That's true. The majority users probably don't want to cache 100s
> worth of data in memory. It may be worthwhile to add a new per-bdi
> limit whose unit is number-of-seconds (of dirty data).

Hi, Fengguang,

Another option is to limit the amount of time we wait to the amount of
time we expect to have to wait. IOW, if we can estimate the amount of
time we think the I/O will take to complete, we can set the
hung_task_timeout[1] to *that* (with some fudge factor). Do you have a
mechanism in place today to make such an estimate? The benefit of this
solution is obvious: you still get notified when tasks are actually
hung, but you don't get false warnings.

Thanks for your quick and detailed response, by the way!

-Jeff

[1] I realize that hung_task_timeout is global. We could simulate a
per-task timeout by simply looping in wait_for_completion_timeout until
expected_time - waited_time <= hung_task_timeout, and then doing
the wait_for_completion (without the timeout).

2012-06-14 01:35:55

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Wed, Jun 13, 2012 at 10:48:40PM +0800, Fengguang Wu wrote:
>Hi Jeff,
>
>On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
>> Wanpeng Li <[email protected]> writes:
>>
>> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> > index f2d0109..df879ee 100644
>> > --- a/fs/fs-writeback.c
>> > +++ b/fs/fs-writeback.c
>> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
>> >
>> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
>> > bdi_queue_work(sb->s_bdi, &work);
>> > - wait_for_completion(&done);
>> > + if (sysctl_hung_task_timeout_secs)
>> > + while (!wait_for_completion_timeout(&done, HZ/2))
>> > + ;
>> > + else
>> > + wait_for_completion(&done);
>> > }
>> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
>>
>> Is it really expected that writeback_inodes_sb_nr will routinely queue
>> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
>> only entity issuing I/O)?
>
>Yes, in the case of syncing the whole superblock.
>Basically sync() does its job in two steps:
>
>for all sb:
> writeback_inodes_sb_nr() # WB_SYNC_NONE
> sync_inodes_sb() # WB_SYNC_ALL
>
>> For devices that are really slow, it may make
>> more sense to tune the system so that you don't have too much writeback
>> I/O submitted at once. Dropping nr_requests for the given queue should
>> fix this situation, I would think.
>
>The worried case is about sync() waiting
>
> (nr_dirty + nr_writeback) / write_bandwidth
>
>time, where it is nr_dirty that could grow rather large.
>
>For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
>the sync() will have to wait for 100 seconds. If there are heavy
>dirtiers running during the sync, it will typically take several
>hundreds of seconds (which looks not that good, but still much better
>than being livelocked in some old kernels)..
>
>> This really feels like we're papering over the problem.
>
>That's true. The majority users probably don't want to cache 100s
>worth of data in memory. It may be worthwhile to add a new per-bdi
>limit whose unit is number-of-seconds (of dirty data).
Hi Fengguang,

Maybe we have already have a threshold "dirty_expire_interval" to ensure
pages will not dirty more than 30 seconds. Why should add a similar
variable ? I think per-bdi flusher will try its best to flush dirty pages
when waken up, just because the backing storages is too slow. :-)

Best Regards,
Wanpeng Li

>
>Thanks,
>Fengguang

2012-06-14 10:52:52

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Wed, Jun 13, 2012 at 12:42:33PM +0800, Wanpeng Li wrote:

Hi Fengguang,

This patch need improve or ..., any comments or suggetions?

Best Regards,
Wanpeng Li

>
>From: Wanpeng Li <[email protected]>
>
>"V1 -> V2"
>* remove redundant hangcheck variable
>
>I use several dd processes to write a slow SD card
>dd if=/dev/sda1 of=/dev/sdc4 bs=1M count=4000
>and several sync commands(maybe > 10),dmesg show this:
>
>[ 366.888741] INFO: task sync:3518 blocked for more than 120 seconds.
>[ 366.888742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>[ 366.888746] sync D 00000201 0 3518 3462 0x00000000
>[ 366.888752] dcde5e54 00000082 00000000 00000201 00000000 c180d220 c1933d00 c1933d00
>[ 366.888758] 614a7ae7 00000023 f6bc0d00 ebecb280 c180d220 f6bc0d6c f04acbfc 00000000
>[ 366.888786] f6bc0d6c dcde5e2c 00000023 f6bc0d44 00000000 00000000 ebd8b2b4 dcde5e6c
>[ 366.888788] Call Trace:
>[ 366.888792] [<c107a0ee>] ? enqueue_entity+0xee/0x5a0
>[ 366.888795] [<c15abc83>] schedule+0x23/0x60
>[ 366.888798] [<c15aa5ad>] schedule_timeout+0x22d/0x2a0
>[ 366.888801] [<c1071ce2>] ? check_preempt_curr+0x72/0x90
>[ 366.888804] [<c1071d28>] ? ttwu_do_wakeup+0x28/0x130
>[ 366.888807] [<c1037f28>] ? default_spin_lock_flags+0x8/0x10
>[ 366.888810] [<c15acaad>] ? _raw_spin_lock_irqsave+0x2d/0x40
>[ 366.888813] [<c15abb17>] wait_for_common+0xa7/0x110
>[ 366.888816] [<c1074690>] ? try_to_wake_up+0x220/0x220
>[ 366.888819] [<c15abc57>] wait_for_completion+0x17/0x20
>[ 366.888822] [<c116ad80>] writeback_inodes_sb_nr+0x70/0x90
>[ 366.888825] [<c116af75>] writeback_inodes_sb+0x25/0x30
>[ 366.888828] [<c117159f>] __sync_filesystem+0x4f/0x90
>[ 366.888831] [<c11715f7>] sync_one_sb+0x17/0x20
>[ 366.888834] [<c114c430>] iterate_supers+0xc0/0xd0
>[ 366.888837] [<c11715e0>] ? __sync_filesystem+0x90/0x90
>[ 366.888831] [<c11715f7>] sync_one_sb+0x17/0x20
>[ 366.888834] [<c114c430>] iterate_supers+0xc0/0xd0
>[ 366.888837] [<c11715e0>] ? __sync_filesystem+0x90/0x90
>[ 366.888840] [<c117167b>] sys_sync+0x2b/0x60
>[ 366.888842] [<c15b385f>] sysenter_do_call+0x12/0x28
>
>Too many similar messages flood the logs. So I use a present method to
>fix this issue.
>------------------------------------------
>Author: Mark Lord <[email protected]>
>Date: Fri Sep 24 09:51:13 2010 -0400
>
> block: Prevent hang_check firing during long I/O
>------------------------------------------
>
>Signed-off-by: Wanpeng Li <[email protected]>
>
>
>---
> fs/fs-writeback.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>index f2d0109..df879ee 100644
>--- a/fs/fs-writeback.c
>+++ b/fs/fs-writeback.c
>@@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
>
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
> bdi_queue_work(sb->s_bdi, &work);
>- wait_for_completion(&done);
>+ if (sysctl_hung_task_timeout_secs)
>+ while (!wait_for_completion_timeout(&done, HZ/2))
>+ ;
>+ else
>+ wait_for_completion(&done);
> }
> EXPORT_SYMBOL(writeback_inodes_sb_nr);
>
>--
>1.7.9.5
>

2012-06-14 13:26:58

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Thu, Jun 14, 2012 at 09:35:34AM +0800, Wanpeng Li wrote:
> On Wed, Jun 13, 2012 at 10:48:40PM +0800, Fengguang Wu wrote:
> >Hi Jeff,
> >
> >On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
> >> Wanpeng Li <[email protected]> writes:
> >>
> >> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> > index f2d0109..df879ee 100644
> >> > --- a/fs/fs-writeback.c
> >> > +++ b/fs/fs-writeback.c
> >> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> >> >
> >> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >> > bdi_queue_work(sb->s_bdi, &work);
> >> > - wait_for_completion(&done);
> >> > + if (sysctl_hung_task_timeout_secs)
> >> > + while (!wait_for_completion_timeout(&done, HZ/2))
> >> > + ;
> >> > + else
> >> > + wait_for_completion(&done);
> >> > }
> >> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
> >>
> >> Is it really expected that writeback_inodes_sb_nr will routinely queue
> >> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
> >> only entity issuing I/O)?
> >
> >Yes, in the case of syncing the whole superblock.
> >Basically sync() does its job in two steps:
> >
> >for all sb:
> > writeback_inodes_sb_nr() # WB_SYNC_NONE
> > sync_inodes_sb() # WB_SYNC_ALL
> >
> >> For devices that are really slow, it may make
> >> more sense to tune the system so that you don't have too much writeback
> >> I/O submitted at once. Dropping nr_requests for the given queue should
> >> fix this situation, I would think.
> >
> >The worried case is about sync() waiting
> >
> > (nr_dirty + nr_writeback) / write_bandwidth
> >
> >time, where it is nr_dirty that could grow rather large.
> >
> >For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
> >the sync() will have to wait for 100 seconds. If there are heavy
> >dirtiers running during the sync, it will typically take several
> >hundreds of seconds (which looks not that good, but still much better
> >than being livelocked in some old kernels)..
> >
> >> This really feels like we're papering over the problem.
> >
> >That's true. The majority users probably don't want to cache 100s
> >worth of data in memory. It may be worthwhile to add a new per-bdi
> >limit whose unit is number-of-seconds (of dirty data).
> Hi Fengguang,
>
> Maybe we have already have a threshold "dirty_expire_interval" to ensure
> pages will not dirty more than 30 seconds. Why should add a similar
> variable ? I think per-bdi flusher will try its best to flush dirty pages
> when waken up, just because the backing storages is too slow. :-)

dirty_expire_interval is used to start background writeback and not
used for throttling the dirtier task. So the heavy dirtier can still
outrun the flusher thread and push dirty pages all the way to the
dirty limits.

So to actually keep dirty pages under control 30s, we need some time
based _limit_. However I'm not sure whether we are going to do this.
Because the bandwidth estimation may not be accurate, especially that
it's initialized to an arbitrary 100MB/s for all newly detected disks,
whether it be slow USB keys or fast enterprise SSD. So now you claim
to throttle dirty pages under 30s worth of data, however the user can
still be able to dirty 300s worth of data on a newly plugged USB key.
That makes the interface look silly and not working to user expectations.

Another worry is that it will break some existing workloads because
the new limit triggers throttling for them earlier than before.

Thanks,
Fengguang

2012-06-14 13:36:09

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

Hi Jeff,

On Wed, Jun 13, 2012 at 11:34:20AM -0400, Jeff Moyer wrote:
> Fengguang Wu <[email protected]> writes:
>
> > Hi Jeff,
> >
> > On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
> >> Wanpeng Li <[email protected]> writes:
> >>
> >> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> > index f2d0109..df879ee 100644
> >> > --- a/fs/fs-writeback.c
> >> > +++ b/fs/fs-writeback.c
> >> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> >> >
> >> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >> > bdi_queue_work(sb->s_bdi, &work);
> >> > - wait_for_completion(&done);
> >> > + if (sysctl_hung_task_timeout_secs)
> >> > + while (!wait_for_completion_timeout(&done, HZ/2))
> >> > + ;
> >> > + else
> >> > + wait_for_completion(&done);
> >> > }
> >> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
> >>
> >> Is it really expected that writeback_inodes_sb_nr will routinely queue
> >> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
> >> only entity issuing I/O)?
> >
> > Yes, in the case of syncing the whole superblock.
> > Basically sync() does its job in two steps:
> >
> > for all sb:
> > writeback_inodes_sb_nr() # WB_SYNC_NONE
> > sync_inodes_sb() # WB_SYNC_ALL
> >
> >> For devices that are really slow, it may make
> >> more sense to tune the system so that you don't have too much writeback
> >> I/O submitted at once. Dropping nr_requests for the given queue should
> >> fix this situation, I would think.
> >
> > The worried case is about sync() waiting
> >
> > (nr_dirty + nr_writeback) / write_bandwidth
> >
> > time, where it is nr_dirty that could grow rather large.
> >
> > For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
> > the sync() will have to wait for 100 seconds. If there are heavy
> > dirtiers running during the sync, it will typically take several
> > hundreds of seconds (which looks not that good, but still much better
> > than being livelocked in some old kernels)..
> >
> >> This really feels like we're papering over the problem.
> >
> > That's true. The majority users probably don't want to cache 100s
> > worth of data in memory. It may be worthwhile to add a new per-bdi
> > limit whose unit is number-of-seconds (of dirty data).
>
> Hi, Fengguang,
>
> Another option is to limit the amount of time we wait to the amount of
> time we expect to have to wait. IOW, if we can estimate the amount of
> time we think the I/O will take to complete, we can set the
> hung_task_timeout[1] to *that* (with some fudge factor). Do you have a
> mechanism in place today to make such an estimate? The benefit of this
> solution is obvious: you still get notified when tasks are actually
> hung, but you don't get false warnings.

Good idea! Yes we can do some estimation and adaptively extend the
hang timeout for the current writeback_inodes_sb_nr()/sync_inodes_sb()
call.

Note that it's not going to reliably get rid of false warnings due to
estimation errors, which could be pretty large and unavoidable on
change of workload. But still, it would be a net improvement and
perhaps enough to get rid of most false warnings, while still being
able to catch livelock or other kind of task hang.

Thanks,
Fengguang

2012-06-15 22:38:21

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Wed, Jun 13, 2012 at 12:42:33PM +0800, Wanpeng Li wrote:
> Hi Fengguang,
>
> This time it works well after testing.
>
> Regards,
> Wanpeng Li
>
> From: Wanpeng Li <[email protected]>
>
> "V1 -> V2"
> * remove redundant hangcheck variable
>
> I use several dd processes to write a slow SD card
> dd if=/dev/sda1 of=/dev/sdc4 bs=1M count=4000
> and several sync commands(maybe > 10),dmesg show this:
>
> [ 366.888741] INFO: task sync:3518 blocked for more than 120 seconds.
> [ 366.888742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 366.888746] sync D 00000201 0 3518 3462 0x00000000
> [ 366.888752] dcde5e54 00000082 00000000 00000201 00000000 c180d220 c1933d00 c1933d00
> [ 366.888758] 614a7ae7 00000023 f6bc0d00 ebecb280 c180d220 f6bc0d6c f04acbfc 00000000
> [ 366.888786] f6bc0d6c dcde5e2c 00000023 f6bc0d44 00000000 00000000 ebd8b2b4 dcde5e6c
> [ 366.888788] Call Trace:
> [ 366.888792] [<c107a0ee>] ? enqueue_entity+0xee/0x5a0
> [ 366.888795] [<c15abc83>] schedule+0x23/0x60
> [ 366.888798] [<c15aa5ad>] schedule_timeout+0x22d/0x2a0
> [ 366.888801] [<c1071ce2>] ? check_preempt_curr+0x72/0x90
> [ 366.888804] [<c1071d28>] ? ttwu_do_wakeup+0x28/0x130
> [ 366.888807] [<c1037f28>] ? default_spin_lock_flags+0x8/0x10
> [ 366.888810] [<c15acaad>] ? _raw_spin_lock_irqsave+0x2d/0x40
> [ 366.888813] [<c15abb17>] wait_for_common+0xa7/0x110
> [ 366.888816] [<c1074690>] ? try_to_wake_up+0x220/0x220
> [ 366.888819] [<c15abc57>] wait_for_completion+0x17/0x20
> [ 366.888822] [<c116ad80>] writeback_inodes_sb_nr+0x70/0x90
> [ 366.888825] [<c116af75>] writeback_inodes_sb+0x25/0x30
> [ 366.888828] [<c117159f>] __sync_filesystem+0x4f/0x90
> [ 366.888831] [<c11715f7>] sync_one_sb+0x17/0x20
> [ 366.888834] [<c114c430>] iterate_supers+0xc0/0xd0
> [ 366.888837] [<c11715e0>] ? __sync_filesystem+0x90/0x90
> [ 366.888831] [<c11715f7>] sync_one_sb+0x17/0x20
> [ 366.888834] [<c114c430>] iterate_supers+0xc0/0xd0
> [ 366.888837] [<c11715e0>] ? __sync_filesystem+0x90/0x90
> [ 366.888840] [<c117167b>] sys_sync+0x2b/0x60
> [ 366.888842] [<c15b385f>] sysenter_do_call+0x12/0x28
>
> Too many similar messages flood the logs. So I use a present method to
> fix this issue.

NACK.

You're not fixing anything - you are simply removing a warning
that indicates that there IO subsystem is overloaded. This is a very
good warning to have - when the system slows down immensely, and
these start appearing, we know that there is a problem related to IO
writeback. For example, this -current- thread on the XFS list:

http://oss.sgi.com/archives/xfs/2012-06/msg00144.html

Which a) we probably would have never heard of if the hung task warning
didn't fire on blocked IOs, and b) would have struggled to
understand without the stack traces the hung task timer has provided
us with.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-15 22:43:12

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Wed, Jun 13, 2012 at 10:48:40PM +0800, Fengguang Wu wrote:
> > This really feels like we're papering over the problem.
>
> That's true. The majority users probably don't want to cache 100s
> worth of data in memory. It may be worthwhile to add a new per-bdi
> limit whose unit is number-of-seconds (of dirty data).

Doesn't work. You have a BBWC that takes in 500MB of random 4k
writes in a second, then starts to flush and needs to do a RMW cycle
for every 4k write it cached. On RAID5/6, the flush rate will be
about 100 IOPS, so it could take half an hour to flush those writes
that took a second to dump into the cache. IO for that entire half
hour will be extremely slow, and if you isue a sync during it, then
that's when you get a hung task timer.

Limiting the amount of writeback to a few seconds of IO simply won't
fix this - the ingest rate of BBWCs is simply too great to prevent
such events by a slow moving bandwidth throttle....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-19 20:14:24

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

Fengguang Wu <[email protected]> writes:

> Good idea! Yes we can do some estimation and adaptively extend the
> hang timeout for the current writeback_inodes_sb_nr()/sync_inodes_sb()
> call.
>
> Note that it's not going to reliably get rid of false warnings due to
> estimation errors, which could be pretty large and unavoidable on
> change of workload. But still, it would be a net improvement and
> perhaps enough to get rid of most false warnings, while still being
> able to catch livelock or other kind of task hang.

Hi, Fengguang,

I didn't see a patch from you for this, so I went ahead and threw
something together. Let me know what you think of it. I wasn't sure
how to estimate the total I/O that will be issued for syncing out an
entire superblock, though, so I didn't do that part.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <[email protected]>

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8d2fb8c..346f3de 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1291,6 +1291,37 @@ static void wait_sb_inodes(struct super_block *sb)
}

/**
+ * wb_wait_for_completion_nohang - wait for the given work item to
+ * complete, attempting to not falsely trigger the hangcheck timer.
+ * @work: the wb_writeback_work we're waiting on
+ *
+ * Wait for the completion of the given work item. If the hang check
+ * timer is activated, then estimate the amount of time we should spend
+ * waiting for I/O, and wake up often enough to not trigger the timer.
+ * Once we've exceeded the estimated I/O time, wait without a timeout so
+ * that the hangcheck timer will then fire.
+ */
+void wb_wait_for_completion_nohang(struct wb_writeback_work *work)
+{
+ unsigned long hang_check = sysctl_hung_task_timeout_secs;
+ bool completed = false;
+
+ if (hang_check) {
+ /* loop until the time remaining is less than the timer */
+ unsigned long est_io_time_s = work->nr_pages /
+ work->sb->s_bdi->avg_write_bandwidth;
+ while (!completed && est_io_time_s > hang_check) {
+ completed = !!wait_for_completion_timeout(work->done,
+ hang_check * (HZ/2));
+ est_io_time_s -= hang_check / 2;
+ }
+ }
+
+ if (!completed)
+ wait_for_completion(work->done);
+}
+
+/**
* writeback_inodes_sb_nr - writeback dirty inodes from given super_block
* @sb: the superblock
* @nr: the number of pages to write
@@ -1316,7 +1347,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,

WARN_ON(!rwsem_is_locked(&sb->s_umount));
bdi_queue_work(sb->s_bdi, &work);
- wait_for_completion(&done);
+ wb_wait_for_completion_nohang(&work);
}
EXPORT_SYMBOL(writeback_inodes_sb_nr);

2012-06-19 21:02:17

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

On Tue, Jun 19, 2012 at 04:14:16PM -0400, Jeff Moyer wrote:
> Fengguang Wu <[email protected]> writes:
>
> > Good idea! Yes we can do some estimation and adaptively extend the
> > hang timeout for the current writeback_inodes_sb_nr()/sync_inodes_sb()
> > call.
> >
> > Note that it's not going to reliably get rid of false warnings due to
> > estimation errors, which could be pretty large and unavoidable on
> > change of workload. But still, it would be a net improvement and
> > perhaps enough to get rid of most false warnings, while still being
> > able to catch livelock or other kind of task hang.
>
> Hi, Fengguang,
>
> I didn't see a patch from you for this, so I went ahead and threw
> something together. Let me know what you think of it. I wasn't sure
> how to estimate the total I/O that will be issued for syncing out an
> entire superblock, though, so I didn't do that part.

As I said to the original patch - having a hang check timeout on a
system that is overloaded w.r.t. IO is an important piece of
information when it comes to debugging problems. Often the hangcheck
timer is the first piece of information that we will get that
indicates a problem somewhere in a production system.

Removing it does not magically fix the underlying problem - it
simply means that we don't hear about them until someone complains
that unmount is taking hours....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-06-19 21:09:34

[permalink] [raw]

Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block

Dave Chinner <[email protected]> writes:

> On Tue, Jun 19, 2012 at 04:14:16PM -0400, Jeff Moyer wrote:
>> Fengguang Wu <[email protected]> writes:
>>
>> > Good idea! Yes we can do some estimation and adaptively extend the
>> > hang timeout for the current writeback_inodes_sb_nr()/sync_inodes_sb()
>> > call.
>> >
>> > Note that it's not going to reliably get rid of false warnings due to
>> > estimation errors, which could be pretty large and unavoidable on
>> > change of workload. But still, it would be a net improvement and
>> > perhaps enough to get rid of most false warnings, while still being
>> > able to catch livelock or other kind of task hang.
>>
>> Hi, Fengguang,
>>
>> I didn't see a patch from you for this, so I went ahead and threw
>> something together. Let me know what you think of it. I wasn't sure
>> how to estimate the total I/O that will be issued for syncing out an
>> entire superblock, though, so I didn't do that part.
>
> As I said to the original patch - having a hang check timeout on a
> system that is overloaded w.r.t. IO is an important piece of
> information when it comes to debugging problems. Often the hangcheck
> timer is the first piece of information that we will get that
> indicates a problem somewhere in a production system.

So, you believe that we should always check at 2 minute intervals (or
whatever is configured), even if we know there is more than that much
I/O queued? In case there is any confusion, here, the patch I posted
ensured that we would eventually spew a warning, but only if the process
was blocked for longer than we (the kernel) expected.

> Removing it does not magically fix the underlying problem - it
> simply means that we don't hear about them until someone complains
> that unmount is taking hours....

There isn't necessarily an underlying problem. This is very much a gray
area, Dave. We get plenty of false positives in this code. I was
trying to reduce *that* problem. Do you have a better idea on how to
address the issue?

Maybe this discussion requires looking at specific instances of the
problem so we're all on the same page. What do you think is the best
way forward, here?

Cheers,
Jeff

2012-06-19 21:56:51