LinuxLists.cc - [PATCH] writeback: Fix broken sync writeback

2010-02-12 09:16:15

Subject: [PATCH] writeback: Fix broken sync writeback

Hi,

There's currently a writeback bug in the 2.6.32 and 2.6.33-rc kernels
that prevent proper writeback when sync(1) is being run. Instead of
flushing everything older than the sync run, it will do chunks of normal
MAX_WRITEBACK_PAGES writeback and restart over and over. This results in
very suboptimal writeback for many files, see the initial report from
Jan Engelhardt:

http://lkml.org/lkml/2010/1/22/382

This fixes it by using the passed in page writeback count, instead of
doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
(Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
finish properly even when new pages are being dirted.

Thanks to Jan Kara <[email protected]> for spotting this problem!

Cc: [email protected]
Reported-by: Jan Engelhardt <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

---

I'm sending this out as a patch on the list instead since I'll be going
away for a week skiing very shortly, a mail+patch is easier to discuss
here.

Greg, I'm attaching a 2.6.32 based patch as well, since this one will
not apply to 2.6.32.

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a7c42c..55aeea9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -749,6 +749,8 @@ static long wb_writeback(struct bdi_writeback *wb,
}

for (;;) {
+ long to_write = 0;
+
/*
* Stop writeback when nr_pages has been consumed
*/
@@ -762,12 +764,17 @@ static long wb_writeback(struct bdi_writeback *wb,
if (args->for_background && !over_bground_thresh())
break;

+ if (args->sync_mode == WB_SYNC_ALL)
+ to_write = args->nr_pages;
+ if (!to_write)
+ to_write = MAX_WRITEBACK_PAGES;
+
wbc.more_io = 0;
- wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+ wbc.nr_to_write = to_write;
wbc.pages_skipped = 0;
writeback_inodes_wb(wb, &wbc);
- args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
- wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+ args->nr_pages -= to_write - wbc.nr_to_write;
+ wrote += to_write - wbc.nr_to_write;

/*
* If we consumed everything, see if we have more
@@ -782,7 +789,7 @@ static long wb_writeback(struct bdi_writeback *wb,
/*
* Did we write something? Try for more
*/
- if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+ if (wbc.nr_to_write < to_write)
continue;
/*
* Nothing written. Wait for some inode to

--
Jens Axboe

Attachments:

(No filename) (2.33 kB)
writeback-fix-broken-sync-2.6.32.patch (2.35 kB)
Download all attachments

2010-02-12 15:46:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Fri, 12 Feb 2010, Jens Axboe wrote:
>
> This fixes it by using the passed in page writeback count, instead of
> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> finish properly even when new pages are being dirted.

This seems broken.

The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't
generate a single IO bigger than a couple of MB. And then we have that
loop around things to do multiple chunks. Your change to use nr_pages
seems to make the whole looping totally pointless, and breaks that "don't
do huge hunks" logic.

So I don't think that your patch is correct.

That said, I _do_ believe you when you say it makes a difference, which
makes me think there is a bug there. I just don't think you fixed the
right bug, and your change just happens to do what you wanted by pure
random luck.

The _real_ bug seems to bethe one you mentioned, but then ignored:

> Instead of flushing everything older than the sync run, it will do
> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and
> over.

and it would seem that the _logical_ way to fix it would be something like
the appended...

Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that
any _future_ data is written back, so the 'oldest_jif' thing would seem to
be sane regardless of sync mode.

NOTE NOTE NOTE! I do have to admit that this patch scares me, because
there could be some bug in the 'older_than_this' logic that means that
somebody sets it even if the inode is already dirty. So this patch makes
conceptual sense to me, and I think it's the right thing to do, but I
also suspect that we do not actually have a lot of test coverage of the
whole 'older_than_this' logic, because it historically has been just a
small optimization for kupdated.

So this patch scares me, as it could break 'fdatasync' entirely. So
somebody should really double-check the whole 'dirtied_when' logic, just
to be safe. If anybody ever sets 'dirtied_when' to the current time even
if the inode is already dirty (and has an earlier dirtied_when'), then
that would open up 'fdatasync()' and friends up to not writing things
back properly at all (because a newer write set 'dirtied_when' so that
old writes get ignored and thought to be 'totally new')

Comments?

Linus

---
fs/fs-writeback.c | 15 ++++++++++-----
1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a7c42c..a0a8424 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb,
long wrote = 0;
struct inode *inode;

- if (wbc.for_kupdate) {
- wbc.older_than_this = &oldest_jif;
- oldest_jif = jiffies -
- msecs_to_jiffies(dirty_expire_interval * 10);
- }
+ /*
+ * We never write back data that was dirtied _after_ we
+ * started writeback. But kupdate doesn't even want to
+ * write back recently dirtied stuff, only older data.
+ */
+ oldest_jif = jiffies-1;
+ wbc.older_than_this = &oldest_jif;
+ if (wbc.for_kupdate)
+ oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10);
+
if (!wbc.range_cyclic) {
wbc.range_start = 0;
wbc.range_end = LLONG_MAX;

2010-02-13 12:58:24

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Friday 2010-02-12 16:45, Linus Torvalds wrote:
>On Fri, 12 Feb 2010, Jens Axboe wrote:
>>
>> This fixes it by using the passed in page writeback count, instead of
>> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>> finish properly even when new pages are being dirted.
>
>This seems broken.

It seems so. Jens, Jan Kara, your patch does not entirely fix this.
While there is no sync/fsync to be seen in these traces, I can
tell there's a livelock, without Dirty decreasing at all.

INFO: task flush-8:0:354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0 D 00000000005691a0 5560 354 2 0x28000000000
Call Trace:
[0000000000568de4] start_this_handle+0x36c/0x520
[00000000005691a0] jbd2_journal_start+0xb4/0xe0
[000000000054f99c] ext4_journal_start_sb+0x54/0x9c
[0000000000540de0] ext4_da_writepages+0x1e0/0x460
[00000000004ab3e4] do_writepages+0x28/0x4c
[00000000004f825c] writeback_single_inode+0xf0/0x330
[00000000004f90a8] writeback_inodes_wb+0x4e4/0x600
[00000000004f9354] wb_writeback+0x190/0x20c
[00000000004f967c] wb_do_writeback+0x1d4/0x1f0
[00000000004f96c0] bdi_writeback_task+0x28/0xa0
[00000000004b71dc] bdi_start_fn+0x64/0xc8
[00000000004786f0] kthread+0x58/0x6c
[000000000042ae7c] kernel_thread+0x30/0x48
[0000000000478644] kthreadd+0xb8/0x10c
1 lock held by flush-8:0/354:
#0: (&type->s_umount_key#18){......}, at: [<00000000004f8fc8>]
# writeback_inodes_wb+0x404/0x600
INFO: task jbd2/sda6-8:588 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda6-8 D 000000000056eea0 7272 588 2 0x18000000000
Call Trace:
[000000000056976c] jbd2_journal_commit_transaction+0x218/0x15e0
[000000000056eea0] kjournald2+0x138/0x2fc
[00000000004786f0] kthread+0x58/0x6c
[000000000042ae7c] kernel_thread+0x30/0x48
[0000000000478644] kthreadd+0xb8/0x10c
no locks held by jbd2/sda6-8/588.
INFO: task rpmbuild:6580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rpmbuild D 00000000005691a0 16 6580 6562 0x310061101000080
Call Trace:
[0000000000568de4] start_this_handle+0x36c/0x520
[00000000005691a0] jbd2_journal_start+0xb4/0xe0
[000000000054f99c] ext4_journal_start_sb+0x54/0x9c
[000000000053dcd4] ext4_dirty_inode+0x8/0x3c
[00000000004f8888] __mark_inode_dirty+0x20/0x15c
[00000000004eeb34] file_update_time+0x104/0x124
[00000000004a50d4] __generic_file_aio_write+0x264/0x36c
[00000000004a5228] generic_file_aio_write+0x4c/0xa4
[00000000005390c0] ext4_file_write+0x94/0xa4
[00000000004dc12c] do_sync_write+0x84/0xd4
[00000000004dca78] vfs_write+0x70/0x12c
[00000000004dcbcc] SyS_write+0x2c/0x5c
[0000000000406214] linux_sparc_syscall32+0x34/0x40
1 lock held by rpmbuild/6580:
#0: (&sb->s_type->i_mutex_key#9){......}, at: [<00000000004a5214>]
# generic_file_aio_write+0x38/0xa4
etc.

>
>The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't
>generate a single IO bigger than a couple of MB. And then we have that
>loop around things to do multiple chunks. Your change to use nr_pages
>seems to make the whole looping totally pointless, and breaks that "don't
>do huge hunks" logic.
>
>So I don't think that your patch is correct.
>
>That said, I _do_ believe you when you say it makes a difference, which
>makes me think there is a bug there. I just don't think you fixed the
>right bug, and your change just happens to do what you wanted by pure
>random luck.
>
>The _real_ bug seems to bethe one you mentioned, but then ignored:
>
>> Instead of flushing everything older than the sync run, it will do
>> chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and
>> over.
>
>and it would seem that the _logical_ way to fix it would be something like
>the appended...
>
>Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that
>any _future_ data is written back, so the 'oldest_jif' thing would seem to
>be sane regardless of sync mode.
>
> NOTE NOTE NOTE! I do have to admit that this patch scares me, because
> there could be some bug in the 'older_than_this' logic that means that
> somebody sets it even if the inode is already dirty. So this patch makes
> conceptual sense to me, and I think it's the right thing to do, but I
> also suspect that we do not actually have a lot of test coverage of the
> whole 'older_than_this' logic, because it historically has been just a
> small optimization for kupdated.
>
> So this patch scares me, as it could break 'fdatasync' entirely. So
> somebody should really double-check the whole 'dirtied_when' logic, just
> to be safe. If anybody ever sets 'dirtied_when' to the current time even
> if the inode is already dirty (and has an earlier dirtied_when'), then
> that would open up 'fdatasync()' and friends up to not writing things
> back properly at all (because a newer write set 'dirtied_when' so that
> old writes get ignored and thought to be 'totally new')
>
>Comments?
>
> Linus
>
>---
> fs/fs-writeback.c | 15 ++++++++++-----
> 1 files changed, 10 insertions(+), 5 deletions(-)
>
>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>index 1a7c42c..a0a8424 100644
>--- a/fs/fs-writeback.c
>+++ b/fs/fs-writeback.c
>@@ -738,11 +738,16 @@ static long wb_writeback(struct bdi_writeback *wb,
> long wrote = 0;
> struct inode *inode;
>
>- if (wbc.for_kupdate) {
>- wbc.older_than_this = &oldest_jif;
>- oldest_jif = jiffies -
>- msecs_to_jiffies(dirty_expire_interval * 10);
>- }
>+ /*
>+ * We never write back data that was dirtied _after_ we
>+ * started writeback. But kupdate doesn't even want to
>+ * write back recently dirtied stuff, only older data.
>+ */
>+ oldest_jif = jiffies-1;
>+ wbc.older_than_this = &oldest_jif;
>+ if (wbc.for_kupdate)
>+ oldest_jif -= msecs_to_jiffies(dirty_expire_interval * 10);
>+
> if (!wbc.range_cyclic) {
> wbc.range_start = 0;
> wbc.range_end = LLONG_MAX;
>

2010-02-15 14:17:43

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Fri 12-02-10 07:45:04, Linus Torvalds wrote:
> On Fri, 12 Feb 2010, Jens Axboe wrote:
> >
> > This fixes it by using the passed in page writeback count, instead of
> > doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> > (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> > finish properly even when new pages are being dirted.
>
> This seems broken.
>
> The whole point of MAX_WRITEBACK_PAGES was to make sure that we don't
> generate a single IO bigger than a couple of MB. And then we have that
> loop around things to do multiple chunks. Your change to use nr_pages
> seems to make the whole looping totally pointless, and breaks that "don't
> do huge hunks" logic.
>
> So I don't think that your patch is correct.
>
> That said, I _do_ believe you when you say it makes a difference, which
> makes me think there is a bug there. I just don't think you fixed the
> right bug, and your change just happens to do what you wanted by pure
> random luck.
A few points to this:
Before Jens rewritten the writeback code to use per-bdi flusher threads
MAX_WRITEBACK_PAGES applied only to writeback done by pdflush and kupdate.
So for example sync used .nr_pages like in the suggested patch. So Jens'
patch was just trying to get back to the old behavior...

Personally, I don't see why VM shouldn't generate IO from a single inode
larger than a few MB for data integrity syncs. There are two reasons I know
about for MAX_WRITEBACK_PAGES:
a) avoid livelocking of writeout when the file is continuously grown
(even nastier is when the file is sparse and huge and the writer just
fills the huge hole with data gradually)
b) avoid holding I_SYNC for too long because that could block tasks being
write-throttled.

I find a) a valid reason but MAX_WRITEBACK_PAGES workarounds the livelock
only for background writeback where we cycle through all inodes and do not
care whether we ever make the inode clean. Limiting data integrity
writeback to MAX_WRITEBACK_PAGES just makes things *worse* because you give
processes more time to generate new dirty pages. Using dirtied_when does
not solve this because the inode dirty timestamp is updated only at
clean->dirty transition but inode never gets clean.

I don't think b) is a very valid reason anymore. Currently, throttled task
writes all inodes from the BDI and the I_SYNC locked inode is just skipped.
It can only be a problem when that inode would be the only one with big
enough amount of dirty data. Then the throttled thread will only exit
balance_dirty_pages as soon as the amount of dirty data at that BDI drops
below the BDI dirty threshold.

> The _real_ bug seems to bethe one you mentioned, but then ignored:
>
> > Instead of flushing everything older than the sync run, it will do
> > chunks of normal MAX_WRITEBACK_PAGES writeback and restart over and
> > over.
>
> and it would seem that the _logical_ way to fix it would be something like
> the appended...
>
> Hmm? Even when you do a 'fdatasync()', it's not going to guarantee that
> any _future_ data is written back, so the 'oldest_jif' thing would seem to
> be sane regardless of sync mode.
>
> NOTE NOTE NOTE! I do have to admit that this patch scares me, because
> there could be some bug in the 'older_than_this' logic that means that
> somebody sets it even if the inode is already dirty. So this patch makes
> conceptual sense to me, and I think it's the right thing to do, but I
> also suspect that we do not actually have a lot of test coverage of the
> whole 'older_than_this' logic, because it historically has been just a
> small optimization for kupdated.
A similar logic to what you are suggesting is actually already there. See
'start' variable in writeback_inodes_wb(). It was there exactly to avoid
sync(1) livelocks but it is assuming that writeback_inodes_wb() handles all
the IO for the sync. Not that it is called just to write
MAX_WRITEBACK_PAGES. So the only danger of your patch using older_than_this
would be that the ordering on wb->b_dirty list isn't always by dirty time.

But as I wrote above the main problem I see in your patch is that it does
not avoid all the cases of livelocks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-15 14:49:30

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
>
> On Friday 2010-02-12 16:45, Linus Torvalds wrote:
> >On Fri, 12 Feb 2010, Jens Axboe wrote:
> >>
> >> This fixes it by using the passed in page writeback count, instead of
> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> >> finish properly even when new pages are being dirted.
> >
> >This seems broken.
>
> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> While there is no sync/fsync to be seen in these traces, I can
> tell there's a livelock, without Dirty decreasing at all.
I don't think this is directly connected with my / Jens' patch.
Similar traces happen even without the patch (see e.g.
http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch
makes it worse... So are you able to reproduce these warnings and
without the patch they did not happen?
Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0?

> INFO: task flush-8:0:354 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> flush-8:0 D 00000000005691a0 5560 354 2 0x28000000000
> Call Trace:
> [0000000000568de4] start_this_handle+0x36c/0x520
> [00000000005691a0] jbd2_journal_start+0xb4/0xe0
> [000000000054f99c] ext4_journal_start_sb+0x54/0x9c
> [0000000000540de0] ext4_da_writepages+0x1e0/0x460
> [00000000004ab3e4] do_writepages+0x28/0x4c
> [00000000004f825c] writeback_single_inode+0xf0/0x330
> [00000000004f90a8] writeback_inodes_wb+0x4e4/0x600
> [00000000004f9354] wb_writeback+0x190/0x20c
> [00000000004f967c] wb_do_writeback+0x1d4/0x1f0
> [00000000004f96c0] bdi_writeback_task+0x28/0xa0
> [00000000004b71dc] bdi_start_fn+0x64/0xc8
> [00000000004786f0] kthread+0x58/0x6c
> [000000000042ae7c] kernel_thread+0x30/0x48
> [0000000000478644] kthreadd+0xb8/0x10c
> 1 lock held by flush-8:0/354:
> #0: (&type->s_umount_key#18){......}, at: [<00000000004f8fc8>]
> # writeback_inodes_wb+0x404/0x600
> INFO: task jbd2/sda6-8:588 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda6-8 D 000000000056eea0 7272 588 2 0x18000000000
> Call Trace:
> [000000000056976c] jbd2_journal_commit_transaction+0x218/0x15e0
> [000000000056eea0] kjournald2+0x138/0x2fc
> [00000000004786f0] kthread+0x58/0x6c
> [000000000042ae7c] kernel_thread+0x30/0x48
> [0000000000478644] kthreadd+0xb8/0x10c
> no locks held by jbd2/sda6-8/588.
> INFO: task rpmbuild:6580 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> rpmbuild D 00000000005691a0 16 6580 6562 0x310061101000080
> Call Trace:
> [0000000000568de4] start_this_handle+0x36c/0x520
> [00000000005691a0] jbd2_journal_start+0xb4/0xe0
> [000000000054f99c] ext4_journal_start_sb+0x54/0x9c
> [000000000053dcd4] ext4_dirty_inode+0x8/0x3c
> [00000000004f8888] __mark_inode_dirty+0x20/0x15c
> [00000000004eeb34] file_update_time+0x104/0x124
> [00000000004a50d4] __generic_file_aio_write+0x264/0x36c
> [00000000004a5228] generic_file_aio_write+0x4c/0xa4
> [00000000005390c0] ext4_file_write+0x94/0xa4
> [00000000004dc12c] do_sync_write+0x84/0xd4
> [00000000004dca78] vfs_write+0x70/0x12c
> [00000000004dcbcc] SyS_write+0x2c/0x5c
> [0000000000406214] linux_sparc_syscall32+0x34/0x40
> 1 lock held by rpmbuild/6580:
> #0: (&sb->s_type->i_mutex_key#9){......}, at: [<00000000004a5214>]
> # generic_file_aio_write+0x38/0xa4
> etc.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-15 15:41:19

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Monday 2010-02-15 15:49, Jan Kara wrote:
>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
>> >>
>> >> This fixes it by using the passed in page writeback count, instead of
>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>> >> finish properly even when new pages are being dirted.
>> >
>> >This seems broken.
>>
>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
>> While there is no sync/fsync to be seen in these traces, I can
>> tell there's a livelock, without Dirty decreasing at all.
>
> I don't think this is directly connected with my / Jens' patch.

I start to think so too.

>Similar traces happen even without the patch (see e.g.
>http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch
>makes it worse... So are you able to reproduce these warnings and
>without the patch they did not happen?

Your patch speeds up the slow sync; without the patch, there was
no real chance to observ the hard lockup, as the slow sync would
take up all time.

So far, no reproduction. It seems to be just as you say.

> Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0?

0000000000569554 <jbd2_journal_commit_transaction>:
56976c: 40 04 ee 62 call 6a50f4 <schedule>

Since there is an obvious schedule() call in jbd2_journal_commit_transaction's
C code, I think that's where it is.

2010-02-15 15:58:25

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon 15-02-10 16:41:17, Jan Engelhardt wrote:
>
> On Monday 2010-02-15 15:49, Jan Kara wrote:
> >On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
> >> >>
> >> >> This fixes it by using the passed in page writeback count, instead of
> >> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
> >> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
> >> >> finish properly even when new pages are being dirted.
>
> >> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
> >> While there is no sync/fsync to be seen in these traces, I can
> >> tell there's a livelock, without Dirty decreasing at all.
> >
> > I don't think this is directly connected with my / Jens' patch.
>
> I start to think so too.
>
> >Similar traces happen even without the patch (see e.g.
> >http://bugzilla.kernel.org/show_bug.cgi?id=14830). But maybe the patch
> >makes it worse... So are you able to reproduce these warnings and
> >without the patch they did not happen?
>
> Your patch speeds up the slow sync; without the patch, there was
> no real chance to observ the hard lockup, as the slow sync would
> take up all time.
>
> So far, no reproduction. It seems to be just as you say.
>
> > Where in the code is jbd2_journal_commit_transaction+0x218/0x15e0?
>
> 0000000000569554 <jbd2_journal_commit_transaction>:
> 56976c: 40 04 ee 62 call 6a50f4 <schedule>
>
> Since there is an obvious schedule() call in jbd2_journal_commit_transaction's
> C code, I think that's where it is.
OK. Thanks. It seems some process is spending excessive time with a
transaction open (jbd2_journal_commit_transaction waits for all handles of
a transaction to be dropped). If you see the traces again, try to obtain
stack traces of all the other processes and maybe we can catch the process
and see whether it's doing something unexpected.
The patch can have an influence on this because we now pass larger
nr_to_write to ext4_writepages so maybe that makes some corner case more
likely.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-16 00:06:26

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon, 15 Feb 2010, Jan Kara wrote:
>
> Personally, I don't see why VM shouldn't generate IO from a single inode
> larger than a few MB for data integrity syncs. There are two reasons I know
> about for MAX_WRITEBACK_PAGES:

Two issues:
- it shouldn't matter for performance (if you have a broken disk that
wants multi-megabyte writes to get good performance, you need to fix
the driver, not the VM)
- I generally think that _latency_ is much more important than
throughput, and huge writes are simply bad for latency.

But the real complaint I had about your patch was that it made no sense.

Your statement that it speeds up sync is indicative of some _other_
independent problem (perhaps starting from the beginning of the file each
time? Who knows?) and the patch _clearly_doesn't actually address the
underlying problem, it just changes the logic in a way that makes no
obvious sense to anybody, without actually explaining why it would matter.

So if you can explain _why_ that patch makes such a big difference for
you, and actually write that up, I wouldn't mind it. But right now it was
sent as a voodoo patch with no sensible explanation for why it would
really make any difference, since the outer loop should already do what
the patch does.

Linus

2010-02-16 23:00:14

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon 15-02-10 16:05:43, Linus Torvalds wrote:
>
>
> On Mon, 15 Feb 2010, Jan Kara wrote:
> >
> > Personally, I don't see why VM shouldn't generate IO from a single inode
> > larger than a few MB for data integrity syncs. There are two reasons I know
> > about for MAX_WRITEBACK_PAGES:
>
> Two issues:
> - it shouldn't matter for performance (if you have a broken disk that
> wants multi-megabyte writes to get good performance, you need to fix
> the driver, not the VM)
The IO size actually does matter for performance because if you switch
after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
that means a seek to a distant place and that has significant impact (seek
time are in a 10-20 ms range for a common drives so with 50 MB/s throughput
this is like 13-25% of the IO time...). A similar reason why it matters is
if the filesystem does delayed allocation - then allocation is performed
from writeback code and if it happens in 4 MB chunks, chances for
fragmentation are higher. Actually XFS, btrfs, and ext4 *workaround* the
fact that VM sends only 4 MB chunks and write more regardless of
nr_to_write set - I agree this really sucks but that's the way it is.

> - I generally think that _latency_ is much more important than
> throughput, and huge writes are simply bad for latency.
I agree that latency matters but is VM the right place to decide where to
break writes into smaller chunks for latency reasons? It knows neither
about the placement of the file nor about possible concurrent requests for
that device. So I personally believe that IO scheduler should be the layer
that should care about scheduling IO so that we have decent latecies. As
splitting bios is probably not an option, we might want to have an
artificial upper limit on bio size for latency reasons but it's still it's
way below VM...

> But the real complaint I had about your patch was that it made no sense.
>
> Your statement that it speeds up sync is indicative of some _other_
> independent problem (perhaps starting from the beginning of the file each
> time? Who knows?) and the patch _clearly_doesn't actually address the
> underlying problem, it just changes the logic in a way that makes no
> obvious sense to anybody, without actually explaining why it would matter.
>
> So if you can explain _why_ that patch makes such a big difference for
> you, and actually write that up, I wouldn't mind it. But right now it was
> sent as a voodoo patch with no sensible explanation for why it would
> really make any difference, since the outer loop should already do what
> the patch does.
OK, fair enough :). The patch is actually Jens' but looking at the code
again the nr_to_write trimming should rather happen inside of
writeback_inodes_wb, not outside of it. That would make the logic clearer
and magically also fix the problem because write_cache_pages() ignores
nr_to_write in WB_SYNC_ALL mode. But I guess it's better to not depend
on this and explicitely handle that in writeback_inodes_wb...
So I'll post a new patch with a better changelog shortly.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-16 23:34:51

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Wed, 17 Feb 2010, Jan Kara wrote:
>
> The IO size actually does matter for performance because if you switch
> after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,

No.

Dammit, read the code.

That's my whole _point_. Look at the for-loop.

We DO NOT SWITCH to another inode, because we just continue in the
for-loop.

This is why I think your patch is crap. You clearly haven't even read the
code, the patch makes no sense, and there must be something else going on
than what you _claim_ is going on.

> So I'll post a new patch with a better changelog shortly.

Not "shortly". Because you clearly haven't looked at the code. Look here:

/*
* If we consumed everything, see if we have more
*/
if (wbc.nr_to_write <= 0)
continue;

and here:

/*
* Did we write something? Try for more
*/
if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
continue;

the point is, the way the code is written, it is very much _meant_ to
write out one file in one go, except it is supposed to chunk it up so that
you never see _too_ many dirty pages in flight at any time. Because tons
of dirty pages in the queues makes for bad latency.

Now, I admit that since your patch makes a difference, there is a bug
somewhere. That's never what I've argued against. What I argue against is
your patch itself, and your explanation for it. They don't make sense.

I suspect that there is something else going on. In particular, I wonder
if multiple calls to "writeback_inodes_wb()" somehow mess up the wbc
state, and it starts writing the same inode from the beginning, or it
simply has some bug that means that it moves the inode to the head of the
queue, or something.

For example, it might be that the logic in writeback_inodes_wb() moves an
inode back (the "redirty_tail()" cases) in bad ways when it shouldn't.

See what I'm trying to say? I think your patch probably just hides the
_real_ bug. Because it really shouldn't matter if we do things in 4MB
chunks or whatever, because the for-loop should happily continue.

Linus

2010-02-17 00:03:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue, 16 Feb 2010, Linus Torvalds wrote:
>
> For example, it might be that the logic in writeback_inodes_wb() moves an
> inode back (the "redirty_tail()" cases) in bad ways when it shouldn't.

Or another example: perhaps we screw up the inode list ordering when we
move the inodes between b_dirty <-> b_io <-> b_more_io?

And if we put inodes on the b_more_io list, do we perhaps then end up
doing too much waiting in inode_wait_for_writeback()?

See what I'm saying? Your patch - by just submitting the maximal sized
buffers - may well end up hiding the real problem. But if there is a real
problem in that whole list manipulation or waiting, then that problem
still exists for async writeback.

Wouldn't it be better to fix the real problem, so that async writeback
also gets the correct IO patterns?

NOTE! It's entirely possible that we do end up wanting to really submit
the maximal dirty IO for synchronous dirty writeback, in order to get
better IO patterns. So maybe your patch ends up being the right one in the
end. But I really _really_ want to understand this.

But right now, that patch seems like voodoo programming to me, and I
personally suspect that the real problem is in the horribly complex
b_io/b_more_io interaction. Or one of the _other_ horribly complex details
in the write-back logic (even just writing back a single inode is
complicated, see all the logic about "range_start/range_end" in the lower
level write_cache_pages() function).

Our whole writeback code is very very complicated. I don't like it. But
that's also why I want to feel like I understand the patch when I apply
it.

Linus

2010-02-17 01:33:31

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue 16-02-10 15:34:01, Linus Torvalds wrote:
> On Wed, 17 Feb 2010, Jan Kara wrote:
> >
> > The IO size actually does matter for performance because if you switch
> > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
>
> No.
>
> Dammit, read the code.
>
> That's my whole _point_. Look at the for-loop.
>
> We DO NOT SWITCH to another inode, because we just continue in the
> for-loop.
>
> This is why I think your patch is crap. You clearly haven't even read the
> code, the patch makes no sense, and there must be something else going on
> than what you _claim_ is going on.
I've read the code. Maybe I'm missing something but look:
writeback_inodes_wb(nr_to_write = 1024)
-> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
...
writeback_single_inode()
...writes 1024 pages.
if we haven't written everything in the inode (more than 1024 dirty
pages) we end up doing either requeue_io() or redirty_tail(). In the
first case the inode is put to b_more_io list, in the second case to
the tail of b_dirty list. In either case it will not receive further
writeout until we go through all other members of current b_io list.

So I claim we currently *do* switch to another inode after 4 MB. That
is a fact.

I *think* it is by design - mainly to avoid the situation where someone
continuously writes a huge file and kupdate or pdflush would never get to
writing other files with dirty data (at least that's impression I've built
over the years - heck, even 2.6.16 seems to have this redirty_tail logic
with a comment about the above livelock).

I do find this design broken as well as you likely do and think that the
livelock issue described in the above paragraph should be solved differently
(e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.

The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
I think that changing the writeback logic so that it does not switch inodes
after 4 MB is too risky for these two kernels. So with the above
explanation would you accept some fix along the lines of original Jens'
fix?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-17 01:58:17

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Wed, Feb 17, 2010 at 02:33:37AM +0100, Jan Kara wrote:
> On Tue 16-02-10 15:34:01, Linus Torvalds wrote:
> > On Wed, 17 Feb 2010, Jan Kara wrote:
> > >
> > > The IO size actually does matter for performance because if you switch
> > > after 4 MB (current value of MAX_WRITEBACK_PAGES) to writing another inode,
> >
> > No.
> >
> > Dammit, read the code.
> >
> > That's my whole _point_. Look at the for-loop.
> >
> > We DO NOT SWITCH to another inode, because we just continue in the
> > for-loop.
> >
> > This is why I think your patch is crap. You clearly haven't even read the
> > code, the patch makes no sense, and there must be something else going on
> > than what you _claim_ is going on.
> I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
> -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
> ...
> writeback_single_inode()
> ...writes 1024 pages.
> if we haven't written everything in the inode (more than 1024 dirty
> pages) we end up doing either requeue_io() or redirty_tail(). In the
> first case the inode is put to b_more_io list, in the second case to
> the tail of b_dirty list. In either case it will not receive further
> writeout until we go through all other members of current b_io list.
>
> So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

That is my understanding of how it works, too.

> I *think* it is by design - mainly to avoid the situation where someone
> continuously writes a huge file and kupdate or pdflush would never get to
> writing other files with dirty data (at least that's impression I've built
> over the years - heck, even 2.6.16 seems to have this redirty_tail logic
> with a comment about the above livelock).

Right, and there is another condition as well - writing lots of
small files could starve large file writeback as the large file only
got 4MB written back every 30s writeback period because it took that
long to write out all the new files.

IIRC it was in 2.6.16 that both these problems were fixed...

> I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.
>
> The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

We've had this sync() writeback behaviour for a long time - my
question is why is it only now a problem?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-17 03:36:54

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Wed, 17 Feb 2010, Jan Kara wrote:
>
> I've read the code. Maybe I'm missing something but look:
> writeback_inodes_wb(nr_to_write = 1024)
> -> queue_io() - queues inodes from wb->b_dirty list to wb->b_io list
> ...
> writeback_single_inode()
> ...writes 1024 pages.
> if we haven't written everything in the inode (more than 1024 dirty
> pages) we end up doing either requeue_io() or redirty_tail(). In the
> first case the inode is put to b_more_io list, in the second case to
> the tail of b_dirty list. In either case it will not receive further
> writeout until we go through all other members of current b_io list.
>
> So I claim we currently *do* switch to another inode after 4 MB. That
> is a fact.

Ok, I think that's the bug. I do agree that it may well be intentional,
but considering the performance impact, I suspect it's been "intentional
without any performance numbers".

Which just makes me very unhappy to just paper it over for the sync case,
and leave the now known-broken state alone for the async case. That really
isn't how we want to do things.

That said, if we've done this forever, I can certainly see the allure to
just keep doing it, and then handle the sync case separately.

> I do find this design broken as well as you likely do and think that the
> livelock issue described in the above paragraph should be solved differently
> (e.g. by http://lkml.org/lkml/2010/2/11/321) but that's not a quick fix.

Hmm. The thing is, the new radix tree bit you propose also sounds like
overdesigning things.

If we really do switch inodes (which I obviously didn't expect, even if I
may have been aware of it many years ago), then the max rate limiting is
just always bad.

If it's bad for synchronous syncs, then it's bad for background syncing
too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely -
since the whole latency argument goes away if we don't always honor it
("Oh, we have good latency - _except_ if you do 'sync()' to synchronously
write something out" - that's just insane).

> The question is what to do now for 2.6.33 and 2.6.32-stable. Personally,
> I think that changing the writeback logic so that it does not switch inodes
> after 4 MB is too risky for these two kernels. So with the above
> explanation would you accept some fix along the lines of original Jens'
> fix?

What is affected if we just remove MAX_WRITEBACK_PAGES entirely (as
opposed to the patch under discussion that effectively removes it for
WB_SYNC_ALL)?

I see balance_dirty_pages -> bdi_start_writeback, but that if anything
would be something that I think would be better off with efficient
writeback, and doesn't seem like it should try to round-robin over inodes
for latency reasons.

But I guess we can do it in stages, if it's about "minimal changes for
2.6.32/33.

Linus

2010-02-17 04:30:22

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue, Feb 16, 2010 at 07:35:35PM -0800, Linus Torvalds wrote:
> > writeback_single_inode()
> > ...writes 1024 pages.
> > if we haven't written everything in the inode (more than 1024 dirty
> > pages) we end up doing either requeue_io() or redirty_tail(). In the
> > first case the inode is put to b_more_io list, in the second case to
> > the tail of b_dirty list. In either case it will not receive further
> > writeout until we go through all other members of current b_io list.
> >
> > So I claim we currently *do* switch to another inode after 4 MB. That
> > is a fact.
>
> Ok, I think that's the bug. I do agree that it may well be intentional,
> but considering the performance impact, I suspect it's been "intentional
> without any performance numbers".

This is well known amongst file system developers. We've even raised
it from time to time, but apparently most people are too scared to
touch the writeback code. I proposed upping the limit some six months
ago, but I got serious pushback. As a result, I followed XFS's lead,
and so now, both XFS and ext4 will write more pages than what is
requested by the writeback logic, to work around this bug.....

What we really want to do is to time how fast the device is. If the
device is some Piece of Sh*t USB stick, then maybe you only want to
write 4MB at a time to avoid latency problems. Heck, maybe you only
want to write 32k at a time, if it's really slow.... But if it's some
super-fast RAID array, maybe you want to write a lot more than 4MB at
a time.

We've had this logic for a long time, and given the increase in disk
density, and spindle speeds, the 4MB limit, which might have made
sense 10 years ago, probably doesn't make sense now.

> If it's bad for synchronous syncs, then it's bad for background syncing
> too, and I'd rather get rid of the MAX_WRITEBACK_PAGES thing entirely -
> since the whole latency argument goes away if we don't always honor it
> ("Oh, we have good latency - _except_ if you do 'sync()' to synchronously
> write something out" - that's just insane).

I tried arguing for this six months ago, and got the argument that it
might cause latency problems on slow USB sticks. So I added a forced
override for ext4, which now writes 128MB at a time --- with a sysfs
tuning knob that allow the old behaviour to be restored if users
really complained. No one did actually complain....

- Ted

2010-02-17 05:17:34

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue, 16 Feb 2010, [email protected] wrote:
>
> We've had this logic for a long time, and given the increase in disk
> density, and spindle speeds, the 4MB limit, which might have made
> sense 10 years ago, probably doesn't make sense now.

I still don't think that 4MB is enough on its own to suck quite that
much. Even a fast device should be perfectly happy with 4MB IOs, or it
must be sucking really badly.

In order to see the kinds of problems that got quoted in the original
thread, there must be something else going on too, methinks (disk light
was "blinking").

So I would guess that it's also getting stuck on that

inode_wait_for_writeback(inode);

inside that loop in wb_writeback().

In fact, I'm starting to wonder about that "Nothing written" case. The
code basically decides that "if I wrote zero pages, I didn't write
anything at all, so I must wait for the inode to complete old writes in
order to not busy-loop". Which sounds sensible on the face of it, but the
thing is, inodes can be dirty without actually having any dirty _pages_
associated with them.

Are we perhaps ending up in a situation where we essentially wait
synchronously on just the inode itself being written out? That would
explain the "40kB/s" kind of behavior.

If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s
throughput.

But if we do a 4MB chunk (for the one file that had real dirty data in
it), and then do a few hundred trivial "write out the inode data
_synchronously_" (due to access time changes etc) in between until we hit
the file that has real dirty data again - now _that_ would explain 40kB/s
throughput. It's not just seeking around - it's not even trying to push
multiple IO's to get any elevator going or anything like that.

And then the patch that started this discussion makes sense: it improves
performance because in between those synchronous inode updates it now
writes big chunks. But again, it's mostly hiding us just doing insane
things.

I dunno. Just a theory. The more I look at that code, the uglier it looks.
And I do get the feeling that the "4MB chunking" is really just making the
more fundamental problems show up.

Linus

2010-02-22 17:31:56

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue 16-02-10 21:16:46, Linus Torvalds wrote:
> On Tue, 16 Feb 2010, [email protected] wrote:
> >
> > We've had this logic for a long time, and given the increase in disk
> > density, and spindle speeds, the 4MB limit, which might have made
> > sense 10 years ago, probably doesn't make sense now.
>
> I still don't think that 4MB is enough on its own to suck quite that
> much. Even a fast device should be perfectly happy with 4MB IOs, or it
> must be sucking really badly.
>
> In order to see the kinds of problems that got quoted in the original
> thread, there must be something else going on too, methinks (disk light
> was "blinking").
<snip>
> Are we perhaps ending up in a situation where we essentially wait
> synchronously on just the inode itself being written out? That would
> explain the "40kB/s" kind of behavior.
>
> If we were actually doing real 4MB chunks, that would _not_ explain 40kB/s
> throughput.
Yes, it is actually 400kB/s but still you're right that that seems too
low if the only problem were seeks. I was looking at the code and it's even
bigger mess than what I thought :(.

a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
(when 1024 is passed in nr_to_write). Writeback code kind of expects that
in WB_SYNC_ALL mode all dirty pages in the given range are written (the
same way as write_cache_pages does that).

b) because of delayed allocation, inode is redirtied during ->writepages
call and thus writeback_single_inode calls redirty_tail at it. Thus each
inode will be written at least twice (synchronously, which means a
transaction commit and a disk cache flush for each such write).

c) the livelock avoidace in writeback_inodes_wb does not work because the
function exists as soon as nr_to_write (set to 1024) gets to 0 and thus
the 'start' timestamp gets always renewed.

d) ext4_writepage never succeeds to write a page with delayed-allocated
data. So pageout() function never succeeds in cleaning a page on ext4.
I think that when other problems in writeback code make writeout slow (like
in Jan Engelhardt's case), this can bite us and I assume this might be the
reason why Jan saw kswapd active doing some work during his problems.

> But if we do a 4MB chunk (for the one file that had real dirty data in
> it), and then do a few hundred trivial "write out the inode data
> _synchronously_" (due to access time changes etc) in between until we hit
> the file that has real dirty data again - now _that_ would explain 40kB/s
> throughput. It's not just seeking around - it's not even trying to push
> multiple IO's to get any elevator going or anything like that.
>
> And then the patch that started this discussion makes sense: it improves
> performance because in between those synchronous inode updates it now
> writes big chunks. But again, it's mostly hiding us just doing insane
> things.
I'm not quite sure whether some of the above problems is really causing the
sync writeback to be so slow. At least problems a) and c) would be
worked-around by the patch setting nr_to_write to LONG_MAX for sync
writeback and the effect of b) would be mitigated to just two synchronous
inode writes instead of (dirty_pages + 8191) / 8192 + 1 sync writes.
For c) I think the original patch is actually the right solution but I
agree that it just hides the other problems...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-22 21:01:25

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
>
> a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> same way as write_cache_pages does that).

Well, we return after writing 128MB because of the magic
s_max_writeback_mb_bump. The fact that nr_to_write limits the number
of pages which are written is something which is intentional to the
writeback code. I've disagreed with it, but I don't think it would be
legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
what you are saying we should do? (If it is indeed legit to ignore
nr_to_write, I would have done it a long time ago; I introduced
s_max_writeback_mb_bump instead as a workaround to what I consider to
be a serious misfeature in the writeback code.)

> b) because of delayed allocation, inode is redirtied during ->writepages
> call and thus writeback_single_inode calls redirty_tail at it. Thus each
> inode will be written at least twice (synchronously, which means a
> transaction commit and a disk cache flush for each such write).

Hmm, does this happen with XFS, too? If not, I wonder how they handle
it? And whether we need to push a solution into the generic layers.

> d) ext4_writepage never succeeds to write a page with delayed-allocated
> data. So pageout() function never succeeds in cleaning a page on ext4.
> I think that when other problems in writeback code make writeout slow (like
> in Jan Engelhardt's case), this can bite us and I assume this might be the
> reason why Jan saw kswapd active doing some work during his problems.

Yeah, I've noticed this. What it means is that if we have a massive
memory pressure in a particular zone, pages which are subject to
delayed allocation won't get written out by mm/vmscan.c. Anonymous
pages will be written out to swap, and data pages which are re-written
via random access mmap() (and so we know where they will be written on
disk) will get written, and that's not a problem. So with relatively
large zones, it happens, but most of the time I don't think it's a
major problem.

I am worried about this issue in certain configurations where pseudo
NUMA zones have been created and are artificially really tiny (128MB)
for container support, but that's not standard upstream thing.

This is done to avoid a lock inversion, and so this is an
ext4-specific thing (at least I don't think XFS's delayed allocation
has this misfeature). It would be interesting if we have documented
evidence that this is easily triggered under normal situations. If
so, we should look into figuring out how to fix this...

- Ted

2010-02-22 22:26:41

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon 22-02-10 16:01:12, [email protected] wrote:
> On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
> >
> > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> > (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> > in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> > same way as write_cache_pages does that).
>
> Well, we return after writing 128MB because of the magic
I think it's really just 32 MB because writeback code passes nr_to_write
1024 -> ext4 bumps that to 8192 pages which is 32 MB. As I've understood
128 MB is just an absolute maximum over which we never bump.

> s_max_writeback_mb_bump. The fact that nr_to_write limits the number
> of pages which are written is something which is intentional to the
> writeback code. I've disagreed with it, but I don't think it would be
> legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
> what you are saying we should do? (If it is indeed legit to ignore
The behavior of nr_to_write for WB_SYNC_ALL mode was always kind of
fuzzy. Until Jens rewrote the writeback code, we never passed anything
different than LONG_MAX (or similarly huge values) when doing WB_SYNC_ALL
writeback. Also write_cache_pages which is used as writepages
implementation for most filesystems (ext2, ext3, udf, ...) ignores
nr_to_write for WB_SYNC_ALL writeback. So until recently we actually never
had to decide whether write_cache_pages behavior is a bug or de-facto
standard...
Since it's hard to avoid livelocking problems when synchronous writeback
would stop earlier than after writing the whole range, I'm leaned to
standardize on ignoring the nr_to_write limit for synchronous writeback.

> nr_to_write, I would have done it a long time ago; I introduced
> s_max_writeback_mb_bump instead as a workaround to what I consider to
> be a serious misfeature in the writeback code.)
I agree that nr_to_write currently brings more problems than advantages.

> > b) because of delayed allocation, inode is redirtied during ->writepages
> > call and thus writeback_single_inode calls redirty_tail at it. Thus each
> > inode will be written at least twice (synchronously, which means a
> > transaction commit and a disk cache flush for each such write).
>
> Hmm, does this happen with XFS, too? If not, I wonder how they handle
> it? And whether we need to push a solution into the generic layers.
Yes, I think so. As soon as you mark inode dirty during writepages
call, the dirty flag won't be cleared even though ->write_inode will be
called after that. So a right fix would IMO be to move clearing of
I_DIRTY_SYNC and I_DIRTY_DATASYNC flag after the writepages call. Or we
could even put some dirty bit handling inside of ->write_inode because
sometimes it would be useful if a filesystem would know in its
->write_inode method what dirty bits were set...
Another thing is that in particular in ext[34] case, we would be much
better off with submitting all the inode writes and then waiting just once.
This is actually what should usually happen because sync calls asynchronous
writeback first and only after it a synchronous one is called. But when
the machine is under a constant IO load and the livelock avoidance code
is invalidated by passing nr_to_write == 1024 to sync writeback, we end up
doing a lot of sync IO and thus this problem becomes much more visible.

> > d) ext4_writepage never succeeds to write a page with delayed-allocated
> > data. So pageout() function never succeeds in cleaning a page on ext4.
> > I think that when other problems in writeback code make writeout slow (like
> > in Jan Engelhardt's case), this can bite us and I assume this might be the
> > reason why Jan saw kswapd active doing some work during his problems.
>
> Yeah, I've noticed this. What it means is that if we have a massive
> memory pressure in a particular zone, pages which are subject to
> delayed allocation won't get written out by mm/vmscan.c. Anonymous
> pages will be written out to swap, and data pages which are re-written
> via random access mmap() (and so we know where they will be written on
> disk) will get written, and that's not a problem. So with relatively
> large zones, it happens, but most of the time I don't think it's a
> major problem.
Yes, I also don't think it's a major problem in common scenarios.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-02-23 02:54:12

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon, Feb 22, 2010 at 04:01:12PM -0500, [email protected] wrote:
> On Mon, Feb 22, 2010 at 06:29:38PM +0100, Jan Kara wrote:
> >
> > a) ext4_da_writepages returns after writing 32 MB even in WB_SYNC_ALL mode
> > (when 1024 is passed in nr_to_write). Writeback code kind of expects that
> > in WB_SYNC_ALL mode all dirty pages in the given range are written (the
> > same way as write_cache_pages does that).
>
> Well, we return after writing 128MB because of the magic
> s_max_writeback_mb_bump. The fact that nr_to_write limits the number
> of pages which are written is something which is intentional to the
> writeback code. I've disagreed with it, but I don't think it would be
> legit to completely ignore nr_to_write in WB_SYNC_ALL mode --- is that
> what you are saying we should do? (If it is indeed legit to ignore
> nr_to_write, I would have done it a long time ago; I introduced
> s_max_writeback_mb_bump instead as a workaround to what I consider to
> be a serious misfeature in the writeback code.)

Ignoring nr_to_write completely can lead to issues like we used to
have with XFS - it would write an entire extent (8GB) at a time and
starve all other writeback. Those starvation problems - which were
very obvious on NFS servers - went away when we trimmed back the
amount to write in a single pass to saner amounts...

> > b) because of delayed allocation, inode is redirtied during ->writepages
> > call and thus writeback_single_inode calls redirty_tail at it. Thus each
> > inode will be written at least twice (synchronously, which means a
> > transaction commit and a disk cache flush for each such write).
>
> Hmm, does this happen with XFS, too? If not, I wonder how they handle
> it? And whether we need to push a solution into the generic layers.

Yes, it does - XFS dirties the inode during delayed allocation. In
the worst case XFS writes back the inode twice, synchronously.
However, XFS inodes carry their own dirty state and so the second
flush is often a no-op because the inode is actually clean and not
pinned in the log any more. The inode is only dirty the second time
around if the file size was updated as a result of IO completion.

FWIW, we've only just changed the XFS code to issue transactions for
sync inode writes instead of writing the inode directly because of
changeѕ to the async inode writeback that avoid writing the inode
over and over again while it still has dirty data on it (coming in
2.6.34). These changes mean we need to issue a transaction to
capture any unlogged changes to the inode during sync writes, but we
still avoid the second transaction if the XFS inode is cleaned by
the first transaction.

As to a generic solution, why do you think I've been advocating
separate per-sb data sync and inode writeback methods that separate
data writeback from inode writeback for so long? ;)

> > d) ext4_writepage never succeeds to write a page with delayed-allocated
> > data. So pageout() function never succeeds in cleaning a page on ext4.
> > I think that when other problems in writeback code make writeout slow (like
> > in Jan Engelhardt's case), this can bite us and I assume this might be the
> > reason why Jan saw kswapd active doing some work during his problems.

[...]

> This is done to avoid a lock inversion, and so this is an
> ext4-specific thing (at least I don't think XFS's delayed allocation
> has this misfeature).

Not that I know of, but then again I don't know what inversion ext4
is trying to avoid. Can you describe the inversion, Ted?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-23 03:23:37

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote:
>
> Ignoring nr_to_write completely can lead to issues like we used to
> have with XFS - it would write an entire extent (8GB) at a time and
> starve all other writeback. Those starvation problems - which were
> very obvious on NFS servers - went away when we trimmed back the
> amount to write in a single pass to saner amounts...

How do you determine what a "sane amount" is? Is it something that is
determined dynamically, or is it a hard-coded or manually tuned value?

> As to a generic solution, why do you think I've been advocating
> separate per-sb data sync and inode writeback methods that separate
> data writeback from inode writeback for so long? ;)

Heh.

> > This is done to avoid a lock inversion, and so this is an
> > ext4-specific thing (at least I don't think XFS's delayed allocation
> > has this misfeature).
>
> Not that I know of, but then again I don't know what inversion ext4
> is trying to avoid. Can you describe the inversion, Ted?

The locking order is journal_start_handle (starting a micro
transaction in jbd) -> lock_page. A more detailed description of why
this locking order is non-trivial for us to fix in ext4 can be found
in the description of commit f0e6c985.

Regards,

- Ted

2010-02-23 05:53:56

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Mon, Feb 22, 2010 at 10:23:17PM -0500, [email protected] wrote:
> On Tue, Feb 23, 2010 at 01:53:50PM +1100, Dave Chinner wrote:
> >
> > Ignoring nr_to_write completely can lead to issues like we used to
> > have with XFS - it would write an entire extent (8GB) at a time and
> > starve all other writeback. Those starvation problems - which were
> > very obvious on NFS servers - went away when we trimmed back the
> > amount to write in a single pass to saner amounts...
>
> How do you determine what a "sane amount" is? Is it something that is
> determined dynamically, or is it a hard-coded or manually tuned value?

"sane amount" was a hard coded mapping look-ahead value that limited
the size of the allocation that was done. It got set to 64 pages,
back in 2.6.16 (IIRC) so that we'd do at least a 256k allocation
on x86 or 1MB on ia64 (which had 16k page size at the time).
It was previously unbound, which is why XFS was mapping entire
extents...

Hence ->writepage() on XFS will write 64 pages at a time if there
are contiguous pages in the page cache, and then go back up to
write_cache_pages() for the loop to either terminate due to
nr_to_write <= 0 or call ->writepage on the next dirty page not
under writeback on the inode.

This behaviour pretty much defines the _minimum IO size_ we'd issue
if there are contiguous dirty pages in the page cache. It was (and
still is) primarily to work around the deficiencies of the random
single page writeback that the VM's memory reclaim algorithms are so
fond of (another of my pet peeves)....

> > As to a generic solution, why do you think I've been advocating
> > separate per-sb data sync and inode writeback methods that separate
> > data writeback from inode writeback for so long? ;)
>
> Heh.
>
> > > This is done to avoid a lock inversion, and so this is an
> > > ext4-specific thing (at least I don't think XFS's delayed allocation
> > > has this misfeature).
> >
> > Not that I know of, but then again I don't know what inversion ext4
> > is trying to avoid. Can you describe the inversion, Ted?
>
> The locking order is journal_start_handle (starting a micro
> transaction in jbd) -> lock_page. A more detailed description of why
> this locking order is non-trivial for us to fix in ext4 can be found
> in the description of commit f0e6c985.

Nasty - you need to start a transaction before you lock pages for
writeback and allocation, but ->writepage hands you a locked page.
And you can't use an existing transaction handle open because you
can't guarantee that you have journal credits reserved for the
allocation? IIUC, ext3/4 has this problem due to the ordered data
writeback constraints, right?

Regardless of the reason for the ext4 behaviour, you are right about
XFS - it doesn't have this particular problem with delalloc because
it does have any transaction/page lock ordering constraints in the
transaction subsystem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-02-24 14:56:37

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

On Tue 23-02-10 16:53:35, Dave Chinner wrote:
> > > > This is done to avoid a lock inversion, and so this is an
> > > > ext4-specific thing (at least I don't think XFS's delayed allocation
> > > > has this misfeature).
> > >
> > > Not that I know of, but then again I don't know what inversion ext4
> > > is trying to avoid. Can you describe the inversion, Ted?
> >
> > The locking order is journal_start_handle (starting a micro
> > transaction in jbd) -> lock_page. A more detailed description of why
> > this locking order is non-trivial for us to fix in ext4 can be found
> > in the description of commit f0e6c985.
>
> Nasty - you need to start a transaction before you lock pages for
> writeback and allocation, but ->writepage hands you a locked page.
> And you can't use an existing transaction handle open because you
> can't guarantee that you have journal credits reserved for the
> allocation?
Exactly.

> IIUC, ext3/4 has this problem due to the ordered data writeback
> constraints, right?
Not quite. I don't know how XFS solves this but in ext3/4 starting
a transaction can block (waiting for journal space) until all users of
a previous transaction are done with it and we can commit it. Thus
the transaction start / stop behave just as an ordinary lock. Because
you need a transaction started when writing a page (for metadata updates)
there is some lock ordering forced between a page lock and a trasaction
start / stop. Ext4 chose it to be transaction -> page lock (which makes
writepages more efficient and writepage hard), ext3 has page lock ->
transaction (so it has working ->writepage).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-06-27 16:44:46

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [PATCH] writeback: Fix broken sync writeback

Hi Jan Kara,

On Monday 2010-02-15 16:41, Jan Engelhardt wrote:
>On Monday 2010-02-15 15:49, Jan Kara wrote:
>>On Sat 13-02-10 13:58:19, Jan Engelhardt wrote:
>>> >>
>>> >> This fixes it by using the passed in page writeback count, instead of
>>> >> doing MAX_WRITEBACK_PAGES batches, which gets us much better performance
>>> >> (Jan reports it's up from ~400KB/sec to 10MB/sec) and makes sync(1)
>>> >> finish properly even when new pages are being dirted.
>>> >
>>> >This seems broken.
>>>
>>> It seems so. Jens, Jan Kara, your patch does not entirely fix this.
>>> While there is no sync/fsync to be seen in these traces, I can
>>> tell there's a livelock, without Dirty decreasing at all.

What ultimately became of the discussion and/or the patch?

Your original ad-hoc patch certainly still does its job; had no need to
reboot in 86 days and still counting.