2009-07-28 19:11:40

by Chad Talbott

[permalink] [raw]
Subject: Bug in kernel 2.6.31, Slow wb_kupdate writeout

I run a simple workload on a 4GB machine which dirties a few largish
inodes like so:

# seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
bs=1024k count=100

While the dds are running data is written out at disk speed. However,
once the dds have run to completion and exited there is ~500MB of
dirty memory left. Background writeout then takes about 3 more
minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
can see that the disk is capable of 40MB/s, which finishes off the
files in ~10s. [1]

An interesting recent-ish change is "writeback: speed up writeback of
big dirty files." When I revert the change to __sync_single_inode the
problem appears to go away and background writeout proceeds at disk
speed. Interestingly, that code is in the git commit [2], but not in
the post to LKML. [3] This is may not be the fix, but it makes this
test behave better.

Thanks,
Chad

[1] I've plotted the dirty memory from /proc/meminfo and disk write
speed from iostat at
http://sites.google.com/site/cwtlinux/2-6-31-writeback-bug
[2] git commit:
http://mirror.celinuxforum.org/gitstat/commit-detail.php?commit=8bc3be2751b4f74ab90a446da1912fd8204d53f7
[3] LKML post: http://marc.info/?l=linux-kernel&m=119131601130372&w=2


2009-07-28 21:49:48

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

> An interesting recent-ish change is "writeback: speed up writeback of
> big dirty files." ?When I revert the change to __sync_single_inode the
> problem appears to go away and background writeout proceeds at disk
> speed. ?Interestingly, that code is in the git commit [2], but not in
> the post to LKML. [3] ?This is may not be the fix, but it makes this
> test behave better.

I'm fairly sure this is not fixing the root cause - but putting it at the head
rather than the tail of the queue causes the error not to starve wb_kupdate
for nearly so long - as long as we keep the queue full, the bug is hidden.

M.

2009-07-29 07:15:57

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<[email protected]> wrote:
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files." ?When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed. ?Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3] ?This is may not be the fix, but it makes this
>> test behave better.
>
> I'm fairly sure this is not fixing the root cause - but putting it at the head
> rather than the tail of the queue causes the error not to starve wb_kupdate
> for nearly so long - as long as we keep the queue full, the bug is hidden.

OK, it seems this is the root cause - I wasn't clear why all the pages weren't
being written back, and thought there was another bug. What happens is
we go into write_cache_pages, and stuff the disk queue with as much as
we can put into it, and then inevitably hit the congestion limit.

Then we back out to __sync_single_inode, who says "huh, you didn't manage
to write your whole slice", and penalizes the poor blameless inode in question
by putting it back into the penalty box for 30s.

This results in very lumpy I/O writeback at 5s intervals, and very
poor throughput.

Patch below is inline and probably text munged, but is for RFC only.
I'll test it
more thoroughly tomorrow. As for the comment about starving other writes,
I believe requeue_io moves it from s_io to s_more_io which should at least
allow some progress of other files.

--- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 00:11:28.000000000 -0700
@@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
/*
* We didn't write back all the pages. nfs_writepages()
* sometimes bales out without doing anything. Redirty
- * the inode; Move it from s_io onto s_more_io/s_dirty.
+ * the inode; Move it from s_io onto s_more_io. It
+ * may well have just encountered congestion
*/
- /*
- * akpm: if the caller was the kupdate function we put
- * this inode at the head of s_dirty so it gets first
- * consideration. Otherwise, move it to the tail, for
- * the reasons described there. I'm not really sure
- * how much sense this makes. Presumably I had a good
- * reasons for doing it this way, and I'd rather not
- * muck with it at present.
- */
- if (wbc->for_kupdate) {
- /*
- * For the kupdate function we move the inode
- * to s_more_io so it will get more writeout as
- * soon as the queue becomes uncongested.
- */
- inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
- /*
- * slice used up: queue for next turn
- */
- requeue_io(inode);
- } else {
- /*
- * somehow blocked: retry later
- */
- redirty_tail(inode);
- }
- } else {
- /*
- * Otherwise fully redirty the inode so that
- * other inodes on this superblock will get some
- * writeout. Otherwise heavy writing to one
- * file would indefinitely suspend writeout of
- * all the other files.
- */
- inode->i_state |= I_DIRTY_PAGES;
- redirty_tail(inode);
- }
+ inode->i_state |= I_DIRTY_PAGES;
+ requeue_io(inode);
} else if (inode->i_state & I_DIRTY) {
/*
* Someone redirtied the inode while were writing back

2009-07-29 11:43:55

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Wed, Jul 29, 2009 at 12:15:48AM -0700, Martin Bligh wrote:
> On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<[email protected]> wrote:
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files."  When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed.  Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3]  This is may not be the fix, but it makes this
> >> test behave better.
> >
> > I'm fairly sure this is not fixing the root cause - but putting it at the head
> > rather than the tail of the queue causes the error not to starve wb_kupdate
> > for nearly so long - as long as we keep the queue full, the bug is hidden.
>
> OK, it seems this is the root cause - I wasn't clear why all the pages weren't
> being written back, and thought there was another bug. What happens is
> we go into write_cache_pages, and stuff the disk queue with as much as
> we can put into it, and then inevitably hit the congestion limit.
>
> Then we back out to __sync_single_inode, who says "huh, you didn't manage
> to write your whole slice", and penalizes the poor blameless inode in question
> by putting it back into the penalty box for 30s.
>
> This results in very lumpy I/O writeback at 5s intervals, and very
> poor throughput.

You are right, so let's fix the congestion case. Your analysis would
be perfect changelog :)

> Patch below is inline and probably text munged, but is for RFC only.
> I'll test it
> more thoroughly tomorrow. As for the comment about starving other writes,
> I believe requeue_io moves it from s_io to s_more_io which should at least
> allow some progress of other files.
>
> --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 00:11:28.000000000 -0700
> @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
> /*
> * We didn't write back all the pages. nfs_writepages()
> * sometimes bales out without doing anything. Redirty
[snip]
> - if (wbc->nr_to_write <= 0) {
> - /*
> - * slice used up: queue for next turn
> - */
> - requeue_io(inode);
> - } else {
> - /*
> - * somehow blocked: retry later
> - */
> - redirty_tail(inode);

Removing this line can be dangerous - we'll probably go into buzy
waiting (I have tried that long long ago).

Chad, can you try this small patch? Thank you.

Thanks,
Fengguang
---
fs/fs-writeback.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
* soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
+ if (wbc->nr_to_write <= 0 ||
+ wbc->encountered_congestion) {
/*
* slice used up: queue for next turn
*/

2009-07-29 14:11:18

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * soon as the queue becomes uncongested.
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?inode->i_state |= I_DIRTY_PAGES;
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (wbc->nr_to_write <= 0) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (wbc->nr_to_write <= 0 ||
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? wbc->encountered_congestion) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * slice used up: queue for next turn
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? */
>

That's not sufficient - it only the problem in the wb_kupdate path. If you want
to be more conservative, how about we do this?

--- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 07:08:48.000000000 -0700
@@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode,
* We didn't write back all the pages. nfs_writepages(
)
* sometimes bales out without doing anything. Redirty
* the inode; Move it from s_io onto s_more_io/s_dirty.
+ * It may well have just encountered congestion
*/
- /*
- * akpm: if the caller was the kupdate function we put
- * this inode at the head of s_dirty so it gets first
- * consideration. Otherwise, move it to the tail, for
- * the reasons described there. I'm not really sure
- * how much sense this makes. Presumably I had a good
- * reasons for doing it this way, and I'd rather not
- * muck with it at present.
- */
- if (wbc->for_kupdate) {
- /*
- * For the kupdate function we move the inode
- * to s_more_io so it will get more writeout as
- * soon as the queue becomes uncongested.
- */
- inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
- /*
- * slice used up: queue for next turn
- */
- requeue_io(inode);
- } else {
- /*
- * somehow blocked: retry later
- */
- redirty_tail(inode);
- }
- } else {
- /*
- * Otherwise fully redirty the inode so that
- * other inodes on this superblock will get som
e
- * writeout. Otherwise heavy writing to one
- * file would indefinitely suspend writeout of
- * all the other files.
- */
- inode->i_state |= I_DIRTY_PAGES;
+ inode->i_state |= I_DIRTY_PAGES;
+ if (wbc->nr_to_write <= 0 || /* sliced used up */
+ wbc->encountered_congestion)
+ requeue_io(inode);
+ else {
+ /* somehow blocked: retry later */
redirty_tail(inode);
}
} else if (inode->i_state & I_DIRTY) {

2009-07-30 00:19:40

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

BTW, can you explain this code at the bottom of generic_sync_sb_inodes
for me?

if (wbc->nr_to_write <= 0) {
wbc->more_io = 1;
break;
}

I don't understand why we are setting more_io here? AFAICS, more_io
means there's more stuff to write ... I would think we'd set this if
nr_to_write was > 0 ?

Or just have the section below brought up above this
break check and do:

if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
wbc->more_io = 1;

Am I just misunderstanding the intent of more_io ?

2009-07-30 01:06:34

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Wed, Jul 29, 2009 at 10:11:10PM +0800, Martin Bligh wrote:
> > --- mm.orig/fs/fs-writeback.c
> > +++ mm/fs/fs-writeback.c
> > @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
> >                                 * soon as the queue becomes uncongested.
> >                                 */
> >                                inode->i_state |= I_DIRTY_PAGES;
> > -                               if (wbc->nr_to_write <= 0) {
> > +                               if (wbc->nr_to_write <= 0 ||
> > +                                   wbc->encountered_congestion) {
> >                                        /*
> >                                         * slice used up: queue for next turn
> >                                         */
> >
>
> That's not sufficient - it only the problem in the wb_kupdate path. If you want
> to be more conservative, how about we do this?

I agree on the unification of kupdate and sync paths. In fact I had a
patch for doing this. And I'd recommend to do it in two patches:
one to fix the congestion case, another to do the code unification.

The sync path don't care whether requeue_io() or redirty_tail() is
used, because they disregard the time stamps totally - only order of
inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().

Thanks,
Fengguang

> --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 07:08:48.000000000 -0700
> @@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode,
> * We didn't write back all the pages. nfs_writepages(
> )
> * sometimes bales out without doing anything. Redirty
> * the inode; Move it from s_io onto s_more_io/s_dirty.
> + * It may well have just encountered congestion
> */
> - /*
> - * akpm: if the caller was the kupdate function we put
> - * this inode at the head of s_dirty so it gets first
> - * consideration. Otherwise, move it to the tail, for
> - * the reasons described there. I'm not really sure
> - * how much sense this makes. Presumably I had a good
> - * reasons for doing it this way, and I'd rather not
> - * muck with it at present.
> - */
> - if (wbc->for_kupdate) {
> - /*
> - * For the kupdate function we move the inode
> - * to s_more_io so it will get more writeout as
> - * soon as the queue becomes uncongested.
> - */
> - inode->i_state |= I_DIRTY_PAGES;
> - if (wbc->nr_to_write <= 0) {
> - /*
> - * slice used up: queue for next turn
> - */
> - requeue_io(inode);
> - } else {
> - /*
> - * somehow blocked: retry later
> - */
> - redirty_tail(inode);
> - }
> - } else {
> - /*
> - * Otherwise fully redirty the inode so that
> - * other inodes on this superblock will get som
> e
> - * writeout. Otherwise heavy writing to one
> - * file would indefinitely suspend writeout of
> - * all the other files.
> - */
> - inode->i_state |= I_DIRTY_PAGES;
> + inode->i_state |= I_DIRTY_PAGES;
> + if (wbc->nr_to_write <= 0 || /* sliced used up */
> + wbc->encountered_congestion)
> + requeue_io(inode);
> + else {
> + /* somehow blocked: retry later */
> redirty_tail(inode);
> }
> } else if (inode->i_state & I_DIRTY) {

2009-07-30 01:12:31

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

> I agree on the unification of kupdate and sync paths. In fact I had a
> patch for doing this. And I'd recommend to do it in two patches:
> one to fix the congestion case, another to do the code unification.
>
> The sync path don't care whether requeue_io() or redirty_tail() is
> used, because they disregard the time stamps totally - only order of
> inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().

But, as I understand it, both paths share the same lists, so we still have
to be consistent?

Also, you set flags like more_io higher up in sync_sb_inodes() based on
whether there's anything in s_more_io queue, so it still seems to have
some effect to me?

2009-07-30 01:28:13

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<[email protected]> wrote:
> BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> for me?
>
> ? ? ? ? ? ? ? ?if (wbc->nr_to_write <= 0) {
> ? ? ? ? ? ? ? ? ? ? ? ?wbc->more_io = 1;
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?}
>
> I don't understand why we are setting more_io here? AFAICS, more_io
> means there's more stuff to write ... I would think we'd set this if
> nr_to_write was > 0 ?
>
> Or just have the section below brought up above this
> break check and do:
>
> if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> ? ? ? ?wbc->more_io = 1;
>
> Am I just misunderstanding the intent of more_io ?

I am thinking along the lines of:

@@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s
iput(inode);
cond_resched();
spin_lock(&inode_lock);
- if (wbc->nr_to_write <= 0) {
- wbc->more_io = 1;
+ if (wbc->nr_to_write <= 0)
break;
- }
- if (!list_empty(&sb->s_more_io))
- wbc->more_io = 1;
}
+ if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)
+ wbc->more_io = 1;
return; /* Leave any unwritten inodes on s_io */
}

2009-07-30 01:49:21

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 08:19:34AM +0800, Martin Bligh wrote:
> BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> for me?
>
> if (wbc->nr_to_write <= 0) {
> wbc->more_io = 1;
> break;
> }
>
> I don't understand why we are setting more_io here? AFAICS, more_io
> means there's more stuff to write ... I would think we'd set this if
> nr_to_write was > 0 ?

That's true: wbc.nr_to_write will always be set to MAX_WRITEBACK_PAGES
by wb_writeback() before entering generic_sync_sb_inodes().

So wbc.nr_to_write <=0 indicates we are interrupted by the quota and
should revisit generic_sync_sb_inodes() to check for more io (which will
_normally_ find more dirty pages to write).

> Or just have the section below brought up above this
> break check and do:
>
> if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> wbc->more_io = 1;
>
> Am I just misunderstanding the intent of more_io ?

It should be OK. I agree on the change if it makes the logic more
straightforward.

Thanks,
Fengguang

2009-07-30 01:57:58

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
> > I agree on the unification of kupdate and sync paths. In fact I had a
> > patch for doing this. And I'd recommend to do it in two patches:
> > one to fix the congestion case, another to do the code unification.
> >
> > The sync path don't care whether requeue_io() or redirty_tail() is
> > used, because they disregard the time stamps totally - only order of
> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
>
> But, as I understand it, both paths share the same lists, so we still have
> to be consistent?

Then let's first unify the code, then fix the congestion case? :)

> Also, you set flags like more_io higher up in sync_sb_inodes() based on
> whether there's anything in s_more_io queue, so it still seems to have
> some effect to me?

Yes, maybe in some rare cases.

2009-07-30 02:09:38

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 09:28:07AM +0800, Martin Bligh wrote:
> On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<[email protected]> wrote:
> > BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> > for me?
> >
> >                if (wbc->nr_to_write <= 0) {
> >                        wbc->more_io = 1;
> >                        break;
> >                }
> >
> > I don't understand why we are setting more_io here? AFAICS, more_io
> > means there's more stuff to write ... I would think we'd set this if
> > nr_to_write was > 0 ?
> >
> > Or just have the section below brought up above this
> > break check and do:
> >
> > if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> >        wbc->more_io = 1;
> >
> > Am I just misunderstanding the intent of more_io ?
>
> I am thinking along the lines of:

On closer looks I found this line:

if (inode_dirtied_after(inode, start))
break;

In this case "list_empty(&sb->s_io)" is not a good criteria:
here we are breaking away for some other reasons, and shall
not touch wbc.more_io.

So let's stick with the current code?

Thanks,
Fengguang

> @@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s
> iput(inode);
> cond_resched();
> spin_lock(&inode_lock);
> - if (wbc->nr_to_write <= 0) {
> - wbc->more_io = 1;
> + if (wbc->nr_to_write <= 0)
> break;
> - }
> - if (!list_empty(&sb->s_more_io))
> - wbc->more_io = 1;
> }
> + if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)
> + wbc->more_io = 1;
> return; /* Leave any unwritten inodes on s_io */
> }

2009-07-30 02:57:42

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

> On closer looks I found this line:
>
> ? ? ? ? ? ? ? ?if (inode_dirtied_after(inode, start))
> ? ? ? ? ? ? ? ? ? ? ? ?break;

Ah, OK.

> In this case "list_empty(&sb->s_io)" is not a good criteria:
> here we are breaking away for some other reasons, and shall
> not touch wbc.more_io.
>
> So let's stick with the current code?

Well, I see two problems. One is that we set more_io based on
whether s_more_io is empty or not before we finish the loop.
I can't see how this can be correct, especially as there can be
other concurrent writers. So somehow we need to check when
we exit the loop, not during it.

The other is that we're saying we are setting more_io when
nr_to_write is <=0 ... but we only really check it when
nr_to_write is > 0 ... I can't see how this can be useful?
I'll admit there is one corner case when page_skipped it set
from one of the branches, but I am really not sure what the
intended logic is here, given the above?

In the case where we hit the inode_dirtied_after break
condition, is it bad to set more_io ? There is more to do
on that inode after all. Is there a definition somewhere for
exactly what the more_io flag means?

M.

2009-07-30 02:59:17

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<[email protected]> wrote:
> On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
>> > I agree on the unification of kupdate and sync paths. In fact I had a
>> > patch for doing this. And I'd recommend to do it in two patches:
>> > one to fix the congestion case, another to do the code unification.
>> >
>> > The sync path don't care whether requeue_io() or redirty_tail() is
>> > used, because they disregard the time stamps totally - only order of
>> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
>>
>> But, as I understand it, both paths share the same lists, so we still have
>> to be consistent?
>
> Then let's first unify the code, then fix the congestion case? :)

OK, I will send it out as separate patches. I am just finishing up the testing
first.

M.

2009-07-30 03:19:31

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 10:57:35AM +0800, Martin Bligh wrote:
> > On closer looks I found this line:
> >
> >                if (inode_dirtied_after(inode, start))
> >                        break;
>
> Ah, OK.
>
> > In this case "list_empty(&sb->s_io)" is not a good criteria:
> > here we are breaking away for some other reasons, and shall
> > not touch wbc.more_io.
> >
> > So let's stick with the current code?
>
> Well, I see two problems. One is that we set more_io based on
> whether s_more_io is empty or not before we finish the loop.
> I can't see how this can be correct, especially as there can be
> other concurrent writers. So somehow we need to check when
> we exit the loop, not during it.

It is correct inside the loop, however with some overheads.

We put it inside the loop because sometimes the whole filesystem is
skipped and we shall not set more_io on them whether or not s_more_io
is empty.

> The other is that we're saying we are setting more_io when
> nr_to_write is <=0 ... but we only really check it when
> nr_to_write is > 0 ... I can't see how this can be useful?

That's the caller's fault - I guess the logic was changed a bit by
Jens in linux-next. I noticed this just now. It shall be fixed.

> I'll admit there is one corner case when page_skipped it set
> from one of the branches, but I am really not sure what the
> intended logic is here, given the above?
>
> In the case where we hit the inode_dirtied_after break
> condition, is it bad to set more_io ? There is more to do
> on that inode after all. Is there a definition somewhere for
> exactly what the more_io flag means?

"More dirty pages to be put to io"?

The exact semantics of more_io is determined by the caller,
which used to be (in 2.6.31):

background_writeout():

if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(BLK_RW_ASYNC, HZ/10);
else
break;
}

wb_kupdate() is same except that it does not check pages_skipped.

Note that in 2.6.31, more_io is not used at all for sync().

Thanks,
Fengguang

2009-07-30 04:08:08

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 10:59:09AM +0800, Martin Bligh wrote:
> On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<[email protected]> wrote:
> > On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
> >> > I agree on the unification of kupdate and sync paths. In fact I had a
> >> > patch for doing this. And I'd recommend to do it in two patches:
> >> > one to fix the congestion case, another to do the code unification.
> >> >
> >> > The sync path don't care whether requeue_io() or redirty_tail() is
> >> > used, because they disregard the time stamps totally - only order of
> >> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
> >>
> >> But, as I understand it, both paths share the same lists, so we still have
> >> to be consistent?
> >
> > Then let's first unify the code, then fix the congestion case? :)
>
> OK, I will send it out as separate patches. I am just finishing up the testing
> first.

Note that this is a simple fix that may have suboptimal write performance.
Here is an old reasoning:

http://lkml.org/lkml/2009/3/28/235

Thanks,
Fengguang

2009-07-30 19:55:56

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

> Note that this is a simple fix that may have suboptimal write performance.
> Here is an old reasoning:
>
> ? ? ? ?http://lkml.org/lkml/2009/3/28/235

The other thing I've been experimenting with is to disable the per-page
check in write_cache_pages, ie:

if (wbc->nonblocking && bdi_write_congested(bdi)) {
wb_stats_inc(WB_STATS_WCP_SECTION_CONG);
wbc->encountered_congestion = 1;
/* done = 1; */

This treats the congestion limits as soft, but encourages us to write
back in larger, more efficient chunks. If that's not going to scare
people unduly, I can submit that as well.

2009-07-30 20:33:15

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

(BTW: background ... I'm not picking through this code for fun, I'm
trying to debug writeback problems introduced in our new kernel
that are affecting Google production workloads ;-))

>> Well, I see two problems. One is that we set more_io based on
>> whether s_more_io is empty or not before we finish the loop.
>> I can't see how this can be correct, especially as there can be
>> other concurrent writers. So somehow we need to check when
>> we exit the loop, not during it.
>
> It is correct inside the loop, however with some overheads.
>
> We put it inside the loop because sometimes the whole filesystem is
> skipped and we shall not set more_io on them whether or not s_more_io
> is empty.

My point was that you're setting more_io based on a condition
at a point in time that isn't when you return to the caller.

By the time you return to the caller (after several more loops
iterations), that condition may no longer be true.

One other way to address that would to be only to set if if we're
about to fall off the end of the loop, ie change it to:

if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
wbc->more_io = 1;

>> The other is that we're saying we are setting more_io when
>> nr_to_write is <=0 ... but we only really check it when
>> nr_to_write is > 0 ... I can't see how this can be useful?
>
> That's the caller's fault - I guess the logic was changed a bit by
> Jens in linux-next. I noticed this just now. It shall be fixed.

I am guessing you're setting more_io here because we're stopping
because our slice expired, presumably without us completing
all the io there was to do? That doesn't seem entirely accurate,
we could have finished all the pending IO (particularly given that
we can go over nr_to_write somewhat and send it negative).
Hence, I though that checking whether s_more_io and s_io were
empty at the time of return might be a more accurate check,
but on the other hand they are shared lists.

2009-07-30 21:40:05

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Tue, Jul 28 2009, Chad Talbott wrote:
> I run a simple workload on a 4GB machine which dirties a few largish
> inodes like so:
>
> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
> bs=1024k count=100
>
> While the dds are running data is written out at disk speed. However,
> once the dds have run to completion and exited there is ~500MB of
> dirty memory left. Background writeout then takes about 3 more
> minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
> can see that the disk is capable of 40MB/s, which finishes off the
> files in ~10s. [1]
>
> An interesting recent-ish change is "writeback: speed up writeback of
> big dirty files." When I revert the change to __sync_single_inode the
> problem appears to go away and background writeout proceeds at disk
> speed. Interestingly, that code is in the git commit [2], but not in
> the post to LKML. [3] This is may not be the fix, but it makes this
> test behave better.

Can I talk you into trying the per-bdi writeback patchset? I just tried
your test on a 16gb machine, and the dd's finish immediately since it
wont trip the writeout at that percentage of dirty memory. The 1GB of
dirty memory is flushed when it gets too old, 30 seconds later in two
chunks of writeout running at disk speed.

http://lkml.org/lkml/2009/7/30/302

You can either use the git branch, or just download

http://kernel.dk/writeback-v13.patch

and apply that to -git (or -rc4) directly.

--
Jens Axboe

2009-07-30 22:02:07

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<[email protected]> wrote:
> On Tue, Jul 28 2009, Chad Talbott wrote:
>> I run a simple workload on a 4GB machine which dirties a few largish
>> inodes like so:
>>
>> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
>> bs=1024k count=100
>>
>> While the dds are running data is written out at disk speed. ?However,
>> once the dds have run to completion and exited there is ~500MB of
>> dirty memory left. ?Background writeout then takes about 3 more
>> minutes to clean memory at only ~3.3MB/s. ?When I explicitly sync, I
>> can see that the disk is capable of 40MB/s, which finishes off the
>> files in ~10s. [1]
>>
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files." ?When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed. ?Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3] ?This is may not be the fix, but it makes this
>> test behave better.
>
> Can I talk you into trying the per-bdi writeback patchset? I just tried
> your test on a 16gb machine, and the dd's finish immediately since it
> wont trip the writeout at that percentage of dirty memory. The 1GB of
> dirty memory is flushed when it gets too old, 30 seconds later in two
> chunks of writeout running at disk speed.

How big did you make the dds? It has to be writing more data than
you have RAM, or it's not going to do anything much interesting ;-)

2009-07-30 22:17:29

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30 2009, Martin Bligh wrote:
> On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<[email protected]> wrote:
> > On Tue, Jul 28 2009, Chad Talbott wrote:
> >> I run a simple workload on a 4GB machine which dirties a few largish
> >> inodes like so:
> >>
> >> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
> >> bs=1024k count=100
> >>
> >> While the dds are running data is written out at disk speed. ?However,
> >> once the dds have run to completion and exited there is ~500MB of
> >> dirty memory left. ?Background writeout then takes about 3 more
> >> minutes to clean memory at only ~3.3MB/s. ?When I explicitly sync, I
> >> can see that the disk is capable of 40MB/s, which finishes off the
> >> files in ~10s. [1]
> >>
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files." ?When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed. ?Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3] ?This is may not be the fix, but it makes this
> >> test behave better.
> >
> > Can I talk you into trying the per-bdi writeback patchset? I just tried
> > your test on a 16gb machine, and the dd's finish immediately since it
> > wont trip the writeout at that percentage of dirty memory. The 1GB of
> > dirty memory is flushed when it gets too old, 30 seconds later in two
> > chunks of writeout running at disk speed.
>
> How big did you make the dds? It has to be writing more data than
> you have RAM, or it's not going to do anything much interesting ;-)

The test case above on a 4G machine is only generating 1G of dirty data.
I ran the same test case on the 16G, resulting in only background
writeout. The relevant bit here being that the background writeout
finished quickly, writing at disk speed.

I re-ran the same test, but using 300 100MB files instead. While the
dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
x25-m). When the dd's are done, it continues doing 80MB/sec for 10
seconds or so. Then the remainder (about 2G) is written in bursts at
disk speeds, but with some time in between.

--
Jens Axboe

2009-07-30 22:34:19

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

> The test case above on a 4G machine is only generating 1G of dirty data.
> I ran the same test case on the 16G, resulting in only background
> writeout. The relevant bit here being that the background writeout
> finished quickly, writing at disk speed.
>
> I re-ran the same test, but using 300 100MB files instead. While the
> dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> seconds or so. Then the remainder (about 2G) is written in bursts at
> disk speeds, but with some time in between.

OK, I think the test case is sensitive to how many files you have - if
we punt them to the back of the list, and yet we still have 299 other
ones, it may well be able to keep the disk spinning despite the bug
I outlined.Try using 30 1GB files?

Though it doesn't seem to happen with just one dd streamer, and
I don't see why the bug doesn't trigger in that case either.

I believe the bugfix is correct independent of any bdi changes?

M.

2009-07-30 22:43:10

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30 2009, Martin Bligh wrote:
> > The test case above on a 4G machine is only generating 1G of dirty data.
> > I ran the same test case on the 16G, resulting in only background
> > writeout. The relevant bit here being that the background writeout
> > finished quickly, writing at disk speed.
> >
> > I re-ran the same test, but using 300 100MB files instead. While the
> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > disk speeds, but with some time in between.
>
> OK, I think the test case is sensitive to how many files you have - if
> we punt them to the back of the list, and yet we still have 299 other
> ones, it may well be able to keep the disk spinning despite the bug
> I outlined.Try using 30 1GB files?

If this disk starts spinning, then we have bigger bugs :-)
>
> Though it doesn't seem to happen with just one dd streamer, and
> I don't see why the bug doesn't trigger in that case either.
>
> I believe the bugfix is correct independent of any bdi changes?

Yeah I think so too, I'll run some more tests on this tomorrow and
verify it there as well.

--
Jens Axboe

2009-07-30 22:48:09

by Martin Bligh

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<[email protected]> wrote:
> On Thu, Jul 30 2009, Martin Bligh wrote:
>> > The test case above on a 4G machine is only generating 1G of dirty data.
>> > I ran the same test case on the 16G, resulting in only background
>> > writeout. The relevant bit here being that the background writeout
>> > finished quickly, writing at disk speed.
>> >
>> > I re-ran the same test, but using 300 100MB files instead. While the
>> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
>> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
>> > seconds or so. Then the remainder (about 2G) is written in bursts at
>> > disk speeds, but with some time in between.
>>
>> OK, I think the test case is sensitive to how many files you have - if
>> we punt them to the back of the list, and yet we still have 299 other
>> ones, it may well be able to keep the disk spinning despite the bug
>> I outlined.Try using 30 1GB files?
>
> If this disk starts spinning, then we have bigger bugs :-)
>>
>> Though it doesn't seem to happen with just one dd streamer, and
>> I don't see why the bug doesn't trigger in that case either.
>>
>> I believe the bugfix is correct independent of any bdi changes?
>
> Yeah I think so too, I'll run some more tests on this tomorrow and
> verify it there as well.

There's another issue I was discussing with Peter Z. earlier that the
bdi changes might help with - if you look at where the dirty pages
get to, they are capped hard at the average of the dirty and
background thresholds, meaning we can only dirty about half the
pages we should be able to. That does very slowly go away when
the bdi limit catches up, but it seems to start at 0, and it's progess
seems glacially slow (at least if you're impatient ;-))

This seems to affect some of our workloads badly when they have
a sharp spike in dirty data to one device, they get throttled heavily
when they wouldn't have before the per-bdi dirty limits.

2009-07-31 07:46:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Thu, 2009-07-30 at 15:48 -0700, Martin Bligh wrote:

> There's another issue I was discussing with Peter Z. earlier that the
> bdi changes might help with - if you look at where the dirty pages
> get to, they are capped hard at the average of the dirty and
> background thresholds, meaning we can only dirty about half the
> pages we should be able to. That does very slowly go away when
> the bdi limit catches up, but it seems to start at 0, and it's progess
> seems glacially slow (at least if you're impatient ;-))
>
> This seems to affect some of our workloads badly when they have
> a sharp spike in dirty data to one device, they get throttled heavily
> when they wouldn't have before the per-bdi dirty limits.

Right, currently that adjustment period is about the same order as
writing out the full dirty page capacity. If your system has unbalanced
memory vs io capacity this might indeed end up being glacial.

I've been considering making it a sublinear function wrt the memory
size, so that larger machines get less and therefore adjust faster.

Something like the below perhaps -- the alternative is yet another
sysctl :/

Not sure how the sqrt works out on a wide variety of machines though,..
we'll have to test and see.

---
mm/page-writeback.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..64aa140 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -152,7 +152,7 @@ static int calc_period_shift(void)
else
dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
- return 2 + ilog2(dirty_total - 1);
+ return 2 + ilog2(int_sqrt(dirty_total) - 1);
}

/*

2009-08-01 02:02:41

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Fri, Jul 31, 2009 at 03:55:44AM +0800, Martin Bligh wrote:
> > Note that this is a simple fix that may have suboptimal write performance.
> > Here is an old reasoning:
> >
> >        http://lkml.org/lkml/2009/3/28/235
>
> The other thing I've been experimenting with is to disable the per-page
> check in write_cache_pages, ie:
>
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> wb_stats_inc(WB_STATS_WCP_SECTION_CONG);
> wbc->encountered_congestion = 1;
> /* done = 1; */
>
> This treats the congestion limits as soft, but encourages us to write
> back in larger, more efficient chunks. If that's not going to scare
> people unduly, I can submit that as well.

This risks hitting the hard limit (nr_requests), and block everyone,
including the ones with higher priority (ie. kswapd).

On the other hand, the simple fix in previous mails won't necessarily
act too sub-optimal. It's only a potential one. There is a window of
(1/16)*(nr_requests)*request_size (= 128*256KB/16 = 4MB) between
congestion-on and congestion-off states. So for the best we can inject
a big 4MB chunk into the async write queue once it becomes uncongested.

I have a writeback debug patch that can help find out how
that works out in your real world workloads (by monitoring
nr_to_write). You can also try doubling the ratio (1/16) in
blk_queue_congestion_threshold(), to see how an increased
congestion-on-off window may help.

Thanks,
Fengguang


Attachments:
(No filename) (1.52 kB)
writeback-debug-2.6.31.patch (2.65 kB)
Download all attachments

2009-08-01 02:58:28

by Fengguang Wu

[permalink] [raw]
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote:
> (BTW: background ... I'm not picking through this code for fun, I'm
> trying to debug writeback problems introduced in our new kernel
> that are affecting Google production workloads ;-))
>
> >> Well, I see two problems. One is that we set more_io based on
> >> whether s_more_io is empty or not before we finish the loop.
> >> I can't see how this can be correct, especially as there can be
> >> other concurrent writers. So somehow we need to check when
> >> we exit the loop, not during it.
> >
> > It is correct inside the loop, however with some overheads.
> >
> > We put it inside the loop because sometimes the whole filesystem is
> > skipped and we shall not set more_io on them whether or not s_more_io
> > is empty.
>
> My point was that you're setting more_io based on a condition
> at a point in time that isn't when you return to the caller.
>
> By the time you return to the caller (after several more loops
> iterations), that condition may no longer be true.

You are right in that sense. Sorry that my claim of correctness is
somehow biased: we normally care much about early abortion, and don't
mind one extra trip over the superblocks. And the extra trip should be
rare enough. I'd be surprised if you observed much of them in real
workloads.

> One other way to address that would to be only to set if if we're
> about to fall off the end of the loop, ie change it to:
>
> if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
> wbc->more_io = 1;

Let more_io=0 when there are more inodes in s_io to be worked on?
I cannot understand it, and suspect we are talking about imaginary
problem on this point ;)

> >> The other is that we're saying we are setting more_io when
> >> nr_to_write is <=0 ... but we only really check it when
> >> nr_to_write is > 0 ... I can't see how this can be useful?
> >
> > That's the caller's fault - I guess the logic was changed a bit by
> > Jens in linux-next. I noticed this just now. It shall be fixed.
>
> I am guessing you're setting more_io here because we're stopping
> because our slice expired, presumably without us completing
> all the io there was to do? That doesn't seem entirely accurate,
> we could have finished all the pending IO (particularly given that
> we can go over nr_to_write somewhat and send it negative).
> Hence, I though that checking whether s_more_io and s_io were
> empty at the time of return might be a more accurate check,
> but on the other hand they are shared lists.

Yes the current more_io logic is not entirely accurate, but I doubt we
can gain much and the improvement can be done trivially (not the line
of code, but the analyzes and tests involved).

Anyway if you would take the time to push forward a patch for reducing
the overheads of possible extra trips, I'll take the time to review it ;)

Thanks,
Fengguang