LinuxLists.cc - [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full

2007-08-12 09:23:00

Subject: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()

Miklos Szeredi <[email protected]> and me identified a writeback bug:

> The following strange behavior can be observed:
>
> 1. large file is written
> 2. after 30 seconds, nr_dirty goes down by 1024
> 3. then for some time (< 30 sec) nothing happens (disk idle)
> 4. then nr_dirty again goes down by 1024
> 5. repeat from 3. until whole file is written
>
> So basically a 4Mbyte chunk of the file is written every 30 seconds.
> I'm quite sure this is not the intended behavior.

It can be produced by the following test scheme:

# cat bin/test-writeback.sh
grep nr_dirty /proc/vmstat
echo 1 > /proc/sys/fs/inode_debug
dd if=/dev/zero of=/var/x bs=1K count=204800&
while true; do grep nr_dirty /proc/vmstat; sleep 1; done

# bin/test-writeback.sh
nr_dirty 19207
nr_dirty 19207
nr_dirty 30924
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
nr_dirty 47150
nr_dirty 47141
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47205
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47214
nr_dirty 47215
nr_dirty 47216
nr_dirty 47216
nr_dirty 47216
nr_dirty 47154
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47143
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47142
nr_dirty 47134
nr_dirty 47134
nr_dirty 47135
nr_dirty 47135
nr_dirty 47135
nr_dirty 46097 <== -1038
nr_dirty 46098
nr_dirty 46098
nr_dirty 46098
[...]
nr_dirty 46091
nr_dirty 46092
nr_dirty 46092
nr_dirty 45069 <== -1023
nr_dirty 45056
nr_dirty 45056
nr_dirty 45056
[...]
nr_dirty 37822
nr_dirty 36799 <== -1023
[...]
nr_dirty 36781
nr_dirty 35758 <== -1023
[...]
nr_dirty 34708
nr_dirty 33672 <== -1024
[...]
nr_dirty 33692
nr_dirty 32669 <== -1023

% ls -li /var/x
847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x

% dmesg|grep 847824 # generated by a debug printk
[ 529.263184] redirtied inode 847824 line 548
[ 564.250872] redirtied inode 847824 line 548
[ 594.272797] redirtied inode 847824 line 548
[ 629.231330] redirtied inode 847824 line 548
[ 659.224674] redirtied inode 847824 line 548
[ 689.219890] redirtied inode 847824 line 548
[ 724.226655] redirtied inode 847824 line 548
[ 759.198568] redirtied inode 847824 line 548

# line 548 in fs/fs-writeback.c:
543 if (wbc->pages_skipped != pages_skipped) {
544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */
548 redirty_tail(inode);
549 }

More debug efforts show that __block_write_full_page()
never has the chance to call submit_bh() for that big dirty file:
the buffer head is *clean*. So basicly no page io is issued by
__block_write_full_page(), hence pages_skipped goes up.

This patch fixes this bug. I'm not quite sure about it.
But at least the comment in generic_sync_sb_inodes():

544 /*
545 * writeback is not making progress due to locked
546 * buffers. Skip this inode for now.
547 */

and the comment in __block_write_full_page():

1713 /*
1714 * The page was marked dirty, but the buffers were
1715 * clean. Someone wrote them back by hand with
1716 * ll_rw_block/submit_bh. A rare case.
1717 */

do not quite agree with each other. The page writeback is skipped not because
of 'locked buffer', but 'clean buffer'.

This is the new behavior after the patch:

# bin/test-writeback.sh
nr_dirty 60
847824 /var/x
nr_dirty 60
nr_dirty 31139
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.55338 seconds, 135 MB/s
nr_dirty 47137
nr_dirty 46147
nr_dirty 46147
nr_dirty 46147
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46148
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46193
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46126
nr_dirty 46109
nr_dirty 46109
nr_dirty 46109
nr_dirty 46113
nr_dirty 46113
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46106
nr_dirty 46089
nr_dirty 46089
nr_dirty 46090
nr_dirty 46093
nr_dirty 46093
nr_dirty 15
nr_dirty 15
nr_dirty 15
nr_dirty 15

It is pretty numbers: wait 30s and write ALL:)

But another run is not so good:

# sh bin/test-writeback.sh
mount: proc already mounted
nr_dirty 223
nr_dirty 223
nr_dirty 23664
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 1.51092 seconds, 139 MB/s
nr_dirty 47299
nr_dirty 47271
nr_dirty 47260
nr_dirty 47260
nr_dirty 47267
nr_dirty 47267
nr_dirty 47329
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47352
nr_dirty 47606
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47604
nr_dirty 47480
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 47492
nr_dirty 46470
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 46473
nr_dirty 45428
nr_dirty 45435
nr_dirty 45436
nr_dirty 45436
nr_dirty 45436
nr_dirty 257
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 259
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16
nr_dirty 16

Basicly they are
- during the dd: ~16M
- after 30s: ~4M
- after 5s: ~4M
- after 5s: ~176M

The box has 2G memory.

Question 1:
How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.

Question 2:
__block_write_full_page() is virtually doing nothing for the whole dirty file.
Isn't it abnormal? Who did the actual write back for us? The jounal? How to fix it?

Any suggestions? Thank you.

Cc: Ken Chen <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
fs/buffer.c | 1 -
1 file changed, 1 deletion(-)

--- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
+++ linux-2.6.23-rc2-mm2/fs/buffer.c
@@ -1713,7 +1713,6 @@ done:
* The page and buffer_heads can be released at any time from
* here on.
*/
- wbc->pages_skipped++; /* We didn't write this page */
}
return err;

--

2007-08-13 01:03:51

by David Chinner

[permalink] [raw]

Subject: Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()

On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> Miklos Szeredi <[email protected]> and me identified a writeback bug:
> Basicly they are
> - during the dd: ~16M
> - after 30s: ~4M
> - after 5s: ~4M
> - after 5s: ~176M
>
> The box has 2G memory.
>
> Question 1:
> How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.

pdflush runs every five seconds, so that is indicative of the inode being
written once for 1024 pages, and then delayed to the next pdflush run 5s later.
perhaps the inodes aren't moving between the lists exactly the way you
think they are...

> --- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
> +++ linux-2.6.23-rc2-mm2/fs/buffer.c
> @@ -1713,7 +1713,6 @@ done:
> * The page and buffer_heads can be released at any time from
> * here on.
> */
> - wbc->pages_skipped++; /* We didn't write this page */
> }
> return err;

Hmmmm - I suspect XFS is going to need a similar fix as well. I'm moving
house so I'm not going to get a chance to look at this for a week...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-08-13 13:51:48

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()

On Mon, Aug 13, 2007 at 11:03:21AM +1000, David Chinner wrote:
> > --- linux-2.6.23-rc2-mm2.orig/fs/buffer.c
> > +++ linux-2.6.23-rc2-mm2/fs/buffer.c
> > @@ -1713,7 +1713,6 @@ done:
> > * The page and buffer_heads can be released at any time from
> > * here on.
> > */
> > - wbc->pages_skipped++; /* We didn't write this page */
> > }
> > return err;
>
> Hmmmm - I suspect XFS is going to need a similar fix as well. I'm moving
> house so I'm not going to get a chance to look at this for a week...

I guess as long as the skipped page no longer has dirty bit set both
as a page flag and a radix tree tag(i.e. has been put to io by someone
else), it is OK to not increase wbc->pages_skipped.

> On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > Miklos Szeredi <[email protected]> and me identified a writeback bug:
> > Basicly they are
> > - during the dd: ~16M
> > - after 30s: ~4M
> > - after 5s: ~4M
> > - after 5s: ~176M
> >
> > The box has 2G memory.
> >
> > Question 1:
> > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
>
> pdflush runs every five seconds, so that is indicative of the inode being
> written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> perhaps the inodes aren't moving between the lists exactly the way you
> think they are...

Now I figured out the exact situation. When the scan of s_io finishes
with some small inodes, nr_to_write will be positive, fooling kupdate
to quit prematurely. But in fact the big dirty file is on s_more_io
waiting for more io... The attached patch fixes it.

Fengguang
===

Subject: writeback: introduce writeback_control.more_io to indicate more io

After making dirty a 100M file, the normal behavior is to
start the writeback for all data after 30s delays. But
sometimes the following happens instead:

- after 30s: ~4M
- after 5s: ~4M
- after 5s: all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

s_io s_more_io
-------------------------
1) 100M,1K 0
2) 1K 96M
3) 0 96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in 3) fools the upper layer to think that data have all been
written out. Bug the big dirty file is still sitting in s_more_io. We cannot
simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
in s_dirty. It is also not an option to draw inodes from both s_more_io and
s_dirty, an let the loop go on: this might lead to live locks, and might also
starve other superblocks in sync time(well kupdate may still starve some
superblocks, that's another bug).

So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
not necessarily mean that "all data are written". This patch introduces a flag
writeback_control.more_io to indicate this situation. With it the big dirty file
no longer has to wait for the next kupdate invocation 5s later.

Signed-off-by: Fengguang Wu <[email protected]>
---
fs/fs-writeback.c | 2 ++
include/linux/writeback.h | 1 +
mm/page-writeback.c | 9 ++++++---
3 files changed, 9 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
+++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
@@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
if (wbc->nr_to_write <= 0)
break;
}
+ if (!list_empty(&sb->s_more_io))
+ wbc->more_io = 1;
spin_unlock(&inode_lock);
return ret; /* Leave any unwritten inodes on s_io */
}
--- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
+++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
@@ -61,6 +61,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
+ unsigned more_io:1; /* more io to be dispatched */

void *fs_private; /* For use by ->writepages() */
};
--- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
@@ -382,6 +382,7 @@ static void background_writeout(unsigned
global_page_state(NR_UNSTABLE_NFS) < background_thresh
&& min_pages <= 0)
break;
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
@@ -389,8 +390,9 @@ static void background_writeout(unsigned
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
- congestion_wait(WRITE, HZ/10);
- if (!wbc.encountered_congestion)
+ if (wbc.encountered_congestion)
+ congestion_wait(WRITE, HZ/10);
+ else if (!wbc.more_io)
break;
}
}
@@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
global_page_state(NR_UNSTABLE_NFS) +
(inodes_stat.nr_inodes - inodes_stat.nr_unused);
while (nr_to_write > 0) {
+ wbc.more_io = 0;
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion)
congestion_wait(WRITE, HZ/10);
- else
+ else if (!wbc.more_io)
break; /* All the old data is written */
}
nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;

2007-08-17 07:13:33

by Wu Fengguang

[permalink] [raw]

Subject: Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()

On Mon, Aug 13, 2007 at 06:30:00PM +0800, Fengguang Wu wrote:
> > On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > > Miklos Szeredi <[email protected]> and me identified a writeback bug:
> > > Basicly they are
> > > - during the dd: ~16M
> > > - after 30s: ~4M
> > > - after 5s: ~4M
> > > - after 5s: ~176M
> > >
> > > The box has 2G memory.
> > >
> > > Question 1:
> > > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
> >
> > pdflush runs every five seconds, so that is indicative of the inode being
> > written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> > perhaps the inodes aren't moving between the lists exactly the way you
> > think they are...
>
> Now I figured out the exact situation. When the scan of s_io finishes
> with some small inodes, nr_to_write will be positive, fooling kupdate
> to quit prematurely. But in fact the big dirty file is on s_more_io
> waiting for more io... The attached patch fixes it.
>
> Fengguang
> ===
>
> Subject: writeback: introduce writeback_control.more_io to indicate more io
>
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
>
> - after 30s: ~4M
> - after 5s: ~4M
> - after 5s: all remaining 92M
>
> Some analyze shows that the internal io dispatch queues goes like this:
>
> s_io s_more_io
> -------------------------
> 1) 100M,1K 0
> 2) 1K 96M
> 3) 0 96M
>
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
>
> nr_to_write > 0 in 3) fools the upper layer to think that data have all been
> written out. Bug the big dirty file is still sitting in s_more_io. We cannot
> simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
> loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
> in s_dirty. It is also not an option to draw inodes from both s_more_io and
> s_dirty, an let the loop go on: this might lead to live locks, and might also
> starve other superblocks in sync time(well kupdate may still starve some
> superblocks, that's another bug).
>
> So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
> not necessarily mean that "all data are written". This patch introduces a flag
> writeback_control.more_io to indicate this situation. With it the big dirty file
> no longer has to wait for the next kupdate invocation 5s later.

Sorry, this patch is found to be dangerous. It locks up my desktop
on heavy I/O: kupdate *immediately* returns to push the file in
s_more_io for writeback, but it *could* still not able to make
progress(locks etc.). Now kupdate ends up *busy looping*. That could
be fixed by wait for somehow 100ms and retry the io. Should we do
it?(or: Is 5s interval considered too long a wait?)

> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> fs/fs-writeback.c | 2 ++
> include/linux/writeback.h | 1 +
> mm/page-writeback.c | 9 ++++++---
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> --- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
> +++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
> @@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
> if (wbc->nr_to_write <= 0)
> break;
> }
> + if (!list_empty(&sb->s_more_io))
> + wbc->more_io = 1;
> spin_unlock(&inode_lock);
> return ret; /* Leave any unwritten inodes on s_io */
> }
> --- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
> +++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
> @@ -61,6 +61,7 @@ struct writeback_control {
> unsigned for_reclaim:1; /* Invoked from the page allocator */
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> + unsigned more_io:1; /* more io to be dispatched */
>
> void *fs_private; /* For use by ->writepages() */
> };
> --- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
> @@ -382,6 +382,7 @@ static void background_writeout(unsigned
> global_page_state(NR_UNSTABLE_NFS) < background_thresh
> && min_pages <= 0)
> break;
> + wbc.more_io = 0;
> wbc.encountered_congestion = 0;
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> wbc.pages_skipped = 0;
> @@ -389,8 +390,9 @@ static void background_writeout(unsigned
> min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> /* Wrote less than expected */
> - congestion_wait(WRITE, HZ/10);
> - if (!wbc.encountered_congestion)
> + if (wbc.encountered_congestion)
> + congestion_wait(WRITE, HZ/10);
> + else if (!wbc.more_io)
> break;
> }
> }
> @@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
> global_page_state(NR_UNSTABLE_NFS) +
> (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> while (nr_to_write > 0) {
> + wbc.more_io = 0;
> wbc.encountered_congestion = 0;
> wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> writeback_inodes(&wbc);
> if (wbc.nr_to_write > 0) {
> if (wbc.encountered_congestion)
> congestion_wait(WRITE, HZ/10);
> - else
> + else if (!wbc.more_io)
> break; /* All the old data is written */
> }
> nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html