LinuxLists.cc - [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

2010-12-13 15:15:59

Subject: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes). For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which seems reasonable.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh: 0 kB
BdiDirtyThresh: 118748 kB
BdiDirtyThresh: 214280 kB
BdiDirtyThresh: 303868 kB
BdiDirtyThresh: 376528 kB
BdiDirtyThresh: 411180 kB
BdiDirtyThresh: 448636 kB
BdiDirtyThresh: 472260 kB
BdiDirtyThresh: 490924 kB
BdiDirtyThresh: 499596 kB
BdiDirtyThresh: 507068 kB
...
DirtyThresh: 530392 kB

CC: Peter Zijlstra <[email protected]>
CC: Richard Kennedy <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
+++ linux-next/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
@@ -145,7 +145,7 @@ static int calc_period_shift(void)
else
dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
- return 2 + ilog2(dirty_total - 1);
+ return ilog2(dirty_total - 1) - 1;
}

/*

2010-12-14 13:37:47

by Richard Kennedy

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> Reduce the dampening for the control system, yielding faster
> convergence.
>
> Currently it converges at a snail's pace for slow devices (in order of
> minutes). For really fast storage, the convergence speed should be fine.
>
> It makes sense to make it reasonably fast for typical desktops.
>
> After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> 16GB mem, which seems reasonable.
>
> $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> BdiDirtyThresh: 0 kB
> BdiDirtyThresh: 118748 kB
> BdiDirtyThresh: 214280 kB
> BdiDirtyThresh: 303868 kB
> BdiDirtyThresh: 376528 kB
> BdiDirtyThresh: 411180 kB
> BdiDirtyThresh: 448636 kB
> BdiDirtyThresh: 472260 kB
> BdiDirtyThresh: 490924 kB
> BdiDirtyThresh: 499596 kB
> BdiDirtyThresh: 507068 kB
> ...
> DirtyThresh: 530392 kB
>
> CC: Peter Zijlstra <[email protected]>
> CC: Richard Kennedy <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
> mm/page-writeback.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
> @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> else
> dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> 100;
> - return 2 + ilog2(dirty_total - 1);
> + return ilog2(dirty_total - 1) - 1;
> }
>
> /*
>
>
Hi Fengguang,

I've been running my test set on your v3 series and generally it's
giving good results in line with the mainline kernel, with much less
variability and lower standard deviation of the results so it is much
more repeatable.

However, it doesn't seem to be honouring the background_dirty_threshold.

The attached graph is from a simple fio write test of 400Mb on ext4.
All dirty pages are completely written in 15 seconds, but I expect to
see up to background_dirty_threshold pages staying dirty until the 30
second background task writes them out. So it is much too eager to write
back dirty pages.

As to the ramp up time, when writing to 2 disks at the same time I see
the per_bdi_threshold taking up to 20 seconds to converge on a steady
value after one of the write stops. So I think this could be speeded up
even more, at least on my setup.

I am just about to start testing v4 & will report anything interesting.

regards
Richard

Attachments:

dirty.png (3.43 kB)

2010-12-14 13:59:20

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

Hi Richard,

On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > Reduce the dampening for the control system, yielding faster
> > convergence.
> >
> > Currently it converges at a snail's pace for slow devices (in order of
> > minutes). For really fast storage, the convergence speed should be fine.
> >
> > It makes sense to make it reasonably fast for typical desktops.
> >
> > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > 16GB mem, which seems reasonable.
> >
> > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > BdiDirtyThresh: 0 kB
> > BdiDirtyThresh: 118748 kB
> > BdiDirtyThresh: 214280 kB
> > BdiDirtyThresh: 303868 kB
> > BdiDirtyThresh: 376528 kB
> > BdiDirtyThresh: 411180 kB
> > BdiDirtyThresh: 448636 kB
> > BdiDirtyThresh: 472260 kB
> > BdiDirtyThresh: 490924 kB
> > BdiDirtyThresh: 499596 kB
> > BdiDirtyThresh: 507068 kB
> > ...
> > DirtyThresh: 530392 kB
> >
> > CC: Peter Zijlstra <[email protected]>
> > CC: Richard Kennedy <[email protected]>
> > Signed-off-by: Wu Fengguang <[email protected]>
> > ---
> > mm/page-writeback.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
> > +++ linux-next/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
> > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> > else
> > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > 100;
> > - return 2 + ilog2(dirty_total - 1);
> > + return ilog2(dirty_total - 1) - 1;
> > }
> >
> > /*
> >
> >
> Hi Fengguang,
>
> I've been running my test set on your v3 series and generally it's
> giving good results in line with the mainline kernel, with much less
> variability and lower standard deviation of the results so it is much
> more repeatable.

Glad to hear that, and thank you very much for trying it out!

> However, it doesn't seem to be honouring the background_dirty_threshold.

> The attached graph is from a simple fio write test of 400Mb on ext4.
> All dirty pages are completely written in 15 seconds, but I expect to
> see up to background_dirty_threshold pages staying dirty until the 30
> second background task writes them out. So it is much too eager to write
> back dirty pages.

This is interesting, and seems easy to root cause. When testing v4,
would you help collect the following trace events?

echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable

They'll have good opportunity to disclose the bug.

> As to the ramp up time, when writing to 2 disks at the same time I see
> the per_bdi_threshold taking up to 20 seconds to converge on a steady
> value after one of the write stops. So I think this could be speeded up
> even more, at least on my setup.

I have the roughly same ramp up time on the 1-disk 3GB mem test:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png

Given that it's the typical desktop, it does seem reasonable to speed
it up further.

> I am just about to start testing v4 & will report anything interesting.

Thanks!

Fengguang

2010-12-14 14:33:49

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:

> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
>
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>

Interestingly, the above graph shows that after about 10s fast ramp
up, there is another 20s slow ramp down. It's obviously due the
decline of global limit:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png

But why is the global limit declining? The following log shows that
nr_file_pages keeps growing and goes stable after 75 seconds (so long
time!). In the same period nr_free_pages goes slowly down to its
stable value. Given that the global limit is mainly derived from
nr_free_pages+nr_file_pages (I disabled swap), something must be
slowly eating memory until 75 ms. Maybe the tracing ring buffers?

free file reclaimable pages
50s 369324 + 318760 => 688084
60s 235989 + 448096 => 684085

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat

Thanks,
Fengguang

2010-12-14 14:39:10

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
>
> > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > value after one of the write stops. So I think this could be speeded up
> > > even more, at least on my setup.
> >
> > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> >
>
> Interestingly, the above graph shows that after about 10s fast ramp
> up, there is another 20s slow ramp down. It's obviously due the
> decline of global limit:
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
>
> But why is the global limit declining? The following log shows that
> nr_file_pages keeps growing and goes stable after 75 seconds (so long
> time!). In the same period nr_free_pages goes slowly down to its
> stable value. Given that the global limit is mainly derived from
> nr_free_pages+nr_file_pages (I disabled swap), something must be
> slowly eating memory until 75 ms. Maybe the tracing ring buffers?
>
> free file reclaimable pages
> 50s 369324 + 318760 => 688084
> 60s 235989 + 448096 => 684085
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat

The log shows that ~64MB reclaimable memory is stoled. But the trace
data only takes 1.8MB. Hmm..

Thanks,
Fengguang

2010-12-14 14:52:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> >
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > >
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >
> >
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> >
> > But why is the global limit declining? The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> >
> > free file reclaimable pages
> > 50s 369324 + 318760 => 688084
> > 60s 235989 + 448096 => 684085
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
>
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

Also, trace buffers are fully pre-allocated.

Inodes perhaps?

2010-12-14 14:57:06

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, Dec 14, 2010 at 10:39:02PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> >
> > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > value after one of the write stops. So I think this could be speeded up
> > > > even more, at least on my setup.
> > >
> > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > >
> >
> > Interestingly, the above graph shows that after about 10s fast ramp
> > up, there is another 20s slow ramp down. It's obviously due the
> > decline of global limit:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> >
> > But why is the global limit declining? The following log shows that
> > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > time!). In the same period nr_free_pages goes slowly down to its
> > stable value. Given that the global limit is mainly derived from
> > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> >
> > free file reclaimable pages
> > 50s 369324 + 318760 => 688084
> > 60s 235989 + 448096 => 684085
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
>
> The log shows that ~64MB reclaimable memory is stoled. But the trace
> data only takes 1.8MB. Hmm..

ext2 has the same pattern:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36/dirty-pages.png

But it does not happen for btrfs!

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png

Seems that it's the nr_slab_reclaimable keep growing until 75s.

Looking at
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36/slabinfo-end

It should be the buffer heads that slowly eats the memory during the time:

buffer_head 670304 670662 104 37 1 : tunables 120 60 8 : slabdata 18117 18126 480

(670304/37)*4 = 72464KB.

The consumption seems acceptable for a 3G memory system.

Thanks,
Fengguang

2010-12-14 15:15:24

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, Dec 14, 2010 at 10:50:55PM +0800, Peter Zijlstra wrote:
> On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > >
> > > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > > value after one of the write stops. So I think this could be speeded up
> > > > > even more, at least on my setup.
> > > >
> > > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > >
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > > >
> > >
> > > Interestingly, the above graph shows that after about 10s fast ramp
> > > up, there is another 20s slow ramp down. It's obviously due the
> > > decline of global limit:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > >
> > > But why is the global limit declining? The following log shows that
> > > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > > time!). In the same period nr_free_pages goes slowly down to its
> > > stable value. Given that the global limit is mainly derived from
> > > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > >
> > > free file reclaimable pages
> > > 50s 369324 + 318760 => 688084
> > > 60s 235989 + 448096 => 684085
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> >
> > The log shows that ~64MB reclaimable memory is stoled. But the trace
> > data only takes 1.8MB. Hmm..
>
> Also, trace buffers are fully pre-allocated.
>
> Inodes perhaps?

Just figured out that it's the buffer heads :)

The other interesting question is, why it takes up to 50s to consume
all the nr_free_pages pages. I would imagine the free pages be quickly
allocated to the page cache..

Attached is the graph for ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36

Thanks,
Fengguang

Attachments:

(No filename) (2.36 kB)
vmstat-reclaimable-500.png (64.98 kB)
Download all attachments

2010-12-14 15:26:44

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, Dec 14, 2010 at 11:15:07PM +0800, Wu Fengguang wrote:
> On Tue, Dec 14, 2010 at 10:50:55PM +0800, Peter Zijlstra wrote:
> > On Tue, 2010-12-14 at 22:39 +0800, Wu Fengguang wrote:
> > > On Tue, Dec 14, 2010 at 10:33:25PM +0800, Wu Fengguang wrote:
> > > > On Tue, Dec 14, 2010 at 09:59:10PM +0800, Wu Fengguang wrote:
> > > > > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > > >
> > > > > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > > > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > > > > value after one of the write stops. So I think this could be speeded up
> > > > > > even more, at least on my setup.
> > > > >
> > > > > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> > > > >
> > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> > > > >
> > > >
> > > > Interestingly, the above graph shows that after about 10s fast ramp
> > > > up, there is another 20s slow ramp down. It's obviously due the
> > > > decline of global limit:
> > > >
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat-dirty.png
> > > >
> > > > But why is the global limit declining? The following log shows that
> > > > nr_file_pages keeps growing and goes stable after 75 seconds (so long
> > > > time!). In the same period nr_free_pages goes slowly down to its
> > > > stable value. Given that the global limit is mainly derived from
> > > > nr_free_pages+nr_file_pages (I disabled swap), something must be
> > > > slowly eating memory until 75 ms. Maybe the tracing ring buffers?
> > > >
> > > > free file reclaimable pages
> > > > 50s 369324 + 318760 => 688084
> > > > 60s 235989 + 448096 => 684085
> > > >
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/vmstat
> > >
> > > The log shows that ~64MB reclaimable memory is stoled. But the trace
> > > data only takes 1.8MB. Hmm..
> >
> > Also, trace buffers are fully pre-allocated.
> >
> > Inodes perhaps?
>
> Just figured out that it's the buffer heads :)
>
> The other interesting question is, why it takes up to 50s to consume
> all the nr_free_pages pages. I would imagine the free pages be quickly
> allocated to the page cache..
>
> Attached is the graph for ext2-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-01-36

Ah it's embarrassing.. we are writing data and the free memory
consumption is quickly bounded by the disk write speed..

So it's FS independent.

Here is the graph for ext3 on vanilla kernel, generated from

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext3-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-19-57/vmstat

And btrfs on vanilla kernel

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat

Thanks,
Fengguang

Attachments:

(No filename) (3.00 kB)
vmstat-reclaimable-500.png (66.49 kB)
vmstat-dirty-500.png (55.78 kB)
Download all attachments

2010-12-15 18:48:43

by Richard Kennedy

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Tue, 2010-12-14 at 21:59 +0800, Wu Fengguang wrote:
> Hi Richard,
>
> On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > On Mon, 2010-12-13 at 22:46 +0800, Wu Fengguang wrote:
> > > plain text document attachment
> > > (writeback-speedup-per-bdi-threshold-ramp-up.patch)
> > > Reduce the dampening for the control system, yielding faster
> > > convergence.
> > >
> > > Currently it converges at a snail's pace for slow devices (in order of
> > > minutes). For really fast storage, the convergence speed should be fine.
> > >
> > > It makes sense to make it reasonably fast for typical desktops.
> > >
> > > After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
> > > So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
> > > 16GB mem, which seems reasonable.
> > >
> > > $ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
> > > BdiDirtyThresh: 0 kB
> > > BdiDirtyThresh: 118748 kB
> > > BdiDirtyThresh: 214280 kB
> > > BdiDirtyThresh: 303868 kB
> > > BdiDirtyThresh: 376528 kB
> > > BdiDirtyThresh: 411180 kB
> > > BdiDirtyThresh: 448636 kB
> > > BdiDirtyThresh: 472260 kB
> > > BdiDirtyThresh: 490924 kB
> > > BdiDirtyThresh: 499596 kB
> > > BdiDirtyThresh: 507068 kB
> > > ...
> > > DirtyThresh: 530392 kB
> > >
> > > CC: Peter Zijlstra <[email protected]>
> > > CC: Richard Kennedy <[email protected]>
> > > Signed-off-by: Wu Fengguang <[email protected]>
> > > ---
> > > mm/page-writeback.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > --- linux-next.orig/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c 2010-12-13 21:46:11.000000000 +0800
> > > @@ -145,7 +145,7 @@ static int calc_period_shift(void)
> > > else
> > > dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > > 100;
> > > - return 2 + ilog2(dirty_total - 1);
> > > + return ilog2(dirty_total - 1) - 1;
> > > }
> > >
> > > /*
> > >
> > >
> > Hi Fengguang,
> >
> > I've been running my test set on your v3 series and generally it's
> > giving good results in line with the mainline kernel, with much less
> > variability and lower standard deviation of the results so it is much
> > more repeatable.
>
> Glad to hear that, and thank you very much for trying it out!
>
> > However, it doesn't seem to be honouring the background_dirty_threshold.
>
> > The attached graph is from a simple fio write test of 400Mb on ext4.
> > All dirty pages are completely written in 15 seconds, but I expect to
> > see up to background_dirty_threshold pages staying dirty until the 30
> > second background task writes them out. So it is much too eager to write
> > back dirty pages.
>
> This is interesting, and seems easy to root cause. When testing v4,
> would you help collect the following trace events?
>
> echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
> echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable
>
> They'll have good opportunity to disclose the bug.
>
> > As to the ramp up time, when writing to 2 disks at the same time I see
> > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > value after one of the write stops. So I think this could be speeded up
> > even more, at least on my setup.
>
> I have the roughly same ramp up time on the 1-disk 3GB mem test:
>
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
>
> Given that it's the typical desktop, it does seem reasonable to speed
> it up further.
>
> > I am just about to start testing v4 & will report anything interesting.
>
> Thanks!
>
> Fengguang

I just mailed the trace log to Fengguang, it is a bit big to post to
this list. If anyone wants it, let me know and I'll mail to them
directly.

I'm also seeing a write stall in some of my tests. When writing 400Mb
after about 6 seconds I'm see a few seconds when there are no reported
sectors written to sda & there are no pages under writeback although
there are lots of dirty pages. ( the graph I sent previously shows this
stall as well )

regards
Richard

2010-12-17 13:07:58

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time

On Thu, Dec 16, 2010 at 02:48:29AM +0800, Richard Kennedy wrote:
> On Tue, 2010-12-14 at 21:59 +0800, Wu Fengguang wrote:
> > On Tue, Dec 14, 2010 at 09:37:34PM +0800, Richard Kennedy wrote:
> > > Hi Fengguang,
> > >
> > > I've been running my test set on your v3 series and generally it's
> > > giving good results in line with the mainline kernel, with much less
> > > variability and lower standard deviation of the results so it is much
> > > more repeatable.
> >
> > Glad to hear that, and thank you very much for trying it out!
> >
> > > However, it doesn't seem to be honouring the background_dirty_threshold.
> >
> > > The attached graph is from a simple fio write test of 400Mb on ext4.
> > > All dirty pages are completely written in 15 seconds, but I expect to
> > > see up to background_dirty_threshold pages staying dirty until the 30
> > > second background task writes them out. So it is much too eager to write
> > > back dirty pages.
> >
> > This is interesting, and seems easy to root cause. When testing v4,
> > would you help collect the following trace events?
> >
> > echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable
> > echo 1 > /debug/tracing/events/writeback/balance_dirty_state/enable
> > echo 1 > /debug/tracing/events/writeback/writeback_single_inode/enable
> >
> > They'll have good opportunity to disclose the bug.
> >
> > > As to the ramp up time, when writing to 2 disks at the same time I see
> > > the per_bdi_threshold taking up to 20 seconds to converge on a steady
> > > value after one of the write stops. So I think this could be speeded up
> > > even more, at least on my setup.
> >
> > I have the roughly same ramp up time on the 1-disk 3GB mem test:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages.png
> >
> > Given that it's the typical desktop, it does seem reasonable to speed
> > it up further.
> >
> > > I am just about to start testing v4 & will report anything interesting.
> >
> > Thanks!
> >
> > Fengguang
>
> I just mailed the trace log to Fengguang, it is a bit big to post to
> this list. If anyone wants it, let me know and I'll mail to them
> directly.
>
> I'm also seeing a write stall in some of my tests. When writing 400Mb
> after about 6 seconds I'm see a few seconds when there are no reported
> sectors written to sda & there are no pages under writeback although
> there are lots of dirty pages. ( the graph I sent previously shows this
> stall as well )

I managed to reproduce your workload, see the attached graphs. They
represents two runs of the following fio job. Obviously the results
are very reproducible.

[zero]
size=400m
rw=write
pre_read=1
ioengine=mmap

Here is the trace data for the first graph. I'll explain how every
single write is triggered. Vanilla kernels should have the same
behaviors.

background threshold exceeded, so background flush is started
-------------------------------------------------------------
flush-8:0-2662 [005] 18.759459: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=544 wrote=16385 to_write=-1 index=1
flush-8:0-2662 [000] 19.941272: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1732 wrote=16385 to_write=-1 index=16386
flush-8:0-2662 [000] 20.162497: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1952 wrote=4097 to_write=-1 index=32771

fio completes data population and does something like fsync()
Note that the dirty age is not reset by fsync().
-------------------------------------------------------------
<...>-2637 [000] 25.364145: fdatawrite_range: fio: bdi=8:0 ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES start=0 end=9223372036854775807 sync=1 wrote=65533 skipped=0
<...>-2637 [004] 26.492765: fdatawrite_range: fio: bdi=8:0 ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES start=0 end=9223372036854775807 sync=0 wrote=0 skipped=0

fio starts "rw=write", and triggered background flush when
background threshold is exceeded
----------------------------------------------------------
flush-8:0-2662 [000] 33.277084: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_PAGES age=15112 wrote=16385 to_write=-1 index=1
flush-8:0-2662 [000] 34.486721: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=16324 wrote=16385 to_write=-1 index=16386
flush-8:0-2662 [000] 34.942939: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=16784 wrote=8193 to_write=-1 index=32771

5 seconds later, kupdate flush starts to work on expired inodes in
b_io *as well as* whatever inode that is already in the b_more_io
list. Unfortunately inode 131 was moved to b_more_io in the previous
background flush and has been sit there ever since.
---------------------------------------------------------------------
flush-8:0-2662 [004] 39.951920: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=21808 wrote=16385 to_write=-1 index=40964
flush-8:0-2662 [000] 40.784427: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=22644 wrote=16385 to_write=-1 index=57349
flush-8:0-2662 [000] 41.840671: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=23704 wrote=8193 to_write=-1 index=73734
flush-8:0-2662 [004] 42.845739: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=24712 wrote=8193 to_write=-1 index=81927
flush-8:0-2662 [004] 43.309379: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=25180 wrote=8193 to_write=-1 index=90120
flush-8:0-2662 [000] 43.547443: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC age=25416 wrote=4088 to_write=12296 index=0

This may be a bit surprising, but should not be a big problem. After
all, the vm.dirty_expire_centisecs=30s merely says that dirty inodes
will be put to IO _within_ 35s. The kernel still have some freedom
to start writeback earlier than the deadline, or even miss the
deadline in the case of too busy IO.

Thanks,
Fengguang

Attachments:

(No filename) (6.18 kB)
global-dirty-state.png (72.93 kB)
global-dirty-state.png (72.87 kB)
Download all attachments