2012-05-03 03:43:23

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH] btrfs: lower metadata writeback threshold on low dirty threshold

This helps write performance when setting the dirty threshold to tiny numbers.

3.4.0-rc2 3.4.0-rc2-btrfs4+
------------ ------------------------
96.92 -0.4% 96.54 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
98.47 +0.0% 98.50 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
99.38 -0.3% 99.06 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
98.04 -0.0% 98.02 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
98.68 +0.3% 98.98 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
99.34 -0.0% 99.31 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
==> 88.98 +9.6% 97.53 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
==> 86.99 +13.1% 98.39 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
==> 2.75 +2442.4% 69.88 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
==> 3.31 +2634.1% 90.54 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2

Signed-off-by: Fengguang Wu <[email protected]>
---
fs/btrfs/disk-io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/btrfs/disk-io.c 2012-05-02 14:04:00.989262395 +0800
+++ linux-next/fs/btrfs/disk-io.c 2012-05-02 14:04:01.773262414 +0800
@@ -930,7 +930,8 @@ static int btree_writepages(struct addre

/* this is a bit racy, but that's ok */
num_dirty = root->fs_info->dirty_metadata_bytes;
- if (num_dirty < thresh)
+ if (num_dirty < min(thresh,
+ global_dirty_limit << (PAGE_CACHE_SHIFT-2)))
return 0;
}
return btree_write_cache_pages(mapping, wbc);


2012-05-03 03:53:25

by Fengguang Wu

[permalink] [raw]
Subject: [PATCH] writeback: initialize global_dirty_limit

This prevents global_dirty_limit from remaining 0 (the initial value)
for long time, since it's only updated in update_dirty_limit() when
above the dirty freerun area.

It will avoid unexpected consequences when some random code use it as a
convenient approximation of the global dirty threshold.

Signed-off-by: Fengguang Wu <[email protected]>
---
mm/page-writeback.c | 1 +
1 file changed, 1 insertion(+)

--- linux-next.orig/mm/page-writeback.c 2012-04-25 20:16:12.766859391 +0800
+++ linux-next/mm/page-writeback.c 2012-05-03 11:44:32.746272930 +0800
@@ -1568,6 +1568,7 @@ void writeback_set_ratelimit(void)
unsigned long background_thresh;
unsigned long dirty_thresh;
global_dirty_limits(&background_thresh, &dirty_thresh);
+ global_dirty_limit = dirty_thresh;
ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
if (ratelimit_pages < 16)
ratelimit_pages = 16;

2012-05-03 09:25:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] btrfs: lower metadata writeback threshold on low dirty threshold

On Thu 03-05-12 11:43:11, Wu Fengguang wrote:
> This helps write performance when setting the dirty threshold to tiny numbers.
>
> 3.4.0-rc2 3.4.0-rc2-btrfs4+
> ------------ ------------------------
> 96.92 -0.4% 96.54 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
> 98.47 +0.0% 98.50 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
> 99.38 -0.3% 99.06 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
> 98.04 -0.0% 98.02 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
> 98.68 +0.3% 98.98 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
> 99.34 -0.0% 99.31 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
> ==> 88.98 +9.6% 97.53 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
> ==> 86.99 +13.1% 98.39 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
> ==> 2.75 +2442.4% 69.88 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
> ==> 3.31 +2634.1% 90.54 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2
>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> fs/btrfs/disk-io.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/fs/btrfs/disk-io.c 2012-05-02 14:04:00.989262395 +0800
> +++ linux-next/fs/btrfs/disk-io.c 2012-05-02 14:04:01.773262414 +0800
> @@ -930,7 +930,8 @@ static int btree_writepages(struct addre
>
> /* this is a bit racy, but that's ok */
> num_dirty = root->fs_info->dirty_metadata_bytes;
> - if (num_dirty < thresh)
> + if (num_dirty < min(thresh,
> + global_dirty_limit << (PAGE_CACHE_SHIFT-2)))
> return 0;
> }
> return btree_write_cache_pages(mapping, wbc);
Frankly, that whole condition on WB_SYNC_NONE in btree_writepages() looks
like a hack. I think we also had problems with this condition when we tried
to change b_more_io list handling. I found rather terse commit message
explaining the code:
Btrfs: Limit btree writeback to prevent seeks

Which I kind of understand but is it that bad? Also I think last time we
stumbled over this code we were discussing that these dirty metadata would
be simply hidden from mm which would solve the problem of flusher thread
trying to outsmart the filesystem... But I guess noone had time to
implement this for btrfs.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-05-03 10:03:01

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] btrfs: lower metadata writeback threshold on low dirty threshold

On Thu, May 03, 2012 at 11:25:28AM +0200, Jan Kara wrote:
> On Thu 03-05-12 11:43:11, Wu Fengguang wrote:
> > This helps write performance when setting the dirty threshold to tiny numbers.
> >
> > 3.4.0-rc2 3.4.0-rc2-btrfs4+
> > ------------ ------------------------
> > 96.92 -0.4% 96.54 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
> > 98.47 +0.0% 98.50 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
> > 99.38 -0.3% 99.06 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
> > 98.04 -0.0% 98.02 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
> > 98.68 +0.3% 98.98 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
> > 99.34 -0.0% 99.31 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
> > ==> 88.98 +9.6% 97.53 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
> > ==> 86.99 +13.1% 98.39 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
> > ==> 2.75 +2442.4% 69.88 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
> > ==> 3.31 +2634.1% 90.54 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2
> >
> > Signed-off-by: Fengguang Wu <[email protected]>
> > ---
> > fs/btrfs/disk-io.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > --- linux-next.orig/fs/btrfs/disk-io.c 2012-05-02 14:04:00.989262395 +0800
> > +++ linux-next/fs/btrfs/disk-io.c 2012-05-02 14:04:01.773262414 +0800
> > @@ -930,7 +930,8 @@ static int btree_writepages(struct addre
> >
> > /* this is a bit racy, but that's ok */
> > num_dirty = root->fs_info->dirty_metadata_bytes;
> > - if (num_dirty < thresh)
> > + if (num_dirty < min(thresh,
> > + global_dirty_limit << (PAGE_CACHE_SHIFT-2)))
> > return 0;
> > }
> > return btree_write_cache_pages(mapping, wbc);
> Frankly, that whole condition on WB_SYNC_NONE in btree_writepages() looks
> like a hack. I think we also had problems with this condition when we tried
> to change b_more_io list handling. I found rather terse commit message
> explaining the code:
> Btrfs: Limit btree writeback to prevent seeks
>
> Which I kind of understand but is it that bad? Also I think last time we
> stumbled over this code we were discussing that these dirty metadata would
> be simply hidden from mm which would solve the problem of flusher thread
> trying to outsmart the filesystem... But I guess noone had time to
> implement this for btrfs.

Yeah I have the same uneasy feelings. Actually my first attempt was to
remove the heuristics in btree_writepages() altogether. The result is
more or less performance degradations in the normal cases:

wfg@bee /export/writeback% ./compare bay/*/*-{3.4.0-rc2,3.4.0-rc2-btrfs+}
3.4.0-rc2 3.4.0-rc2-btrfs+
------------------------ ------------------------
190.81 -6.8% 177.82 bay/JBOD-2HDD-thresh=1000M/btrfs-100dd-1-3.4.0-rc2
195.86 -3.3% 189.31 bay/JBOD-2HDD-thresh=1000M/btrfs-10dd-1-3.4.0-rc2
196.68 -1.7% 193.30 bay/JBOD-2HDD-thresh=1000M/btrfs-1dd-1-3.4.0-rc2
194.83 -24.4% 147.27 bay/JBOD-2HDD-thresh=100M/btrfs-100dd-1-3.4.0-rc2
196.60 -2.5% 191.61 bay/JBOD-2HDD-thresh=100M/btrfs-10dd-1-3.4.0-rc2
197.09 -0.7% 195.69 bay/JBOD-2HDD-thresh=100M/btrfs-1dd-1-3.4.0-rc2
181.64 -8.7% 165.80 bay/RAID0-2HDD-thresh=1000M/btrfs-100dd-1-3.4.0-rc2
186.14 -2.8% 180.85 bay/RAID0-2HDD-thresh=1000M/btrfs-10dd-1-3.4.0-rc2
191.10 -1.5% 188.23 bay/RAID0-2HDD-thresh=1000M/btrfs-1dd-1-3.4.0-rc2
191.30 -20.7% 151.63 bay/RAID0-2HDD-thresh=100M/btrfs-100dd-1-3.4.0-rc2
186.03 -2.4% 181.54 bay/RAID0-2HDD-thresh=100M/btrfs-10dd-1-3.4.0-rc2
170.18 -2.5% 165.97 bay/RAID0-2HDD-thresh=100M/btrfs-1dd-1-3.4.0-rc2
96.18 -1.9% 94.32 bay/RAID1-2HDD-thresh=1000M/btrfs-100dd-1-3.4.0-rc2
97.71 -1.4% 96.36 bay/RAID1-2HDD-thresh=1000M/btrfs-10dd-1-3.4.0-rc2
97.57 -0.4% 97.23 bay/RAID1-2HDD-thresh=1000M/btrfs-1dd-1-3.4.0-rc2
97.68 -6.0% 91.79 bay/RAID1-2HDD-thresh=100M/btrfs-100dd-1-3.4.0-rc2
97.76 -0.7% 97.07 bay/RAID1-2HDD-thresh=100M/btrfs-10dd-1-3.4.0-rc2
97.53 -0.3% 97.19 bay/RAID1-2HDD-thresh=100M/btrfs-1dd-1-3.4.0-rc2
96.92 -3.0% 94.03 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
98.47 -1.4% 97.08 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
99.38 -0.7% 98.66 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
98.04 -8.2% 89.99 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
98.68 -0.6% 98.09 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
99.34 -0.7% 98.62 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
88.98 -0.5% 88.51 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
86.99 +14.5% 99.60 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
2.75 +1871.2% 54.18 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
3.31 +2035.0% 70.70 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2
3635.55 -1.2% 3592.46 TOTAL write_bw

So I end up with the conservative fix in this patch.

FYI I also experimented with "global_dirty_limit << PAGE_CACHE_SHIFT"
w/o the further "/4" in this patch, however result is not good:

3.4.0-rc2 3.4.0-rc2-btrfs3+
------------------------ ------------------------
96.92 -0.3% 96.62 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
98.47 +0.1% 98.56 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
99.38 -0.2% 99.23 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
98.04 +0.1% 98.15 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
98.68 +0.3% 98.96 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
99.34 -0.1% 99.20 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
88.98 -0.3% 88.73 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
86.99 +1.4% 88.23 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
2.75 +232.0% 9.13 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
3.31 +1.5% 3.36 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2

So this patch is kind of based on "experiment" rather than "reasoning".
And I took the easy way of using the global dirty threshold. Ideally
it should be based upon the per-bdi dirty threshold, but anyway...

Thanks,
Fengguang

2012-05-03 12:32:36

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] btrfs: lower metadata writeback threshold on low dirty threshold

On Thu, May 03, 2012 at 11:25:28AM +0200, Jan Kara wrote:
> On Thu 03-05-12 11:43:11, Wu Fengguang wrote:
> > This helps write performance when setting the dirty threshold to tiny numbers.
> >
> > 3.4.0-rc2 3.4.0-rc2-btrfs4+
> > ------------ ------------------------
> > 96.92 -0.4% 96.54 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
> > 98.47 +0.0% 98.50 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
> > 99.38 -0.3% 99.06 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
> > 98.04 -0.0% 98.02 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
> > 98.68 +0.3% 98.98 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
> > 99.34 -0.0% 99.31 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
> > ==> 88.98 +9.6% 97.53 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
> > ==> 86.99 +13.1% 98.39 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
> > ==> 2.75 +2442.4% 69.88 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
> > ==> 3.31 +2634.1% 90.54 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2
> >
> > Signed-off-by: Fengguang Wu <[email protected]>
> > ---
> > fs/btrfs/disk-io.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > --- linux-next.orig/fs/btrfs/disk-io.c 2012-05-02 14:04:00.989262395 +0800
> > +++ linux-next/fs/btrfs/disk-io.c 2012-05-02 14:04:01.773262414 +0800
> > @@ -930,7 +930,8 @@ static int btree_writepages(struct addre
> >
> > /* this is a bit racy, but that's ok */
> > num_dirty = root->fs_info->dirty_metadata_bytes;
> > - if (num_dirty < thresh)
> > + if (num_dirty < min(thresh,
> > + global_dirty_limit << (PAGE_CACHE_SHIFT-2)))
> > return 0;
> > }
> > return btree_write_cache_pages(mapping, wbc);
> Frankly, that whole condition on WB_SYNC_NONE in btree_writepages() looks
> like a hack. I think we also had problems with this condition when we tried
> to change b_more_io list handling. I found rather terse commit message
> explaining the code:
> Btrfs: Limit btree writeback to prevent seeks

It is definitely a hack ;) The basic point is that once we write a
metadata block, we have to cow it for any future changes. So writing
the metadata has a pretty big impact on performance, and I'd rather
write everything else that is dirty first. When that code was added I
was finding the metadata going to disk very soon under memory pressure.

I'm open to any ideas on this one.

-chris

2012-05-03 13:30:56

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] btrfs: lower metadata writeback threshold on low dirty threshold

On Thu, May 03, 2012 at 11:25:28AM +0200, Jan Kara wrote:
> On Thu 03-05-12 11:43:11, Wu Fengguang wrote:
> > This helps write performance when setting the dirty threshold to tiny numbers.
> >
> > 3.4.0-rc2 3.4.0-rc2-btrfs4+
> > ------------ ------------------------
> > 96.92 -0.4% 96.54 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
> > 98.47 +0.0% 98.50 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
> > 99.38 -0.3% 99.06 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
> > 98.04 -0.0% 98.02 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
> > 98.68 +0.3% 98.98 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
> > 99.34 -0.0% 99.31 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
> > ==> 88.98 +9.6% 97.53 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
> > ==> 86.99 +13.1% 98.39 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
> > ==> 2.75 +2442.4% 69.88 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
> > ==> 3.31 +2634.1% 90.54 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2
> >
> > Signed-off-by: Fengguang Wu <[email protected]>
> > ---
> > fs/btrfs/disk-io.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > --- linux-next.orig/fs/btrfs/disk-io.c 2012-05-02 14:04:00.989262395 +0800
> > +++ linux-next/fs/btrfs/disk-io.c 2012-05-02 14:04:01.773262414 +0800
> > @@ -930,7 +930,8 @@ static int btree_writepages(struct addre
> >
> > /* this is a bit racy, but that's ok */
> > num_dirty = root->fs_info->dirty_metadata_bytes;
> > - if (num_dirty < thresh)
> > + if (num_dirty < min(thresh,
> > + global_dirty_limit << (PAGE_CACHE_SHIFT-2)))
> > return 0;
> > }
> > return btree_write_cache_pages(mapping, wbc);
> Frankly, that whole condition on WB_SYNC_NONE in btree_writepages() looks
> like a hack. I think we also had problems with this condition when we tried
> to change b_more_io list handling. I found rather terse commit message
> explaining the code:
> Btrfs: Limit btree writeback to prevent seeks
>
> Which I kind of understand but is it that bad? Also I think last time we
> stumbled over this code we were discussing that these dirty metadata would
> be simply hidden from mm which would solve the problem of flusher thread
> trying to outsmart the filesystem... But I guess noone had time to
> implement this for btrfs.
>

Actually I did but I ran into an OOM problem. See we can have as much dirty
metadata as we have ram, and having no insight into what the global dirty and
writeback limits are for the system means btrfs was using wayyyyy more memory
for it's dirty and writeback metadata pages than would have normally been
allowed. In order to avoid OOM I had to re-implement a sort of
balance_dirty_pages for btrfs, and again having no access to the global dirty
limits and such at the time (AFAIK, I could just be an idiot) it was very hacky
and prone to breaking. The shrinker doesn't get called enough to handle this
sort of thing. Dave mentioned at LSF that XFS will actually do the synchronous
writeout from the shrinker which will auto-throttle everything so I was going to
try that but I haven't gotten around to it. Thanks,

Josef

2012-05-03 14:08:42

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] btrfs: lower metadata writeback threshold on low dirty threshold

On Thu, May 03, 2012 at 11:43:11AM +0800, Fengguang Wu wrote:
> This helps write performance when setting the dirty threshold to tiny numbers.
>
> 3.4.0-rc2 3.4.0-rc2-btrfs4+
> ------------ ------------------------
> 96.92 -0.4% 96.54 bay/thresh=1000M/btrfs-100dd-1-3.4.0-rc2
> 98.47 +0.0% 98.50 bay/thresh=1000M/btrfs-10dd-1-3.4.0-rc2
> 99.38 -0.3% 99.06 bay/thresh=1000M/btrfs-1dd-1-3.4.0-rc2
> 98.04 -0.0% 98.02 bay/thresh=100M/btrfs-100dd-1-3.4.0-rc2
> 98.68 +0.3% 98.98 bay/thresh=100M/btrfs-10dd-1-3.4.0-rc2
> 99.34 -0.0% 99.31 bay/thresh=100M/btrfs-1dd-1-3.4.0-rc2
> ==> 88.98 +9.6% 97.53 bay/thresh=10M/btrfs-10dd-1-3.4.0-rc2
> ==> 86.99 +13.1% 98.39 bay/thresh=10M/btrfs-1dd-1-3.4.0-rc2
> ==> 2.75 +2442.4% 69.88 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2
> ==> 3.31 +2634.1% 90.54 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2

Well, further tests show that it behaves very unstable:

3.4.0-rc2-btrfs4+ 3.4.0-rc2-btrfs5+
------------------------ ------------------------
69.88 +16.4% 81.31 bay/thresh=1M/btrfs-10dd-1-3.4.0-rc2-btrfs4+
71.09 +1.4% 72.05 bay/thresh=1M/btrfs-10dd-2-3.4.0-rc2-btrfs4+
72.60 -1.7% 71.38 bay/thresh=1M/btrfs-10dd-3-3.4.0-rc2-btrfs4+
90.54 -0.9% 89.74 bay/thresh=1M/btrfs-1dd-1-3.4.0-rc2-btrfs4+
89.17 -90.2% ==> 8.71 bay/thresh=1M/btrfs-1dd-2-3.4.0-rc2-btrfs4+
==> 14.96 +495.3% 89.06 bay/thresh=1M/btrfs-1dd-3-3.4.0-rc2-btrfs4+
408.23 +1.0% 412.26 TOTAL write_bw

Where the -btrfs5 kernel has one more patch to remove the write plug.

Thanks,
Fengguang