2008-08-05 21:22:51

by Chris Mason

[permalink] [raw]
Subject: Btrfs v0.16 released

Hello everyone,

Btrfs v0.16 is available for download, please see
http://btrfs.wiki.kernel.org/ for download links and project
information.

v0.16 has a shiny new disk format, and is not compatible with
filesystems created by older Btrfs releases. But, it should be the
fastest Btrfs yet, with a wide variety of scalability fixes and new
features.

There were quite a few contributors this time around, but big thanks to
Josef Bacik and Yan Zheng for their help on this release. Toei Rei also
helped track down an important corruption problem.

Scalability and performance:

* Fine grained btree locking. The large fs_mutex is finally gone.
There is still some work to do on the locking during extent allocation,
but the code is much more scalable than it was.

* Helper threads for checksumming and other background tasks. Most CPU
intensive operations have been pushed off to helper threads to take
advantage of SMP machines. Streaming read and write throughput now
scale to disk speed even with checksumming on.

* Improved data=ordered mode. Metadata is now updated only after all
the blocks in a data extent are on disk. This allows btrfs to provide
data=ordered semantics without waiting for all the dirty data in the FS
to flush at commit time. fsync and O_SYNC writes do not force down all
the dirty data in the FS.

* Faster cleanup of old transactions (Yan Zheng). A new cache now
dramatically reduces the amount of IO required to clean up and delete
old snapshots.

Major features (all from Josef Bacik):

* ACL support. ACLs are enabled by default, no special mount options
required.

* Orphan inode prevention, no more lost files after a crash

* New directory index format, fixing some suboptimal corner cases in the
original.

There are still more disk format changes planned, but we're making every
effort to get them out of the way as quickly as we can. You can see the
major features we have planned on the development timeline:

http://btrfs.wiki.kernel.org/index.php/Development_timeline

A few interesting statistics:

Between v0.14 and v0.15:

42 files changed, 6995 insertions(+), 3011 deletions(-)

The btrfs kernel module now weighs in at 30,000 LOC, which means we're
getting very close to the size of ext[34].

-chris



2008-08-07 12:38:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:

> There are still more disk format changes planned, but we're making every
> effort to get them out of the way as quickly as we can. You can see the
> major features we have planned on the development timeline:
>
> http://btrfs.wiki.kernel.org/index.php/Development_timeline

Just took a peek, seems to be slightly out of date as it still lists the
single mutex thingy.

Also, how true is the IO-error and disk-full claim?

2008-08-07 12:44:23

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-07 at 11:14 +0200, Peter Zijlstra wrote:
> On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:
>
> > There are still more disk format changes planned, but we're making every
> > effort to get them out of the way as quickly as we can. You can see the
> > major features we have planned on the development timeline:
> >
> > http://btrfs.wiki.kernel.org/index.php/Development_timeline
>
> Just took a peek, seems to be slightly out of date as it still lists the
> single mutex thingy.

Thanks, I thought I had removed all the references to it on that page,
but there was one left.

>
> Also, how true is the IO-error and disk-full claim?
>

We still don't handle disk full. The IO errors are handled most of the
time. If a checksum doesn't match or the lower layers report an IO
error, btrfs will use an alternate mirror of the block. If there is no
alternate mirror, the caller gets EIO and in the case of a failed csum,
the page is zero filled (actually filled with ones so I can find bogus
pages in an oops).

Metadata is duplicated by default even on single spindle drives, so this
means that metadata IO errors are handled as long as the other mirror is
happy.

If mirroring is off or both mirrors are bad, we currently get into
trouble.

data pages work better, those errors bubble up to userland just like in
other filesystems.

-chris

2008-08-07 12:52:34

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-07 at 11:08 +0200, Peter Zijlstra wrote:
> On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:
>
> > * Fine grained btree locking. The large fs_mutex is finally gone.
> > There is still some work to do on the locking during extent allocation,
> > but the code is much more scalable than it was.
>
> Cool - will try to find a cycle to stare at the code ;-)
>

I was able to get it mostly lockdep complaint by using mutex_lock_nested
based on the level of the btree I was locking. My allocation mutex is a
little of a problem for lockdep though.

> > * Helper threads for checksumming and other background tasks. Most CPU
> > intensive operations have been pushed off to helper threads to take
> > advantage of SMP machines. Streaming read and write throughput now
> > scale to disk speed even with checksumming on.
>
> Can this lead to the same Priority Inversion issues as seen with
> kjournald?
>

Yes, although in general only the helper threads end up actually doing
the IO for writes. Unfortunately, they are almost but not quite an
elevator. It is tempting to try sorting the bios on the helper queues
etc. But I haven't done that because it gets into starvation and other
fun.

I haven't done any real single cpu testing, it may make sense in those
workloads to checksum and submit directly in the calling context. But
real single cpu boxes are harder to come by these days.

-chris

2008-08-07 13:12:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:

> * Fine grained btree locking. The large fs_mutex is finally gone.
> There is still some work to do on the locking during extent allocation,
> but the code is much more scalable than it was.

Cool - will try to find a cycle to stare at the code ;-)

> * Helper threads for checksumming and other background tasks. Most CPU
> intensive operations have been pushed off to helper threads to take
> advantage of SMP machines. Streaming read and write throughput now
> scale to disk speed even with checksumming on.

Can this lead to the same Priority Inversion issues as seen with
kjournald?

2008-08-07 14:06:51

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-07 at 17:03 +0300, Ahmed Kamal wrote:
> With csum errors, do we get warnings in logs ?

Yes

> Does too many faults cause a device to be flagged as faulty ?

Not yet

> is there any user-space application to monitor/scrub/re-silver btrfs
> volumes ?
>
Not yet, but there definitely will be.

-chris

2008-08-07 14:59:25

by Chris Friesen

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

Chris Mason wrote:

> I haven't done any real single cpu testing, it may make sense in those
> workloads to checksum and submit directly in the calling context. But
> real single cpu boxes are harder to come by these days.

They're still pretty common in the embedded/low power space. I could
see something like a settop box wanting to use btrfs with massive disks.

Chris

2008-08-07 15:08:21

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

Chris Mason wrote on 07/08/2008 11:34:02:

> > > * Helper threads for checksumming and other background tasks. Most
CPU
> > > intensive operations have been pushed off to helper threads to take
> > > advantage of SMP machines. Streaming read and write throughput now
> > > scale to disk speed even with checksumming on.
> >
> > Can this lead to the same Priority Inversion issues as seen with
> > kjournald?
>
> Yes, although in general only the helper threads end up actually doing
> the IO for writes. Unfortunately, they are almost but not quite an
> elevator. It is tempting to try sorting the bios on the helper queues
> etc. But I haven't done that because it gets into starvation and other
> fun.
>
> I haven't done any real single cpu testing, it may make sense in those
> workloads to checksum and submit directly in the calling context. But
> real single cpu boxes are harder to come by these days.

[just jumping in as a casual bystander with one remark]

For this purpose it seems booting up with limiting to one CPU should be
sufficient.

Tvrtko


Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon,
OX14 3YP, United Kingdom.

Company Reg No 2096520. VAT Reg No GB 348 3873 20.

2008-08-07 18:02:54

by Andi Kleen

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

Chris Mason <[email protected]> writes:
>
> Metadata is duplicated by default even on single spindle drives,

Can you please say a bit how much that impacts performance? That sounds
costly.

-Andi

2008-08-08 18:49:56

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-07 at 20:02 +0200, Andi Kleen wrote:
> Chris Mason <[email protected]> writes:
> >
> > Metadata is duplicated by default even on single spindle drives,
>
> Can you please say a bit how much that impacts performance? That sounds
> costly.

Most metadata is allocated in groups of 128k or 256k, and so most of the
writes are nicely sized. The mirroring code has areas of the disk
dedicated to mirror other areas. So we end up with something like this:

metadata chunk A (~1GB in size)
[ ......................... ]

mirror of chunk A (~1GB in size)
[ ......................... ]

So, the mirroring turns a single large write into two large writes.
Definitely not free, but always a fixed cost.

I started to make some numbers of this yesterday on single spindles and
discovered that my worker threads are not doing as good a job as they
should be of maintaining IO ordering. I've been using an array with a
writeback cache for benchmarking lately and hadn't noticed.

I need to fix that, but here are some numbers on a single sata drive.
The drive can do about 100MB/s streaming reads/writes. Btrfs
checksumming and inline data (tail packing) are both turned on.

Single process creating 30 kernel trees (2.6.27-rc2)

Btrfs defaults 36MB/s
Btrfs no mirror 50MB/s
Ext4 defaults 59.2MB/s (much better than ext3 here)

With /sys/block/sdb/queue/nr_requests at 8192 to hide my IO ordering
submission problems:

Btrfs defaults: 57MB/s
Btrfs no mirror: 61.51MB/s

-chris

2008-08-08 22:11:39

by Andi Kleen

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

> So, the mirroring turns a single large write into two large writes.
> Definitely not free, but always a fixed cost.

Thanks for the explanation and the numbers. I see that's the advantage of
copy-on-write that you can actually always cluster the metadata together and
get always batched IO this way and then afford to do more of it.

Still wondering what that will do to read seekiness.

-Andi

2008-08-09 01:19:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, Aug 08, 2008 at 11:56:25PM +0200, Andi Kleen wrote:
> > So, the mirroring turns a single large write into two large writes.
> > Definitely not free, but always a fixed cost.
>
> Thanks for the explanation and the numbers. I see that's the advantage of
> copy-on-write that you can actually always cluster the metadata together and
> get always batched IO this way and then afford to do more of it.
>
> Still wondering what that will do to read seekiness.

In theory, if the elevator was smart enough, it could actually help
read seekiness; there are two copies of the metadata, and it shouldn't
matter which one is fetched. So I could imagine a (hypothetical) read
request which says, "please give me the contents of block 4500 or
75000000 --- I don't care which, if the disk head is closer to one end
of the disk or another, use whichever one is most convenient". Our
elevator algorithms are currently totally unable to deal with this
sort of request, and if SSD's are going to be coming on line as
quickly as some people are claiming, maybe it's not worth it to try to
implement that kind of thing, but at least in theory it's something
that could be done....

- Ted

2008-08-09 01:22:45

by Andi Kleen

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

> In theory, if the elevator was smart enough, it could actually help
> read seekiness; there are two copies of the metadata, and it shouldn't

That assumes the elevator actually knows what is nearby? I thought
that wasn't that easy with modern disks with multiple spindles
and invisible remapping, not even talking about RAID
arrays looking like disks.

-Andi

2008-08-09 01:44:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Sat, Aug 09, 2008 at 03:23:22AM +0200, Andi Kleen wrote:
> > In theory, if the elevator was smart enough, it could actually help
> > read seekiness; there are two copies of the metadata, and it shouldn't
>
> That assumes the elevator actually knows what is nearby? I thought
> that wasn't that easy with modern disks with multiple spindles
> and invisible remapping, not even talking about RAID
> arrays looking like disks.

RAID is the big problem, yeah. In general, though, we are already
making an assumption in the elevator code and in filesystem code that
block numbers which are numerically closer together are "close" from
the perspective of disks. There has been talk about trying to make
filesystems smarter about allocating blocks by giving them visibility
to the RAID parameters; in theory the elevator algorithm could also be
made smarter as well using the same information. I'm really not sure
if the complexity is worth it, though....

- Ted

2008-08-14 21:01:32

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, 2008-08-08 at 14:48 -0400, Chris Mason wrote:
> On Thu, 2008-08-07 at 20:02 +0200, Andi Kleen wrote:
> > Chris Mason <[email protected]> writes:
> > >
> > > Metadata is duplicated by default even on single spindle drives,
> >
> > Can you please say a bit how much that impacts performance? That sounds
> > costly.
>
> Most metadata is allocated in groups of 128k or 256k, and so most of the
> writes are nicely sized. The mirroring code has areas of the disk
> dedicated to mirror other areas.

[ ... ]

> So, the mirroring turns a single large write into two large writes.
> Definitely not free, but always a fixed cost.

> With /sys/block/sdb/queue/nr_requests at 8192 to hide my IO ordering
> submission problems:
>
> Btrfs defaults: 57MB/s
> Btrfs no mirror: 61.51MB/s

I spent a bunch of time hammering on different ways to fix this without
increasing nr_requests, and it was a mixture of needing better tuning in
btrfs and needing to init mapping->writeback_index on inode allocation.

So, today's numbers for creating 30 kernel trees in sequence:

Btrfs defaults 57.41 MB/s
Btrfs dup no csum 74.59 MB/s
Btrfs no duplication 76.83 MB/s
Btrfs no dup no csum no inline 76.85 MB/s

Ext4 data=writeback, delalloc 60.50 MB/s

I may be able to get the duplication numbers higher by tuning metadata
writeback. My current code doesn't push metadata throughput as high in
order to give some spindle time to data writes.

This graph may give you an idea of how the duplication goes to disk:

http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-default.png

Compared with the result of mkfs.btrfs -m single (no duplication):

http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-single.png

Both on one graph is a little hard to read:

http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-dup-compare.png

Here is btrfs with duplication on, but without checksumming. Even with
inline extents on, the checksums seem to cause most of the metadata
related syncing (they are stored in the btree):

http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-dup-nosum.png

It is worth noting that with checksumming on, I go through async
kthreads to do the checksumming and they may be reordering the IO a bit
as they submit things. So, I'm not 100% sure the extra seeks aren't
coming from my async code.

And Ext4:

http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/ext4-writeback.png

This benchmark has questionable real world value, but since it includes
a number of smallish files it is a good place to look at the cost of
metadata and metadata dup

I'll push the btrfs related changes for this out tonight after some
stress testing.

-chris

2008-08-14 21:16:45

by Andi Kleen

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, Aug 14, 2008 at 05:00:56PM -0400, Chris Mason wrote:
> Btrfs defaults 57.41 MB/s
> Btrfs dup no csum 74.59 MB/s

With duplications checksums seem to be quite costly (CPU bound?)

> Btrfs no duplication 76.83 MB/s
> Btrfs no dup no csum no inline 76.85 MB/s

But without duplication they are basically free here at least
in IO rate. Seems odd?

Does it compute them twice in the duplication case perhaps?

-Andi

2008-08-14 23:45:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

> I spent a bunch of time hammering on different ways to fix this without
> increasing nr_requests, and it was a mixture of needing better tuning in
> btrfs and needing to init mapping->writeback_index on inode allocation.
>
> So, today's numbers for creating 30 kernel trees in sequence:
>
> Btrfs defaults 57.41 MB/s
> Btrfs dup no csum 74.59 MB/s
> Btrfs no duplication 76.83 MB/s
> Btrfs no dup no csum no inline 76.85 MB/s

What sort of script are you using? Basically something like this?

for i in `seq 1 30` do
mkdir $i; cd $i
tar xjf /usr/src/linux-2.6.28.tar.bz2
cd ..
done

- Ted

2008-08-15 01:11:17

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-14 at 19:44 -0400, Theodore Tso wrote:
> > I spent a bunch of time hammering on different ways to fix this without
> > increasing nr_requests, and it was a mixture of needing better tuning in
> > btrfs and needing to init mapping->writeback_index on inode allocation.
> >
> > So, today's numbers for creating 30 kernel trees in sequence:
> >
> > Btrfs defaults 57.41 MB/s
> > Btrfs dup no csum 74.59 MB/s
> > Btrfs no duplication 76.83 MB/s
> > Btrfs no dup no csum no inline 76.85 MB/s
>
> What sort of script are you using? Basically something like this?
>
> for i in `seq 1 30` do
> mkdir $i; cd $i
> tar xjf /usr/src/linux-2.6.28.tar.bz2
> cd ..
> done

Similar. I used compilebench -i 30 -r 0, which means create 30 initial
kernel trees and then do nothing. compilebench simulates compiles by
writing to the FS files of the same size that you would get by creating
kernel trees or compiling them.

The idea is to get all of the IO without needing to keep 2.6.28.tar.bz2
in cache or the compiler using up CPU.

http://www.oracle.com/~mason/compilebench

-chris

2008-08-15 01:26:54

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-14 at 23:17 +0200, Andi Kleen wrote:
> On Thu, Aug 14, 2008 at 05:00:56PM -0400, Chris Mason wrote:
> > Btrfs defaults 57.41 MB/s

Looks like I can get the btrfs defaults up to 64MB/s with some writeback
tweaks.

> > Btrfs dup no csum 74.59 MB/s
>
> With duplications checksums seem to be quite costly (CPU bound?)
>

The async worker threads should be spreading the load across CPUs pretty
well, and even a single CPU could keep up with 100MB/s checksumming.
But, the async worker threads do randomize the IO somewhat because the
IO goes from pdflush -> one worker thread per CPU -> submit_bio. So,
maybe that 3rd thread is more than the drive can handle?

btrfsck tells me the total size of the btree is only 20MB larger with
checksumming on.

> > Btrfs no duplication 76.83 MB/s
> > Btrfs no dup no csum no inline 76.85 MB/s
>
> But without duplication they are basically free here at least
> in IO rate. Seems odd?
>
> Does it compute them twice in the duplication case perhaps?
>

The duplication happens lower down in the stack, they only get done
once.

-chris

2008-08-15 01:38:22

by Andi Kleen

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

> The async worker threads should be spreading the load across CPUs pretty
> well, and even a single CPU could keep up with 100MB/s checksumming.
> But, the async worker threads do randomize the IO somewhat because the
> IO goes from pdflush -> one worker thread per CPU -> submit_bio. So,
> maybe that 3rd thread is more than the drive can handle?

You have more threads with duplication?

>
> btrfsck tells me the total size of the btree is only 20MB larger with
> checksumming on.
>
> > > Btrfs no duplication 76.83 MB/s
> > > Btrfs no dup no csum no inline 76.85 MB/s
> >
> > But without duplication they are basically free here at least
> > in IO rate. Seems odd?
> >
> > Does it compute them twice in the duplication case perhaps?
> >
>
> The duplication happens lower down in the stack, they only get done
> once.

Ok was just speculation. The big difference still seems odd.

-Andi

2008-08-15 12:46:38

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Thu, 2008-08-14 at 21:10 -0400, Chris Mason wrote:
> On Thu, 2008-08-14 at 19:44 -0400, Theodore Tso wrote:
> > > I spent a bunch of time hammering on different ways to fix this without
> > > increasing nr_requests, and it was a mixture of needing better tuning in
> > > btrfs and needing to init mapping->writeback_index on inode allocation.
> > >
> > > So, today's numbers for creating 30 kernel trees in sequence:
> > >
> > > Btrfs defaults 57.41 MB/s
> > > Btrfs dup no csum 74.59 MB/s
> > > Btrfs no duplication 76.83 MB/s
> > > Btrfs no dup no csum no inline 76.85 MB/s
> >
> > What sort of script are you using? Basically something like this?
> >
> > for i in `seq 1 30` do
> > mkdir $i; cd $i
> > tar xjf /usr/src/linux-2.6.28.tar.bz2
> > cd ..
> > done
>
> Similar. I used compilebench -i 30 -r 0, which means create 30 initial
> kernel trees and then do nothing. compilebench simulates compiles by
> writing to the FS files of the same size that you would get by creating
> kernel trees or compiling them.
>
> The idea is to get all of the IO without needing to keep 2.6.28.tar.bz2
> in cache or the compiler using up CPU.
>
> http://www.oracle.com/~mason/compilebench

Whoops the link above is wrong, try:

http://oss.oracle.com/~mason/compilebench

It is worth noting that the end throughput doesn't matter quite as much
as the writeback pattern. Ext4 is pretty solid on this test, with very
consistent results.

-chris

2008-08-15 13:02:19

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, 2008-08-15 at 03:39 +0200, Andi Kleen wrote:
> > The async worker threads should be spreading the load across CPUs pretty
> > well, and even a single CPU could keep up with 100MB/s checksumming.
> > But, the async worker threads do randomize the IO somewhat because the
> > IO goes from pdflush -> one worker thread per CPU -> submit_bio. So,
> > maybe that 3rd thread is more than the drive can handle?
>
> You have more threads with duplication?
>

It was a very confusing use of the word thread. I have the same number
of kernel threads running, but the single spindle on the drive has to
deal with 3 different streams of writes. The seeks/sec portion of the
graph shows a big enough increase in seeks on the duplication run to
explain the performance.

> > btrfsck tells me the total size of the btree is only 20MB larger with
> > checksumming on.
> >
> > > > Btrfs no duplication 76.83 MB/s
> > > > Btrfs no dup no csum no inline 76.85 MB/s
> > >
> > > But without duplication they are basically free here at least
> > > in IO rate. Seems odd?
> > >
> > > Does it compute them twice in the duplication case perhaps?
> > >
> >
> > The duplication happens lower down in the stack, they only get done
> > once.
>
> Ok was just speculation. The big difference still seems odd.

It does, I'll give the test a shot on other hardware too. To be honest
I'm pretty happy at matching ext4 with duplication on. The graph shows
even writeback and the times from each iteration are fairly consistent.

Ext3 and XFS score somewhere between 10-15MB/s on the same test...

-chris

2008-08-15 13:46:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, Aug 15, 2008 at 08:46:01AM -0400, Chris Mason wrote:
> Whoops the link above is wrong, try:
>
> http://oss.oracle.com/~mason/compilebench

Thanks, I figured it out.

> It is worth noting that the end throughput doesn't matter quite as much
> as the writeback pattern. Ext4 is pretty solid on this test, with very
> consistent results.

There were two reasons why I wanted to play with compilebench. The
first is we have a fragmentation problem with delayed allocation and
small files getting forced out due to memory pressure, that we've been
working for the past week. My intuition (which has proven to be
correct) is that compilebench is a great tool to show it off. It may
not matter so much for write throughput results, since usually the
separation distance between the first block and the rest of the file
is small, and the write elevator takes care of it, but in the long run
this kind of allocation pattern is no good:

Inode 221280: (0):887097, (1):882497
Inode 221282: (0):887098, (1-2):882498-882499
Inode 221284: (0):887099, (1):882500

The other reason why I was interested in playing with compilebench
tool is that I wanted to try tweaking the commit timers to see if this
would make a difference to the result. Not for this benchmark, it
appears, given a quick test that I did last night.

- Ted

2008-08-15 17:55:12

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, 2008-08-15 at 09:45 -0400, Theodore Tso wrote:
> On Fri, Aug 15, 2008 at 08:46:01AM -0400, Chris Mason wrote:
> > Whoops the link above is wrong, try:
> >
> > http://oss.oracle.com/~mason/compilebench
>
> Thanks, I figured it out.
>
> > It is worth noting that the end throughput doesn't matter quite as much
> > as the writeback pattern. Ext4 is pretty solid on this test, with very
> > consistent results.
>
> There were two reasons why I wanted to play with compilebench. The
> first is we have a fragmentation problem with delayed allocation and
> small files getting forced out due to memory pressure, that we've been
> working for the past week.

Have you tried this one:

http://article.gmane.org/gmane.linux.file-systems/25560

This bug should cause fragmentation on small files getting forced out
due to memory pressure in ext4. But, I wasn't able to really
demonstrate it with ext4 on my machine.

-chris


2008-08-15 20:00:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, Aug 15, 2008 at 01:52:52PM -0400, Chris Mason wrote:
> Have you tried this one:
>
> http://article.gmane.org/gmane.linux.file-systems/25560
>
> This bug should cause fragmentation on small files getting forced out
> due to memory pressure in ext4. But, I wasn't able to really
> demonstrate it with ext4 on my machine.

I've been able to use compilebench to see the fragmentation problem
very easily.

Annesh has been workign on it, and has some fixes that he queued up.
I'll have to point him at your proposed fix, thanks. This is what he
came up with in the common code. What do you think?

- Ted

(From Annesh, on the linux-ext4 list.)

As I explained in my previous patch the problem is due to pdflush
background_writeout. Now when pdflush does the writeout we may
have only few pages for the file and we would attempt
to write them to disk. So my attempt in the last patch was to
do the below

a) When allocation blocks try to be close to the goal block specified
b) When we call ext4_da_writepages make sure we have minimal nr_to_write
that ensures we allocate all dirty buffer_heads in a single go.
nr_to_write is set to 1024 in pdflush background_writeout and that
would mean we may end up calling some inodes writepages() with really
small values even though we have more dirty buffer_heads.

What it doesn't handle is
1) File A have 4 dirty buffer_heads.
2) pdflush try to write them. We get 4 contig blocks
3) File A now have new 5 dirty_buffer_heads
4) File B now have 6 dirty_buffer_heads
5) pdflush try to write the 6 dirty buffer_heads of file B and allocate
them next to earlier file A blocks
6) pdflush try to write the 5 dirty buffer_heads of file A and allocate
them after file B blocks resulting in discontinuity.

I am right now testing the below patch which make sure new dirty inodes
are added to the tail of the dirty inode list

commit 6ad9d25595aea8efa0d45c0a2dd28b4a415e34e6
Author: Aneesh Kumar K.V <[email protected]>
Date: Fri Aug 15 23:19:15 2008 +0530

move the dirty inodes to the end of the list

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 25adfc3..91f3c54 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -163,7 +163,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
*/
if (!was_dirty) {
inode->dirtied_when = jiffies;
- list_move(&inode->i_list, &sb->s_dirty);
+ list_move_tail(&inode->i_list, &sb->s_dirty);
}
}
out:

2008-08-15 20:39:56

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, 2008-08-15 at 15:59 -0400, Theodore Tso wrote:
> On Fri, Aug 15, 2008 at 01:52:52PM -0400, Chris Mason wrote:
> > Have you tried this one:
> >
> > http://article.gmane.org/gmane.linux.file-systems/25560
> >
> > This bug should cause fragmentation on small files getting forced out
> > due to memory pressure in ext4. But, I wasn't able to really
> > demonstrate it with ext4 on my machine.
>
> I've been able to use compilebench to see the fragmentation problem
> very easily.
>
> Annesh has been workign on it, and has some fixes that he queued up.
> I'll have to point him at your proposed fix, thanks. This is what he
> came up with in the common code. What do you think?
>

It sounds like ext4 would show the writeback_index bug with
fragmentation on disk and btrfs would show it with seeks during the
benchmark. I was only watching the throughput numbers and not looking
at filefrag results.

> - Ted
>
> (From Annesh, on the linux-ext4 list.)
>
> As I explained in my previous patch the problem is due to pdflush
> background_writeout. Now when pdflush does the writeout we may
> have only few pages for the file and we would attempt
> to write them to disk. So my attempt in the last patch was to
> do the below
>

pdflush and delalloc and raid stripe alignment and lots of other things
don't play well together. In general, I think we need one or more
pdflush threads per mounted FS so that write_cache_pages doesn't have to
bail out every time it hits congestion.

The current write_cache_pages code even misses easy changes to create
bigger bios just because a block device is congested when called by
background_writeout()

But I would hope we can deal with a single threaded small file workload
like compilebench without resorting to big rewrites

> a) When allocation blocks try to be close to the goal block specified
> b) When we call ext4_da_writepages make sure we have minimal nr_to_write
> that ensures we allocate all dirty buffer_heads in a single go.
> nr_to_write is set to 1024 in pdflush background_writeout and that
> would mean we may end up calling some inodes writepages() with really
> small values even though we have more dirty buffer_heads.
>
> What it doesn't handle is
> 1) File A have 4 dirty buffer_heads.
> 2) pdflush try to write them. We get 4 contig blocks
> 3) File A now have new 5 dirty_buffer_heads
> 4) File B now have 6 dirty_buffer_heads
> 5) pdflush try to write the 6 dirty buffer_heads of file B and allocate
> them next to earlier file A blocks
> 6) pdflush try to write the 5 dirty buffer_heads of file A and allocate
> them after file B blocks resulting in discontinuity.
>
> I am right now testing the below patch which make sure new dirty inodes
> are added to the tail of the dirty inode list
>
> commit 6ad9d25595aea8efa0d45c0a2dd28b4a415e34e6
> Author: Aneesh Kumar K.V <[email protected]>
> Date: Fri Aug 15 23:19:15 2008 +0530
>
> move the dirty inodes to the end of the list
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 25adfc3..91f3c54 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -163,7 +163,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> */
> if (!was_dirty) {
> inode->dirtied_when = jiffies;
> - list_move(&inode->i_list, &sb->s_dirty);
> + list_move_tail(&inode->i_list, &sb->s_dirty);
> }
> }
> out:

Looks like everyone who walks sb->s_io or s_dirty walks it backwards.
This should make the newly dirtied inode the first one to be processed,
which probably isn't what we want. I could be reading it backwards of
course ;)

-chris

2008-08-16 18:12:58

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Fri, 2008-08-15 at 16:37 -0400, Chris Mason wrote:
> On Fri, 2008-08-15 at 15:59 -0400, Theodore Tso wrote:
> > On Fri, Aug 15, 2008 at 01:52:52PM -0400, Chris Mason wrote:
> > > Have you tried this one:
> > >
> > > http://article.gmane.org/gmane.linux.file-systems/25560
> > >
> > > This bug should cause fragmentation on small files getting forced out
> > > due to memory pressure in ext4. But, I wasn't able to really
> > > demonstrate it with ext4 on my machine.
> >
> > I've been able to use compilebench to see the fragmentation problem
> > very easily.
> >
> > Annesh has been workign on it, and has some fixes that he queued up.
> > I'll have to point him at your proposed fix, thanks. This is what he
> > came up with in the common code. What do you think?
> >
>
> It sounds like ext4 would show the writeback_index bug with
> fragmentation on disk and btrfs would show it with seeks during the
> benchmark. I was only watching the throughput numbers and not looking
> at filefrag results.
>

I tried just the writeback_index patch and got only 4 fragmented files
on ext4 after a compilebench run. Then I tried again and got 1200.
Seems there is something timing dependent in here ;)

By default compilebench uses 256k buffers for writing (see compilebench
-b) and btrfs_file_write will lock down up to 512 pages at a time during
a single write. This means that for most small files, compilebench will
send the whole file down in one write() and btrfs_file_write will lock
down pages for the entire write() call while working on it.

So, even if pdflush tries to jump in and do the wrong thing, the pages
will be locked by btrfs_file_write and pdflush will end up skipping
them.

With the generic file write routines, pages are locked one at a time,
giving pdflush more windows to trigger delalloc while a write is still
ongoing.

-chris

2008-08-16 19:28:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Sat, Aug 16, 2008 at 02:10:10PM -0400, Chris Mason wrote:
>
> I tried just the writeback_index patch and got only 4 fragmented files
> on ext4 after a compilebench run. Then I tried again and got 1200.
> Seems there is something timing dependent in here ;)
>

Yeah, the patch Aneesh sent to change where we added the inode to the
dirty list was false lead. The right fix is in the ext4 patch queue
now. I think we have the problem licked and a quick test showed it
increased the compilebench MB/s by a very tiny amount (enough so that
I wasnt sure whether or not it was measurement error), but it does
avoid the needly fragmentation.

- Ted

2008-08-16 19:32:44

by Szabolcs Szakacsits

[permalink] [raw]
Subject: Re: Btrfs v0.16 released


On Fri, 15 Aug 2008, Chris Mason wrote:

> Ext3 and XFS score somewhere between 10-15MB/s on the same test...

Interesting (and cool animations).

We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26,
freshly formatted partition, with defaults. Results:

MB/s Runtime (s)
----- -----------
ext3 13.24 877
btrfs 12.33 793
ntfs-3g 8.55 865
reiserfs 8.38 966
xfs 1.88 3901

Regards,
Szaka

--
NTFS-3G: http://ntfs-3g.org

2008-08-18 13:53:44

by Chris Mason

[permalink] [raw]
Subject: Re: Btrfs v0.16 released

On Sat, 2008-08-16 at 22:26 +0300, Szabolcs Szakacsits wrote:
> On Fri, 15 Aug 2008, Chris Mason wrote:
>
> > Ext3 and XFS score somewhere between 10-15MB/s on the same test...
>
> Interesting (and cool animations).
>
> We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26,
> freshly formatted partition, with defaults. Results:
>
> MB/s Runtime (s)
> ----- -----------
> ext3 13.24 877
> btrfs 12.33 793


Thanks for running things.

The code in the btrfs-unstable tree has all my performance fixes.
You'll need it to get good results. Also, the MB/s number doesn't
include the time to run sync at the end, which is probably why the
runtime for btrfs is shorter but MB/s is lower.

-chris

2008-08-18 18:22:08

by Szabolcs Szakacsits

[permalink] [raw]
Subject: Re: Btrfs v0.16 released


On Mon, 18 Aug 2008, Chris Mason wrote:
> On Sat, 2008-08-16 at 22:26 +0300, Szabolcs Szakacsits wrote:
> >
> > We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26,
> > freshly formatted partition, with defaults. Results:
> >
> > MB/s Runtime (s)
> > ----- -----------
> > ext3 13.24 877
> > btrfs 12.33 793
>
> Thanks for running things.
>
> The code in the btrfs-unstable tree has all my performance fixes.
> You'll need it to get good results.

The numbers are indeed much better:

MB/s Runtime (s)
----- -----------
btrfs-unstable 17.09 572

The disk is capable of 40+ MB/s however the test partition was one of the
last ones and as I figured it out now, it can do only 26 MB/sec. Btrfs bulk
write easily sustains it. The write speed was 21 MB/s during the benchmark,
so btrfs is the closest to the possible best write speed in the test
environment.

Szaka

--
NTFS-3G: http://ntfs-3g.org