2007-05-16 14:44:06

by Chris Mason

[permalink] [raw]
Subject: filesystem benchmarking fun

Hello everyone,

I've been spending some time lately on filesystem benchmarking, in part
because my pet FS project is getting more stable and closer to release.
Now seems like a good time to step back and try to find out what
workloads we think are most important and see how well Linux is doing on
them. So, I'll start with my favorite three benchmarks and why I think
they matter. Over time I hope to collect a bunch of results for all of
us to argue about.

* fio: http://brick.kernel.dk/snaps/
Fio can abuse a file via just about every api in the kernel. aio, dio,
syslets, splice etc. It can thread, fork, record and playback traces
and provides good numbers for throughput and latencies on various
sequential and random io loads.

* fs_mark: http://developer.osdl.org/dev/doubt/fs_mark/index.html
This one covers most of the 'use the FS as a database' type workloads,
and can vary the number of files, directory depth etc. It has detailed
timings for reads, writes, unlinks and fsyncs that make it good for
simulating mail servers and other setups.

* compilebench: http://oss.oracle.com/~mason/compilebench/
Tries to benchmark the filesystem allocator by aging the FS through
simulated kernel compiles, patch runs and other operations.

It's easy to get caught up in one benchmark or another and try to use
them for bragging rights. But, what I want to do is talk about the
workloads we're trying to optimize for and our current methods for
measuring success. If we don't have good benchmarks for a given
workload, I'd like to try and collect ideas on how to make one.

For example, I'll pick on xfs for a minute. compilebench shows the
default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
kernel trees. Dave Chinner gave me some mount options that make it
dramatically better, but it still writes at 10MB/s on a sata drive that
can do 80MB/s. Ext3 is better, but still only 20MB/s.

Both are presumably picking a reasonable file and directory layout.
Still, our writeback algorithms are clearly not optimized for this kind
of workload. Should we fix it?

-chris


2007-05-16 16:01:49

by Chuck Ebbert

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

Chris Mason wrote:
> For example, I'll pick on xfs for a minute. compilebench shows the
> default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
> kernel trees. Dave Chinner gave me some mount options that make it
> dramatically better, but it still writes at 10MB/s on a sata drive that
> can do 80MB/s. Ext3 is better, but still only 20MB/s.
>

Now try JFS. My lawn grows faster than it can write a new kernel tree.

What we need is a tool that shows *why* this stuff happens...

2007-05-16 17:14:21

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 12:01:06PM -0400, Chuck Ebbert wrote:
> Chris Mason wrote:
> > For example, I'll pick on xfs for a minute. compilebench shows the
> > default FS you get from mkfs.xfs is pretty slow for untarring a
> > bunch of kernel trees. Dave Chinner gave me some mount options that
> > make it dramatically better, but it still writes at 10MB/s on a sata
> > drive that can do 80MB/s. Ext3 is better, but still only 20MB/s.
> >
>
> Now try JFS. My lawn grows faster than it can write a new kernel tree.
>
> What we need is a tool that shows *why* this stuff happens...
>
Unfortunately, this varies with every FS. ext3 is pretty fast for the
first 10 or so untars, and then the log wraps. I tossed a systemtap
probe into __log_wait_for_space and basically every time it gets called
vmstat shows write throughput go from 30MB/s to 4MB/s.

Presumably, xfs suffers from something similar since tuning the log to
be larger and the log buffers to be larger improves performance. I
don't remember if the jfs log is tunable here or not.

On ext3, blktrace shows that we're getting pretty good overall
sequential writeback except for the log flushing. reiserfsv3 gives
similar numbers to xfs, which is important only because I spent a
bunch of time tuning the v3 log flushing code to work in big batches a
few years ago.

At least on ext3, it may help to sort the blocks under io for
flushing...it may not. A bigger log would definitely help, but I would
say the mkfs defaults should be reasonable for a workload this simple.

(data=writeback was used for my ext3 numbers).

-chris

2007-05-16 18:13:43

by Jan Engelhardt

[permalink] [raw]
Subject: Re: filesystem benchmarking fun


On May 16 2007 10:42, Chris Mason wrote:
>
>For example, I'll pick on xfs for a minute. compilebench shows the
>default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
>kernel trees.

I suppose you used 'nobarrier'? [ http://lkml.org/lkml/2006/5/19/33 ]

>Dave Chinner gave me some mount options that make it
>dramatically better,

and `mkfs.xfs -l version=2` is also said to make it better

>but it still writes at 10MB/s on a sata drive that
>can do 80MB/s. Ext3 is better, but still only 20MB/s.
>
>Both are presumably picking a reasonable file and directory layout.
>Still, our writeback algorithms are clearly not optimized for this kind
>of workload. Should we fix it?

Also try with tmpfs.



Jan
--

2007-05-16 18:28:28

by Andrew Morton

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, 16 May 2007 13:11:56 -0400
Chris Mason <[email protected]> wrote:

> At least on ext3, it may help to sort the blocks under io for
> flushing...it may not. A bigger log would definitely help, but I would
> say the mkfs defaults should be reasonable for a workload this simple.
>
> (data=writeback was used for my ext3 numbers).

When ext3 runs out of journal space it needs to sync lots of metadata out
to the fs so that its space in the journal can be reclaimed. That metadata
is of course splattered all over the place so it's seekstorm time.

The filesystem does take some care to place the metadata blocks "close" to
the data blocks. But of course if we're writing all the pagecache and then
we later separately go back and write the metadata then that would screw
things up.

I put some code in there which will place indirect blocks under I/O at the
same time as their data blocks, so everything _should_ go out in a nice
slurp (see write_boundary_block()). The first thing to do here is to check
that write_boundary_block() didn't get broken.

If that's still working then the problem will _probably_ be directory
writeout. Possibly inodes, but they should be well-laid-out.

Were you using dir_index? That might be screwing things up.

2007-05-16 19:13:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

Jan Engelhardt wrote:
> On May 16 2007 10:42, Chris Mason wrote:
>> For example, I'll pick on xfs for a minute. compilebench shows the
>> default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
>> kernel trees.
>
> I suppose you used 'nobarrier'? [ http://lkml.org/lkml/2006/5/19/33 ]

Shouldn't that option be renamed to 'corrupt_my_data'? ;-)

Jeff



2007-05-16 19:17:36

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 11:25:15AM -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 13:11:56 -0400
> Chris Mason <[email protected]> wrote:
>
> > At least on ext3, it may help to sort the blocks under io for
> > flushing...it may not. A bigger log would definitely help, but I would
> > say the mkfs defaults should be reasonable for a workload this simple.
> >
> > (data=writeback was used for my ext3 numbers).
>
> When ext3 runs out of journal space it needs to sync lots of metadata out
> to the fs so that its space in the journal can be reclaimed. That metadata
> is of course splattered all over the place so it's seekstorm time.
>
> The filesystem does take some care to place the metadata blocks "close" to
> the data blocks. But of course if we're writing all the pagecache and then
> we later separately go back and write the metadata then that would screw
> things up.

Just to clarify, in the initial stage where kernel trees are created,
benchmark doesn't call sync. So all the writeback is through the normal
async mechanisms.

>
> I put some code in there which will place indirect blocks under I/O at
> the same time as their data blocks, so everything _should_ go out in a
> nice slurp (see write_boundary_block()). The first thing to do here
> is to check that write_boundary_block() didn't get broken.

write_boundary_block should get called from pdflush and the IO done by
pdflush seems to be pretty sequential. But, in this phase the
vast majority of the files are small (95% are less than 46k).
>
> If that's still working then the problem will _probably_ be directory
> writeout. Possibly inodes, but they should be well-laid-out.
>
> Were you using dir_index? That might be screwing things up.

Yes, dir_index. A quick test of mkfs.ext3 -O ^dir_index seems to still
have the problem. Even though the inodes are well laid out, is the
order they get written sane? Looks like ext3 is just walking a list of
bh/jh, maybe we can just sort the silly thing?

-chris

2007-05-16 19:24:16

by Jan Engelhardt

[permalink] [raw]
Subject: Re: filesystem benchmarking fun


On May 16 2007 14:16, Jeffrey Hundstad wrote:
> Jeff Garzik wrote:
>> Jan Engelhardt wrote:
>> > On May 16 2007 10:42, Chris Mason wrote:
>> > > For example, I'll pick on xfs for a minute. compilebench shows
>> > > the
>> > > default FS you get from mkfs.xfs is pretty slow for untarring a
>> > > bunch of
>> > > kernel trees.
>> >
>> > I suppose you used 'nobarrier'? [ http://lkml.org/lkml/2006/5/19/33 ]
>>
>> Shouldn't that option be renamed to 'corrupt_my_data'? ;-)
>
> Perhaps maybe "my_power_never_fails".

It's not like I live in a country where power regularly fails.


Jan
--

2007-05-16 19:29:28

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 08:12:09PM +0200, Jan Engelhardt wrote:
>
> On May 16 2007 10:42, Chris Mason wrote:
> >
> >For example, I'll pick on xfs for a minute. compilebench shows the
> >default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
> >kernel trees.
>
> I suppose you used 'nobarrier'? [ http://lkml.org/lkml/2006/5/19/33 ]

Oddly, xfs fails barriers on this sata drive although the other filesystems
don't. But yes, I tried both ways.

>
> >Dave Chinner gave me some mount options that make it
> >dramatically better,
>
> and `mkfs.xfs -l version=2` is also said to make it better

I used mkfs.xfs -l size=128m,version=2
mount -o logbsize=256k,nobarrier

>
> >but it still writes at 10MB/s on a sata drive that
> >can do 80MB/s. Ext3 is better, but still only 20MB/s.
> >
> >Both are presumably picking a reasonable file and directory layout.
> >Still, our writeback algorithms are clearly not optimized for this kind
> >of workload. Should we fix it?
>
> Also try with tmpfs.
>
Sorry, I'm not entirely clear on what we learn from trying tmpfs?

-chris

2007-05-16 19:37:21

by Andrew Morton

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, 16 May 2007 15:13:39 -0400
Chris Mason <[email protected]> wrote:

> >
> > If that's still working then the problem will _probably_ be directory
> > writeout. Possibly inodes, but they should be well-laid-out.
> >
> > Were you using dir_index? That might be screwing things up.
>
> Yes, dir_index. A quick test of mkfs.ext3 -O ^dir_index seems to still
> have the problem. Even though the inodes are well laid out, is the
> order they get written sane?

Should be: it uses first-fit.

> Looks like ext3 is just walking a list of
> bh/jh, maybe we can just sort the silly thing?

The IO scheduler is supposed to do that.

But I don't know what's causing this.

2007-05-16 19:41:52

by Jeffrey Hundstad

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

Jeff Garzik wrote:
> Jan Engelhardt wrote:
>> On May 16 2007 10:42, Chris Mason wrote:
>>> For example, I'll pick on xfs for a minute. compilebench shows the
>>> default FS you get from mkfs.xfs is pretty slow for untarring a
>>> bunch of
>>> kernel trees.
>>
>> I suppose you used 'nobarrier'? [ http://lkml.org/lkml/2006/5/19/33 ]
>
> Shouldn't that option be renamed to 'corrupt_my_data'? ;-)

Perhaps maybe "my_power_never_fails".
>
> Jeff
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-05-16 19:56:30

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 12:33:42PM -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 15:13:39 -0400
> Chris Mason <[email protected]> wrote:
>
> > >
> > > If that's still working then the problem will _probably_ be directory
> > > writeout. Possibly inodes, but they should be well-laid-out.
> > >
> > > Were you using dir_index? That might be screwing things up.
> >
> > Yes, dir_index. A quick test of mkfs.ext3 -O ^dir_index seems to still
> > have the problem. Even though the inodes are well laid out, is the
> > order they get written sane?
>
> Should be: it uses first-fit.
>
> > Looks like ext3 is just walking a list of
> > bh/jh, maybe we can just sort the silly thing?
>
> The IO scheduler is supposed to do that.
>
> But I don't know what's causing this.

I had high hopes of blaming cfq, but deadline gives the same results:

create dir kernel-0 222MB in 5.38 seconds (41.33 MB/s)
... [ ~30MB/s here ] ...
create dir kernel-7 222MB in 8.11 seconds (27.42 MB/s)
create dir kernel-8 222MB in 18.39 seconds (12.09 MB/s)
create dir kernel-9 222MB in 6.91 seconds (32.18 MB/s)
create dir kernel-10 222MB in 24.32 seconds (9.14 MB/s)
create dir kernel-11 222MB in 12.06 seconds (18.44 MB/s)
create dir kernel-12 222MB in 10.95 seconds (20.31 MB/s)

The good news is that if you let it run long enough, the times
stabilize. The bad news is:

create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)

echo 2048 > /sys/block/..../nr_requests didn't do it either.

I guess I'll have systemtap tell me more about the log flushing.

-chris

2007-05-16 20:07:46

by Andrew Morton

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, 16 May 2007 15:53:59 -0400
Chris Mason <[email protected]> wrote:

> >
> > Should be: it uses first-fit.
> >
> > > Looks like ext3 is just walking a list of
> > > bh/jh, maybe we can just sort the silly thing?
> >
> > The IO scheduler is supposed to do that.
> >
> > But I don't know what's causing this.
>
> I had high hopes of blaming cfq, but deadline gives the same results:
>
> create dir kernel-0 222MB in 5.38 seconds (41.33 MB/s)
> ... [ ~30MB/s here ] ...
> create dir kernel-7 222MB in 8.11 seconds (27.42 MB/s)
> create dir kernel-8 222MB in 18.39 seconds (12.09 MB/s)
> create dir kernel-9 222MB in 6.91 seconds (32.18 MB/s)
> create dir kernel-10 222MB in 24.32 seconds (9.14 MB/s)
> create dir kernel-11 222MB in 12.06 seconds (18.44 MB/s)
> create dir kernel-12 222MB in 10.95 seconds (20.31 MB/s)
>
> The good news is that if you let it run long enough, the times
> stabilize. The bad news is:
>
> create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)

well hang on. Doesn't this just mean that the first few runs were writing
into pagecache and the later ones were blocking due to dirty-memory limits?

Or do you have a sync in there?

> echo 2048 > /sys/block/..../nr_requests didn't do it either.
>
> I guess I'll have systemtap tell me more about the log flushing.

2007-05-16 20:16:47

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > The good news is that if you let it run long enough, the times
> > stabilize. The bad news is:
> >
> > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>
> well hang on. Doesn't this just mean that the first few runs were writing
> into pagecache and the later ones were blocking due to dirty-memory limits?
>
> Or do you have a sync in there?
>
There's no sync, but if you watch vmstat you can clearly see the log
flushes, even when the overall create times are 11MB/s. vmstat goes
30MB/s -> 4MB/s or less, then back up to 30MB/s.

On the same box, my shiny new FS writes at 30MB/s the whole time. For
this part of the benchmark, I think we should all be getting the same
numbers.

-chris

2007-05-16 20:40:11

by Andrew Morton

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, 16 May 2007 16:14:14 -0400
Chris Mason <[email protected]> wrote:

> On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > The good news is that if you let it run long enough, the times
> > > stabilize. The bad news is:
> > >
> > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> >
> > well hang on. Doesn't this just mean that the first few runs were writing
> > into pagecache and the later ones were blocking due to dirty-memory limits?
> >
> > Or do you have a sync in there?
> >
> There's no sync, but if you watch vmstat you can clearly see the log
> flushes, even when the overall create times are 11MB/s. vmstat goes
> 30MB/s -> 4MB/s or less, then back up to 30MB/s.

How do you know that it is a log flush rather than, say, pdflush
hitting the blockdev inode and doing a big seeky write?

2007-05-16 21:02:29

by Al Boldi

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:
> > > Should be: it uses first-fit.
> > >
> > > > Looks like ext3 is just walking a list of
> > > > bh/jh, maybe we can just sort the silly thing?
> > >
> > > The IO scheduler is supposed to do that.
> > >
> > > But I don't know what's causing this.
> >
> > I had high hopes of blaming cfq, but deadline gives the same results:
> >
> > create dir kernel-0 222MB in 5.38 seconds (41.33 MB/s)
> > ... [ ~30MB/s here ] ...
> > create dir kernel-7 222MB in 8.11 seconds (27.42 MB/s)
> > create dir kernel-8 222MB in 18.39 seconds (12.09 MB/s)
> > create dir kernel-9 222MB in 6.91 seconds (32.18 MB/s)
> > create dir kernel-10 222MB in 24.32 seconds (9.14 MB/s)
> > create dir kernel-11 222MB in 12.06 seconds (18.44 MB/s)
> > create dir kernel-12 222MB in 10.95 seconds (20.31 MB/s)
> >
> > The good news is that if you let it run long enough, the times
> > stabilize. The bad news is:
> >
> > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>
> well hang on. Doesn't this just mean that the first few runs were writing
> into pagecache and the later ones were blocking due to dirty-memory
> limits?
>
> Or do you have a sync in there?
>
> > echo 2048 > /sys/block/..../nr_requests didn't do it either.
> >
> > I guess I'll have systemtap tell me more about the log flushing.

Try these:
# echo anticipatory > /sys/block/.../scheduler
# echo 0 > /sys/block/.../iosched/antic_expire
# echo 192 > /sys/block/.../max_sectors_kb
# echo 192 > /sys/block/.../read_ahead_kb

These give me best performance, but most noticeably antic_expire > 0 leaves
the IOScheduler in a apparent limbo.

see http://bugzilla.kernel.org/show_bug.cgi?id=5900


Thanks!

--
Al

2007-05-16 21:06:24

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 16:14:14 -0400
> Chris Mason <[email protected]> wrote:
>
> > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > The good news is that if you let it run long enough, the times
> > > > stabilize. The bad news is:
> > > >
> > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > >
> > > well hang on. Doesn't this just mean that the first few runs were writing
> > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > >
> > > Or do you have a sync in there?
> > >
> > There's no sync, but if you watch vmstat you can clearly see the log
> > flushes, even when the overall create times are 11MB/s. vmstat goes
> > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
>
> How do you know that it is a log flush rather than, say, pdflush
> hitting the blockdev inode and doing a big seeky write?

I don't...it gets especially tricky because ext3_writepage starts
a transaction, and so pdflush does hit the log flushing code too.

So, in comes systemtap. I instrumented submit_bh to look for seeks
(defined as writes more than 16 blocks apart) when the process was
inside __log_wait_for_space. The probe is attached, it is _really_
quick and dirty because I'm about to run out the door.

Watching vmstat, every time the __log_wait_for_space hits lots of seeks,
vmstat goes into the 2-4MB/s range. Not a scientific match up, but
here's some sample output:

7824 ext3 done waiting for space total wrote 3155 blocks seeks 2241
7827 ext3 done waiting for space total wrote 855 blocks seeks 598
7827 ext3 done waiting for space total wrote 2547 blocks seeks 1759
7653 ext3 done waiting for space total wrote 2273 blocks seeks 1609

I also recorded the total size of each seek, 66% of them where 6000
blocks or more.

-chris


Attachments:
(No filename) (2.02 kB)
jbd.tap (1.02 kB)
Download all attachments

2007-05-17 11:53:01

by Xu CanHao

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On May 17, 5:10 am, Chris Mason <[email protected]> wrote:
> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > On Wed, 16 May 2007 16:14:14 -0400
> > Chris Mason <[email protected]> wrote:
>
> > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > The good news is that if you let it run long enough, the times
> > > > > stabilize. The bad news is:
>
> > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>
> > > > well hang on. Doesn't this just mean that the first few runs were writing
> > > > into pagecache and the later ones were blocking due to dirty-memory limits?
>
> > > > Or do you have a sync in there?
>
> > > There's no sync, but if you watch vmstat you can clearly see the log
> > > flushes, even when the overall create times are 11MB/s. vmstat goes
> > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
>
> > How do you know that it is a log flush rather than, say, pdflush
> > hitting the blockdev inode and doing a big seeky write?
>
> I don't...it gets especially tricky because ext3_writepage starts
> a transaction, and so pdflush does hit the log flushing code too.
>
> So, in comes systemtap. I instrumented submit_bh to look for seeks
> (defined as writes more than 16 blocks apart) when the process was
> inside __log_wait_for_space. The probe is attached, it is _really_
> quick and dirty because I'm about to run out the door.
>
> Watching vmstat, every time the __log_wait_for_space hits lots of seeks,
> vmstat goes into the 2-4MB/s range. Not a scientific match up, but
> here's some sample output:
>
> 7824 ext3 done waiting for space total wrote 3155 blocks seeks 2241
> 7827 ext3 done waiting for space total wrote 855 blocks seeks 598
> 7827 ext3 done waiting for space total wrote 2547 blocks seeks 1759
> 7653 ext3 done waiting for space total wrote 2273 blocks seeks 1609
>
> I also recorded the total size of each seek, 66% of them where 6000
> blocks or more.
>
> -chris
>
> [jbd.tap]
>
> global in_process
> global writers
> global last
> global seeks
>
> probe kernel.function("__log_wait_for_space@fs/jbd/checkpoint.c") {
> printf("%d ext3 waiting for space\n", pid())
> p = pid()
> writers[p] = 0
> in_process[p] = 1
> last[p] = 0
> seeks[p] = 0
>
> }
>
> probe kernel.function("__log_wait_for_space@fs/jbd/checkpoint.c").return {
> p = pid()
> in_process[p] = 0
> printf("%d ext3 done waiting for space total wrote %d blocks seeks %d\n", p,
> writers[p], seeks[p])
>
> }
>
> probe kernel.function("submit_bh") {
> p = pid()
> in_proc = in_process[p]
> if (in_proc != 0) {
> writers[p] += 1
> block = $bh->b_blocknr
> last_block = last[p]
> diff = 0
> if (last_block != 0) {
> if (last_block < block && block - last_block > 16) {
> diff = block - last_block
> }
> if (last_block > block && last_block - block > 16) {
> diff = last_block - block
> }
> }
>
> last[p] = block
> if (diff != 0) {
> printf("seek log write pid %d last %d this %d diff %d\n",
> p, last_block, block, diff);
> seeks[p] += 1
> }
> }
>
> }

To Chris Mason:

I see that your file-system aging methodology is much the same as
here: http://defragfs.sourceforge.net/theory.html

Would it be useful to you?

2007-05-18 03:32:17

by Eric Sandeen

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

Jeff Garzik wrote:
> Jan Engelhardt wrote:
>> On May 16 2007 10:42, Chris Mason wrote:
>>> For example, I'll pick on xfs for a minute. compilebench shows the
>>> default FS you get from mkfs.xfs is pretty slow for untarring a bunch of
>>> kernel trees.
>>
>> I suppose you used 'nobarrier'? [ http://lkml.org/lkml/2006/5/19/33 ]
>
> Shouldn't that option be renamed to 'corrupt_my_data'? ;-)

It means "I have real storage with battery backed cache and I'd like
good performance, please"

-Eric

2007-05-22 16:39:36

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 16:14:14 -0400
> Chris Mason <[email protected]> wrote:
>
> > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > The good news is that if you let it run long enough, the times
> > > > stabilize. The bad news is:
> > > >
> > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > >
> > > well hang on. Doesn't this just mean that the first few runs were writing
> > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > >
> > > Or do you have a sync in there?
> > >
> > There's no sync, but if you watch vmstat you can clearly see the log
> > flushes, even when the overall create times are 11MB/s. vmstat goes
> > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
>
> How do you know that it is a log flush rather than, say, pdflush
> hitting the blockdev inode and doing a big seeky write?

Ok, I did some more work to split out the two cases (block device inode
writeback and log flushing).

I patched jbd's log_do_checkpoint to put all the blocks it wanted to
write in a radix tree, then send them all down in order at the end.
The elevator should be helping here, but jbd is sending down 2,000
to 3,000 blocks during the checkpoint and upping nr_requests alone
didn't seem to be doing the trick.

Unpatched ext3 would break down into seeks after 8 kernel trees are
created (222MB each). With the radix sorting, the first 15 kernel trees
are created quickly, and then we slow down.

So I waited until around the 25th kernel tree was created, hit ctrl-c
and ran sync. vmstat showed writes going at 2MB/s, and sysrq-w showed
sync was running the block device inode for most of the 2MB/s period.

It looks as though the dirty pages on the block device inode are spread
out far enough that we're not getting good streaming writes. Mark
Fasheh ran on a bigger raid array, where performance was consistently
good for the whole run. I'm assuming the larger write cache on the
array was able to group the data writes with the metadata on disk, while
my poor little sata drive wasn't. Dave Chinner hinted that xfs is
probably suffering a similar problem, which is usually fixed by backing
the FS with stripes and big raid.

My vaporware FS is able to maintain speed through the run because the
allocator tries to keep data and metadata grouped into 256mb chunks,
and so they don't end up mingling on disk until things get full.

At any rate, it may be worth putzing with the writeback routines to try
and find dirty pages close by in the block dev inode when doing data
writeback. My guess is that ext3 should be going 1.5x to 2x faster for
this particular run, but that's a huge amount of complexity added so I'm
not convinced it is a great idea.

-chris

2007-05-22 17:51:55

by John Stoffel

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

>>>>> "Chris" == Chris Mason <[email protected]> writes:

Chris> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
>> On Wed, 16 May 2007 16:14:14 -0400
>> Chris Mason <[email protected]> wrote:
>>
>> > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
>> > > > The good news is that if you let it run long enough, the times
>> > > > stabilize. The bad news is:
>> > > >
>> > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
>> > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
>> > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
>> > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>> > >
>> > > well hang on. Doesn't this just mean that the first few runs were writing
>> > > into pagecache and the later ones were blocking due to dirty-memory limits?
>> > >
>> > > Or do you have a sync in there?
>> > >
>> > There's no sync, but if you watch vmstat you can clearly see the log
>> > flushes, even when the overall create times are 11MB/s. vmstat goes
>> > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
>>
>> How do you know that it is a log flush rather than, say, pdflush
>> hitting the blockdev inode and doing a big seeky write?

Chris> Ok, I did some more work to split out the two cases (block device inode
Chris> writeback and log flushing).

Chris> I patched jbd's log_do_checkpoint to put all the blocks it
Chris> wanted to write in a radix tree, then send them all down in
Chris> order at the end. The elevator should be helping here, but jbd
Chris> is sending down 2,000 to 3,000 blocks during the checkpoint and
Chris> upping nr_requests alone didn't seem to be doing the trick.

That seems like a really high number to me, in terms of the number of
blocks being check pointed here. Should we be more aggressively
writing journal blocks? Or having sub-transactions so we can write
smaller atomic chunks, while still not forcing them to disk?

I dunno, I'm probably smoking something here.

Just thinking out loud, what's the worst possible layout of inodes and
such that can be written in ext3? And can we generate a test case
which re-creates that layout and then the handling of it? Is it when
we try to write N inodes, and we do inode 1, then N, then 2, then N-1,
etc? So that the writes are spread out all over the place and don't
have any clustering in them at all?

Chris> Unpatched ext3 would break down into seeks after 8 kernel trees
Chris> are created (222MB each). With the radix sorting, the first 15
Chris> kernel trees are created quickly, and then we slow down.

Chris> So I waited until around the 25th kernel tree was created, hit
Chris> ctrl-c and ran sync. vmstat showed writes going at 2MB/s, and
Chris> sysrq-w showed sync was running the block device inode for most
Chris> of the 2MB/s period.

How much data was written overall at this point? And was it sorted at
all, or were just sub-chunks of it sorted?

Chris> It looks as though the dirty pages on the block device inode
Chris> are spread out far enough that we're not getting good streaming
Chris> writes. Mark Fasheh ran on a bigger raid array, where
Chris> performance was consistently good for the whole run. I'm
Chris> assuming the larger write cache on the array was able to group
Chris> the data writes with the metadata on disk, while my poor little
Chris> sata drive wasn't. Dave Chinner hinted that xfs is probably
Chris> suffering a similar problem, which is usually fixed by backing
Chris> the FS with stripes and big raid.

It sounds like Mark Fasheh just needs to have a bigger test case to
also hit the same wall. His hardware just moves it out a bit.

Chris> My vaporware FS is able to maintain speed through the run
Chris> because the allocator tries to keep data and metadata grouped
Chris> into 256mb chunks, and so they don't end up mingling on disk
Chris> until things get full.

So what happens when your vaporFS is told to allocate in 128Mb chunks
doing this same test? I assume it gets into the same type of problem?

I'm happy to see these tests happening, even if I don't have alot to
contribute otherwise. :[

One thought, do you have a list of the blocks being written for the
dirty block device inode? Should we be trying to push them out
whenenever we write data blocks which are near by?

I wonder how ext4 will do with this?

John

2007-05-22 18:15:44

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Tue, May 22, 2007 at 01:50:13PM -0400, John Stoffel wrote:
> >>>>> "Chris" == Chris Mason <[email protected]> writes:
>
> Chris> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:

[ seeky writes while creating kernel trees on ext3 ]

> >> How do you know that it is a log flush rather than, say, pdflush
> >> hitting the blockdev inode and doing a big seeky write?
>
> Chris> Ok, I did some more work to split out the two cases (block device inode
> Chris> writeback and log flushing).
>
> Chris> I patched jbd's log_do_checkpoint to put all the blocks it
> Chris> wanted to write in a radix tree, then send them all down in
> Chris> order at the end. The elevator should be helping here, but jbd
> Chris> is sending down 2,000 to 3,000 blocks during the checkpoint and
> Chris> upping nr_requests alone didn't seem to be doing the trick.
>
> That seems like a really high number to me, in terms of the number of
> blocks being check pointed here. Should we be more aggressively
> writing journal blocks? Or having sub-transactions so we can write
> smaller atomic chunks, while still not forcing them to disk?
>

The FS is 40GB and the log is 32768 blocks. So, writing 2,000 blocks as
part of a checkpoint seems reasonable. If you're whipping up a quick
and dirty patch to sort all the blocks being checkpointed, working in
bigger batches will increase the changes the writes are sequential. So
I don't think smaller sub-transactions will do the trick.

> I dunno, I'm probably smoking something here.
>
> Just thinking out loud, what's the worst possible layout of inodes and
> such that can be written in ext3? And can we generate a test case
> which re-creates that layout and then the handling of it? Is it when
> we try to write N inodes, and we do inode 1, then N, then 2, then N-1,
> etc? So that the writes are spread out all over the place and don't
> have any clustering in them at all?

I don't know the jbd code well enough, but it looks like it is putting
things down in the order they were logged. So it should be possible,
but every FS is going to have a pathological worst case. I'm actually
trying to tune the benchmark such that any bad performance is only from
allocator decisions and not writeback decisions.

>
> Chris> Unpatched ext3 would break down into seeks after 8 kernel trees
> Chris> are created (222MB each). With the radix sorting, the first 15
> Chris> kernel trees are created quickly, and then we slow down.
>
> Chris> So I waited until around the 25th kernel tree was created, hit
> Chris> ctrl-c and ran sync. vmstat showed writes going at 2MB/s, and
> Chris> sysrq-w showed sync was running the block device inode for most
> Chris> of the 2MB/s period.
>
> How much data was written overall at this point? And was it sorted at
> all, or were just sub-chunks of it sorted?

Each tree is 222MB, plus metadata. It works out to about 10GB when you
factor in space wasted due to partially used blocks. writeback will be
somewhat sorted, well enough that file data seems to do down at 30MB/s
or so. It's the metadata slowing things down.

>
> Chris> My vaporware FS is able to maintain speed through the run
> Chris> because the allocator tries to keep data and metadata grouped
> Chris> into 256mb chunks, and so they don't end up mingling on disk
> Chris> until things get full.
>
> So what happens when your vaporFS is told to allocate in 128Mb chunks
> doing this same test? I assume it gets into the same type of problem?

128MB chunks would probably perform well, it isn't so much the size of
the group that helps but not having mixed data and metadata in the same
group.

>
> I'm happy to see these tests happening, even if I don't have alot to
> contribute otherwise. :[
>
> One thought, do you have a list of the blocks being written for the
> dirty block device inode? Should we be trying to push them out
> whenenever we write data blocks which are near by?
>

It's an option, but not a light programming chore.

> I wonder how ext4 will do with this?

I would guess it'll score the same way. The files are fairly small in
this dataset and ext4's jbd code is based on ext3.

-chris

2007-05-22 18:24:27

by Andrew Morton

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Tue, 22 May 2007 12:35:11 -0400
Chris Mason <[email protected]> wrote:

> On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > On Wed, 16 May 2007 16:14:14 -0400
> > Chris Mason <[email protected]> wrote:
> >
> > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > The good news is that if you let it run long enough, the times
> > > > > stabilize. The bad news is:
> > > > >
> > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > > >
> > > > well hang on. Doesn't this just mean that the first few runs were writing
> > > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > > >
> > > > Or do you have a sync in there?
> > > >
> > > There's no sync, but if you watch vmstat you can clearly see the log
> > > flushes, even when the overall create times are 11MB/s. vmstat goes
> > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
> >
> > How do you know that it is a log flush rather than, say, pdflush
> > hitting the blockdev inode and doing a big seeky write?
>
> Ok, I did some more work to split out the two cases (block device inode
> writeback and log flushing).
>
> I patched jbd's log_do_checkpoint to put all the blocks it wanted to
> write in a radix tree, then send them all down in order at the end.

Side note: we already have all of that capability in the kernel:
sync_inode(blockdev_inode, wbc) will do an ascending-LBA write of the whole
blockdev.

It could be that as a quick diddle, running sync_inode() in
do-block-on-queue-congestion mode prior to doing the checkpoint would have
some benefit.

> The elevator should be helping here, but jbd is sending down 2,000
> to 3,000 blocks during the checkpoint and upping nr_requests alone
> didn't seem to be doing the trick.
>
> Unpatched ext3 would break down into seeks after 8 kernel trees are
> created (222MB each). With the radix sorting, the first 15 kernel trees
> are created quickly, and then we slow down.
>
> So I waited until around the 25th kernel tree was created, hit ctrl-c
> and ran sync. vmstat showed writes going at 2MB/s, and sysrq-w showed
> sync was running the block device inode for most of the 2MB/s period.
>
> It looks as though the dirty pages on the block device inode are spread
> out far enough that we're not getting good streaming writes. Mark
> Fasheh ran on a bigger raid array, where performance was consistently
> good for the whole run. I'm assuming the larger write cache on the
> array was able to group the data writes with the metadata on disk, while
> my poor little sata drive wasn't. Dave Chinner hinted that xfs is
> probably suffering a similar problem, which is usually fixed by backing
> the FS with stripes and big raid.
>
> My vaporware FS is able to maintain speed through the run because the
> allocator tries to keep data and metadata grouped into 256mb chunks,
> and so they don't end up mingling on disk until things get full.
>
> At any rate, it may be worth putzing with the writeback routines to try
> and find dirty pages close by in the block dev inode when doing data
> writeback. My guess is that ext3 should be going 1.5x to 2x faster for
> this particular run, but that's a huge amount of complexity added so I'm
> not convinced it is a great idea.

Yes, this is a distinct disadvantage of the whole per-address-space
writeback scheme - we're leaving IO scheduling optimisations on the floor,
especially wrt the blockdev inode, but probably also wrt regular-file
versus regular-file. Even if one makes the request queue tremendously
huge, that won't help if there's dirty data close-by the disk head which
hasn't even been put into the queue yet.

2007-05-22 18:42:17

by Chris Mason

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Tue, May 22, 2007 at 11:21:20AM -0700, Andrew Morton wrote:
> >
> > I patched jbd's log_do_checkpoint to put all the blocks it wanted to
> > write in a radix tree, then send them all down in order at the end.
>
> Side note: we already have all of that capability in the kernel:
> sync_inode(blockdev_inode, wbc) will do an ascending-LBA write of the whole
> blockdev.
>
> It could be that as a quick diddle, running sync_inode() in
> do-block-on-queue-congestion mode prior to doing the checkpoint would have
> some benefit.

I had played with this in the past (although not this time around), but
I had performance problems with newly dirtied blocks sneaking in.

> > At any rate, it may be worth putzing with the writeback routines to try
> > and find dirty pages close by in the block dev inode when doing data
> > writeback. My guess is that ext3 should be going 1.5x to 2x faster for
> > this particular run, but that's a huge amount of complexity added so I'm
> > not convinced it is a great idea.
>
> Yes, this is a distinct disadvantage of the whole per-address-space
> writeback scheme - we're leaving IO scheduling optimisations on the floor,
> especially wrt the blockdev inode, but probably also wrt regular-file
> versus regular-file. Even if one makes the request queue tremendously
> huge, that won't help if there's dirty data close-by the disk head which
> hasn't even been put into the queue yet.
>

I'm not sure yet on a good way to fix it, but I do think I've nailed
it down as the cause of the strange performance numbers I'm getting.

-chris

2007-05-22 21:29:16

by Matt Mackall

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Tue, May 22, 2007 at 11:21:20AM -0700, Andrew Morton wrote:
> On Tue, 22 May 2007 12:35:11 -0400
> Chris Mason <[email protected]> wrote:
>
> > On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > > On Wed, 16 May 2007 16:14:14 -0400
> > > Chris Mason <[email protected]> wrote:
> > >
> > > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > > The good news is that if you let it run long enough, the times
> > > > > > stabilize. The bad news is:
> > > > > >
> > > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > > > >
> > > > > well hang on. Doesn't this just mean that the first few runs were writing
> > > > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > > > >
> > > > > Or do you have a sync in there?
> > > > >
> > > > There's no sync, but if you watch vmstat you can clearly see the log
> > > > flushes, even when the overall create times are 11MB/s. vmstat goes
> > > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
> > >
> > > How do you know that it is a log flush rather than, say, pdflush
> > > hitting the blockdev inode and doing a big seeky write?
> >
> > Ok, I did some more work to split out the two cases (block device inode
> > writeback and log flushing).
> >
> > I patched jbd's log_do_checkpoint to put all the blocks it wanted to
> > write in a radix tree, then send them all down in order at the end.
>
> Side note: we already have all of that capability in the kernel:
> sync_inode(blockdev_inode, wbc) will do an ascending-LBA write of the whole
> blockdev.

Why don't we simply plug the queue, do all the writes, and let the I/O
scheduler sort it out instead?

--
Mathematics is the supreme nostalgia of our time.

2007-05-24 17:30:26

by Vara Prasad

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

Chris Mason wrote:

>On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
>
>
>>On Wed, 16 May 2007 16:14:14 -0400
>>Chris Mason <[email protected]> wrote:
>>
>>
>>
>>>On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
>>>
>>>
>>>>>The good news is that if you let it run long enough, the times
>>>>>stabilize. The bad news is:
>>>>>
>>>>>create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
>>>>>create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
>>>>>create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
>>>>>create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
>>>>>
>>>>>
>>>>well hang on. Doesn't this just mean that the first few runs were writing
>>>>into pagecache and the later ones were blocking due to dirty-memory limits?
>>>>
>>>>Or do you have a sync in there?
>>>>
>>>>
>>>>
>>>There's no sync, but if you watch vmstat you can clearly see the log
>>>flushes, even when the overall create times are 11MB/s. vmstat goes
>>>30MB/s -> 4MB/s or less, then back up to 30MB/s.
>>>
>>>
>>How do you know that it is a log flush rather than, say, pdflush
>>hitting the blockdev inode and doing a big seeky write?
>>
>>
>
>I don't...it gets especially tricky because ext3_writepage starts
>a transaction, and so pdflush does hit the log flushing code too.
>
>So, in comes systemtap. I instrumented submit_bh to look for seeks
>(defined as writes more than 16 blocks apart) when the process was
>inside __log_wait_for_space. The probe is attached, it is _really_
>quick and dirty because I'm about to run out the door.
>
>Watching vmstat, every time the __log_wait_for_space hits lots of seeks,
>vmstat goes into the 2-4MB/s range. Not a scientific match up, but
>here's some sample output:
>
>7824 ext3 done waiting for space total wrote 3155 blocks seeks 2241
>7827 ext3 done waiting for space total wrote 855 blocks seeks 598
>7827 ext3 done waiting for space total wrote 2547 blocks seeks 1759
>7653 ext3 done waiting for space total wrote 2273 blocks seeks 1609
>
>I also recorded the total size of each seek, 66% of them where 6000
>blocks or more.
>
>-chris
>
>
>
>------------------------------------------------------------------------
>
>
>global in_process
>global writers
>global last
>global seeks
>
>probe kernel.function("__log_wait_for_space@fs/jbd/checkpoint.c") {
> printf("%d ext3 waiting for space\n", pid())
> p = pid()
> writers[p] = 0
> in_process[p] = 1
> last[p] = 0
> seeks[p] = 0
>}
>
>probe kernel.function("__log_wait_for_space@fs/jbd/checkpoint.c").return {
> p = pid()
> in_process[p] = 0
> printf("%d ext3 done waiting for space total wrote %d blocks seeks %d\n", p,
> writers[p], seeks[p])
>}
>
>probe kernel.function("submit_bh") {
> p = pid()
> in_proc = in_process[p]
> if (in_proc != 0) {
> writers[p] += 1
> block = $bh->b_blocknr
> last_block = last[p]
> diff = 0
> if (last_block != 0) {
> if (last_block < block && block - last_block > 16) {
> diff = block - last_block
> }
> if (last_block > block && last_block - block > 16) {
> diff = last_block - block
> }
> }
>
> last[p] = block
> if (diff != 0) {
> printf("seek log write pid %d last %d this %d diff %d\n",
> p, last_block, block, diff);
> seeks[p] += 1
> }
> }
>}
>
>
Hi Chris,

I am glad to see SystemTap is helping you to get useful information for
you to understand the behavior of the benchmark.
I am not a filesystem expert hence i have couple of questions.
1) From your usage of SystemTap for your benchmark work do you see a
need to develop a generic tool that could be useful to understand
general filesystem performance issues, if so please let us know the
details of what you would like to see in that tool and we will be more
than happy to help develop one.
2) Based on your usage of SystemTap do you have any suggestions on a
probe library for filesystem that captures most of the essential
information and state transitions in the filesystem for someone to
understand performance/functionality of the filesystem. In other words i
am looking for some help/advise from experts in the subsystem to
identify probe points so we can develop a probe library that anyone can
use to get more insight into that subsystem without being a guru in it.
If you have any ideas in that please let us know.
3) Of course if you have any feedback on general SystemTap and its use
and any suggestions for improvements, we would love to hear as well.

Thanks again for using SystemTap,
Vara Prasad

2007-05-25 07:17:25

by Jens Axboe

[permalink] [raw]
Subject: Re: filesystem benchmarking fun

On Tue, May 22 2007, Matt Mackall wrote:
> On Tue, May 22, 2007 at 11:21:20AM -0700, Andrew Morton wrote:
> > On Tue, 22 May 2007 12:35:11 -0400
> > Chris Mason <[email protected]> wrote:
> >
> > > On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> > > > On Wed, 16 May 2007 16:14:14 -0400
> > > > Chris Mason <[email protected]> wrote:
> > > >
> > > > > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > > > > The good news is that if you let it run long enough, the times
> > > > > > > stabilize. The bad news is:
> > > > > > >
> > > > > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > > > > >
> > > > > > well hang on. Doesn't this just mean that the first few runs were writing
> > > > > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > > > > >
> > > > > > Or do you have a sync in there?
> > > > > >
> > > > > There's no sync, but if you watch vmstat you can clearly see the log
> > > > > flushes, even when the overall create times are 11MB/s. vmstat goes
> > > > > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
> > > >
> > > > How do you know that it is a log flush rather than, say, pdflush
> > > > hitting the blockdev inode and doing a big seeky write?
> > >
> > > Ok, I did some more work to split out the two cases (block device inode
> > > writeback and log flushing).
> > >
> > > I patched jbd's log_do_checkpoint to put all the blocks it wanted to
> > > write in a radix tree, then send them all down in order at the end.
> >
> > Side note: we already have all of that capability in the kernel:
> > sync_inode(blockdev_inode, wbc) will do an ascending-LBA write of the whole
> > blockdev.
>
> Why don't we simply plug the queue, do all the writes, and let the I/O
> scheduler sort it out instead?

The data set is too huge for that to work, even if you increased
nr_requests to some huuuge number.

--
Jens Axboe