2002-03-27 13:55:12

by Matthew Kirkwood

[permalink] [raw]
Subject: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

Hi,

A while ago, I did some longish runs of OSDB (osdb.sf.net)
against PostgreSQL 7.2. All runs were on kernel 2.5.6 + the
dc395x driver and the futexes patch. I'd have included
reiserfs too, but in 2.5.6 it seemed to oops on mount. 2.5.7
doesn't boot for me, but I'll run these again when a more
interesting kernel appears.

Hardware is: 2 x P3-450, 384Mb, 3 x 9Gb Quantum disks on
internal aic7xxx (new driver). Except for a "vmstat 1", the
system was otherwise unused during the tests. There was no
other mounted filesystem on the disk with the test partition.
The numbers seem pretty consistent -- if they're more than 5%
different, that's probably a valid comparision (no, I'm not a
statistician and can't justify that).

The scripts I used are available on request, but they do
roughly:

stop postgres
umount
mkfs
mount
create postgres data directories
start postgres (incl. creating postgres database)
"osdb-pg --datadir /scratch/data-40mb/ --short"


"Tuning" key:
"dd" -- default PG, default FS opts
"dn" -- default PG, "noatime"
"bn" -- big PG buffers, "noatime"

PostgreSQL
tuning? single ir mx-ir oltp mixed-oltp
(sec) (tps) (sec) (tps) (sec)
ext2 dd 1304.72 66.64 214.25 188.50 230.55
dn 1288.31 65.93 209.57 234.08 213.75
bn 1283.50 77.90 1867.71 192.43 226.77

ext3 dd 1303.84 66.87 212.49 66.06 361.04
dn 1288.03 64.62 209.27 111.41 278.54
bn 1285.32 65.98 1996.41 90.05 307.79

ext3-wb dn 1291.68 66.06 209.94 138.25 242.28
bn 1287.31 98.42 2149.38 125.13 236.02

jfs dd 1308.97 66.82 212.59 117.28 273.08
dn 1288.60 65.08 211.56 116.18 218.22
bn 1279.89 81.00 2059.26 114.20 225.56

minix dd 1305.26 67.38 207.74 193.90 228.81
dn 1331.27 67.14 210.07 223.70 214.33
bn 1299.24 89.58 1988.31 231.17 231.17


My conclusions:

1. I'll have to spend more time learning to tune postgres,
but clearly something went wrong there -- the
"agg_simple_report" test accounted for almost all of the
differences.

2. "noatime" is very useful switch for these circumstances.

3. The journalled filesystems do have measurable overhead
for this workload.

Questions:

1. Is there anything else I should try in the way of fs
options, etc?

2. What does jfs do in the way of data journalling? Is it
"ordered" or "writeback", in ext3-speak? (I assume
fully journalled data would give much worse performance.)


Cheers,
Matthew.


2002-03-27 14:09:52

by Andi Kleen

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

Matthew Kirkwood <[email protected]> writes:

> PostgreSQL
> tuning? single ir mx-ir oltp mixed-oltp
> (sec) (tps) (sec) (tps) (sec)
> ext2 dd 1304.72 66.64 214.25 188.50 230.55
> dn 1288.31 65.93 209.57 234.08 213.75
> bn 1283.50 77.90 1867.71 192.43 226.77
>
> ext3 dd 1303.84 66.87 212.49 66.06 361.04
> dn 1288.03 64.62 209.27 111.41 278.54
> bn 1285.32 65.98 1996.41 90.05 307.79

This is ext3 with ordered data?

> minix dd 1305.26 67.38 207.74 193.90 228.81
> dn 1331.27 67.14 210.07 223.70 214.33
> bn 1299.24 89.58 1988.31 231.17 231.17

Wow minix is faster than ext2 @) That certainly looks strange.

Any chance to test XFS too?

> 3. The journalled filesystems do have measurable overhead
> for this workload.

Normally (non data journaling, noatime) journaling fs shouldn't have any
overhead for database load, because database files should be preallocated
and the database should do direct IO in/out the preallocated buffers
with the FS never doing any metadata writes, except for occassional inode
updates for mtime depending on what sync mode that DB uses (hmm, I guess a
nomtime or verylazymtime or alwaysasyncmtime mount option could be helpful
for that)

That's the theory, but doesn't seem to be the case in your test. I guess
your test is not very realistic then.

> 2. What does jfs do in the way of data journalling? Is it
> "ordered" or "writeback", in ext3-speak? (I assume
> fully journalled data would give much worse performance.)

Kind of ordered I believe.

-Andi

2002-03-27 14:17:45

by Florin Andrei

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Wed, 2002-03-27 at 05:54, Matthew Kirkwood wrote:
>
> 3. The journalled filesystems do have measurable overhead
> for this workload.

Can you repeat the tests with XFS too?

In my tests, it did the best for database-type workloads (and generally,
for large files with multiple access).

--
Florin Andrei

"Sorry judge, we would like to publish the file formats, but the data is
not stored in files. It is stored in a database that is an indivisible
part of the operating system." - a potential future Microsoft excuse

2002-03-27 14:47:31

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On 27 Mar 2002, Andi Kleen wrote:

> > ext3 dd 1303.84 66.87 212.49 66.06 361.04
> > dn 1288.03 64.62 209.27 111.41 278.54
> > bn 1285.32 65.98 1996.41 90.05 307.79
>
> This is ext3 with ordered data?

Yep. Everything is default unless otherwise stated.

> > minix dd 1305.26 67.38 207.74 193.90 228.81
> > dn 1331.27 67.14 210.07 223.70 214.33
> > bn 1299.24 89.58 1988.31 231.17 231.17
>
> Wow minix is faster than ext2 @) That certainly looks strange.

Yeah, I thought it was a little odd. Postgres does so much
fsync()ing that I thought it may just have been that the lower
overhead won out over ext2's cleverer layout. All the I/O was
basically fsync-driven, so this test was only about write
performance.

> Any chance to test XFS too?

Sure. I'll try to build a more interesting kernel sometime
this week. ext2 with delalloc might be fun, too.

Do you know of any simple patch or patches which might get
reiserfs working on 2.5.6?

> > 3. The journalled filesystems do have measurable overhead
> > for this workload.
>
> Normally (non data journaling, noatime) journaling fs shouldn't have
> any overhead for database load, because database files should be
> preallocated and the database should do direct IO in/out the
> preallocated buffers with the FS never doing any metadata writes,
> except for occassional inode updates for mtime depending on what sync
> mode that DB uses (hmm, I guess a nomtime or verylazymtime or
> alwaysasyncmtime mount option could be helpful for that)

Postgres doesn't pre-allocate datafiles. They reckon it's not
their job to implement a filesystem, and I'm inclined to agree.
They do prefer fdatasync on datafiles and (I think) O_DATASYNC
for their journal files where available, but I haven't checked
that my build is doing that.

> That's the theory, but doesn't seem to be the case in your test. I
> guess your test is not very realistic then.

Or your assumptions about DB vs filesystems are not valid in
this case.

> > 2. What does jfs do in the way of data journalling? Is it
> > "ordered" or "writeback", in ext3-speak? (I assume
> > fully journalled data would give much worse performance.)
>
> Kind of ordered I believe.

OK, ta. So it probably does something right that ext3
doesn't? (Or has rather weaker semantics, of course.)

Matthew.

2002-03-27 15:36:01

by Michael Alan Dorman

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

Matthew Kirkwood <[email protected]> writes:
> Postgres doesn't pre-allocate datafiles.

I haven't recieved your original message, so I don't know what version
of PostgreSQL you're using, but I believe it is pertinent given that
versions >= 7.2 (and perhaps >= 7.1) *do* pre-allocate WAL logs, which
is where most of the action is.

It might be that in this situation you might benefit from any
reduction in FS overhead even if it means a reduction in features
because WAL is going to dramatically change the way disk access
happens.

Mike.

2002-03-27 17:53:31

by Andrew Morton

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

Matthew Kirkwood wrote:
>
> ...
> Yeah, I thought it was a little odd. Postgres does so much
> fsync()ing that I thought it may just have been that the lower
> overhead won out over ext2's cleverer layout. All the I/O was
> basically fsync-driven, so this test was only about write
> performance.
>

For fsync-intensive loads ext3's best mode is generally
data=journal. That way, an fsync is satisfied by a nice
single linear write to the journal.

With a high volume of data you'll quickly exhaust the
journal space so it would also be beneficial to create
a monster journal with, say, mke2fs -J 400.

-

2002-03-27 18:04:21

by Andreas Dilger

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Mar 27, 2002 14:47 +0000, Matthew Kirkwood wrote:
> Postgres doesn't pre-allocate datafiles. They reckon it's not
> their job to implement a filesystem, and I'm inclined to agree.
> They do prefer fdatasync on datafiles and (I think) O_DATASYNC
> for their journal files where available, but I haven't checked
> that my build is doing that.

If the I/O is normally sync driven, you should consider testing ext3
with "data=journal". While this seems counterintuitive because it is
writing the data to disk twice, it can often be faster in real-world
"bursty" environments because the sync I/O goes to the journal in one
contiguous write, and it can then be written to the rest of the fs
asynchronously safely. You can also set up an external journal device
so that the journal is on another disk and avoid seeking between the
journal and the rest of the filesystem.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2002-03-28 00:05:14

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Wed, 27 Mar 2002, Andrew Morton wrote:

> > Yeah, I thought it was a little odd. Postgres does so much
> > fsync()ing that I thought it may just have been that the lower
> > overhead won out over ext2's cleverer layout. All the I/O was
> > basically fsync-driven, so this test was only about write
> > performance.
>
> For fsync-intensive loads ext3's best mode is generally
> data=journal. That way, an fsync is satisfied by a nice
> single linear write to the journal.

Here we are. This is with just a 200Mb journal (the partition
is only a little over 1Gb, and the datafiles grow fairly big,
so I didn't brave making it any bigger).

tuning? single ir mx-ir oltp mixed-oltp
(sec) (tps) (sec) (tps) (sec)
ext3 bn 1285.32 65.98 1996.41 90.05 307.79
ext3-wb bn 1287.31 98.42 2149.38 125.13 236.02
ext3-jd bn 1306.90 72.07 1813.54 125.15 305.27

The I/O load should be almost exclusively fsync-driven writes,
so I'm not sure how to account for the fact that the OLTP and
OLTP + misc (mostly read) activity give different numbers.

I'll try to find time to run these again tomorrow to convince
myself that all is sane, but these numbers are usually pretty
stable.

Matthew.

2002-03-28 00:10:34

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Wed, 27 Mar 2002, Andreas Dilger wrote:

> If the I/O is normally sync driven, you should consider testing ext3
> with "data=journal". While this seems counterintuitive because it is
> writing the data to disk twice, it can often be faster in real-world
> "bursty" environments because the sync I/O goes to the journal in one
> contiguous write, and it can then be written to the rest of the fs
> asynchronously safely.

Good point (and partially borne out by my new numbers).

> You can also set up an external journal device so that the journal is
> on another disk and avoid seeking between the journal and the rest of
> the filesystem.

Good idea. If I had only a disks - a slow one and a fast one,
how should they be configured? (Or might this be another area
worthy of testing? The tradeoffs can go both ways -- the slow
disk might seem better for the async writes, but it'll also be
worse at seeking, so perhaps might be more appropriate for the
journal disk?)

Matthew.

2002-03-28 00:31:44

by Andrew Morton

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

Matthew Kirkwood wrote:
>
> On Wed, 27 Mar 2002, Andrew Morton wrote:
>
> > > Yeah, I thought it was a little odd. Postgres does so much
> > > fsync()ing that I thought it may just have been that the lower
> > > overhead won out over ext2's cleverer layout. All the I/O was
> > > basically fsync-driven, so this test was only about write
> > > performance.
> >
> > For fsync-intensive loads ext3's best mode is generally
> > data=journal. That way, an fsync is satisfied by a nice
> > single linear write to the journal.
>
> Here we are. This is with just a 200Mb journal (the partition
> is only a little over 1Gb, and the datafiles grow fairly big,
> so I didn't brave making it any bigger).
>
> tuning? single ir mx-ir oltp mixed-oltp
> (sec) (tps) (sec) (tps) (sec)
> ext3 bn 1285.32 65.98 1996.41 90.05 307.79
> ext3-wb bn 1287.31 98.42 2149.38 125.13 236.02
> ext3-jd bn 1306.90 72.07 1813.54 125.15 305.27

Oh well.

It sounds like a useful and valid workload to optimise
for. So I'll take you up on the offer of those scripts,
please.

-

2002-03-28 00:42:46

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Wed, 27 Mar 2002, Andrew Morton wrote:

> > tuning? single ir mx-ir oltp mixed-oltp
> > (sec) (tps) (sec) (tps) (sec)
> > ext3 bn 1285.32 65.98 1996.41 90.05 307.79
> > ext3-wb bn 1287.31 98.42 2149.38 125.13 236.02
> > ext3-jd bn 1306.90 72.07 1813.54 125.15 305.27
>
> Oh well.

Sometimes better, sometimes worse. I'll kick another run
off tonight, to check that the numbers aren't too far off.

> It sounds like a useful and valid workload to optimise
> for. So I'll take you up on the offer of those scripts,
> please.

My scripts are roughly the appended, and:

grep -E '(agg_simple|Bench|crossSe|Mixed|"Sin)' dbb-tuned.out | \
sed 's/^crossSection/cS/'

I've been too lazy so far to automate the "make it into a
table" bit, particularly because I quite like watching the
results come in :)

Cheers,
Matthew.



#!perl -w
use strict;

my $PART = '/dev/sdb6';
my $FORCEOPTS = 'noatime';
my $DEFOPTS = undef;
my $DEBUG = 1;
my $DEBUGONLY = 0;

my @FSES = qw(jfs ext3 ext3-wb ext3-jd minix ext2);
my @DBS = qw(postgresql);

my @BENCHOPTS = qw(--datadir /home/matthew/dbbench/data-40mb --short);

my %filesystems = (
minix => { mkfs => [ qw(/root/mkfs.minix -v) ], },
ext2 => {},
ext3 => {},
'ext3-wb' => { type => 'ext3', mountopts => 'data=writeback', },
'ext3-jd' => { mkfs => [ qw(mkfs.ext3 -J size=200 )],
type => 'ext3', mountopts => 'data=journal', },
jfs => { mkfs => [ qw(mkfs.jfs -q) ], },
reiser => { type => 'reiserfs', },
);

my %dbs = (
mysql => { mntpoint => '/var/lib/mysql', osdb => 'osdb-my', },
postgresql => { mntpoint => '/var/lib/pgsql', osdb => 'osdb-pg',
init => \&pg_init, },
);

runit('umount', $PART);

foreach my $db (@DBS) {
my $dbopts = $dbs{$db};
my $mntpoint = $dbopts->{mntpoint} or die "$db has no mntpoint\n";
my $osdb = $dbopts->{osdb} or die "$db has no \"osdb\"\n";

foreach my $fs (@FSES) {
my $opts = $filesystems{$fs};
print "Benchmark for ", $db, " on ", $fs, "\n\n";

my $fstype = $opts->{type} || $fs;
my $mkfs = $opts->{mkfs} || [ qw(mkfs -t), $fstype ];
print "making fs\n";
runit(@$mkfs, $PART) or die "can't mkfs $fstype\n";
print "\n\n";

print "mounting fs\n";
my $opt = $opts->{mountopts} || $DEFOPTS;
$opt = [$opt] if $opt && ! ref $opt;
push @$opt, $FORCEOPTS;
$opt = join(",", @$opt);
$opt = ['-o', $opt] if $opt;
runit('mount', '-t', $fstype, @$opt, $PART, $mntpoint)
or die "can't mount $fstype\n";
print "\n\n";

print "Starting ", $db, "\n";
if ($dbopts->{init}) {
&{$dbopts->{init}}($dbopts, $opts);
} else {
runit('/sbin/service', $db, 'start');
}
print "\n\n";

print "Running test\n";
runit($osdb, @BENCHOPTS) or warn "tests failed\n";
print "\n\n";

print "Stopping ", $db, "\n";
runit('/sbin/service', $db, 'stop');
sleep(2);
print "\n\n";

print "Umounting\n";
runit('umount', $PART) or die "can't umount $fstype\n";
print "\n\n";

print "\n\n";
print "\n\n";
}
}


exit;



sub pg_init {
my $dbopts = shift;
my $opts = shift;
my $mp = $dbopts->{mntpoint};

my @dirs = ($mp.'/data', $mp.'/backups');
runit('mkdir', '-m', '0700', @dirs);
runit('chown', 'postgres.postgres', @dirs);
# runit('cp', '/etc/postgresql.conf', $dirs[0]);
runit('/sbin/service', 'postgresql', 'start');
sleep(2);
runit('sudo', '-u', 'postgres', 'createuser', '-a', '-d', 'root');
}

sub runit {
print join(" ", @_), "\n" if $DEBUG || $DEBUGONLY;
return !system(@_) unless $DEBUGONLY;
}

2002-03-28 02:16:25

by Mike Fedyk

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Wed, Mar 27, 2002 at 11:02:47AM -0700, Andreas Dilger wrote:
> On Mar 27, 2002 14:47 +0000, Matthew Kirkwood wrote:
> > Postgres doesn't pre-allocate datafiles. They reckon it's not
> > their job to implement a filesystem, and I'm inclined to agree.
> > They do prefer fdatasync on datafiles and (I think) O_DATASYNC
> > for their journal files where available, but I haven't checked
> > that my build is doing that.
>
> If the I/O is normally sync driven, you should consider testing ext3
> with "data=journal". While this seems counterintuitive because it is
> writing the data to disk twice, it can often be faster in real-world
> "bursty" environments because the sync I/O goes to the journal in one
> contiguous write, and it can then be written to the rest of the fs
> asynchronously safely.

Don't forget to have enough extra memory so that it can have time to do
those async writes later.

When is ext3 going to get high and low watermarks?

Currently it hits a (50%?) high usage level and then sync writes the entire
journal contents. :( Has that changed?

2002-03-28 11:12:20

by Matthew Kirkwood

[permalink] [raw]
Subject: Re: Filesystem benchmarks: ext2 vs ext3 vs jfs vs minix

On Thu, 28 Mar 2002, Matthew Kirkwood wrote:

> I'll try to find time to run these again tomorrow to convince
> myself that all is sane, but these numbers are usually pretty
> stable.

Here's another run, with noatime on, and default postgres
parameters.

tuning? single ir mx-ir oltp mixed-oltp
(sec) (tps) (sec) (tps) (sec)
ext3 dn 1296.30 66.34 207.59 69.19 318.26
ext3-wb dn 1286.38 66.27 212.48 135.48 229.74
ext3-jd dn 1293.08 68.72 209.33 113.40 283.97

Looks like I'll have to invest some time in tuning postgres
a little better before the filesystem becomes more of a
bottleneck.

Matthew.