2006-11-07 19:47:27

by Andrew Morton

[permalink] [raw]
Subject: Fw: Re: ICP, 3ware, Areca?


Why is ext3 slow??


Begin forwarded message:

Date: Tue, 7 Nov 2006 09:47:17 -0500
From: "Bill Rugolsky Jr." <[email protected]>
To: Arne Schmitz <[email protected]>
Cc: [email protected]
Subject: Re: ICP, 3ware, Areca?


On Tue, Nov 07, 2006 at 03:25:04PM +0100, Arne Schmitz wrote:
> Has anyone information about how current ICP and Areca hardware performs under
> Linux? We are currently running kernel 2.6.17 and have two offers, one with
> an Areca ARC-1220 8-port, and one with an ICP 9087MA 8-port. Does either of
> them make trouble running a (64 bit) Linux?
>
> At the moment we only have two 3ware controllers running on 32 bit Linux.

On Fri, 18 Aug 2006, I wrote to the list:

I've been doing sequential raw disk I/O testing with both Jens Axboe's
"fio" using libaio and iodepths up to 32, as well as a basic
"dd if=/dev/zero oflag=direct".

Reads look fine; a zone read test shows 360 MiB/s at the start of the disk,
190 MiB/s at the end. I see similarly high numbers doing direct reads via
ext3.

Unfortunately, no matter what I do on the write side, I don't see
more than 72 MiB/s for a sequential direct I/O write to the raw disk.
I've tried the deadline and noop schedulers, boosted nr_requests and
toyed with various i/o sizes and queue depths using fio. I was expecting
sequential writes in the range of 120-150 MiB/s, based on the (now
ancient) tweakers.net review and various other info. [Copying /dev/zero
to tmpfs on this box yields ~860 MiB/s.]

The machine is a Tyan 2882 dual Opteron with 8GB RAM and an Areca 1220
/ 128MB BBU and 8xWDC WD2500JS-00NCB1 250.1GB 7200 RPM configured as a
RAID6 with chunk size 64K. [System volume is on an separate MD RAID1 on
the Nvidia controller.] It's running FC4 x86_64 with a custom-built
2.6.17.7 kernel and the arcmsr driver from scsi-misc GIT, which is
basically 1.20.0X.13 + fixes. The firmware is V1.41 2006-5-24.

Chris Caputo suggested:

I'd run a test with write cache on and one with write cache off and
compare the results. The difference can be vast and depending on your
application it may be okay to run with write cache on.

And I reported back on Tue, 22 Aug 2006:

Forcing disk write caching on certainly changes the results
(and the risk profile, of course). For the archives, here are
some simple "dd" and "fio" odirect results. These benchmarks
were run with defaults (CFQ scheduler, nr_request = 128).

...

Summary:

Raw partition: 228 MiB/s
XFS: 228 MiB/s
Ext3: 139-151 MiB/s


Regards,

Bill Rugolsky


2006-11-07 19:55:21

by Alex Tomas

[permalink] [raw]
Subject: Re: Fw: Re: ICP, 3ware, Areca?


can we get vmstat 1 output for the run?

thanks, Alex

>>>>> Andrew Morton (AM) writes:

AM> Why is ext3 slow??


AM> Begin forwarded message:

AM> Date: Tue, 7 Nov 2006 09:47:17 -0500
AM> From: "Bill Rugolsky Jr." <[email protected]>
AM> To: Arne Schmitz <[email protected]>
AM> Cc: [email protected]
AM> Subject: Re: ICP, 3ware, Areca?


AM> On Tue, Nov 07, 2006 at 03:25:04PM +0100, Arne Schmitz wrote:
>> Has anyone information about how current ICP and Areca hardware performs under
>> Linux? We are currently running kernel 2.6.17 and have two offers, one with
>> an Areca ARC-1220 8-port, and one with an ICP 9087MA 8-port. Does either of
>> them make trouble running a (64 bit) Linux?
>>
>> At the moment we only have two 3ware controllers running on 32 bit Linux.

AM> On Fri, 18 Aug 2006, I wrote to the list:

AM> I've been doing sequential raw disk I/O testing with both Jens Axboe's
AM> "fio" using libaio and iodepths up to 32, as well as a basic
AM> "dd if=/dev/zero oflag=direct".

AM> Reads look fine; a zone read test shows 360 MiB/s at the start of the disk,
AM> 190 MiB/s at the end. I see similarly high numbers doing direct reads via
AM> ext3.

AM> Unfortunately, no matter what I do on the write side, I don't see
AM> more than 72 MiB/s for a sequential direct I/O write to the raw disk.
AM> I've tried the deadline and noop schedulers, boosted nr_requests and
AM> toyed with various i/o sizes and queue depths using fio. I was expecting
AM> sequential writes in the range of 120-150 MiB/s, based on the (now
AM> ancient) tweakers.net review and various other info. [Copying /dev/zero
AM> to tmpfs on this box yields ~860 MiB/s.]

AM> The machine is a Tyan 2882 dual Opteron with 8GB RAM and an Areca 1220
AM> / 128MB BBU and 8xWDC WD2500JS-00NCB1 250.1GB 7200 RPM configured as a
AM> RAID6 with chunk size 64K. [System volume is on an separate MD RAID1 on
AM> the Nvidia controller.] It's running FC4 x86_64 with a custom-built
AM> 2.6.17.7 kernel and the arcmsr driver from scsi-misc GIT, which is
AM> basically 1.20.0X.13 + fixes. The firmware is V1.41 2006-5-24.

AM> Chris Caputo suggested:

AM> I'd run a test with write cache on and one with write cache off and
AM> compare the results. The difference can be vast and depending on your
AM> application it may be okay to run with write cache on.

AM> And I reported back on Tue, 22 Aug 2006:

AM> Forcing disk write caching on certainly changes the results
AM> (and the risk profile, of course). For the archives, here are
AM> some simple "dd" and "fio" odirect results. These benchmarks
AM> were run with defaults (CFQ scheduler, nr_request = 128).

AM> ...

AM> Summary:

AM> Raw partition: 228 MiB/s
AM> XFS: 228 MiB/s
AM> Ext3: 139-151 MiB/s


AM> Regards,

AM> Bill Rugolsky
AM> -
AM> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
AM> the body of a message to [email protected]
AM> More majordomo info at http://vger.kernel.org/majordomo-info.html

2006-11-07 21:00:00

by Dave Kleikamp

[permalink] [raw]
Subject: Re: Fw: Re: ICP, 3ware, Areca?

On Tue, 2006-11-07 at 11:47 -0800, Andrew Morton wrote:
> Why is ext3 slow??

Allocation? I don't see anything indicating that Bill is overwriting an
existing file, so there is block allocation and journaling overhead. If
that's the case, it would be interesting to see how fast ext3 is when
overwriting a file. Extents and delayed allocation should improve on
this a lot.

> Begin forwarded message:
>
> Date: Tue, 7 Nov 2006 09:47:17 -0500
> From: "Bill Rugolsky Jr." <[email protected]>
> To: Arne Schmitz <[email protected]>
> Cc: [email protected]
> Subject: Re: ICP, 3ware, Areca?
>
>
> On Tue, Nov 07, 2006 at 03:25:04PM +0100, Arne Schmitz wrote:
> > Has anyone information about how current ICP and Areca hardware performs under
> > Linux? We are currently running kernel 2.6.17 and have two offers, one with
> > an Areca ARC-1220 8-port, and one with an ICP 9087MA 8-port. Does either of
> > them make trouble running a (64 bit) Linux?
> >
> > At the moment we only have two 3ware controllers running on 32 bit Linux.
>
> On Fri, 18 Aug 2006, I wrote to the list:
>
> I've been doing sequential raw disk I/O testing with both Jens Axboe's
> "fio" using libaio and iodepths up to 32, as well as a basic
> "dd if=/dev/zero oflag=direct".
>
> Reads look fine; a zone read test shows 360 MiB/s at the start of the disk,
> 190 MiB/s at the end. I see similarly high numbers doing direct reads via
> ext3.

This would indicate that indirect block lookups themselves aren't a
problem.

> Summary:
>
> Raw partition: 228 MiB/s
> XFS: 228 MiB/s
> Ext3: 139-151 MiB/s
--
David Kleikamp
IBM Linux Technology Center

2006-11-07 21:07:23

by bzzz

[permalink] [raw]
Subject: Re: Fw: Re: ICP, 3ware, Areca?

>>>>> Dave Kleikamp (DK) writes:

DK> On Tue, 2006-11-07 at 11:47 -0800, Andrew Morton wrote:
>> Why is ext3 slow??

DK> Allocation? I don't see anything indicating that Bill is overwriting an
DK> existing file, so there is block allocation and journaling overhead. If
DK> that's the case, it would be interesting to see how fast ext3 is when
DK> overwriting a file. Extents and delayed allocation should improve on
DK> this a lot.

this was my first suspiction as well. though in my testing on opteron
write achieved ~300MB/s consuming 100% cpu. so it would be interesting
to see vmstat output and actual cpu consumption.

thanks, Alex

2006-11-07 21:45:19

by Andrew Morton

[permalink] [raw]
Subject: Re: Fw: Re: ICP, 3ware, Areca?

On Tue, 07 Nov 2006 14:59:52 -0600
Dave Kleikamp <[email protected]> wrote:

> On Tue, 2006-11-07 at 11:47 -0800, Andrew Morton wrote:
> > Why is ext3 slow??
>
> Allocation? I don't see anything indicating that Bill is overwriting an
> existing file, so there is block allocation and journaling overhead. If
> that's the case, it would be interesting to see how fast ext3 is when
> overwriting a file. Extents and delayed allocation should improve on
> this a lot.

Maybe. or perhaps some funniness with RAID aligment.

Bill, if you have time it'd be interesting to repeat the comparative
benchmarking with:

ext3, data=ordered:

dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct
time dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct conv=notrunc

ext4dev:

dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct
time dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct conv=notrunc

ext4dev, -oextents

rm foo
dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct
time dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct conv=notrunc

> > Begin forwarded message:
> >
> > Date: Tue, 7 Nov 2006 09:47:17 -0500
> > From: "Bill Rugolsky Jr." <[email protected]>
> > To: Arne Schmitz <[email protected]>
> > Cc: [email protected]
> > Subject: Re: ICP, 3ware, Areca?
> >
> >
> > On Tue, Nov 07, 2006 at 03:25:04PM +0100, Arne Schmitz wrote:
> > > Has anyone information about how current ICP and Areca hardware performs under
> > > Linux? We are currently running kernel 2.6.17 and have two offers, one with
> > > an Areca ARC-1220 8-port, and one with an ICP 9087MA 8-port. Does either of
> > > them make trouble running a (64 bit) Linux?
> > >
> > > At the moment we only have two 3ware controllers running on 32 bit Linux.
> >
> > On Fri, 18 Aug 2006, I wrote to the list:
> >
> > I've been doing sequential raw disk I/O testing with both Jens Axboe's
> > "fio" using libaio and iodepths up to 32, as well as a basic
> > "dd if=/dev/zero oflag=direct".
> >
> > Reads look fine; a zone read test shows 360 MiB/s at the start of the disk,
> > 190 MiB/s at the end. I see similarly high numbers doing direct reads via
> > ext3.
>
> This would indicate that indirect block lookups themselves aren't a
> problem.
>
> > Summary:
> >
> > Raw partition: 228 MiB/s
> > XFS: 228 MiB/s
> > Ext3: 139-151 MiB/s

It's hard to believe that the block allocator could do this to us. I'd be
suspecting that something is causing additional seeking.

Bill, when publishing figures like this it is useful (and somewhat
important) to also record the CPU consumption. So please publish the full
output of /usr/bin/time and not just the elapsed time, thanks.

2006-11-07 22:07:04

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: Fw: Re: ICP, 3ware, Areca?

On Tue, Nov 07, 2006 at 01:45:13PM -0800, Andrew Morton wrote:
> Bill, if you have time it'd be interesting to repeat the comparative
> benchmarking with:
>
> ext3, data=ordered:
>
> dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct
> time dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct conv=notrunc
>
> ext4dev:
>
> dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct
> time dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct conv=notrunc
>
> ext4dev, -oextents
>
> rm foo
> dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct
> time dd if=/dev/zero of=foo bs=1M count=1000 oflag=direct conv=notrunc

Andrew,

Will do.

I currently have one of these servers running a production Postgresql
over Ext3. The warm-standby backup server is not yet fully configured
and in use, so I will do some testing before deploying it.

We are at the tail end of a horrible office move, so I've been a bit
removed from kernel-building. [Sadly, I have yet to have a chance to test
the excellent sata_nv ADMA work to see whether the latencies are gone.]
I ought to be able to get to testing in the next day or two; sorry in advance
for the delay.

In the e-mail you received, I had omitted the full information from my
original postings. I don't see the archives online, so I've appended the
full results. fio-1.5-0.20060728152503 was used; the parameters appear
in the fio output

-Bill

=========================================================================

Date: Tue, 22 Aug 2006 12:39:01 -0400
From: "Bill Rugolsky Jr." <[email protected]>
To: Chris Caputo <[email protected]>
Cc: [email protected]
Subject: Re: Areca 1220 Sequential I/O performance numbers
In-Reply-To: <[email protected]>
Message-ID: <[email protected]>


On Fri, Aug 18, 2006 at 10:54:22PM +0000, Chris Caputo wrote:
> I'd run a test with write cache on and one with write cache off and
> compare the results. The difference can be vast and depending on your
> application it may be okay to run with write cache on.

Thanks Chris,

Forcing disk write caching on certainly changes the results
(and the risk profile, of course). For the archives, here are
some simple "dd" and "fio" odirect results. These benchmarks
were run with defaults (CFQ scheduler, nr_request = 128).

Again, the machine is a Tyan 2882 dual Opteron with 8GB RAM and an Areca 1220
/ 128MB BBU and 8xWDC WD2500JS-00NCB1 250.1GB 7200 RPM configured as a
RAID6 with chunk size 64K. [System volume is on an separate MD RAID1 on
the Nvidia controller.] It's running FC4 x86_64 with a custom-built
2.6.17.7 kernel and the arcmsr driver from scsi-misc GIT, which is
basically 1.20.0X.13 + fixes. The firmware is V1.41 2006-5-24.


Summary:

Raw partition: 228 MiB/s
XFS: 228 MiB/s
Ext3: 139-151 MiB/s

[N.B.: The "dd" numbers are displayed in MB/s, the "fio" results are in MiB/s.]

=================
= Raw partition =
=================

% sudo time dd if=/dev/zero of=/dev/sdc2 bs=4M count=1024 oflag=direct
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 17.7893 seconds, 241 MB/s
0.00user 0.68system 0:17.86elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+264minor)pagefaults 0swaps

% sudo fio sequential-write
client1: (g=0): rw=write, odir=1, bs=131072-131072, rate=0,
ioengine=libaio, iodepth=32
Starting 1 thread
Threads running: 1: [W] [100.00% done] [eta 00m:00s]
client1: (groupid=0): err= 0:
write: io= 4099MiB, bw=228004KiB/s, runt= 18855msec
slat (msec): min= 0, max= 0, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 83, avg=18.07, dev=26.64
bw (KiB/s) : min= 0, max=358612, per=98.57%, avg=224741.21, dev=243343.17
cpu : usr=0.30%, sys=5.15%, ctx=33015

Run status group 0 (all jobs):
WRITE: io=4099MiB, aggrb=228004, minb=228004, maxb=228004,
mint=18855msec, maxt=18855msec

Disk stats (read/write):
sdc: ios=0/32799, merge=0/0, ticks=0/602466, in_queue=602461, util=99.73%


======================================================
= XFS (/sbin/mkfs.xfs -f -d su=65536,sw=6 /dev/sdc2) =
======================================================

% sudo time dd if=/dev/zero of=foo bs=4M count=1024 oflag=direct
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 17.9354 seconds, 239 MB/s
0.00user 0.80system 0:17.93elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+268minor)pagefaults 0swaps

% sudo fio sequential-write-foo
client1: (g=0): rw=write, odir=1, bs=131072-131072, rate=0,
ioengine=libaio, iodepth=32
Starting 1 thread
client1: Laying out IO file (4096MiB)
Threads running: 1: [W] [100.00% done] [eta 00m:00s]
client1: (groupid=0): err= 0:
write: io= 4096MiB, bw=228613KiB/s, runt= 18787msec
slat (msec): min= 0, max= 0, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 105, avg=18.02, dev=26.63
bw (KiB/s) : min= 0, max=359137, per=97.62%, avg=223165.97, dev=240029.16
cpu : usr=0.21%, sys=5.39%, ctx=32928

Run status group 0 (all jobs):
WRITE: io=4096MiB, aggrb=228613, minb=228613, maxb=228613,
mint=18787msec, maxt=18787msec

Disk stats (read/write):
sdc: ios=28/49658, merge=0/1, ticks=520/2564125, in_queue=2564637, util=92.62%

==================================================================
= Ext3 (/sbin/mke2fs -j -J size=400 -E stride=96 /dev/sdc2) =
= This is with data=ordered; data=writeback was slightly slower. =
==================================================================

% sudo time dd if=/dev/zero of=foo bs=4M count=1024 oflag=direct
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB) copied, 29.4102 seconds, 146 MB/s
0.00user 1.40system 0:29.95elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+268minor)pagefaults 0swaps

% sudo fio sequential-write-foo
client1: (g=0): rw=write, odir=1, bs=131072-131072, rate=0,
ioengine=libaio, iodepth=32
Starting 1 thread
Threads running: 1: [W] [100.00% done] [eta 00m:00s]0m:10s]
client1: (groupid=0): err= 0:
write: io= 4096MiB, bw=151894KiB/s, runt= 28276msec
slat (msec): min= 0, max= 0, avg= 0.00, dev= 0.00
clat (msec): min= 0, max= 428, avg=27.23, dev=56.99
bw (KiB/s) : min= 0, max=266338, per=100.11%, avg=152057.02, dev=173467.74
cpu : usr=0.23%, sys=3.64%, ctx=32944

Run status group 0 (all jobs):
WRITE: io=4096MiB, aggrb=151894, minb=151894, maxb=151894,
mint=28276msec, maxt=28276msec

Disk stats (read/write):
sdc: ios=0/33867, merge=0/5, ticks=0/934143, in_queue=934143, util=99.96%

2006-11-07 22:20:12

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: Fw: Re: ICP, 3ware, Areca?

On Tue, Nov 07, 2006 at 01:45:13PM -0800, Andrew Morton wrote:
> On Tue, 07 Nov 2006 14:59:52 -0600
> Dave Kleikamp <[email protected]> wrote:
>
> > On Tue, 2006-11-07 at 11:47 -0800, Andrew Morton wrote:
> > > Why is ext3 slow??
> >
> > Allocation? I don't see anything indicating that Bill is overwriting an
> > existing file, so there is block allocation and journaling overhead. If
> > that's the case, it would be interesting to see how fast ext3 is when
> > overwriting a file. Extents and delayed allocation should improve on
> > this a lot.

Will do.

> Maybe. or perhaps some funniness with RAID aligment.

I neglected to include the relevant RAID/mkfs info here.

device=/dev/sdc2 # ought to have been on a raid stripe boundary
# very close to the start of the array

# XFS:
mkfs.xfs -f -d su=65536,sw=6 -l su=65536 $device
mount -o noatime,attr2,largeio,logbsize=64k $device /mnt

# Ext3: XFS has problems up through 2.6.18-rc5; use slow, but safe, Ext3:
mke2fs -j -J size=400 -E stride=96 $device
mount -o noatime $device /mnt

Also, I ran blockdev --flushbufs and

echo 1 | sudo tee /proc/sys/vm/drop_caches

before each test.

-Bill