2004-10-28 22:47:06

by Martin Mokrejs

[permalink] [raw]
Subject: Filesystem performance on 2.4.28-pre3 on hardware RAID5.

Hi,
I have finished evaluation of my tests of filesystems. For your
interest I put the results at http://www.natur.cuni.cz/~mmokrejs/linux-performance.
I have some questions on the developers:

1.
ext3 has lost nice performance compared to ext2 in terms of "Random create /Delete"
test, whatever it does (see bonnie++ 1.93c docs), also in terms of "Sequential output /Char"
test and in peak perfomance of "Sequential output /Block" test, in perfomance
of "Sequential create /Create" test, "Random seek" perfomance.

2.
"mount -t xfs -o async" unexpectedly kills random seek performance,
but is still a bit better than with "-o sync". ;) Maybe it has to do
with the dramatic jump in CPU consumption of this operation,
as it in both cases it takes about 21-26% instead of usual 3%.
Why? Isn't actually async default mode?

3.
"mke2fs -O dir_index" decreased Sequential output /Block perf
by 25%. It seems the only positive effect of this option
is in Random create /Delete test and "Random seek" test.

4.
At least on this RAID5, "mke2fs -T largefile or -T largefile4"
should be prohibited, as they kill Sequential create /Create perf.

5. Based on you experience, would you prefer better random IO
or sequential IO? The server should house many huge files around 800MB or more,
in general there should be more reads then writes on the partition.
As the RAID5 splits data into blocks on the RAID, sequential reads or writes
anyway get split (cannot if if also randomly sheared over every disk plate).

Any feedback? Please Cc: me. Thanks.
Martin


For the archive, some minimal results from the test.
There's a lot more, including raw data on that page and even more comments
and some comparisons, briefly (more comments on the web):

----------------------------------------------------
Sequential output /Char
ReiserFS3 255 K/sec
XFS 425 K/sec
ext3 122 K/sec
ext2 540 K/sec (560?)
----------------------------------------------------
Sequential output /Block
ReiserFS3 53400 K/sec
XFS 56500 K/sec
ext3 48000-51500 K/sec
ext2 36000-53000 K/sec
----------------------------------------------------
Sequential create /Create
ReiserFS3 18-23 ms
ReiserFs3 55 us (when "mkreiserfs -t 128 or 256 or 512 or 768" used)
XFS 60-120 ms
ext3 2-3 ms
ext2 25-30 us (with exception of "mke2fs -T largefile or largefile4")
----------------------------------------------------
Random seeks take very different amount of CPU and have
different amount of seeks per second:
ReiserFS3 520 seeks/sec and consumes 60% (with 2 weird exceptions below)
ReiserFS3 1290 seeks/sec and consumes 3%
(mkreiserfs -s 1024 or 2048 or 16384)
(mkreiserfs -t 768)
XFS 804 seeks/sec and consumes 3%
XFS 540-660 seeks/sec and consumes 21-26%
(worse values for mount -o sync,
better values for -o async, but still
worse than if the switch is omitted).
ext3 770 seeks/sec and consumes 30%
ext2 790-800 seeks/sec and consumes 30%
ext2 815 seeks/sec and consumes 30%
(mke2fs -O dir_index)
----------------------------------------------------
Random create /Create
ReiserFS3 50-55 us
XFS 3000 us
ext3 1400-7000 us
ext2 24-30 us
----------------------------------------------------
Random create /Read
ReiserFS3 mostly 8-10 us (-o notail doubles the time,
also by 30% increases Sequential create /Create time
and by 60% decreases number of Random seeks per sec.
XFS 9-13 or 19 us
ext3 5 us
ext2 10 us
----------------------------------------------------
Random create /Delete:
ReiserFS3 80-90 us
XFS 3000-3500us
ext3 43-66us
ext2 23-38us


How I tested?
I used "bonnie++ -n 1 -s 12G -d /scratch -u apache -q"
on an external RAID5 1TB logical drive. Data should be split
by raid controller into 128 kB chunks. The server has 6 GB RAM,
SMP and HIGHMEM enabled 2.4.28-pre3 kernel, 12 GB swap
on same RAID5 and 1 GB ECC RAM on raid controller. However,
only 500MB are used (it's dual-controller, so every controller
uses just half, the other half is used to mirror the other controller).


2004-10-29 07:34:05

by Nathan Scott

[permalink] [raw]
Subject: Re: Filesystem performance on 2.4.28-pre3 on hardware RAID5.

Hi there,

On Fri, Oct 29, 2004 at 12:43:30AM +0200, Martin MOKREJ? wrote:
> "mount -t xfs -o async" unexpectedly kills random seek performance,
> but is still a bit better than with "-o sync". ;) Maybe it has to do
> with the dramatic jump in CPU consumption of this operation,
> as it in both cases it takes about 21-26% instead of usual 3%.
> Why? Isn't actually async default mode?

Thats odd. Actually, I'm not sure what the "async" option is meant
to do, it isn't seen by the fs afaict (XFS isn't looking for it)...
we also use the generic_file_llseek code in XFS ... so we're not
doing anything special there either -- some profiling data showing
where that CPU time is spent would be insightful.

> Sequential create /Create
> Random create /Create
> XFS 60-120 ms

You may get better results using a version 2 log (mkfs option)
with large in-core log buffers (mount option) for these (which
mkfs version are you using atm?)

cheers.

--
Nathan

2004-10-29 11:11:40

by Martin Mokrejs

[permalink] [raw]
Subject: Re: Filesystem performance on 2.4.28-pre3 on hardware RAID5.

Hi Nathan, Marcello and others,
the collested meminfo, slabinfo, vmstat output are at
http://www.natur.cuni.cz/~mmokrejs/crash/

Those precrash-* files contain output since the machine
was fresh, every second I appended current stats to each
of them. I believe some data were not flushed into the disk
before the problem.

I get on STDERR "fork: Cannot allocate memory"

Using anothe ropen console session and doing df gives me:

start_pipeline: Too many open files in system
fork: Cannot allocate memory

I had fortunately xdm to kill, so I could then do
sync and collect some stats (although some resources
got freed by xdm/X11). Those files are named with
prefix crash-*.

After that, I decided to put continue the suspended job,
and those files collected are precrash2-* prefixed.

There /var/log/messages included.

If you tell what kind of memory/xfs debugging I should turn
on adn *how*, I can do it immediately. I don't have access
to the machine daily, and already had to be in production. :(

Martin
P.S: It is hardware raid5. I use mkfs.xfs version 2.6.13.

2004-10-31 23:24:45

by Nathan Scott

[permalink] [raw]
Subject: Re: Filesystem performance on 2.4.28-pre3 on hardware RAID5.

On Fri, Oct 29, 2004 at 01:10:49PM +0200, [email protected] wrote:
> Hi Nathan, Marcello and others,
> the collested meminfo, slabinfo, vmstat output are at
> http://www.natur.cuni.cz/~mmokrejs/crash/

On Sun, Oct 31, 2004 at 11:20:35PM +0100, Martin MOKREJ? wrote:
> Sorry, fixed by soflink. Was actually http://www.natur.cuni.cz/~mmokrejs/tmp/c
rash/

OK, well there's your problem - see the slabinfo output - you
have over 700MB of buffer_head structures that are not being
reclaimed. Everything else seems to be fine.

> If you tell what kind of memory/xfs debugging I should turn
> on adn *how*, I can do it immediately. I don't have access

I think turning on debugging is going to hinder more than it
will help here.

> to the machine daily, and already had to be in production. :(
>
> P.S: It is hardware raid5. I use mkfs.xfs version 2.6.13.

Hmm. Did that patch I sent you help at all? That should help
free up buffer_heads more effectively in low memory situations
like this. You may also have some luck with tweaking bdflush
parameters so that flushing out of dirty buffers is started
earlier and/or done more often. I can't remember off the top
of my head what all the bdflush tunables are - see the bdflush
section in Documentation/filesystems/proc.txt.

Alternatively, pick a filesystem blocksize that matches your
pagesize (4K instead of 512 bytes) to minimise the number of
buffer_heads you end up using.

cheers.

--
Nathan

2004-11-03 00:18:15

by Nathan Scott

[permalink] [raw]
Subject: Re: Filesystem performance on 2.4.28-pre3 on hardware RAID5.

On Tue, Nov 02, 2004 at 01:57:22PM +0100, Martin MOKREJ? wrote:
> I retested with blocksize 1024, instead of 512 (default) which causes problems:

4K is the default blocksize, not 1024 or 512 bytes. From going
through all your notes, the default mkfs parameters are working
fine, and changing to a 512 byte blocksize (-blog=9 / -bsize=512)
is where the VM starts to see problems.

I don't have a device the size of yours handy on my test box, nor
do I have as much memory as you -- but I ran similar bonnie++
commands with -bsize=512 filesystems on a machine with very little
memory, and a filesystem and file size exponentially larger than
available memory, and it ran to completion without problems.
I did see vastly more buffer_heads being created than with the
default mkfs parameters (as we'd expect with that blocksize) but
it didn't cause me any VM problems.

> How can I free the buffer_head without rebooting? I'm trying to help myself with

AFAICT, there is no way to do this without a reboot. They are
meant to be reclaimed (and were reclaimed on my test box) as
needed, but they don't seem to be for you.

This looks alot like a VM balancing sort of problem to me (that
6G of memory you have is a bit unusual - probably not a widely
tested configuration on i386... maybe try booting with mem=1G
and see if that changes anything?), so far it doesn't seem like
XFS is at fault here at least.

cheers.

--
Nathan