2002-09-03 09:19:48

by Aaron Lehmann

[permalink] [raw]
Subject: ext3 throughput woes on certain (possibly heavily fragmented) files

This pretty much sums it up:

[aaronl@vitelus:~]$ time cat mail/debian-legal > /dev/null
cat mail/debian-legal > /dev/null 0.00s user 0.02s system 0% cpu 5.565 total
[aaronl@vitelus:~]$ ls -l mail/debian-legal
-rw------- 1 aaronl mail 7893525 Sep 3 00:42 mail/debian-legal
[aaronl@vitelus:~]$ time cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null
cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null 0.00s user 0.10s system 16% cpu 0.616 total
[aaronl@vitelus:~]$ ls -l /usr/src/linux-2.4.18.tar.bz2
-rw-r--r-- 1 aaronl aaronl 24161675 Apr 14 11:53

Both files were AFAIK not in any cache, and they are on the same
partition.

My current uninformed theory is that this is caused by fragmentation,
since the linux tarball was downloaded all at once but the mailbox I'm
comparing it to has 1695 messages, each of which having been appended
seperately to the file. All of my mailboxes exhibit similarly awful
performance.

Do any other filesystems handle this type of thing more gracefully? Is
there room for improvement in ext3? Is there any way I can test my
theory by seeing how fragmented a certain inode is? What can I do to
avoid extensive fragmentation, if it is truely the cause of my issue?

I'm running 2.4.20-pre5, but this is not a recently-introduced problem.

The disk is IDE - nothing fancy, WDC WD200BB-18CAA0. IDE controller is
ServerWorks CSB5. However, I've had this problem consistantly on
previous hardware.


2002-09-06 16:01:50

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Hi,

On Tue, Sep 03, 2002 at 02:24:19AM -0700, Aaron Lehmann wrote:

> [aaronl@vitelus:~]$ time cat mail/debian-legal > /dev/null
> cat mail/debian-legal > /dev/null 0.00s user 0.02s system 0% cpu 5.565 total
> [aaronl@vitelus:~]$ ls -l mail/debian-legal
> -rw------- 1 aaronl mail 7893525 Sep 3 00:42 mail/debian-legal
> [aaronl@vitelus:~]$ time cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null
> cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null 0.00s user 0.10s system 16% cpu 0.616 total
> [aaronl@vitelus:~]$ ls -l /usr/src/linux-2.4.18.tar.bz2
> -rw-r--r-- 1 aaronl aaronl 24161675 Apr 14 11:53
>
> Both files were AFAIK not in any cache, and they are on the same
> partition.
>
> My current uninformed theory is that this is caused by fragmentation,
> since the linux tarball was downloaded all at once but the mailbox I'm
> comparing it to has 1695 messages, each of which having been appended
> seperately to the file. All of my mailboxes exhibit similarly awful
> performance.

Yep, both ext2 and ext3 can get badly fragmented by files which are
closed, reopened and appended to frequently like that.

> Do any other filesystems handle this type of thing more gracefully?

There are some ideas from recent FFS changes. One thing they now do
is to defragment things automatically as a file grows by effectively
deleting and then reallocating the last 16 blocks of the file.
Fragmentation will still occur, but less so, if we do that.

Cheers,
Stephen

2002-09-06 17:09:51

by Nikita Danilov

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Stephen C. Tweedie writes:
> Hi,
>
> On Tue, Sep 03, 2002 at 02:24:19AM -0700, Aaron Lehmann wrote:
>
> > [aaronl@vitelus:~]$ time cat mail/debian-legal > /dev/null
> > cat mail/debian-legal > /dev/null 0.00s user 0.02s system 0% cpu 5.565 total
> > [aaronl@vitelus:~]$ ls -l mail/debian-legal
> > -rw------- 1 aaronl mail 7893525 Sep 3 00:42 mail/debian-legal
> > [aaronl@vitelus:~]$ time cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null
> > cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null 0.00s user 0.10s system 16% cpu 0.616 total
> > [aaronl@vitelus:~]$ ls -l /usr/src/linux-2.4.18.tar.bz2
> > -rw-r--r-- 1 aaronl aaronl 24161675 Apr 14 11:53
> >
> > Both files were AFAIK not in any cache, and they are on the same
> > partition.
> >
> > My current uninformed theory is that this is caused by fragmentation,
> > since the linux tarball was downloaded all at once but the mailbox I'm
> > comparing it to has 1695 messages, each of which having been appended
> > seperately to the file. All of my mailboxes exhibit similarly awful
> > performance.
>
> Yep, both ext2 and ext3 can get badly fragmented by files which are
> closed, reopened and appended to frequently like that.
>
> > Do any other filesystems handle this type of thing more gracefully?
>
> There are some ideas from recent FFS changes. One thing they now do
> is to defragment things automatically as a file grows by effectively
> deleting and then reallocating the last 16 blocks of the file.
> Fragmentation will still occur, but less so, if we do that.
>

Another possible solution is to try to "defer" allocation. For example,
in reiser4 (and XFS, I believe) extents are allocated on the transaction
commit and as a result, if file was created by several writes, it will
still be allocated as one extent.

>
> Cheers,
> Stephen

Nikita.

2002-09-06 17:17:51

by Hans Reiser

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Nikita Danilov wrote:

>Stephen C. Tweedie writes:
> > Hi,
> >
> > On Tue, Sep 03, 2002 at 02:24:19AM -0700, Aaron Lehmann wrote:
> >
> > > [aaronl@vitelus:~]$ time cat mail/debian-legal > /dev/null
> > > cat mail/debian-legal > /dev/null 0.00s user 0.02s system 0% cpu 5.565 total
> > > [aaronl@vitelus:~]$ ls -l mail/debian-legal
> > > -rw------- 1 aaronl mail 7893525 Sep 3 00:42 mail/debian-legal
> > > [aaronl@vitelus:~]$ time cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null
> > > cat /usr/src/linux-2.4.18.tar.bz2 > /dev/null 0.00s user 0.10s system 16% cpu 0.616 total
> > > [aaronl@vitelus:~]$ ls -l /usr/src/linux-2.4.18.tar.bz2
> > > -rw-r--r-- 1 aaronl aaronl 24161675 Apr 14 11:53
> > >
> > > Both files were AFAIK not in any cache, and they are on the same
> > > partition.
> > >
> > > My current uninformed theory is that this is caused by fragmentation,
> > > since the linux tarball was downloaded all at once but the mailbox I'm
> > > comparing it to has 1695 messages, each of which having been appended
> > > seperately to the file. All of my mailboxes exhibit similarly awful
> > > performance.
> >
> > Yep, both ext2 and ext3 can get badly fragmented by files which are
> > closed, reopened and appended to frequently like that.
> >
> > > Do any other filesystems handle this type of thing more gracefully?
> >
> > There are some ideas from recent FFS changes. One thing they now do
> > is to defragment things automatically as a file grows by effectively
> > deleting and then reallocating the last 16 blocks of the file.
> > Fragmentation will still occur, but less so, if we do that.
> >
>
>Another possible solution is to try to "defer" allocation. For example,
>in reiser4 (and XFS, I believe) extents are allocated on the transaction
>commit and as a result, if file was created by several writes, it will
>still be allocated as one extent.
>
> >
> > Cheers,
> > Stephen
>
>Nikita.
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
The FFS approach has an advantage for the case where the file grows too
slowly for allocation to be delayed.

I think I prefer that we implement a repacker for reiser4 though, as
that, combined with delayed allocation, will be a balanced and thorough
solution.

Hans


2002-09-06 17:20:26

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Hi,

On Fri, Sep 06, 2002 at 09:14:28PM +0400, Nikita Danilov wrote:

> Another possible solution is to try to "defer" allocation. For example,
> in reiser4 (and XFS, I believe) extents are allocated on the transaction
> commit and as a result, if file was created by several writes, it will
> still be allocated as one extent.

Ext2 has a preallocation mechanism so that if you have multiple
writes, they get dealt with to some extent as a single allocation.
However, that doesn't work over close(): the preallocated blocks are
discarded wheneven we close the file.

The problem with mail files, though, is that they tend to grow quite
slowly, so the writes span very many transactions and we don't have
that opportunity for coalescing the writes. Actively defragmenting on
writes is an alternative in that case.

Cheers,
Stephen

2002-09-06 20:59:36

by Aaron Lehmann

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

On Fri, Sep 06, 2002 at 09:22:22PM +0400, Hans Reiser wrote:
> I think I prefer that we implement a repacker for reiser4 though, as
> that, combined with delayed allocation, will be a balanced and thorough
> solution.

How does current ReiserFS fare against extreme fragmentation? What
about XFS? Without trying to risk a flamewar, what Linux filesystems
are the most preventive of fragmentation?

The filesystem could make a huge difference on a machine like a mail
server...

2002-09-06 22:00:39

by Hans Reiser

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Aaron Lehmann wrote:

>On Fri, Sep 06, 2002 at 09:22:22PM +0400, Hans Reiser wrote:
>
>
>>I think I prefer that we implement a repacker for reiser4 though, as
>>that, combined with delayed allocation, will be a balanced and thorough
>>solution.
>>
>>
>
>How does current ReiserFS fare against extreme fragmentation? What
>about XFS? Without trying to risk a flamewar, what Linux filesystems
>are the most preventive of fragmentation?
>
>The filesystem could make a huge difference on a machine like a mail
>server...
>
>
>
>
Sometimes it is best to confess that one does not have the expertise
appropriate for answering a question. Someone on our mailing list
studied it carefully though. Perhaps they can comment.

Hans

2002-09-06 23:15:15

by Hell.Surfers

[permalink] [raw]
Subject: RE:Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Perhaps you could use a fat partition, you can defragment those, or ntfs [mwaaaahahaha].



On Sat, 07 Sep 2002 02:05:12 +0400 Hans Reiser <[email protected]> wrote:


Attachments:
(No filename) (2.63 kB)

2002-09-16 17:55:59

by Peter Niemayer

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Hans Reiser wrote:

> Sometimes it is best to confess that one does not have the expertise
> appropriate for answering a question. Someone on our mailing list
> studied it carefully though. Perhaps they can comment.

You can find all about the diploma thesis Constantin Loizides
wrote on that topic under

http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/

Alas, while fragmentation effects are measurable, their real-world-impact
is so heavily masked by even the slightest differences in the VFS of
different Linux kernel versions and the usage pattern of applications
that it is hard to make a definitive general statement.

Regards,

Peter Niemayer

2002-09-16 22:34:17

by Simon Kirby

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

On Fri, Sep 06, 2002 at 06:24:57PM +0100, Stephen C. Tweedie wrote:

> Ext2 has a preallocation mechanism so that if you have multiple
> writes, they get dealt with to some extent as a single allocation.
> However, that doesn't work over close(): the preallocated blocks are
> discarded wheneven we close the file.
>
> The problem with mail files, though, is that they tend to grow quite
> slowly, so the writes span very many transactions and we don't have
> that opportunity for coalescing the writes. Actively defragmenting on
> writes is an alternative in that case.

We recently switched a large mail spool from ext2 to ext3 with default
journalling, and we are now having huge problems with disk I/O load.

We have fsync and friends disabled for performance reasons. With ext2,
the machine would happily hum along with an average load of 0.2 and a
usual 400 kB - 800 kB write every 5 seconds, with about 10 kB/sec read in
every second.

Now with ext3, the machine has a load average of about 15 and writing
happens almost all of the time. "vmstat 1" output:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 42 2 79368 47196 100456 1080348 0 0 0 3036 2514 2077 18 21 60
0 76 2 79368 44264 100456 1080348 0 0 0 1776 1266 823 4 3 92
0 111 3 79368 41248 100456 1080348 0 0 0 1952 1176 722 4 5 91
0 132 2 79368 39432 100460 1080348 0 0 0 1368 1007 612 1 3 96
0 67 3 79368 34412 100460 1080628 0 0 0 2884 1968 1246 18 13 69
0 41 2 79368 36572 100468 1080828 0 0 24 4020 2661 1530 16 21 64
0 32 3 79368 31736 100500 1081456 0 0 0 3688 2696 2061 26 22 52
0 39 3 79368 24588 100528 1082164 0 0 4 3800 2636 2643 30 21 50
0 32 4 79368 21500 100536 1082832 0 0 24 3216 2404 2419 32 15 54
5 28 2 79368 18160 100536 1083360 0 0 0 3416 2372 2164 24 19 57
0 25 4 79368 19748 100552 1082896 0 0 4 4120 2544 2421 17 21 62
4 16 4 79368 18216 100560 1083284 0 0 0 3532 2115 2361 20 17 63
0 37 2 79368 17240 100568 1083456 0 0 16 2376 1817 1691 8 12 80
1 67 3 79368 15112 100568 1083456 4 0 4 1644 1051 723 6 4 90
1 88 3 79368 12028 100572 1083464 0 0 8 1884 1102 684 6 3 91
0 108 3 79368 10132 100572 1083468 0 0 0 1716 924 503 3 3 94
15 0 2 79368 14460 100548 1081996 0 0 12 3852 2609 2000 17 25 59
0 39 3 79368 13252 100576 1082220 0 0 52 4288 2740 2095 19 19 62

This box is primarily running a POP3 server (written in-house to cache
mbox offsets, so that it can handle a huge volume of mail), and also
exports the mail spool via NFS to other servers which run exim (-fsync).
nfsd is exported async. Everything is mounted noatime, nodiratime. No
applications should be calling sync/fsync/fdatasync or using O_SYNC.
It's a mail server, so everything is fragmented.

We're using dotlocking. Would this cause metadata journalling? We had
to hash the mail spool a long time ago do to system time eating all CPU
(the ext2 linear directory scan to find a slot available in the spool
directory to add the dotlock file). I estimate about 200 - 300 dotlock
files are created per second, but these should all be asynchronous.
Would switching to fctnl() locking (if this works over NFS) solve the
problem?

A "ps -eo pid,stat,args,wchan | grep simpopd | grep ' D '" shows POP3
processes stuck in either "down" or in "do_get_write_access", which
appears to be a journal function.

We notice there are some ext3 updates included as a patch to vanilla
2.4.18 in the newest Red Hat kernel, including changes to the
do_get_write_access function. Have improvements in this area been made?

Thanks!

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2002-09-17 16:50:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

On Sep 16, 2002 15:39 -0700, Simon Kirby wrote:
> We recently switched a large mail spool from ext2 to ext3 with default
> journalling, and we are now having huge problems with disk I/O load.
>
> We have fsync and friends disabled for performance reasons. With ext2,
> the machine would happily hum along with an average load of 0.2 and a
> usual 400 kB - 800 kB write every 5 seconds, with about 10 kB/sec read in
> every second.
>
> This box is primarily running a POP3 server (written in-house to cache
> mbox offsets, so that it can handle a huge volume of mail), and also
> exports the mail spool via NFS to other servers which run exim (-fsync).
> nfsd is exported async. Everything is mounted noatime, nodiratime. No
> applications should be calling sync/fsync/fdatasync or using O_SYNC.
> It's a mail server, so everything is fragmented.

Hmm, it seems strange (and rather unsafe) that you would run a mail
spool without using sync I/O. Unfortunate too, because sync I/O with
a large journal (and perhaps an external journal disk) would give you
very fast throughput on ext3.

> We're using dotlocking. Would this cause metadata journalling?
> I estimate about 200 - 300 dotlock files are created per second, but
> these should all be asynchronous.

Lots of it. So, that is 250 * (1 dir block + 1 inode bitmap + 1 inode
table block (+ 1 block bitmap + 1 data block, if there is data in the
dotlock file)) = 1250 blocks/second or so.

> Would switching to fctnl() locking (if this works over NFS) solve the
> problem?

Probably (no disk I/O generated), but I don't know the state of NFS locking.

> We had to hash the mail spool a long time ago do to system time eating all
> CPU (the ext2 linear directory scan to find a slot available in the spool
> directory to add the dotlock file).

One reason why we are adding hash-indexed directory support to ext3, so
that you don't have to implement it in each application.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-09-17 21:50:54

by jw schultz

[permalink] [raw]
Subject: Re: ext3 throughput woes on certain (possibly heavily fragmented) files

On Mon, Sep 16, 2002 at 03:39:11PM -0700, Simon Kirby wrote:
> This box is primarily running a POP3 server (written in-house to cache
> mbox offsets, so that it can handle a huge volume of mail), and also
> exports the mail spool via NFS to other servers which run exim (-fsync).
> nfsd is exported async. Everything is mounted noatime, nodiratime. No
> applications should be calling sync/fsync/fdatasync or using O_SYNC.
> It's a mail server, so everything is fragmented.
>
> We're using dotlocking. Would this cause metadata journalling? We had
> to hash the mail spool a long time ago do to system time eating all CPU
> (the ext2 linear directory scan to find a slot available in the spool
> directory to add the dotlock file). I estimate about 200 - 300 dotlock
> files are created per second, but these should all be asynchronous.
> Would switching to fctnl() locking (if this works over NFS) solve the
> problem?

I'd absolutly go to fcntl(). As bad as dotlocking is for
journaling filesystems it is even worse for NFS (when it works).
Look at the lkml thread "invalidate_inode_pages in 2.5.32/3"
to get an idea. Multiply the directory invalidations by the
size of the directories. fcntl() is the preferred way of locking
over NFS as it will even report if there is a problem with
lockd.


--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt