2009-07-16 11:51:28

by Stephan Kulow

[permalink] [raw]
Subject: file allocation problem

Hi,

I played around with ext4 online defrag on 2.6.31-rc3 and noticed a problem.
The core is this:

# filefrag -v /usr/bin/gimp-2.6
File size of /usr/bin/gimp-2.6 is 4677400 (1142 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 2884963 29
1 29 2890819 2884991 29
2 58 2906960 2890847 62
3 120 2893864 2907021 29
4 149 2898531 2893892 29
5 178 2887012 2898559 28
6 206 2887261 2887039 27
7 233 2888229 2887287 27
8 260 2907727 2888255 49
9 309 2907811 2907775 90
10 399 2889078 2907900 26
11 425 2890641 2889103 26
12 451 2908065 2890666 31
13 482 2908136 2908095 33
14 515 2908170 2908168 54
15 569 2908257 2908223 31
16 600 2908378 2908287 38
17 638 2886399 2908415 25
18 663 2908646 2886423 26
19 689 2909129 2908671 56
20 745 2909186 2909184 62
21 807 2909281 2909247 31
22 838 2902503 2909311 25
23 863 103690 2902527 161
24 1024 109621 103850 118 eof
/usr/bin/gimp-2.6: 25 extents found

ext4 defragmentation for /usr/bin/gimp-2.6
[1/1]/usr/bin/gimp-2.6: 100% extents: 25 -> 25 [ OK ]
Success: [1/1]

(filefrag will output very much the same now)

But now the really interesting part starts: when I copy away
that file (as far as I understand the code, e4defrag allocates
space in /usr/bin too), I get:

cp -a /usr/bin/gimp-2.6{,.defrag} (I have 50% free, so I expect it to find
room):

filefrag -v /usr/bin/gimp-2.6.defrag
File size of /usr/bin/gimp-2.6.defrag is 4677400 (1142 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 452952 40
1 40 439168 452991 32
2 72 442912 439199 32
3 104 448544 442943 32
4 136 449472 448575 32
5 168 453920 449503 32
6 200 429625 453951 31
7 231 430714 429655 31
8 262 435296 430744 31
9 293 454842 435326 31
10 324 436410 454872 29
11 353 426832 436438 28
12 381 453651 426859 27
13 408 447705 453677 25
14 433 436510 447729 23
15 456 442421 436532 23
16 479 451098 442443 23
17 502 447082 451120 22
18 524 451647 447103 22
19 546 437950 451668 21
20 567 439293 437970 21
21 588 454464 439313 21
22 609 455776 454484 21
23 630 454624 455796 20
24 650 450592 454643 18
25 668 451136 450609 18
26 686 452305 451153 18
27 704 427088 452322 16
28 720 427568 427103 16
29 736 427952 427583 16
30 752 427984 427967 16
31 768 650240 427999 256
32 1024 634851 650495 69
33 1093 633344 634919 49 eof
/usr/bin/gimp-2.6.defrag: 34 extents found

Now that I call fragmented! Calling e4defrag again gives me
34->28 and now it moved _parts_

..
24 781 478136 480191 56
25 837 475850 478191 54
26 891 1836751 475903 133
27 1024 1875978 1836883 118 eof
/usr/bin/gimp-2.6.defrag: 28 extents found

This looks really strange to me, is this a problem with my very file system or
a bug?

Greetings, Stephan


2009-07-16 16:22:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: file allocation problem

On Thu, Jul 16, 2009 at 01:31:17PM +0200, Stephan Kulow wrote:
> Hi,
>
> I played around with ext4 online defrag on 2.6.31-rc3 and noticed a problem.
> The core is this:

Was your filesystem originally an ext3 filesystme which was converted
over to ext4? What features are currently enabled (sending a copy of
the output of "dumpe2fs -h /dev/XXX" would be helpful.)

If it is the case that this was originally an ext3 filesystem,
e4defrag does have some definite limitations that will prevent it from
doing a great job in such a case. I'm guessing that's what's going on
here.

> Now that I call fragmented! Calling e4defrag again gives me
> 34->28 and now it moved _parts_

I'm not sure what you mean by moving _parts_?

- Ted

2009-07-16 17:43:25

by Stephan Kulow

[permalink] [raw]
Subject: Re: file allocation problem

On Thursday 16 July 2009 17:58:32 Theodore Tso wrote:
> On Thu, Jul 16, 2009 at 01:31:17PM +0200, Stephan Kulow wrote:
> > Hi,
> >
> > I played around with ext4 online defrag on 2.6.31-rc3 and noticed a
> > problem. The core is this:
>
> Was your filesystem originally an ext3 filesystme which was converted
> over to ext4? What features are currently enabled (sending a copy of
Yes, it was converted quite some time ago.

> the output of "dumpe2fs -h /dev/XXX" would be helpful.)

Filesystem volume name: <none>
Last mounted on: /root
Filesystem UUID: ec4454af-a8db-42ad-9627-19c9c17a0220
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 853440
Block count: 3409788
Reserved block count: 170489
Free blocks: 1156411
Free inodes: 615319
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 832
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8128
Inode blocks per group: 508
Filesystem created: Fri Dec 12 17:01:57 2008
Last mount time: Thu Jul 16 19:30:26 2009
Last write time: Thu Jul 16 19:30:26 2009
Mount count: 718
Maximum mount count: -1
Last checked: Thu Jan 29 15:01:57 2009
Check interval: 0 (<none>)
Lifetime writes: 5211 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 650850
Default directory hash: half_md4
Directory Hash Seed: a262693d-9659-4212-8e5b-5901140edff8
Journal backup: inode blocks
Journal size: 128M

>
> If it is the case that this was originally an ext3 filesystem,
> e4defrag does have some definite limitations that will prevent it from
> doing a great job in such a case. I'm guessing that's what's going on
> here.
My problem is not so much with what e4defrag does, but the fact that
a new file I create with cp(1) contains 34 extents.

>
> > Now that I call fragmented! Calling e4defrag again gives me
> > 34->28 and now it moved _parts_
>
> I'm not sure what you mean by moving _parts_?
It moved a couple of blocks from 6XXX to 10XXX and most extents stayed in the
area where they were (I guess close to the rest of /usr/bin?)

Greetings, Stephan

2009-07-17 01:12:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: file allocation problem

On Thu, Jul 16, 2009 at 07:43:21PM +0200, Stephan Kulow wrote:
> > If it is the case that this was originally an ext3 filesystem,
> > e4defrag does have some definite limitations that will prevent it from
> > doing a great job in such a case. I'm guessing that's what's going on
> > here.
> My problem is not so much with what e4defrag does, but the fact that
> a new file I create with cp(1) contains 34 extents.

Well, because your filesystem is still fragmented; you asked e4defrag
to defragment a single file. In fact, it wasn't able to do much --
the file previously had 25 extents, and the new file had 25 extents.
E4defrag is quite new, and still needs a lot of polishing; I'm not
sure it should have tried to swap files when the newly allocated file
has the same number of extents. This might be a case of changing a
">=" to ">" in code.

The reason why "cp" still created a file with 34 extents is because
the free space was still fragmented. As I said, e4defrag is quite
primitive; it doesn't know how to defrag free space; it simply tries
to reduce the number of extents for each file, on a file-by-file
basis.

The other problem is that an ext3 filesystem that has been converted
to ext4 does not have the flex_bg feature. This is a feature that,
when set at when the file system is formatted, creates a higher order
flex_bg which combines several block groups into a bigger allocation
group, a flex_bg. This helps avoid fragmentation, especially for
directories like /usr/bin which typically have more than 128 megs (a
single block group) worth of files in it.

Using an ext3 filesystem format, the filesystem driver will first try
to find space in the home block group of the directory, and if there
is no space there, it will look in other block groups. With a freshly
formatted ext4 filesystem, the allocation group is the flex_bg, which
is much larger, and which gives us a better opportunity for allocating
contiguous blocks.

I suspect we could do better with our allocator in this case; maybe
should use a flex_bg to give the block group allocator a bigger set of
block groups to search. The inode tables will still not be optimally
laid out for flex_bg, but we might still be better off. Or, if the
block group is terribly fragmented, maybe we should have the allocator
find some other bg, even if it isn't the ideal block group close to
the directory. According to the dumpe2fs output, the filesystem is
only 66% or so full, so there's probably some possibly completely
unused block groups we should be using instead. One of the things
that we have _not_ had time to do is optimize the block allocator for
heavily fragimented filesystems, especially for fragmented filesystems
that had been converted from ext3 filesystems.

In any case, I don't anything went _wrong_ per se, just that both
e4defrag and our block allocator are insufficiently smart to help
improve things for you given your current filesystem. A backup,
reformat, and restore will result in a filesystem that works far
better.

Out of curiosity, what sort of workload had the file system received?
It looks like the filesystem hadn't been created that long ago, so
it's bit surprising it was so fragmented. Were you perhaps updating
your system (by doing a yum update or apt-get update) very frequently,
perhaps?

- Ted

2009-07-17 04:33:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: file allocation problem

On Jul 16, 2009 21:12 -0400, Theodore Ts'o wrote:
> On Thu, Jul 16, 2009 at 07:43:21PM +0200, Stephan Kulow wrote:
> > My problem is not so much with what e4defrag does, but the fact that
> > a new file I create with cp(1) contains 34 extents.
>
> The other problem is that an ext3 filesystem that has been converted
> to ext4 does not have the flex_bg feature. This is a feature that,
> when set at when the file system is formatted, creates a higher order
> flex_bg which combines several block groups into a bigger allocation
> group, a flex_bg. This helps avoid fragmentation, especially for
> directories like /usr/bin which typically have more than 128 megs (a
> single block group) worth of files in it.

It seems quite odd to me that mballoc didn't find enough contiguous
free space for this relatively small file. It might be worthwhile
to look at (though not necessarily post) the output from the file
/sys/fs/ext4/{dev}/mb_groups (or "dumpe2fs" has equivalent data)
and see if there are groups with a lot of contiguous free space.
In the mb_groups file this would be numbers in the 2^{high} column.

I don't agree that flex_bg is necessary to have good block allocation,
since we do get about 125MB per group. Maybe mballoc is being
constrained to look at too few block groups in this case? Looking at
/sys/fs/ext4/{dev}/mb_history under the "groups" column will tell how
many groups were scanned to find that allocation, and the "original"
and "result" will show group/grpblock/[email protected] for recent writes.

$ dd if=/dev/zero of=/myth/tmp/foo bs=1M count=1

pid inode original goal result
4423 110359 3448/14336/[email protected] 1646/18944/[email protected] 1646/19456/[email protected]

You might also try to create a new temp directory elsewhere on the
filesystem, copy the file over to the temp directory, and then see
if it is less fragmented in the new directory.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-07-17 05:17:16

by Stephan Kulow

[permalink] [raw]
Subject: Re: file allocation problem

On Friday 17 July 2009 03:12:19 you wrote:
> On Thu, Jul 16, 2009 at 07:43:21PM +0200, Stephan Kulow wrote:
> > > If it is the case that this was originally an ext3 filesystem,
> > > e4defrag does have some definite limitations that will prevent it from
> > > doing a great job in such a case. I'm guessing that's what's going on
> > > here.
> >
> > My problem is not so much with what e4defrag does, but the fact that
> > a new file I create with cp(1) contains 34 extents.
>
Hi,
>
> The reason why "cp" still created a file with 34 extents is because
> the free space was still fragmented. As I said, e4defrag is quite
> primitive; it doesn't know how to defrag free space; it simply tries
> to reduce the number of extents for each file, on a file-by-file
> basis.
Well, is there a tool to check the overall state of the file system? I can't
really believe it's 1010101010, but it's hard to say without a picture :)
>
> The other problem is that an ext3 filesystem that has been converted
> to ext4 does not have the flex_bg feature. This is a feature that,
> when set at when the file system is formatted, creates a higher order
> flex_bg which combines several block groups into a bigger allocation
> group, a flex_bg. This helps avoid fragmentation, especially for
> directories like /usr/bin which typically have more than 128 megs (a
> single block group) worth of files in it.

Oh, I enabled flex_bg after you asked, rebooted to get a e2fsck - and I still
get 34 extents for my gimp-2.6.defrag. From what I understand, this doesn't
help in the after fact, but then again how am I supposed to fix my file system
if even new files are created fragmented.

> In any case, I don't anything went _wrong_ per se, just that both
> e4defrag and our block allocator are insufficiently smart to help
> improve things for you given your current filesystem. A backup,
> reformat, and restore will result in a filesystem that works far
> better.
I believe that, but my hope for online defrag was not having to rely on this
80ties defrag method :)
>
> Out of curiosity, what sort of workload had the file system received?
> It looks like the filesystem hadn't been created that long ago, so
> it's bit surprising it was so fragmented. Were you perhaps updating
> your system (by doing a yum update or apt-get update) very frequently,
> perhaps?
Yes, that's what I'm doing. I'm updating about every file in this file system
every second day by means of rpm packages (openSUSE calls it factory, you will
now it as rawhide).

Greetings, Stephan

2009-07-17 05:31:55

by Stephan Kulow

[permalink] [raw]
Subject: Re: file allocation problem

On Friday 17 July 2009 06:32:42 Andreas Dilger wrote:

Hi,

> It seems quite odd to me that mballoc didn't find enough contiguous
> free space for this relatively small file. It might be worthwhile
> to look at (though not necessarily post) the output from the file
> /sys/fs/ext4/{dev}/mb_groups (or "dumpe2fs" has equivalent data)
> and see if there are groups with a lot of contiguous free space.
> In the mb_groups file this would be numbers in the 2^{high} column.

I'm not sure what you expect with "a lot", so I pasted the full file (that
happens to be in /proc/fs here): http://ktown.kde.org/~coolo/sda6
>
> I don't agree that flex_bg is necessary to have good block allocation,
> since we do get about 125MB per group. Maybe mballoc is being
> constrained to look at too few block groups in this case? Looking at
> /sys/fs/ext4/{dev}/mb_history under the "groups" column will tell how
> many groups were scanned to find that allocation, and the "original"
> and "result" will show group/grpblock/[email protected] for recent writes.
>
> $ dd if=/dev/zero of=/myth/tmp/foo bs=1M count=1
>
> pid inode original goal result
> 4423 110359 3448/14336/[email protected] 1646/18944/[email protected] 1646/19456/[email protected]
>
> You might also try to create a new temp directory elsewhere on the
> filesystem, copy the file over to the temp directory, and then see
> if it is less fragmented in the new directory.
>
cp /usr/bin/gimp-2.6{.defrag}:

31548 106916 13/0/[email protected] 13/0/[email protected] 13/24152/[email protected]
201 1 2 1056 0 0
31548 106916 13/24211/[email protected] 13/24211/[email protected] 13/26192/[email protected]
201 1 2 1568 0 0
31548 106916 13/26233/[email protected] 13/26233/[email protected] 13/21777/[email protected]
201 1 2 1568 0 0
31548 106916 13/21811/[email protected] 13/21811/[email protected] 13/6688/[email protected]
201 1 2 1568 0 0
31548 106916 13/6720/[email protected] 13/6720/[email protected] 13/10944/[email protected]
201 1 2 1568 0 0
31548 106916 13/6720/[email protected] 13/6720/[email protected] 13/513/[email protected]
1 1 1 1024 0 0
31548 106916 13/10976/[email protected] 13/10976/[email protected] 13/16896/[email protected]
201 1 2 1568 0 0
31548 106916 13/16928/[email protected] 13/16928/[email protected] 13/12564/[email protected]
201 1 2 1568 0 0
31548 106916 13/12595/[email protected] 13/12595/[email protected] 13/12724/[email protected]
201 1 2 1568 0 0
31548 106916 13/12755/[email protected] 13/12755/[email protected] 13/31700/[email protected]
201 1 2 1568 0 0
31548 106916 13/31731/[email protected] 13/31731/[email protected] 13/18103/[email protected]
201 1 2 1568 0 0
31548 106916 13/18133/[email protected] 13/18133/[email protected] 13/21691/[email protected]
201 1 2 1568 0 0
31548 106916 13/21721/[email protected] 13/21721/[email protected] 13/25881/[email protected]
201 1 2 1568 0 0
31548 106916 13/25911/[email protected] 13/25911/[email protected] 13/22196/[email protected]
201 1 2 1568 0 0
31548 106916 13/22225/[email protected] 13/22225/[email protected] 13/31380/[email protected]
201 1 2 1568 0 0
31548 106916 13/31409/[email protected] 13/31409/[email protected] 13/12954/[email protected]
201 2 2 1568 0 0
31548 106916 13/12981/[email protected] 13/12981/[email protected] 13/18176/[email protected]
201 2 2 1568 0 0
31548 106916 13/18203/[email protected] 13/18203/[email protected] 13/15161/[email protected]
201 2 2 1568 0 0
31548 106916 13/15187/[email protected] 13/15187/[email protected] 13/17625/[email protected]
201 2 2 1568 0 0
31548 106916 13/17651/[email protected] 13/17651/[email protected] 13/19936/[email protected]
201 2 2 1568 0 0
31548 106916 13/19962/[email protected] 13/19962/[email protected] 13/20247/[email protected]
201 2 2 1568 0 0
31548 106916 13/20273/[email protected] 13/20273/[email protected] 13/23515/[email protected]
201 2 2 1568 0 0
31548 106916 13/23541/[email protected] 13/23541/[email protected] 13/9949/[email protected]
201 2 2 1568 0 0
31548 106916 13/9974/[email protected] 13/9974/[email protected] 13/19832/[email protected]
201 2 2 1568 0 0
31548 106916 13/19857/[email protected] 13/19857/[email protected] 13/29244/[email protected]
201 2 2 1568 0 0
31548 106916 13/29269/[email protected] 13/29269/[email protected] 13/1344/[email protected]
201 2 2 1568 0 0
31548 106916 13/1368/[email protected] 13/1368/[email protected] 13/11776/[email protected]
201 2 2 1568 0 0
31548 106916 13/11799/[email protected] 13/11799/[email protected] 14/3104/[email protected]
201 2 2 1568 0 0
31548 106916 14/3130/[email protected] 14/3130/[email protected] 14/9984/[email protected]
201 1 2 1568 0 0
31548 106916 14/10034/[email protected] 14/10034/[email protected] 14/11264/[email protected]
201 1 2 1568 0 0
31548 106916 14/11310/[email protected] 14/11310/[email protected] 58/1024/[email protected]
11 1 1 1568 125 128
31548 106916 58/1149/[email protected] 58/1149/[email protected]
58/17408/[email protected] 201 2 2 1568 0 0

filefrag: 59 extents.

cp /usr/bin/gimp-2.6 /tmp/nd/

25449 650578 80/0/[email protected] 80/0/[email protected] 80/589/[email protected]
4 1 1 0 0 0

Filesystem type is: ef53
File size of /tmp/nd/gimp-2.6 is 4677400 (1142 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 2638592 588
1 588 2628896 2639179 436
2 1024 2637846 2629331 118 eof
/tmp/nd/gimp-2.6: 3 extents found

Greetings, Stephan

2009-07-17 14:26:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: file allocation problem

On Fri, Jul 17, 2009 at 07:17:12AM +0200, Stephan Kulow wrote:
> Well, is there a tool to check the overall state of the file system? I can't
> really believe it's 1010101010, but it's hard to say without a picture :)

Well, you can check the fragmentation of the free space by using
dumpe2fs and looking at the free blocks in each block group.

> > The other problem is that an ext3 filesystem that has been converted
> > to ext4 does not have the flex_bg feature. This is a feature that,
> > when set at when the file system is formatted, creates a higher order
> > flex_bg which combines several block groups into a bigger allocation
> > group, a flex_bg. This helps avoid fragmentation, especially for
> > directories like /usr/bin which typically have more than 128 megs (a
> > single block group) worth of files in it.
>
> Oh, I enabled flex_bg after you asked, rebooted to get a e2fsck -
> and I still get 34 extents for my gimp-2.6.defrag. From what I
> understand, this doesn't help in the after fact, but then again how
> am I supposed to fix my file system if even new files are created
> fragmented.

Well, it's actually not enough to enable flex_bg filesystem feature;
you need to also set the flex_bg size, like this:

debugfs -w /dev/XXX
debugfs: ssv log_groups_per_flex 4
debugfs: quit

(And no, this isn't something which we've done a lot of testing on.)

And this isn't necessarily going to help; if 16 block groups around
(2**4) for the flex_bg for the /usr/bin directory are all badly
fragmented, then when you create new files in /usr/bin, it will still
be fragmented.

> > In any case, I don't anything went _wrong_ per se, just that both
> > e4defrag and our block allocator are insufficiently smart to help
> > improve things for you given your current filesystem. A backup,
> > reformat, and restore will result in a filesystem that works far
> > better.
>
> I believe that, but my hope for online defrag was not having to rely on this
> 80ties defrag method :)

Yeah, sorry, online defrag is a very new feature. It will hopefully
get better, but it's matter of resources. Ultimately, though, the
problem is that the ext3 allocation algorithms are very different (and
far more primitive) than the ext4 allocation algorithms. So undoing
the ext3 allocation algorithm decisions is going to be non-trivial,
and even if we can eventually get e4defrag to the point where it can
do this on the whole filesystem, I suspect backup/reformat/restore
will almost always be faster.

> > Out of curiosity, what sort of workload had the file system received?
> > It looks like the filesystem hadn't been created that long ago, so
> > it's bit surprising it was so fragmented. Were you perhaps updating
> > your system (by doing a yum update or apt-get update) very frequently,
> > perhaps?
>
> Yes, that's what I'm doing. I'm updating about every file in this
> file system every second day by means of rpm packages (openSUSE
> calls it factory, you will now it as rawhide).

Unfortunately, constantly updating every single file on a daily basis
is a very effective way of seriously aging a filesystem. The ext4
allocator tries to keep files aligned on power of two boundaries,
which tends to help this a lot (although this means that dumpe2fs -h
will show a bunch of holes that makes the free space look more
fragmented than it really is), but the ext3 allocator doesn't have any
such smarts on it.

- Ted

2009-07-17 18:02:29

by Stephan Kulow

[permalink] [raw]
Subject: Re: file allocation problem

On Friday 17 July 2009 16:26:28 Theodore Tso wrote:
> And this isn't necessarily going to help; if 16 block groups around
> (2**4) for the flex_bg for the /usr/bin directory are all badly
> fragmented, then when you create new files in /usr/bin, it will still
> be fragmented.
Yeah, but even the file in /tmp/nd got 3 extents. my file is 1142 blocks
and my mb_groups says 2**9 is the highest possible value. So I guess I will
indeed try to create the file system from scratch to test the allocator for
real.
>
> > > In any case, I don't anything went _wrong_ per se, just that both
> > > e4defrag and our block allocator are insufficiently smart to help
> > > improve things for you given your current filesystem. A backup,
> > > reformat, and restore will result in a filesystem that works far
> > > better.
> >
> > I believe that, but my hope for online defrag was not having to rely on
> > this 80ties defrag method :)
>
> Yeah, sorry, online defrag is a very new feature. It will hopefully
> get better, but it's matter of resources. Ultimately, though, the
> problem is that the ext3 allocation algorithms are very different (and
> far more primitive) than the ext4 allocation algorithms. So undoing
> the ext3 allocation algorithm decisions is going to be non-trivial,
> and even if we can eventually get e4defrag to the point where it can
> do this on the whole filesystem, I suspect backup/reformat/restore
> will almost always be faster.
I don't have any kind of experience in that field, but would it possible
to allocate a big file that would get all all the free blocks and then move
the extents of one group into it, basically freeing all blocks of one group
so it can be used purely by ext4 allocation? Or even go as far and pack the
blocks of every group. As far as I see there is no way with the current ioctl
interface to achieve that once your file system is fragmented enough because
the allocator will always create new files as fragemented and the ioctl can
only move extents from one fragemented to another fragemented.

And yes, backup/restore might be faster, but it's also the far more
interruptive action than leaving defrag running over night.

>
> > > Out of curiosity, what sort of workload had the file system received?
> > > It looks like the filesystem hadn't been created that long ago, so
> > > it's bit surprising it was so fragmented. Were you perhaps updating
> > > your system (by doing a yum update or apt-get update) very frequently,
> > > perhaps?
> >
> > Yes, that's what I'm doing. I'm updating about every file in this
> > file system every second day by means of rpm packages (openSUSE
> > calls it factory, you will now it as rawhide).
>
> Unfortunately, constantly updating every single file on a daily basis
> is a very effective way of seriously aging a filesystem. The ext4
Of course it is, guess why I'm so interested in having it :)

> allocator tries to keep files aligned on power of two boundaries,
> which tends to help this a lot (although this means that dumpe2fs -h
> will show a bunch of holes that makes the free space look more
> fragmented than it really is), but the ext3 allocator doesn't have any
> such smarts on it.
But there is nothing packing the blocks if the groups get full, so these
holes will always cause fragmentation once the file system gets full, right?

So I guess online defragmentation first needs to pretend doing an online
resize so it can use the gained free size. Now I have something to test.. :)

Greetings, Stephan

2009-07-17 21:14:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: file allocation problem

On Jul 17, 2009 20:02 +0200, Stephan Kulow wrote:
> On Friday 17 July 2009 16:26:28 Theodore Tso wrote:
> > And this isn't necessarily going to help; if 16 block groups around
> > (2**4) for the flex_bg for the /usr/bin directory are all badly
> > fragmented, then when you create new files in /usr/bin, it will still
> > be fragmented.
>
> Yeah, but even the file in /tmp/nd got 3 extents. my file is 1142 blocks
> and my mb_groups says 2**9 is the highest possible value. So I guess I will
> indeed try to create the file system from scratch to test the allocator for
> real.

The defrag code needs to become smarter, so that it finds small files
in the middle of freespace and migrates those to fit into a small gap.
That will allow larger files to be defragged once there is large chunks
of free space.

> > allocator tries to keep files aligned on power of two boundaries,
> > which tends to help this a lot (although this means that dumpe2fs -h
> > will show a bunch of holes that makes the free space look more
> > fragmented than it really is), but the ext3 allocator doesn't have any
> > such smarts on it.
> But there is nothing packing the blocks if the groups get full, so these
> holes will always cause fragmentation once the file system gets full, right?


Well, this isn't quite correct. The mballoc code only tries to allocate
"large" files on power-of-two boundaries, where large is 64kB by default,
but is tunable in /proc. For smaller files it tries to pack them together
into the same block, or into gaps that are exactly the size of the file.

> So I guess online defragmentation first needs to pretend doing an online
> resize so it can use the gained free size. Now I have something to test.. :)

Yes, that would give you some good free space at the end of the filesystem.
Then find the largest files in the filesystem, migrate them there, then
defrag the smaller files.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-07-18 21:16:43

by Stephan Kulow

[permalink] [raw]
Subject: Re: file allocation problem

On Friday 17 July 2009 23:14:44 Andreas Dilger wrote:
> > Yeah, but even the file in /tmp/nd got 3 extents. my file is 1142 blocks
> > and my mb_groups says 2**9 is the highest possible value. So I guess I
> > will indeed try to create the file system from scratch to test the
> > allocator for real.
>
> The defrag code needs to become smarter, so that it finds small files
> in the middle of freespace and migrates those to fit into a small gap.
> That will allow larger files to be defragged once there is large chunks
> of free space.

Is there a way that user space can hint the allocator to fill these gaps? I
don't see any obvious way. Relying on the allocator not to make matters worse
might be enough, but it doesn't sound ideal. Unless something urgent comes up
I might actually continue experiment next week :)

My resize2fs defrag worked pretty well actually, but then again I did it on an
offline copy and it won't work for online that way.

Greetings, Stephan

2009-07-19 22:45:11

by Ron Johnson

[permalink] [raw]
Subject: Re: file allocation problem

On 2009-07-17 16:14, Andreas Dilger wrote:
[snip]
>
> Well, this isn't quite correct. The mballoc code only tries to allocate
> "large" files on power-of-two boundaries, where large is 64kB by default,
> but is tunable in /proc. For smaller files it tries to pack them together
> into the same block, or into gaps that are exactly the size of the file.

How does ext4 act on growing files? I.e., creating a tarball that,
obviously, starts at 0 bytes and then grows to multi-GB?

--
Scooty Puff, Sr
The Doom-Bringer

2009-07-20 21:19:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: file allocation problem

On Jul 19, 2009 17:45 -0500, Ron Johnson wrote:
> On 2009-07-17 16:14, Andreas Dilger wrote:
>> Well, this isn't quite correct. The mballoc code only tries to allocate
>> "large" files on power-of-two boundaries, where large is 64kB by default,
>> but is tunable in /proc. For smaller files it tries to pack them together
>> into the same block, or into gaps that are exactly the size of the file.
>
> How does ext4 act on growing files? I.e., creating a tarball that,
> obviously, starts at 0 bytes and then grows to multi-GB?

ext4 has "delayed allocation" (delalloc) so that no blocks are allocated
during initial file writes, but rather only when RAM is running short or
when the data has been sitting around for a while.

Normally, if you are writing to a file with _most_ applications the IO
rate is high enough that within the 5-30s memory flush interval the
size of the file has grown large enough to give the allocator an idea
whether the file will be small or large.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.