2008-08-10 17:30:19

by Theodore Ts'o

[permalink] [raw]
Subject: Bug in delayed allocation: really bad block layouts!


Thanks to a comment on a recent blog entry of mine[1], I think I've
uncovered a rather embarassing bug in mballoc.

[1]http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times

I created a fresh 5 gig ext4 filesystem, and then populating it using a
single-threaded tar command:

(cd /usr ; tar cf - bin lib) | (cd /mnt; tar xfp -)

I then unmounted the filesystem, and ran an instrumented e2fsck looking
for fragmented files, and found a whole series of fragmanted files with
the following pattern:

Inode 122: (0):58399, (1-3):43703-43705
Inode 124: (0):58400, (1):43707
Inode 127: (0):58401, (1-7):43709-43715
Inode 128: (0):58402, (1-2):43716-43717
Inode 129: (0):58403, (1-3):43718-43720
Inode 133: (0):58404, (1-5):43722-43726
Inode 135: (0):58405, (1):43728
Inode 136: (0):58406, (1-3):43729-43731
Inode 141: (0-1):58407-58408, (2-6):43734-43738
Inode 143: (0):58409, (1):43740
Inode 144: (0):58410, (1-5):43741-43745
Inode 146: (0):58411, (1):43746

Inode Pathname
122 /bin/smproxy
124 /bin/debconf-updatepo
127 /bin/iostat
128 /bin/xeyes
129 /bin/pbmtog3
133 /bin/join-dctrl
135 /bin/dpkg-name
136 /bin/lockfile
141 /bin/id
143 /bin/ppmcolormask
144 /bin/tty
146 /bin/colrm

If I do this test with -o nodelalloc, I get a slightly different
pattern. Now I get a whole series of discontiguous regions after the
first 15 blocks:

inode last_block pblk lblk len
=============================================
2932: was 47087 actual extent 41894 (15, 3)...
3512: was 47829 actual extent 41908 (15, 1)...
3535: was 47904 actual extent 41912 (15, 37)...
3549: was 47977 actual extent 41949 (15, 4)...
3637: was 48225 actual extent 41959 (15, 6)...
3641: was 48245 actual extent 41965 (15, 13)...
3675: was 48418 actual extent 41978 (15, 1)...
3675: was 41979 actual extent 48640 (16, 15)...
3714: was 41984 actual extent 48656 (1, 2)...
3954: was 49449 actual extent 48660 (15, 16)...
3999: was 49569 actual extent 48679 (15, 2)...
4010: was 49644 actual extent 48681 (15, 1)...
4143: was 49943 actual extent 48687 (15, 10)...
4202: was 50036 actual extent 48699 (15, 6)...

So all of the discontiguities start at logical block #15, and when I
examine the inodes, what we find is one extent for blocks 0-14, ending
at the last_block number, and then the second extent which extends for
the rest of the file, starting somewhere else earlier in the block
group.

So a very similar issue, even without delayed allocation. That leads me
to suspect the problem is somewhere inside mballoc. Aneesh, Andreas,
Alex --- I think you folks are most familiar the mballoc code the;
someone have time to take a look? This is clearly a bug, and clearly
something we want to fix. If we can't get an optimal layout with one
single-threaded process writing to the filesystem, what hope do we have
of getting it right on more realistic benchmarks or real-world usage?

- Ted


2008-08-10 17:54:15

by Eric Sandeen

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

Theodore Ts'o wrote:
> Thanks to a comment on a recent blog entry of mine[1], I think I've
> uncovered a rather embarassing bug in mballoc.
>
> [1]http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times
>
> I created a fresh 5 gig ext4 filesystem, and then populating it using a
> single-threaded tar command:
>
> (cd /usr ; tar cf - bin lib) | (cd /mnt; tar xfp -)
>
> I then unmounted the filesystem, and ran an instrumented e2fsck looking
> for fragmented files, and found a whole series of fragmanted files with
> the following pattern:

Hm, last year I reported "delalloc fragmenting files?" and Alex sent
some patches, I wonder if they all got integrated. I guess I should
re-run that simple large-file dd test and see how we're doing now...

ok, on a 1G single-threaded write it looks sane:

Discontinuity: Block 53248 is at 100352 (was 98303)
Discontinuity: Block 116736 is at 165888 (was 163839)
Discontinuity: Block 180224 is at 231424 (was 229375)
Discontinuity: Block 243712 is at 296960 (was 294911)

so I suppose this is not the same issue ...

Hm, and I tried writing out 10 files in order as a simple test but
umount/remount brought me back many 0-byte files, I need to update my
patchset I guess. :)

-Eric

2008-08-10 18:22:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Sun, Aug 10, 2008 at 12:54:00PM -0500, Eric Sandeen wrote:
>
> Hm, and I tried writing out 10 files in order as a simple test but
> umount/remount brought me back many 0-byte files, I need to update my
> patchset I guess. :)
>

One of the questions in my mind is whether this is a regression
triggered by the some of our most recent patches.... since I only
have 2.2% files reported a fragmented by e2fsck, and if this problem
had always been there, I would have expected a much higher
fragmentation number. So if you have some older kernels, you might
want to see if you can replicate the problem. I've since found that
just doing a copy via "(tar -cf - -C / usr/include ) | tar -C /mnt -xf -)"
is sufficient to see the problem. Just add a "sync; sleep 5" before
the umount. :-)

- Ted

2008-08-10 18:28:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

P.S. Please excuse the subject line; when I was first writing up the
e-mail, I thought the issue was caused by a bug with how delayed
allocation when under memory pressure interacted with mballoc().
Subsequent tests showed my initial theory was incorrect, and I fixed
the text of my e-mail, but not the subject line....

- Ted

2008-08-10 18:54:54

by Eric Sandeen

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

Theodore Tso wrote:
> On Sun, Aug 10, 2008 at 12:54:00PM -0500, Eric Sandeen wrote:
>> Hm, and I tried writing out 10 files in order as a simple test but
>> umount/remount brought me back many 0-byte files, I need to update my
>> patchset I guess. :)
>>
>
> One of the questions in my mind is whether this is a regression
> triggered by the some of our most recent patches.... since I only
> have 2.2% files reported a fragmented by e2fsck, and if this problem
> had always been there, I would have expected a much higher
> fragmentation number. So if you have some older kernels, you might
> want to see if you can replicate the problem. I've since found that
> just doing a copy via "(tar -cf - -C / usr/include ) | tar -C /mnt -xf -)"
> is sufficient to see the problem. Just add a "sync; sleep 5" before
> the umount. :-)

It may be; I tried this and then a quick filefrag run:

# filefrag usr/include/*.h | grep -v extents | awk -F : '{print $2}' |
sort | uniq -c
146 1 extent found

so everything came out contiguous.

This was with 2.6.27-0.186.rc0.git15.fc10.x86_64

-Eric

2008-08-10 20:04:04

by Eric Sandeen

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

Eric Sandeen wrote:

> It may be; I tried this and then a quick filefrag run:
>
> # filefrag usr/include/*.h | grep -v extents | awk -F : '{print $2}' |
> sort | uniq -c
> 146 1 extent found
>
> so everything came out contiguous.
>
> This was with 2.6.27-0.186.rc0.git15.fc10.x86_64

Or better:

# find . -name \*.h | xargs filefrag | grep -v extents | awk -F :
'{print $2}' | sort | uniq -c
2460 1 extent found

so *really* everything.

-Eric

2008-08-11 02:49:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

1;1613;0cOn Sun, Aug 10, 2008 at 03:04:00PM -0500, Eric Sandeen wrote:
>
> # find . -name \*.h | xargs filefrag | grep -v extents | awk -F :
> '{print $2}' | sort | uniq -c
> 2460 1 extent found
>
> so *really* everything.
>

Um, no, because you filtered out "extents", and filefrag will print "2
extents found", so that it gets the singular/plural correct. One of
the things I've been meaning to do is to clean up the output of
filefrag so that it's a bit easier to parse. It's really stupid for
it to print "file is in extents format" over and over again...

I've done some research on my laptop's filesystem, and I've found
examples of files with the characteristic horrendous layout going back
as far back to at least July 7th. (See below)

This takes us back to about the time when we were stablizing the
patches that went into 2.6.26, and before the most recent set of
patches to fix the journal credits and delayed allocation patches.

Interestingly, I'm not seeing the bad layouts for files created June
30th, which is when I switched my laptop's root filesystem to ext4.
So that gives us a pretty good time bound on the patches that might be
involved. (This is a great demonstration of what it's a real win to
manage the ext4 patch queue using git, since we can see which patches
were added during that week or so.)

In other news, I've added an enhancement to e2fsck which makes it much
easier to spot these sorts of problems. On the latest e2fsprogs git
branch, you'll be able to do "e2fsck -E fragcheck /dev/thunk/root" and
get a report which makes it really ease to see the problem.

- Ted

debugfs: stat <4545760>
Inode: 4545760 Type: regular Mode: 0755 Flags: 0x80000
Generation: 278636620 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 6669
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 16
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x487282ce:50c55b54 -- Mon Jul 7 16:55:42 2008
atime: 0x487282cd:00000000 -- Mon Jul 7 16:55:41 2008
mtime: 0x485d754f:00000000 -- Sat Jun 21 17:40:31 2008
crtime: 0x487282ce:50c55b54 -- Mon Jul 7 16:55:42 2008
Size of extra inode fields: 28
BLOCKS:
(0):19472457, (1):19816447
TOTAL: 2

debugfs: stat <4545768>
Inode: 4545768 Type: regular Mode: 0755 Flags: 0x80000
Generation: 278636628 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 7655
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 16
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x487282ce:5a3cff0c -- Mon Jul 7 16:55:42 2008
atime: 0x487282cd:00000000 -- Mon Jul 7 16:55:41 2008
mtime: 0x485d754f:00000000 -- Sat Jun 21 17:40:31 2008
crtime: 0x487282ce:5a3cff0c -- Mon Jul 7 16:55:42 2008
Size of extra inode fields: 28
BLOCKS:
(0):19472458, (1):20849215
TOTAL: 2

debugfs: stat <4545769>
Inode: 4545769 Type: regular Mode: 0755 Flags: 0x80000
Generation: 278636629 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 17452
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 40
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x487282ce:5a3cff0c -- Mon Jul 7 16:55:42 2008
atime: 0x487282cd:00000000 -- Mon Jul 7 16:55:41 2008
mtime: 0x485d754f:00000000 -- Sat Jun 21 17:40:31 2008
crtime: 0x487282ce:5a3cff0c -- Mon Jul 7 16:55:42 2008
Size of extra inode fields: 28
BLOCKS:
(0):19472459, (1-4):20849216-20849219
TOTAL: 5

debugfs: ncheck 4545760 4545768 4545769
Inode Pathname
4545760 /usr/projects/sid-root/usr/bin/piconv
4545768 /usr/projects/sid-root/usr/bin/shasum
4545769 /usr/projects/sid-root/usr/bin/splain


----------------------Example of files created June 30th


debugfs: stat <18246>
Inode: 18246 Type: regular Mode: 0755 Flags: 0x80000
Generation: 1457985356 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 24448
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 48
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4868cc68:d7f10208 -- Mon Jun 30 08:07:04 2008
atime: 0x4868cbfa:8ac0ef60 -- Mon Jun 30 08:05:14 2008
mtime: 0x47f5cddd:00000000 -- Fri Apr 4 02:42:37 2008
crtime: 0x4868cc68:d724eee4 -- Mon Jun 30 08:07:04 2008
Size of extra inode fields: 28
BLOCKS:
(0-5):351687-351692
TOTAL: 6

debugfs: stat <18247>
Inode: 18247 Type: regular Mode: 0755 Flags: 0x80000
Generation: 1457985357 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 1297
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4868cc68:d7f10208 -- Mon Jun 30 08:07:04 2008
atime: 0x4868cbfa:8ac0ef60 -- Mon Jun 30 08:05:14 2008
mtime: 0x47e38ec1:00000000 -- Fri Mar 21 06:32:33 2008
crtime: 0x4868cc68:d7f10208 -- Mon Jun 30 08:07:04 2008
Size of extra inode fields: 28
BLOCKS:
(0):351693
TOTAL: 1

debugfs: stat <18248>
Inode: 18248 Type: regular Mode: 0755 Flags: 0x80000
Generation: 1457985358 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 28492
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 56
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4868cc68:d7f10208 -- Mon Jun 30 08:07:04 2008
atime: 0x4868cbfa:8ac0ef60 -- Mon Jun 30 08:05:14 2008
mtime: 0x47398252:00000000 -- Tue Nov 13 05:54:10 2007
crtime: 0x4868cc68:d7f10208 -- Mon Jun 30 08:07:04 2008
Size of extra inode fields: 28
BLOCKS:
(0-6):351694-351700
TOTAL: 7

debugfs: ncheck 18246 18247 18248
Inode Pathname
18246 /bin/uname
18247 /bin/bzmore
18248 /bin/mt-gnu

2008-08-11 05:36:24

by Eric Sandeen

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

Theodore Tso wrote:
> 1;1613;0cOn Sun, Aug 10, 2008 at 03:04:00PM -0500, Eric Sandeen wrote:
>> # find . -name \*.h | xargs filefrag | grep -v extents | awk -F :
>> '{print $2}' | sort | uniq -c
>> 2460 1 extent found
>>
>> so *really* everything.
>>
>
> Um, no, because you filtered out "extents", and filefrag will print "2
> extents found", so that it gets the singular/plural correct.

Oh, heh, whoops! :) But still:

[root@inode test]# find . -name \*.h | xargs filefrag | grep -v "extents
format" | awk -F : '{print $2}' | sort | uniq -c
2461 1 extent found

and to be sure:

[root@inode test]# find . -name \*.h | xargs filefrag | grep "extents
found"
[root@inode test]#

> One of
> the things I've been meaning to do is to clean up the output of
> filefrag so that it's a bit easier to parse. It's really stupid for
> it to print "file is in extents format" over and over again...

I'm not really even sure why it should print it at all, ever, at least
without -vvv or something. (That's what I meant to filter the first
time ;) If the tool is to report layout I'm not sure the trivia about
how the layout is tracked internally is all that interesting. The
fiemap patches for filefrag that print it out more as a table format,
more like xfs_bmap does, should be a lot better. But anyway...

> I've done some research on my laptop's filesystem, and I've found
> examples of files with the characteristic horrendous layout going back
> as far back to at least July 7th. (See below)
>
> This takes us back to about the time when we were stablizing the
> patches that went into 2.6.26, and before the most recent set of
> patches to fix the journal credits and delayed allocation patches.
>
> Interestingly, I'm not seeing the bad layouts for files created June
> 30th, which is when I switched my laptop's root filesystem to ext4.
> So that gives us a pretty good time bound on the patches that might be
> involved. (This is a great demonstration of what it's a real win to
> manage the ext4 patch queue using git, since we can see which patches
> were added during that week or so.)
>
> In other news, I've added an enhancement to e2fsck which makes it much
> easier to spot these sorts of problems. On the latest e2fsprogs git
> branch, you'll be able to do "e2fsck -E fragcheck /dev/thunk/root" and
> get a report which makes it really ease to see the problem.
>
> - Ted

Hm... I wonder why I'm not seeing it?

-Eric

2008-08-11 07:58:21

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Sun, Aug 10, 2008 at 01:30:14PM -0400, Theodore Ts'o wrote:
>
> Thanks to a comment on a recent blog entry of mine[1], I think I've
> uncovered a rather embarassing bug in mballoc.
>
> [1]http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times
>
> I created a fresh 5 gig ext4 filesystem, and then populating it using a
> single-threaded tar command:
>
> (cd /usr ; tar cf - bin lib) | (cd /mnt; tar xfp -)
>
> I then unmounted the filesystem, and ran an instrumented e2fsck looking
> for fragmented files, and found a whole series of fragmanted files with
> the following pattern:
>
> Inode 122: (0):58399, (1-3):43703-43705
> Inode 124: (0):58400, (1):43707
> Inode 127: (0):58401, (1-7):43709-43715
> Inode 128: (0):58402, (1-2):43716-43717
> Inode 129: (0):58403, (1-3):43718-43720
> Inode 133: (0):58404, (1-5):43722-43726
> Inode 135: (0):58405, (1):43728
> Inode 136: (0):58406, (1-3):43729-43731
> Inode 141: (0-1):58407-58408, (2-6):43734-43738
> Inode 143: (0):58409, (1):43740
> Inode 144: (0):58410, (1-5):43741-43745
> Inode 146: (0):58411, (1):43746
>
> Inode Pathname
> 122 /bin/smproxy
> 124 /bin/debconf-updatepo
> 127 /bin/iostat
> 128 /bin/xeyes
> 129 /bin/pbmtog3
> 133 /bin/join-dctrl
> 135 /bin/dpkg-name
> 136 /bin/lockfile
> 141 /bin/id
> 143 /bin/ppmcolormask
> 144 /bin/tty
> 146 /bin/colrm
>
> If I do this test with -o nodelalloc, I get a slightly different
> pattern. Now I get a whole series of discontiguous regions after the
> first 15 blocks:
>
> inode last_block pblk lblk len
> =============================================
> 2932: was 47087 actual extent 41894 (15, 3)...
> 3512: was 47829 actual extent 41908 (15, 1)...
> 3535: was 47904 actual extent 41912 (15, 37)...
> 3549: was 47977 actual extent 41949 (15, 4)...
> 3637: was 48225 actual extent 41959 (15, 6)...
> 3641: was 48245 actual extent 41965 (15, 13)...
> 3675: was 48418 actual extent 41978 (15, 1)...
> 3675: was 41979 actual extent 48640 (16, 15)...
> 3714: was 41984 actual extent 48656 (1, 2)...
> 3954: was 49449 actual extent 48660 (15, 16)...
> 3999: was 49569 actual extent 48679 (15, 2)...
> 4010: was 49644 actual extent 48681 (15, 1)...
> 4143: was 49943 actual extent 48687 (15, 10)...
> 4202: was 50036 actual extent 48699 (15, 6)...
>
> So all of the discontiguities start at logical block #15, and when I
> examine the inodes, what we find is one extent for blocks 0-14, ending
> at the last_block number, and then the second extent which extends for
> the rest of the file, starting somewhere else earlier in the block
> group.
>
> So a very similar issue, even without delayed allocation. That leads me
> to suspect the problem is somewhere inside mballoc. Aneesh, Andreas,
> Alex --- I think you folks are most familiar the mballoc code the;
> someone have time to take a look? This is clearly a bug, and clearly
> something we want to fix. If we can't get an optimal layout with one
> single-threaded process writing to the filesystem, what hope do we have
> of getting it right on more realistic benchmarks or real-world usage?
>

Are these small files ? The locality group preallocation is discarded
with the recent changes. 6be2ded1d7c51b39144b9f07d2c839e1bd8707f1 on
linus tree ?

-aneesh

2008-08-11 14:39:23

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Sun, Aug 10, 2008 at 01:30:14PM -0400, Theodore Ts'o wrote:
>
> Thanks to a comment on a recent blog entry of mine[1], I think I've
> uncovered a rather embarassing bug in mballoc.
>
> [1]http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times
>
> I created a fresh 5 gig ext4 filesystem, and then populating it using a
> single-threaded tar command:
>
> (cd /usr ; tar cf - bin lib) | (cd /mnt; tar xfp -)
>
> I then unmounted the filesystem, and ran an instrumented e2fsck looking
> for fragmented files, and found a whole series of fragmanted files with
> the following pattern:
>
> Inode 122: (0):58399, (1-3):43703-43705
> Inode 124: (0):58400, (1):43707
> Inode 127: (0):58401, (1-7):43709-43715
> Inode 128: (0):58402, (1-2):43716-43717
> Inode 129: (0):58403, (1-3):43718-43720
> Inode 133: (0):58404, (1-5):43722-43726
> Inode 135: (0):58405, (1):43728
> Inode 136: (0):58406, (1-3):43729-43731
> Inode 141: (0-1):58407-58408, (2-6):43734-43738
> Inode 143: (0):58409, (1):43740
> Inode 144: (0):58410, (1-5):43741-43745
> Inode 146: (0):58411, (1):43746
>
> Inode Pathname
> 122 /bin/smproxy
> 124 /bin/debconf-updatepo
> 127 /bin/iostat
> 128 /bin/xeyes
> 129 /bin/pbmtog3
> 133 /bin/join-dctrl
> 135 /bin/dpkg-name
> 136 /bin/lockfile
> 141 /bin/id
> 143 /bin/ppmcolormask
> 144 /bin/tty
> 146 /bin/colrm
>
> If I do this test with -o nodelalloc, I get a slightly different
> pattern. Now I get a whole series of discontiguous regions after the
> first 15 blocks:
>
> inode last_block pblk lblk len
> =============================================
> 2932: was 47087 actual extent 41894 (15, 3)...
> 3512: was 47829 actual extent 41908 (15, 1)...
> 3535: was 47904 actual extent 41912 (15, 37)...
> 3549: was 47977 actual extent 41949 (15, 4)...
> 3637: was 48225 actual extent 41959 (15, 6)...
> 3641: was 48245 actual extent 41965 (15, 13)...
> 3675: was 48418 actual extent 41978 (15, 1)...
> 3675: was 41979 actual extent 48640 (16, 15)...
> 3714: was 41984 actual extent 48656 (1, 2)...
> 3954: was 49449 actual extent 48660 (15, 16)...
> 3999: was 49569 actual extent 48679 (15, 2)...
> 4010: was 49644 actual extent 48681 (15, 1)...
> 4143: was 49943 actual extent 48687 (15, 10)...
> 4202: was 50036 actual extent 48699 (15, 6)...
>
> So all of the discontiguities start at logical block #15, and when I
> examine the inodes, what we find is one extent for blocks 0-14, ending
> at the last_block number, and then the second extent which extends for
> the rest of the file, starting somewhere else earlier in the block
> group.
>
> So a very similar issue, even without delayed allocation. That leads me
> to suspect the problem is somewhere inside mballoc. Aneesh, Andreas,
> Alex --- I think you folks are most familiar the mballoc code the;
> someone have time to take a look? This is clearly a bug, and clearly
> something we want to fix. If we can't get an optimal layout with one
> single-threaded process writing to the filesystem, what hope do we have
> of getting it right on more realistic benchmarks or real-world usage?
>

Can you try this patch ? The patch make group preallocation use the goal
block.

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 865e9dd..25fe375 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3281,6 +3281,29 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac,
mb_debug("use %u/%u from group pa %p\n", pa->pa_lstart-len, len, pa);
}

+static struct ext4_prealloc_space *
+ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
+ struct ext4_prealloc_space *pa,
+ struct ext4_prealloc_space *cpa)
+{
+ ext4_fsblk_t cur_distance, new_distance;
+
+ if (cpa == NULL) {
+ atomic_inc(&pa->pa_count);
+ return pa;
+ }
+ cur_distance = abs(goal_block - cpa->pa_pstart);
+ new_distance = abs(goal_block - pa->pa_pstart);
+
+ if (cur_distance < new_distance)
+ return cpa;
+
+ /* drop the previous reference */
+ atomic_dec(&cpa->pa_count);
+ atomic_inc(&pa->pa_count);
+ return pa;
+}
+
/*
* search goal blocks in preallocated space
*/
@@ -3290,7 +3313,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
int order, i;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_locality_group *lg;
- struct ext4_prealloc_space *pa;
+ struct ext4_prealloc_space *pa, *cpa = NULL;
+ ext4_fsblk_t goal_block;

/* only data can be preallocated */
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
@@ -3333,6 +3357,10 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
/* The max size of hash table is PREALLOC_TB_SIZE */
order = PREALLOC_TB_SIZE - 1;

+ goal_block = ac->ac_g_ex.fe_group * EXT4_BLOCKS_PER_GROUP(ac->ac_sb) +
+ ac->ac_g_ex.fe_start +
+ le32_to_cpu(EXT4_SB(ac->ac_sb)->s_es->s_first_data_block);
+
for (i = order; i < PREALLOC_TB_SIZE; i++) {
rcu_read_lock();
list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
@@ -3340,17 +3368,19 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
spin_lock(&pa->pa_lock);
if (pa->pa_deleted == 0 &&
pa->pa_free >= ac->ac_o_ex.fe_len) {
- atomic_inc(&pa->pa_count);
- ext4_mb_use_group_pa(ac, pa);
- spin_unlock(&pa->pa_lock);
- ac->ac_criteria = 20;
- rcu_read_unlock();
- return 1;
+
+ cpa = ext4_mb_check_group_pa(goal_block,
+ pa, cpa);
}
spin_unlock(&pa->pa_lock);
}
rcu_read_unlock();
}
+ if (cpa) {
+ ext4_mb_use_group_pa(ac, cpa);
+ ac->ac_criteria = 20;
+ return 1;
+ }
return 0;
}


2008-08-11 18:15:33

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Mon, Aug 11, 2008 at 08:09:12PM +0530, Aneesh Kumar K.V wrote:
> Can you try this patch ? The patch make group preallocation use the goal
> block.
>

Results with and without patch.

http://www.radian.org/~kvaneesh/ext4/lg-fragmentation/

-aneesh

2008-08-13 02:32:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Mon, Aug 11, 2008 at 11:45:24PM +0530, Aneesh Kumar K.V wrote:
> On Mon, Aug 11, 2008 at 08:09:12PM +0530, Aneesh Kumar K.V wrote:
> > Can you try this patch ? The patch make group preallocation use the goal
> > block.
> >
>
> Results with and without patch.
>
> http://www.radian.org/~kvaneesh/ext4/lg-fragmentation/
>

My results match yours; seems to be a bit better, but it's not fixing
the fundamental problem. With the patch:

26524: expecting 638190 actual extent phys 631960 log 1 len 1
26527: expecting 638191 actual extent phys 631963 log 1 len 1
26533: expecting 638192 actual extent phys 631976 log 1 len 5
26534: expecting 638193 actual extent phys 631981 log 1 len 2
26536: expecting 638194 actual extent phys 631984 log 1 len 6
26538: expecting 638195 actual extent phys 631991 log 1 len 5
26540: expecting 638196 actual extent phys 631997 log 1 len 2
26545: expecting 638197 actual extent phys 632009 log 1 len 1
26546: expecting 638198 actual extent phys 632010 log 1 len 6
26604: expecting 638199 actual extent phys 632156 log 1 len 1

Useing debugfs's stat command to look at the blocks:

26524: (0):638189, (1):631960
26527: (0):638190, (1):631963
26533: (0):638191, (1-5):631976-631980
26534: (0):638192, (1-2):631981-631982
26536: (0):638193, (1-6):631984-631989
26538: (0):638194, (1-5):631991-631995
26540: (0):638195, (1-2):631997-631998
26545: (0):638196, (1):632009
26546: (0):638197, (1-6):632010-632015

Out of curiosity, I also probed the inode numbers that were out of
sequence from above. They seem to be mostly allocating out of the
numbers used for the second extent, above.

26526: (0):631961
26526: (0):631962
26528: (0):631964
26529: (0):411742
26530: (0):631965
26531: (0-1):631966-631967
26532: (0-7):631968-631975
26535: (0):631983
26537: (0):631990
26541: (0-7):631999-632006
26542: (0):632007
26543: (0):632008
26544: (0):411743
26547: (0):632016

Inode Pathname
26524 /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.py
26525 /lib/rhythmbox/plugins/lyrics/LyrcParser.py
26526 /lib/rhythmbox/plugins/lyrics/LyricsParse.py
26527 /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.pyc
26528 /lib/rhythmbox/plugins/lyrics/WinampcnParser.py
26529 /lib/rhythmbox/plugins/magnatune
26530 /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_small.png
26531 /lib/rhythmbox/plugins/magnatune/magnatune.rb-plugin
26532 /lib/rhythmbox/plugins/magnatune/magnatune-prefs.glade
26533 /lib/rhythmbox/plugins/magnatune/MagnatuneSource.pyc
26534 /lib/rhythmbox/plugins/magnatune/__init__.py
26535 /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.py
26536 /lib/rhythmbox/plugins/magnatune/magnatune-purchase.glade
26537 /lib/rhythmbox/plugins/magnatune/TrackListHandler.py
26538 /lib/rhythmbox/plugins/magnatune/MagnatuneSource.py
26539 /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_tiny.png
26540 /lib/rhythmbox/plugins/magnatune/__init__.pyc
26541 /lib/rhythmbox/plugins/magnatune/magnatune-loading.glade
26542 /lib/rhythmbox/plugins/magnatune/TrackListHandler.pyc
26543 /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.pyc
26544 /lib/rhythmbox/plugins/audioscrobbler
26546 /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-prefs.glade
26547 /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-ui.xml

Looks like we still have some problems with the block allocator...

- Ted

2008-08-13 10:52:29

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Tue, Aug 12, 2008 at 10:32:05PM -0400, Theodore Tso wrote:
> On Mon, Aug 11, 2008 at 11:45:24PM +0530, Aneesh Kumar K.V wrote:
> > On Mon, Aug 11, 2008 at 08:09:12PM +0530, Aneesh Kumar K.V wrote:
> > > Can you try this patch ? The patch make group preallocation use the goal
> > > block.
> > >
> >
> > Results with and without patch.
> >
> > http://www.radian.org/~kvaneesh/ext4/lg-fragmentation/
> >
>
> My results match yours; seems to be a bit better, but it's not fixing
> the fundamental problem. With the patch:
>
> 26524: expecting 638190 actual extent phys 631960 log 1 len 1
> 26527: expecting 638191 actual extent phys 631963 log 1 len 1
> 26533: expecting 638192 actual extent phys 631976 log 1 len 5
> 26534: expecting 638193 actual extent phys 631981 log 1 len 2
> 26536: expecting 638194 actual extent phys 631984 log 1 len 6
> 26538: expecting 638195 actual extent phys 631991 log 1 len 5
> 26540: expecting 638196 actual extent phys 631997 log 1 len 2
> 26545: expecting 638197 actual extent phys 632009 log 1 len 1
> 26546: expecting 638198 actual extent phys 632010 log 1 len 6
> 26604: expecting 638199 actual extent phys 632156 log 1 len 1
>
> Useing debugfs's stat command to look at the blocks:
>
> 26524: (0):638189, (1):631960
> 26527: (0):638190, (1):631963
> 26533: (0):638191, (1-5):631976-631980
> 26534: (0):638192, (1-2):631981-631982
> 26536: (0):638193, (1-6):631984-631989
> 26538: (0):638194, (1-5):631991-631995
> 26540: (0):638195, (1-2):631997-631998
> 26545: (0):638196, (1):632009
> 26546: (0):638197, (1-6):632010-632015

I am not sure why we are getting single block request for inodes
26524 etc. With delayed alloc we should have got 2 block request.

>
> Out of curiosity, I also probed the inode numbers that were out of
> sequence from above. They seem to be mostly allocating out of the
> numbers used for the second extent, above.
>
> 26526: (0):631961
> 26526: (0):631962
> 26528: (0):631964
> 26529: (0):411742
> 26530: (0):631965
> 26531: (0-1):631966-631967
> 26532: (0-7):631968-631975
> 26535: (0):631983
> 26537: (0):631990
> 26541: (0-7):631999-632006
> 26542: (0):632007
> 26543: (0):632008
> 26544: (0):411743
> 26547: (0):632016
>
> Inode Pathname
> 26524 /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.py
> 26525 /lib/rhythmbox/plugins/lyrics/LyrcParser.py
> 26526 /lib/rhythmbox/plugins/lyrics/LyricsParse.py
> 26527 /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.pyc
> 26528 /lib/rhythmbox/plugins/lyrics/WinampcnParser.py
> 26529 /lib/rhythmbox/plugins/magnatune
> 26530 /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_small.png
> 26531 /lib/rhythmbox/plugins/magnatune/magnatune.rb-plugin
> 26532 /lib/rhythmbox/plugins/magnatune/magnatune-prefs.glade
> 26533 /lib/rhythmbox/plugins/magnatune/MagnatuneSource.pyc
> 26534 /lib/rhythmbox/plugins/magnatune/__init__.py
> 26535 /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.py
> 26536 /lib/rhythmbox/plugins/magnatune/magnatune-purchase.glade
> 26537 /lib/rhythmbox/plugins/magnatune/TrackListHandler.py
> 26538 /lib/rhythmbox/plugins/magnatune/MagnatuneSource.py
> 26539 /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_tiny.png
> 26540 /lib/rhythmbox/plugins/magnatune/__init__.pyc
> 26541 /lib/rhythmbox/plugins/magnatune/magnatune-loading.glade
> 26542 /lib/rhythmbox/plugins/magnatune/TrackListHandler.pyc
> 26543 /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.pyc
> 26544 /lib/rhythmbox/plugins/audioscrobbler
> 26546 /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-prefs.glade
> 26547 /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-ui.xml
>
> Looks like we still have some problems with the block allocator...

The problem is with delalloc and mballoc locality group. With delalloc
we use pdflush to write the pages. Small file allocation use a per-cpu
prealloc space. In my understanding using Per-CPU prealloc space is
fine without delalloc. Because without delalloc get_block happens in the
process context at write_begin and OS scheduler will not schedule the
task to other CPU unless needed.

With delalloc we have pdflush doing block allocation and using per-cpu
may not really help here. So i tried a small patch as below. But that
didn't help much. Also the patch would increase contention on the
locality group mutex. So i guess the change is not worth.

But with delalloc we should have got multiple block request together.
That implies we should get a single get_block request for the whole
file. I will have to instrument the kernel to understand why it is not
happening.

Even though the files are fragmented I guess we have those blocks closer
on disk right ?

diff --git a/fs/ext4/ext4_i.h b/fs/ext4/ext4_i.h
index ef7409f..734b6ef 100644
--- a/fs/ext4/ext4_i.h
+++ b/fs/ext4/ext4_i.h
@@ -163,6 +163,8 @@ struct ext4_inode_info {
/* mballoc */
struct list_head i_prealloc_list;
spinlock_t i_prealloc_lock;
+ /* locality group used for block allocation */
+ struct ext4_locality_group *lg;

/* allocation reservation info for delalloc */
unsigned long i_reserved_data_blocks;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 25fe375..293f048 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4061,9 +4061,10 @@ static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
*/
static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
{
- struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
- int bsbits = ac->ac_sb->s_blocksize_bits;
loff_t size, isize;
+ int bsbits = ac->ac_sb->s_blocksize_bits;
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);

if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return;
@@ -4085,13 +4086,23 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
* per cpu locality group is to reduce the contention between block
* request from multiple CPUs.
*/
- ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
- put_cpu();
+ if (ei->lg)
+ ac->ac_lg = ei->lg;
+ else {
+ ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
+ ei->lg = ac->ac_lg;
+ put_cpu();
+ }

/* we're going to use group allocation */
ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;

- /* serialize all allocations in the group */
+ /*
+ * serialize all allocations in the group
+ * If we find lot of contention we may want
+ * to add waiters count and use other lg if
+ * we have large number of waiters
+ */
mutex_lock(&ac->ac_lg->lg_mutex);
}

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 09e3c56..08bdbf9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -576,6 +576,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_reserved_meta_blocks = 0;
ei->i_allocated_meta_blocks = 0;
ei->i_delalloc_reserved_flag = 0;
+ ei->lg = NULL;
spin_lock_init(&(ei->i_block_reservation_lock));
return &ei->vfs_inode;
}

2008-08-14 21:49:35

by Mingming Cao

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!


在 2008-08-13三的 16:22 +0530,Aneesh Kumar K.V写道:
> On Tue, Aug 12, 2008 at 10:32:05PM -0400, Theodore Tso wrote:
> > On Mon, Aug 11, 2008 at 11:45:24PM +0530, Aneesh Kumar K.V wrote:
> > > On Mon, Aug 11, 2008 at 08:09:12PM +0530, Aneesh Kumar K.V wrote:
> > > > Can you try this patch ? The patch make group preallocation use the goal
> > > > block.
> > > >
> > >
> > > Results with and without patch.
> > >
> > > http://www.radian.org/~kvaneesh/ext4/lg-fragmentation/
> > >
> >
> > My results match yours; seems to be a bit better, but it's not fixing
> > the fundamental problem. With the patch:
> >
> > 26524: expecting 638190 actual extent phys 631960 log 1 len 1
> > 26527: expecting 638191 actual extent phys 631963 log 1 len 1
> > 26533: expecting 638192 actual extent phys 631976 log 1 len 5
> > 26534: expecting 638193 actual extent phys 631981 log 1 len 2
> > 26536: expecting 638194 actual extent phys 631984 log 1 len 6
> > 26538: expecting 638195 actual extent phys 631991 log 1 len 5
> > 26540: expecting 638196 actual extent phys 631997 log 1 len 2
> > 26545: expecting 638197 actual extent phys 632009 log 1 len 1
> > 26546: expecting 638198 actual extent phys 632010 log 1 len 6
> > 26604: expecting 638199 actual extent phys 632156 log 1 len 1
> >
> > Useing debugfs's stat command to look at the blocks:
> >
> > 26524: (0):638189, (1):631960
> > 26527: (0):638190, (1):631963
> > 26533: (0):638191, (1-5):631976-631980
> > 26534: (0):638192, (1-2):631981-631982
> > 26536: (0):638193, (1-6):631984-631989
> > 26538: (0):638194, (1-5):631991-631995
> > 26540: (0):638195, (1-2):631997-631998
> > 26545: (0):638196, (1):632009
> > 26546: (0):638197, (1-6):632010-632015
>
> I am not sure why we are getting single block request for inodes
> 26524 etc. With delayed alloc we should have got 2 block request.
>
> >
> > Out of curiosity, I also probed the inode numbers that were out of
> > sequence from above. They seem to be mostly allocating out of the
> > numbers used for the second extent, above.
> >
> > 26526: (0):631961
> > 26526: (0):631962
> > 26528: (0):631964
> > 26529: (0):411742
> > 26530: (0):631965
> > 26531: (0-1):631966-631967
> > 26532: (0-7):631968-631975
> > 26535: (0):631983
> > 26537: (0):631990
> > 26541: (0-7):631999-632006
> > 26542: (0):632007
> > 26543: (0):632008
> > 26544: (0):411743
> > 26547: (0):632016
> >
> > Inode Pathname
> > 26524 /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.py
> > 26525 /lib/rhythmbox/plugins/lyrics/LyrcParser.py
> > 26526 /lib/rhythmbox/plugins/lyrics/LyricsParse.py
> > 26527 /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.pyc
> > 26528 /lib/rhythmbox/plugins/lyrics/WinampcnParser.py
> > 26529 /lib/rhythmbox/plugins/magnatune
> > 26530 /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_small.png
> > 26531 /lib/rhythmbox/plugins/magnatune/magnatune.rb-plugin
> > 26532 /lib/rhythmbox/plugins/magnatune/magnatune-prefs.glade
> > 26533 /lib/rhythmbox/plugins/magnatune/MagnatuneSource.pyc
> > 26534 /lib/rhythmbox/plugins/magnatune/__init__.py
> > 26535 /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.py
> > 26536 /lib/rhythmbox/plugins/magnatune/magnatune-purchase.glade
> > 26537 /lib/rhythmbox/plugins/magnatune/TrackListHandler.py
> > 26538 /lib/rhythmbox/plugins/magnatune/MagnatuneSource.py
> > 26539 /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_tiny.png
> > 26540 /lib/rhythmbox/plugins/magnatune/__init__.pyc
> > 26541 /lib/rhythmbox/plugins/magnatune/magnatune-loading.glade
> > 26542 /lib/rhythmbox/plugins/magnatune/TrackListHandler.pyc
> > 26543 /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.pyc
> > 26544 /lib/rhythmbox/plugins/audioscrobbler
> > 26546 /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-prefs.glade
> > 26547 /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-ui.xml
> >
> > Looks like we still have some problems with the block allocator...
>
> The problem is with delalloc and mballoc locality group. With delalloc
> we use pdflush to write the pages. Small file allocation use a per-cpu
> prealloc space. In my understanding using Per-CPU prealloc space is
> fine without delalloc. Because without delalloc get_block happens in the
> process context at write_begin and OS scheduler will not schedule the
> task to other CPU unless needed.
>
> With delalloc we have pdflush doing block allocation and using per-cpu
> may not really help here.

I wonder if it still make sense for using per~cpu group locality
allocation with delalloc, with the fact that all block allocation is
done via pdflush?

> So i tried a small patch as below. But that
> didn't help much. Also the patch would increase contention on the
> locality group mutex. So i guess the change is not worth.
>
> But with delalloc we should have got multiple block request together.
> That implies we should get a single get_block request for the whole
> file. I will have to instrument the kernel to understand why it is not
> happening.
>

I am courious to know this too. Why we get single block allocation
request for delalloc?


Mingming

2008-08-18 10:50:15

by Andreas Dilger

[permalink] [raw]
Subject: Re: Bug in delayed allocation: really bad block layouts!

On Aug 11, 2008 00:36 -0500, Eric Sandeen wrote:
> Theodore Tso wrote:
> > One of
> > the things I've been meaning to do is to clean up the output of
> > filefrag so that it's a bit easier to parse. It's really stupid for
> > it to print "file is in extents format" over and over again...
>
> I'm not really even sure why it should print it at all, ever, at least
> without -vvv or something. (That's what I meant to filter the first
> time ;) If the tool is to report layout I'm not sure the trivia about
> how the layout is tracked internally is all that interesting. The
> fiemap patches for filefrag that print it out more as a table format,
> more like xfs_bmap does, should be a lot better. But anyway...

Actually, the FIEMAP patches for filefrag provide a much nicer format
for the output, more like what xfs_bmap does. For now it only does
this if requested with the "-e" option (--extents) but if Ted is not
adverse to changing the output format of filefrag we could make it the
default.

The whole Lustre patchset (against 1.40.11) is at
http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.11-sun1-patches.tar.gz

but Kalpak or I can send just the FIEMAP one again if you are interested.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.