I was doing a little seekwatchering today, and found something...
interesting.
I was doing an 8G buffered write via dd, on a machine that reports 3G of
memory, in 1M chunks like so:
dd if=/dev/zero of=/mnt/test/foobar bs=1024k count=8192
on a fairly decent hardware raid, ~90G filesystem.
Kernel is 2.6.24-rc1, with all the git patches from a day or so ago applied.
I made the ext4 fs with lustre's e2fsprogs, with -I 256, and mounted with:
mount -t ext4dev -o data=writeback,delalloc,extents,mballoc /dev/sdb7
/mnt/test
The resulting file had over 4k extents.
[root@bear-05 ~]# filefrag -v /mnt/test/foobar | grep -i extents
File is stored in extents format
/mnt/test/foobar: 4075 extents found
http://people.redhat.com/esandeen/seekwatcher/ext4-dd-write.png
if I don't mount with delalloc:
mount -t ext4dev -o data=writeback,extents,mballoc /dev/sdb7 /mnt/test
and run the same dd, I get 229 extents:
[root@bear-05 ~]# filefrag -v /mnt/test/foobar | grep -i extents
File is stored in extents format
/mnt/test/foobar: 229 extents found
http://people.redhat.com/esandeen/seekwatcher/ext4-dd-write-nodelalloc.png
It looks like delalloc is dribbling all over the disk....
(note: times & rates look wrong to me, something is up with blktrace I
think, but FIBMAP shouldn't lie about allocation)
-Eric
On Oct 26, 2007 16:24 -0500, Eric Sandeen wrote:
> The resulting file had over 4k extents.
> [root@bear-05 ~]# filefrag -v /mnt/test/foobar | grep -i extents
> File is stored in extents format
> /mnt/test/foobar: 4075 extents found
On a related note - we're just putting the finishing touches on the
FIEMAP patches for ext4 + e2fsprogs, so that we can get decent looking
output from filefrag, and much more efficiently than FIBMAP.
> if I don't mount with delalloc:
>
> mount -t ext4dev -o data=writeback,extents,mballoc /dev/sdb7 /mnt/test
>
> and run the same dd, I get 229 extents:
>
> [root@bear-05 ~]# filefrag -v /mnt/test/foobar | grep -i extents
> File is stored in extents format
> /mnt/test/foobar: 229 extents found
One of the issues is that w/o delalloc the mballoc code only gets
single-block allocations, so there might be a problem with the interface
to mballoc. That might be caused by the fact the patches were changed
at one point from delalloc-atop-mballoc to mballoc-atop-delalloc, and
something was missed in that conversion.
Have you tried O_DIRECT? That is another way to access mballoc w/o
using delalloc.
Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> On Oct 26, 2007 16:24 -0500, Eric Sandeen wrote:
>> The resulting file had over 4k extents.
>> [root@bear-05 ~]# filefrag -v /mnt/test/foobar | grep -i extents
>> File is stored in extents format
>> /mnt/test/foobar: 4075 extents found
>
> On a related note - we're just putting the finishing touches on the
> FIEMAP patches for ext4 + e2fsprogs, so that we can get decent looking
> output from filefrag, and much more efficiently than FIBMAP.
sounds good.
> One of the issues is that w/o delalloc the mballoc code only gets
> single-block allocations, so there might be a problem with the interface
> to mballoc. That might be caused by the fact the patches were changed
> at one point from delalloc-atop-mballoc to mballoc-atop-delalloc, and
> something was missed in that conversion.
>
> Have you tried O_DIRECT? That is another way to access mballoc w/o
> using delalloc.
I've tried osync:
http://people.redhat.com/esandeen/seekwatcher/ext4-dd-osync-write.png
but not yet O_DIRECT, will do.
-Eric
Eric Sandeen wrote:
> but not yet O_DIRECT, will do.
dd if=/dev/zero oflag=direct of=/mnt/test/foobar bs=1024k count=8192
filefrag -v /mnt/test/foobar | grep -i extents
File is stored in extents format
/mnt/test/foobar: 67 extents found
http://people.redhat.com/esandeen/seekwatcher/ext4-dd-write-odirect.png
-Eric
Eric Sandeen wrote:
> (note: times & rates look wrong to me, something is up with blktrace I
> think, but FIBMAP shouldn't lie about allocation)
CONFIG_NO_HZ was causing this... here are better graphs:
http://people.redhat.com/esandeen/seekwatcher/ext4-dd-write.png
http://people.redhat.com/esandeen/seekwatcher/xfs-dd-write.png
http://people.redhat.com/esandeen/seekwatcher/ext4-xfs-dd-write.png
http://people.redhat.com/esandeen/seekwatcher/ext4-dd-write-odirect.png
http://people.redhat.com/esandeen/seekwatcher/xfs-dd-write-odirect.png
http://people.redhat.com/esandeen/seekwatcher/ext4-xfs-dd-write-odirect.png
-Eric
Andreas Dilger wrote:
> One of the issues is that w/o delalloc the mballoc code only gets
> single-block allocations, so there might be a problem with the interface
> to mballoc. That might be caused by the fact the patches were changed
> at one point from delalloc-atop-mballoc to mballoc-atop-delalloc, and
> something was missed in that conversion.
I've been testing this with a 16x1MB buffered dd to a fresh filesystem;
it gets 7 extents, and out of order. :(
First block: 37120
Last block: 63427
Discontinuity: Block 1024 is at 38400 (was 38143)
Discontinuity: Block 1036 is at 57344 (was 38411)
Discontinuity: Block 2048 is at 38412 (was 58355)
Discontinuity: Block 2072 is at 61440 (was 38435)
Discontinuity: Block 3072 is at 38436 (was 62439)
Discontinuity: Block 3108 is at 62440 (was 38471)
/mnt/test/testfile: 7 extents found
One thing that seems to be happening is that thanks to delalloc, a nice
big request is coming in (only 1036 blocks of the 4096, not quite sure
why), but then it gets into ext4_mb_normalize_request(), which finds the
most blocks it can "preallocate" is 256, and chops down the request to
256 blocks. Shouldn't this preallocation be over & above what was asked
for, vs. reducing the request?
Ok, so, we get allocations in 256-block chunks... Why they don't all
come out contiguous, I don't know yet...
-Eric
please, try the patch attached.
thanks, Alex
Eric Sandeen wrote:
> One thing that seems to be happening is that thanks to delalloc, a nice
> big request is coming in (only 1036 blocks of the 4096, not quite sure
> why), but then it gets into ext4_mb_normalize_request(), which finds the
> most blocks it can "preallocate" is 256, and chops down the request to
> 256 blocks. Shouldn't this preallocation be over & above what was asked
> for, vs. reducing the request?
>
> Ok, so, we get allocations in 256-block chunks... Why they don't all
> come out contiguous, I don't know yet...
>
> -Eric
Alex Tomas wrote:
> please, try the patch attached.
Looks quite a bit better:
http://people.redhat.com/esandeen/seekwatcher/ext4-alex.png
http://people.redhat.com/esandeen/seekwatcher/ext4-alex-ext4-dd-write.png
http://people.redhat.com/esandeen/seekwatcher/ext4-alex-ext4-xfs-dd-write.png
It is much less fragmented, although still not exactly the nice linear
allocation I'd expect from a single threaded large write on a fresh fs...
-Eric
> thanks, Alex
>
> Eric Sandeen wrote:
>> One thing that seems to be happening is that thanks to delalloc, a nice
>> big request is coming in (only 1036 blocks of the 4096, not quite sure
>> why), but then it gets into ext4_mb_normalize_request(), which finds the
>> most blocks it can "preallocate" is 256, and chops down the request to
>> 256 blocks. Shouldn't this preallocation be over & above what was asked
>> for, vs. reducing the request?
>>
>> Ok, so, we get allocations in 256-block chunks... Why they don't all
>> come out contiguous, I don't know yet...
>>
>> -Eric
>
Eric Sandeen wrote:
> Alex Tomas wrote:
>> please, try the patch attached.
>
> Looks quite a bit better:
>
> http://people.redhat.com/esandeen/seekwatcher/ext4-alex.png
> http://people.redhat.com/esandeen/seekwatcher/ext4-alex-ext4-dd-write.png
> http://people.redhat.com/esandeen/seekwatcher/ext4-alex-ext4-xfs-dd-write.png
>
> It is much less fragmented, although still not exactly the nice linear
> allocation I'd expect from a single threaded large write on a fresh fs...
Note, we're still getting out-of-order extents, too:
First block: 122880
Last block: 2694143
Discontinuity: Block 7424 is at 101376 (was 130303)
Discontinuity: Block 28160 is at 133120 (was 122111)
Discontinuity: Block 58368 is at 188416 (was 163327)
Discontinuity: Block 66304 is at 180224 (was 196351)
Discontinuity: Block 73984 is at 172032 (was 187903)
Discontinuity: Block 81664 is at 167936 (was 179711)
Discontinuity: Block 84736 is at 221184 (was 171007)
Discontinuity: Block 92416 is at 212992 (was 228863)
Discontinuity: Block 100096 is at 204800 (was 220671)
Discontinuity: Block 107776 is at 198656 (was 212479)
...
I'm trying to find time to look into this but other things are knocking
at my door so no promises...
-Eric
Eric,
would you mind to repeat the run and then grab /proc/fs/ext4/<dev>/mb_history?
thanks in advance, Alex
Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Alex Tomas wrote:
>>> please, try the patch attached.
>> Looks quite a bit better:
>>
>> http://people.redhat.com/esandeen/seekwatcher/ext4-alex.png
>> http://people.redhat.com/esandeen/seekwatcher/ext4-alex-ext4-dd-write.png
>> http://people.redhat.com/esandeen/seekwatcher/ext4-alex-ext4-xfs-dd-write.png
>>
>> It is much less fragmented, although still not exactly the nice linear
>> allocation I'd expect from a single threaded large write on a fresh fs...
>
> Note, we're still getting out-of-order extents, too:
>
> First block: 122880
> Last block: 2694143
> Discontinuity: Block 7424 is at 101376 (was 130303)
> Discontinuity: Block 28160 is at 133120 (was 122111)
> Discontinuity: Block 58368 is at 188416 (was 163327)
> Discontinuity: Block 66304 is at 180224 (was 196351)
> Discontinuity: Block 73984 is at 172032 (was 187903)
> Discontinuity: Block 81664 is at 167936 (was 179711)
> Discontinuity: Block 84736 is at 221184 (was 171007)
> Discontinuity: Block 92416 is at 212992 (was 228863)
> Discontinuity: Block 100096 is at 204800 (was 220671)
> Discontinuity: Block 107776 is at 198656 (was 212479)
> ...
>
> I'm trying to find time to look into this but other things are knocking
> at my door so no promises...
>
> -Eric
>
Alex Tomas wrote:
> Eric,
>
> would you mind to repeat the run and then grab /proc/fs/ext4/<dev>/mb_history?
>
> thanks in advance, Alex
Sure thing, attached; this is from a 1024x1M run, wound up with 32
fragments, out of order:
First block: 122880
Last block: 491519
Discontinuity: Block 7424 is at 114688 (was 130303)
Discontinuity: Block 15616 is at 106496 (was 122879)
Discontinuity: Block 23552 is at 102400 (was 114431)
Discontinuity: Block 26624 is at 155648 (was 105471)
Discontinuity: Block 33792 is at 147456 (was 162815)
Discontinuity: Block 41472 is at 162816 (was 155135)
Discontinuity: Block 42496 is at 188416 (was 163839)
Discontinuity: Block 50176 is at 180224 (was 196095)
Discontinuity: Block 58368 is at 172032 (was 188415)
Discontinuity: Block 66560 is at 167936 (was 180223)
....
pid 3695 is pdflush, I assume 7634 was the dd itself.
-Eric
thanks for the data. the attached patch should fix couple issues:
broken history output and policy in that smaller usable chunk to
be used. can you give it a spin, please?
thanks, Alex
Eric Sandeen wrote:
> Alex Tomas wrote:
>> Eric,
>>
>> would you mind to repeat the run and then grab /proc/fs/ext4/<dev>/mb_history?
>>
>> thanks in advance, Alex
>
> Sure thing, attached; this is from a 1024x1M run, wound up with 32
> fragments, out of order:
>
> First block: 122880
> Last block: 491519
> Discontinuity: Block 7424 is at 114688 (was 130303)
> Discontinuity: Block 15616 is at 106496 (was 122879)
> Discontinuity: Block 23552 is at 102400 (was 114431)
> Discontinuity: Block 26624 is at 155648 (was 105471)
> Discontinuity: Block 33792 is at 147456 (was 162815)
> Discontinuity: Block 41472 is at 162816 (was 155135)
> Discontinuity: Block 42496 is at 188416 (was 163839)
> Discontinuity: Block 50176 is at 180224 (was 196095)
> Discontinuity: Block 58368 is at 172032 (was 188415)
> Discontinuity: Block 66560 is at 167936 (was 180223)
> ....
>
> pid 3695 is pdflush, I assume 7634 was the dd itself.
>
> -Eric
Alex Tomas wrote:
> thanks for the data. the attached patch should fix couple issues:
> broken history output and policy in that smaller usable chunk to
> be used. can you give it a spin, please?
Hi Alex -
This looks *much* better:
First block: 100355
Last block: 2342657
Discontinuity: Block 30208 is at 133120 (was 130562)
Discontinuity: Block 60928 is at 165891 (was 163839)
Discontinuity: Block 90368 is at 197634 (was 195330)
Discontinuity: Block 121856 is at 232448 (was 229121)
Discontinuity: Block 150528 is at 263170 (was 261119)
Discontinuity: Block 181504 is at 296963 (was 294145)
Discontinuity: Block 210944 is at 328706 (was 326402)
Discontinuity: Block 241664 is at 361474 (was 359425)
Discontinuity: Block 272896 is at 395264 (was 392705)
Discontinuity: Block 303360 is at 427010 (was 425727)
Discontinuity: Block 334080 is at 459778 (was 457729)
Discontinuity: Block 365568 is at 493568 (was 491265)
Discontinuity: Block 395264 is at 525314 (was 523263)
Discontinuity: Block 426240 is at 558082 (was 556289)
....
for an 8192x1M (8G) buffered dd, I now get 88 extents, in order.
Do you want the mballoc history?
Now I can try some multithreaded tests :)
Thanks,
-Eric
> thanks, Alex
Eric Sandeen wrote:
> Alex Tomas wrote:
>> thanks for the data. the attached patch should fix couple issues:
>> broken history output and policy in that smaller usable chunk to
>> be used. can you give it a spin, please?
>
> Hi Alex -
>
> This looks *much* better:
(nb: that's running w/ both of the patches you sent on this thread)
-Eric
Hi,
Eric Sandeen wrote:
> Do you want the mballoc history?
I wouldn't mind :)
> Now I can try some multithreaded tests :)
I'm very interested!
thanks, Alex
On Nov 05, 2007 11:37 -0600, Eric Sandeen wrote:
> Alex Tomas wrote:
> > thanks for the data. the attached patch should fix couple issues:
> > broken history output and policy in that smaller usable chunk to
> > be used. can you give it a spin, please?
>
> This looks *much* better:
>
> First block: 100355
> Last block: 2342657
> Discontinuity: Block 30208 is at 133120 (was 130562)
> Discontinuity: Block 60928 is at 165891 (was 163839)
> Discontinuity: Block 90368 is at 197634 (was 195330)
> Discontinuity: Block 121856 is at 232448 (was 229121)
> Discontinuity: Block 150528 is at 263170 (was 261119)
> Discontinuity: Block 181504 is at 296963 (was 294145)
> Discontinuity: Block 210944 is at 328706 (was 326402)
> Discontinuity: Block 241664 is at 361474 (was 359425)
> Discontinuity: Block 272896 is at 395264 (was 392705)
> Discontinuity: Block 303360 is at 427010 (was 425727)
> Discontinuity: Block 334080 is at 459778 (was 457729)
> Discontinuity: Block 365568 is at 493568 (was 491265)
> Discontinuity: Block 395264 is at 525314 (was 523263)
> Discontinuity: Block 426240 is at 558082 (was 556289)
> ....
On a related note - the FIEMAP patches to filefrag also include a new
output format that is much more useful, IMHO. The new format is like:
{filename}
ext: [logical start.. end kB]: phys start..end kB : kB:lun: flags
0: [ 0.. 30207]: 401416.. 522251: 120828: 0 :
1: [ 30208.. 60927]: 532480.. 655359: 122880: 0 :
2: [ 60928.. 121855]: 790536.. 916484: 125948: 0 :
Hopefully Kalpak will be able to post the updated patches here soon.
Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> On a related note - the FIEMAP patches to filefrag also include a new
> output format that is much more useful, IMHO. The new format is like:
>
> {filename}
> ext: [logical start.. end kB]: phys start..end kB : kB:lun: flags
> 0: [ 0.. 30207]: 401416.. 522251: 120828: 0 :
> 1: [ 30208.. 60927]: 532480.. 655359: 122880: 0 :
> 2: [ 60928.. 121855]: 790536.. 916484: 125948: 0 :
>
> Hopefully Kalpak will be able to post the updated patches here soon.
yep, I hacked existing filefrag to do something like this, the existing
format is pretty hard to glance over :)
One thing I like about xfs_bmap is that it can tell you which Allocation
Group the blocks are in; most filesystems have some concept of
sub-regions of the filesystem, such as BGs or resource groups or whatnot
- do you think there is room for this in the FIEMAP interface? Hm, or
should this just be calculated from knowing the size of the sub-regions...
-Eric
Eric Sandeen wrote:
> Alex Tomas wrote:
>> thanks for the data. the attached patch should fix couple issues:
>> broken history output and policy in that smaller usable chunk to
>> be used. can you give it a spin, please?
>
> Hi Alex -
>
> This looks *much* better:
Hmm bad news is when I add uninit_groups into the mix, it goes a little
south again, with some out-of-order extents. Not the end of the world,
but a little unexpected?
....
Discontinuity: Block 1430784 is at 24183810 (was 24181761)
Discontinuity: Block 1461760 is at 24216578 (was 24214785)
Discontinuity: Block 1492480 is at 37888 (was 24247297)
Discontinuity: Block 1519616 is at 850944 (was 65023)
Discontinuity: Block 1520640 is at 883712 (was 851967)
Discontinuity: Block 1521664 is at 1670144 (was 884735)
Discontinuity: Block 1522688 is at 2685952 (was 1671167)
Discontinuity: Block 1523712 is at 4226048 (was 2686975)
Discontinuity: Block 1524736 is at 11271168 (was 4227071)
Discontinuity: Block 1525760 is at 23952384 (was 11272191)
...
-Eric
On Nov 06, 2007 13:54 -0600, Eric Sandeen wrote:
> Hmm bad news is when I add uninit_groups into the mix, it goes a little
> south again, with some out-of-order extents. Not the end of the world,
> but a little unexpected?
>
> ....
> Discontinuity: Block 1430784 is at 24183810 (was 24181761)
> Discontinuity: Block 1461760 is at 24216578 (was 24214785)
> Discontinuity: Block 1492480 is at 37888 (was 24247297)
> Discontinuity: Block 1519616 is at 850944 (was 65023)
> Discontinuity: Block 1520640 is at 883712 (was 851967)
> Discontinuity: Block 1521664 is at 1670144 (was 884735)
> Discontinuity: Block 1522688 is at 2685952 (was 1671167)
> Discontinuity: Block 1523712 is at 4226048 (was 2686975)
> Discontinuity: Block 1524736 is at 11271168 (was 4227071)
> Discontinuity: Block 1525760 is at 23952384 (was 11272191)
I think part of the issue is that by default the groups marked BLOCK_UNINIT
are skipped, to avoid dirtying those groups if they have never been used
before. This policy could be changed in the mballoc code pretty easily if
you think it is a net loss. Note that the size of the extents is large
enough (120MB or more) that some small reordering is probably not going
to affect the performance in any meaningful way.
Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> On Nov 06, 2007 13:54 -0600, Eric Sandeen wrote:
>> Hmm bad news is when I add uninit_groups into the mix, it goes a little
>> south again, with some out-of-order extents. Not the end of the world,
>> but a little unexpected?
> I think part of the issue is that by default the groups marked BLOCK_UNINIT
> are skipped, to avoid dirtying those groups if they have never been used
> before. This policy could be changed in the mballoc code pretty easily if
> you think it is a net loss. Note that the size of the extents is large
> enough (120MB or more) that some small reordering is probably not going
> to affect the performance in any meaningful way.
You're probably right; on the other hand, this is about the simplest
test an allocator could wish for - a single-threaded large linear write
in big IO chunks.
In this case it's probably not a big deal; I do wonder how it might
affect the bigger picture though, with more writing threads, aged
filesystems, and the like. Just thought it was worth pointing out, as I
started looking at allocator behavior in the simple/isolated/unrealistic
:) cases.
-Eric