Hi folks,
As a follow on to my previous issue with ext3, it's looking like the
indirect block allocator in ext4 is not doing a very good job of making
block allocations sequential. On a 1GB test filesystem, I'm getting
the following allocation results for 10MB files (written out with a single
10MB write()):
debugfs: stat testfile.0
Inode: 12 Type: regular Mode: 0600 Flags: 0x0 Generation: 2584871807
User: 0 Group: 0 Size: 10485760
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 20512
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014
atime: 0x52d6de27 -- Wed Jan 15 14:14:47 2014
mtime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014
BLOCKS:
(0-11):24576-24587, (IND):8797, (12-1035):24588-25611, (DIND):8798, (IND):8799,
(1036-2059):25612-26635, (IND):10248, (2060-2559):26636-27135
TOTAL: 2564
debugfs: stat testfile.1
Inode: 15 Type: regular Mode: 0600 Flags: 0x0 Generation: 1625569093
User: 0 Group: 0 Size: 10485760
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 20512
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014
atime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014
mtime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014
BLOCKS:
(0-11):12288-12299, (IND):8787, (12-1035):12300-13323, (DIND):8790, (IND):8791,
(1036-2059):13324-14347, (IND):8789, (2060-2559):14348-14847
TOTAL: 2564
debugfs:
To give folks an idea about how significant an impact on performance this
is, using ext4 to mount my ext3 filesystem and create files is resulting
in a 10-15% reduction in speed when data is being read back into memory.
I also tested 3.11.7 and see the same poor allocation layout. I also
tried turning off delalloc, but there was no change in the layout of the
data blocks. Has anyone got any ideas what's going on here? Cheers,
-ben
--
"Thought is the essence of where you are now."
On Wed, Jan 15, 2014 at 02:28:02PM -0500, Benjamin LaHaise wrote:
> Hi folks,
>
> As a follow on to my previous issue with ext3, it's looking like the
> indirect block allocator in ext4 is not doing a very good job of making
> block allocations sequential. On a 1GB test filesystem, I'm getting
> the following allocation results for 10MB files (written out with a single
> 10MB write()):
>
> debugfs: stat testfile.0
> Inode: 12 Type: regular Mode: 0600 Flags: 0x0 Generation: 2584871807
> User: 0 Group: 0 Size: 10485760
> File ACL: 0 Directory ACL: 0
> Links: 1 Blockcount: 20512
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014
> atime: 0x52d6de27 -- Wed Jan 15 14:14:47 2014
> mtime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014
> BLOCKS:
> (0-11):24576-24587, (IND):8797, (12-1035):24588-25611, (DIND):8798, (IND):8799,
> (1036-2059):25612-26635, (IND):10248, (2060-2559):26636-27135
> TOTAL: 2564
A dumpe2fs would be nice, but I think I have enough here to speculate:
The data blocks are all sequential, which looks like what one would expect from
mballoc. Is your complaint is that the *IND blocks are not inline with the
data blocks, like what ext3 did?
FWIW, ext3 did something like this:
(0-11):6144-6155, (IND):6156, (12-1035):6157-7180, (DIND):7181, (IND):7182,
(1036-2059):7183-8206, (IND):8207, (2060-2559):8208-8707
I think the behavior that you're seeing is ext4 trying to keep the mapping
blocks close to the inode table to avoid fragmenting the file -- see
ext4_find_near() in indirect.c. There's an XXX comment in ext4_find_goal()
that implies that someone might have wanted to tie in with mballoc, which I
suppose you could use to restore the ext3 behavior... but there's no way to do
that.
--D
>
> debugfs: stat testfile.1
> Inode: 15 Type: regular Mode: 0600 Flags: 0x0 Generation: 1625569093
> User: 0 Group: 0 Size: 10485760
> File ACL: 0 Directory ACL: 0
> Links: 1 Blockcount: 20512
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014
> atime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014
> mtime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014
> BLOCKS:
> (0-11):12288-12299, (IND):8787, (12-1035):12300-13323, (DIND):8790, (IND):8791,
> (1036-2059):13324-14347, (IND):8789, (2060-2559):14348-14847
> TOTAL: 2564
>
> debugfs:
>
> To give folks an idea about how significant an impact on performance this
> is, using ext4 to mount my ext3 filesystem and create files is resulting
> in a 10-15% reduction in speed when data is being read back into memory.
> I also tested 3.11.7 and see the same poor allocation layout. I also
> tried turning off delalloc, but there was no change in the layout of the
> data blocks. Has anyone got any ideas what's going on here? Cheers,
>
> -ben
> --
> "Thought is the essence of where you are now."
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 15, 2014 at 12:22:14PM -0800, Darrick J. Wong wrote:
> A dumpe2fs would be nice, but I think I have enough here to speculate:
It's trivial to reproduce. Just create a 1GB file, run mkfs.ext3, then
mount with ext4 and dd a 10MB file onto the filesystem.
> The data blocks are all sequential, which looks like what one would expect from
> mballoc. Is your complaint is that the *IND blocks are not inline with the
> data blocks, like what ext3 did?
The problem is that the indirect blocks are nowhere near where the file's
data is. It'd be perfectly okay if they were at the beginning of the range
of blocks used for the file's data.
> FWIW, ext3 did something like this:
> (0-11):6144-6155, (IND):6156, (12-1035):6157-7180, (DIND):7181, (IND):7182,
> (1036-2059):7183-8206, (IND):8207, (2060-2559):8208-8707
>
> I think the behavior that you're seeing is ext4 trying to keep the mapping
> blocks close to the inode table to avoid fragmenting the file -- see
> ext4_find_near() in indirect.c. There's an XXX comment in ext4_find_goal()
> that implies that someone might have wanted to tie in with mballoc, which I
> suppose you could use to restore the ext3 behavior... but there's no way to do
> that.
...
I tried a few tests setting goal to different things, but evidently I'm
not managing to convince mballoc to put the file's data close to my goal
block, something in that mess of complicated logic is making it ignore
the goal value I'm passing in.
-ben
--
"Thought is the essence of where you are now."
On Wed, Jan 15, 2014 at 03:32:05PM -0500, Benjamin LaHaise wrote:
> I tried a few tests setting goal to different things, but evidently I'm
> not managing to convince mballoc to put the file's data close to my goal
> block, something in that mess of complicated logic is making it ignore
> the goal value I'm passing in.
It appears that ext4_new_meta_blocks() essentially ignores the goal block
specified for metadata blocks. If I hack around things and pass in the
EXT4_MB_HINT_TRY_GOAL flag where ext4_new_meta_blocks() is called in
ext4_alloc_blocks(), then it will at least try to allocate the block
specified by goal. However, if the block specified by goal is not free,
it ends up allocating blocks many megabytes away, even if one is free
within a few blocks of goal.
-ben
--
"Thought is the essence of where you are now."
On Wed, Jan 15, 2014 at 04:56:13PM -0500, Benjamin LaHaise wrote:
> On Wed, Jan 15, 2014 at 03:32:05PM -0500, Benjamin LaHaise wrote:
> > I tried a few tests setting goal to different things, but evidently I'm
> > not managing to convince mballoc to put the file's data close to my goal
> > block, something in that mess of complicated logic is making it ignore
> > the goal value I'm passing in.
>
> It appears that ext4_new_meta_blocks() essentially ignores the goal block
> specified for metadata blocks. If I hack around things and pass in the
> EXT4_MB_HINT_TRY_GOAL flag where ext4_new_meta_blocks() is called in
> ext4_alloc_blocks(), then it will at least try to allocate the block
> specified by goal. However, if the block specified by goal is not free,
> it ends up allocating blocks many megabytes away, even if one is free
> within a few blocks of goal.
I don't remember who sent in the patch to make this change, but the
goal of this change (which was deliberate) was to speed up operations
such as deletes, since the indirect blocks would be (ideally) close
together. If I recall correctly, the person who made this change was
more concerned about random read/write workloads than sequential
workloads. He or she did make the assertion that in general the
triple indirect and double indirect blocks would be tend to be flushed
out of memory anyway.
Looking back, I'm not sure how strong that particular argument really
was, but I don't think we really spent a lot time focusing on that
argument, given that extents were what was going to give the very
clear win.
Something that might be worth experimenting with is extending the
EXT4_IOC_PRECACHE_EXTENTS to support indirect blocks mapped file. If
we have managed to keep all of the indirect blocks close together at
the beginning of the flex_bg, and if we have indeed succeeded in
keeping the data blocks contiguous on disk, then sucking in all of the
indirect blocks and distilling it into a few extent status cache
entries might be the best way to accelerate performance.
If we can keep the data blocks for the multi-gigabyte file completely
contiguous on disk, then all of the indirect blocks (or extent tree)
can be stored in memory in a single 40 byte data structure. (Of
course, with a legacy ext3 file system layout, the 128 megs or so the
data blocks will be broken up by the block group metadata --- this is
one of the reasons why we implemented the flex_bg feature in ext4, to
relax the requirement that the inode table and allocation bitmaps for
a block group have to be stored in the block group. Still, using 320
bytes of memory for each 1G file is not too shabby.)
That way, we get the best of both worlds; because the indirect blocks
are close to each other (instead of being inline with the data blocks)
things like deleting the file will be fast. But so will precaching
all of the logical->physical block data, since we can read all of the
indirect blocks in at once, and then store it in memory in a highly
compacted form in the extents status cache.
Regards,
- Ted
Hi Ted,
On Wed, Jan 15, 2014 at 10:54:59PM -0500, Theodore Ts'o wrote:
> On Wed, Jan 15, 2014 at 04:56:13PM -0500, Benjamin LaHaise wrote:
> > On Wed, Jan 15, 2014 at 03:32:05PM -0500, Benjamin LaHaise wrote:
> > > I tried a few tests setting goal to different things, but evidently I'm
> > > not managing to convince mballoc to put the file's data close to my goal
> > > block, something in that mess of complicated logic is making it ignore
> > > the goal value I'm passing in.
> >
> > It appears that ext4_new_meta_blocks() essentially ignores the goal block
> > specified for metadata blocks. If I hack around things and pass in the
> > EXT4_MB_HINT_TRY_GOAL flag where ext4_new_meta_blocks() is called in
> > ext4_alloc_blocks(), then it will at least try to allocate the block
> > specified by goal. However, if the block specified by goal is not free,
> > it ends up allocating blocks many megabytes away, even if one is free
> > within a few blocks of goal.
>
> I don't remember who sent in the patch to make this change, but the
> goal of this change (which was deliberate) was to speed up operations
> such as deletes, since the indirect blocks would be (ideally) close
> together. If I recall correctly, the person who made this change was
> more concerned about random read/write workloads than sequential
> workloads. He or she did make the assertion that in general the
> triple indirect and double indirect blocks would be tend to be flushed
> out of memory anyway.
Any idea when this commit was made or titled? I care about random
performance as well, but that can't be at the cost of making sequential
reads suck.
> Looking back, I'm not sure how strong that particular argument really
> was, but I don't think we really spent a lot time focusing on that
> argument, given that extents were what was going to give the very
> clear win.
>
> Something that might be worth experimenting with is extending the
> EXT4_IOC_PRECACHE_EXTENTS to support indirect blocks mapped file. If
> we have managed to keep all of the indirect blocks close together at
> the beginning of the flex_bg, and if we have indeed succeeded in
> keeping the data blocks contiguous on disk, then sucking in all of the
> indirect blocks and distilling it into a few extent status cache
> entries might be the best way to accelerate performance.
The seek to get to the indirect blocks is still a cost that is not present
in ext3, meaning that the bar is pretty high to avoid a regression.
> If we can keep the data blocks for the multi-gigabyte file completely
> contiguous on disk, then all of the indirect blocks (or extent tree)
> can be stored in memory in a single 40 byte data structure. (Of
> course, with a legacy ext3 file system layout, the 128 megs or so the
> data blocks will be broken up by the block group metadata --- this is
> one of the reasons why we implemented the flex_bg feature in ext4, to
> relax the requirement that the inode table and allocation bitmaps for
> a block group have to be stored in the block group. Still, using 320
> bytes of memory for each 1G file is not too shabby.)
The files I'm dealing with are usually 8MB in size, and there can be up
to 1 million of them. In such a use-case, I don't expect the inodes will
always remain cached in memory (some of the systems involved only have
4GB of RAM), so adding another metadata cache won't fix the regression.
The crux of the issue is that the indirect blocks are getting placed many
*megabytes* away from the data blocks. Incurring a seek for every 4MB
of data read seems pretty painful. Putting the metadata closer to the
data seems like the right thing to do. And it should help the random
i/o case as well.
-ben
> That way, we get the best of both worlds; because the indirect blocks
> are close to each other (instead of being inline with the data blocks)
> things like deleting the file will be fast. But so will precaching
> all of the logical->physical block data, since we can read all of the
> indirect blocks in at once, and then store it in memory in a highly
> compacted form in the extents status cache.
>
> Regards,
>
> - Ted
--
"Thought is the essence of where you are now."
On Thu, Jan 16, 2014 at 01:48:26PM -0500, Benjamin LaHaise wrote:
>
> Any idea when this commit was made or titled? I care about random
> performance as well, but that can't be at the cost of making sequential
> reads suck.
Thinking about this some more, I think it was made as part of the
changes to better take advantage of the flex_bg feature in ext4. The
idea was to keep metadata blocks such as directory blocks and extent
trees closer together. I don't think when we made that change we
really consciously thought that much about indirect block support,
since that was viewed as a legacy feature for backwards compatibility
support in ext4. (This was years ago, before distributions started
wanting to support only one code base for ext3 and ext4 file systems.)
I *know* we've had this discussion about whether to put the indirect
blocks inline with the data, or closer together to speed up metadata
operations (i.e., unlink, fsck, etc.) before, though. There was a
patch against ext3 I remember looking at which forced the indirect
blocks to the end of the previous block group. That kept the indirect
blocks closer together, and on average 64MB away from the data blocks.
As I recall, the stated reason for the patch was to make unlinks of
backups of DVD images not take forever and a day.
I'm pretty sure we've had it at least once on the weekly ext4
concalls, and I'm pretty sure we've had it one hallway track or
another. Ultimately, extents are such a huge win that it's not clear
it's really worth that much effort to try to optimize indirect blocks,
which are a lose no matter how you slice and dice things.
> The files I'm dealing with are usually 8MB in size, and there can be up
> to 1 million of them. In such a use-case, I don't expect the inodes will
> always remain cached in memory (some of the systems involved only have
> 4GB of RAM), so adding another metadata cache won't fix the regression.
> The crux of the issue is that the indirect blocks are getting placed many
> *megabytes* away from the data blocks. Incurring a seek for every 4MB
> of data read seems pretty painful. Putting the metadata closer to the
> data seems like the right thing to do. And it should help the random
> i/o case as well.
An 8MB file will require two indirect blocks. If you are using
extents, almost certainly it will fit inside the inode, which means we
don't need any external metadata blocks. That massively speeds up
fsck time, and unlink time, and it also speeds up the random read case
since the best way to optimize a seek is to eliminate it. :-)
I understand that for your use case, it would be hard to move to using
extents right away. But I think you'd see so many improvements from
going to ext4 and extents that it might be more efficient to optimize
an indirect blocok scheme.
- Ted
On Thu, Jan 16, 2014 at 02:12:27PM -0500, Theodore Ts'o wrote:
> An 8MB file will require two indirect blocks. If you are using
> extents, almost certainly it will fit inside the inode, which means we
> don't need any external metadata blocks. That massively speeds up
> fsck time, and unlink time, and it also speeds up the random read case
> since the best way to optimize a seek is to eliminate it. :-)
> I understand that for your use case, it would be hard to move to using
> extents right away. But I think you'd see so many improvements from
> going to ext4 and extents that it might be more efficient to optimize
> an indirect blocok scheme.
Unfortunately, the improvements from extents for our use-case are not
enough to outweigh the other costs of deployment. I think I've figured
out a hack that results in the system doing most of what I want it to do:
I've removed EXT4_MB_HINT_DATA in ext4_alloc_blocks(). With that change,
the allocator is giving me mostly sequential allocations. Hopefully that
doesn't have any other negative side effects.
-ben
> - Ted
--
"Thought is the essence of where you are now."
On 1/16/14, 1:12 PM, Theodore Ts'o wrote:
> On Thu, Jan 16, 2014 at 01:48:26PM -0500, Benjamin LaHaise wrote:
>>
>> Any idea when this commit was made or titled? I care about random
>> performance as well, but that can't be at the cost of making sequential
>> reads suck.
>
> Thinking about this some more, I think it was made as part of the
> changes to better take advantage of the flex_bg feature in ext4. The
> idea was to keep metadata blocks such as directory blocks and extent
> trees closer together. I don't think when we made that change we
> really consciously thought that much about indirect block support,
> since that was viewed as a legacy feature for backwards compatibility
> support in ext4. (This was years ago, before distributions started
> wanting to support only one code base for ext3 and ext4 file systems.)
Just to nitpick, wasn't this always the plan? ;)
https://lkml.org/lkml/2006/6/28/454 :
> 4) At some point, probably in 6-9 months when we are satisified with the
> set of features that have been added to fs/ext4, and confident that the
> filesystem format has stablized, we will submit a patch which causes the
> fs/ext4 code to register itself as the ext4 filesystem.
-Eric
p.s. "6-9 months" ;)