2009-12-08 10:12:33

by Vyacheslav Dubeyko

[permalink] [raw]
Subject: About reserve of blocks for "overflow extents" in ext4 metadata

Hello,

I think that it make sense to has in ext4 metadata a reserve of blocks for "overflow extents" (it is the extents that to form extent's tree and it is placed in some blocks is described in i_block inode's field for a file). The reserve of blocks for "overflow extents" can be located (during operation of ext4 file system creation by mkfs) after inode table for every virtual (FLEX_BG) group by united aggregate of blocks. The size and placement of this reserve has to be described by free special inode.

In my opinion, the reserve of blocks for "overflow extents" resolves such problems:
1) In the case of ext4 volume's shrinking resize (especially, in the case of very fragmented volume) it can be very difficult to estimate possibility of successful resize because of existing mechanism of extents' tree layout on the volume. It is possible to encounter during resize the problem of free blocks' lack for rebuilding of extents' tree for replaced files. The reserve of blocks for "overflow extents" guarantee against encountering of such problem during resizes.
2) The presence of the reserve of blocks for "overflow extents" means that all existing extents' trees of files will locate in one place. This fact and placement the reserve just after inode table will increase efficiency of operations with extents' trees, in my opinion.
3) The localized layout of extents' trees of files means efficient journaling of this metadata, also.

I think that the reserve of blocks for "overflow extents" can has such on-disk layout. The reserve is union of bitmap (that keeps knowledge about used and free blocks in reserve) and some number of blocks (used for extents' trees). All blocks has allocated for the reserve during volume creation has to set as used in block bitmap of group(s) that contains the reserve. The size in blocks of the reserve can be defined by: inode_counts * count_blocks_for_inode (count of blocks that make possible to form extents' tree with some average depth). The field i_block of special inode (that will describe the reserve) will have two extents: 1) the extent that describes placement and size of reserve's bitmap block(s); 2) the extent that describes placement and size of blocks used for trees' extents.

--
Vyacheslav Dubeyko <[email protected]>
Acronis


2009-12-08 15:48:23

by Eric Sandeen

[permalink] [raw]
Subject: Re: About reserve of blocks for "overflow extents" in ext4 metadata

Vyacheslav Dubeyko wrote:
> Hello,
>
> I think that it make sense to has in ext4 metadata a reserve of
> blocks for "overflow extents" (it is the extents that to form
> extent's tree and it is placed in some blocks is described in i_block
> inode's field for a file). The reserve of blocks for "overflow
> extents" can be located (during operation of ext4 file system
> creation by mkfs) after inode table for every virtual (FLEX_BG) group
> by united aggregate of blocks. The size and placement of this reserve
> has to be described by free special inode.
>
> In my opinion, the reserve of blocks for "overflow extents" resolves
> such problems: 1) In the case of ext4 volume's shrinking resize
> (especially, in the case of very fragmented volume) it can be very
> difficult to estimate possibility of successful resize because of
> existing mechanism of extents' tree layout on the volume. It is
> possible to encounter during resize the problem of free blocks' lack
> for rebuilding of extents' tree for replaced files. The reserve of
> blocks for "overflow extents" guarantee against encountering of such
> problem during resizes. 2) The presence of the reserve of blocks for
> "overflow extents" means that all existing extents' trees of files
> will locate in one place. This fact and placement the reserve just
> after inode table will increase efficiency of operations with
> extents' trees, in my opinion. 3) The localized layout of extents'
> trees of files means efficient journaling of this metadata, also.
>
> I think that the reserve of blocks for "overflow extents" can has
> such on-disk layout. The reserve is union of bitmap (that keeps
> knowledge about used and free blocks in reserve) and some number of
> blocks (used for extents' trees). All blocks has allocated for the
> reserve during volume creation has to set as used in block bitmap of
> group(s) that contains the reserve. The size in blocks of the reserve
> can be defined by: inode_counts * count_blocks_for_inode (count of
> blocks that make possible to form extents' tree with some average
> depth). The field i_block of special inode (that will describe the
> reserve) will have two extents: 1) the extent that describes
> placement and size of reserve's bitmap block(s); 2) the extent that
> describes placement and size of blocks used for trees' extents.

If I understand this correctly, then you would be pre-reserving
all extent metadata blocks that are possible on the filesystem, in
the same way that we currently pre-provision inodes, at mkfs time?

What happens if we have a highly fragmented filesystem, and we
run out of these reserved "overflow extents" blocks? And would
overprovisioning waste more filesystem space as the inodes do
today?

-Eric

2009-12-08 18:23:54

by Andreas Dilger

[permalink] [raw]
Subject: Re: About reserve of blocks for "overflow extents" in ext4 metadata

On 2009-12-08, at 03:03, Vyacheslav Dubeyko wrote:
> I think that it make sense to has in ext4 metadata a reserve of
> blocks for "overflow extents" (it is the extents that to form
> extent's tree and it is placed in some blocks is described in
> i_block inode's field for a file). The reserve of blocks for
> "overflow extents" can be located (during operation of ext4 file
> system creation by mkfs) after inode table for every virtual
> (FLEX_BG) group by united aggregate of blocks. The size and
> placement of this reserve has to be described by free special inode.
>
> In my opinion, the reserve of blocks for "overflow extents" resolves
> such problems:
> 1) In the case of ext4 volume's shrinking resize (especially, in the
> case of very fragmented volume) it can be very difficult to estimate
> possibility of successful resize because of existing mechanism of
> extents' tree layout on the volume. It is possible to encounter
> during resize the problem of free blocks' lack for rebuilding of
> extents' tree for replaced files. The reserve of blocks for
> "overflow extents" guarantee against encountering of such problem
> during resizes.
> 2) The presence of the reserve of blocks for "overflow extents"
> means that all existing extents' trees of files will locate in one
> place. This fact and placement the reserve just after inode table
> will increase efficiency of operations with extents' trees, in my
> opinion.
> 3) The localized layout of extents' trees of files means efficient
> journaling of this metadata, also.

In fact, for most files the 4 extents that can be stored within the
inode itself provide enough space to store all of the extents of the
file. Reserving extra space is generally sub-optimal, either because
it wastes space when too many blocks are reserved (causing ENOSPC
before it is needed), or when too few blocks are reserved it will
cause the same failures as you report today.

I wouldn't object to tuning the block allocator to pack index and
extent blocks into shared (in-memory) preallocated regions, but I
don't think that needs to be a hard reservation. The mballoc code
already has the concept of aggregating small IOs into a single free
chunk, and it makes sense to put the index/extent blocks together in
this way, to avoid seeking during e2fsck, and to avoid fragmenting the
free space with small allocations.

In fact, I thought Ted had done some work in this area already?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-12-09 11:02:55

by Vyacheslav Dubeyko

[permalink] [raw]
Subject: RE: About reserve of blocks for "overflow extents" in ext4 metadata

Hello,

> If I understand this correctly, then you would be pre-reserving all extent metadata blocks that are possible on the filesystem, in the same way that we
> currently pre-provision inodes, at mkfs time?

It is not necessary to pre-reserve all extent metadata blocks that are possible on the filesystem. I offer to pre-reserve some reasonable and not very big part of above-mentioned blocks for extent metadata because of a some inodes hasn't any extents' tree but some can has a deep extents' tree. I think that file servers (on that file deletion and file creation operations is very frequent) will has considerable count of extents' trees. Now the blocks of extents' metadata can place anywhere on the volume but it is not efficient way, as I think.

> What happens if we have a highly fragmented filesystem, and we run out of these reserved "overflow extents" blocks? And would overprovisioning
> waste more filesystem space as the inodes do today?

We can try to allocate next part of reserved "overflow extents" blocks in the case when we haven't free blocks in the existing reserve. I think that pre-reservation scheme has to reserve such block count that will be adequate by filesystem size, needs of extents metadata and doesn't waste filesystem space. It is very important to has such reserve for the resize case. The ext4 (as ext3) has reserved blocks for GDT. It needs to have reserved blocks and for extents metadata, I think. And it is not obligatory to calculate block count for reserved "overflow extents" on the basis inode count.

--
Vyacheslav Dubeyko <[email protected]>
Acronis
--


-----Original Message-----
From: Eric Sandeen [mailto:[email protected]]
Sent: Tuesday, December 08, 2009 6:49 PM
To: Dubeyko, Vyacheslav
Cc: [email protected]
Subject: Re: About reserve of blocks for "overflow extents" in ext4 metadata

Vyacheslav Dubeyko wrote:
> Hello,
>
> I think that it make sense to has in ext4 metadata a reserve of blocks
> for "overflow extents" (it is the extents that to form extent's tree
> and it is placed in some blocks is described in i_block inode's field
> for a file). The reserve of blocks for "overflow extents" can be
> located (during operation of ext4 file system creation by mkfs) after
> inode table for every virtual (FLEX_BG) group by united aggregate of
> blocks. The size and placement of this reserve has to be described by
> free special inode.
>
> In my opinion, the reserve of blocks for "overflow extents" resolves
> such problems: 1) In the case of ext4 volume's shrinking resize
> (especially, in the case of very fragmented volume) it can be very
> difficult to estimate possibility of successful resize because of
> existing mechanism of extents' tree layout on the volume. It is
> possible to encounter during resize the problem of free blocks' lack
> for rebuilding of extents' tree for replaced files. The reserve of
> blocks for "overflow extents" guarantee against encountering of such
> problem during resizes. 2) The presence of the reserve of blocks for
> "overflow extents" means that all existing extents' trees of files
> will locate in one place. This fact and placement the reserve just
> after inode table will increase efficiency of operations with extents'
> trees, in my opinion. 3) The localized layout of extents'
> trees of files means efficient journaling of this metadata, also.
>
> I think that the reserve of blocks for "overflow extents" can has such
> on-disk layout. The reserve is union of bitmap (that keeps knowledge
> about used and free blocks in reserve) and some number of blocks (used
> for extents' trees). All blocks has allocated for the reserve during
> volume creation has to set as used in block bitmap of
> group(s) that contains the reserve. The size in blocks of the reserve
> can be defined by: inode_counts * count_blocks_for_inode (count of
> blocks that make possible to form extents' tree with some average
> depth). The field i_block of special inode (that will describe the
> reserve) will have two extents: 1) the extent that describes placement
> and size of reserve's bitmap block(s); 2) the extent that describes
> placement and size of blocks used for trees' extents.

If I understand this correctly, then you would be pre-reserving all extent metadata blocks that are possible on the filesystem, in the same way that we currently pre-provision inodes, at mkfs time?

What happens if we have a highly fragmented filesystem, and we run out of these reserved "overflow extents" blocks? And would overprovisioning waste more filesystem space as the inodes do today?

-Eric

2009-12-09 15:31:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: About reserve of blocks for "overflow extents" in ext4 metadata

On Tue, Dec 08, 2009 at 01:03:28PM +0300, Vyacheslav Dubeyko wrote:
> 1) In the case of ext4 volume's shrinking resize (especially, in the
> case of very fragmented volume) it can be very difficult to estimate
> possibility of successful resize because of existing mechanism of
> extents' tree layout on the volume. It is possible to encounter
> during resize the problem of free blocks' lack for rebuilding of
> extents' tree for replaced files. The reserve of blocks for
> "overflow extents" guarantee against encountering of such problem
> during resizes.

I'm not sure how important it is to make fs shrink work "better",
since most of the time, system administrators are more interested in
growing file systems than shrinking file systems. That's one of the
reasons why we haven't spent more time trying to make e2resize smarter
about avoiding fragmenting files while shrinking --- or perhaps even
doing some defragmentation as part of the resize. In some sense
that's something that we should think about doing if people want to be
using resizing as a common operation, as opposed to something rare
that works but isn't very well optimized. (I've always been a bit
concerned with Fedora using it as a regular way of making ISO disks,
since the a number of file inevitably ended up being fragmented, for
ext3 or ext4 file systems, and seeks on CD's aren't exactly like that
of HDD's or SDD's.)

That being said, if the goal is to allow the shrink to succeed, we
just need a pool of reserve blocks. We do have something exactly like
that for non-privileged users (the 5% reserve). What you're talking
about is adding more of a reserve, and perhaps one that gets used for
the privileged users as well. We could do that, but people are
already annoyed about having 5% reserved. Reserving more would not be
popular.

> 2) The presence of the reserve of blocks for "overflow extents"
> means that all existing extents' trees of files will locate in one
> place. This fact and placement the reserve just after inode table
> will increase efficiency of operations with extents' trees, in my
> opinion.

We have this already. Directory and extent tree blocks are currently
preferentially allocated in the first block group of each flex_bg, and
data file blocks are preferentially allocated outside of the first
block group. If the file system gets full, then extent tree blocks or
data blocks can located anywhere, of course.

Your idea seems to have a contradiction, though. If you have a
"reserve of blocks" and you aren't necessarily allocating enough space
for the absolute worse case (which means reserving a *very* large
number of blocks), but you are allocating them under normal
circumstances, then there will be times when the reserve will be
exhausted. At that point, resize shrinking will be once again be
problematic or not possible, and fragmentation loses wil be quite bad.

> 3) The localized layout of extents' trees of files means efficient
> journaling of this metadata, also.

Um, how?

- Ted