Hi,
another idea how to achieve more (dynamic) inodes:
* new dir_entry format with 64bit inum
* ino space is 64bit:
* 2^48 phys. 4K blocks
* 2^5 inodes in 4K block
* highest bit is used to choose addressing schema: static or dynamic
* each block is covered by two bits: in inode (I) and block (B) bitmaps:
I: 0, B: 0 - block is just free
I: 0, B: 1 - block is used, but not contains inodes
I: 1, B: 0 - block is full of inodes
I: 1, B: 1 - block contains few inodes, has free space
implementation issues:
* how to allocate new blocks for inodes
* try to find block with empty inode using bitmaps
* if we can't find - allocate new block and make it inode block
* if no free block in this group - repeat 1-2 in another groups
* how to release block from inodes
* if block has no used inodes anymore, we mark it 0 in I and 1 in B
* or if group has very few free inodes, leave it inode block
* how to find free inode in existing inode block
* just scan all slots
* how to migrate/co-exist with static inodes
* group is marked with DYNAMIC_INODES flag
* we can turn group to dynamic when it has 0 inodes (simple)
* we can turn group to dynamic and update bitmaps (hard)
* how to implement varlen inodes
* should be simple at allocation time
possibilities:
* use large inodes to store directory entries
(4096-128)/(4+8+8) = 198 entries
* use large inodes to store EAs
* if we introduce small header for new inode block we could use it to store tails
problems:
* can't relocate inode w/o changing directory entry
* can't change inode size after creation
* e2fsck?
comments/suggestions are very welcome.
thanks, Alex
Sadly this was sitting in my outbox overnight, and might be obsolete
already (explanation in a follow-up email), but I'm sending it as food
for thought...
On Sep 24, 2008 15:46 +0400, Alex Tomas wrote:
> another idea how to achieve more (dynamic) inodes:
> * new dir_entry format with 64bit inum
Yes, that is a requirement in all cases.
I've always thought that we should also implement inode-in-dirent when
we need to change the dirent format and make dynamic inodes, but that
may be too much to chew on at one time.
> * ino space is 64bit:
> * 2^48 phys. 4K blocks
> * 2^5 inodes in 4K block
The 2^5 inodes/4kB block would actually depend on the blocksize/inodesize,
lets just call this inodes-per-block-bits (IPBB). It will be a power-of-2
between 0 and 8 (i.e. between 1 and 256 inodes per block), which is fine.
For common ext4 filesystems this would be 2^4 = 16 inodes/block, because
the default is 256-byte inodes today.
> * highest bit is used to choose addressing schema: static or dynamic
Alternately, any inode >= 2^32 would be dynamic? One clear benefit of
putting the dynamic inodes at the end of the number space is that they
will only be used if the static inodes are full, which reduces risk due
to corruption and overhead due to dynamic allocations.
> * each block is covered by two bits: in inode (I) and block (B) bitmaps:
> I: 0, B: 0 - block is just free
> I: 0, B: 1 - block is used, but not contains inodes
> I: 1, B: 0 - block is full of inodes
> I: 1, B: 1 - block contains few inodes, has free space
Storing B:0 for an in-use block seems very dangerous to me. This also
doesn't really address the need to be able to quickly locate free inodes,
because it means "I:1" _might_ mean the inode is free or it might not,
so EVERY "in-use" inode would need to be checked to see if it is free.
We need to start with a "dynamic inode bitmap" (DIB) that is mapped from
an "inode table file" (possibly only for the dynamic inode table blocks).
Free inodes can be scanned using the normal ext4_find_next_zero_bit()
in each of the bitmaps.
Each such DIB block holds an array of bits indicating dynamic inode
use, as well as an array of block numbers which map IPBB inode bits to
dynamic inode table blocks. The DIBB should also have a header which
contains space for a magic, a checksum, and the count of free and total
inodes, like a GDT has, as well as a count of in-use itable blocks.
The dynamic inode table blocks (DITB) should also hold a header with
magic, checksum, back-pointer to DIBB. The back-pointer to the DIBB
allows efficient clearing of in-use bit and location of the DIBB if the
dynamic inode itself is corrupted, and possibly freeing the DITB if
the last in-use inode is freed.
For common 256-byte inodes and 4kB blocks we need 8 bytes/block for the
block addresses, and 1 bit/inode, so
4096 bytes/block / 256 bytes/inode = 16 inodes(bits)/block = 2 byte bitmap
(4096 bytes - 64-byte header) / (8 byte address + 2 byte bitmap) =
400 itable blocks per DIBB = 400 * 16 = 6400 inodes/DIBB
65536 bytes/block / 256 bytes/inode = 256 inodes(bits)/block = 8 byte bitmap
(65536 bytes - 64-byte header) / (8 byte address + 8 byte bitmap) =
4092 itable blocks per DIBB = 4092 * 16 = 1048576 inodes/DIBB
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Sep 24, 2008 15:46 +0400, Alex Tomas wrote:
> another idea how to achieve more (dynamic) inodes:
Actually, Jos? propsed a _very_ simple idea that would allow dynamic
inodes with relatively low code complexity or risk due to dynamic
placement of inode tables.
The basic idea is to extend the FLEX_BG feature so that (essentially)
"blockless groups" can be added to the filesystem when the inodes are
all gone. The core idea of FLEX_BG is that the "group metadata" (inode
and block bitmaps, inode table) can be placed anywhere in the filesystem.
This implies that a "block group" is strictly just a contiguous range of
blocks, and somewhere in the filesystem is the metadata that describes its
usage.
If one adds a new group (ostensibly "at the end of the filesystem") that
has a flag which indicates there are no blocks available in the group,
then what we get is the inode bitmap and inode table, with a 1-block
"excess baggage" of the block bitmap and a new group descriptor. The
"baggage" is small considering any overhead needed to locate and describe
fully dynamic inode tables.
A big plus is that there are very few changes needed to the kernel or
e2fsck (the "dynamic inode table" is just a group which has no space
for data). Some quick checks on 10 filesystems (some local, some
server) shows that there is enough contiguous space in the filesystems
to allocate a full inode table (between 1-4MB for most filesystems), and
mballoc can help with this. This makes sense because the cases where
there is a shortage of inodes also means there is an excess of space,
and if the inodes were under-provisioned it also (usually) means the
itable is on the smaller side.
Another important benefit is that the 32-bit inode space is used fully
before there is any need to grow to 64-bit inodes. This avoids the
compatibility issues with userspace to the maximum possible extent,
without any complex remapping of inode numbers.
We could hedge our bets for finding large enough contiguous itable space
and allow the itable to be smaller than normal, and mark the end inodes
as in-use. e2fsck will in fact consider any blocks under the rest of
the inode table as "shared blocks" and do duplicate block processing to
remap the data blocks. We could also leverage online defrag to remap
the blocks before allocating the itable if there isn't enough space.
Another major benefit of this approach is that the "dynamic" inode table
is actually relatively static in location, and we don't need a tree to
find it. We would continue to use the "normal" group inodes first, and
only add dynamic groups if there are no free inodes. It would also be
possible to remove the last dynamic group if all its inodes are freed.
The itable location would be replicated to all of the group descriptor
backups for safety, though we would need to find a way for "META_BG"
to store a backup of the GDT in blocks that don't exist, in the case
where increasing the GDT size in-place isn't possible.
The drawbacks of the approach is relatively coarse-grained itable
allocation, which would fail if the filesystem is highly fragmented,
but we don't _have_ to succeed either. The coarse-grained approach is
also a benefit because we don't need complex data structures to find the
itable, it reduces seeking during e2fsck, and we can keep some hysteresis
in adding/removing dynamic groups to reduce overhead (updates of many
GDT backups).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:
> Alternately, any inode >= 2^32 would be dynamic? One clear benefit of
> putting the dynamic inodes at the end of the number space is that they
> will only be used if the static inodes are full, which reduces risk due
> to corruption and overhead due to dynamic allocations.
the highest (63th) bit put dynamic inodes at the end, no? the idea is that
using 48 bits we can address block directly, w/o any additional lookup via
some metainode. essentially this is just a way to introduce 64bit addressable
fragments of 2^(64-48-1) size.
>> * each block is covered by two bits: in inode (I) and block (B) bitmaps:
>> I: 0, B: 0 - block is just free
>> I: 0, B: 1 - block is used, but not contains inodes
>> I: 1, B: 0 - block is full of inodes
>> I: 1, B: 1 - block contains few inodes, has free space
>
> Storing B:0 for an in-use block seems very dangerous to me. This also
> doesn't really address the need to be able to quickly locate free inodes,
> because it means "I:1" _might_ mean the inode is free or it might not,
> so EVERY "in-use" inode would need to be checked to see if it is free.
just combine I and B into single bitmap:
1) when you look for free block it's any 0 bit in bitmap made by (I & B)
2) when you look for free inode (in current inode blocks) it's any 1 bit
in bitmap made again by (I & B), then you read corresponded block and
find free slot there (for example, it can be null i_mode)
looks very simple and doable?
> We need to start with a "dynamic inode bitmap" (DIB) that is mapped from
> an "inode table file" (possibly only for the dynamic inode table blocks).
> Free inodes can be scanned using the normal ext4_find_next_zero_bit()
> in each of the bitmaps.
the idea is that we can implement truly dynamic and varlen inodes w/o
introducing special files, using existing structures.
thanks, Alex
On Sep 26, 2008 03:00 +0400, Alex Tomas wrote:
> Andreas Dilger wrote:
>> Storing B:0 for an in-use block seems very dangerous to me. This also
>> doesn't really address the need to be able to quickly locate free inodes,
>> because it means "I:1" _might_ mean the inode is free or it might not,
>> so EVERY "in-use" inode would need to be checked to see if it is free.
>
> just combine I and B into single bitmap:
> 1) when you look for free block it's any 0 bit in bitmap made by (I & B)
> 2) when you look for free inode (in current inode blocks) it's any 1 bit
> in bitmap made again by (I & B), then you read corresponded block and
> find free slot there (for example, it can be null i_mode)
>
> looks very simple and doable?
It _sounds_ simple, but I think the implementation will not be what
is expected. Either you need to keep a 3rd bitmap for each group
which is (I&B) used for finding either inodes or blocks first (with
respectively find_first_bit() or find_first_zero_bit()), then check the
"normal" inode and block bitmaps, keeping this in sync with mballoc, and
confusion/danger on disk/e2fsck because in-use itable blocks are marked
"0" in the block bitmap. There will be races between updating these
bitmaps, unless the group is locked for both block or inode allocations
on any update because setting any bit completely changes the meaning.
Alternately, if there are only I and B bitmaps, then find_first_bit()
and find_first_zero_bit() are not useful. Searching for free blocks
means looking for "B:0" and finding potentially many "B:0 I:1" blocks
that are full of inodes. Searching for free inodes means looking for
"I:1" (strangely) but finding potentially many "I:1 B:0" blocks.
I much prefer the dynamic itable idea from Jos? (which I embellished in
my other email), which is very simple for both the kernel and e2fsck,
robust, and avoids the 64-bit inode problem for userspace to the maximum
amount (i.e. a full 4B inodes must be in use before we ever need to
use 64-bit inodes). The lack of complexity in itable allocation also
translates directly into increased robustness in the face of corruption.
It doesn't provide dynamic-sized inodes (which hasn't traditionally
been a problem), nor is it perfect in terms of being able to fully
populate a filesystem with inodes in all use cases but it could work
in all but completely pathalogical fragmentation cases (at which point
one wonders if it isn't better to just return -ENOSPC than to flog a
nearly dead filesystem). It can definitely do a good job in most likely
uses, and also provides a big win over what is done today.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Thu, 25 Sep 2008 16:37:31 -0600
Andreas Dilger <[email protected]> wrote:
> On Sep 24, 2008 15:46 +0400, Alex Tomas wrote:
> > another idea how to achieve more (dynamic) inodes:
>
> Actually, José propsed a _very_ simple idea that would allow dynamic
> inodes with relatively low code complexity or risk due to dynamic
> placement of inode tables.
>
> The basic idea is to extend the FLEX_BG feature so that (essentially)
> "blockless groups" can be added to the filesystem when the inodes are
> all gone. The core idea of FLEX_BG is that the "group metadata" (inode
> and block bitmaps, inode table) can be placed anywhere in the filesystem.
> This implies that a "block group" is strictly just a contiguous range of
> blocks, and somewhere in the filesystem is the metadata that describes its
> usage.
>
> If one adds a new group (ostensibly "at the end of the filesystem") that
> has a flag which indicates there are no blocks available in the group,
> then what we get is the inode bitmap and inode table, with a 1-block
> "excess baggage" of the block bitmap and a new group descriptor. The
> "baggage" is small considering any overhead needed to locate and describe
> fully dynamic inode tables.
>
> A big plus is that there are very few changes needed to the kernel or
> e2fsck (the "dynamic inode table" is just a group which has no space
> for data). Some quick checks on 10 filesystems (some local, some
> server) shows that there is enough contiguous space in the filesystems
> to allocate a full inode table (between 1-4MB for most filesystems), and
> mballoc can help with this. This makes sense because the cases where
> there is a shortage of inodes also means there is an excess of space,
> and if the inodes were under-provisioned it also (usually) means the
> itable is on the smaller side.
>
> Another important benefit is that the 32-bit inode space is used fully
> before there is any need to grow to 64-bit inodes. This avoids the
> compatibility issues with userspace to the maximum possible extent,
> without any complex remapping of inode numbers.
>
> We could hedge our bets for finding large enough contiguous itable space
> and allow the itable to be smaller than normal, and mark the end inodes
> as in-use. e2fsck will in fact consider any blocks under the rest of
> the inode table as "shared blocks" and do duplicate block processing to
> remap the data blocks. We could also leverage online defrag to remap
> the blocks before allocating the itable if there isn't enough space.
>
> Another major benefit of this approach is that the "dynamic" inode table
> is actually relatively static in location, and we don't need a tree to
> find it. We would continue to use the "normal" group inodes first, and
> only add dynamic groups if there are no free inodes. It would also be
> possible to remove the last dynamic group if all its inodes are freed.
>
> The itable location would be replicated to all of the group descriptor
> backups for safety, though we would need to find a way for "META_BG"
> to store a backup of the GDT in blocks that don't exist, in the case
> where increasing the GDT size in-place isn't possible.
One way to get around this is to implement the exact opposite of what I
proposed earlier and have a block group with no inode tables. If we do
a 1:1 distribution of inode per block and don't allocate inodes tables
for a series of block groups within a flexbg we could later on attempt
to allocate new inode tables when we run out of inodes. If we leave
holes in the inode numbers for the missing inode tables, adding new
inode tables in these block groups would not require any inode
renumbering. This also does not break the current inode allocator
which would be a good thing. This should be even simpler to implement
than the previous proposal. The drawbacks are that when allocating a
new inode table, the 1:1 distribution of inode per block would mean
that we need to find a bigger chunk on contiguous blocks to since we
have bigger inode tables per block group. Since the current inode
allocator tries to keep a 10% of blocks in a flexbg free, finding
contiguous blocks may not be a really big issue. Another issue is 64bit
filesystem if we use a 1:1 scheme.
This would be like uninitialized inode tables with the added steps of
finding free blocks, allocating a new inode and zeroing the newly
created inode table. Since we could chose to allocate a new inode
table on a flexbg with the most free blocks, this could keep filesystem
meta-data/data layout consistently close together to maintain
predictable performance. This option also has no overhead compared to
the previous proposal.
>
> The drawbacks of the approach is relatively coarse-grained itable
> allocation, which would fail if the filesystem is highly fragmented,
> but we don't _have_ to succeed either. The coarse-grained approach is
> also a benefit because we don't need complex data structures to find the
> itable, it reduces seeking during e2fsck, and we can keep some hysteresis
> in adding/removing dynamic groups to reduce overhead (updates of many
> GDT backups).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
-JRS
On Thu, Sep 25, 2008 at 04:37:31PM -0600, Andreas Dilger wrote:
> If one adds a new group (ostensibly "at the end of the filesystem") that
> has a flag which indicates there are no blocks available in the group,
> then what we get is the inode bitmap and inode table, with a 1-block
> "excess baggage" of the block bitmap and a new group descriptor. The
> "baggage" is small considering any overhead needed to locate and describe
> fully dynamic inode tables.
It's a good idea; and technically you don't have to allocate a block
bitmap, given that the flag is present which says "no blocks
available". The reason for allocating it is if you're trying to
maintain full backwards compatibility, it will work --- except that
you need some way of making sure that the on-line resizing code won't
screw with the filesystem --- so the feature would have to be a
read/only compat feature anyway.
To do on-line resizing, you'd have to clear the flag and then know to
that the first "inode-only" block group should be given the new
blocks.
> The itable location would be replicated to all of the group descriptor
> backups for safety, though we would need to find a way for "META_BG"
> to store a backup of the GDT in blocks that don't exist, in the case
> where increasing the GDT size in-place isn't possible.
This is actually the big problem; with META_BG, in order to find the
group descriptor blocks, it assumes that the first group descriptor
can be found at the beginning of the group descriptor block, which
means it has to be found at a certain offset from the beginning of the
filesystem. And this would not be true for inode-only block groups.
The simplest solution actually would be to to allocate inodes from the
*end* of the 32-bit inode space, growing downwards, and having those
inodes be stored in a reserved inode. You would lose block locality,
although that could be solved by adding a block group affinity field
in the inode structure which is used by "extended inodes".
- Ted
On Sep 25, 2008 22:11 -0400, Theodore Ts'o wrote:
> On Thu, Sep 25, 2008 at 04:37:31PM -0600, Andreas Dilger wrote:
> > If one adds a new group (ostensibly "at the end of the filesystem") that
> > has a flag which indicates there are no blocks available in the group,
> > then what we get is the inode bitmap and inode table, with a 1-block
> > "excess baggage" of the block bitmap and a new group descriptor. The
> > "baggage" is small considering any overhead needed to locate and describe
> > fully dynamic inode tables.
>
> It's a good idea; and technically you don't have to allocate a block
> bitmap, given that the flag is present which says "no blocks
> available". The reason for allocating it is if you're trying to
> maintain full backwards compatibility, it will work --- except that
> you need some way of making sure that the on-line resizing code won't
> screw with the filesystem --- so the feature would have to be a
> read/only compat feature anyway.
Sure, I agree it is possible to go either way. I was just trying to
go for the element of least surprise. Having a group with
"bg_block_bitmap = 0" would be strange, but no more strange than having
a group for blocks beyond the end of the filesystem...
> To do on-line resizing, you'd have to clear the flag and then know to
> that the first "inode-only" block group should be given the new
> blocks.
Right.
> > The itable location would be replicated to all of the group descriptor
> > backups for safety, though we would need to find a way for "META_BG"
> > to store a backup of the GDT in blocks that don't exist, in the case
> > where increasing the GDT size in-place isn't possible.
>
> This is actually the big problem; with META_BG, in order to find the
> group descriptor blocks, it assumes that the first group descriptor
> can be found at the beginning of the group descriptor block, which
> means it has to be found at a certain offset from the beginning of the
> filesystem. And this would not be true for inode-only block groups.
We could special-case the placement of the GDT blocks in this case, and
then put them into the proper META_BG location when/if the blocks are
actually added to the filesystem.
> The simplest solution actually would be to to allocate inodes from the
> *end* of the 32-bit inode space, growing downwards, and having those
> inodes be stored in a reserved inode. You would lose block locality,
> although that could be solved by adding a block group affinity field
> in the inode structure which is used by "extended inodes".
I don't see how growing the inode numbers downward really helps anything.
With FLEX_BG there already is no "affinity" between the inodes and the
blocks. The drawback of putting the inode table into an inode is that
this is relatively fragile if the inode is corrupted. We'd want to have
replication of the inode itself (we couldn't replicate the whole inode
table very efficiently).
Alternately, we could put the GDT into the inode and replicate the whole
inode several times (the data would already be present in the filesystem).
We just need to select inodes from disparate parts of the filesystem to
avoid corruption (I'd suggest one inode from each backup superblock
group), point them at the existing GDT blocks, then allow the new GDT
blocks to be added to each one. The backup GDT-inode copies only need
to be changed when new groups are added/removed.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Sep 25, 2008 20:10 -0500, Jose R. Santos wrote:
> One way to get around this is to implement the exact opposite of what I
> proposed earlier and have a block group with no inode tables. If we do
> a 1:1 distribution of inode per block and don't allocate inodes tables
> for a series of block groups within a flexbg we could later on attempt
> to allocate new inode tables when we run out of inodes. If we leave
> holes in the inode numbers for the missing inode tables, adding new
> inode tables in these block groups would not require any inode
> renumbering. This also does not break the current inode allocator
> which would be a good thing. This should be even simpler to implement
> than the previous proposal. The drawbacks are that when allocating a
> new inode table, the 1:1 distribution of inode per block would mean
> that we need to find a bigger chunk on contiguous blocks to since we
> have bigger inode tables per block group. Since the current inode
> allocator tries to keep a 10% of blocks in a flexbg free, finding
> contiguous blocks may not be a really big issue. Another issue is 64bit
> filesystem if we use a 1:1 scheme.
>
> This would be like uninitialized inode tables with the added steps of
> finding free blocks, allocating a new inode and zeroing the newly
> created inode table. Since we could chose to allocate a new inode
> table on a flexbg with the most free blocks, this could keep filesystem
> meta-data/data layout consistently close together to maintain
> predictable performance. This option also has no overhead compared to
> the previous proposal.
The problem with leaving gaps in the itable is that this needs the
filesystem to be created in this manner in the first place, while adding
them at the end can be done to any filesystem. If we are preparing the
filesystem in advance for this we could just reserve enough GDT space
too (as online resize already does to some extent)..
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Fri, Sep 26, 2008 at 04:33:22AM -0600, Andreas Dilger wrote:
> > This is actually the big problem; with META_BG, in order to find the
> > group descriptor blocks, it assumes that the first group descriptor
> > can be found at the beginning of the group descriptor block, which
> > means it has to be found at a certain offset from the beginning of the
> > filesystem. And this would not be true for inode-only block groups.
>
> We could special-case the placement of the GDT blocks in this case, and
> then put them into the proper META_BG location when/if the blocks are
> actually added to the filesystem.
Yes, but where do you put the GDT blocks in the case of where there is
no more space in the reserved gdt blocks? Using some inode is
probably the best bet, since we would then know where to find the GDT
blocks.
My suggestion of using inode numbers growing downward from the top of
the 2**32 number space was to avoid needing to move the GDT blocks
into their proper place if and when the filesystem is grown; it
simplifies the code needed for the on-line resizing, and it also means
that when you do the on-line resizing the filesystem gets more inodes
along with more blocks. If we move the GDT blocks into their proper
place, then the filesystem gets more blocks, but not more inodes; but
if the inodes are dynamically grown automatically by the filesystem,
maybe that's not a problem.
> Alternately, we could put the GDT into the inode and replicate the whole
> inode several times (the data would already be present in the filesystem).
> We just need to select inodes from disparate parts of the filesystem to
> avoid corruption (I'd suggest one inode from each backup superblock
> group), point them at the existing GDT blocks, then allow the new GDT
> blocks to be added to each one. The backup GDT-inode copies only need
> to be changed when new groups are added/removed.
Yes, that's probably the best solution, IMHO.
- Ted
On Fri, 26 Sep 2008 04:36:07 -0600
Andreas Dilger <[email protected]> wrote:
> On Sep 25, 2008 20:10 -0500, Jose R. Santos wrote:
> > One way to get around this is to implement the exact opposite of what I
> > proposed earlier and have a block group with no inode tables. If we do
> > a 1:1 distribution of inode per block and don't allocate inodes tables
> > for a series of block groups within a flexbg we could later on attempt
> > to allocate new inode tables when we run out of inodes. If we leave
> > holes in the inode numbers for the missing inode tables, adding new
> > inode tables in these block groups would not require any inode
> > renumbering. This also does not break the current inode allocator
> > which would be a good thing. This should be even simpler to implement
> > than the previous proposal. The drawbacks are that when allocating a
> > new inode table, the 1:1 distribution of inode per block would mean
> > that we need to find a bigger chunk on contiguous blocks to since we
> > have bigger inode tables per block group. Since the current inode
> > allocator tries to keep a 10% of blocks in a flexbg free, finding
> > contiguous blocks may not be a really big issue. Another issue is 64bit
> > filesystem if we use a 1:1 scheme.
> >
> > This would be like uninitialized inode tables with the added steps of
> > finding free blocks, allocating a new inode and zeroing the newly
> > created inode table. Since we could chose to allocate a new inode
> > table on a flexbg with the most free blocks, this could keep filesystem
> > meta-data/data layout consistently close together to maintain
> > predictable performance. This option also has no overhead compared to
> > the previous proposal.
>
> The problem with leaving gaps in the itable is that this needs the
> filesystem to be created in this manner in the first place, while adding
> them at the end can be done to any filesystem. If we are preparing the
> filesystem in advance for this we could just reserve enough GDT space
> too (as online resize already does to some extent)..
Agreed, but performance wise this way is more consistent with the
current block and inode allocators. The block allocator will start its
free block search on the block group that contains the inode. Since
these block groups do not contain any blocks, the block allocator will
have to be modify to make sure data is not being placed randomly in the
disk. The flex_bg inode allocator would also need to be modify since
it currently depends on a algoright that assumes that block groups
contain actual blocks. One of the things that got flex_bg added to
ext4 in the first place was performance the performance improvements it
provided. I would like to keep that advantage if possible.
This could also be use to speed mkfs since we would not need to zero
out as many inode tables. We could initialize just a couple of inode
tables per flex_bg group and allocate the rest dynamically. You do pay
a small penalty when allocating a new inode table since we first need
to find the blocks for that inode table as well as zeroing it afterward.
The penalty is less than if we do the one time background zeroing of
inode tables where your disk will be trashing for a while the first
time it is mounted.
If supporting already existing filesystems is really important we could
always implement both techniques since they technically should not
conflict with each other, though you couldn't use both of them at the
same time if you have a 1:1 block/inode ratio.
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
-JRS
On Sep 26, 2008 09:49 -0500, Jose R. Santos wrote:
> Agreed, but performance wise this way is more consistent with the
> current block and inode allocators. The block allocator will start its
> free block search on the block group that contains the inode. Since
> these block groups do not contain any blocks, the block allocator will
> have to be modify to make sure data is not being placed randomly in the
> disk.
This is already the case today when a block group is full. The block
allocator needs to handle this gracefully.
> The flex_bg inode allocator would also need to be modify since
> it currently depends on a algoright that assumes that block groups
> contain actual blocks. One of the things that got flex_bg added to
> ext4 in the first place was performance the performance improvements it
> provided. I would like to keep that advantage if possible.
I don't think the performance advantage was at all related to inode->block
locality (since this is actually worse with FLEX_BG) but rather better
metadata locality (e.g. contiguous bitmaps, itables avoiding seeking
during metadata operations).
> This could also be use to speed mkfs since we would not need to zero
> out as many inode tables. We could initialize just a couple of inode
> tables per flex_bg group and allocate the rest dynamically.
There is already the ability to avoid zeroing ANY inode tables with
uninit_bg, but it is unsafe to do this in production because the old
itable data is there and e2fsck might become confused if the group
bg_itable_unused is lost (due to gdt corruption or other inconsistency).
> You do pay
> a small penalty when allocating a new inode table since we first need
> to find the blocks for that inode table as well as zeroing it afterward.
> The penalty is less than if we do the one time background zeroing of
> inode tables where your disk will be trashing for a while the first
> time it is mounted.
I don't think it is any different. The itable zeroing is _still_ needed,
because the flag that indicates if an itable is used or not is unreliable
in some corruption cases, and we don't want to read garbage from disk.
IMHO when a filesystem is first formatted and mounted it is probably
mostly idle, and if not the zeroing (and other stuff) thread can be delayed
(e.g. in a new distro install maybe the itables aren't zeroed until the
second or third mount, no great loss/risk).
> If supporting already existing filesystems is really important we could
> always implement both techniques since they technically should not
> conflict with each other, though you couldn't use both of them at the
> same time if you have a 1:1 block/inode ratio.
IMHO dynamic inode tables for existing filesystems is the MAIN goal.
Once you know you have run out of inodes it is already too late to plan
for it, and if you need a reformat to implement this scheme you could
just as easily reformat with enough inodes in the first place :-).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Sep 26, 2008 10:33 -0400, Theodore Ts'o wrote:
> > We could special-case the placement of the GDT blocks in this case, and
> > then put them into the proper META_BG location when/if the blocks are
> > actually added to the filesystem.
>
> Yes, but where do you put the GDT blocks in the case of where there is
> no more space in the reserved gdt blocks? Using some inode is
> probably the best bet, since we would then know where to find the GDT
> blocks.
I agree that replicating a GDT inode is probably the easiest answer.
IIRC this was proposed also in the past, before META_BG was implemented.
To be honest, we should just deprecate META_BG at that time, I don't
think it was every used by anything, and still isn't properly handled
by the cross-product of filesystem features (online resize, others).
> My suggestion of using inode numbers growing downward from the top of
> the 2**32 number space was to avoid needing to move the GDT blocks
> into their proper place if and when the filesystem is grown;
How do inode numbers affect the GDT blocks? Is it because high inode
numbers would be in correspondingly high "groups" and resizing could
be done "normally" without affecting the new GDT placement?
Once we move over to a scheme of GDT inodes, there isn't necessarily a
"proper place" for GDT blocks, so I don't know if that makes a difference.
I was going to object on the grounds that the GDT inodes will become too
large and sparse, but for a "normal" ratio (8192 inodes/group) this
only works out to be 32MB for the whole gdt to hit 2^32 inodes.
The other thing we should consider is the case where the inode ratio
is too high, and it is limiting the growth of the filesystem due to
2^32 inode limit. With a default inode ratio of 1 inode/8192 bytes,
this hits 2^32 inodes at 262144 groups, or only 32TB... We may need
to also be able to add "inodeless groups" in such systems unless we
also implement 2^64-bit inode numbers at the same time.
This isn't impossible, though the directory format would need to change
to handle 64-bit inode numbers, and some way to convert between the
leaf formats.
> it simplifies the code needed for the on-line resizing, and it also means
> that when you do the on-line resizing the filesystem gets more inodes
> if the inodes are dynamically grown automatically by the filesystem,
> maybe that's not a problem.
It probably makes sense to increase the "static" inode count proportionally
with the new blocks, since we already know the inode ratio is too small,
so I can see a benefit from this direction.
> > Alternately, we could put the GDT into the inode and replicate the whole
> > inode several times (the data would already be present in the filesystem).
> > We just need to select inodes from disparate parts of the filesystem to
> > avoid corruption (I'd suggest one inode from each backup superblock
> > group), point them at the existing GDT blocks, then allow the new GDT
> > blocks to be added to each one. The backup GDT-inode copies only need
> > to be changed when new groups are added/removed.
>
> Yes, that's probably the best solution, IMHO.
>
> - Ted
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Fri, Sep 26, 2008 at 02:18:32PM -0600, Andreas Dilger wrote:
> I agree that replicating a GDT inode is probably the easiest answer.
> IIRC this was proposed also in the past, before META_BG was implemented.
> To be honest, we should just deprecate META_BG at that time, I don't
> think it was every used by anything, and still isn't properly handled
> by the cross-product of filesystem features (online resize, others).
Meta_BG is I think relatively well supported now, actually. More to
the point, the resize_inode feature doesn't work for filesystems with
more than 2**32 blocks, since indirect blocks don't work for such
filesystems. The assumption had always been that we would use meta_bg
to support online-resize for > 2*32 block filesystem, once we had
implemented on-line resize support for it.
> How do inode numbers affect the GDT blocks? Is it because high inode
> numbers would be in correspondingly high "groups" and resizing could
> be done "normally" without affecting the new GDT placement?
Yep. So inode numbers between 1 and (num_bg*inodes_per_bg)+1 are
"natural" inodes, and inodes above that would have to be "dynamic"
inodes where the GDT would be found in an inode.
> I was going to object on the grounds that the GDT inodes will become too
> large and sparse, but for a "normal" ratio (8192 inodes/group) this
> only works out to be 32MB for the whole gdt to hit 2^32 inodes.
I'm not sure what you mean by "sparse".... the would be just as tighly
packed, but just starting at 2*32-1 and growing down.
> The other thing we should consider is the case where the inode ratio
> is too high, and it is limiting the growth of the filesystem due to
> 2^32 inode limit. With a default inode ratio of 1 inode/8192 bytes,
> this hits 2^32 inodes at 262144 groups, or only 32TB... We may need
> to also be able to add "inodeless groups" in such systems unless we
> also implement 2^64-bit inode numbers at the same time.
Yeah, good point. The real fundamental question is whether we want to
try to support 2**64 inodes as a long-term goal. Past a certain
point, we would have to have inodeless groups if we support 2**48
physical blocks, but only 2**32 inodes, with or without dynamic
inodes.
- Ted
Andreas Dilger wrote:
> It _sounds_ simple, but I think the implementation will not be what
> is expected. Either you need to keep a 3rd bitmap for each group
> which is (I&B) used for finding either inodes or blocks first (with
> respectively find_first_bit() or find_first_zero_bit()), then check the
> "normal" inode and block bitmaps, keeping this in sync with mballoc, and
> confusion/danger on disk/e2fsck because in-use itable blocks are marked
> "0" in the block bitmap. There will be races between updating these
> bitmaps, unless the group is locked for both block or inode allocations
> on any update because setting any bit completely changes the meaning.
>
> Alternately, if there are only I and B bitmaps, then find_first_bit()
> and find_first_zero_bit() are not useful. Searching for free blocks
> means looking for "B:0" and finding potentially many "B:0 I:1" blocks
> that are full of inodes. Searching for free inodes means looking for
> "I:1" (strangely) but finding potentially many "I:1 B:0" blocks.
mballoc already maintains own in-core copy, so we'd have to apply another
bitmap to it. as for races - I think this can be done by proper ordering,
probably w/o locks even: free block turns used first, then becomes part
of "fragmentary" space. anyway, the complexity would be away simpler than
mballoc itself, for example.
> I much prefer the dynamic itable idea from Jos? (which I embellished in
> my other email), which is very simple for both the kernel and e2fsck,
> robust, and avoids the 64-bit inode problem for userspace to the maximum
> amount (i.e. a full 4B inodes must be in use before we ever need to
> use 64-bit inodes). The lack of complexity in itable allocation also
> translates directly into increased robustness in the face of corruption.
>
> It doesn't provide dynamic-sized inodes (which hasn't traditionally
> been a problem), nor is it perfect in terms of being able to fully
> populate a filesystem with inodes in all use cases but it could work
> in all but completely pathalogical fragmentation cases (at which point
> one wonders if it isn't better to just return -ENOSPC than to flog a
> nearly dead filesystem). It can definitely do a good job in most likely
> uses, and also provides a big win over what is done today.
I do understand simplicity and robustness as driving reasons much. and I
agree dynamic inodes added via empty group descriptors is an excellent idea.
but I still think that with original idea we could get much more than just
dynamic inodes (though it was original intention). for example, storing
small directories (upto ~200 dir entries) within inode could be very nice
to avoid bunch of seeks. and tail packing could be as well. notice we don't
really need fine structures to find free slots in that fragmentary space -
usually small files are generated at once (e.g. tar -xf), so to pack them
we just need to remember few last "partial filled" blocks. if some file
is deleted and we get free slot in corresponded block - who cares - with
current ext3/4 we'd waste whole block anyway.
another reason for that design was to support >2^32 files per filesystem.
thanks, Alex