I was checking up on the support for 64bit ( > 16 TB ) fs support in the
next branch last night and have a few questions:
1) Why is blocks_per_group limited to 32k ( more specifically, 8 x
blocksize )
2) Why can't 64bit be enabled explicitly ( with -O or -E? ) instead of
automatically when needed and the enable automatic 64bit setting is in
mke2fs.conf?
3) Why does 64bit disable the resize inode?
On 2011-06-09, at 8:36 AM, Phillip Susi wrote:
> I was checking up on the support for 64bit ( > 16 TB ) fs support in the next branch last night and have a few questions:
>
> 1) Why is blocks_per_group limited to 32k ( more specifically, 8 x blocksize )
There is only a single block pointer for each bitmap per group. That said,
with flex_bg this is mostly meaningless, since the bitmaps do not have to
be located in the group, and a flex group is the same as a virtual group
that is {flex_bg_factor} times as large.
> 2) Why can't 64bit be enabled explicitly ( with -O or -E? ) instead of automatically when needed and the enable automatic 64bit setting is in mke2fs.conf?
I thought it could be enabled with "-O 64bit", but I admit I've never tried.
> 3) Why does 64bit disable the resize inode?
Because the on-disk format of the resize inode is only suitable for 32-bit
filesystems (it is an indirect-block mapped file and cannot reserve blocks
beyond 2^32). The "future" way to resize filesystems is using the META_BG
feature, but the ability to use it has not been integrated into the kernel
or e2fsprogs yet.
Cheers, Andreas
On 6/9/2011 8:08 PM, Andreas Dilger wrote:
> There is only a single block pointer for each bitmap per group. That said,
> with flex_bg this is mostly meaningless, since the bitmaps do not have to
> be located in the group, and a flex group is the same as a virtual group
> that is {flex_bg_factor} times as large.
Of course there is only a single pointer because there is only a single
bitmap. What does this have to do with limiting the block count to 8 *
blocksize?
>> 3) Why does 64bit disable the resize inode?
>
> Because the on-disk format of the resize inode is only suitable for 32-bit
> filesystems (it is an indirect-block mapped file and cannot reserve blocks
> beyond 2^32). The "future" way to resize filesystems is using the META_BG
> feature, but the ability to use it has not been integrated into the kernel
> or e2fsprogs yet.
Ahh, right... no indirect blocks. Couldn't and shouldn't the resize
inode just use extents instead? Also I thought that META_BG was an idea
that eventually become FLEX_BG and has been dropped?
On 2011-06-10, at 9:19 AM, Phillip Susi wrote:
> On 6/9/2011 8:08 PM, Andreas Dilger wrote:
>> There is only a single block pointer for each bitmap per group. That said,
>> with flex_bg this is mostly meaningless, since the bitmaps do not have to
>> be located in the group, and a flex group is the same as a virtual group
>> that is {flex_bg_factor} times as large.
>
> Of course there is only a single pointer because there is only a single bitmap. What does this have to do with limiting the block count to 8 * blocksize?
I think in the presence of flex_bg this issue is moot.
>>> 3) Why does 64bit disable the resize inode?
>>
>> Because the on-disk format of the resize inode is only suitable for 32-bit
>> filesystems (it is an indirect-block mapped file and cannot reserve blocks
>> beyond 2^32). The "future" way to resize filesystems is using the META_BG
>> feature, but the ability to use it has not been integrated into the kernel
>> or e2fsprogs yet.
>
> Ahh, right... no indirect blocks. Couldn't and shouldn't the resize inode just use extents instead? Also I thought that META_BG was an idea that eventually become FLEX_BG and has been dropped?
META_BG also reduces the number of group descriptor blocks needed for the table.
Normally (without META_BG) each group descriptor table has a full copy of all
group descriptor blocks, and it has to be allocated contiguously on disk.
With META_BG, there are only 2 backups of each GDT block, and it is spread
around the filesystem, so there is not a need to allocate huge chunks of space.
Once we get a filesystem up to 256TB in size the size of the GDT will be larger
than a whole group (more than 128MB per GDT) and it will not be possible to
create a larger filesystem without META_BG.
Cheers, Andreas
On 6/10/2011 12:19 PM, Andreas Dilger wrote:
> I think in the presence of flex_bg this issue is moot.
What is the issue without flex_bg?
On 2011-06-10, at 11:14 AM, Phillip Susi wrote:
> On 6/10/2011 12:19 PM, Andreas Dilger wrote:
>> I think in the presence of flex_bg this issue is moot.
>
> What is the issue without flex_bg?
No "issue" really, just that the block/inode bitmaps are spread all over
the filesystem. The original discussion was about whether there could be
"larger bitmaps that addressed more than 32768 blocks", which is essentially
what the flex_bg feature provides. With flex_bg the bitmaps for different
groups will be allocated adjacent to each other on disk, and allow addressing
more than 32768 blocks without any seeking.
On large filesystems without flex_bg, the distribution of the bitmaps without
flex_bg means that a seek is needed to read each one, and given that spinning
disks have stayed at about 100 seeks/sec for decades it means 10+ minutes just
to read all of the bitmaps.
On my 2TB 5400 RPM SATA drive, e2fsck time went from ~20 minutes to ~3 minutes
by copying the data to a new ext4 filesystem with flex_bg + extents. For a
fair comparison, I then reformatted the original (identical) disk without
flex_bg or extents and copied the data back, so that there wasn't any unfair
comparison between the newly-formatted filesystem and the old fragmented one.
Cheers, Andreas
On 6/10/2011 1:29 PM, Andreas Dilger wrote:
> On 2011-06-10, at 11:14 AM, Phillip Susi wrote:
>> On 6/10/2011 12:19 PM, Andreas Dilger wrote:
>>> I think in the presence of flex_bg this issue is moot.
>>
>> What is the issue without flex_bg?
>
> No "issue" really, just that the block/inode bitmaps are spread all over
> the filesystem. The original discussion was about whether there could be
> "larger bitmaps that addressed more than 32768 blocks", which is essentially
> what the flex_bg feature provides. With flex_bg the bitmaps for different
> groups will be allocated adjacent to each other on disk, and allow addressing
> more than 32768 blocks without any seeking.
>
> On large filesystems without flex_bg, the distribution of the bitmaps without
> flex_bg means that a seek is needed to read each one, and given that spinning
> disks have stayed at about 100 seeks/sec for decades it means 10+ minutes just
> to read all of the bitmaps.
>
> On my 2TB 5400 RPM SATA drive, e2fsck time went from ~20 minutes to ~3 minutes
> by copying the data to a new ext4 filesystem with flex_bg + extents. For a
> fair comparison, I then reformatted the original (identical) disk without
> flex_bg or extents and copied the data back, so that there wasn't any unfair
> comparison between the newly-formatted filesystem and the old fragmented one.
I know what flex_bg is; what I don't understand is what it has to do
with the limit on the size of a block group. Whether the block bitmaps
are stored in their native block group, or clustered up with flex_bg
does not seem to have anything to do with whether or not the size of the
bitmap can exceed 32k blocks.
On 2011-06-10, at 11:45 AM, Phillip Susi wrote:
> On 6/10/2011 1:29 PM, Andreas Dilger wrote:
>> On 2011-06-10, at 11:14 AM, Phillip Susi wrote:
>>> On 6/10/2011 12:19 PM, Andreas Dilger wrote:
>>>> I think in the presence of flex_bg this issue is moot.
>>>
>>> What is the issue without flex_bg?
>>
>> No "issue" really, just that the block/inode bitmaps are spread all over
>> the filesystem. The original discussion was about whether there could be
>> "larger bitmaps that addressed more than 32768 blocks", which is essentially
>> what the flex_bg feature provides. With flex_bg the bitmaps for different
>> groups will be allocated adjacent to each other on disk, and allow addressing
>> more than 32768 blocks without any seeking.
>>
>> On large filesystems without flex_bg, the distribution of the bitmaps without
>> flex_bg means that a seek is needed to read each one, and given that spinning
>> disks have stayed at about 100 seeks/sec for decades it means 10+ minutes just
>> to read all of the bitmaps.
>>
>> On my 2TB 5400 RPM SATA drive, e2fsck time went from ~20 minutes to ~3 minutes
>> by copying the data to a new ext4 filesystem with flex_bg + extents. For a
>> fair comparison, I then reformatted the original (identical) disk without
>> flex_bg or extents and copied the data back, so that there wasn't any unfair
>> comparison between the newly-formatted filesystem and the old fragmented one.
>
> I know what flex_bg is; what I don't understand is what it has to do with the limit on the size of a block group. Whether the block bitmaps are stored in their native block group, or clustered up with flex_bg does not seem to have anything to do with whether or not the size of the bitmap can exceed 32k blocks.
I hope it is obvious that a single bitmap block can only address the number
of bits (==blocks) that fit within that block. To address more blocks the
block bitmap needs to be larger than a single block in size. One possible
way to do this (discussed early on for ext4) would be to have N block
bitmap blocks per group. That raises issues of how to address those blocks
for each "block group", and what the meaning of a "block group" really is.
The other (very similar, but not identical) approach is to essentially merge
N adjacent "block groups" into a single "large block group" that has N block
bitmaps, and addresses N * blocksize * 8 blocks per "large block group".
In this case "N" is the flex_bg factor (constrained to 2^n), and the "large
block group" is called a "flex group". It achieves exactly the same thing
as having N block bitmaps per group, with the only difference that there are
N group descriptors that point to the bitmaps, and they no longer have to be
located within the groups themselves
There is virtually no difference between "larger bitmap" and "flex_bg":
"b"=block bitmap, "i"=inode bitmap, "."=data block
Non-flex_bg configuration for 4 groups * 32768 blocks:
bi...{32760}...bi...{32760}...bi...{32760}...bi...{32760}...
Each block bitmap addresses 32768 blocks in total (including itself).
flex_bg configuration for the same 4 groups * 32768 blocks:
bbbbiiii.....................{131020}.......................
If you treat the four "bbbb" blocks as a single block bitmap, and "iiii"
as a single inode bitmap, and the contiguous range of free blocks as a
single group, it is exactly what you are asking for - a larger bitmap.
Cheers, Andreas
On 6/10/2011 4:37 PM, Andreas Dilger wrote:
> I hope it is obvious that a single bitmap block can only address the number
> of bits (==blocks) that fit within that block. To address more blocks the
> block bitmap needs to be larger than a single block in size. One possible
> way to do this (discussed early on for ext4) would be to have N block
> bitmap blocks per group. That raises issues of how to address those blocks
> for each "block group", and what the meaning of a "block group" really is.
I thought it was obvious that if there were more blocks, then you would
have more than one bitmap block and it would just follow the first.
> The other (very similar, but not identical) approach is to essentially merge
> N adjacent "block groups" into a single "large block group" that has N block
> bitmaps, and addresses N * blocksize * 8 blocks per "large block group".
> In this case "N" is the flex_bg factor (constrained to 2^n), and the "large
> block group" is called a "flex group". It achieves exactly the same thing
> as having N block bitmaps per group, with the only difference that there are
> N group descriptors that point to the bitmaps, and they no longer have to be
> located within the groups themselves
The other side effect is that you have N inode tables and N inode
bitmaps. A typical fs these days seems to have 8192 inodes in each bg,
which gives far more inodes than needed, and only uses 1/4 of the inode
bitmap block.
Now that I've looked a bit more at the code, it seems the 32k block
limit comes from the old ext2 block group descriptor only having a 16
bit field for the free blocks count. This was fixed in the ext4 bg
descriptor, but it seems that is not actually used except on a 64bit fs.
It looks like a few more bits of code need cleaned up to allow for
more blocks per group when using 64bit.
> If you treat the four "bbbb" blocks as a single block bitmap, and "iiii"
> as a single inode bitmap, and the contiguous range of free blocks as a
> single group, it is exactly what you are asking for - a larger bitmap.
While each of those inode bitmaps may follow the previous, each one is
typically only 1/4 used and the rest ignored. It would be better to
have only the single inode bitmap for a single, larger bg.