Currently, the maximum number of blocks in a block group is the number
of bits in a block, since the block bitmap must be stored inside a
single block.
So on a 4 KB blocksize filesystem, the maximum number of blocks in a
group is 32768.
This constraint can limit the maximum size of the filesystem.
Increasing the block group size allows also to reduce the number of
groups in the filesystem.
This patchset allows using several consecutive blocks to store the
bitmaps, thus having larger block groups and so increasing the
filesystem size.
It applies against a linux-2.6.19-rc6 kernel and follows up the work on
the 64-bit support done by Laurent Vivier and Alexandre Ratchov.
These patches have been tested on a x86_64 machine with a 20TB device
and with updated e2fsprogs available here:
http://www.bullopensource.org/ext4/20061124
If we already see that the execution time of fsck and mkfs commands is
reduced when increasing the size of groups on a large filesystem, I'll
will do performance testing in the next days to see the other impacts of
this modification.
Any comments are welcome.
Thanks,
Valerie.
On Fri, Nov 24, 2006 at 05:47:40PM +0100, Valerie Clement wrote:
> Currently, the maximum number of blocks in a block group is the
> number of bits in a block, since the block bitmap must be stored
> inside a single block. So on a 4 KB blocksize filesystem, the
> maximum number of blocks in a group is 32768. This constraint can
> limit the maximum size of the filesystem.
So what's the current limitation on the maximum size of the filesystem
without big block groups? Well, the block group number is an unsigned
32 bit number, so we can have 2**32 block group. Using a 4k (2**12)
block group, have a limit of 32768 blocks per block group, or 2**15
blocks. So the limit is 2**(32+15) or 2**47 blocks, or 2**59 bytes
(512 petabytes).
So one justification of doing this work is we can raise the limit from
2**59 bytes to 2**63 bytes (after which point we have to worry about
loff_t on 32-bit systems and off_t on 64-bit systems being a signed
64-bit number). So it buys us a factor of 16 increase from 512
petabytes to 8 exabytes of maximum filesystem size (assuming that we
also have the full 64-bit support patches, of course).
(For reference, the Internet Archive Wayback machine contains
approximately 2 petabytes of information, and the Star Trek: TNG's
character Data was purported to have a storage capacity of 88
petabytes.)
The other thing to note about the 2**32 block group number limitation
is this is not a filesystem format limitation, but a implementation
limitation, and is based on dgrp_t being a 32-bit unsigned integer.
If we ever needed to go beyond 512 petabytes, we could do so by making
dgrp_t 64-bits.
> If we already see that the execution time of fsck and mkfs commands is
> reduced when increasing the size of groups on a large filesystem, I'll
> will do performance testing in the next days to see the other impacts of
> this modification.
The execution time speedup of mkfs would mainly be in the time that it
takes to write out the bitmaps in a less seeky fashion --- although
the metablockgroup format change does this as well. There is also be
a secondary improvement based on the fact that the overhead of writing
out the block group descriptor shrinks, but given that this overhead
is an extra 4k block for every 8 gigabytes of filesystem space, I'm
not sure how important the overhead really is. We will also get the
execution mkfs speedup (and in fact a much more significant one) when
we implement the lazy initialization of the bitmap blocks and inode
table blocks.
The metablockgroup changes also would enable the storing the bitmap
blocks contiguously which would speed up reading and writing the
bitmap blocks. The main advantage remaining of the big blockgroups is
then the need to keep track of additional numbers of block utilization
statistics on a per-block group basis, and in allowing the "pool" of
blocks in the ext4's block group be bigger, which could help (or hurt)
our allocation policies. And, the metablockgroup changes are
significantly simpler (already implemented in e2fsprogs, and involve a
very minor change to the mount code's sanity checking about valid
locations for bitmap blocks).
Am I missing anything?
Based on this analysis, it's clear that the big block groups patch has
some benefits, but I'm wondering if they are sufficiently large to be
worth it, especially since we also have to consider the changes
necessary to the e2fsprogs (which haven't been written yet as far as I
know). Comments?
- Ted
Theodore Tso wrote:
> So what's the current limitation on the maximum size of the filesystem
> without big block groups? Well, the block group number is an unsigned
> 32 bit number, so we can have 2**32 block group. Using a 4k (2**12)
> block group, have a limit of 32768 blocks per block group, or 2**15
> blocks. So the limit is 2**(32+15) or 2**47 blocks, or 2**59 bytes
> (512 petabytes).
Hi Ted,
thanks for your comments.
In fact, there is another limitation related to the block group size:
all the group descriptors are stored in the first group of the filesystem.
Currently, with a 4-KB block size, the maximum size of a group is
2**15 blocks = 2**27 bytes.
With a group descriptor size of 32 bytes, we can store a maximum of
2**32 / 32 = 2**22 group descriptors in the first group.
So the maximum number of groups is limited to 2**22 which limits the
size of the filesystem to
2**22(groups) * 2**15(blocks) * 2**12(blocksize) = 2**49 bytes = 512TB
With big block groups, we can grow beyond this limit of 512TB.
>
> Based on this analysis, it's clear that the big block groups patch has
> some benefits, but I'm wondering if they are sufficiently large to be
> worth it, especially since we also have to consider the changes
> necessary to the e2fsprogs (which haven't been written yet as far as I
> know).
I already made changes in the e2fsprogs to support larger block groups,
but there is still some work to do.
The first thing is when creating a large filesystem (over 512TB or
perhaps before) which is the optimal value for the block group size ?
How to set its default value used by mkfs ?
It is why I do tests now to see the behavior of a system when increasing
the size of the block groups.
Val?rie
On Thu, Nov 30, 2006 at 04:17:41PM +0100, Valerie Clement wrote:
> In fact, there is another limitation related to the block group size:
> all the group descriptors are stored in the first group of the filesystem.
> Currently, with a 4-KB block size, the maximum size of a group is
> 2**15 blocks = 2**27 bytes.
> With a group descriptor size of 32 bytes, we can store a maximum of
> 2**32 / 32 = 2**22 group descriptors in the first group.
> So the maximum number of groups is limited to 2**22 which limits the
> size of the filesystem to
> 2**22(groups) * 2**15(blocks) * 2**12(blocksize) = 2**49 bytes = 512TB
Hmm, yes. Good point. Thanks for pointing that out. In fact, with
the 64-bit patches, the block group descriptor size becomes 64 bytes
long, which means we can only have 2*21 groups, which means 2**48
bytes, or 256TB.
There is one other problem with big block groups which I had forgotten
to mention in my last note. As we grow the size of the big block
group, it means that we increase the number of contiguous blocks
required for block and inode allocation bitmaps. If we use the
smallest possible block group size to support a given filesystem, then
for a 1 Petabyte filesystem (using a 128k blocks/group), we will need
4 contiguous blocks for the block and inode allocation bitmaps, and
for an Exabyte (2**60) filesystem we would need 4096 contiguous bitmap
blocks. The problem with requiring this many contiguous blocks is
that it makes the filesystem less robust in the face of bad blocks
appearing in the middle of a block group, or in the face of filesystem
corruptions where it becomes necessary to relocate the bitmap blocks.
(For example, if the block allocation bitmap gets damaged and data
blocks get allocated into bitmap blocks.) Finding even 4 contiguous
blocks can be quite difficult, especially if you are constrained to
find them within the current block group.
Even if we relax this constraint for ext4 (and I believe we should),
it is not always guaranteed that it is possible to find a large number
of contiguous free blocks. And if we can't then e2fsck will not be
able to repair the filesystem, leaving the user dead in the water.
What are potential solutions to this issue?
* We could add two per-block group flags indicating whether the block
bitmap and inode bitmap are stored contiguously, or whether the block
number points to an indirect or doubly-indirect block (depending on
what is necessary to store the bitmap information).
* We could use the bitmap block address as the root of a b-tree
containing the allocation information --- at the cost of adding some
XFS-like complexity.
* We ignore the problem, and accept that there are some kinds of
filesystem corruptions which e2fsck will not be able to fix --- or at
least not without adding complexity which would allow it to relocate
data blocks in order to make a contiguous range of blocks to be used
for the allocation bitmaps.
The last alternative sounds horrible, but if we assume that some other
layer (i.e., the hard drive's bad block replacement pool) provides us
the illusion of a flawless storage media, and CRC to protect metadata
will prevent us from relying on an corrupted bitmap block, maybe it is
acceptable that e2fsck may not be able to fix certain types of
filesystem corruption. In that case, though, for laptop drives
without any of these protections, I'd want to keep the block group
size under 32k so we can avoid dealing with these issues for as long
as possible. Even if we assume laptop drives will double in size
every 12 months, we still have a good 10+ years before we're in danger
of seeing a 512TB laptop drives. :-)
Yet another solution that we could consider besides supporting larger
block groups would be to increase the block size. The downside of
this solution is that we would have to fix the VM helper functions
(i.e., the file_map functions, et. al) to allow supporting filesystems
where the blocksize > page size, and of course it will increase
fragmentation cost for small files. But for certain partitions which
are dedicated for video files, using a larger block size could also
improve data I/O efficiency, as well as decreasing the overhead
necessary caused by needing to update the block allocation bitmaps as
a file is extended.
As always, filesystem design is full of tradeoffs....
- Ted
On Nov 30, 2006 14:41 -0500, Theodore Tso wrote:
> * We ignore the problem, and accept that there are some kinds of
> filesystem corruptions which e2fsck will not be able to fix --- or at
> least not without adding complexity which would allow it to relocate
> data blocks in order to make a contiguous range of blocks to be used
> for the allocation bitmaps.
>
> The last alternative sounds horrible, but if we assume that some other
> layer (i.e., the hard drive's bad block replacement pool) provides us
> the illusion of a flawless storage media, and CRC to protect metadata
> will prevent us from relying on an corrupted bitmap block, maybe it is
> acceptable that e2fsck may not be able to fix certain types of
> filesystem corruption.
I'd agree that even with media errors, the bad-block replacement pool
is almost certainly available to handle this case. Even if there are
media errors on the read of the bitmap, they will generally go away
if the bitmap is rewritten (because of relocation). At worst, we
would no longer allow new blocks/inodes to be allocated that are
tracked by that block, and if we are past 256TB then the sacrifice
of 128MB of space is not fatal. It wouldn't even have to impact any
files that are already allocated in that space.
> without any of these protections, I'd want to keep the block group
> size under 32k so we can avoid dealing with these issues for as long
> as possible. Even if we assume laptop drives will double in size
> every 12 months, we still have a good 10+ years before we're in danger
> of seeing a 512TB laptop drives. :-)
Agreed, I think there isn't any reason to increase the group size
unless it is really needed, or it is specified with "mke2fs -g {blocks}"
or the number of inodes requires it.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.