LinuxLists.cc - mkfs'ing a 48-bit fs... or not.

2011-10-03 21:55:12

Subject: mkfs'ing a 48-bit fs... or not.

Has anyone tried mke2fs at its limits? The latest git tree seems to fail in several ways.
(Richard Jones reported the initial failure)

# truncate --size 1152921504606846976 reallybigfile
# mke2fs -t ext4 reallybigfile

first,

Warning: the fs_type huge is not defined in mke2fs.conf

(when types "big" and "huge" got added, they never got a mke2fs.conf update?)

Then, I got:

reallybigfile: Not enough space to build proposed filesystem while setting up superblock

because:

fs->group_desc_count = (blk_t) ext2fs_div64_ceil(
ext2fs_blocks_count(super) - super->s_first_data_block,
EXT2_BLOCKS_PER_GROUP(super));
if (fs->group_desc_count == 0) {
retval = EXT2_ET_TOOSMALL;

The div64_ceil returns > 2^32 (2^33, actually), and the cast to blk_t
(which should be dgrp_t?) turns that into a 0.

Trying it with "-O bigalloc" (which should be automatic at this size,
I think?) just goes away for a very long time, I'm not sure what it's
thinking about, or if it's in a loop somewhere (looking now).

I also came across this in ext2fs_initialize() in the bigalloc case:

if (super->s_clusters_per_group > EXT2_MAX_CLUSTERS_PER_GROUP(super))
super->s_blocks_per_group = EXT2_MAX_CLUSTERS_PER_GROUP(super);
super->s_blocks_per_group = EXT2FS_C2B(fs,
super->s_clusters_per_group);

which seems to be incorrect; I doubt that you meant to set s_blocks_per_group under
a conditional, and then unconditionally set it immediately after. I assume
that should be super->s_clusters_per_group in the first case? I'll send a patch,
assuming so.

TBH I've kind of lost the thread on bigalloc, so just putting this out there for
now while I look into things a bit more.

-Eric

2011-10-04 04:00:36

by Theodore Ts'o

[permalink] [raw]

Subject: Re: mkfs'ing a 48-bit fs... or not.

On Mon, Oct 03, 2011 at 04:55:11PM -0500, Eric Sandeen wrote:
> Has anyone tried mke2fs at its limits? The latest git tree seems to fail in several ways.
> (Richard Jones reported the initial failure)
>
> # truncate --size 1152921504606846976 reallybigfile
> # mke2fs -t ext4 reallybigfile
>
> first,
>
> Warning: the fs_type huge is not defined in mke2fs.conf
>
> (when types "big" and "huge" got added, they never got a mke2fs.conf update?)

It used to be that an undefined file system type didn't flag an error.
It now does, so we should have definitions for them in mke2fs.conf.

> reallybigfile: Not enough space to build proposed filesystem while setting up superblock
>
> because:
>
> fs->group_desc_count = (blk_t) ext2fs_div64_ceil(
> ext2fs_blocks_count(super) - super->s_first_data_block,
> EXT2_BLOCKS_PER_GROUP(super));
> if (fs->group_desc_count == 0) {
> retval = EXT2_ET_TOOSMALL;
>
> The div64_ceil returns > 2^32 (2^33, actually), and the cast to blk_t
> (which should be dgrp_t?) turns that into a 0.

Yep, that should be dgrp_t. Oops.

> Trying it with "-O bigalloc" (which should be automatic at this size,
> I think?) just goes away for a very long time, I'm not sure what it's
> thinking about, or if it's in a loop somewhere (looking now).

Well, we probably do want to engage bigalloc automatically, at some
point (I want to wait until bigalloc is in commonly used kernels, at
least for community distro's). I'm not sure what the best cluster
size to pick by default should be, though. 16k? 64k?

- Ted

2011-10-04 04:03:43

by Eric Sandeen

[permalink] [raw]

Subject: Re: mkfs'ing a 48-bit fs... or not.

On 10/3/11 4:55 PM, Eric Sandeen wrote:
> Has anyone tried mke2fs at its limits? The latest git tree seems to fail in several ways.
> (Richard Jones reported the initial failure)
>
> # truncate --size 1152921504606846976 reallybigfile
> # mke2fs -t ext4 reallybigfile

...

> Trying it with "-O bigalloc" (which should be automatic at this size,
> I think?) just goes away for a very long time, I'm not sure what it's
> thinking about, or if it's in a loop somewhere (looking now).

It comes up with too many inodes, then tries to reduce the count,
but the "waste not want not" logic bumps it back up... ipg eventually
goes "below" 0 but it's unsigned so it goes on in this loop forever.

Some of this is my fault... I put that retry logic in years ago. :(

I'll see what I can do to fix it up.

-Eric

2011-10-04 04:27:03

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH 1/2] Add "big" and "huge" types to mke2fs.conf

mke2fs attempts to use the "big" and "huge" types, and now that mke2fs
will complain if there are file system types which are undefined,
let's add definitions for them.

Thanks to Richard Jones for reporting this problem.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
misc/mke2fs-hurd.conf | 6 ++++++
misc/mke2fs.conf | 6 ++++++
2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/misc/mke2fs-hurd.conf b/misc/mke2fs-hurd.conf
index 52ed7e5..4f0527d 100644
--- a/misc/mke2fs-hurd.conf
+++ b/misc/mke2fs-hurd.conf
@@ -21,6 +21,12 @@
floppy = {
inode_ratio = 8192
}
+ big = {
+ inode_ratio = 32768
+ }
+ huge = {
+ inode_ratio = 65536
+ }
news = {
inode_ratio = 4096
}
diff --git a/misc/mke2fs.conf b/misc/mke2fs.conf
index 775e046..0871f77 100644
--- a/misc/mke2fs.conf
+++ b/misc/mke2fs.conf
@@ -30,6 +30,12 @@
inode_size = 128
inode_ratio = 8192
}
+ big = {
+ inode_ratio = 32768
+ }
+ huge = {
+ inode_ratio = 65536
+ }
news = {
inode_ratio = 4096
}
--
1.7.4.1.22.gec8e1.dirty

2011-10-04 04:27:07

by Theodore Ts'o

[permalink] [raw]

Subject: [PATCH 2/2] libext2fs: fix bad cast which causes problems for file systems > 512EB

If the number of block groups exceeds 2**32, a bad cast would lead to
a bogus "Not enough space to build proposed filesystem while setting
up superblock" failure.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/ext2fs/initialize.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
index 2875f97..b050a0a 100644
--- a/lib/ext2fs/initialize.c
+++ b/lib/ext2fs/initialize.c
@@ -248,7 +248,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
}

retry:
- fs->group_desc_count = (blk_t) ext2fs_div64_ceil(
+ fs->group_desc_count = (dgrp_t) ext2fs_div64_ceil(
ext2fs_blocks_count(super) - super->s_first_data_block,
EXT2_BLOCKS_PER_GROUP(super));
if (fs->group_desc_count == 0) {
--
1.7.4.1.22.gec8e1.dirty

2011-10-04 04:28:17

by Theodore Ts'o

[permalink] [raw]

Subject: Re: mkfs'ing a 48-bit fs... or not.

On Mon, Oct 03, 2011 at 11:03:40PM -0500, Eric Sandeen wrote:
> It comes up with too many inodes, then tries to reduce the count,
> but the "waste not want not" logic bumps it back up... ipg eventually
> goes "below" 0 but it's unsigned so it goes on in this loop forever.

Oh, this is because of the fact that we can't have more than 2**32
inodes, right? Doh!

> Some of this is my fault... I put that retry logic in years ago. :(
>
> I'll see what I can do to fix it up.

Many thanks. I've fixed the other issues you've pointed out. Check
out the next branch on github...

- Ted

2011-10-04 05:30:53

by Andreas Dilger

[permalink] [raw]

Subject: Re: mkfs'ing a 48-bit fs... or not.

On 2011-10-03, at 10:00 PM, Ted Ts'o <[email protected]> wrote:
> On Mon, Oct 03, 2011 at 04:55:11PM -0500, Eric Sandeen wrote:
>> Has anyone tried mke2fs at its limits? The latest git tree seems to fail in several ways.
>> (Richard Jones reported the initial failure)
>>
>> # truncate --size 1152921504606846976 reallybigfile
>> # mke2fs -t ext4 reallybigfile
>>
>> first,
>>
>> Warning: the fs_type huge is not defined in mke2fs.conf
>>
>> (when types "big" and "huge" got added, they never got a mke2fs.conf update?)
>
> It used to be that an undefined file system type didn't flag an error.
> It now does, so we should have definitions for them in mke2fs.conf.
>
>> reallybigfile: Not enough space to build proposed filesystem while setting up superblock

Isn't there also a problem with the number of block group descriptor blocks in the first group, if meta_bg is not used? With 64-byte group descriptors per 128MB group this is 1024 bytes of descriptors for 2GB of blocks, or 128MB of descriptors for 256TB of blocks. At this point group 0 is full of primary block descriptors and group 1 is full of backup descriptors, and we are out of luck to make a larger filesystem.

That is only 2^48 bytes, not 2^48 blocks (2^60 bytes), so it means meta_bg needs to get into more testing, and online resize with flex_bg needs to move forward.

>> because:
>>
>> fs->group_desc_count = (blk_t) ext2fs_div64_ceil(
>> ext2fs_blocks_count(super) - super->s_first_data_block,
>> EXT2_BLOCKS_PER_GROUP(super));
>> if (fs->group_desc_count == 0) {
>> retval = EXT2_ET_TOOSMALL;
>>
>> The div64_ceil returns > 2^32 (2^33, actually), and the cast to blk_t
>> (which should be dgrp_t?) turns that into a 0.
>
> Yep, that should be dgrp_t. Oops.
>
>> Trying it with "-O bigalloc" (which should be automatic at this size,
>> I think?) just goes away for a very long time, I'm not sure what it's
>> thinking about, or if it's in a loop somewhere (looking now).
>
> Well, we probably do want to engage bigalloc automatically, at some
> point (I want to wait until bigalloc is in commonly used kernels, at
> least for community distro's). I'm not sure what the best cluster
> size to pick by default should be, though. 16k? 64k?
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-10-04 07:06:20

by Richard W.M. Jones

[permalink] [raw]

Subject: Re: mkfs'ing a 48-bit fs... or not.

Thanks Eric. Here is the original thread (see also the replies).

https://lists.fedoraproject.org/pipermail/devel/2011-October/157618.html

In theory I could test this up to ~ 2**63, but it requires a number of
bugfixes and changes in qemu. Obviously that size is ridiculous :-)
but it may reveal bugs that wouldn't be found by ordinary testing.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into Xen guests.
http://et.redhat.com/~rjones/virt-p2v

2011-10-04 11:47:13

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH 2/2] libext2fs: fix bad cast which causes problems for file systems > 512EB

On 10/3/11 11:27 PM, Theodore Ts'o wrote:
> If the number of block groups exceeds 2**32, a bad cast would lead to
> a bogus "Not enough space to build proposed filesystem while setting
> up superblock" failure.

It's the proper cast now, but I don't think it fixes the problem, since they
are both __u32...

But in any case, for the actual change at least:

Reviewed-by: Eric Sandeen <[email protected]>

> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> lib/ext2fs/initialize.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
> index 2875f97..b050a0a 100644
> --- a/lib/ext2fs/initialize.c
> +++ b/lib/ext2fs/initialize.c
> @@ -248,7 +248,7 @@ errcode_t ext2fs_initialize(const char *name, int flags,
> }
>
> retry:
> - fs->group_desc_count = (blk_t) ext2fs_div64_ceil(
> + fs->group_desc_count = (dgrp_t) ext2fs_div64_ceil(
> ext2fs_blocks_count(super) - super->s_first_data_block,
> EXT2_BLOCKS_PER_GROUP(super));
> if (fs->group_desc_count == 0) {

2011-10-04 18:05:59

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 2/2] libext2fs: fix bad cast which causes problems for file systems > 512EB

On Tue, Oct 04, 2011 at 06:47:12AM -0500, Eric Sandeen wrote:
> On 10/3/11 11:27 PM, Theodore Ts'o wrote:
> > If the number of block groups exceeds 2**32, a bad cast would lead to
> > a bogus "Not enough space to build proposed filesystem while setting
> > up superblock" failure.
>
> It's the proper cast now, but I don't think it fixes the problem, since they
> are both __u32...

Hmm, yes.

And to be quite honest I'm not sure it's worth fixing. 2**32 block
groups gets us up to 2**59 bytes assuming 4k blocks. The theoretical
maximum given the current extent tree format is 2**60 assuming 4k
blocks. So changing dgrp_t to be 64-bits just to get that last power
of two (i.e., from 512EB to a full PB) doesn't seem worth it. Simply
using a bigalloc cluster size of 8k would make the problem go away
(and arguably we'd probably want a large cluster size if someone
wanted to create a file system that big anyway).

So maybe we should just check to see if the required number of block
groups is greater than 2**32, and if so, give an error.

- Ted

2011-10-04 18:15:19

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH 2/2] libext2fs: fix bad cast which causes problems for file systems > 512EB

On 10/4/11 1:05 PM, Ted Ts'o wrote:
> On Tue, Oct 04, 2011 at 06:47:12AM -0500, Eric Sandeen wrote:
>> On 10/3/11 11:27 PM, Theodore Ts'o wrote:
>>> If the number of block groups exceeds 2**32, a bad cast would lead to
>>> a bogus "Not enough space to build proposed filesystem while setting
>>> up superblock" failure.
>>
>> It's the proper cast now, but I don't think it fixes the problem, since they
>> are both __u32...
>
> Hmm, yes.
>
> And to be quite honest I'm not sure it's worth fixing. 2**32 block
> groups gets us up to 2**59 bytes assuming 4k blocks. The theoretical
> maximum given the current extent tree format is 2**60 assuming 4k
> blocks. So changing dgrp_t to be 64-bits just to get that last power
> of two (i.e., from 512EB to a full PB) doesn't seem worth it. Simply
> using a bigalloc cluster size of 8k would make the problem go away
> (and arguably we'd probably want a large cluster size if someone
> wanted to create a file system that big anyway).
>
> So maybe we should just check to see if the required number of block
> groups is greater than 2**32, and if so, give an error.
>
> - Ted
>

As long as we have a consistent, predictable, well-designed and well-understood
maximum (theoretical) size for the fs, I'm all for documenting & enforcing it.

TBH I'm still trying to get all the moving parts together in my head, between
meta_bg & bigalloc & whatnot, at these sizes.

The initialization functions are looking pretty ad-hoc to me right now. :)

-Eric