2014-01-12 03:23:54

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH RFC] Add support for new compat feature "one_backup_sb"

In practice, it is **extremely** rare for users to try to use more
than the first backup superblock located at the beginning of block
group #1. (i.e., at block number 32768 for file systems with a 4k
block size).

Aside from reducing the overhead of the file system by a small number
of blocks, by eliminating the rest of the backup superblocks, it
allows us to have a much more flexible metadata layout. For example,
we can force all of the allocation bitmaps and inode table blocks to
the beginning of the disk, which allows most of the disk to be
exclusively used for contiguous data blocks.

This simplifies taking advantage of certain HDD specific features,
such as Shingled Magnetic Recording (aka Shingled Drives), and the
TCG's OPAL Storage Specification where having a simple mapping between
LBA block ranges and data blocks used by the file system can make life
much simpler.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/e2p/feature.c | 2 ++
lib/ext2fs/closefs.c | 3 +++
lib/ext2fs/ext2_fs.h | 1 +
lib/ext2fs/ext2fs.h | 3 ++-
lib/ext2fs/res_gdt.c | 8 +++++++-
misc/ext4.5.in | 7 +++++++
misc/mke2fs.c | 3 ++-
resize/online.c | 8 ++++++++
8 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
index 9691263..8f4d8e7 100644
--- a/lib/e2p/feature.c
+++ b/lib/e2p/feature.c
@@ -43,6 +43,8 @@ static struct feature feature_list[] = {
"lazy_bg" },
{ E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
"snapshot_bitmap" },
+ { E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_ONE_BACKUP_SB,
+ "one_backup_sb" },

{ E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
"sparse_super" },
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 3e4af7f..c806291 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -38,6 +38,9 @@ int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
if (!(fs->super->s_feature_ro_compat &
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
return 1;
+ if (fs->super->s_feature_compat &
+ EXT4_FEATURE_COMPAT_ONE_BACKUP_SB)
+ return 0;
if (!(group & 1))
return 0;
if (test_root(group, 3) || (test_root(group, 5)) ||
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index 930c2a3..65c40c0 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -696,6 +696,7 @@ struct ext2_super_block {
#define EXT2_FEATURE_COMPAT_LAZY_BG 0x0040
/* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE 0x0080 not used, legacy */
#define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100
+#define EXT4_FEATURE_COMPAT_ONE_BACKUP_SB 0x0200


#define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 367b8de..8ff5a7e 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
EXT2_FEATURE_COMPAT_RESIZE_INODE|\
EXT2_FEATURE_COMPAT_DIR_INDEX|\
- EXT2_FEATURE_COMPAT_EXT_ATTR)
+ EXT2_FEATURE_COMPAT_EXT_ATTR|\
+ EXT4_FEATURE_COMPAT_ONE_BACKUP_SB)

/* This #ifdef is temporary until compression is fully supported */
#ifdef ENABLE_COMPRESSION
diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
index 6449228..5fb0195 100644
--- a/lib/ext2fs/res_gdt.c
+++ b/lib/ext2fs/res_gdt.c
@@ -37,7 +37,13 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
*min += 1;
return ret;
}
-
+ if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_ONE_BACKUP_SB) {
+ if (*min == 1) {
+ *min += 1;
+ return 1;
+ }
+ return fs->group_desc_count;
+ }
if (*five < *min) {
min = five;
mult = 5;
diff --git a/misc/ext4.5.in b/misc/ext4.5.in
index fab1139..9005b86 100644
--- a/misc/ext4.5.in
+++ b/misc/ext4.5.in
@@ -171,6 +171,13 @@ kernels from mounting file systems that they could not understand.
.\" .br
.\" .B Future feature, available in e2fsprogs 1.43-WIP
.TP
+.B one_backup_sb
+.br
+This feature indicates that there will only be a single backup
+superblock and block group descriptor located at the beginning of the
+second block group (i.e., block group #1). This is an more extreme
+version of sparse_super.
+.TP
.B meta_bg
.br
This ext4 feature allows file systems to be resized on-line without explicitly
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index c45b42f..cec473e 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -924,7 +924,8 @@ static __u32 ok_features[3] = {
EXT3_FEATURE_COMPAT_HAS_JOURNAL |
EXT2_FEATURE_COMPAT_RESIZE_INODE |
EXT2_FEATURE_COMPAT_DIR_INDEX |
- EXT2_FEATURE_COMPAT_EXT_ATTR,
+ EXT2_FEATURE_COMPAT_EXT_ATTR |
+ EXT4_FEATURE_COMPAT_ONE_BACKUP_SB,
/* Incompat */
EXT2_FEATURE_INCOMPAT_FILETYPE|
EXT3_FEATURE_INCOMPAT_EXTENTS|
diff --git a/resize/online.c b/resize/online.c
index defcac1..ab126e7 100644
--- a/resize/online.c
+++ b/resize/online.c
@@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
no_resize_ioctl = 1;
}

+ if (EXT2_HAS_COMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_COMPAT_ONE_BACKUP_SB) &&
+ (access("/sys/fs/ext4/features/one_backup_sb", R_OK) != 0)) {
+ com_err(program_name, 0, _("kernel does not support online "
+ "resize with one_backup_sb"));
+ exit(1);
+ }
+
printf(_("Filesystem at %s is mounted on %s; "
"on-line resizing required\n"), fs->device_name, mtpt);

--
1.8.5.rc3.362.gdf10213



2014-01-13 13:27:12

by Carlos Maiolino

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for new compat feature "one_backup_sb"

I'm not really a big fan of removing more backup metadata blocks than we already
have with the sparse_super feature, but, giving the SMR writing system, this
might save the filesystem from a lot of fragmentation when updating the backup
superblocks.

I'm just wondering if it might not be interesting in still have the backup
superblock into the last block group, but I'm not really sure about the concerns
it might have when resizing a filesystem. This might add more problems than the
benefits :)

despite my thoughts above, the patch looks good to me, consider it
Reviewed-by: Carlos Maiolino <[email protected]>

Cheers


On Sat, Jan 11, 2014 at 10:23:49PM -0500, Theodore Ts'o wrote:
> In practice, it is **extremely** rare for users to try to use more
> than the first backup superblock located at the beginning of block
> group #1. (i.e., at block number 32768 for file systems with a 4k
> block size).
>
> Aside from reducing the overhead of the file system by a small number
> of blocks, by eliminating the rest of the backup superblocks, it
> allows us to have a much more flexible metadata layout. For example,
> we can force all of the allocation bitmaps and inode table blocks to
> the beginning of the disk, which allows most of the disk to be
> exclusively used for contiguous data blocks.
>
> This simplifies taking advantage of certain HDD specific features,
> such as Shingled Magnetic Recording (aka Shingled Drives), and the
> TCG's OPAL Storage Specification where having a simple mapping between
> LBA block ranges and data blocks used by the file system can make life
> much simpler.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> lib/e2p/feature.c | 2 ++
> lib/ext2fs/closefs.c | 3 +++
> lib/ext2fs/ext2_fs.h | 1 +
> lib/ext2fs/ext2fs.h | 3 ++-
> lib/ext2fs/res_gdt.c | 8 +++++++-
> misc/ext4.5.in | 7 +++++++
> misc/mke2fs.c | 3 ++-
> resize/online.c | 8 ++++++++
> 8 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
> index 9691263..8f4d8e7 100644
> --- a/lib/e2p/feature.c
> +++ b/lib/e2p/feature.c
> @@ -43,6 +43,8 @@ static struct feature feature_list[] = {
> "lazy_bg" },
> { E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
> "snapshot_bitmap" },
> + { E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_ONE_BACKUP_SB,
> + "one_backup_sb" },
>
> { E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
> "sparse_super" },
> diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
> index 3e4af7f..c806291 100644
> --- a/lib/ext2fs/closefs.c
> +++ b/lib/ext2fs/closefs.c
> @@ -38,6 +38,9 @@ int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
> if (!(fs->super->s_feature_ro_compat &
> EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
> return 1;
> + if (fs->super->s_feature_compat &
> + EXT4_FEATURE_COMPAT_ONE_BACKUP_SB)
> + return 0;
> if (!(group & 1))
> return 0;
> if (test_root(group, 3) || (test_root(group, 5)) ||
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index 930c2a3..65c40c0 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -696,6 +696,7 @@ struct ext2_super_block {
> #define EXT2_FEATURE_COMPAT_LAZY_BG 0x0040
> /* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE 0x0080 not used, legacy */
> #define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100
> +#define EXT4_FEATURE_COMPAT_ONE_BACKUP_SB 0x0200
>
>
> #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 367b8de..8ff5a7e 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
> EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
> EXT2_FEATURE_COMPAT_RESIZE_INODE|\
> EXT2_FEATURE_COMPAT_DIR_INDEX|\
> - EXT2_FEATURE_COMPAT_EXT_ATTR)
> + EXT2_FEATURE_COMPAT_EXT_ATTR|\
> + EXT4_FEATURE_COMPAT_ONE_BACKUP_SB)
>
> /* This #ifdef is temporary until compression is fully supported */
> #ifdef ENABLE_COMPRESSION
> diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
> index 6449228..5fb0195 100644
> --- a/lib/ext2fs/res_gdt.c
> +++ b/lib/ext2fs/res_gdt.c
> @@ -37,7 +37,13 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
> *min += 1;
> return ret;
> }
> -
> + if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_ONE_BACKUP_SB) {
> + if (*min == 1) {
> + *min += 1;
> + return 1;
> + }
> + return fs->group_desc_count;
> + }
> if (*five < *min) {
> min = five;
> mult = 5;
> diff --git a/misc/ext4.5.in b/misc/ext4.5.in
> index fab1139..9005b86 100644
> --- a/misc/ext4.5.in
> +++ b/misc/ext4.5.in
> @@ -171,6 +171,13 @@ kernels from mounting file systems that they could not understand.
> .\" .br
> .\" .B Future feature, available in e2fsprogs 1.43-WIP
> .TP
> +.B one_backup_sb
> +.br
> +This feature indicates that there will only be a single backup
> +superblock and block group descriptor located at the beginning of the
> +second block group (i.e., block group #1). This is an more extreme
> +version of sparse_super.
> +.TP
> .B meta_bg
> .br
> This ext4 feature allows file systems to be resized on-line without explicitly
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index c45b42f..cec473e 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -924,7 +924,8 @@ static __u32 ok_features[3] = {
> EXT3_FEATURE_COMPAT_HAS_JOURNAL |
> EXT2_FEATURE_COMPAT_RESIZE_INODE |
> EXT2_FEATURE_COMPAT_DIR_INDEX |
> - EXT2_FEATURE_COMPAT_EXT_ATTR,
> + EXT2_FEATURE_COMPAT_EXT_ATTR |
> + EXT4_FEATURE_COMPAT_ONE_BACKUP_SB,
> /* Incompat */
> EXT2_FEATURE_INCOMPAT_FILETYPE|
> EXT3_FEATURE_INCOMPAT_EXTENTS|
> diff --git a/resize/online.c b/resize/online.c
> index defcac1..ab126e7 100644
> --- a/resize/online.c
> +++ b/resize/online.c
> @@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
> no_resize_ioctl = 1;
> }
>
> + if (EXT2_HAS_COMPAT_FEATURE(fs->super,
> + EXT4_FEATURE_COMPAT_ONE_BACKUP_SB) &&
> + (access("/sys/fs/ext4/features/one_backup_sb", R_OK) != 0)) {
> + com_err(program_name, 0, _("kernel does not support online "
> + "resize with one_backup_sb"));
> + exit(1);
> + }
> +
> printf(_("Filesystem at %s is mounted on %s; "
> "on-line resizing required\n"), fs->device_name, mtpt);
>
> --
> 1.8.5.rc3.362.gdf10213
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Carlos

2014-01-13 14:06:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for new compat feature "one_backup_sb"

On Mon, Jan 13, 2014 at 11:27:08AM -0200, Carlos Maiolino wrote:
> I'm not really a big fan of removing more backup metadata blocks
> than we already have with the sparse_super feature, but, giving the
> SMR writing system, this might save the filesystem from a lot of
> fragmentation when updating the backup superblocks.

Not only that, but the backup superblocks become extra things for the
SMR subsystem to have to copy around.

> I'm just wondering if it might not be interesting in still have the
> backup superblock into the last block group, but I'm not really sure
> about the concerns it might have when resizing a filesystem. This
> might add more problems than the benefits :)

Yes, I thought about putting the backup superblock either at the very
last block group, or at the very end of the file system (which would
avoid further fragmentation). Either way, it adds a lot more
complications, and realistically, have you ever actually seen a user
take advantage of a backup superblock other than the one at block
#32768?

Still, it is something we could do, if we really thought it would
help. My original thought was that keeping things simple was better
than adding more complexity, especially if the vast majority of users
would never take advantage of such a feature, and given that everyone
is trained to know that a backup superblock is always available at
32768, so we really want to have one there regardless. Given that, is
having a third backup superblock at the end of the disk really going
to provide that much additional safety?

-Ted

P.S. If we want to have extra copies of the backup blocks, we could
also use e2image to make additional backup copies; we could even a
future version of mke2fs place a copy of the e2image file at the very
of the file system, if we really thought it would be helpful later on.
If we thought it was really required, we should do it now, of course.
I'm just not entirely sure it's worth the extra hair, either way. I'm
curious to hear what other people think.

2014-01-13 16:19:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for new compat feature "one_backup_sb"

On Mon, Jan 13, 2014 at 09:06:45AM -0500, Theodore Ts'o wrote:
> I'm just not entirely sure it's worth the extra hair, either way. I'm
> curious to hear what other people think.

After the ext4 concall this morning, I've been convinced by Carlos's
and Andreas' arguments that we should add a superblock at the end of
the disk. So I'll rework the patches to do that.

- Ted




2014-01-13 16:42:20

by Carlos Maiolino

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for new compat feature "one_backup_sb"

On Mon, Jan 13, 2014 at 09:06:45AM -0500, Theodore Ts'o wrote:
> On Mon, Jan 13, 2014 at 11:27:08AM -0200, Carlos Maiolino wrote:
> > I'm not really a big fan of removing more backup metadata blocks
> > than we already have with the sparse_super feature, but, giving the
> > SMR writing system, this might save the filesystem from a lot of
> > fragmentation when updating the backup superblocks.
>
> Not only that, but the backup superblocks become extra things for the
> SMR subsystem to have to copy around.
>
> > I'm just wondering if it might not be interesting in still have the
> > backup superblock into the last block group, but I'm not really sure
> > about the concerns it might have when resizing a filesystem. This
> > might add more problems than the benefits :)
>
> Yes, I thought about putting the backup superblock either at the very
> last block group, or at the very end of the file system (which would
> avoid further fragmentation). Either way, it adds a lot more
> complications, and realistically, have you ever actually seen a user
> take advantage of a backup superblock other than the one at block
> #32768?
>
> Still, it is something we could do, if we really thought it would
> help. My original thought was that keeping things simple was better
> than adding more complexity, especially if the vast majority of users
> would never take advantage of such a feature, and given that everyone
> is trained to know that a backup superblock is always available at
> 32768, so we really want to have one there regardless. Given that, is
> having a third backup superblock at the end of the disk really going
> to provide that much additional safety?
>

I agree with you here, after being worked as a frontline support for a few
years, I guess I became too paranoid :)
The first thing that came to my mind when you argued about removing remaining
backup superblocks was about people issuing `dd` commands to wrong places of the
device, and having a copy at the very end of the filesystem is less feasible of
somebody rewrite it than at the beginning (or even at 32768). But as I said, I
agree with you here, too much time I don't see anybody needing other backup SB.

> -Ted
>
> P.S. If we want to have extra copies of the backup blocks, we could
> also use e2image to make additional backup copies; we could even a
> future version of mke2fs place a copy of the e2image file at the very
> of the file system, if we really thought it would be helpful later on.
> If we thought it was really required, we should do it now, of course.
> I'm just not entirely sure it's worth the extra hair, either way. I'm
> curious to hear what other people think.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Carlos

2014-01-13 20:41:33

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for new compat feature "one_backup_sb"

On Jan 13, 2014, at 9:19 AM, Theodore Ts'o <[email protected]> wrote:

> On Mon, Jan 13, 2014 at 09:06:45AM -0500, Theodore Ts'o wrote:
>> I'm just not entirely sure it's worth the extra hair, either way. I'm
>> curious to hear what other people think.
>
> After the ext4 concall this morning, I've been convinced by Carlos's
> and Andreas' arguments that we should add a superblock at the end of
> the disk. So I'll rework the patches to do that.

Instead of adding a new location for the backup superblock at the
"end" of the disk (which is subject to change if the filesystem is
resized), what about using the last group that would otherwise have
a backup superblock with the "sparse_super" feature? That means
resizing the filesystem by some amount won't change the location of
the backup, and avoids the need to change all of the documentation
for how to find superblocks.

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-01-13 22:39:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC] Add support for new compat feature "one_backup_sb"

On Mon, Jan 13, 2014 at 01:41:27PM -0700, Andreas Dilger wrote:
>
> Instead of adding a new location for the backup superblock at the
> "end" of the disk (which is subject to change if the filesystem is
> resized), what about using the last group that would otherwise have
> a backup superblock with the "sparse_super" feature? That means
> resizing the filesystem by some amount won't change the location of
> the backup, and avoids the need to change all of the documentation
> for how to find superblocks.

That doesn't actually reduce the complexity. The two headaches with
dealing with the resize is (a) when you grow the file system and add
one or more new block groups, you need to release the blocks
associated with backup superblock and block group descriptors, and
then (b) when you shrink a file system and you remove one or more
block groups, you need to relocate blocks where we need to put backup
superblocks in the "new" last block group.

Neither of these is a show stopper, but it's annoying since we need to
implement (a) and (b) in resize2fs, and then in the kernel twice, once
for the old-style online resize, and then again for the meta_bg style
online resize.

If we put the 2nd backup sueprblock in the last location where there
is a sparse_super feature, we still have deal with both of these
issues, and indeed it's actually a bit more of a headache to calculate
the block group which satisfies this constraint. In addition, it
decreases slihgtly the range of contiguous data blocks. (Putting the
backup superblock at the very end of the file system is the best from
this perspective, but it increases the complexity headache even more.)

What I'll probably do for now is to simply not support online resize
for the newly renamed "super sparse" feature, since for my use case,
which is for things like SMR disks, we're not going to be resizing
them anyway. If we want to use this mode for RAID 5 arrays, then we
will have to get the resizing support working for the new "super
sparse" mode.

- Ted

2014-01-14 05:54:30

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH v2] Add support for new compat feature "super_sparse"

And here's the version of this patch which adds a block group in the
last block group. Note the huge complexity required to support
shrinking such a file system. I still haven't tested that bit of code
yet, since it's also painful to create all of the various file systems
to test all of reserve_super_sparse_last_group().

But I'll send it out so people have an idea of what's needed/involved.

- Ted

>From af0f4ad05d1bbce4ae6b817e2638a3700e8a5a6e Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <[email protected]>
Date: Sat, 11 Jan 2014 22:11:42 -0500
Subject: [PATCH] Add support for new compat feature "super_sparse"

In practice, it is **extremely** rare for users to try to use more
than the first backup superblock located at the beginning of block
group #1. (i.e., at block number 32768 for file systems with a 4k
block size). This new compat feature restricts the backup superblock
to block group #1 and the last block group in the file system.

Aside from reducing the overhead of the file system by a small number
of blocks, by eliminating the rest of the backup superblocks, it
allows us to have a much more flexible metadata layout. For example,
we can force all of the allocation bitmaps and inode table blocks to
the beginning of the disk, which allows most of the disk to be
exclusively used for contiguous data blocks.

This simplifies taking advantage of certain HDD specific features,
such as Shingled Magnetic Recording (aka Shingled Drives), and the
TCG's OPAL Storage Specification where having a simple mapping between
LBA block ranges and the data blocks used by the file system can make
life much simpler.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/e2p/feature.c | 2 +
lib/ext2fs/closefs.c | 10 +++-
lib/ext2fs/ext2_fs.h | 1 +
lib/ext2fs/ext2fs.h | 3 +-
lib/ext2fs/res_gdt.c | 14 +++++-
misc/ext4.5.in | 7 +++
misc/mke2fs.c | 3 +-
resize/online.c | 8 ++++
resize/resize2fs.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 169 insertions(+), 6 deletions(-)

diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
index 9691263..c06b833 100644
--- a/lib/e2p/feature.c
+++ b/lib/e2p/feature.c
@@ -43,6 +43,8 @@ static struct feature feature_list[] = {
"lazy_bg" },
{ E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
"snapshot_bitmap" },
+ { E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_SUPER_SPARSE,
+ "super_sparse" },

{ E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
"sparse_super" },
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 3e4af7f..caf5b46 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -35,9 +35,15 @@ static int test_root(unsigned int a, unsigned int b)

int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
{
- if (!(fs->super->s_feature_ro_compat &
- EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
+ if ((group <= 1) || !(fs->super->s_feature_ro_compat &
+ EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER))
return 1;
+ if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE) {
+ /* Implied by the above test */
+ if (/* group == 1 || */ group == fs->group_desc_count - 1)
+ return 1;
+ return 0;
+ }
if (!(group & 1))
return 0;
if (test_root(group, 3) || (test_root(group, 5)) ||
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index 930c2a3..eb040e5 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -696,6 +696,7 @@ struct ext2_super_block {
#define EXT2_FEATURE_COMPAT_LAZY_BG 0x0040
/* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE 0x0080 not used, legacy */
#define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100
+#define EXT4_FEATURE_COMPAT_SUPER_SPARSE 0x0200


#define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 1e07f88..efec97f 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
EXT2_FEATURE_COMPAT_RESIZE_INODE|\
EXT2_FEATURE_COMPAT_DIR_INDEX|\
- EXT2_FEATURE_COMPAT_EXT_ATTR)
+ EXT2_FEATURE_COMPAT_EXT_ATTR|\
+ EXT4_FEATURE_COMPAT_SUPER_SPARSE)

/* This #ifdef is temporary until compression is fully supported */
#ifdef ENABLE_COMPRESSION
diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
index 6449228..1ce6f68 100644
--- a/lib/ext2fs/res_gdt.c
+++ b/lib/ext2fs/res_gdt.c
@@ -31,13 +31,23 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
int mult = 3;
unsigned int ret;

+ if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE) {
+ if (*min == 1) {
+ *min = fs->group_desc_count - 1;
+ if (*min <= 1)
+ *min = 2;
+ return 1;
+ }
+ ret = *min;
+ *min += 1;
+ return ret;
+ }
if (!(fs->super->s_feature_ro_compat &
EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
ret = *min;
- *min += 1;
+ *min +=1 ;
return ret;
}
-
if (*five < *min) {
min = five;
mult = 5;
diff --git a/misc/ext4.5.in b/misc/ext4.5.in
index fab1139..d6f71e7 100644
--- a/misc/ext4.5.in
+++ b/misc/ext4.5.in
@@ -171,6 +171,13 @@ kernels from mounting file systems that they could not understand.
.\" .br
.\" .B Future feature, available in e2fsprogs 1.43-WIP
.TP
+.B super_sparse
+.br
+This feature indicates that there will only be only two backup
+superblock and block group descriptors; one located at the beginning of
+block group #1, and one in the last block group in the file system.
+This is an more extreme version of sparse_super.
+.TP
.B meta_bg
.br
This ext4 feature allows file systems to be resized on-line without explicitly
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index c45b42f..825165f 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -924,7 +924,8 @@ static __u32 ok_features[3] = {
EXT3_FEATURE_COMPAT_HAS_JOURNAL |
EXT2_FEATURE_COMPAT_RESIZE_INODE |
EXT2_FEATURE_COMPAT_DIR_INDEX |
- EXT2_FEATURE_COMPAT_EXT_ATTR,
+ EXT2_FEATURE_COMPAT_EXT_ATTR |
+ EXT4_FEATURE_COMPAT_SUPER_SPARSE,
/* Incompat */
EXT2_FEATURE_INCOMPAT_FILETYPE|
EXT3_FEATURE_INCOMPAT_EXTENTS|
diff --git a/resize/online.c b/resize/online.c
index defcac1..af640c3 100644
--- a/resize/online.c
+++ b/resize/online.c
@@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
no_resize_ioctl = 1;
}

+ if (EXT2_HAS_COMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_COMPAT_SUPER_SPARSE) &&
+ (access("/sys/fs/ext4/features/super_sparse", R_OK) != 0)) {
+ com_err(program_name, 0, _("kernel does not support online "
+ "resize with super_sparse"));
+ exit(1);
+ }
+
printf(_("Filesystem at %s is mounted on %s; "
"on-line resizing required\n"), fs->device_name, mtpt);

diff --git a/resize/resize2fs.c b/resize/resize2fs.c
index c4c2517..a6cbe57 100644
--- a/resize/resize2fs.c
+++ b/resize/resize2fs.c
@@ -53,6 +53,9 @@ static errcode_t ext2fs_calculate_summary_stats(ext2_filsys fs);
static errcode_t fix_sb_journal_backup(ext2_filsys fs);
static errcode_t mark_table_blocks(ext2_filsys fs,
ext2fs_block_bitmap bmap);
+static errcode_t clear_super_sparse_last_group(ext2_resize_t rfs);
+static errcode_t reserve_super_sparse_last_group(ext2_resize_t rfs,
+ ext2fs_block_bitmap meta_bmap);

/*
* Some helper CPP macros
@@ -191,6 +194,10 @@ errcode_t resize_fs(ext2_filsys fs, blk64_t *new_size, int flags,
goto errout;
print_resource_track(rfs, &rtrack, fs->io);

+ retval = clear_super_sparse_last_group(rfs);
+ if (retval)
+ goto errout;
+
rfs->new_fs->super->s_state &= ~EXT2_ERROR_FS;
rfs->new_fs->flags &= ~EXT2_FLAG_MASTER_SB_ONLY;

@@ -952,6 +959,10 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
}

+ retval = reserve_super_sparse_last_group(rfs, meta_bmap);
+ if (retval)
+ goto errout;
+
if (old_blocks == new_blocks) {
retval = 0;
goto errout;
@@ -1840,6 +1851,122 @@ errout:
}

/*
+ * This function is used when expanding a file system. It frees the
+ * superblock and block group descriptor blocks from the block group
+ * which is no longer the last block group.
+ */
+static errcode_t clear_super_sparse_last_group(ext2_resize_t rfs)
+{
+ ext2_filsys fs = rfs->new_fs;
+ errcode_t retval;
+ dgrp_t old_groups = rfs->old_fs->group_desc_count;
+ dgrp_t new_groups = fs->group_desc_count;
+ blk64_t sb, old_desc;
+ blk_t num;
+
+ if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE))
+ return 0;
+
+ if (new_groups <= old_groups || old_groups <= 2)
+ return 0;
+
+ retval = ext2fs_super_and_bgd_loc2(rfs->old_fs, old_groups - 1,
+ &sb, &old_desc, NULL, &num);
+ if (retval)
+ return retval;
+
+ if (sb)
+ ext2fs_unmark_block_bitmap2(fs->block_map, sb);
+ if (old_desc)
+ ext2fs_unmark_block_bitmap_range2(fs->block_map, old_desc, num);
+ return 0;
+}
+
+/*
+ * This function is used when shrinking a file system. We need to
+ * utilize blocks from what will be the new last block group for the
+ * backup superblock and block group descriptor blocks.
+ * Unfortunately, those blocks may be used by other files or fs
+ * metadata blocks. We need to mark them as being in use.
+ */
+static errcode_t reserve_super_sparse_last_group(ext2_resize_t rfs,
+ ext2fs_block_bitmap meta_bmap)
+{
+ ext2_filsys fs = rfs->new_fs;
+ ext2_filsys old_fs = rfs->old_fs;
+ errcode_t retval;
+ dgrp_t old_groups = old_fs->group_desc_count;
+ dgrp_t new_groups = fs->group_desc_count;
+ dgrp_t g;
+ blk64_t blk, sb, old_desc;
+ blk_t i, num;
+ int realloc = 0;
+
+ if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE))
+ return 0;
+
+ if (new_groups >= old_groups || new_groups <= 2)
+ return 0;
+
+ retval = ext2fs_super_and_bgd_loc2(rfs->new_fs, new_groups - 1,
+ &sb, &old_desc, NULL, &num);
+ if (retval)
+ return retval;
+
+ if (!sb) {
+ fputs(_("Should never happen! No sb in last super_sparse bg?\n"),
+ stderr);
+ exit(1);
+ }
+ if (old_desc != sb+1) {
+ fputs(_("Should never happen! Unexpected old_desc in "
+ "super_sparse bg?\n"),
+ stderr);
+ exit(1);
+ }
+ num = (old_desc) ? num + 1 : 1;
+
+ /* Reserve the backup blocks */
+ ext2fs_mark_block_bitmap_range2(fs->block_map, sb, num);
+
+ for (g = 0; g < fs->group_desc_count; g++) {
+ blk64_t mb;
+
+ mb = ext2fs_block_bitmap_loc(fs, g);
+ if ((mb >= sb) && (mb < sb + num)) {
+ ext2fs_block_bitmap_loc_set(fs, g, 0);
+ realloc = 1;
+ }
+ mb = ext2fs_inode_bitmap_loc(fs, g);
+ if ((mb >= sb) && (mb < sb + num)) {
+ ext2fs_inode_bitmap_loc_set(fs, g, 0);
+ realloc = 1;
+ }
+ mb = ext2fs_inode_table_loc(fs, g);
+ if ((mb < sb + num) &&
+ (sb < mb + fs->inode_blocks_per_group)) {
+ ext2fs_inode_table_loc_set(fs, g, 0);
+ realloc = 1;
+ }
+ if (realloc) {
+ retval = ext2fs_allocate_group_table(fs, g, 0);
+ if (retval)
+ return retval;
+ }
+ }
+
+ for (blk = sb, i = 0; i < num; i++) {
+ if (ext2fs_test_block_bitmap2(old_fs->block_map, blk) &&
+ !ext2fs_test_block_bitmap2(meta_bmap, blk)) {
+ ext2fs_mark_block_bitmap2(rfs->move_blocks, blk);
+ rfs->needed_blocks++;
+ }
+ ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
+ }
+ return 0;
+}
+
+/*
* Fix the resize inode
*/
static errcode_t fix_resize_inode(ext2_filsys fs)
--
1.8.5.rc3.362.gdf10213


2014-01-14 11:21:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2] Add support for new compat feature "super_sparse"

A few comments on this new patch:
- I think the name will be confusing to users, especially non-native English
speakers. Is it "sparse_super" or "super_sparse" they want?
- I would suspect that group #1 is not the best place to put the backup.
For very large filesystems, there is a conflict with the backup group
descriptors in group #0 and #1. It would be better to out the one
backup in group #3 or something. I don't think this will be a problem
for SMR drives, since they will be so large that this will easily fit inside
(or close to) the flex_bg layout of the inode table.
- To simplify matters, it makes sense that super_sparse supersedes
the sparse_super and meta_bg features. It doesn't make sense
to have both. Should it also require flex_bg? Without it, it is mostly
useless.

Cheers, Andreas

> On Jan 13, 2014, at 22:54, Theodore Ts'o <[email protected]> wrote:
>
> And here's the version of this patch which adds a block group in the
> last block group. Note the huge complexity required to support
> shrinking such a file system. I still haven't tested that bit of code
> yet, since it's also painful to create all of the various file systems
> to test all of reserve_super_sparse_last_group().
>
> But I'll send it out so people have an idea of what's needed/involved.
>
> - Ted
>
> From af0f4ad05d1bbce4ae6b817e2638a3700e8a5a6e Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <[email protected]>
> Date: Sat, 11 Jan 2014 22:11:42 -0500
> Subject: [PATCH] Add support for new compat feature "super_sparse"
>
> In practice, it is **extremely** rare for users to try to use more
> than the first backup superblock located at the beginning of block
> group #1. (i.e., at block number 32768 for file systems with a 4k
> block size). This new compat feature restricts the backup superblock
> to block group #1 and the last block group in the file system.
>
> Aside from reducing the overhead of the file system by a small number
> of blocks, by eliminating the rest of the backup superblocks, it
> allows us to have a much more flexible metadata layout. For example,
> we can force all of the allocation bitmaps and inode table blocks to
> the beginning of the disk, which allows most of the disk to be
> exclusively used for contiguous data blocks.
>
> This simplifies taking advantage of certain HDD specific features,
> such as Shingled Magnetic Recording (aka Shingled Drives), and the
> TCG's OPAL Storage Specification where having a simple mapping between
> LBA block ranges and the data blocks used by the file system can make
> life much simpler.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> lib/e2p/feature.c | 2 +
> lib/ext2fs/closefs.c | 10 +++-
> lib/ext2fs/ext2_fs.h | 1 +
> lib/ext2fs/ext2fs.h | 3 +-
> lib/ext2fs/res_gdt.c | 14 +++++-
> misc/ext4.5.in | 7 +++
> misc/mke2fs.c | 3 +-
> resize/online.c | 8 ++++
> resize/resize2fs.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 9 files changed, 169 insertions(+), 6 deletions(-)
>
> diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
> index 9691263..c06b833 100644
> --- a/lib/e2p/feature.c
> +++ b/lib/e2p/feature.c
> @@ -43,6 +43,8 @@ static struct feature feature_list[] = {
> "lazy_bg" },
> { E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
> "snapshot_bitmap" },
> + { E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_SUPER_SPARSE,
> + "super_sparse" },
>
> { E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
> "sparse_super" },
> diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
> index 3e4af7f..caf5b46 100644
> --- a/lib/ext2fs/closefs.c
> +++ b/lib/ext2fs/closefs.c
> @@ -35,9 +35,15 @@ static int test_root(unsigned int a, unsigned int b)
>
> int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
> {
> - if (!(fs->super->s_feature_ro_compat &
> - EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
> + if ((group <= 1) || !(fs->super->s_feature_ro_compat &
> + EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER))
> return 1;
> + if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE) {
> + /* Implied by the above test */
> + if (/* group == 1 || */ group == fs->group_desc_count - 1)
> + return 1;
> + return 0;
> + }
> if (!(group & 1))
> return 0;
> if (test_root(group, 3) || (test_root(group, 5)) ||
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index 930c2a3..eb040e5 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -696,6 +696,7 @@ struct ext2_super_block {
> #define EXT2_FEATURE_COMPAT_LAZY_BG 0x0040
> /* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE 0x0080 not used, legacy */
> #define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100
> +#define EXT4_FEATURE_COMPAT_SUPER_SPARSE 0x0200
>
>
> #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 1e07f88..efec97f 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
> EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
> EXT2_FEATURE_COMPAT_RESIZE_INODE|\
> EXT2_FEATURE_COMPAT_DIR_INDEX|\
> - EXT2_FEATURE_COMPAT_EXT_ATTR)
> + EXT2_FEATURE_COMPAT_EXT_ATTR|\
> + EXT4_FEATURE_COMPAT_SUPER_SPARSE)
>
> /* This #ifdef is temporary until compression is fully supported */
> #ifdef ENABLE_COMPRESSION
> diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
> index 6449228..1ce6f68 100644
> --- a/lib/ext2fs/res_gdt.c
> +++ b/lib/ext2fs/res_gdt.c
> @@ -31,13 +31,23 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
> int mult = 3;
> unsigned int ret;
>
> + if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE) {
> + if (*min == 1) {
> + *min = fs->group_desc_count - 1;
> + if (*min <= 1)
> + *min = 2;
> + return 1;
> + }
> + ret = *min;
> + *min += 1;
> + return ret;
> + }
> if (!(fs->super->s_feature_ro_compat &
> EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
> ret = *min;
> - *min += 1;
> + *min +=1 ;
> return ret;
> }
> -
> if (*five < *min) {
> min = five;
> mult = 5;
> diff --git a/misc/ext4.5.in b/misc/ext4.5.in
> index fab1139..d6f71e7 100644
> --- a/misc/ext4.5.in
> +++ b/misc/ext4.5.in
> @@ -171,6 +171,13 @@ kernels from mounting file systems that they could not understand.
> .\" .br
> .\" .B Future feature, available in e2fsprogs 1.43-WIP
> .TP
> +.B super_sparse
> +.br
> +This feature indicates that there will only be only two backup
> +superblock and block group descriptors; one located at the beginning of
> +block group #1, and one in the last block group in the file system.
> +This is an more extreme version of sparse_super.
> +.TP
> .B meta_bg
> .br
> This ext4 feature allows file systems to be resized on-line without explicitly
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index c45b42f..825165f 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -924,7 +924,8 @@ static __u32 ok_features[3] = {
> EXT3_FEATURE_COMPAT_HAS_JOURNAL |
> EXT2_FEATURE_COMPAT_RESIZE_INODE |
> EXT2_FEATURE_COMPAT_DIR_INDEX |
> - EXT2_FEATURE_COMPAT_EXT_ATTR,
> + EXT2_FEATURE_COMPAT_EXT_ATTR |
> + EXT4_FEATURE_COMPAT_SUPER_SPARSE,
> /* Incompat */
> EXT2_FEATURE_INCOMPAT_FILETYPE|
> EXT3_FEATURE_INCOMPAT_EXTENTS|
> diff --git a/resize/online.c b/resize/online.c
> index defcac1..af640c3 100644
> --- a/resize/online.c
> +++ b/resize/online.c
> @@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
> no_resize_ioctl = 1;
> }
>
> + if (EXT2_HAS_COMPAT_FEATURE(fs->super,
> + EXT4_FEATURE_COMPAT_SUPER_SPARSE) &&
> + (access("/sys/fs/ext4/features/super_sparse", R_OK) != 0)) {
> + com_err(program_name, 0, _("kernel does not support online "
> + "resize with super_sparse"));
> + exit(1);
> + }
> +
> printf(_("Filesystem at %s is mounted on %s; "
> "on-line resizing required\n"), fs->device_name, mtpt);
>
> diff --git a/resize/resize2fs.c b/resize/resize2fs.c
> index c4c2517..a6cbe57 100644
> --- a/resize/resize2fs.c
> +++ b/resize/resize2fs.c
> @@ -53,6 +53,9 @@ static errcode_t ext2fs_calculate_summary_stats(ext2_filsys fs);
> static errcode_t fix_sb_journal_backup(ext2_filsys fs);
> static errcode_t mark_table_blocks(ext2_filsys fs,
> ext2fs_block_bitmap bmap);
> +static errcode_t clear_super_sparse_last_group(ext2_resize_t rfs);
> +static errcode_t reserve_super_sparse_last_group(ext2_resize_t rfs,
> + ext2fs_block_bitmap meta_bmap);
>
> /*
> * Some helper CPP macros
> @@ -191,6 +194,10 @@ errcode_t resize_fs(ext2_filsys fs, blk64_t *new_size, int flags,
> goto errout;
> print_resource_track(rfs, &rtrack, fs->io);
>
> + retval = clear_super_sparse_last_group(rfs);
> + if (retval)
> + goto errout;
> +
> rfs->new_fs->super->s_state &= ~EXT2_ERROR_FS;
> rfs->new_fs->flags &= ~EXT2_FLAG_MASTER_SB_ONLY;
>
> @@ -952,6 +959,10 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
> new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
> }
>
> + retval = reserve_super_sparse_last_group(rfs, meta_bmap);
> + if (retval)
> + goto errout;
> +
> if (old_blocks == new_blocks) {
> retval = 0;
> goto errout;
> @@ -1840,6 +1851,122 @@ errout:
> }
>
> /*
> + * This function is used when expanding a file system. It frees the
> + * superblock and block group descriptor blocks from the block group
> + * which is no longer the last block group.
> + */
> +static errcode_t clear_super_sparse_last_group(ext2_resize_t rfs)
> +{
> + ext2_filsys fs = rfs->new_fs;
> + errcode_t retval;
> + dgrp_t old_groups = rfs->old_fs->group_desc_count;
> + dgrp_t new_groups = fs->group_desc_count;
> + blk64_t sb, old_desc;
> + blk_t num;
> +
> + if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE))
> + return 0;
> +
> + if (new_groups <= old_groups || old_groups <= 2)
> + return 0;
> +
> + retval = ext2fs_super_and_bgd_loc2(rfs->old_fs, old_groups - 1,
> + &sb, &old_desc, NULL, &num);
> + if (retval)
> + return retval;
> +
> + if (sb)
> + ext2fs_unmark_block_bitmap2(fs->block_map, sb);
> + if (old_desc)
> + ext2fs_unmark_block_bitmap_range2(fs->block_map, old_desc, num);
> + return 0;
> +}
> +
> +/*
> + * This function is used when shrinking a file system. We need to
> + * utilize blocks from what will be the new last block group for the
> + * backup superblock and block group descriptor blocks.
> + * Unfortunately, those blocks may be used by other files or fs
> + * metadata blocks. We need to mark them as being in use.
> + */
> +static errcode_t reserve_super_sparse_last_group(ext2_resize_t rfs,
> + ext2fs_block_bitmap meta_bmap)
> +{
> + ext2_filsys fs = rfs->new_fs;
> + ext2_filsys old_fs = rfs->old_fs;
> + errcode_t retval;
> + dgrp_t old_groups = old_fs->group_desc_count;
> + dgrp_t new_groups = fs->group_desc_count;
> + dgrp_t g;
> + blk64_t blk, sb, old_desc;
> + blk_t i, num;
> + int realloc = 0;
> +
> + if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE))
> + return 0;
> +
> + if (new_groups >= old_groups || new_groups <= 2)
> + return 0;
> +
> + retval = ext2fs_super_and_bgd_loc2(rfs->new_fs, new_groups - 1,
> + &sb, &old_desc, NULL, &num);
> + if (retval)
> + return retval;
> +
> + if (!sb) {
> + fputs(_("Should never happen! No sb in last super_sparse bg?\n"),
> + stderr);
> + exit(1);
> + }
> + if (old_desc != sb+1) {
> + fputs(_("Should never happen! Unexpected old_desc in "
> + "super_sparse bg?\n"),
> + stderr);
> + exit(1);
> + }
> + num = (old_desc) ? num + 1 : 1;
> +
> + /* Reserve the backup blocks */
> + ext2fs_mark_block_bitmap_range2(fs->block_map, sb, num);
> +
> + for (g = 0; g < fs->group_desc_count; g++) {
> + blk64_t mb;
> +
> + mb = ext2fs_block_bitmap_loc(fs, g);
> + if ((mb >= sb) && (mb < sb + num)) {
> + ext2fs_block_bitmap_loc_set(fs, g, 0);
> + realloc = 1;
> + }
> + mb = ext2fs_inode_bitmap_loc(fs, g);
> + if ((mb >= sb) && (mb < sb + num)) {
> + ext2fs_inode_bitmap_loc_set(fs, g, 0);
> + realloc = 1;
> + }
> + mb = ext2fs_inode_table_loc(fs, g);
> + if ((mb < sb + num) &&
> + (sb < mb + fs->inode_blocks_per_group)) {
> + ext2fs_inode_table_loc_set(fs, g, 0);
> + realloc = 1;
> + }
> + if (realloc) {
> + retval = ext2fs_allocate_group_table(fs, g, 0);
> + if (retval)
> + return retval;
> + }
> + }
> +
> + for (blk = sb, i = 0; i < num; i++) {
> + if (ext2fs_test_block_bitmap2(old_fs->block_map, blk) &&
> + !ext2fs_test_block_bitmap2(meta_bmap, blk)) {
> + ext2fs_mark_block_bitmap2(rfs->move_blocks, blk);
> + rfs->needed_blocks++;
> + }
> + ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
> + }
> + return 0;
> +}
> +
> +/*
> * Fix the resize inode
> */
> static errcode_t fix_resize_inode(ext2_filsys fs)
> --
> 1.8.5.rc3.362.gdf10213
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-01-14 16:08:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] Add support for new compat feature "super_sparse"

On Tue, Jan 14, 2014 at 04:21:52AM -0700, Andreas Dilger wrote:
> A few comments on this new patch:
> - I think the name will be confusing to users, especially non-native English
> speakers. Is it "sparse_super" or "super_sparse" they want?

Yes, good point. Maybe sparse_super2? More generally, I don't think
we want most users of mke2fs ever needing or wanting to use these
features. We can kind of handle this by using "mke2fs -T smr", or
some such, but this is related to something I've been thinking about
for a while, which is a way of collapsing the following from dumpe2fs:

Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize

... into something like this.

Filesystem features: ext4_default_set needs_recovery


> - I would suspect that group #1 is not the best place to put the backup.
> For very large filesystems, there is a conflict with the backup group
> descriptors in group #0 and #1. It would be better to out the one
> backup in group #3 or something. I don't think this will be a problem
> for SMR drives, since they will be so large that this will easily fit inside
> (or close to) the flex_bg layout of the inode table.

I'm not sure what what you mean by "conflict with the backup
descriptors in #0 and #1"?

One reason why I'm inclined to leave a backup at group #1 is that for
most file systems, sysadmins are trained to know that there is a
backup at -b 32768. If we change it to be something else, it makes it
a bit harder to find the backup sb, which is a consideration.

Yes, bigalloc does change the offset, but that's actually another
solution I had been looking at for our use case inside google for big
SMR drives.


> - To simplify matters, it makes sense that super_sparse supersedes
> the sparse_super and meta_bg features. It doesn't make sense
> to have both. Should it also require flex_bg? Without it, it is mostly
> useless.

Actually, it doesn't supercede meta_bg. Meta_bg is about where to put
the block group descriptors to allow for 64-bit online resize, such
that the bg descriptor blocks are no longer contiguous. This is
separate and distinct from the question of which block group have a
superblock and the contiguous (aka "old-style") set of block group
descriptors as backup.

I agree that for the use case of keeping the data blocks contiguous,
it only makes sense to use it with flex_bg; but the file systems
options are largely orthogonal, and it doesn't actually simplify
anything from a code complexity standpoint to require them. How we
make it easy for users to request a certain set of features is a
different question, and that's where I think ultimately mke2fs's -T
option is going to come in really handy.

- Ted

2014-01-14 18:42:37

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2] Add support for new compat feature "super_sparse"

On Tue, Jan 14, 2014 at 12:54:26AM -0500, Theodore Ts'o wrote:
> And here's the version of this patch which adds a block group in the
> last block group. Note the huge complexity required to support
> shrinking such a file system. I still haven't tested that bit of code
> yet, since it's also painful to create all of the various file systems
> to test all of reserve_super_sparse_last_group().
>
> But I'll send it out so people have an idea of what's needed/involved.
>
> - Ted
>
> From af0f4ad05d1bbce4ae6b817e2638a3700e8a5a6e Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <[email protected]>
> Date: Sat, 11 Jan 2014 22:11:42 -0500
> Subject: [PATCH] Add support for new compat feature "super_sparse"
>
> In practice, it is **extremely** rare for users to try to use more
> than the first backup superblock located at the beginning of block
> group #1. (i.e., at block number 32768 for file systems with a 4k
> block size). This new compat feature restricts the backup superblock
> to block group #1 and the last block group in the file system.
>
> Aside from reducing the overhead of the file system by a small number
> of blocks, by eliminating the rest of the backup superblocks, it
> allows us to have a much more flexible metadata layout. For example,
> we can force all of the allocation bitmaps and inode table blocks to
> the beginning of the disk, which allows most of the disk to be
> exclusively used for contiguous data blocks.
>
> This simplifies taking advantage of certain HDD specific features,
> such as Shingled Magnetic Recording (aka Shingled Drives), and the
> TCG's OPAL Storage Specification where having a simple mapping between
> LBA block ranges and the data blocks used by the file system can make
> life much simpler.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> lib/e2p/feature.c | 2 +
> lib/ext2fs/closefs.c | 10 +++-
> lib/ext2fs/ext2_fs.h | 1 +
> lib/ext2fs/ext2fs.h | 3 +-
> lib/ext2fs/res_gdt.c | 14 +++++-
> misc/ext4.5.in | 7 +++
> misc/mke2fs.c | 3 +-
> resize/online.c | 8 ++++
> resize/resize2fs.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 9 files changed, 169 insertions(+), 6 deletions(-)
>
> diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
> index 9691263..c06b833 100644
> --- a/lib/e2p/feature.c
> +++ b/lib/e2p/feature.c
> @@ -43,6 +43,8 @@ static struct feature feature_list[] = {
> "lazy_bg" },
> { E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
> "snapshot_bitmap" },
> + { E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_SUPER_SPARSE,
> + "super_sparse" },
>
> { E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
> "sparse_super" },
> diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
> index 3e4af7f..caf5b46 100644
> --- a/lib/ext2fs/closefs.c
> +++ b/lib/ext2fs/closefs.c
> @@ -35,9 +35,15 @@ static int test_root(unsigned int a, unsigned int b)
>
> int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
> {
> - if (!(fs->super->s_feature_ro_compat &
> - EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
> + if ((group <= 1) || !(fs->super->s_feature_ro_compat &
> + EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER))
> return 1;
> + if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE) {

Ugh, SPARSE_SUPER/SUPER_SPARSE is already making my head spin.

May I suggest FEW_SUPERS? Or perhaps MINIMAL_SUPERS?

> + /* Implied by the above test */
> + if (/* group == 1 || */ group == fs->group_desc_count - 1)
> + return 1;
> + return 0;
> + }
> if (!(group & 1))
> return 0;
> if (test_root(group, 3) || (test_root(group, 5)) ||
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index 930c2a3..eb040e5 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -696,6 +696,7 @@ struct ext2_super_block {
> #define EXT2_FEATURE_COMPAT_LAZY_BG 0x0040
> /* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE 0x0080 not used, legacy */
> #define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100
> +#define EXT4_FEATURE_COMPAT_SUPER_SPARSE 0x0200
>
>
> #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 1e07f88..efec97f 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
> EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
> EXT2_FEATURE_COMPAT_RESIZE_INODE|\
> EXT2_FEATURE_COMPAT_DIR_INDEX|\
> - EXT2_FEATURE_COMPAT_EXT_ATTR)
> + EXT2_FEATURE_COMPAT_EXT_ATTR|\
> + EXT4_FEATURE_COMPAT_SUPER_SPARSE)
>
> /* This #ifdef is temporary until compression is fully supported */
> #ifdef ENABLE_COMPRESSION
> diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
> index 6449228..1ce6f68 100644
> --- a/lib/ext2fs/res_gdt.c
> +++ b/lib/ext2fs/res_gdt.c
> @@ -31,13 +31,23 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
> int mult = 3;
> unsigned int ret;
>
> + if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE) {
> + if (*min == 1) {
> + *min = fs->group_desc_count - 1;
> + if (*min <= 1)
> + *min = 2;
> + return 1;
> + }
> + ret = *min;
> + *min += 1;
> + return ret;
> + }
> if (!(fs->super->s_feature_ro_compat &
> EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
> ret = *min;
> - *min += 1;
> + *min +=1 ;

Is this whitespace change supposed to be here?

> return ret;
> }
> -
> if (*five < *min) {
> min = five;
> mult = 5;
> diff --git a/misc/ext4.5.in b/misc/ext4.5.in
> index fab1139..d6f71e7 100644
> --- a/misc/ext4.5.in
> +++ b/misc/ext4.5.in
> @@ -171,6 +171,13 @@ kernels from mounting file systems that they could not understand.
> .\" .br
> .\" .B Future feature, available in e2fsprogs 1.43-WIP
> .TP
> +.B super_sparse
> +.br
> +This feature indicates that there will only be only two backup
> +superblock and block group descriptors; one located at the beginning of
> +block group #1, and one in the last block group in the file system.
> +This is an more extreme version of sparse_super.
> +.TP
> .B meta_bg
> .br
> This ext4 feature allows file systems to be resized on-line without explicitly
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index c45b42f..825165f 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -924,7 +924,8 @@ static __u32 ok_features[3] = {
> EXT3_FEATURE_COMPAT_HAS_JOURNAL |
> EXT2_FEATURE_COMPAT_RESIZE_INODE |
> EXT2_FEATURE_COMPAT_DIR_INDEX |
> - EXT2_FEATURE_COMPAT_EXT_ATTR,
> + EXT2_FEATURE_COMPAT_EXT_ATTR |
> + EXT4_FEATURE_COMPAT_SUPER_SPARSE,
> /* Incompat */
> EXT2_FEATURE_INCOMPAT_FILETYPE|
> EXT3_FEATURE_INCOMPAT_EXTENTS|
> diff --git a/resize/online.c b/resize/online.c
> index defcac1..af640c3 100644
> --- a/resize/online.c
> +++ b/resize/online.c
> @@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
> no_resize_ioctl = 1;
> }
>
> + if (EXT2_HAS_COMPAT_FEATURE(fs->super,
> + EXT4_FEATURE_COMPAT_SUPER_SPARSE) &&
> + (access("/sys/fs/ext4/features/super_sparse", R_OK) != 0)) {
> + com_err(program_name, 0, _("kernel does not support online "
> + "resize with super_sparse"));
> + exit(1);
> + }
> +
> printf(_("Filesystem at %s is mounted on %s; "
> "on-line resizing required\n"), fs->device_name, mtpt);
>
> diff --git a/resize/resize2fs.c b/resize/resize2fs.c
> index c4c2517..a6cbe57 100644
> --- a/resize/resize2fs.c
> +++ b/resize/resize2fs.c
> @@ -53,6 +53,9 @@ static errcode_t ext2fs_calculate_summary_stats(ext2_filsys fs);
> static errcode_t fix_sb_journal_backup(ext2_filsys fs);
> static errcode_t mark_table_blocks(ext2_filsys fs,
> ext2fs_block_bitmap bmap);
> +static errcode_t clear_super_sparse_last_group(ext2_resize_t rfs);
> +static errcode_t reserve_super_sparse_last_group(ext2_resize_t rfs,
> + ext2fs_block_bitmap meta_bmap);
>
> /*
> * Some helper CPP macros
> @@ -191,6 +194,10 @@ errcode_t resize_fs(ext2_filsys fs, blk64_t *new_size, int flags,
> goto errout;
> print_resource_track(rfs, &rtrack, fs->io);
>
> + retval = clear_super_sparse_last_group(rfs);
> + if (retval)
> + goto errout;
> +
> rfs->new_fs->super->s_state &= ~EXT2_ERROR_FS;
> rfs->new_fs->flags &= ~EXT2_FLAG_MASTER_SB_ONLY;
>
> @@ -952,6 +959,10 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
> new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
> }
>
> + retval = reserve_super_sparse_last_group(rfs, meta_bmap);
> + if (retval)
> + goto errout;
> +
> if (old_blocks == new_blocks) {
> retval = 0;
> goto errout;
> @@ -1840,6 +1851,122 @@ errout:
> }
>
> /*
> + * This function is used when expanding a file system. It frees the
> + * superblock and block group descriptor blocks from the block group
> + * which is no longer the last block group.
> + */
> +static errcode_t clear_super_sparse_last_group(ext2_resize_t rfs)
> +{
> + ext2_filsys fs = rfs->new_fs;
> + errcode_t retval;
> + dgrp_t old_groups = rfs->old_fs->group_desc_count;
> + dgrp_t new_groups = fs->group_desc_count;
> + blk64_t sb, old_desc;
> + blk_t num;
> +
> + if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE))
> + return 0;
> +
> + if (new_groups <= old_groups || old_groups <= 2)
> + return 0;
> +
> + retval = ext2fs_super_and_bgd_loc2(rfs->old_fs, old_groups - 1,
> + &sb, &old_desc, NULL, &num);
> + if (retval)
> + return retval;
> +
> + if (sb)
> + ext2fs_unmark_block_bitmap2(fs->block_map, sb);
> + if (old_desc)
> + ext2fs_unmark_block_bitmap_range2(fs->block_map, old_desc, num);
> + return 0;
> +}
> +
> +/*
> + * This function is used when shrinking a file system. We need to
> + * utilize blocks from what will be the new last block group for the
> + * backup superblock and block group descriptor blocks.
> + * Unfortunately, those blocks may be used by other files or fs
> + * metadata blocks. We need to mark them as being in use.
> + */
> +static errcode_t reserve_super_sparse_last_group(ext2_resize_t rfs,
> + ext2fs_block_bitmap meta_bmap)
> +{
> + ext2_filsys fs = rfs->new_fs;
> + ext2_filsys old_fs = rfs->old_fs;
> + errcode_t retval;
> + dgrp_t old_groups = old_fs->group_desc_count;
> + dgrp_t new_groups = fs->group_desc_count;
> + dgrp_t g;
> + blk64_t blk, sb, old_desc;
> + blk_t i, num;
> + int realloc = 0;
> +
> + if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SUPER_SPARSE))
> + return 0;
> +
> + if (new_groups >= old_groups || new_groups <= 2)
> + return 0;
> +
> + retval = ext2fs_super_and_bgd_loc2(rfs->new_fs, new_groups - 1,
> + &sb, &old_desc, NULL, &num);
> + if (retval)
> + return retval;
> +
> + if (!sb) {
> + fputs(_("Should never happen! No sb in last super_sparse bg?\n"),
> + stderr);
> + exit(1);
> + }
> + if (old_desc != sb+1) {
> + fputs(_("Should never happen! Unexpected old_desc in "
> + "super_sparse bg?\n"),
> + stderr);
> + exit(1);
> + }
> + num = (old_desc) ? num + 1 : 1;
> +
> + /* Reserve the backup blocks */
> + ext2fs_mark_block_bitmap_range2(fs->block_map, sb, num);
> +
> + for (g = 0; g < fs->group_desc_count; g++) {
> + blk64_t mb;
> +
> + mb = ext2fs_block_bitmap_loc(fs, g);
> + if ((mb >= sb) && (mb < sb + num)) {
> + ext2fs_block_bitmap_loc_set(fs, g, 0);
> + realloc = 1;
> + }
> + mb = ext2fs_inode_bitmap_loc(fs, g);
> + if ((mb >= sb) && (mb < sb + num)) {
> + ext2fs_inode_bitmap_loc_set(fs, g, 0);
> + realloc = 1;
> + }
> + mb = ext2fs_inode_table_loc(fs, g);
> + if ((mb < sb + num) &&
> + (sb < mb + fs->inode_blocks_per_group)) {
> + ext2fs_inode_table_loc_set(fs, g, 0);
> + realloc = 1;
> + }
> + if (realloc) {
> + retval = ext2fs_allocate_group_table(fs, g, 0);
> + if (retval)
> + return retval;
> + }
> + }
> +
> + for (blk = sb, i = 0; i < num; i++) {
> + if (ext2fs_test_block_bitmap2(old_fs->block_map, blk) &&
> + !ext2fs_test_block_bitmap2(meta_bmap, blk)) {
> + ext2fs_mark_block_bitmap2(rfs->move_blocks, blk);
> + rfs->needed_blocks++;
> + }
> + ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
> + }
> + return 0;
> +}
> +
> +/*

At a glance this seems ok to me...

--D
> * Fix the resize inode
> */
> static errcode_t fix_resize_inode(ext2_filsys fs)
> --
> 1.8.5.rc3.362.gdf10213
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-01-16 20:21:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2] Add support for new compat feature "super_sparse"

On Jan 14, 2014, at 9:08 AM, Theodore Ts'o <[email protected]> wrote:
> On Tue, Jan 14, 2014 at 04:21:52AM -0700, Andreas Dilger wrote:
>> A few comments on this new patch:
>> - I think the name will be confusing to users, especially non-native English speakers. Is it "sparse_super" or "super_sparse" they want?
>
> Yes, good point. Maybe sparse_super2? More generally, I don't think
> we want most users of mke2fs ever needing or wanting to use these
> features. We can kind of handle this by using "mke2fs -T smr", or
> some such, but this is related to something I've been thinking about
> for a while, which is a way of collapsing the following from dumpe2fs:
>
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>
> ... into something like this.
>
> Filesystem features: ext4_default_set needs_recovery

I'm OK with this in theory, but it would make it harder to know what
features are actually enabled, especially if "ext4_default_set" is
changing over time. Also, while this might be OK for "dumpe2fs"
output, it shouldn't be used for the debugfs "features" command
output, since that would break the ability to determine what features
are actually implemented.

>> - I would suspect that group #1 is not the best place to put the backup.
>> For very large filesystems, there is a conflict with the backup group
>> descriptors in group #0 and #1. It would be better to out the one
>> backup in group #3 or something. I don't think this will be a problem
>> for SMR drives, since they will be so large that this will easily fit inside
>> (or close to) the flex_bg layout of the inode table.
>
> I'm not sure what what you mean by "conflict with the backup
> descriptors in #0 and #1"?

In 4kB blocksize filesystems with 64-bit group descriptors, there
are 64 group descriptors per block, so for the 32k blocks in group
#0 this means a maximum of 32767 * 64 ~= 2M groups = 255TB before
the group #0 group descriptors collide with the group #1 superblock
and group #1 descriptor backups.

This problem would be avoided by meta_bg, but that also reverts back
to the undesirable behaviour of spreading small metadata chunks all
over the filesystem. In some respects, meta_bg would be worse than
the normal sparse_super for SMR, since it writes a few blocks every
64 groups, while sparse_super will write a larger number of blocks
together but less often.

It might make sense to combine meta_bg and flex_bg in this case so
that the superblock and its backups are kept in the same groups as
the bitmaps. That avoids metadata being spread around the disk.

> One reason why I'm inclined to leave a backup at group #1 is that for
> most file systems, sysadmins are trained to know that there is a
> backup at -b 32768. If we change it to be something else, it makes it
> a bit harder to find the backup sb, which is a consideration.

I thought that e2fsprogs automatically tries to read all of the
backup superblock and group descriptors if the primary fails, so
as long as it is kept in one of the "known" groups it should be
found automatically?

> Yes, bigalloc does change the offset, but that's actually another
> solution I had been looking at for our use case inside google for big
> SMR drives.
>
>
>> - To simplify matters, it makes sense that super_sparse supersedes
>> the sparse_super and meta_bg features. It doesn't make sense
>> to have both. Should it also require flex_bg? Without it, it is mostly
>> useless.
>
> Actually, it doesn't supercede meta_bg. Meta_bg is about where to put
> the block group descriptors to allow for 64-bit online resize, such
> that the bg descriptor blocks are no longer contiguous. This is
> separate and distinct from the question of which block group have a
> superblock and the contiguous (aka "old-style") set of block group
> descriptors as backup.
>
> I agree that for the use case of keeping the data blocks contiguous,
> it only makes sense to use it with flex_bg; but the file systems
> options are largely orthogonal, and it doesn't actually simplify
> anything from a code complexity standpoint to require them. How we
> make it easy for users to request a certain set of features is a
> different question, and that's where I think ultimately mke2fs's -T
> option is going to come in really handy.
>
> - Ted


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-01-16 20:54:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] Add support for new compat feature "super_sparse"

On Thu, Jan 16, 2014 at 01:21:47PM -0700, Andreas Dilger wrote:
>
> I'm OK with this in theory, but it would make it harder to know what
> features are actually enabled, especially if "ext4_default_set" is
> changing over time. Also, while this might be OK for "dumpe2fs"
> output, it shouldn't be used for the debugfs "features" command
> output, since that would break the ability to determine what features
> are actually implemented.

Yeah, I think if we were going to use sets, the sets would have to be
invariant over time. So that probably means we'd have to do things
like ext4_set_v3, ext4_set_v4, etc. And I think we'd want to have
options to both debugfs's "features" and commands to dumpe2fs which
either shows the full feature set, or the compressed version using
feature sets. There are some interesting UI design issues hiding
here, which is one of the reasons I haven't pursued this seriously for
the past couple of years.

> > I'm not sure what what you mean by "conflict with the backup
> > descriptors in #0 and #1"?
>
> In 4kB blocksize filesystems with 64-bit group descriptors, there
> are 64 group descriptors per block, so for the 32k blocks in group
> #0 this means a maximum of 32767 * 64 ~= 2M groups = 255TB before
> the group #0 group descriptors collide with the group #1 superblock
> and group #1 descriptor backups.

Ah.... yes, good point. I suspect that we'd definitely want to use
bigalloc for a file system as big as 256TB, but still, this is
something we should try to fix in the future "sparse_super2" feature.

I wonder if the right answer is that we should have two fields in the
superblock which describes which block groups have the backup
superblocks, and then the tools which do automated searching for the
bitmaps would simply search the first couple of block groups looking
for the backup superblock.

If these fields is zero, then we can also skip having the backup
superblock --- which is actually what I'd probably use at Google,
because if the file system is that badly damaged, it's not worth it to
fix it. Better to simply fix the file system by using mke2fs, and
relying on the redundancies at the cluster file system level.

- Ted