2009-09-16 16:25:04

by Will Drewry

[permalink] [raw]
Subject: [PATCH][RFC] resize2fs and uninit_bg questions

Hi linux-ext4,

I have a two questions with an accompanying patch for clarification.

resize2fs is uninit_bg aware, but when it is expanding an ext4
filesystem, it will always zero the inode tables. Is it safe to mimick
mke2fs's write_inode_table(.., lazy_flag=1) and leave the new block
groups' inode tables marked INODE_UNINIT, BLOCK_UNINIT and _not_ zero
out the inode table if uninit_bg is supported?

If it is okay, then it means offline resizing upwards can be just as
fast as mke2fs. I've attached a patch which is probably incomplete.
I'd love feedback as to the feasibility of the change and/or patch
quality.

As a follow-on, would it be sane to add support like this for
online resizing. From a cursory investigation, it looks like
setup_new_block_groups() could be modified to not zero itables
if uninit_bg is supported, and INODE_ZEROED could be replaced
with ΒG_*_UNINIT. However, I'm not sure if that is a naive
view. I'm happy to send along a patch illustrating this change
if that'd be helpful or welcome.

Any and all feedback is appreciated -- even if it just for me
to look at the archives/link/etc.

Thanks!


Signed-off-by: Will Drewry <[email protected]>
---
resize/resize2fs.c | 28 ++++++++++++++++++++++------
1 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/resize/resize2fs.c b/resize/resize2fs.c
index 1a5d910..9fcc3b9 100644
--- a/resize/resize2fs.c
+++ b/resize/resize2fs.c
@@ -497,8 +497,7 @@ retry:

fs->group_desc[i].bg_flags = 0;
if (csum_flag)
- fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
- EXT2_BG_INODE_ZEROED;
+ fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
if (i == fs->group_desc_count-1) {
numblocks = (fs->super->s_blocks_count -
fs->super->s_first_data_block) %
@@ -568,7 +567,7 @@ errout:
static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
{
ext2_filsys fs;
- int adj = 0;
+ int adj = 0, csum_flag = 0, num = 0;
errcode_t retval;
blk_t group_block;
unsigned long i;
@@ -624,6 +623,9 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
&rfs->itable_buf);
if (retval)
goto errout;
+ /* Track if we can get by with a lazy init */
+ csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM);

memset(rfs->itable_buf, 0, fs->blocksize * fs->inode_blocks_per_group);
group_block = fs->super->s_first_data_block +
@@ -642,10 +644,24 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
/*
* Write out the new inode table
*/
+ if (csum_flag) {
+ /* These are _new_ inode tables. No inodes should be in use.
+ * (As per ext2fs_set_gdt_csum) */
+ fs->group_desc[i].bg_itable_unused = fs->super->s_inodes_per_group;
+ num = ((((fs->super->s_inodes_per_group -
+ fs->group_desc[i].bg_itable_unused) *
+ EXT2_INODE_SIZE(fs->super)) +
+ EXT2_BLOCK_SIZE(fs->super) - 1) /
+ EXT2_BLOCK_SIZE(fs->super));
+ } else {
+ num = fs->inode_blocks_per_group;
+ /* The kernel doesn't need to zero the itable blocks. We will below */
+ fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
+ }
retval = io_channel_write_blk(fs->io,
- fs->group_desc[i].bg_inode_table,
- fs->inode_blocks_per_group,
- rfs->itable_buf);
+ fs->group_desc[i].bg_inode_table, /* blk */
+ num, /* count */
+ rfs->itable_buf); /* contents */
if (retval) goto errout;

io_channel_flush(fs->io);


2009-09-16 19:08:36

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Sep 16, 2009 11:24 -0500, Will Drewry wrote:
> I have a two questions with an accompanying patch for clarification.
>
> resize2fs is uninit_bg aware, but when it is expanding an ext4
> filesystem, it will always zero the inode tables. Is it safe to mimick
> mke2fs's write_inode_table(.., lazy_flag=1) and leave the new block
> groups' inode tables marked INODE_UNINIT, BLOCK_UNINIT and _not_ zero
> out the inode table if uninit_bg is supported?
>
> If it is okay, then it means offline resizing upwards can be just as
> fast as mke2fs. I've attached a patch which is probably incomplete.
> I'd love feedback as to the feasibility of the change and/or patch
> quality.
>
> As a follow-on, would it be sane to add support like this for
> online resizing. From a cursory investigation, it looks like
> setup_new_block_groups() could be modified to not zero itables
> if uninit_bg is supported, and INODE_ZEROED could be replaced
> with ΒG_*_UNINIT. However, I'm not sure if that is a naive
> view. I'm happy to send along a patch illustrating this change
> if that'd be helpful or welcome.

The question is why you would want to risk disk corruption vs.
the (likely not performance critical) online resize?

> Any and all feedback is appreciated -- even if it just for me
> to look at the archives/link/etc.
>
> diff --git a/resize/resize2fs.c b/resize/resize2fs.c
> index 1a5d910..9fcc3b9 100644
> --- a/resize/resize2fs.c
> +++ b/resize/resize2fs.c
> @@ -497,8 +497,7 @@ retry:
>
> fs->group_desc[i].bg_flags = 0;
> if (csum_flag)
> - fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
> - EXT2_BG_INODE_ZEROED;
> + fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;

This shouldn't be unconditional, since most users will want the
safety of having zeroed inode tables.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-09-16 20:47:43

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Wed, Sep 16, 2009 at 01:08:31PM -0600, Andreas Dilger wrote:
> On Sep 16, 2009 11:24 -0500, Will Drewry wrote:
> > I have a two questions with an accompanying patch for clarification.
> >
> > resize2fs is uninit_bg aware, but when it is expanding an ext4
> > filesystem, it will always zero the inode tables. Is it safe to mimick
> > mke2fs's write_inode_table(.., lazy_flag=1) and leave the new block
> > groups' inode tables marked INODE_UNINIT, BLOCK_UNINIT and _not_ zero
> > out the inode table if uninit_bg is supported?
> >
> > If it is okay, then it means offline resizing upwards can be just as
> > fast as mke2fs. I've attached a patch which is probably incomplete.
> > I'd love feedback as to the feasibility of the change and/or patch
> > quality.
> >
> > As a follow-on, would it be sane to add support like this for
> > online resizing. From a cursory investigation, it looks like
> > setup_new_block_groups() could be modified to not zero itables
> > if uninit_bg is supported, and INODE_ZEROED could be replaced
> > with ΒG_*_UNINIT. However, I'm not sure if that is a naive
> > view. I'm happy to send along a patch illustrating this change
> > if that'd be helpful or welcome.
>
> The question is why you would want to risk disk corruption vs.
> the (likely not performance critical) online resize?

I'm interested in it for a few reasons:
1. it undermines the use of uninit_bg on large filesystems as
e2fsck -f will go back to normal speed, even without those block
groups being 'used'. In my local example, it goes from 14s to 2m24s.

2. it will spread the I/O cost out over time. Online resizing often
means that you don't want to/can't unmount the fs. However, a
large filesystem increase might result in gigabytes of 0s being
written to the backing store which could result in I/O throttling
that makes doing it online less useful. It'd be nice to be able to
optionally amortize that cost as is done if the fs had been mke2fs -O
lazy_itable_init=1 at full size initially.

3. dm/sparse-file-backed ext4 filesystems end up having the file size
expanded quite early as init'ing the itables force extra non-sparse
bytes to be written. Otherwise, ext4+uninit_bg is a really nice fs
type for this purpose.

Would it seriously raise the risk of corruption if uninit_bg is already
in use? Alternately, would a patch to that effect stand a chance of ever
making it upstream?

> > Any and all feedback is appreciated -- even if it just for me
> > to look at the archives/link/etc.
> >
> > diff --git a/resize/resize2fs.c b/resize/resize2fs.c
> > index 1a5d910..9fcc3b9 100644
> > --- a/resize/resize2fs.c
> > +++ b/resize/resize2fs.c
> > @@ -497,8 +497,7 @@ retry:
> >
> > fs->group_desc[i].bg_flags = 0;
> > if (csum_flag)
> > - fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
> > - EXT2_BG_INODE_ZEROED;
> > + fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
>
> This shouldn't be unconditional, since most users will want the
> safety of having zeroed inode tables.

I've attached a version with it being flagged by a "-l" for lazy.

Thanks for the quick response!
will

Signed-off-by: Will Drewry <[email protected]>
---
resize/main.c | 7 +++++--
resize/online.c | 2 +-
resize/resize2fs.8.in | 3 +++
resize/resize2fs.c | 41 +++++++++++++++++++++++++++++------------
resize/resize2fs.h | 5 ++++-
5 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/resize/main.c b/resize/main.c
index 220c192..f04a939 100644
--- a/resize/main.c
+++ b/resize/main.c
@@ -39,7 +39,7 @@ char *program_name, *device_name, *io_options;

static void usage (char *prog)
{
- fprintf (stderr, _("Usage: %s [-d debug_flags] [-f] [-F] [-M] [-P] "
+ fprintf (stderr, _("Usage: %s [-d debug_flags] [-f] [-F] [-l] [-M] [-P] "
"[-p] device [new_size]\n\n"), prog);

exit (1);
@@ -189,7 +189,7 @@ int main (int argc, char ** argv)
if (argc && *argv)
program_name = *argv;

- while ((c = getopt (argc, argv, "d:fFhMPpS:")) != EOF) {
+ while ((c = getopt (argc, argv, "d:fFhlMPpS:")) != EOF) {
switch (c) {
case 'h':
usage(program_name);
@@ -209,6 +209,9 @@ int main (int argc, char ** argv)
case 'd':
flags |= atoi(optarg);
break;
+ case 'l':
+ flags |= RESIZE_LAZY_INIT;
+ break;
case 'p':
flags |= RESIZE_PERCENT_COMPLETE;
break;
diff --git a/resize/online.c b/resize/online.c
index 4bc5451..f0b0569 100644
--- a/resize/online.c
+++ b/resize/online.c
@@ -104,7 +104,7 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
* but at least it allows on-line resizing to function.
*/
new_fs->super->s_feature_incompat &= ~EXT4_FEATURE_INCOMPAT_FLEX_BG;
- retval = adjust_fs_info(new_fs, fs, 0, *new_size);
+ retval = adjust_fs_info(new_fs, fs, 0, *new_size, flags);
if (retval)
return retval;

diff --git a/resize/resize2fs.8.in b/resize/resize2fs.8.in
index 3ea7a63..020393e 100644
--- a/resize/resize2fs.8.in
+++ b/resize/resize2fs.8.in
@@ -102,6 +102,9 @@ really useful for doing
.B resize2fs
time trials.
.TP
+.B \-l
+Lazily initialize new inode tables if supported (uninit_bg).
+.TP
.B \-M
Shrink the filesystem to the minimum size.
.TP
diff --git a/resize/resize2fs.c b/resize/resize2fs.c
index 1a5d910..fee2e3f 100644
--- a/resize/resize2fs.c
+++ b/resize/resize2fs.c
@@ -41,7 +41,8 @@
#endif

static void fix_uninit_block_bitmaps(ext2_filsys fs);
-static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size);
+static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size,
+ int flags);
static errcode_t blocks_to_move(ext2_resize_t rfs);
static errcode_t block_mover(ext2_resize_t rfs);
static errcode_t inode_scan_and_fix(ext2_resize_t rfs);
@@ -106,7 +107,7 @@ errcode_t resize_fs(ext2_filsys fs, blk_t *new_size, int flags,
if (retval)
goto errout;

- retval = adjust_superblock(rfs, *new_size);
+ retval = adjust_superblock(rfs, *new_size, flags);
if (retval)
goto errout;

@@ -292,7 +293,7 @@ static void free_gdp_blocks(ext2_filsys fs,
* filesystem.
*/
errcode_t adjust_fs_info(ext2_filsys fs, ext2_filsys old_fs,
- ext2fs_block_bitmap reserve_blocks, blk_t new_size)
+ ext2fs_block_bitmap reserve_blocks, blk_t new_size, int flags)
{
errcode_t retval;
int overhead = 0;
@@ -496,9 +497,11 @@ retry:
adjblocks = 0;

fs->group_desc[i].bg_flags = 0;
- if (csum_flag)
- fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
- EXT2_BG_INODE_ZEROED;
+ if (csum_flag) {
+ fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
+ if (!(flags & RESIZE_LAZY_INIT))
+ fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
+ }
if (i == fs->group_desc_count-1) {
numblocks = (fs->super->s_blocks_count -
fs->super->s_first_data_block) %
@@ -565,10 +568,10 @@ errout:
* This routine adjusts the superblock and other data structures, both
* in disk as well as in memory...
*/
-static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
+static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size, int flags)
{
ext2_filsys fs;
- int adj = 0;
+ int adj = 0, csum_flag = 0, num = 0;
errcode_t retval;
blk_t group_block;
unsigned long i;
@@ -584,7 +587,7 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
if (retval)
return retval;

- retval = adjust_fs_info(fs, rfs->old_fs, rfs->reserve_blocks, new_size);
+ retval = adjust_fs_info(fs, rfs->old_fs, rfs->reserve_blocks, new_size, flags);
if (retval)
goto errout;

@@ -624,6 +627,9 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
&rfs->itable_buf);
if (retval)
goto errout;
+ /* Track if we can get by with a lazy init */
+ csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM);

memset(rfs->itable_buf, 0, fs->blocksize * fs->inode_blocks_per_group);
group_block = fs->super->s_first_data_block +
@@ -642,10 +648,21 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
/*
* Write out the new inode table
*/
+ if (csum_flag && (flags & RESIZE_LAZY_INIT)) {
+ /* These are _new_ inode tables. No inodes should be in use. */
+ fs->group_desc[i].bg_itable_unused = fs->super->s_inodes_per_group;
+ num = ((((fs->super->s_inodes_per_group -
+ fs->group_desc[i].bg_itable_unused) *
+ EXT2_INODE_SIZE(fs->super)) +
+ EXT2_BLOCK_SIZE(fs->super) - 1) /
+ EXT2_BLOCK_SIZE(fs->super));
+ } else {
+ num = fs->inode_blocks_per_group;
+ }
retval = io_channel_write_blk(fs->io,
- fs->group_desc[i].bg_inode_table,
- fs->inode_blocks_per_group,
- rfs->itable_buf);
+ fs->group_desc[i].bg_inode_table, /* blk */
+ num, /* count */
+ rfs->itable_buf); /* contents */
if (retval) goto errout;

io_channel_flush(fs->io);
diff --git a/resize/resize2fs.h b/resize/resize2fs.h
index fab7290..d4c862c 100644
--- a/resize/resize2fs.h
+++ b/resize/resize2fs.h
@@ -77,6 +77,8 @@ typedef struct ext2_sim_progress *ext2_sim_progmeter;
#define RESIZE_DEBUG_INODEMAP 0x0004
#define RESIZE_DEBUG_ITABLEMOVE 0x0008

+#define RESIZE_LAZY_INIT 0x0010
+
#define RESIZE_PERCENT_COMPLETE 0x0100
#define RESIZE_VERBOSE 0x0200

@@ -129,7 +131,8 @@ extern errcode_t resize_fs(ext2_filsys fs, blk_t *new_size, int flags,

extern errcode_t adjust_fs_info(ext2_filsys fs, ext2_filsys old_fs,
ext2fs_block_bitmap reserve_blocks,
- blk_t new_size);
+ blk_t new_size,
+ int flags);
extern blk_t calculate_minimum_resize_size(ext2_filsys fs);

2009-09-16 21:22:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Sep 16, 2009 15:42 -0500, Will Drewry wrote:
> I'm interested in it for a few reasons:
> 1. it undermines the use of uninit_bg on large filesystems as
> e2fsck -f will go back to normal speed, even without those block
> groups being 'used'. In my local example, it goes from 14s to 2m24s.

Ah, my bad... It definitely makes sense to mark new groups added
during online resize as {BLOCK,INODE}_UNINIT if that feature is
enabled for the filesystem. The e2fsck slowdown after a resize is
only a one-time event (that e2fsck would mark the unused groups as
UNINIT again) but it makes sense to do it correctly the first time.

> 2. it will spread the I/O cost out over time. Online resizing often
> means that you don't want to/can't unmount the fs. However, a
> large filesystem increase might result in gigabytes of 0s being
> written to the backing store which could result in I/O throttling
> that makes doing it online less useful. It'd be nice to be able to
> optionally amortize that cost as is done if the fs had been mke2fs -O
> lazy_itable_init=1 at full size initially.

While this is true, there is a non-zero risk of problems if the inode
table isn't zeroed, which is why lazy_itable_init isn't the default.
The risk is that if the group descriptor is invalid for some reason
(found by bad checksum, or some inode in use beyond itable_unused)
then the UNINIT and itable_unused fields will be ignored and a full
inode table scan for that group is done.

If the itable isn't zeroed, then any old inodes (from a previous
filesystem, or garbage) will be "reattached" to the filesystem in
lost+found, and may cause a LOT of duplicate blocks processing (slow!).

If you had the time to work on the solution, it would be very useful,
and we could make lazy_itable_init the default. What needs to be done
is have a thread that is created at filesystem mount that walks all the
groups and validates the GDT checksum, and zeroes inode tables and
bitmaps that are marked UNINIT w/o ZEROED. For bonus points it could
check bitmap validity (later that might validate a bitmap checksum),
compute buddy bitmaps for groups that have free space, etc.

The thread would exit once all of the groups have had the inode tables
zeroed, or if the filesystem is unmounted. In the common case (i.e.
once all inode tables are zeroed), it would just walk the already-loaded
group descriptor table looking for the ZEROED flag and no IO is done,
assuming we don't check the on-disk bitmaps on each mount (that could
be done only periodically, with a timestamp in the superblock).

> 3. dm/sparse-file-backed ext4 filesystems end up having the file size
> expanded quite early as init'ing the itables force extra non-sparse
> bytes to be written. Otherwise, ext4+uninit_bg is a really nice fs
> type for this purpose.

If you know the backing storage is always zero-filled (e.g. a new sparse
loop device), or you don't care about rare corruption bugs (i.e. for
test filesystems) then using lazy_itable_init is very useful for sure.

> Would it seriously raise the risk of corruption if uninit_bg is already
> in use? Alternately, would a patch to that effect stand a chance of ever
> making it upstream?

If the filesystem is already formatted with lazy_itable_init, then
doing further resizing w/o inode table zeroing is fine.

> I've attached a version with it being flagged by a "-l" for lazy.

It might make sense to avoid requiring the user to specify this,
rather remembering the option supplied at mke2fs time? There is
the COMPAT_LAZY_BG superblock flag that might be usable for this,
though Ted might have some comments about any potential compatibility
issues.

Other than that, the patch looks reasonable at first glance.

> diff --git a/resize/main.c b/resize/main.c
> index 220c192..f04a939 100644
> --- a/resize/main.c
> +++ b/resize/main.c
> @@ -39,7 +39,7 @@ char *program_name, *device_name, *io_options;
>
> static void usage (char *prog)
> {
> - fprintf (stderr, _("Usage: %s [-d debug_flags] [-f] [-F] [-M] [-P] "
> + fprintf (stderr, _("Usage: %s [-d debug_flags] [-f] [-F] [-l] [-M] [-P] "
> "[-p] device [new_size]\n\n"), prog);
>
> exit (1);
> @@ -189,7 +189,7 @@ int main (int argc, char ** argv)
> if (argc && *argv)
> program_name = *argv;
>
> - while ((c = getopt (argc, argv, "d:fFhMPpS:")) != EOF) {
> + while ((c = getopt (argc, argv, "d:fFhlMPpS:")) != EOF) {
> switch (c) {
> case 'h':
> usage(program_name);
> @@ -209,6 +209,9 @@ int main (int argc, char ** argv)
> case 'd':
> flags |= atoi(optarg);
> break;
> + case 'l':
> + flags |= RESIZE_LAZY_INIT;
> + break;
> case 'p':
> flags |= RESIZE_PERCENT_COMPLETE;
> break;
> diff --git a/resize/online.c b/resize/online.c
> index 4bc5451..f0b0569 100644
> --- a/resize/online.c
> +++ b/resize/online.c
> @@ -104,7 +104,7 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
> * but at least it allows on-line resizing to function.
> */
> new_fs->super->s_feature_incompat &= ~EXT4_FEATURE_INCOMPAT_FLEX_BG;
> - retval = adjust_fs_info(new_fs, fs, 0, *new_size);
> + retval = adjust_fs_info(new_fs, fs, 0, *new_size, flags);
> if (retval)
> return retval;
>
> diff --git a/resize/resize2fs.8.in b/resize/resize2fs.8.in
> index 3ea7a63..020393e 100644
> --- a/resize/resize2fs.8.in
> +++ b/resize/resize2fs.8.in
> @@ -102,6 +102,9 @@ really useful for doing
> .B resize2fs
> time trials.
> .TP
> +.B \-l
> +Lazily initialize new inode tables if supported (uninit_bg).
> +.TP
> .B \-M
> Shrink the filesystem to the minimum size.
> .TP
> diff --git a/resize/resize2fs.c b/resize/resize2fs.c
> index 1a5d910..fee2e3f 100644
> --- a/resize/resize2fs.c
> +++ b/resize/resize2fs.c
> @@ -41,7 +41,8 @@
> #endif
>
> static void fix_uninit_block_bitmaps(ext2_filsys fs);
> -static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size);
> +static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size,
> + int flags);
> static errcode_t blocks_to_move(ext2_resize_t rfs);
> static errcode_t block_mover(ext2_resize_t rfs);
> static errcode_t inode_scan_and_fix(ext2_resize_t rfs);
> @@ -106,7 +107,7 @@ errcode_t resize_fs(ext2_filsys fs, blk_t *new_size, int flags,
> if (retval)
> goto errout;
>
> - retval = adjust_superblock(rfs, *new_size);
> + retval = adjust_superblock(rfs, *new_size, flags);
> if (retval)
> goto errout;
>
> @@ -292,7 +293,7 @@ static void free_gdp_blocks(ext2_filsys fs,
> * filesystem.
> */
> errcode_t adjust_fs_info(ext2_filsys fs, ext2_filsys old_fs,
> - ext2fs_block_bitmap reserve_blocks, blk_t new_size)
> + ext2fs_block_bitmap reserve_blocks, blk_t new_size, int flags)
> {
> errcode_t retval;
> int overhead = 0;
> @@ -496,9 +497,11 @@ retry:
> adjblocks = 0;
>
> fs->group_desc[i].bg_flags = 0;
> - if (csum_flag)
> - fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
> - EXT2_BG_INODE_ZEROED;
> + if (csum_flag) {
> + fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
> + if (!(flags & RESIZE_LAZY_INIT))
> + fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
> + }
> if (i == fs->group_desc_count-1) {
> numblocks = (fs->super->s_blocks_count -
> fs->super->s_first_data_block) %
> @@ -565,10 +568,10 @@ errout:
> * This routine adjusts the superblock and other data structures, both
> * in disk as well as in memory...
> */
> -static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
> +static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size, int flags)
> {
> ext2_filsys fs;
> - int adj = 0;
> + int adj = 0, csum_flag = 0, num = 0;
> errcode_t retval;
> blk_t group_block;
> unsigned long i;
> @@ -584,7 +587,7 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
> if (retval)
> return retval;
>
> - retval = adjust_fs_info(fs, rfs->old_fs, rfs->reserve_blocks, new_size);
> + retval = adjust_fs_info(fs, rfs->old_fs, rfs->reserve_blocks, new_size, flags);
> if (retval)
> goto errout;
>
> @@ -624,6 +627,9 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
> &rfs->itable_buf);
> if (retval)
> goto errout;
> + /* Track if we can get by with a lazy init */
> + csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
> + EXT4_FEATURE_RO_COMPAT_GDT_CSUM);
>
> memset(rfs->itable_buf, 0, fs->blocksize * fs->inode_blocks_per_group);
> group_block = fs->super->s_first_data_block +
> @@ -642,10 +648,21 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
> /*
> * Write out the new inode table
> */
> + if (csum_flag && (flags & RESIZE_LAZY_INIT)) {
> + /* These are _new_ inode tables. No inodes should be in use. */
> + fs->group_desc[i].bg_itable_unused = fs->super->s_inodes_per_group;
> + num = ((((fs->super->s_inodes_per_group -
> + fs->group_desc[i].bg_itable_unused) *
> + EXT2_INODE_SIZE(fs->super)) +
> + EXT2_BLOCK_SIZE(fs->super) - 1) /
> + EXT2_BLOCK_SIZE(fs->super));
> + } else {
> + num = fs->inode_blocks_per_group;
> + }
> retval = io_channel_write_blk(fs->io,
> - fs->group_desc[i].bg_inode_table,
> - fs->inode_blocks_per_group,
> - rfs->itable_buf);
> + fs->group_desc[i].bg_inode_table, /* blk */
> + num, /* count */
> + rfs->itable_buf); /* contents */
> if (retval) goto errout;
>
> io_channel_flush(fs->io);
> diff --git a/resize/resize2fs.h b/resize/resize2fs.h
> index fab7290..d4c862c 100644
> --- a/resize/resize2fs.h
> +++ b/resize/resize2fs.h
> @@ -77,6 +77,8 @@ typedef struct ext2_sim_progress *ext2_sim_progmeter;
> #define RESIZE_DEBUG_INODEMAP 0x0004
> #define RESIZE_DEBUG_ITABLEMOVE 0x0008
>
> +#define RESIZE_LAZY_INIT 0x0010
> +
> #define RESIZE_PERCENT_COMPLETE 0x0100
> #define RESIZE_VERBOSE 0x0200
>
> @@ -129,7 +131,8 @@ extern errcode_t resize_fs(ext2_filsys fs, blk_t *new_size, int flags,
>
> extern errcode_t adjust_fs_info(ext2_filsys fs, ext2_filsys old_fs,
> ext2fs_block_bitmap reserve_blocks,
> - blk_t new_size);
> + blk_t new_size,
> + int flags);
> extern blk_t calculate_minimum_resize_size(ext2_filsys fs);
>
>

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-16 23:11:31

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Wed, Sep 16, 2009 at 03:22:50PM -0600, Andreas Dilger wrote:
> On Sep 16, 2009 15:42 -0500, Will Drewry wrote:
> > I'm interested in it for a few reasons:
> > 1. it undermines the use of uninit_bg on large filesystems as
> > e2fsck -f will go back to normal speed, even without those block
> > groups being 'used'. In my local example, it goes from 14s to 2m24s.
>
> Ah, my bad... It definitely makes sense to mark new groups added
> during online resize as {BLOCK,INODE}_UNINIT if that feature is
> enabled for the filesystem. The e2fsck slowdown after a resize is
> only a one-time event (that e2fsck would mark the unused groups as
> UNINIT again) but it makes sense to do it correctly the first time.

Cool - didn't realize e2fsck would swap them back. That only makes
it seem like an even heavier burden if I know the backing store is
zero-filled! :)

> > 2. it will spread the I/O cost out over time. Online resizing often
> > means that you don't want to/can't unmount the fs. However, a
> > large filesystem increase might result in gigabytes of 0s being
> > written to the backing store which could result in I/O throttling
> > that makes doing it online less useful. It'd be nice to be able to
> > optionally amortize that cost as is done if the fs had been mke2fs -O
> > lazy_itable_init=1 at full size initially.
>
> While this is true, there is a non-zero risk of problems if the inode
> table isn't zeroed, which is why lazy_itable_init isn't the default.
> The risk is that if the group descriptor is invalid for some reason
> (found by bad checksum, or some inode in use beyond itable_unused)
> then the UNINIT and itable_unused fields will be ignored and a full
> inode table scan for that group is done.
>
> If the itable isn't zeroed, then any old inodes (from a previous
> filesystem, or garbage) will be "reattached" to the filesystem in
> lost+found, and may cause a LOT of duplicate blocks processing (slow!).

That makes things a lot clearer - thanks! I wasn't sure what the default
action was, but it makes sense to assume that corruption would lead
you to crawl the inode table regardless. In which case, your best bet
is to zero-fill it to minimize the weirdness.

> If you had the time to work on the solution, it would be very useful,
> and we could make lazy_itable_init the default. What needs to be done
> is have a thread that is created at filesystem mount that walks all the
> groups and validates the GDT checksum, and zeroes inode tables and
> bitmaps that are marked UNINIT w/o ZEROED. For bonus points it could
> check bitmap validity (later that might validate a bitmap checksum),
> compute buddy bitmaps for groups that have free space, etc.
>
> The thread would exit once all of the groups have had the inode tables
> zeroed, or if the filesystem is unmounted. In the common case (i.e.
> once all inode tables are zeroed), it would just walk the already-loaded
> group descriptor table looking for the ZEROED flag and no IO is done,
> assuming we don't check the on-disk bitmaps on each mount (that could
> be done only periodically, with a timestamp in the superblock).

I'd love to have this functionality so it's definitely going on my TODO
list, but probably not for a while yet. This is a great description of
the needed code which will make it that much easier.

> > Would it seriously raise the risk of corruption if uninit_bg is already
> > in use? Alternately, would a patch to that effect stand a chance of ever
> > making it upstream?
>
> If the filesystem is already formatted with lazy_itable_init, then
> doing further resizing w/o inode table zeroing is fine.

Cool -- I'll start in on a patch to setup to add that support as a
precursor to having a mount triggered itable zero'ing thread. At least,
then test filesystems and known zero-filled ones will benefit (as you
pointed out!).

>
> > I've attached a version with it being flagged by a "-l" for lazy.
>
> It might make sense to avoid requiring the user to specify this,
> rather remembering the option supplied at mke2fs time? There is
> the COMPAT_LAZY_BG superblock flag that might be usable for this,
> though Ted might have some comments about any potential compatibility
> issues.

Cool - yeah I'd love to make use of the COMPAT_LAZY_BG flag since it
seems that all (but e2p/features.c) references to it seem to be gone
from the e2fsprogs source and the kernel. I'm happy to rewrite it to do
so and update mke2fs to set LAZY_BG when lazy_itable_init=1 is set.

> Other than that, the patch looks reasonable at first glance.

Thanks! If Ted has any feedback on the use of COMPAT_LAZY_BG, I'll
rewrite it using that (or not). Using COMPAT_LAZY_BG would also be nice
because it would make it easier to decide when it's okay to online resize
without initializing itables too (and would fit its initial purpose
of being useful for sparse files)!

cheers -
will

2009-09-17 19:28:21

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On 09/16/2009 07:11 PM, Will Drewry wrote:
> On Wed, Sep 16, 2009 at 03:22:50PM -0600, Andreas Dilger wrote:
>
>> On Sep 16, 2009 15:42 -0500, Will Drewry wrote:
>>
>>> I'm interested in it for a few reasons:
>>> 1. it undermines the use of uninit_bg on large filesystems as
>>> e2fsck -f will go back to normal speed, even without those block
>>> groups being 'used'. In my local example, it goes from 14s to 2m24s.
>>>
>> Ah, my bad... It definitely makes sense to mark new groups added
>> during online resize as {BLOCK,INODE}_UNINIT if that feature is
>> enabled for the filesystem. The e2fsck slowdown after a resize is
>> only a one-time event (that e2fsck would mark the unused groups as
>> UNINIT again) but it makes sense to do it correctly the first time.
>>
> Cool - didn't realize e2fsck would swap them back. That only makes
> it seem like an even heavier burden if I know the backing store is
> zero-filled! :)
>
>
>>> 2. it will spread the I/O cost out over time. Online resizing often
>>> means that you don't want to/can't unmount the fs. However, a
>>> large filesystem increase might result in gigabytes of 0s being
>>> written to the backing store which could result in I/O throttling
>>> that makes doing it online less useful. It'd be nice to be able to
>>> optionally amortize that cost as is done if the fs had been mke2fs -O
>>> lazy_itable_init=1 at full size initially.
>>>
>> While this is true, there is a non-zero risk of problems if the inode
>> table isn't zeroed, which is why lazy_itable_init isn't the default.
>> The risk is that if the group descriptor is invalid for some reason
>> (found by bad checksum, or some inode in use beyond itable_unused)
>> then the UNINIT and itable_unused fields will be ignored and a full
>> inode table scan for that group is done.
>>
>> If the itable isn't zeroed, then any old inodes (from a previous
>> filesystem, or garbage) will be "reattached" to the filesystem in
>> lost+found, and may cause a LOT of duplicate blocks processing (slow!).
>>
> That makes things a lot clearer - thanks! I wasn't sure what the default
> action was, but it makes sense to assume that corruption would lead
> you to crawl the inode table regardless. In which case, your best bet
> is to zero-fill it to minimize the weirdness.
>

One note - the WRITE_SAME command in SCSI has long been used by array
vendors to do relatively high performance zero fills.

It will actually write the disk (and that can be slow), but it won't do
multiple transfers of the data block of zeroes from server to storage.

Note sure that is a useful point, but might be nice to take advantage of :-)

ric


>> If you had the time to work on the solution, it would be very useful,
>> and we could make lazy_itable_init the default. What needs to be done
>> is have a thread that is created at filesystem mount that walks all the
>> groups and validates the GDT checksum, and zeroes inode tables and
>> bitmaps that are marked UNINIT w/o ZEROED. For bonus points it could
>> check bitmap validity (later that might validate a bitmap checksum),
>> compute buddy bitmaps for groups that have free space, etc.
>>
>> The thread would exit once all of the groups have had the inode tables
>> zeroed, or if the filesystem is unmounted. In the common case (i.e.
>> once all inode tables are zeroed), it would just walk the already-loaded
>> group descriptor table looking for the ZEROED flag and no IO is done,
>> assuming we don't check the on-disk bitmaps on each mount (that could
>> be done only periodically, with a timestamp in the superblock).
>>
> I'd love to have this functionality so it's definitely going on my TODO
> list, but probably not for a while yet. This is a great description of
> the needed code which will make it that much easier.
>
>
>>> Would it seriously raise the risk of corruption if uninit_bg is already
>>> in use? Alternately, would a patch to that effect stand a chance of ever
>>> making it upstream?
>>>
>> If the filesystem is already formatted with lazy_itable_init, then
>> doing further resizing w/o inode table zeroing is fine.
>>
> Cool -- I'll start in on a patch to setup to add that support as a
> precursor to having a mount triggered itable zero'ing thread. At least,
> then test filesystems and known zero-filled ones will benefit (as you
> pointed out!).
>
>
>>
>>> I've attached a version with it being flagged by a "-l" for lazy.
>>>
>> It might make sense to avoid requiring the user to specify this,
>> rather remembering the option supplied at mke2fs time? There is
>> the COMPAT_LAZY_BG superblock flag that might be usable for this,
>> though Ted might have some comments about any potential compatibility
>> issues.
>>
> Cool - yeah I'd love to make use of the COMPAT_LAZY_BG flag since it
> seems that all (but e2p/features.c) references to it seem to be gone
> from the e2fsprogs source and the kernel. I'm happy to rewrite it to do
> so and update mke2fs to set LAZY_BG when lazy_itable_init=1 is set.
>
>
>> Other than that, the patch looks reasonable at first glance.
>>
> Thanks! If Ted has any feedback on the use of COMPAT_LAZY_BG, I'll
> rewrite it using that (or not). Using COMPAT_LAZY_BG would also be nice
> because it would make it easier to decide when it's okay to online resize
> without initializing itables too (and would fit its initial purpose
> of being useful for sparse files)!
>
> cheers -
> will
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2009-09-18 06:09:21

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

>>>>> "Ric" == Ric Wheeler <[email protected]> writes:

Ric> One note - the WRITE_SAME command in SCSI has long been used by
Ric> array vendors to do relatively high performance zero fills.

I've been working on WRITE SAME and UNMAP support recently (layering on
top of Christoph's patch kit). It might be worthwhile to expose the
WRITE SAME functionality further up the stack. It could also be helpful
for initializing RAID disks, for instance...

--
Martin K. Petersen Oracle Linux Engineering

2009-09-18 11:42:52

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On 09/18/2009 02:09 AM, Martin K. Petersen wrote:
>>>>>> "Ric" == Ric Wheeler<[email protected]> writes:
>>>>>>
> Ric> One note - the WRITE_SAME command in SCSI has long been used by
> Ric> array vendors to do relatively high performance zero fills.
>
> I've been working on WRITE SAME and UNMAP support recently (layering on
> top of Christoph's patch kit). It might be worthwhile to expose the
> WRITE SAME functionality further up the stack. It could also be helpful
> for initializing RAID disks, for instance...
>
>

Using it to initialize RAID disks is exactly what it is there for - that
sounds quite useful...

ric


2009-09-18 13:56:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Sep 18, 2009 07:43 -0400, Ric Wheeler wrote:
> On 09/18/2009 02:09 AM, Martin K. Petersen wrote:
>>>>>>> "Ric" == Ric Wheeler<[email protected]> writes:
>>>>>>>
>> Ric> One note - the WRITE_SAME command in SCSI has long been used by
>> Ric> array vendors to do relatively high performance zero fills.
>>
>> I've been working on WRITE SAME and UNMAP support recently (layering on
>> top of Christoph's patch kit). It might be worthwhile to expose the
>> WRITE SAME functionality further up the stack. It could also be helpful
>> for initializing RAID disks, for instance...
>
> Using it to initialize RAID disks is exactly what it is there for - that
> sounds quite useful...

In ext4 we already have a routine ext4_ext_zeroout() that is used to zero
out chunks of disk in order to wipe uninitialized extents when they need
to be put into use. It currently maps ZERO_PAGE() multiple times to avoid
causing cache pressure, but if we could wire it up to some block-level
routine that optimized this it would be even better.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-19 12:19:16

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Wed, Sep 16, 2009 at 06:11:31PM -0500, Will Drewry wrote:
> On Wed, Sep 16, 2009 at 03:22:50PM -0600, Andreas Dilger wrote:
> > On Sep 16, 2009 15:42 -0500, Will Drewry wrote:
[snip]
> > It might make sense to avoid requiring the user to specify this,
> > rather remembering the option supplied at mke2fs time? There is
> > the COMPAT_LAZY_BG superblock flag that might be usable for this,
> > though Ted might have some comments about any potential compatibility
> > issues.
>
> Cool - yeah I'd love to make use of the COMPAT_LAZY_BG flag since it
> seems that all (but e2p/features.c) references to it seem to be gone
> from the e2fsprogs source and the kernel. I'm happy to rewrite it to do
> so and update mke2fs to set LAZY_BG when lazy_itable_init=1 is set.
>
> > Other than that, the patch looks reasonable at first glance.
>
> Thanks! If Ted has any feedback on the use of COMPAT_LAZY_BG, I'll
> rewrite it using that (or not). Using COMPAT_LAZY_BG would also be nice
> because it would make it easier to decide when it's okay to online resize
> without initializing itables too (and would fit its initial purpose
> of being useful for sparse files)!


Here's the same patch for offline resizing based on LAZY_BG. I only have
a less-than-stable kernel patch at present, but I figured I wouldn't push
on that until I have a bit more time.

This change adds support for offline resize2fs to use lazy inode
table initialization if the superblock is marked with LAZY_BG and
GDT_CSUM. This speeds up offline resizing dramatically and doesn't
undermine the speed gains of e2fsck on the first run after resizing.

Thanks!
will

Signed-off-by: Will Drewry <redpig <at> dataspill.org>
---

diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 5c3d17f..45fab3e 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -508,6 +508,7 @@ typedef struct ext2_icount *ext2_icount_t;
EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
EXT2_FEATURE_COMPAT_RESIZE_INODE|\
EXT2_FEATURE_COMPAT_DIR_INDEX|\
+ EXT2_FEATURE_COMPAT_LAZY_BG|\
EXT2_FEATURE_COMPAT_EXT_ATTR)

/* This #ifdef is temporary until compression is fully supported */
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 84c4361..28880a9 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -1687,6 +1687,12 @@ got_size:
fs_param.s_log_groups_per_flex = int_log2(flex_bg_size);
}

+ /* Mark the filesystem as friendly to uninit_bg during later resizing. */
+ if (lazy_itable_init &&
+ EXT2_HAS_RO_COMPAT_FEATURE(&fs_param,
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM))
+ fs_param.s_feature_compat |= EXT2_FEATURE_COMPAT_LAZY_BG;
+
if (inode_size && fs_param.s_rev_level >= EXT2_DYNAMIC_REV) {
if (inode_size < EXT2_GOOD_OLD_INODE_SIZE ||
inode_size > EXT2_BLOCK_SIZE(&fs_param) ||
diff --git a/resize/resize2fs.c b/resize/resize2fs.c
index 1a5d910..12003e4 100644
--- a/resize/resize2fs.c
+++ b/resize/resize2fs.c
@@ -302,7 +302,7 @@ errcode_t adjust_fs_info(ext2_filsys fs, ext2_filsys old_fs,
int adj, old_numblocks, numblocks, adjblocks;
unsigned long i, j, old_desc_blocks, max_group;
unsigned int meta_bg, meta_bg_size;
- int has_super, csum_flag;
+ int has_super, csum_flag, lazy_flag;
unsigned long long new_inodes; /* u64 to check for overflow */
double percent;

@@ -482,6 +482,8 @@ retry:

csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
EXT4_FEATURE_RO_COMPAT_GDT_CSUM);
+ lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
+ EXT2_FEATURE_COMPAT_LAZY_BG);
adj = old_fs->group_desc_count;
max_group = fs->group_desc_count - adj;
if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
@@ -496,9 +498,12 @@ retry:
adjblocks = 0;

fs->group_desc[i].bg_flags = 0;
- if (csum_flag)
- fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
- EXT2_BG_INODE_ZEROED;
+ if (csum_flag) {
+ fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
+ if (!lazy_flag) {
+ fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
+ }
+ }
if (i == fs->group_desc_count-1) {
numblocks = (fs->super->s_blocks_count -
fs->super->s_first_data_block) %
@@ -568,7 +573,7 @@ errout:
static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
{
ext2_filsys fs;
- int adj = 0;
+ int adj = 0, csum_flag = 0, lazy_flag = 0, num = 0;
errcode_t retval;
blk_t group_block;
unsigned long i;
@@ -624,6 +629,11 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
&rfs->itable_buf);
if (retval)
goto errout;
+ /* Track if we can get by with a lazy init */
+ csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM);
+ lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
+ EXT2_FEATURE_COMPAT_LAZY_BG);

memset(rfs->itable_buf, 0, fs->blocksize * fs->inode_blocks_per_group);
group_block = fs->super->s_first_data_block +
@@ -642,10 +652,21 @@ static errcode_t adjust_superblock(ext2_resize_t rfs, blk_t new_size)
/*
* Write out the new inode table
*/
+ if (csum_flag && lazy_flag) {
+ /* These are _new_ inode tables. No inodes should be in use. */
+ fs->group_desc[i].bg_itable_unused = fs->super->s_inodes_per_group;
+ num = ((((fs->super->s_inodes_per_group -
+ fs->group_desc[i].bg_itable_unused) *
+ EXT2_INODE_SIZE(fs->super)) +
+ EXT2_BLOCK_SIZE(fs->super) - 1) /
+ EXT2_BLOCK_SIZE(fs->super));
+ } else {
+ num = fs->inode_blocks_per_group;
+ }
retval = io_channel_write_blk(fs->io,
- fs->group_desc[i].bg_inode_table,
- fs->inode_blocks_per_group,
- rfs->itable_buf);
+ fs->group_desc[i].bg_inode_table, /* blk */
+ num, /* count */
+ rfs->itable_buf); /* contents */
if (retval) goto errout;

io_channel_flush(fs->io);

2009-09-23 09:34:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Sep 19, 2009 07:19 -0500, Will Drewry wrote:
> @@ -482,6 +482,8 @@ retry:
>
> csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
> EXT4_FEATURE_RO_COMPAT_GDT_CSUM);
> + lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
> + EXT2_FEATURE_COMPAT_LAZY_BG);
> adj = old_fs->group_desc_count;
> max_group = fs->group_desc_count - adj;
> if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
> @@ -496,9 +498,12 @@ retry:
> adjblocks = 0;
>
> fs->group_desc[i].bg_flags = 0;
> - if (csum_flag)
> - fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
> - EXT2_BG_INODE_ZEROED;
> + if (csum_flag) {
> + fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
> + if (!lazy_flag) {
> + fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
> + }
> + }

This code could be cleaned up a bit by assigning "EXT2_BG_INODE_ZEROED" to
lazy_flag so that you don't need to check this each time:

lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
EXT2_FEATURE_COMPAT_LAZY_BG) ?
EXT2_BG_INODE_ZEROED : 0;

if (csum_flag)
fs->group_desc[i].bg_flags |=
EXT2_BG_INODE_UNINIT | lazy_flag;

Still waiting to head from Ted on the use of COMPAT_LAZY_BG.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-09-23 19:28:02

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Wed, Sep 23, 2009 at 4:34 AM, Andreas Dilger <[email protected]> wrote:
> On Sep 19, 2009 ?07:19 -0500, Will Drewry wrote:
>> @@ -482,6 +482,8 @@ retry:
>>
>> ? ? ? csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_GDT_CSUM);
>> + ? ? lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT2_FEATURE_COMPAT_LAZY_BG);
>> ? ? ? adj = old_fs->group_desc_count;
>> ? ? ? max_group = fs->group_desc_count - adj;
>> ? ? ? if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
>> @@ -496,9 +498,12 @@ retry:
>> ? ? ? ? ? ? ? adjblocks = 0;
>>
>> ? ? ? ? ? ? ? fs->group_desc[i].bg_flags = 0;
>> - ? ? ? ? ? ? if (csum_flag)
>> - ? ? ? ? ? ? ? ? ? ? fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT2_BG_INODE_ZEROED;
>> + ? ? ? ? ? ? if (csum_flag) {
>> + ? ? ? ? ? ? ? ? ? ? fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
>> + ? ? ? ? ? ? ? ? ? ? if (!lazy_flag) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? }
>
> This code could be cleaned up a bit by assigning "EXT2_BG_INODE_ZEROED" to
> lazy_flag so that you don't need to check this each time:
>
> ? ? ? ?lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT2_FEATURE_COMPAT_LAZY_BG) ?
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT2_BG_INODE_ZEROED : 0;
>
> ? ? ? ? ? ? ? ?if (csum_flag)
> ? ? ? ? ? ? ? ? ? ? ? ?fs->group_desc[i].bg_flags |=
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT2_BG_INODE_UNINIT | lazy_flag;
>
> Still waiting to head from Ted on the use of COMPAT_LAZY_BG.

Ah nice - I can definitely do that. I also think I mangled the spaces
v tabs on the ext4.h part of the patch. If COMPAT_LAZY_BG is off limits,
I can easily use some other flag/indicator.

That aside, I've also got a barebones kernel patch which supports lazy
online resizing which accompanies the e2fsprogs patch above. I realize
it is probably less-than-practical until there is an initializing
thread, but I'd appreciate any feedback if possible -- even if just to
ensure I'm understanding things correctly.

Thanks!
will


Signed-off-by: Will Drewry <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/resize.c | 38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9714db3..24d2880 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1043,6 +1043,7 @@ static inline int ext4_valid_inum(struct
super_block *sb, unsigned long ino)
#define EXT4_FEATURE_COMPAT_EXT_ATTR 0x0008
#define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
#define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
+#define EXT4_FEATURE_COMPAT_LAZY_BG 0x0040

#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 68b0351..faf9263 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -183,7 +183,7 @@ static int setup_new_group_blocks(struct super_block *sb,
handle_t *handle;
ext4_fsblk_t block;
ext4_grpblk_t bit;
- int i;
+ int i, lazy = 0;
int err = 0, err2;

/* This transaction may be extended/restarted along the way */
@@ -261,13 +261,31 @@ static int setup_new_group_blocks(struct super_block *sb,
input->inode_bitmap - start);
ext4_set_bit(input->inode_bitmap - start, bh->b_data);

- /* Zero out all of the inode table blocks */
+ /* Zero out all of the inode table blocks except if the fs supports lazy
+ * itables. */
+ lazy = EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
+ EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_LAZY_BG);
+ if (lazy) {
+ ext4_debug("lazy_bg for inode blocks "
+ "%#04llx (+%d) - %#04llx (+%d) ",
+ input->inode_table, input->inode_table - start,
+ input->inode_table + sbi->s_itb_per_group - 1,
+ (input->inode_table - start) + sbi->s_itb_per_group - 1);
+ }
+
for (i = 0, block = input->inode_table, bit = block - start;
i < sbi->s_itb_per_group; i++, bit++, block++) {
struct buffer_head *it;

ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);

+ /* Even though we don't initialize the inode table, we need
+ * to mark the blocks used by it for later init. */
+ if (lazy) {
+ ext4_set_bit(bit, bh->b_data);
+ continue;
+ }
if ((err = extend_or_restart_transaction(handle, 1, bh)))
goto exit_bh;

@@ -286,6 +304,11 @@ static int setup_new_group_blocks(struct super_block *sb,
mark_bitmap_end(input->blocks_count, sb->s_blocksize * 8, bh->b_data);
ext4_handle_dirty_metadata(handle, NULL, bh);
brelse(bh);
+ /* If lazy, we're done since we are marked INODE_UNINIT and that
+ * includes the inode bitmap. (ext4_init_inode_bitmap will do
+ * this for us later). */
+ if (lazy)
+ goto exit_journal;
/* Mark unused entries in inode bitmap used */
ext4_debug("clear inode bitmap %#04llx (+%llu)\n",
input->inode_bitmap, input->inode_bitmap - start);
@@ -868,6 +891,17 @@ int ext4_group_add(struct super_block *sb, struct
ext4_new_group_data *input)
ext4_free_blks_set(sb, gdp, input->free_blocks_count);
ext4_free_inodes_set(sb, gdp, EXT4_INODES_PER_GROUP(sb));
gdp->bg_flags = cpu_to_le16(EXT4_BG_INODE_ZEROED);
+ /* If we can do lazy initialization, we do. */
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
+ EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_LAZY_BG)) {
+ /* Only use EXT4_BG_INODE_UNINIT and not BLOCK_UNINIT
+ * because we purposefully initialize the block bitmaps
+ * to avoid managing super block backup decisions, making
+ * sure the last block is init'd, etc.
+ */
+ gdp->bg_flags = cpu_to_le16(EXT4_BG_INODE_UNINIT);
+ ext4_itable_unused_set(sb, gdp, EXT4_INODES_PER_GROUP(sb));
+ }
gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);

/*

2009-09-25 06:21:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH][RFC] resize2fs and uninit_bg questions

On Sep 23, 2009 14:28 -0500, Will Drewry wrote:
> That aside, I've also got a barebones kernel patch which supports lazy
> online resizing which accompanies the e2fsprogs patch above. I realize
> it is probably less-than-practical until there is an initializing
> thread, but I'd appreciate any feedback if possible -- even if just to
> ensure I'm understanding things correctly.

This is looking very good. For prototyping the initializing thread, I
would suggest to create a "per-group initialization and check" function
(maybe just stealing the inside of the ext4_check_descriptors() loop as
a starting point) that is called inline for each group at mount time.
Add in the check for a missing INODE_ZEROED and do the zeroing there.

While that will initially just push the mke2fs time into the kernel, it
would likely be somewhat faster than running it from mke2fs because it
will not need any userspace memory, nor will there be any user->kernel
data copying going on.

Once you have that working, you can work on adding this to a kernel
thread that starts at mount time and stops when it has checked all
of the groups, or if the filesystem is being unmounted. It probably
makes sense to only have one of these threads for the entire system
(instead of one per filesystem), so that they don't start crazy seeking
IO when there are multiple new partitions on a single disk. That means
some kind of work queue with a list of superblocks and their groups to
check (so that when online resizing only the new groups are checked).


One recent thought I had was that we might want to distinguish between
filesystems that want lazy itable zeroing (i.e. real production filesystems)
and ones that don't want itable zeroing at all (i.e. filesystems for
testing or ones created on sparse loop devices that will always return
0s for unwritten blocks). The latter is very useful for keeping image
size small, for VM images, etc.

IMHO for filesystems that want itable zeroing could just be a mke2fs
option, and the kernel detects this from groups that do not have
INODE_ZEROED set. Filesystems that don't want any itable zeroing
should set COMPAT_LAZY_BG to tell the kernel not to zero the itable
(though the flag would be misnamed in that case).

Maybe it makes sense to keep COMPAT_LAZY_BG for filesystems that haven't
had their itables zeroed, but want it, and add a new COMPAT_NO_ZERO flag
that indicates the kernel shouldn't zero itables at all? Ted?

> Signed-off-by: Will Drewry <[email protected]>
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index 68b0351..faf9263 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -261,13 +261,31 @@ static int setup_new_group_blocks(struct super_block *sb,
> input->inode_bitmap - start);
> ext4_set_bit(input->inode_bitmap - start, bh->b_data);
>
> - /* Zero out all of the inode table blocks */
> + /* Zero out all of the inode table blocks except if the fs supports lazy
> + * itables. */
> + lazy = EXT4_HAS_RO_COMPAT_FEATURE(sb,
> + EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&

(style) This continuation should be aligned with the '(' on the previous line.

> + EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_LAZY_BG);
> + if (lazy) {
> + ext4_debug("lazy_bg for inode blocks "
> + "%#04llx (+%d) - %#04llx (+%d) ",
> + input->inode_table, input->inode_table - start,
> + input->inode_table + sbi->s_itb_per_group - 1,

(style) the indenting doesn't match (first 2 lines are correct), last one not

> + (input->inode_table - start) + sbi->s_itb_per_group - 1);

(style) wrap at 80 columns, or in this case, remove a few spaces to make
line 80 columns wide.

(suggestion) It might make sense to print this debug message with the
inode table geometry regardless of whether it is lazy or not, and just
indicate in the message whether it will be zeroed or not.

> for (i = 0, block = input->inode_table, bit = block - start;
> i < sbi->s_itb_per_group; i++, bit++, block++) {
> struct buffer_head *it;
>
> ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
>
> + /* Even though we don't initialize the inode table, we need
> + * to mark the blocks used by it for later init. */
> + if (lazy) {
> + ext4_set_bit(bit, bh->b_data);
> + continue;
> + }

(suggestion) I prefer not to do the same operation in two different parts
of the loop, as this does, since if there are other changes to the code
later the "if (lazy)" clause will get increasing code duplication.

Better would be a conditional here that only did the loop internals if
not lazy, like:


for (i = 0, block = input->inode_table, bit = block - start;
i < sbi->s_itb_per_group; i++, bit++, block++) {
struct buffer_head *it;

ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);

/* If the filesystem is not using lazy initialization then we
* need zero the inode table blocks to avoid e2fsck issues. */
if (!lazy) {
if ((err = extend_or_restart_transaction(handle, 1,bh)))
goto exit_bh;

if (IS_ERR(it = bclean(handle, sb, block))) {
err = PTR_ERR(it);
goto exit_bh;
}
ext4_handle_dirty_metadata(handle, NULL, it);
brelse(it);
}
ext4_set_bit(bit, bh->b_data);
}

It could also be an "if (lazy) goto set_bit;", but given the small size
of the conditional I think the above is still pretty readable and not
too much indented.

The rest of it looks good.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.