2007-04-15 16:16:09

by Andreas Dilger

[permalink] [raw]
Subject: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

Just a quick note before I forget. I thought there was a call in ext4
to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
more than 2^32 blocks? I don't see that anywhere in the 2.6.20 ext4.
Is that in the upstream git repo?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2007-04-19 19:15:04

by Mingming Cao

[permalink] [raw]
Subject: Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote:
> Just a quick note before I forget. I thought there was a call in ext4
> to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
> more than 2^32 blocks?

Question about the online resize case. If the fs is increased to more
than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the
journal. What about existing transactions that still stores 32 bit block
numbers? I guess the journal need to commit them all so that revoke
will not get confused about the bits for block numbers later. After
that done then JBD2 can set this feature safely.


> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-04-19 21:18:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

On Apr 19, 2007 12:15 -0700, Mingming Cao wrote:
> On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote:
> > Just a quick note before I forget. I thought there was a call in ext4
> > to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
> > more than 2^32 blocks?
>
> Question about the online resize case. If the fs is increased to more
> than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the
> journal. What about existing transactions that still stores 32 bit block
> numbers? I guess the journal need to commit them all so that revoke
> will not get confused about the bits for block numbers later. After
> that done then JBD2 can set this feature safely.

Well, there are two options here:
1) refuse resizing filesystems beyond 16TB
- this is required if they were not formatted as ext4 to start with, as
the group descriptors will not be large enough to handle the "_hi"
word in the bitmap/inode table locations
- this is also a problem for block-mapped files that need to allocate
blocks beyond 16TB (though this could just fail on those files with
e.g. ENOSPC or EFBIG or something similar)
2) flush the journal (like ext4_write_super_lockfs()) while resizing beyond
16TB. This would also require changing over to META_BG at some point,
because there cannot be enough reserved group descriptor blocks (the
resize_inode is set up for a maximum of 2TB filesystems I think)

For now I'd be happy with just setting the JBD2_*_64BIT flag at mount for
filesystems > 16TB, and refusing resize across 16TB. We can fix it later.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-04-20 00:41:48

by Mingming Cao

[permalink] [raw]
Subject: Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

On Thu, 2007-04-19 at 15:18 -0600, Andreas Dilger wrote:
> On Apr 19, 2007 12:15 -0700, Mingming Cao wrote:
> > On Sun, 2007-04-15 at 10:16 -0600, Andreas Dilger wrote:
> > > Just a quick note before I forget. I thought there was a call in ext4
> > > to set JBD2_FEATURE_INCOMPAT_64BIT at mount time if the filesystem has
> > > more than 2^32 blocks?
> >
> > Question about the online resize case. If the fs is increased to more
> > than 2^32 blocks, we should set this JBD2_FEATURE_INCOMPAT_64BIT in the
> > journal. What about existing transactions that still stores 32 bit block
> > numbers? I guess the journal need to commit them all so that revoke
> > will not get confused about the bits for block numbers later. After
> > that done then JBD2 can set this feature safely.
>
> Well, there are two options here:
> 1) refuse resizing filesystems beyond 16TB
> - this is required if they were not formatted as ext4 to start with, as
> the group descriptors will not be large enough to handle the "_hi"
> word in the bitmap/inode table locations
> - this is also a problem for block-mapped files that need to allocate
> blocks beyond 16TB (though this could just fail on those files with
> e.g. ENOSPC or EFBIG or something similar)

I agree for fs not formatted as ext4(block-map based ext3 but mounted as
ext4), resize fs to >16TB is not possible

This concern is mostly for new formated ext4, which by default is
extents based.


> 2) flush the journal (like ext4_write_super_lockfs()) while resizing beyond
> 16TB.

Ah. thanks for point this out.

> This would also require changing over to META_BG at some point,
> because there cannot be enough reserved group descriptor blocks (the
> resize_inode is set up for a maximum of 2TB filesystems I think)
>

Any concerns about turn on META_BG by default for all new ext4 fs?
Initially I thought we only need META_BG for support >256TB, so there is
no rush to turn it on for all the new fs. But it appears there are
multiple benefits to enable META_BG by default:

- enable online resize >2TB
- support >256TB fs
- Since metadatas(bitmaps, group descriptors etc) are not put at the
beginning of each block group anymore, the 128MB limit(block group size
with 4k block size) that used to limit an extent size is removed.
- Speed up fsck since metadata are placed closely.

So I am wondering why not make it default?

Mingming

2007-04-20 05:03:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: Missing JBD2_FEATURE_INCOMPAT_64BIT in ext4

On Apr 19, 2007 17:41 -0700, Mingming Cao wrote:
> Any concerns about turn on META_BG by default for all new ext4 fs?
> Initially I thought we only need META_BG for support >256TB, so there is
> no rush to turn it on for all the new fs. But it appears there are
> multiple benefits to enable META_BG by default:

I would prefer not to have it default for the first 1TB or so of the
filesystem or so. One reason is that using META_BG for all of the groups
give us only 2 backups of each group descriptor, and those are relatively
close together. In the first 1TB we would get 17 backups of the group
descriptors, which should be plenty.

> - enable online resize >2TB

Actually, I don't think the current online resize support for META_BG.
There was a patch last year by Glauber de Oliveira Costa which added
support for online resizing with META_BG, which would need to be updated
to work with ext4. Also, the usage of s_first_meta_bg in that patch is
incorrect.

> - support >256TB fs

True, though not exactly pressing, and filesystems can be changed
to add META_BG support at any point.

> - Since metadatas(bitmaps, group descriptors etc) are not put at the
> beginning of each block group anymore, the 128MB limit(block group size
> with 4k block size) that used to limit an extent size is removed.
> - Speed up fsck since metadata are placed closely.

That isn't really true, even though descriptions of META_BG say this.
There will still be block and inode bitmaps and the inode table.
The ext3 code was missing support for moving the bitmaps/itable outside
their respective groups, and that has not been fixed yet in ext4.

The problem is that ext4_check_descriptors() in the kernel was never
changed to support META_BG, so it does not allow the bitmaps or inode
table to be outside the group. Similarly, ext2fs_group_first_block()
and ext2fs_group_last_block() in lib/ext2fs also don't take META_BG
into account.

Also, since the extent format supports at most 2^15 blocks (128MB) it
doesn't really make much difference in that regard, though it does help
the allocator somewhat because it has more contiguous space to allocate
from.

> So I am wondering why not make it default?

It wouldn't be too hard to add in support for this I think, and there
is definitely some benefit. Since neither e2fsprogs nor the kernel
handle this correctly, the placement of bitmaps and inode tables outside
of their respective groups may as well be a separate feature.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.