Ted,
there are several COMPAT flag assignments that have been proposed in
the past:
- EXT4_FEATURE_INCOMPAT_64BIT (0x0080!) - support for 64-bit block count
fields in the superblock (s_blocks_count_hi, s_free_blocks_count_hi),
large group descriptors (s_desc_size), extents with high 16 bits
(ee_start_hi, ei_leaf_hi), inode ACL (i_file_acl_hi). May also grow
to encompass the previously proposed BIG_BG.
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE (0x0008) - change i_blocks to be
in units of s_blocksize units instead of 512-byte sectors, use
l_i_frag and l_i_fsize as i_blocks_hi (could also be part of 64BIT).
Also uses EXT4_HUGE_FILE_FL 0x40000 for i_flags.
- EXT4_FEATURE_RO_COMPAT_GDT_CSUM (0x0010?) - store a crc16 checksum in
the group descriptor (s_uuid[16] | __u32 group | ext3_group_desc
(excluding gd_checksum itself)). This allows the kernel to more safely
manage UNINIT groups. Incomplete patch, e2fsck support mostly done.
- EXT4_FEATURE_RO_COMPAT_DIR_NLINK (0x0020?) - allow directories to have
> 65000 subdirectories (i_nlinks) by setting i_nlinks = 1 for such
directories. RO_COMPAT protects old filesystems from unlinking such
directories incorrectly and losing all files therein. Needs RO_COMPAT
flag handling, needs e2fsck support, but very heavily tested.
- EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE (0x0040?) - add s_min_extra_isize and
s_want_extra_isize fields to superblock, which allow specifying
the minimum and desired i_extra_isize fields in large inodes
(for nsec+epoch timestamps, potential other uses). Needs RO_COMPAT
flag handling, needs e2fsck support, patch complete, little testing.
I'm not sure about the state of HUGE_FILE (it might be useful for ext[23]
to allow larger sparse files, but also fits quite well with INCOMPAT_64BIT),
but the others are definitely useful independent from INCOMPAT_64BIT.
There are patches in various states of completion for all of the features.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Fri, Sep 22, 2006 at 02:15:20AM -0700, Andreas Dilger wrote:
> Ted,
> there are several COMPAT flag assignments that have been proposed in
> the past:
as Ted suggested, i post here a list of all fields we plan to use, perhaps
others could complete it?
>
> - EXT4_FEATURE_INCOMPAT_64BIT (0x0080!) - support for 64-bit block count
> fields in the superblock (s_blocks_count_hi, s_free_blocks_count_hi),
> large group descriptors (s_desc_size), extents with high 16 bits
> (ee_start_hi, ei_leaf_hi), inode ACL (i_file_acl_hi). May also grow
> to encompass the previously proposed BIG_BG.
>
here is a list of fields we plan to use for the 64bit support, they must be
zero on file systems without the EXT4_FEATURE_INCOMPAT_64BIT.
struct ext4_super_block
{
/* at offset 0xfe */
__le32 s_desc_size; /* Group descriptor size */
/* at offset 0x150 */
__le32 s_blocks_count_hi; /* Blocks count */
__le32 s_r_blocks_count_hi; /* Reserved blocks count */
__le32 s_free_blocks_count_hi; /* Free blocks count */
__le32 s_jnl_blocks_hi[17]; /* Backup of the journal inode */
};
struct ext4_group_desc
{
/* at offset 0x20 */
__le32 bg_block_bitmap; /* Blocks bitmap block hi bits */
__le32 bg_inode_bitmap; /* Inodes bitmap block hi bits */
__le32 bg_inode_table; /* Inodes table block hi bits */
__le16 bg_free_blocks_count; /* Free blocks count hi bits */
__le16 bg_free_inodes_count; /* Free inodes count hi bits */
__le16 bg_used_dirs_count; /* Directories count hi bits */
};
basically, we make 64bit all block numbers and we double the size of all
xxx_count in the block group descriptor.
> - EXT4_FEATURE_RO_COMPAT_HUGE_FILE (0x0008) - change i_blocks to be
> in units of s_blocksize units instead of 512-byte sectors, use
> l_i_frag and l_i_fsize as i_blocks_hi (could also be part of 64BIT).
> Also uses EXT4_HUGE_FILE_FL 0x40000 for i_flags.
>
> - EXT4_FEATURE_RO_COMPAT_GDT_CSUM (0x0010?) - store a crc16 checksum in
> the group descriptor (s_uuid[16] | __u32 group | ext3_group_desc
> (excluding gd_checksum itself)). This allows the kernel to more safely
> manage UNINIT groups. Incomplete patch, e2fsck support mostly done.
>
> - EXT4_FEATURE_RO_COMPAT_DIR_NLINK (0x0020?) - allow directories to have >
> 65000 subdirectories (i_nlinks) by setting i_nlinks = 1 for such
> directories. RO_COMPAT protects old filesystems from unlinking such
> directories incorrectly and losing all files therein. Needs RO_COMPAT
> flag handling, needs e2fsck support, but very heavily tested.
>
> - EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE (0x0040?) - add s_min_extra_isize and
> s_want_extra_isize fields to superblock, which allow specifying
> the minimum and desired i_extra_isize fields in large inodes
> (for nsec+epoch timestamps, potential other uses). Needs RO_COMPAT
> flag handling, needs e2fsck support, patch complete, little testing.
>
>
> I'm not sure about the state of HUGE_FILE (it might be useful for ext[23]
> to allow larger sparse files, but also fits quite well with INCOMPAT_64BIT),
> but the others are definitely useful independent from INCOMPAT_64BIT.
> There are patches in various states of completion for all of the features.
>
i agree, this should go with INCOMPAT_64BIT.
There's also the change attribute patch; it currently uses the l_i_reserved2
field of the inode:
- __u32 l_i_reserved2;
+ __le32 l_i_change_attribute;
-#define i_reserved2 osd2.linux2.l_i_reserved2
+#define i_chattr osd2.linux2.l_i_change_attribute
It doesn't need RO_COMPAT/INCOMPAT flag because there are no incompatibility
issues with kernels that do not support the change attribute but that mount
file systems that have used it. Also it doesn't really need changes in fsck.
cheers,
-- Alexandre
Alexandre Ratchov <[email protected]> writes:
> here is a list of fields we plan to use for the 64bit support, they must be
> zero on file systems without the EXT4_FEATURE_INCOMPAT_64BIT.
>
> struct ext4_super_block
> {
> /* at offset 0xfe */
> __le32 s_desc_size; /* Group descriptor size */
> /* at offset 0x150 */
> __le32 s_blocks_count_hi; /* Blocks count */
> __le32 s_r_blocks_count_hi; /* Reserved blocks count */
> __le32 s_free_blocks_count_hi; /* Free blocks count */
> __le32 s_jnl_blocks_hi[17]; /* Backup of the journal inode */
> };
>
> struct ext4_group_desc
> {
> /* at offset 0x20 */
> __le32 bg_block_bitmap; /* Blocks bitmap block hi bits */
> __le32 bg_inode_bitmap; /* Inodes bitmap block hi bits */
> __le32 bg_inode_table; /* Inodes table block hi bits */
> __le16 bg_free_blocks_count; /* Free blocks count hi bits */
> __le16 bg_free_inodes_count; /* Free inodes count hi bits */
> __le16 bg_used_dirs_count; /* Directories count hi bits */
> };
>
> basically, we make 64bit all block numbers and we double the size of all
> xxx_count in the block group descriptor.
When you do this have you considered at least reserving fields in the
new 64bit indirect blocks for checksums for each block?
IMHO it would be a great advantage to checksum all metadata
(as demonstrated by ZFS) and CPU cycles are cheap enough now that it is
basically free.
The checksums could be different feature flags, but it would be useful
to reserve space in any new format. 16 byte free on each block should be enough.
-Andi
On Sep 28, 2006 22:29 +0200, Andi Kleen wrote:
> Alexandre Ratchov <[email protected]> writes:
> > struct ext4_group_desc
> > {
> > /* at offset 0x20 */
> > __le32 bg_block_bitmap; /* Blocks bitmap block hi bits */
> > __le32 bg_inode_bitmap; /* Inodes bitmap block hi bits */
> > __le32 bg_inode_table; /* Inodes table block hi bits */
> > __le16 bg_free_blocks_count; /* Free blocks count hi bits */
> > __le16 bg_free_inodes_count; /* Free inodes count hi bits */
> > __le16 bg_used_dirs_count; /* Directories count hi bits */
> > };
> >
> > basically, we make 64bit all block numbers and we double the size of all
> > xxx_count in the block group descriptor.
>
> When you do this have you considered at least reserving fields in the
> new 64bit indirect blocks for checksums for each block?
>
> IMHO it would be a great advantage to checksum all metadata
> (as demonstrated by ZFS) and CPU cycles are cheap enough now that it is
> basically free.
Actually, there are several plans afoot in that direction already.
Some of them need at least some help in the "finish up and get it
into the kernel" department, some of them are just ideas previously
discussed..
One of the reason for Alexandre pushing the 64-bit inode/block counters
into the "large" descriptor is because the 64-bit filesystem is already
incompatible with a 32-bit filesystem so there is no extra harm, and this
leaves space in the "original" group descriptor for checksums of the block
and inode bitmaps. The bitmap checksums are a critical single-point-of-
failure, and having checksums allows the kernel to avoid cascading
filesystem corruption even if it can't (yet) do anything about it.
Having the checksums in the "original" group descriptor allows this
feature to be used on both 32-bit and 64-bit filesystems.
No work has been done on this yet. Getting checksums to be efficient
depends on having a generic callback mechanism from the journal code
to avoid repeated checksums on a block while it is being modified.
The journal callback would do the checksum exactly once for each block
(or sub-structure therein) at checkpoint time.
A second change is to add checksums to the ext3 journal commit blocks
(per U. Wisconsin) to avoid need for 2-phase commit for transactions,
and to provide redundancy. Patches for the kernel and e2fsck are
available for that already (not 100% sure if I posted them here).
Checksums for the group descriptors themselves, to allow mke2fs
and the kernel to handle "uninitialized groups". This means that mke2fs
doesn't need to zero the block/inode bitmaps and inode table, and the
kernel can selectively initialize the inode tables to avoid the need to
read all of them during e2fsck time. The checksum is a safety check on
the group descriptor flags, as well as providing normal corruption detection.
Patches for the kernel and e2fsck are in early prototype and were posted
about a week ago.
Finally, the extents format has the capability (though no code is implemented
for this yet) to store a checksum in each index and extent block. This
would be done by reducing the count of allowed entries in the block and
storing an ext3_extent_tail (checksum, inode+generation backpointer) as
the last entry in the block. No work has been done on this, but I've
described the ext3_extent_tail a few times previously on this list.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Sep 28, 2006 10:55 +0200, Alexandre Ratchov wrote:
> here is a list of fields we plan to use for the 64bit support, they must be
> zero on file systems without the EXT4_FEATURE_INCOMPAT_64BIT.
>
> struct ext4_super_block
> {
> /* at offset 0xfe */
> __le32 s_desc_size; /* Group descriptor size */
I believe this is actually a __u16 and not __u32. The group descriptor
can't be larger than a filesystem block anyways. Formerly called
s_reserved_word_pad.
> > - EXT4_FEATURE_RO_COMPAT_GDT_CSUM (0x0010?) - store a crc16 checksum in
> > the group descriptor (s_uuid[16] | __u32 group | ext3_group_desc
> > (excluding gd_checksum itself)). This allows the kernel to more safely
> > manage UNINIT groups. Incomplete patch, e2fsck support mostly done.
struct ext3_group_desc
{
__le32 bg_block_bitmap; /* Blocks bitmap block */
__le32 bg_inode_bitmap; /* Inodes bitmap block */
__le32 bg_inode_table; /* Inodes table block */
__le16 bg_free_blocks_count; /* Free blocks count */
__le16 bg_free_inodes_count; /* Free inodes count */
__le16 bg_used_dirs_count; /* Directories count */
- __u16 bg_pad;
- __le32 bg_reserved[3];
+ __le16 bg_flags;
+ __le32 bg_reserved[2];
+ __le16 bg_itable_unused; /*Unused inodes count*/
+ __le16 bg_checksum; /*crc16(s_uuid+group_num+group_desc)*/
};
> > - EXT4_FEATURE_RO_COMPAT_DIR_NLINK (0x0020?) - allow directories to have >
> > 65000 subdirectories (i_nlinks) by setting i_nlinks = 1 for such
> > directories. RO_COMPAT protects old filesystems from unlinking such
> > directories incorrectly and losing all files therein. Needs RO_COMPAT
> > flag handling, needs e2fsck support, but very heavily tested.
No extra fields needed, just compat. Bumps EXT3_LINK_MAX to 65000.
> > - EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE (0x0040?) - add s_min_extra_isize and
> > s_want_extra_isize fields to superblock, which allow specifying
> > the minimum and desired i_extra_isize fields in large inodes
> > (for nsec+epoch timestamps, potential other uses). Needs RO_COMPAT
> > flag handling, needs e2fsck support, patch complete, little testing.
No patch yet which uses s_*_extra_isize, they can go in next available slots.
struct ext3_inode {
} osd2; /* OS dependent 2 */
__le16 i_extra_isize;
__le16 i_pad1;
__le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */
__le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */
__le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
__le32 i_extra_reserved1;
}
> There's also the change attribute patch; it currently uses the l_i_reserved2
> field of the inode:
>
> - __u32 l_i_reserved2;
> + __le32 l_i_change_attribute;
>
> -#define i_reserved2 osd2.linux2.l_i_reserved2
> +#define i_chattr osd2.linux2.l_i_change_attribute
>
> It doesn't need RO_COMPAT/INCOMPAT flag because there are no incompatibility
> issues with kernels that do not support the change attribute but that mount
> file systems that have used it. Also it doesn't really need changes in fsck.
Did we decide if l_i_change_attribute would also be the ctime nsec value?
That would affect the RO_COMPAT_EXTRA_ISIZE implementation above, putting
the i_ctime_extra in place of l_i_reserved2. That doesn't change the
patch significantly, though it does need the "always increment" change.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
> Actually, there are several plans afoot in that direction already.
> Some of them need at least some help in the "finish up and get it
> into the kernel" department, some of them are just ideas previously
> discussed..
The important part right now is to just keep enough space in all
structures that are being changed anyways.
>
> One of the reason for Alexandre pushing the 64-bit inode/block counters
> into the "large" descriptor is because the 64-bit filesystem is already
> incompatible with a 32-bit filesystem so there is no extra harm, and this
> leaves space in the "original" group descriptor for checksums of the block
> and inode bitmaps. The bitmap checksums are a critical single-point-of-
> failure, and having checksums allows the kernel to avoid cascading
> filesystem corruption even if it can't (yet) do anything about it.
> Having the checksums in the "original" group descriptor allows this
> feature to be used on both 32-bit and 64-bit filesystems.
Ok.
> No work has been done on this yet. Getting checksums to be efficient
> depends on having a generic callback mechanism from the journal code
> to avoid repeated checksums on a block while it is being modified.
You can just do incremental checksumming which is very cheap.
Or did you mean the flushing to disk of the checksum? If it's always in the same
object that would be free, but that is not possible for bitmaps at least.
But I guess the checksum write in the block descriptor
could be done very lazily at least, perhaps keeping track on disk if invalid
checksums are expected or not.
> Finally, the extents format has the capability (though no code is implemented
> for this yet) to store a checksum in each index and extent block. This
> would be done by reducing the count of allowed entries in the block and
> storing an ext3_extent_tail (checksum, inode+generation backpointer) as
> the last entry in the block. No work has been done on this, but I've
> described the ext3_extent_tail a few times previously on this list.
Old style indirect blocks will need them too. My thinking was
to use another block for those (so a indirect block would be two nearby
blocks)
Inodes need them, but with the inode extension that will be hopefully
not a problem to keep a few bytes for this.
And directories, which should be relatively easy to extend with
the current format.
-Andi
On Sep 29, 2006 01:06 +0200, Andi Kleen wrote:
> Andreas Dilger wrote:
> > No work has been done on this yet. Getting checksums to be efficient
> > depends on having a generic callback mechanism from the journal code
> > to avoid repeated checksums on a block while it is being modified.
>
> You can just do incremental checksumming which is very cheap.
>
> Or did you mean the flushing to disk of the checksum? If it's always in
> the same object that would be free, but that is not possible for bitmaps
> at least. But I guess the checksum write in the block descriptor
> could be done very lazily at least, perhaps keeping track on disk if invalid
> checksums are expected or not.
I'm not sure I understand what you mean. My goal is that the ext4 code
modifies the block as many times as it wants during a transaction (this
may happen from multiple threads for a single block), then just before
the transaction is committed to disk the journal calls a callback for that
block (inode, group descriptor, bitmap, superblock, extent, index, etc) and
computes the checksum only once for that block. Then the block is flushed
to filesystem.
I'm not sure I like the idea of writing "this block doesn't have a valid
checksum" to disk, since there is some risk of that block being corrupted
during a crash and then we don't know if the block is valid or not.
> > Finally, the extents format has the capability (though no code is
> > implemented for this yet) to store a checksum in each index and extent
> > block... storing an ext3_extent_tail (checksum, inode+generation
> > backpointer) as the last entry in the block.
>
> Old style indirect blocks will need them too. My thinking was
> to use another block for those (so a indirect block would be two nearby
> blocks)
We couldn't do this for the existing indirect blocks easily, but what I'd
thought is that it is possible to either have e2fsck convert block-mapped
files to extent mapped (with extent tail of checksum + inode backpointer)
or have a new block-mapped extent (for fragmented files), which would also
have a header with magic (so that random garbage in a large filesystem
doesn't look like a valid [dt]indirect block) and also have the extent
tail to contain the checksum + inode backpointer.
> Inodes need them, but with the inode extension that will be hopefully
> not a problem to keep a few bytes for this.
Yes, it might even be valuable to put this into the "small" inode so
that it can be used for existing ext3 filesystems.
> And directories, which should be relatively easy to extend with
> the current format.
Haven't thought about that specifically for directories, but I do have
some ideas about enhancing the directory format to allow storing more
data into the dir_entries (e.g. 64-bit inode) and possibly using the
same code to store a tree of EAs in the same format as directories, so
the htree code can be used to do lookups if there are lots of EAs.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Thu, Sep 28, 2006 at 10:55:15AM +0200, Alexandre Ratchov wrote:
> struct ext4_super_block
> {
> /* at offset 0xfe */
> __le32 s_desc_size; /* Group descriptor size */
> /* at offset 0x150 */
> __le32 s_blocks_count_hi; /* Blocks count */
> __le32 s_r_blocks_count_hi; /* Reserved blocks count */
> __le32 s_free_blocks_count_hi; /* Free blocks count */
> __le32 s_jnl_blocks_hi[17]; /* Backup of the journal inode */
> };
Why do we need to have the high blocks # of the journal inode.
s_jnl_blocks was just a backup of the i_blocks[] array. But if we are
assuming that we will only support 64-bits using extents, we shouldn't
need s_jnl_blocks_hi[]. How specifically is this array being used in
the patches?
- Ted
On Oct 04, 2006 16:04 -0400, Theodore Tso wrote:
> On Thu, Sep 28, 2006 at 10:55:15AM +0200, Alexandre Ratchov wrote:
> > struct ext4_super_block
> > {
> > /* at offset 0xfe */
> > __le32 s_desc_size; /* Group descriptor size */
> > /* at offset 0x150 */
> > __le32 s_blocks_count_hi; /* Blocks count */
> > __le32 s_r_blocks_count_hi; /* Reserved blocks count */
> > __le32 s_free_blocks_count_hi; /* Free blocks count */
> > __le32 s_jnl_blocks_hi[17]; /* Backup of the journal inode */
> > };
>
> Why do we need to have the high blocks # of the journal inode.
> s_jnl_blocks was just a backup of the i_blocks[] array. But if we are
> assuming that we will only support 64-bits using extents, we shouldn't
> need s_jnl_blocks_hi[]. How specifically is this array being used in
> the patches?
Good question, I don't know that it is. Even if the journal was extent
mapped (possible, but would need support in e2fsprogs for this) the
data would be stored in the same sized i_blocks array.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Wed, Oct 04, 2006 at 06:19:06PM -0600, Andreas Dilger wrote:
> Good question, I don't know that it is. Even if the journal was extent
> mapped (possible, but would need support in e2fsprogs for this) the
> data would be stored in the same sized i_blocks array.
That won't be too hard. The e2fsprogs code is designed to be
identical to the kernel code, so we just drop in a new version of
fs/ext3/recovery.c that understands extents into e2fsck/recovery.c,
and e2fsprogs will have support. :-)
- Ted
Theodore Tso wrote:
> On Thu, Sep 28, 2006 at 10:55:15AM +0200, Alexandre Ratchov wrote:
>> struct ext4_super_block
>> {
>> /* at offset 0xfe */
>> __le32 s_desc_size; /* Group descriptor size */
>> /* at offset 0x150 */
>> __le32 s_blocks_count_hi; /* Blocks count */
>> __le32 s_r_blocks_count_hi; /* Reserved blocks count */
>> __le32 s_free_blocks_count_hi; /* Free blocks count */
>> __le32 s_jnl_blocks_hi[17]; /* Backup of the journal inode */
>> };
>
> Why do we need to have the high blocks # of the journal inode.
> s_jnl_blocks was just a backup of the i_blocks[] array. But if we are
> assuming that we will only support 64-bits using extents, we shouldn't
> need s_jnl_blocks_hi[]. How specifically is this array being used in
> the patches?
The s_jnl_blocks_hi[] array is not used in the current patchset.
Alexandre wanted to reserve these fields for a future use, for instance
to support larger inode sizes.
As we'll not use them in the short term and we'll still need to think
about that, you can remove this array.
Regards,
Val?rie