2006-10-18 06:25:09

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH, E2FSPROGS] On-disk format for inode extra size control inode size


Comments?

- Ted


On-disk format for controlling the inode size

- EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE (0x0040?) - add s_min_extra_isize and
s_want_extra_isize fields to superblock, which allow specifying
the minimum and desired i_extra_isize fields in large inodes
(for nsec+epoch timestamps, potential other uses). Needs RO_COMPAT
flag handling, needs e2fsck support, patch complete, little testing.

Signed-off-by: "Theodore Ts'o" <[email protected]>

Index: e2fsprogs/lib/ext2fs/ext2_fs.h
===================================================================
--- e2fsprogs.orig/lib/ext2fs/ext2_fs.h 2006-10-18 01:49:51.000000000 -0400
+++ e2fsprogs/lib/ext2fs/ext2_fs.h 2006-10-18 02:07:35.000000000 -0400
@@ -298,7 +298,7 @@
__u16 i_uid; /* Low 16 bits of Owner Uid */
__u32 i_size; /* Size in bytes */
__u32 i_atime; /* Access time */
- __u32 i_ctime; /* Creation time */
+ __u32 i_ctime; /* Inode Change time */
__u32 i_mtime; /* Modification time */
__u32 i_dtime; /* Deletion Time */
__u16 i_gid; /* Low 16 bits of Group Id */
@@ -356,7 +356,7 @@
__u16 i_uid; /* Low 16 bits of Owner Uid */
__u32 i_size; /* Size in bytes */
__u32 i_atime; /* Access time */
- __u32 i_ctime; /* Creation time */
+ __u32 i_ctime; /* Inode Change time */
__u32 i_mtime; /* Modification time */
__u32 i_dtime; /* Deletion Time */
__u16 i_gid; /* Low 16 bits of Group Id */
@@ -405,6 +405,11 @@
} osd2; /* OS dependent 2 */
__u16 i_extra_isize;
__u16 i_pad1;
+ __u32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */
+ __u32 i_mtime_extra; /* extra Mod. time (nsec << 2 | epoch) */
+ __u32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
+ __u32 i_crtime; /* File creation time */
+ __u32 i_crtime_extra; /* extra File creation time (nsec << 2 | epoch)*/
};

#define i_size_high i_dir_acl
@@ -554,7 +559,9 @@
__u32 s_blocks_count_hi; /* Blocks count high 32bits */
__u32 s_r_blocks_count_hi; /* Reserved blocks count high 32 bits*/
__u32 s_free_blocks_hi; /* Free blocks count */
- __u32 s_reserved[169]; /* Padding to the end of the block */
+ __u16 s_min_extra_isize; /* All inodes have at least # bytes */
+ __u16 s_want_extra_isize; /* New inodes should reserve # bytes */
+ __u32 s_reserved[168]; /* Padding to the end of the block */
};

/*
@@ -607,6 +614,7 @@
#define EXT4_FEATURE_RO_COMPAT_HUGE_FILE 0x0008
#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
+#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040

#define EXT2_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT2_FEATURE_INCOMPAT_FILETYPE 0x0002


2006-10-18 19:34:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, E2FSPROGS] On-disk format for inode extra size control inode size

On Oct 18, 2006 02:25 -0400, Theodore Ts'o wrote:
> On-disk format for controlling the inode size
>
> - EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE (0x0040?) - add s_min_extra_isize and
> s_want_extra_isize fields to superblock, which allow specifying
> the minimum and desired i_extra_isize fields in large inodes
> (for nsec+epoch timestamps, potential other uses). Needs RO_COMPAT
> flag handling, needs e2fsck support, patch complete, little testing.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>

> @@ -405,6 +405,11 @@
> } osd2; /* OS dependent 2 */
> __u16 i_extra_isize;
> __u16 i_pad1;
> + __u32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */

There was some discussion about moving the i_ctime_extra field into the
small inode, for use as a version field by NFSV4. The proposed field was
l_i_reserved2 in the original Bull patch.

> + __u32 i_mtime_extra; /* extra Mod. time (nsec << 2 | epoch) */
^^^ Modify

Looks good otherwise.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-10-18 22:16:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH, E2FSPROGS] On-disk format for inode extra size control inode size

On Wed, Oct 18, 2006 at 01:34:38PM -0600, Andreas Dilger wrote:
> There was some discussion about moving the i_ctime_extra field into the
> small inode, for use as a version field by NFSV4. The proposed field was
> l_i_reserved2 in the original Bull patch.

The potential problem with this is what if program makes a huge number
of changes to inodes, such that we end up having to bump the ctime
field into the future?

But I have another question --- why does this the change attribute
counter have to be a persistent value that takes up space in the inode
at all? This is strictly a NFSv4 only hack, and NFSv4 simply requires
a 64-bit increasing counter when any inode attributes have changed, so
it can determine whether client-side caches are refreshed.

So let the high 32-bits of the attribute value be the time in seconds
that the system was booted (or the filesystem was mounted; I don't
care), and let the low 32-bit values be a value which is stored in the
in-core inode which is set to a global counter value which is
incremented whenever inode is changed. If the in-core inode gets
dropped and then later reloaded, it will get a newer global counter
value, which may forced some clients to reload their cache when they
may not have needed to --- and the same would happen if the server
gets rebooted, since the high 32-bits would get bumped to the new
mount time.

But that's not a big deal, since it's only there to make client-side
caching side more efficient. In these cases the clients may need to
refresh their inode cache from the server, but is that really going to
be such a bad thing?

- Ted

2006-10-18 22:59:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH, E2FSPROGS] On-disk format for inode extra size control inode size

On Oct 18, 2006 18:16 -0400, Theodore Tso wrote:
> On Wed, Oct 18, 2006 at 01:34:38PM -0600, Andreas Dilger wrote:
> > There was some discussion about moving the i_ctime_extra field into the
> > small inode, for use as a version field by NFSV4. The proposed field was
> > l_i_reserved2 in the original Bull patch.
>
> The potential problem with this is what if program makes a huge number
> of changes to inodes, such that we end up having to bump the ctime
> field into the future?

Well, if you are changing the inode billions of times/second for any
extended amount of time, then yes that might be possible. Normally
setting it to the current ctime and incrementing the nsec timestamp
on the rare cases two threads are updating it at once is sufficient.
Note that in this version is only critical to compare for a single
inode, and not filesystem-wide.

> But I have another question --- why does this the change attribute
> counter have to be a persistent value that takes up space in the inode
> at all? This is strictly a NFSv4 only hack, and NFSv4 simply requires
> a 64-bit increasing counter when any inode attributes have changed, so
> it can determine whether client-side caches are refreshed.

Well, Lustre has a need for this also, in order to know if a particular
modification to an inode has been done or not. Using the nsec ctime is
a way of saving space in the inode, while at the same time allowing two
operations to be ordered relative to each other. This needs to be
persistent across a server reboot for both NFSv4 and Lustre.

Since i_ctime_extra needs to be stored on disk anyways, I don't see why
there would be an objection to having it serve a dual purpose. For
Lustre we don't actually care whether it is in the core inode, since we
always format filesystems with large inodes anyways, but in consideration
of NFSv4 exporting existing ext3 filesystems I suggested i_ctime_extra go
into the core inode.

> So let the high 32-bits of the attribute value be the time in seconds
> that the system was booted (or the filesystem was mounted; I don't
> care), and let the low 32-bit values be a value which is stored in the
> in-core inode which is set to a global counter value which is
> incremented whenever inode is changed. If the in-core inode gets
> dropped and then later reloaded, it will get a newer global counter
> value, which may forced some clients to reload their cache when they
> may not have needed to --- and the same would happen if the server
> gets rebooted, since the high 32-bits would get bumped to the new
> mount time.

This would work, I suppose. It doesn't allow this value to be exported
to userspace without additional API changes as the i_ctime_extra one can.
The initial Bull patch consumed an extra field in struct stat, stat64,
struct inode, and struct iattr in order to allow userspace access also.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.