2006-11-29 18:53:13

by Cordenner jean noel

[permalink] [raw]
Subject: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

This part of the patch concerns the ext4 code.



Signed-off-by: Celine Bourde
Signed-off-by: Alexandre Ratchov
Signed-off-by: Jean Noel Cordenner <[email protected]>


--- linux-2.6.19-rc2-mm2-CA.orig/fs/ext4/ialloc.c 2006-11-09
17:20:57.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/fs/ext4/ialloc.c 2006-11-29 15:34:35.000000000 +0100
@@ -564,6 +564,7 @@
/* This is the optimal IO size (for stat), not the fs block size */
inode->i_blocks = 0;
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute = 1;

memset(ei->i_data, 0, sizeof(ei->i_data));
ei->i_dir_start_lookup = 0;
Index: linux-2.6.19-rc2-mm2-CA/fs/ext4/inode.c
===================================================================
--- linux-2.6.19-rc2-mm2-CA.orig/fs/ext4/inode.c 2006-11-09
17:20:57.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/fs/ext4/inode.c 2006-11-29 15:38:39.000000000 +0100
@@ -728,6 +728,7 @@
/* We are done with atomic stuff, now do the rest of housekeeping */

inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;
ext4_mark_inode_dirty(handle, inode);

/* had we spliced it onto indirect block? */
@@ -2377,6 +2378,7 @@

mutex_unlock(&ei->truncate_mutex);
inode->i_mtime = inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;
ext4_mark_inode_dirty(handle, inode);

/*
@@ -2615,6 +2617,7 @@
inode->i_ctime.tv_sec = le32_to_cpu(raw_inode->i_ctime);
inode->i_mtime.tv_sec = le32_to_cpu(raw_inode->i_mtime);
inode->i_atime.tv_nsec = inode->i_ctime.tv_nsec =
inode->i_mtime.tv_nsec = 0;
+ inode->i_change_attribute = le32_to_cpu(raw_inode->i_chattr);

ei->i_state = 0;
ei->i_dir_start_lookup = 0;
@@ -2773,6 +2776,7 @@
raw_inode->i_atime = cpu_to_le32(inode->i_atime.tv_sec);
raw_inode->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
raw_inode->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
+ raw_inode->i_chattr = cpu_to_le32(inode->i_change_attribute);
raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
Index: linux-2.6.19-rc2-mm2-CA/fs/ext4/ioctl.c
===================================================================
--- linux-2.6.19-rc2-mm2-CA.orig/fs/ext4/ioctl.c 2006-11-29
15:41:28.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/fs/ext4/ioctl.c 2006-11-29 15:42:21.000000000 +0100
@@ -97,6 +97,7 @@

ext4_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;

err = ext4_mark_iloc_dirty(handle, inode, &iloc);
flags_err:
@@ -134,6 +135,7 @@
err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err == 0) {
inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;
inode->i_generation = generation;
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
}
Index: linux-2.6.19-rc2-mm2-CA/fs/ext4/namei.c
===================================================================
--- linux-2.6.19-rc2-mm2-CA.orig/fs/ext4/namei.c 2006-11-09
17:20:57.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/fs/ext4/namei.c 2006-11-29 15:53:06.000000000 +0100
@@ -1274,6 +1274,7 @@
* and/or different from the directory change time.
*/
dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ dir->i_change_attribute++;
ext4_update_dx_flag(dir);
dir->i_version++;
ext4_mark_inode_dirty(handle, dir);
@@ -2050,6 +2051,8 @@
inode->i_size = 0;
ext4_orphan_add(handle, inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;
+ dir->i_change_attribute++;
ext4_mark_inode_dirty(handle, inode);
drop_nlink(dir);
ext4_update_dx_flag(dir);
@@ -2100,12 +2103,14 @@
if (retval)
goto end_unlink;
dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
+ dir->i_change_attribute++;
ext4_update_dx_flag(dir);
ext4_mark_inode_dirty(handle, dir);
drop_nlink(inode);
if (!inode->i_nlink)
ext4_orphan_add(handle, inode);
inode->i_ctime = dir->i_ctime;
+ inode->i_change_attribute++;
ext4_mark_inode_dirty(handle, inode);
retval = 0;

@@ -2191,6 +2196,7 @@
handle->h_sync = 1;

inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;
ext4_inc_count(handle, inode);
atomic_inc(&inode->i_count);

@@ -2293,6 +2299,7 @@
* rename.
*/
old_inode->i_ctime = CURRENT_TIME_SEC;
+ old_inode->i_change_attribute++;
ext4_mark_inode_dirty(handle, old_inode);

/*
@@ -2326,8 +2333,10 @@
if (new_inode) {
drop_nlink(new_inode);
new_inode->i_ctime = CURRENT_TIME_SEC;
+ new_inode->i_change_attribute++;
}
old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME_SEC;
+ old_dir->i_change_attribute++;
ext4_update_dx_flag(old_dir);
if (dir_bh) {
BUFFER_TRACE(dir_bh, "get_write_access");
Index: linux-2.6.19-rc2-mm2-CA/fs/ext4/super.c
===================================================================
--- linux-2.6.19-rc2-mm2-CA.orig/fs/ext4/super.c 2006-11-09
17:20:57.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/fs/ext4/super.c 2006-11-29 15:57:02.000000000 +0100
@@ -2780,6 +2780,7 @@
}
inode->i_version++;
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ inode->i_change_attribute = 1;
ext4_mark_inode_dirty(handle, inode);
mutex_unlock(&inode->i_mutex);
return len - towrite;
Index: linux-2.6.19-rc2-mm2-CA/fs/ext4/xattr.c
===================================================================
--- linux-2.6.19-rc2-mm2-CA.orig/fs/ext4/xattr.c 2006-11-09
17:20:57.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/fs/ext4/xattr.c 2006-11-29 15:57:50.000000000 +0100
@@ -1005,6 +1005,7 @@
if (!error) {
ext4_xattr_update_super_block(handle, inode->i_sb);
inode->i_ctime = CURRENT_TIME_SEC;
+ inode->i_change_attribute++;
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
/*
* The bh is consumed by ext4_mark_iloc_dirty, even with
Index: linux-2.6.19-rc2-mm2-CA/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.19-rc2-mm2-CA.orig/include/linux/ext4_fs.h 2006-11-09
17:21:48.000000000 +0100
+++ linux-2.6.19-rc2-mm2-CA/include/linux/ext4_fs.h 2006-11-29
16:15:01.000000000 +0100
@@ -312,7 +312,7 @@
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
- __u32 l_i_reserved2;
+ __u32 l_i_change_attribute;
} linux2;
struct {
__u8 h_i_frag; /* Fragment number */
@@ -344,7 +344,7 @@
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
#define i_gid_high osd2.linux2.l_i_gid_high
-#define i_reserved2 osd2.linux2.l_i_reserved2
+#define i_chattr osd2.linux2.l_i_change_attribute

#elif defined(__GNU__)


2006-12-06 21:49:36

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

On Nov 29, 2006 19:54 +0100, Jean-Noel Cordenner wrote:
> This part of the patch concerns the ext4 code.

I was looking more closely at this code, and wondering two things:
- why not just use the existing inode->i_version field instead of
adding a new i_change_attribute? The i_version is not used by
the VFS at all, and only for detecting directory modifications in
ext3 (where it has the same semantic as the new i_change_attribute
anyways). This avoids bloating the VFS inode more than it already is.
- why not just do an increment of i_version in ext3_do_update_inode()?
That is ext3_dirty_inode->ext3_mark_inode_dirty->ext3_mark_iloc_dirty()
and also handles all of the VFS locations that call notify_change().
This MUST be called anywhere that we make a persistent change to the
inode in order to flush it to disk. That would reduce the patch to
a few lines at most. I don't think there are any places we need to
supplement this (even mmap IO or writes to a hole will update mtime).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-12-13 17:59:53

by Cordenner jean noel

[permalink] [raw]
Subject: Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

Sorry for the late answer.
I agree that using i_version field sounds better, I'm working on it.

regards,
Jean noel

Andreas Dilger wrote:
> On Nov 29, 2006 19:54 +0100, Jean-Noel Cordenner wrote:
>> This part of the patch concerns the ext4 code.
>
> I was looking more closely at this code, and wondering two things:
> - why not just use the existing inode->i_version field instead of
> adding a new i_change_attribute? The i_version is not used by
> the VFS at all, and only for detecting directory modifications in
> ext3 (where it has the same semantic as the new i_change_attribute
> anyways). This avoids bloating the VFS inode more than it already is.
> - why not just do an increment of i_version in ext3_do_update_inode()?
> That is ext3_dirty_inode->ext3_mark_inode_dirty->ext3_mark_iloc_dirty()
> and also handles all of the VFS locations that call notify_change().
> This MUST be called anywhere that we make a persistent change to the
> inode in order to flush it to disk. That would reduce the patch to
> a few lines at most. I don't think there are any places we need to
> supplement this (even mmap IO or writes to a hole will update mtime).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2006-12-14 16:35:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

On Wed, Dec 13, 2006 at 06:31:50PM +0100, Cordenner jean noel wrote:

There was discussion on yesterday's call about whether or not 32-bit
was enough for NFSv4, or whether it also requried 64-bits of change
notification in the RFC's. So one of the questions is whether this is
something that would justify requiring 64-bits --- and if so, maybe we
need to require that big inodes be used and store the entire 64-bit
value beyond 128 bytes. This would mean that NFSv4 cache management
couldn't be fully implemented without big inodes, or we'd have to make
do by using the inode ctime as a partial substitute.

What do you think?

- Ted

2006-12-14 22:57:14

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

On Dec 14, 2006 11:03 -0500, Theodore Tso wrote:
> There was discussion on yesterday's call about whether or not 32-bit
> was enough for NFSv4, or whether it also requried 64-bits of change
> notification in the RFC's. So one of the questions is whether this is
> something that would justify requiring 64-bits --- and if so, maybe we
> need to require that big inodes be used and store the entire 64-bit
> value beyond 128 bytes. This would mean that NFSv4 cache management
> couldn't be fully implemented without big inodes, or we'd have to make
> do by using the inode ctime as a partial substitute.

Per Trond and Bruce Field's reply to my email it seems that NFSv4 only
needs the version to compare for inequality. If the change numbers are
sequential for a given inode it can OPTIONALLY extract additional
information about the server (i.e. it still has an up-to-date cache
because it was the only one that did an update on a given file).

So, I think for basic NFSv4 setups that 2^32 is sufficient (per Bull's
original patch) but 2^64 is desirable to avoid collisions and allow the
"sequential updates" logic to work properly for long-lived files.

So, I think a 32-bit field in the small inode, and an additional 32-bit
field in the large inode would be perfect. It allows this functionality
to work with existing ext3 filesystems, if not quite optimally.

In addition, for Lustre, could we get a 64-bit field in the superblock
which contains the fs-wide version number.

I'm proposing that, per the original Bull patch, l_i_reserved1 be changed
to be i_version for linux, and we add i_version_hi after cr_time_extra in
the large inode. The disk i_version would be stored in the vfs_inode
i_version (which is already used for this same purpose). It would be good
for NFSv4 if the i_version field could be expanded to 64 bits to avoid
the need for it to have fs-specific operations, but failing that we can
put the high word into ext4_inode_info and NFS can access it via
export_operations I think.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-12-15 10:36:39

by Cordenner jean noel

[permalink] [raw]
Subject: Re: [RFC] [patch 2/3] change attribute for ext4: ext4 specific code

Theodore Tso a ?crit :
> On Wed, Dec 13, 2006 at 06:31:50PM +0100, Cordenner jean noel wrote:
>
> There was discussion on yesterday's call about whether or not 32-bit
> was enough for NFSv4, or whether it also requried 64-bits of change
> notification in the RFC's. So one of the questions is whether this is
> something that would justify requiring 64-bits --- and if so, maybe we
> need to require that big inodes be used and store the entire 64-bit
> value beyond 128 bytes. This would mean that NFSv4 cache management
> couldn't be fully implemented without big inodes, or we'd have to make
> do by using the inode ctime as a partial substitute.
>
> What do you think?
>
> - Ted
>

Well it seems that NFSv4 RFC requires a 64-bits notification.
The interest of the change attribute is that it has a simple
implementation and doesn't seem to penalize the performance.
Using a 32bits counter and the ctime can give a resolution less than ns.

But it seems to me that finner timestamp could be usefull in the future.
As the ns patch also use a counter, I don't know if a common
implementation would avoid using big inode.

I agree with Andreas saying that the "change_attribute" can be stored in
the i_version field.

Jean noel