Hi,
This is an update of the i_version patch.
The i_version field is a 64bit counter that is set on every inode
creation and that is incremented every time the inode data is modified
(similarly to the "ctime" time-stamp).
The aim is to fulfill a NFSv4 requirement for rfc3530:
"5.5. Mandatory Attributes - Definitions
Name # DataType Access Description
___________________________________________________________________
change 3 uint64 READ A value created by the
server that the client can use to determine if file
data, directory contents or attributes of the object
have been modified. The servermay return the object's
time_metadata attribute for this attribute's value but
only if the filesystem object can not be updated more
frequently than the resolution of time_metadata.
"
The patch set is divided into 2parts, the vfs layer and a ext4 specific
code.
Any comments welcome.
Cheers,
Jean noel
On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
> Hi,
>
> This is an update of the i_version patch.
> The i_version field is a 64bit counter that is set on every inode
> creation and that is incremented every time the inode data is modified
> (similarly to the "ctime" time-stamp).
My understanding (please correct me if I'm wrong) is that the
requirements are much more rigourous than simply incrementing an in
memory counter on every change. i.e. the this counter has to
survive server crashes intact so clients never see the counter go
backwards. That means version number changes need to be journalled
along with the operation that caused the change of the version
number.
> The aim is to fulfill a NFSv4 requirement for rfc3530:
> "5.5. Mandatory Attributes - Definitions
> Name # DataType Access Description
> ___________________________________________________________________
> change 3 uint64 READ A value created by the
> server that the client can use to determine if file
> data, directory contents or attributes of the object
^^^^
File data writes are included in this list of things that need to
increment the version field. Hence to fulfill the crash requirement,
that implies server data writes either need to be synchronous or
journalled...
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote:
> On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
> > Hi,
> >
> > This is an update of the i_version patch.
> > The i_version field is a 64bit counter that is set on every inode
> > creation and that is incremented every time the inode data is modified
> > (similarly to the "ctime" time-stamp).
>
> My understanding (please correct me if I'm wrong) is that the
> requirements are much more rigourous than simply incrementing an in
> memory counter on every change. i.e. the this counter has to
> survive server crashes intact so clients never see the counter go
> backwards. That means version number changes need to be journalled
> along with the operation that caused the change of the version
> number.
>
Yeah, the i_version is the in memeory counter. From the patch it looks
like the counter is being updated inside ext4_mark_iloc_dirty(), so it
is being journalled and being flush to on-disk ext4 inode structure
immediately (On-disk ext4 inode structure is being modified/expanded to
store the counter in the first patch).
>This patch is on top of i_version_update_vfs.
> The i_version field of the inode is set on inode creation and incremented when
> the inode is being modified.
>
> Signed-off-by: Jean Noel Cordenner <[email protected]>
>
> Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c
> ===================================================================
> --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 18:05:28.000000000 +0200
> +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c 2007-05-25 18:05:40.000000000 +0200
> @@ -565,6 +565,7 @@
> inode->i_blocks = 0;
> inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime =
> ext4_current_time(inode);
> + inode->i_version = 1;
>
> memset(ei->i_data, 0, sizeof(ei->i_data));
> ei->i_dir_start_lookup = 0;
> Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c 2007-05-25 18:05:28.000000000 +0200
> +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c 2007-05-25 18:05:40.000000000 +0200
> @@ -3082,6 +3082,7 @@
> {
> int err = 0;
>
> + inode->i_version++;
> /* the do_update_inode consumes one bh->b_count */
> get_bh(iloc->bh);
>
> Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c
> ===================================================================
> --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c 2007-05-25 18:05:28.000000000 +0200
> +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c 2007-05-25 18:05:40.000000000 +0200
> @@ -2839,8 +2839,8 @@
> i_size_write(inode, off+len-towrite);
> EXT4_I(inode)->i_disksize = inode->i_size;
> }
> - inode->i_version++;
> inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> + inode->i_version = 1;
> ext4_mark_inode_dirty(handle, inode);
> mutex_unlock(&inode->i_mutex);
> return len - towrite;
I am not very clear about this part -- what is i_version being used for
with quota? And shall we reset the counter to 1 for ext4_quota_write()?
Mingming
On Wednesday May 30, [email protected] wrote:
> On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
>
> > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > "5.5. Mandatory Attributes - Definitions
> > Name # DataType Access Description
> > ___________________________________________________________________
> > change 3 uint64 READ A value created by the
> > server that the client can use to determine if file
> > data, directory contents or attributes of the object
> ^^^^
>
> File data writes are included in this list of things that need to
> increment the version field. Hence to fulfill the crash requirement,
> that implies server data writes either need to be synchronous or
> journalled...
I think that would be excessive.
The important property if the 'change' number is:
If the 'change' number is still the same, then the data and
metadata etc is still the same.
The converse of this (if the data+metadata is still then same then the
'change' is still the same) is valuable but less critical. Having
the 'change' change when it doesn't need to will defeat client-side
caching and so will reduce performance but not correctness.
So after a crash, I think it is perfectly acceptable to assign a
change number that is known to be different to anything previously
supplied if there is any doubt about recent change history.
e.g. suppose we had a filesystem with 1-second resolution mtime, and
an in-memory 'change' counter that was incremented on every change.
When we load an inode from storage, we initialise the counter to
-1: if the mtime is earlier than current_seconds
current_nanoseconds: if the mtime is equal to current_seconds.
We arrange that when the ctime changes, the change number is reset to 0.
Then when the 'change' number of an inode is required, we use the
bottom 32bits of the 'change' counter and the 32bits of the mtime.
This will provide a change number that normally changes only when the
file changes and doesn't require any extra storage on disk.
The change number will change inappropriately only when the inode has
fallen out of cache and is being reload, which is either after a crash
(hopefully rare) of when a file hasn't been used for a while, implying
that it is unlikely that any client has it in cache.
So in summary: I think it is impossible to have a change number that
changes *only* when content changes (as your 'crash' example suggests)
and it is quite achievable to have a change number that changes rarely
when the content doesn't change.
NeilBrown
On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote:
> On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote:
> > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
> > > Hi,
> > >
> > > This is an update of the i_version patch.
> > > The i_version field is a 64bit counter that is set on every inode
> > > creation and that is incremented every time the inode data is modified
> > > (similarly to the "ctime" time-stamp).
> >
> > My understanding (please correct me if I'm wrong) is that the
> > requirements are much more rigourous than simply incrementing an in
> > memory counter on every change. i.e. the this counter has to
> > survive server crashes intact so clients never see the counter go
> > backwards. That means version number changes need to be journalled
> > along with the operation that caused the change of the version
> > number.
> >
> Yeah, the i_version is the in memeory counter. From the patch it looks
> like the counter is being updated inside ext4_mark_iloc_dirty(), so it
> is being journalled and being flush to on-disk ext4 inode structure
> immediately (On-disk ext4 inode structure is being modified/expanded to
> store the counter in the first patch).
Ok, that catches most things (I missed that), but the version number still
needs to change on file data changes, right? So if we are overwriting the
file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't
get the callout and so the version number doesn't change or get logged. In
that case, the version number is not doing what it needs to do, right?
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, 2007-05-31 at 10:01 +1000, Neil Brown wrote:
> This will provide a change number that normally changes only when the
> file changes and doesn't require any extra storage on disk.
> The change number will change inappropriately only when the inode has
> fallen out of cache and is being reload, which is either after a crash
> (hopefully rare) of when a file hasn't been used for a while, implying
> that it is unlikely that any client has it in cache.
It will also change inappropriately if the server is under heavy load
and needs to reclaim memory by tossing out inodes that are cached and
still in use by the clients. That change will trigger clients to
invalidate their caches and to refetch the data from the server, further
cranking up the load.
Not an ideal solution...
Trond
On Thu, 2007-05-31 at 10:33 +1000, David Chinner wrote:
> On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote:
> > On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote:
> > > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
> > > > Hi,
> > > >
> > > > This is an update of the i_version patch.
> > > > The i_version field is a 64bit counter that is set on every inode
> > > > creation and that is incremented every time the inode data is modified
> > > > (similarly to the "ctime" time-stamp).
> > >
> > > My understanding (please correct me if I'm wrong) is that the
> > > requirements are much more rigourous than simply incrementing an in
> > > memory counter on every change. i.e. the this counter has to
> > > survive server crashes intact so clients never see the counter go
> > > backwards. That means version number changes need to be journalled
> > > along with the operation that caused the change of the version
> > > number.
> > >
> > Yeah, the i_version is the in memeory counter. From the patch it looks
> > like the counter is being updated inside ext4_mark_iloc_dirty(), so it
> > is being journalled and being flush to on-disk ext4 inode structure
> > immediately (On-disk ext4 inode structure is being modified/expanded to
> > store the counter in the first patch).
>
> Ok, that catches most things (I missed that), but the version number still
> needs to change on file data changes, right? So if we are overwriting the
> file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't
> get the callout and so the version number doesn't change or get logged. In
> that case, the version number is not doing what it needs to do, right?
>
Hmm, maybe I missed something... but looking at the code again, in the
case of overwrite (file date updated),it seems the ctime/mtime is being
updated and the inode is being dirtied, so the version number is being
updated.
vfs_write()->..
->__generic_file_aio_write_nolock()
->file_update_time()
->mark_inode_dirty_sync()
->__mark_inode_dirty(I_DIRTY_SYNC)
->ext4_dirty_inode()
->ext4_mark_inode_dirty()
Regards,
Mingming