2007-07-01 07:37:04

by Mingming Cao

[permalink] [raw]
Subject: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

This patch converts the 32-bit i_version in the generic inode to a 64-bit
i_version field.

Signed-off-by: Mingming Cao <[email protected]>
Signed-off-by: Jean Noel Cordenner <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>

Index: linux-2.6.21/include/linux/fs.h
===================================================================
--- linux-2.6.21.orig/include/linux/fs.h
+++ linux-2.6.21/include/linux/fs.h
@@ -549,7 +549,7 @@ struct inode {
uid_t i_uid;
gid_t i_gid;
dev_t i_rdev;
- unsigned long i_version;
+ u64 i_version;
loff_t i_size;
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;


2007-07-02 14:58:33

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

Trond or Bruce, can you please review these patch series and ack if you
agrees? Thanks.

As to performance concerns that raise before the inode version counter
(at least for ext4) is done inside ext4_mark_inode_dirty), so there is
no extra IO work to store this counter to disk.

Mingming
On Sun, 2007-07-01 at 03:37 -0400, Mingming Cao wrote:
> This patch converts the 32-bit i_version in the generic inode to a 64-bit
> i_version field.
>
> Signed-off-by: Mingming Cao <[email protected]>
> Signed-off-by: Jean Noel Cordenner <[email protected]>
> Signed-off-by: Kalpak Shah <[email protected]>
>
> Index: linux-2.6.21/include/linux/fs.h
> ===================================================================
> --- linux-2.6.21.orig/include/linux/fs.h
> +++ linux-2.6.21/include/linux/fs.h
> @@ -549,7 +549,7 @@ struct inode {
> uid_t i_uid;
> gid_t i_gid;
> dev_t i_rdev;
> - unsigned long i_version;
> + u64 i_version;
> loff_t i_size;
> #ifdef __NEED_I_SIZE_ORDERED
> seqcount_t i_size_seqcount;
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2007-07-03 14:29:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Mon, 2007-07-02 at 10:58 -0400, Mingming Cao wrote:
> Trond or Bruce, can you please review these patch series and ack if you
> agrees? Thanks.
>
> As to performance concerns that raise before the inode version counter
> (at least for ext4) is done inside ext4_mark_inode_dirty), so there is
> no extra IO work to store this counter to disk.

Hi Mingming,

It looks OK to me, but you might want to strip out the now redundant
i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename().

I also have some questions about how this will affect the readdir code:
unless I missed something, the filp->f_version is still unsigned long,
so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir()
no longer make sense.

Cheers
Trond

2007-07-03 21:56:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Jul 03, 2007 10:24 -0400, Trond Myklebust wrote:
> It looks OK to me, but you might want to strip out the now redundant
> i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename().

Agreed, and I thought we discussed that already on the ext4 list.

> I also have some questions about how this will affect the readdir code:
> unless I missed something, the filp->f_version is still unsigned long,
> so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir()
> no longer make sense.

I don't see them as any worse than existing checks. For 32-bit systems
we only ever had a 32-bit in-memory version anyway so using only the
low 32 bits of i_version in f_version is no more racy than in the past.
For 64-bit systems using the full on-disk i_version is possible.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-03 22:15:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Mon, Jul 02, 2007 at 10:58:33AM -0400, Mingming Cao wrote:
> Trond or Bruce, can you please review these patch series and ack if you
> agrees?

Thanks, looks like what we need!

How will nfsd tell whether it can really on a given filesystem's
i_version, or whether it should fall back on ctime?

> As to performance concerns that raise before the inode version counter
> (at least for ext4) is done inside ext4_mark_inode_dirty), so there is
> no extra IO work to store this counter to disk.

So what's the motivation for the "noversion" mount option?

--b.

2007-07-03 23:32:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Jul 03, 2007 18:15 -0400, J. Bruce Fields wrote:
> How will nfsd tell whether it can really on a given filesystem's
> i_version, or whether it should fall back on ctime?

Good question.

> > As to performance concerns that raise before the inode version counter
> > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is
> > no extra IO work to store this counter to disk.
>
> So what's the motivation for the "noversion" mount option?

Lustre needs to be able to control the version number directly (version
number needs to be ordered between all inodes, is set by Lustre to be a
transaction number). Instead of trying to incorporate this unused code
into ext4 we just turn off the ext4 version code and let Lustre control
this directly. It may even be that NFSv4 will need to control the version
numbers itself...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-06 13:52:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, Jul 03, 2007 at 05:32:00PM -0600, Andreas Dilger wrote:
> On Jul 03, 2007 18:15 -0400, J. Bruce Fields wrote:
> > How will nfsd tell whether it can really on a given filesystem's
> > i_version, or whether it should fall back on ctime?
>
> Good question.

Well, we don't need anything particularly complicated--just a one-bit
flag on the superblock would be enough.

> > So what's the motivation for the "noversion" mount option?
>
> Lustre needs to be able to control the version number directly (version
> number needs to be ordered between all inodes, is set by Lustre to be a
> transaction number). Instead of trying to incorporate this unused code
> into ext4 we just turn off the ext4 version code and let Lustre control
> this directly. It may even be that NFSv4 will need to control the version
> numbers itself...

I can't think of any reason we would need to in the near future, but
maybe I'm insufficiently creative.

The use of a mount option means the change attribute could be
inconsistent across mounts. If we really need this, wouldn't it make
more sense for it to be a persistent feature of the filesystem, set at
mkfs time?

--b.

2007-07-06 22:53:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Jul 06, 2007 09:51 -0400, J. Bruce Fields wrote:
> The use of a mount option means the change attribute could be
> inconsistent across mounts. If we really need this, wouldn't it make
> more sense for it to be a persistent feature of the filesystem, set at
> mkfs time?

Yes, having it stored into the superblock in s_flags is probably a good
idea. Kalpak, do you think you could get a patch that adds e.g.
EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-09 21:16:30

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Fri, 2007-07-06 at 16:53 -0600, Andreas Dilger wrote:
> On Jul 06, 2007 09:51 -0400, J. Bruce Fields wrote:
> > The use of a mount option means the change attribute could be
> > inconsistent across mounts. If we really need this, wouldn't it make
> > more sense for it to be a persistent feature of the filesystem, set at
> > mkfs time?
>
> Yes, having it stored into the superblock in s_flags is probably a good
> idea. Kalpak, do you think you could get a patch that adds e.g.
> EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs).
>
Per our ext4 interlock discussion today, I have dropped the
ext4-no-inode version-mount-option patch from ext4 patch queue.

Thanks,
Mingming


2007-07-10 23:30:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Sun, 01 Jul 2007 03:37:04 -0400
Mingming Cao <[email protected]> wrote:

> This patch converts the 32-bit i_version in the generic inode to a 64-bit
> i_version field.
>

That's obvious from the patch. But what was the reason for making this
(unrelated to ext4) change?

Please update the changelog for this.

> Index: linux-2.6.21/include/linux/fs.h
> ===================================================================
> --- linux-2.6.21.orig/include/linux/fs.h
> +++ linux-2.6.21/include/linux/fs.h
> @@ -549,7 +549,7 @@ struct inode {
> uid_t i_uid;
> gid_t i_gid;
> dev_t i_rdev;
> - unsigned long i_version;
> + u64 i_version;
> loff_t i_size;
> #ifdef __NEED_I_SIZE_ORDERED
> seqcount_t i_size_seqcount;


2007-07-10 22:09:40

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:37:04 -0400
> Mingming Cao <[email protected]> wrote:
>
> > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > i_version field.
> >
>
> That's obvious from the patch. But what was the reason for making this
> (unrelated to ext4) change?
>

The need is came from NFSv4

On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote:
> Hi,
>
> This is an update of the i_version patch.
> The i_version field is a 64bit counter that is set on every inode
> creation and that is incremented every time the inode data is modified
> (similarly to the "ctime" time-stamp).
> The aim is to fulfill a NFSv4 requirement for rfc3530:
> "5.5. Mandatory Attributes - Definitions
> Name # DataType Access Description
> ___________________________________________________________________
> change 3 uint64 READ A value created by the
> server that the client can use to determine if file
> data, directory contents or attributes of the object
> have been modified. The servermay return the object's
> time_metadata attribute for this attribute's value but
> only if the filesystem object can not be updated more
> frequently than the resolution of time_metadata.
> "
>

> Please update the changelog for this.
>

Is above description clear to you?


> > Index: linux-2.6.21/include/linux/fs.h
> > ===================================================================
> > --- linux-2.6.21.orig/include/linux/fs.h
> > +++ linux-2.6.21/include/linux/fs.h
> > @@ -549,7 +549,7 @@ struct inode {
> > uid_t i_uid;
> > gid_t i_gid;
> > dev_t i_rdev;
> > - unsigned long i_version;
> > + u64 i_version;
> > loff_t i_size;
> > #ifdef __NEED_I_SIZE_ORDERED
> > seqcount_t i_size_seqcount;
>

2007-07-11 01:22:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[email protected]> wrote:

> On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > On Sun, 01 Jul 2007 03:37:04 -0400
> > Mingming Cao <[email protected]> wrote:
> >
> > > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > > i_version field.
> > >
> >
> > That's obvious from the patch. But what was the reason for making this
> > (unrelated to ext4) change?
> >
>
> The need is came from NFSv4
>
> On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote:
> > Hi,
> >
> > This is an update of the i_version patch.
> > The i_version field is a 64bit counter that is set on every inode
> > creation and that is incremented every time the inode data is modified
> > (similarly to the "ctime" time-stamp).
> > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > "5.5. Mandatory Attributes - Definitions
> > Name # DataType Access Description
> > ___________________________________________________________________
> > change 3 uint64 READ A value created by the
> > server that the client can use to determine if file
> > data, directory contents or attributes of the object
> > have been modified. The servermay return the object's
> > time_metadata attribute for this attribute's value but
> > only if the filesystem object can not be updated more
> > frequently than the resolution of time_metadata.
> > "
> >
>
> > Please update the changelog for this.
> >
>
> Is above description clear to you?
>

Yes, thanks. It doesn't actually tell us why we want to implement
this attribute and it doesn't tell us what the implications of failing
to do so are, but I guess we can take that on trust from the NFS guys.

But I suspect the ext4 implementation doesn't actually do this. afaict we
won't update i_version for file overwrites (especially if s_time_gran can
indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
would be the implications of this?

And how does the NFS server know that the filesystem implements i_version?
Will a zero-value of i_version have special significance, telling the
server to not send this attribute, perhaps?



2007-07-11 00:19:16

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[email protected]> wrote:
>
> > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > > On Sun, 01 Jul 2007 03:37:04 -0400
> > > Mingming Cao <[email protected]> wrote:
> > >
> > > > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > > > i_version field.
> > > >
> > >
> > > That's obvious from the patch. But what was the reason for making this
> > > (unrelated to ext4) change?
> > >
> >
> > The need is came from NFSv4
> >
> > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote:
> > > Hi,
> > >
> > > This is an update of the i_version patch.
> > > The i_version field is a 64bit counter that is set on every inode
> > > creation and that is incremented every time the inode data is modified
> > > (similarly to the "ctime" time-stamp).
> > > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > > "5.5. Mandatory Attributes - Definitions
> > > Name # DataType Access Description
> > > ___________________________________________________________________
> > > change 3 uint64 READ A value created by the
> > > server that the client can use to determine if file
> > > data, directory contents or attributes of the object
> > > have been modified. The servermay return the object's
> > > time_metadata attribute for this attribute's value but
> > > only if the filesystem object can not be updated more
> > > frequently than the resolution of time_metadata.
> > > "
> > >
> >
> > > Please update the changelog for this.
> > >
> >
> > Is above description clear to you?
> >
>
> Yes, thanks. It doesn't actually tell us why we want to implement
> this attribute and it doesn't tell us what the implications of failing
> to do so are, but I guess we can take that on trust from the NFS guys.
>
> But I suspect the ext4 implementation doesn't actually do this. afaict we
> won't update i_version for file overwrites (especially if s_time_gran can
> indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
> would be the implications of this?
>

In the case of overwrite (file date updated), I assume the ctime/mtime
is being updated and the inode is being dirtied, so the version number
is being updated.

vfs_write()->..
->__generic_file_aio_write_nolock()
->file_update_time()
->mark_inode_dirty_sync()
->__mark_inode_dirty(I_DIRTY_SYNC)
->ext4_dirty_inode()
->ext4_mark_inode_dirty()

> And how does the NFS server know that the filesystem implements i_version?
> Will a zero-value of i_version have special significance, telling the
> server to not send this attribute, perhaps?

Bruce raised up this question a few days back when he reviewed this
patch, I think the solution is add a superblock flag for fs support
inode versioning, probably at VFS layer?

Mingming

2007-07-11 03:21:55

by NeilBrown

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tuesday July 10, [email protected] wrote:
>
> Yes, thanks. It doesn't actually tell us why we want to implement
> this attribute and it doesn't tell us what the implications of failing
> to do so are, but I guess we can take that on trust from the NFS guys.

You would like to think so, but remember NFSv4 was designed by a
committee :-)

The 'change' number is used for cache consistency, and as the spec
makes very strong statements about the 'change' number, it is very
hard (or impossible) to implement a server correctly without storing a
change number in stable storage (just one of my grips about V4).

>
> But I suspect the ext4 implementation doesn't actually do this. afaict we
> won't update i_version for file overwrites (especially if s_time_gran can
> indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
> would be the implications of this?

The first part sounds like a bug - i_version should really be updated
by every call to ->commit_write (if that is still what it is called).

The MAP_SHARED thing is less obvious. I guess every time we notice
that the page might have been changed, we need to increment i_version.

>
> And how does the NFS server know that the filesystem implements i_version?
> Will a zero-value of i_version have special significance, telling the
> server to not send this attribute, perhaps?

That is a very important question. Zero probably makes sense, but
what ever it is needs to be agreed and documented.
And just by-the-way, the server doesn't really have the option of not
sending the attribute. If i_version isn't defined, it has to fake
something using mtime, and hope that is good enough.

Alternately we could mandate that i_version is always kept up-to-date
and if a filesystem doesn't have anything to load from storage, it
just sets it to the current time in nanoseconds.

That would mean that a client would need to flush it's cache whenever
the inode fell out of cache on the server, but I don't think we can
reliably do better than that.

I think I like that approach.

So my vote is to increment i_version in common code every time any
change is made to the file, and alloc_inode should initialise it to
current time, which might be changed by the filesystem before it calls
unlock_new_inode.
... but doesn't lustre want to control its i_version... so maybe not :-(

NeilBrown

2007-07-11 03:34:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> On Tuesday July 10, [email protected] wrote:
> >
> > Yes, thanks. It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
>
> You would like to think so, but remember NFSv4 was designed by a
> committee :-)
>
> The 'change' number is used for cache consistency, and as the spec
> makes very strong statements about the 'change' number, it is very
> hard (or impossible) to implement a server correctly without storing a
> change number in stable storage (just one of my grips about V4).

Well... It reflects a requirement that was just as present in the
caching models that we use for NFSv2/v3, but that we glossed over for
other reasons.
The real difference here is that v2/v3 caching model assumes that you
have sufficient resolution in the ctime/mtime to allow clients to detect
any changes to the file/directory contents, whereas NFSv4 allows you to
use an arbitrary variable (that may be the ctime, if it has sufficient
resolution) for the same purposes.

> > But I suspect the ext4 implementation doesn't actually do this. afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
> > would be the implications of this?
>
> The first part sounds like a bug - i_version should really be updated
> by every call to ->commit_write (if that is still what it is called).
>
> The MAP_SHARED thing is less obvious. I guess every time we notice
> that the page might have been changed, we need to increment i_version.

You need to increment is any time that you detect remotely visible
changes.
IOW: any change that posix mandates should result in a ctime update,
must also result in an update of i_version. The only difference is that
i_version must not suffer from the time resolution issues that ctime
does.

> > And how does the NFS server know that the filesystem implements i_version?
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
>
> That is a very important question. Zero probably makes sense, but
> what ever it is needs to be agreed and documented.
> And just by-the-way, the server doesn't really have the option of not
> sending the attribute. If i_version isn't defined, it has to fake
> something using mtime, and hope that is good enough.
>
> Alternately we could mandate that i_version is always kept up-to-date
> and if a filesystem doesn't have anything to load from storage, it
> just sets it to the current time in nanoseconds.
>
> That would mean that a client would need to flush it's cache whenever
> the inode fell out of cache on the server, but I don't think we can
> reliably do better than that.
>
> I think I like that approach.
>
> So my vote is to increment i_version in common code every time any
> change is made to the file, and alloc_inode should initialise it to
> current time, which might be changed by the filesystem before it calls
> unlock_new_inode.
> ... but doesn't lustre want to control its i_version... so maybe not :-(

If lustre wants to be exportable via pNFS (as Peter Braam has suggested
it should), then it had better be able to return a change attribute that
is compatible with the NFSv4.1 spec...

Trond

2007-07-11 04:22:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao <[email protected]> wrote:

> On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[email protected]> wrote:
> >
> > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > > > On Sun, 01 Jul 2007 03:37:04 -0400
> > > > Mingming Cao <[email protected]> wrote:
> > > >
> > > > > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > > > > i_version field.
> > > > >
> > > >
> > > > That's obvious from the patch. But what was the reason for making this
> > > > (unrelated to ext4) change?
> > > >
> > >
> > > The need is came from NFSv4
> > >
> > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote:
> > > > Hi,
> > > >
> > > > This is an update of the i_version patch.
> > > > The i_version field is a 64bit counter that is set on every inode
> > > > creation and that is incremented every time the inode data is modified
> > > > (similarly to the "ctime" time-stamp).
> > > > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > > > "5.5. Mandatory Attributes - Definitions
> > > > Name # DataType Access Description
> > > > ___________________________________________________________________
> > > > change 3 uint64 READ A value created by the
> > > > server that the client can use to determine if file
> > > > data, directory contents or attributes of the object
> > > > have been modified. The servermay return the object's
> > > > time_metadata attribute for this attribute's value but
> > > > only if the filesystem object can not be updated more
> > > > frequently than the resolution of time_metadata.
> > > > "
> > > >
> > >
> > > > Please update the changelog for this.
> > > >
> > >
> > > Is above description clear to you?
> > >
> >
> > Yes, thanks. It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
> >
> > But I suspect the ext4 implementation doesn't actually do this. afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
> > would be the implications of this?
> >
>
> In the case of overwrite (file date updated), I assume the ctime/mtime
> is being updated and the inode is being dirtied, so the version number
> is being updated.
>
> vfs_write()->..
> ->__generic_file_aio_write_nolock()
> ->file_update_time()
> ->mark_inode_dirty_sync()
> ->__mark_inode_dirty(I_DIRTY_SYNC)
> ->ext4_dirty_inode()
> ->ext4_mark_inode_dirty()

That assumes an mtime update for every write(). OK, so two writes in a
single nanosecond won't be happening. But in that case why is this code:

static inline struct timespec ext4_current_time(struct inode *inode)
{
return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ?
current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
}

checking (s_time_gran < NSEC_PER_SEC) ??

Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS
server implementation: if we were to later decrease s_time_gran (as we
might do, for performance reasons), the NFS server implementation starts
reporting incorrect information.

> > And how does the NFS server know that the filesystem implements i_version?
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
>
> Bruce raised up this question a few days back when he reviewed this
> patch, I think the solution is add a superblock flag for fs support
> inode versioning, probably at VFS layer?

That would work.

2007-07-11 05:05:35

by NeilBrown

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version


It just occurred to me:

If i_version is 64bit, then knfsd would need to be careful when
reading it on a 32bit host. What are the locking rules?

Presumably it is only updated under i_mutex protection, but having to
get i_mutex to read it would seem a little heavy handed.

Should it use a seqlock like i_size?
Could we use the same seqlock that i_size uses, or would we need a
separate one?

NeilBrown

2007-07-11 05:09:09

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> On Tuesday July 10, [email protected] wrote:
> >
> > Yes, thanks. It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
>
> You would like to think so, but remember NFSv4 was designed by a
> committee :-)
>
> The 'change' number is used for cache consistency, and as the spec
> makes very strong statements about the 'change' number, it is very
> hard (or impossible) to implement a server correctly without storing a
> change number in stable storage (just one of my grips about V4).
>
> >
> > But I suspect the ext4 implementation doesn't actually do this. afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
> > would be the implications of this?
>
> The first part sounds like a bug - i_version should really be updated
> by every call to ->commit_write (if that is still what it is called).
>
> The MAP_SHARED thing is less obvious. I guess every time we notice
> that the page might have been changed, we need to increment i_version.
>
> >
> > And how does the NFS server know that the filesystem implements i_version?
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
>
> That is a very important question. Zero probably makes sense, but
> what ever it is needs to be agreed and documented.
> And just by-the-way, the server doesn't really have the option of not
> sending the attribute. If i_version isn't defined, it has to fake
> something using mtime, and hope that is good enough.
>
> Alternately we could mandate that i_version is always kept up-to-date
> and if a filesystem doesn't have anything to load from storage, it
> just sets it to the current time in nanoseconds.
>
> That would mean that a client would need to flush it's cache whenever
> the inode fell out of cache on the server, but I don't think we can
> reliably do better than that.
>
> I think I like that approach.
>
> So my vote is to increment i_version in common code every time any
> change is made to the file,

David Chinneer pointed that we need to journal the version number
updates together with the operations that causes the change of the inode
version number, in order to survive server crashes so clients won't see
the counter go backwards.

So increment i_version in fs code is probably the place to ensure the
inode version changes are stored to disk. It's seems update the ext4
inode version in every ext4_mark_inode_dirty() is the easiest way.

> and alloc_inode should initialise it to
> current time, which might be changed by the filesystem before it calls
> unlock_new_inode.
> ... but doesn't lustre want to control its i_version... so maybe not :-(
>
> NeilBrown

2007-07-11 05:17:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[email protected]> wrote:

> David Chinneer pointed that we need to journal the version number
> updates together with the operations that causes the change of the inode
> version number, in order to survive server crashes so clients won't see
> the counter go backwards.
>
> So increment i_version in fs code is probably the place to ensure the
> inode version changes are stored to disk. It's seems update the ext4
> inode version in every ext4_mark_inode_dirty() is the easiest way.

That still makes us dependent upon _something_ changing the inode. For
overwrites the only something is mtime.

If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
I don't think we do) then I guess we'll need new code in or around
file_update_time() to do this.

2007-07-11 05:22:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Wed, 11 Jul 2007 15:05:27 +1000 Neil Brown <[email protected]> wrote:

>
> It just occurred to me:
>
> If i_version is 64bit, then knfsd would need to be careful when
> reading it on a 32bit host. What are the locking rules?
>
> Presumably it is only updated under i_mutex protection, but having to
> get i_mutex to read it would seem a little heavy handed.
>
> Should it use a seqlock like i_size?
> Could we use the same seqlock that i_size uses, or would we need a
> separate one?
>

seqlocks are a bit of a pain to use (we've had plenty of deadlocks on the
i_size one). We could reuse inode.i_lock for this modification. Its
mandate is "general purpose innermost lock to protect stuff in this inode".

2007-07-11 05:27:58

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 2007-07-10 at 21:22 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao <[email protected]> wrote:
>
> > On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> > > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[email protected]> wrote:
> > >
> > > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > > > > On Sun, 01 Jul 2007 03:37:04 -0400
> > > > > Mingming Cao <[email protected]> wrote:
> > > > >
> > > > > > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > > > > > i_version field.
> > > > > >
> > > > >
> > > > > That's obvious from the patch. But what was the reason for making this
> > > > > (unrelated to ext4) change?
> > > > >
> > > >
> > > > The need is came from NFSv4
> > > >
> > > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote:
> > > > > Hi,
> > > > >
> > > > > This is an update of the i_version patch.
> > > > > The i_version field is a 64bit counter that is set on every inode
> > > > > creation and that is incremented every time the inode data is modified
> > > > > (similarly to the "ctime" time-stamp).
> > > > > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > > > > "5.5. Mandatory Attributes - Definitions
> > > > > Name # DataType Access Description
> > > > > ___________________________________________________________________
> > > > > change 3 uint64 READ A value created by the
> > > > > server that the client can use to determine if file
> > > > > data, directory contents or attributes of the object
> > > > > have been modified. The servermay return the object's
> > > > > time_metadata attribute for this attribute's value but
> > > > > only if the filesystem object can not be updated more
> > > > > frequently than the resolution of time_metadata.
> > > > > "
> > > > >
> > > >
> > > > > Please update the changelog for this.
> > > > >
> > > >
> > > > Is above description clear to you?
> > > >
> > >
> > > Yes, thanks. It doesn't actually tell us why we want to implement
> > > this attribute and it doesn't tell us what the implications of failing
> > > to do so are, but I guess we can take that on trust from the NFS guys.
> > >
> > > But I suspect the ext4 implementation doesn't actually do this. afaict we
> > > won't update i_version for file overwrites (especially if s_time_gran can
> > > indeed be 1,000,000,000) and of course for MAP_SHARED modifications. What
> > > would be the implications of this?
> > >
> >
> > In the case of overwrite (file date updated), I assume the ctime/mtime
> > is being updated and the inode is being dirtied, so the version number
> > is being updated.
> >
> > vfs_write()->..
> > ->__generic_file_aio_write_nolock()
> > ->file_update_time()
> > ->mark_inode_dirty_sync()
> > ->__mark_inode_dirty(I_DIRTY_SYNC)
> > ->ext4_dirty_inode()
> > ->ext4_mark_inode_dirty()
>
> That assumes an mtime update for every write(). OK, so two writes in a
> single nanosecond won't be happening. But in that case why is this code:
>
> static inline struct timespec ext4_current_time(struct inode *inode)
> {
> return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ?
> current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
> }
>
> checking (s_time_gran < NSEC_PER_SEC) ??
>

Ext4 can still load/read ext3 fs (which by default with 128 bytes old
inode size, means doens't have support nanosecond timestamps), so it's
not always gurantee nanosecond timestamps granularity.(it depends on the
size of the inode (>128 bytes), by default, a fresh ext4 increase inode
size to 256 bytes to have the room to store nanoseond timestamps, inode
versioning etc)

> Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS
> server implementation: if we were to later decrease s_time_gran (as we
> might do, for performance reasons), the NFS server implementation starts
> reporting incorrect information.
>

:( that is a problem...

> > > And how does the NFS server know that the filesystem implements i_version?
> > > Will a zero-value of i_version have special significance, telling the
> > > server to not send this attribute, perhaps?
> >
> > Bruce raised up this question a few days back when he reviewed this
> > patch, I think the solution is add a superblock flag for fs support
> > inode versioning, probably at VFS layer?
>
> That would work.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-07-11 03:18:50

by Mingming Cao

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[email protected]> wrote:
>
> > David Chinneer pointed that we need to journal the version number
> > updates together with the operations that causes the change of the inode
> > version number, in order to survive server crashes so clients won't see
> > the counter go backwards.
> >
> > So increment i_version in fs code is probably the place to ensure the
> > inode version changes are stored to disk. It's seems update the ext4
> > inode version in every ext4_mark_inode_dirty() is the easiest way.
>
> That still makes us dependent upon _something_ changing the inode. For
> overwrites the only something is mtime.
>
> If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
> I don't think we do) then I guess we'll need new code in or around
> file_update_time() to do this.

do you mean mark inode dirty all the times in file_update_time()? Not
sure about the overhead for ext3/4.

Mingming


2007-07-11 06:35:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao <[email protected]> wrote:

> On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
> > On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[email protected]> wrote:
> >
> > > David Chinneer pointed that we need to journal the version number
> > > updates together with the operations that causes the change of the inode
> > > version number, in order to survive server crashes so clients won't see
> > > the counter go backwards.
> > >
> > > So increment i_version in fs code is probably the place to ensure the
> > > inode version changes are stored to disk. It's seems update the ext4
> > > inode version in every ext4_mark_inode_dirty() is the easiest way.
> >
> > That still makes us dependent upon _something_ changing the inode. For
> > overwrites the only something is mtime.
> >
> > If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
> > I don't think we do) then I guess we'll need new code in or around
> > file_update_time() to do this.
>
> do you mean mark inode dirty all the times in file_update_time()? Not
> sure about the overhead for ext3/4.
>

hm, I hadn't thought about it in any detail.

Maybe something like

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1238,6 +1238,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}

+ if (IS_I_VERSION_64(inode)) {
+ inode_inc_iversion(inode); /* Takes i_lock on 32-bit */
+ sync_it = 1;
+ }
+
if (sync_it)
mark_inode_dirty_sync(inode);
}
_

?

2007-07-11 11:41:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Jul 10, 2007 23:34 -0400, Trond Myklebust wrote:
> On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> > So my vote is to increment i_version in common code every time any
> > change is made to the file, and alloc_inode should initialise it to
> > current time, which might be changed by the filesystem before it calls
> > unlock_new_inode.
> > ... but doesn't lustre want to control its i_version... so maybe not :-(
>
> If lustre wants to be exportable via pNFS (as Peter Braam has suggested
> it should), then it had better be able to return a change attribute that
> is compatible with the NFSv4.1 spec...

The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre
version can be used to compare the updates of two inodes. It is set
to be the Lustre transaction number (which is sequentially incremented
on a per filesystem basis), so that we can use the per-inode version
to do replay of client operations even if they have been disconnected for
a long time, which is why we want to be able to control the actual value.
We don't want the version to be updated for e.g. file defragmentation
or other similar internal-only changes which need ext4_mark_inode_dirty().

We had a patch to disable ext4 inode versioning by a flag the superblock,
but we dropped it at the last minute because it needed some updates and
we didn't want to wait on that for submitting these changes upstream.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-07-11 14:28:06

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote:
> It just occurred to me:
>
> If i_version is 64bit, then knfsd would need to be careful when
> reading it on a 32bit host. What are the locking rules?

How does knfsd use i_version? I would think that if all it was doing
was to compare (i_version == previous_version), then locking wouldn't
really matter. Well, theoretically, previous_version could be
0x100000000, and i_version could be 0x1ffffffff, knfsd checks the high
word, then ext4 updates i_version to 0x200000000, then knfsd checks the
low word, detecting no change. How likely is this? (I don't understand
why i_version even needs to be 64 bits in the first place.)

> Presumably it is only updated under i_mutex protection, but having to
> get i_mutex to read it would seem a little heavy handed.

How does knfsd protect itself from the inode changing after i_version is
checked? Is any locking being done otherwise?

> Should it use a seqlock like i_size?
> Could we use the same seqlock that i_size uses, or would we need a
> separate one?
>
> NeilBrown

--
David Kleikamp
IBM Linux Technology Center

2007-07-11 16:57:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Tue, Jul 10, 2007 at 08:19:16PM -0400, Mingming Cao wrote:
> On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> > And how does the NFS server know that the filesystem implements i_version?
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
>
> Bruce raised up this question a few days back when he reviewed this
> patch, I think the solution is add a superblock flag for fs support
> inode versioning, probably at VFS layer?

Sounds fine. As long as it's something that's standard across
filesystems, then I can just do something like

if (sb->s_flags & MS_CHANGEATTR)
/* return i_version to client as change attribute */
else
/* return ctime to client as change attribute */

--b.

2007-07-11 17:26:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Wed, Jul 11, 2007 at 01:21:55PM +1000, Neil Brown wrote:
> And just by-the-way, the server doesn't really have the option of not
> sending the attribute. If i_version isn't defined, it has to fake
> something using mtime, and hope that is good enough.

ctime, actually--the change attribute is also supposed to be updated on
attribute updates.

> Alternately we could mandate that i_version is always kept up-to-date
> and if a filesystem doesn't have anything to load from storage, it
> just sets it to the current time in nanoseconds.
>
> That would mean that a client would need to flush it's cache whenever
> the inode fell out of cache on the server, but I don't think we can
> reliably do better than that.
>
> I think I like that approach.
>
> So my vote is to increment i_version in common code every time any
> change is made to the file, and alloc_inode should initialise it to
> current time, which might be changed by the filesystem before it calls
> unlock_new_inode.

So the client would be invalidating its cache more often than necessary,
rather than failing to invalidate it when it should. I agree that
that's probably the better tradeoff, although I wish I had a better idea
of the downside. I don't know, for example, whether users might see
unpleasant results if every client has to reread its cached data on a
reboot.

The currently proposed change--just providing a model change attribute
implementation for ext4 and leaving other filesystems untouched--is a
more conservative step.

So I'm inclined to just do this ext4 thing first, and then look into
further change attribute experiments next time around....

--b.

2007-07-11 20:04:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Wed, Jul 11, 2007 at 09:28:06AM -0500, Dave Kleikamp wrote:
> On Wed, 2007-07-11 at 15:05 +1000, Neil Brown wrote:
> > It just occurred to me:
> >
> > If i_version is 64bit, then knfsd would need to be careful when
> > reading it on a 32bit host. What are the locking rules?
>
> How does knfsd use i_version? I would think that if all it was doing
> was to compare (i_version == previous_version)

That's correct. (Though it's the client that's doing the comparison,
actually--the server is just reporting the value.)

> then locking wouldn't really matter. Well, theoretically,
> previous_version could be 0x100000000, and i_version could be
> 0x1ffffffff, knfsd checks the high word, then ext4 updates i_version
> to 0x200000000, then knfsd checks the low word, detecting no change.
> How likely is this?

The choice of upper word in your example is arbitrary, but other than
that I believe your example is essentially the only one. So this would
only happen when *both*

- the read of the new value of the low word happens precisely
2^32 i_version updates after the word was read on the client's
previous cache revalidation, and
- the value of i_version itself is close enough to a 32-bit
boundary that wraparound can happen between the reads of the
high and low words.

> (I don't understand why i_version even needs to be 64 bits in the
> first place.)

A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
That's not a problem in itself--the problem would only arise if two
subsequent client queries of the change attribute happened a multiple of
2^32 i_version increments apart.

This is more likely than the previous scenario, but still very unlikely.
I would have guessed that even in situations with a very high rate of
updates and a low rate of client revalidations, the chance of two
revalidations happening exactly 2^32 updates apart would still be no
more than 1 in 2^32. (Could odd characteristics of the workloads (like
updates that tend to happen in power-of-2 groups?) make it any more
likely?)

I'd be happier if ext4 at least allowed the possibility of 64 bits in
the future. And there's always the chance someone would find a use for
an i_version that was nondecreasing, even if nfs didn't care.

> > Presumably it is only updated under i_mutex protection, but having to
> > get i_mutex to read it would seem a little heavy handed.
>
> How does knfsd protect itself from the inode changing after i_version is
> checked? Is any locking being done otherwise?

If the client always requests the change attribute before reading, and
the i_version is always updated after data is modified, I think we're
OK. Admittedly this is a little subtle.

--b.

2007-07-12 04:56:35

by Andreas Dilger

[permalink] [raw]
Subject: Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

On Jul 11, 2007 16:04 -0400, J. Bruce Fields wrote:
> A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
> That's not a problem in itself--the problem would only arise if two
> subsequent client queries of the change attribute happened a multiple of
> 2^32 i_version increments apart.
>
> This is more likely than the previous scenario, but still very unlikely.
> I would have guessed that even in situations with a very high rate of
> updates and a low rate of client revalidations, the chance of two
> revalidations happening exactly 2^32 updates apart would still be no
> more than 1 in 2^32. (Could odd characteristics of the workloads (like
> updates that tend to happen in power-of-2 groups?) make it any more
> likely?)
>
> I'd be happier if ext4 at least allowed the possibility of 64 bits in
> the future. And there's always the chance someone would find a use for
> an i_version that was nondecreasing, even if nfs didn't care.

This would indeed be the case for ext3 filesystems updated to ext4.
They will only be able to store the low 32 bits of the version, which
is itself normally enough for NFSv4 because it only uses the inequality
check. Having the full 64 bits available eliminates the risk of
collisions, and given that the spec mandates a 64-bit version I'm sure
someone will take full advantage of it in NFS at some point.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.