LinuxLists.cc - rfc: [patch] change attribute for ext3

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> hello,
>
> here is a small patch that adds the "change attribute" for ext3
> file-systems;
>
> the change attribute is a simple counter that is reset to zero on
> inode creation and that is incremented every time the inode data is
> modified (similarly to the "ctime" time-stamp).

I would really have preferred a full-blown 64-bit counter as per
RFC3530, but I suppose we could always combine this change attribute
with the high word from ctime in order to make up the NFSv4 change
attribute. That should keep us safe until someone develops a ramdisk
with < 1 nsecond access time.

Cheers,
Trond

2006-09-13 18:30:25

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > hello,
> >
> > here is a small patch that adds the "change attribute" for ext3
> > file-systems;
> >
> > the change attribute is a simple counter that is reset to zero on
> > inode creation and that is incremented every time the inode data is
> > modified (similarly to the "ctime" time-stamp).
>
> I would really have preferred a full-blown 64-bit counter as per
> RFC3530, but I suppose we could always combine this change attribute
> with the high word from ctime in order to make up the NFSv4 change
> attribute. That should keep us safe until someone develops a ramdisk
> with < 1 nsecond access time.
>

do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
would allow 2^32 inode changes per second.

For ext3 it's hard to find unused bits in the on-disk inode structure, but
ext4 inode may become larger in the future, allowing a 64bit counter.

-- Alexandre

2006-09-13 19:06:13

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, 2006-09-13 at 20:30 +0200, Alexandre Ratchov wrote:
> On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > > hello,
> > >
> > > here is a small patch that adds the "change attribute" for ext3
> > > file-systems;
> > >
> > > the change attribute is a simple counter that is reset to zero on
> > > inode creation and that is incremented every time the inode data is
> > > modified (similarly to the "ctime" time-stamp).
> >
> > I would really have preferred a full-blown 64-bit counter as per
> > RFC3530, but I suppose we could always combine this change attribute
> > with the high word from ctime in order to make up the NFSv4 change
> > attribute. That should keep us safe until someone develops a ramdisk
> > with < 1 nsecond access time.
> >
>
> do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> would allow 2^32 inode changes per second.

Yes. As I said, that probably ought to suffice for now.

> For ext3 it's hard to find unused bits in the on-disk inode structure, but
> ext4 inode may become larger in the future, allowing a 64bit counter.

In anticipation of that event, could you please change the field in
'struct stat' and 'struct stat64' to be an 'unsigned long long' instead
of the current 'unsigned long'?

All the other fields are internal to the kernel, so a future change of
their size should not matter.

Cheers,
Trond

2006-09-13 19:31:38

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Sep 13, 2006 18:42 +0200, Alexandre Ratchov wrote:
> the change attribute is a simple counter that is reset to zero on
> inode creation and that is incremented every time the inode data is
> modified (similarly to the "ctime" time-stamp).

To start, I'm supportive of this concept, my comments are only to
get the most efficient implementation.

This appears to be very similar to the i_version field that is already
in the inode (which is also modified only by ext3), so instead of
increasing the size of the inode further we could use that field and
just make it persistent on disk. The i_version field is incremented
in mkdir/rmdir/create/rename. For Lustre it would also be desirable
to modify the version field for regular files when they are modified
(e.g. setattr, write), and it appears NFS v4 also wants the same
(assumed from use of file_update_time()). The question is whether
this should be handled internal to the filesystem (as ext3 does now)
or if it should be part of the VFS.

> The patch also adds a new ``st_change_attribute'' field in the stat
> structure, and modifies the stat(2) syscall accordingly. Currently the
> change is only visible on i386 and x86_64 archs.

Is this really necessary for knfsd?

> @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct
> inode->i_blocks = 0;
> inode->i_atime = inode->i_mtime = inode->i_ctime =
> current_fs_time(inode->i_sb);
> + inode->i_change_attribute = 0;

Initializing to zero is more dangerous than any non-zero number,
since this is the most likely outcome of corruption... The current
ext3 code initializes i_version to a random number, and we can use
comparisons similar to jiffies as long as we don't expect > 2^31
changes between comparisons.

> +++ fs/inode.c 13 Sep 2006 18:15:43 -0000 1.1.1.3.2.1
> @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file)
> return;
>
> now = current_fs_time(inode->i_sb);
> - if (!timespec_equal(&inode->i_mtime, &now))
> - sync_it = 1;
> inode->i_mtime = now;
> -
> - if (!timespec_equal(&inode->i_ctime, &now))
> - sync_it = 1;
> inode->i_ctime = now;
> -
> - if (sync_it)
> - mark_inode_dirty_sync(inode);
> + inode->i_change_attribute++;
> + mark_inode_dirty_sync(inode);

Ugh, this would definitely hurt performance, because ext3_dirty_inode()
packs-for-disk the whole inode each time. I believe Stephen had patches
at one time to do the inode packing at transaction commit time instead
of when it is changed, so we only do the packing once. Having a generic
mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
inode to the buffer) would also be useful for other things like doing
the inode or group descriptor checksum only once per transaction...

> Index: include/linux/ext3_fs.h
> ===================================================================
> RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v
> retrieving revision 1.1.1.3
> retrieving revision 1.1.1.3.2.1
> diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
> --- include/linux/ext3_fs.h 13 Sep 2006 17:45:20 -0000 1.1.1.3
> +++ include/linux/ext3_fs.h 13 Sep 2006 18:15:43 -0000 1.1.1.3.2.1
> @@ -286,7 +286,7 @@ struct ext3_inode {
> __u16 i_pad1;
> __le16 l_i_uid_high; /* these 2 fields */
> __le16 l_i_gid_high; /* were reserved2[0] */
> - __u32 l_i_reserved2;
> + __le32 l_i_change_attribute;

There was some other use of the reserved fields for ext4 64-bit-blocks
support. One was for i_file_acl_hi (I think this is using the i_pad1
above), one was for i_blocks_hi (I believe this was proposed to use the
i_frag and i_fsize bytes). Is this conflicting with anything else?
There were a lot of proposals for these fields, and I don't recall which
ones are still out there.

> @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb
> +#define ATTR_CHANGE_ATTRIBUTE 16384

Do you need a setattr interface for this, or is it sufficient to use
the i_version field from the inode, and let the filesystem manage
i_version updates as it is doing now?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-09-13 20:30:25

by Randy Dunlap

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, 13 Sep 2006 15:06:02 -0400 Trond Myklebust wrote:

> On Wed, 2006-09-13 at 20:30 +0200, Alexandre Ratchov wrote:
> > On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > > On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > > > hello,
> > > >
> > > > here is a small patch that adds the "change attribute" for ext3
> > > > file-systems;
> > > >
> > > > the change attribute is a simple counter that is reset to zero on
> > > > inode creation and that is incremented every time the inode data is
> > > > modified (similarly to the "ctime" time-stamp).
> > >
> > > I would really have preferred a full-blown 64-bit counter as per
> > > RFC3530, but I suppose we could always combine this change attribute
> > > with the high word from ctime in order to make up the NFSv4 change
> > > attribute. That should keep us safe until someone develops a ramdisk
> > > with < 1 nsecond access time.
> > >
> >
> > do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> > would allow 2^32 inode changes per second.
>
> Yes. As I said, that probably ought to suffice for now.
>
> > For ext3 it's hard to find unused bits in the on-disk inode structure, but
> > ext4 inode may become larger in the future, allowing a 64bit counter.
>
> In anticipation of that event, could you please change the field in
> 'struct stat' and 'struct stat64' to be an 'unsigned long long' instead
> of the current 'unsigned long'?
>
> All the other fields are internal to the kernel, so a future change of
> their size should not matter.

and while you are making changes + resubmitting,
Signed-off-by:
needs to have name + email address.

---
~Randy

2006-09-14 01:24:52

by J. Bruce Fields

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, Sep 13, 2006 at 01:31:30PM -0600, Andreas Dilger wrote:
> On Sep 13, 2006 18:42 +0200, Alexandre Ratchov wrote:
> > The patch also adds a new ``st_change_attribute'' field in the stat
> > structure, and modifies the stat(2) syscall accordingly. Currently the
> > change is only visible on i386 and x86_64 archs.
>
> Is this really necessary for knfsd?

Of course knfsd is completely in kernel, so it doesn't care about the
userspace interface.

But I think that a change attribute is potentially an *extremely* useful
thing, and for more than just nfs servers. Lots of userspace programs
also need to know whether a file has changed since they last examined
it, and also suffer from the limitations of using ctime or mtime as an
imperfect approximation to a real change attribute.

But it would make sense to split the user space changes into a second
patch and possibly apply it later.

--b.

2006-09-14 09:23:26

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Sep 13, 2006 20:30 +0200, Alexandre Ratchov wrote:
> On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > > the change attribute is a simple counter that is reset to zero on
> > > inode creation and that is incremented every time the inode data is
> > > modified (similarly to the "ctime" time-stamp).
> >
> > I would really have preferred a full-blown 64-bit counter as per
> > RFC3530, but I suppose we could always combine this change attribute
> > with the high word from ctime in order to make up the NFSv4 change
> > attribute. That should keep us safe until someone develops a ramdisk
> > with < 1 nsecond access time.
>
> do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> would allow 2^32 inode changes per second.

It might be preferrable, since we are depending on the ctime here anyways,
is to combine this with the nsec-resolution ctime, and kill two birds with
one field in the inode.

The implementation would be to update the ctime+nsec field as normal, but
in the unlikely case that both the second+nsec ctime is the same as before
the nsec value would be incremented by 1. This could happen in case of
low-resolution kernel timers, and would also handle the future case where
the inode is modified more than once in the same nanosecond.

The other benefit is that it allows comparisons between two different
inodes to be more meaningful, instead of just using the seconds + random
version number.

It would be possible/desirable to make the nsec ctime field be part of the
small inode (using the proposed reserved field) instead of the large inode,
since that is a requirement for working with existing ext3 filesystems. The
previous nsec timestamp patch would only need trivial modifications to make
this work, just #define i_ctime_extra to be l_i_reserved1 I believe.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-09-14 13:22:04

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, Sep 13, 2006 at 01:31:30PM -0600, Andreas Dilger wrote:
> On Sep 13, 2006 18:42 +0200, Alexandre Ratchov wrote:
> > the change attribute is a simple counter that is reset to zero on
> > inode creation and that is incremented every time the inode data is
> > modified (similarly to the "ctime" time-stamp).
>
> To start, I'm supportive of this concept, my comments are only to
> get the most efficient implementation.
>
> This appears to be very similar to the i_version field that is already
> in the inode (which is also modified only by ext3), so instead of
> increasing the size of the inode further we could use that field and
> just make it persistent on disk. The i_version field is incremented
> in mkdir/rmdir/create/rename. For Lustre it would also be desirable
> to modify the version field for regular files when they are modified
> (e.g. setattr, write), and it appears NFS v4 also wants the same
> (assumed from use of file_update_time()). The question is whether
> this should be handled internal to the filesystem (as ext3 does now)
> or if it should be part of the VFS.

hmm..., i_version is currently used for directory entries validation; i've
browsed the ext{2,3,4} sources and i don't see any drawbacks in merging
i_version and i_chattr.

IMHO, the natural place to do this stuff is the VFS, because it can be
common to all file-systems supporting this feature. Currently it's the same
with ctime, mtime and atime. These are in the VFS even if there are
file-systems that don't support all of them.

> > The patch also adds a new ``st_change_attribute'' field in the stat
> > structure, and modifies the stat(2) syscall accordingly. Currently the
> > change is only visible on i386 and x86_64 archs.
>
> Is this really necessary for knfsd?
>
> > @@ -511,6 +511,7 @@ static struct inode *bm_get_inode(struct
> > inode->i_blocks = 0;
> > inode->i_atime = inode->i_mtime = inode->i_ctime =
> > current_fs_time(inode->i_sb);
> > + inode->i_change_attribute = 0;
>
> Initializing to zero is more dangerous than any non-zero number,
> since this is the most likely outcome of corruption... The current
> ext3 code initializes i_version to a random number, and we can use
> comparisons similar to jiffies as long as we don't expect > 2^31
> changes between comparisons.
>

it's ok for me;

> > +++ fs/inode.c 13 Sep 2006 18:15:43 -0000 1.1.1.3.2.1
> > @@ -1232,16 +1232,10 @@ void file_update_time(struct file *file)
> > return;
> >
> > now = current_fs_time(inode->i_sb);
> > - if (!timespec_equal(&inode->i_mtime, &now))
> > - sync_it = 1;
> > inode->i_mtime = now;
> > -
> > - if (!timespec_equal(&inode->i_ctime, &now))
> > - sync_it = 1;
> > inode->i_ctime = now;
> > -
> > - if (sync_it)
> > - mark_inode_dirty_sync(inode);
> > + inode->i_change_attribute++;
> > + mark_inode_dirty_sync(inode);
>
> Ugh, this would definitely hurt performance, because ext3_dirty_inode()
> packs-for-disk the whole inode each time. I believe Stephen had patches
> at one time to do the inode packing at transaction commit time instead
> of when it is changed, so we only do the packing once. Having a generic
> mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
> inode to the buffer) would also be useful for other things like doing
> the inode or group descriptor checksum only once per transaction...
>

yes, that part is ugly. I've thought about another solution, but i don't
know if this would work:

afaik, for an open file, there is always a copy of the inode in memory,
because there is a reference to it in the file structure. So, in principle,
we shouldn't need to make dirty the inode. I don't know if this is feasable
perhaps am i missing something here.

> > Index: include/linux/ext3_fs.h
> > ===================================================================
> > RCS file: /home/ratchova/cvs/linux/include/linux/ext3_fs.h,v
> > retrieving revision 1.1.1.3
> > retrieving revision 1.1.1.3.2.1
> > diff -u -p -r1.1.1.3 -r1.1.1.3.2.1
> > --- include/linux/ext3_fs.h 13 Sep 2006 17:45:20 -0000 1.1.1.3
> > +++ include/linux/ext3_fs.h 13 Sep 2006 18:15:43 -0000 1.1.1.3.2.1
> > @@ -286,7 +286,7 @@ struct ext3_inode {
> > __u16 i_pad1;
> > __le16 l_i_uid_high; /* these 2 fields */
> > __le16 l_i_gid_high; /* were reserved2[0] */
> > - __u32 l_i_reserved2;
> > + __le32 l_i_change_attribute;
>
> There was some other use of the reserved fields for ext4 64-bit-blocks
> support. One was for i_file_acl_hi (I think this is using the i_pad1
> above), one was for i_blocks_hi (I believe this was proposed to use the
> i_frag and i_fsize bytes). Is this conflicting with anything else?
> There were a lot of proposals for these fields, and I don't recall which
> ones are still out there.

i haven't noticed any conflicts here.

>
> > @@ -280,6 +280,7 @@ typedef void (dio_iodone_t)(struct kiocb
> > +#define ATTR_CHANGE_ATTRIBUTE 16384
>
> Do you need a setattr interface for this, or is it sufficient to use
> the i_version field from the inode, and let the filesystem manage
> i_version updates as it is doing now?

it's not strictly necessary; it's not more necessary that the interface to
ctime or other attributes. It's here for completness, in my opinion the
change attribute is the same as the ctime time-stamp

thanks for your comments

-- Alexandre

2006-09-14 13:46:03

by Peter Staubach

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

Trond Myklebust wrote:
> On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
>
>> hello,
>>
>> here is a small patch that adds the "change attribute" for ext3
>> file-systems;
>>
>> the change attribute is a simple counter that is reset to zero on
>> inode creation and that is incremented every time the inode data is
>> modified (similarly to the "ctime" time-stamp).
>>
>
> I would really have preferred a full-blown 64-bit counter as per
> RFC3530, but I suppose we could always combine this change attribute
> with the high word from ctime in order to make up the NFSv4 change
> attribute. That should keep us safe until someone develops a ramdisk
> with < 1 nsecond access time.

Wouldn't the generation count work better than ctime to differentiate
between
instances of files using the same inode number? That way, there wouldn't be
a clock resolution issue.

Thanx...

ps

2006-09-14 13:48:31

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Thu, Sep 14, 2006 at 03:23:18AM -0600, Andreas Dilger wrote:
> On Sep 13, 2006 20:30 +0200, Alexandre Ratchov wrote:
> > On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > > On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > > > the change attribute is a simple counter that is reset to zero on
> > > > inode creation and that is incremented every time the inode data is
> > > > modified (similarly to the "ctime" time-stamp).
> > >
> > > I would really have preferred a full-blown 64-bit counter as per
> > > RFC3530, but I suppose we could always combine this change attribute
> > > with the high word from ctime in order to make up the NFSv4 change
> > > attribute. That should keep us safe until someone develops a ramdisk
> > > with < 1 nsecond access time.
> >
> > do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> > would allow 2^32 inode changes per second.
>
> It might be preferrable, since we are depending on the ctime here anyways,
> is to combine this with the nsec-resolution ctime, and kill two birds with
> one field in the inode.
>
> The implementation would be to update the ctime+nsec field as normal, but
> in the unlikely case that both the second+nsec ctime is the same as before
> the nsec value would be incremented by 1. This could happen in case of
> low-resolution kernel timers, and would also handle the future case where
> the inode is modified more than once in the same nanosecond.
>
> The other benefit is that it allows comparisons between two different
> inodes to be more meaningful, instead of just using the seconds + random
> version number.
>
> It would be possible/desirable to make the nsec ctime field be part of the
> small inode (using the proposed reserved field) instead of the large inode,
> since that is a requirement for working with existing ext3 filesystems. The
> previous nsec timestamp patch would only need trivial modifications to make
> this work, just #define i_ctime_extra to be l_i_reserved1 I believe.
>

there is something i dislike with incrementing the nsec value. The ctime is
a global (as opposed to per-inode) time reference for the file-system. And
it is expected to be globally coherent; imagine the following situation:

Within the same time-slice (with time-stamp T0, in nanoseconds), we do the
following in this order:

change file1 -> ctime = T0
change file2 -> ctime = T0
change file2 -> ctime = T0 + 1
change file2 -> ctime = T0 + 2
change file1 -> ctime = T0 + 1

so it appears that file2 is strictly newer than file1, which is false. So
the assumption "if ctime(file1) < ctime(file2) then file2 is newer that
file1" is no longer true.

In order to fix this, we'll need to increment a global counter, not a
pre-inode counter. It's feasable.

cheers,

-- Alexandre

2006-09-14 13:56:20

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Thu, 2006-09-14 at 09:46 -0400, Peter Staubach wrote:
> Wouldn't the generation count work better than ctime to differentiate
> between
> instances of files using the same inode number? That way, there wouldn't be
> a clock resolution issue.

No. This is about distinguishing updates to the metadata/data of the
same instance. It is exactly what ctime was supposed to do, but ctime
relies on a clock which usually has too poor time resolution, may not be
monotonic, etc...

Cheers,
Trond

2006-09-14 14:06:48

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Thu, Sep 14, 2006 at 09:46:03AM -0400, Peter Staubach wrote:
> Trond Myklebust wrote:
> >On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> >
> >>hello,
> >>
> >>here is a small patch that adds the "change attribute" for ext3
> >>file-systems;
> >>
> >>the change attribute is a simple counter that is reset to zero on
> >>inode creation and that is incremented every time the inode data is
> >>modified (similarly to the "ctime" time-stamp).
> >>
> >
> >I would really have preferred a full-blown 64-bit counter as per
> >RFC3530, but I suppose we could always combine this change attribute
> >with the high word from ctime in order to make up the NFSv4 change
> >attribute. That should keep us safe until someone develops a ramdisk
> >with < 1 nsecond access time.
>
> Wouldn't the generation count work better than ctime to differentiate
> between
> instances of files using the same inode number? That way, there wouldn't be
> a clock resolution issue.

Yes, and afaik it's already used for that purpose by NFSv{2,3}.

Note that the change attribute is for counting changes of the same instance
of a file using a given inode (as opposed to the generation counter that's
used to count the number of files that have used a given inode).

-- Alexandre

2006-09-14 21:02:00

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Sep 14, 2006 15:21 +0200, Alexandre Ratchov wrote:
> IMHO, the natural place to do this stuff is the VFS, because it can be
> common to all file-systems supporting this feature. Currently it's the same
> with ctime, mtime and atime. These are in the VFS even if there are
> file-systems that don't support all of them.

Well, that is only partly true. I see lots of places in ext3 that are
setting i_ctime and i_mtime...

> > Ugh, this would definitely hurt performance, because ext3_dirty_inode()
> > packs-for-disk the whole inode each time. I believe Stephen had patches
> > at one time to do the inode packing at transaction commit time instead
> > of when it is changed, so we only do the packing once. Having a generic
> > mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
> > inode to the buffer) would also be useful for other things like doing
> > the inode or group descriptor checksum only once per transaction...
>
> afaik, for an open file, there is always a copy of the inode in memory,
> because there is a reference to it in the file structure. So, in principle,
> we shouldn't need to make dirty the inode. I don't know if this is feasable
> perhaps am i missing something here.

The in-memory inode needs to be copied into the buffer so that it is
part of the transaction being committed to disk, or updates are lost.
This was a common bug with early ext3 - marking the inode dirty and
then changing a field in the in-core inode - which would not be saved
to disk. In other filesystems this is only a few-cycle race, but in
ext3 the in-core inode is not written to disk unless the inode is
again marked dirty.

The potential benefit of making this a callback from the JBD layer is
it avoids copying the inode for EVERY dirty, and only doing it once
per transaction. Add a list of callbacks hooked onto the transaction
to be called before it is committed, and the callback data is the
inode pointer which does a single ext3_do_update_inode() call if the
inode is still marked dirty.

> it's not strictly necessary; it's not more necessary that the interface to
> ctime or other attributes. It's here for completness, in my opinion the
> change attribute is the same as the ctime time-stamp

Then makes sense to just improve the ctime mechanism instead of adding
new code and interfaces...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-09-15 10:19:40

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Thu, Sep 14, 2006 at 03:01:48PM -0600, Andreas Dilger wrote:
> On Sep 14, 2006 15:21 +0200, Alexandre Ratchov wrote:
> > IMHO, the natural place to do this stuff is the VFS, because it can be
> > common to all file-systems supporting this feature. Currently it's the same
> > with ctime, mtime and atime. These are in the VFS even if there are
> > file-systems that don't support all of them.
>
> Well, that is only partly true. I see lots of places in ext3 that are
> setting i_ctime and i_mtime...
>

i fully agree here; personnally in the long term, i'd like to see all this
stuff common to all file systems. IMHO, the file-system specific code should
only handle the on-disk storage part of the problem.

> > > Ugh, this would definitely hurt performance, because ext3_dirty_inode()
> > > packs-for-disk the whole inode each time. I believe Stephen had patches
> > > at one time to do the inode packing at transaction commit time instead
> > > of when it is changed, so we only do the packing once. Having a generic
> > > mechanism to do pre-commit callbacks from jbd (i.e. to pack a dirty
> > > inode to the buffer) would also be useful for other things like doing
> > > the inode or group descriptor checksum only once per transaction...
> >
> > afaik, for an open file, there is always a copy of the inode in memory,
> > because there is a reference to it in the file structure. So, in principle,
> > we shouldn't need to make dirty the inode. I don't know if this is feasable
> > perhaps am i missing something here.
>
> The in-memory inode needs to be copied into the buffer so that it is
> part of the transaction being committed to disk, or updates are lost.
> This was a common bug with early ext3 - marking the inode dirty and
> then changing a field in the in-core inode - which would not be saved
> to disk. In other filesystems this is only a few-cycle race, but in
> ext3 the in-core inode is not written to disk unless the inode is
> again marked dirty.
>
> The potential benefit of making this a callback from the JBD layer is
> it avoids copying the inode for EVERY dirty, and only doing it once
> per transaction. Add a list of callbacks hooked onto the transaction
> to be called before it is committed, and the callback data is the
> inode pointer which does a single ext3_do_update_inode() call if the
> inode is still marked dirty.
>

yes, that would be a real solution. Do you have any reference to Stephen
patches?

BTW, note that when we'll add support for nanosecond time-stamps, we still
have the same problem, because with a very high clock and time-stamp
resolutions, we'll have to update the inode on every change.

> > it's not strictly necessary; it's not more necessary that the interface to
> > ctime or other attributes. It's here for completness, in my opinion the
> > change attribute is the same as the ctime time-stamp
>
> Then makes sense to just improve the ctime mechanism instead of adding
> new code and interfaces...
>

that's also my opinion, and i really see the change attribute as part of the
ctime mechanism. It adds very few lines code (most of them are trivial) and
it uses the same interface as ctime.

The problem we want to solve is how to track relaiably file changes; there
are several solutions. The solution that Celine and me have considered is
the change attribute. It uses the 'ctime' semantics so we used the ctime
interface.

It's available for kernel threads, but I don't see any reason not to make it
available for user-land programs. In order to use it, user-land will just
need some trivial modifications.

Note that the change attribute is not incompatible with nanosecond
time-stamps. Its just a complement to the ctime, regardless of the time
resolution.

-- Alexandre

2006-09-15 16:18:23

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Sep 15, 2006 12:19 +0200, Alexandre Ratchov wrote:
> BTW, note that when we'll add support for nanosecond time-stamps, we still
> have the same problem, because with a very high clock and time-stamp
> resolutions, we'll have to update the inode on every change.

If we have a journal commit callback, then even if the in-core inode ctime
is changed continually, the on-disk inode will be updated exactly once per
jbd transaction. The problem as it stands now is that the ext3/VFS code
doesn't know where a transaction boundary is, so it has to continually
update the on-disk inode to make sure that the changes are in the relevant
transaction.

I recall a long time ago that Andrew got significant improvement from
reducing the number of mark_inode_dirty() calls in ext3, just because of
avoiding the repeated core->disk inode packing.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-11-14 22:17:25

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Sep 13, 2006 20:30 +0200, Alexandre Ratchov wrote:
> On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > I would really have preferred a full-blown 64-bit counter as per
> > RFC3530, but I suppose we could always combine this change attribute
> > with the high word from ctime in order to make up the NFSv4 change
> > attribute. That should keep us safe until someone develops a ramdisk
> > with < 1 nsecond access time.
>
> do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> would allow 2^32 inode changes per second.

I've been giving this further thought, and it may be that a full 64-bit
counter per inode is the only bulletproof solution.

One reason that ctime+nsec as the version number isn't so great is that if
there is some reason to set the clock backward (i.e. it was incorrectly
set into the future at some point) the inode ctime may jump backward.
This could cause either misordering of events, or collisions between
version numbers. The problem could be mitigated by having the ctime+nsec
value only increment the nsec component by 1 for each new version (like
a counter) until real time catches up with the bad ctime, but it might
leave files with a bad ctime for a long time.

Other than not being able to set ctime backward (which isn't really
something that should happen under normal behaviour), this is a reasonable
solution.

The main drawback of a 64-bit counter is the space in the inode that it
consumes... I don't think we can find 64 bits of free space in the core
inode, so this would relegate the solution to new filesystems that are
formatted with large inodes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-11-24 00:23:11

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Nov 14, 2006 15:17 -0700, Andreas Dilger wrote:
> On Sep 13, 2006 20:30 +0200, Alexandre Ratchov wrote:
> > On Wed, Sep 13, 2006 at 02:11:11PM -0400, Trond Myklebust wrote:
> > > I would really have preferred a full-blown 64-bit counter as per
> > > RFC3530, but I suppose we could always combine this change attribute
> > > with the high word from ctime in order to make up the NFSv4 change
> > > attribute. That should keep us safe until someone develops a ramdisk
> > > with < 1 nsecond access time.
> >
> > do you mean something like "(ctime.tv_sec << 32) | change_attribute"? this
> > would allow 2^32 inode changes per second.
>
> I've been giving this further thought, and it may be that a full 64-bit
> counter per inode is the only bulletproof solution.
>
> One reason that ctime+nsec as the version number isn't so great is that if
> there is some reason to set the clock backward (i.e. it was incorrectly
> set into the future at some point) the inode ctime may jump backward.
> This could cause either misordering of events, or collisions between
> version numbers. The problem could be mitigated by having the ctime+nsec
> value only increment the nsec component by 1 for each new version (like
> a counter) until real time catches up with the bad ctime, but it might
> leave files with a bad ctime for a long time.
>
> The main drawback of a 64-bit counter is the space in the inode that it
> consumes... I don't think we can find 64 bits of free space in the core
> inode, so this would relegate the solution to new filesystems that are
> formatted with large inodes.

Alexandre, Trond,
what do you think about using a 32-bit in-inode version (sufficient for
causal uses of NFSv4), and put the 32-bit MSB of the version into the
large part of the inode (say after cr_time)?

That allows use of the version for existing ext3 filesystems, and with
large inodes (Lustre, ext4) it also meets the specs of RFC 3530 and any
intended NFSv4 future use?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-11-28 19:00:16

by J. Bruce Fields

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Thu, Nov 23, 2006 at 05:23:11PM -0700, Andreas Dilger wrote:
> On Nov 14, 2006 15:17 -0700, Andreas Dilger wrote:
> > I've been giving this further thought, and it may be that a full 64-bit
> > counter per inode is the only bulletproof solution.
> >
> > One reason that ctime+nsec as the version number isn't so great is that if
> > there is some reason to set the clock backward (i.e. it was incorrectly
> > set into the future at some point) the inode ctime may jump backward.
> > This could cause either misordering of events, or collisions between
> > version numbers. The problem could be mitigated by having the ctime+nsec
> > value only increment the nsec component by 1 for each new version (like
> > a counter) until real time catches up with the bad ctime, but it might
> > leave files with a bad ctime for a long time.
> >
> > The main drawback of a 64-bit counter is the space in the inode that it
> > consumes... I don't think we can find 64 bits of free space in the core
> > inode, so this would relegate the solution to new filesystems that are
> > formatted with large inodes.
>
> Alexandre, Trond,
> what do you think about using a 32-bit in-inode version (sufficient for
> causal uses of NFSv4),
> and put the 32-bit MSB of the version into the
> large part of the inode (say after cr_time)?

So does that mean that the MSB of the change attribute would only be
available on some filesystems, or that it would be available on all of
them but be slower on those with smaller inodes? And how does the user
(e.g. the nfsd code) distinguish the two cases?

> That allows use of the version for existing ext3 filesystems, and with
> large inodes (Lustre, ext4) it also meets the specs of RFC 3530 and any
> intended NFSv4 future use?

--b.

2006-11-28 22:06:22

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Nov 28, 2006 14:00 -0500, J. Bruce Fields wrote:
> On Thu, Nov 23, 2006 at 05:23:11PM -0700, Andreas Dilger wrote:
> > On Nov 14, 2006 15:17 -0700, Andreas Dilger wrote:
> > > I've been giving this further thought, and it may be that a full 64-bit
> > > counter per inode is the only bulletproof solution.
> > >
> > > One reason that ctime+nsec as the version number isn't so great is that if
> > > there is some reason to set the clock backward (i.e. it was incorrectly
> > > set into the future at some point) the inode ctime may jump backward.
> > > This could cause either misordering of events, or collisions between
> > > version numbers. The problem could be mitigated by having the ctime+nsec
> > > value only increment the nsec component by 1 for each new version (like
> > > a counter) until real time catches up with the bad ctime, but it might
> > > leave files with a bad ctime for a long time.
> > >
> > > The main drawback of a 64-bit counter is the space in the inode that it
> > > consumes... I don't think we can find 64 bits of free space in the core
> > > inode, so this would relegate the solution to new filesystems that are
> > > formatted with large inodes.
> >
> > Alexandre, Trond,
> > what do you think about using a 32-bit in-inode version (sufficient for
> > causal uses of NFSv4),
> > and put the 32-bit MSB of the version into the
> > large part of the inode (say after cr_time)?
>
> So does that mean that the MSB of the change attribute would only be
> available on some filesystems, or that it would be available on all of
> them but be slower on those with smaller inodes? And how does the user
> (e.g. the nfsd code) distinguish the two cases?

One other option is to use the other reserved field (l_i_reserved2) to
store the MSB of the version.

> > That allows use of the version for existing ext3 filesystems, and with
> > large inodes (Lustre, ext4) it also meets the specs of RFC 3530 and any
> > intended NFSv4 future use?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-11-29 18:53:01

by Cordenner jean noel

[permalink] [raw]

Subject: [RFC] [patch 0/3] change attribute for ext4

Hello,

I've updated the change attribute patch for ext4 that was initially posted by
Alexandre Ratchov.

The change attribute is a simple counter that is set on inode creation and that
is incremented every time the inode data is modified (similarly to the "ctime"
time-stamp) and never reset.

Here are the results of tests I ran with and without the change attribute patch.

http://www.bullopensource.org/ext4/change_attribute/index.html

Any comments are welcome.

regards,
Jean noel

2006-12-14 01:24:31

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Sep 13, 2006 14:11 -0400, Trond Myklebust wrote:
> On Wed, 2006-09-13 at 18:42 +0200, Alexandre Ratchov wrote:
> > here is a small patch that adds the "change attribute" for ext3
> > file-systems;
> >
> > the change attribute is a simple counter that is reset to zero on
> > inode creation and that is incremented every time the inode data is
> > modified (similarly to the "ctime" time-stamp).
>
> I would really have preferred a full-blown 64-bit counter as per
> RFC3530, but I suppose we could always combine this change attribute
> with the high word from ctime in order to make up the NFSv4 change
> attribute. That should keep us safe until someone develops a ramdisk
> with < 1 nsecond access time.

Trond, can you please elaborate on the need for a 64-bit version counter
for NFSv4? We have been looking at something similar, but ctime+nsec is
not really sufficient as it is possible that the inode ctime can go
backward if the clock is reset.

What kind of requirements does NFSv4 place on the version? Monotonic is
probably a good bet. Does it need to be global for the filesystem, or
is a per-inode version sufficient? What functionality of NFSv4 needs
the version?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-12-14 01:52:12

by J. Bruce Fields

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, Dec 13, 2006 at 06:24:28PM -0700, Andreas Dilger wrote:
> On Sep 13, 2006 14:11 -0400, Trond Myklebust wrote:
> > I would really have preferred a full-blown 64-bit counter as per
> > RFC3530, but I suppose we could always combine this change attribute
> > with the high word from ctime in order to make up the NFSv4 change
> > attribute. That should keep us safe until someone develops a ramdisk
> > with < 1 nsecond access time.
>
> Trond, can you please elaborate on the need for a 64-bit version counter
> for NFSv4?

I'm not Trond, but....

> What kind of requirements does NFSv4 place on the version? Monotonic is
> probably a good bet.

The only requirement is that it be unique (assuming a file is never
modified 2^64 times). Clients can't compare them except for equality.

> Does it need to be global for the filesystem

Nope.

> or is a per-inode version sufficient?

Yes.

> What functionality of NFSv4 needs the version?

Clients use it to revalidate their caches.

--b.

2006-12-14 16:48:09

[permalink] [raw]

Subject: Re: rfc: [patch] change attribute for ext3

On Wed, 2006-12-13 at 20:52 -0500, J. Bruce Fields wrote:
> > What kind of requirements does NFSv4 place on the version? Monotonic is
> > probably a good bet.
>
> The only requirement is that it be unique (assuming a file is never
> modified 2^64 times). Clients can't compare them except for equality.

The other requirement is that they be updated in more or less any
situation where you would normally see a 'ctime' update. In other words
any time when the file metadata or data changes, and any time when the
ACL changes.

(NB: I'm not sure what we should do w.r.t. xattr changes since those are
not really covered by RFC3530.)

Atomicity is not a hard requirement, however the server is required to
know whether or not the update was atomic. If the update is atomic, a
careful client may perform certain optimisations based upon it knowing
that no other changes to the inode have raced with this one. For
instance, if it knows that a file creation atomically updated the change
attribute of the directory, then it can determine that it does not need
to check for other changes to that directory.

> > Does it need to be global for the filesystem
>
> Nope.
>
> > or is a per-inode version sufficient?
>
> Yes.

Yes. If your filesystem wants to support Solaris or Reiser4-like
subfiles, then it is expected that each subfile should have its own
change attribute (whereas changes to the subfile 'directory' will be
reflected by the parent inode's change attribute.

Change attribute values may be reused if the inode number is reused (as
long as the filesystem has something like a generation counter that
allows it to distinguish between different instances of the same inode
number).

> > What functionality of NFSv4 needs the version?
>
> Clients use it to revalidate their caches.

Yup. It is used to detect changes made on the NFS server itself
(possibly by other NFS clients, possibly by local processes on the
server), so that the client can flush out any stale cached data.

Cheers
Trond

2006-12-14 23:04:57