LinuxLists.cc - [PATCH 00/18] Extended file stat functions [ver #6]

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sat, 2010-07-31 at 17:53 +0100, David Howells wrote:
> utz lehmann <[email protected]> wrote:
>
> > When abusing an existing time stamp use atime not ctime please.
> > ctime has it's uses. atime was just a mistake and is nearly useless.
>
> CacheFiles currently uses atime to determine least-recently-usedness.

How does this works right with noatime or relatime (which is default)?

We had used FS-Cache with a few 10000s files cached. Doesn't it mean
that the cleanup has to stat them all?

Why didn't cachefilesd managed the cache index in a separate database
like other caches?

2010-07-16 13:32:15

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Friday 16 July 2010, David Howells wrote:
> Arnd Bergmann <[email protected]> wrote:
>
> > You could also define the tv_gran_units to be power-of-ten nanoseconds,
> > making it a decimal floating point number like
> >
> > enum {
> > XSTAT_NANOSECONDS_GRANULARITY = 0,
> > XSTAT_MICROSECONDS_GRANULARITY = 3,
> > XSTAT_MILLISECONDS_GRANULARITY = 6,
> > XSTAT_SECONDS_GRANULARITY = 9,
> > };
>
> Are you thinking, then, of having tv_nsec be in terms of those units?

No, just tv_granularity. Most users won't need to care that this
is not a regular timespec then.

> > That would make it easier to define an xstat_time_before() function, though
> > it means that you could no longer do XSTAT_MINUTES_GRANULARITY and
> > higher directly other than { .tv_gran_units = 10, .tv_granularity = 6, }.
>
> So you're thinking of indicating time (in)equality based on overlapping time
> granules?

Yes, for example rsync could use this to determine wether a local (e.g. FAT)
and a remote (e.g. NFS) file are identical or not. Right now, you can pass
the granularity in seconds as a command line argument, but it would be nice
to have rsync do this automatically if possible.

> Your suggestion would suffice, I think. With a 2:2 split between exponent
> (tv_gran_units) and mantissa (tv_granularity), you can do:
>
> UNIT SECONDS/UNIT EXPONENT MANTISSA
> nanoseconds 0.000000001 -9 1
> microseconds 0.000001 -6 1
> millseconds 0.001 -3 1
> seconds 1 0 1
> minutes 60 1 6
> hours 3600 2 36
> days 86400 2 864
> weeks 604800 2 6048
>
> Any units beyond that are variable length and not worth considering, IMO.

right.

> And if you don't want negative numbers in your exponent, you can make the base
> unit nS instead of S.

either way works fine for me.

> Is it worth allowing a filesystem to indicate that it has granularity smaller
> than nS, even if the resolution can't be handled here? We could even have:
>
> struct xstat_time {
> signed long long tv_sec; /* seconds */
> unsigned int tv_nsec; /* nanoseconds */
> unsigned char tv_psec4; /* picoseconds/4 */
> signed char tv_gran_exp; /* exponent */
> unsigned short tv_gran_mant; /* mantissa */
> };
>
> Though it's probably still an unnecessary extravagance to have the pS field.
> It's probably best left as padding for now; we can always change our minds
> later...

There are also two extra bits in tv_nsec ;-). No, I don't think we
need picoseconds any time soon.

One byte padding might not be the worst thing to have in here, like

struct xstat_time {
signed long long tv_sec; /* seconds */
unsigned int tv_nsec; /* nanoseconds */
unsigned short tv_gran_mant; /* mantissa */
signed char tv_gran_exp; /* exponent */
unsigned char unused;
};

Arnd

2010-07-22 18:02:07

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 12:58:50PM -0400, Trond Myklebust wrote:
>
> That would make it impossible to export the filesystem with NFSv2 and
> v3. They do rely on ctime checking for certain operations (e.g. deciding
> when to invalidate access and acl caches). NFSv4 needs this too if the
> filesystem has no dedicated change attribute.
>
> Still, I suppose the market for exporting the same filesystem with both
> NFS and Samba is limited...

Ask NetApp about that :-). They have built a rather large
business on just that fact :-).

Jeremy.

2010-07-22 18:21:57

by Benny Halevy

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Jul. 22, 2010, 20:24 +0300, Linus Torvalds <[email protected]> wrote:
> On Thu, Jul 22, 2010 at 10:03 AM, Jan Engelhardt <[email protected]> wrote:
>>
>> I beg to differ. ctime is not completely useless. It reflects changes on
>> the inode for when you don't you change the content.
>
> Uh. Yes. Except that why is file metadata really different from file
> data? Most people really don't care. And a lot of people have asked
> for creation dates - and I seriously doubt that Windows people
> complain a lot about the fact that there you have mtime for metadata
> changes too.
>
> The point being that Unix ctime semantics certainly have well-defined
> semantics, but they are in no way "better" than having a real creation
> time, and are often worse.

Yeah, having create time would be important.
That said, having a non user-settable modify timestamp is crucial
for quickly determining whether a file has changed.

Benny

>
> Just imagine what you could do as an MIS person if you actually had a
> creation time you could somewhat trust? You talk about seeing somebody
> change the permissions of /etc/passwd, but realistically, absent
> preexisting semantics, who would really ask for that? The only reason
> you mention that as an example of what you can do with ctime is that
> that is indeed pretty much the _only_ thing you can do with ctime, and
> it really isn't that useful.
>
> In contrast, with a creation date, you see the difference between
> people overwriting files by writing to them, or overwriting files by
> creating a new one and moving it over the old one. At a guess, that
> would be quite as useful to a sysadmin as ctime is now (my gut feel is
> that it would be more so, but whatever).
>
> IOW, there really isn't anything magically good about UNIX ctime
> semantics, and in fact they are totally broken in the presence of
> extended attributes (that's file data, but it only changes ctime? WTF
> is up with that? Yes, I know why it happens, and it makes sense within
> the insane unix ctime rules, but no way does it make sense in a bigger
> picture unless you are in total denial and try to claim that xattrs
> are just metadata despite having contents).
>
> And yes, I am also sure that there are applications that do depend on
> ctime semantics. Trond mentioned NFS serving, and that's unfortunate.
> I bet there are others. That's inevitable when you have 40 years of
> history. So I'm not claiming that re-using ctime is painfree, but for
> somebody that cares about samba a lot, I bet it's a _lot_ better than
> adding a new time that almost nobody actually supports as things stand
> now.
>
> Of people can just use xattrs and do it all entirely in user space. I
> assume that's what samba does now, even outside of birthtime.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-07-22 17:12:25

by Jim Rees

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Linus Torvalds wrote:

I personally think that Unix ctime is insane. There is no real reason
why "write()" should change mtime, but "chmod" changes ctime. It was
just a random decision way back when...

I believe it was done that way so "dump" could backup just the inode and not
the data if only the inode had changed. Full history here:

http://blog.plover.com/Unix/ctime.html

2010-07-22 15:37:13

by Volker Lendecke

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 08:14:42AM -0700, Linus Torvalds wrote:
> > # recent FreeBSD, NetBSD have creation timestamps called birthtime:
> > AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec])
> > AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec]))
> >
> > and the supporting code around that. "birth" might also be
> > where the "b" comes from :-)
>
> Oh wow. And all of this just convinces me that we should _not_ do any
> of this, since clearly it's all totally useless and people can't even
> agree on a name.
>
> Let's wait five years and see if there is actually any consensus on it
> being needed and used at all, rather than rush into something just
> because "we can".

The nice thing about this is also that if this is supposed
to be fully usable for Windows clients, the birthtime needs
to be changeable. That's what NTFS semantics gives you, thus
Windows clients tend to require it.

Just as a hint, nothing that Linux should necessarily have
to be bothered with, this is Samba's duty :-)

Volker

2010-07-31 14:44:19

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sat, 2010-07-31 at 10:08 +0200, Jan Engelhardt wrote:
> >When abusing an existing time stamp use atime not ctime please.
> >ctime has it's uses. atime was just a mistake and is nearly useless.
>
> MUAs make use of atime.

I know mutt uses atime to detect new messages. But there are better and
more reliable ways to do this.

>
> >And with noatime we already have creation time semantics for atime.
>
> noatime was a late afterthought, and because it can interfere with
> some programs, relatime came along too.

There are people who prefer noatime over relatime.

Using an existing time stamp for creation time is a bad idea IMHO. But
when doing this use the least important one. Which is atime. For example
ctime is used by backup programs.

Anyway when we want to support creation time it should be an additional
time stamp.

utz

2010-07-22 12:52:13

by Volker Lendecke

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 01:14:47PM +0100, David Howells wrote:
> Jan Engelhardt <[email protected]> wrote:
>
> > Linux already has a creation time field, it's called otime (there is no "b"
> > in "creation"), and you will find scattered fragments of that all over the
> > kernel (foremost, fs/jfs/, now btrfs, and I also notice sysvipc having
> > something with that name).
>
> It is? It's called crtime in Ext4. st_btime, however, would be compatible
> with BSD's stat, and Samba would just use it by way of autoconf magic if it
> appeared.

Samba has the following check:

# recent FreeBSD, NetBSD have creation timestamps called birthtime:
AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec])
AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec]))

and the supporting code around that. "birth" might also be
where the "b" comes from :-)

Volker

2010-07-31 08:08:10

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Friday 2010-07-30 23:22, utz lehmann wrote:
>On Thu, 2010-07-22 at 09:40 -0700, Linus Torvalds wrote:
>> But the fact is, th Unix ctime semantics are insane and largely
>> useless. There's a damn good reason almost nobody uses ctime under
>> unix.
>>
>> So what I'm suggesting is that we have a flag - either per-process or
>> per-mount - that just says "use windows semantics for ctime".
>
>When abusing an existing time stamp use atime not ctime please.
>ctime has it's uses. atime was just a mistake and is nearly useless.

MUAs make use of atime.

>And with noatime we already have creation time semantics for atime.

noatime was a late afterthought, and because it can interfere with
some programs, relatime came along too.

2010-07-22 15:46:41

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-22 17:14, Linus Torvalds wrote:
>>>
>>> It is? It's called crtime in Ext4. st_btime, however, would be compatible
>>> with BSD's stat, and Samba would just use it by way of autoconf magic if it
>>> appeared.
>>
>> Samba has the following check:
>> # recent FreeBSD, NetBSD have creation timestamps called birthtime:
>> AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec])
>> AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec]))
>>
>> and the supporting code around that. "birth" might also be
>> where the "b" comes from :-)
>
>Oh wow. And all of this just convinces me that we should _not_ do any
>of this, since clearly it's all totally useless and people can't even
>agree on a name.
>
>Let's wait five years and see if there is actually any consensus on it
>being needed and used at all, rather than rush into something just
>because "we can".

There just is no way currently to store creation times. Abusing ctimes
for write-once archives also stops working once you rsync it from one
place to another. (Which brings me to the side question of why
the ctime isn't settable through futimesnat.)

2010-07-19 17:47:33

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Mon, Jul 19, 2010 at 10:26 AM, David Howells <[email protected]> wrote:
>
>> Ask your samba people, for example, if they'd _ever_ do just a "xstat()"?
>
> I suspect they would, though maybe they can say otherwise. ?What about SMB
> directory enumeration? ?I believe that is effectively getdents-with-stat.
> Having to do open+stat for each file for that would be painful.

Yeah, but do you need xstat information at all for something like
that? Most people try very hard to make do with the information
returned by readdir itself (d_type and inode number), because if you
end up looking up each name you've already pretty much lost in a
performance model.

(And I do agree that a "readdirplus()" is probably something that a
lot of server people would find useful, but obviously that's another
cross-filesystem nightmare. Only a few filesystems can cheaply give
you anything but d_type/d_ino, and not all do even that),

Linus

2010-07-16 15:10:12

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Arnd Bergmann <[email protected]> wrote:

> > > For the volume id, I could not find any file system that requires more
> > > than 32 bytes here, which is also reasonable to put into the structure.
> > > Make it 36 if you want to cover ascii encoded UUIDs.
> >
> > You should also include a length. Volume IDs may be binary rather than
> > NUL-terminated strings.
>
> Yes, maybe. There are several possible encodings for this. I was actually
> thinking of fixed-length string rather than zero-terminated, but that
> is possible as well. If this gets added, we need to audit every possible
> use to make sure each of them is covered. My point was mostly that if we
> need at most 40 bytes, it doesn't have to be variable length at all.

I suppose it depends what you want it for. Steve French asked for it:

> (4) Should the inode number and data version number fields be
> 128-bit?
>
This is tricky for SMB2, if you can also provide a device id (or an
object id of some sort for the superblock) then 64 bit inode number is
ok.

But I'm not sure what he wants to put in there. He didn't respond to my reply:

A remote device ID? That would be possible. That could be used by
AFS to return the numeric volume ID (32 bits) and by NFS to return the
FSID (128 bits). Would you be using the VolumeGUID (128 bits) for
SMB2?

so I'm not sure what he's thinking of.

Looking through various filesystems:

FS SOURCE FORMAT LENGTH (BYTES)
======= =============================== ======= =============
- __kernel_fsid_t int 8
- super_block::s_id chars 32
ext234 superblock s_uuid UUID 16
ext234 superblock s_volume_name chars 16
nfs2 FSID int 4
nfs3 FSID int 8
nfs4 FSID int 16
afs Volume Name + type chars 64+1
afs Numeric volume ID int 4
cifs VolumeGUID UUID 16
btrfs superblock fsid bytes 16
fat superblock system_id+version? bytes 8+2
ntfs volume_serial_number int 8
ntfs FILE_Volume object_id UUID 16
xfs superblock sb_fname chars 12
xfs superblock sb_uuid UUID 16
jfs superblock s_uuid UUID 16
jfs superblock s_label bytes 16
isofs medium_catalog_number chars 13
isofs volume_id chars 32
udf volIdent chars 32

it would seem that a 16-byte (128-bit) ID would suit quite well. That would be
able to contain most things and could be added to the super_block struct. That
would also give NFSD something to use as a default FSID and Samba something to
used as a VolumeGUID.

David

2010-07-22 16:27:03

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 08:47:46AM -0700, Linus Torvalds wrote:
> On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke
> <[email protected]> wrote:
> >
> > The nice thing about this is also that if this is supposed
> > to be fully usable for Windows clients, the birthtime needs
> > to be changeable. That's what NTFS semantics gives you, thus
> > Windows clients tend to require it.
>
> Ok. So it's not really a creation date, exactly the same way ctime
> isn't at all a creation date.
>
> And maybe that actually hints at a better solution: maybe a better
> model is to create a new per-thread flag that says "do ctime updates
> the way windows does them".
>
> So instead of adding another "btime" - which isn't actually what even
> windows does - just admit that the _real_ issue is that Unix and
> Windows semantics are different for the pre-existing "ctime".
>
> The fact is, windows has "access time", "modification time" and
> "creation time" _exactly_ like UNIX. It's just that the ctime has
> slightly different semantics in windows vs unix. So quite frankly,
> it's totally insane to introduce a "birthtime", when that isn't even
> what windows wants, just because people cannot face the actual real
> difference.
>
> Tell me why we shouldn't just do this right?

No, ctime isn't the same as Windows "create time". Windows
"create time" semantics are that the timestamp is set to
current time on file creation, but afterwards anyone with
sufficient access can then modify it (!). Which is different
from the "birthtime" spec on *BSD, as they can't be modified.

Currently on *BSD we look for our special EA containing any
modified create times on a file, and return that as "create
time" if found, if not we return the st_birthtime from the
stat struct. That works well enough for systems where you
don't want to allow birthtime to be changed. Having said
that I'm not sure how they cope with doing restores to
a filesystem where you would need to set st_birthtime :-).

Jeremy.

2010-07-22 17:33:27

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 10:12 AM, Jim Rees <[email protected]> wrote:
>
> I believe it was done that way so "dump" could backup just the inode and not
> the data if only the inode had changed. ?Full history here:
>
> http://blog.plover.com/Unix/ctime.html

Yes, the dump reasoning makes sense, and that history also shows that
originally chmod just changed mtime (since that's the _sane_ thing to
do). So if it wasn't for dump - that nobody uses any more and that was
considered a hack even back when and never supported things like
xattrs etc - unix probably wouldn't have a ctime at all (or would have
implemented a "creation time" because people would have asked for it).

So I'm sure there are reasons for ctime. That just doesn't mean that
it's really "good", the same way there were reasons to name "creat()"
without the "e".

Linus

2010-07-31 18:48:41

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Saturday 2010-07-31 20:41, Andreas Dilger wrote:
>On 2010-07-30, at 12:11, Trond Myklebust wrote:
>> Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
>> They both interoperate just fine with Samba, and would presumably
>> continue to do so if someone were to decide to reuse the ctime field on
>> your Samba box as storage for a create time.
>
>CIFS doesn't support symlinks (they just appear as the referenced
>file), so I've had applications that scan the filesystem recurse
>indefinitely due to symlinked directories on a CIFS share appearing
>as hard-linked directories on the client. This doesn't happen when
>the filesystem is accessed via NFS.

This shouldn't go on indefinitely - PATH_MAX is reached at some point.

2010-07-16 12:38:53

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Arnd Bergmann <[email protected]> wrote:

> You could also define the tv_gran_units to be power-of-ten nanoseconds,
> making it a decimal floating point number like
>
> enum {
> XSTAT_NANOSECONDS_GRANULARITY = 0,
> XSTAT_MICROSECONDS_GRANULARITY = 3,
> XSTAT_MILLISECONDS_GRANULARITY = 6,
> XSTAT_SECONDS_GRANULARITY = 9,
> };

Are you thinking, then, of having tv_nsec be in terms of those units?

> That would make it easier to define an xstat_time_before() function, though
> it means that you could no longer do XSTAT_MINUTES_GRANULARITY and
> higher directly other than { .tv_gran_units = 10, .tv_granularity = 6, }.

So you're thinking of indicating time (in)equality based on overlapping time
granules?

Your suggestion would suffice, I think. With a 2:2 split between exponent
(tv_gran_units) and mantissa (tv_granularity), you can do:

UNIT SECONDS/UNIT EXPONENT MANTISSA
nanoseconds 0.000000001 -9 1
microseconds 0.000001 -6 1
millseconds 0.001 -3 1
seconds 1 0 1
minutes 60 1 6
hours 3600 2 36
days 86400 2 864
weeks 604800 2 6048

Any units beyond that are variable length and not worth considering, IMO.

And if you don't want negative numbers in your exponent, you can make the base
unit nS instead of S.

Is it worth allowing a filesystem to indicate that it has granularity smaller
than nS, even if the resolution can't be handled here? We could even have:

struct xstat_time {
signed long long tv_sec; /* seconds */
unsigned int tv_nsec; /* nanoseconds */
unsigned char tv_psec4; /* picoseconds/4 */
signed char tv_gran_exp; /* exponent */
unsigned short tv_gran_mant; /* mantissa */
};

Though it's probably still an unnecessary extravagance to have the pS field.
It's probably best left as padding for now; we can always change our minds
later...

David

2010-07-31 16:54:05

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

utz lehmann <[email protected]> wrote:

> When abusing an existing time stamp use atime not ctime please.
> ctime has it's uses. atime was just a mistake and is nearly useless.

CacheFiles currently uses atime to determine least-recently-usedness.

David

2010-07-22 18:15:51

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 10:24:17AM -0700, Linus Torvalds wrote:
>
> And yes, I am also sure that there are applications that do depend on
> ctime semantics. Trond mentioned NFS serving, and that's unfortunate.
> I bet there are others. That's inevitable when you have 40 years of
> history. So I'm not claiming that re-using ctime is painfree, but for
> somebody that cares about samba a lot, I bet it's a _lot_ better than
> adding a new time that almost nobody actually supports as things stand
> now.

Samba mostly ignores ctime, for just the reasons you mention.
But re-using ctime as create time will lead to more horrible
confusion (IMHO).

Easier to add a btime field to stat (or whatever you want to
call it), especially as some of the filesystems already support it,
the code for it exists inside Samba and is working on other UNIX-style
OS'es, and for filesystems that don't support it, just return
zero or -1 in that field (which we already ignore).

> Of people can just use xattrs and do it all entirely in user space. I
> assume that's what samba does now, even outside of birthtime.

Yep. We even have to do that on systems with an immutable
btime to get Windows semantics.

Jeremy.

2010-07-15 20:35:48

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 15 July 2010 04:17:12 David Howells wrote:
> (10) Can be extended by using more request flags and tagging further data after
> the end of the standard return data. Such things as the following could
> be returned:
>
> - BSD st_flags or FS_IOC_GETFLAGS.
> - Volume ID / Remote Device ID [Steve French].
> - Time granularity (NFSv4 time_delta) [Steve French].
> - Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
> Michael Kerrisk].
>
> This was initially proposed as a set of xattrs, but the general preferance is
> for an extended stat structure.

I don't think I'd call this general preference. Three of the four
are fixed length and could easily be done inside the structure if you
leave a bit of space instead of a variable-length field at the end.

For the volume id, I could not find any file system that requires more
than 32 bytes here, which is also reasonable to put into the structure.
Make it 36 if you want to cover ascii encoded UUIDs.

That's at most 60 bytes for the extensions you're considering already,
plus the 152 you have already is still less than a cache line on
some machines. Padding it to 256 bytes would make it nice and round,
if you want to be really sure, make it 384 bytes.

> The following structures are defined for the use of these new system calls:
>
> struct xstat_parameters {
> unsigned long long request_mask;
> };

I'd also still argue that 32 bits would be better since you can
put them into the argument list instead of having to use a pointer
to xstat_parameters. You only use 15 bits so far, so the remaining
17 bits should go a long way. It's not as important to me as the
previous point though.

> The system calls are:
>
> ssize_t ret = xstat(int dfd,
> const char *filename,
> unsigned flags,
> const struct xstat_parameters *params,
> struct xstat *buffer,
> size_t buflen);

The resulting syscall I'd hope for would be

int xstat(dfd, const char *filename, unsigned flags,
unsigned mask, struct xstat *buf);

Everything else in your patch looks very good and has my full support.

Arnd

2010-07-22 18:45:09

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 2:21 PM, Benny Halevy <[email protected]> wrote:
> On Jul. 22, 2010, 20:24 +0300, Linus Torvalds <[email protected]> wrote:
>> On Thu, Jul 22, 2010 at 10:03 AM, Jan Engelhardt <[email protected]> wrote:
>>>
>>> I beg to differ. ctime is not completely useless. It reflects changes on
>>> the inode for when you don't you change the content.
>>
>> Uh. Yes. Except that why is file metadata really different from file
>> data? Most people really don't care. And a lot of people have asked
>> for creation dates - and I seriously doubt that Windows people
>> complain a lot about the fact that there you have mtime for metadata
>> changes too.
>>
>> The point being that Unix ctime semantics certainly have well-defined
>> semantics, but they are in no way "better" than having a real creation
>> time, and are often worse.
>
> Yeah, having create time would be important.
> That said, having a non user-settable modify timestamp is crucial
> for quickly determining whether a file has changed.

How would "cp --archive" and a host of backup/restore tools work
without user-settable modify timestamps?

Or are you proposing another timestamp? I do computer forensics, I
like timestamps, but enough is enough.

Greg

2010-07-31 19:04:22

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sat, 2010-07-31 at 12:41 -0600, Andreas Dilger wrote:
> On 2010-07-30, at 12:11, Trond Myklebust wrote:
> > Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
> > They both interoperate just fine with Samba, and would presumably
> > continue to do so if someone were to decide to reuse the ctime field on
> > your Samba box as storage for a create time.
>
> CIFS doesn't support symlinks (they just appear as the referenced file), so I've had applications that scan the filesystem recurse indefinitely due to symlinked directories on a CIFS share appearing as hard-linked directories on the client. This doesn't happen when the filesystem is accessed via NFS.

Sigh... So please explain how it would be useful to export that
particular filesystem through _both_ CIFS and NFS?

My point was that in most circumstances you want to export either
through CIFS or through NFS, but very rarely both.

I also made the point that converting ctime into a creation time would
break NFS, but it would be a limited breakage, mainly affecting the
client's ability to detect ACL changes, and possibly causing the inode
to get temporarily updated with stale attribute information on occasion
due to out-of-order RPC replies.

Trond

2010-07-22 17:53:32

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 10:03 AM, Jan Engelhardt <[email protected]> wrote:
>
> I beg to differ. ctime is not completely useless. It reflects changes on
> the inode for when you don't you change the content.

Uh. Yes. Except that why is file metadata really different from file
data? Most people really don't care. And a lot of people have asked
for creation dates - and I seriously doubt that Windows people
complain a lot about the fact that there you have mtime for metadata
changes too.

The point being that Unix ctime semantics certainly have well-defined
semantics, but they are in no way "better" than having a real creation
time, and are often worse.

Just imagine what you could do as an MIS person if you actually had a
creation time you could somewhat trust? You talk about seeing somebody
change the permissions of /etc/passwd, but realistically, absent
preexisting semantics, who would really ask for that? The only reason
you mention that as an example of what you can do with ctime is that
that is indeed pretty much the _only_ thing you can do with ctime, and
it really isn't that useful.

In contrast, with a creation date, you see the difference between
people overwriting files by writing to them, or overwriting files by
creating a new one and moving it over the old one. At a guess, that
would be quite as useful to a sysadmin as ctime is now (my gut feel is
that it would be more so, but whatever).

IOW, there really isn't anything magically good about UNIX ctime
semantics, and in fact they are totally broken in the presence of
extended attributes (that's file data, but it only changes ctime? WTF
is up with that? Yes, I know why it happens, and it makes sense within
the insane unix ctime rules, but no way does it make sense in a bigger
picture unless you are in total denial and try to claim that xattrs
are just metadata despite having contents).

And yes, I am also sure that there are applications that do depend on
ctime semantics. Trond mentioned NFS serving, and that's unfortunate.
I bet there are others. That's inevitable when you have 40 years of
history. So I'm not claiming that re-using ctime is painfree, but for
somebody that cares about samba a lot, I bet it's a _lot_ better than
adding a new time that almost nobody actually supports as things stand
now.

Of people can just use xattrs and do it all entirely in user space. I
assume that's what samba does now, even outside of birthtime.

Linus

2010-07-30 21:23:14

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 2010-07-22 at 09:40 -0700, Linus Torvalds wrote:
> But the fact is, th Unix ctime semantics are insane and largely
> useless. There's a damn good reason almost nobody uses ctime under
> unix.
>
> So what I'm suggesting is that we have a flag - either per-process or
> per-mount - that just says "use windows semantics for ctime".

When abusing an existing time stamp use atime not ctime please.
ctime has it's uses. atime was just a mistake and is nearly useless.

And with noatime we already have creation time semantics for atime.

utz

2010-07-16 06:28:31

by Mark Harris

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

> struct xstat_time {
> unsigned long long tv_sec, tv_nsec;
> };

unsigned? Existing filesystems support on-disk timestamps
representing times prior to the epoch.

2010-07-30 17:55:00

by Phil Pishioneri

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On 7/22/10 2:59 PM, Trond Myklebust wrote:
> The fact remains that most of us would be hard pressed to name an
> application

Microsoft Office?

> that requires you to share the same dataset to both
> Windows/CIFS and posix NFS clients.

NFS client: Mac OS X (NFSv3, since v4 on it is still alpha *cough*).

> tends to discourage mixing the two environments.

Or is "discourage" not strong enough term to describe that we shouldn't
be doing this?

-Phil

2010-07-22 18:04:53

by Volker Lendecke

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 11:02:04AM -0700, Jeremy Allison wrote:
> On Thu, Jul 22, 2010 at 12:58:50PM -0400, Trond Myklebust wrote:
> >
> > That would make it impossible to export the filesystem with NFSv2 and
> > v3. They do rely on ctime checking for certain operations (e.g. deciding
> > when to invalidate access and acl caches). NFSv4 needs this too if the
> > filesystem has no dedicated change attribute.
> >
> > Still, I suppose the market for exporting the same filesystem with both
> > NFS and Samba is limited...
>
> Ask NetApp about that :-). They have built a rather large
> business on just that fact :-).

Jeremy, how many hours have you spent getting "posix
locking" to the point where it is now? :-)

Volker

P.S: For those not aware, "posix locking = yes" is
cross-protocol byte range locking done by smbd to co-operate
with local processes and NFS.

2010-07-22 16:48:35

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 9:27 AM, Jeremy Allison <[email protected]> wrote:
> On Thu, Jul 22, 2010 at 08:47:46AM -0700, Linus Torvalds wrote:
>> Tell me why we shouldn't just do this right?
>
> No, ctime isn't the same as Windows "create time".

Umm. What kind of reading problems do you guys have?

I know effin well that ctime isn't the same as Windows create time.
THAT WAS MY POINT.

But the fact is, th Unix ctime semantics are insane and largely
useless. There's a damn good reason almost nobody uses ctime under
unix.

So what I'm suggesting is that we have a flag - either per-process or
per-mount - that just says "use windows semantics for ctime".

And yes, I'm very aware that the "c" in ctime doesn't stand for
"create". But anybody who points that out is - once more - totally
missing the point. My point is that we have three timestamps, and
windows wants three timestamps (somebody claims that NTFS has four
timestamps, but the Windows file time access functions certainly only
shows three times, so any potential extra on-disk times have no
relevance because they are invisible to pretty much everybody). We can
have unix semantics for mtime/atime/ctime, or we can have windows
semantics for those three values.

So let's say that we introduce a mount flag that says
"ctime=winctime", which basically just sets a flag that instead of
changing ctime on chmod/chown/etc, it just changes mtime instead (or,
as mentioned, we could make it a process flag instead).

Let's face it, Unix semantics are not sacred. Especially not
something like ctime, which is pretty damn useless. If you're a samba
server, why not just say "let's do ctime the way windows does creation
times", and let it be at that?

I personally think that Unix ctime is insane. There is no real reason
why "write()" should change mtime, but "chmod" changes ctime. It was
just a random decision way back when, and it's clearly not what samba
wants, and it's equally clearly not what even most _unix_ people want
(just google for "ctime" and "creation time", and watch the confusion
- exactly because unix semantics are simply _random_ and odd semantics
in this area)

I would not be at all surprised if it turns out that people might want
to really turn ctime into creation time (with the mount flag or
whatever) even if they are _not_ running samba.

An added issue is that most filesystems simply don't have more than
three times (and some obviously have not even that, but that's true in
Windows too). So re-using ctime actually means that this scheme would
work a whole lot better than some crazy xstat() interface that doesn't
support common filesystems anyway.

Linus

2010-07-31 21:20:26

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Saturday 2010-07-31 21:03, Trond Myklebust wrote:
>On Sat, 2010-07-31 at 12:41 -0600, Andreas Dilger wrote:
>> On 2010-07-30, at 12:11, Trond Myklebust wrote:
>> > Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
>> > They both interoperate just fine with Samba, and would presumably
>> > continue to do so if someone were to decide to reuse the ctime field on
>> > your Samba box as storage for a create time.
>>
>> CIFS doesn't support symlinks (they just appear as the referenced file), so I've had applications that scan the filesystem recurse indefinitely due to symlinked directories on a CIFS share appearing as hard-linked directories on the client. This doesn't happen when the filesystem is accessed via NFS.
>
>Sigh... So please explain how it would be useful to export that
>particular filesystem through _both_ CIFS and NFS?

Seems like a reasonable case for, say, a public "ftp server". For
example, I keep ftp5.gwdg.de:/ftp/pub mounted, that's a little more
convenient than always having to start an ftp cilent.

Conversely, since NFS is, well, non-existent on Windows, one would
use CIFS there (had it ftp5 opened) to get the same convenience.

2010-07-16 10:24:53

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Mark Harris <[email protected]> wrote:

> > struct xstat_time {
> > unsigned long long tv_sec, tv_nsec;
> > };
>
> unsigned? Existing filesystems support on-disk timestamps
> representing times prior to the epoch.

I suppose it doesn't hurt to make is signed. It's large enough...

Looking at it again, having a 64-bit field for tv_nsec is overkill. It can't
(or shouldn't) exceed 999,999,999 - well within the capability of a 32-bit
unsigned integer.

So how about using up the dead space for what Steve French wanted:

| One hole that this reminded me about is how to return the superblock
| time granularity (for NFSv4 this is attribute 51 "time_delta" which
| is called on a superblock not on a file). We run into time rounding
| issues with Samba too.

By doing something like:

struct xstat_time {
signed long long tv_sec;
unsigned int tv_nsec;
unsigned short tv_granularity;
unsigned short tv_gran_units;
};

Where tv_granularity is the minimum granularity for tv_sec and tv_nsec given
as a quantity of tv_gran_units. tv_gran_units could then be a constant, such
as:

XSTAT_NANOSECONDS_GRANULARITY
XSTAT_MICROSECONDS_GRANULARITY
XSTAT_MILLISECONDS_GRANULARITY
XSTAT_SECONDS_GRANULARITY
XSTAT_MINUTES_GRANULARITY
XSTAT_HOURS_GRANULARITY
XSTAT_DAYS_GRANULARITY

So, for example, FAT times are a 2s granularity, so FAT would set
tv_granularity to 2 and tv_gran_units to XSTAT_SECONDS_GRANULARITY.

We could even support picosecond granularity if we made tv_nsec a 5-byte
field (tv_psec):

struct xstat_time {
signed long long tv_sec;
unsigned long long tv_gran_units : 8;
unsigned long long tv_granularity : 16;
unsigned long long tv_psec : 48;
};

but that's probably excessive. Does any filesystem we currently support need
that?

David

2010-07-16 11:03:00

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Friday 16 July 2010, David Howells wrote:
> Mark Harris <[email protected]> wrote:
> So how about using up the dead space for what Steve French wanted:
>
> | One hole that this reminded me about is how to return the superblock
> | time granularity (for NFSv4 this is attribute 51 "time_delta" which
> | is called on a superblock not on a file). We run into time rounding
> | issues with Samba too.
>
> By doing something like:
>
> struct xstat_time {
> signed long long tv_sec;
> unsigned int tv_nsec;
> unsigned short tv_granularity;
> unsigned short tv_gran_units;
> };

I like that!

> Where tv_granularity is the minimum granularity for tv_sec and tv_nsec given
> as a quantity of tv_gran_units. tv_gran_units could then be a constant, such
> as:
>
> XSTAT_NANOSECONDS_GRANULARITY
> XSTAT_MICROSECONDS_GRANULARITY
> XSTAT_MILLISECONDS_GRANULARITY
> XSTAT_SECONDS_GRANULARITY
> XSTAT_MINUTES_GRANULARITY
> XSTAT_HOURS_GRANULARITY
> XSTAT_DAYS_GRANULARITY
>
> So, for example, FAT times are a 2s granularity, so FAT would set
> tv_granularity to 2 and tv_gran_units to XSTAT_SECONDS_GRANULARITY.

You could also define the tv_gran_units to be power-of-ten nanoseconds,
making it a decimal floating point number like

enum {
XSTAT_NANOSECONDS_GRANULARITY = 0,
XSTAT_MICROSECONDS_GRANULARITY = 3,
XSTAT_MILLISECONDS_GRANULARITY = 6,
XSTAT_SECONDS_GRANULARITY = 9,
};

That would make it easier to define an xstat_time_before() function, though
it means that you could no longer do XSTAT_MINUTES_GRANULARITY and
higher directly other than { .tv_gran_units = 10, .tv_granularity = 6, }.

> We could even support picosecond granularity if we made tv_nsec a 5-byte
> field (tv_psec):
>
> struct xstat_time {
> signed long long tv_sec;
> unsigned long long tv_gran_units : 8;
> unsigned long long tv_granularity : 16;
> unsigned long long tv_psec : 48;
> };
>
> but that's probably excessive. Does any filesystem we currently support need
> that?

I wouldn't even go that far if we needed sub-ns (I don't think we do), because
that breaks old compilers that cannot do bit fields.

Arnd

2010-07-22 16:59:22

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 2010-07-22 at 09:40 -0700, Linus Torvalds wrote:
> On Thu, Jul 22, 2010 at 9:27 AM, Jeremy Allison <[email protected]> wrote:
> > On Thu, Jul 22, 2010 at 08:47:46AM -0700, Linus Torvalds wrote:
> >> Tell me why we shouldn't just do this right?
> >
> > No, ctime isn't the same as Windows "create time".
>
> Umm. What kind of reading problems do you guys have?
>
> I know effin well that ctime isn't the same as Windows create time.
> THAT WAS MY POINT.
>
> But the fact is, th Unix ctime semantics are insane and largely
> useless. There's a damn good reason almost nobody uses ctime under
> unix.
>
> So what I'm suggesting is that we have a flag - either per-process or
> per-mount - that just says "use windows semantics for ctime".
>
> And yes, I'm very aware that the "c" in ctime doesn't stand for
> "create". But anybody who points that out is - once more - totally
> missing the point. My point is that we have three timestamps, and
> windows wants three timestamps (somebody claims that NTFS has four
> timestamps, but the Windows file time access functions certainly only
> shows three times, so any potential extra on-disk times have no
> relevance because they are invisible to pretty much everybody). We can
> have unix semantics for mtime/atime/ctime, or we can have windows
> semantics for those three values.
>
> So let's say that we introduce a mount flag that says
> "ctime=winctime", which basically just sets a flag that instead of
> changing ctime on chmod/chown/etc, it just changes mtime instead (or,
> as mentioned, we could make it a process flag instead).
>
> Let's face it, Unix semantics are not sacred. Especially not
> something like ctime, which is pretty damn useless. If you're a samba
> server, why not just say "let's do ctime the way windows does creation
> times", and let it be at that?
>
> I personally think that Unix ctime is insane. There is no real reason
> why "write()" should change mtime, but "chmod" changes ctime. It was
> just a random decision way back when, and it's clearly not what samba
> wants, and it's equally clearly not what even most _unix_ people want
> (just google for "ctime" and "creation time", and watch the confusion
> - exactly because unix semantics are simply _random_ and odd semantics
> in this area)
>
> I would not be at all surprised if it turns out that people might want
> to really turn ctime into creation time (with the mount flag or
> whatever) even if they are _not_ running samba.
>
> An added issue is that most filesystems simply don't have more than
> three times (and some obviously have not even that, but that's true in
> Windows too). So re-using ctime actually means that this scheme would
> work a whole lot better than some crazy xstat() interface that doesn't
> support common filesystems anyway.

That would make it impossible to export the filesystem with NFSv2 and
v3. They do rely on ctime checking for certain operations (e.g. deciding
when to invalidate access and acl caches). NFSv4 needs this too if the
filesystem has no dedicated change attribute.

Still, I suppose the market for exporting the same filesystem with both
NFS and Samba is limited...

Cheers
Trond

2010-07-20 08:28:27

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On 2010-07-19, at 11:46, Linus Torvalds wrote:
> On Mon, Jul 19, 2010 at 10:26 AM, David Howells <[email protected]> wrote:
>> I suspect they would, though maybe they can say otherwise. What about SMB
>> directory enumeration? I believe that is effectively getdents-with-stat.
>> Having to do open+stat for each file for that would be painful.
>
> Yeah, but do you need xstat information at all for something like
> that? Most people try very hard to make do with the information
> returned by readdir itself (d_type and inode number), because if you
> end up looking up each name you've already pretty much lost in a
> performance model.

This lightweight stat() interface is exactly needed for things like "color ls",
which is the default on all distros today. "ls --color" always does a stat on the file just to get the file mode to color executable files differently. For Lustre and other distributed filesystems, getting things like the current file size is hard work (i.e. multiple RPCs per file), yet "ls" doesn't care about the size or modification times unless "ls -l" is used. Same goes for "find".

> (And I do agree that a "readdirplus()" is probably something that a
> lot of server people would find useful, but obviously that's another
> cross-filesystem nightmare. Only a few filesystems can cheaply give
> you anything but d_type/d_ino, and not all do even that),

Having a readdirplus() syscall would be even better, but again only with the ability to request specific attributes. Otherwise the filesystem may be doing a lot of extra work to collect all of the file attributes, and then userspace will probably be throwing most of them away.

Cheers, Andreas

2010-07-22 18:41:05

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 1:24 PM, Linus Torvalds
<[email protected]> wrote:

<snip>

> and I seriously doubt that Windows people
> complain a lot about the fact that there you have mtime for metadata
> changes too.

But Windows doesn't work that way for I'm fairly sure.

Window's mtime is only affected by file content updates. (I don't
know about xattr updates).

If you look at the first and fourth rows of the table at:

http://blogs.sans.org/computer-forensics/2010/04/12/windows-7-mft-entry-timestamp-properties/

You see that there are a number of activities that update the "$STD
Info MFT Entry Modified Field" that don't update the "$STD Info
Modification Time"

Again, "$STD Info MFT Entry Modified Field" has semantics close to linux ctime.

And "$STD Info Modification Time" similar to mtime.

I don't know if there are APIs to present MFT Entry Modified to user
space or if Samba uses that info. I just know it's part of the
on-disk NTFS filesystem data.

Greg

2010-07-22 17:36:24

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-22 19:16, Trond Myklebust wrote:
>> >But the fact is, th Unix ctime semantics are insane and largely
>> >useless. There's a damn good reason almost nobody uses ctime under
>> >unix.
>>
>> I beg to differ. ctime is not completely useless. It reflects changes on
>> the inode for when you don't you change the content. It's like an mtime
>> for the metadata. It comes useful when you go around in your filesystem
>> trying to figure out who of your co-admins screwed up the permissions on
>> /etc/passwd... and if the mtime is the same as that of the last backup,
>> I can at least have a reasonable assurance that it was /only/ the
>> metadata that was tampered with. (SHA1 check, yeah yeah, costly on large
>> files.)
>
>Errr... Only if you eliminate utimes() from your syscall table.
>Otherwise it is trivial to reset the mtime after changing the file
>contents.

Well yes; I had implicitly implied that evil people with malicious intent
are absent.

2010-07-23 09:22:13

by Björn JACKE

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On 2010-07-22 at 21:21 -0400 Ted Ts'o sent off:
> Well, not POSIX, because POSIX doesn't have CreationTime at all.
> BSD's birthtime doesn't allow it to be set, and the question here is
> largely philosophical.

actually, it can (partly :). But the way it can be done is an insane hack:

<quote "http://ace.delos.com/kirk/">
To provide a sensible birth time for applications that are unaware of the birth
time attribute, we changed the semantics of the "utimes" system call so that if
the birth time was newer than the value of the modification time that it was
setting, it sets the birth time to the same time as the modification time. An
application that is aware of the birth time attribute can set both the birth
time and the modification time by doing two calls to "utimes". First it calls
"utimes" with a modification time equal to the saved birth time, then it calls
"utimes" a second time with a modification time equal to the (presumably newer)
saved modification time.
</quote>

Thus it can also be only be set more in the past.

Cheers
Bj?rn
--
SerNet GmbH, Bahnhofsallee 1b, 37081 G?ttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG G?ttingen, HRB 2816, GF: Dr. Johannes Loxen

2010-07-15 21:53:36

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Arnd Bergmann <[email protected]> wrote:

> > This was initially proposed as a set of xattrs, but the general preferance
> > is for an extended stat structure.
>
> I don't think I'd call this general preference. Three of the four
> are fixed length and could easily be done inside the structure if you
> leave a bit of space instead of a variable-length field at the end.

?

Maybe I wasn't clear: I meant having an extended stat() syscall rather than
using a bunch of getxattr()'s was the general preference.

> For the volume id, I could not find any file system that requires more
> than 32 bytes here, which is also reasonable to put into the structure.
> Make it 36 if you want to cover ascii encoded UUIDs.

You should also include a length. Volume IDs may be binary rather than
NUL-terminated strings.

> That's at most 60 bytes for the extensions you're considering already,
> plus the 152

160, I think.

> you have already is still less than a cache line on
> some machines. Padding it to 256 bytes would make it nice and round,
> if you want to be really sure, make it 384 bytes.

Which we currently allocate on the kernel stack, plus up to a couple of kstat
structs if something like eCryptFS is used. Admittedly, the base xstat struct
could be kmalloc()'d instead, but why use up all that space if you don't need
it?

David

2010-07-17 05:51:35

by Mark Harris

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

David Howells wrote:
> With a 2:2 split between exponent
> (tv_gran_units) and mantissa (tv_granularity), you can do:
>
> UNIT SECONDS/UNIT EXPONENT MANTISSA
> nanoseconds 0.000000001 -9 1
> microseconds 0.000001 -6 1
> millseconds 0.001 -3 1
> seconds 1 0 1
> minutes 60 1 6
> hours 3600 2 36
> days 86400 2 864
> weeks 604800 2 6048

At least for the in-tree filesystems, I do not see any that keep
timestamps with a granularity larger than 2s. For that, a simple
32-bit tv_granularity in nanoseconds (not limited to 1e9) would
suffice, and there is no need for the complexity of dealing with
a separate exponent.

If there is a need to handle larger granularity, its msb could
potentially be used to indicate that the number is in seconds
instead of nanoseconds. This is convenient because the timestamp
is already broken down into sec and nsec fields. So this bit would
then indicate that the granularity applies to the tv_sec field, and
that tv_nsec is not in use. But even this is overkill if no one
uses a granularity larger than 2s.

- Mark

2010-07-22 10:35:04

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-15 04:17, David Howells wrote:
> (6) BSD stat compatibility: Including more fields from the BSD stat such as
> creation time (st_btime) and inode generation number (st_gen) [Jeremy
> Allison, Bernd Schubert].
>where st_btime is the file creation time, st_gen is the inode generation
>(i_generation), st_data_version is the data version number (i_version),
>st_inode_flags is the flags from FS_IOC_GETFLAGS plus some extras,
>request_mask and st_result_mask are bitmasks of data desired/provided and
>st_extra_results[] is where as-yet undefined fields are appended.

Linux already has a creation time field, it's called otime (there is no "b" in
"creation"), and you will find scattered fragments of that all over the kernel
(foremost, fs/jfs/, now btrfs, and I also notice sysvipc having something with
that name).

> struct xstat_time {
> unsigned long long tv_sec, tv_nsec;
> };

If it helps getting rid of the ugly suseconds_t in userspace ;-)

2010-07-19 16:15:45

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Linus Torvalds <[email protected]> wrote:

> - that whole xstat buffer handling is just a mess. I think you
> already fixed the "xstat_parameters" crud and just made it a simple
> unsigned long and a direct argument,

I was thinking more of an unsigned int argument, since it can't have more than
32 flags in it if it is also to work on 32-bit arches.

> but the "buffer+buflen" thing is still disgusting.
>
> Why not just leave a few empty fields at the end, and make the rule
> be: "We don't just add random crap, so don't expect it to grow widely
> in the future".

Because it gets allocated on the kernel stack. It's already 160 bytes, and
expanding it will eat more kernel stack space. Now, I can offset that by: (a)
embedding it in struct kstat so that we allocate less stack space in xstat()
overall, and (b) allocating kstat/xstat structs with kmalloc() rather than on
the stack in all the stat syscalls.

> - you use "long long" all over the place. Don't do that. If you want
> a fixed size, say so, and use "u64/s64". That's the _real_ fixed size,
> and "long long" just _happens_ to be the same size on all current
> architectures.

I was following struct stat/stat64 in arch/x86/include/asm/stat.h which do the
same. Also, if this is going to be seen by userspace, isn't it better to use
uint32_t and suchlike?

> - why create that new kind of xstat() that realistically absolutely
> nobody will use outside of some very special cases, and that has no
> real advantages for 99.9% of all people?

The new information is useful for some cases. Samba for example. At least
two of the fields I'm adding are also made available through BSD's stat()
call, and will automatically be used for some things by autoconf magic if they
become available.

I'm still trying to get a handle on what people think will be truly useful. I
can see things *could* be useful, particularly to GUI file managers and ls,
but not everyone is of the same opinion.

Perhaps you or others can offer answers to the following questions as these
might help:

(1) Should I offer information that's effectively free to come by, but could
be got through:

(a) An extra statfs() call - such as whether a file is remote, whether
it's some kernel special file? Or what the volume label is for this
file?

(b) An extra getxattr() call - such as a file's security label.

(c) An extra ioctl() call - such as FS_IOC_GETFLAGS.

(2) Should I offer information that's appropriate to non-UNIX filesystems
such as FAT, NTFS or CIFS. Some of this may map onto other fields, such
as FS_IOC_GETFLAGS.

(3) Should I offer information about which results that I've returned are
actually useful, as opposed to being fabricated on the spot? Such as
UID/GID in FAT or blocks in UBIFS. This may be of use to df or a GUI.
For instance, a GUI, seeing that UID/GID aren't useful, could ask the
filesystem to provide information about what it considers to be valid
ownership information.

> You could make it a "atomic stat+open" by replacing the useless
> "size" return value with a "fd" return value, add a flag saying "we're
> also interested in opening it" (in the same result set flags), and
> instead of that stupid "buflen" input, give the "mode" input that open
> needs.

Which would be used by even fewer people, I suspect. However, it's certainly
an interesting idea. I suspect it doesn't gain much over open()+fstat()
though, given that you'd still have to do most of the work of fstat() after
doing the open() thing anyway.

Also, I'm not sure how much use the atomicity is, given that the file may have
changed state between the gathering of the stat data and userspace getting to
do anything with it.

> > ? ? ? ?ssize_t ret = fxstat(unsigned fd,
>
> Quite frankly, my gut feel is that once you do "xstat(dfd, filename,
> ...)" then it's damn stupid to do a separate "fxstat()", when you
> might as well say that "xtstat(dfd, NULL, ...)" is the same as
> "fxstat(fd, ...)"

This has been suggested and denounced as stupid already. That said, I agree
with you.

> Now, the difference between adding one or two system calls may not be
> huge, but just from a cleanliness angle, I really don't see the point
> of having another fstat variant when the extended xstat() already very
> naturally supports the thing. And let's face it, using a NULL path
> pointer just makes sense if you don't have a path. You already passed
> it a target file descriptor in the dfd.

Agreed.

David

2010-07-23 01:03:59

by tridge

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Hi Linus,

> My point is that we have three timestamps, and
> windows wants three timestamps (somebody claims that NTFS has four
> timestamps, but the Windows file time access functions certainly only
> shows three times, so any potential extra on-disk times have no
> relevance because they are invisible to pretty much everybody).

Not quite. The underlying structure available to Windows programmers
is this one:

typedef struct _FILE_BASIC_INFORMATION {
LARGE_INTEGER CreationTime;
LARGE_INTEGER LastAccessTime;
LARGE_INTEGER LastWriteTime;
LARGE_INTEGER ChangeTime;
ULONG FileAttributes;
} FILE_BASIC_INFORMATION, *PFILE_BASIC_INFORMATION;

See http://msdn.microsoft.com/en-us/library/ff545762%28v=VS.85%29.aspx

These are the definitions:

CreationTime
Specifies the time that the file was created.
LastAccessTime
Specifies the time that the file was last accessed.
LastWriteTime
Specifies the time that the file was last written to.
ChangeTime
Specifies the last time the file was changed.

You are right that the more commonly used APIs (such as
GetFileInformationByHandle()) omit the ChangeTime field in the return
value. The ChangeTime is also not visible via the normal Windows GUI
or command line tools.

But there are APIs that are used by quite a few programs that do get
all 4 timestamps. For example, GetFileInformationByHandleEx() returns
all 4 fields. I include an example program that uses that API to show
all the timestamps below.

and yes, we think that real applications (such as Excel), look at
these values separately.

The other big difference from POSIX timestamps is that the
CreationTime is settable on Windows, and some of the windows UI
behaviour relies on this.

Cheers, Tridge

PS: Sorry for coming into this discussion so late

/*
show all 4 file times
[email protected], July 2010
*/

#define _WIN32_WINNT 0x0600

#include <stdio.h>
#include <stdlib.h>
#include "windows.h"
#include "winbase.h"

static void FileTime(const char *fname)
{
HANDLE h;
FILE_BASIC_INFO info;
BOOL ret;

h = CreateFile(
fname, GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
0,
NULL
);
if (h == INVALID_HANDLE_VALUE) {
printf("Unable to open %s\n", fname);
exit(1);
}

ret = GetFileInformationByHandleEx(h, FileBasicInfo, &info, sizeof(info));

if (!ret) {
printf("Unable to get file information\n");
exit(1);
}

printf("CreationTime: %llu\n", (unsigned long long)info.CreationTime.QuadPart);
printf("LastAccessTime: %llu\n", (unsigned long long)info.LastAccessTime.QuadPart);
printf("LastWriteTime: %llu\n", (unsigned long long)info.LastWriteTime.QuadPart);
printf("ChangeTime: %llu\n", (unsigned long long)info.ChangeTime.QuadPart);

CloseHandle(h);
}

int main(int argc, char* argv[])
{
if (argc < 2) {
printf("Usage: filetime FILENAME\n");
exit(1);
}

FileTime(argv[1]);
return 0;
}

2010-07-31 18:41:52

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On 2010-07-30, at 12:11, Trond Myklebust wrote:
> Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
> They both interoperate just fine with Samba, and would presumably
> continue to do so if someone were to decide to reuse the ctime field on
> your Samba box as storage for a create time.

CIFS doesn't support symlinks (they just appear as the referenced file), so I've had applications that scan the filesystem recurse indefinitely due to symlinked directories on a CIFS share appearing as hard-linked directories on the client. This doesn't happen when the filesystem is accessed via NFS.

Cheers, Andreas

2010-07-22 16:07:32

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Jan Engelhardt <[email protected]> wrote:

> There just is no way currently to store creation times.

What do you mean? Ext4 and BtrFS can both do so; it's just that there's no
user interface to it.

David

2010-07-31 19:27:40

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

utz lehmann <[email protected]> wrote:

> How does this works right with noatime or relatime (which is default)?

Don't do that then.

> We had used FS-Cache with a few 10000s files cached. Doesn't it mean
> that the cleanup has to stat them all?

Yes.

> Why didn't cachefilesd managed the cache index in a separate database
> like other caches?

Because using atime is much simpler since the filesystem updates it
automatically. If you have a separate database then you have redundant
information and you need to maintain metadata integrity which has a cost, both
in terms of disk usage and performance. I'm working on it, but you don't get
it for free.

David

2010-07-16 10:46:33

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 15 July 2010, David Howells wrote:
> Arnd Bergmann <[email protected]> wrote:
>
> > > This was initially proposed as a set of xattrs, but the general preferance
> > > is for an extended stat structure.
> >
> > I don't think I'd call this general preference. Three of the four
> > are fixed length and could easily be done inside the structure if you
> > leave a bit of space instead of a variable-length field at the end.
>
> ?
>
> Maybe I wasn't clear: I meant having an extended stat() syscall rather than
> using a bunch of getxattr()'s was the general preference.

Ok, I misparsed your statement there. I don't think anyone was
objecting the use of xstat for this.

The controversial part is only how the extension happens. I would
already feel better about it if you just dropped the
'unsigned long long st_extra_results[0];' at the end and
added a comment saying that the structure may grow in the future, though
my preference would be to space for extensions and make it fixed length.

> > For the volume id, I could not find any file system that requires more
> > than 32 bytes here, which is also reasonable to put into the structure.
> > Make it 36 if you want to cover ascii encoded UUIDs.
>
> You should also include a length. Volume IDs may be binary rather than
> NUL-terminated strings.

Yes, maybe. There are several possible encodings for this. I was actually
thinking of fixed-length string rather than zero-terminated, but that
is possible as well. If this gets added, we need to audit every possible
use to make sure each of them is covered. My point was mostly that if we
need at most 40 bytes, it doesn't have to be variable length at all.

> > That's at most 60 bytes for the extensions you're considering already,
> > plus the 152
>
> 160, I think.

right.

> > you have already is still less than a cache line on
> > some machines. Padding it to 256 bytes would make it nice and round,
> > if you want to be really sure, make it 384 bytes.
>
> Which we currently allocate on the kernel stack, plus up to a couple of kstat
> structs if something like eCryptFS is used. Admittedly, the base xstat struct
> could be kmalloc()'d instead, but why use up all that space if you don't need
> it?

If you're worried about stack utilization, xstat could also be embedded into
kstat, like

struct kstat {
u64 request_mask;
struct xstat x;
};

Then you only need one of them on the stack for sys_xstat, or have both
struct kstat and struct stat/stat64 for the other syscalls.

Arnd

2010-07-19 17:27:25

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Linus Torvalds <[email protected]> wrote:

> > The new information is useful for some cases. ?Samba for example. ?At
> > least two of the fields I'm adding are also made available through BSD's
> > stat() call, and will automatically be used for some things by autoconf
> > magic if they become available.
>
> .. that' a pointless argument. If the only way something gets used is
> through autoconf, then clearly nobody cares.

That's not what I meant at all. I meant there may be things out there that
will just use st_btime and st_gen as soon as they appear without anything
having to be done to them because these fields already exist in the BSD stat
struct.

Samba is such an example as this. It will use st_btime immediately if it
exists as the SMB protocol wants to pass the creation time around.

> Yeah, maybe it adds a flag to "ls", but let's face is - that isn't actually
> _buying_ anything.

Not having ls cause a mass automount just because you did an ls of a directory
full of automount points would be very nice.

> So the only thing that matters for new system calls is who actually
> really seriously wants to use the information, even if it's not there
> by default. Is it _anybody_ else than samba?

Perhaps. As previously mentioned, BSD (and other unices) already make some of
these fields available (notably st_btime and st_gen). We could also make a
BSD-compatible st_flags available.

> In other words, in the absense of some seriously generic users, it
> sounds more like an ioctl to me to ask for something like "creation
> time" or "inode version", when not all filesystems support anything
> like that.

I initially did them by getxattr(), but that didn't go down too well.

> Ask your samba people, for example, if they'd _ever_ do just a "xstat()"?

I suspect they would, though maybe they can say otherwise. What about SMB
directory enumeration? I believe that is effectively getdents-with-stat.
Having to do open+stat for each file for that would be painful.

David

2010-07-28 01:15:40

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 22 Jul 2010 10:24:17 -0700
Linus Torvalds <[email protected]> wrote:

> Of people can just use xattrs and do it all entirely in user space. I
> assume that's what samba does now, even outside of birthtime.

Much as I despise xattrs, this would definitely be my preference.

ctime and mtime have real cache-coherence semantics which require them being
updated by the kernel (whether the cache is on an NFS client, in a backup
archive, or in a .o translation of a .c file).

create-time, on the other hand, would never be updated by the kernel, and
might sometimes be updated by an application. So it is a very different sort
of attribute, much like a hypothetical 'last archived' time.

The only role the kernel might have would be setting the 'creation time' when
the file was created, but it seems even that isn't always what is wanted,
because people don't so much what the time of create of the
container-on-disk, but the time of creation of the data-content.

I would want to see a pretty convincing use-case that cannot be solved with
xattrs before 'creation time' was added to a generic kernel interface.

So just use xattrs and don't involve the kernel in any detailed knowledge of
this value.

Maybe xstat should take a list of xattrs to be retrieved as well?? or maybe
not.

But I hope the xstat debate doesn't get bogged down about whether 'create
time' is sensible or not. Quite apart from the ability to return more
attributes, I think it has real value is being able to return fewer
attributes, and being allowed to ask for 'best guess' values. Being able to
do an 'fstat' and being certain that you won't be blocked by a non-responsive
NFS server would be a GOOD THING (TM).

NeilBrown

2010-07-22 15:16:08

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 5:17 AM, Volker Lendecke
<[email protected]> wrote:
> On Thu, Jul 22, 2010 at 01:14:47PM +0100, David Howells wrote:
>> Jan Engelhardt <[email protected]> wrote:
>>
>> > Linux already has a creation time field, it's called otime (there is no "b"
>> > in "creation"), and you will find scattered fragments of that all over the
>> > kernel (foremost, fs/jfs/, now btrfs, and I also notice sysvipc having
>> > something with that name).
>>
>> It is? ?It's called crtime in Ext4. ?st_btime, however, would be compatible
>> with BSD's stat, and Samba would just use it by way of autoconf magic if it
>> appeared.
>
> Samba has the following check:
>
> # recent FreeBSD, NetBSD have creation timestamps called birthtime:
> AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec])
> AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec]))
>
> and the supporting code around that. "birth" might also be
> where the "b" comes from :-)

Oh wow. And all of this just convinces me that we should _not_ do any
of this, since clearly it's all totally useless and people can't even
agree on a name.

Let's wait five years and see if there is actually any consensus on it
being needed and used at all, rather than rush into something just
because "we can".

Linus

2010-07-30 18:12:13

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, 2010-07-30 at 13:55 -0400, Phil Pishioneri wrote:
> On 7/22/10 2:59 PM, Trond Myklebust wrote:
> > The fact remains that most of us would be hard pressed to name an
> > application
>
> Microsoft Office?
>
> > that requires you to share the same dataset to both
> > Windows/CIFS and posix NFS clients.
>
> NFS client: Mac OS X (NFSv3, since v4 on it is still alpha *cough*).
>
> > tends to discourage mixing the two environments.
>
> Or is "discourage" not strong enough term to describe that we shouldn't
> be doing this?
>
> -Phil

Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
They both interoperate just fine with Samba, and would presumably
continue to do so if someone were to decide to reuse the ctime field on
your Samba box as storage for a create time.

Trond

2010-07-29 16:16:13

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Neil Brown <[email protected]> wrote:

> This justifies for me why a CIFS client would want to extract the
> creation-time from the CIFS protocol, but not why you want to expose it via a
> generic interface.

It would also be easier for NFSD if the creation time was in struct kstat.
It's included as an optional element in NFSv4. The same goes for the data
version number. I'm not sure about the inode generation, I suspect that's used
as part of the FH construction.

However, someone was talking about a userspace NFS daemon, and there they may
want all three bits. Even Samba may want multiple bits. Calling getxattr
multiple times per file starts to add up, even for internal values.

Consider further: NFS, for example, could be made to retrieve the creation time
from the server. This can be merged with the attribute fetch done by the
getattr() call, or it could be done separately by getxattr. Unless it's stored
in RAM, that's one NFS RPC op versus two. Okay, that's a bit of an artificial
example, but still.

> Given that we have an extensible attribute framework, it seems wrong to be
> adding new attributes to *stat. If a given filesystem wants to store certain
> attributes more efficiently, then it is welcome to intercept xattr calls and
> store (say) "cifs.birthtime" directly at a known offset in the inode.

It's not attribute storage I'm thinking about, but making attribute retrieval
more efficient.

> The flip-side of extracting these various attributes is setting them.

I acknowledge that if we went down the getxattr() route, then that
automatically makes setxattr() the obvious candidate for setting things.

But think about it another way: what if you want to set several attributes?
You have to make a bunch of setxattr() calls. But what if it were possible to
do all of chmod, chgrp, chown, truncate, utimes, set_btime, etc. all in one go,
atomically? We more or less have this internally in the kernel, and it might
stand to be exposed to userspace.

It might, for example, make untarring that little bit more efficient.

> I'm still pondering those extra flags:
> FS_SPECIAL_FL
> FS_AUTOMOUNT_FL
> FS_AUTOMOUNT_ANY_FL
> FS_REMOTE_FL
> FS_ENCRYPTED_FL
> FS_OFFLINE_FL
>
> They sound like they might be useful, they are not file-metadata (like
> btime) but rather implementation details (like st_blocks). So it is probably
> sensible to include them as you have done.

I've split these away from ioc flags as ioc flags is very ext2/3/4 centric, and
those filesystems happily create their own ioc flags sets without updating the
master set.

> If a filesystem is mounted on an network-block-device, or a loop-back of a
> file on NFS, is FS_REMOTE_FL set?
> Is ROT13 enough for FS_ENCRYPTED_FL to be set?
> If the NFS server is "not responding, still trying", should FS_OFFLINE_FL get
> set on all files?
> And I cannot even guess at the different between the two FS_AUTOMOUNT flags.
> I'm sure it is something useful, but doco would be good. Should one of them
> be set on mountpoints that NFSv4 detects from the server?

Yeah. I have plans to write documentation for it, but I'd like to have a
clearer idea of what the interface might be before doing that.

But to give you an idea of the flags:

(*) FS_SPECIAL_FL - Kernel API file from a quasi-filesystem such as /proc or
/sys - the sort of thing you might not want to expose through NFSD.

(*) FS_AUTOMOUNT_FL - A named automount/referral point. You attempt to
transit this directory and the backing fs will mount something over the
top.

(*) FS_AUTOMOUNT_ANY_FL - A directory in which you can look up a non-existent
directory entry, which will cause that dirent to be fabricated and the
target filesystem be mounted over the top. Examples include looking up
arbitrary cell names in /afs, or arbitrary hostnames in autofs or amd
indirect mount directories.

(*) FS_REMOTE_FL - A filesystem object that is assumed not to be stored on the
computer issuing the request. It would be quite nice to have loopback NFS
not set the remote flag and to have NBD mounted filesystems to set the
remote flag, but this can get quite messy with things like overmounts.

My thought is that this can be used by a GUI to choose its icons for
files.

(*) FS_ENCRYPTED_FL - A file that is stored encrypted and that presumably
needs a key providing to decrypt it. CIFS has an attribute bit for this
(ATTR_ENCRYPTED).

(*) FS_OFFLINE_FL - A file that isn't immediately available, and that requires
a connection to the data store to be made. CIFS has an attribute bit for
this (ATTR_OFFLINE). AFS has a field in its volume data and an error code
indicating that a volume is offline and cannot currently be accessed.

This could be set by network filesystems for which the network or the
server is absent for example. Especially if the lightweight stat is
requested (non-blocking in essence).

> It would probably help to keep that sort of decision process (complete with
> who to blame) documented in the change-log entry, but one never thinks of
> doing that at the time.

There have been a lot of conflicting opinions on this. I'm not sure rendering
them into a list in the change log would be that useful.

> Providing everybody imposes exactly the same semantics for "creation time"...

We can invent some for Linux. The time at which an inode is created would seem
to be a sensible course, but with the ability for the creation time to be set
by archiving tools. Overwriting an existing inode by truncating it and then
writing it should keep the creation time of the inode.

I think this would then be the same behaviour as Windows.

> "well derided" like high-mem and SMP support? or "real-time" support and
> priority inheritance?
> I guess the deriders are wrong, and will eventually realise that they are
> wrong. The difficult bit is we cannot know how long it will take them, or
> how much you have to care.

Almost everyone hates the idea of having a stat function with a variable length
buffer. To quote Linus:

the "buffer+buflen" thing is still disgusting.

You might be right, though: the deriders might be wrong; it just doesn't help
at this particular point in time.

> (unambiguous documentation!! the rest is just details)

I normally do write documentation. It's just that I don't want to have to keep
changing the docs as well as constantly rewriting the code.

David

2010-07-22 18:07:52

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 08:04:41PM +0200, Volker Lendecke wrote:
> On Thu, Jul 22, 2010 at 11:02:04AM -0700, Jeremy Allison wrote:
> > On Thu, Jul 22, 2010 at 12:58:50PM -0400, Trond Myklebust wrote:
> > >
> > > That would make it impossible to export the filesystem with NFSv2 and
> > > v3. They do rely on ctime checking for certain operations (e.g. deciding
> > > when to invalidate access and acl caches). NFSv4 needs this too if the
> > > filesystem has no dedicated change attribute.
> > >
> > > Still, I suppose the market for exporting the same filesystem with both
> > > NFS and Samba is limited...
> >
> > Ask NetApp about that :-). They have built a rather large
> > business on just that fact :-).
>
> Jeremy, how many hours have you spent getting "posix
> locking" to the point where it is now? :-)
>
> Volker
>
> P.S: For those not aware, "posix locking = yes" is
> cross-protocol byte range locking done by smbd to co-operate
> with local processes and NFS.

The time is counted in years, not hours :-).

2010-07-30 18:19:37

by Phil Pishioneri

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On 7/30/10 2:11 PM, Trond Myklebust wrote:
> Your Mac has a perfectly functional CIFS client,

It didn't, at one point. Some version of Mac OS X would cause a client
kernel crash when unmounting the CIFS share. I think it's been fixed,
but we had to have some OS X clients switch to NFS because of it.

-Phil

2010-07-22 13:05:24

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-22 14:17, Volker Lendecke wrote:
>On Thu, Jul 22, 2010 at 01:14:47PM +0100, David Howells wrote:
>> Jan Engelhardt <[email protected]> wrote:
>>
>> > Linux already has a creation time field, it's called otime (there is no "b"
>> > in "creation"), and you will find scattered fragments of that all over the
>> > kernel (foremost, fs/jfs/, now btrfs, and I also notice sysvipc having
>> > something with that name).
>>
>> It is? It's called crtime in Ext4. st_btime, however, would be compatible
>> with BSD's stat, and Samba would just use it by way of autoconf magic if it
>> appeared.

Of course you can find remnants of btime in Linux's BSD-style task
accounting, but Linux always looked more like SysV than BSD, speaking
for otime. And if you are using autoconf, the cost of using otime over
btime seems the same.

>Samba has the following check:
>
># recent FreeBSD, NetBSD have creation timestamps called birthtime:
>AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec])
>AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec]))
>
>and the supporting code around that. "birth" might also be
>where the "b" comes from :-)

Well, in all reference to the Matrix movie, files aren't born. Except
for Directory Default ACLs and possibly security labels, they usually
don't inherit either :) And on a CS level, it's more like copy than
inherit, because if the parent changes, the file does not (with the
potential exception of security relabeling, bla).

2010-07-22 16:07:08

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 12:06 PM, Greg Freemyer <[email protected]> wrote:
> On Thu, Jul 22, 2010 at 11:47 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke
>> <[email protected]> wrote:
>>>
>>> The nice thing about this is also that if this is supposed
>>> to be fully usable for Windows clients, the birthtime needs
>>> to be changeable. That's what NTFS semantics gives you, thus
>>> Windows clients tend to require it.
>>
>> Ok. So it's not really a creation date, exactly the same way ctime
>> isn't at all a creation date.
>>
>> And maybe that actually hints at a better solution: maybe a better
>> model is to create a new per-thread flag that says "do ctime updates
>> the way windows does them".
>>
>> So instead of adding another "btime" - which isn't actually what even
>> windows does - just admit that the _real_ issue is that Unix and
>> Windows semantics are different for the pre-existing "ctime".
>>
>> The fact is, windows has "access time", "modification time" and
>> "creation time" _exactly_ like UNIX. It's just that the ctime has
>> slightly different semantics in windows vs unix. So quite frankly,
>> it's totally insane to introduce a "birthtime", when that isn't even
>> what windows wants, just because people cannot face the actual real
>> difference.
>>
>> Tell me why we shouldn't just do this right?
>>
>> ? ? ? ? ? ? ? ?Linus
>
> I haven't been keeping up with this thread, but I believe NTFS has a
> number of timestamps, not just 3.
>
> This blog post references 8 in the left hand column.
>
> The 4 standard (most common) ones are:
>
> File last access
> File last modified
> File created
> MFT last modified
>
> My understanding is that "MFT last modified" has semantics very
> similar to Linux ctime.
>
> But there is not a generic equivalent to NTFS created.
>
> Thus if trying to have the Linux kernel match NTFS semantics for the
> benefit of Samba is the goal, it seems a new field should be preferred
> instead of having linux ctime try to do different jobs.
>
> Greg

I forgot the blog post url:

http://blogs.sans.org/computer-forensics/2010/04/12/windows-7-mft-entry-timestamp-properties/

2010-07-22 19:53:40

by Benny Halevy

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Jul. 22, 2010, 21:45 +0300, Greg Freemyer <[email protected]> wrote:
> On Thu, Jul 22, 2010 at 2:21 PM, Benny Halevy <[email protected]> wrote:
>> On Jul. 22, 2010, 20:24 +0300, Linus Torvalds <[email protected]> wrote:
>>> On Thu, Jul 22, 2010 at 10:03 AM, Jan Engelhardt <[email protected]> wrote:
>>>>
>>>> I beg to differ. ctime is not completely useless. It reflects changes on
>>>> the inode for when you don't you change the content.
>>>
>>> Uh. Yes. Except that why is file metadata really different from file
>>> data? Most people really don't care. And a lot of people have asked
>>> for creation dates - and I seriously doubt that Windows people
>>> complain a lot about the fact that there you have mtime for metadata
>>> changes too.
>>>
>>> The point being that Unix ctime semantics certainly have well-defined
>>> semantics, but they are in no way "better" than having a real creation
>>> time, and are often worse.
>>
>> Yeah, having create time would be important.
>> That said, having a non user-settable modify timestamp is crucial
>> for quickly determining whether a file has changed.
>
> How would "cp --archive" and a host of backup/restore tools work
> without user-settable modify timestamps?
>
> Or are you proposing another timestamp? I do computer forensics, I
> like timestamps, but enough is enough.

mtime and atime are already user settable and archive programs use
this on the destination, but ctime would be different after
copy/restore.

When updating the archive, just comparing mtime to determine if the source
changed is problematic as it can be set to any value after the change,
but src.ctime would be greater than dest.ctime in this case.

With posix semantics (http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_07)
this is not perfect either as there can be false-positives when the file stat changed but
the file has not, e.g. when st_nlink changed.

Benny

>
> Greg

2010-07-19 16:51:53

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Mon, Jul 19, 2010 at 9:15 AM, David Howells <[email protected]> wrote:
> Linus Torvalds <[email protected]> wrote:
>
>> ?- that whole xstat buffer handling is just a mess. I think you
>> already fixed the "xstat_parameters" crud and just made it a simple
>> unsigned long and a direct argument,
>
> I was thinking more of an unsigned int argument, since it can't have more than
> 32 flags in it if it is also to work on 32-bit arches.

That's fine.

>> but the "buffer+buflen" thing is still disgusting.
>>
>> ? ?Why not just leave a few empty fields at the end, and make the rule
>> be: "We don't just add random crap, so don't expect it to grow widely
>> in the future".
>
> Because it gets allocated on the kernel stack. ?It's already 160 bytes, and
> expanding it will eat more kernel stack space. ?Now, I can offset that by: (a)
> embedding it in struct kstat so that we allocate less stack space in xstat()
> overall, and (b) allocating kstat/xstat structs with kmalloc() rather than on
> the stack in all the stat syscalls.

Using implementation issues like that as a reason for some odd
interface that we'll have to live with for the next decades sounds
bad. It's basically a broken form of versioning, since if you end up
using buffer sizes, everybody will just use "sizeof()" except for some
random crazy developer that decides to re-use a buffer they use for
something else, and then use the size of that instead.

End result: the kernel gets passed in some random constant that
depends on just which version of glibc they were compiled against _or_
on just how crazy they were. And it all just encourages people to do
odd things. For example, the glibc developers, who love adding their
own random fields for crazy "forwards compatibility", will start
extending the xstat structure on their own and then just pass in the
larger size and emulate a few new fields ? la that whole vfstat thing.
And then if/when we want to extend on it, we're screwed.

So making it fixed is not only simpler, it avoids all the "I'm passing
in random integers" crud.

You don't need to allocate the whole thing inside the kernel anyway.
Quite the reverse. You probably want to continue using the kernel
"kstat" interface with some extensions. That's the point of kstat,
after all - allowing the filesystem interfaces to share _one_
interface rather than having new interfaces at the VFS level for every
damn new stat implementation we have to do for user space.

In short, your stack space usage is all totally bogus. You should copy
the kstat to the user xstat one field at a time, and NOT allocate an
xstat on the kernel stack at all. There is no advantage to using
"memcpy_to_user()" (after having filled in the kernel struct one field
at a time) over just filling in the user struct directly.

Just do "access_ok() + several __put_user() calls", in other words.

I think you wanted to use "memcpy_to_user()" just because you had that
broken "bufsize" argument to begin with. If you get rid of the
bufsize, you also get rid of the potential for partial structures, and
all the reasons to use memcpy go away.

Just do the obvious thing.

>> ?- you use "long long" all over the place. Don't do that. If you want
>> a fixed size, say so, and use "u64/s64". That's the _real_ fixed size,
>> and "long long" just _happens_ to be the same size on all current
>> architectures.
>
> I was following struct stat/stat64 in arch/x86/include/asm/stat.h which do the
> same. ?Also, if this is going to be seen by userspace, isn't it better to use
> uint32_t and suchlike?

The arch/x86/include/asm stuff isn't trying to be the same image on
different architectures, it's just x86[-64]-specific. But if you want
to have a cross-architectural thing, you want to use
cross-architectural types. Don't use "long long".

Yeah, we may well do it somewhere, but there's no reason to add new ones.

Another thing you should look for in things like this - make sure that
u64 is always naturally aligned. Otherwise some architectures will
align it at 4-byte boundaries (notably x86-32), while others will
align it at 8-byte boundaries (native 64-bit).

>> ?- why create that new kind of xstat() that realistically absolutely
>> nobody will use outside of some very special cases, and that has no
>> real advantages for 99.9% of all people?
>
> The new information is useful for some cases. ?Samba for example. ?At least
> two of the fields I'm adding are also made available through BSD's stat()
> call, and will automatically be used for some things by autoconf magic if they
> become available.

.. that' a pointless argument. If the only way something gets used is
through autoconf, then clearly nobody cares. Yeah, maybe it adds a
flag to "ls", but let's face is - that isn't actually _buying_
anything.

So the only thing that matters for new system calls is who actually
really seriously wants to use the information, even if it's not there
by default. Is it _anybody_ else than samba?

That's why I asked about maybe making it "open+stat". Because that at
least _potentially_ opens things up to another class of users.

Because if it really is just samba that wants some odd crap that not
even all filesystems support, then why add a whole new xstat for it?
If nobody else clamors for it (except for people who just want new
interfaces), then it's not generic enough to be worth something like
that.

In other words, in the absense of some seriously generic users, it
sounds more like an ioctl to me to ask for something like "creation
time" or "inode version", when not all filesystems support anything
like that.

>> [open+stat}
>
> Which would be used by even fewer people, I suspect.

Umm, no. It would be used by _at_least_ as many people.

And don't get me wrong - I'm not saying "you need to make it
open+stat". I'm saying "you need to make the case that the thing is so
generically useful that it's worth a whole new system call, rather
than just a filesystem specific ioctl".

> Also, I'm not sure how much use the atomicity is, given that the file may have
> changed state between the gathering of the stat data and userspace getting to
> do anything with it.

It's a security issue. It's not atomic wrt the file being edited, but
it would be atomic wrt the filename changing. IOW, the same thing as
why web servers etc need to do "open+fstat" rather than "stat+open".

And yes, we can already do open+fstat. But exactly because it's a
fairly common pattern, and the kinds of programs that do it tend to
also care about performance, maybe they'd like a single system call.

Ask your samba people, for example, if they'd _ever_ do just a
"xstat()"? Somehow I suspect that most server kind of apps almost
always end up doing open+fstat, just because they don't want just the
stat information, and need to do the fstat in order to guarantee they
are talking about the same file.

But again - the whole "open+stat" is not because I think they need to
be done together. It's because I'm trying to see if it could make the
system call worth it at all.

>> > ? ? ? ?ssize_t ret = fxstat(unsigned fd,
>>
>> Quite frankly, my gut feel is that once you do "xstat(dfd, filename,
>> ...)" then it's damn stupid to do a separate "fxstat()", when you
>> might as well say that "xtstat(dfd, NULL, ...)" is the same as
>> "fxstat(fd, ...)"
>
> This has been suggested and denounced as stupid already. ?That said, I agree
> with you.

Hey, whoever denounced it as stupid obviously doesn't have the neurons
to go around to be involved in the discussion. Ignore them.

Linus

2010-07-22 18:59:47

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 2010-07-22 at 20:04 +0200, Volker Lendecke wrote:
> On Thu, Jul 22, 2010 at 11:02:04AM -0700, Jeremy Allison wrote:
> > On Thu, Jul 22, 2010 at 12:58:50PM -0400, Trond Myklebust wrote:
> > >
> > > That would make it impossible to export the filesystem with NFSv2 and
> > > v3. They do rely on ctime checking for certain operations (e.g. deciding
> > > when to invalidate access and acl caches). NFSv4 needs this too if the
> > > filesystem has no dedicated change attribute.
> > >
> > > Still, I suppose the market for exporting the same filesystem with both
> > > NFS and Samba is limited...
> >
> > Ask NetApp about that :-). They have built a rather large
> > business on just that fact :-).
>
> Jeremy, how many hours have you spent getting "posix
> locking" to the point where it is now? :-)
>
> Volker
>
> P.S: For those not aware, "posix locking = yes" is
> cross-protocol byte range locking done by smbd to co-operate
> with local processes and NFS.

I said "limited", not "non-existent".

The fact remains that most of us would be hard pressed to name an
application that requires you to share the same dataset to both
Windows/CIFS and posix NFS clients. Everything from ACL models through
caseless vs case-aware filesystems and Windows vs posix locking
semantics tends to discourage mixing the two environments.

Trond

2010-07-22 12:15:03

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Jan Engelhardt <[email protected]> wrote:

> Linux already has a creation time field, it's called otime (there is no "b"
> in "creation"), and you will find scattered fragments of that all over the
> kernel (foremost, fs/jfs/, now btrfs, and I also notice sysvipc having
> something with that name).

It is? It's called crtime in Ext4. st_btime, however, would be compatible
with BSD's stat, and Samba would just use it by way of autoconf magic if it
appeared.

David

2010-07-22 12:25:47

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Jan Engelhardt <[email protected]> wrote:

> >> (8) Allow the filesystem to indicate what it can/cannot provide: A
> >> filesystem can now say it doesn't support a standard stat feature if
> >> that isn't available.
> >
> >What for?
>
> Given xstat.otime=0, how would you determine whether the file is really
> tagged with a date of 1970, or whether it's just the fs which didnot
> store this kind of information.

I was thinking more of stuff that's already in the Linux stat struct, some of
which is fabricated because the underlying fs doesn't support it.

Take RomFS for example: it fabricates all of st_mtime, st_atime, st_ctime,
st_nlinks, st_blocks, st_uid and st_gid because none of them are stored in the
medium

Similarly, UbiFS fabricates st_blocks and complains in a comment that it makes
no sense for that type of filesystem.

There are other examples.

David

2010-07-19 15:17:54

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Wed, Jul 14, 2010 at 7:17 PM, David Howells <[email protected]> wrote:
>
> ? ? ? ?ssize_t ret = xstat(int dfd,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?const char *filename,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned flags,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct xstat_parameters *params,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct xstat *buffer,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?size_t buflen);

Ugh. So I think this is pretty disgusting. For a few reasons:

- that whole xstat buffer handling is just a mess. I think you
already fixed the "xstat_parameters" crud and just made it a simple
unsigned long and a direct argument, but the "buffer+buflen" thing is
still disgusting.

Why not just leave a few empty fields at the end, and make the rule
be: "We don't just add random crap, so don't expect it to grow widely
in the future".

- you use "long long" all over the place. Don't do that. If you want
a fixed size, say so, and use "u64/s64". That's the _real_ fixed size,
and "long long" just _happens_ to be the same size on all current
architectures.

Put another way: "long" just _happened_ to be 32 bits way back when
on pretty much all targets. That's where all the 64-bit compatibility
mess came from. Don't make the same mistake. Besides, if the point is
to make things be the same, _document_ that point by using a type that
is explicitly sized.

- why create that new kind of xstat() that realistically absolutely
nobody will use outside of some very special cases, and that has no
real advantages for 99.9% of all people?

You could make it a "atomic stat+open" by replacing the useless
"size" return value with a "fd" return value, add a flag saying "we're
also interested in opening it" (in the same result set flags), and
instead of that stupid "buflen" input, give the "mode" input that open
needs.

Tadaa! You now have something that more people might be interested
in, if only because it avoids a system call and might be a performance
win. Who knows. Ask the Wine people what strange
open-function-from-hell they are interested in.

> ? ? ? ?ssize_t ret = fxstat(unsigned fd,

Quite frankly, my gut feel is that once you do "xstat(dfd, filename,
...)" then it's damn stupid to do a separate "fxstat()", when you
might as well say that "xtstat(dfd, NULL, ...)" is the same as
"fxstat(fd, ...)"

Now, the difference between adding one or two system calls may not be
huge, but just from a cleanliness angle, I really don't see the point
of having another fstat variant when the extended xstat() already very
naturally supports the thing. And let's face it, using a NULL path
pointer just makes sense if you don't have a path. You already passed
it a target file descriptor in the dfd.

Anyway, I didn't look at whether the new xstat fields made any sense,
but I hated the interface enough that I can't be bothered to. Don't
make up baroque new things that will never be used. Make a better
argument for why anybody would use them despite the lack of
standardization etc. And make sure they are as simple as possible
(which is why I hate that "buflen" thing etc).

Linus

2010-07-22 16:25:26

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-22 17:47, Linus Torvalds wrote:
>On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke
><[email protected]> wrote:
>>
>> The nice thing about this is also that if this is supposed
>> to be fully usable for Windows clients, the birthtime needs
>> to be changeable. That's what NTFS semantics gives you, thus
>> Windows clients tend to require it.
>
>Ok. So it's not really a creation date, exactly the same way ctime
>isn't at all a creation date. [...]
>Tell me why we shouldn't just do this right?

Nobody said the c in ctime stands for creation. It stands for
change (you probably knew that).

$ touch this
$ stat this
File: `this'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fh/15d Inode: 106777647 Links: 1
Access: (0644/-rw-r--r--) Uid: (25121/ jengelh) Gid: ( 100/ users)
Access: 2010-07-22 18:18:52.665480058 +0200
Modify: 2010-07-22 18:18:52.665480058 +0200
Change: 2010-07-22 18:18:52.665480058 +0200

# Only change inode, not content
$ chmod 600 this
$ stat this
File: `this'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: fh/15d Inode: 106777647 Links: 1
Access: (0600/-rw-------) Uid: (25121/ jengelh) Gid: ( 100/ users)
Access: 2010-07-22 18:18:52.665480058 +0200
Modify: 2010-07-22 18:18:52.665480058 +0200
Change: 2010-07-22 18:18:58.533436339 +0200

(Solaris exhibits the very same kind of behavior.)

2010-07-22 17:16:32

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 2010-07-22 at 19:03 +0200, Jan Engelhardt wrote:
> On Thursday 2010-07-22 18:40, Linus Torvalds wrote:
>
> >On Thu, Jul 22, 2010 at 9:27 AM, Jeremy Allison <[email protected]> wrote:
> >> On Thu, Jul 22, 2010 at 08:47:46AM -0700, Linus Torvalds wrote:
> >>> Tell me why we shouldn't just do this right?
> >>
> >> No, ctime isn't the same as Windows "create time".
> >
> >Umm. What kind of reading problems do you guys have?
> >
> >I know effin well that ctime isn't the same as Windows create time.
> >THAT WAS MY POINT.
> >
> >But the fact is, th Unix ctime semantics are insane and largely
> >useless. There's a damn good reason almost nobody uses ctime under
> >unix.
>
> I beg to differ. ctime is not completely useless. It reflects changes on
> the inode for when you don't you change the content. It's like an mtime
> for the metadata. It comes useful when you go around in your filesystem
> trying to figure out who of your co-admins screwed up the permissions on
> /etc/passwd... and if the mtime is the same as that of the last backup,
> I can at least have a reasonable assurance that it was /only/ the
> metadata that was tampered with. (SHA1 check, yeah yeah, costly on large
> files.)

Errr... Only if you eliminate utimes() from your syscall table.
Otherwise it is trivial to reset the mtime after changing the file
contents.

Cheers
Trond

2010-07-19 14:09:40

[permalink] [raw]

Subject: Re: [PATCH 09/18] xstat: Make special system filesystems return FS_SPECIAL_FL [ver #6]

Christoph Hellwig <[email protected]> wrote:

> special is not a very useful identifier. Also what you are returning
> is per-filesystem data, not per-file. This needs to go into statfs,
> not into stat. We're about to introduce flags for statfs, so try
> to do it ontop of those.
>
> The same thing applies to the remote flag in the next patch.

Which means that you have to do two calls (xstat+statfs) to find this
information that we can return pretty much for free here, though you can cache
it based on st_dev, I suppose.

Also, not all the flags are per-filesystem. The following are:

FS_SPECIAL_FL /* Special file as found in procfs/sysfs */
FS_REMOTE_FL /* File is remote */

but the rest aren't:

FS_AUTOMOUNT_FL /* Specific automount point */
FS_AUTOMOUNT_ANY_FL /* Unspecific automount directory */
FS_ENCRYPTED_FL /* File is encrypted */
FS_HIDDEN_FL /* File is marked hidden (DOS+) */
FS_SYSTEM_FL /* File is marked system (DOS+) */
FS_ARCHIVE_FL /* File is marked archive (DOS+) */
FS_TEMPORARY_FL /* File is temporary (NTFS/CIFS) */
FS_OFFLINE_FL /* File is offline (CIFS) */
FS_REPARSE_POINT_FL /* Reparse point (NTFS/CIFS) */

David

2010-07-28 17:28:30

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Neil Brown <[email protected]> wrote:

> ctime and mtime have real cache-coherence semantics which require them being
> updated by the kernel (whether the cache is on an NFS client, in a backup
> archive, or in a .o translation of a .c file).

So does creation time, at least for CIFS caching. Creation time has potential
for spotting when the object at a pathname has changed for something else,
given the lack of inode number and inode generation from windows servers.
Creation time gives us one more datum to use.

> The only role the kernel might have would be setting the 'creation time' when
> the file was created, but it seems even that isn't always what is wanted,
> because people don't so much what the time of create of the
> container-on-disk, but the time of creation of the data-content.

That should be a timestamp in the content itself, not a filesystem metadata
timestamp.

> I would want to see a pretty convincing use-case that cannot be solved with
> xattrs before 'creation time' was added to a generic kernel interface.

Then there's no point even considering this. You could emulate the entirety
of stat() with getxattr(). I've previously posted a patch to implement the
retrieval of creation time, inode gen and data version as xattrs and been told
that it's the wrong way to do it and I should extend stat instead.

> So just use xattrs and don't involve the kernel in any detailed knowledge of
> this value.

Why not? BSD has it in its stat struct. Windows has it in its Win32
equivalents. Samba for one will look for it there, and use it if it is.

Using an xattr means an extra pathwalk and extra locking per access for any
program that wants it. It's a reasonable bet such a program will also be
stat'ing the file it wants the creation time for.

If we are going to extend stat anyway, then why not make out a short list of
extra things we could usefully return and consider adding them? Something
like creation time is reasonably easy to come by for little extra overhead.
Ext4, for example, retains a copy of it in RAM in its inode struct.

> Maybe xstat should take a list of xattrs to be retrieved as well?? or maybe
> not.

The idea of xstat() having a variable-length buffer and variable arguments has
been well derided. It ain't going to happen, much though I'd like it to. I'd
quite like to offer the opportunity to return the security label, for example.

David

2010-07-23 01:22:12

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Jul 23, 2010 at 11:03:47AM +1000, [email protected] wrote:
>
> The other big difference from POSIX timestamps is that the
> CreationTime is settable on Windows, and some of the windows UI
> behaviour relies on this.

Well, not POSIX, because POSIX doesn't have CreationTime at all.
BSD's birthtime doesn't allow it to be set, and the question here is
largely philosophical. Does it literally mean "file creation time" in
terms of when the OS created the file, or does it mean "file" in the
sense of application contents. For example, if an application edits
the file and saves it out using "write file to foo.new; sync; rename
foo to foo.bak; rename foo.new to foo", should the creation time for
the newly written file "foo" be the time when the editor saved out the
file (i.e., when "foo.new" was created), or copied from the original
file "foo"'s creation time.

This is something (whether or not the application is allowed to set
the creation time) that I think makes sense to be either a filesystem
level mount option, or superblock tunable, or even a per-process
personality flag.

However, I think Linus's idea of using a per-process flag to control
whether or not "ctime" has the original POSIX semantics or some new
"creation time" semantics would lead to a huge amount of confusion.
Given that a number of new filesystems, including both ext4 and btrfs,
have creation time, it makes sense for us to have a fourth timestamp.
Whether or not our creation time is settable or not is a separate
question, and I don't think we need to follow BSD's lead on this. If
GNOME and/or KDE applications start using it, I could see this
becoming that gets wide adoption fairly quickly.

- Ted

2010-07-22 15:48:16

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke
<[email protected]> wrote:
>
> The nice thing about this is also that if this is supposed
> to be fully usable for Windows clients, the birthtime needs
> to be changeable. That's what NTFS semantics gives you, thus
> Windows clients tend to require it.

Ok. So it's not really a creation date, exactly the same way ctime
isn't at all a creation date.

And maybe that actually hints at a better solution: maybe a better
model is to create a new per-thread flag that says "do ctime updates
the way windows does them".

So instead of adding another "btime" - which isn't actually what even
windows does - just admit that the _real_ issue is that Unix and
Windows semantics are different for the pre-existing "ctime".

The fact is, windows has "access time", "modification time" and
"creation time" _exactly_ like UNIX. It's just that the ctime has
slightly different semantics in windows vs unix. So quite frankly,
it's totally insane to introduce a "birthtime", when that isn't even
what windows wants, just because people cannot face the actual real
difference.

Tell me why we shouldn't just do this right?

Linus

2010-07-30 18:40:15

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 29, 2010 at 09:04:01AM +1000, Neil Brown wrote:
> On Wed, 28 Jul 2010 18:28:02 +0100
> David Howells <[email protected]> wrote:
>
> > Neil Brown <[email protected]> wrote:
> >
> > > ctime and mtime have real cache-coherence semantics which require them being
> > > updated by the kernel (whether the cache is on an NFS client, in a backup
> > > archive, or in a .o translation of a .c file).
> >
> > So does creation time, at least for CIFS caching. Creation time has potential
> > for spotting when the object at a pathname has changed for something else,
> > given the lack of inode number and inode generation from windows servers.
> > Creation time gives us one more datum to use.
>
> This justifies for me why a CIFS client would want to extract the
> creation-time from the CIFS protocol, but not why you want to expose it via a
> generic interface.
> The kernel/filesystem doesn't need to maintain creation-time to meet this
> need, only the CIFS server needs to maintain it

For what it's worth, the NFSv4 server would also export creation time if
we had it.

--b.

2010-07-22 17:04:01

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-22 18:40, Linus Torvalds wrote:

>On Thu, Jul 22, 2010 at 9:27 AM, Jeremy Allison <[email protected]> wrote:
>> On Thu, Jul 22, 2010 at 08:47:46AM -0700, Linus Torvalds wrote:
>>> Tell me why we shouldn't just do this right?
>>
>> No, ctime isn't the same as Windows "create time".
>
>Umm. What kind of reading problems do you guys have?
>
>I know effin well that ctime isn't the same as Windows create time.
>THAT WAS MY POINT.
>
>But the fact is, th Unix ctime semantics are insane and largely
>useless. There's a damn good reason almost nobody uses ctime under
>unix.

I beg to differ. ctime is not completely useless. It reflects changes on
the inode for when you don't you change the content. It's like an mtime
for the metadata. It comes useful when you go around in your filesystem
trying to figure out who of your co-admins screwed up the permissions on
/etc/passwd... and if the mtime is the same as that of the last backup,
I can at least have a reasonable assurance that it was /only/ the
metadata that was tampered with. (SHA1 check, yeah yeah, costly on large
files.)

>I personally think that Unix ctime is insane. There is no real reason
>why "write()" should change mtime, but "chmod" changes ctime.

2010-07-27 13:41:33

[permalink] [raw]

Subject: Re: [PATCH 09/18] xstat: Make special system filesystems return FS_SPECIAL_FL [ver #6]

David Howells <[email protected]> wrote:

> Also, not all the flags are per-filesystem. The following are:
>
> FS_SPECIAL_FL /* Special file as found in procfs/sysfs */
> FS_REMOTE_FL /* File is remote */

Actually, that last is not true; FS_REMOTE_FL is per-file, not per-fs. You
can have a filesystem that has fabricated files and remote files. For
instance, with kAFS at some point you will be go into /afs, do a lookup for a
directory that doesn't exist, but whose name represents a cell+volume, the
filesystem will fabricate a local directory and then attempt to mount a remote
directory on to it.

David

2010-07-19 14:10:56

[permalink] [raw]

Subject: Re: [PATCH 12/18] xstat: Add a dentry op to handle automounting rather than abusing follow_link() [ver #6]

Christoph Hellwig <[email protected]> wrote:

> Moving this out of ->follow_link is a good idea, but please submit this
> as a separate patch series, as it has very little to do with stat().

Except that I want to use it to create a new AT flag for xstat() (and also
fstatat()), but fair enough.

David

2010-07-22 16:06:24

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Jul 22, 2010 at 11:47 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke
> <[email protected]> wrote:
>>
>> The nice thing about this is also that if this is supposed
>> to be fully usable for Windows clients, the birthtime needs
>> to be changeable. That's what NTFS semantics gives you, thus
>> Windows clients tend to require it.
>
> Ok. So it's not really a creation date, exactly the same way ctime
> isn't at all a creation date.
>
> And maybe that actually hints at a better solution: maybe a better
> model is to create a new per-thread flag that says "do ctime updates
> the way windows does them".
>
> So instead of adding another "btime" - which isn't actually what even
> windows does - just admit that the _real_ issue is that Unix and
> Windows semantics are different for the pre-existing "ctime".
>
> The fact is, windows has "access time", "modification time" and
> "creation time" _exactly_ like UNIX. It's just that the ctime has
> slightly different semantics in windows vs unix. So quite frankly,
> it's totally insane to introduce a "birthtime", when that isn't even
> what windows wants, just because people cannot face the actual real
> difference.
>
> Tell me why we shouldn't just do this right?
>
> ? ? ? ? ? ? ? ?Linus

I haven't been keeping up with this thread, but I believe NTFS has a
number of timestamps, not just 3.

This blog post references 8 in the left hand column.

The 4 standard (most common) ones are:

File last access
File last modified
File created
MFT last modified

My understanding is that "MFT last modified" has semantics very
similar to Linux ctime.

But there is not a generic equivalent to NTFS created.

Thus if trying to have the Linux kernel match NTFS semantics for the
benefit of Samba is the goal, it seems a new field should be preferred
instead of having linux ctime try to do different jobs.

Greg

2010-07-23 02:13:01

by tridge

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Hi Ted,

> Does it literally mean "file creation time" in terms of when the OS
> created the file, or does it mean "file" in the sense of
> application contents. For example, if an application edits the
> file and saves it out using "write file to foo.new; sync; rename
> foo to foo.bak; rename foo.new to foo", should the creation time
> for the newly written file "foo" be the time when the editor saved
> out the file (i.e., when "foo.new" was created), or copied from the
> original file "foo"'s creation time.

In Windows this is can be controlled by applications, but it also is
done at the filesystem level in NTFS using a technique that Microsoft
call "File System Tunneling". If you create a file with the same name
within a short time (default 15s and settable in the registry) of when
the file previously existed then it will get the same CreationTime as
the previous file.

For details see http://support.microsoft.com/kb/172190

Some applications also do this regardless of the registry setting for
MaximumTunnelEntryAgeInSeconds. They use the ability to set the
CreationTime to get the same behaviour.

Cheers, Tridge

2010-07-22 10:52:37

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sunday 2010-07-18 10:48, Christoph Hellwig wrote:

>Adding Uli to the Cc list to make sure this system call is useful
>for glibc / can be exported by it. Otherwise it's rather pointless
>to add it.
>
>> (6) BSD stat compatibility: Including more fields from the BSD stat such as
>> creation time (st_btime) and inode generation number (st_gen) [Jeremy
>> Allison, Bernd Schubert]
>
>How is this different from (1) and (4)?

(4), the inode generation number, need not be related to time.

>> (8) Allow the filesystem to indicate what it can/cannot provide: A filesystem
>> can now say it doesn't support a standard stat feature if that isn't
>> available.
>
>What for?

Given xstat.otime=0, how would you determine whether the file is really
tagged with a date of 1970, or whether it's just the fs which didnot
store this kind of information.

2010-07-18 08:48:28

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Adding Uli to the Cc list to make sure this system call is useful
for glibc / can be exported by it. Otherwise it's rather pointless
to add it.

> (6) BSD stat compatibility: Including more fields from the BSD stat such as
> creation time (st_btime) and inode generation number (st_gen) [Jeremy
> Allison, Bernd Schubert]

How is this different from (1) and (4)?

> (7) Extra coherency data may be useful in making backups [Andreas Dilger].

What do you mean with that?

> (8) Allow the filesystem to indicate what it can/cannot provide: A filesystem
> can now say it doesn't support a standard stat feature if that isn't
> available.

What for?

> (9) Make the fields a consistent size on all arches, and make them large.

Why making them large for the sake of it? We'll need massive changes
all through libc and applications to ever make use of this. So please
coordinate the types used with Uli.

> The following structures are defined for the use of these new system calls:
>
> struct xstat_parameters {
> unsigned long long request_mask;
> };

Just pass this as a single flag by value. And just make it an unsigned
long to make the calling convention a lot simpler.

> struct xstat_dev {
> unsigned int major, minor;
> };
>
> struct xstat_time {
> unsigned long long tv_sec, tv_nsec;
> };

No point in adding special types here that aren't genericly useful.
Also this is the first and only system call using split major/minor
values for the dev_t. All this just creates more churn than it helps.

>
> struct xstat {
> unsigned long long st_result_mask;

Just st_mask?

> unsigned long long st_data_version;

st version?

> unsigned long long st_inode_flags;

> The defined bits in request_mask and st_result_mask are:
>
> XSTAT_REQUEST_MODE Want/got st_mode
> XSTAT_REQUEST_NLINK Want/got st_nlink
> XSTAT_REQUEST_UID Want/got st_uid
> XSTAT_REQUEST_GID Want/got st_gid
> XSTAT_REQUEST_RDEV Want/got st_rdev
> XSTAT_REQUEST_ATIME Want/got st_atime
> XSTAT_REQUEST_MTIME Want/got st_mtime
> XSTAT_REQUEST_CTIME Want/got st_ctime
> XSTAT_REQUEST_INO Want/got st_ino
> XSTAT_REQUEST_SIZE Want/got st_size
> XSTAT_REQUEST_BLOCKS Want/got st_blocks
> XSTAT_REQUEST__BASIC_STATS The stuff in the normal stat struct
> XSTAT_REQUEST_BTIME Want/got st_btime
> XSTAT_REQUEST_GEN Want/got st_gen
> XSTAT_REQUEST_DATA_VERSION Want/got st_data_version
> XSTAT_REQUEST_INODE_FLAGS Want/got st_inode_flags
> XSTAT_REQUEST__EXTENDED_STATS The stuff in the xstat struct
> XSTAT_REQUEST__ALL_STATS The defined set of requestables

What's the point of the REQUEST in the name? Also no double
underscores inside the identifier. Instead adding a _MASK postfix
for masks would make it a lot more clear.

> The defined bits in st_inode_flags are the usual FS_xxx_FL flags in the LSW,
> plus some extra flags in the MSW:
>
> FS_SPECIAL_FL Special kernel file, such as found in procfs
> FS_AUTOMOUNT_FL Specific automount point
> FS_AUTOMOUNT_ANY_FL Free-form automount directory
> FS_REMOTE_FL File is remote
> FS_ENCRYPTED_FL File is encrypted
> FS_SYSTEM_FL File is marked system (DOS/NTFS/CIFS)
> FS_TEMPORARY_FL File is temporary (NTFS/CIFS)
> FS_OFFLINE_FL File is offline (CIFS)

Please don't overload the FL_ namespace even more. It's already a
complete mess given that it overloads the extN on-disk namespace.
You're much better off just adding a clean new namespace.

> The system calls are:
>
> ssize_t ret = xstat(int dfd,
> const char *filename,
> unsigned flags,
> const struct xstat_parameters *params,
> struct xstat *buffer,
> size_t buflen);

If you already have a buflen parameter there is absolute no need for
the extra results field. Just define new fields at the end and include
them if the bufsize is big enough and it's in the mask of requested
fields.

> When the system call is executed, the request_mask bitmask is read from the
> parameter block to work out what the user is requesting. If params is NULL,
> then request_mask will be assumed to be XSTAT_REQUEST__BASIC_STATS.

Why add a special case like that? Especially if we make the request
flags a pass by value scalar initalizing it is trivial.

> The request_mask should be set by the caller to specify extra results that the
> caller may desire. These come in a number of classes:
>
> (0) dev, blksize.
>
> These are local data and are always available.
>
> (1) mode, nlinks, uid, gid, [amc]time, ino, size, blocks.
>
> These will be returned whether the caller asks for them or not. The
> corresponding bits in result_mask will be set to indicate their presence.
>
> If the caller didn't ask for them, then they may be approximated. For
> example, NFS won't waste any time updating them from the server, unless as
> a byproduct of updating something requested.

Please don't introduce tons of special cases. Instead use a simple rule
like:

- a filesystem must return all attributes requests, or return an
error if it can't.
- a filesystem may return additional attributes, the caller can detect
this by looking at st_mask.

plus possibly a list of attributes the filesystem must be able to
provide if requests. I don't see a reason to make that mask different
from the attributes required by Posix.

2010-07-18 08:50:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 12/18] xstat: Add a dentry op to handle automounting rather than abusing follow_link() [ver #6]

Moving this out of ->follow_link is a good idea, but please submit this
as a separate patch series, as it has very little to do with stat().

2010-07-18 08:49:57

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 09/18] xstat: Make special system filesystems return FS_SPECIAL_FL [ver #6]

special is not a very useful identifier. Also what you are returning
is per-filesystem data, not per-file. This needs to go into statfs,
not into stat. We're about to introduce flags for statfs, so try
to do it ontop of those.

The same thing applies to the remote flag in the next patch.

2010-07-19 14:05:27

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Christoph Hellwig <[email protected]> wrote:

> Adding Uli to the Cc list to make sure this system call is useful
> for glibc / can be exported by it. Otherwise it's rather pointless
> to add it.
>
> > (6) BSD stat compatibility: Including more fields from the BSD stat such
> > as creation time (st_btime) and inode generation number (st_gen)
> > [Jeremy Allison, Bernd Schubert]
>
> How is this different from (1) and (4)?

A matter of intent, really, and who proposed it.

> > (7) Extra coherency data may be useful in making backups [Andreas Dilger].
>
> What do you mean with that?

There are extra dates and version numbers potentially available. This may be
useful in making backups. Ask Andreas.

> > (8) Allow the filesystem to indicate what it can/cannot provide: A
> > filesystem can now say it doesn't support a standard stat feature if
> > that isn't available.
>
> What for?

So that you can decide not to use it. Some of our filesystems fabricate things
that they don't actually store.

> > (9) Make the fields a consistent size on all arches, and make them large.
>
> Why making them large for the sake of it? We'll need massive changes
> all through libc and applications to ever make use of this. So please
> coordinate the types used with Uli.

Otherwise we end up with #ifdefs and duplicated fields of different sizes
within stat structs, and fields of "long" types which vary in size, depending
on the environment.

I just want to make sure that:

- st_ino is stored as 64-bit
- st_size and st_blocks are stored 64-bit
- st.{a,b,c,m}time.tv_sec are stored 64-bit

We could probably stand to make st_blksize 32-bit. I'd quite like to leave
st_gen as 64-bits and I definitely want to leave st_data_version as 64-bits.

> > The following structures are defined for the use of these new system calls:
> >
> > struct xstat_parameters {
> > unsigned long long request_mask;
> > };
>
> Just pass this as a single flag by value. And just make it an unsigned
> long to make the calling convention a lot simpler.

Already done.

> > struct xstat_dev {
> > unsigned int major, minor;
> > };
> >
> > struct xstat_time {
> > unsigned long long tv_sec, tv_nsec;
> > };
>
> No point in adding special types here that aren't genericly useful.
> Also this is the first and only system call using split major/minor
> values for the dev_t. All this just creates more churn than it helps.

I can perhaps agree on the device numbers, though some filesystems we have can
store numbers that can't be represented by dev_t. I think, however, everything
we have can be handled by a 32:32 split. The numbers could then be encoded as
desired in userspace.

The problem with using extant time structs is they use "long" or "unsigned
long". And I specifically want to get away from that, since it might be
32-bits or it might be 64-bits.

> >
> > struct xstat {
> > unsigned long long st_result_mask;
>
> Just st_mask?

Perhaps, but it contrasts nicely with request_mask, and makes it easier to
document.

> > unsigned long long st_data_version;
>
> st version?

Acceptable.

> > unsigned long long st_inode_flags;
>
>
>
> > The defined bits in request_mask and st_result_mask are:
> >
> > XSTAT_REQUEST_MODE Want/got st_mode
> > XSTAT_REQUEST_NLINK Want/got st_nlink
> > XSTAT_REQUEST_UID Want/got st_uid
> > XSTAT_REQUEST_GID Want/got st_gid
> > XSTAT_REQUEST_RDEV Want/got st_rdev
> > XSTAT_REQUEST_ATIME Want/got st_atime
> > XSTAT_REQUEST_MTIME Want/got st_mtime
> > XSTAT_REQUEST_CTIME Want/got st_ctime
> > XSTAT_REQUEST_INO Want/got st_ino
> > XSTAT_REQUEST_SIZE Want/got st_size
> > XSTAT_REQUEST_BLOCKS Want/got st_blocks
> > XSTAT_REQUEST__BASIC_STATS The stuff in the normal stat struct
> > XSTAT_REQUEST_BTIME Want/got st_btime
> > XSTAT_REQUEST_GEN Want/got st_gen
> > XSTAT_REQUEST_DATA_VERSION Want/got st_data_version
> > XSTAT_REQUEST_INODE_FLAGS Want/got st_inode_flags
> > XSTAT_REQUEST__EXTENDED_STATS The stuff in the xstat struct
> > XSTAT_REQUEST__ALL_STATS The defined set of requestables
>
> What's the point of the REQUEST in the name?

Well, they are.

> Also no double underscores inside the identifier. Instead adding a _MASK
> postfix for masks would make it a lot more clear.

Perhaps.

> > The defined bits in st_inode_flags are the usual FS_xxx_FL flags in the
> > LSW, plus some extra flags in the MSW:
> >
> > FS_SPECIAL_FL Special kernel file, such as found in procfs
> > FS_AUTOMOUNT_FL Specific automount point
> > FS_AUTOMOUNT_ANY_FL Free-form automount directory
> > FS_REMOTE_FL File is remote
> > FS_ENCRYPTED_FL File is encrypted
> > FS_SYSTEM_FL File is marked system (DOS/NTFS/CIFS)
> > FS_TEMPORARY_FL File is temporary (NTFS/CIFS)
> > FS_OFFLINE_FL File is offline (CIFS)
>
> Please don't overload the FL_ namespace even more. It's already a
> complete mess given that it overloads the extN on-disk namespace.
> You're much better off just adding a clean new namespace.

Yeah. I've been thinking that's probably the better thing to do.

> > The system calls are:
> >
> > ssize_t ret = xstat(int dfd,
> > const char *filename,
> > unsigned flags,
> > const struct xstat_parameters *params,
> > struct xstat *buffer,
> > size_t buflen);
>
> If you already have a buflen parameter there is absolute no need for
> the extra results field. Just define new fields at the end and include
> them if the bufsize is big enough and it's in the mask of requested
> fields.

Or, as someone else has already said, return -E2BIG if the result won't fit.

> > The request_mask should be set by the caller to specify extra results that
> > the caller may desire. These come in a number of classes:
> >
> > (0) dev, blksize.
> >
> > These are local data and are always available.
> >
> > (1) mode, nlinks, uid, gid, [amc]time, ino, size, blocks.
> >
> > These will be returned whether the caller asks for them or not. The
> > corresponding bits in result_mask will be set to indicate their
> > presence.
> >
> > If the caller didn't ask for them, then they may be approximated. For
> > example, NFS won't waste any time updating them from the server,
> > unless as a byproduct of updating something requested.
>
> Please don't introduce tons of special cases. Instead use a simple rule
> like:
>
> - a filesystem must return all attributes requests, or return an
> error if it can't.
> - a filesystem may return additional attributes, the caller can detect
> this by looking at st_mask.
>
> plus possibly a list of attributes the filesystem must be able to
> provide if requests. I don't see a reason to make that mask different
> from the attributes required by Posix.

Firstly: Lightweight stat: I want to say that the filesystem may return data
that is out of date if it isn't asked for specifically, but the filesystem has
a copy available. But I'm not sure that this should apply to non-standard
fields.

Secondly: It doesn't matter what POSIX wants; not all filesystems we support
have everything available. Where something that's standard is not available,
we have the opportunity to indicate this, whilst still providing a fabricated
result, so that the user can take note of this fact if they choose to, whilst
totally ignoring the indication if they prefer, and just using the fabrication.

Davod

2010-07-28 23:04:14

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Wed, 28 Jul 2010 18:28:02 +0100
David Howells <[email protected]> wrote:

> Neil Brown <[email protected]> wrote:
>
> > ctime and mtime have real cache-coherence semantics which require them being
> > updated by the kernel (whether the cache is on an NFS client, in a backup
> > archive, or in a .o translation of a .c file).
>
> So does creation time, at least for CIFS caching. Creation time has potential
> for spotting when the object at a pathname has changed for something else,
> given the lack of inode number and inode generation from windows servers.
> Creation time gives us one more datum to use.

This justifies for me why a CIFS client would want to extract the
creation-time from the CIFS protocol, but not why you want to expose it via a
generic interface.
The kernel/filesystem doesn't need to maintain creation-time to meet this
need, only the CIFS server needs to maintain it - the kernel/filesystem just
needs to provide somewhere to store it - xattrs.

Given that we have an extensible attribute framework, it seems wrong to be
adding new attributes to *stat. If a given filesystem wants to store certain
attributes more efficiently, then it is welcome to intercept xattr calls and
store (say) "cifs.birthtime" directly at a known offset in the inode.

The flip-side of extracting these various attributes is setting them. One
presumably doesn't want to set st_data_version and possibly not st_gen, but
there seems to be a need to set st_btime and FS_SYSTEM_FL and FS_TEMPORARY_FL
might want to be set. Your xstat doesn't give any way to do that, xattrs
already does - you just need to define names for the attributes.

So I'm against adding new attributes that simply involve the fs storing some
information for the application to use.

I'm still pondering those extra flags:
FS_SPECIAL_FL
FS_AUTOMOUNT_FL
FS_AUTOMOUNT_ANY_FL
FS_REMOTE_FL
FS_ENCRYPTED_FL
FS_OFFLINE_FL

They sound like they might be useful, they are not file-metadata (like
btime) but rather implementation details (like st_blocks). So it is probably
sensible to include them as you have done.

However I would really like to see clear and complete documentation for them.
When exactly should a filesystem set these flag, and what exactly can an
application assume if they are (or are not) set.

If a filesystem is mounted on an network-block-device, or a loop-back of a
file on NFS, is FS_REMOTE_FL set?
Is ROT13 enough for FS_ENCRYPTED_FL to be set?
If the NFS server is "not responding, still trying", should FS_OFFLINE_FL get
set on all files?
And I cannot even guess at the different between the two FS_AUTOMOUNT flags.
I'm sure it is something useful, but doco would be good. Should one of them
be set on mountpoints that NFSv4 detects from the server?

They sound useful, but they are only really useful if they have precise
meanings.

>
> > The only role the kernel might have would be setting the 'creation time' when
> > the file was created, but it seems even that isn't always what is wanted,
> > because people don't so much what the time of create of the
> > container-on-disk, but the time of creation of the data-content.
>
> That should be a timestamp in the content itself, not a filesystem metadata
> timestamp.
>
> > I would want to see a pretty convincing use-case that cannot be solved with
> > xattrs before 'creation time' was added to a generic kernel interface.
>
> Then there's no point even considering this. You could emulate the entirety
> of stat() with getxattr(). I've previously posted a patch to implement the
> retrieval of creation time, inode gen and data version as xattrs and been told
> that it's the wrong way to do it and I should extend stat instead.

:-( stuck between a rock and a hard-place ??

It would probably help to keep that sort of decision process (complete with
who to blame) documented in the change-log entry, but one never thinks of
doing that at the time.

I don't have any veto power here, and I don't want any. I think creation
time and inode gen make more sense as xattrs. I'm less certain about
data-version as the kernel has to know about it as a first-class concept.

If I have any power to influence the results, I would much rather spend it on
requiring clear precise useful definitions for each new attribute than on
determining which attributes should be first-class and which should be xattrs.

>
> > So just use xattrs and don't involve the kernel in any detailed knowledge of
> > this value.
>
> Why not? BSD has it in its stat struct. Windows has it in its Win32
> equivalents. Samba for one will look for it there, and use it if it is.
>
> Using an xattr means an extra pathwalk and extra locking per access for any
> program that wants it. It's a reasonable bet such a program will also be
> stat'ing the file it wants the creation time for.
>
> If we are going to extend stat anyway, then why not make out a short list of
> extra things we could usefully return and consider adding them? Something
> like creation time is reasonably easy to come by for little extra overhead.
> Ext4, for example, retains a copy of it in RAM in its inode struct.

Providing everybody imposes exactly the same semantics for "creation time"...

>
> > Maybe xstat should take a list of xattrs to be retrieved as well?? or maybe
> > not.
>
> The idea of xstat() having a variable-length buffer and variable arguments has
> been well derided. It ain't going to happen, much though I'd like it to. I'd
> quite like to offer the opportunity to return the security label, for example.

"well derided" like high-mem and SMP support? or "real-time" support and
priority inheritance?
I guess the deriders are wrong, and will eventually realise that they are
wrong. The difficult bit is we cannot know how long it will take them, or
how much you have to care.

NeilBrown
(unambiguous documentation!! the rest is just details)

2010-07-22 19:44:13

by John Stoffel

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

>>>>> "Jeremy" == Jeremy Allison <[email protected]> writes:

Jeremy> On Thu, Jul 22, 2010 at 12:58:50PM -0400, Trond Myklebust wrote:
>>
>> That would make it impossible to export the filesystem with NFSv2 and
>> v3. They do rely on ctime checking for certain operations (e.g. deciding
>> when to invalidate access and acl caches). NFSv4 needs this too if the
>> filesystem has no dedicated change attribute.
>>
>> Still, I suppose the market for exporting the same filesystem with both
>> NFS and Samba is limited...

Jeremy> Ask NetApp about that :-). They have built a rather large
Jeremy> business on just that fact :-).

And it does work, as long as you also go with either unix or windows
semantics for the security and permissions bits. If you try to use
the mixed-mode, you're in for a world of hurt.

Oh yeah, Netapp still uses dump/restore for it's backups. :] Though
whether it's still dependent on the optimization of ctime being used
to know whether to just dump the inode only or not, I can't say.

John

2010-07-22 18:05:00

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thursday 2010-07-22 20:02, Jeremy Allison wrote:
>On Thu, Jul 22, 2010 at 12:58:50PM -0400, Trond Myklebust wrote:
>>
>> That would make it impossible to export the filesystem with NFSv2 and
>> v3. They do rely on ctime checking for certain operations (e.g. deciding
>> when to invalidate access and acl caches). NFSv4 needs this too if the
>> filesystem has no dedicated change attribute.
>>
>> Still, I suppose the market for exporting the same filesystem with both
>> NFS and Samba is limited...
>
>Ask NetApp about that :-). They have built a rather large
>business on just that fact :-).

What would ZFS do? :p

2010-07-15 02:17:25

[permalink] [raw]

Subject: [PATCH 13/18] xstat: AFS: Use d_automount() rather than abusing follow_link() [ver #6]

Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/dir.c | 1 +
fs/afs/internal.h | 1 +
fs/afs/mntpt.c | 46 +++++++++++++++-------------------------------
3 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index afb9ff8..d2dd137 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -65,6 +65,7 @@ static const struct dentry_operations afs_fs_dentry_operations = {
.d_revalidate = afs_d_revalidate,
.d_delete = afs_d_delete,
.d_release = afs_d_release,
+ .d_automount = afs_d_automount,
};

#define AFS_DIR_HASHTBL_SIZE 128
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 5f679b7..2c700dc 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -583,6 +583,7 @@ extern int afs_abort_to_error(u32);
extern const struct inode_operations afs_mntpt_inode_operations;
extern const struct file_operations afs_mntpt_file_operations;

+extern struct vfsmount *afs_d_automount(struct path *);
extern int afs_mntpt_check_symlink(struct afs_vnode *, struct key *);
extern void afs_mntpt_kill_timer(void);

diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index a9e2303..ea9cfee 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -24,7 +24,6 @@ static struct dentry *afs_mntpt_lookup(struct inode *dir,
struct dentry *dentry,
struct nameidata *nd);
static int afs_mntpt_open(struct inode *inode, struct file *file);
-static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd);
static void afs_mntpt_expiry_timed_out(struct work_struct *work);

const struct file_operations afs_mntpt_file_operations = {
@@ -33,7 +32,6 @@ const struct file_operations afs_mntpt_file_operations = {

const struct inode_operations afs_mntpt_inode_operations = {
.lookup = afs_mntpt_lookup,
- .follow_link = afs_mntpt_follow_link,
.readlink = page_readlink,
.getattr = afs_getattr,
};
@@ -82,6 +80,7 @@ int afs_mntpt_check_symlink(struct afs_vnode *vnode, struct key *key)
_debug("symlink is a mountpoint");
spin_lock(&vnode->lock);
set_bit(AFS_VNODE_MOUNTPOINT, &vnode->flags);
+ vnode->vfs_inode.i_flags |= S_AUTOMOUNT;
spin_unlock(&vnode->lock);
}

@@ -205,52 +204,37 @@ error_no_devname:
}

/*
- * follow a link from a mountpoint directory, thus causing it to be mounted
+ * handle an automount point
*/
-static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd)
+struct vfsmount *afs_d_automount(struct path *path)
{
struct vfsmount *newmnt;
int err;

- _enter("%p{%s},{%s:%p{%s},}",
- dentry,
- dentry->d_name.name,
- nd->path.mnt->mnt_devname,
- dentry,
- nd->path.dentry->d_name.name);
-
- dput(nd->path.dentry);
- nd->path.dentry = dget(dentry);
+ _enter("{%s,%s}", path->mnt->mnt_devname, path->dentry->d_name.name);

- newmnt = afs_mntpt_do_automount(nd->path.dentry);
- if (IS_ERR(newmnt)) {
- path_put(&nd->path);
- return (void *)newmnt;
- }
+ newmnt = afs_mntpt_do_automount(path->dentry);
+ if (IS_ERR(newmnt))
+ return newmnt;

mntget(newmnt);
- err = do_add_mount(newmnt, &nd->path, MNT_SHRINKABLE, &afs_vfsmounts);
+ err = do_add_mount(newmnt, path, MNT_SHRINKABLE, &afs_vfsmounts);
switch (err) {
case 0:
- path_put(&nd->path);
- nd->path.mnt = newmnt;
- nd->path.dentry = dget(newmnt->mnt_root);
schedule_delayed_work(&afs_mntpt_expiry_timer,
afs_mntpt_expiry_timeout * HZ);
- break;
+ _leave(" = %p {%s}", newmnt, newmnt->mnt_devname);
+ return newmnt;
case -EBUSY:
/* someone else made a mount here whilst we were busy */
- while (d_mountpoint(nd->path.dentry) &&
- follow_down(&nd->path))
- ;
- err = 0;
+ mntput(newmnt);
+ _leave(" = NULL [EBUSY]");
+ return NULL;
default:
mntput(newmnt);
- break;
+ _leave(" = %d", err);
+ return ERR_PTR(err);
}
-
- _leave(" = %d", err);
- return ERR_PTR(err);
}

/*

2010-07-15 02:17:26

[permalink] [raw]

Subject: [PATCH 14/18] xstat: NFS: Use d_automount() rather than abusing follow_link() [ver #6]

Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <[email protected]>
---

fs/nfs/dir.c | 2 +
fs/nfs/inode.c | 1 +
fs/nfs/internal.h | 1 +
fs/nfs/namespace.c | 87 ++++++++++++++++++++++++----------------------------
4 files changed, 44 insertions(+), 47 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 782b431..d7e5810 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -927,6 +927,7 @@ const struct dentry_operations nfs_dentry_operations = {
.d_revalidate = nfs_lookup_revalidate,
.d_delete = nfs_dentry_delete,
.d_iput = nfs_dentry_iput,
+ .d_automount = nfs_d_automount,
};

static struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
@@ -1002,6 +1003,7 @@ const struct dentry_operations nfs4_dentry_operations = {
.d_revalidate = nfs_open_revalidate,
.d_delete = nfs_dentry_delete,
.d_iput = nfs_dentry_iput,
+ .d_automount = nfs_d_automount,
};

/*
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 8c6de96..f9737bd 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -296,6 +296,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
inode->i_op = &nfs_mountpoint_inode_operations;
inode->i_fop = NULL;
set_bit(NFS_INO_MOUNTPOINT, &nfsi->flags);
+ inode->i_flags |= S_AUTOMOUNT;
}
} else if (S_ISLNK(inode->i_mode))
inode->i_op = &nfs_symlink_inode_operations;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index d8bd619..48de6f8 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -238,6 +238,7 @@ extern char *nfs_path(const char *base,
const struct dentry *droot,
const struct dentry *dentry,
char *buffer, ssize_t buflen);
+extern struct vfsmount *nfs_d_automount(struct path *path);

/* getroot.c */
extern struct dentry *nfs_get_root(struct super_block *, struct nfs_fh *);
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index db6aa36..bf80079 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -88,9 +88,8 @@ Elong:
}

/*
- * nfs_follow_mountpoint - handle crossing a mountpoint on the server
- * @dentry - dentry of mountpoint
- * @nd - nameidata info
+ * nfs_d_automount - Handle crossing a mountpoint on the server
+ * @path - The mountpoint
*
* When we encounter a mountpoint on the server, we want to set up
* a mountpoint on the client too, to prevent inode numbers from
@@ -100,87 +99,81 @@ Elong:
* situation, and that different filesystems may want to use
* different security flavours.
*/
-static void * nfs_follow_mountpoint(struct dentry *dentry, struct nameidata *nd)
+struct vfsmount *nfs_d_automount(struct path *path)
{
struct vfsmount *mnt;
- struct nfs_server *server = NFS_SERVER(dentry->d_inode);
+ struct nfs_server *server = NFS_SERVER(path->dentry->d_inode);
struct dentry *parent;
struct nfs_fh *fh = NULL;
struct nfs_fattr *fattr = NULL;
int err;

- dprintk("--> nfs_follow_mountpoint()\n");
+ dprintk("--> nfs_d_automount()\n");

- err = -ESTALE;
- if (IS_ROOT(dentry))
- goto out_err;
+ mnt = ERR_PTR(-ESTALE);
+ if (IS_ROOT(path->dentry))
+ goto out_nofree;

- err = -ENOMEM;
+ mnt = ERR_PTR(-ENOMEM);
fh = nfs_alloc_fhandle();
fattr = nfs_alloc_fattr();
if (fh == NULL || fattr == NULL)
- goto out_err;
+ goto out;

dprintk("%s: enter\n", __func__);
- dput(nd->path.dentry);
- nd->path.dentry = dget(dentry);

- /* Look it up again */
- parent = dget_parent(nd->path.dentry);
+ /* Look it up again to get its attributes */
+ parent = dget_parent(path->dentry);
err = server->nfs_client->rpc_ops->lookup(parent->d_inode,
- &nd->path.dentry->d_name,
+ &path->dentry->d_name,
fh, fattr);
dput(parent);
- if (err != 0)
- goto out_err;
+ if (err != 0) {
+ mnt = ERR_PTR(err);
+ goto out;
+ }

if (fattr->valid & NFS_ATTR_FATTR_V4_REFERRAL)
- mnt = nfs_do_refmount(nd->path.mnt, nd->path.dentry);
+ mnt = nfs_do_refmount(path->mnt, path->dentry);
else
- mnt = nfs_do_submount(nd->path.mnt, nd->path.dentry, fh,
- fattr);
- err = PTR_ERR(mnt);
+ mnt = nfs_do_submount(path->mnt, path->dentry, fh, fattr);
if (IS_ERR(mnt))
- goto out_err;
+ goto out;

mntget(mnt);
- err = do_add_mount(mnt, &nd->path, nd->path.mnt->mnt_flags|MNT_SHRINKABLE,
+ err = do_add_mount(mnt, path, path->mnt->mnt_flags | MNT_SHRINKABLE,
&nfs_automount_list);
- if (err < 0) {
+ switch (err) {
+ case 0:
+ dprintk("%s: done, success\n", __func__);
+ schedule_delayed_work(&nfs_automount_task, nfs_mountpoint_expiry_timeout);
+ break;
+ case -EBUSY:
+ /* someone else made a mount here whilst we were busy */
mntput(mnt);
- if (err == -EBUSY)
- goto out_follow;
- goto out_err;
+ dprintk("%s: done, collision\n", __func__);
+ mnt = NULL;
+ break;
+ default:
+ mntput(mnt);
+ dprintk("%s: done, error %d\n", __func__, err);
+ mnt = ERR_PTR(err);
+ break;
}
- path_put(&nd->path);
- nd->path.mnt = mnt;
- nd->path.dentry = dget(mnt->mnt_root);
- schedule_delayed_work(&nfs_automount_task, nfs_mountpoint_expiry_timeout);
+
out:
nfs_free_fattr(fattr);
nfs_free_fhandle(fh);
- dprintk("%s: done, returned %d\n", __func__, err);
-
- dprintk("<-- nfs_follow_mountpoint() = %d\n", err);
- return ERR_PTR(err);
-out_err:
- path_put(&nd->path);
- goto out;
-out_follow:
- while (d_mountpoint(nd->path.dentry) &&
- follow_down(&nd->path))
- ;
- err = 0;
- goto out;
+out_nofree:
+ dprintk("<-- nfs_follow_mountpoint() = %p\n", mnt);
+ return mnt;
}

const struct inode_operations nfs_mountpoint_inode_operations = {
- .follow_link = nfs_follow_mountpoint,
.getattr = nfs_getattr,
};

const struct inode_operations nfs_referral_inode_operations = {
- .follow_link = nfs_follow_mountpoint,
};

static void nfs_expire_automounts(struct work_struct *work)

2010-07-15 02:17:28

[permalink] [raw]

Subject: [PATCH 16/18] xstat: Remove the automount through follow_link() kludge code from pathwalk [ver #6]

Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().

Signed-off-by: David Howells <[email protected]>
---

fs/namei.c | 17 +++--------------
1 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fcec3c6..86068a2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,17 +845,6 @@ fail:
}

/*
- * This is a temporary kludge to deal with "automount" symlinks; proper
- * solution is to trigger them on follow_mount(), so that do_lookup()
- * would DTRT. To be killed before 2.6.34-final.
- */
-static inline int follow_on_final(struct inode *inode, unsigned lookup_flags)
-{
- return inode && unlikely(inode->i_op->follow_link) &&
- ((lookup_flags & LOOKUP_FOLLOW) || S_ISDIR(inode->i_mode));
-}
-
-/*
* Name resolution.
* This is the basic name resolution function, turning a pathname into
* the final dentry. We expect 'base' to be positive and a directory.
@@ -975,7 +964,8 @@ last_component:
if (err)
break;
inode = next.dentry->d_inode;
- if (follow_on_final(inode, lookup_flags)) {
+ if (inode && unlikely(inode->i_op->follow_link) &&
+ (lookup_flags & LOOKUP_FOLLOW)) {
err = do_follow_link(&next, nd);
if (err)
goto return_err;
@@ -1888,8 +1878,7 @@ reval:
struct inode *inode = path.dentry->d_inode;
void *cookie;
error = -ELOOP;
- /* S_ISDIR part is a temporary automount kludge */
- if (!(nd.flags & LOOKUP_FOLLOW) && !S_ISDIR(inode->i_mode))
+ if (!(nd.flags & LOOKUP_FOLLOW))
goto exit_dput;
if (count++ == 32)
goto exit_dput;

2010-07-15 02:17:29

[permalink] [raw]

Subject: [PATCH 17/18] xstat: Add an AT_NO_AUTOMOUNT flag to suppress terminal automount [ver #6]

Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of directories
with follow_link semantics. This can be used by fstatat()/xstat() users to
permit the gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.

Signed-off-by: David Howells <[email protected]>
---

fs/namei.c | 15 ++++++++++-----
fs/stat.c | 4 +++-
include/linux/fcntl.h | 1 +
include/linux/namei.h | 2 ++
4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 86068a2..056427e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -654,7 +654,8 @@ static int follow_automount(struct path *path, int res)
/* no need for dcache_lock, as serialization is taken care in
* namespace.c
*/
-static int __follow_mount(struct path *path, unsigned nofollow)
+static int __follow_mount(struct path *path, unsigned nofollow,
+ struct nameidata *nd)
{
struct vfsmount *mounted;
int ret, res = 0;
@@ -674,8 +675,12 @@ static int __follow_mount(struct path *path, unsigned nofollow)
}
if (!d_automount_point(path->dentry))
break;
- if (nofollow)
- return -ELOOP;
+ if (!(nd->flags & LOOKUP_CONTINUE)) {
+ if (nofollow)
+ return -ELOOP;
+ if (nd->flags & LOOKUP_NO_AUTOMOUNT)
+ break;
+ }
ret = follow_automount(path, res);
if (ret < 0)
return ret;
@@ -769,7 +774,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
done:
path->mnt = mnt;
path->dentry = dentry;
- ret = __follow_mount(path, 0);
+ ret = __follow_mount(path, 0, nd);
if (unlikely(ret < 0))
path_put(path);
return ret;
@@ -1762,7 +1767,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (open_flag & O_EXCL)
goto exit_dput;

- error = __follow_mount(path, open_flag & O_NOFOLLOW);
+ error = __follow_mount(path, open_flag & O_NOFOLLOW, nd);
if (error < 0)
goto exit_dput;

diff --git a/fs/stat.c b/fs/stat.c
index bb0f538..3f2ab5f 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -182,11 +182,13 @@ int vfs_xstat(int dfd, const char __user *filename, int flags,
struct path path;
int error, lookup_flags;

- if (flags & ~(AT_SYMLINK_NOFOLLOW | KSTAT_QUERY_FLAGS))
+ if (flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+ KSTAT_QUERY_FLAGS))
return -EINVAL;

stat->query_flags = flags & KSTAT_QUERY_FLAGS;
lookup_flags = (flags & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW;
+ lookup_flags |= (flags & AT_NO_AUTOMOUNT) ? LOOKUP_NO_AUTOMOUNT : 0;

error = user_path_at(dfd, filename, lookup_flags, &path);
if (!error) {
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index bcf8083..768b0fd 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -46,6 +46,7 @@
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
#define AT_FORCE_ATTR_SYNC 0x800 /* Force the attributes to be sync'd with the server */
+#define AT_NO_AUTOMOUNT 0x1000 /* Suppress terminal automount traversal */

#ifdef __KERNEL__

diff --git a/include/linux/namei.h b/include/linux/namei.h
index 05b441d..1e1febf 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -43,12 +43,14 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
* - internal "there are more path components" flag
* - locked when lookup done with dcache_lock held
* - dentry cache is untrusted; force a real lookup
+ * - suppress terminal automount
*/
#define LOOKUP_FOLLOW 1
#define LOOKUP_DIRECTORY 2
#define LOOKUP_CONTINUE 4
#define LOOKUP_PARENT 16
#define LOOKUP_REVAL 64
+#define LOOKUP_NO_AUTOMOUNT 128
/*
* Intent data
*/

2010-07-15 02:17:27

[permalink] [raw]

Subject: [PATCH 15/18] xstat: CIFS: Use d_automount() rather than abusing follow_link() [ver #6]

Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

[NOTE: THIS IS UNTESTED!]

[Question: Why does cifs_dfs_do_refmount() when the caller has already done
that and could pass the result through?]

Signed-off-by: David Howells <[email protected]>
Cc: Steve French <[email protected]>
---

fs/cifs/cifs_dfs_ref.c | 145 +++++++++++++++++++++++-------------------------
fs/cifs/cifsfs.h | 6 ++
fs/cifs/dir.c | 2 +
fs/cifs/inode.c | 8 ++-
4 files changed, 83 insertions(+), 78 deletions(-)

diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c
index 4516867..500b952 100644
--- a/fs/cifs/cifs_dfs_ref.c
+++ b/fs/cifs/cifs_dfs_ref.c
@@ -230,8 +230,8 @@ compose_mount_options_err:
}

-static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
- struct dentry *dentry, const struct dfs_info3_param *ref)
+static struct vfsmount *cifs_dfs_do_refmount(struct dentry *mntpt,
+ const struct dfs_info3_param *ref)
{
struct cifs_sb_info *cifs_sb;
struct vfsmount *mnt;
@@ -239,12 +239,12 @@ static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
char *devname = NULL;
char *fullpath;

- cifs_sb = CIFS_SB(dentry->d_inode->i_sb);
+ cifs_sb = CIFS_SB(mntpt->d_inode->i_sb);
/*
* this function gives us a path with a double backslash prefix. We
* require a single backslash for DFS.
*/
- fullpath = build_path_from_dentry(dentry);
+ fullpath = build_path_from_dentry(mntpt);
if (!fullpath)
return ERR_PTR(-ENOMEM);

@@ -262,35 +262,6 @@ static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,

}

-static int add_mount_helper(struct vfsmount *newmnt, struct nameidata *nd,
- struct list_head *mntlist)
-{
- /* stolen from afs code */
- int err;
-
- mntget(newmnt);
- err = do_add_mount(newmnt, &nd->path, nd->path.mnt->mnt_flags | MNT_SHRINKABLE, mntlist);
- switch (err) {
- case 0:
- path_put(&nd->path);
- nd->path.mnt = newmnt;
- nd->path.dentry = dget(newmnt->mnt_root);
- schedule_delayed_work(&cifs_dfs_automount_task,
- cifs_dfs_mountpoint_expiry_timeout);
- break;
- case -EBUSY:
- /* someone else made a mount here whilst we were busy */
- while (d_mountpoint(nd->path.dentry) &&
- follow_down(&nd->path))
- ;
- err = 0;
- default:
- mntput(newmnt);
- break;
- }
- return err;
-}
-
static void dump_referral(const struct dfs_info3_param *ref)
{
cFYI(1, "DFS: ref path: %s", ref->path_name);
@@ -300,34 +271,30 @@ static void dump_referral(const struct dfs_info3_param *ref)
ref->path_consumed);
}

-
-static void*
-cifs_dfs_follow_mountpoint(struct dentry *dentry, struct nameidata *nd)
+/*
+ * Create a vfsmount that we can automount
+ */
+static struct vfsmount *cifs_dfs_do_automount(struct dentry *mntpt)
{
struct dfs_info3_param *referrals = NULL;
unsigned int num_referrals = 0;
struct cifs_sb_info *cifs_sb;
struct cifsSesInfo *ses;
- char *full_path = NULL;
+ char *full_path;
int xid, i;
- int rc = 0;
- struct vfsmount *mnt = ERR_PTR(-ENOENT);
+ int rc;
+ struct vfsmount *mnt;

cFYI(1, "in %s", __func__);
- BUG_ON(IS_ROOT(dentry));
+ BUG_ON(IS_ROOT(mntpt));

xid = GetXid();

- dput(nd->path.dentry);
- nd->path.dentry = dget(dentry);
-
- cifs_sb = CIFS_SB(dentry->d_inode->i_sb);
+ cifs_sb = CIFS_SB(mntpt->d_inode->i_sb);
+ mnt = ERR_PTR(-EINVAL);
ses = cifs_sb->tcon->ses;
-
- if (!ses) {
- rc = -EINVAL;
- goto out_err;
- }
+ if (!ses)
+ goto free_xid;

/*
* The MSDFS spec states that paths in DFS referral requests and
@@ -335,56 +302,84 @@ cifs_dfs_follow_mountpoint(struct dentry *dentry, struct nameidata *nd)
* the double backslashes usually used in the UNC. This function
* gives us the latter, so we must adjust the result.
*/
- full_path = build_path_from_dentry(dentry);
- if (full_path == NULL) {
- rc = -ENOMEM;
- goto out_err;
- }
+ mnt = ERR_PTR(-ENOMEM);
+ full_path = build_path_from_dentry(mntpt);
+ if (full_path == NULL)
+ goto free_xid;

rc = get_dfs_path(xid, ses , full_path + 1, cifs_sb->local_nls,
&num_referrals, &referrals,
cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR);

+ mnt = ERR_PTR(-ENOENT);
for (i = 0; i < num_referrals; i++) {
int len;
- dump_referral(referrals+i);
+ dump_referral(referrals + i);
/* connect to a node */
len = strlen(referrals[i].node_name);
if (len < 2) {
cERROR(1, "%s: Net Address path too short: %s",
__func__, referrals[i].node_name);
- rc = -EINVAL;
- goto out_err;
+ mnt = ERR_PTR(-EINVAL);
+ break;
}
- mnt = cifs_dfs_do_refmount(nd->path.mnt,
- nd->path.dentry, referrals + i);
+ mnt = cifs_dfs_do_refmount(mntpt, referrals + i);
cFYI(1, "%s: cifs_dfs_do_refmount:%s , mnt:%p", __func__,
referrals[i].node_name, mnt);
-
- /* complete mount procedure if we accured submount */
if (!IS_ERR(mnt))
- break;
+ goto success;
}

- /* we need it cause for() above could exit without valid submount */
- rc = PTR_ERR(mnt);
- if (IS_ERR(mnt))
- goto out_err;
-
- rc = add_mount_helper(mnt, nd, &cifs_dfs_automount_list);
+ /* no valid submounts were found; return error from get_dfs_path() by
+ * preference */
+ if (rc != 0)
+ mnt = ERR_PTR(rc);

-out:
- FreeXid(xid);
+success:
free_dfs_info_array(referrals, num_referrals);
kfree(full_path);
+free_xid:
+ FreeXid(xid);
cFYI(1, "leaving %s" , __func__);
return ERR_PTR(rc);
-out_err:
- path_put(&nd->path);
- goto out;
+}
+
+/*
+ * Attempt to automount the referral
+ */
+struct vfsmount *cifs_dfs_d_automount(struct path *path)
+{
+ struct vfsmount *newmnt;
+ int err;
+
+ cFYI(1, "in %s", __func__);
+
+ newmnt = cifs_dfs_do_automount(path->dentry);
+ if (IS_ERR(newmnt)) {
+ cFYI(1, "leaving %s [automount failed]" , __func__);
+ return newmnt;
+ }
+
+ mntget(newmnt);
+ err = do_add_mount(newmnt, path, MNT_SHRINKABLE,
+ &cifs_dfs_automount_list);
+ switch (err) {
+ case 0:
+ schedule_delayed_work(&cifs_dfs_automount_task,
+ cifs_dfs_mountpoint_expiry_timeout);
+ cFYI(1, "leaving %s [ok]" , __func__);
+ return newmnt;
+ case -EBUSY:
+ /* someone else made a mount here whilst we were busy */
+ mntput(newmnt);
+ cFYI(1, "leaving %s [EBUSY]" , __func__);
+ return NULL;
+ default:
+ mntput(newmnt);
+ cFYI(1, "leaving %s [error %d]" , __func__, err);
+ return ERR_PTR(err);
+ }
}

const struct inode_operations cifs_dfs_referral_inode_operations = {
- .follow_link = cifs_dfs_follow_mountpoint,
};
-
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 50bf70b..fb50a4a 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -95,6 +95,12 @@ extern int cifs_readdir(struct file *file, void *direntry, filldir_t filldir);
extern const struct dentry_operations cifs_dentry_ops;
extern const struct dentry_operations cifs_ci_dentry_ops;

+#ifdef CONFIG_CIFS_DFS_UPCALL
+extern struct vfsmount *cifs_dfs_d_automount(struct path *path);
+#else
+#define cifs_dfs_d_automount NULL
+#endif
+
/* Functions related to symlinks */
extern void *cifs_follow_link(struct dentry *direntry, struct nameidata *nd);
extern void cifs_put_link(struct dentry *direntry,
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 65da30e..73fd652 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -804,6 +804,7 @@ cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)

const struct dentry_operations cifs_dentry_ops = {
.d_revalidate = cifs_d_revalidate,
+ .d_automount = cifs_dfs_d_automount,
/* d_delete: cifs_d_delete, */ /* not needed except for debugging */
};

@@ -844,4 +845,5 @@ const struct dentry_operations cifs_ci_dentry_ops = {
.d_revalidate = cifs_d_revalidate,
.d_hash = cifs_ci_hash,
.d_compare = cifs_ci_compare,
+ .d_automount = cifs_dfs_d_automount,
};
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index ff4a62f..12d95f4 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -31,7 +31,7 @@
#include "cifs_fs_sb.h"

-static void cifs_set_ops(struct inode *inode, const bool is_dfs_referral)
+static void cifs_set_ops(struct inode *inode)
{
struct cifs_sb_info *cifs_sb = CIFS_SB(inode->i_sb);

@@ -59,7 +59,7 @@ static void cifs_set_ops(struct inode *inode, const bool is_dfs_referral)
break;
case S_IFDIR:
#ifdef CONFIG_CIFS_DFS_UPCALL
- if (is_dfs_referral) {
+ if (IS_AUTOMOUNT(inode)) {
inode->i_op = &cifs_dfs_referral_inode_operations;
} else {
#else /* NO DFS support, treat as a directory */
@@ -170,7 +170,9 @@ cifs_fattr_to_inode(struct inode *inode, struct cifs_fattr *fattr)
}
spin_unlock(&inode->i_lock);

- cifs_set_ops(inode, fattr->cf_flags & CIFS_FATTR_DFS_REFERRAL);
+ if (fattr->cf_flags & CIFS_FATTR_DFS_REFERRAL)
+ inode->i_flags |= S_AUTOMOUNT;
+ cifs_set_ops(inode);
}

void

2010-07-17 09:00:30

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Saturday 17 July 2010 07:51:30 Mark Harris wrote:
> David Howells wrote:
> > With a 2:2 split between exponent
> > (tv_gran_units) and mantissa (tv_granularity), you can do:
> >
> > UNIT SECONDS/UNIT EXPONENT MANTISSA
> > nanoseconds 0.000000001 -9 1
> > microseconds 0.000001 -6 1
> > millseconds 0.001 -3 1
> > seconds 1 0 1
> > minutes 60 1 6
> > hours 3600 2 36
> > days 86400 2 864
> > weeks 604800 2 6048
>
> At least for the in-tree filesystems, I do not see any that keep
> timestamps with a granularity larger than 2s. For that, a simple
> 32-bit tv_granularity in nanoseconds (not limited to 1e9) would
> suffice, and there is no need for the complexity of dealing with
> a separate exponent.

Yes, good point. That would indeed be a significant simplification.

Arnd

2010-07-15 02:17:11

[permalink] [raw]

Subject: [PATCH 01/18] Mark arguments to certain syscalls as being const [ver #6]

Mark arguments to certain system calls as being const where they should be but
aren't. The list includes:

(*) The filename arguments of various stat syscalls, execve(), various utimes
syscalls and some mount syscalls.

(*) The filename arguments of some syscall helpers relating to the above.

(*) The buffer argument of various write syscalls.

Signed-off-by: David Howells <[email protected]>
---

arch/alpha/kernel/osf_sys.c | 6 +++---
arch/alpha/kernel/process.c | 2 +-
arch/arm/kernel/sys_arm.c | 4 ++--
arch/arm/kernel/sys_oabi-compat.c | 6 +++---
arch/avr32/include/asm/syscalls.h | 2 +-
arch/avr32/kernel/process.c | 3 ++-
arch/blackfin/kernel/process.c | 2 +-
arch/frv/kernel/process.c | 3 ++-
arch/h8300/kernel/process.c | 2 +-
arch/ia64/include/asm/unistd.h | 2 +-
arch/ia64/kernel/process.c | 2 +-
arch/m32r/kernel/process.c | 3 ++-
arch/m68k/kernel/process.c | 2 +-
arch/m68knommu/kernel/process.c | 2 +-
arch/microblaze/kernel/sys_microblaze.c | 2 +-
arch/mips/kernel/syscall.c | 2 +-
arch/mn10300/kernel/process.c | 2 +-
arch/parisc/hpux/fs.c | 7 ++++---
arch/powerpc/kernel/process.c | 2 +-
arch/powerpc/kernel/sys_ppc32.c | 2 +-
arch/s390/kernel/compat_linux.c | 10 +++++-----
arch/s390/kernel/compat_linux.h | 10 +++++-----
arch/s390/kernel/entry.h | 2 +-
arch/s390/kernel/process.c | 2 +-
arch/sh/include/asm/syscalls_32.h | 2 +-
arch/sh/include/asm/syscalls_64.h | 2 +-
arch/sh/kernel/process_64.c | 2 +-
arch/sparc/kernel/sys_sparc32.c | 7 ++++---
arch/um/kernel/exec.c | 6 +++---
arch/um/kernel/internal.h | 2 +-
arch/um/kernel/syscall.c | 2 +-
arch/x86/ia32/sys_ia32.c | 14 +++++++-------
arch/x86/include/asm/sys_ia32.h | 12 ++++++------
arch/x86/include/asm/syscalls.h | 2 +-
arch/x86/kernel/entry_64.S | 4 ++--
arch/x86/kernel/process.c | 2 +-
arch/xtensa/kernel/process.c | 2 +-
fs/compat.c | 23 +++++++++++++----------
fs/stat.c | 29 ++++++++++++++++++-----------
fs/utimes.c | 7 ++++---
include/linux/compat.h | 6 +++---
include/linux/fs.h | 6 +++---
include/linux/syscalls.h | 20 ++++++++++----------
include/linux/time.h | 2 +-
44 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index de9d397..1719fe3 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -244,7 +244,7 @@ do_osf_statfs(struct dentry * dentry, struct osf_statfs __user *buffer,
return error;
}

-SYSCALL_DEFINE3(osf_statfs, char __user *, pathname,
+SYSCALL_DEFINE3(osf_statfs, const char __user *, pathname,
struct osf_statfs __user *, buffer, unsigned long, bufsiz)
{
struct path path;
@@ -358,7 +358,7 @@ osf_procfs_mount(char *dirname, struct procfs_args __user *args, int flags)
return do_mount("", dirname, "proc", flags, NULL);
}

-SYSCALL_DEFINE4(osf_mount, unsigned long, typenr, char __user *, path,
+SYSCALL_DEFINE4(osf_mount, unsigned long, typenr, const char __user *, path,
int, flag, void __user *, data)
{
int retval;
@@ -932,7 +932,7 @@ SYSCALL_DEFINE3(osf_setitimer, int, which, struct itimerval32 __user *, in,

}

-SYSCALL_DEFINE2(osf_utimes, char __user *, filename,
+SYSCALL_DEFINE2(osf_utimes, const char __user *, filename,
struct timeval32 __user *, tvs)
{
struct timespec tv[2];
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 395a464..88e608a 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -387,7 +387,7 @@ EXPORT_SYMBOL(dump_elf_task_fp);
* sys_execve() executes a new program.
*/
asmlinkage int
-do_sys_execve(char __user *ufilename, char __user * __user *argv,
+do_sys_execve(const char __user *ufilename, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs)
{
int error;
diff --git a/arch/arm/kernel/sys_arm.c b/arch/arm/kernel/sys_arm.c
index c235018..5b7c541 100644
--- a/arch/arm/kernel/sys_arm.c
+++ b/arch/arm/kernel/sys_arm.c
@@ -62,7 +62,7 @@ asmlinkage int sys_vfork(struct pt_regs *regs)
/* sys_execve() executes a new program.
* This is called indirectly via a small wrapper
*/
-asmlinkage int sys_execve(char __user *filenamei, char __user * __user *argv,
+asmlinkage int sys_execve(const char __user *filenamei, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs)
{
int error;
@@ -84,7 +84,7 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[])
int ret;

memset(&regs, 0, sizeof(struct pt_regs));
- ret = do_execve((char *)filename, (char __user * __user *)argv,
+ ret = do_execve(filename, (char __user * __user *)argv,
(char __user * __user *)envp, &regs);
if (ret < 0)
goto out;
diff --git a/arch/arm/kernel/sys_oabi-compat.c b/arch/arm/kernel/sys_oabi-compat.c
index 33ff678..4ad8da1 100644
--- a/arch/arm/kernel/sys_oabi-compat.c
+++ b/arch/arm/kernel/sys_oabi-compat.c
@@ -141,7 +141,7 @@ static long cp_oldabi_stat64(struct kstat *stat,
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-asmlinkage long sys_oabi_stat64(char __user * filename,
+asmlinkage long sys_oabi_stat64(const char __user * filename,
struct oldabi_stat64 __user * statbuf)
{
struct kstat stat;
@@ -151,7 +151,7 @@ asmlinkage long sys_oabi_stat64(char __user * filename,
return error;
}

-asmlinkage long sys_oabi_lstat64(char __user * filename,
+asmlinkage long sys_oabi_lstat64(const char __user * filename,
struct oldabi_stat64 __user * statbuf)
{
struct kstat stat;
@@ -172,7 +172,7 @@ asmlinkage long sys_oabi_fstat64(unsigned long fd,
}

asmlinkage long sys_oabi_fstatat64(int dfd,
- char __user *filename,
+ const char __user *filename,
struct oldabi_stat64 __user *statbuf,
int flag)
{
diff --git a/arch/avr32/include/asm/syscalls.h b/arch/avr32/include/asm/syscalls.h
index 66a1972..ab608b7 100644
--- a/arch/avr32/include/asm/syscalls.h
+++ b/arch/avr32/include/asm/syscalls.h
@@ -21,7 +21,7 @@ asmlinkage int sys_clone(unsigned long, unsigned long,
unsigned long, unsigned long,
struct pt_regs *);
asmlinkage int sys_vfork(struct pt_regs *);
-asmlinkage int sys_execve(char __user *, char __user *__user *,
+asmlinkage int sys_execve(const char __user *, char __user *__user *,
char __user *__user *, struct pt_regs *);

/* kernel/signal.c */
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 2d76515..e5daddf 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -383,7 +383,8 @@ asmlinkage int sys_vfork(struct pt_regs *regs)
0, NULL, NULL);
}

-asmlinkage int sys_execve(char __user *ufilename, char __user *__user *uargv,
+asmlinkage int sys_execve(const char __user *ufilename,
+ char __user *__user *uargv,
char __user *__user *uenvp, struct pt_regs *regs)
{
int error;
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 93ec07d..a566f61 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -209,7 +209,7 @@ copy_thread(unsigned long clone_flags,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *name, char __user * __user *argv, char __user * __user *envp)
+asmlinkage int sys_execve(const char __user *name, char __user * __user *argv, char __user * __user *envp)
{
int error;
char *filename;
diff --git a/arch/frv/kernel/process.c b/arch/frv/kernel/process.c
index 21d0fd1..428931c 100644
--- a/arch/frv/kernel/process.c
+++ b/arch/frv/kernel/process.c
@@ -250,7 +250,8 @@ int copy_thread(unsigned long clone_flags,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *name, char __user * __user *argv, char __user * __user *envp)
+asmlinkage int sys_execve(const char __user *name, char __user * __user *argv,
+ char __user * __user *envp)
{
int error;
char * filename;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8c8b0ff..8b7b78d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -212,7 +212,7 @@ int copy_thread(unsigned long clone_flags,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char *name, char **argv, char **envp,int dummy,...)
+asmlinkage int sys_execve(const char *name, char **argv, char **envp,int dummy,...)
{
int error;
char * filename;
diff --git a/arch/ia64/include/asm/unistd.h b/arch/ia64/include/asm/unistd.h
index bb8b0ff..46f36fc 100644
--- a/arch/ia64/include/asm/unistd.h
+++ b/arch/ia64/include/asm/unistd.h
@@ -353,7 +353,7 @@ asmlinkage unsigned long sys_mmap2(
int fd, long pgoff);
struct pt_regs;
struct sigaction;
-long sys_execve(char __user *filename, char __user * __user *argv,
+long sys_execve(const char __user *filename, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs);
asmlinkage long sys_ia64_pipe(void);
asmlinkage long sys_rt_sigaction(int sig,
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 53f1648..a879c03 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -633,7 +633,7 @@ dump_fpu (struct pt_regs *pt, elf_fpregset_t dst)
}

long
-sys_execve (char __user *filename, char __user * __user *argv, char __user * __user *envp,
+sys_execve (const char __user *filename, char __user * __user *argv, char __user * __user *envp,
struct pt_regs *regs)
{
char *fname;
diff --git a/arch/m32r/kernel/process.c b/arch/m32r/kernel/process.c
index bc8c8c1..8665a4d 100644
--- a/arch/m32r/kernel/process.c
+++ b/arch/m32r/kernel/process.c
@@ -288,7 +288,8 @@ asmlinkage int sys_vfork(unsigned long r0, unsigned long r1, unsigned long r2,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *ufilename, char __user * __user *uargv,
+asmlinkage int sys_execve(const char __user *ufilename,
+ char __user * __user *uargv,
char __user * __user *uenvp,
unsigned long r3, unsigned long r4, unsigned long r5,
unsigned long r6, struct pt_regs regs)
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index 1a6be27..221d0b7 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -315,7 +315,7 @@ EXPORT_SYMBOL(dump_fpu);
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *name, char __user * __user *argv, char __user * __user *envp)
+asmlinkage int sys_execve(const char __user *name, char __user * __user *argv, char __user * __user *envp)
{
int error;
char * filename;
diff --git a/arch/m68knommu/kernel/process.c b/arch/m68knommu/kernel/process.c
index 6aa6613..6350f68 100644
--- a/arch/m68knommu/kernel/process.c
+++ b/arch/m68knommu/kernel/process.c
@@ -350,7 +350,7 @@ void dump(struct pt_regs *fp)
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char *name, char **argv, char **envp)
+asmlinkage int sys_execve(const char *name, char **argv, char **envp)
{
int error;
char * filename;
diff --git a/arch/microblaze/kernel/sys_microblaze.c b/arch/microblaze/kernel/sys_microblaze.c
index f4e00b7..6abab6e 100644
--- a/arch/microblaze/kernel/sys_microblaze.c
+++ b/arch/microblaze/kernel/sys_microblaze.c
@@ -47,7 +47,7 @@ asmlinkage long microblaze_clone(int flags, unsigned long stack, struct pt_regs
return do_fork(flags, stack, regs, 0, NULL, NULL);
}

-asmlinkage long microblaze_execve(char __user *filenamei, char __user *__user *argv,
+asmlinkage long microblaze_execve(const char __user *filenamei, char __user *__user *argv,
char __user *__user *envp, struct pt_regs *regs)
{
int error;
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index dd81b0f..6322c39 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -207,7 +207,7 @@ asmlinkage int sys_execve(nabi_no_regargs struct pt_regs regs)
int error;
char * filename;

- filename = getname((char __user *) (long)regs.regs[4]);
+ filename = getname((const char __user *) (long)regs.regs[4]);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c
index 82b817c..762eb32 100644
--- a/arch/mn10300/kernel/process.c
+++ b/arch/mn10300/kernel/process.c
@@ -268,7 +268,7 @@ asmlinkage long sys_vfork(void)
0, NULL, NULL);
}

-asmlinkage long sys_execve(char __user *name,
+asmlinkage long sys_execve(const char __user *name,
char __user * __user *argv,
char __user * __user *envp)
{
diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c
index 6935123..1444875 100644
--- a/arch/parisc/hpux/fs.c
+++ b/arch/parisc/hpux/fs.c
@@ -36,7 +36,7 @@ int hpux_execve(struct pt_regs *regs)
int error;
char *filename;

- filename = getname((char __user *) regs->gr[26]);
+ filename = getname((const char __user *) regs->gr[26]);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
@@ -169,7 +169,7 @@ static int cp_hpux_stat(struct kstat *stat, struct hpux_stat64 __user *statbuf)
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-long hpux_stat64(char __user *filename, struct hpux_stat64 __user *statbuf)
+long hpux_stat64(const char __user *filename, struct hpux_stat64 __user *statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
@@ -191,7 +191,8 @@ long hpux_fstat64(unsigned int fd, struct hpux_stat64 __user *statbuf)
return error;
}

-long hpux_lstat64(char __user *filename, struct hpux_stat64 __user *statbuf)
+long hpux_lstat64(const char __user *filename,
+ struct hpux_stat64 __user *statbuf)
{
struct kstat stat;
int error = vfs_lstat(filename, &stat);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 773424d..3ef6ed4 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -991,7 +991,7 @@ int sys_execve(unsigned long a0, unsigned long a1, unsigned long a2,
int error;
char *filename;

- filename = getname((char __user *) a0);
+ filename = getname((const char __user *) a0);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c
index 19471a1..20fd701 100644
--- a/arch/powerpc/kernel/sys_ppc32.c
+++ b/arch/powerpc/kernel/sys_ppc32.c
@@ -546,7 +546,7 @@ compat_ssize_t compat_sys_pread64(unsigned int fd, char __user *ubuf, compat_siz
return sys_pread64(fd, ubuf, count, ((loff_t)poshi << 32) | poslo);
}

-compat_ssize_t compat_sys_pwrite64(unsigned int fd, char __user *ubuf, compat_size_t count,
+compat_ssize_t compat_sys_pwrite64(unsigned int fd, const char __user *ubuf, compat_size_t count,
u32 reg6, u32 poshi, u32 poslo)
{
return sys_pwrite64(fd, ubuf, count, ((loff_t)poshi << 32) | poslo);
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 73b624e..1e6449c 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -436,7 +436,7 @@ sys32_rt_sigqueueinfo(int pid, int sig, compat_siginfo_t __user *uinfo)
* sys32_execve() executes a new program after the asm stub has set
* things up for us. This should basically do what I want it to.
*/
-asmlinkage long sys32_execve(char __user *name, compat_uptr_t __user *argv,
+asmlinkage long sys32_execve(const char __user *name, compat_uptr_t __user *argv,
compat_uptr_t __user *envp)
{
struct pt_regs *regs = task_pt_regs(current);
@@ -570,7 +570,7 @@ static int cp_stat64(struct stat64_emu31 __user *ubuf, struct kstat *stat)
return copy_to_user(ubuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-asmlinkage long sys32_stat64(char __user * filename, struct stat64_emu31 __user * statbuf)
+asmlinkage long sys32_stat64(const char __user * filename, struct stat64_emu31 __user * statbuf)
{
struct kstat stat;
int ret = vfs_stat(filename, &stat);
@@ -579,7 +579,7 @@ asmlinkage long sys32_stat64(char __user * filename, struct stat64_emu31 __user
return ret;
}

-asmlinkage long sys32_lstat64(char __user * filename, struct stat64_emu31 __user * statbuf)
+asmlinkage long sys32_lstat64(const char __user * filename, struct stat64_emu31 __user * statbuf)
{
struct kstat stat;
int ret = vfs_lstat(filename, &stat);
@@ -597,7 +597,7 @@ asmlinkage long sys32_fstat64(unsigned long fd, struct stat64_emu31 __user * sta
return ret;
}

-asmlinkage long sys32_fstatat64(unsigned int dfd, char __user *filename,
+asmlinkage long sys32_fstatat64(unsigned int dfd, const char __user *filename,
struct stat64_emu31 __user* statbuf, int flag)
{
struct kstat stat;
@@ -655,7 +655,7 @@ asmlinkage long sys32_read(unsigned int fd, char __user * buf, size_t count)
return sys_read(fd, buf, count);
}

-asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
+asmlinkage long sys32_write(unsigned int fd, const char __user * buf, size_t count)
{
if ((compat_ssize_t) count < 0)
return -EINVAL;
diff --git a/arch/s390/kernel/compat_linux.h b/arch/s390/kernel/compat_linux.h
index cb97afc..9635d75 100644
--- a/arch/s390/kernel/compat_linux.h
+++ b/arch/s390/kernel/compat_linux.h
@@ -193,7 +193,7 @@ long sys32_rt_sigprocmask(int how, compat_sigset_t __user *set,
compat_sigset_t __user *oset, size_t sigsetsize);
long sys32_rt_sigpending(compat_sigset_t __user *set, size_t sigsetsize);
long sys32_rt_sigqueueinfo(int pid, int sig, compat_siginfo_t __user *uinfo);
-long sys32_execve(char __user *name, compat_uptr_t __user *argv,
+long sys32_execve(const char __user *name, compat_uptr_t __user *argv,
compat_uptr_t __user *envp);
long sys32_init_module(void __user *umod, unsigned long len,
const char __user *uargs);
@@ -207,16 +207,16 @@ long sys32_sendfile(int out_fd, int in_fd, compat_off_t __user *offset,
size_t count);
long sys32_sendfile64(int out_fd, int in_fd, compat_loff_t __user *offset,
s32 count);
-long sys32_stat64(char __user * filename, struct stat64_emu31 __user * statbuf);
-long sys32_lstat64(char __user * filename,
+long sys32_stat64(const char __user * filename, struct stat64_emu31 __user * statbuf);
+long sys32_lstat64(const char __user * filename,
struct stat64_emu31 __user * statbuf);
long sys32_fstat64(unsigned long fd, struct stat64_emu31 __user * statbuf);
-long sys32_fstatat64(unsigned int dfd, char __user *filename,
+long sys32_fstatat64(unsigned int dfd, const char __user *filename,
struct stat64_emu31 __user* statbuf, int flag);
unsigned long old32_mmap(struct mmap_arg_struct_emu31 __user *arg);
long sys32_mmap2(struct mmap_arg_struct_emu31 __user *arg);
long sys32_read(unsigned int fd, char __user * buf, size_t count);
-long sys32_write(unsigned int fd, char __user * buf, size_t count);
+long sys32_write(unsigned int fd, const char __user * buf, size_t count);
long sys32_fadvise64(int fd, loff_t offset, size_t len, int advise);
long sys32_fadvise64_64(struct fadvise64_64_args __user *args);
long sys32_sigaction(int sig, const struct old_sigaction32 __user *act,
diff --git a/arch/s390/kernel/entry.h b/arch/s390/kernel/entry.h
index eb15c12..e2c048b 100644
--- a/arch/s390/kernel/entry.h
+++ b/arch/s390/kernel/entry.h
@@ -42,7 +42,7 @@ long sys_clone(unsigned long newsp, unsigned long clone_flags,
int __user *parent_tidptr, int __user *child_tidptr);
long sys_vfork(void);
void execve_tail(void);
-long sys_execve(char __user *name, char __user * __user *argv,
+long sys_execve(const char __user *name, char __user * __user *argv,
char __user * __user *envp);
long sys_sigsuspend(int history0, int history1, old_sigset_t mask);
long sys_sigaction(int sig, const struct old_sigaction __user *act,
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 1039fde..7eafaf2 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -267,7 +267,7 @@ asmlinkage void execve_tail(void)
/*
* sys_execve() executes a new program.
*/
-SYSCALL_DEFINE3(execve, char __user *, name, char __user * __user *, argv,
+SYSCALL_DEFINE3(execve, const char __user *, name, char __user * __user *, argv,
char __user * __user *, envp)
{
struct pt_regs *regs = task_pt_regs(current);
diff --git a/arch/sh/include/asm/syscalls_32.h b/arch/sh/include/asm/syscalls_32.h
index 8b30200..be201fd 100644
--- a/arch/sh/include/asm/syscalls_32.h
+++ b/arch/sh/include/asm/syscalls_32.h
@@ -19,7 +19,7 @@ asmlinkage int sys_clone(unsigned long clone_flags, unsigned long newsp,
asmlinkage int sys_vfork(unsigned long r4, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs __regs);
-asmlinkage int sys_execve(char __user *ufilename, char __user * __user *uargv,
+asmlinkage int sys_execve(const char __user *ufilename, char __user * __user *uargv,
char __user * __user *uenvp, unsigned long r7,
struct pt_regs __regs);
asmlinkage int sys_sigsuspend(old_sigset_t mask, unsigned long r5,
diff --git a/arch/sh/include/asm/syscalls_64.h b/arch/sh/include/asm/syscalls_64.h
index 751fd88..ee519f4 100644
--- a/arch/sh/include/asm/syscalls_64.h
+++ b/arch/sh/include/asm/syscalls_64.h
@@ -21,7 +21,7 @@ asmlinkage int sys_vfork(unsigned long r2, unsigned long r3,
unsigned long r4, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs *pregs);
-asmlinkage int sys_execve(char *ufilename, char **uargv,
+asmlinkage int sys_execve(const char *ufilename, char **uargv,
char **uenvp, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs *pregs);
diff --git a/arch/sh/kernel/process_64.c b/arch/sh/kernel/process_64.c
index d4ca648..68d128d 100644
--- a/arch/sh/kernel/process_64.c
+++ b/arch/sh/kernel/process_64.c
@@ -483,7 +483,7 @@ asmlinkage int sys_vfork(unsigned long r2, unsigned long r3,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char *ufilename, char **uargv,
+asmlinkage int sys_execve(const char *ufilename, char **uargv,
char **uenvp, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs *pregs)
diff --git a/arch/sparc/kernel/sys_sparc32.c b/arch/sparc/kernel/sys_sparc32.c
index c0ca875..e6375a7 100644
--- a/arch/sparc/kernel/sys_sparc32.c
+++ b/arch/sparc/kernel/sys_sparc32.c
@@ -162,7 +162,7 @@ static int cp_compat_stat64(struct kstat *stat,
return err;
}

-asmlinkage long compat_sys_stat64(char __user * filename,
+asmlinkage long compat_sys_stat64(const char __user * filename,
struct compat_stat64 __user *statbuf)
{
struct kstat stat;
@@ -173,7 +173,7 @@ asmlinkage long compat_sys_stat64(char __user * filename,
return error;
}

-asmlinkage long compat_sys_lstat64(char __user * filename,
+asmlinkage long compat_sys_lstat64(const char __user * filename,
struct compat_stat64 __user *statbuf)
{
struct kstat stat;
@@ -195,7 +195,8 @@ asmlinkage long compat_sys_fstat64(unsigned int fd,
return error;
}

-asmlinkage long compat_sys_fstatat64(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_fstatat64(unsigned int dfd,
+ const char __user *filename,
struct compat_stat64 __user * statbuf, int flag)
{
struct kstat stat;
diff --git a/arch/um/kernel/exec.c b/arch/um/kernel/exec.c
index 97974c1..59b20d9 100644
--- a/arch/um/kernel/exec.c
+++ b/arch/um/kernel/exec.c
@@ -44,7 +44,7 @@ void start_thread(struct pt_regs *regs, unsigned long eip, unsigned long esp)
PT_REGS_SP(regs) = esp;
}

-static long execve1(char *file, char __user * __user *argv,
+static long execve1(const char *file, char __user * __user *argv,
char __user *__user *env)
{
long error;
@@ -61,7 +61,7 @@ static long execve1(char *file, char __user * __user *argv,
return error;
}

-long um_execve(char *file, char __user *__user *argv, char __user *__user *env)
+long um_execve(const char *file, char __user *__user *argv, char __user *__user *env)
{
long err;

@@ -71,7 +71,7 @@ long um_execve(char *file, char __user *__user *argv, char __user *__user *env)
return err;
}

-long sys_execve(char __user *file, char __user *__user *argv,
+long sys_execve(const char __user *file, char __user *__user *argv,
char __user *__user *env)
{
long error;
diff --git a/arch/um/kernel/internal.h b/arch/um/kernel/internal.h
index 3bda43c..1303a10 100644
--- a/arch/um/kernel/internal.h
+++ b/arch/um/kernel/internal.h
@@ -1 +1 @@
-extern long um_execve(char *file, char __user *__user *argv, char __user *__user *env);
+extern long um_execve(const char *file, char __user *__user *argv, char __user *__user *env);
diff --git a/arch/um/kernel/syscall.c b/arch/um/kernel/syscall.c
index 4393173..7427c0b 100644
--- a/arch/um/kernel/syscall.c
+++ b/arch/um/kernel/syscall.c
@@ -58,7 +58,7 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[])

fs = get_fs();
set_fs(KERNEL_DS);
- ret = um_execve((char *)filename, (char __user *__user *)argv,
+ ret = um_execve(filename, (char __user *__user *)argv,
(char __user *__user *) envp);
set_fs(fs);

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index 626be15..1baddad 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -51,7 +51,7 @@
#define AA(__x) ((unsigned long)(__x))

-asmlinkage long sys32_truncate64(char __user *filename,
+asmlinkage long sys32_truncate64(const char __user *filename,
unsigned long offset_low,
unsigned long offset_high)
{
@@ -96,7 +96,7 @@ static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat)
return 0;
}

-asmlinkage long sys32_stat64(char __user *filename,
+asmlinkage long sys32_stat64(const char __user *filename,
struct stat64 __user *statbuf)
{
struct kstat stat;
@@ -107,7 +107,7 @@ asmlinkage long sys32_stat64(char __user *filename,
return ret;
}

-asmlinkage long sys32_lstat64(char __user *filename,
+asmlinkage long sys32_lstat64(const char __user *filename,
struct stat64 __user *statbuf)
{
struct kstat stat;
@@ -126,7 +126,7 @@ asmlinkage long sys32_fstat64(unsigned int fd, struct stat64 __user *statbuf)
return ret;
}

-asmlinkage long sys32_fstatat(unsigned int dfd, char __user *filename,
+asmlinkage long sys32_fstatat(unsigned int dfd, const char __user *filename,
struct stat64 __user *statbuf, int flag)
{
struct kstat stat;
@@ -408,8 +408,8 @@ asmlinkage long sys32_pread(unsigned int fd, char __user *ubuf, u32 count,
((loff_t)AA(poshi) << 32) | AA(poslo));
}

-asmlinkage long sys32_pwrite(unsigned int fd, char __user *ubuf, u32 count,
- u32 poslo, u32 poshi)
+asmlinkage long sys32_pwrite(unsigned int fd, const char __user *ubuf,
+ u32 count, u32 poslo, u32 poshi)
{
return sys_pwrite64(fd, ubuf, count,
((loff_t)AA(poshi) << 32) | AA(poslo));
@@ -449,7 +449,7 @@ asmlinkage long sys32_sendfile(int out_fd, int in_fd,
return ret;
}

-asmlinkage long sys32_execve(char __user *name, compat_uptr_t __user *argv,
+asmlinkage long sys32_execve(const char __user *name, compat_uptr_t __user *argv,
compat_uptr_t __user *envp, struct pt_regs *regs)
{
long error;
diff --git a/arch/x86/include/asm/sys_ia32.h b/arch/x86/include/asm/sys_ia32.h
index 3ad4217..c8a052a 100644
--- a/arch/x86/include/asm/sys_ia32.h
+++ b/arch/x86/include/asm/sys_ia32.h
@@ -18,13 +18,13 @@
#include <asm/ia32.h>

/* ia32/sys_ia32.c */
-asmlinkage long sys32_truncate64(char __user *, unsigned long, unsigned long);
+asmlinkage long sys32_truncate64(const char __user *, unsigned long, unsigned long);
asmlinkage long sys32_ftruncate64(unsigned int, unsigned long, unsigned long);

-asmlinkage long sys32_stat64(char __user *, struct stat64 __user *);
-asmlinkage long sys32_lstat64(char __user *, struct stat64 __user *);
+asmlinkage long sys32_stat64(const char __user *, struct stat64 __user *);
+asmlinkage long sys32_lstat64(const char __user *, struct stat64 __user *);
asmlinkage long sys32_fstat64(unsigned int, struct stat64 __user *);
-asmlinkage long sys32_fstatat(unsigned int, char __user *,
+asmlinkage long sys32_fstatat(unsigned int, const char __user *,
struct stat64 __user *, int);
struct mmap_arg_struct32;
asmlinkage long sys32_mmap(struct mmap_arg_struct32 __user *);
@@ -49,12 +49,12 @@ asmlinkage long sys32_rt_sigpending(compat_sigset_t __user *, compat_size_t);
asmlinkage long sys32_rt_sigqueueinfo(int, int, compat_siginfo_t __user *);

asmlinkage long sys32_pread(unsigned int, char __user *, u32, u32, u32);
-asmlinkage long sys32_pwrite(unsigned int, char __user *, u32, u32, u32);
+asmlinkage long sys32_pwrite(unsigned int, const char __user *, u32, u32, u32);

asmlinkage long sys32_personality(unsigned long);
asmlinkage long sys32_sendfile(int, int, compat_off_t __user *, s32);

-asmlinkage long sys32_execve(char __user *, compat_uptr_t __user *,
+asmlinkage long sys32_execve(const char __user *, compat_uptr_t __user *,
compat_uptr_t __user *, struct pt_regs *);
asmlinkage long sys32_clone(unsigned int, unsigned int, struct pt_regs *);

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 5c044b4..feb2ff9 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -23,7 +23,7 @@ long sys_iopl(unsigned int, struct pt_regs *);
/* kernel/process.c */
int sys_fork(struct pt_regs *);
int sys_vfork(struct pt_regs *);
-long sys_execve(char __user *, char __user * __user *,
+long sys_execve(const char __user *, char __user * __user *,
char __user * __user *, struct pt_regs *);
long sys_clone(unsigned long, unsigned long, void __user *,
void __user *, struct pt_regs *);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0697ff1..77f5986 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1185,13 +1185,13 @@ END(kernel_thread_helper)
* execve(). This function needs to use IRET, not SYSRET, to set up all state properly.
*
* C extern interface:
- * extern long execve(char *name, char **argv, char **envp)
+ * extern long execve(const char *name, char **argv, char **envp)
*
* asm input arguments:
* rdi: name, rsi: argv, rdx: envp
*
* We want to fallback into:
- * extern long sys_execve(char *name, char **argv,char **envp, struct pt_regs *regs)
+ * extern long sys_execve(const char *name, char **argv,char **envp, struct pt_regs *regs)
*
* do_sys_execve asm fallback arguments:
* rdi: name, rsi: argv, rdx: envp, rcx: fake frame on the stack
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e7e3521..f5c816e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -300,7 +300,7 @@ EXPORT_SYMBOL(kernel_thread);
/*
* sys_execve() executes a new program.
*/
-long sys_execve(char __user *name, char __user * __user *argv,
+long sys_execve(const char __user *name, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs)
{
long error;
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index f167e0f..7c2f38f 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -318,7 +318,7 @@ long xtensa_clone(unsigned long clone_flags, unsigned long newsp,
*/

asmlinkage
-long xtensa_execve(char __user *name, char __user * __user *argv,
+long xtensa_execve(const char __user *name, char __user * __user *argv,
char __user * __user *envp,
long a3, long a4, long a5,
struct pt_regs *regs)
diff --git a/fs/compat.c b/fs/compat.c
index 6490d21..d72591a 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -76,7 +76,8 @@ int compat_printk(const char *fmt, ...)
* Not all architectures have sys_utime, so implement this in terms
* of sys_utimes.
*/
-asmlinkage long compat_sys_utime(char __user *filename, struct compat_utimbuf __user *t)
+asmlinkage long compat_sys_utime(const char __user *filename,
+ struct compat_utimbuf __user *t)
{
struct timespec tv[2];

@@ -90,7 +91,7 @@ asmlinkage long compat_sys_utime(char __user *filename, struct compat_utimbuf __
return do_utimes(AT_FDCWD, filename, t ? tv : NULL, 0);
}

-asmlinkage long compat_sys_utimensat(unsigned int dfd, char __user *filename, struct compat_timespec __user *t, int flags)
+asmlinkage long compat_sys_utimensat(unsigned int dfd, const char __user *filename, struct compat_timespec __user *t, int flags)
{
struct timespec tv[2];

@@ -105,7 +106,7 @@ asmlinkage long compat_sys_utimensat(unsigned int dfd, char __user *filename, st
return do_utimes(dfd, filename, t ? tv : NULL, flags);
}

-asmlinkage long compat_sys_futimesat(unsigned int dfd, char __user *filename, struct compat_timeval __user *t)
+asmlinkage long compat_sys_futimesat(unsigned int dfd, const char __user *filename, struct compat_timeval __user *t)
{
struct timespec tv[2];

@@ -124,7 +125,7 @@ asmlinkage long compat_sys_futimesat(unsigned int dfd, char __user *filename, st
return do_utimes(dfd, filename, t ? tv : NULL, 0);
}

-asmlinkage long compat_sys_utimes(char __user *filename, struct compat_timeval __user *t)
+asmlinkage long compat_sys_utimes(const char __user *filename, struct compat_timeval __user *t)
{
return compat_sys_futimesat(AT_FDCWD, filename, t);
}
@@ -168,7 +169,7 @@ static int cp_compat_stat(struct kstat *stat, struct compat_stat __user *ubuf)
return err;
}

-asmlinkage long compat_sys_newstat(char __user * filename,
+asmlinkage long compat_sys_newstat(const char __user * filename,
struct compat_stat __user *statbuf)
{
struct kstat stat;
@@ -180,7 +181,7 @@ asmlinkage long compat_sys_newstat(char __user * filename,
return cp_compat_stat(&stat, statbuf);
}

-asmlinkage long compat_sys_newlstat(char __user * filename,
+asmlinkage long compat_sys_newlstat(const char __user * filename,
struct compat_stat __user *statbuf)
{
struct kstat stat;
@@ -193,7 +194,8 @@ asmlinkage long compat_sys_newlstat(char __user * filename,
}

#ifndef __ARCH_WANT_STAT64
-asmlinkage long compat_sys_newfstatat(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_newfstatat(unsigned int dfd,
+ const char __user *filename,
struct compat_stat __user *statbuf, int flag)
{
struct kstat stat;
@@ -836,9 +838,10 @@ static int do_nfs4_super_data_conv(void *raw_data)
#define NCPFS_NAME "ncpfs"
#define NFS4_NAME "nfs4"

-asmlinkage long compat_sys_mount(char __user * dev_name, char __user * dir_name,
- char __user * type, unsigned long flags,
- void __user * data)
+asmlinkage long compat_sys_mount(const char __user * dev_name,
+ const char __user * dir_name,
+ const char __user * type, unsigned long flags,
+ const void __user * data)
{
char *kernel_type;
unsigned long data_page;
diff --git a/fs/stat.c b/fs/stat.c
index c4ecd52..12e90e2 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -68,7 +68,8 @@ int vfs_fstat(unsigned int fd, struct kstat *stat)
}
EXPORT_SYMBOL(vfs_fstat);

-int vfs_fstatat(int dfd, char __user *filename, struct kstat *stat, int flag)
+int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
+ int flag)
{
struct path path;
int error = -EINVAL;
@@ -91,13 +92,13 @@ out:
}
EXPORT_SYMBOL(vfs_fstatat);

-int vfs_stat(char __user *name, struct kstat *stat)
+int vfs_stat(const char __user *name, struct kstat *stat)
{
return vfs_fstatat(AT_FDCWD, name, stat, 0);
}
EXPORT_SYMBOL(vfs_stat);

-int vfs_lstat(char __user *name, struct kstat *stat)
+int vfs_lstat(const char __user *name, struct kstat *stat)
{
return vfs_fstatat(AT_FDCWD, name, stat, AT_SYMLINK_NOFOLLOW);
}
@@ -147,7 +148,8 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-SYSCALL_DEFINE2(stat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
+SYSCALL_DEFINE2(stat, const char __user *, filename,
+ struct __old_kernel_stat __user *, statbuf)
{
struct kstat stat;
int error;
@@ -159,7 +161,8 @@ SYSCALL_DEFINE2(stat, char __user *, filename, struct __old_kernel_stat __user *
return cp_old_stat(&stat, statbuf);
}

-SYSCALL_DEFINE2(lstat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
+SYSCALL_DEFINE2(lstat, const char __user *, filename,
+ struct __old_kernel_stat __user *, statbuf)
{
struct kstat stat;
int error;
@@ -234,7 +237,8 @@ static int cp_new_stat(struct kstat *stat, struct stat __user *statbuf)
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-SYSCALL_DEFINE2(newstat, char __user *, filename, struct stat __user *, statbuf)
+SYSCALL_DEFINE2(newstat, const char __user *, filename,
+ struct stat __user *, statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
@@ -244,7 +248,8 @@ SYSCALL_DEFINE2(newstat, char __user *, filename, struct stat __user *, statbuf)
return cp_new_stat(&stat, statbuf);
}

-SYSCALL_DEFINE2(newlstat, char __user *, filename, struct stat __user *, statbuf)
+SYSCALL_DEFINE2(newlstat, const char __user *, filename,
+ struct stat __user *, statbuf)
{
struct kstat stat;
int error;
@@ -257,7 +262,7 @@ SYSCALL_DEFINE2(newlstat, char __user *, filename, struct stat __user *, statbuf
}

#if !defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_SYS_NEWFSTATAT)
-SYSCALL_DEFINE4(newfstatat, int, dfd, char __user *, filename,
+SYSCALL_DEFINE4(newfstatat, int, dfd, const char __user *, filename,
struct stat __user *, statbuf, int, flag)
{
struct kstat stat;
@@ -355,7 +360,8 @@ static long cp_new_stat64(struct kstat *stat, struct stat64 __user *statbuf)
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-SYSCALL_DEFINE2(stat64, char __user *, filename, struct stat64 __user *, statbuf)
+SYSCALL_DEFINE2(stat64, const char __user *, filename,
+ struct stat64 __user *, statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
@@ -366,7 +372,8 @@ SYSCALL_DEFINE2(stat64, char __user *, filename, struct stat64 __user *, statbuf
return error;
}

-SYSCALL_DEFINE2(lstat64, char __user *, filename, struct stat64 __user *, statbuf)
+SYSCALL_DEFINE2(lstat64, const char __user *, filename,
+ struct stat64 __user *, statbuf)
{
struct kstat stat;
int error = vfs_lstat(filename, &stat);
@@ -388,7 +395,7 @@ SYSCALL_DEFINE2(fstat64, unsigned long, fd, struct stat64 __user *, statbuf)
return error;
}

-SYSCALL_DEFINE4(fstatat64, int, dfd, char __user *, filename,
+SYSCALL_DEFINE4(fstatat64, int, dfd, const char __user *, filename,
struct stat64 __user *, statbuf, int, flag)
{
struct kstat stat;
diff --git a/fs/utimes.c b/fs/utimes.c
index e4c75db..179b586 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -126,7 +126,8 @@ out:
* must be owner or have write permission.
* Else, update from *times, must be owner or super user.
*/
-long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags)
+long do_utimes(int dfd, const char __user *filename, struct timespec *times,
+ int flags)
{
int error = -EINVAL;

@@ -170,7 +171,7 @@ out:
return error;
}

-SYSCALL_DEFINE4(utimensat, int, dfd, char __user *, filename,
+SYSCALL_DEFINE4(utimensat, int, dfd, const char __user *, filename,
struct timespec __user *, utimes, int, flags)
{
struct timespec tstimes[2];
@@ -188,7 +189,7 @@ SYSCALL_DEFINE4(utimensat, int, dfd, char __user *, filename,
return do_utimes(dfd, filename, utimes ? tstimes : NULL, flags);
}

-SYSCALL_DEFINE3(futimesat, int, dfd, char __user *, filename,
+SYSCALL_DEFINE3(futimesat, int, dfd, const char __user *, filename,
struct timeval __user *, utimes)
{
struct timeval times[2];
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 168f7da..9ddc878 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -331,7 +331,7 @@ asmlinkage long compat_sys_epoll_pwait(int epfd,
const compat_sigset_t __user *sigmask,
compat_size_t sigsetsize);

-asmlinkage long compat_sys_utimensat(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_utimensat(unsigned int dfd, const char __user *filename,
struct compat_timespec __user *t, int flags);

asmlinkage long compat_sys_signalfd(int ufd,
@@ -348,9 +348,9 @@ asmlinkage long compat_sys_move_pages(pid_t pid, unsigned long nr_page,
const int __user *nodes,
int __user *status,
int flags);
-asmlinkage long compat_sys_futimesat(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_futimesat(unsigned int dfd, const char __user *filename,
struct compat_timeval __user *t);
-asmlinkage long compat_sys_newfstatat(unsigned int dfd, char __user * filename,
+asmlinkage long compat_sys_newfstatat(unsigned int dfd, const char __user * filename,
struct compat_stat __user *statbuf,
int flag);
asmlinkage long compat_sys_openat(unsigned int dfd, const char __user *filename,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 68ca1b0..f5e7cf2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2349,10 +2349,10 @@ void inode_set_bytes(struct inode *inode, loff_t bytes);

extern int vfs_readdir(struct file *, filldir_t, void *);

-extern int vfs_stat(char __user *, struct kstat *);
-extern int vfs_lstat(char __user *, struct kstat *);
+extern int vfs_stat(const char __user *, struct kstat *);
+extern int vfs_lstat(const char __user *, struct kstat *);
extern int vfs_fstat(unsigned int, struct kstat *);
-extern int vfs_fstatat(int , char __user *, struct kstat *, int);
+extern int vfs_fstatat(int , const char __user *, struct kstat *, int);

extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
unsigned long arg);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7f614ce..8812a63 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -393,7 +393,7 @@ asmlinkage long sys_umount(char __user *name, int flags);
asmlinkage long sys_oldumount(char __user *name);
asmlinkage long sys_truncate(const char __user *path, long length);
asmlinkage long sys_ftruncate(unsigned int fd, unsigned long length);
-asmlinkage long sys_stat(char __user *filename,
+asmlinkage long sys_stat(const char __user *filename,
struct __old_kernel_stat __user *statbuf);
asmlinkage long sys_statfs(const char __user * path,
struct statfs __user *buf);
@@ -402,21 +402,21 @@ asmlinkage long sys_statfs64(const char __user *path, size_t sz,
asmlinkage long sys_fstatfs(unsigned int fd, struct statfs __user *buf);
asmlinkage long sys_fstatfs64(unsigned int fd, size_t sz,
struct statfs64 __user *buf);
-asmlinkage long sys_lstat(char __user *filename,
+asmlinkage long sys_lstat(const char __user *filename,
struct __old_kernel_stat __user *statbuf);
asmlinkage long sys_fstat(unsigned int fd,
struct __old_kernel_stat __user *statbuf);
-asmlinkage long sys_newstat(char __user *filename,
+asmlinkage long sys_newstat(const char __user *filename,
struct stat __user *statbuf);
-asmlinkage long sys_newlstat(char __user *filename,
+asmlinkage long sys_newlstat(const char __user *filename,
struct stat __user *statbuf);
asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
#if BITS_PER_LONG == 32
-asmlinkage long sys_stat64(char __user *filename,
+asmlinkage long sys_stat64(const char __user *filename,
struct stat64 __user *statbuf);
asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
-asmlinkage long sys_lstat64(char __user *filename,
+asmlinkage long sys_lstat64(const char __user *filename,
struct stat64 __user *statbuf);
asmlinkage long sys_truncate64(const char __user *path, loff_t length);
asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length);
@@ -756,7 +756,7 @@ asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
-asmlinkage long sys_futimesat(int dfd, char __user *filename,
+asmlinkage long sys_futimesat(int dfd, const char __user *filename,
struct timeval __user *utimes);
asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode);
asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
@@ -765,13 +765,13 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
gid_t group, int flag);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
int mode);
-asmlinkage long sys_newfstatat(int dfd, char __user *filename,
+asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
struct stat __user *statbuf, int flag);
-asmlinkage long sys_fstatat64(int dfd, char __user *filename,
+asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
struct stat64 __user *statbuf, int flag);
asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *buf,
int bufsiz);
-asmlinkage long sys_utimensat(int dfd, char __user *filename,
+asmlinkage long sys_utimensat(int dfd, const char __user *filename,
struct timespec __user *utimes, int flags);
asmlinkage long sys_unshare(unsigned long unshare_flags);

diff --git a/include/linux/time.h b/include/linux/time.h
index ea3559f..16346c0 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -135,7 +135,7 @@ extern void do_gettimeofday(struct timeval *tv);
extern int do_settimeofday(struct timespec *tv);
extern int do_sys_settimeofday(struct timespec *tv, struct timezone *tz);
#define do_posix_clock_monotonic_gettime(ts) ktime_get_ts(ts)
-extern long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags);
+extern long do_utimes(int dfd, const char __user *filename, struct timespec *times, int flags);
struct itimerval;
extern int do_setitimer(int which, struct itimerval *value,
struct itimerval *ovalue);

2010-07-15 02:17:13

[permalink] [raw]

Subject: [PATCH 03/18] AFS: Use i_generation not i_version for the vnode uniquifier [ver #6]

Store the AFS vnode uniquifier in the i_generation field, not the i_version
field of the inode struct. i_version can then be given the AFS data version
number.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/dir.c | 8 ++++----
fs/afs/fsclient.c | 3 ++-
fs/afs/inode.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index b42d5cc..afb9ff8 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -542,11 +542,11 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
dentry->d_op = &afs_fs_dentry_operations;

d_add(dentry, inode);
- _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%llu }",
+ _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%u }",
fid.vnode,
fid.unique,
dentry->d_inode->i_ino,
- (unsigned long long)dentry->d_inode->i_version);
+ dentry->d_inode->i_generation);

return NULL;
}
@@ -626,10 +626,10 @@ static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
* been deleted and replaced, and the original vnode ID has
* been reused */
if (fid.unique != vnode->fid.unique) {
- _debug("%s: file deleted (uq %u -> %u I:%llu)",
+ _debug("%s: file deleted (uq %u -> %u I:%u)",
dentry->d_name.name, fid.unique,
vnode->fid.unique,
- (unsigned long long)dentry->d_inode->i_version);
+ dentry->d_inode->i_generation);
spin_lock(&vnode->lock);
set_bit(AFS_VNODE_DELETED, &vnode->flags);
spin_unlock(&vnode->lock);
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 4bd0218..346e328 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -89,7 +89,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
i_size_write(&vnode->vfs_inode, size);
vnode->vfs_inode.i_uid = status->owner;
vnode->vfs_inode.i_gid = status->group;
- vnode->vfs_inode.i_version = vnode->fid.unique;
+ vnode->vfs_inode.i_generation = vnode->fid.unique;
vnode->vfs_inode.i_nlink = status->nlink;

mode = vnode->vfs_inode.i_mode;
@@ -102,6 +102,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_server;
vnode->vfs_inode.i_mtime = vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime = vnode->vfs_inode.i_ctime;
+ vnode->vfs_inode.i_version = data_version;
}

expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index d00b312..ee3190a 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -73,7 +73,8 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
inode->i_ctime.tv_nsec = 0;
inode->i_atime = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
- inode->i_version = vnode->fid.unique;
+ inode->i_generation = vnode->fid.unique;
+ inode->i_version = vnode->status.data_version;
inode->i_mapping->a_ops = &afs_fs_aops;

/* check to see whether a symbolic link is really a mountpoint */
@@ -98,7 +99,7 @@ static int afs_iget5_test(struct inode *inode, void *opaque)
struct afs_iget_data *data = opaque;

return inode->i_ino == data->fid.vnode &&
- inode->i_version == data->fid.unique;
+ inode->i_generation == data->fid.unique;
}

/*
@@ -110,7 +111,7 @@ static int afs_iget5_set(struct inode *inode, void *opaque)
struct afs_vnode *vnode = AFS_FS_I(inode);

inode->i_ino = data->fid.vnode;
- inode->i_version = data->fid.unique;
+ inode->i_generation = data->fid.unique;
vnode->fid = data->fid;
vnode->volume = data->volume;

@@ -306,8 +307,7 @@ int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,

inode = dentry->d_inode;

- _enter("{ ino=%lu v=%llu }", inode->i_ino,
- (unsigned long long)inode->i_version);
+ _enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);

generic_fillattr(inode, stat);
return 0;

2010-07-15 02:17:19

[permalink] [raw]

Subject: [PATCH 08/18] xstat: CIFS: Return extended attributes [ver #6]

Return extended attributes from the CIFS filesystem. This includes the
following:

(1) Return the file creation time as btime. We assume that the creation time
won't change over the life of the inode.

(2) FS_AUTOMOUNT_FL on referral/submount directories.

(3) Deasserting XSTAT_REQUEST_INO in st_result_mask if we made up the inode
number and didn't get it from the server.

(4) Map various Windows file attributes to FS_xxx_FL flags in st_inode_flags,
fetching them from the server if we don't have them yet or don't have a
current copy.

Furthermore, what cifs_getattr() does can be controlled as follows:

(1) If AT_FORCE_ATTR_SYNC is indicated, or if the inode flags or creation time
are requested but not yet collected, then the attributes will be reread
unconditionally.

(2) If the basic stats are requested or if the inode flags are requested and
have been collected previously, then the attributes will be reread if out
of date.

(3) Otherwise the cached attributes will be used - even if expired - without
reference to the server.

Note that cifs_revalidate_dentry() will issue an extra operation to get the
FILE_ALL_INFO in addition to the FILE_UNIX_BASIC_INFO if it needs to collect
creation time and attributes on behalf of cifs_getattr().

[NOTE: THIS PATCH IS UNTESTED!]

Signed-off-by: David Howells <[email protected]>
---

fs/cifs/cifsfs.h | 2 +
fs/cifs/cifsglob.h | 5 +++
fs/cifs/dir.c | 2 +
fs/cifs/inode.c | 76 ++++++++++++++++++++++++++++++++++++++++++++--------
4 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index a7eb65c..50bf70b 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -62,7 +62,7 @@ extern int cifs_rmdir(struct inode *, struct dentry *);
extern int cifs_rename(struct inode *, struct dentry *, struct inode *,
struct dentry *);
extern int cifs_revalidate_file(struct file *filp);
-extern int cifs_revalidate_dentry(struct dentry *);
+extern int cifs_revalidate_dentry(struct dentry *, bool, bool);
extern int cifs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
extern int cifs_setattr(struct dentry *, struct iattr *);

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index a88479c..f12e78d 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -396,6 +396,9 @@ struct cifsInodeInfo {
bool clientCanCacheAll:1; /* read and writebehind oplock */
bool delete_pending:1; /* DELETE_ON_CLOSE is set */
bool invalid_mapping:1; /* pagecache is invalid */
+ bool cifsAttrs_valid:1; /* stored cifs attributes are valid */
+ bool btime_valid:1; /* stored creation time is valid */
+ struct timespec btime; /* creation time */
u64 server_eof; /* current file size on server */
u64 uniqueid; /* server inode number */
struct inode vfs_inode;
@@ -508,6 +511,7 @@ struct dfs_info3_param {
#define CIFS_FATTR_DELETE_PENDING 0x2
#define CIFS_FATTR_NEED_REVAL 0x4
#define CIFS_FATTR_INO_COLLISION 0x8
+#define CIFS_FATTR_WINATTRS_VALID 0x10 /* T if cf_btime and cf_cifsattrs valid */

struct cifs_fattr {
u32 cf_flags;
@@ -524,6 +528,7 @@ struct cifs_fattr {
struct timespec cf_atime;
struct timespec cf_mtime;
struct timespec cf_ctime;
+ struct timespec cf_btime;
};

static inline void free_dfs_info_param(struct dfs_info3_param *param)
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index e7ae78b..65da30e 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -778,7 +778,7 @@ cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)
int isValid = 1;

if (direntry->d_inode) {
- if (cifs_revalidate_dentry(direntry))
+ if (cifs_revalidate_dentry(direntry, false, false))
return 0;
} else {
cFYI(1, "neg dentry 0x%p name = %s",
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 6f0683c..ff4a62f 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -136,7 +136,11 @@ cifs_fattr_to_inode(struct inode *inode, struct cifs_fattr *fattr)
!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_DYNPERM))
inode->i_mode = fattr->cf_mode;

- cifs_i->cifsAttrs = fattr->cf_cifsattrs;
+ if (fattr->cf_flags & CIFS_FATTR_WINATTRS_VALID) {
+ cifs_i->cifsAttrs = fattr->cf_cifsattrs;
+ cifs_i->btime = fattr->cf_btime;
+ cifs_i->btime_valid = true;
+ }

if (fattr->cf_flags & CIFS_FATTR_NEED_REVAL)
cifs_i->time = 0;
@@ -468,6 +472,7 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,
struct cifs_sb_info *cifs_sb, bool adjust_tz)
{
memset(fattr, 0, sizeof(*fattr));
+ fattr->cf_flags = CIFS_FATTR_WINATTRS_VALID;
fattr->cf_cifsattrs = le32_to_cpu(info->Attributes);
if (info->DeletePending)
fattr->cf_flags |= CIFS_FATTR_DELETE_PENDING;
@@ -479,6 +484,7 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,

fattr->cf_ctime = cifs_NTtimeToUnix(info->ChangeTime);
fattr->cf_mtime = cifs_NTtimeToUnix(info->LastWriteTime);
+ fattr->cf_btime = cifs_NTtimeToUnix(info->CreationTime);

if (adjust_tz) {
fattr->cf_ctime.tv_sec += cifs_sb->tcon->ses->server->timeAdj;
@@ -1591,7 +1597,8 @@ check_inval:
}

/* revalidate a dentry's inode attributes */
-int cifs_revalidate_dentry(struct dentry *dentry)
+int cifs_revalidate_dentry(struct dentry *dentry, bool want_extra_bits,
+ bool force)
{
int xid;
int rc = 0;
@@ -1604,7 +1611,7 @@ int cifs_revalidate_dentry(struct dentry *dentry)

xid = GetXid();

- if (!cifs_inode_needs_reval(inode))
+ if (!force && !cifs_inode_needs_reval(inode))
goto check_inval;

/* can not safely grab the rename sem here if rename calls revalidate
@@ -1619,9 +1626,12 @@ int cifs_revalidate_dentry(struct dentry *dentry)
"jiffies %ld", full_path, inode, inode->i_count.counter,
dentry, dentry->d_time, jiffies);

- if (CIFS_SB(sb)->tcon->unix_ext)
+ if (CIFS_SB(sb)->tcon->unix_ext) {
rc = cifs_get_inode_info_unix(&inode, full_path, sb, xid);
- else
+ if (rc != 0)
+ goto check_inval;
+ }
+ if (!CIFS_SB(sb)->tcon->unix_ext || want_extra_bits)
rc = cifs_get_inode_info(&inode, full_path, NULL, sb,
xid, NULL);

@@ -1637,13 +1647,55 @@ check_inval:
int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat)
{
- int err = cifs_revalidate_dentry(dentry);
- if (!err) {
- generic_fillattr(dentry->d_inode, stat);
- stat->blksize = CIFS_MAX_MSGSIZE;
- stat->ino = CIFS_I(dentry->d_inode)->uniqueid;
- }
- return err;
+ struct cifsInodeInfo *cifs_i = CIFS_I(dentry->d_inode);
+ struct cifs_sb_info *cifs_sb = CIFS_SB(dentry->d_sb);
+ unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
+ bool want_extra_bits = false;
+ u64 iflag;
+ u32 attrs;
+ int err;
+
+ if (stat->request_mask & XSTAT_REQUEST_BTIME && !cifs_i->btime_valid) {
+ want_extra_bits = true;
+ force = true;
+ }
+
+ if (stat->request_mask & XSTAT_REQUEST_INODE_FLAGS) {
+ want_extra_bits = true;
+ if (!cifs_i->cifsAttrs_valid)
+ force = true;
+ }
+
+ if (force || stat->request_mask & XSTAT_REQUEST__BASIC_STATS) {
+ err = cifs_revalidate_dentry(dentry, want_extra_bits, force);
+ if (err)
+ return err;
+ }
+
+ generic_fillattr(&cifs_i->vfs_inode, stat);
+ stat->blksize = CIFS_MAX_MSGSIZE;
+
+ /* we don't promise an inode number if we made one up */
+ stat->ino = cifs_i->uniqueid;
+ if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SERVER_INUM))
+ stat->result_mask &= ~XSTAT_REQUEST_INO;
+ if (cifs_i->btime_valid) {
+ stat->btime = cifs_i->btime;
+ stat->result_mask |= XSTAT_REQUEST_BTIME;
+ }
+ attrs = cifs_i->cifsAttrs;
+ iflag = 0;
+ if (attrs & ATTR_READONLY) iflag |= FS_IMMUTABLE_FL;
+ if (attrs & ATTR_HIDDEN) iflag |= FS_HIDDEN_FL;
+ if (attrs & ATTR_SYSTEM) iflag |= FS_SYSTEM_FL;
+ if (attrs & ATTR_ARCHIVE) iflag |= FS_ARCHIVE_FL;
+ if (attrs & ATTR_TEMPORARY) iflag |= FS_TEMPORARY_FL;
+ if (attrs & ATTR_REPARSE) iflag |= FS_REPARSE_POINT_FL;
+ if (attrs & ATTR_COMPRESSED) iflag |= FS_COMPR_FL;
+ if (attrs & ATTR_OFFLINE) iflag |= FS_OFFLINE_FL;
+ if (attrs & ATTR_ENCRYPTED) iflag |= FS_ENCRYPTED_FL;
+ stat->inode_flags |= iflag;
+ return 0;
}

static int cifs_truncate_page(struct address_space *mapping, loff_t from)

2010-07-15 02:17:12

[permalink] [raw]

Subject: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Add a pair of system calls to make extended file stats available, including
file creation time, inode version and data version where available through the
underlying filesystem.

[This depends on the previously posted pair of patches to (a) constify a number
of syscall string and buffer arguments and (b) rearrange AFS's use of
i_version and i_generation].

This has a number of uses:

(1) Creation time: The SMB protocol carries the creation time, which could be
exported by Samba, which will in turn help CIFS make use of FS-Cache as
that can be used for coherency data.

This is also specified in NFSv4 as a recommended attribute and could be
exported by NFSD [Steve French].

(2) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper].

(3) Heavyweight stat: Force a netfs to go to the server, even if it thinks its
cached attributes are up to date [Trond Myklebust].

(4) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd
Schubert].

(5) Data version number: Could be used by userspace NFS servers [Aneesh Kumar].

Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get it
from the kstat struct if it used vfs_xgetattr() instead.

(6) BSD stat compatibility: Including more fields from the BSD stat such as
creation time (st_btime) and inode generation number (st_gen) [Jeremy
Allison, Bernd Schubert].

(7) Extra coherency data may be useful in making backups [Andreas Dilger].

(8) Allow the filesystem to indicate what it can/cannot provide: A filesystem
can now say it doesn't support a standard stat feature if that isn't
available.

(9) Make the fields a consistent size on all arches, and make them large.

(10) Can be extended by using more request flags and tagging further data after
the end of the standard return data. Such things as the following could
be returned:

- BSD st_flags or FS_IOC_GETFLAGS.
- Volume ID / Remote Device ID [Steve French].
- Time granularity (NFSv4 time_delta) [Steve French].
- Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].

This was initially proposed as a set of xattrs, but the general preferance is
for an extended stat structure.

The following structures are defined for the use of these new system calls:

struct xstat_parameters {
unsigned long long request_mask;
};

struct xstat_dev {
unsigned int major, minor;
};

struct xstat_time {
unsigned long long tv_sec, tv_nsec;
};

struct xstat {
unsigned long long st_result_mask;
unsigned int st_mode;
unsigned int st_nlink;
unsigned int st_uid;
unsigned int st_gid;
struct xstat_dev st_rdev;
struct xstat_dev st_dev;
struct xstat_time st_atime;
struct xstat_time st_mtime;
struct xstat_time st_ctime;
struct xstat_time st_btime;
unsigned long long st_ino;
unsigned long long st_size;
unsigned long long st_blksize;
unsigned long long st_blocks;
unsigned long long st_gen;
unsigned long long st_data_version;
unsigned long long st_inode_flags;
unsigned long long st_extra_results[0];
};

where st_btime is the file creation time, st_gen is the inode generation
(i_generation), st_data_version is the data version number (i_version),
st_inode_flags is the flags from FS_IOC_GETFLAGS plus some extras,
request_mask and st_result_mask are bitmasks of data desired/provided and
st_extra_results[] is where as-yet undefined fields are appended.

The defined bits in request_mask and st_result_mask are:

XSTAT_REQUEST_MODE Want/got st_mode
XSTAT_REQUEST_NLINK Want/got st_nlink
XSTAT_REQUEST_UID Want/got st_uid
XSTAT_REQUEST_GID Want/got st_gid
XSTAT_REQUEST_RDEV Want/got st_rdev
XSTAT_REQUEST_ATIME Want/got st_atime
XSTAT_REQUEST_MTIME Want/got st_mtime
XSTAT_REQUEST_CTIME Want/got st_ctime
XSTAT_REQUEST_INO Want/got st_ino
XSTAT_REQUEST_SIZE Want/got st_size
XSTAT_REQUEST_BLOCKS Want/got st_blocks
XSTAT_REQUEST__BASIC_STATS The stuff in the normal stat struct
XSTAT_REQUEST_BTIME Want/got st_btime
XSTAT_REQUEST_GEN Want/got st_gen
XSTAT_REQUEST_DATA_VERSION Want/got st_data_version
XSTAT_REQUEST_INODE_FLAGS Want/got st_inode_flags
XSTAT_REQUEST__EXTENDED_STATS The stuff in the xstat struct
XSTAT_REQUEST__ALL_STATS The defined set of requestables

The defined bits in st_inode_flags are the usual FS_xxx_FL flags in the LSW,
plus some extra flags in the MSW:

FS_SPECIAL_FL Special kernel file, such as found in procfs
FS_AUTOMOUNT_FL Specific automount point
FS_AUTOMOUNT_ANY_FL Free-form automount directory
FS_REMOTE_FL File is remote
FS_ENCRYPTED_FL File is encrypted
FS_SYSTEM_FL File is marked system (DOS/NTFS/CIFS)
FS_TEMPORARY_FL File is temporary (NTFS/CIFS)
FS_OFFLINE_FL File is offline (CIFS)

Note that Ext4 returns flags outside of FS_FL_USER_VISIBLE in response to
FS_IOC_GETFLAGS. Should FS_FL_USER_VISIBLE be extended to cover them? Or
should the extra flags be suppressed?

The system calls are:

ssize_t ret = xstat(int dfd,
const char *filename,
unsigned flags,
const struct xstat_parameters *params,
struct xstat *buffer,
size_t buflen);

ssize_t ret = fxstat(unsigned fd,
unsigned flags,
const struct xstat_parameters *params,
struct xstat *buffer,
size_t buflen);

The dfd, filename, flags and fd parameters indicate the file to query. There
is no equivalent of lstat() as that can be emulated with xstat() by passing
AT_SYMLINK_NOFOLLOW in flags.

AT_FORCE_ATTR_SYNC can also be set in flags. This will require a network
filesystem to synchronise its attributes with the server.

When the system call is executed, the request_mask bitmask is read from the
parameter block to work out what the user is requesting. If params is NULL,
then request_mask will be assumed to be XSTAT_REQUEST__BASIC_STATS.

The request_mask should be set by the caller to specify extra results that the
caller may desire. These come in a number of classes:

(0) dev, blksize.

These are local data and are always available.

(1) mode, nlinks, uid, gid, [amc]time, ino, size, blocks.

These will be returned whether the caller asks for them or not. The
corresponding bits in result_mask will be set to indicate their presence.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server, unless as
a byproduct of updating something requested.

(2) rdev.

As for class (1), but this won't be returned if the file is not a blockdev
or chardev. The bit will be cleared if the value is not returned.

(3) File creation time, inode generation and data version.

These will be returned if available whether the caller asked for them or
not. The corresponding bits in result_mask will be set or cleared as
appropriate to indicate their presence.

If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server, unless
as a byproduct of updating something requested.

(4) Inode flags.

Some of the extra flags (in the MSW) may be returned anyway, and if so,
XSTAT_REQUEST_INODE_FLAGS will be set to indicate it. A base set of
flags is stored in a filesystem's file_system_type struct and is loaded
into inode_flags by generic_fileattr() for further addition by the
filesystem.

(5) Extra results.

These will only be returned if the caller asked for them by setting their
bits in request_mask. They will be placed in the buffer after the xstat
struct in ascending result_mask bit order. Any bit set in request_mask
mask will be left set in result_mask if the result is available and
cleared otherwise.

The pointer into the results list will be rounded up to the nearest 8-byte
boundary after each result is written in. The size of each extra result
is specific to the definition for that result.

No extra results are currently defined.

If the buffer is insufficiently big, the syscall returns the amount of space it
will need to write the complete result set and returns a partial result in the
buffer.

At the moment, this will only work on x86_64 as it requires system calls to be
wired up.

=======
TESTING
=======

The following test program can be used to test the xstat system call:

/* Test the xstat() system call
*
* Copyright (C) 2010 Red Hat, Inc. All Rights Reserved.
* Written by David Howells ([email protected])
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public Licence
* as published by the Free Software Foundation; either version
* 2 of the Licence, or (at your option) any later version.
*/

#define _GNU_SOURCE
#define _ATFILE_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <time.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <sys/types.h>

#define AT_FORCE_ATTR_SYNC 0x800
#define AT_NO_AUTOMOUNT 0x1000

struct xstat_parameters {
unsigned long long request_mask;
#define XSTAT_REQUEST_MODE 0x00000001ULL
#define XSTAT_REQUEST_NLINK 0x00000002ULL
#define XSTAT_REQUEST_UID 0x00000004ULL
#define XSTAT_REQUEST_GID 0x00000008ULL
#define XSTAT_REQUEST_RDEV 0x00000010ULL
#define XSTAT_REQUEST_ATIME 0x00000020ULL
#define XSTAT_REQUEST_MTIME 0x00000040ULL
#define XSTAT_REQUEST_CTIME 0x00000080ULL
#define XSTAT_REQUEST_INO 0x00000100ULL
#define XSTAT_REQUEST_SIZE 0x00000200ULL
#define XSTAT_REQUEST_BLOCKS 0x00000400ULL
#define XSTAT_REQUEST__BASIC_STATS 0x000007ffULL
#define XSTAT_REQUEST_BTIME 0x00000800ULL
#define XSTAT_REQUEST_GEN 0x00001000ULL
#define XSTAT_REQUEST_DATA_VERSION 0x00002000ULL
#define XSTAT_REQUEST_INODE_FLAGS 0x00004000ULL
#define XSTAT_REQUEST__EXTENDED_STATS 0x00007fffULL
#define XSTAT_REQUEST__ALL_STATS 0x00007fffULL
#define XSTAT_REQUEST__EXTRA_STATS (XSTAT_REQUEST__ALL_STATS & ~XSTAT_REQUEST__EXTENDED_STATS)
};

struct xstat_dev {
unsigned int major;
unsigned int minor;
};

struct xstat_time {
unsigned long long tv_sec;
unsigned long long tv_nsec;
};

struct xstat {
unsigned long long st_result_mask;
unsigned int st_mode;
unsigned int st_nlink;
unsigned int st_uid;
unsigned int st_gid;
struct xstat_dev st_rdev;
struct xstat_dev st_dev;
struct xstat_time st_atim;
struct xstat_time st_mtim;
struct xstat_time st_ctim;
struct xstat_time st_btim;
unsigned long long st_ino;
unsigned long long st_size;
unsigned long long st_blksize;
unsigned long long st_blocks;
unsigned long long st_gen;
unsigned long long st_data_version;
unsigned long long st_inode_flags;
unsigned long long st_extra_results[0];
};

#define FS__STANDARD_FL 0x00000000ffffffffULL
#define FS_SPECIAL_FL 0x0000000100000000ULL
#define FS_AUTOMOUNT_FL 0x0000000200000000ULL
#define FS_AUTOMOUNT_ANY_FL 0x0000000400000000ULL
#define FS_REMOTE_FL 0x0000000800000000ULL
#define FS_ENCRYPTED_FL 0x0000001000000000ULL
#define FS_SYSTEM_FL 0x0000002000000000ULL
#define FS_TEMPORARY_FL 0x0000004000000000ULL
#define FS_OFFLINE_FL 0x0000008000000000ULL

#define __NR_xstat 300
#define __NR_fxstat 301

static __attribute__((unused))
ssize_t xstat(int dfd, const char *filename, unsigned flags,
struct xstat_parameters *params,
struct xstat *buffer, size_t bufsize)
{
return syscall(__NR_xstat, dfd, filename, flags,
params, buffer, bufsize);
}

static __attribute__((unused))
ssize_t fxstat(int fd, unsigned flags,
struct xstat_parameters *params,
struct xstat *buffer, size_t bufsize)
{
return syscall(__NR_fxstat, fd, flags,
params, buffer, bufsize);
}

static void print_time(const char *field, const struct xstat_time *xstm)
{
struct tm tm;
time_t tim;
char buffer[100];
int len;

tim = xstm->tv_sec;
if (!localtime_r(&tim, &tm)) {
perror("localtime_r");
exit(1);
}
len = strftime(buffer, 100, "%F %T", &tm);
if (len == 0) {
perror("strftime");
exit(1);
}
printf("%s", field);
fwrite(buffer, 1, len, stdout);
printf(".%09llu", xstm->tv_nsec);
len = strftime(buffer, 100, "%z", &tm);
if (len == 0) {
perror("strftime2");
exit(1);
}
fwrite(buffer, 1, len, stdout);
printf("\n");
}

static void dump_xstat(struct xstat *xst)
{
char buffer[256], ft;

printf("results=%llx\n", xst->st_result_mask);

printf(" ");
if (xst->st_result_mask & XSTAT_REQUEST_SIZE)
printf(" Size: %-15llu", xst->st_size);
if (xst->st_result_mask & XSTAT_REQUEST_BLOCKS)
printf(" Blocks: %-10llu", xst->st_blocks);
printf(" IO Block: %-6llu ", xst->st_blksize);
if (xst->st_result_mask & XSTAT_REQUEST_MODE) {
switch (xst->st_mode & S_IFMT) {
case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
case S_IFREG: printf(" regular file\n"); ft = '-'; break;
case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
default:
printf("unknown type (%o)\n", xst->st_mode & S_IFMT);
ft = '?';
break;
}
}

sprintf(buffer, "%02x:%02x", xst->st_dev.major, xst->st_dev.minor);
printf("Device: %-15s", buffer);
if (xst->st_result_mask & XSTAT_REQUEST_INO)
printf(" Inode: %-11llu", xst->st_ino);
if (xst->st_result_mask & XSTAT_REQUEST_SIZE)
printf(" Links: %-5u", xst->st_nlink);
if (xst->st_result_mask & XSTAT_REQUEST_RDEV)
printf(" Device type: %u,%u",
xst->st_rdev.major, xst->st_rdev.minor);
printf("\n");

if (xst->st_result_mask & XSTAT_REQUEST_MODE)
printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
xst->st_mode & 07777,
ft,
xst->st_mode & S_IRUSR ? 'r' : '-',
xst->st_mode & S_IWUSR ? 'w' : '-',
xst->st_mode & S_IXUSR ? 'x' : '-',
xst->st_mode & S_IRGRP ? 'r' : '-',
xst->st_mode & S_IWGRP ? 'w' : '-',
xst->st_mode & S_IXGRP ? 'x' : '-',
xst->st_mode & S_IROTH ? 'r' : '-',
xst->st_mode & S_IWOTH ? 'w' : '-',
xst->st_mode & S_IXOTH ? 'x' : '-');
if (xst->st_result_mask & XSTAT_REQUEST_UID)
printf("Uid: %d \n", xst->st_uid);
if (xst->st_result_mask & XSTAT_REQUEST_GID)
printf("Gid: %u\n", xst->st_gid);

if (xst->st_result_mask & XSTAT_REQUEST_ATIME)
print_time("Access: ", &xst->st_atim);
if (xst->st_result_mask & XSTAT_REQUEST_MTIME)
print_time("Modify: ", &xst->st_mtim);
if (xst->st_result_mask & XSTAT_REQUEST_CTIME)
print_time("Change: ", &xst->st_ctim);
if (xst->st_result_mask & XSTAT_REQUEST_BTIME)
print_time("Create: ", &xst->st_btim);

if (xst->st_result_mask & XSTAT_REQUEST_GEN)
printf("Inode version: %llxh\n", xst->st_gen);
if (xst->st_result_mask & XSTAT_REQUEST_DATA_VERSION)
printf("Data version: %llxh\n", xst->st_data_version);

if (xst->st_result_mask & XSTAT_REQUEST_INODE_FLAGS) {
unsigned char bits;
int loop, byte;

static char flag_representation[64 + 1] =
"????????"
"????????"
"????????"
"otserAaS"
"????????"
"????ehTD"
"tj?IE?XZ"
"AdaiScus"
;

printf("Inode flags: %016llx (", xst->st_inode_flags);
for (byte = 64 - 8; byte >= 0; byte -= 8) {
bits = xst->st_inode_flags >> byte;
for (loop = 7; loop >= 0; loop--) {
int bit = byte + loop;

if (bits & 0x80)
putchar(flag_representation[63 - bit]);
else
putchar('-');
bits <<= 1;
}
if (byte)
putchar(' ');
}
printf(")\n");
}
}

int main(int argc, char **argv)
{
struct xstat_parameters params;
union {
struct xstat xst;
unsigned long long raw[4096 / 8];
} buffer;
int ret, atflag = AT_SYMLINK_NOFOLLOW;

unsigned long long query = XSTAT_REQUEST__ALL_STATS;

for (argv++; *argv; argv++) {
if (strcmp(*argv, "-F") == 0) {
atflag |= AT_FORCE_ATTR_SYNC;
continue;
}
if (strcmp(*argv, "-L") == 0) {
atflag &= ~AT_SYMLINK_NOFOLLOW;
continue;
}
if (strcmp(*argv, "-O") == 0) {
query &= ~XSTAT_REQUEST__BASIC_STATS;
continue;
}
if (strcmp(*argv, "-A") == 0) {
atflag |= AT_NO_AUTOMOUNT;
continue;
}

memset(&buffer, 0xbf, sizeof(buffer));
params.request_mask = query;
ret = xstat(AT_FDCWD, *argv, atflag, &params, &buffer.xst,
sizeof(buffer));
printf("xstat(%s) = %d\n", *argv, ret);
if (ret < 0) {
perror(*argv);
exit(1);
}

dump_xstat(&buffer.xst);

ret = (ret + 7) / 8;
if (ret > sizeof(buffer.xst) / 8) {
unsigned offset, print_offset = 1, col = 0;
if (ret > sizeof(buffer) / 8)
ret = sizeof(buffer) / 8;

for (offset = sizeof(buffer.xst) / 8; offset < ret; offset++) {
if (print_offset) {
printf("%04x: ", offset * 8);
print_offset = 0;
}
printf("%016llx", buffer.raw[offset]);
col++;
if ((col & 3) == 0) {
printf("\n");
print_offset = 1;
} else {
printf(" ");
}
}

if (!print_offset)
printf("\n");
}
}
return 0;
}

Just compile and run, passing it paths to the files you want to examine:

[root@andromeda ~]# /tmp/xstat /proc/$$
xstat(/proc/2074) = 160
results=47ef
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 00:03 Inode: 9072 Links: 7
Access: (0555/dr-xr-xr-x) Uid: 0
Gid: 0
Access: 2010-07-14 16:50:46.609336272+0100
Modify: 2010-07-14 16:50:46.609336272+0100
Change: 2010-07-14 16:50:46.609336272+0100
Inode flags: 0000000100000000 (-------- -------- -------- -------S -------- -------- -------- --------)
[root@andromeda ~]# /tmp/xstat /afs/archive/linuxdev/fedora9/x86_64/kernel-devel-2.6.25.10-86.fc9.x86_64.rpm
xstat(/afs/archive/linuxdev/fedora9/x86_64/kernel-devel-2.6.25.10-86.fc9.x86_64.rpm) = 160
results=77ef
Size: 5413882 Blocks: 0 IO Block: 4096 regular file
Device: 00:15 Inode: 2288 Links: 1
Access: (0644/-rw-r--r--) Uid: 75338
Gid: 0
Access: 2008-11-05 19:47:22.000000000+0000
Modify: 2008-11-05 19:47:22.000000000+0000
Change: 2008-11-05 19:47:22.000000000+0000
Inode version: 795h
Data version: 2h
Inode flags: 0000000800000000 (-------- -------- -------- ----r--- -------- -------- -------- --------)

Signed-off-by: David Howells <[email protected]>
---

arch/x86/include/asm/unistd_32.h | 4
arch/x86/include/asm/unistd_64.h | 4
fs/stat.c | 322 +++++++++++++++++++++++++++++++++++---
include/linux/fcntl.h | 1
include/linux/fs.h | 4
include/linux/stat.h | 119 ++++++++++++++
include/linux/syscalls.h | 9 +
7 files changed, 436 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index beb9b5f..a9953cc 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
+#define __NR_xstat 338
+#define __NR_fxstat 339

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 340

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index ff4307b..c90d240 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
#define __NR_recvmmsg 299
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_xstat 300
+__SYSCALL(__NR_xstat, sys_xstat)
+#define __NR_fxstat 301
+__SYSCALL(__NR_fxstat, sys_fxstat)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/fs/stat.c b/fs/stat.c
index 12e90e2..89d72fc 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -18,6 +18,15 @@
#include <asm/uaccess.h>
#include <asm/unistd.h>

+/**
+ * generic_fillattr - Fill in the basic attributes from the inode struct
+ * @inode: Inode to use as the source
+ * @stat: Where to fill in the attributes
+ *
+ * Fill in the basic attributes in the kstat structure from data that's to be
+ * found on the VFS inode structure. This is the default if no getattr inode
+ * operation is supplied.
+ */
void generic_fillattr(struct inode *inode, struct kstat *stat)
{
stat->dev = inode->i_sb->s_dev;
@@ -33,11 +42,37 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
stat->size = i_size_read(inode);
stat->blocks = inode->i_blocks;
stat->blksize = (1 << inode->i_blkbits);
+ stat->inode_flags = inode->i_sb->s_type->inode_flags;
+ stat->result_mask |= XSTAT_REQUEST__BASIC_STATS & ~XSTAT_REQUEST_RDEV;
+ if (unlikely(S_ISBLK(stat->mode) || S_ISCHR(stat->mode)))
+ stat->result_mask |= XSTAT_REQUEST_RDEV;
+ if (stat->inode_flags)
+ stat->result_mask |= XSTAT_REQUEST_INODE_FLAGS;
}
-
EXPORT_SYMBOL(generic_fillattr);

-int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+/**
+ * vfs_xgetattr - Get the extended attributes of a file
+ * @mnt: The mountpoint to which the dentry belongs
+ * @dentry: The file of interest
+ * @stat: Where to return the statistics
+ *
+ * Ask the filesystem for a file's attributes. The caller must have preset
+ * stat->request_mask and stat->query_flags to indicate what they want.
+ *
+ * If the file is remote, the filesystem can be forced to update the attributes
+ * from the backing store by passing AT_FORCE_ATTR_SYNC in query_flags.
+ *
+ * Bits must have been set in stat->request_mask to indicate which attributes
+ * the caller wants retrieving. Only attributes from the set
+ * XSTAT_REQUEST__EXTENDED_STATS can be retrieved through this interface. Any
+ * such attribute not requested may be returned anyway, but the value may be
+ * approximate, and, if remote, may not have been synchronised with the server.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_xgetattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
{
struct inode *inode = dentry->d_inode;
int retval;
@@ -46,61 +81,176 @@ int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
if (retval)
return retval;

+ stat->result_mask = 0;
if (inode->i_op->getattr)
return inode->i_op->getattr(mnt, dentry, stat);

generic_fillattr(inode, stat);
return 0;
}
+EXPORT_SYMBOL(vfs_xgetattr);

+/**
+ * vfs_getattr - Get the basic attributes of a file
+ * @mnt: The mountpoint to which the dentry belongs
+ * @dentry: The file of interest
+ * @stat: Where to return the statistics
+ *
+ * Ask the filesystem for a file's attributes. If remote, the filesystem isn't
+ * forced to update its files from the backing store. Only the basic set of
+ * attributes will be retrieved; anyone wanting more must use vfs_getxattr(),
+ * as must anyone who wants to force attributes to be sync'd with the server.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
+{
+ stat->query_flags = 0;
+ stat->request_mask = XSTAT_REQUEST__BASIC_STATS;
+ return vfs_xgetattr(mnt, dentry, stat);
+}
EXPORT_SYMBOL(vfs_getattr);

-int vfs_fstat(unsigned int fd, struct kstat *stat)
+/**
+ * vfs_fxstat - Get extended attributes by file descriptor
+ * @fd: The file descriptor refering to the file of interest
+ * @stat: The result structure to fill in.
+ *
+ * This function is a wrapper around vfs_xgetattr(). The main difference is
+ * that it uses a file descriptor to determine the file location.
+ *
+ * The caller must have preset stat->query_flags and stat->request_mask as for
+ * vfs_xgetattr().
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_fxstat(unsigned int fd, struct kstat *stat)
{
struct file *f = fget(fd);
int error = -EBADF;

+ if (stat->query_flags & ~KSTAT_QUERY_FLAGS)
+ return -EINVAL;
if (f) {
- error = vfs_getattr(f->f_path.mnt, f->f_path.dentry, stat);
+ error = vfs_xgetattr(f->f_path.mnt, f->f_path.dentry, stat);
fput(f);
}
return error;
}
+EXPORT_SYMBOL(vfs_fxstat);
+
+/**
+ * vfs_fstat - Get basic attributes by file descriptor
+ * @fd: The file descriptor refering to the file of interest
+ * @stat: The result structure to fill in.
+ *
+ * This function is a wrapper around vfs_getattr(). The main difference is
+ * that it uses a file descriptor to determine the file location.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_fstat(unsigned int fd, struct kstat *stat)
+{
+ stat->query_flags = 0;
+ stat->request_mask = XSTAT_REQUEST__BASIC_STATS;
+ return vfs_fxstat(fd, stat);
+}
EXPORT_SYMBOL(vfs_fstat);

-int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
- int flag)
+/**
+ * vfs_xstat - Get extended attributes by filename
+ * @dfd: A file descriptor representing the base dir for a relative filename
+ * @filename: The name of the file of interest
+ * @flags: Flags to control the query
+ * @stat: The result structure to fill in.
+ *
+ * This function is a wrapper around vfs_xgetattr(). The main difference is
+ * that it uses a filename and base directory to determine the file location.
+ * Additionally, the addition of AT_SYMLINK_NOFOLLOW to flags will prevent a
+ * symlink at the given name from being referenced.
+ *
+ * The caller must have preset stat->request_mask as for vfs_xgetattr(). The
+ * flags are also used to load up stat->query_flags.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_xstat(int dfd, const char __user *filename, int flags,
+ struct kstat *stat)
{
struct path path;
- int error = -EINVAL;
- int lookup_flags = 0;
+ int error, lookup_flags;

- if ((flag & ~AT_SYMLINK_NOFOLLOW) != 0)
- goto out;
+ if (flags & ~(AT_SYMLINK_NOFOLLOW | KSTAT_QUERY_FLAGS))
+ return -EINVAL;

- if (!(flag & AT_SYMLINK_NOFOLLOW))
- lookup_flags |= LOOKUP_FOLLOW;
+ stat->query_flags = flags & KSTAT_QUERY_FLAGS;
+ lookup_flags = (flags & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW;

error = user_path_at(dfd, filename, lookup_flags, &path);
- if (error)
- goto out;
-
- error = vfs_getattr(path.mnt, path.dentry, stat);
- path_put(&path);
-out:
+ if (!error) {
+ error = vfs_xgetattr(path.mnt, path.dentry, stat);
+ path_put(&path);
+ }
return error;
}
+EXPORT_SYMBOL(vfs_xstat);
+
+/**
+ * vfs_fstatat - Get basic attributes by filename
+ * @dfd: A file descriptor representing the base dir for a relative filename
+ * @filename: The name of the file of interest
+ * @flags: Flags to control the query
+ * @stat: The result structure to fill in.
+ *
+ * This function is a wrapper around vfs_xstat(). The difference is that it
+ * preselects basic stats only. The flags are used to load up
+ * stat->query_flags in addition to indicating symlink handling during path
+ * resolution.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
+ int flags)
+{
+ stat->request_mask = XSTAT_REQUEST__BASIC_STATS;
+ return vfs_xstat(dfd, filename, flags, stat);
+}
EXPORT_SYMBOL(vfs_fstatat);

-int vfs_stat(const char __user *name, struct kstat *stat)
+/**
+ * vfs_stat - Get basic attributes by filename
+ * @filename: The name of the file of interest
+ * @stat: The result structure to fill in.
+ *
+ * This function is a wrapper around vfs_xstat(). The difference is that it
+ * preselects basic stats only, terminal symlinks are followed regardless and a
+ * remote filesystem can't be forced to query the server. If such is desired,
+ * vfs_xstat() should be used instead.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
+int vfs_stat(const char __user *filename, struct kstat *stat)
{
- return vfs_fstatat(AT_FDCWD, name, stat, 0);
+ stat->request_mask = XSTAT_REQUEST__BASIC_STATS;
+ return vfs_xstat(AT_FDCWD, filename, 0, stat);
}
EXPORT_SYMBOL(vfs_stat);

+/**
+ * vfs_stat - Get basic attributes by filename, without following terminal symlink
+ * @filename: The name of the file of interest
+ * @stat: The result structure to fill in.
+ *
+ * This function is a wrapper around vfs_xstat(). The difference is that it
+ * preselects basic stats only, terminal symlinks are note followed regardless
+ * and a remote filesystem can't be forced to query the server. If such is
+ * desired, vfs_xstat() should be used instead.
+ *
+ * 0 will be returned on success, and a -ve error code if unsuccessful.
+ */
int vfs_lstat(const char __user *name, struct kstat *stat)
{
- return vfs_fstatat(AT_FDCWD, name, stat, AT_SYMLINK_NOFOLLOW);
+ return vfs_xstat(AT_FDCWD, name, AT_SYMLINK_NOFOLLOW, stat);
}
EXPORT_SYMBOL(vfs_lstat);

@@ -115,7 +265,7 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
{
static int warncount = 5;
struct __old_kernel_stat tmp;
-
+
if (warncount > 0) {
warncount--;
printk(KERN_WARNING "VFS: Warning: %s using old stat() call. Recompile your binary.\n",
@@ -140,7 +290,7 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
#if BITS_PER_LONG == 32
if (stat->size > MAX_NON_LFS)
return -EOVERFLOW;
-#endif
+#endif
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
tmp.st_mtime = stat->mtime.tv_sec;
@@ -222,7 +372,7 @@ static int cp_new_stat(struct kstat *stat, struct stat __user *statbuf)
#if BITS_PER_LONG == 32
if (stat->size > MAX_NON_LFS)
return -EOVERFLOW;
-#endif
+#endif
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
tmp.st_mtime = stat->mtime.tv_sec;
@@ -408,6 +558,130 @@ SYSCALL_DEFINE4(fstatat64, int, dfd, const char __user *, filename,
}
#endif /* __ARCH_WANT_STAT64 */

+/*
+ * Get the xstat parameters if supplied
+ */
+static int xstat_get_params(struct xstat_parameters __user *_params,
+ struct kstat *stat)
+{
+ struct xstat_parameters params;
+
+ memset(stat, 0xde, sizeof(*stat)); // DEBUGGING
+
+ if (_params) {
+ if (copy_from_user(&params, _params, sizeof(params)) != 0)
+ return -EFAULT;
+ stat->request_mask =
+ params.request_mask & XSTAT_REQUEST__ALL_STATS;
+ } else {
+ stat->request_mask = XSTAT_REQUEST__BASIC_STATS;
+ }
+ stat->result_mask = 0;
+ return 0;
+}
+
+/*
+ * Set the xstat results.
+ *
+ * If the buffer size was 0, we just return the size of the buffer needed to
+ * return the full result.
+ *
+ * If bufsize indicates a buffer of insufficient size to hold the full result,
+ * we return -E2BIG.
+ *
+ * Otherwise we copy the extended stats to userspace and return the amount of
+ * data written into the buffer (or -EFAULT).
+ */
+static long xstat_set_result(struct kstat *stat,
+ struct xstat __user *buffer, size_t bufsize)
+{
+ struct xstat tmp;
+ size_t result_size = sizeof(tmp);
+
+ if (bufsize == 0)
+ return result_size;
+ if (bufsize < result_size)
+ return -E2BIG;
+
+ /* transfer the fixed results */
+ memset(&tmp, 0, sizeof(tmp));
+ tmp.st_result_mask = stat->result_mask;
+ tmp.st_mode = stat->mode;
+ tmp.st_nlink = stat->nlink;
+ tmp.st_uid = stat->uid;
+ tmp.st_gid = stat->gid;
+ tmp.st_blksize = stat->blksize;
+ tmp.st_rdev.major = MAJOR(stat->rdev);
+ tmp.st_rdev.minor = MINOR(stat->rdev);
+ tmp.st_dev.major = MAJOR(stat->dev);
+ tmp.st_dev.minor = MINOR(stat->dev);
+ tmp.st_atime.tv_sec = stat->atime.tv_sec;
+ tmp.st_atime.tv_nsec = stat->atime.tv_nsec;
+ tmp.st_mtime.tv_sec = stat->mtime.tv_sec;
+ tmp.st_mtime.tv_nsec = stat->mtime.tv_nsec;
+ tmp.st_ctime.tv_sec = stat->ctime.tv_sec;
+ tmp.st_ctime.tv_nsec = stat->ctime.tv_nsec;
+ tmp.st_ino = stat->ino;
+ tmp.st_size = stat->size;
+ tmp.st_blocks = stat->blocks;
+
+ if (tmp.st_result_mask & XSTAT_REQUEST_BTIME) {
+ tmp.st_btime.tv_sec = stat->btime.tv_sec;
+ tmp.st_btime.tv_nsec = stat->btime.tv_nsec;
+ }
+ if (tmp.st_result_mask & XSTAT_REQUEST_GEN)
+ tmp.st_gen = stat->gen;
+ if (tmp.st_result_mask & XSTAT_REQUEST_DATA_VERSION)
+ tmp.st_data_version = stat->data_version;
+ if (tmp.st_result_mask & XSTAT_REQUEST_INODE_FLAGS)
+ tmp.st_inode_flags = stat->inode_flags;
+
+ if (copy_to_user(buffer, &tmp, result_size) != 0)
+ return -EFAULT;
+ return result_size;
+}
+
+/*
+ * System call to get extended stats by path
+ */
+SYSCALL_DEFINE6(xstat,
+ int, dfd, const char __user *, filename, unsigned, atflag,
+ struct xstat_parameters __user *, params,
+ struct xstat __user *, buffer, size_t, bufsize)
+{
+ struct kstat stat;
+ int error;
+
+ error = xstat_get_params(params, &stat);
+ if (error != 0)
+ return error;
+ error = vfs_xstat(dfd, filename, atflag, &stat);
+ if (error)
+ return error;
+ return xstat_set_result(&stat, buffer, bufsize);
+}
+
+/*
+ * System call to get extended stats by file descriptor
+ */
+SYSCALL_DEFINE5(fxstat, unsigned int, fd, unsigned int, flags,
+ struct xstat_parameters __user *, params,
+ struct xstat __user *, buffer, size_t, bufsize)
+{
+ struct kstat stat;
+ int error;
+
+ error = xstat_get_params(params, &stat);
+ if (error < 0)
+ return error;
+ stat.query_flags = flags;
+ error = vfs_fxstat(fd, &stat);
+ if (error)
+ return error;
+
+ return xstat_set_result(&stat, buffer, bufsize);
+}
+
/* Caller is here responsible for sufficient locking (ie. inode->i_lock) */
void __inode_add_bytes(struct inode *inode, loff_t bytes)
{
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index afc00af..bcf8083 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -45,6 +45,7 @@
#define AT_REMOVEDIR 0x200 /* Remove directory instead of
unlinking file. */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
+#define AT_FORCE_ATTR_SYNC 0x800 /* Force the attributes to be sync'd with the server */

#ifdef __KERNEL__

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5e7cf2..951c36b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1735,6 +1735,7 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc);
struct file_system_type {
const char *name;
int fs_flags;
+ u64 inode_flags; /* base inode_flags for generic_getattr() */
int (*get_sb) (struct file_system_type *, int,
const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
@@ -2341,6 +2342,7 @@ extern const struct inode_operations page_symlink_inode_operations;
extern int generic_readlink(struct dentry *, char __user *, int);
extern void generic_fillattr(struct inode *, struct kstat *);
extern int vfs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
+extern int vfs_xgetattr(struct vfsmount *, struct dentry *, struct kstat *);
void __inode_add_bytes(struct inode *inode, loff_t bytes);
void inode_add_bytes(struct inode *inode, loff_t bytes);
void inode_sub_bytes(struct inode *inode, loff_t bytes);
@@ -2353,6 +2355,8 @@ extern int vfs_stat(const char __user *, struct kstat *);
extern int vfs_lstat(const char __user *, struct kstat *);
extern int vfs_fstat(unsigned int, struct kstat *);
extern int vfs_fstatat(int , const char __user *, struct kstat *, int);
+extern int vfs_xstat(int, const char __user *, int, struct kstat *);
+extern int vfs_xfstat(unsigned int, struct kstat *);

extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
unsigned long arg);
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 611c398..41a3c22 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,114 @@

#endif

+/*
+ * Extended stat structures
+ */
+struct xstat_parameters {
+ /* Query request/result mask
+ *
+ * Bits should be set in request_mask to request particular items
+ * before calling xstat() or fxstat().
+ *
+ * For each item in the set XSTAT_REQUEST__EXTENDED_STATS:
+ *
+ * - if not available at all, the bit will be cleared before returning
+ * and the field will be cleared; otherwise,
+ *
+ * - if AT_FORCE_ATTR_SYNC is set, then the datum will be synchronised
+ * to the server and the bit will be set on return; otherwise,
+ *
+ * - if requested, the datum will be synchronised to a server or other
+ * hardware if out of date before being returned, and the bit will be
+ * set on return; otherwise,
+ *
+ * - if not requested, but available in approximate form without any
+ * effort, it will be filled in anyway, and the bit will be set upon
+ * return (it might not be up to date, however, and no attempt will
+ * be made to synchronise the internal state first); otherwise,
+ *
+ * - the bit will be cleared before returning, and the field will be
+ * cleared.
+ *
+ * For each item not in the set XSTAT_REQUEST__EXTENDED_STATS
+ *
+ * - if not available at all, the bit will be cleared, and no result
+ * data will be returned; otherwise,
+ *
+ * - if requested, the datum will be synchronised to a server or other
+ * hardware before being appended if necessary, and the bit will be
+ * set on return; otherwise,
+ *
+ * - the bit will be cleared, and no result data will be returned.
+ *
+ * Items in XSTAT_REQUEST__BASIC_STATS may be marked unavailable on
+ * return, but they will have a value installed for compatibility
+ * purposes.
+ */
+ unsigned long long request_mask;
+#define XSTAT_REQUEST_MODE 0x00000001ULL /* want/got st_mode */
+#define XSTAT_REQUEST_NLINK 0x00000002ULL /* want/got st_nlink */
+#define XSTAT_REQUEST_UID 0x00000004ULL /* want/got st_uid */
+#define XSTAT_REQUEST_GID 0x00000008ULL /* want/got st_gid */
+#define XSTAT_REQUEST_RDEV 0x00000010ULL /* want/got st_rdev */
+#define XSTAT_REQUEST_ATIME 0x00000020ULL /* want/got st_atime */
+#define XSTAT_REQUEST_MTIME 0x00000040ULL /* want/got st_mtime */
+#define XSTAT_REQUEST_CTIME 0x00000080ULL /* want/got st_ctime */
+#define XSTAT_REQUEST_INO 0x00000100ULL /* want/got st_ino */
+#define XSTAT_REQUEST_SIZE 0x00000200ULL /* want/got st_size */
+#define XSTAT_REQUEST_BLOCKS 0x00000400ULL /* want/got st_blocks */
+#define XSTAT_REQUEST__BASIC_STATS 0x000007ffULL /* the stuff in the normal stat struct */
+#define XSTAT_REQUEST_BTIME 0x00000800ULL /* want/got st_btime */
+#define XSTAT_REQUEST_GEN 0x00001000ULL /* want/got st_gen */
+#define XSTAT_REQUEST_DATA_VERSION 0x00002000ULL /* want/got st_data_version */
+#define XSTAT_REQUEST_INODE_FLAGS 0x00004000ULL /* want/got st_inode_flags */
+#define XSTAT_REQUEST__EXTENDED_STATS 0x00007fffULL /* the stuff in the xstat struct */
+#define XSTAT_REQUEST__ALL_STATS 0x00007fffULL /* the defined set of requestables */
+};
+
+struct xstat_dev {
+ unsigned int major, minor;
+};
+
+struct xstat_time {
+ unsigned long long tv_sec, tv_nsec;
+};
+
+struct xstat {
+ unsigned long long st_result_mask; /* what results were written */
+ unsigned int st_mode; /* file mode */
+ unsigned int st_nlink; /* number of hard links */
+ unsigned int st_uid; /* user ID of owner */
+ unsigned int st_gid; /* group ID of owner */
+ struct xstat_dev st_rdev; /* device ID of special file */
+ struct xstat_dev st_dev; /* ID of device containing file */
+ struct xstat_time st_atime; /* last access time */
+ struct xstat_time st_mtime; /* last data modification time */
+ struct xstat_time st_ctime; /* last attribute change time */
+ struct xstat_time st_btime; /* file creation time */
+ unsigned long long st_ino; /* inode number */
+ unsigned long long st_size; /* file size */
+ unsigned long long st_blksize; /* block size for filesystem I/O */
+ unsigned long long st_blocks; /* number of 512-byte blocks allocated */
+ unsigned long long st_gen; /* inode generation number */
+ unsigned long long st_data_version; /* data version number */
+ unsigned long long st_inode_flags; /* inode flags (!= BSD st_flags) */
+ unsigned long long st_extra_results[0]; /* extra requested results */
+};
+
+#define FS__STANDARD_FL 0x00000000ffffffffULL /* As for user visible FS_IOC_GETFLAGS */
+#define FS_SPECIAL_FL 0x0000000100000000ULL /* Special file as found in procfs/sysfs */
+#define FS_AUTOMOUNT_FL 0x0000000200000000ULL /* Specific automount point */
+#define FS_AUTOMOUNT_ANY_FL 0x0000000400000000ULL /* Unspecific automount directory */
+#define FS_REMOTE_FL 0x0000000800000000ULL /* File is remote */
+#define FS_ENCRYPTED_FL 0x0000001000000000ULL /* File is encrypted */
+#define FS_HIDDEN_FL 0x0000002000000000ULL /* File is marked hidden (DOS+) */
+#define FS_SYSTEM_FL 0x0000004000000000ULL /* File is marked system (DOS+) */
+#define FS_ARCHIVE_FL 0x0000008000000000ULL /* File is marked archive (DOS+) */
+#define FS_TEMPORARY_FL 0x0000010000000000ULL /* File is temporary (NTFS/CIFS) */
+#define FS_OFFLINE_FL 0x0000020000000000ULL /* File is offline (CIFS) */
+#define FS_REPARSE_POINT_FL 0x0000040000000000ULL /* Reparse point (NTFS/CIFS) */
+
#ifdef __KERNEL__
#define S_IRWXUGO (S_IRWXU|S_IRWXG|S_IRWXO)
#define S_IALLUGO (S_ISUID|S_ISGID|S_ISVTX|S_IRWXUGO)
@@ -60,6 +168,8 @@
#include <linux/time.h>

struct kstat {
+ u64 request_mask; /* what fields the user asked for */
+ u64 result_mask; /* what fields the user got */
u64 ino;
dev_t dev;
umode_t mode;
@@ -67,14 +177,19 @@ struct kstat {
uid_t uid;
gid_t gid;
dev_t rdev;
+ unsigned int query_flags; /* operational flags */
+#define KSTAT_QUERY_FLAGS (AT_FORCE_ATTR_SYNC)
loff_t size;
- struct timespec atime;
+ struct timespec atime;
struct timespec mtime;
struct timespec ctime;
+ struct timespec btime; /* file creation time */
unsigned long blksize;
unsigned long long blocks;
+ u64 gen; /* inode generation */
+ u64 data_version;
+ u64 inode_flags; /* inode flags (!= BSD st_flags) */
};

#endif
-
#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 8812a63..5d68b4c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -44,6 +44,8 @@ struct shmid_ds;
struct sockaddr;
struct stat;
struct stat64;
+struct xstat_parameters;
+struct xstat;
struct statfs;
struct statfs64;
struct __sysctl_args;
@@ -824,4 +826,11 @@ asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long fd, unsigned long pgoff);
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);

+asmlinkage long sys_xstat(int, const char __user *, unsigned,
+ struct xstat_parameters __user *,
+ struct xstat __user *, size_t);
+asmlinkage long sys_fxstat(unsigned, unsigned,
+ struct xstat_parameters __user *,
+ struct xstat __user *, size_t);
+
#endif

2010-07-15 02:17:14

[permalink] [raw]

Subject: [PATCH 04/18] xstat: AFS: Return extended attributes [ver #6]

Return extended attributes from the AFS filesystem. This includes the
following:

(1) The vnode uniquifier as st_gen.

(2) The data version number as st_data_version.

(3) FS_AUTOMOUNT_FL on mountpoint directories.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/inode.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index ee3190a..02f115f 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -300,16 +300,19 @@ error_unlock:
/*
* read the attributes of an inode
*/
-int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,
- struct kstat *stat)
+int afs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
- struct inode *inode;
-
- inode = dentry->d_inode;
+ struct inode *inode = dentry->d_inode;

_enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);

generic_fillattr(inode, stat);
+
+ stat->result_mask |= XSTAT_REQUEST_GEN | XSTAT_REQUEST_DATA_VERSION;
+ stat->gen = inode->i_generation;
+ stat->data_version = inode->i_version;
+ if (test_bit(AFS_VNODE_MOUNTPOINT, &AFS_FS_I(inode)->flags))
+ stat->inode_flags |= FS_AUTOMOUNT_FL;
return 0;
}

2010-07-15 02:17:15

[permalink] [raw]

Subject: [PATCH 05/18] xstat: eCryptFS: Return extended attributes [ver #6]

Return extended attributes from the eCryptFS filesystem, dredged up from the
lower filesystem.

Possibly eCryptFS should also set FS_COMPR_FL on its compressed files.

Signed-off-by: David Howells <[email protected]>
---

fs/ecryptfs/inode.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 31ef525..41bc407 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -994,8 +994,10 @@ int ecryptfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat lower_stat;
int rc;

- rc = vfs_getattr(ecryptfs_dentry_to_lower_mnt(dentry),
- ecryptfs_dentry_to_lower(dentry), &lower_stat);
+ lower_stat.query_flags = stat->query_flags;
+ lower_stat.request_mask = stat->request_mask | XSTAT_REQUEST_BLOCKS;
+ rc = vfs_xgetattr(ecryptfs_dentry_to_lower_mnt(dentry),
+ ecryptfs_dentry_to_lower(dentry), &lower_stat);
if (!rc) {
generic_fillattr(dentry->d_inode, stat);
stat->blocks = lower_stat.blocks;

2010-07-15 02:17:16

[permalink] [raw]

Subject: [PATCH 06/18] xstat: Ext4: Return extended attributes [ver #6]

Return extended attributes from the Ext4 filesystem. This includes the
following:

(1) The inode creation time (i_crtime) as i_btime.

(2) The inode i_generation as i_gen if not the root directory.

(3) The inode i_version as st_data_version if a file with I_VERSION set or a
directory.

(4) FS_xxx_FL flags as for FS_IOC_GETFLAGS.

Signed-off-by: David Howells <[email protected]>
---

fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 2 +-
fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++---
fs/ext4/namei.c | 2 ++
fs/ext4/symlink.c | 2 ++
5 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..96823f3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1571,6 +1571,8 @@ extern int ext4_write_inode(struct inode *, struct writeback_control *);
extern int ext4_setattr(struct dentry *, struct iattr *);
extern int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
+extern int ext4_file_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat);
extern void ext4_delete_inode(struct inode *);
extern int ext4_sync_inode(handle_t *, struct inode *);
extern void ext4_dirty_inode(struct inode *);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5313ae4..18c29ab 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -150,7 +150,7 @@ const struct file_operations ext4_file_operations = {
const struct inode_operations ext4_file_inode_operations = {
.truncate = ext4_truncate,
.setattr = ext4_setattr,
- .getattr = ext4_getattr,
+ .getattr = ext4_file_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..822a4ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5550,12 +5550,38 @@ err_out:
int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat)
{
- struct inode *inode;
- unsigned long delalloc_blocks;
+ struct inode *inode = dentry->d_inode;
+ struct ext4_inode_info *ei = EXT4_I(inode);

- inode = dentry->d_inode;
generic_fillattr(inode, stat);

+ stat->result_mask |= XSTAT_REQUEST_BTIME;
+ stat->btime.tv_sec = EXT4_I(inode)->i_crtime.tv_sec;
+ stat->btime.tv_nsec = EXT4_I(inode)->i_crtime.tv_nsec;
+
+ if (inode->i_ino != EXT4_ROOT_INO) {
+ stat->result_mask |= XSTAT_REQUEST_GEN;
+ stat->gen = inode->i_generation;
+ }
+ if (S_ISDIR(inode->i_mode) || test_opt(inode->i_sb, I_VERSION)) {
+ stat->result_mask |= XSTAT_REQUEST_DATA_VERSION;
+ stat->data_version = inode->i_version;
+ }
+
+ ext4_get_inode_flags(ei);
+ stat->inode_flags |= ei->i_flags & EXT4_FL_USER_VISIBLE;
+ stat->result_mask |= XSTAT_REQUEST_INODE_FLAGS;
+ return 0;
+}
+
+int ext4_file_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
+{
+ struct inode *inode = dentry->d_inode;
+ unsigned long delalloc_blocks;
+
+ ext4_getattr(mnt, dentry, stat);
+
/*
* We can't update i_blocks if the block allocation is delayed
* otherwise in the case of system crash before the real block
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a43e661..0f776c7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2542,6 +2542,7 @@ const struct inode_operations ext4_dir_inode_operations = {
.mknod = ext4_mknod,
.rename = ext4_rename,
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
@@ -2554,6 +2555,7 @@ const struct inode_operations ext4_dir_inode_operations = {

const struct inode_operations ext4_special_inode_operations = {
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
diff --git a/fs/ext4/symlink.c b/fs/ext4/symlink.c
index ed9354a..d8fe7fb 100644
--- a/fs/ext4/symlink.c
+++ b/fs/ext4/symlink.c
@@ -35,6 +35,7 @@ const struct inode_operations ext4_symlink_inode_operations = {
.follow_link = page_follow_link_light,
.put_link = page_put_link,
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
@@ -47,6 +48,7 @@ const struct inode_operations ext4_fast_symlink_inode_operations = {
.readlink = generic_readlink,
.follow_link = ext4_follow_link,
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,

2010-07-15 02:17:18

[permalink] [raw]

Subject: [PATCH 07/18] xstat: NFS: Return extended attributes [ver #6]

Return extended attributes from the NFS filesystem. This includes the
following:

(1) The change attribute as st_data_version if NFSv4.

(2) FS_AUTOMOUNT_FL on referral/submount directories.

Furthermore, what nfs_getattr() does can be controlled as follows:

(1) If AT_FORCE_ATTR_SYNC is indicated, or mtime, ctime or data_version (NFSv4
only) are requested then the outstanding writes will be written to the
server first.

(2) The inode's attributes may be synchronised with the server:

(a) If AT_FORCE_ATTR_SYNC is indicated or if atime is requested (and atime
updating is not suppressed by a mount flag) then the attributes will
be reread unconditionally.

(b) If the data version or any of basic stats are requested then the
attributes will be reread if the cached attributes have expired.

(c) Otherwise the cached attributes will be used - even if expired -
without reference to the server.

Signed-off-by: David Howells <[email protected]>
---

fs/nfs/inode.c | 46 ++++++++++++++++++++++++++++++++++------------
1 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 099b351..8c6de96 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -495,11 +495,21 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr)
int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
struct inode *inode = dentry->d_inode;
+ unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
int err;

- /* Flush out writes to the server in order to update c/mtime. */
- if (S_ISREG(inode->i_mode)) {
+ if (NFS_SERVER(inode)->nfs_client->rpc_ops->version < 4)
+ stat->request_mask &= ~XSTAT_REQUEST_DATA_VERSION;
+
+ /* Flush out writes to the server in order to update c/mtime
+ * or data version if the user wants them */
+ if ((force || stat->request_mask & (XSTAT_REQUEST_MTIME |
+ XSTAT_REQUEST_CTIME |
+ XSTAT_REQUEST_DATA_VERSION
+ )) &&
+ S_ISREG(inode->i_mode)
+ ) {
err = filemap_write_and_wait(inode->i_mapping);
if (err)
goto out;
@@ -514,18 +524,30 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
* - NFS never sets MS_NOATIME or MS_NODIRATIME so there is
* no point in checking those.
*/
- if ((mnt->mnt_flags & MNT_NOATIME) ||
- ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
+ if (!(stat->request_mask & XSTAT_REQUEST_ATIME) ||
+ (mnt->mnt_flags & MNT_NOATIME) ||
+ ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
need_atime = 0;

- if (need_atime)
- err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
- else
- err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
- if (!err) {
- generic_fillattr(inode, stat);
- stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+ if (force || stat->request_mask & (XSTAT_REQUEST__BASIC_STATS |
+ XSTAT_REQUEST_DATA_VERSION)
+ ) {
+ if (force || need_atime)
+ err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ else
+ err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ if (err)
+ goto out;
}
+
+ generic_fillattr(inode, stat);
+ stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+
+ if (stat->request_mask & XSTAT_REQUEST_DATA_VERSION) {
+ stat->data_version = NFS_I(inode)->change_attr;
+ stat->result_mask |= XSTAT_REQUEST_DATA_VERSION;
+ }
+
out:
return err;
}
@@ -770,7 +792,7 @@ int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
static int nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
{
struct nfs_inode *nfsi = NFS_I(inode);
-
+
if (mapping->nrpages != 0) {
int ret = invalidate_inode_pages2(mapping);
if (ret < 0)

2010-07-15 02:17:21

[permalink] [raw]

Subject: [PATCH 10/18] xstat: Make network filesystems return FS_REMOTE_FL [ver #6]

Make network filesystems return FS_REMOTE_FL in st_inode_flags to xstat().

Signed-off-by: David Howells <[email protected]>
---

fs/afs/super.c | 1 +
fs/ceph/super.c | 1 +
fs/cifs/cifsfs.c | 1 +
fs/coda/inode.c | 1 +
fs/ncpfs/inode.c | 1 +
fs/nfs/super.c | 7 +++++++
fs/smbfs/inode.c | 1 +
7 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index e932e5a..daaa3d4 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -40,6 +40,7 @@ static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
struct file_system_type afs_fs_type = {
.owner = THIS_MODULE,
.name = "afs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = afs_get_sb,
.kill_sb = kill_anon_super,
.fs_flags = 0,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index fa87f51..f486ac8 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1019,6 +1019,7 @@ static void ceph_kill_sb(struct super_block *s)
static struct file_system_type ceph_fs_type = {
.owner = THIS_MODULE,
.name = "ceph",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = ceph_get_sb,
.kill_sb = ceph_kill_sb,
.fs_flags = FS_RENAME_DOES_D_MOVE,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index ef9a773..eb2c517 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -586,6 +586,7 @@ static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
struct file_system_type cifs_fs_type = {
.owner = THIS_MODULE,
.name = "cifs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = cifs_get_sb,
.kill_sb = kill_anon_super,
/* .fs_flags */
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index d97f993..cb05427 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -308,6 +308,7 @@ static int coda_get_sb(struct file_system_type *fs_type,
struct file_system_type coda_fs_type = {
.owner = THIS_MODULE,
.name = "coda",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = coda_get_sb,
.kill_sb = kill_anon_super,
.fs_flags = FS_BINARY_MOUNTDATA,
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index fa33851..c5892a1 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -1018,6 +1018,7 @@ static int ncp_get_sb(struct file_system_type *fs_type,
static struct file_system_type ncp_fs_type = {
.owner = THIS_MODULE,
.name = "ncpfs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = ncp_get_sb,
.kill_sb = kill_anon_super,
.fs_flags = FS_BINARY_MOUNTDATA,
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index f9df16d..2553683 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -251,6 +251,7 @@ static int nfs_remount(struct super_block *sb, int *flags, char *raw_data);
static struct file_system_type nfs_fs_type = {
.owner = THIS_MODULE,
.name = "nfs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs_get_sb,
.kill_sb = nfs_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
@@ -259,6 +260,7 @@ static struct file_system_type nfs_fs_type = {
struct file_system_type nfs_xdev_fs_type = {
.owner = THIS_MODULE,
.name = "nfs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs_xdev_get_sb,
.kill_sb = nfs_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
@@ -297,6 +299,7 @@ static void nfs4_kill_super(struct super_block *sb);
static struct file_system_type nfs4_fs_type = {
.owner = THIS_MODULE,
.name = "nfs4",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs4_get_sb,
.kill_sb = nfs4_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
@@ -305,6 +308,7 @@ static struct file_system_type nfs4_fs_type = {
static struct file_system_type nfs4_remote_fs_type = {
.owner = THIS_MODULE,
.name = "nfs4",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs4_remote_get_sb,
.kill_sb = nfs4_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
@@ -313,6 +317,7 @@ static struct file_system_type nfs4_remote_fs_type = {
struct file_system_type nfs4_xdev_fs_type = {
.owner = THIS_MODULE,
.name = "nfs4",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs4_xdev_get_sb,
.kill_sb = nfs4_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
@@ -321,6 +326,7 @@ struct file_system_type nfs4_xdev_fs_type = {
static struct file_system_type nfs4_remote_referral_fs_type = {
.owner = THIS_MODULE,
.name = "nfs4",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs4_remote_referral_get_sb,
.kill_sb = nfs4_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
@@ -329,6 +335,7 @@ static struct file_system_type nfs4_remote_referral_fs_type = {
struct file_system_type nfs4_referral_fs_type = {
.owner = THIS_MODULE,
.name = "nfs4",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = nfs4_referral_get_sb,
.kill_sb = nfs4_kill_super,
.fs_flags = FS_RENAME_DOES_D_MOVE|FS_REVAL_DOT|FS_BINARY_MOUNTDATA,
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index 9551cb6..e47d86a 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -800,6 +800,7 @@ static int smb_get_sb(struct file_system_type *fs_type,
static struct file_system_type smb_fs_type = {
.owner = THIS_MODULE,
.name = "smbfs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = smb_get_sb,
.kill_sb = kill_anon_super,
.fs_flags = FS_BINARY_MOUNTDATA,

2010-07-15 02:17:20

[permalink] [raw]

Subject: [PATCH 09/18] xstat: Make special system filesystems return FS_SPECIAL_FL [ver #6]

Make special system filesystems return FS_SPECIAL_FL in st_inode_flags to
xstat().

Signed-off-by: David Howells <[email protected]>
---

arch/ia64/kernel/perfmon.c | 7 ++++---
arch/powerpc/platforms/cell/spufs/inode.c | 1 +
arch/s390/hypfs/inode.c | 1 +
drivers/infiniband/hw/ipath/ipath_fs.c | 1 +
drivers/infiniband/hw/qib/qib_fs.c | 1 +
drivers/isdn/capi/capifs.c | 1 +
drivers/misc/ibmasm/ibmasmfs.c | 1 +
drivers/mtd/mtdchar.c | 1 +
drivers/oprofile/oprofilefs.c | 1 +
drivers/usb/core/inode.c | 1 +
drivers/usb/gadget/f_fs.c | 1 +
drivers/usb/gadget/inode.c | 1 +
drivers/xen/xenfs/super.c | 1 +
fs/anon_inodes.c | 1 +
fs/binfmt_misc.c | 1 +
fs/configfs/mount.c | 1 +
fs/debugfs/inode.c | 1 +
fs/fuse/control.c | 1 +
fs/hostfs/hostfs_kern.c | 1 +
fs/nfsd/nfsctl.c | 1 +
fs/ocfs2/dlmfs/dlmfs.c | 1 +
fs/openpromfs/inode.c | 1 +
fs/pipe.c | 1 +
fs/proc/root.c | 1 +
fs/sysfs/mount.c | 1 +
ipc/mqueue.c | 1 +
kernel/cgroup.c | 1 +
kernel/cpuset.c | 1 +
net/socket.c | 1 +
net/sunrpc/rpc_pipe.c | 1 +
security/inode.c | 1 +
security/selinux/selinuxfs.c | 1 +
security/smack/smackfs.c | 1 +
33 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index ab985f7..bc96df7 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -626,9 +626,10 @@ pfmfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name,
}

static struct file_system_type pfm_fs_type = {
- .name = "pfmfs",
- .get_sb = pfmfs_get_sb,
- .kill_sb = kill_anon_super,
+ .name = "pfmfs",
+ .inode_flags = FS_SPECIAL_FL,
+ .get_sb = pfmfs_get_sb,
+ .kill_sb = kill_anon_super,
};

DEFINE_PER_CPU(unsigned long, pfm_syst_info);
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index e5e5f82..f7d7c84 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -808,6 +808,7 @@ spufs_get_sb(struct file_system_type *fstype, int flags,
static struct file_system_type spufs_type = {
.owner = THIS_MODULE,
.name = "spufs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = spufs_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index 6b120f0..01060c6 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -454,6 +454,7 @@ static const struct file_operations hypfs_file_ops = {
static struct file_system_type hypfs_type = {
.owner = THIS_MODULE,
.name = "s390_hypfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = hypfs_get_super,
.kill_sb = hypfs_kill_super
};
diff --git a/drivers/infiniband/hw/ipath/ipath_fs.c b/drivers/infiniband/hw/ipath/ipath_fs.c
index 2fca708..e5bb001 100644
--- a/drivers/infiniband/hw/ipath/ipath_fs.c
+++ b/drivers/infiniband/hw/ipath/ipath_fs.c
@@ -407,6 +407,7 @@ bail:
static struct file_system_type ipathfs_fs_type = {
.owner = THIS_MODULE,
.name = "ipathfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = ipathfs_get_sb,
.kill_sb = ipathfs_kill_super,
};
diff --git a/drivers/infiniband/hw/qib/qib_fs.c b/drivers/infiniband/hw/qib/qib_fs.c
index 844954b..b888a49 100644
--- a/drivers/infiniband/hw/qib/qib_fs.c
+++ b/drivers/infiniband/hw/qib/qib_fs.c
@@ -601,6 +601,7 @@ int qibfs_remove(struct qib_devdata *dd)
static struct file_system_type qibfs_fs_type = {
.owner = THIS_MODULE,
.name = "ipathfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = qibfs_get_sb,
.kill_sb = qibfs_kill_super,
};
diff --git a/drivers/isdn/capi/capifs.c b/drivers/isdn/capi/capifs.c
index 2b83850..1208e4c 100644
--- a/drivers/isdn/capi/capifs.c
+++ b/drivers/isdn/capi/capifs.c
@@ -134,6 +134,7 @@ static int capifs_get_sb(struct file_system_type *fs_type,
static struct file_system_type capifs_fs_type = {
.owner = THIS_MODULE,
.name = "capifs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = capifs_get_sb,
.kill_sb = kill_anon_super,
};
diff --git a/drivers/misc/ibmasm/ibmasmfs.c b/drivers/misc/ibmasm/ibmasmfs.c
index 8844a3f..0767280 100644
--- a/drivers/misc/ibmasm/ibmasmfs.c
+++ b/drivers/misc/ibmasm/ibmasmfs.c
@@ -108,6 +108,7 @@ static const struct file_operations *ibmasmfs_dir_ops = &simple_dir_operations;
static struct file_system_type ibmasmfs_type = {
.owner = THIS_MODULE,
.name = "ibmasmfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = ibmasmfs_get_super,
.kill_sb = kill_litter_super,
};
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index 91c8013..acb4ad7 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -986,6 +986,7 @@ static int mtd_inodefs_get_sb(struct file_system_type *fs_type, int flags,

static struct file_system_type mtd_inodefs_type = {
.name = "mtd_inodefs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = mtd_inodefs_get_sb,
.kill_sb = kill_anon_super,
};
diff --git a/drivers/oprofile/oprofilefs.c b/drivers/oprofile/oprofilefs.c
index 2766a6d..cf3dab6 100644
--- a/drivers/oprofile/oprofilefs.c
+++ b/drivers/oprofile/oprofilefs.c
@@ -279,6 +279,7 @@ static int oprofilefs_get_sb(struct file_system_type *fs_type,
static struct file_system_type oprofilefs_type = {
.owner = THIS_MODULE,
.name = "oprofilefs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = oprofilefs_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/drivers/usb/core/inode.c b/drivers/usb/core/inode.c
index 1a27618..0d942f0 100644
--- a/drivers/usb/core/inode.c
+++ b/drivers/usb/core/inode.c
@@ -586,6 +586,7 @@ static int usb_get_sb(struct file_system_type *fs_type,
static struct file_system_type usb_fs_type = {
.owner = THIS_MODULE,
.name = "usbfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = usb_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/drivers/usb/gadget/f_fs.c b/drivers/usb/gadget/f_fs.c
index d69eccf..63d8f91 100644
--- a/drivers/usb/gadget/f_fs.c
+++ b/drivers/usb/gadget/f_fs.c
@@ -1221,6 +1221,7 @@ ffs_fs_kill_sb(struct super_block *sb)
static struct file_system_type ffs_fs_type = {
.owner = THIS_MODULE,
.name = "functionfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = ffs_fs_get_sb,
.kill_sb = ffs_fs_kill_sb,
};
diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index de8a838..cb304b8 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -2127,6 +2127,7 @@ gadgetfs_kill_sb (struct super_block *sb)
static struct file_system_type gadgetfs_type = {
.owner = THIS_MODULE,
.name = shortname,
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = gadgetfs_get_sb,
.kill_sb = gadgetfs_kill_sb,
};
diff --git a/drivers/xen/xenfs/super.c b/drivers/xen/xenfs/super.c
index 8924d93..d1029bf 100644
--- a/drivers/xen/xenfs/super.c
+++ b/drivers/xen/xenfs/super.c
@@ -59,6 +59,7 @@ static int xenfs_get_sb(struct file_system_type *fs_type,
static struct file_system_type xenfs_type = {
.owner = THIS_MODULE,
.name = "xenfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = xenfs_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index e4b75d6..8820975 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -45,6 +45,7 @@ static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)

static struct file_system_type anon_inode_fs_type = {
.name = "anon_inodefs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = anon_inodefs_get_sb,
.kill_sb = kill_anon_super,
};
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index c4e8353..df727ea 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -715,6 +715,7 @@ static struct linux_binfmt misc_format = {
static struct file_system_type bm_fs_type = {
.owner = THIS_MODULE,
.name = "binfmt_misc",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = bm_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/fs/configfs/mount.c b/fs/configfs/mount.c
index 8c8d642..11107f4 100644
--- a/fs/configfs/mount.c
+++ b/fs/configfs/mount.c
@@ -113,6 +113,7 @@ static int configfs_get_sb(struct file_system_type *fs_type,
static struct file_system_type configfs_fs_type = {
.owner = THIS_MODULE,
.name = "configfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = configfs_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 30a87b3..84f09a1 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -144,6 +144,7 @@ static int debug_get_sb(struct file_system_type *fs_type,
static struct file_system_type debug_fs_type = {
.owner = THIS_MODULE,
.name = "debugfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = debug_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 3773fd6..798da59 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -341,6 +341,7 @@ static void fuse_ctl_kill_sb(struct super_block *sb)
static struct file_system_type fuse_ctl_fs_type = {
.owner = THIS_MODULE,
.name = "fusectl",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = fuse_ctl_get_sb,
.kill_sb = fuse_ctl_kill_sb,
};
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 87ac189..07a9973 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -1037,6 +1037,7 @@ static int hostfs_read_sb(struct file_system_type *type,
static struct file_system_type hostfs_type = {
.owner = THIS_MODULE,
.name = "hostfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = hostfs_read_sb,
.kill_sb = kill_anon_super,
.fs_flags = 0,
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 508941c..4e6cf14 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1397,6 +1397,7 @@ static int nfsd_get_sb(struct file_system_type *fs_type,
static struct file_system_type nfsd_fs_type = {
.owner = THIS_MODULE,
.name = "nfsd",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = nfsd_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index b83d610..0bba39f 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -649,6 +649,7 @@ static int dlmfs_get_sb(struct file_system_type *fs_type,
static struct file_system_type dlmfs_fs_type = {
.owner = THIS_MODULE,
.name = "ocfs2_dlmfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = dlmfs_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..8bbc7ce 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -424,6 +424,7 @@ static int openprom_get_sb(struct file_system_type *fs_type,
static struct file_system_type openprom_fs_type = {
.owner = THIS_MODULE,
.name = "openpromfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = openprom_get_sb,
.kill_sb = kill_anon_super,
};
diff --git a/fs/pipe.c b/fs/pipe.c
index 279eef9..247efd1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1254,6 +1254,7 @@ static int pipefs_get_sb(struct file_system_type *fs_type,

static struct file_system_type pipe_fs_type = {
.name = "pipefs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = pipefs_get_sb,
.kill_sb = kill_anon_super,
};
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 4258384..9c95a19 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -97,6 +97,7 @@ static void proc_kill_sb(struct super_block *sb)

static struct file_system_type proc_fs_type = {
.name = "proc",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = proc_get_sb,
.kill_sb = proc_kill_sb,
};
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 281c0c9..ec2ea3b 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -147,6 +147,7 @@ static void sysfs_kill_sb(struct super_block *sb)

static struct file_system_type sysfs_fs_type = {
.name = "sysfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = sysfs_get_sb,
.kill_sb = sysfs_kill_sb,
};
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c93fd3f..bba10cf 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1232,6 +1232,7 @@ static const struct super_operations mqueue_super_ops = {

static struct file_system_type mqueue_fs_type = {
.name = "mqueue",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = mqueue_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3ac6f5b..71c81b4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1619,6 +1619,7 @@ static void cgroup_kill_sb(struct super_block *sb) {

static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = cgroup_get_sb,
.kill_sb = cgroup_kill_sb,
};
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 02b9611..d5e5c56 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -250,6 +250,7 @@ static int cpuset_get_sb(struct file_system_type *fs_type,

static struct file_system_type cpuset_fs_type = {
.name = "cpuset",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = cpuset_get_sb,
};

diff --git a/net/socket.c b/net/socket.c
index 367d547..0aaf9f6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -326,6 +326,7 @@ static struct vfsmount *sock_mnt __read_mostly;

static struct file_system_type sock_fs_type = {
.name = "sockfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = sockfs_get_sb,
.kill_sb = kill_anon_super,
};
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 95ccbcf..1e1d477 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1033,6 +1033,7 @@ rpc_get_sb(struct file_system_type *fs_type,
static struct file_system_type rpc_pipe_fs_type = {
.owner = THIS_MODULE,
.name = "rpc_pipefs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = rpc_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/security/inode.c b/security/inode.c
index 1c812e8..7a904bb 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -139,6 +139,7 @@ static int get_sb(struct file_system_type *fs_type,
static struct file_system_type fs_type = {
.owner = THIS_MODULE,
.name = "securityfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 0293843..9c4bc22 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -1706,6 +1706,7 @@ static int sel_get_sb(struct file_system_type *fs_type,

static struct file_system_type sel_fs_type = {
.name = "selinuxfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = sel_get_sb,
.kill_sb = kill_litter_super,
};
diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
index a2b72d7..56a8cc1 100644
--- a/security/smack/smackfs.c
+++ b/security/smack/smackfs.c
@@ -1325,6 +1325,7 @@ static int smk_get_sb(struct file_system_type *fs_type,

static struct file_system_type smk_fs_type = {
.name = "smackfs",
+ .inode_flags = FS_SPECIAL_FL,
.get_sb = smk_get_sb,
.kill_sb = kill_litter_super,
};

2010-07-15 02:17:22

[permalink] [raw]

Subject: [PATCH 11/18] xstat: Make automounter filesystems return FS_AUTOMOUNT_FL [ver #6]

Make automounter filesystems return FS_AUTOMOUNT_FL in st_inode_flags to
xstat().

Signed-off-by: David Howells <[email protected]>
---

fs/autofs/init.c | 1 +
fs/autofs4/init.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/autofs/init.c b/fs/autofs/init.c
index cea5219..2c06d4b 100644
--- a/fs/autofs/init.c
+++ b/fs/autofs/init.c
@@ -23,6 +23,7 @@ static int autofs_get_sb(struct file_system_type *fs_type,
static struct file_system_type autofs_fs_type = {
.owner = THIS_MODULE,
.name = "autofs",
+ .inode_flags = FS_AUTOMOUNT_FL,
.get_sb = autofs_get_sb,
.kill_sb = autofs_kill_sb,
};
diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
index 9722e4b..43df431 100644
--- a/fs/autofs4/init.c
+++ b/fs/autofs4/init.c
@@ -23,6 +23,7 @@ static int autofs_get_sb(struct file_system_type *fs_type,
static struct file_system_type autofs_fs_type = {
.owner = THIS_MODULE,
.name = "autofs",
+ .inode_flags = FS_AUTOMOUNT_FL,
.get_sb = autofs_get_sb,
.kill_sb = autofs4_kill_sb,
};

2010-07-15 02:17:24

[permalink] [raw]

Subject: [PATCH 12/18] xstat: Add a dentry op to handle automounting rather than abusing follow_link() [ver #6]

Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
inode flag (S_AUTOMOUNT).

This makes it easier to add an AT_ flag to suppress terminal segment automount
during pathwalk. It should also remove the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

I've only changed __follow_mount() to handle automount points, but it might be
necessary to change follow_mount() too. The latter is only used from
follow_dotdot(), but any automounts on ".." should be pinned whilst we're using
a child of it.

Note that autofs4's use of follow_mount() will need examining if this patch is
committed.

Signed-off-by: David Howells <[email protected]>
---

Documentation/filesystems/Locking | 2 +
Documentation/filesystems/vfs.txt | 13 ++++++
fs/namei.c | 85 +++++++++++++++++++++++++++++--------
fs/stat.c | 2 +
include/linux/dcache.h | 5 ++
include/linux/fs.h | 2 +
6 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 96d4293..ccbfa98 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -16,6 +16,7 @@ prototypes:
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
+ struct vfsmount *(*d_automount)(struct path *path);

locking rules:
none have BKL
@@ -27,6 +28,7 @@ d_delete: yes no yes no
d_release: no no no yes
d_iput: no no no yes
d_dname: no no no no
+d_automount: no no no yes

--------------------------- inode_operations ---------------------------
prototypes:
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 94677e7..31a9e8f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -851,6 +851,7 @@ struct dentry_operations {
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
+ struct vfsmount *(*d_automount)(struct path *);
};

d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -885,6 +886,18 @@ struct dentry_operations {
at the end of the buffer, and returns a pointer to the first char.
dynamic_dname() helper function is provided to take care of this.

+ d_automount: called when an automount dentry is to be traversed (optional).
+ This should create a new VFS mount record, mount it on the directory
+ and return the record to the caller. The caller is supplied with a
+ path parameter giving the automount directory to describe the automount
+ target and the parent VFS mount record to provide inheritable mount
+ parameters. NULL should be returned if someone else managed to make
+ the automount first. If the automount failed, then an error code
+ should be returned.
+
+ This function is only used if S_AUTOMOUNT is set on the inode to which
+ the dentry refers.
+
Example :

static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
diff --git a/fs/namei.c b/fs/namei.c
index 868d0cb..fcec3c6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -617,24 +617,71 @@ int follow_up(struct path *path)
return 1;
}

+/*
+ * Perform an automount
+ */
+static int follow_automount(struct path *path, int res)
+{
+ struct vfsmount *mnt;
+
+ if (!path->dentry->d_op || !path->dentry->d_op->d_automount)
+ return -EREMOTE;
+
+ current->total_link_count++;
+ if (current->total_link_count >= 40)
+ return -ELOOP;
+
+ mnt = path->dentry->d_op->d_automount(path);
+ if (IS_ERR(mnt))
+ return PTR_ERR(mnt);
+ if (!mnt) /* mount collision */
+ return 0;
+
+ if (mnt->mnt_sb == path->mnt->mnt_sb &&
+ mnt->mnt_root == path->dentry) {
+ mntput(mnt);
+ return -ELOOP;
+ }
+
+ dput(path->dentry);
+ if (res)
+ mntput(path->mnt);
+ path->mnt = mnt;
+ path->dentry = dget(mnt->mnt_root);
+ return 0;
+}
+
/* no need for dcache_lock, as serialization is taken care in
* namespace.c
*/
-static int __follow_mount(struct path *path)
+static int __follow_mount(struct path *path, unsigned nofollow)
{
- int res = 0;
- while (d_mountpoint(path->dentry)) {
- struct vfsmount *mounted = lookup_mnt(path);
- if (!mounted)
+ struct vfsmount *mounted;
+ int ret, res = 0;
+ for (;;) {
+ while (d_mountpoint(path->dentry)) {
+ if (nofollow)
+ return -ELOOP;
+ mounted = lookup_mnt(path);
+ if (!mounted)
+ break;
+ dput(path->dentry);
+ if (res)
+ mntput(path->mnt);
+ path->mnt = mounted;
+ path->dentry = dget(mounted->mnt_root);
+ res = 1;
+ }
+ if (!d_automount_point(path->dentry))
break;
- dput(path->dentry);
- if (res)
- mntput(path->mnt);
- path->mnt = mounted;
- path->dentry = dget(mounted->mnt_root);
+ if (nofollow)
+ return -ELOOP;
+ ret = follow_automount(path, res);
+ if (ret < 0)
+ return ret;
res = 1;
}
- return res;
+ return 0;
}

static void follow_mount(struct path *path)
@@ -702,6 +749,8 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
struct vfsmount *mnt = nd->path.mnt;
struct dentry *dentry, *parent;
struct inode *dir;
+ int ret;
+
/*
* See if the low-level filesystem might want
* to use its own hash..
@@ -720,8 +769,10 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
done:
path->mnt = mnt;
path->dentry = dentry;
- __follow_mount(path);
- return 0;
+ ret = __follow_mount(path, 0);
+ if (unlikely(ret < 0))
+ path_put(path);
+ return ret;

need_lookup:
parent = nd->path.dentry;
@@ -1721,11 +1772,9 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (open_flag & O_EXCL)
goto exit_dput;

- if (__follow_mount(path)) {
- error = -ELOOP;
- if (open_flag & O_NOFOLLOW)
- goto exit_dput;
- }
+ error = __follow_mount(path, open_flag & O_NOFOLLOW);
+ if (error < 0)
+ goto exit_dput;

error = -ENOENT;
if (!path->dentry->d_inode)
diff --git a/fs/stat.c b/fs/stat.c
index 89d72fc..bb0f538 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -43,6 +43,8 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
stat->blocks = inode->i_blocks;
stat->blksize = (1 << inode->i_blkbits);
stat->inode_flags = inode->i_sb->s_type->inode_flags;
+ if (IS_AUTOMOUNT(inode))
+ stat->inode_flags |= FS_AUTOMOUNT_FL;
stat->result_mask |= XSTAT_REQUEST__BASIC_STATS & ~XSTAT_REQUEST_RDEV;
if (unlikely(S_ISBLK(stat->mode) || S_ISCHR(stat->mode)))
stat->result_mask |= XSTAT_REQUEST_RDEV;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index eebb617..5380bff 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -139,6 +139,7 @@ struct dentry_operations {
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
+ struct vfsmount *(*d_automount)(struct path *);
};

/* the dentry parameter passed to d_hash and d_compare is the parent
@@ -157,6 +158,7 @@ d_compare: no yes yes no
d_delete: no yes no no
d_release: no no no yes
d_iput: no no no yes
+d_automount: no no no yes
*/

/* d_flags entries */
@@ -389,6 +391,9 @@ static inline int d_mountpoint(struct dentry *dentry)
return dentry->d_mounted;
}

+#define d_automount_point(dentry) \
+ (dentry->d_inode && IS_AUTOMOUNT(dentry->d_inode))
+
extern struct vfsmount *lookup_mnt(struct path *);
extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 951c36b..579ad9d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -235,6 +235,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_AUTOMOUNT 1024 /* Automount/referral quasi-directory */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -269,6 +270,7 @@ struct inodes_stat_t {
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)

/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */

2010-08-08 12:12:11

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 06, 2010 at 01:38:36PM +1000, Neil Brown wrote:
> I'm curious. Why do you particularly care what interface the kernel uses to
> provide you with access to this attribute?

It's a matter of taste. The *BSD's have this right IMHO. It
should be part of the stat information. A file timestamp is not
an EA. Making it available that way just feels like an appalingly
tasteless kludge. It offends the artist in me :-).

> Or do you really want something like BSD's 'btime' which as I understand it
> cannot be set. Would that be really useful to you?

It is *already* useful to us, and is widely used in
existing code. The occasions when btime is set are
relatively rare, and at that point we store it in a
separate EA for Windows reporting purposes.

Jeremy.

2010-08-06 23:31:17

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 5 Aug 2010 22:55:06 -0500
Steve French <[email protected]> wrote:

> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
> > On Thu, 5 Aug 2010 16:52:18 -0700
> > Jeremy Allison <[email protected]> wrote:
>
> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
> >
> > I'm curious. Why do you particularly care what interface the kernel uses to
> > provide you with access to this attribute?
> >
> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
> > would seem to be an extension - an extended attribute.
> > As the Linux kernel does virtually nothing with this attribute except provide
> > access, it seems to be a very different class of thing to other timestamps.
> > Surely it is simply some storage associated with a file which is capable of
> > storing a timestamp, which can be set or retrieved by an application, and
> > which happens to be initialised to the current time when a file is created.
> >
> > Yes, to you it is a timestamp. But to Linux it is a few bytes of
> > user-settable metadata. Sounds like an EA to me.
> >
> > Or do you really want something like BSD's 'btime' which as I understand it
> > cannot be set. Would that be really useful to you?
>
> Obviously the cifs and SMB2 protocols which Samba server support can
> ask the server to set the create time of a file (this is handled
> through xattrs today along with the "dos attribute" flags such as
> archive/hidden/system), but certainly it is much more common (and
> important) to read the creation time of an existing file.
>

Just a point of clarification - when you say it is common and important to be
able to read the creation time on an existing file, and you still talking in
the context of cifs/smb windows compatibility, or are you talking in the
broader context?
If you are referring to a broader context could be please give more details
because I have not heard any mention of any real value of creation-time out
side of window interoperability - have such a use clearly documented would
assist the conversation I think.

If on the other hand you are just referring the the windows interoperability
context ... given that you have to read an EA if the create-time has been
changed, you will always have to read and EA so having something else is
pointless ... or I'm missing something.

>
> > Is there something important that I am missing?
>
> It is another syscall that Samba server would have to make - and xattr
> performance is extremely slow on some file systems (although
> presumably this one would be more likely to be stored in inode and
> perhaps not as bad on ext4, cifs and a few others such as ntfs).
>
>

Obviously if we were to make xattrs the preferred way to get create time out
of the filesystem we would want to make sure it is efficient.
It would seem to make perfect sense to add a 'getxattrat' syscall and allow an
AT_NONBLOCK flag (which would probably be useful for statat too). The
AT_NONBLOCK flag would only get attributes if they were available immediately
without going to storage/network/whatever.

And if it is simply a case of too many syscalls per file, then
getxattrat_multi would seem to be the most general way to go.

NeilBrown

2010-08-02 14:09:51

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

> Furthermore, I'll go ahead and propose the following (simple) semantics:
>
> 1) birthtime is initialized to the current time when a new inode is
> created
>
> 2) it's settable via the xattr to an arbitrary value
>
> Either way, the xattr for this ought to be named the same on all
> filesystems. Samba shouldn't need to know or care what the underlying
> filesystem is, as long as it presents the correct xattr.
>
> That should make samba happy, and be reasonably simple to implement.

Is there any reason to allow birthtime to be set in advance of the
current birthtime?

ie restore / copy tools clearly need to backdate it, but I'd prefer to
see it not advance-able.

Greg

2010-08-01 13:15:04

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, 30 Jul 2010 14:11:46 -0400
Trond Myklebust <[email protected]> wrote:

> On Fri, 2010-07-30 at 13:55 -0400, Phil Pishioneri wrote:
> > On 7/22/10 2:59 PM, Trond Myklebust wrote:
> > > The fact remains that most of us would be hard pressed to name an
> > > application
> >
> > Microsoft Office?
> >
> > > that requires you to share the same dataset to both
> > > Windows/CIFS and posix NFS clients.
> >
> > NFS client: Mac OS X (NFSv3, since v4 on it is still alpha *cough*).
> >
> > > tends to discourage mixing the two environments.
> >
> > Or is "discourage" not strong enough term to describe that we shouldn't
> > be doing this?
> >
> > -Phil
>
> Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
> They both interoperate just fine with Samba, and would presumably
> continue to do so if someone were to decide to reuse the ctime field on
> your Samba box as storage for a create time.
>
> Trond
>

It's not so much particular applications that require access to the
same data via NFS and CIFS. There is, however a common desire to share
the same data to different client OS'.

All of the unix CIFS clients that I know of (including Linux's) trail
NFS in several areas. For instance, if you need to have the same data
accessible by multiple users using their own credentials then you need
multiple mounts.

--
Jeff Layton <[email protected]>

2010-08-13 19:19:05

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 13, 2010 at 09:06:28PM +0200, Jan Engelhardt wrote:
>
> On Friday 2010-08-13 19:54, Jeremy Allison wrote:
> >On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
> >> On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
> >> > We don't need to ape Windows in everything.
> >> > The coming ACL disaster will show that (we will go from an ACL
> >> > model that is slightly too complex to use, to one that is impossibly
> >> > complex to use :-).
> >>
> >> Care to elaborate?
> >
> >POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
> >people are asking for this. But Windows ACLs are a nightmare
> >beyond human comprehension :-). In the "too complex to be
> >usable" camp.
>
> Well, for one, ACLs in NT can be recursive IIRC. You can't say that of Linux
> ACLs - instead you have to setfacl -R and setfacl -Rd to give one user access
> to a directory and all its subdirs including future new inodes.

You do realize that Windows does exactly the same thing under
the covers, right ? Watch SMB or SMB2 traffic between a client
and Windows server when someone changes an ACL sometime :-).

Jeremy

2010-08-13 19:06:32

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Friday 2010-08-13 19:54, Jeremy Allison wrote:
>On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
>> On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
>> > We don't need to ape Windows in everything.
>> > The coming ACL disaster will show that (we will go from an ACL
>> > model that is slightly too complex to use, to one that is impossibly
>> > complex to use :-).
>>
>> Care to elaborate?
>
>POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
>people are asking for this. But Windows ACLs are a nightmare
>beyond human comprehension :-). In the "too complex to be
>usable" camp.

Well, for one, ACLs in NT can be recursive IIRC. You can't say that of Linux
ACLs - instead you have to setfacl -R and setfacl -Rd to give one user access
to a directory and all its subdirs including future new inodes.

2010-08-08 13:05:05

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sun, Aug 08, 2010 at 08:53:01AM -0400, Jeff Layton wrote:
>
> It would be more convenient if this were part of stat() but adding a
> new stat call is non-trivial. Even if we did that, it still doesn't
> solve the problem of being able to set the create time. The fact that
> that's rarely done doesn't really matter much -- we ought to shoot for
> the semantics that are needed to handle this properly.

*BSD didn't. They just added something that was useful to UNIX.
I'd be happy with that. We don't need to ape Windows in everything.
The coming ACL disaster will show that (we will go from an ACL
model that is slightly too complex to use, to one that is impossibly
complex to use :-).

> If that's the case, don't you have to query for this EA every time you
> need to return the create time anyway? If so, then doing this really
> isn't any more costly -- you'd just be querying a different EA, right?

No, we'd be querying an additional EA. The EA we query contains
the DOS attribues as well as the create time.

Jeremy.

2010-08-02 14:39:52

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Mon, 2 Aug 2010 10:09:49 -0400
Greg Freemyer <[email protected]> wrote:

> > Furthermore, I'll go ahead and propose the following (simple) semantics:
> >
> > 1) birthtime is initialized to the current time when a new inode is
> > created
> >
> > 2) it's settable via the xattr to an arbitrary value
> >
> > Either way, the xattr for this ought to be named the same on all
> > filesystems. Samba shouldn't need to know or care what the underlying
> > filesystem is, as long as it presents the correct xattr.
> >
> > That should make samba happy, and be reasonably simple to implement.
>
> Is there any reason to allow birthtime to be set in advance of the
> current birthtime?
>
> ie restore / copy tools clearly need to backdate it, but I'd prefer to
> see it not advance-able.
>
> Greg

Why not? Is there a good argument for prohibiting it? We allow people
to set mtime in the future. Why not allow the same semantics here?

We also have to consider that this may eventually be settable by via
networked filesystem interfaces. If the client and server don't have
synchronized clocks you may end up with the client getting an error
back in some cases if you don't allow it.

--
Jeff Layton <[email protected]>

2010-08-01 13:23:10

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, 30 Jul 2010 23:22:58 +0200
utz lehmann <[email protected]> wrote:

> On Thu, 2010-07-22 at 09:40 -0700, Linus Torvalds wrote:
> > But the fact is, th Unix ctime semantics are insane and largely
> > useless. There's a damn good reason almost nobody uses ctime under
> > unix.
> >
> > So what I'm suggesting is that we have a flag - either per-process or
> > per-mount - that just says "use windows semantics for ctime".
>
> When abusing an existing time stamp use atime not ctime please.
> ctime has it's uses. atime was just a mistake and is nearly useless.
>
> And with noatime we already have creation time semantics for atime.
>

Ugh. Honestly all of this talk of abusing different time fields seems
like craziness to me. It's going to be very hard to do that without
breaking *something*. There's also very little reason to do this when
xattrs are a much cleaner approach.

Neil Brown has put forth a very reasoned justification for putting the
birthtime in an xattr. After reading it, I think that makes more sense
than anything. It's also something that can be done without any extra
infrastructure. If at some point in the future we get an xstat-like
syscall then we can always add birthtime to that as well.

Ditto for the other fields under discussion (i_generation and the like).

--
Jeff Layton <[email protected]>

2010-08-01 13:37:46

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 29 Jul 2010 09:04:01 +1000
Neil Brown <[email protected]> wrote:

> On Wed, 28 Jul 2010 18:28:02 +0100
> David Howells <[email protected]> wrote:
>
> > Neil Brown <[email protected]> wrote:
> >
> > > ctime and mtime have real cache-coherence semantics which require them being
> > > updated by the kernel (whether the cache is on an NFS client, in a backup
> > > archive, or in a .o translation of a .c file).
> >
> > So does creation time, at least for CIFS caching. Creation time has potential
> > for spotting when the object at a pathname has changed for something else,
> > given the lack of inode number and inode generation from windows servers.
> > Creation time gives us one more datum to use.
>
> This justifies for me why a CIFS client would want to extract the
> creation-time from the CIFS protocol, but not why you want to expose it via a
> generic interface.
> The kernel/filesystem doesn't need to maintain creation-time to meet this
> need, only the CIFS server needs to maintain it - the kernel/filesystem just
> needs to provide somewhere to store it - xattrs.
>
> Given that we have an extensible attribute framework, it seems wrong to be
> adding new attributes to *stat. If a given filesystem wants to store certain
> attributes more efficiently, then it is welcome to intercept xattr calls and
> store (say) "cifs.birthtime" directly at a known offset in the inode.
>

The problem with the above approach is that you're assuming that the
data in question is always accessed via the CIFS server. If someone
comes along and messes with the data outside of CIFS, then samba won't
have knowledge of that and the birthtime will be wrong.

There's some history behind this as well -- samba tracks windows ACLs
via xattr and it can be very problematic keeping those up to date when
the data is accessed outside of samba.

I think presenting this data via xattr makes the most sense. It's
simple and as Neil points out, it also provides us with a clealy
settable interface. If we ever get an xstat-like syscall, we can always
present the same data via that as well.

I also think it's quite reasonable to consider tracking birthtime in a
generic inode field. In the absence of that, filesystems could track
this themselves in their filesystem-specific inode structs.

Furthermore, I'll go ahead and propose the following (simple) semantics:

1) birthtime is initialized to the current time when a new inode is
created

2) it's settable via the xattr to an arbitrary value

Either way, the xattr for this ought to be named the same on all
filesystems. Samba shouldn't need to know or care what the underlying
filesystem is, as long as it presents the correct xattr.

That should make samba happy, and be reasonably simple to implement.

--
Jeff Layton <[email protected]>

2010-08-06 03:38:54

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 5 Aug 2010 16:52:18 -0700
Jeremy Allison <[email protected]> wrote:

> On Sun, Aug 01, 2010 at 09:25:29AM -0400, Jeff Layton wrote:
> > On Fri, 30 Jul 2010 23:22:58 +0200
> > utz lehmann <[email protected]> wrote:
> >
> > > On Thu, 2010-07-22 at 09:40 -0700, Linus Torvalds wrote:
> > > > But the fact is, th Unix ctime semantics are insane and largely
> > > > useless. There's a damn good reason almost nobody uses ctime under
> > > > unix.
> > > >
> > > > So what I'm suggesting is that we have a flag - either per-process or
> > > > per-mount - that just says "use windows semantics for ctime".
> > >
> > > When abusing an existing time stamp use atime not ctime please.
> > > ctime has it's uses. atime was just a mistake and is nearly useless.
> > >
> > > And with noatime we already have creation time semantics for atime.
> > >
> >
> > Ugh. Honestly all of this talk of abusing different time fields seems
> > like craziness to me. It's going to be very hard to do that without
> > breaking *something*. There's also very little reason to do this when
> > xattrs are a much cleaner approach.
> >
> > Neil Brown has put forth a very reasoned justification for putting the
> > birthtime in an xattr. After reading it, I think that makes more sense
> > than anything. It's also something that can be done without any extra
> > infrastructure. If at some point in the future we get an xstat-like
> > syscall then we can always add birthtime to that as well.
>
> Just my 2 cents (as a Samba server implementor). I *hate* the idea
> of adding a "virtual" EA for birthtime. If you're going to add it,
> just add it to the stat struct like *BSD does. Don't abuse the other
> time fields, it's a new one.
>
> Jeff, please don't advocate for an EA for the Samba server to use.
> Don't add it as an EA. It's *not* an EA, it's a timestamp.

I'm curious. Why do you particularly care what interface the kernel uses to
provide you with access to this attribute?

And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
would seem to be an extension - an extended attribute.
As the Linux kernel does virtually nothing with this attribute except provide
access, it seems to be a very different class of thing to other timestamps.
Surely it is simply some storage associated with a file which is capable of
storing a timestamp, which can be set or retrieved by an application, and
which happens to be initialised to the current time when a file is created.

Yes, to you it is a timestamp. But to Linux it is a few bytes of
user-settable metadata. Sounds like an EA to me.

Or do you really want something like BSD's 'btime' which as I understand it
cannot be set. Would that be really useful to you?

Is there something important that I am missing?

NeilBrown

2010-08-13 18:09:38

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 13, 2010 at 12:54 PM, Jeremy Allison <[email protected]> wrote:
> On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
>> On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
>> > We don't need to ape Windows in everything.
>> > The coming ACL disaster will show that (we will go from an ACL
>> > model that is slightly too complex to use, to one that is impossibly
>> > complex to use :-).
>>
>> Care to elaborate?
>
> POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
> people are asking for this. But Windows ACLs are a nightmare
> beyond human comprehension :-). In the "too complex to be
> usable" camp.

Not much choice - even community colleges now have
to teach students about this ACL model in their sysadmin courses.

>> And what would native ACL support mean for Samba?
>
> RichACLs'll do it, but I feel sorry for the admins :-).

Yes - RichACLs and Windows ACLs allow you to set
some strange combinations of permssion bits.
RichACLs will make a more natural mapping for
Samba and NFSv4 - and it is far too late to
remove the requirement for Windows and
MacOS (among other clients) support.

--
Thanks,

Steve

2010-08-06 23:58:43

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 6, 2010 at 6:30 PM, Neil Brown <[email protected]> wrote:
> On Thu, 5 Aug 2010 22:55:06 -0500
> Steve French <[email protected]> wrote:
>
>> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
>> > On Thu, 5 Aug 2010 16:52:18 -0700
>> > Jeremy Allison <[email protected]> wrote:
>>
>> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
>> >
>> > I'm curious. ?Why do you particularly care what interface the kernel uses to
>> > provide you with access to this attribute?
>> >
>> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
>> > would seem to be an extension - an extended attribute.
>> > As the Linux kernel does virtually nothing with this attribute except provide
>> > access, it seems to be a very different class of thing to other timestamps.
>> > Surely it is simply some storage associated with a file which is capable of
>> > storing a timestamp, which can be set or retrieved by an application, and
>> > which happens to be initialised to the current time when a file is created.
>> >
>> > Yes, to you it is a timestamp. ?But to Linux it is a few bytes of
>> > user-settable metadata. ?Sounds like an EA to me.
>> >
>> > Or do you really want something like BSD's 'btime' which as I understand it
>> > cannot be set. ?Would that be really useful to you?
>>
>> Obviously the cifs and SMB2 protocols which ?Samba server support can
>> ask the server to set the create time of a file (this is handled
>> through xattrs today along with the "dos attribute" flags such as
>> archive/hidden/system), but certainly it is much more common (and
>> important) to read the creation time of an existing file.
>>
>
> Just a point of clarification - when you say it is common and important to be
> able to read the creation time on an existing file, and you still talking in
> the context of cifs/smb windows compatibility, or are you talking in the
> broader context?
> If you are referring to a broader context could be please give more details
> because I have not heard any mention of any real value of creation-time out
> side of window interoperability - have such a use clearly documented would
> assist the conversation I think.
>
> If on the other hand you are just referring the the windows interoperability
> context ... given that you have to read an EA if the create-time has been
> changed, you will always have to read and EA so having something else is
> pointless ... or I'm missing something.

There are other cases, less common than cifs and smb2. One
that comes to mind is NFS version 4, but there are a few other
cases that I have heard of (backup/archive applications).
The RFC recommends that servers return attribute 50 (creation
time). See below text:

time_create 50 nfstime4 R/W The time of creation
of the object. This
attribute does not
have any relation to
the traditional UNIX
file attribute
"ctime" or "change
time".

--
Thanks,

Steve

2010-08-07 00:29:16

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, 6 Aug 2010 18:58:42 -0500
Steve French <[email protected]> wrote:

> On Fri, Aug 6, 2010 at 6:30 PM, Neil Brown <[email protected]> wrote:
> > On Thu, 5 Aug 2010 22:55:06 -0500
> > Steve French <[email protected]> wrote:
> >
> >> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
> >> > On Thu, 5 Aug 2010 16:52:18 -0700
> >> > Jeremy Allison <[email protected]> wrote:
> >>
> >> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
> >> >
> >> > I'm curious. Why do you particularly care what interface the kernel uses to
> >> > provide you with access to this attribute?
> >> >
> >> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
> >> > would seem to be an extension - an extended attribute.
> >> > As the Linux kernel does virtually nothing with this attribute except provide
> >> > access, it seems to be a very different class of thing to other timestamps.
> >> > Surely it is simply some storage associated with a file which is capable of
> >> > storing a timestamp, which can be set or retrieved by an application, and
> >> > which happens to be initialised to the current time when a file is created.
> >> >
> >> > Yes, to you it is a timestamp. But to Linux it is a few bytes of
> >> > user-settable metadata. Sounds like an EA to me.
> >> >
> >> > Or do you really want something like BSD's 'btime' which as I understand it
> >> > cannot be set. Would that be really useful to you?
> >>
> >> Obviously the cifs and SMB2 protocols which Samba server support can
> >> ask the server to set the create time of a file (this is handled
> >> through xattrs today along with the "dos attribute" flags such as
> >> archive/hidden/system), but certainly it is much more common (and
> >> important) to read the creation time of an existing file.
> >>
> >
> > Just a point of clarification - when you say it is common and important to be
> > able to read the creation time on an existing file, and you still talking in
> > the context of cifs/smb windows compatibility, or are you talking in the
> > broader context?
> > If you are referring to a broader context could be please give more details
> > because I have not heard any mention of any real value of creation-time out
> > side of window interoperability - have such a use clearly documented would
> > assist the conversation I think.
> >
> > If on the other hand you are just referring the the windows interoperability
> > context ... given that you have to read an EA if the create-time has been
> > changed, you will always have to read and EA so having something else is
> > pointless ... or I'm missing something.
>
> There are other cases, less common than cifs and smb2. One
> that comes to mind is NFS version 4, but there are a few other
> cases that I have heard of (backup/archive applications).
> The RFC recommends that servers return attribute 50 (creation
> time). See below text:
>
> time_create 50 nfstime4 R/W The time of creation
> of the object. This
> attribute does not
> have any relation to
> the traditional UNIX
> file attribute
> "ctime" or "change
> time".

I really don't think NFSv4 is a separate justification. I'm fairly sure
that attribute was only including in NFSv4 for enhanced Windows
compatibility (windows interoperation was a big issue during the protocol
development).

That leaves hypothetical "backup/archive applications". Do you have a
concrete example? Or we are left with just various flavours of Windows
compatibility (not that I have a problem with Windows compatibility, but if
that is the only reason that we have creation-time then I think it is
important to be clear and open about that).

NeilBrown

2010-08-07 03:33:00

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, 6 Aug 2010 21:54:49 -0500
Steve French <[email protected]> wrote:

> On Fri, Aug 6, 2010 at 9:42 PM, Steve French <[email protected]> wrote:
> > On Fri, Aug 6, 2010 at 7:29 PM, Neil Brown <[email protected]> wrote:
> >> On Fri, 6 Aug 2010 18:58:42 -0500
> >> Steve French <[email protected]> wrote:
> >>
> >>> On Fri, Aug 6, 2010 at 6:30 PM, Neil Brown <[email protected]> wrote:
> >>> > On Thu, 5 Aug 2010 22:55:06 -0500
> >>> > Steve French <[email protected]> wrote:
> >>> >
> >>> >> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
> >>> >> > On Thu, 5 Aug 2010 16:52:18 -0700
> >>> >> > Jeremy Allison <[email protected]> wrote:
> >>> >>
> >>> >> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
> >>> >> >
> >>> >> > I'm curious. Why do you particularly care what interface the kernel uses to
> >>> >> > provide you with access to this attribute?
> >>> >> >
> >>> >> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
> >>> >> > would seem to be an extension - an extended attribute.
> >>> >> > As the Linux kernel does virtually nothing with this attribute except provide
> >>> >> > access, it seems to be a very different class of thing to other timestamps.
> >>> >> > Surely it is simply some storage associated with a file which is capable of
> >>> >> > storing a timestamp, which can be set or retrieved by an application, and
> >>> >> > which happens to be initialised to the current time when a file is created.
> >>> >> >
> >>> >> > Yes, to you it is a timestamp. But to Linux it is a few bytes of
> >>> >> > user-settable metadata. Sounds like an EA to me.
> >>> >> >
> >>> >> > Or do you really want something like BSD's 'btime' which as I understand it
> >>> >> > cannot be set. Would that be really useful to you?
> >>> >>
> >>> >> Obviously the cifs and SMB2 protocols which Samba server support can
> >>> >> ask the server to set the create time of a file (this is handled
> >>> >> through xattrs today along with the "dos attribute" flags such as
> >>> >> archive/hidden/system), but certainly it is much more common (and
> >>> >> important) to read the creation time of an existing file.
> >>> >>
> >>> >
> >>> > Just a point of clarification - when you say it is common and important to be
> >>> > able to read the creation time on an existing file, and you still talking in
> >>> > the context of cifs/smb windows compatibility, or are you talking in the
> >>> > broader context?
> >>> > If you are referring to a broader context could be please give more details
> >>> > because I have not heard any mention of any real value of creation-time out
> >>> > side of window interoperability - have such a use clearly documented would
> >>> > assist the conversation I think.
> >>> >
> >>> > If on the other hand you are just referring the the windows interoperability
> >>> > context ... given that you have to read an EA if the create-time has been
> >>> > changed, you will always have to read and EA so having something else is
> >>> > pointless ... or I'm missing something.
> >>>
> >>> There are other cases, less common than cifs and smb2. One
> >>> that comes to mind is NFS version 4, but there are a few other
> >>> cases that I have heard of (backup/archive applications).
> >>> The RFC recommends that servers return attribute 50 (creation
> >>> time). See below text:
> >>>
> >>> time_create 50 nfstime4 R/W The time of creation
> >>> of the object. This
> >>> attribute does not
> >>> have any relation to
> >>> the traditional UNIX
> >>> file attribute
> >>> "ctime" or "change
> >>> time".
> >>
> >> I really don't think NFSv4 is a separate justification. I'm fairly sure
> >> that attribute was only including in NFSv4 for enhanced Windows
> >> compatibility (windows interoperation was a big issue during the protocol
> >> development).
> >
> > Perhaps also useful for MacOS (and other BSD), not just Windows,
> > although MacOS may use cifs more often than nfs.
>
> >> That leaves hypothetical "backup/archive applications".
> >> Do you have a concrete example?
>
> A quick search for backup applications in Wikipedia came up with a
> reference fairly easily (to backup app which uses creation
> time) for Linux:
>
> http://www.aqualab.cs.northwestern.edu/publications/Cornell04VFS.html

That publication seems to mention 'creation time' only as an abstract concept.
The backup architecture keeps a history of the file all that way back to its
"creation time".
It doesn't appear to need or use a 'creation time' attribute stored with any
file.

>
> Presumably Windows compat. is a stronger motivation, than BSD/MacOS
> NFSv4 (returning birth time) compat, and backup applications
> are a lesser motivations. There may also be some value in using creation
> time as a generation number where no generation number is
> available.
>
> Intuitively seems like creation time would be as "useful" as ctime (and probably
> more so) to app developers ... but that is hard to prove.
>

I agree, it does seem like an intuitively valuable number - after all we each
have a birthday which we are very aware of and often make use of. It is
often treated as part of our identity - just like you were mentioning that
the CIFS client uses creation-time to help identify files which lack the
'inode number' identifier that is the common tool in Unix and derivatives.

But I'm not convinced that it is *practically* useful. The only practical
use beyond windows-compatibility that has been mentioned is a stronger
'identity' tag. However inode+generation number, or "file-handle-fragment"
are better things to use for identifying a file than "creation time",
especially when the latter is settable.

So if we were to add something for native applications to use, I doubt that
it would be 'creation time' (but I'm still open to hearing a convincing
use-case).

So we are left with an attribute that is needed for windows compatibility,
and so just needs to be understood by samba and wine. Some filesystems might
support it efficiently, others might require the use of generic
extended-attributes, still others might not support it at all (I guess you
store it in some 'tdb' and hope it works well enough).

Core-linux doesn't really need to know about this - there just needs to be a
channel to pass it between samba/wine and the filesystem. xattr still seems
the best mechanism to pass this stuff around. Team-samba can negotiate with
fs developers to optimise/accelerate certain attributes, and linux-VFS
doesn't need to know or care (except maybe to provide generic non-blocking or
multiple-access interfaces).

What is 'creation time' used for in the windows world??? Maybe there really
is something valuable here that we are missing....

NeilBrown

2010-08-13 17:54:17

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
> On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
> > We don't need to ape Windows in everything.
> > The coming ACL disaster will show that (we will go from an ACL
> > model that is slightly too complex to use, to one that is impossibly
> > complex to use :-).
>
> Care to elaborate?

POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
people are asking for this. But Windows ACLs are a nightmare
beyond human comprehension :-). In the "too complex to be
usable" camp.

> And what would native ACL support mean for Samba?

RichACLs'll do it, but I feel sorry for the admins :-).

Jeremy.

2010-08-07 11:04:40

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sat, 7 Aug 2010 06:34:00 -0400
Jeff Layton <[email protected]> wrote:

> On Sat, 7 Aug 2010 13:32:40 +1000
> Neil Brown <[email protected]> wrote:
>
> > So we are left with an attribute that is needed for windows compatibility,
> > and so just needs to be understood by samba and wine. Some filesystems might
> > support it efficiently, others might require the use of generic
> > extended-attributes, still others might not support it at all (I guess you
> > store it in some 'tdb' and hope it works well enough).
> >
> > Core-linux doesn't really need to know about this - there just needs to be a
> > channel to pass it between samba/wine and the filesystem. xattr still seems
> > the best mechanism to pass this stuff around. Team-samba can negotiate with
> > fs developers to optimise/accelerate certain attributes, and linux-VFS
> > doesn't need to know or care (except maybe to provide generic non-blocking or
> > multiple-access interfaces).
> >
>
> IIUC, you're saying that we should basically just have samba stuff the
> current time into an xattr when it creates the file and leave the
> filesystems alone. If so, I disagree here.

I'm not quite saying that (though there is a temptation). Some attributes
are initialised by the filesystem rather than by common code. i_uid is a
simple example. I have no problem with the filesystem initialising the
storage that is used for this well-known-EA to the current time at creation.
This would be part of what team-samba negotiated with FS developers.

>
> The problem with treating this as *just* an xattr is that it doesn't
> account for files that are created outside of samba but are then shared
> out by it.

If something is created in a different universe, then brought into this one -
when is its date of birth? The moment of creation, or the moment of entry
into this universe? If both universes have a common time line (altough
with a 10 year offset) then I guess the former, though I think it is a bit of
a philosophical point.... :-)

>
> To handle this correctly, I believe it needs to be initialized by the
> kernel to the current time whenever an inode is created, even if samba
> doesn't create it. After that, it can be treated as just another xattr.
>
Yes, I suspect that would be ideal, and trivial for the fs to implement (it
has to initialise it to something after all).

i.e. I agree.

NeilBrown

2010-08-06 03:55:08

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
> On Thu, 5 Aug 2010 16:52:18 -0700
> Jeremy Allison <[email protected]> wrote:

>> Don't add it as an EA. It's *not* an EA, it's a timestamp.
>
> I'm curious. ?Why do you particularly care what interface the kernel uses to
> provide you with access to this attribute?
>
> And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
> would seem to be an extension - an extended attribute.
> As the Linux kernel does virtually nothing with this attribute except provide
> access, it seems to be a very different class of thing to other timestamps.
> Surely it is simply some storage associated with a file which is capable of
> storing a timestamp, which can be set or retrieved by an application, and
> which happens to be initialised to the current time when a file is created.
>
> Yes, to you it is a timestamp. ?But to Linux it is a few bytes of
> user-settable metadata. ?Sounds like an EA to me.
>
> Or do you really want something like BSD's 'btime' which as I understand it
> cannot be set. ?Would that be really useful to you?

Obviously the cifs and SMB2 protocols which Samba server support can
ask the server to set the create time of a file (this is handled
through xattrs today along with the "dos attribute" flags such as
archive/hidden/system), but certainly it is much more common (and
important) to read the creation time of an existing file.

> Is there something important that I am missing?

It is another syscall that Samba server would have to make - and xattr
performance is extremely slow on some file systems (although
presumably this one would be more likely to be stored in inode and
perhaps not as bad on ext4, cifs and a few others such as ntfs).

--
Thanks,

Steve

2010-08-08 23:07:33

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sun, 8 Aug 2010 05:12:09 -0700
Jeremy Allison <[email protected]> wrote:

> On Fri, Aug 06, 2010 at 01:38:36PM +1000, Neil Brown wrote:
> > I'm curious. Why do you particularly care what interface the kernel uses to
> > provide you with access to this attribute?
>
> It's a matter of taste. The *BSD's have this right IMHO. It
> should be part of the stat information. A file timestamp is not
> an EA. Making it available that way just feels like an appalingly
> tasteless kludge. It offends the artist in me :-).

Unfortunately whenever you work on a collaborative project someone has to make
concessions to taste, as we all taste different.. (or have different taste..
or something).

So I think it is very important to clearly differentiate the practical issues
from the aesthetic issues as I think we can hope for unity on the former, but
never on the latter.

>
> > Or do you really want something like BSD's 'btime' which as I understand it
> > cannot be set. Would that be really useful to you?
>
> It is *already* useful to us, and is widely used in
> existing code. The occasions when btime is set are
> relatively rare, and at that point we store it in a
> separate EA for Windows reporting purposes.

I'm probably sounding like a scratched record, but when you say "is widely
used" do you mean "is used in samba which is widely used" or do you mean "is
used in a wide variety of applications"?

Because if you are only saying the former, then I don't think we should copy
BSD, but rather I think we should provide exactly the semantics that are most
useful to samba - and that would seem to be creation-time and DOS flags which
the filesystem can store directly in the inode and which samba can access
cheaply.
(and I would prefer to use xattrs, but that is a taste thing and as I'm not
writing the code, I don't get to choose the taste).

But if you are saying the later, then sharing those details might help us see
that copying bsd is actually the best thing to do, or maybe that something
else is better.

I'm just afraid that if some new interface is added without clear,
comprehensive and up-front justification then we will end up getting a
sub-optimal interface.

NeilBrown

2010-08-16 18:07:07

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 13, 2010 at 12:19:00PM -0700, Jeremy Allison wrote:
> On Fri, Aug 13, 2010 at 09:06:28PM +0200, Jan Engelhardt wrote:
> >
> > On Friday 2010-08-13 19:54, Jeremy Allison wrote:
> > >On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
> > >> On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
> > >> > We don't need to ape Windows in everything.
> > >> > The coming ACL disaster will show that (we will go from an ACL
> > >> > model that is slightly too complex to use, to one that is impossibly
> > >> > complex to use :-).
> > >>
> > >> Care to elaborate?
> > >
> > >POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
> > >people are asking for this. But Windows ACLs are a nightmare
> > >beyond human comprehension :-). In the "too complex to be
> > >usable" camp.
> >
> > Well, for one, ACLs in NT can be recursive IIRC. You can't say that of Linux
> > ACLs - instead you have to setfacl -R and setfacl -Rd to give one user access
> > to a directory and all its subdirs including future new inodes.
>
> You do realize that Windows does exactly the same thing under
> the covers, right ? Watch SMB or SMB2 traffic between a client
> and Windows server when someone changes an ACL sometime :-).

Yeah. There's some explanation here:

http://tools.ietf.org/search/rfc5661#section-6.4.3.2

What NT-style ACLs provide is a few bits that help a setfacl-like
application decide how to propagate the change. But it's still up to
the application to do the recursive traversal.

--b.

2010-08-16 19:07:26

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Mon, Aug 16, 2010 at 02:08:29PM -0400, J. Bruce Fields wrote:
> On Fri, Aug 13, 2010 at 10:54:10AM -0700, Jeremy Allison wrote:
> > On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
> > > On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
> > > > We don't need to ape Windows in everything.
> > > > The coming ACL disaster will show that (we will go from an ACL
> > > > model that is slightly too complex to use, to one that is impossibly
> > > > complex to use :-).
> > >
> > > Care to elaborate?
> >
> > POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
> > people are asking for this. But Windows ACLs are a nightmare
> > beyond human comprehension :-). In the "too complex to be
> > usable" camp.
> >
> > > And what would native ACL support mean for Samba?
> >
> > RichACLs'll do it, but I feel sorry for the admins :-).
>
> I was curious whether you can support that with any data (or even just
> anecdotes) about real-world sysadmins.

Just an anecdote, but I remember giving a talk to a room full
of admins, all of whom told me it was essential for Samba to
implement "full Windows ACL compatibility" (we were in the process
of coding it up at the time). I asked them to tell me the difference
between object inherit, container inherit, and inherit only. Only
one hand remained up (out of a room containing a couple of hundred
Windows admins). I asked him where he worked, and the reply was
"the US Marine Corps." :-).

> The NT-style ACLs give me a headache, honestly. But that may just be
> because I've been involved with the implementation. Admins may have the
> luxury of using only the subset that they're comfortable with.

Yeah. I think most sites set a group as the owner of a share
and the directory so exported, set the directory to inherit
everything down below, and just leave it up to the members
of that group without getting further involved :-).

Jeremy.

2010-08-05 23:52:31

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sun, Aug 01, 2010 at 09:25:29AM -0400, Jeff Layton wrote:
> On Fri, 30 Jul 2010 23:22:58 +0200
> utz lehmann <[email protected]> wrote:
>
> > On Thu, 2010-07-22 at 09:40 -0700, Linus Torvalds wrote:
> > > But the fact is, th Unix ctime semantics are insane and largely
> > > useless. There's a damn good reason almost nobody uses ctime under
> > > unix.
> > >
> > > So what I'm suggesting is that we have a flag - either per-process or
> > > per-mount - that just says "use windows semantics for ctime".
> >
> > When abusing an existing time stamp use atime not ctime please.
> > ctime has it's uses. atime was just a mistake and is nearly useless.
> >
> > And with noatime we already have creation time semantics for atime.
> >
>
> Ugh. Honestly all of this talk of abusing different time fields seems
> like craziness to me. It's going to be very hard to do that without
> breaking *something*. There's also very little reason to do this when
> xattrs are a much cleaner approach.
>
> Neil Brown has put forth a very reasoned justification for putting the
> birthtime in an xattr. After reading it, I think that makes more sense
> than anything. It's also something that can be done without any extra
> infrastructure. If at some point in the future we get an xstat-like
> syscall then we can always add birthtime to that as well.

Just my 2 cents (as a Samba server implementor). I *hate* the idea
of adding a "virtual" EA for birthtime. If you're going to add it,
just add it to the stat struct like *BSD does. Don't abuse the other
time fields, it's a new one.

Jeff, please don't advocate for an EA for the Samba server to use.
Don't add it as an EA. It's *not* an EA, it's a timestamp.

Jeremy.

2010-08-07 10:34:07

by Jeffrey Layton

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sat, 7 Aug 2010 13:32:40 +1000
Neil Brown <[email protected]> wrote:

> So we are left with an attribute that is needed for windows compatibility,
> and so just needs to be understood by samba and wine. Some filesystems might
> support it efficiently, others might require the use of generic
> extended-attributes, still others might not support it at all (I guess you
> store it in some 'tdb' and hope it works well enough).
>
> Core-linux doesn't really need to know about this - there just needs to be a
> channel to pass it between samba/wine and the filesystem. xattr still seems
> the best mechanism to pass this stuff around. Team-samba can negotiate with
> fs developers to optimise/accelerate certain attributes, and linux-VFS
> doesn't need to know or care (except maybe to provide generic non-blocking or
> multiple-access interfaces).
>

IIUC, you're saying that we should basically just have samba stuff the
current time into an xattr when it creates the file and leave the
filesystems alone. If so, I disagree here.

The problem with treating this as *just* an xattr is that it doesn't
account for files that are created outside of samba but are then shared
out by it.

To handle this correctly, I believe it needs to be initialized by the
kernel to the current time whenever an inode is created, even if samba
doesn't create it. After that, it can be treated as just another xattr.

--
Jeff Layton <[email protected]>

2010-08-13 12:56:41

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
> We don't need to ape Windows in everything.
> The coming ACL disaster will show that (we will go from an ACL
> model that is slightly too complex to use, to one that is impossibly
> complex to use :-).

Care to elaborate?

And what would native ACL support mean for Samba?

--b.

2010-08-07 02:54:50

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 6, 2010 at 9:42 PM, Steve French <[email protected]> wrote:
> On Fri, Aug 6, 2010 at 7:29 PM, Neil Brown <[email protected]> wrote:
>> On Fri, 6 Aug 2010 18:58:42 -0500
>> Steve French <[email protected]> wrote:
>>
>>> On Fri, Aug 6, 2010 at 6:30 PM, Neil Brown <[email protected]> wrote:
>>> > On Thu, 5 Aug 2010 22:55:06 -0500
>>> > Steve French <[email protected]> wrote:
>>> >
>>> >> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
>>> >> > On Thu, 5 Aug 2010 16:52:18 -0700
>>> >> > Jeremy Allison <[email protected]> wrote:
>>> >>
>>> >> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
>>> >> >
>>> >> > I'm curious. ?Why do you particularly care what interface the kernel uses to
>>> >> > provide you with access to this attribute?
>>> >> >
>>> >> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
>>> >> > would seem to be an extension - an extended attribute.
>>> >> > As the Linux kernel does virtually nothing with this attribute except provide
>>> >> > access, it seems to be a very different class of thing to other timestamps.
>>> >> > Surely it is simply some storage associated with a file which is capable of
>>> >> > storing a timestamp, which can be set or retrieved by an application, and
>>> >> > which happens to be initialised to the current time when a file is created.
>>> >> >
>>> >> > Yes, to you it is a timestamp. ?But to Linux it is a few bytes of
>>> >> > user-settable metadata. ?Sounds like an EA to me.
>>> >> >
>>> >> > Or do you really want something like BSD's 'btime' which as I understand it
>>> >> > cannot be set. ?Would that be really useful to you?
>>> >>
>>> >> Obviously the cifs and SMB2 protocols which ?Samba server support can
>>> >> ask the server to set the create time of a file (this is handled
>>> >> through xattrs today along with the "dos attribute" flags such as
>>> >> archive/hidden/system), but certainly it is much more common (and
>>> >> important) to read the creation time of an existing file.
>>> >>
>>> >
>>> > Just a point of clarification - when you say it is common and important to be
>>> > able to read the creation time on an existing file, and you still talking in
>>> > the context of cifs/smb windows compatibility, or are you talking in the
>>> > broader context?
>>> > If you are referring to a broader context could be please give more details
>>> > because I have not heard any mention of any real value of creation-time out
>>> > side of window interoperability - have such a use clearly documented would
>>> > assist the conversation I think.
>>> >
>>> > If on the other hand you are just referring the the windows interoperability
>>> > context ... given that you have to read an EA if the create-time has been
>>> > changed, you will always have to read and EA so having something else is
>>> > pointless ... or I'm missing something.
>>>
>>> There are other cases, less common than cifs and smb2. ? One
>>> that comes to mind is NFS version 4, but there are a few other
>>> cases that I have heard of (backup/archive applications).
>>> The RFC recommends that servers return attribute 50 (creation
>>> time). ?See below text:
>>>
>>> ? ?time_create ? ? ? ? 50 ? nfstime4 ? ? ? R/W ? ? ?The time of creation
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? of the object. ?This
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? attribute does not
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? have any relation to
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? the traditional UNIX
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? file attribute
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "ctime" or "change
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? time".
>>
>> I really don't think NFSv4 is a separate justification. ?I'm fairly sure
>> that attribute was only including in NFSv4 for enhanced Windows
>> compatibility (windows interoperation was a big issue during the protocol
>> development).
>
> Perhaps also useful for MacOS (and other BSD), not just Windows,
> although MacOS may use cifs more often than nfs.

>> That leaves hypothetical "backup/archive applications".
>> Do you have a concrete example?

A quick search for backup applications in Wikipedia came up with a
reference fairly easily (to backup app which uses creation
time) for Linux:

http://www.aqualab.cs.northwestern.edu/publications/Cornell04VFS.html

Presumably Windows compat. is a stronger motivation, than BSD/MacOS
NFSv4 (returning birth time) compat, and backup applications
are a lesser motivations. There may also be some value in using creation
time as a generation number where no generation number is
available.

Intuitively seems like creation time would be as "useful" as ctime (and probably
more so) to app developers ... but that is hard to prove.

--
Thanks,

Steve

2010-08-08 12:53:38

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Sun, 8 Aug 2010 05:12:09 -0700
Jeremy Allison <[email protected]> wrote:

> On Fri, Aug 06, 2010 at 01:38:36PM +1000, Neil Brown wrote:
> > I'm curious. Why do you particularly care what interface the kernel uses to
> > provide you with access to this attribute?
>
> It's a matter of taste. The *BSD's have this right IMHO. It
> should be part of the stat information. A file timestamp is not
> an EA. Making it available that way just feels like an appalingly
> tasteless kludge. It offends the artist in me :-).
>

It would be more convenient if this were part of stat() but adding a
new stat call is non-trivial. Even if we did that, it still doesn't
solve the problem of being able to set the create time. The fact that
that's rarely done doesn't really matter much -- we ought to shoot for
the semantics that are needed to handle this properly.

If we do settle on a xstat() interface, it might also end up being able
to report things like selinux labels which are also available and
settable via xattr. I don't see a problem with presenting the same data
via multiple interfaces. If presenting this data via xattr solves the
immediate problem of being able to properly store and report the create
time then it seems like a win.

> > Or do you really want something like BSD's 'btime' which as I understand it
> > cannot be set. Would that be really useful to you?
>
> It is *already* useful to us, and is widely used in
> existing code. The occasions when btime is set are
> relatively rare, and at that point we store it in a
> separate EA for Windows reporting purposes.
>

If that's the case, don't you have to query for this EA every time you
need to return the create time anyway? If so, then doing this really
isn't any more costly -- you'd just be querying a different EA, right?

--
Jeff Layton <[email protected]>

2010-08-16 18:10:40

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 13, 2010 at 10:54:10AM -0700, Jeremy Allison wrote:
> On Fri, Aug 13, 2010 at 08:54:32AM -0400, J. Bruce Fields wrote:
> > On Sun, Aug 08, 2010 at 06:05:01AM -0700, Jeremy Allison wrote:
> > > We don't need to ape Windows in everything.
> > > The coming ACL disaster will show that (we will go from an ACL
> > > model that is slightly too complex to use, to one that is impossibly
> > > complex to use :-).
> >
> > Care to elaborate?
>
> POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
> people are asking for this. But Windows ACLs are a nightmare
> beyond human comprehension :-). In the "too complex to be
> usable" camp.
>
> > And what would native ACL support mean for Samba?
>
> RichACLs'll do it, but I feel sorry for the admins :-).

I was curious whether you can support that with any data (or even just
anecdotes) about real-world sysadmins.

The NT-style ACLs give me a headache, honestly. But that may just be
because I've been involved with the implementation. Admins may have the
luxury of using only the subset that they're comfortable with.

--b.

2010-08-07 02:42:42

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Fri, Aug 6, 2010 at 7:29 PM, Neil Brown <[email protected]> wrote:
> On Fri, 6 Aug 2010 18:58:42 -0500
> Steve French <[email protected]> wrote:
>
>> On Fri, Aug 6, 2010 at 6:30 PM, Neil Brown <[email protected]> wrote:
>> > On Thu, 5 Aug 2010 22:55:06 -0500
>> > Steve French <[email protected]> wrote:
>> >
>> >> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
>> >> > On Thu, 5 Aug 2010 16:52:18 -0700
>> >> > Jeremy Allison <[email protected]> wrote:
>> >>
>> >> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
>> >> >
>> >> > I'm curious. ?Why do you particularly care what interface the kernel uses to
>> >> > provide you with access to this attribute?
>> >> >
>> >> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
>> >> > would seem to be an extension - an extended attribute.
>> >> > As the Linux kernel does virtually nothing with this attribute except provide
>> >> > access, it seems to be a very different class of thing to other timestamps.
>> >> > Surely it is simply some storage associated with a file which is capable of
>> >> > storing a timestamp, which can be set or retrieved by an application, and
>> >> > which happens to be initialised to the current time when a file is created.
>> >> >
>> >> > Yes, to you it is a timestamp. ?But to Linux it is a few bytes of
>> >> > user-settable metadata. ?Sounds like an EA to me.
>> >> >
>> >> > Or do you really want something like BSD's 'btime' which as I understand it
>> >> > cannot be set. ?Would that be really useful to you?
>> >>
>> >> Obviously the cifs and SMB2 protocols which ?Samba server support can
>> >> ask the server to set the create time of a file (this is handled
>> >> through xattrs today along with the "dos attribute" flags such as
>> >> archive/hidden/system), but certainly it is much more common (and
>> >> important) to read the creation time of an existing file.
>> >>
>> >
>> > Just a point of clarification - when you say it is common and important to be
>> > able to read the creation time on an existing file, and you still talking in
>> > the context of cifs/smb windows compatibility, or are you talking in the
>> > broader context?
>> > If you are referring to a broader context could be please give more details
>> > because I have not heard any mention of any real value of creation-time out
>> > side of window interoperability - have such a use clearly documented would
>> > assist the conversation I think.
>> >
>> > If on the other hand you are just referring the the windows interoperability
>> > context ... given that you have to read an EA if the create-time has been
>> > changed, you will always have to read and EA so having something else is
>> > pointless ... or I'm missing something.
>>
>> There are other cases, less common than cifs and smb2. ? One
>> that comes to mind is NFS version 4, but there are a few other
>> cases that I have heard of (backup/archive applications).
>> The RFC recommends that servers return attribute 50 (creation
>> time). ?See below text:
>>
>> ? ?time_create ? ? ? ? 50 ? nfstime4 ? ? ? R/W ? ? ?The time of creation
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? of the object. ?This
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? attribute does not
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? have any relation to
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? the traditional UNIX
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? file attribute
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "ctime" or "change
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? time".
>
> I really don't think NFSv4 is a separate justification. ?I'm fairly sure
> that attribute was only including in NFSv4 for enhanced Windows
> compatibility (windows interoperation was a big issue during the protocol
> development).

Perhaps also useful for MacOS (and other BSD), not just Windows,
although MacOS may use cifs more often than nfs.

--
Thanks,

Steve

2010-08-06 11:15:51

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 5 Aug 2010 22:55:06 -0500
Steve French <[email protected]> wrote:

> On Thu, Aug 5, 2010 at 10:38 PM, Neil Brown <[email protected]> wrote:
> > On Thu, 5 Aug 2010 16:52:18 -0700
> > Jeremy Allison <[email protected]> wrote:
>
> >> Don't add it as an EA. It's *not* an EA, it's a timestamp.
> >
> > I'm curious. ?Why do you particularly care what interface the kernel uses to
> > provide you with access to this attribute?
> >
> > And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
> > would seem to be an extension - an extended attribute.
> > As the Linux kernel does virtually nothing with this attribute except provide
> > access, it seems to be a very different class of thing to other timestamps.
> > Surely it is simply some storage associated with a file which is capable of
> > storing a timestamp, which can be set or retrieved by an application, and
> > which happens to be initialised to the current time when a file is created.
> >
> > Yes, to you it is a timestamp. ?But to Linux it is a few bytes of
> > user-settable metadata. ?Sounds like an EA to me.
> >
> > Or do you really want something like BSD's 'btime' which as I understand it
> > cannot be set. ?Would that be really useful to you?
>
> Obviously the cifs and SMB2 protocols which Samba server support can
> ask the server to set the create time of a file (this is handled
> through xattrs today along with the "dos attribute" flags such as
> archive/hidden/system), but certainly it is much more common (and
> important) to read the creation time of an existing file.
>
>
> > Is there something important that I am missing?
>
> It is another syscall that Samba server would have to make - and xattr
> performance is extremely slow on some file systems (although
> presumably this one would be more likely to be stored in inode and
> perhaps not as bad on ext4, cifs and a few others such as ntfs).
>
>

Right. One has to consider that samba has to satisfy READDIRPLUS-like
calls, and on a large directory all of those extra syscalls are likely
to impact performance.

In my view, the ideal thing would be to add this field as an EA and
continue work on implementing xstat(). Adding it as an EA gives
userland a way to set this value, without needing to add a new utimes()
variant.

If/when xstat becomes available, samba could use that instead of the EA
for reading this value.

--
Jeff Layton <[email protected]>

2010-08-03 01:13:52

[permalink] [raw]

Subject: Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

On Thu, 29 Jul 2010 17:15:15 +0100
David Howells <[email protected]> wrote:

> Neil Brown <[email protected]> wrote:
>
> > This justifies for me why a CIFS client would want to extract the
> > creation-time from the CIFS protocol, but not why you want to expose it via a
> > generic interface.
>
> It would also be easier for NFSD if the creation time was in struct kstat.
> It's included as an optional element in NFSv4. The same goes for the data
> version number. I'm not sure about the inode generation, I suspect that's used
> as part of the FH construction.
>
> However, someone was talking about a userspace NFS daemon, and there they may
> want all three bits. Even Samba may want multiple bits. Calling getxattr
> multiple times per file starts to add up, even for internal values.
>
> Consider further: NFS, for example, could be made to retrieve the creation time
> from the server. This can be merged with the attribute fetch done by the
> getattr() call, or it could be done separately by getxattr. Unless it's stored
> in RAM, that's one NFS RPC op versus two. Okay, that's a bit of an artificial
> example, but still.
>
> > Given that we have an extensible attribute framework, it seems wrong to be
> > adding new attributes to *stat. If a given filesystem wants to store certain
> > attributes more efficiently, then it is welcome to intercept xattr calls and
> > store (say) "cifs.birthtime" directly at a known offset in the inode.
>
> It's not attribute storage I'm thinking about, but making attribute retrieval
> more efficient.
>
> > The flip-side of extracting these various attributes is setting them.
>
> I acknowledge that if we went down the getxattr() route, then that
> automatically makes setxattr() the obvious candidate for setting things.
>
> But think about it another way: what if you want to set several attributes?
> You have to make a bunch of setxattr() calls. But what if it were possible to
> do all of chmod, chgrp, chown, truncate, utimes, set_btime, etc. all in one go,
> atomically? We more or less have this internally in the kernel, and it might
> stand to be exposed to userspace.
>
> It might, for example, make untarring that little bit more efficient.
>
> > I'm still pondering those extra flags:
> > FS_SPECIAL_FL
> > FS_AUTOMOUNT_FL
> > FS_AUTOMOUNT_ANY_FL
> > FS_REMOTE_FL
> > FS_ENCRYPTED_FL
> > FS_OFFLINE_FL
> >
> > They sound like they might be useful, they are not file-metadata (like
> > btime) but rather implementation details (like st_blocks). So it is probably
> > sensible to include them as you have done.
>
> I've split these away from ioc flags as ioc flags is very ext2/3/4 centric, and
> those filesystems happily create their own ioc flags sets without updating the
> master set.
>
> > If a filesystem is mounted on an network-block-device, or a loop-back of a
> > file on NFS, is FS_REMOTE_FL set?
> > Is ROT13 enough for FS_ENCRYPTED_FL to be set?
> > If the NFS server is "not responding, still trying", should FS_OFFLINE_FL get
> > set on all files?
> > And I cannot even guess at the different between the two FS_AUTOMOUNT flags.
> > I'm sure it is something useful, but doco would be good. Should one of them
> > be set on mountpoints that NFSv4 detects from the server?
>
> Yeah. I have plans to write documentation for it, but I'd like to have a
> clearer idea of what the interface might be before doing that.
>
> But to give you an idea of the flags:
>
> (*) FS_SPECIAL_FL - Kernel API file from a quasi-filesystem such as /proc or
> /sys - the sort of thing you might not want to expose through NFSD.
>
> (*) FS_AUTOMOUNT_FL - A named automount/referral point. You attempt to
> transit this directory and the backing fs will mount something over the
> top.
>
> (*) FS_AUTOMOUNT_ANY_FL - A directory in which you can look up a non-existent
> directory entry, which will cause that dirent to be fabricated and the
> target filesystem be mounted over the top. Examples include looking up
> arbitrary cell names in /afs, or arbitrary hostnames in autofs or amd
> indirect mount directories.
>
> (*) FS_REMOTE_FL - A filesystem object that is assumed not to be stored on the
> computer issuing the request. It would be quite nice to have loopback NFS
> not set the remote flag and to have NBD mounted filesystems to set the
> remote flag, but this can get quite messy with things like overmounts.
>
> My thought is that this can be used by a GUI to choose its icons for
> files.
>
> (*) FS_ENCRYPTED_FL - A file that is stored encrypted and that presumably
> needs a key providing to decrypt it. CIFS has an attribute bit for this
> (ATTR_ENCRYPTED).
>
> (*) FS_OFFLINE_FL - A file that isn't immediately available, and that requires
> a connection to the data store to be made. CIFS has an attribute bit for
> this (ATTR_OFFLINE). AFS has a field in its volume data and an error code
> indicating that a volume is offline and cannot currently be accessed.
>
> This could be set by network filesystems for which the network or the
> server is absent for example. Especially if the lightweight stat is
> requested (non-blocking in essence).

Thanks for these. It particularly helps when you identify how the flag might
be used - guiding GUI icon choice is certainly valid and tells me that if I
don't set the flag 'correctly' (maybe because it is too difficult) then it
isn't the end of the world.

I get the AUTOMOUNT distinction too - FS_AUTHMOUNT_ANY_FL would be good for a
GUI as it could allow you to type in a filename for it to try to follow.

I'm not sure exactly how FS_ENCRYPTED_FL would be used - if the gui might be
prompted to ask for a key there would either need to be a completely general
interface for presenting keys, or the flag should be specific to CIFS and
should mean that a key must be given to CIFS to unlock the file.

Similarly, what can you do with an OFFLINE file? Do CIFS and AFS offline
files behave the same way? If not there should be two different flags. If
so then that behaviour should be specified with the flag ... unless this flag
is just for GUI cosmetics too.

Anyway, I've been thinking more about this and have refined my position
somewhat. I'll present it here for what it is worth - feel free to ignore
bits you don't like.

Your proposed 'xstat' seems to combine a number of different goals - doing
that is always a bit dangerous as you have defend it on multiple fronts...

I see the separate goals are:
A/ allowing attributes to be accessed independently - an explicit list of
required attributes is given and the FS doesn't need to collect the other
attributes.
B/ allowing synthetic attributes to be identified - if the FS doesn't
natively support some attribute but must synthesise it, you can now
discover that fact
C/ add an ad-hoc collection of new attributes that filesystems can return if
they happen to support them
D/ do all the above with a single system call for efficiency.

I think pushing all these together is asking for trouble - arguments about one
aspect will interfere with completion of the others.

Given that we already have the 'xattr' interface it seems most sensible to
achieve 'A' by defining xattr names for all 'standard' attributes and
handling them in a common library function. Maybe 'linux.inum' to get the
inode numbers, etc. There is doubtlessly a better name than 'linux.inum'.
I understand that you tried something like this before and it was rejected.
To borrow Linus's hyperbole from up-thread:
>> Hey, whoever denounced it as stupid obviously doesn't have the neurons
>> to go around to be involved in the discussion. Ignore them.

With that in place, 'B' can be achieved by the simple expedient of not
listing (in listxattr) the system attributes that the filesystem doesn't
support natively. So if a filesystem doesn't support uid and has to fake it,
then it would not list 'linux.uid' in the xattr list, but will still return
the faked uid if explicitly asked for it.

The various proposed new attributes (C) could then be added one at a time or
as groups depending on how much opposition they receive. Some might be
generic (linux.*) while others should possibly be filesystem-specific (FAT.*,
CIFS.*).

This could result in the need to make multiple system calls to get all of the
attributes that you want. Maybe this would be a problem ... I keep hearing
that in Linux context switches are really cheap and system calls are also
really cheap, so maybe it isn't a problem.

However if you can demonstrate a cost in a credible workload you would then
have ammunition to defend a new syscall (D) which would get multiple xattrs.
And maybe one that would set multiple xattrs.

Thus you can address each goal one at a time and the more contentious parts
can be delayed without interfering with the clearly valuable parts.

Whether a particular attribute were stored in kstat, or whether the fs needed
extra disk access to get the attribute would be entirely internal details
which we are free to get wrong the first few times and then fix up once we
understand all the issues properly.

> > Providing everybody imposes exactly the same semantics for "creation time"...
>
> We can invent some for Linux. The time at which an inode is created would seem
> to be a sensible course, but with the ability for the creation time to be set
> by archiving tools. Overwriting an existing inode by truncating it and then
> writing it should keep the creation time of the inode.
>
> I think this would then be the same behaviour as Windows.

Yes, it seems that supporting the Windows behaviour is the only actual
use-case that has been suggested - so I think that we should be explicit that
this attribute has exactly the same semantics as the windows attribute. i.e.
we shouldn't invent some, we should precisely copy them.

>
> > "well derided" like high-mem and SMP support? or "real-time" support and
> > priority inheritance?
> > I guess the deriders are wrong, and will eventually realise that they are
> > wrong. The difficult bit is we cannot know how long it will take them, or
> > how much you have to care.
>
> Almost everyone hates the idea of having a stat function with a variable length
> buffer. To quote Linus:
>
> the "buffer+buflen" thing is still disgusting.
>
> You might be right, though: the deriders might be wrong; it just doesn't help
> at this particular point in time.

We do seem to suffer from the squeaky-wheel syndrome - the louder someone
complains the more attention they are given - I'm sorry I wasn't listening
when you first suggested using xattrs for accessing creation-time - maybe I
could have squeaked loudly too .... though probably I wouldn't have
considered the issues deeply enough by that time.

(Look - getxattr has buffer+buflen ! - it may well be disgusting, but
following established practice is good for consistency).

>
> > (unambiguous documentation!! the rest is just details)
>
> I normally do write documentation. It's just that I don't want to have to keep
> changing the docs as well as constantly rewriting the code.

I understand that desire ... but with an interface, the docs really are just
as important as the code!

thanks,
NeilBrown

2010-08-01 16:18:44