2010-06-30 01:17:59

by David Howells

[permalink] [raw]
Subject: [PATCH 0/3] Extended file stat functions [ver #2]

Implement a pair of new system calls to provide extended and further extensible
stat functions.

The third of the associated patches provides these new system calls:

struct xstat_dev {
unsigned int major;
unsigned int minor;
};

struct xstat_time {
unsigned long long tv_sec;
unsigned long long tv_nsec;
};

struct xstat {
unsigned int struct_version;
#define XSTAT_STRUCT_VERSION 0
unsigned int st_mode;
unsigned int st_nlink;
unsigned int st_uid;
unsigned int st_gid;
unsigned int st_blksize;
struct xstat_dev st_rdev;
struct xstat_dev st_dev;
unsigned long long st_ino;
unsigned long long st_size;
struct xstat_time st_atime;
struct xstat_time st_mtime;
struct xstat_time st_ctime;
struct xstat_time st_btime;
unsigned long long st_blocks;
unsigned long long st_gen;
unsigned long long st_data_version;
unsigned long long query_flags;
#define XSTAT_QUERY_SIZE 0x00000001ULL
#define XSTAT_QUERY_NLINK 0x00000002ULL
#define XSTAT_QUERY_AMC_TIMES 0x00000004ULL
#define XSTAT_QUERY_CREATION_TIME 0x00000008ULL
#define XSTAT_QUERY_BLOCKS 0x00000010ULL
#define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL
#define XSTAT_QUERY_DATA_VERSION 0x00000040ULL
#define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
#define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL
#define XSTAT_QUERY__DEFINED_SET 0x0000007fULL
unsigned long long extra_results[0];
};

ssize_t ret = xstat(int dfd,
const char *filename,
unsigned atflag,
struct xstat *buffer,
size_t buflen);

ssize_t ret = fxstat(int fd,
struct xstat *buffer,
size_t buflen);

which are more fully documented in that patch's description.

The bonuses of these new stat functions are:

(1) The fields in the xstat struct are cleaned up. There are no split or
duplicated fields.

(2) Some extra information is made available (file creation time, inode
generation number and data version number) where provided by the
underlying filesystem.

These are implemented here for Ext4 and AFS, but could also be provided
for CIFS, NTFS and BtrFS and probably others.

(3) The structure is versioned and extensible, meaning that further new system
calls shouldn't be required.

Note that no lstat() equivalent is required as that can be implemented through
xstat() with atflag == 0.


The first patch makes const a bunch of system call userspace string/buffer
arguments. I can then make sys_xstat()'s filename pointer const too (though
the entire first patch is not required for that).

The second patch makes the AFS filesystem use i_generation for the vnode ID
uniquifier rather than i_version, and assigns i_version to hold the AFS data
version number, making them more logical for when I want to get at them from
afs_getattr().


There's a test program attached to the description for patch 3. It can be run
as follows:

[root@andromeda ~]# /tmp/xstat /afs/archive/linuxdev/fedora9/i386/repodata/
xstat(/afs/archive/linuxdev/fedora9/i386/repodata/) = 152
sv=0 qf=77 cr=0.0 iv=7a5 dv=5
Size: 2048 Blocks: 0 IO Block: 4096 directory
Device: 00:15 Inode: 83 Links: 2
Access: (0755/drwxr-xr-x) Uid: 75338 Gid: 0
Access: 2008-11-05 20:00:12.000000000+0000
Modify: 2008-11-05 20:00:12.000000000+0000
Change: 2008-11-05 20:00:12.000000000+0000
Inode version: 7a5h
Data version: 5h


Things that need consideration:

(1) Is it worth retaining the ability to arbitrarily add extra bits onto the
end of the stat buffer? And what's the best way to do this?

I've defined a way that from userspace involves assigning bits in
query_flags to extra results that you might want. But this could instead
be done, say, by just upping the struct version number any time we want to
pass back more information. Alternatively, we could go for a tagged data
method, perhaps using the same format as the recvmsg() control message
field.

If we use tagged data then rather than being selective, we could just
return as many tagged data items as we feel the user might want and we can
cram into the buffer. That could be rather slow, though.

(2) What extra bits of information might we like to see available through the
stat interface? Security labels? NFS file IDs? Xattrs?

If we went for a tagged data method, xstat() could be modified to take a
list of tags as an argument, and could then return arbitrarily-sized
tagged results, including fs-specific stuff.

(3) Does st_blksize really need to be 64 bits on a 64-bit system? Or can it
be 32-bits? Are we really likely to see something with a 4Gb+ blocksize?

(4) Should the inode number and data version number fields be 128-bit?

David
---

David Howells (3):
Add a pair of system calls to make extended file stats available
AFS: Use i_generation not i_version for the vnode uniquifier
Mark arguments to certain syscalls as being const


arch/alpha/kernel/osf_sys.c | 6 +
arch/alpha/kernel/process.c | 2
arch/arm/kernel/sys_arm.c | 4 -
arch/arm/kernel/sys_oabi-compat.c | 6 +
arch/avr32/include/asm/syscalls.h | 2
arch/avr32/kernel/process.c | 3 -
arch/blackfin/kernel/process.c | 2
arch/frv/kernel/process.c | 3 -
arch/h8300/kernel/process.c | 2
arch/ia64/include/asm/unistd.h | 2
arch/ia64/kernel/process.c | 2
arch/m32r/kernel/process.c | 3 -
arch/m68k/kernel/process.c | 2
arch/m68knommu/kernel/process.c | 2
arch/microblaze/kernel/sys_microblaze.c | 2
arch/mips/kernel/syscall.c | 2
arch/mn10300/kernel/process.c | 2
arch/parisc/hpux/fs.c | 7 +
arch/powerpc/kernel/process.c | 2
arch/powerpc/kernel/sys_ppc32.c | 2
arch/s390/kernel/compat_linux.c | 10 +-
arch/s390/kernel/compat_linux.h | 10 +-
arch/s390/kernel/entry.h | 2
arch/s390/kernel/process.c | 2
arch/sh/include/asm/syscalls_32.h | 2
arch/sh/include/asm/syscalls_64.h | 2
arch/sh/kernel/process_64.c | 2
arch/sparc/kernel/sys_sparc32.c | 7 +
arch/um/kernel/exec.c | 6 +
arch/um/kernel/internal.h | 2
arch/um/kernel/syscall.c | 2
arch/x86/ia32/sys_ia32.c | 14 +-
arch/x86/include/asm/sys_ia32.h | 12 +-
arch/x86/include/asm/syscalls.h | 2
arch/x86/include/asm/unistd_32.h | 4 +
arch/x86/include/asm/unistd_64.h | 4 +
arch/x86/kernel/entry_64.S | 4 -
arch/x86/kernel/process.c | 2
arch/xtensa/kernel/process.c | 2
fs/afs/dir.c | 8 +
fs/afs/fsclient.c | 3 -
fs/afs/inode.c | 22 ++--
fs/compat.c | 23 ++--
fs/ecryptfs/inode.c | 1
fs/ext4/ext4.h | 2
fs/ext4/file.c | 2
fs/ext4/inode.c | 27 ++++-
fs/ext4/namei.c | 2
fs/ext4/symlink.c | 2
fs/nfs/inode.c | 38 +++++--
fs/nfsd/nfs3proc.c | 1
fs/nfsd/nfs3xdr.c | 2
fs/nfsd/nfs4xdr.c | 2
fs/nfsd/nfsproc.c | 3 +
fs/nfsd/nfsxdr.c | 1
fs/stat.c | 178 ++++++++++++++++++++++++++++---
fs/utimes.c | 7 +
include/linux/compat.h | 6 +
include/linux/fs.h | 8 +
include/linux/stat.h | 88 +++++++++++++++
include/linux/syscalls.h | 25 +++-
include/linux/time.h | 2
62 files changed, 456 insertions(+), 146 deletions(-)


2010-06-30 01:17:23

by David Howells

[permalink] [raw]
Subject: [PATCH 1/3] Mark arguments to certain syscalls as being const [ver #2]

Mark arguments to certain system calls as being const where they should be but
aren't. The list includes:

(*) The filename arguments of various stat syscalls, execve(), various utimes
syscalls and some mount syscalls.

(*) The filename arguments of some syscall helpers relating to the above.

(*) The buffer argument of various write syscalls.

Signed-off-by: David Howells <[email protected]>
---

arch/alpha/kernel/osf_sys.c | 6 +++---
arch/alpha/kernel/process.c | 2 +-
arch/arm/kernel/sys_arm.c | 4 ++--
arch/arm/kernel/sys_oabi-compat.c | 6 +++---
arch/avr32/include/asm/syscalls.h | 2 +-
arch/avr32/kernel/process.c | 3 ++-
arch/blackfin/kernel/process.c | 2 +-
arch/frv/kernel/process.c | 3 ++-
arch/h8300/kernel/process.c | 2 +-
arch/ia64/include/asm/unistd.h | 2 +-
arch/ia64/kernel/process.c | 2 +-
arch/m32r/kernel/process.c | 3 ++-
arch/m68k/kernel/process.c | 2 +-
arch/m68knommu/kernel/process.c | 2 +-
arch/microblaze/kernel/sys_microblaze.c | 2 +-
arch/mips/kernel/syscall.c | 2 +-
arch/mn10300/kernel/process.c | 2 +-
arch/parisc/hpux/fs.c | 7 ++++---
arch/powerpc/kernel/process.c | 2 +-
arch/powerpc/kernel/sys_ppc32.c | 2 +-
arch/s390/kernel/compat_linux.c | 10 +++++-----
arch/s390/kernel/compat_linux.h | 10 +++++-----
arch/s390/kernel/entry.h | 2 +-
arch/s390/kernel/process.c | 2 +-
arch/sh/include/asm/syscalls_32.h | 2 +-
arch/sh/include/asm/syscalls_64.h | 2 +-
arch/sh/kernel/process_64.c | 2 +-
arch/sparc/kernel/sys_sparc32.c | 7 ++++---
arch/um/kernel/exec.c | 6 +++---
arch/um/kernel/internal.h | 2 +-
arch/um/kernel/syscall.c | 2 +-
arch/x86/ia32/sys_ia32.c | 14 +++++++-------
arch/x86/include/asm/sys_ia32.h | 12 ++++++------
arch/x86/include/asm/syscalls.h | 2 +-
arch/x86/kernel/entry_64.S | 4 ++--
arch/x86/kernel/process.c | 2 +-
arch/xtensa/kernel/process.c | 2 +-
fs/compat.c | 23 +++++++++++++----------
fs/stat.c | 29 ++++++++++++++++++-----------
fs/utimes.c | 7 ++++---
include/linux/compat.h | 6 +++---
include/linux/fs.h | 6 +++---
include/linux/syscalls.h | 20 ++++++++++----------
include/linux/time.h | 2 +-
44 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index de9d397..1719fe3 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -244,7 +244,7 @@ do_osf_statfs(struct dentry * dentry, struct osf_statfs __user *buffer,
return error;
}

-SYSCALL_DEFINE3(osf_statfs, char __user *, pathname,
+SYSCALL_DEFINE3(osf_statfs, const char __user *, pathname,
struct osf_statfs __user *, buffer, unsigned long, bufsiz)
{
struct path path;
@@ -358,7 +358,7 @@ osf_procfs_mount(char *dirname, struct procfs_args __user *args, int flags)
return do_mount("", dirname, "proc", flags, NULL);
}

-SYSCALL_DEFINE4(osf_mount, unsigned long, typenr, char __user *, path,
+SYSCALL_DEFINE4(osf_mount, unsigned long, typenr, const char __user *, path,
int, flag, void __user *, data)
{
int retval;
@@ -932,7 +932,7 @@ SYSCALL_DEFINE3(osf_setitimer, int, which, struct itimerval32 __user *, in,

}

-SYSCALL_DEFINE2(osf_utimes, char __user *, filename,
+SYSCALL_DEFINE2(osf_utimes, const char __user *, filename,
struct timeval32 __user *, tvs)
{
struct timespec tv[2];
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 395a464..88e608a 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -387,7 +387,7 @@ EXPORT_SYMBOL(dump_elf_task_fp);
* sys_execve() executes a new program.
*/
asmlinkage int
-do_sys_execve(char __user *ufilename, char __user * __user *argv,
+do_sys_execve(const char __user *ufilename, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs)
{
int error;
diff --git a/arch/arm/kernel/sys_arm.c b/arch/arm/kernel/sys_arm.c
index c235018..5b7c541 100644
--- a/arch/arm/kernel/sys_arm.c
+++ b/arch/arm/kernel/sys_arm.c
@@ -62,7 +62,7 @@ asmlinkage int sys_vfork(struct pt_regs *regs)
/* sys_execve() executes a new program.
* This is called indirectly via a small wrapper
*/
-asmlinkage int sys_execve(char __user *filenamei, char __user * __user *argv,
+asmlinkage int sys_execve(const char __user *filenamei, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs)
{
int error;
@@ -84,7 +84,7 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[])
int ret;

memset(&regs, 0, sizeof(struct pt_regs));
- ret = do_execve((char *)filename, (char __user * __user *)argv,
+ ret = do_execve(filename, (char __user * __user *)argv,
(char __user * __user *)envp, &regs);
if (ret < 0)
goto out;
diff --git a/arch/arm/kernel/sys_oabi-compat.c b/arch/arm/kernel/sys_oabi-compat.c
index 33ff678..4ad8da1 100644
--- a/arch/arm/kernel/sys_oabi-compat.c
+++ b/arch/arm/kernel/sys_oabi-compat.c
@@ -141,7 +141,7 @@ static long cp_oldabi_stat64(struct kstat *stat,
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-asmlinkage long sys_oabi_stat64(char __user * filename,
+asmlinkage long sys_oabi_stat64(const char __user * filename,
struct oldabi_stat64 __user * statbuf)
{
struct kstat stat;
@@ -151,7 +151,7 @@ asmlinkage long sys_oabi_stat64(char __user * filename,
return error;
}

-asmlinkage long sys_oabi_lstat64(char __user * filename,
+asmlinkage long sys_oabi_lstat64(const char __user * filename,
struct oldabi_stat64 __user * statbuf)
{
struct kstat stat;
@@ -172,7 +172,7 @@ asmlinkage long sys_oabi_fstat64(unsigned long fd,
}

asmlinkage long sys_oabi_fstatat64(int dfd,
- char __user *filename,
+ const char __user *filename,
struct oldabi_stat64 __user *statbuf,
int flag)
{
diff --git a/arch/avr32/include/asm/syscalls.h b/arch/avr32/include/asm/syscalls.h
index 66a1972..ab608b7 100644
--- a/arch/avr32/include/asm/syscalls.h
+++ b/arch/avr32/include/asm/syscalls.h
@@ -21,7 +21,7 @@ asmlinkage int sys_clone(unsigned long, unsigned long,
unsigned long, unsigned long,
struct pt_regs *);
asmlinkage int sys_vfork(struct pt_regs *);
-asmlinkage int sys_execve(char __user *, char __user *__user *,
+asmlinkage int sys_execve(const char __user *, char __user *__user *,
char __user *__user *, struct pt_regs *);

/* kernel/signal.c */
diff --git a/arch/avr32/kernel/process.c b/arch/avr32/kernel/process.c
index 2d76515..e5daddf 100644
--- a/arch/avr32/kernel/process.c
+++ b/arch/avr32/kernel/process.c
@@ -383,7 +383,8 @@ asmlinkage int sys_vfork(struct pt_regs *regs)
0, NULL, NULL);
}

-asmlinkage int sys_execve(char __user *ufilename, char __user *__user *uargv,
+asmlinkage int sys_execve(const char __user *ufilename,
+ char __user *__user *uargv,
char __user *__user *uenvp, struct pt_regs *regs)
{
int error;
diff --git a/arch/blackfin/kernel/process.c b/arch/blackfin/kernel/process.c
index 93ec07d..a566f61 100644
--- a/arch/blackfin/kernel/process.c
+++ b/arch/blackfin/kernel/process.c
@@ -209,7 +209,7 @@ copy_thread(unsigned long clone_flags,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *name, char __user * __user *argv, char __user * __user *envp)
+asmlinkage int sys_execve(const char __user *name, char __user * __user *argv, char __user * __user *envp)
{
int error;
char *filename;
diff --git a/arch/frv/kernel/process.c b/arch/frv/kernel/process.c
index 21d0fd1..428931c 100644
--- a/arch/frv/kernel/process.c
+++ b/arch/frv/kernel/process.c
@@ -250,7 +250,8 @@ int copy_thread(unsigned long clone_flags,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *name, char __user * __user *argv, char __user * __user *envp)
+asmlinkage int sys_execve(const char __user *name, char __user * __user *argv,
+ char __user * __user *envp)
{
int error;
char * filename;
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 8c8b0ff..8b7b78d 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -212,7 +212,7 @@ int copy_thread(unsigned long clone_flags,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char *name, char **argv, char **envp,int dummy,...)
+asmlinkage int sys_execve(const char *name, char **argv, char **envp,int dummy,...)
{
int error;
char * filename;
diff --git a/arch/ia64/include/asm/unistd.h b/arch/ia64/include/asm/unistd.h
index bb8b0ff..46f36fc 100644
--- a/arch/ia64/include/asm/unistd.h
+++ b/arch/ia64/include/asm/unistd.h
@@ -353,7 +353,7 @@ asmlinkage unsigned long sys_mmap2(
int fd, long pgoff);
struct pt_regs;
struct sigaction;
-long sys_execve(char __user *filename, char __user * __user *argv,
+long sys_execve(const char __user *filename, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs);
asmlinkage long sys_ia64_pipe(void);
asmlinkage long sys_rt_sigaction(int sig,
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index 53f1648..a879c03 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -633,7 +633,7 @@ dump_fpu (struct pt_regs *pt, elf_fpregset_t dst)
}

long
-sys_execve (char __user *filename, char __user * __user *argv, char __user * __user *envp,
+sys_execve (const char __user *filename, char __user * __user *argv, char __user * __user *envp,
struct pt_regs *regs)
{
char *fname;
diff --git a/arch/m32r/kernel/process.c b/arch/m32r/kernel/process.c
index bc8c8c1..8665a4d 100644
--- a/arch/m32r/kernel/process.c
+++ b/arch/m32r/kernel/process.c
@@ -288,7 +288,8 @@ asmlinkage int sys_vfork(unsigned long r0, unsigned long r1, unsigned long r2,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *ufilename, char __user * __user *uargv,
+asmlinkage int sys_execve(const char __user *ufilename,
+ char __user * __user *uargv,
char __user * __user *uenvp,
unsigned long r3, unsigned long r4, unsigned long r5,
unsigned long r6, struct pt_regs regs)
diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c
index 1a6be27..221d0b7 100644
--- a/arch/m68k/kernel/process.c
+++ b/arch/m68k/kernel/process.c
@@ -315,7 +315,7 @@ EXPORT_SYMBOL(dump_fpu);
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char __user *name, char __user * __user *argv, char __user * __user *envp)
+asmlinkage int sys_execve(const char __user *name, char __user * __user *argv, char __user * __user *envp)
{
int error;
char * filename;
diff --git a/arch/m68knommu/kernel/process.c b/arch/m68knommu/kernel/process.c
index 6aa6613..6350f68 100644
--- a/arch/m68knommu/kernel/process.c
+++ b/arch/m68knommu/kernel/process.c
@@ -350,7 +350,7 @@ void dump(struct pt_regs *fp)
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char *name, char **argv, char **envp)
+asmlinkage int sys_execve(const char *name, char **argv, char **envp)
{
int error;
char * filename;
diff --git a/arch/microblaze/kernel/sys_microblaze.c b/arch/microblaze/kernel/sys_microblaze.c
index f4e00b7..6abab6e 100644
--- a/arch/microblaze/kernel/sys_microblaze.c
+++ b/arch/microblaze/kernel/sys_microblaze.c
@@ -47,7 +47,7 @@ asmlinkage long microblaze_clone(int flags, unsigned long stack, struct pt_regs
return do_fork(flags, stack, regs, 0, NULL, NULL);
}

-asmlinkage long microblaze_execve(char __user *filenamei, char __user *__user *argv,
+asmlinkage long microblaze_execve(const char __user *filenamei, char __user *__user *argv,
char __user *__user *envp, struct pt_regs *regs)
{
int error;
diff --git a/arch/mips/kernel/syscall.c b/arch/mips/kernel/syscall.c
index dd81b0f..6322c39 100644
--- a/arch/mips/kernel/syscall.c
+++ b/arch/mips/kernel/syscall.c
@@ -207,7 +207,7 @@ asmlinkage int sys_execve(nabi_no_regargs struct pt_regs regs)
int error;
char * filename;

- filename = getname((char __user *) (long)regs.regs[4]);
+ filename = getname((const char __user *) (long)regs.regs[4]);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
diff --git a/arch/mn10300/kernel/process.c b/arch/mn10300/kernel/process.c
index 82b817c..762eb32 100644
--- a/arch/mn10300/kernel/process.c
+++ b/arch/mn10300/kernel/process.c
@@ -268,7 +268,7 @@ asmlinkage long sys_vfork(void)
0, NULL, NULL);
}

-asmlinkage long sys_execve(char __user *name,
+asmlinkage long sys_execve(const char __user *name,
char __user * __user *argv,
char __user * __user *envp)
{
diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c
index 6935123..1444875 100644
--- a/arch/parisc/hpux/fs.c
+++ b/arch/parisc/hpux/fs.c
@@ -36,7 +36,7 @@ int hpux_execve(struct pt_regs *regs)
int error;
char *filename;

- filename = getname((char __user *) regs->gr[26]);
+ filename = getname((const char __user *) regs->gr[26]);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
@@ -169,7 +169,7 @@ static int cp_hpux_stat(struct kstat *stat, struct hpux_stat64 __user *statbuf)
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-long hpux_stat64(char __user *filename, struct hpux_stat64 __user *statbuf)
+long hpux_stat64(const char __user *filename, struct hpux_stat64 __user *statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
@@ -191,7 +191,8 @@ long hpux_fstat64(unsigned int fd, struct hpux_stat64 __user *statbuf)
return error;
}

-long hpux_lstat64(char __user *filename, struct hpux_stat64 __user *statbuf)
+long hpux_lstat64(const char __user *filename,
+ struct hpux_stat64 __user *statbuf)
{
struct kstat stat;
int error = vfs_lstat(filename, &stat);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 773424d..3ef6ed4 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -991,7 +991,7 @@ int sys_execve(unsigned long a0, unsigned long a1, unsigned long a2,
int error;
char *filename;

- filename = getname((char __user *) a0);
+ filename = getname((const char __user *) a0);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c
index 19471a1..20fd701 100644
--- a/arch/powerpc/kernel/sys_ppc32.c
+++ b/arch/powerpc/kernel/sys_ppc32.c
@@ -546,7 +546,7 @@ compat_ssize_t compat_sys_pread64(unsigned int fd, char __user *ubuf, compat_siz
return sys_pread64(fd, ubuf, count, ((loff_t)poshi << 32) | poslo);
}

-compat_ssize_t compat_sys_pwrite64(unsigned int fd, char __user *ubuf, compat_size_t count,
+compat_ssize_t compat_sys_pwrite64(unsigned int fd, const char __user *ubuf, compat_size_t count,
u32 reg6, u32 poshi, u32 poslo)
{
return sys_pwrite64(fd, ubuf, count, ((loff_t)poshi << 32) | poslo);
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 73b624e..1e6449c 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -436,7 +436,7 @@ sys32_rt_sigqueueinfo(int pid, int sig, compat_siginfo_t __user *uinfo)
* sys32_execve() executes a new program after the asm stub has set
* things up for us. This should basically do what I want it to.
*/
-asmlinkage long sys32_execve(char __user *name, compat_uptr_t __user *argv,
+asmlinkage long sys32_execve(const char __user *name, compat_uptr_t __user *argv,
compat_uptr_t __user *envp)
{
struct pt_regs *regs = task_pt_regs(current);
@@ -570,7 +570,7 @@ static int cp_stat64(struct stat64_emu31 __user *ubuf, struct kstat *stat)
return copy_to_user(ubuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-asmlinkage long sys32_stat64(char __user * filename, struct stat64_emu31 __user * statbuf)
+asmlinkage long sys32_stat64(const char __user * filename, struct stat64_emu31 __user * statbuf)
{
struct kstat stat;
int ret = vfs_stat(filename, &stat);
@@ -579,7 +579,7 @@ asmlinkage long sys32_stat64(char __user * filename, struct stat64_emu31 __user
return ret;
}

-asmlinkage long sys32_lstat64(char __user * filename, struct stat64_emu31 __user * statbuf)
+asmlinkage long sys32_lstat64(const char __user * filename, struct stat64_emu31 __user * statbuf)
{
struct kstat stat;
int ret = vfs_lstat(filename, &stat);
@@ -597,7 +597,7 @@ asmlinkage long sys32_fstat64(unsigned long fd, struct stat64_emu31 __user * sta
return ret;
}

-asmlinkage long sys32_fstatat64(unsigned int dfd, char __user *filename,
+asmlinkage long sys32_fstatat64(unsigned int dfd, const char __user *filename,
struct stat64_emu31 __user* statbuf, int flag)
{
struct kstat stat;
@@ -655,7 +655,7 @@ asmlinkage long sys32_read(unsigned int fd, char __user * buf, size_t count)
return sys_read(fd, buf, count);
}

-asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
+asmlinkage long sys32_write(unsigned int fd, const char __user * buf, size_t count)
{
if ((compat_ssize_t) count < 0)
return -EINVAL;
diff --git a/arch/s390/kernel/compat_linux.h b/arch/s390/kernel/compat_linux.h
index cb97afc..9635d75 100644
--- a/arch/s390/kernel/compat_linux.h
+++ b/arch/s390/kernel/compat_linux.h
@@ -193,7 +193,7 @@ long sys32_rt_sigprocmask(int how, compat_sigset_t __user *set,
compat_sigset_t __user *oset, size_t sigsetsize);
long sys32_rt_sigpending(compat_sigset_t __user *set, size_t sigsetsize);
long sys32_rt_sigqueueinfo(int pid, int sig, compat_siginfo_t __user *uinfo);
-long sys32_execve(char __user *name, compat_uptr_t __user *argv,
+long sys32_execve(const char __user *name, compat_uptr_t __user *argv,
compat_uptr_t __user *envp);
long sys32_init_module(void __user *umod, unsigned long len,
const char __user *uargs);
@@ -207,16 +207,16 @@ long sys32_sendfile(int out_fd, int in_fd, compat_off_t __user *offset,
size_t count);
long sys32_sendfile64(int out_fd, int in_fd, compat_loff_t __user *offset,
s32 count);
-long sys32_stat64(char __user * filename, struct stat64_emu31 __user * statbuf);
-long sys32_lstat64(char __user * filename,
+long sys32_stat64(const char __user * filename, struct stat64_emu31 __user * statbuf);
+long sys32_lstat64(const char __user * filename,
struct stat64_emu31 __user * statbuf);
long sys32_fstat64(unsigned long fd, struct stat64_emu31 __user * statbuf);
-long sys32_fstatat64(unsigned int dfd, char __user *filename,
+long sys32_fstatat64(unsigned int dfd, const char __user *filename,
struct stat64_emu31 __user* statbuf, int flag);
unsigned long old32_mmap(struct mmap_arg_struct_emu31 __user *arg);
long sys32_mmap2(struct mmap_arg_struct_emu31 __user *arg);
long sys32_read(unsigned int fd, char __user * buf, size_t count);
-long sys32_write(unsigned int fd, char __user * buf, size_t count);
+long sys32_write(unsigned int fd, const char __user * buf, size_t count);
long sys32_fadvise64(int fd, loff_t offset, size_t len, int advise);
long sys32_fadvise64_64(struct fadvise64_64_args __user *args);
long sys32_sigaction(int sig, const struct old_sigaction32 __user *act,
diff --git a/arch/s390/kernel/entry.h b/arch/s390/kernel/entry.h
index eb15c12..e2c048b 100644
--- a/arch/s390/kernel/entry.h
+++ b/arch/s390/kernel/entry.h
@@ -42,7 +42,7 @@ long sys_clone(unsigned long newsp, unsigned long clone_flags,
int __user *parent_tidptr, int __user *child_tidptr);
long sys_vfork(void);
void execve_tail(void);
-long sys_execve(char __user *name, char __user * __user *argv,
+long sys_execve(const char __user *name, char __user * __user *argv,
char __user * __user *envp);
long sys_sigsuspend(int history0, int history1, old_sigset_t mask);
long sys_sigaction(int sig, const struct old_sigaction __user *act,
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 1039fde..7eafaf2 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -267,7 +267,7 @@ asmlinkage void execve_tail(void)
/*
* sys_execve() executes a new program.
*/
-SYSCALL_DEFINE3(execve, char __user *, name, char __user * __user *, argv,
+SYSCALL_DEFINE3(execve, const char __user *, name, char __user * __user *, argv,
char __user * __user *, envp)
{
struct pt_regs *regs = task_pt_regs(current);
diff --git a/arch/sh/include/asm/syscalls_32.h b/arch/sh/include/asm/syscalls_32.h
index 8b30200..be201fd 100644
--- a/arch/sh/include/asm/syscalls_32.h
+++ b/arch/sh/include/asm/syscalls_32.h
@@ -19,7 +19,7 @@ asmlinkage int sys_clone(unsigned long clone_flags, unsigned long newsp,
asmlinkage int sys_vfork(unsigned long r4, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs __regs);
-asmlinkage int sys_execve(char __user *ufilename, char __user * __user *uargv,
+asmlinkage int sys_execve(const char __user *ufilename, char __user * __user *uargv,
char __user * __user *uenvp, unsigned long r7,
struct pt_regs __regs);
asmlinkage int sys_sigsuspend(old_sigset_t mask, unsigned long r5,
diff --git a/arch/sh/include/asm/syscalls_64.h b/arch/sh/include/asm/syscalls_64.h
index 751fd88..ee519f4 100644
--- a/arch/sh/include/asm/syscalls_64.h
+++ b/arch/sh/include/asm/syscalls_64.h
@@ -21,7 +21,7 @@ asmlinkage int sys_vfork(unsigned long r2, unsigned long r3,
unsigned long r4, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs *pregs);
-asmlinkage int sys_execve(char *ufilename, char **uargv,
+asmlinkage int sys_execve(const char *ufilename, char **uargv,
char **uenvp, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs *pregs);
diff --git a/arch/sh/kernel/process_64.c b/arch/sh/kernel/process_64.c
index d4ca648..68d128d 100644
--- a/arch/sh/kernel/process_64.c
+++ b/arch/sh/kernel/process_64.c
@@ -483,7 +483,7 @@ asmlinkage int sys_vfork(unsigned long r2, unsigned long r3,
/*
* sys_execve() executes a new program.
*/
-asmlinkage int sys_execve(char *ufilename, char **uargv,
+asmlinkage int sys_execve(const char *ufilename, char **uargv,
char **uenvp, unsigned long r5,
unsigned long r6, unsigned long r7,
struct pt_regs *pregs)
diff --git a/arch/sparc/kernel/sys_sparc32.c b/arch/sparc/kernel/sys_sparc32.c
index c0ca875..e6375a7 100644
--- a/arch/sparc/kernel/sys_sparc32.c
+++ b/arch/sparc/kernel/sys_sparc32.c
@@ -162,7 +162,7 @@ static int cp_compat_stat64(struct kstat *stat,
return err;
}

-asmlinkage long compat_sys_stat64(char __user * filename,
+asmlinkage long compat_sys_stat64(const char __user * filename,
struct compat_stat64 __user *statbuf)
{
struct kstat stat;
@@ -173,7 +173,7 @@ asmlinkage long compat_sys_stat64(char __user * filename,
return error;
}

-asmlinkage long compat_sys_lstat64(char __user * filename,
+asmlinkage long compat_sys_lstat64(const char __user * filename,
struct compat_stat64 __user *statbuf)
{
struct kstat stat;
@@ -195,7 +195,8 @@ asmlinkage long compat_sys_fstat64(unsigned int fd,
return error;
}

-asmlinkage long compat_sys_fstatat64(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_fstatat64(unsigned int dfd,
+ const char __user *filename,
struct compat_stat64 __user * statbuf, int flag)
{
struct kstat stat;
diff --git a/arch/um/kernel/exec.c b/arch/um/kernel/exec.c
index 97974c1..59b20d9 100644
--- a/arch/um/kernel/exec.c
+++ b/arch/um/kernel/exec.c
@@ -44,7 +44,7 @@ void start_thread(struct pt_regs *regs, unsigned long eip, unsigned long esp)
PT_REGS_SP(regs) = esp;
}

-static long execve1(char *file, char __user * __user *argv,
+static long execve1(const char *file, char __user * __user *argv,
char __user *__user *env)
{
long error;
@@ -61,7 +61,7 @@ static long execve1(char *file, char __user * __user *argv,
return error;
}

-long um_execve(char *file, char __user *__user *argv, char __user *__user *env)
+long um_execve(const char *file, char __user *__user *argv, char __user *__user *env)
{
long err;

@@ -71,7 +71,7 @@ long um_execve(char *file, char __user *__user *argv, char __user *__user *env)
return err;
}

-long sys_execve(char __user *file, char __user *__user *argv,
+long sys_execve(const char __user *file, char __user *__user *argv,
char __user *__user *env)
{
long error;
diff --git a/arch/um/kernel/internal.h b/arch/um/kernel/internal.h
index 3bda43c..1303a10 100644
--- a/arch/um/kernel/internal.h
+++ b/arch/um/kernel/internal.h
@@ -1 +1 @@
-extern long um_execve(char *file, char __user *__user *argv, char __user *__user *env);
+extern long um_execve(const char *file, char __user *__user *argv, char __user *__user *env);
diff --git a/arch/um/kernel/syscall.c b/arch/um/kernel/syscall.c
index 4393173..7427c0b 100644
--- a/arch/um/kernel/syscall.c
+++ b/arch/um/kernel/syscall.c
@@ -58,7 +58,7 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[])

fs = get_fs();
set_fs(KERNEL_DS);
- ret = um_execve((char *)filename, (char __user *__user *)argv,
+ ret = um_execve(filename, (char __user *__user *)argv,
(char __user *__user *) envp);
set_fs(fs);

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index 626be15..1baddad 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -51,7 +51,7 @@
#define AA(__x) ((unsigned long)(__x))


-asmlinkage long sys32_truncate64(char __user *filename,
+asmlinkage long sys32_truncate64(const char __user *filename,
unsigned long offset_low,
unsigned long offset_high)
{
@@ -96,7 +96,7 @@ static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat)
return 0;
}

-asmlinkage long sys32_stat64(char __user *filename,
+asmlinkage long sys32_stat64(const char __user *filename,
struct stat64 __user *statbuf)
{
struct kstat stat;
@@ -107,7 +107,7 @@ asmlinkage long sys32_stat64(char __user *filename,
return ret;
}

-asmlinkage long sys32_lstat64(char __user *filename,
+asmlinkage long sys32_lstat64(const char __user *filename,
struct stat64 __user *statbuf)
{
struct kstat stat;
@@ -126,7 +126,7 @@ asmlinkage long sys32_fstat64(unsigned int fd, struct stat64 __user *statbuf)
return ret;
}

-asmlinkage long sys32_fstatat(unsigned int dfd, char __user *filename,
+asmlinkage long sys32_fstatat(unsigned int dfd, const char __user *filename,
struct stat64 __user *statbuf, int flag)
{
struct kstat stat;
@@ -408,8 +408,8 @@ asmlinkage long sys32_pread(unsigned int fd, char __user *ubuf, u32 count,
((loff_t)AA(poshi) << 32) | AA(poslo));
}

-asmlinkage long sys32_pwrite(unsigned int fd, char __user *ubuf, u32 count,
- u32 poslo, u32 poshi)
+asmlinkage long sys32_pwrite(unsigned int fd, const char __user *ubuf,
+ u32 count, u32 poslo, u32 poshi)
{
return sys_pwrite64(fd, ubuf, count,
((loff_t)AA(poshi) << 32) | AA(poslo));
@@ -449,7 +449,7 @@ asmlinkage long sys32_sendfile(int out_fd, int in_fd,
return ret;
}

-asmlinkage long sys32_execve(char __user *name, compat_uptr_t __user *argv,
+asmlinkage long sys32_execve(const char __user *name, compat_uptr_t __user *argv,
compat_uptr_t __user *envp, struct pt_regs *regs)
{
long error;
diff --git a/arch/x86/include/asm/sys_ia32.h b/arch/x86/include/asm/sys_ia32.h
index 3ad4217..c8a052a 100644
--- a/arch/x86/include/asm/sys_ia32.h
+++ b/arch/x86/include/asm/sys_ia32.h
@@ -18,13 +18,13 @@
#include <asm/ia32.h>

/* ia32/sys_ia32.c */
-asmlinkage long sys32_truncate64(char __user *, unsigned long, unsigned long);
+asmlinkage long sys32_truncate64(const char __user *, unsigned long, unsigned long);
asmlinkage long sys32_ftruncate64(unsigned int, unsigned long, unsigned long);

-asmlinkage long sys32_stat64(char __user *, struct stat64 __user *);
-asmlinkage long sys32_lstat64(char __user *, struct stat64 __user *);
+asmlinkage long sys32_stat64(const char __user *, struct stat64 __user *);
+asmlinkage long sys32_lstat64(const char __user *, struct stat64 __user *);
asmlinkage long sys32_fstat64(unsigned int, struct stat64 __user *);
-asmlinkage long sys32_fstatat(unsigned int, char __user *,
+asmlinkage long sys32_fstatat(unsigned int, const char __user *,
struct stat64 __user *, int);
struct mmap_arg_struct32;
asmlinkage long sys32_mmap(struct mmap_arg_struct32 __user *);
@@ -49,12 +49,12 @@ asmlinkage long sys32_rt_sigpending(compat_sigset_t __user *, compat_size_t);
asmlinkage long sys32_rt_sigqueueinfo(int, int, compat_siginfo_t __user *);

asmlinkage long sys32_pread(unsigned int, char __user *, u32, u32, u32);
-asmlinkage long sys32_pwrite(unsigned int, char __user *, u32, u32, u32);
+asmlinkage long sys32_pwrite(unsigned int, const char __user *, u32, u32, u32);

asmlinkage long sys32_personality(unsigned long);
asmlinkage long sys32_sendfile(int, int, compat_off_t __user *, s32);

-asmlinkage long sys32_execve(char __user *, compat_uptr_t __user *,
+asmlinkage long sys32_execve(const char __user *, compat_uptr_t __user *,
compat_uptr_t __user *, struct pt_regs *);
asmlinkage long sys32_clone(unsigned int, unsigned int, struct pt_regs *);

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 5c044b4..feb2ff9 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -23,7 +23,7 @@ long sys_iopl(unsigned int, struct pt_regs *);
/* kernel/process.c */
int sys_fork(struct pt_regs *);
int sys_vfork(struct pt_regs *);
-long sys_execve(char __user *, char __user * __user *,
+long sys_execve(const char __user *, char __user * __user *,
char __user * __user *, struct pt_regs *);
long sys_clone(unsigned long, unsigned long, void __user *,
void __user *, struct pt_regs *);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0697ff1..77f5986 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1185,13 +1185,13 @@ END(kernel_thread_helper)
* execve(). This function needs to use IRET, not SYSRET, to set up all state properly.
*
* C extern interface:
- * extern long execve(char *name, char **argv, char **envp)
+ * extern long execve(const char *name, char **argv, char **envp)
*
* asm input arguments:
* rdi: name, rsi: argv, rdx: envp
*
* We want to fallback into:
- * extern long sys_execve(char *name, char **argv,char **envp, struct pt_regs *regs)
+ * extern long sys_execve(const char *name, char **argv,char **envp, struct pt_regs *regs)
*
* do_sys_execve asm fallback arguments:
* rdi: name, rsi: argv, rdx: envp, rcx: fake frame on the stack
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e7e3521..f5c816e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -300,7 +300,7 @@ EXPORT_SYMBOL(kernel_thread);
/*
* sys_execve() executes a new program.
*/
-long sys_execve(char __user *name, char __user * __user *argv,
+long sys_execve(const char __user *name, char __user * __user *argv,
char __user * __user *envp, struct pt_regs *regs)
{
long error;
diff --git a/arch/xtensa/kernel/process.c b/arch/xtensa/kernel/process.c
index f167e0f..7c2f38f 100644
--- a/arch/xtensa/kernel/process.c
+++ b/arch/xtensa/kernel/process.c
@@ -318,7 +318,7 @@ long xtensa_clone(unsigned long clone_flags, unsigned long newsp,
*/

asmlinkage
-long xtensa_execve(char __user *name, char __user * __user *argv,
+long xtensa_execve(const char __user *name, char __user * __user *argv,
char __user * __user *envp,
long a3, long a4, long a5,
struct pt_regs *regs)
diff --git a/fs/compat.c b/fs/compat.c
index 6490d21..d72591a 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -76,7 +76,8 @@ int compat_printk(const char *fmt, ...)
* Not all architectures have sys_utime, so implement this in terms
* of sys_utimes.
*/
-asmlinkage long compat_sys_utime(char __user *filename, struct compat_utimbuf __user *t)
+asmlinkage long compat_sys_utime(const char __user *filename,
+ struct compat_utimbuf __user *t)
{
struct timespec tv[2];

@@ -90,7 +91,7 @@ asmlinkage long compat_sys_utime(char __user *filename, struct compat_utimbuf __
return do_utimes(AT_FDCWD, filename, t ? tv : NULL, 0);
}

-asmlinkage long compat_sys_utimensat(unsigned int dfd, char __user *filename, struct compat_timespec __user *t, int flags)
+asmlinkage long compat_sys_utimensat(unsigned int dfd, const char __user *filename, struct compat_timespec __user *t, int flags)
{
struct timespec tv[2];

@@ -105,7 +106,7 @@ asmlinkage long compat_sys_utimensat(unsigned int dfd, char __user *filename, st
return do_utimes(dfd, filename, t ? tv : NULL, flags);
}

-asmlinkage long compat_sys_futimesat(unsigned int dfd, char __user *filename, struct compat_timeval __user *t)
+asmlinkage long compat_sys_futimesat(unsigned int dfd, const char __user *filename, struct compat_timeval __user *t)
{
struct timespec tv[2];

@@ -124,7 +125,7 @@ asmlinkage long compat_sys_futimesat(unsigned int dfd, char __user *filename, st
return do_utimes(dfd, filename, t ? tv : NULL, 0);
}

-asmlinkage long compat_sys_utimes(char __user *filename, struct compat_timeval __user *t)
+asmlinkage long compat_sys_utimes(const char __user *filename, struct compat_timeval __user *t)
{
return compat_sys_futimesat(AT_FDCWD, filename, t);
}
@@ -168,7 +169,7 @@ static int cp_compat_stat(struct kstat *stat, struct compat_stat __user *ubuf)
return err;
}

-asmlinkage long compat_sys_newstat(char __user * filename,
+asmlinkage long compat_sys_newstat(const char __user * filename,
struct compat_stat __user *statbuf)
{
struct kstat stat;
@@ -180,7 +181,7 @@ asmlinkage long compat_sys_newstat(char __user * filename,
return cp_compat_stat(&stat, statbuf);
}

-asmlinkage long compat_sys_newlstat(char __user * filename,
+asmlinkage long compat_sys_newlstat(const char __user * filename,
struct compat_stat __user *statbuf)
{
struct kstat stat;
@@ -193,7 +194,8 @@ asmlinkage long compat_sys_newlstat(char __user * filename,
}

#ifndef __ARCH_WANT_STAT64
-asmlinkage long compat_sys_newfstatat(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_newfstatat(unsigned int dfd,
+ const char __user *filename,
struct compat_stat __user *statbuf, int flag)
{
struct kstat stat;
@@ -836,9 +838,10 @@ static int do_nfs4_super_data_conv(void *raw_data)
#define NCPFS_NAME "ncpfs"
#define NFS4_NAME "nfs4"

-asmlinkage long compat_sys_mount(char __user * dev_name, char __user * dir_name,
- char __user * type, unsigned long flags,
- void __user * data)
+asmlinkage long compat_sys_mount(const char __user * dev_name,
+ const char __user * dir_name,
+ const char __user * type, unsigned long flags,
+ const void __user * data)
{
char *kernel_type;
unsigned long data_page;
diff --git a/fs/stat.c b/fs/stat.c
index c4ecd52..12e90e2 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -68,7 +68,8 @@ int vfs_fstat(unsigned int fd, struct kstat *stat)
}
EXPORT_SYMBOL(vfs_fstat);

-int vfs_fstatat(int dfd, char __user *filename, struct kstat *stat, int flag)
+int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
+ int flag)
{
struct path path;
int error = -EINVAL;
@@ -91,13 +92,13 @@ out:
}
EXPORT_SYMBOL(vfs_fstatat);

-int vfs_stat(char __user *name, struct kstat *stat)
+int vfs_stat(const char __user *name, struct kstat *stat)
{
return vfs_fstatat(AT_FDCWD, name, stat, 0);
}
EXPORT_SYMBOL(vfs_stat);

-int vfs_lstat(char __user *name, struct kstat *stat)
+int vfs_lstat(const char __user *name, struct kstat *stat)
{
return vfs_fstatat(AT_FDCWD, name, stat, AT_SYMLINK_NOFOLLOW);
}
@@ -147,7 +148,8 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-SYSCALL_DEFINE2(stat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
+SYSCALL_DEFINE2(stat, const char __user *, filename,
+ struct __old_kernel_stat __user *, statbuf)
{
struct kstat stat;
int error;
@@ -159,7 +161,8 @@ SYSCALL_DEFINE2(stat, char __user *, filename, struct __old_kernel_stat __user *
return cp_old_stat(&stat, statbuf);
}

-SYSCALL_DEFINE2(lstat, char __user *, filename, struct __old_kernel_stat __user *, statbuf)
+SYSCALL_DEFINE2(lstat, const char __user *, filename,
+ struct __old_kernel_stat __user *, statbuf)
{
struct kstat stat;
int error;
@@ -234,7 +237,8 @@ static int cp_new_stat(struct kstat *stat, struct stat __user *statbuf)
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-SYSCALL_DEFINE2(newstat, char __user *, filename, struct stat __user *, statbuf)
+SYSCALL_DEFINE2(newstat, const char __user *, filename,
+ struct stat __user *, statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
@@ -244,7 +248,8 @@ SYSCALL_DEFINE2(newstat, char __user *, filename, struct stat __user *, statbuf)
return cp_new_stat(&stat, statbuf);
}

-SYSCALL_DEFINE2(newlstat, char __user *, filename, struct stat __user *, statbuf)
+SYSCALL_DEFINE2(newlstat, const char __user *, filename,
+ struct stat __user *, statbuf)
{
struct kstat stat;
int error;
@@ -257,7 +262,7 @@ SYSCALL_DEFINE2(newlstat, char __user *, filename, struct stat __user *, statbuf
}

#if !defined(__ARCH_WANT_STAT64) || defined(__ARCH_WANT_SYS_NEWFSTATAT)
-SYSCALL_DEFINE4(newfstatat, int, dfd, char __user *, filename,
+SYSCALL_DEFINE4(newfstatat, int, dfd, const char __user *, filename,
struct stat __user *, statbuf, int, flag)
{
struct kstat stat;
@@ -355,7 +360,8 @@ static long cp_new_stat64(struct kstat *stat, struct stat64 __user *statbuf)
return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
}

-SYSCALL_DEFINE2(stat64, char __user *, filename, struct stat64 __user *, statbuf)
+SYSCALL_DEFINE2(stat64, const char __user *, filename,
+ struct stat64 __user *, statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
@@ -366,7 +372,8 @@ SYSCALL_DEFINE2(stat64, char __user *, filename, struct stat64 __user *, statbuf
return error;
}

-SYSCALL_DEFINE2(lstat64, char __user *, filename, struct stat64 __user *, statbuf)
+SYSCALL_DEFINE2(lstat64, const char __user *, filename,
+ struct stat64 __user *, statbuf)
{
struct kstat stat;
int error = vfs_lstat(filename, &stat);
@@ -388,7 +395,7 @@ SYSCALL_DEFINE2(fstat64, unsigned long, fd, struct stat64 __user *, statbuf)
return error;
}

-SYSCALL_DEFINE4(fstatat64, int, dfd, char __user *, filename,
+SYSCALL_DEFINE4(fstatat64, int, dfd, const char __user *, filename,
struct stat64 __user *, statbuf, int, flag)
{
struct kstat stat;
diff --git a/fs/utimes.c b/fs/utimes.c
index e4c75db..179b586 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -126,7 +126,8 @@ out:
* must be owner or have write permission.
* Else, update from *times, must be owner or super user.
*/
-long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags)
+long do_utimes(int dfd, const char __user *filename, struct timespec *times,
+ int flags)
{
int error = -EINVAL;

@@ -170,7 +171,7 @@ out:
return error;
}

-SYSCALL_DEFINE4(utimensat, int, dfd, char __user *, filename,
+SYSCALL_DEFINE4(utimensat, int, dfd, const char __user *, filename,
struct timespec __user *, utimes, int, flags)
{
struct timespec tstimes[2];
@@ -188,7 +189,7 @@ SYSCALL_DEFINE4(utimensat, int, dfd, char __user *, filename,
return do_utimes(dfd, filename, utimes ? tstimes : NULL, flags);
}

-SYSCALL_DEFINE3(futimesat, int, dfd, char __user *, filename,
+SYSCALL_DEFINE3(futimesat, int, dfd, const char __user *, filename,
struct timeval __user *, utimes)
{
struct timeval times[2];
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 168f7da..9ddc878 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -331,7 +331,7 @@ asmlinkage long compat_sys_epoll_pwait(int epfd,
const compat_sigset_t __user *sigmask,
compat_size_t sigsetsize);

-asmlinkage long compat_sys_utimensat(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_utimensat(unsigned int dfd, const char __user *filename,
struct compat_timespec __user *t, int flags);

asmlinkage long compat_sys_signalfd(int ufd,
@@ -348,9 +348,9 @@ asmlinkage long compat_sys_move_pages(pid_t pid, unsigned long nr_page,
const int __user *nodes,
int __user *status,
int flags);
-asmlinkage long compat_sys_futimesat(unsigned int dfd, char __user *filename,
+asmlinkage long compat_sys_futimesat(unsigned int dfd, const char __user *filename,
struct compat_timeval __user *t);
-asmlinkage long compat_sys_newfstatat(unsigned int dfd, char __user * filename,
+asmlinkage long compat_sys_newfstatat(unsigned int dfd, const char __user * filename,
struct compat_stat __user *statbuf,
int flag);
asmlinkage long compat_sys_openat(unsigned int dfd, const char __user *filename,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7c443c3..a18bcea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2339,10 +2339,10 @@ void inode_set_bytes(struct inode *inode, loff_t bytes);

extern int vfs_readdir(struct file *, filldir_t, void *);

-extern int vfs_stat(char __user *, struct kstat *);
-extern int vfs_lstat(char __user *, struct kstat *);
+extern int vfs_stat(const char __user *, struct kstat *);
+extern int vfs_lstat(const char __user *, struct kstat *);
extern int vfs_fstat(unsigned int, struct kstat *);
-extern int vfs_fstatat(int , char __user *, struct kstat *, int);
+extern int vfs_fstatat(int , const char __user *, struct kstat *, int);

extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
unsigned long arg);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7f614ce..8812a63 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -393,7 +393,7 @@ asmlinkage long sys_umount(char __user *name, int flags);
asmlinkage long sys_oldumount(char __user *name);
asmlinkage long sys_truncate(const char __user *path, long length);
asmlinkage long sys_ftruncate(unsigned int fd, unsigned long length);
-asmlinkage long sys_stat(char __user *filename,
+asmlinkage long sys_stat(const char __user *filename,
struct __old_kernel_stat __user *statbuf);
asmlinkage long sys_statfs(const char __user * path,
struct statfs __user *buf);
@@ -402,21 +402,21 @@ asmlinkage long sys_statfs64(const char __user *path, size_t sz,
asmlinkage long sys_fstatfs(unsigned int fd, struct statfs __user *buf);
asmlinkage long sys_fstatfs64(unsigned int fd, size_t sz,
struct statfs64 __user *buf);
-asmlinkage long sys_lstat(char __user *filename,
+asmlinkage long sys_lstat(const char __user *filename,
struct __old_kernel_stat __user *statbuf);
asmlinkage long sys_fstat(unsigned int fd,
struct __old_kernel_stat __user *statbuf);
-asmlinkage long sys_newstat(char __user *filename,
+asmlinkage long sys_newstat(const char __user *filename,
struct stat __user *statbuf);
-asmlinkage long sys_newlstat(char __user *filename,
+asmlinkage long sys_newlstat(const char __user *filename,
struct stat __user *statbuf);
asmlinkage long sys_newfstat(unsigned int fd, struct stat __user *statbuf);
asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
#if BITS_PER_LONG == 32
-asmlinkage long sys_stat64(char __user *filename,
+asmlinkage long sys_stat64(const char __user *filename,
struct stat64 __user *statbuf);
asmlinkage long sys_fstat64(unsigned long fd, struct stat64 __user *statbuf);
-asmlinkage long sys_lstat64(char __user *filename,
+asmlinkage long sys_lstat64(const char __user *filename,
struct stat64 __user *statbuf);
asmlinkage long sys_truncate64(const char __user *path, loff_t length);
asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length);
@@ -756,7 +756,7 @@ asmlinkage long sys_linkat(int olddfd, const char __user *oldname,
int newdfd, const char __user *newname, int flags);
asmlinkage long sys_renameat(int olddfd, const char __user * oldname,
int newdfd, const char __user * newname);
-asmlinkage long sys_futimesat(int dfd, char __user *filename,
+asmlinkage long sys_futimesat(int dfd, const char __user *filename,
struct timeval __user *utimes);
asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode);
asmlinkage long sys_fchmodat(int dfd, const char __user * filename,
@@ -765,13 +765,13 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user,
gid_t group, int flag);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
int mode);
-asmlinkage long sys_newfstatat(int dfd, char __user *filename,
+asmlinkage long sys_newfstatat(int dfd, const char __user *filename,
struct stat __user *statbuf, int flag);
-asmlinkage long sys_fstatat64(int dfd, char __user *filename,
+asmlinkage long sys_fstatat64(int dfd, const char __user *filename,
struct stat64 __user *statbuf, int flag);
asmlinkage long sys_readlinkat(int dfd, const char __user *path, char __user *buf,
int bufsiz);
-asmlinkage long sys_utimensat(int dfd, char __user *filename,
+asmlinkage long sys_utimensat(int dfd, const char __user *filename,
struct timespec __user *utimes, int flags);
asmlinkage long sys_unshare(unsigned long unshare_flags);

diff --git a/include/linux/time.h b/include/linux/time.h
index ea3559f..16346c0 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -135,7 +135,7 @@ extern void do_gettimeofday(struct timeval *tv);
extern int do_settimeofday(struct timespec *tv);
extern int do_sys_settimeofday(struct timespec *tv, struct timezone *tz);
#define do_posix_clock_monotonic_gettime(ts) ktime_get_ts(ts)
-extern long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags);
+extern long do_utimes(int dfd, const char __user *filename, struct timespec *times, int flags);
struct itimerval;
extern int do_setitimer(int which, struct itimerval *value,
struct itimerval *ovalue);

2010-06-30 01:17:21

by David Howells

[permalink] [raw]
Subject: [PATCH 2/3] AFS: Use i_generation not i_version for the vnode uniquifier [ver #2]

Store the AFS vnode uniquifier in the i_generation field, not the i_version
field of the inode struct. i_version can then be given the AFS data version
number.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/dir.c | 8 ++++----
fs/afs/fsclient.c | 3 ++-
fs/afs/inode.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index b42d5cc..afb9ff8 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -542,11 +542,11 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
dentry->d_op = &afs_fs_dentry_operations;

d_add(dentry, inode);
- _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%llu }",
+ _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%u }",
fid.vnode,
fid.unique,
dentry->d_inode->i_ino,
- (unsigned long long)dentry->d_inode->i_version);
+ dentry->d_inode->i_generation);

return NULL;
}
@@ -626,10 +626,10 @@ static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
* been deleted and replaced, and the original vnode ID has
* been reused */
if (fid.unique != vnode->fid.unique) {
- _debug("%s: file deleted (uq %u -> %u I:%llu)",
+ _debug("%s: file deleted (uq %u -> %u I:%u)",
dentry->d_name.name, fid.unique,
vnode->fid.unique,
- (unsigned long long)dentry->d_inode->i_version);
+ dentry->d_inode->i_generation);
spin_lock(&vnode->lock);
set_bit(AFS_VNODE_DELETED, &vnode->flags);
spin_unlock(&vnode->lock);
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 4bd0218..346e328 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -89,7 +89,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
i_size_write(&vnode->vfs_inode, size);
vnode->vfs_inode.i_uid = status->owner;
vnode->vfs_inode.i_gid = status->group;
- vnode->vfs_inode.i_version = vnode->fid.unique;
+ vnode->vfs_inode.i_generation = vnode->fid.unique;
vnode->vfs_inode.i_nlink = status->nlink;

mode = vnode->vfs_inode.i_mode;
@@ -102,6 +102,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
vnode->vfs_inode.i_ctime.tv_sec = status->mtime_server;
vnode->vfs_inode.i_mtime = vnode->vfs_inode.i_ctime;
vnode->vfs_inode.i_atime = vnode->vfs_inode.i_ctime;
+ vnode->vfs_inode.i_version = data_version;
}

expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index d00b312..ee3190a 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -73,7 +73,8 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
inode->i_ctime.tv_nsec = 0;
inode->i_atime = inode->i_mtime = inode->i_ctime;
inode->i_blocks = 0;
- inode->i_version = vnode->fid.unique;
+ inode->i_generation = vnode->fid.unique;
+ inode->i_version = vnode->status.data_version;
inode->i_mapping->a_ops = &afs_fs_aops;

/* check to see whether a symbolic link is really a mountpoint */
@@ -98,7 +99,7 @@ static int afs_iget5_test(struct inode *inode, void *opaque)
struct afs_iget_data *data = opaque;

return inode->i_ino == data->fid.vnode &&
- inode->i_version == data->fid.unique;
+ inode->i_generation == data->fid.unique;
}

/*
@@ -110,7 +111,7 @@ static int afs_iget5_set(struct inode *inode, void *opaque)
struct afs_vnode *vnode = AFS_FS_I(inode);

inode->i_ino = data->fid.vnode;
- inode->i_version = data->fid.unique;
+ inode->i_generation = data->fid.unique;
vnode->fid = data->fid;
vnode->volume = data->volume;

@@ -306,8 +307,7 @@ int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,

inode = dentry->d_inode;

- _enter("{ ino=%lu v=%llu }", inode->i_ino,
- (unsigned long long)inode->i_version);
+ _enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);

generic_fillattr(inode, stat);
return 0;

2010-06-30 01:17:33

by David Howells

[permalink] [raw]
Subject: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

Add a pair of system calls to make extended file stats available, including
file creation time, inode version and data version where available through the
underlying filesystem:

struct xstat_dev {
unsigned int major;
unsigned int minor;
};

struct xstat_time {
unsigned long long tv_sec;
unsigned long long tv_nsec;
};

struct xstat {
unsigned int struct_version;
#define XSTAT_STRUCT_VERSION 0
unsigned int st_mode;
unsigned int st_nlink;
unsigned int st_uid;
unsigned int st_gid;
unsigned int st_blksize;
struct xstat_dev st_rdev;
struct xstat_dev st_dev;
unsigned long long st_ino;
unsigned long long st_size;
struct xstat_time st_atime;
struct xstat_time st_mtime;
struct xstat_time st_ctime;
struct xstat_time st_btime;
unsigned long long st_blocks;
unsigned long long st_gen;
unsigned long long st_data_version;
unsigned long long query_flags;
#define XSTAT_QUERY_SIZE 0x00000001ULL
#define XSTAT_QUERY_NLINK 0x00000002ULL
#define XSTAT_QUERY_AMC_TIMES 0x00000004ULL
#define XSTAT_QUERY_CREATION_TIME 0x00000008ULL
#define XSTAT_QUERY_BLOCKS 0x00000010ULL
#define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL
#define XSTAT_QUERY_DATA_VERSION 0x00000040ULL
unsigned long long extra_results[0];
};

ssize_t ret = xstat(int dfd,
const char *filename,
unsigned atflag,
struct xstat *buffer,
size_t buflen);

ssize_t ret = fxstat(int fd,
struct xstat *buffer,
size_t buflen);


The dfd, filename, atflag and fd parameters indicate the file to query. There
is no equivalent of lstat() as that can be emulated with xstat(), passing 0
instead of AT_SYMLINK_NOFOLLOW as atflag.

When the system call is executed, the struct_version ID and query_flags bitmask
are read from the buffer to work out what the user is requesting.

If the structure version specified is not supported, the system call will
return ENOTSUPP. The above structure is version 0.

The query_flags should be set by the caller to specify extra results that the
caller may desire. These come in three classes:

(1) Size, nlinks, [amc]times and block count.

These will be returned whether the caller asks for them or not. The
corresponding bits in query_flags will be set to indicate their presence.

If the called didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server, unless
as a byproduct of updating something requested.

Query Flag Field
=============================== ================
XSTAT_QUERY_SIZE st_size
XSTAT_QUERY_NLINK st_nlink
XSTAT_QUERY_AMC_TIMES st_[amc]time
XSTAT_QUERY_BLOCKS st_blocks

(2) Creation time, Inode generation and Data version.

These will be returned if available whether the caller asked for them or
not. The corresponding bits in query_flags will be set or cleared as
appropriate to indicate their presence.

Query Flag Field
=============================== ================
XSTAT_QUERY_CREATION_TIME st_btime
XSTAT_QUERY_INODE_GENERATION st_gen
XSTAT_QUERY_DATA_VERSION st_data_version

If the called didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server, unless
as a byproduct of updating something requested.

(3) Extra results.

These will only be returned if the caller asked for them by setting their
bits in query_flags. They will be placed in the buffer after the xstat
struct in ascending query_flags bit order. Any bit set in query_flags
mask will be left set if the result is available and cleared otherwise.

The pointer into the results list will be rounded up to the nearest 8-byte
boundary after each result is written in. The size of each extra result
is specific to the definition for that result.

No extra results are currently defined.

If the buffer is insufficiently big, the syscall returns the amount of space it
will need to write the complete result set, but otherwise does nothing.

If successful, the amount of data written into the buffer will be returned.

At the moment, this will only work on x86_64 as it requires system calls to be
wired up.


===========
FILESYSTEMS
===========

The following filesystems have been modified to make use of this facility:

(*) Ext4. This will return the creation time and inode version number for all
files. It will, however, only return the data version number for
directories as i_version is only maintained for them.

(*) AFS. This will return the vnode ID uniquifier as the inode version and
the AFS data version number as the data version. There is no file
creation time available.

(*) NFS. This will return the change attribute if NFSv4 only. No other extra
values are returned at this time. If mtime and ctime aren't asked for,
the outstanding writes won't be written to the server. If none of
[amc]time, size, nlink, blocks and data_version are requested, then the
attributes won't be refreshed from the server.

Probably this isn't sufficient, as the other non-optional attributes may
require refreshing.


=======
TESTING
=======

The following test program can be used to test the xstat system call:

#define _GNU_SOURCE
#define _ATFILE_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <time.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <sys/types.h>

struct xstat_dev {
unsigned int major;
unsigned int minor;
};

struct xstat_time {
unsigned long long tv_sec;
unsigned long long tv_nsec;
};

struct xstat {
unsigned int struct_version;
#define XSTAT_STRUCT_VERSION 0
unsigned int st_mode;
unsigned int st_nlink;
unsigned int st_uid;
unsigned int st_gid;
unsigned int st_blksize;
struct xstat_dev st_rdev;
struct xstat_dev st_dev;
unsigned long long st_ino;
unsigned long long st_size;
struct xstat_time st_atim;
struct xstat_time st_mtim;
struct xstat_time st_ctim;
struct xstat_time st_btim;
unsigned long long st_blocks;
unsigned long long st_gen;
unsigned long long st_data_version;
unsigned long long query_flags;
#define XSTAT_QUERY_SIZE 0x00000001ULL /* want/got st_size */
#define XSTAT_QUERY_NLINK 0x00000002ULL /* want/got st_nlink */
#define XSTAT_QUERY_AMC_TIMES 0x00000004ULL /* want/got st_[amc]time */
#define XSTAT_QUERY_CREATION_TIME 0x00000008ULL /* want/got st_btime */
#define XSTAT_QUERY_BLOCKS 0x00000010ULL /* want/got st_blocks */
#define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL /* want/got st_gen */
#define XSTAT_QUERY_DATA_VERSION 0x00000040ULL /* want/got st_data_version */
#define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL /* the stuff in the normal stat struct */
#define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL /* what we get anyway if available */
#define XSTAT_QUERY__DEFINED_SET 0x0000007fULL /* the defined set of flags */
unsigned long long extra_results[0];
};

#define __NR_xstat 300
#define __NR_fxstat 301

static __attribute__((unused))
ssize_t xstat(int dfd, const char *filename, int atflag,
struct xstat *buffer, size_t bufsize)
{
return syscall(__NR_xstat, dfd, filename, atflag, buffer, bufsize);
}

static __attribute__((unused))
ssize_t fxstat(int fd, struct xstat *buffer, size_t bufsize)
{
return syscall(__NR_fxstat, fd, buffer, bufsize);
}

static void print_time(const struct xstat_time *xstm)
{
struct tm tm;
time_t tim;
char buffer[100];
int len;

tim = xstm->tv_sec;
if (!localtime_r(&tim, &tm)) {
perror("localtime_r");
exit(1);
}
len = strftime(buffer, 100, "%F %T", &tm);
if (len == 0) {
perror("strftime");
exit(1);
}
fwrite(buffer, 1, len, stdout);
printf(".%09llu", xstm->tv_nsec);
len = strftime(buffer, 100, "%z", &tm);
if (len == 0) {
perror("strftime2");
exit(1);
}
fwrite(buffer, 1, len, stdout);
}

static void dump_xstat(struct xstat *xst)
{
char buffer[256], ft;

printf(" ");
if (xst->query_flags & XSTAT_QUERY_SIZE)
printf(" Size: %-15llu", xst->st_size);
if (xst->query_flags & XSTAT_QUERY_BLOCKS)
printf(" Blocks: %-10llu", xst->st_blocks);
printf(" IO Block: %-6u ", xst->st_blksize);
switch (xst->st_mode & S_IFMT) {
case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
case S_IFREG: printf(" regular file\n"); ft = '-'; break;
case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
default:
printf("unknown type (%o)\n", xst->st_mode & S_IFMT);
ft = '?';
break;
}

sprintf(buffer, "%02x:%02x", xst->st_dev.major, xst->st_dev.minor);
printf("Device: %-15s Inode: %-11llu", buffer, xst->st_ino);
if (xst->query_flags & XSTAT_QUERY_SIZE)
printf(" Links: %u", xst->st_nlink);
printf("\n");

printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
xst->st_mode & 07777,
ft,
xst->st_mode & S_IRUSR ? 'r' : '-',
xst->st_mode & S_IWUSR ? 'w' : '-',
xst->st_mode & S_IXUSR ? 'x' : '-',
xst->st_mode & S_IRGRP ? 'r' : '-',
xst->st_mode & S_IWGRP ? 'w' : '-',
xst->st_mode & S_IXGRP ? 'x' : '-',
xst->st_mode & S_IROTH ? 'r' : '-',
xst->st_mode & S_IWOTH ? 'w' : '-',
xst->st_mode & S_IXOTH ? 'x' : '-');
printf("Uid: %d Gid: %u\n", xst->st_uid, xst->st_gid);

if (xst->query_flags & XSTAT_QUERY_AMC_TIMES) {
printf("Access: "); print_time(&xst->st_atim); printf("\n");
printf("Modify: "); print_time(&xst->st_mtim); printf("\n");
printf("Change: "); print_time(&xst->st_ctim); printf("\n");
}
if (xst->query_flags & XSTAT_QUERY_CREATION_TIME) {
printf("Create: "); print_time(&xst->st_btim); printf("\n");
}

if (xst->query_flags & XSTAT_QUERY_INODE_GENERATION)
printf("Inode version: %llxh\n", xst->st_gen);
if (xst->query_flags & XSTAT_QUERY_DATA_VERSION)
printf("Data version: %llxh\n", xst->st_data_version);
}

int main(int argc, char **argv)
{
struct xstat xst;
int ret, atflag = AT_SYMLINK_NOFOLLOW;

unsigned long long query =
XSTAT_QUERY__ORDINARY_SET |
XSTAT_QUERY_CREATION_TIME |
XSTAT_QUERY_INODE_GENERATION |
XSTAT_QUERY_DATA_VERSION;

for (argv++; *argv; argv++) {
if (strcmp(*argv, "-L") == 0) {
atflag = 0;
continue;
}
if (strcmp(*argv, "-O") == 0) {
query &= ~XSTAT_QUERY__ORDINARY_SET;
continue;
}

memset(&xst, 0xbf, sizeof(xst));
xst.struct_version = 0;
xst.query_flags = query;
ret = xstat(AT_FDCWD, *argv, atflag, &xst, sizeof(xst));
printf("xstat(%s) = %d\n", *argv, ret);
if (ret < 0) {
perror(*argv);
exit(1);
}

printf("sv=%u qf=%llx cr=%llx.%llx iv=%llx dv=%llx\n",
xst.struct_version, xst.query_flags,
xst.st_btim.tv_sec, xst.st_btim.tv_nsec,
xst.st_gen, xst.st_data_version);

dump_xstat(&xst);
}
return 0;
}

Just compile and run, passing it paths to the files you want to examine:

[root@andromeda ~]# /tmp/xstat /afs/archive/linuxdev/fedora9/i386/repodata/
xstat(/afs/archive/linuxdev/fedora9/i386/repodata/) = 152
sv=0 qf=77 cr=0.0 iv=7a5 dv=5
Size: 2048 Blocks: 0 IO Block: 4096 directory
Device: 00:15 Inode: 83 Links: 2
Access: (0755/drwxr-xr-x) Uid: 75338 Gid: 0
Access: 2008-11-05 20:00:12.000000000+0000
Modify: 2008-11-05 20:00:12.000000000+0000
Change: 2008-11-05 20:00:12.000000000+0000
Inode version: 7a5h
Data version: 5h

[root@andromeda ~]# /tmp/xstat /warthog/nfs/linux-2.6-fscache
xstat(/warthog/nfs/linux-2.6-fscache) = 152
sv=0 qf=57 cr=0.0 iv=0 dv=f4992a4c00000000
Size: 4096 Blocks: 16 IO Block: 1048576 directory
Device: 00:13 Inode: 19005487 Links: 27
Access: (2775/drwxrwxr-x) Uid: -2 Gid: 4294967294
Access: 2010-06-30 02:07:42.000000000+0100
Modify: 2010-06-30 02:12:20.000000000+0100
Change: 2010-06-30 02:12:20.000000000+0100
Data version: f4992a4c00000000h

[root@andromeda ~]# /tmp/xstat /var/cache/fscache/cache/
xstat(/var/cache/fscache/cache/) = 152
sv=0 qf=7f cr=4c24ba83.1c15ee3d iv=f585ab70 dv=2
Size: 4096 Blocks: 16 IO Block: 4096 directory
Device: 08:06 Inode: 130561 Links: 3
Access: (0700/drwx------) Uid: 0 Gid: 0
Access: 2010-06-29 18:16:33.680703545+0100
Modify: 2010-06-29 18:16:20.132786632+0100
Change: 2010-06-29 18:16:20.132786632+0100
Create: 2010-06-25 15:17:39.471199293+0100
Inode version: f585ab70h
Data version: 2h


Signed-off-by: David Howells <[email protected]>
---

arch/x86/include/asm/unistd_32.h | 4 +
arch/x86/include/asm/unistd_64.h | 4 +
fs/afs/inode.c | 12 ++-
fs/ecryptfs/inode.c | 1
fs/ext4/ext4.h | 2
fs/ext4/file.c | 2
fs/ext4/inode.c | 27 ++++++-
fs/ext4/namei.c | 2
fs/ext4/symlink.c | 2
fs/nfs/inode.c | 38 ++++++---
fs/nfsd/nfs3proc.c | 1
fs/nfsd/nfs3xdr.c | 2
fs/nfsd/nfs4xdr.c | 2
fs/nfsd/nfsproc.c | 3 +
fs/nfsd/nfsxdr.c | 1
fs/stat.c | 153 +++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 2
include/linux/stat.h | 88 ++++++++++++++++++++++
include/linux/syscalls.h | 5 +
19 files changed, 322 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index beb9b5f..a9953cc 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
+#define __NR_xstat 338
+#define __NR_fxstat 339

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 340

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index ff4307b..c90d240 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
#define __NR_recvmmsg 299
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_xstat 300
+__SYSCALL(__NR_xstat, sys_xstat)
+#define __NR_fxstat 301
+__SYSCALL(__NR_fxstat, sys_fxstat)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index ee3190a..3b68136 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -300,16 +300,18 @@ error_unlock:
/*
* read the attributes of an inode
*/
-int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,
- struct kstat *stat)
+int afs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
- struct inode *inode;
-
- inode = dentry->d_inode;
+ struct inode *inode = dentry->d_inode;

_enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);

generic_fillattr(inode, stat);
+
+ stat->result_flags |=
+ XSTAT_QUERY_INODE_GENERATION | XSTAT_QUERY_DATA_VERSION;
+ stat->gen = inode->i_generation;
+ stat->data_version = inode->i_version;
return 0;
}

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 31ef525..93b914b 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -994,6 +994,7 @@ int ecryptfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat lower_stat;
int rc;

+ lower_stat.query_flags = XSTAT_QUERY_BLOCKS;
rc = vfs_getattr(ecryptfs_dentry_to_lower_mnt(dentry),
ecryptfs_dentry_to_lower(dentry), &lower_stat);
if (!rc) {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..96823f3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1571,6 +1571,8 @@ extern int ext4_write_inode(struct inode *, struct writeback_control *);
extern int ext4_setattr(struct dentry *, struct iattr *);
extern int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
+extern int ext4_file_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat);
extern void ext4_delete_inode(struct inode *);
extern int ext4_sync_inode(handle_t *, struct inode *);
extern void ext4_dirty_inode(struct inode *);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5313ae4..18c29ab 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -150,7 +150,7 @@ const struct file_operations ext4_file_operations = {
const struct inode_operations ext4_file_inode_operations = {
.truncate = ext4_truncate,
.setattr = ext4_setattr,
- .getattr = ext4_getattr,
+ .getattr = ext4_file_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..465ce48 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5550,12 +5550,33 @@ err_out:
int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat)
{
- struct inode *inode;
- unsigned long delalloc_blocks;
+ struct inode *inode = dentry->d_inode;

- inode = dentry->d_inode;
generic_fillattr(inode, stat);

+ stat->result_flags |= XSTAT_QUERY_CREATION_TIME;
+ stat->btime.tv_sec = EXT4_I(inode)->i_crtime.tv_sec;
+ stat->btime.tv_nsec = EXT4_I(inode)->i_crtime.tv_nsec;
+
+ if (inode->i_ino != EXT4_ROOT_INO) {
+ stat->result_flags |= XSTAT_QUERY_INODE_GENERATION;
+ stat->gen = inode->i_generation;
+ }
+ if (S_ISDIR(inode->i_mode)) {
+ stat->result_flags |= XSTAT_QUERY_DATA_VERSION;
+ stat->data_version = inode->i_version;
+ }
+ return 0;
+}
+
+int ext4_file_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
+{
+ struct inode *inode = dentry->d_inode;
+ unsigned long delalloc_blocks;
+
+ ext4_getattr(mnt, dentry, stat);
+
/*
* We can't update i_blocks if the block allocation is delayed
* otherwise in the case of system crash before the real block
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a43e661..0f776c7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2542,6 +2542,7 @@ const struct inode_operations ext4_dir_inode_operations = {
.mknod = ext4_mknod,
.rename = ext4_rename,
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
@@ -2554,6 +2555,7 @@ const struct inode_operations ext4_dir_inode_operations = {

const struct inode_operations ext4_special_inode_operations = {
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
diff --git a/fs/ext4/symlink.c b/fs/ext4/symlink.c
index ed9354a..d8fe7fb 100644
--- a/fs/ext4/symlink.c
+++ b/fs/ext4/symlink.c
@@ -35,6 +35,7 @@ const struct inode_operations ext4_symlink_inode_operations = {
.follow_link = page_follow_link_light,
.put_link = page_put_link,
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
@@ -47,6 +48,7 @@ const struct inode_operations ext4_fast_symlink_inode_operations = {
.readlink = generic_readlink,
.follow_link = ext4_follow_link,
.setattr = ext4_setattr,
+ .getattr = ext4_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 099b351..bb19eaf 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -498,8 +498,10 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
int err;

- /* Flush out writes to the server in order to update c/mtime. */
- if (S_ISREG(inode->i_mode)) {
+ /* Flush out writes to the server in order to update c/mtime if the
+ * user wants them */
+ if (stat->query_flags & XSTAT_QUERY_AMC_TIMES &&
+ S_ISREG(inode->i_mode)) {
err = filemap_write_and_wait(inode->i_mapping);
if (err)
goto out;
@@ -514,18 +516,30 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
* - NFS never sets MS_NOATIME or MS_NODIRATIME so there is
* no point in checking those.
*/
- if ((mnt->mnt_flags & MNT_NOATIME) ||
- ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
+ if (!(stat->query_flags & XSTAT_QUERY_AMC_TIMES) ||
+ (mnt->mnt_flags & MNT_NOATIME) ||
+ ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
need_atime = 0;

- if (need_atime)
- err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
- else
- err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
- if (!err) {
- generic_fillattr(inode, stat);
- stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+ if (stat->query_flags &
+ (XSTAT_QUERY__ORDINARY_SET | XSTAT_QUERY_DATA_VERSION)) {
+ if (need_atime)
+ err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ else
+ err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ if (err)
+ goto out;
}
+
+ generic_fillattr(inode, stat);
+ stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+
+ if (stat->query_flags & XSTAT_QUERY_DATA_VERSION &&
+ NFS_SERVER(inode)->nfs_client->rpc_ops->version == 4) {
+ stat->data_version = NFS_I(inode)->change_attr;
+ stat->result_flags |= XSTAT_QUERY_DATA_VERSION;
+ }
+
out:
return err;
}
@@ -770,7 +784,7 @@ int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
static int nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
{
struct nfs_inode *nfsi = NFS_I(inode);
-
+
if (mapping->nrpages != 0) {
int ret = invalidate_inode_pages2(mapping);
if (ret < 0)
diff --git a/fs/nfsd/nfs3proc.c b/fs/nfsd/nfs3proc.c
index 3d68f45..bcd08a3 100644
--- a/fs/nfsd/nfs3proc.c
+++ b/fs/nfsd/nfs3proc.c
@@ -55,6 +55,7 @@ nfsd3_proc_getattr(struct svc_rqst *rqstp, struct nfsd_fhandle *argp,
if (nfserr)
RETURN_STATUS(nfserr);

+ resp->stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
err = vfs_getattr(resp->fh.fh_export->ex_path.mnt,
resp->fh.fh_dentry, &resp->stat);
nfserr = nfserrno(err);
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2a533a0..7a8737b 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -205,6 +205,7 @@ encode_post_op_attr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
int err;
struct kstat stat;

+ stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
err = vfs_getattr(fhp->fh_export->ex_path.mnt, dentry, &stat);
if (!err) {
*p++ = xdr_one; /* attributes follow */
@@ -257,6 +258,7 @@ void fill_post_wcc(struct svc_fh *fhp)
if (fhp->fh_post_saved)
printk("nfsd: inode locked twice during operation.\n");

+ fhp->fh_post_attr.query_flags = XSTAT_QUERY__GET_ANYWAY;
err = vfs_getattr(fhp->fh_export->ex_path.mnt, fhp->fh_dentry,
&fhp->fh_post_attr);
fhp->fh_post_change = fhp->fh_dentry->d_inode->i_version;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index ac17a70..afed8d5 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1769,6 +1769,7 @@ nfsd4_encode_fattr(struct svc_fh *fhp, struct svc_export *exp,
goto out;
}

+ stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
err = vfs_getattr(exp->ex_path.mnt, dentry, &stat);
if (err)
goto out_nfserr;
@@ -2139,6 +2140,7 @@ out_acl:
if (path.dentry != path.mnt->mnt_root)
break;
}
+ stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
err = vfs_getattr(path.mnt, path.dentry, &stat);
path_put(&path);
if (err)
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index a047ad6..81e4b4c 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -26,6 +26,7 @@ static __be32
nfsd_return_attrs(__be32 err, struct nfsd_attrstat *resp)
{
if (err) return err;
+ resp->stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
return nfserrno(vfs_getattr(resp->fh.fh_export->ex_path.mnt,
resp->fh.fh_dentry,
&resp->stat));
@@ -34,6 +35,7 @@ static __be32
nfsd_return_dirop(__be32 err, struct nfsd_diropres *resp)
{
if (err) return err;
+ resp->stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
return nfserrno(vfs_getattr(resp->fh.fh_export->ex_path.mnt,
resp->fh.fh_dentry,
&resp->stat));
@@ -150,6 +152,7 @@ nfsd_proc_read(struct svc_rqst *rqstp, struct nfsd_readargs *argp,
&resp->count);

if (nfserr) return nfserr;
+ resp->stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
return nfserrno(vfs_getattr(resp->fh.fh_export->ex_path.mnt,
resp->fh.fh_dentry,
&resp->stat));
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 4ce005d..c5f9869 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -197,6 +197,7 @@ encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp,
__be32 *nfs2svc_encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp)
{
struct kstat stat;
+ stat.query_flags = XSTAT_QUERY__GET_ANYWAY;
vfs_getattr(fhp->fh_export->ex_path.mnt, fhp->fh_dentry, &stat);
return encode_fattr(rqstp, p, fhp, &stat);
}
diff --git a/fs/stat.c b/fs/stat.c
index 12e90e2..9ee968b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -33,6 +33,7 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
stat->size = i_size_read(inode);
stat->blocks = inode->i_blocks;
stat->blksize = (1 << inode->i_blkbits);
+ stat->result_flags |= XSTAT_QUERY__ORDINARY_SET;
}

EXPORT_SYMBOL(generic_fillattr);
@@ -42,6 +43,8 @@ int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
struct inode *inode = dentry->d_inode;
int retval;

+ stat->result_flags = 0;
+
retval = security_inode_getattr(mnt, dentry);
if (retval)
return retval;
@@ -55,7 +58,10 @@ int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)

EXPORT_SYMBOL(vfs_getattr);

-int vfs_fstat(unsigned int fd, struct kstat *stat)
+/*
+ * VFS entrypoint to get extended stats by file descriptor
+ */
+int vfs_fxstat(unsigned int fd, struct kstat *stat)
{
struct file *f = fget(fd);
int error = -EBADF;
@@ -66,10 +72,20 @@ int vfs_fstat(unsigned int fd, struct kstat *stat)
}
return error;
}
+EXPORT_SYMBOL(vfs_fxstat);
+
+int vfs_fstat(unsigned int fd, struct kstat *stat)
+{
+ stat->query_flags = XSTAT_QUERY__ORDINARY_SET;
+ return vfs_fxstat(fd, stat);
+}
EXPORT_SYMBOL(vfs_fstat);

-int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
- int flag)
+/*
+ * VFS entrypoint to get extended stats by filename
+ */
+int vfs_xstat(int dfd, const char __user *filename, int flag,
+ struct kstat *stat)
{
struct path path;
int error = -EINVAL;
@@ -90,6 +106,14 @@ int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
out:
return error;
}
+EXPORT_SYMBOL(vfs_xstat);
+
+int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
+ int flag)
+{
+ stat->query_flags = XSTAT_QUERY__ORDINARY_SET;
+ return vfs_xstat(dfd, filename, flag, stat);
+}
EXPORT_SYMBOL(vfs_fstatat);

int vfs_stat(const char __user *name, struct kstat *stat)
@@ -115,7 +139,7 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
{
static int warncount = 5;
struct __old_kernel_stat tmp;
-
+
if (warncount > 0) {
warncount--;
printk(KERN_WARNING "VFS: Warning: %s using old stat() call. Recompile your binary.\n",
@@ -140,7 +164,7 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
#if BITS_PER_LONG == 32
if (stat->size > MAX_NON_LFS)
return -EOVERFLOW;
-#endif
+#endif
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
tmp.st_mtime = stat->mtime.tv_sec;
@@ -222,7 +246,7 @@ static int cp_new_stat(struct kstat *stat, struct stat __user *statbuf)
#if BITS_PER_LONG == 32
if (stat->size > MAX_NON_LFS)
return -EOVERFLOW;
-#endif
+#endif
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
tmp.st_mtime = stat->mtime.tv_sec;
@@ -408,6 +432,123 @@ SYSCALL_DEFINE4(fstatat64, int, dfd, const char __user *, filename,
}
#endif /* __ARCH_WANT_STAT64 */

+/*
+ * check the input parameters in the xstat struct
+ */
+static int xstat_check_param(struct xstat __user *buffer, size_t bufsize,
+ struct kstat *stat)
+{
+ u32 struct_version;
+ int ret;
+
+ /* if the buffer isn't large enough, return how much we wanted to
+ * write, but otherwise do nothing */
+ if (bufsize < sizeof(struct xstat))
+ return sizeof(struct xstat);
+
+ ret = get_user(struct_version, &buffer->struct_version);
+ if (ret < 0)
+ return ret;
+ if (struct_version != 0)
+ return -ENOTSUPP;
+
+ memset(stat, 0xde, sizeof(*stat));
+
+ ret = get_user(stat->query_flags, &buffer->query_flags);
+ if (ret < 0)
+ return ret;
+
+ /* nothing outside this set has a defined purpose */
+ stat->query_flags &= XSTAT_QUERY__DEFINED_SET;
+ stat->result_flags = 0;
+ return 0;
+}
+
+/*
+ * copy the extended stats to userspace and return the amount of data written
+ * into the buffer
+ */
+static long xstat_set_result(struct kstat *stat,
+ struct xstat __user *buffer, size_t bufsize)
+{
+ struct xstat tmp;
+
+ memset(&tmp, 0, sizeof(tmp));
+ tmp.struct_version = XSTAT_STRUCT_VERSION;
+ tmp.query_flags = stat->result_flags;
+ tmp.st_dev.major = MAJOR(stat->dev);
+ tmp.st_dev.minor = MINOR(stat->dev);
+ tmp.st_rdev.major = MAJOR(stat->rdev);
+ tmp.st_rdev.minor = MINOR(stat->rdev);
+ tmp.st_ino = stat->ino;
+ tmp.st_mode = stat->mode;
+ tmp.st_uid = stat->uid;
+ tmp.st_gid = stat->gid;
+ tmp.st_blksize = stat->blksize;
+
+ if (stat->result_flags & XSTAT_QUERY_NLINK)
+ tmp.st_nlink = stat->nlink;
+ if (stat->result_flags & XSTAT_QUERY_AMC_TIMES) {
+ tmp.st_atime.tv_sec = stat->atime.tv_sec;
+ tmp.st_atime.tv_nsec = stat->atime.tv_nsec;
+ tmp.st_mtime.tv_sec = stat->mtime.tv_sec;
+ tmp.st_mtime.tv_nsec = stat->mtime.tv_nsec;
+ tmp.st_ctime.tv_sec = stat->ctime.tv_sec;
+ tmp.st_ctime.tv_nsec = stat->ctime.tv_nsec;
+ }
+ if (stat->result_flags & XSTAT_QUERY_SIZE)
+ tmp.st_size = stat->size;
+ if (stat->result_flags & XSTAT_QUERY_BLOCKS)
+ tmp.st_blocks = stat->blocks;
+ if (stat->result_flags & XSTAT_QUERY_CREATION_TIME) {
+ tmp.st_btime.tv_sec = stat->btime.tv_sec;
+ tmp.st_btime.tv_nsec = stat->btime.tv_nsec;
+ }
+ if (stat->result_flags & XSTAT_QUERY_INODE_GENERATION)
+ tmp.st_gen = stat->gen;
+ if (stat->result_flags & XSTAT_QUERY_DATA_VERSION)
+ tmp.st_data_version = stat->data_version;
+
+ return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : sizeof(tmp);
+}
+
+/*
+ * System call to get extended stats by path
+ */
+SYSCALL_DEFINE5(xstat,
+ int, dfd, const char __user *, filename, unsigned, atflag,
+ struct xstat __user *, buffer, size_t, bufsize)
+{
+ struct kstat stat;
+ int error;
+
+ error = xstat_check_param(buffer, bufsize, &stat);
+ if (error != 0)
+ return error;
+ error = vfs_xstat(dfd, filename, atflag, &stat);
+ if (error)
+ return error;
+ return xstat_set_result(&stat, buffer, bufsize);
+}
+
+/*
+ * System call to get extended stats by file descriptor
+ */
+SYSCALL_DEFINE3(fxstat, int, fd, struct xstat __user *, buffer, size_t, bufsize)
+{
+ struct kstat stat;
+ int error;
+
+ error = xstat_check_param(buffer, bufsize, &stat);
+ if (error < 0)
+ return error;
+ error = vfs_fxstat(fd, &stat);
+ if (error)
+ return error;
+
+ return xstat_set_result(&stat, buffer, bufsize);
+}
+
/* Caller is here responsible for sufficient locking (ie. inode->i_lock) */
void __inode_add_bytes(struct inode *inode, loff_t bytes)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a18bcea..9ce2119 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2343,6 +2343,8 @@ extern int vfs_stat(const char __user *, struct kstat *);
extern int vfs_lstat(const char __user *, struct kstat *);
extern int vfs_fstat(unsigned int, struct kstat *);
extern int vfs_fstatat(int , const char __user *, struct kstat *, int);
+extern int vfs_xstat(int, const char __user *, int, struct kstat *);
+extern int vfs_xfstat(unsigned int, struct kstat *);

extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
unsigned long arg);
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 611c398..5ef092a 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,87 @@

#endif

+/*
+ * Extended stat structures
+ */
+struct xstat_dev {
+ unsigned int major;
+ unsigned int minor;
+};
+
+struct xstat_time {
+ unsigned long long tv_sec;
+ unsigned long long tv_nsec;
+};
+
+struct xstat {
+ unsigned int struct_version; /* version of this structure */
+#define XSTAT_STRUCT_VERSION 0
+
+ unsigned int st_mode; /* file mode */
+ unsigned int st_nlink; /* number of hard links */
+ unsigned int st_uid; /* user ID of owner */
+ unsigned int st_gid; /* group ID of owner */
+ unsigned int st_blksize; /* block size for filesystem I/O */
+ struct xstat_dev st_rdev; /* device ID of special file */
+ struct xstat_dev st_dev; /* ID of device containing file */
+ unsigned long long st_ino; /* inode number */
+ unsigned long long st_size; /* file size */
+ struct xstat_time st_atime; /* last access time */
+ struct xstat_time st_mtime; /* last data modification time */
+ struct xstat_time st_ctime; /* last attribute change time */
+ struct xstat_time st_btime; /* file creation time */
+ unsigned long long st_blocks; /* number of 512-byte blocks allocated */
+ unsigned long long st_gen; /* inode generation number */
+ unsigned long long st_data_version; /* data version number */
+
+ /* Query request/result flags
+ *
+ * Bits should be set in query_flags to request particular items before
+ * calling xstat() or fxstat().
+ *
+ * For each item in the set XSTAT_QUERY__GET_ANYWAY:
+ *
+ * - if not available at all, the bit will be cleared before returning
+ * and the field will be cleared; otherwise,
+ *
+ * - if requested, the datum will be synchronised to a server or other
+ * hardware before being returned if necessary, and the bit will be
+ * set on return; otherwise,
+ *
+ * - if not requested, but available in approximate form without any
+ * effort, it will be filled in anyway, and the bit will be set upon
+ * return (it might not be up to date, however, and no attempt will
+ * be made to synchronise the internal state first); otherwise,
+ *
+ * - the bit will be cleared before returning, and the field will be
+ * cleared.
+ *
+ * For each item not in the set XSTAT_QUERY__GET_ANYWAY:
+ *
+ * - if not available at all, the bit will be cleared, and no result
+ * data will be returned; otherwise,
+ *
+ * - if requested, the datum will be synchronised to a server or other
+ * hardware before being appended if necessary, and the bit will be
+ * set on return; otherwise,
+ *
+ * - the bit will be cleared, and no result data will be returned.
+ */
+ unsigned long long query_flags;
+#define XSTAT_QUERY_SIZE 0x00000001ULL /* want/got st_size */
+#define XSTAT_QUERY_NLINK 0x00000002ULL /* want/got st_nlink */
+#define XSTAT_QUERY_AMC_TIMES 0x00000004ULL /* want/got st_[amc]time */
+#define XSTAT_QUERY_CREATION_TIME 0x00000008ULL /* want/got st_btime */
+#define XSTAT_QUERY_BLOCKS 0x00000010ULL /* want/got st_blocks */
+#define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL /* want/got st_gen */
+#define XSTAT_QUERY_DATA_VERSION 0x00000040ULL /* want/got st_data_version */
+#define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL /* the stuff in the normal stat struct */
+#define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL /* what we get anyway if available */
+#define XSTAT_QUERY__DEFINED_SET 0x0000007fULL /* the defined set of flags */
+ unsigned long long extra_results[0]; /* extra requested results */
+};
+
#ifdef __KERNEL__
#define S_IRWXUGO (S_IRWXU|S_IRWXG|S_IRWXO)
#define S_IALLUGO (S_ISUID|S_ISGID|S_ISVTX|S_IRWXUGO)
@@ -68,11 +149,16 @@ struct kstat {
gid_t gid;
dev_t rdev;
loff_t size;
- struct timespec atime;
+ struct timespec atime;
struct timespec mtime;
struct timespec ctime;
+ struct timespec btime; /* file creation time */
unsigned long blksize;
unsigned long long blocks;
+ u64 query_flags; /* what extras the user asked for */
+ u64 result_flags; /* what extras the user got */
+ u64 gen; /* inode generation */
+ u64 data_version;
};

#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 8812a63..760a303 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -44,6 +44,7 @@ struct shmid_ds;
struct sockaddr;
struct stat;
struct stat64;
+struct xstat;
struct statfs;
struct statfs64;
struct __sysctl_args;
@@ -824,4 +825,8 @@ asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long fd, unsigned long pgoff);
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);

+asmlinkage long sys_xstat(int, const char __user *, unsigned,
+ struct xstat __user *, size_t);
+asmlinkage long sys_fxstat(int, struct xstat __user *, size_t);
+
#endif

2010-06-30 01:49:18

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On Wed, 2010-06-30 at 02:17 +0100, David Howells wrote:
> Add a pair of system calls to make extended file stats available, including
> file creation time, inode version and data version where available through the
> underlying filesystem:
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };
>
> struct xstat_time {
> unsigned long long tv_sec;
> unsigned long long tv_nsec;
> };
>
> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0
> unsigned int st_mode;
> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;
> unsigned int st_blksize;
> struct xstat_dev st_rdev;
> struct xstat_dev st_dev;
> unsigned long long st_ino;
> unsigned long long st_size;
> struct xstat_time st_atime;
> struct xstat_time st_mtime;
> struct xstat_time st_ctime;
> struct xstat_time st_btime;
> unsigned long long st_blocks;
> unsigned long long st_gen;
> unsigned long long st_data_version;
> unsigned long long query_flags;
> #define XSTAT_QUERY_SIZE 0x00000001ULL
> #define XSTAT_QUERY_NLINK 0x00000002ULL
> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL
> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL
> #define XSTAT_QUERY_BLOCKS 0x00000010ULL
> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL
> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL
> unsigned long long extra_results[0];
> };
>
> ssize_t ret = xstat(int dfd,
> const char *filename,
> unsigned atflag,
> struct xstat *buffer,
> size_t buflen);
>
> ssize_t ret = fxstat(int fd,
> struct xstat *buffer,
> size_t buflen);
>
>
> The dfd, filename, atflag and fd parameters indicate the file to query. There
> is no equivalent of lstat() as that can be emulated with xstat(), passing 0
> instead of AT_SYMLINK_NOFOLLOW as atflag.
>
> When the system call is executed, the struct_version ID and query_flags bitmask
> are read from the buffer to work out what the user is requesting.
>
> If the structure version specified is not supported, the system call will
> return ENOTSUPP. The above structure is version 0.
>
> The query_flags should be set by the caller to specify extra results that the
> caller may desire. These come in three classes:
>
> (1) Size, nlinks, [amc]times and block count.
>
> These will be returned whether the caller asks for them or not. The
> corresponding bits in query_flags will be set to indicate their presence.
>
> If the called didn't ask for them, then they may be approximated. For
> example, NFS won't waste any time updating them from the server, unless
> as a byproduct of updating something requested.
>
> Query Flag Field
> =============================== ================
> XSTAT_QUERY_SIZE st_size
> XSTAT_QUERY_NLINK st_nlink
> XSTAT_QUERY_AMC_TIMES st_[amc]time
> XSTAT_QUERY_BLOCKS st_blocks
>
> (2) Creation time, Inode generation and Data version.
>
> These will be returned if available whether the caller asked for them or
> not. The corresponding bits in query_flags will be set or cleared as
> appropriate to indicate their presence.
>
> Query Flag Field
> =============================== ================
> XSTAT_QUERY_CREATION_TIME st_btime
> XSTAT_QUERY_INODE_GENERATION st_gen
> XSTAT_QUERY_DATA_VERSION st_data_version
>
> If the called didn't ask for them, then they may be approximated. For
> example, NFS won't waste any time updating them from the server, unless
> as a byproduct of updating something requested.
>
> (3) Extra results.
>
> These will only be returned if the caller asked for them by setting their
> bits in query_flags. They will be placed in the buffer after the xstat
> struct in ascending query_flags bit order. Any bit set in query_flags
> mask will be left set if the result is available and cleared otherwise.
>
> The pointer into the results list will be rounded up to the nearest 8-byte
> boundary after each result is written in. The size of each extra result
> is specific to the definition for that result.
>
> No extra results are currently defined.
>
> If the buffer is insufficiently big, the syscall returns the amount of space it
> will need to write the complete result set, but otherwise does nothing.
>
> If successful, the amount of data written into the buffer will be returned.
>
> At the moment, this will only work on x86_64 as it requires system calls to be
> wired up.
>
>
> ===========
> FILESYSTEMS
> ===========
>
> The following filesystems have been modified to make use of this facility:
>
> (*) Ext4. This will return the creation time and inode version number for all
> files. It will, however, only return the data version number for
> directories as i_version is only maintained for them.
>
> (*) AFS. This will return the vnode ID uniquifier as the inode version and
> the AFS data version number as the data version. There is no file
> creation time available.
>
> (*) NFS. This will return the change attribute if NFSv4 only. No other extra
> values are returned at this time. If mtime and ctime aren't asked for,
> the outstanding writes won't be written to the server. If none of
> [amc]time, size, nlink, blocks and data_version are requested, then the
> attributes won't be refreshed from the server.
>
> Probably this isn't sufficient, as the other non-optional attributes may
> require refreshing.
>
>
> =======
> TESTING
> =======
>
> The following test program can be used to test the xstat system call:
>
> #define _GNU_SOURCE
> #define _ATFILE_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <time.h>
> #include <sys/syscall.h>
> #include <sys/stat.h>
> #include <sys/types.h>
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };
>
> struct xstat_time {
> unsigned long long tv_sec;
> unsigned long long tv_nsec;
> };
>
> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0
> unsigned int st_mode;
> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;
> unsigned int st_blksize;
> struct xstat_dev st_rdev;
> struct xstat_dev st_dev;
> unsigned long long st_ino;
> unsigned long long st_size;
> struct xstat_time st_atim;
> struct xstat_time st_mtim;
> struct xstat_time st_ctim;
> struct xstat_time st_btim;
> unsigned long long st_blocks;
> unsigned long long st_gen;
> unsigned long long st_data_version;
> unsigned long long query_flags;
> #define XSTAT_QUERY_SIZE 0x00000001ULL /* want/got st_size */
> #define XSTAT_QUERY_NLINK 0x00000002ULL /* want/got st_nlink */
> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL /* want/got st_[amc]time */
> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL /* want/got st_btime */
> #define XSTAT_QUERY_BLOCKS 0x00000010ULL /* want/got st_blocks */
> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL /* want/got st_gen */
> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL /* want/got st_data_version */
> #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL /* the stuff in the normal stat struct */
> #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL /* what we get anyway if available */
> #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL /* the defined set of flags */
> unsigned long long extra_results[0];
> };
>
> #define __NR_xstat 300
> #define __NR_fxstat 301
>
> static __attribute__((unused))
> ssize_t xstat(int dfd, const char *filename, int atflag,
> struct xstat *buffer, size_t bufsize)
> {
> return syscall(__NR_xstat, dfd, filename, atflag, buffer, bufsize);
> }
>
> static __attribute__((unused))
> ssize_t fxstat(int fd, struct xstat *buffer, size_t bufsize)
> {
> return syscall(__NR_fxstat, fd, buffer, bufsize);
> }
>
> static void print_time(const struct xstat_time *xstm)
> {
> struct tm tm;
> time_t tim;
> char buffer[100];
> int len;
>
> tim = xstm->tv_sec;
> if (!localtime_r(&tim, &tm)) {
> perror("localtime_r");
> exit(1);
> }
> len = strftime(buffer, 100, "%F %T", &tm);
> if (len == 0) {
> perror("strftime");
> exit(1);
> }
> fwrite(buffer, 1, len, stdout);
> printf(".%09llu", xstm->tv_nsec);
> len = strftime(buffer, 100, "%z", &tm);
> if (len == 0) {
> perror("strftime2");
> exit(1);
> }
> fwrite(buffer, 1, len, stdout);
> }
>
> static void dump_xstat(struct xstat *xst)
> {
> char buffer[256], ft;
>
> printf(" ");
> if (xst->query_flags & XSTAT_QUERY_SIZE)
> printf(" Size: %-15llu", xst->st_size);
> if (xst->query_flags & XSTAT_QUERY_BLOCKS)
> printf(" Blocks: %-10llu", xst->st_blocks);
> printf(" IO Block: %-6u ", xst->st_blksize);
> switch (xst->st_mode & S_IFMT) {
> case S_IFIFO: printf(" FIFO\n"); ft = 'p'; break;
> case S_IFCHR: printf(" character special file\n"); ft = 'c'; break;
> case S_IFDIR: printf(" directory\n"); ft = 'd'; break;
> case S_IFBLK: printf(" block special file\n"); ft = 'b'; break;
> case S_IFREG: printf(" regular file\n"); ft = '-'; break;
> case S_IFLNK: printf(" symbolic link\n"); ft = 'l'; break;
> case S_IFSOCK: printf(" socket\n"); ft = 's'; break;
> default:
> printf("unknown type (%o)\n", xst->st_mode & S_IFMT);
> ft = '?';
> break;
> }
>
> sprintf(buffer, "%02x:%02x", xst->st_dev.major, xst->st_dev.minor);
> printf("Device: %-15s Inode: %-11llu", buffer, xst->st_ino);
> if (xst->query_flags & XSTAT_QUERY_SIZE)
> printf(" Links: %u", xst->st_nlink);
> printf("\n");
>
> printf("Access: (%04o/%c%c%c%c%c%c%c%c%c%c) ",
> xst->st_mode & 07777,
> ft,
> xst->st_mode & S_IRUSR ? 'r' : '-',
> xst->st_mode & S_IWUSR ? 'w' : '-',
> xst->st_mode & S_IXUSR ? 'x' : '-',
> xst->st_mode & S_IRGRP ? 'r' : '-',
> xst->st_mode & S_IWGRP ? 'w' : '-',
> xst->st_mode & S_IXGRP ? 'x' : '-',
> xst->st_mode & S_IROTH ? 'r' : '-',
> xst->st_mode & S_IWOTH ? 'w' : '-',
> xst->st_mode & S_IXOTH ? 'x' : '-');
> printf("Uid: %d Gid: %u\n", xst->st_uid, xst->st_gid);
>
> if (xst->query_flags & XSTAT_QUERY_AMC_TIMES) {
> printf("Access: "); print_time(&xst->st_atim); printf("\n");
> printf("Modify: "); print_time(&xst->st_mtim); printf("\n");
> printf("Change: "); print_time(&xst->st_ctim); printf("\n");
> }
> if (xst->query_flags & XSTAT_QUERY_CREATION_TIME) {
> printf("Create: "); print_time(&xst->st_btim); printf("\n");
> }
>
> if (xst->query_flags & XSTAT_QUERY_INODE_GENERATION)
> printf("Inode version: %llxh\n", xst->st_gen);
> if (xst->query_flags & XSTAT_QUERY_DATA_VERSION)
> printf("Data version: %llxh\n", xst->st_data_version);
> }
>
> int main(int argc, char **argv)
> {
> struct xstat xst;
> int ret, atflag = AT_SYMLINK_NOFOLLOW;
>
> unsigned long long query =
> XSTAT_QUERY__ORDINARY_SET |
> XSTAT_QUERY_CREATION_TIME |
> XSTAT_QUERY_INODE_GENERATION |
> XSTAT_QUERY_DATA_VERSION;
>
> for (argv++; *argv; argv++) {
> if (strcmp(*argv, "-L") == 0) {
> atflag = 0;
> continue;
> }
> if (strcmp(*argv, "-O") == 0) {
> query &= ~XSTAT_QUERY__ORDINARY_SET;
> continue;
> }
>
> memset(&xst, 0xbf, sizeof(xst));
> xst.struct_version = 0;
> xst.query_flags = query;
> ret = xstat(AT_FDCWD, *argv, atflag, &xst, sizeof(xst));
> printf("xstat(%s) = %d\n", *argv, ret);
> if (ret < 0) {
> perror(*argv);
> exit(1);
> }
>
> printf("sv=%u qf=%llx cr=%llx.%llx iv=%llx dv=%llx\n",
> xst.struct_version, xst.query_flags,
> xst.st_btim.tv_sec, xst.st_btim.tv_nsec,
> xst.st_gen, xst.st_data_version);
>
> dump_xstat(&xst);
> }
> return 0;
> }
>
> Just compile and run, passing it paths to the files you want to examine:
>
> [root@andromeda ~]# /tmp/xstat /afs/archive/linuxdev/fedora9/i386/repodata/
> xstat(/afs/archive/linuxdev/fedora9/i386/repodata/) = 152
> sv=0 qf=77 cr=0.0 iv=7a5 dv=5
> Size: 2048 Blocks: 0 IO Block: 4096 directory
> Device: 00:15 Inode: 83 Links: 2
> Access: (0755/drwxr-xr-x) Uid: 75338 Gid: 0
> Access: 2008-11-05 20:00:12.000000000+0000
> Modify: 2008-11-05 20:00:12.000000000+0000
> Change: 2008-11-05 20:00:12.000000000+0000
> Inode version: 7a5h
> Data version: 5h
>
> [root@andromeda ~]# /tmp/xstat /warthog/nfs/linux-2.6-fscache
> xstat(/warthog/nfs/linux-2.6-fscache) = 152
> sv=0 qf=57 cr=0.0 iv=0 dv=f4992a4c00000000
> Size: 4096 Blocks: 16 IO Block: 1048576 directory
> Device: 00:13 Inode: 19005487 Links: 27
> Access: (2775/drwxrwxr-x) Uid: -2 Gid: 4294967294
> Access: 2010-06-30 02:07:42.000000000+0100
> Modify: 2010-06-30 02:12:20.000000000+0100
> Change: 2010-06-30 02:12:20.000000000+0100
> Data version: f4992a4c00000000h
>
> [root@andromeda ~]# /tmp/xstat /var/cache/fscache/cache/
> xstat(/var/cache/fscache/cache/) = 152
> sv=0 qf=7f cr=4c24ba83.1c15ee3d iv=f585ab70 dv=2
> Size: 4096 Blocks: 16 IO Block: 4096 directory
> Device: 08:06 Inode: 130561 Links: 3
> Access: (0700/drwx------) Uid: 0 Gid: 0
> Access: 2010-06-29 18:16:33.680703545+0100
> Modify: 2010-06-29 18:16:20.132786632+0100
> Change: 2010-06-29 18:16:20.132786632+0100
> Create: 2010-06-25 15:17:39.471199293+0100
> Inode version: f585ab70h
> Data version: 2h

Yes, but could we please also add a flag that allows you to specify that
the kernel _must_ provide up to date attributes.

IOW: a flag that for something like NFS or CIFS will force a GETATTR RPC
call on the wire as opposed to using cached values.

Cheers
Trond

2010-06-30 02:32:46

by Nicholas Miell

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On Wed, 2010-06-30 at 02:17 +0100, David Howells wrote:
> Add a pair of system calls to make extended file stats available, including
> file creation time, inode version and data version where available through the
> underlying filesystem:
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };
>
> struct xstat_time {
> unsigned long long tv_sec;
> unsigned long long tv_nsec;
> };
>
> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0
> unsigned int st_mode;
> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;
> unsigned int st_blksize;
> struct xstat_dev st_rdev;
> struct xstat_dev st_dev;
> unsigned long long st_ino;
> unsigned long long st_size;
> struct xstat_time st_atime;
> struct xstat_time st_mtime;
> struct xstat_time st_ctime;
> struct xstat_time st_btime;
> unsigned long long st_blocks;
> unsigned long long st_gen;
> unsigned long long st_data_version;
> unsigned long long query_flags;
> #define XSTAT_QUERY_SIZE 0x00000001ULL
> #define XSTAT_QUERY_NLINK 0x00000002ULL
> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL
> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL
> #define XSTAT_QUERY_BLOCKS 0x00000010ULL
> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL
> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL
> unsigned long long extra_results[0];
> };
>

unsigned long long inside a struct has 4 byte alignment on x86, while
AMD64 has 8 byte alignment. This struct layout isn't affected by the
difference, but that's something to keep in mind.
--
Nicholas Miell <[email protected]>

2010-06-30 08:31:06

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On Wednesday 30 June 2010 03:17:12 David Howells wrote:
> +static int xstat_check_param(struct xstat __user *buffer, size_t bufsize,
> + struct kstat *stat)
> +{
> + u32 struct_version;
> + int ret;
> +
> + /* if the buffer isn't large enough, return how much we wanted to
> + * write, but otherwise do nothing */
> + if (bufsize < sizeof(struct xstat))
> + return sizeof(struct xstat);
> +
> + ret = get_user(struct_version, &buffer->struct_version);
> + if (ret < 0)
> + return ret;
> + if (struct_version != 0)
> + return -ENOTSUPP;
> +
> + memset(stat, 0xde, sizeof(*stat));
> +
> + ret = get_user(stat->query_flags, &buffer->query_flags);
> + if (ret < 0)
> + return ret;
> +
> + /* nothing outside this set has a defined purpose */
> + stat->query_flags &= XSTAT_QUERY__DEFINED_SET;
> + stat->result_flags = 0;
> + return 0;
> +}

I think it would be better to leave the structure as write-only from
the kernel and pass the query_flags and struct_version as syscall
arguments, though it makes sense to store them in the result as well.

Independent from this, I also think that we can collapse the
struct_version into the more flexible query_flags. When the structure
gets extended with new fields, just add another flag to let
the user ask for them.

When the flags are outside of the structure, you can even have a flag
that will result in a completely new structure layout to be returned.

Arnd

2010-06-30 08:56:17

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

Arnd Bergmann <[email protected]> wrote:

> I think it would be better to leave the structure as write-only from
> the kernel

Why?

> and pass the query_flags and struct_version as syscall arguments, though it
> makes sense to store them in the result as well.

The problem with that is that the number of syscall arguments is limited, and
there is no SYSCALL_DEFINE7.

On the other hand, I could make a separate argument block struct and pass a
pointer to it...

David

2010-06-30 09:31:47

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On Wednesday 30 June 2010 10:55:51 David Howells wrote:
> Arnd Bergmann <[email protected]> wrote:
>
> > I think it would be better to leave the structure as write-only from
> > the kernel
>
> Why?

Consistency mostly. stat and stat64 don't read it, so I think xstat
also shouldn't if we can easily avoid it.

It also makes things like strace more complicated.

> > and pass the query_flags and struct_version as syscall arguments, though it
> > makes sense to store them in the result as well.
>
> The problem with that is that the number of syscall arguments is limited, and
> there is no SYSCALL_DEFINE7.
>
> On the other hand, I could make a separate argument block struct and pass a
> pointer to it...

No, I think that would be worse than the current version. But if you remove
the structure version in favor of the flags, you only need six arguments
anyway.

You can also go further and fold the structure length into flags, because
the length is just a function of the data you are passing.

Having a system call with flags, size and version is like wearing a belt,
braces and suspenders. An unsigned long flags argument should be enough to
hold up your pants[1].

Arnd

[1] I hope I managed to make this sound wrong in both American and proper
English.

2010-06-30 09:33:54

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On 2010-06-29, at 19:48, Trond Myklebust wrote:
>> When the system call is executed, the struct_version ID and query_flags bitmask are read from the buffer to work out what the user is requesting.
>
> Yes, but could we please also add a flag that allows you to specify that
> the kernel _must_ provide up to date attributes.

To my reading, if the query_flags are set in the input buffer, then the attributes MUST be fetched. If they are unset, then they MAY be fetched, and the corresponding query_flags will be set in the return buffer. If the query_flags are not set in the return buffer then I assume the output values are undefined.

In discussions about the proposed "statlite()" API (which this is very similar to) it was desirable that there be separate flags for the individual fields (at least AMC_TIME should be split), since it isn't always clear whether it is "free" to get all of these timestamps, if just one is desired. For Lustre, in particular, the mtime is stored with the file data (where it is updated), and it is more costly to get this if it isn't needed.

Cheers, Andreas




2010-06-30 09:46:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On 2010-06-29, at 19:17, David Howells wrote:
> int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
> struct kstat *stat)
> {
> + if (S_ISDIR(inode->i_mode)) {
> + stat->result_flags |= XSTAT_QUERY_DATA_VERSION;
> + stat->data_version = inode->i_version;
> + }

Note that when ext4 is mounted with the "i_version" option that the i_version field is also updated on regular files, for use by NFSv4. See, for example, ext4_mark_iloc_dirty().

I had a hard time finding this, even though I knew it was there somewhere, because it isn't modifying "i_version" directly, but rather calling a helper function inode_inc_iversion().

It probably makes sense to always return i_version, unless it is 0.

Cheers, Andreas




2010-06-30 09:47:54

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

Andreas Dilger <[email protected]> wrote:

> > Yes, but could we please also add a flag that allows you to specify that
> > the kernel _must_ provide up to date attributes.
>
> To my reading, if the query_flags are set in the input buffer, then the
> attributes MUST be fetched. If they are unset, then they MAY be fetched,
> and the corresponding query_flags will be set in the return buffer. If the
> query_flags are not set in the return buffer then I assume the output values
> are undefined.

I think Trond may have a point, looking at nfs_getattr().

There can be three levels:

(1) Don't check with the server, just go with what we've got in the cache if
it's available. Results returned may be approximate.

(2) Check with the server if the cached attributes are out of date or if
something is requested that we don't keep in RAM.

(3) Check with the server anyway.

David

2010-06-30 10:01:34

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

Arnd Bergmann <[email protected]> wrote:

> It also makes things like strace more complicated.

That's the most compelling argument.

> No, I think that would be worse than the current version. But if you remove
> the structure version in favor of the flags, you only need six arguments
> anyway.

I want to keep the structure version, just in case we need to expand fields in
the stat struct in future. Otherwise we may need to create yet another stat
syscall.

> You can also go further and fold the structure length into flags, because
> the length is just a function of the data you are passing.

The potential problem with passing the flags as a syscall argument is that
we're then limited to a single 32-bit integer. It might be enough, but if I
do as at least one person has suggested and assign each field in the struct
its own bit, that uses up half right there, plus I'd like to add at least one
operational flag (to force synchronisation with the server).

> Having a system call with flags, size and version is like wearing a belt,
> braces and suspenders. An unsigned long flags argument should be enough to
> hold up your pants[1].

I would like the size argument for two reasons: firstly, to prevent buffer
overruns and, secondly, because I can see some scope for variable-size fields
(such as for volume IDs or security labels), though the latter might be better
handled through getxattr() (which would mean extra overhead).

David

2010-06-30 10:22:32

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

Andreas Dilger <[email protected]> wrote:

> Note that when ext4 is mounted with the "i_version" option that the
> i_version field is also updated on regular files, for use by NFSv4. See,
> for example, ext4_mark_iloc_dirty().
>
> I had a hard time finding this, even though I knew it was there somewhere,
> because it isn't modifying "i_version" directly, but rather calling a helper
> function inode_inc_iversion().

Ah, okay. Thanks!

> It probably makes sense to always return i_version, unless it is 0.

I didn't want to return it if it wasn't supported on the nominated file as
that may give a false sense of coherency.

David

2010-06-30 11:04:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On 2010-06-29, at 19:16, David Howells wrote:
> Implement a pair of new system calls to provide extended and further extensible stat functions.
>
> The third of the associated patches provides these new system calls:
>
> struct xstat_dev {
> unsigned int major;
> unsigned int minor;
> };

Doesn't glibc use two 64-bit values for devices?

> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0

I dislike sequential "version" fields (which are "all or nothing"), and prefer the ext2/3/4-like "feature flags" that allow the caller to state what features and fields it expects and/or understands. This allows extensibility without unduly breaking compatibility.

> unsigned int st_mode;

Having a separate MODE flag would be great for "ls --color", since that is basically the only information that it needs that isn't already available in the readdir() output.

> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;

In struct stat64 it uses "unsigned long" for both st_uid and st_gid. Having a 64-bit value here is useful for CIFS servers to be able to remap different UID domains into a 32-bit domain and a 32-bit UID. If you change this, please remember to reorder the fields for proper 64-bit alignment.

> unsigned int st_blksize;
> Does st_blksize really need to be 64 bits on a 64-bit system?

I don't think so, but adding a 32-bit padding couldn't hurt.

> unsigned long long st_ino;
> unsigned long long st_size;
> Should the inode number and data version number fields be 128-bit?

I wouldn't object to having a 128-bit st_ino field, since this is what Lustre will be using internally in the next release.

Similarly, _filesystems_ are not SO far from hitting the 64-bit size limit (a Lustre filesystem will likely hit 100PB ~= 2^57 bytes in the next year), so having a 128-bit st_size wouldn't be unreasonable, because...

What is also very convenient that I learned Solaris stat() does is it returns the device size in st_size for a block device file. This is very convenient, and avoids the morass of ioctls and "binary llseek guessing" used by libext2fs and libblkid to determine the size of a block device. Any reason not to add this into this new syscall?

> unsigned long long st_blocks;

If st_size is 128-bit (or has padding) then st_blocks should have the same.

> unsigned long long query_flags;

It is inconsistent to have all the other fields use the "st_" prefix, but "query_flags" and "struct_version" do not have this prefix.

> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL

Can these be split into separate ATIME, MTIME, CTIME flags?

> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL

It seems a bit inconsistent to call the field "st_btime" and the mask "CREATION_TIME". It would be more consistent (if somewhat less clear) to call the mask "BTIME". The struct definition should get short comments for each field to explain their meaning anyway, so "st_btime" can have "/* birth/creation time */".

> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL

This is also a bit inconsistent with the "st_gen" field name.

> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL

It wouldn't be a bad idea to interleave these flags with each of the fields that they represent, to make it more clear what is included in each.

> #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
> #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL

Could you provide some information what the semantic distinction between these is? It might be useful to have an "XSTAT_QUERY_LEGACY_STAT" mask that returns only the fields that are in the previous struct stat, unless that is what "ORDINARY_SET" means, in which case it should be renamed I think.

> #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL

It is smart to have a "DEFINED_SET" mask that maps to the currently-understood fields. This ensures that applications compiled against a specific set of headers/struct will not request fields which they don't understand. It might be better to call this "XSTAT_QUERY_ALL" so that it is more easily understood and used by callers, instead of the incorrect "-1" or "~0" that some may be tempted to use if they don't understand what "__DEFINED_SET" means.


Cheers, Andreas




2010-06-30 11:47:08

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On Wednesday 30 June 2010, David Howells wrote:
> Arnd Bergmann <[email protected]> wrote:
> > No, I think that would be worse than the current version. But if you remove
> > the structure version in favor of the flags, you only need six arguments
> > anyway.
>
> I want to keep the structure version, just in case we need to expand fields in
> the stat struct in future. Otherwise we may need to create yet another stat
> syscall.

How many versions do you expect we need in the next 10 years, not counting
those where you just add a new field to the structure?

Given a 64 bit flag word, you can start using bits for the version from
the top and bits from the bottom for fields:

#define XSTAT_DEV 0x00000001
#define XSTAT_INO 0x00000002
#define XSTAT_MODE 0x00000004
...
#define XSTAT_LAYOUT_VERSION_2 0x8000000000000000
#define XSTAT_LAYOUT_VERSION_1 0x0000000000000000

> > You can also go further and fold the structure length into flags, because
> > the length is just a function of the data you are passing.
>
> The potential problem with passing the flags as a syscall argument is that
> we're then limited to a single 32-bit integer. It might be enough, but if I
> do as at least one person has suggested and assign each field in the struct
> its own bit, that uses up half right there, plus I'd like to add at least one
> operational flag (to force synchronisation with the server).

I'd imagine that there would be some reasonable way to group some of the
fields so that 32 bits last long enough. Alternatively, you can also make
it a 64 bit argument everywhere, which has some other small disadvantages.

> > Having a system call with flags, size and version is like wearing a belt,
> > braces and suspenders. An unsigned long flags argument should be enough to
> > hold up your pants[1].
>
> I would like the size argument for two reasons: firstly, to prevent buffer
> overruns and, secondly, because I can see some scope for variable-size fields
> (such as for volume IDs or security labels), though the latter might be better
> handled through getxattr() (which would mean extra overhead).

The idea of a syscall API with multiple fixed-length and variable-length
fields in the same structure scares me. If you want to go this far,
it may be better to base the interface on netlink and allow querying
multiple files at once.

For a classic syscall interface, I'd just stay away from variable-length
data and use either fixed-length fields or spend the extra overhead for
the getxattr values.

When all members of struct xstat are fixed length, you can simply add
new members at the end and add the associated flags at the same time.
Any code built against a given header file can only ask for the fields
that are part of the struct definition it uses. The kernel should
obviously only write the fields that the user asked for, in case the
user was built against an older header file. You can also maintain
forward compatibility if the kernel sets a bitmask in the struct with
the fields it has returned.

Arnd

2010-06-30 12:06:05

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

Andreas Dilger <[email protected]> wrote:

> Doesn't glibc use two 64-bit values for devices?

So it would seem. Does Linux need to, though?

> I dislike sequential "version" fields (which are "all or nothing"), and
> prefer the ext2/3/4-like "feature flags" that allow the caller to state what
> features and fields it expects and/or understands. This allows
> extensibility without unduly breaking compatibility.

My aim was to avoid the need to create new stat syscalls in the future by
making it possible to increment the version number you're asking for.

> > unsigned int st_uid;
> > unsigned int st_gid;
>
> In struct stat64 it uses "unsigned long" for both st_uid and st_gid. Having
> a 64-bit value here is useful for CIFS servers to be able to remap different
> UID domains into a 32-bit domain and a 32-bit UID. If you change this,
> please remember to reorder the fields for proper 64-bit alignment.

glibc, on the other hand, only supports 32-bits for these.

My thought was that I could add extension fields to provide access to the
remote username/UID/GID values as well as the local UID/GID (since the latter
are used by the permission checking routines in the local VFS).

> I wouldn't object to having a 128-bit st_ino field, since this is what
> Lustre will be using internally in the next release.

I wonder how best to represent 128-bit numbers.

unsigned long long long

gives:

include/linux/stat.h:151: error: 'long long long' is too long for GCC

so perhaps something like:

struct xstat_u128 { unsigned long long lsw, msw; };

however, I suspect the kernel will require a bit of reengineering to handle a
pgoff_t and loff_t of 128-bits.

> What is also very convenient that I learned Solaris stat() does is it
> returns the device size in st_size for a block device file. This is very
> convenient, and avoids the morass of ioctls and "binary llseek guessing"
> used by libext2fs and libblkid to determine the size of a block device. Any
> reason not to add this into this new syscall?

That's a separate problem. That can be implemented now by overriding getattr
on blockdev files. You could also set st_blocks and st_blksize to indicate
parameters of the blockdev - though that may upset df, I suppose.

> It is inconsistent to have all the other fields use the "st_" prefix, but
> "query_flags" and "struct_version" do not have this prefix.

They are a different sort of field (metametadata, I suppose). But I can add
that on if you'd prefer.

> It wouldn't be a bad idea to interleave these flags with each of the fields
> that they represent, to make it more clear what is included in each.

Or comments could be used for that.

> > #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
> > #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL
>
> Could you provide some information what the semantic distinction between
> these is? It might be useful to have an "XSTAT_QUERY_LEGACY_STAT" mask that
> returns only the fields that are in the previous struct stat, unless that is
> what "ORDINARY_SET" means, in which case it should be renamed I think.

XSTAT_QUERY_LEGACY_STAT is XSTAT_QUERY__ORDINARY_SET. Is "legacy" an
appropriate appellation, though? They're the set most people expect to see
and want to use.

> > #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL
>
> It is smart to have a "DEFINED_SET" mask that maps to the
> currently-understood fields. This ensures that applications compiled
> against a specific set of headers/struct will not request fields which they
> don't understand. It might be better to call this "XSTAT_QUERY_ALL" so that
> it is more easily understood and used by callers, instead of the incorrect
> "-1" or "~0" that some may be tempted to use if they don't understand what
> "__DEFINED_SET" means.

Passing -1 (or ULONGLONG_MAX) to get everything would be reasonable.

This should probably be an internal kernel constant.

David

2010-06-30 12:11:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On Wed, Jun 30, 2010 at 01:05:44PM +0100, David Howells wrote:
> My aim was to avoid the need to create new stat syscalls in the future by
> making it possible to increment the version number you're asking for.

The cost of adding a syscall is much smaller than adding all the wanking
in your system call. Think about it - adding a new stat variant if a
few lines of code which add minimal icache footpint. Even less so when
the applications using the old ones disappear for a while. Totally
overdesigned crap like yours on the other hand adds lots of code and
branches that stay forever. In addition to making life for strace and
co really hard if the structure ever changes.

So adding a few fields of padding at the end for new members is fine,
but doing overkill of versioning including queries for supported
versions doesn't.

2010-06-30 12:14:25

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

Arnd Bergmann <[email protected]> wrote:

> Given a 64 bit flag word, you can start using bits for the version from
> the top and bits from the bottom for fields:

I suppose. It's cleaner, though, to keep them separate.

> Alternatively, you can also make it a 64 bit argument everywhere, which has
> some other small disadvantages.

No, you can't. 32-bit systems can only pass 32-bit arguments. If you're
suggesting passing a pointer to a 64-bit argument instead, how's that any
different from my suggestion of a separate parameter block?

> The idea of a syscall API with multiple fixed-length and variable-length
> fields in the same structure scares me. If you want to go this far,
> it may be better to base the interface on netlink and allow querying
> multiple files at once.

Urgh. Netlink is way too much overhead and even scarier. That's pretty much
a guarantee that people won't use it. It also has to work if CONFIG_NET=n.

David

2010-06-30 12:24:00

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

Christoph Hellwig <[email protected]> wrote:

> The cost of adding a syscall is much smaller than adding all the wanking
> in your system call.

Simply, if inelegantly put.

David

2010-06-30 12:44:36

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/3] Add a pair of system calls to make extended file stats available [ver #2]

On Wednesday 30 June 2010, David Howells wrote:
> Arnd Bergmann <[email protected]> wrote:
>
> > Given a 64 bit flag word, you can start using bits for the version from
> > the top and bits from the bottom for fields:
>
> I suppose. It's cleaner, though, to keep them separate.

Yes, but it's a tradeoff. If separating them means you have to add
another structure, I'd prefer having just a flags word with different
kinds of bits. In particular since I don't think we actually need
to worry about wildly different layouts. If struct oldstat had
come with an extensibility concept like what you propose here, we
would not have needed newstat, stat64 and xstat.

> > Alternatively, you can also make it a 64 bit argument everywhere, which has
> > some other small disadvantages.
>
> No, you can't. 32-bit systems can only pass 32-bit arguments. If you're
> suggesting passing a pointer to a 64-bit argument instead, how's that any
> different from my suggestion of a separate parameter block?

I was thinking of splitting the 64 bit argument into two registers on
32 bit systems, like we do with other 64 bit input arguments (e.g. loff_t).
While there is not much of a difference, I'd always prefer passing
input arguments by register to a memory location when possible.

> > The idea of a syscall API with multiple fixed-length and variable-length
> > fields in the same structure scares me. If you want to go this far,
> > it may be better to base the interface on netlink and allow querying
> > multiple files at once.
>
> Urgh. Netlink is way too much overhead and even scarier. That's pretty much
> a guarantee that people won't use it. It also has to work if CONFIG_NET=n.

Exactly. Just resist the urge to add complexity bordering what we already
have in netlink.

Arnd

2010-06-30 13:31:51

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On Wednesday 30 June 2010, Christoph Hellwig wrote:
> The cost of adding a syscall is much smaller

Ack. No need for different struct layout version since we
can add another stat syscall every ten years.

> So adding a few fields of padding at the end for new members is fine,
> but doing overkill of versioning including queries for supported
> versions doesn't.

The ability to request and return a subset of the fields seems useful
regardless and it can be used to avoid the need for this kind of padding.
A sufficient amount of padding wouldn't be too bad either, but I guess
we should not have both the padding _and_ the option for extending the
structure after the padding.

With the padding, the 'size' argument can go away, though I'd argue that
even without the padding we can safely add extra fixed-length fields
when needed and not need a size argument.

Arnd

2010-06-30 14:06:12

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On Wed, 30 Jun 2010 15:31:39 +0200
Arnd Bergmann <[email protected]> wrote:

> On Wednesday 30 June 2010, Christoph Hellwig wrote:
> > The cost of adding a syscall is much smaller
>
> Ack. No need for different struct layout version since we
> can add another stat syscall every ten years.
>
> > So adding a few fields of padding at the end for new members is fine,
> > but doing overkill of versioning including queries for supported
> > versions doesn't.
>
> The ability to request and return a subset of the fields seems useful
> regardless and it can be used to avoid the need for this kind of padding.
> A sufficient amount of padding wouldn't be too bad either, but I guess
> we should not have both the padding _and_ the option for extending the
> structure after the padding.
>
> With the padding, the 'size' argument can go away, though I'd argue that
> even without the padding we can safely add extra fixed-length fields
> when needed and not need a size argument.
>

Simply having a flags field seems sufficient to me too. I don't think
we need padding, version or a size. Just make it a "rule" that if you
add a new field that it has to go at the end of the struct and a new
flag has to go with it. The kernel will need to only fill out fields
that are requested and that it knows about.

In the event that we approach running out of flags, we could even use
the last flag as a "HAS_FLAGS2" flag, to add a new flags field at the
end. Ugly, but it would avoid the need for a new syscall. We can kick
that potential problem down the road though. With 64 flags to play
with, it likely won't be a problem for a while.

--
Jeff Layton <[email protected]>

2010-06-30 17:36:30

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On Wednesday 30 June 2010, Jeff Layton wrote:
> In the event that we approach running out of flags, we could even use
> the last flag as a "HAS_FLAGS2" flag, to add a new flags field at the
> end. Ugly, but it would avoid the need for a new syscall. We can kick
> that potential problem down the road though. With 64 flags to play
> with, it likely won't be a problem for a while.

Along the lines of what Christoph argued, we can also just use the
new syscall when that happens.

Arnd

2010-06-30 21:45:27

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On 2010-06-30, at 06:05, David Howells wrote:
> Andreas Dilger <[email protected]> wrote:
>> In struct stat64 it uses "unsigned long" for both st_uid and st_gid. Having
>> a 64-bit value here is useful for CIFS servers to be able to remap different
>> UID domains into a 32-bit domain and a 32-bit UID. If you change this,
>> please remember to reorder the fields for proper 64-bit alignment.
>
> glibc, on the other hand, only supports 32-bits for these.

For the cost of those extra bytes it would definitely save a lot of extra complexity in every application packing and unpacking the struct. At a minimum put a 32-bit padding that is zero-filled for now.

>> I wouldn't object to having a 128-bit st_ino field, since this is what
>> Lustre will be using internally in the next release.
>
> so perhaps something like:
>
> struct xstat_u128 { unsigned long long lsw, msw; };
>
> however, I suspect the kernel will require a bit of reengineering to handle a
> pgoff_t and loff_t of 128-bits.

Well, not any different from having 32-bit platforms work with two 32-bit values for 64-bit offsets today, except that we would be doing this with two 64-bit values.

>> What is also very convenient that I learned Solaris stat() does is it
>> returns the device size in st_size for a block device file. This is very
>> convenient, and avoids the morass of ioctls and "binary llseek guessing"
>> used by libext2fs and libblkid to determine the size of a block device. Any
>> reason not to add this into this new syscall?
>
> That's a separate problem. That can be implemented now by overriding getattr
> on blockdev files. You could also set st_blocks and st_blksize to indicate
> parameters of the blockdev - though that may upset df, I suppose.

I don't know if Solaris does that or not, I'd have to check with someone who has more than anecdotal understanding of it. Actually, a quick google shows that st_blocks and st_blksize are undefined for block/char devices.

>>> #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
>>> #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL
>>
>> Could you provide some information what the semantic distinction between
>> these is? It might be useful to have an "XSTAT_QUERY_LEGACY_STAT" mask that
>> returns only the fields that are in the previous struct stat, unless that is
>> what "ORDINARY_SET" means, in which case it should be renamed I think.
>
> XSTAT_QUERY_LEGACY_STAT is XSTAT_QUERY__ORDINARY_SET. Is "legacy" an
> appropriate appellation, though? They're the set most people expect to see
> and want to use.

I was thinking that most applications using this interface would use it because they have a specific need to, or it would be internal to glibc. In those cases it is useful to know what the "traditional" stat() returned, but I don't think "__ORDINARY_SET" encompasses that idea. Other possibilities include "NORMAL_STAT" or "BASIC_STAT", or similar.

>>> #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL
>>
>> It is smart to have a "DEFINED_SET" mask that maps to the
>> currently-understood fields. This ensures that applications compiled
>> against a specific set of headers/struct will not request fields which they
>> don't understand. It might be better to call this "XSTAT_QUERY_ALL" so that
>> it is more easily understood and used by callers, instead of the incorrect
>> "-1" or "~0" that some may be tempted to use if they don't understand what
>> "__DEFINED_SET" means.
>
> Passing -1 (or ULONGLONG_MAX) to get everything would be reasonable.

NOOOO. That is exactly what we _don't_ want, since it makes it impossible for the kernel to actually understand which fields the application is ready to handle. If the application always uses XSTAT_QUERY_ALL, instead of "-1", then the kernel can easily tell which fields are present in the userspace structure, and what it should avoid touching.

If applications start using "-1" to mean "all fields", then it will work so long as the kernel and userspace agree on the size of struct xstat, but as soon as the kernel understands some new field, but userspace does not, the application will segfault or clobber random memory because the kernel thinks it is asking for XSTAT_QUERY_NEXT_NEW_FIELD|... when it really isn't asking for that at all.

Cheers, Andreas




2010-06-30 23:15:30

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

Andreas Dilger <[email protected]> wrote:

> For the cost of those extra bytes it would definitely save a lot of extra
> complexity in every application packing and unpacking the struct. At a
> minimum put a 32-bit padding that is zero-filled for now.

Blech. I'd prefer to just expand the fields to 64-bits.

Note that you can't just arbitrarily pass a raw 64-bit UID, say, back to
vfs_getattr() and expect it to be coped with. Those stat syscalls that return
32-bit (or even 16-bit) would have to do something with it, and glibc would
have to do something with it.

I think we'd need extra request bits to ask for the longer UID/GID - at which
point the extra result data can be appended and extra capacity in the basic
part of the struct is not required.

> > so perhaps something like:
> >
> > struct xstat_u128 { unsigned long long lsw, msw; };
> >
> > however, I suspect the kernel will require a bit of reengineering to handle
> > a pgoff_t and loff_t of 128-bits.
>
> Well, not any different from having 32-bit platforms work with two 32-bit
> values for 64-bit offsets today, except that we would be doing this with two
> 64-bit values.

gcc for 32-bit platforms can handle 64-bit numbers. gcc doesn't handle 128-bit
numbers.

This can be handled as suggested above by allocating extra result bits to get
the upper halves of longer fields:

XSTAT_REQUEST_SIZE__MSW
XSTAT_REQUEST_BLOCKS__MSW

for example.

> > Passing -1 (or ULONGLONG_MAX) to get everything would be reasonable.
>
> NOOOO. That is exactly what we _don't_ want, since it makes it impossible
> for the kernel to actually understand which fields the application is ready
> to handle. If the application always uses XSTAT_QUERY_ALL, instead of "-1",
> then the kernel can easily tell which fields are present in the userspace
> structure, and what it should avoid touching.
>
> If applications start using "-1" to mean "all fields", then it will work so
> long as the kernel and userspace agree on the size of struct xstat, but as
> soon as the kernel understands some new field, but userspace does not, the
> application will segfault or clobber random memory because the kernel thinks
> it is asking for XSTAT_QUERY_NEXT_NEW_FIELD|... when it really isn't asking
> for that at all.

As long as the field bits allocated in order and the extra results are tacked
on in bit number order, will it actually be a problem? Userspace must know how
to deal with all the bits up to the last one it knows about; anything beyond
that is irrelevant.

What would you have me do? Return an error if a request is made that the
kernel doesn't support? That's bad too. This can be handled simply by
clearing the result bit for any unsupported field.

David

2010-06-30 23:28:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On 06/30/2010 04:15 PM, David Howells wrote:
>
> gcc for 32-bit platforms can handle 64-bit numbers. gcc doesn't handle 128-bit
> numbers.
>

gcc for 64-bit platforms does handle 128-bit numbers, but I don't think
it does on 32-bit platforms.

-hpa

2010-07-01 00:15:56

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

H. Peter Anvin <[email protected]> wrote:

> gcc for 64-bit platforms does handle 128-bit numbers, but I don't think
> it does on 32-bit platforms.

How do you specify them? If I say "long long long" gcc moans that it can't
support it on x86_64.

David

2010-07-01 03:21:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On 06/30/2010 05:15 PM, David Howells wrote:
> H. Peter Anvin <[email protected]> wrote:
>
>> gcc for 64-bit platforms does handle 128-bit numbers, but I don't think
>> it does on 32-bit platforms.
>
> How do you specify them? If I say "long long long" gcc moans that it can't
> support it on x86_64.
>

__int128

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-01 04:57:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On 2010-06-30, at 17:15, David Howells wrote:
> Andreas Dilger <[email protected]> wrote:
>>> Passing -1 (or ULONGLONG_MAX) to get everything would be reasonable.
>>
>> NOOOO. That is exactly what we _don't_ want, since it makes it impossible
>> for the kernel to actually understand which fields the application is ready
>> to handle. If the application always uses XSTAT_QUERY_ALL, instead of "-1",
>> then the kernel can easily tell which fields are present in the userspace
>> structure, and what it should avoid touching.
>>
>> If applications start using "-1" to mean "all fields", then it will work so
>> long as the kernel and userspace agree on the size of struct xstat, but as
>> soon as the kernel understands some new field, but userspace does not, the
>> application will segfault or clobber random memory because the kernel thinks
>> it is asking for XSTAT_QUERY_NEXT_NEW_FIELD|... when it really isn't asking
>> for that at all.
>
> As long as the field bits allocated in order and the extra results are tacked
> on in bit number order, will it actually be a problem? Userspace must know how to deal with all the bits up to the last one it knows about; anything beyond that is irrelevant.

The patch you sent seems to get this right, but just for completeness, I'll answer in this thread. Using the new struct as an example:

#define XSTAT_REQUEST_GEN 0x00001000ULL
#define XSTAT_REQUEST_DATA_VERSION 0x00002000ULL

struct xstat {
:
:
unsigned long long st_data_version;
unsigned long long st_result_mask;
unsigned long long st_extra_results[0];
}

An app "today" would allocate a struct xstat that ends at st_result_mask, and "today's" kernel will not know anything about flags beyond *_DATA_VERSION. Even if today's app incorrectly uses request_mask = ~0ULL nothing will break until the kernel code changes.

If a future kernel gets a new static field at st_extra_results (say unsigned long long st_ino_high) with a new flag XSTAT_REQUEST_INO_HIGH 0x000040000ULL the kernel will think that the old app is requesting this field, and will fill in the 64-bit field at st_extra_results[1] (which the old app didn't allocate space for, nor does it understand) and may get a segfault, or stack smashing, or random heap corruption.

> What would you have me do? Return an error if a request is made that the
> kernel doesn't support? That's bad too. This can be handled simply by
> clearing the result bit for any unsupported field.

I agree the desirable behaviour is if an app correctly sets request_mask at most to XSTAT_REQUEST__ALL_STATS 0x00003fffULL (or whatever it is at the time the app is compiled that matches the current struct xstat), and if the kernel understands e.g. XSTAT_REQUEST_INO_HIGH or not is irrelevant since the kernel will not touch fields that are not requested. Likewise, if the application is compiled with a newer/larger XSTAT_REQUEST__ALL_STATS mask than what the kernel understands, the kernel will ignore the flags it doesn't understand.


Cheers, Andreas




2010-07-01 08:10:07

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On Thursday 01 July 2010 06:57:07 Andreas Dilger wrote:
> If a future kernel gets a new static field at st_extra_results (say
> unsigned long long st_ino_high) with a new flag XSTAT_REQUEST_INO_HIGH
> 0x000040000ULL the kernel will think that the old app is requesting
> this field, and will fill in the 64-bit field at st_extra_results[1]
> (which the old app didn't allocate space for, nor does it understand)
> and may get a segfault, or stack smashing, or random heap corruption.

That depends on whether the struct contains a 'buflen' field or not
(it may be part of the struct, as a syscall argument, or in a second struct).
I argue that it should not contain a buflen field and that users should
consequently not set bits that they don't know about to prevent the
scenario you describe.

If the buflen stays in, it will prevent the stack smashing part,
but add extra complexity in the interface, which can cause other
problems.

Arnd

2010-07-05 23:52:39

by Brad Boyer

[permalink] [raw]
Subject: Re: [PATCH 0/3] Extended file stat functions [ver #2]

On Wed, Jun 30, 2010 at 02:16:56AM +0100, David Howells wrote:
> struct xstat {
> unsigned int struct_version;
> #define XSTAT_STRUCT_VERSION 0
> unsigned int st_mode;
> unsigned int st_nlink;
> unsigned int st_uid;
> unsigned int st_gid;
> unsigned int st_blksize;
> struct xstat_dev st_rdev;
> struct xstat_dev st_dev;
> unsigned long long st_ino;
> unsigned long long st_size;
> struct xstat_time st_atime;
> struct xstat_time st_mtime;
> struct xstat_time st_ctime;
> struct xstat_time st_btime;
> unsigned long long st_blocks;
> unsigned long long st_gen;
> unsigned long long st_data_version;
> unsigned long long query_flags;
> #define XSTAT_QUERY_SIZE 0x00000001ULL
> #define XSTAT_QUERY_NLINK 0x00000002ULL
> #define XSTAT_QUERY_AMC_TIMES 0x00000004ULL
> #define XSTAT_QUERY_CREATION_TIME 0x00000008ULL
> #define XSTAT_QUERY_BLOCKS 0x00000010ULL
> #define XSTAT_QUERY_INODE_GENERATION 0x00000020ULL
> #define XSTAT_QUERY_DATA_VERSION 0x00000040ULL
> #define XSTAT_QUERY__ORDINARY_SET 0x00000017ULL
> #define XSTAT_QUERY__GET_ANYWAY 0x0000007fULL
> #define XSTAT_QUERY__DEFINED_SET 0x0000007fULL
> unsigned long long extra_results[0];
> };

Would it be worthwhile to have a field for which security features are
enabled for a file? Maybe with bits for if the file has ACLs (and which
type if the richacl code goes in) or for the selinux labels or other
similar types of data? The current version of ls I have does a huge
pile of getxattr calls along with all the lstat64 calls.

Brad Boyer
[email protected]